There's more...

To validate the result, we must set up a Delphi technique in which the test data is absolutely unknown to the model. See Kaggle competitions for details at https://www.kaggle.com/competitions.

Three types of datasets are needed for a robust ML system:

  • Training dataset: This is used to fit a model to sample
  • Validation dataset: This is used to estimate the delta or prediction error for the fitted model (trained by training set)
  • Test dataset: This is used to assess the model generalization error once a final model is selected

Get Apache Spark 2.x Machine Learning Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.