Splitting data for training and testing

In this recipe, you will learn to use Spark's API to split your available input data into different datasets that can be used for training and validation phases. It is common to use an 80/20 split, but other variations of splitting the data can be considered as well based on your preference.

Get Apache Spark 2.x Machine Learning Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.