Repeatability and automation

In this section, we will discuss some methods of organizing datasets, preprocessing into workflows, and then use the Apache Spark pipeline to represent as well as implement these workflows. Then, we will review data preprocessing automation solutions.

After this section, we will be able to use Spark pipelines to represent and implement datasets preprocessing workflows and understand some automation solutions made available by Apache Spark.

Dataset preprocessing workflows

Our data preparation work from Data cleaning to Identity matching to Data re-organization to Feature extraction were organized in a way to reflect our step-by-step orderly process of preparing datasets for machine learning. In other words, all the data ...

Get Apache Spark Machine Learning Blueprints now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.