ML pipelines

ML pipelines were developed to address the fact that machine learning is not just a bunch of algorithms, such as classification and regression, but a pipeline of actions performed over a Dataset. Let's take a quick look at the tasks involved in a typical machine learning process. The following figure shows the top-level activities:

ML pipelines

The first step is to get some data for the data science work. If you are using internal data, the data should be made anonymous and all PII information purged.

Once we have the data, we'll transform it: for example, we can convert a comma-separated CSV format into a DataFrame consisting of strings and numbers. ...

Get Fast Data Processing with Spark 2 - Third Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.