Spark makes it easy to combine steps in the machine learning pipelines (MLlib) by standardizing APIs that can be combined into a workflow (that is, referred to as pipeline in Spark). While a regression can be invoked without these pipelines, the reality of a working system (that is, end-to-end) requires us to take a multi-step pipeline approach.
The pipeline concept comes from another popular library called scikit-learn:
- Transformer: A Transformer is a method that can transform one DataFrame into another DataFrame.
- Estimator: An Estimator operates on a DataFrame to produce a Transformer.