Apache Spark

Apache Spark is a well-known example of a general-purpose distributed processing engine capable of handling petabytes (PB) of data. Because it is a general-purpose engine, it is suited to a wide variety of use cases at scale, including the engineering and execution of Extract-Transform-Load (ETL) pipelines using its Spark SQL library, interactive analytics, stream processing using its Spark Streaming library, graph-based processing using its GraphX library, and machine learning using its MLlib library. We will be employing Apache Spark's machine learning library in later chapters. For now, however, it is important to get an overview of how Apache Spark works under the hood.

Apache Spark software services run in Java Virtual Machines ...

Get Machine Learning with Apache Spark Quick Start Guide now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.