Large-scale machine learning with Apache Spark and MLlib

The Spark project (https://spark.apache.org/) is a cluster computing framework that emphasizes low-latency job execution. It's a relatively recent project, growing out of UC Berkley's AMP Lab in 2009.

Although Spark is able to coexist with Hadoop (by connecting to the files stored on Hadoop Distributed File System (HDFS), for example), it targets much faster job execution times by keeping much of the computation in memory. In contrast with Hadoop's two-stage MapReduce paradigm, which stores files on the disk in between each iteration, Spark's in-memory model can perform tens or hundreds of times faster for some applications, particularly those performing multiple iterations over the data. ...

Get Clojure for Data Science now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.