There's more...

The function glom() is a function that lets you model each partition in the RDD as an array rather than a row list. While it is possible to produce the results in most cases, glom() allows you to reduce the shuffling between partitions.

While at the surface, both method 1 and 2 mentioned in the text below look similar for calculating the minimum numbers in an RDD, the glom() function will cause much less data shuffling across the network by first applying min() to all the partitions, and then sending over the resulting data. The best way to see the difference is to use this on 10M+ RDDs and watch the IO and CPU usage accordingly.

  • The first method is to find the minimum value without using glom():
val minValue1= numRDD.reduce(_ ...

Get Apache Spark 2.x Machine Learning Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.