Using serialization to improve performance

Serialization plays an important part in distributed computing. There are two persistence (storage) levels, which support serializing RDDs:

MEMORY_ONLY_SER: This stores RDDs as serialized objects. It will create one byte array per partition
MEMORY_AND_DISK_SER: This is similar to the MEMORY_ONLY_SER, but it spills partitions that do not fit in the memory to disk

The following are the steps to add appropriate persistence levels:

Import the StorageLevel and implicits associated with it:

scala> import org.apache.spark.storage.StorageLevel._

Create an RDD:

scala> val words = sc.textFile("words")

Though serialization reduces ...

Get Spark Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Spark Cookbook by Rishi Yadav