Using serialization to improve performance

Serialization plays an important part in distributed computing. There are two persistence (storage) levels, which support serializing RDDs:

  • MEMORY_ONLY_SER: This stores RDDs as serialized objects. It will create one byte array per partition
  • MEMORY_AND_DISK_SER: This is similar to the MEMORY_ONLY_SER, but it spills partitions that do not fit in the memory to disk

The following are the steps to add appropriate persistence levels:

  1. Start the Spark shell:
    $ spark-shell
    
  2. Import the StorageLevel and implicits associated with it:
    scala> import org.apache.spark.storage.StorageLevel._
    
  3. Create an RDD:
    scala> val words = sc.textFile("words")
    
  4. Persist the RDD:
    scala> words.persist(MEMORY_ONLY_SER)
    

Though serialization reduces ...

Get Spark Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.