You can control data persistence (for example, caching) and specify placement preferences for RDD partitions and then use specific operators for manipulating them. By default, Spark persists RDDs in memory, but it can spill them to disk if sufficient RAM isn't available. Caching improves performance by several orders of magnitude; however, it is often memory intensive. Other persistence options include storing RDDs to disk and replicating them across the nodes in your cluster. The in-memory storage of persistent RDDs can be in the form of deserialized or serialized Java objects. The deserialized option is faster, while the serialized option is more memory-efficient (but slower). Unused RDDs are automatically removed from the cache but...
- Understanding Resilient Distributed Datasets (RDDs)
- from Learning Spark SQL
- Publisher: Packt Publishing
- Released: September 2017
read more about "cache and persist". cache is memory only , with persist - we can specify what storage level we want
Share this highlighthttp://www.safaribooksonline.com/a/learning-spark-sql/18983895/