Summary

RDDs are the backbone of Spark; these schema-less data structures are the most fundamental data structures that we will deal with within Spark.

In this chapter, we presented ways to create RDDs from text files, by means of the .parallelize(...) method as well as by reading data from text files. Also, some ways of processing unstructured data were shown.

Transformations in Spark are lazy - they are only applied when an action is called. In this chapter, we discussed and presented the most commonly used transformations and actions; the PySpark documentation contains many more http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.

One major distinction between Scala and Python RDDs is speed: Python RDDs can be much slower than ...

Get Learning PySpark now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.