Creating RDDs with external data sources, whether it is a text file, Hadoop HDFS, sequence file, Casandra, or Parquet file is remarkably simple. Once again, we use SparkSession (SparkContext prior to Spark 2.0) to get a handle to the cluster. Once the function (for example, textFile Protocol: file path) is executed, the data is broken into smaller pieces (partitions) and automatically flows to the cluster, which becomes available to the computations as fault-tolerant distributed collections that can be operated on in parallel.
- There are a number of variations that one must consider when working with real-life situations. The best advice based on our own experience is to consult the documentation before writing your own functions ...