Loading and saving data from an arbitrary source

So far, we have covered three data sources that are inbuilt with DataFrames—parquet (default), json, and jdbc. Dataframes are not limited to these three and can load and save to any arbitrary data source by specifying the format manually.

In this recipe, we will cover loading and saving data from arbitrary sources.

How to do it...

  1. Start the Spark shell and give it some extra memory:
    $ spark-shell --driver-memory 1G
    
  2. Load the data from Parquet; since parquet is the default data source, you do not have to specify it:
    scala> val people = sqlContext.read.load("hdfs://localhost:9000/user/hduser/people.parquet") 
    
  3. Load the data from Parquet by manually specifying the format:
    scala> val people = sqlContext.read.format("org.apache.spark.sql.parquet").load("hdfs://localhost:9000/user/hduser/people.parquet") ...

Get Spark Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.