Chapter 3. External Data Sources

One of the strengths of Spark is that it provides a single runtime that can connect with various underlying data sources.

In this chapter, we will connect to different data sources. This chapter is divided into the following recipes:

  • Loading data from the local filesystem
  • Loading data from HDFS
  • Loading data from HDFS using a custom InputFormat
  • Loading data from Amazon S3
  • Loading data from Apache Cassandra
  • Loading data from relational databases

Introduction

Spark provides a unified runtime for big data. HDFS, which is Hadoop's filesystem, is the most used storage platform for Spark as it provides cost-effective storage for unstructured and semi-structured data on commodity hardware. Spark is not limited to HDFS and can work ...

Get Spark Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.