Chapter 3. External Data Sources

One of the strengths of Spark is that it provides a single runtime that can connect with various underlying data sources.

In this chapter, we will connect to different data sources. This chapter is divided into the following recipes:

Loading data from the local filesystem
Loading data from HDFS
Loading data from HDFS using a custom InputFormat
Loading data from Amazon S3
Loading data from Apache Cassandra
Loading data from relational databases

Introduction

Spark provides a unified runtime for big data. HDFS, which is Hadoop's filesystem, is the most used storage platform for Spark as it provides cost-effective storage for unstructured and semi-structured data on commodity hardware. Spark is not limited to HDFS and can work ...

Get Spark Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Spark Cookbook by Rishi Yadav

Chapter 3. External Data Sources

Introduction

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly