Data acquisition

Data acquisition, or data collection, is the very first step in any data science project. Usually, you won't find the complete set of required data in one place as it is distributed across line-of-business (LOB) applications and systems.

The majority of this section has already been covered in the previous chapter, which outlined how to source data from different data sources and store the data in DataFrames for easier analysis. There is a built-in mechanism in Spark to fetch data from some of the common data sources and the Data Source API is provided for the ones not supported out of the box on Spark.

To get a better understanding of the data acquisition and preparation phases, let us assume a scenario and try to address all the ...

Get Spark for Data Science now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.