Unit 2Data Acquisition Pipeline

Data acquisition is all about obtaining the artifacts that contain the input data from a variety of sources, extracting the data from the artifacts, and converting it into representations suitable for further processing, as shown in the following figure.

images/pipeline.png

The three main sources of data are the Internet (namely, the World Wide Web), databases, and local files (possibly previously downloaded by hand or using additional software). Some of the local files may have been produced by other Python programs and contain serialized or “pickled” data (see Unit 12, Pickling and Unpickling Data, for further explanation).

Get Data Science Essentials in Python now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.