Building complex data pipelines

As soon as the data tools that we're building grow into something a bigger than a simple script, it's useful to split data pre-processing tasks into small units, in order to map all the steps and dependencies of the data pipeline.

With the term data pipeline, we intend a sequence of data processing operations, which cleans, augments, and manipulates the original data, transforming it into something digestible by the analytics engine. Any non-trivial data analytics project will require a data pipeline that is composed of a number of steps.

In the prototyping phase, it is common to split these steps into different scripts, which are then run individually, for example:

$ python download_some_data.py
$ python clean_some_data.py ...

Get Mastering Social Media Mining with Python now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.