O'Reilly logo

Learning Hadoop 2 by Garry Turkington, Gabriele Modena

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Collecting additional data

Many data processing systems don't have a single data ingest source; often, one primary source is enriched by other secondary sources. We will now look at how to incorporate the retrieval of such reference data into our data warehouse.

At a high level, the problem isn't very different from our retrieval of the raw tweet data, as we wish to pull data from an external source, possibly do some processing on it, and store it somewhere where it can be used later. But this does highlight an aspect we need to consider; do we really want to retrieve this data every time we ingest new tweets? The answer is certainly no. The reference data changes very rarely, and we could easily fetch it much less frequently than new tweet data. ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required