O'Reilly logo

Data Just Right: Introduction to Large-Scale Data & Analytics by Michael Manoochehri

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

9. Building Data Transformation Workflows with Pig and Cascading

Collecting and processing large amounts of data can be a complicated task. Fortunately, many common data processing challenges can be broken down into smaller problems. Open source software tools allow us to shard and distribute data transformation jobs across many machines, using strategies such as MapReduce.

While frameworks like Hadoop help manage much of the complexity of taking large MapReduce processing tasks and farming them out to individual machines in a cluster, we still need to define exactly how the data will be processed. Do we want to alter the data in some way? Split it up or combine it with another source?

With large amounts of data coming from many different sources, ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required