Chapter 5. Analytic Helpers

Now that you’ve ingested data into your Hadoop cluster, what’s next? Usually you’ll want to start by simply cleansing or transforming your data. This could be as simple or reformatting fields and removing corrupt records or it could involve all manner of complex aggregation, enrichment, and summarization. Once you’ve cleaned up your data, you may be satisfied to simply push it into a more traditional data store, such as a relational database, and consider your big data work to be done. On the other hand, you may want to continue to work with your data, running specialized machine-learning algorithms to categorize your data or perhaps performing some sort of geospatial analysis.

In this chapter, we’re going to talk about two types of tools:

MapReduce interfaces

General-purpose tools that make it easier to process your data

Analytic libraries

Focused-purpose libraries that include functionality to make it easier to analyze your data

MapReduce Interfaces

In the early days of Hadoop, the only way to process the data in your system was to work with MapReduce in Java, but this approach presented a couple of major problems:

  • Your analytic writers need to not only understand your business and your data, but they also need to understand Java code

  • Pushing a Java archive to Hadoop is more time-consuming than simply authoring a query

For example, the process of developing and testing a simple analytic written directly in MapReduce might look ...

Get Field Guide to Hadoop now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.