Chapter 3. Data Filtering Design Patterns and Scheduling Work

Our initial example from the previous chapter was a fairly simple application, but by now you should understand the basics of getting an Amazon EMR job running with log data. The application only involved grouping data records based on time in order to determine the frequency of the messages we received every second. However, in many data analysis problems, you want to filter your data down to a smaller data set and focus the analysis on only key parts of the data set that are interesting. Like our log analysis scenario, a lot of the data analysis problems focus on analyzing error scenarios and anomalies. With large data sets this may feel like finding a needle in a haystack.

In this chapter, we’ll extend the Amazon EMR application to demonstrate a number of additional useful MapReduce patterns for filtering and analyzing data sets. In demonstrating these new building blocks, we’ll use a new data source that contains a greater variety of data than the earlier scenario. Going back to our NASA theme, you will use a web access log published by NASA and analyze this log for web server errors. The MapReduce patterns that we’ll look at will reduce the web server log data down to find requests resulting in HTTP errors on NASA’s website. Additionally, we’ll combine concepts learned in Chapters 2 and 3 to show how filtering and summarization can be used to gain greater insights into the data.

Toward the end of this chapter, we’ll ...

Get Programming Elastic MapReduce now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.