O'Reilly logo

Programming Elastic MapReduce by Christopher Phillips, Kevin Schmidt

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Chapter 3. Data Filtering Design Patterns and Scheduling Work

Our initial example from the previous chapter was a fairly simple application, but by now you should understand the basics of getting an Amazon EMR job running with log data. The application only involved grouping data records based on time in order to determine the frequency of the messages we received every second. However, in many data analysis problems, you want to filter your data down to a smaller data set and focus the analysis on only key parts of the data set that are interesting. Like our log analysis scenario, a lot of the data analysis problems focus on analyzing error scenarios and anomalies. With large data sets this may feel like finding a needle in a haystack.

In this chapter, we’ll extend the Amazon EMR application to demonstrate a number of additional useful MapReduce patterns for filtering and analyzing data sets. In demonstrating these new building blocks, we’ll use a new data source that contains a greater variety of data than the earlier scenario. Going back to our NASA theme, you will use a web access log published by NASA and analyze this log for web server errors. The MapReduce patterns that we’ll look at will reduce the web server log data down to find requests resulting in HTTP errors on NASA’s website. Additionally, we’ll combine concepts learned in Chapters 2 and 3 to show how filtering and summarization can be used to gain greater insights into the data.

Toward the end of this chapter, we’ll ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required