O'Reilly logo

MapReduce Design Patterns by Adam Shook, Donald Miner

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Chapter 5. Join Patterns

Having all your data in one giant data set is a rarity. For example, presume you have user information stored in a SQL database because it is updated frequently. Meanwhile, web logs arrive in a constant stream and are dumped directly into HDFS. Also, daily analytics that make sense of these logs are stored someone where in HDFS and financial records are stored in an encrypted repository. The list goes on.

Data is all over the place, and while it’s very valuable on its own, we can discover interesting relationships when we start analyzing these sets together. This is where join patterns come into play. Joins can be used to enrich data with a smaller reference set or they can be used to filter out or select records that are in some type of special list. The use cases go on and on as well.

In SQL, joins are accomplished using simple commands, and the database engine handles all of the grunt work. Sadly for us, joins in MapReduce are not nearly this simple. MapReduce operates on a single key/value pair at a time, typically from the same input. We are now working with at least two data sets that are probably of different structures, so we need to know what data set a record came from in order to process it correctly. Typically, no filtering is done prior to the join operation, so some join operations will require every single byte of input to be sent to the reduce phase, which is very taxing on your network. For example, joining a terabyte of data onto another ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required