O'Reilly logo

Hadoop in Practice by Alex Holmes

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Appendix D. Optimized MapReduce join frameworks

In this appendix we’ll look at the two join frameworks we used in chapter 4. The first is the repartition join framework, which lessens the required memory footprint of the Hadoop join implementation in the org.apache.hadoop.contrib.utils.join package. The second is a framework provided to perform a replicated join, and you’ll build in some smarts that will allow you to cache the smaller of the datasets being joined.

D.1. An optimized repartition join framework

The Hadoop contrib join package requires that all the values for a key be loaded into memory. How can you implement a reduce-side join without that memory space overhead? In this optimization you’ll cache the dataset that’s smallest in ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required