Chapter 7. Joining Tables
In this chapter, weâll cover JOIN
operations in Pig. A join is used to join multiple datasets or relations into a single relation based on the presence of a common key or keys. Pig supports several types of JOIN
operations, including INNER
, OUTER
, and FULL
joins. Weâll learn how to perform different kinds of joins in Pig, and weâll also walk through how a join works at a low level, in Python/MrJob. By the end of the chapter, youâll understand how to join like a pro.
To understand this chapter, it helps if youâre familiar with joining data from a SQL or related background. If youâre new to joins, a more thorough introduction will help. Check out Jeff Atwoodâs post âA Visual Explanation of SQL Joinsâ.
In database terminology, a join combines the rows of two or more tables based on some matching information, known as a key. For example, you could join a table of names and a table of mailing addresses, so long as both tables had a common field for the user ID. You could also join a table of prices to a table of items, given an item ID column in both tables. Joins are useful because they permit people to normalize data (that is to say, eliminate redundant content between multiple tables) yet still bring several tablesâ content to a single view on the fly.
Joins are pedestrian fare in relational databases. Far less so for Hadoop, as MapReduce wasnât really created with joins in mind, and you have to go through acrobatics to make it work. ...
Get Big Data for Chimps now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.