Performing Group By queries in Pig

In this recipe, we will use the Group By operator in Pig scripts to get the desired output.

Getting ready

To perform this recipe, you should have a running Hadoop cluster as well as the latest version of Pig installed on it.

How to do it...

Group By is a very useful operator for data analysis. Pig supports this operator so that we can perform aggregations at the group level. Take the same data that we used in the previous recipe where we have this employee dataset:

1	Tanmay	ENGINEERING	5000
2	Sneha	PRODUCTION	8000
3	Sakalya	ENGINEERING	7000
4	Avinash	SALES	6000
5	Manisha	SALES	5700
6	Vinit	FINANCE	6200

First of all, load the data into HDFS:

hadoop fs -mkdir /pig/emps_data
hadoop fs -put emps.txt /pig/emps_data

Next, ...

Get Hadoop Real-World Solutions Cookbook - Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.