images

Data Research and Advanced Data Cleansing with Pig and Hive

WHAT YOU WILL LEARN IN THIS CHAPTER

  • Understanding the Difference Between Pig and Hive and When to Use Each
  • Using Pig Latin Built-in Functions for Advanced Extraction, and Transforming and Loading Data
  • Understanding the Various Types of Hive Functions Available
  • Extending Hive with Map-reduce Scripts
  • Creating Your Own Functions to Plug into Hive

All data processing on Hadoop essentially boils down to a map-reduce process. The mapping consists of retrieving the data and performing operations such as filtering and sorting. The reducing part of the process consists of performing a summary operation such as grouping and counting. Hadoop map-reduce jobs are written in programming languages such as Java and C#. Although this works well for developers with a programming background, it requires a steep learning curve for nonprogrammers. This is where Pig comes in to play. Another tool available to create and run map-reduce jobs in Hadoop is Hive. Like Pig, Hive relies on a batch-based, parallel-processing paradigm and is useful for querying, aggregating, and filtering large data sets.

This chapter covers both Pig and Hive and will help you to understand the strengths of each. You will also see how to extend Pig and Hive using functions and custom map-reduce scripts. In addition, the chapter includes hands-on activities to help ...

Get Microsoft Big Data Solutions now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.