Chapter 6. Transform Data in the Data Lake

In the previous chapter, we ingested the source data into the Data Lake. To make sense of the vast amount of raw data, a transformation procedure is required to convert it into information that can further be used by decision makers. In this chapter, we will discuss how to transform data.

The topics covered in this chapter are as follows:

  • Transformation overview
  • Tools for transforming data in a Data Lake, such as HCatalog, Hive, Pig, and MapReduce
  • Transformation of the airline on-time performance (OTP) raw data into an aggregate
  • Review results of transformation

Transformation overview

Once you get data into the cluster, the next step in a typical project is to get data ready for future consumption. This typically ...

Get HDInsight Essentials - Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.