Implementing Change Data Capture using Hive

Change Data Capture or CDC is one the most painful areas in Data Warehousing. CDC captures the changes that occur in a table. A change could be in the form of new records getting added, updated, or getting deleted. In this recipe, we are going to take a look at how to perform CDC in Hive.

Getting ready

To perform this recipe, you should have a running Hadoop cluster as well as the latest version of Hive installed on it. Here, I am using Hive 1.2.1.

How to do it

First of all, we need a data sample. Consider a simple employee table that has columns, such as the employee ID, name, and salary. Let's say we import this table from a source table in week 1, and after a week, we want to know about the changes that ...

Get Hadoop Real-World Solutions Cookbook - Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.