3.3. Tracking Data Lineage Identifiers

Data lineage is a part of the metadata tracking in an ETL process associated with mapping data back to the source data points. But before considering an approach to data lineage tracking, to better understand data lineage requires understanding metadata in general.

Metadata is the data about the data. For ETL, it's the integrated tracking mechanism for the process and the core data. Metadata is very important for ETL and data warehousing—not just for administrators to understand what data got loaded, but also for system developers to validate data sources, for data administrators to find out what happened when failures occur, and for users to be confident in the data they are looking at. In fact, the latter may be the most important, because if the users do not trust the data, then their adoption of the solution will be slow going. Tracking data lineage is directly related to knowing where each data point came from, not just what system or table or file, but knowing which source row(s) they came from.

As with other requirements, data lineage can be handled several ways in SSIS. The most common way is to add an auto-incrementing number to the data flow that maps data points back to their sources.

If you are staging raw data, this can be leveraged by sending this lineage number to the staged data and also to the destination tables. With raw staging tables, instead of having to track back to a source system, which may be volatile, you are able ...

Get Expert SQL Server™ 2005 Integration Services now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.