Chapter 7. Ensuring Data Integrity

When working with open source enterprise data management systems, it’s common to use multiple storage and processing layers in our data architecture, which often means storing data in multiple formats in order to optimize access. This can even mean duplicating data, which in the past might have been viewed as an antipattern because of expense and complexity, but with newer systems and cheap storage, this becomes much more practical.

What doesn’t change is the need to ensure the integrity of the data as it moves through the system from the data sources to the final storage of the data. When we talk about data integrity, we mean being able to ensure that the data is accurate and consistent throughout our data pipelines. To ensure data integrity, it’s critical that we have a known lineage for all data as it moves through the system.

In this chapter, we discuss what it means to ensure data integrity and provide some examples of how to do this as data moves through our system. In this discussion, we consider what we call full fidelity data, which is data that maintains the full context of the source data. This data might be stored in different formats from the source data, but as long as the data can be returned to the original state, we consider it full fidelity. We also consider datasets derived from our original source data; for example, data that’s been filtered and aggregated. Regardless of whether the final datasets are full fidelity or derived, ...

Get Foundations for Architecting Data Solutions now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.