3.6 Summary

■ Data quality is defined in terms of accuracy, completeness, consistency, timeliness, believability, and interpretabilty. These qualities are assessed based on the intended use of the data.

■ Data cleaning routines attempt to fill in missing values, smooth out noise while identifying outliers, and correct inconsistencies in the data. Data cleaning is usually performed as an iterative two-step process consisting of discrepancy detection and data transformation.

■ Data integration combines data from multiple sources to form a coherent data store. The resolution of semantic heterogeneity, metadata, correlation analysis, tuple duplication detection, and data conflict detection contribute to smooth data integration.

■ Data reduction

Get Data Mining: Concepts and Techniques, 3rd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.