Data Munging with Hadoop

If you torture the data long enough, it will confess.

Ronald Coase, Economist

As every data scientist knows, about 70%–80% of the time spent in data science projects is in what is commonly known as data munging—a popular term that refers to two main activities:

Image Identifying and remediating data quality problems

Image Transforming the raw data into what is known as a feature matrix, a task commonly referred to as feature generation or feature engineering

This eBook, which is part of our upcoming book, Data Science with Hadoop

Get Data Munging with Hadoop now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.