Preface

Most people imagine data science to be focused on advanced math and machine learning techniques. In reality, most data scientists find themselves spending a significant amount of time (70%–80%) in a variety of tasks that are often called “data munging,” including data cleansing and normalization, aggregation, sampling, transformation, and other forms of feature generation.

These activities are often considered low-value or “grunt work,” but they are actually interesting and sometimes require machine learning to accomplish. The resulting set of skills is a complex mishmash of normal data cleansing and extraction techniques that most data analysts or software engineers will recognize and more advanced skills that would normally be seen ...

Get Data Munging with Hadoop now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.