O'Reilly logo

Statistics for Big Data For Dummies by David Semmelroth, Alan Anderson

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Chapter 6

Dirty Work: Preparing Your Data for Analysis

In This Chapter

arrow Understanding how your data is formatted

arrow Recognizing common data problems

arrow Working with dates

arrow Dealing with messy data

“ Garbage in, garbage out” (GIGO) is a cliché that dates back to the early days of data processing. It succinctly captures the idea that any analysis that you do is only as good as the data you start with. In the context of statistical analysis, GIGO is particularly relevant. It is very easy to get caught up in the apparent power of statistical methods and software packages. They can seem almost magical in their predictive powers. But these predictions can be inaccurate and sometimes wildly misleading if the data they are analyzing is messy.

And the data is almost always messy — especially when it is big data. As discussed in Chapter 2, big data can be characterized by the three Vs: volume, velocity, and variety. All three of these characteristics open data up to problems.

The high volume and velocity of data open it up to technical problems with the way it is captured. Database systems don’t always ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required