O'Reilly logo

Statistics for Big Data For Dummies by David Semmelroth, Alan Anderson

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Chapter 20

Ten (or So) Best Practices in Data Preparation

In This Chapter

arrow Understanding the key steps in data validation

arrow Preparing data for analysis

The main goal of this book is to get you familiar with the statistical methods that allow you to build useful statistical models. But as you’ve probably noticed, we have spent a great deal of time, particularly in Part II, talking about getting data ready for analysis. Statistical software packages are extremely powerful these days, but they cannot overcome poor quality data. This chapter provides a checklist of things you need to do before you go off building statistical models.

Check Data Formats

Your analysis always starts with a raw data file. Raw data files come in many different shapes and sizes. Mainframe data is different than PC data, spreadsheet data is formatted differently than web data, and so forth. And in the age of big data, you will surely be faced with data from a variety of sources. Your first step in analyzing your data is making sure you can read the files you’re given. Chapter 7 gives some tips about how to do this.

Chapter 6 talks about the formats of the individual data fields, or variables, in your data file. You need to actually look at what each field contains. For example, it’s not wise to trust that ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required