You are previewing Data Analysis with Open Source Tools.
1. Data Analysis with Open Source Tools
2. Dedication
3. SPECIAL OFFER: Upgrade this ebook with O’Reilly
4. A Note Regarding Supplemental Files
5. Preface
6. 1. Introduction
7. I. Graphics: Looking at Data
1. 2. A Single Variable: Shape and Distribution
2. 3. Two Variables: Establishing Relationships
3. 4. Time As a Variable: Time-Series Analysis
4. 5. More Than Two Variables: Graphical Multivariate Analysis
5. 6. Intermezzo: A Data Analysis Session
8. II. Analytics: Modeling Data
1. 7. Guesstimation and the Back of the Envelope
2. 8. Models from Scaling Arguments
3. 9. Arguments from Probability Models
4. 10. What You Really Need to Know About Classical Statistics
5. 11. Intermezzo: Mythbusting—Bigfoot, Least Squares, and All That
9. III. Computation: Mining Data
1. 12. Simulations
2. 13. Finding Clusters
3. 14. Seeing the Forest for the Trees: Finding Important Attributes
4. 15. Intermezzo: When More Is Different
10. IV. Applications: Using Data
1. 16. Reporting, Business Intelligence, and Dashboards
2. 17. Financial Calculations and Modeling
3. 18. Predictive Analytics
4. 19. Epilogue: Facts Are Not Reality
11. A. Programming Environments for Scientific Computation and Data Analysis
1. Software Tools
2. A Catalog of Scientific Software
12. B. Results from Calculus
1. Common Functions
2. Calculus
3. Useful Tricks
4. Notation and Basic Math
5. Where to Go from Here
13. C. Working with Data
1. Sources for Data
2. Cleaning and Conditioning
3. Sampling
4. Data File Formats
5. The Care and Feeding of Your Data Zoo
6. Skills
7. Terminology
15. Index
17. Colophon
18. SPECIAL OFFER: Upgrade this ebook with O’Reilly

# Chapter 2. A Single Variable: Shape and Distribution

WHEN DEALING WITH UNIVARIATE DATA, WE ARE USUALLY MOSTLY CONCERNED WITH THE OVERALL SHAPE OF the distribution. Some of the initial questions we may ask include:

• Where are the data points located, and how far do they spread? What are typical, as well as minimal and maximal, values?

• How are the points distributed? Are they spread out evenly or do they cluster in certain areas?

• How many points are there? Is this a large data set or a relatively small one?

• Is the distribution symmetric or asymmetric? In other words, is the tail of the distribution much larger on one side than on the other?

• Are the tails of the distribution relatively heavy (i.e., do many data points lie far away from the central group of points), or are most of the points—with the possible exception of individual outliers—confined to a restricted region?

• If there are clusters, how many are there? Is there only one, or are there several? Approximately where are the clusters located, and how large are they—both in terms of spread and in terms of the number of data points belonging to each cluster?

• Are the clusters possibly superimposed on some form of unstructured background, or does the entire data set consist only of the clustered data points?

• Does the data set contain any significant outliers—that is, data points that seem to be different from all the others?

• And lastly, are there any other unusual or significant features in the data set—gaps, sharp cutoffs, unusual values, ...