Chapter 3Exploratory Data Analysis

  1. 3.1 Hypothesis Testing Versus Exploratory Data Analysis
  2. 3.2 Getting to Know the Data Set
  3. 3.3 Exploring Categorical Variables
  4. 3.4 Exploring Numeric Variables
  5. 3.5 Exploring Multivariate Relationships
  6. 3.6 Selecting Interesting Subsets of the Data for Further Investigation
  7. 3.7 Using EDA to Uncover Anomalous Fields
  8. 3.8 Binning Based on Predictive Value
  9. 3.9 Deriving New Variables: Flag Variables
  10. 3.10 Deriving New Variables: Numerical Variables
  11. 3.11 Using EDA to Investigate Correlated Predictor Variables
  12. 3.12 Summary
    1. The R Zone
    2. Reference
    3. Exercises
    4. Hands-On Analysis

3.1 Hypothesis Testing Versus Exploratory Data Analysis

When approaching a data mining problem, a data mining analyst may already have some a priori hypotheses that he or she would like to test regarding the relationships between the variables. For example, suppose that cell phone executives are interested in whether a recent increase in the fee structure has led to a decrease in market share. In this case, the analyst would test the hypothesis that market share has decreased, and would therefore use hypothesis testing procedures.

A myriad of statistical hypothesis testing procedures are available through the traditional statistical analysis literature. We cover many of these in Chapters 4 and 5. However, analysts do not always have a priori notions of the expected relationships among the variables. Especially when confronted with unknown, large databases, analysts often prefer to use ...

Get Discovering Knowledge in Data: An Introduction to Data Mining, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.