You are previewing Discovering Knowledge in Data: An Introduction to Data Mining, 2nd Edition.
O'Reilly logo
Discovering Knowledge in Data: An Introduction to Data Mining, 2nd Edition

Book Description

The field of data mining lies at the confluence of predictive analytics, statistical analysis, and business intelligence. Due to the ever-increasing complexity and size of data sets and the wide range of applications in computer science, business, and health care, the process of discovering knowledge in data is more relevant than ever before.

This book provides the tools needed to thrive in today's big data world. The author demonstrates how to leverage a company's existing databases to increase profits and market share, and carefully explains the most current data science methods and techniques. The reader will "learn data mining by doing data mining". By adding chapters on data modelling preparation, imputation of missing data, and multivariate statistical analysis, Discovering Knowledge in Data, Second Edition remains the eminent reference on data mining.

  • The second edition of a highly praised, successful reference on data mining, with thorough coverage of big data applications, predictive analytics, and statistical analysis.

  • Includes new chapters on Multivariate Statistics, Preparing to Model the Data, and Imputation of Missing Data, and an Appendix on Data Summarization and Visualization

  • Offers extensive coverage of the R statistical programming language

  • Contains 280 end-of-chapter exercises

  • Includes a companion website with further resources for all readers, and Powerpoint slides, a solutions manual, and suggested projects for instructors who adopt the book

  • Table of Contents

    1. Preface
      1. What is Data Mining?
      2. Why is This Book Needed?
      3. What's New for the Second Edition?
      4. Danger! Data Mining is Easy to Do Badly
      5. “White Box” Approach: Understanding the Underlying Algorithmic and Model Structures
      6. Data Mining as a Process
      7. Graphical Approach, Emphasizing Exploratory Data Analysis
      8. How The Book is Structured
      9. Acknowledgments
    2. Chapter 1: An Introduction to Data Mining
      1. 1.1 What is Data Mining?
      2. 1.2 Wanted: Data Miners
      3. 1.3 The Need for Human Direction of Data Mining
      4. 1.4 The Cross-Industry Standard Practice for Data Mining
      5. 1.5 Fallacies of Data Mining
      6. 1.6 What Tasks Can Data Mining Accomplish?
      7. References
      8. Exercises
      9. Note
    3. Chapter 2: Data Preprocessing
      1. 2.1 Why do We Need to Preprocess the Data?
      2. 2.2 Data Cleaning
      3. 2.3 Handling Missing Data
      4. 2.4 Identifying Misclassifications
      5. 2.5 Graphical Methods for Identifying Outliers
      6. 2.6 Measures of Center and Spread
      7. 2.7 Data Transformation
      8. 2.8 Min-Max Normalization
      9. 2.9 <i xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:svg="http://www.w3.org/2000/svg">Z</i>-Score Standardization-Score Standardization
      10. 2.10 Decimal Scaling
      11. 2.11 Transformations to Achieve Normality
      12. 2.12 Numerical Methods for Identifying Outliers
      13. 2.13 Flag Variables
      14. 2.14 Transforming Categorical Variables into Numerical Variables
      15. 2.15 Binning Numerical Variables
      16. 2.16 Reclassifying Categorical Variables
      17. 2.17 Adding an Index Field
      18. 2.18 Removing Variables that are Not Useful
      19. 2.19 Variables that Should Probably Not Be Removed
      20. 2.20 Removal of Duplicate Records
      21. 2.21 A Word About Id Fields
      22. References
      23. Exercises
      24. Hands-On Analysis
      25. Notes
    4. Chapter 3: Exploratory Data Analysis
      1. 3.1 Hypothesis Testing Versus Exploratory Data Analysis
      2. 3.2 Getting to Know the Data Set
      3. 3.3 Exploring Categorical Variables
      4. 3.4 Exploring Numeric Variables
      5. 3.5 Exploring Multivariate Relationships
      6. 3.6 Selecting Interesting Subsets of the Data for Further Investigation
      7. 3.7 Using EDA to Uncover Anomalous Fields
      8. 3.8 Binning Based on Predictive Value
      9. 3.9 Deriving New Variables: Flag Variables
      10. 3.10 Deriving New Variables: Numerical Variables
      11. 3.11 Using EDA to Investigate Correlated Predictor Variables
      12. 3.12 Summary
      13. Reference
      14. Exercises
      15. Hands-On Analysis
      16. Note
    5. Chapter 4: Univariate Statistical Analysis
      1. 4.1 Data Mining Tasks in <i xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:svg="http://www.w3.org/2000/svg">Discovering Knowledge in Data</i>
      2. 4.2 Statistical Approaches to Estimation and Prediction
      3. 4.3 Statistical Inference
      4. 4.4 How Confident are We in Our Estimates?
      5. 4.5 Confidence Interval Estimation of the Mean
      6. 4.6 How to Reduce the Margin of Error
      7. 4.7 Confidence Interval Estimation of the Proportion
      8. 4.8 Hypothesis Testing for the Mean
      9. 4.9 Assessing the Strength of Evidence Against the Null Hypothesis
      10. 4.10 Using Confidence Intervals to Perform Hypothesis Tests
      11. 4.11 Hypothesis Testing for the Proportion
      12. Reference
      13. Exercises
    6. Chapter 5: Multivariate Statistics
      1. 5.1 Two-Sample <i xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:svg="http://www.w3.org/2000/svg">t</i>-Test for Difference in Means-Test for Difference in Means
      2. 5.2 Two-Sample <i xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:svg="http://www.w3.org/2000/svg">Z</i>-Test for Difference in Proportions-Test for Difference in Proportions
      3. 5.3 Test for Homogeneity of Proportions
      4. 5.4 Chi-Square Test for Goodness of Fit of Multinomial Data
      5. 5.5 Analysis of Variance
      6. 5.6 Regression Analysis
      7. 5.7 Hypothesis Testing in Regression
      8. 5.8 Measuring the Quality of a Regression Model
      9. 5.9 Dangers of Extrapolation
      10. 5.10 Confidence Intervals for the Mean Value of <i xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:svg="http://www.w3.org/2000/svg">y</i> Given Given <i xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:svg="http://www.w3.org/2000/svg">x</i>
      11. 5.11 Prediction Intervals for a Randomly Chosen Value of <i xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:svg="http://www.w3.org/2000/svg">y</i> Given Given <i xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:svg="http://www.w3.org/2000/svg">x</i>
      12. 5.12 Multiple Regression
      13. 5.13 Verifying Model Assumptions
      14. Reference
      15. Exercises
      16. Hands-On Analysis
      17. Note
    7. Chapter 6: Preparing to Model the Data
      1. 6.1 Supervised Versus Unsupervised Methods
      2. 6.2 Statistical Methodology and Data Mining Methodology
      3. 6.3 Cross-Validation
      4. 6.4 Overfitting
      5. 6.5 BIAS–Variance Trade-Off
      6. 6.6 Balancing the Training Data Set
      7. 6.7 Establishing Baseline Performance
      8. Reference
      9. Exercises
    8. Chapter 7: <i xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:svg="http://www.w3.org/2000/svg">k</i>-Nearest Neighbor Algorithm-Nearest Neighbor Algorithm
      1. 7.1 Classification Task
      2. 7.2 <i xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:svg="http://www.w3.org/2000/svg">k</i>-Nearest Neighbor Algorithm-Nearest Neighbor Algorithm
      3. 7.3 Distance Function
      4. 7.4 Combination Function
      5. 7.5 Quantifying Attribute Relevance: Stretching the Axes
      6. 7.6 Database Considerations
      7. 7.7 <i xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:svg="http://www.w3.org/2000/svg">k</i>-Nearest Neighbor Algorithm for Estimation and Prediction-Nearest Neighbor Algorithm for Estimation and Prediction
      8. 7.8 Choosing <i xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:svg="http://www.w3.org/2000/svg">k</i>
      9. 7.9 Application of <i xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:svg="http://www.w3.org/2000/svg">k</i>-Nearest Neighbor Algorithm Using IBM/SPSS Modeler-Nearest Neighbor Algorithm Using IBM/SPSS Modeler
      10. Exercises
      11. Hands-On Analysis
    9. Chapter 8: Decision Trees
      1. 8.1 What is a Decision Tree?
      2. 8.2 Requirements for Using Decision Trees
      3. 8.3 Classification and Regression Trees
      4. 8.4 C4.5 Algorithm
      5. 8.5 Decision Rules
      6. 8.6 Comparison of the C5.0 and Cart Algorithms Applied to Real Data
      7. References
      8. Exercises
      9. Hands-On Analysis
    10. Chapter 9: Neural Networks
      1. 9.1 Input and Output Encoding
      2. 9.2 Neural Networks for Estimation and Prediction
      3. 9.3 Simple Example of a Neural Network
      4. 9.4 Sigmoid Activation Function
      5. 9.5 Back-Propagation
      6. 9.6 Termination Criteria
      7. 9.7 Learning Rate
      8. 9.8 Momentum Term
      9. 9.9 Sensitivity Analysis
      10. 9.10 Application of Neural Network Modeling
      11. References
      12. Exercises
      13. Hands-On Analysis
    11. Chapter 10: Hierarchical and <i xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:svg="http://www.w3.org/2000/svg">k</i>-Means Clustering-Means Clustering
      1. 10.1 The Clustering Task
      2. 10.2 Hierarchical Clustering Methods
      3. 10.3 Single-Linkage Clustering
      4. 10.4 Complete-Linkage Clustering
      5. 10.5 <i xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:svg="http://www.w3.org/2000/svg">k</i>-Means Clustering-Means Clustering
      6. 10.6 Example of <i xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:svg="http://www.w3.org/2000/svg">k</i>-Means Clustering at Work-Means Clustering at Work
      7. 10.7 Behavior of MSB, MSE, and PSEUDO-<i xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:svg="http://www.w3.org/2000/svg">F</i> as the as the <i xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:svg="http://www.w3.org/2000/svg">k</i>-Means Algorithm Proceeds-Means Algorithm Proceeds
      8. 10.8 Application of <i xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:svg="http://www.w3.org/2000/svg">k</i>-Means Clustering Using SAS Enterprise Miner-Means Clustering Using SAS Enterprise Miner
      9. 10.9 Using Cluster Membership to Predict Churn
      10. References
      11. Exercises
      12. Hands-On Analysis
      13. Note
    12. Chapter 11: Kohonen Networks
      1. 11.1 Self-Organizing Maps
      2. 11.2 Kohonen Networks
      3. 11.3 Example of a Kohonen Network Study
      4. 11.4 Cluster Validity
      5. 11.5 Application of Clustering Using Kohonen Networks
      6. 11.6 Interpreting the Clusters
      7. 11.7 Using Cluster Membership as Input to Downstream Data Mining Models
      8. References
      9. Exercises
      10. Hands-On Analysis
    13. Chapter 12: Association Rules
      1. 12.1 Affinity Analysis and Market Basket Analysis
      2. 12.2 Support, Confidence, Frequent Itemsets, and the a Priori Property
      3. 12.3 How Does the a Priori Algorithm Work?
      4. 12.4 Extension from Flag Data to General Categorical Data
      5. 12.5 Information-Theoretic Approach: Generalized Rule Induction Method
      6. 12.6 Association Rules are Easy to do Badly
      7. 12.7 How can we Measure the Usefulness of Association Rules?
      8. 12.8 Do Association Rules Represent Supervised or Unsupervised Learning?
      9. 12.9 Local Patterns Versus Global Models
      10. References
      11. Exercises
      12. Hands-On Analysis
    14. Chapter 13: Imputation of Missing Data
      1. 13.1 Need for Imputation of Missing Data
      2. 13.2 Imputation of Missing Data: Continuous Variables
      3. 13.3 Standard Error of the Imputation
      4. 13.4 Imputation of Missing Data: Categorical Variables
      5. 13.5 Handling Patterns in Missingness
      6. Reference
      7. Exercises
      8. Hands-On Analysis
      9. Notes
    15. Chapter 14: Model Evaluation Techniques
      1. 14.1 Model Evaluation Techniques for the Description Task
      2. 14.2 Model Evaluation Techniques for the Estimation and Prediction Tasks
      3. 14.3 Model Evaluation Techniques for the Classification Task
      4. 14.4 Error Rate, False Positives, and False Negatives
      5. 14.5 Sensitivity and Specificity
      6. 14.6 Misclassification Cost Adjustment to Reflect Real-World Concerns
      7. 14.7 Decision Cost/Benefit Analysis
      8. 14.8 Lift Charts and Gains Charts
      9. 14.9 Interweaving Model Evaluation with Model Building
      10. 14.10 Confluence of Results: Applying a Suite of Models
      11. Reference
      12. Exercises
      13. Hands-On Analysis
      14. Notes
    16. Appendix: Data Summarization and Visualization
      1. Part 1 Summarization 1: Building Blocks of Data Analysis
      2. Part 2 Visualization: Graphs and Tables for Summarizing and Organizing Data
      3. Part 3 Summarization 2: Measures of Center, Variability, and Position
      4. Part 4 Summarization and Visualization of Bivariate Relationships
    17. Index
    18. End User License Agreement