O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Practical Statistics for Data Scientists, 1st Edition

Book Description

With Early Release ebooks, you get books in their earliest form—the author's raw and unedited content as he or she writes—so you can take advantage of these technologies long before the official release of these titles. You’ll also receive updates when significant changes are made, new chapters are available, and the final ebook bundle is released.

A key component of data science is statistics and machine learning, but only a small proportion of data scientists are actually trained as statisticians. This concise guide illustrates how to apply statistical concepts essential to data science, with advice on how to avoid their misuse.

Many courses and books teach basic statistics, but rarely from a data science perspective. And while many data science resources incorporate statistical methods, they typically lack a deep statistical perspective. This quick reference book bridges that gap in an accessible, readable format.

Table of Contents

  1. Preface
    1. What to Expect
    2. Conventions Used in This Book
    3. Using Code Examples
    4. Safari® Books Online
    5. How to Contact Us
    6. Acknowledgments
  2. 1. Exploratory Data Analysis
    1. Elements of Structured Data
      1. Further Reading
    2. Rectangular Data
      1. Data Frames and Indexes
      2. Non-Rectangular Data Structures
      3. Further Reading
    3. Estimates of Location
      1. Mean
      2. Median and Robust Estimates
      3. Example: Location Estimates of Population and Murder Rates
      4. Further Reading
    4. Estimates of Variability
      1. Standard Deviation and Related Estimates
      2. Estimates Based on Percentiles
      3. Example: Variability Estimates of State Population
      4. Further Reading
    5. Exploring the Data Distribution
      1. Percentiles and Boxplots
      2. Frequency Table and Histograms
      3. Density Estimates
      4. Further reading
    6. Exploring Binary and Categorical Data
      1. Mode
      2. Expected Value
      3. Further Reading
    7. Correlation
      1. Scatterplots
      2. Further Reading
    8. Exploring Two or More Variables
      1. Hexagonal Binning and Contours (plotting numeric vs. numeric)
      2. Two Categorical Variables
      3. Categorical and Numeric Data
      4. Visualizing Multiple Variables
      5. Further Reading
    9. Summary
  3. 2. Data and Sampling Distributions
    1. Random sampling and sample bias
      1. Bias
      2. Random Selection
      3. Size Versus Quality; When Does Size Matter?
      4. Sample Mean Versus Population Mean
      5. Further Reading
    2. Selection bias
      1. Regression to the mean
      2. Further Reading
    3. Sampling Distribution of a Statistic
      1. Central Limit Theorem
      2. Standard error
      3. Further Reading
    4. The Bootstrap
      1. Resampling versus bootstrapping
      2. Further Reading
    5. Confidence intervals
      1. Further reading
    6. Normal distribution
      1. Standard Normal and QQ-Plots
    7. Long-Tailed Distributions
      1. Further Reading
    8. Student’s t distribution
      1. Further Reading
    9. Binomial distribution
      1. Further Reading
    10. Poisson and Related Distributions
      1. Poisson Distributions
      2. Exponential distribution
      3. Estimating the Failure Rate
      4. Weibull distribution
      5. Further reading
    11. Summary
  4. 3. Statistical Experiments and Significance Testing
    1. A-B Testing
      1. Why have a control group?
      2. Why just A-B? Why not C, D, etc.?
      3. For Further Reading
    2. Hypothesis Test
      1. The Null Hypothesis
      2. Alternative hypothesis
      3. One-way, two-way hypothesis test
      4. Further Reading
    3. Resampling
      1. Permutation Test
      2. Example: Web Stickiness
      3. Exhaustive and Bootstrap Permutation Test
      4. Permutation Tests: The Bottom Line for Data Science
      5. For Further Reading
    4. Statistical Significance and P-values
      1. P-value
      2. Alpha
      3. Type 1 and Type 2 Error
      4. Data Science and P-Values
      5. Further Reading
    5. t-test
      1. Further Reading
    6. Multiple Testing
      1. Further Reading
    7. Degrees of freedom
      1. Further Reading
    8. ANOVA
      1. F-Statistic
      2. 2-way ANOVA
      3. Further Reading
    9. Chi-square test
      1. Chi-square Test: A Resampling Approach
      2. Chi-Squared Test: Statistical Theory
      3. Fisher’s Exact Test
      4. Relevance for data science
      5. Further Reading
    10. Multi-arm bandit algorithm
      1. Further Reading
    11. Power and sample size
      1. Sample Size
      2. For Further Reading
    12. Summary
  5. 4. Regression and Prediction
    1. Simple Linear Regression
      1. The Regression Equation
      2. Fitted Values and Residuals
      3. Least Squares
      4. Prediction Versus Explanation (Profiling)
      5. Further Reading
    2. Multiple Linear Regression
      1. Example: King County Housing Data
      2. Assessing the Model
      3. Cross-validation
      4. Model Selection and Stepwise Regression
      5. Weighted Regression
    3. Prediction Using Regression
      1. The Dangers of Extrapolation
      2. Confidence and Prediction Intervals
    4. Factor Variables in Regression
      1. Dummy Variables Representation
      2. Factor Variables With Many Levels
      3. Ordered factor variables
    5. Interpreting the Regression Equation
      1. Correlated Predictors
      2. Multicollinearity
      3. Confounding Variables
      4. Interactions and Main Effects
    6. Testing the Assumptions - Regression Diagnostics
      1. Outliers
      2. Influential values
      3. Heteroskedasticity, Non-normality and Correlated Errors
      4. Partial Residual Plots and Nonlinearity
    7. Polynomial and Spline Regression
      1. Polynomial
      2. Splines
      3. Generalized Additive Models
      4. Further Reading
    8. Summary
  6. 5. Classification
    1. Naive Bayes
      1. Why Exact Bayesian Classification is Impractical
      2. The Naive Solution
      3. Numeric Predictor Variables
      4. Further Reading:
    2. Discriminant Analysis
      1. Covariance Matrix
      2. Fischer’s Linear Discriminant
      3. A Simple Example
      4. Further Reading:
    3. Logistic regression
      1. Logistic Response Function and Logit
      2. Logistic Regression and the GLM
      3. Generalized Linear Models (GLM)
      4. Predicted Values from Logistic Regression
      5. Interpreting the Coefficients and Odds Ratios
      6. Linear and Logistic Regression: Similarities and Differences
      7. Assessing the Model
      8. Further Reading:
    4. Evaluating Classification Models
      1. Confusion Matrix
      2. The Rare Class Problem
      3. Precision, Recall and Specificity
      4. ROC Curve
      5. AUC
      6. Lift
      7. Further Reading:
    5. Strategies for Imbalanced Data
      1. Undersampling
      2. Oversampling and Up/Down Weighting
      3. Data Generation
      4. Cost-Based Classification
      5. Exploring the Predictions
      6. Further Reading
    6. Summary
  7. 6. Statistical Machine Learning
    1. K-Nearest-Neighbors (KNN)
      1. A Small Example: Predicting Loan Default
      2. Distance Metrics
      3. One Hot Encoder
      4. Standardization (normalization, z-scores)
      5. Choosing K
      6. KNN as a Feature Engine
    2. Tree Models
      1. A Simple Example
      2. The Recursive Partitioning Algorithm
      3. Measuring Homogeneity or Impurity
      4. Stopping the Tree From Growing
      5. Predicting a Continuous Value
      6. How Trees are Used
      7. Further Reading
    3. Bagging and the Random Forest
      1. Bagging
      2. Random Forest
      3. Variable Importance
      4. Hyperparameters
    4. Boosting
      1. The Boosting Algorithm
      2. XGBoost
      3. Regularization: Avoiding Overfitting
      4. Hyperparameters and Cross-Validation
      5. Summary
  8. 7. Unsupervised Learning
    1. Principal Components Analysis (PCA)
      1. A Simple Example
      2. Computing the Principal Components
      3. Interpreting Principal Components
    2. K-Means Clustering
      1. A Simple Example
      2. K-means Algorithm
      3. Interpreting the Clusters
      4. Selecting the Number of Clusters
    3. Hierarchical Clustering
      1. A Simple Example
      2. The Dendogram
      3. The Agglomerative Algorithm
      4. Measures of Dissimilarity
    4. Model Based Clustering
      1. Multivariate Normal Distribution
      2. Mixtures of Normals
      3. Selecting the Number of Clusters
    5. Scaling and Categorical Variables
      1. Scaling the Variables
      2. Dominant Variables
      3. Categorical Data and Gower’s Distance
      4. Problems with Clustering Mixed Data
    6. Summary
  9. Bibliography
  10. Index