Cover image for Think Stats, 2nd Edition

Book description

Think Stats: Probability and Statistics for Programmers is a textbook for a new kind of introductory prob-stat class. It emphasizes the use of statistics to explore large datasets. It takes a computation approach: students write programs in Python as a way of developing and testing their understanding. The new edition switches over to Pandas for data processing. It also includes new chapters on multiple regression, survival analysis, missing value imputation and resampling.

Table of Contents

  1. Preface
    1. How I wrote this book
    2. Using the code
    3. Contributor List
  2. 1. Data science for programmers
    1. A statistical approach
    2. The National Survey of Family Growth
    3. Importing the data
    4. DataFrames
    5. Variables
    6. Transformation
    7. Validation
    8. Interpretation
    9. Exercises
    10. Glossary
  3. 2. Distributions
    1. Histograms
    2. Representing histograms
    3. Plotting histograms
    4. NSFG variables
    5. Outliers
    6. First babies
    7. Summarizing distributions
    8. Variance
    9. Effect size
    10. Reporting results
    11. Exercises
    12. Glossary
  4. 3. Probability mass functions
    1. Pmfs
    2. Plotting PMFs
    3. Other visualizations
    4. The class size paradox
    5. DataFrame indexing
    6. Exercises
    7. Glossary
  5. 4. Cumulative distribution functions
    1. The limits of PMFs
    2. Percentiles
    3. CDFs
    4. Representing CDFs
    5. Comparing CDFs
    6. Percentile-based statistics
    7. Random numbers
    8. Comparing percentile ranks
    9. Exercises
    10. Glossary
  6. 5. Modeling distributions
    1. The exponential distribution
    2. The normal distribution
    3. Normal probability plot
    4. The lognormal distribution
    5. The Pareto distribution
    6. Generating random numbers
    7. Why model?
    8. Exercises
    9. Glossary
  7. 6. Probability density functions
    1. PDFs
    2. Kernel density estimation
    3. The distribution framework
    4. Hist implementation
    5. Pmf implementation
    6. Cdf implementation
    7. Moments
    8. Skewness
    9. Exercises
    10. Glossary
  8. 7. Relationships between variables
    1. Scatter plots
    2. Characterizing relationships
    3. Correlation
    4. Covariance
    5. Pearson’s correlation
    6. Non-linear relationships
    7. Spearman’s rank correlation
    8. Correlation and causation
    9. Exercises
    10. Glossary
  9. 8. Estimation
    1. The estimation game
    2. Guess the variance
    3. Sampling distributions
    4. Sampling bias
    5. Exponential distributions
    6. Exercises
    7. Glossary
  10. 9. Hypothesis testing
    1. Classical hypothesis testing
    2. HypothesisTest
    3. Testing a difference in means
    4. Other test statistics
    5. Testing a correlation
    6. Testing proportions
    7. Chi-squared tests
    8. First babies again
    9. Errors
    10. Power
    11. Replication
    12. Exercises
    13. Glossary
  11. 10. Linear least squares
    1. Least squares fit
    2. Implementation
    3. Residuals
    4. Estimation
    5. Goodness of fit
    6. Testing a linear model
    7. Weighted resampling
    8. Exercises
    9. Glossary
  12. 11. Regression
    1. StatsModels
    2. Multiple regression
    3. Non-linear relationships
    4. Data mining
    5. Prediction
    6. Logistic regression
    7. Estimating parameters
    8. Implementation
    9. Accuracy
    10. Exercises
    11. Glossary
  13. 12. Time series analysis
    1. Importing and cleaning
    2. Plotting
    3. Linear regression
    4. Moving averages
    5. Missing data
    6. Serial correlation
    7. Autocorrelation
    8. Prediction
    9. Further reading
    10. Exercises
    11. Glossary
  14. 13. Survival analysis
    1. Survival curves
    2. Hazard function
    3. Estimating survival curves
    4. Kaplan-Meier estimation
    5. The marriage curve
    6. Estimating the survival function
    7. Confidence intervals
    8. Cohort effects
    9. Extrapolation
    10. Expected remaining lifetime
    11. Exercises
    12. Glossary
  15. 14. Analytic methods
    1. Why normal?
    2. Sampling distributions
    3. Representing normal distributions
    4. Central limit theorem
    5. Testing CLT
    6. Applying CLT
    7. Correlation test
    8. Chi-squared test
    9. Discussion
    10. Exercises
  16. Index
  17. About the Author
  18. Copyright