Chapter 7. Descriptive Statistics and Modeling

The earlier chapters in this book focused on a variety of data processing techniques that enable you to transform raw data into a dataset that’s ready for statistical analysis. In this chapter, we turn our attention to some of these basic statistical analysis and modeling techniques. We’ll focus on exploring and summarizing datasets with plots and summary statistics and conducting regression and classification analyses with multivariate linear regression and logistic regression.

This chapter isn’t meant to be a comprehensive treatment of statistical analysis techniques or pandas functionality. Instead, the goal is to demonstrate how you can produce some standard descriptive statistics and models with pandas and statsmodels.

Datasets

Instead of creating datasets with thousands of rows from scratch, let’s download them from the Internet. One of the datasets we’ll use is the Wine Quality dataset, which is available at the UC Irvine Machine Learning Repository. The other dataset is the Customer Churn dataset, which has been featured in several analytics blog posts.

Wine Quality

The Wine Quality dataset consists of two files, one for red wines and one for white wines, for variants of the Portuguese “Vinho Verde” wine. The red wines file contains 1,599 observations and the white wines file contains 4,898 observations. Both files contain one output variable and eleven input variables. The output variable is quality, which is a score between ...

Get Foundations for Analytics with Python now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.