Chapter 9. Correlation

Standard Scores

In this chapter, we look at relationships between variables. For example, we have a sense that height is related to weight; people who are taller tend to be heavier. Correlation is a description of this kind of relationship.

A challenge in measuring correlation is that the variables we want to compare might not be expressed in the same units. For example, height might be in centimeters and weight in kilograms. And even if they are in the same units, they come from different distributions.

There are two common solutions to these problems:

  1. Transform all values to standard scores. This leads to the Pearson coefficient of correlation.

  2. Transform all values to their percentile ranks. This leads to the Spearman coefficient.

If X is a series of values, xi, we can convert to standard scores by subtracting the mean and dividing by the standard deviation: zi = (xi − μ) / σ.

The numerator is a deviation: the distance from the mean. Dividing by σ normalizes the deviation, so the values of Z are dimensionless (no units) and their distribution has mean 0 and variance 1.

If X is normally distributed, so is Z; but if X is skewed or has outliers, so does Z. In those cases, it is more robust to use percentile ranks. If R contains the percentile ranks of the values in X, the distribution of R is uniform between 0 and 100, regardless of the distribution of X.

Covariance

Covariance is a measure of the tendency of two variables to vary together. If we have two series, X and ...

Get Think Stats now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.