O'Reilly logo

R in a Nutshell, 2nd Edition by Joseph Adler

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Correlation and Covariance

Very often, when analyzing data, you want to know if two variables are correlated. Informally, correlation answers the question, “When we increase (or decrease) x, does y increase (or decrease), and by how much?” Formally, correlation measures the linear dependence between two random variables. Correlation measures range between −1 and 1; 1 means that one variable is a (positive) linear function of the other, 0 means the two variables aren’t correlated at all, and −1 means that one variable is a negative linear function of the other (the two move in completely opposite directions; see Figure 16-1).

Correlation (Source: )

Figure 16-1. Correlation (Source: http://xkcd.com/552/)

The most commonly used correlation measurement is the Pearson correlation statistic (it’s the formula behind the CORREL function in Excel):

Correlation (Source: )

where is the mean of variable x, and ȳ is the mean of variable y. The Pearson correlation statistic is rooted in properties of the normal distribution and works best with normally distributed data. An alternative correlation function is the Spearman correlation statistic. Spearman correlation is a nonparametric statistic and doesn’t make any assumptions about the underlying distribution:

Another measurement of how well two random variables are related is Kendall’s tau.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required