REGRESSION AND MULTIVARIATE ANALYSIS
One of the most common tasks in data analysis is to study – and characterize if it is the case – the possible relationship between two variables, X and Y. Naturally the first step in this endeavour is to plot one variable versus the other – scatterplot – and judge if there is a pattern and what it may mean.
If there is a linear relationship between them, a not unusual case, the strength of this relation can be measured by the correlation coefficient. It is a measure easy to interpret; its value varies between -1 (perfect negative correlation, increasing X decreases Y) and 1 (perfect positive correlation, increasing X increases Y). Some remarks on their use and interpretation:
- The correlation coefficient only measures the degree of linear relationship. Two variables can be perfectly related, but if the relation is, for example, quadratic, the correlation coefficient could be very low.
- Correlation does not imply cause and effect relationship. Two variables may be highly correlated (have a high correlation coefficient), but not directly dependent on each other. There are clear and funny examples to highlight these situations: the number of firefighters who come to fight a fire and the damage caused by it (usually the more firefighters the bigger damage, but firefighters do not cause the damage; there is a third hidden variable related to these two, in this case the magnitude of the fire). This may happen in other cases, and is quite ...