Last week Tom Webb tweeted this post about R2. In it the Cosma Shalizi argues that it’s wrong to say that “R2 is the proportion of variance explained by the model”. I say not so fast.
Cosmo’s point is that you can use a variable – say Date – in a regression to predict a response – say number of deaths in Chicago. But it doesn’t mean that the variable is causal:
It seems absurd, however, to say that the date explains how many people died in Chicago on a given day, or even the variation from day to day. The closest I can come up with to an example of someone making such a claim would be an astrologer, and even one of them would work in some patter about the planets and their influences.
But if it’s so absurd, why are you using date in the first place? Isn’t it absurd to try to explain death rates with something that a priori you think can’t possibly explain death rates?
Well, yes and no. It depends on how we interpret the regression model. The purpose of a regression is to explain a variable, Y, with one or more explanatory variables, X, i.e. to find a relationship Y = f(X) + ε. X explains Y in the sense that if we know X, we can calculate f( X), and this plus some random noise (the ε) is Y is. Note that there is an asymmetry here: X is used to explain Y, but not the other way round (if there were a symmetry we would be talking about X and >Y being correlated). From this it’s a small step to say that some of the variation in Y is explained by X, i.e. according to the model if we change X, we’ll see a corresponding change in Y. Interpreted like this, it makes perfect sense to say that date predicts deaths in Chicago.
But models are there to be interpreted. The (strict) interpretation of a regression model is that it says that the explanatory variables are causal, i.e. if we manipulate the date, the number of deaths will change. Under this interpretation, it’s clearly absurd that sneaking into Chicago one night and changing the calendars to Wednesday rather than Sunday is absurd. But if we accept this interpretation, then once again it is absurd to fit Date as a variable in the analysis.
So why do we do this? Because in practice we use a slightly weaker interpretation of a regression model. Although the variable we put into the model may not itself be causal, we assume it it strongly correlated with one that it. The choice of variable to use is often a matter of greater convenience, e.g. in bird studies tarsus length is often used to measure body size, and date is related to car activity (the roads are busier during the week, when people are driving to and from work). It is assumed that what is measured is strongly enough correlated to what is causal that we can treat them as the same in the analysis: the interpretation can then shift to discussing what is the actual causal factor.
Under this weaker interpretation, I think it’s fine to say that Date explains variation in mortality: we know it’s not causal, but then nothing in our analyses are. As long as we don’t use a strict interpretation of the model, there’s no problem. And if we are using the strict interpretation, we shouldn’t be fitting that model anyway. If there is a problem, it’s in understanding how strictly a model should be interpreted. I think reducing this to linguistics by banning certain phrases isn’t going to help: most people will remember the ban, but not necessarily what it’s for.
Bottom line: we all know know what “X explains 55% of the variance in Y” means, and in practice it’s not mis-interpreted. So why make a fuss?