Over the last week or so I have been asked a couple of statistical questions where the root cause of the problems has been the unthinking application of p-values. In both cases, the solution was to use confidence intervals instead; i.e. to look at the size of the effect, not whether it was different from zero. I decided it was time to take the battle to the host of p-values, armed only with a t-test. It is enough.
What is so bad about p-values? Let’s start with a question: what does a p-value tell you?
If you answered “the probability of getting the observed statistic, or something more extreme, if the null hypothesis was true”, then you’re right but you’re probably only parroting the textbook. If you said something about it giving the strength of evidence that a hypothesis is correct, then you’re wrong, but in good company: that is how most people use them in practice.
Time to spring an example on you. Let’s suppose we are interested in whether rocket scientists are more intelligent than brain surgeons. So, we round up a large number of both professions, and give them IQ tests. We then compare the distributions, to see which one has the higher mean. This is a classic t-test problem. We get these results:
The difference is 1, with a standard error of 0.27, so the t-statistic is 3.7, with 998 degrees of freedom. Hence horribly significant: p=2×10^-4^ we give it three stars, rejoice, and send off the paper.
Then our evil rivals in the physics department do the same study. Here are their results:
|_. Profession |_. Mean |_. Variance |_. n |
|Brain Surgeons | 112 | 6^2 | 50 |
|Rocket Scientists | 114 | 6^2 | 50 |
which gives a t statistic of 1.67, and gives p=0.1. Not significant, so no stars. They then publish a rebuttal calling us fools.
But, look more closely. The variances are the same (you would never guess this is fictitious data, would you?). And in the second study the difference is 2, twice as large as the difference in the first. And yet it says there is no difference!
Ah, some of you are saying. Look at the sample sizes. The second study is much smaller: the standard error is larger (by 20/sqrt(2) times). We can see exactly what is going on if we look at the formula for the t-statistic, assume equal sample sizes in the two classes and re-arrange it:
what we see is that t is proportional to the square root of n.
So what? Well, I would contend that for almost every significance test that is done in practice, the null hypothesis is, a priori, false. For example, it is unlikely that the average IQs of brain surgeons and rocket scientists are exactly the same. It might be that the difference is only 10^-16^, but that is still not zero. So if we know that the hypothesis is false, why are we testing it? The answer is that we want to know if the difference is large enough to be important (in whatever sense). If brain surgeons’ IQs are 0.02 points higher on average, then the difference is not worth bothering about. But if they are 15 points higher, then they are more intelligent than rocket scientists.
Significance tests, however, tell us something different. They tell us if the distribution is statistically significantly different from zero. If the test statistic is large enough, we declare the difference significant. But the test statistic depends on the sample size: with more data our estimates get more precise, so it is easier to distinguish a difference from zero. We saw that the t-statistic increases with the square root of n; other tests have the same general property1.
Now, if we know that the difference is not zero, then what does a significance test test? It depends in equal measure on (1) the true effect size (e.g. the difference between treatments), (2) the variability in observations (i.e. their standard deviation) and (3) the sample size. We know the first is non-zero: it is the signal we want to see. The variance tells us how difficult this is: the larger the variance, the more noise there is in the way. The sample size indicates how hard we are looking: a larger sample size means we have more information to try and see the signal. As we know the signal is there, the only thing a significance test can be testing is whether there is enough data to see the signal. In other words, it principally tells us about the quality of our study.
What is the solution? The first thing is to realise that statistics is no substitute for thinking. If you are a scientist, you should be able to decide for yourself what an important effect is, e.g. if a difference of 8 IQ points is important. Now, statistics can help here by providing the, um, statistics that are important. For example an effect size (e.g. a difference between control and treatment, of the slope of a regression line), or the amount of variation explained (i.e. R2 in regression). We can also tell you the uncertainty: that’s confidence intervals and standard errors. So if you estimate a difference of 0.2 IQ points with a standard error of 20, you know the estimate is uncertain, and there could still be a realistic difference. But if the standard error is 0.0002, the estimate is precise, but still small (remember that the population standard deviation is about 15).
Sounds easy, doesn’t it? The difficulties are in the statistical modelling (which is my job), and in understanding what is being measured, and how that relates to the real world. That shouldn’t be my job – I am merely a number pusher.
1 Except for a few pathological ones