Why P-values are Evil

Over the last week or so I have been asked a couple of statistical questions where the root cause of the problems has been the unthinking application of p-values. In both cases, the solution was to use confidence intervals instead; i.e. to look at the size of the effect, not whether it was different from zero. I decided it was time to take the battle to the host of p-values, armed only with a t-test. It is enough.
What is so bad about p-values? Let’s start with a question: what does a p-value tell you?

If you answered “the probability of getting the observed statistic, or something more extreme, if the null hypothesis was true”, then you’re right but you’re probably only parroting the textbook. If you said something about it giving the strength of evidence that a hypothesis is correct, then you’re wrong, but in good company: that is how most people use them in practice.
Time to spring an example on you. Let’s suppose we are interested in whether rocket scientists are more intelligent than brain surgeons. So, we round up a large number of both professions, and give them IQ tests. We then compare the distributions, to see which one has the higher mean. This is a classic t-test problem. We get these results:

Profession Mean Variance n
Brain Surgeons 110 6^2^ 1000
Rocket Scientists 111 6^2^ 1000

The difference is 1, with a standard error of 0.27, so the t-statistic is 3.7, with 998 degrees of freedom. Hence horribly significant: p=2×10^-4^ we give it three stars, rejoice, and send off the paper.
Then our evil rivals in the physics department do the same study. Here are their results:
|_. Profession |_. Mean |_. Variance |_. n |
|Brain Surgeons | 112 | 6^2 | 50 |
|Rocket Scientists | 114 | 6^2 | 50 |
which gives a t statistic of 1.67, and gives p=0.1. Not significant, so no stars. They then publish a rebuttal calling us fools.
But, look more closely. The variances are the same (you would never guess this is fictitious data, would you?). And in the second study the difference is 2, twice as large as the difference in the first. And yet it says there is no difference!
Ah, some of you are saying. Look at the sample sizes. The second study is much smaller: the standard error is larger (by 20/sqrt(2) times). We can see exactly what is going on if we look at the formula for the t-statistic, assume equal sample sizes in the two classes and re-arrange it:

what we see is that t is proportional to the square root of n.
So what? Well, I would contend that for almost every significance test that is done in practice, the null hypothesis is, a priori, false. For example, it is unlikely that the average IQs of brain surgeons and rocket scientists are exactly the same. It might be that the difference is only 10^-16^, but that is still not zero. So if we know that the hypothesis is false, why are we testing it? The answer is that we want to know if the difference is large enough to be important (in whatever sense). If brain surgeons’ IQs are 0.02 points higher on average, then the difference is not worth bothering about. But if they are 15 points higher, then they are more intelligent than rocket scientists.
Significance tests, however, tell us something different. They tell us if the distribution is statistically significantly different from zero. If the test statistic is large enough, we declare the difference significant. But the test statistic depends on the sample size: with more data our estimates get more precise, so it is easier to distinguish a difference from zero. We saw that the t-statistic increases with the square root of n; other tests have the same general property1.
Now, if we know that the difference is not zero, then what does a significance test test? It depends in equal measure on (1) the true effect size (e.g. the difference between treatments), (2) the variability in observations (i.e. their standard deviation) and (3) the sample size. We know the first is non-zero: it is the signal we want to see. The variance tells us how difficult this is: the larger the variance, the more noise there is in the way. The sample size indicates how hard we are looking: a larger sample size means we have more information to try and see the signal. As we know the signal is there, the only thing a significance test can be testing is whether there is enough data to see the signal. In other words, it principally tells us about the quality of our study.
What is the solution? The first thing is to realise that statistics is no substitute for thinking. If you are a scientist, you should be able to decide for yourself what an important effect is, e.g. if a difference of 8 IQ points is important. Now, statistics can help here by providing the, um, statistics that are important. For example an effect size (e.g. a difference between control and treatment, of the slope of a regression line), or the amount of variation explained (i.e. R2 in regression). We can also tell you the uncertainty: that’s confidence intervals and standard errors. So if you estimate a difference of 0.2 IQ points with a standard error of 20, you know the estimate is uncertain, and there could still be a realistic difference. But if the standard error is 0.0002, the estimate is precise, but still small (remember that the population standard deviation is about 15).
Sounds easy, doesn’t it? The difficulties are in the statistical modelling (which is my job), and in understanding what is being measured, and how that relates to the real world. That shouldn’t be my job – I am merely a number pusher.

1 Except for a few pathological ones

About rpg

Scientist, poet, gadfly
This entry was posted in Uncategorized. Bookmark the permalink.

18 Responses to Why P-values are Evil

  1. Lee Turnpenny says:

    Wow! It’s at times like these that I’m inclined to quote the last line of The Commitments.
    I’m ashamed to say, being as I’m a geneticist by degree, that I’m currently struggling with some stats. Aside from the difficulty in trying to sweet-talk others into re-counting 96-well plates of stained cells blind, because when I’m scoring my own experiments I really want the averaged counts in this column with one treatment to be significantly different from the control, but in that column with another treatment to not be. The problem is I work with a particularly ‘temperamental’ primary cell type and every culture is different. And sometimes the counts from the other treatment are higher than the control – significantly, even though far less so than with the factor that I predict (‘want’) to have the affect. Yet I ‘know’ (believe) that the other factor’s affect is not important. So, it moves to repeat treatments of multiple cultures to improve numbers and iron out the anomalies. But then the stats: first within, and then between culures. I feel like I have to choose where the bar level is, but then that seems to encroach into unscientific territory.
    Your point about the quality of the study is spot on; control as much as possible. Unfortunately, the inherently uncontrollable element in my work is the cells themselves – and probably me.

  2. Martin Fenner says:

    Bob, thanks for the blog post, this is a very relevant topic to my daily work. When calculating the sample size for planned experiments, the assumed difference between groups is often not based on what is scientifically meaningful, but what is possible with the resources available.

  3. Henry Gee says:

    Bob – thanks for a thought-provoking post. I’ve often thought that the last step in any statistical test isn’t the P-value, but whether you, the experimenter, think it’s of any importance. So, in the end, the result of all that number-crunching boils down to intuition and a certain amount of sticking a wet finger out of the window.
    In my scientist days, I was told that if a difference was significant, you’d notice it without a test. If it wasn’t, then it wasn’t, even if the test said it was. Hang on, I might be missing something …
    @ Lee — what is the last line of The Commitments?

  4. Bora Zivkovic says:

    Can I cite this blog post if/when I submit a manuscript that has every statistician frothing in their mouths?
    The experiment could not have been done in a way that would satisfy a statistician (it would require invasive surgeries in thousands of birds instead of dozens), has barely no stats done at all, yet the response to treatment is almost all-or-none, and everyone in the field would look at the raw data and gasp!
    It is significant if I SAY it is.

  5. Richard P. Grant says:

    Wasn’t it Rutherford who said that if your experiment needed statistics, you should do a better experiment?

  6. Mike Fowler says:

    If you want to carry out an experiment, it is imperative to consider what statistical analysis you want to use, when you’re designing the experiment.
    This kinda incorporates all the good points people have made above.
    I’ve just submitted a paper where I’ve claimed that a significant difference between 2 logistic regression lines isn’t biologically important. Simulations are fun and avoid the problems of financial or ethical problems with replication!

  7. Richard P. Grant says:

    I have no ethical problems with replication.

  8. Bob O'Hara says:

    I’m sat in a bar in Tallinn (Amarillo – don’t laugh, Mike. It was better that going to Roberts Coffee or Hesburger), so no deep comments now.
    Bora – there was a discussion about small sample sizes like yours after the Phase I clinical trial failed a year or so ago. Basically, everyone agrees that we know what the right result should be, but the only way to get there is Bayesian – i.e.stuff huge priors on the analysis. When I don’t have a pint of Saku in front of me, I should try and find the stuff.
    Henry – unfortunately too many people stop at the p value.
    Richard, you don’t work in human genetics, do you? The plant breeders always had a giggle over the problems the human people had in getting 100 offspring in an F~1~.
    Lee – what Henry said.
    Actually, it sounds like you need to tame a statistician and talk through your data. Beer helps usually, although blueberry pie also has a decent exchange rate.

  9. Bora Zivkovic says:

    Once I write that puppy up, may I call you and ask for help?

  10. Bob O'Hara says:

    Certainly, Bora. I might feel compelled to blog about it though…

  11. Mark Tummers says:

    Wasn’t it Rutherford who said that if your experiment needed statistics, you should do a better experiment?
    He might have said it and it would have created a bias in his research; only clear cut phenomena would have been investigated. Since it is clear (without statistical analysis) that nature does act through subtle manners on occasion (at least in biology it does) you would create a biased world view, by focussing on obvious results.
    Rutherford’s universe might not necessarily correspond to the real one.

  12. Bob O'Hara says:

    Rutherford was a physicist – they didn’t do hard things like variation. Quantum uncertainty was just God winding them up.

  13. Lev Osherovich says:

    Nicely put! I’ve advocated this idea for quite some time, but am surrounded by hidebound P-value worshipers who blindly equate statistical significance with biological significance. The fools, I will destroy them all!

  14. Tim Fulmer says:

    Dr. Osherovich–you still leave open the question as to how we as scientists collectively settle on the proper criterion for “biologically significant.” What is the threshold of variation for “biologically significant”? A 0.00005-fold, 5-fold, 10-fold, 10^5-fold difference from baseline/control? Have we, as biologists and physicists, any truly objective standard for determining significance of effect? If not, have we then any access to the laws of objective reality? If not, are we practicing science or mere wishful thinking?

  15. Lev Osherovich says:

    Dr. Fulmer- the determination of biological significance is highly case-specific. My rule of thumb is that biological effects need to be stunningly big (and thus stunningly statistically significant) to be taken seriously. How to define “stunning” will depend on the dynamic range available within a system. For instance, if a 10% difference in process X means the difference between life and death, a 10% difference is biologically significant. Likewise, if a 5-fold change in variable Y has no impact on survival, it’s probably not biologically important no matter how statistically significant. Also, you’ll be the first the against the wall when the P value revolution comes!

  16. Bob O'Hara says:

    bq. Have we, as biologists and physicists, any truly objective standard for determining significance of effect? If not, have we then any access to the laws of objective reality?
    Tim – you raise an interesting question. But I would argue that if we don’t know what it “significant”, then we don’t understand the system we’re dealing with – it means we don’t understand what the numbers we are producing tell us. We really should be able to judge what (say) a 10% increase in survival really means. Lev gives a nice example of the thought processes we have to employ, and it’s really biology, not statistics. i.e. it’s real science.
    Right, now I’m off to London.

  17. Raf Aerts says:

    Very instructive entry! I am going to need this in a printable PDF format for future reference.

  18. Pingback: Evil p-values | Michael McCarthy's Research