Semiotics is the study of signs, probably most famous because Umberto Eco wrote a book based around it. One important distinction that semiotics makes is between the sign and the object, i.e. the thing the sign refers to. This distinction was illustrated by René Magritte in his painting The Treachery of Images (‘La Trahison des Images’):
This is not a painting
Magritte was pointing out that this wasn’t a pipe, but a painting of a pipe (“just try to fill it with tobacco”). The painting is a sign not the object itself, just like the word ‘pipe’ which also refers to the object.
There is a similar distinction in statistics, but it is even less appreciated. Statistics is (in part) the science of estimating numbers from data, numbers like the number of badgers in the UK or the fitness of pink sheep. It sounds obvious, but these estimates are not the real values. The estimate that statistics calculates is called the estimator. The thing it is estimating is the estimand. often the only difference between them is the effect of sampling error: the estimated number of badgers is fundamentally the same quantity, other than it lives in a virtual model world of the model, rather than the real world we inhabit.
But sometimes the estimator and estimand are different. Take, for example fitness and natural selection1. In the 1930s, Fisher defined fitness as m in this equation:
(x is time, bx is the rate of birth of offspring, and lx is the rate of survival). The details aren’t important for us, what is relevant is that this is a complicated definition. In practice, most people looking at natural populations estimate fitness as the Lifetime Reproductive Success (LRS): the total number of offspring produced. But this is not the same as Fisher’s m:
To see the difference, imagine one species which waits 100 years and then produces 10 offspring per parent and then dies, whereas another waits 1 year and then produces 2 offspring per parent and then dies. The first species has a higher LRS, but if the two are competing, then the second will have many more offspring after 100 years (all else equal).
So LRS, the estimator, is not the same as the estimand. But in practice it is easier to calculate, and the difference is probably not that big for most cases. So we use something that is wrong – we are estimating the wrong thing – because it is convenient (and probably isn’t too bad).
My reason for writing this is an on-going discussion with Lou Jost about FST. This is a number with a long pedigree, all the way back to Sewell Wright in the 1920s. It is used in population genetics as a measure of genetic divergence between populations. There are a few definitions (just to make things interesting), but the easiest to use is the ratio of variance of allele frequencies between populations to the total variance in allele frequencies (this definition runs into trouble when a gene has more than 2 alleles, hence the more sophisticated definitions used now instead). The more divergent populations are, the higher is FST.
In practice, there are several ways of estimating FST: GST, ΦST etc. Several of these have the unfortunate property that they can decrease as population divergence increases2. Lou has used this to suggest that FST should be ditched, and to suggest an alternative statistics, D, which does not suffer from this problem. But I think he is going too far. He is (rightly) pointing out that the estimator can have horrible problems, but he has never criticised the estimand – the actual divergence, as should be measured by FST. He has not shown that the different definitions of FST we use now (e.g. defined in terms of probabilities of identity by descent, or coalescent times) are wrong. To do that he would have to tackle the large mathematical edifice constructed by Sewell Wright and his successors. It seems to me that if there is no problem with the estimand, then we shouldn’t throw it out. Rather we should try to improve the estimators. In this case, the problem is only with some types of genetic marker, in some situations.
Usually it is not important to worry deeply about the difference between estimators and estimands (other than to acknowledge that estimators have sampling variation). But there are some occasions where it does make a difference, and it is important to realise that what you are calculating may not be what you want to calculate. As long as it’s close enough, it may not matter. But when estimators and estimands diverge, it’s the least one can do is understand which one it is that’s wrong.
1 This description is simplified: the details aren’t important for this post, but there are a few details I’m driving over with a snowplough.
2 The reason for this is that mutations become common, and these estimators don’t react well to that: see “Kronholm et al.“: http://dx.doi.org/10.1186%2F1471-2156-11-33 for more.