Semiotics and Statistics

Posted on August 16, 2010 by rpg

Semiotics is the study of signs, probably most famous because Umberto Eco wrote a book based around it. One important distinction that semiotics makes is between the sign and the object, i.e. the thing the sign refers to. This distinction was illustrated by René Magritte in his painting The Treachery of Images (‘La Trahison des Images’):
This is not a painting

Magritte was pointing out that this wasn’t a pipe, but a painting of a pipe (“just try to fill it with tobacco”). The painting is a sign not the object itself, just like the word ‘pipe’ which also refers to the object.
There is a similar distinction in statistics, but it is even less appreciated. Statistics is (in part) the science of estimating numbers from data, numbers like the number of badgers in the UK or the fitness of pink sheep. It sounds obvious, but these estimates are not the real values. The estimate that statistics calculates is called the estimator. The thing it is estimating is the estimand. often the only difference between them is the effect of sampling error: the estimated number of badgers is fundamentally the same quantity, other than it lives in a virtual model world of the model, rather than the real world we inhabit.
But sometimes the estimator and estimand are different. Take, for example fitness and natural selection¹. In the 1930s, Fisher defined fitness as m in this equation:

(x is time, b_x is the rate of birth of offspring, and l_x is the rate of survival). The details aren’t important for us, what is relevant is that this is a complicated definition. In practice, most people looking at natural populations estimate fitness as the Lifetime Reproductive Success (LRS): the total number of offspring produced. But this is not the same as Fisher’s m:

To see the difference, imagine one species which waits 100 years and then produces 10 offspring per parent and then dies, whereas another waits 1 year and then produces 2 offspring per parent and then dies. The first species has a higher LRS, but if the two are competing, then the second will have many more offspring after 100 years (all else equal).
So LRS, the estimator, is not the same as the estimand. But in practice it is easier to calculate, and the difference is probably not that big for most cases. So we use something that is wrong – we are estimating the wrong thing – because it is convenient (and probably isn’t too bad).
My reason for writing this is an on-going discussion with Lou Jost about F_ST. This is a number with a long pedigree, all the way back to Sewell Wright in the 1920s. It is used in population genetics as a measure of genetic divergence between populations. There are a few definitions (just to make things interesting), but the easiest to use is the ratio of variance of allele frequencies between populations to the total variance in allele frequencies (this definition runs into trouble when a gene has more than 2 alleles, hence the more sophisticated definitions used now instead). The more divergent populations are, the higher is F_ST.
In practice, there are several ways of estimating F_ST: G_ST, Φ_ST etc. Several of these have the unfortunate property that they can decrease as population divergence increases². Lou has used this to suggest that F_ST should be ditched, and to suggest an alternative statistics, D, which does not suffer from this problem. But I think he is going too far. He is (rightly) pointing out that the estimator can have horrible problems, but he has never criticised the estimand – the actual divergence, as should be measured by F_ST. He has not shown that the different definitions of F_ST we use now (e.g. defined in terms of probabilities of identity by descent, or coalescent times) are wrong. To do that he would have to tackle the large mathematical edifice constructed by Sewell Wright and his successors. It seems to me that if there is no problem with the estimand, then we shouldn’t throw it out. Rather we should try to improve the estimators. In this case, the problem is only with some types of genetic marker, in some situations.
Usually it is not important to worry deeply about the difference between estimators and estimands (other than to acknowledge that estimators have sampling variation). But there are some occasions where it does make a difference, and it is important to realise that what you are calculating may not be what you want to calculate. As long as it’s close enough, it may not matter. But when estimators and estimands diverge, it’s the least one can do is understand which one it is that’s wrong.

¹ This description is simplified: the details aren’t important for this post, but there are a few details I’m driving over with a snowplough.

² The reason for this is that mutations become common, and these estimators don’t react well to that: see “Kronholm et al.“: http://dx.doi.org/10.1186%2F1471-2156-11-33 for more.

About rpg

Scientist, poet, gadfly

View all posts by rpg →

This entry was posted in Science Blogging. Bookmark the permalink.

63 Responses to Semiotics and Statistics

Matt Brown says:

August 17, 2010 at 1:25 pm

Estimand. What a great word. Thanks Bob, you’ve enriched my day.
Lou Jost says:

August 17, 2010 at 1:58 pm

Thanks for continuing this discussion. I disagree that Nei’s Gst is an estimator of Fst. Rather, it was a way of generalizing the 2-allele, 2-population definition to multiple populations and alleles, and to haploid, diploid, or polyploid organisms.
Regardless of that issue, you agree that the definition you give of the estimand Fst has problems when there are more than two alleles (basically all realistic cases fall into this category). That is my point too, and the reason I derived a real measure of differentiation, D, to cover all cases. Can you provide an explicit definition of the estimand Fst valid for multiple pops and multiple alleles (preferably expressed in terms of the true population allele frequencies so there is no ambiguity aboutwhether to use sampling variance, pop variance, etc)? If so, then we can get to the real question: Does it measure differentiation? Until you give us a concrete formula for the estimand, we can’t properly judge your argument that I have gone too far in rejecting Fst as a general measure of differentiation.
Bob O'Hara says:

August 17, 2010 at 2:24 pm

bq. Regardless of that issue, you agree that the definition you give of the estimand Fst has problems when there are more than two alleles (basically all realistic cases fall into this category).
I agree that Wright’s definition has problems, and that’s why other definitions (e.g. Malecot’s) are better. Slatkin used that to write Fst in terms of coalescent times, which naturally generalises to multiple populations and alleles.
Lou Jost says:

August 17, 2010 at 6:57 pm

OK, so choose one of these definitions and let’s get to work to see if it passes my tests. If you choose Slatkin’s (1991) definition, from his article relating Fst to coalescent times, as you suggest, we end up with
(f_0 – f_bar)/(1-f_bar)
a definition which is identical to Nei’s Gst, according to Slatkin himself on p.168. This definition has all of the problems described in my 2008 Molecular Ecology paper and the problems pointed out by Gregorius in his many papers (most recently Gregorius 2010, “Linking diversity and differentiation”, in the journal Diversity). In particular, when there are many demes, all fixed for a single allele, Fst will equal unity (supposedly indicating maximal differentiation between demes) even if almost all demes are fixed for the same allele! Also, if gene identity within demes is near zero (in other words, if diversity is high), then the numerator approaches zero (supposedly indicating low differentiation) even if all demes are completely differentiated!
I repeat, this issue has nothing to do with estimators or estimands, but rather reflects fundemental misconceptions about the mathematics of diversity and differentiation. The additive partitioning of heterozygosity into within- and between-group components, which is the basis of most of these definitions, is wrong. It is also wrong to use the ratio of within-group heterozygosity to total heterozygosity as a measure of the simialrity between groups. That misconception is the basis for other derivations of Gst and Fst.
It is worthwhile to read the Gregorius paper I just cited for a complete analysis. You might also read my “Partitioning diversity into independent alpha and beta components”, Ecology (2007), which derives the mathematical relationship between diversity and differentiation, and derives the correct partitioning formula for heterozygosity (under the guise of the “Gini-Simpson index, which is the name ecologists use for heterozygosity). You might also read the 30 page, multi-participant debate about my papers in the last issue of Ecology (July 2010).
I am sorry to say that even with its long pedigree, Fst needs to be “ditched” as a measure of differentiation (though under certain conditions it is a useful tool for estimating migration rate). Likewise most of the literature using heterozygosity as a measure of diversity is also wrong, and this idea too needs to be ditched. The mathematics of classical population genetics was developed at a time when genes could not be sequenced, so observed diversity was always low (often bi-allelic). Under these conditions the errors in standard formulas are not conspicuous. But while technology has progressed, the mathematics of classical pop gen has not. It is still the same as in the 1920s. We know much more now about the mathematics of diversity. I hope population geneticists will rise to the challenge of modernizing their field. If they do not, the literature will continue to fill up with logicaly inconsistent, misinterpreted studies that will be rich fodder for future historians of science.
Bob O'Hara says:

August 17, 2010 at 8:01 pm

bq. In particular, when there are many demes, all fixed for a single allele, Fst will equal unity (supposedly indicating maximal differentiation between demes) even if almost all demes are fixed for the same allele!
Rather than use the definitions based on IBD or coalescence, you shift the goalposts back to the estimator (I’ll check what Slatkin wrote tomorrow – I’m at home now).

In particular, when there are many demes, all fixed for a single allele, Fst will equal unity (supposedly indicating maximal differentiation between demes) even if almost all demes are fixed for the same allele!

If all demes have the same allele, H_T is 0. So Gst is (0 – 0)/0, which may or may not be 1. Or 0. If the alleles are different, then Gst=1, which is the best point estimate of divergence from the data (although the confidence intervals are difficult to calculate from just this).

The additive partitioning of heterozygosity into within- and between-group components, which is the basis of most of these definitions, is wrong.

IBD≠IIS: a definition based on heterozygosity (horrible term, anyway, when it refers to different individuals) is a definition of an estimtor (possibly except for Gst, but then I don’t see Gst as a definition of Fst anyway).
Lou Jost says:

August 17, 2010 at 9:20 pm

Bob, I was trying to use the definition you suggested. This is Malecot’s definition as given in the Slatkin article you referred to. If this is not the definition you prefer, please just write down an actual definition of Fst, any definition, so we can test it. Please pick a different definition if you disagree with this one. Write it in terms of allele frequencies so we can work with it directly and unambiguously.
By the way, you didn’t read my first example carefully. I said ALMOST all the demes were fixed for the same allele. The “almost” is important as it means that Ht is not 0, so the definition of Fst is not undefined in this case. Since in this case f_0 = 1 (Hs = 0), the definition of Fst becomes (1-f_bar)/(1-f_bar), which equals unity regardless of the value of f_bar or Ht. So for this example, this definition of Fst does not measure differentiation, since it gives the same value no matter if all demes are fixed for different alleles or if all but one deme are fixed for the same allele.
And again I repeat, this is not an estimator, this measure uses the true population frequencies of the alleles. Gst is not an estimator either. It is defined in terms of the true population allele frequencies. Nei and Chesser derived an estimator specifically to estimate it from actual data.
I agree with you that heterozygosity is a poor term, used only for historical reasons. I prefer the term gene identity (=1-H) which makes sense for haploid, diploid, and polyploid organisms. If you write down your preferred definition of Fst here, it would be great if you write it in terms of gene identities. Anyway, whatever you call it, the partitioning used in pop gen is incomplete, as you can easily see by noting how the value of Hs constrains the value of the so-called between-group component Hst = Ht – Hs. The correct partitioning is Ht = Hs + Hst – Hs*Hst as you can verify by seeing that now Hst is unconstrained by the value of Hs. This correctly-derived Hst is the basis for the real measure of differentiation, D. Gene identities can also be partitioned directly and more simply, according to my partitioning theorem: Jst = (1/Jt)/(1/Js). This can also be used to derive D.
But my plea to you continues. We have been discussing Fst for several days now but I still don’t really know what formula you want to use to define Fst. Every time I show that a formula doesn’t measure differentiation, you say that is not the formula you meant. A real, explicit formula is what we need here. Then I am sure we could reach agreement about whether or not it measures differentiation. The definition in Slatkin 1991, which I thought you were referring to in your previous post, clearly does not measure differentiation.
I suppose we should also clearly define differentiaton. I am discussing allelic differentiation, which depends only on the alleles and their frequencies, and the number of demes. Differentiation should be unity if and only if there are no shared alleles in any deme. Differentiation should be zero if and only if all alleles appear at identical frequencies in all demes. This is how differentiation is interpreted in most papers on the subject. Is this what you mean by differentiation?
Lou Jost says:

August 18, 2010 at 8:08 pm

I forgot to mention that we should require that any measure of differentiation be monotonically increasing with the balanced addition of new private alleles to all demes, as described in my 2008 paper.
Tom English says:

August 19, 2010 at 4:37 am

Bob, I’m with Matt. I’m happy to have “estimand” at my disposal.
(But I mainly wanted to let you know that I always read your posts, if not always promptly.)
Bob O'Hara says:

August 19, 2010 at 5:04 pm

I re-read Nei’s PNAS paper, and he’s not clear about what Gst actually is – he doesn’t define it as Fst, but shows that it collapses into Fst when a locus is bi-allelic.
What do you mean by the f’s? Everyone uses slightly different terminology! They’re not coalescent times, because fixation obviously doesn’t imply zero coalescence.

And again I repeat, this is not an estimator, this measure uses the true population frequencies of the alleles.

Fst, as it’s used by population geneticists now, is a function of the process that gives rise to the allele freqiencies. Even if you have the true popuation frequencies, these are only a stochastic realisation of the underlying process, so the estimated Fst they can depart from the “true” Fst. This is just like fitness being the expected number of offspring (all else equal!), not the actual. We’re interested in the process that gives rise to the data, not the data per se.
Tom – glad to hear you’re reading. IF you’re talking at the BES, I insist you use ‘estimand’ (note to self: make sure estimand creep into my talk too)
Lou Jost says:

August 20, 2010 at 4:11 am

Nei is defining a measure of population structure basically from scratch, doing so in a way that naturally generalizes to multiple alleles and pops. He “derives” it by partitioning heterozygosity into additive within- and between-group components. This is similar to derivations of Fst based on Aadditive partitioning of variance. Unfortunately, as I pointed out earlier, heterozygosity is not additive, so the “partitioning” does not really do its job.
The f notation I used was directly from Slatkin. It appears to be gene identity. Slatkin says his formula collapses onto Nei’s Gst.
I understand that geneticist are interested in getting to the underlying processes causing differentiation. However, if geneticists are trying to understand the causes of differentiation, they need a measure of the actual differentiation of the present-day population. Fst does not measure this. Remember that this discussion started with your saying I was wrong to want to ditch Fst as a measure of differentiation. I do not say Fst is useless. It is a pretty good way to find migration numbers. But geneticists have gotten so accustomed to (wrongly) equating Fst with differentiation that they often think migration number is the thing that determines differentiation. If Fst does not measure differentiation, then this belief will be wrong. Indeed Fst does not measure differentiation (and maybe you are starting to admit this by now talking about “measuring underlying processes” as opposed to properties of the actual present-day population). This means conclusions drawn from Fst about the underlying causes of differentiation are incorrect.
When the math is correctly done, it turns out that differentiation is controlled not by Nm (number of migrants per generation) but by the ratio n*u/m (n is number of demes, u is mutation rate). Geneticists appear to misundertand the underlying causes of differentiation (one of the basic processes of evolution) because they identify differentiation with Fst.
Anyway, I hope you will tell me what you mean by Fst. Then either I will show you that it does not measure differentiation, or you will show me that it does. So PLEASE, just give us your definition of Fst, an actual formula. The tests I mentioned for a measure of differentiation are unambiguous, so we’ll know if it measures differentiation as soon as you write down the formula. But until you write down a formula, we are just blowing hot air….
Bob O'Hara says:

August 20, 2010 at 3:09 pm

bq. Nei is defining a measure of population structure basically from scratch, doing so in a way that naturally generalizes to multiple alleles and pops.
Right, so he never connects it to a definition of Fst. So to say Fst is rubbish becase Gst performs badly is to miss the point (and Nei himself even recommends against using Gst for large differentiation).

It appears to be gene identity.

What precisely do you mean? Not identity itself – these aren’t indicator functions. Their expectations?

However, if geneticists are trying to understand the causes of differentiation, they need a measure of the actual differentiation of the present-day population.

Well, yes. And Fst, the estimand, summarises this. Whether you measure it in terms of Pr(IBD) or coalescence times, it’ll work as that.

I do not say Fst is useless. It is a pretty good way to find migration numbers.

Rubbish. Have you missed the last 10 years of developments in population genetics?
Lou Jost says:

August 20, 2010 at 6:48 pm

Bob, Gst and the definitions of Fst that I am familiar with give approximately the same values except in extreme cases, and are measuring the same basic thing, as many authors have pointed out. My criticisms of Gst apply equally to all definitions of Fst that I am familiar with. Nagylaki’s 1998 proof that Fst does not measure differentiation when Hs is large examined Cockerham’s, Wright’s, and Nei’s definitions. These are all very closely related measures.
I quoted Nei’s comment about the limitations of his measure in my article (p 4016). The limitation he mentions is not related to the foundational issues described in my article, which are most serious when Hs is high (Nei’s comment applied when Hs was low compared to Ht).
Gst is not rubbish. It is a good way to measure migration numbers under the finite island model at equilibrium when m is much greater than mu. I apologize for not adding those qualifiers in my last post—I had mentioned them earlier and figured it was understood. Yes, in practice, there are many hurdles, but the connection between migration number and Gst or Fst is theoretically correct under the conditions I just mentioned (unlike the connection between Gst or Fst and differentiation.)
Gene identity is Nei’s term, explained on page 3321 in the article you reread. This is a common term in theoretical pop gen, but I guess I should have defined it. It is the probability that two genes chosen at random are identical. It is 1-H. It has nicer mathematical properties than H. In particular, it can be multiplicatively partitioned into independent within- and between-group components.
You once again repeat that Fst will work as a measure of differentiation, but you once again avoid giving a formula so we can see if it really does. PLEASE make this discussion concrete and give us a formula in terms of population allele frequencies so we can get to the bottom of this. The formula for Gst does not work. The formula for Fst in the Slatkin article you referred to does not work. I think the ball is in your court. If you claim that there is a formulation of Fst that actually measures differentiation between demes of a subdivided population, and that my assertions are wrong, then please show us how to calculate it from data. If it passes my tests I will crawl away with tail between my legs. I’ll be surprised, though, if you can come up with such a formula. Neither Slatkin nor Crow (who both reviewed my manuscript before submission) nor any subsequent reviewer has suggested such a definition in answer to my arguments.
Paper reference: Jost, L. (2008). Gst and its relatives do not measure differentiation. Molecular Ecology 17: 4015-4026.
Bob O'Hara says:

August 20, 2010 at 8:42 pm

bq. My criticisms of Gst apply equally to all definitions of Fst that I am familiar with.
Even the ones that have nothing to do with Hs? Like the definitions in terms of coalescence time?

The formula for Fst in the Slatkin article you referred to does not work. I think the ball is in your court.

No. You punted. I suggested coalescence times, and you shifted the goalposts back to Gst. Use coalescence times, please.

Gst is not rubbish. It is a good way to measure migration numbers under the finite island model at equilibrium when m is much greater than mu.

Hm. And how often is that applicable?
Lou Jost says:

August 21, 2010 at 12:26 am

Bob, I am not shifting the goalposts, I have been begging for you to plant some goalposts in the ground, somewhere, anywhere. I have asked you for a formula or clear reference to one every day for a week now. You still haven’t come through. Please just give us the formula you want to use, or if you won’t do that, at least give a specific reference to a formula (written in terms of allele frequencies) in some article, so we can test it. Then we can settle this.
Like you, I am sure the finite island model is never exactly true in the real world. Yet it still provides deep theoretical insights into complex processes, in the same way that understanding an ideal gas helps us understand real gasses. It is an essential part of science to simplify and abstract when faced with complexity. Of course this requires care, so we don’t throw out the baby with the bathwater, but the finite island model and its variants have been extremely useful in helping us understand the behavior of subdivided populations. Furthermore, simulations and analysis (eg Rousset’s) show that some departures from the model’s assumptions (for example, changing panmictic migration to stepping-stone migration) do not greatly alter the formulas derived from the model. Rousset arrived at that conclusion using Fst, but a similar conclusion emerges from simulations using the correct diversity and differentiation measures. See Economo and Keitt (2008), Species diversity in neutral metacommunities: a network approach, Ecology Letters 11: 52-62 for an example couched in the language of ecology. They empirically discover many of the analytical results of my 2008 paper, concluding for example that differentiation is controlled by m/u (not Nm) and that this result is robust to network topology (stepping stone vs panmixia, for example).
But before we get off on any more tangents, just give me your favored formula for Fst in terms of allele frequencies, so we can come to a definite conclusion on the main point.
Lou Jost says:

August 21, 2010 at 12:33 am

Bob, I am not shifting the goalposts, I have been begging for you to plant some goalposts in the ground, somewhere, anywhere. I have asked you for a formula or clear reference to one every day for a week now. You still haven’t come through. Please just give us the formula you want to use, or if you won’t do that, at least give a specific reference to a formula (written in terms of allele frequencies) in some article, so we can test it. Then we can settle this.
Like you, I am sure the finite island model is never exactly true in the real world. Yet it still provides deep theoretical insights into complex processes, in the same way that understanding an ideal gas helps us understand real gasses. It is an essential part of science to simplify and abstract when faced with complexity. Of course this requires care, so we don’t throw out the baby with the bathwater, but the finite island model and its variants have been extremely useful in helping us understand the behavior of subdivided populations. Furthermore, simulations and analysis (eg Rousset’s) show that some departures from the model’s assumptions (for example, changing panmictic migration to stepping-stone migration) do not greatly alter the formulas derived from the model. Rousset arrived at that conclusion using Fst, but a similar conclusion emerges from simulations using the correct diversity and differentiation measures. See Economo and Keitt (2008), Species diversity in neutral metacommunities: a network approach, Ecology Letters 11: 52-62 for an example couched in the language of ecology. They empirically discover many of the analytical results of my 2008 paper, concluding for example that differentiation is controlled by m/u (not Nm) and that this result is robust to network topology (stepping stone vs panmixia, for example). I am not so much interested in the current fad of using these models to calculate migration rate- that seems quite a stretch. I use them in the other direction: given some migration and mutation rates, how does the system evolve? What factors control its differentiation? To answer these kinds of questions, a model that can be solved analytically is very useful.
But before we get off on any more tangents, just give me your favored formula for Fst in terms of allele frequencies, so we can come to a definite conclusion on the main point.
Lou Jost says:

August 21, 2010 at 4:58 am

I’ve looked again at Slatkin’s treatment of Fst and coalescence times. He gives Wright’s definition of Fst, which I gave in a previous post (in terms of the gene identities f that I explained to you a few posts ago). This definition is identical to Gst and does not measure differentiation. Then Slatkin says “It would be desirable to predict the value of Fst in a way that did not confound the purely demographic processes of genetic drift and migration with purely genetic processes such as mutation. We can do this by expressing Fst in terms of coalescence times.” So it seems to me that his method of coalescence times is just an estimator of Wright’s definition of Fst,which seems to be the estimand according to Slatkin. Since the estimand itself does not measure differentiation, the estimator, if it is a good one, will also not estimate differentiation.
But lets be sure. Give me your favorite definition.
Bob O'Hara says:

August 21, 2010 at 9:52 am

bq. But before we get off on any more tangents, just give me your favored formula for Fst in terms of allele frequencies,
No, allele frequencies are a distraction – doing it that way is wrong.
Use Slatkin’s coalescence definition in terms of coalescence times. Don’t shift the goalposts onto allele frequencies, because that’s doing it wrong: you can’t distinguish between IBD and IIS.
Lou Jost says:

August 21, 2010 at 2:04 pm

Man, talk about moving goalposts. Every day since the beginning, I had been asking you for a formula for Fst in terms of allel frequencies, as in classical pop gen. I guess you then concede my point that the Fst of classical population genetics, based on allele frequencies, does not measure differentiation. (This was the point of my 2008 paper.)
Ok, so now let’s move to sequence-based estimators of Fst. Slatkin says clearly that his sequence-based formula in terms of coalescence times is an estimator of the classic allele-based Fst, which does not measure differentiation. So how does using coalescence times help anything?
But fine, let’s examine the coalescence-based definition directly. I suppose I can apply the tests for differentiation I gave earlier? What should “complete differentiation” mean in terms of sequence data? To me this means no alleles can be shared among demes. Can I use that? If so, we can see if Fst in terms of coalescence times solves the problem.
However, it is clear from reading Slatkin’s article that his measure in terms of coalescence times does not address differentiation. His point in developing the coalescence formalism was specifically to focus on the estimation of migration, not the estimation of differentiation. I do not think he ever mentions the word “differentiation” in the article. He correctly contrasts Fst-like measures (including his definition based on coalescence times) with measures of genetic distance like those of Nei (which depend on mutation rate in an essential way). Differentiation is related to this second class of measures, the measures of genetic distance between demes, and not to Fst-like measures, whose purpose is to separate out the effects of migration and drift. As I showed in my reply to Ryman and Leimar (Mol Ecol. 2009, 18: 2088-2091), real measures of genetic differentiation between demes, like my D, are directly related to this second class of measures. In fact Nei’s genetic distance can be transformed into my D. The article I mentioned earlier by Gregorius (2010) in the journal Diversity goes into more detail about Fst-like measures versus measures of differentiation between demes. These two classes of measures are complementary, measuring sort of “opposite” or “dual” qualities of population subdivision.
Lou Jost says:

August 21, 2010 at 4:43 pm

Bob, look at how Slatkin derives his formula for Fst in terms of coalescence times. He STARTS with Wright’s definition of Fst (which has nothing to do with differentiation), and then he makes the approximation that mutation rate is small, taking the limit of Wright’s measure as mu approaches zero. By doing so he does not suddenly and magically turn Fst into a measure of differentiation. That is not his goal. By taking this limit he is able to directly connect his approximation of Fst to migration rate. But this is in no sense an estimate of differentiation, which depends essentially on mu. (Also, note that Wright’s Fst is the estimand and Slatkin’s coalescence time formula is the estimator.)
The dependence of differentiation on mu is easily seen by thinking about what happens when mutation is high and migration is low. If migration is near zero, and mutation rate is very high, each deme will develop its own set of private alleles or private sequences, with very little sharing. The equilibrium condition will be high sequence differentiation between demes. Any measure whose expectation value at equilibrium is independent of mu cannot be estimating mean sequence differentiation between demes. That is my original point from a week ago, translated into the language of coalescence times and sequence divergence.
I can follow Slatkin’s lead and use his method to derive an approximation for true differentiation D in terms of coalescence times:
Differentiation = [n/(n-1)][(t-t0)/(1/mu – t0)]
where t is the mean coalescence time of two genes chosen at random from the population as a whole, t0 is the mean coalescence time within demes, and n is the number of demes.
That is potentially useful- we can now use the coalescence formalism to make deductions about real population structure! Note the essential difference between Fst and D, which is very clear when both are written in terms of coalescence times: D (and any other measure of differentiation or genetic distance between demes) depends directly on mu.
Mike Fowler says:

August 21, 2010 at 9:00 pm

Fun discussion here. A bit too late for me to go into all the detail above, I’d just like to point out in response to one of Bob’s comments that anything divided by zero is undefined – in the context being discussed here.
So, it’s neither zero, nor one. Division by zero is not a valid operation.
I realise there was some semantics involved when it was brought up (‘almost’), but still wanted to point this out.
Bob O'Hara says:

August 22, 2010 at 10:20 am

bq. Man, talk about moving goalposts. Every day since the beginning, I had been asking you for a formula for Fst in terms of allel frequencies, as in classical pop gen. I guess you then concede my point that the Fst of classical population genetics, based on allele frequencies, does not measure differentiation.
I thought classical population genetics had given up on using allele frequencies since Malécot – it conflates IBD and IIS.

What should “complete differentiation” mean in terms of sequence data? To me this means no alleles can be shared among demes. Can I use that?

Well, what situation would no alleles be shared? And what would that mean for coalescence times? And hence what would a coalesence-based Fst look like?
Unfortunately I threw away my copy of Slatkin’s paper when I moved from Helsinki, and I don’t have access to the pdf. 🙁 (and I can’t find my copy of Rousset’s book (which is really annoying).
Lou Jost says:

August 22, 2010 at 3:39 pm

Bob, I wish I had known that you needed some of these papers. This debate would have been much clearer if you had been able to read or re-read some of the papers we have been discussing. If you send me an email address I can send you any of these papers. My email is just my name (all run together, no punctuation) at yahoo.com
No, pop gen has not given up on using allele frequencies (see any current pop gen software such as Arlequin, where you have the option to use allele frequencies or sequence data as the unit of analysis). Coalescence times based on sequence data also have the problems I mentioned.
Let’s back up and ask why we care about this, and whether identity by descent or identity by state is the relevant criterion.
Remember your Aug 20 post? You quoted me:
“However, if geneticists are trying to understand the causes of differentiation, they need a measure of the actual differentiation of the present-day population.”
and you added this:
“Well, yes. And Fst, the estimand, summarises this. Whether you measure it in terms of Pr(IBD) or coalescence times, it’ll work as that.”
When we wonder if two demes are differentiated, all that matters in the real world are the present-day frequencies of the alleles or sequenced genes in the two demes. Identity by state is what matters here. Two genes that are currently identical in sequence have exactly the same effect, whatever their past history.
The broad program of classical pop gen since the time of Wright has been to understand the causes and consequences of population structure. The procedure was to write down a measure of population structure, and then try to connect the values of that measure to more basic parameters like migration rate, mutation rate, etc, by developing simple analytical models like the finite island model. This program worked–geneticists like Malecot solved the simple models analytically and discovered analytical formulas connecting the measure of population structure to the fundamental parameters of genetics. Actually the formulas gave within-deme and total heterozygosity at equilibrium, and these could be used to calculate the measure of population structure. The role of migration and mutation, and deme size and deme number were made clear. Geneticists thought that they understood the causes of population structure.
However, they had started off on the wrong foot. Their measure of population structure was Fst or similar measures. This was an adequate measure of the genetic differentiation between demes early in the history of pop gen, when technology limited us to analyzing simple bi-allelic systems. It did not measure differentiaton when diversity was high or when there were many demes, but in the early days these situations did not arise.
As technology advanced, the measure of population structure did not change much. The early measure of two-deme, bi-allelic population structure became frozen into the toolbox of geneticists, with minor modifications, partly because it retained a simple connection to migration numbers even when there were many alleles and many demes. This relatively simple connection to migration numbers gradually seems to have led to the belief that population structure and migration numbers were essentially two sides of the same coin. Geneticists began to talk about finding population structure as if this meant finding migration numbers and effective population size. Many geneticists assumed that Fst still measured actual differentiation even when there were many alleles or demes, and today if you open any volume of Molecular Ecology you will see many papers supposedly measuring differentiation by calculating Fst. However, as I and others have shown, the fundamental measure Fst and its estimators and relatives do not measure differentiation when there are many alleles or demes.
This ninety-year side-track has lost the path to understanding the causes and consequences of genetic DIFFERENTATION between demes. This actual differentiation is the thing that drives speciation, and the thing that is relevant in conservation genetics and a host of other applications. It is this actual differentiation that needs explanation in terms of the fundamental parameters of genetics. Once we can write an equation connecting real differentiation to the basic parameters of genetics, we can begin to understand the real factors controlling the genetic divergence of demes.
So it is critical to have a good measure of differentiation. That is why I derived my D from first principles based on the mathematics of diversity.
Like Fst, my D can be written in terms of heterozygosities, so all of the old work connecting heterozygosities to the fundamental parameters of genetics can still be used to write D in terms of the fundamental parameters. So now, we can truly understand the causes of population structure. The factors controlling real differentiation are very different from the factors controlling Fst. Our understanding of the genetic basis of the evolution of differentiation has been wrong (and badly wrong) for many decades, because of historical quirks and lack of cross-fertilization of genetics from other fields. (The correct partitioning of heterozygosity into independent within- and between-group components was known in information theory by the 70’s and in physics by the late 80’s, but did not enter population genetics until my 2008 paper.)
Lou Jost says:

August 23, 2010 at 2:51 am

By the way, ecologists made exactly the same mistakes as geneticists regarding measures of diversity and differentiation. Partly because they borrowed some approaches from genetics. These ecological mistakes have been easier to correct than the geneticists’ mistakes, because the systems ecologists work with are immediately observable. All ecologists have to do is open their eyes and they can quickly see that two forests are strongly differentiated in composition. Measures which say the forests are nearly identical in composition can be quickly discarded. In contrast, in genetics it seems people can more easily rationalize contradictory or nonsensical results.
John A. Davison says:

August 26, 2010 at 9:42 pm

In my opinion, there is nothing in population genetics, in Mendelian genetics, in obligatory sexual reproduction, in statistics or in any other aspect of Neo-Darwinian theory that ever had anything whatsoever to do with speciation or the production of any of the higher taxonomic categories. All that natural or artificial selection can accomplish is the production of intraspecific varieties including subspecies, none of which are incipient species. Speciation involved (past tense) a different mechanism which I have tentatively identified with the Semi-Meiotic Hypothesis (SMH) which I first published in 1984 in the Journal Of Theoretical Biology 111: 725-735. That paper and several others are available on my weblog.
Furthermore, I believe that creative evolution is no longer in progress, the present biota representing the climax of a planned evolutionary sequence. All I anticipate is the extinction of the present biota.
In short –
“A past evolution is undeniable, a present evolution undemonstrable.”
I am pepared defend my thesis here, on my weblog or anywhere else where the issues can be discussed in civil fashion.
jadavison.wordpress.com
Lou Jost says:

August 27, 2010 at 12:16 am

Wow, that sure is off-topic….
John A. Davison says:

August 27, 2010 at 9:46 am

Lou Jost
I just presented a challenge to the whole Darwinian model and all you can say is “Wow, that sure is off topic.”
There is one central question about organic evolution – the mechanism by which it took place. Natural selection had nothing to do with that process. As a matter of fact natural selection is now what it has always been, anti-evolutionary, serving only to maintain the species unchanged for as long as possible. That is the concensus of several of my predecessors including myself. It is no longer acceptable for us to be ignored, dismissed as quacks or ridiculed.
I have repeatedly asked for verified examples of speciation in action and have been ignored. The key word here is verified. Evolutionary science, like any other biological science, is subject to the rigors of laboratory verification.
The Darwinians continue to assume that the mechanism of evolution is already known. That has always been their posture from 1859 to this very day. That assumption is wrong. During that century and a half interval we many critics have come and gone while the Darwinista and the Fundamentalista have devoted all their attention to one another, oblivious to the possibility that they are both wrong. There is nothing to debate or even to discuss in the scientific process. Hypotheses are tested and discarded when not verified. Darwinism, like Lamarckism, has failed to qualify as a valid explanatory thesis and should have disappeared from the scientific literature in Darwin’s own day. It has persisted for one reason only. There is a large fraction of the human population which is congenitally unable to accept the notion of a guided, purposeful universe. That mindset has made it impossible for those so afflicted to be objective as they observe the world in which they find themselves. To that extent they cannot be regarded as belonging to the community of those who seek the Truth free of ideology and prejudice, the indispensable feature of the true scientist.
I hope this preliminary statement serves to establish my position with respect to the great mystery of phylogenesis, a process I believe to be no longer in progress.
“A past evolution is undeniable, a present evolution undemonstrable.”
jadavison.wordpress.com
Bob O'Hara says:

August 27, 2010 at 10:35 am

bq. I just presented a challenge to the whole Darwinian model and all you can say is “Wow, that sure is off topic.”
That’s because it is.
(Lou – sorry for not replying: I intend to soon, but I’m travelling to the UK, so it might be delayed).
Lou Jost says:

August 27, 2010 at 1:38 pm

Thanks for letting me know. If you would like some reading material for the train or plane, I’d be eager to send you Slatkin’s article.
John A. Davison says:

August 27, 2010 at 5:10 pm

Let the record show that after I have denied any role for statistics, Mendelian Genetics, Population Genetics and natural selection in the evolution process. Bob OHara, whose blog this is, dismisses my comments as being “off topic.” At least he hasn’t deleted the last one to appear yet. I predict he will. That is the standard Darwinian response to being criticized these days. We critics have never been allowed to exist by the Darwinista. Sooner or later they will be forced to confront us. It will be a rout!
“If you tell the truth, you can be certain, sooner or later, to be found out.”
Oscar Wilde
jadavison.wordpress.com
Lou Jost says:

August 27, 2010 at 9:02 pm

Do you have something to say about the mathematics of measuring differentiation?
John A. Davison says:

August 27, 2010 at 10:31 pm

Lou Jost
I doubt that the mathematics of differentiation would shed any light on the mechanism by which creative evolution took place in the past. I used to teach quantitative biology at the University of Vermont and I never saw much application of math or statistics for elucidating the question of the mechanism of an organic evolution which is no longer in operation. I concur with Leo Berg that there is no role for chance in either ontogeny or phylogeny. Accordingly, I see no role for statistics there either. Both phenomena were determined in advance by one or more programmers who are apparently no longer with us. Just as all the information necessary to produce a complete individual is already present in the unfertilized egg, so I believe that was true of the early flora and fauna that were also following information already present. In short I see no role for the environment in either ontogeny or phylogeny beyond that of acting to permit the expression of what I have termed “prescribed” information. Otto Schindewolf had much the same view –
“This leads to the conclusion that the main features of the evolutionary trend were laid out right from the start with the abrupt, discontinuous production of the type, and with evolutionary potential being restricted right from the start to certain paths… Selection is only a negative principle, an eliminator, and as such it is trivial:…”
Basic Questions in Paleontology, page 360.
jadavison.wordpress.com
Lou Jost says:

August 27, 2010 at 11:53 pm

I guess that means “No” to my question.
John A. Davison says:

August 28, 2010 at 10:15 am

I want to thank Bob OHara for allowing me to present my views here, something I have been denied at Pharyngula, EvC, ARN, Uncommon Descent, Panda’s Thumb, After The Bar Closes, richarddawkins.net and several lesser blogs. I have nothing more to offer here. I trust my messages will not be deleted.
Leaving a record is all that really matters to this old investigator of the closely related mysteries of ontogeny and phylogeny, phenomena concerning which we have barely scratched the surface of our understanding.
“Neither in the one nor in the other is there room for chance.”
Leo Berg, Nomogenesis, page 134
“A past evolution is undeniable, a present evolution undemnstrable.”
jadavison.wordpress.com
Bob Verity says:

September 11, 2010 at 8:37 pm

Hello all,
I hope that this thread is still active and that the contributors so far are willing to continue the discussion, which is in my opinion a fantastic and necessary dialogue – thanks!
Clearly there has been a lot of material discussed above and I don’t want to get straight into the nitty-gritty. I think the best way for me to contribute would be to state what I think the key points are and where I stand on them.
There has been a lot of talk about estimators and estimands. If I have understood correctly this is difference between Fst the statistic and Fst the parameter. Fst the parameter is a feature of a stochastic model of evolution. It is the probability of identity by descent and is related to the population size (N), migration rate and mutation rates (collectively called m) by the expression Fst = 1/(1+theta), where theta=4Nm. Suppose that we have a sample from a single subpopulation containing all the same allele. The allelic composition of this subpopulation is one realisation of the stochastic process governed by Fst the parameter. The most likely value of Fst in this case is 1; once we have seen a single A allele we are certain to see another, however, it is possible that Fst is smaller than 1 and we just got lucky. I should point out that when I say "got lucky" I do not mean that we got lucky in our sampling, as pointed out by Lou the question of whether Fst measures diversity lies in the foundations of the problem and not in the sampling argument which has been discussed elsewhere (for the sake of argument we can assume that our sample covers the whole subpopulation). What I mean is that the process of evolution got lucky and produced another A allele that is not identical by descent but only identical in state. In my opinion the most representative description of Fst the parameter is the full posterior distribution (from a Bayesian perspective) which would give the maximum likelihood value of Fst=1 in this case, but also entertains the possibility of smaller values of Fst.
On the other hand, Fst the statistic is something measured directly from the data – usually from gene diversity (heterozygosity). As I understand it, Fst the statistic is a frequentist measure which corresponds closely to the posterior maximum-likelihood value of the parameter Fst. I think I am agreeing with Bob O’Hara here (sorry, now that there is another Bob here I’m going to have to relegate you to surnames!) that Fst the statistic tells us something about a process of evolution.
Secondly to the question of whether Fst measures diversity, which I believe was the original question. Consider the following model. There are two subpopulations of equal size and between which there is no migration. Mutations occur at a rate m, with a proportion Ra leading to A alleles and a proportion Rb leading to B alleles (there are only two possible alleles so Ra+Rb=1). The same value of the parameter Fst applies to each of the two subpopulations. Furthermore, Fst the parameter is equal to 1 meaning once we have seen a single allele in either subpopulation we are certain that all other alleles in that subpopulation must be the same. However, the first allele in either subpopulation is not influenced by Fst (there is not yet anything for it to be IBD to) meaning they occur with proportions Ra and Rb. So, if the first allele in both subpopulations is an A then all alleles in both populations will be A. However, if the first allele in one subpopulation is A and the first allele in the other is B then the subpopulations will be completely different, which I believe Lou has called "maximully differentiated". Accordingly, the value of the statistic Fst on these populations, which corresponds closely to the maximum-likelihood estimate of the parameter, would be 1 in both cases. I have to be careful here as I think the value of the statistic Fst breaks in the case of identical subpopulations (due to a divide-by-zero mentioned previously?), but this is the effect that Lou is getting at in his model in which "almost all demes are fixed for the same allele".
Thus, I agree with Lou in that Fst the statistic is not a measure of how different subpopulations are. It is better interpreted as a measure of how close subpopulations are to fixation – in both of the above cases they are completely fixed hence Fst=1. However, the question of whether Fst or Jost’s D is a more useful statistic is another matter. Jost’s D is essentially a measure of diversity in a momentary snapshot (called the present) of the process of evolution. Diversity in this sense means diversity that is present now, and arguments apropos conservation that utilise this statistic hinge on the idea of preserving the diversity that currently exists. However, I do not know that it makes sense to talk about diversity now without considering the process that leads to diversity in the future. To take Lou’s example from the Molecular Ecology paper in which there are 20 subpopulations with 100 equally common alleles. Fst in this case would be very small because each subpopulation is a long way from fixation, thus the mutation rate required to produce such a situation would be huge. If we killed of 90% of the diversity then it would just pop back up again due to the large mutation rate. This knowledge comes from measuring a process rather than a state.
Well, thats my two cents, let me know what you think. If you want to move this debate somewhere else then let me know, otherwise I’ll check back here.
Bob Verity
Lou Jost says:

September 12, 2010 at 3:19 am

Thanks Bob V for the input. I am still periodically checking in on this thread in case Bob O returns….
We have gotten pretty far afield in previous posts. In my opinion the whole estimator-estimand thing made for a nice instructional post but was essentially a distraction from the real point, the issue of whether Fst measures differentiation (not diversity, but it seems clear you meant differentiation above when you said diversity).
There have been some advances since my 2008 Mol Ecol paper that have clarified the difference between Fst and D . The most important is Gregorius’ 2010 paper, "Linking diversity and differentiation" (in the online journal Diversity) . He observes that any given individual in a subdivided population can be classified by two labels: its deme and its allele at the locus of interest. Then it makes sense to talk about two different aspects of the way those two characters (deme label and allele label) are associated among individuals.
First , we can ask "Do alleles show fidelity to particular demes?" If yes, any given allele tends to be private to a particular deme, and between-deme differentiation in allele composition will be high. This is what D measures, the degree to which demes differ from each other in allele composition.
Alternatively, we can ask the "dual" of the above question: "Do demes show fidelity to particular alleles?" If yes, each deme tends to be associated with only one allele (whether it is the same allele across all demes or different alleles in each deme is immaterial to this question), and this kind of differentiation will be high. This is fixation, exactly what Fst and Gst measure.
These two ways of looking at deme-allele association were already suggested by Slatkin 1991, when he noticed that there were two kinds of measures of population structure, one kind represented by Fst and its variants, and the other kind represented (at that time) by genetic distance measures. D belongs to this latter class.
When there are only two demes and only two alleles, these two concepts of differentiation become identical. The more demes and alleles, the more different these two numbers can be. When Wright and other pioneer population geneticists began their work, sequencing wasn’t even dreamt of, and workers usually analyzed traits that had very low diversity. The bi-allelic case for a pair of demes got lots of theoretical attention, and in that case Fst is BOTH a fixation index and a measure of compositional differentiaton among demes. As time went on and it became possible to distinguish very many alleles at a locus, some kind of intellectual inertia kept people from noticing that the second interpretation no longer applied. Even now, when anyone can easily see from examples that Gst does not measure compositional differentiation among demes, there are many who still make this interpretation.
There is no reason to debate this question, as it is a simple mathematical issue. It had even been addressed back in 1998 by Nagylaki, in a published proof that Gst did not measure differentiation when diversity is high.
The real debate should now be about WHICH kind of differentiation should be used for a particular question. Some people have misinterpreted my work as saying that Fst or Gst should never be used. This is not a reasonable position and I did not say that. In fact I used Gst in my 2008 paper in conjunction with D to derive some results. Gst (especially in its coalescence form) applied to neutral markers is a useful measure of migration if we make some basic assumptions about the underlying model. Knowing the amount of migration helps understand evolutionary patterns, as Bob V mentioned above. There is nothing wrong with this use, as long as the models apply. We can argue that the models are often unrealistic, but that is a different debate. We might also be able to argue that for non-neutral loci or for systems that deviate strongly from the finite island model, D may provide a more robust idea of the amount of migration than Gst, but I have not developed this idea yet.
When should we use D? Whenever we want to know how different the demes are in actual allele composition. Bob V mentioned conservation. If we investigated coding loci that we thought might be important for long-term survival of the species, D would be the appropriate tool to use when prioritizing demes for conservation. Gst and Fst could give wildly wrong answers, even if the locus was neutral, because of the way Gst’s maximum possible value is constrained by within-group heterozygosity.
To me the most inappropriate use of Fst is in theoretical analyses of the demographic and genetic factors that cause differentiation in allele composition among demes. We have the "one migrant per generation" rule telling us that differentiation between demes will not arise if more than one migrant per generation enters the population. This is completely false, as the rule is derived from the behavior of Fst, which we now know does not measure differentiation of allele frequencies across demes. Here D is the appropriate measure, and paints a very different picture of the factors controlling differentiation in subdivided populations. This is one of the most basic evolutionary processes, one that has led to much of the earth’s present diversity, and it has been misunderstood because of this simple mathematical confusion about how to measure differentiation.
Lou Jost says:

September 12, 2010 at 1:28 pm

There is another common myth in pop gen and many other life sciences, related to the myth that Gst measures the compositional differentiation of alleles among demes. This is the myth that the ratio of within-group diversity to total diversity is a measure of compositional similarity, when "diversity" is measured by heterozygosity or entropy. I often see phrases like "the within-group gene diversity is close to the total gene diversity, so the groups are not differentiated". By looking at examples with high gene diversity, anyone can easily disprove this myth too. If people want to correctly judge compositional similarity on the basis of such a ratio, they must convert heterozygosity to a true diversity measure, the effective number of alleles, before taking the ratio.
Bob Verity says:

September 13, 2010 at 5:00 pm

Thanks for the speedy reply Lou.
Hmmm, it is an interesting idea to think about differentiation and diversity in this way, with differentiation defined as "allele fidelity to demes" and diversity defined as "deme fidelity to alleles". However, my current view (for which I have a draft manuscript with Richard Nichols) is to emphasize that allele distributions are most helpfully interpreted by considering the process that could have produced them, rather than with descriptive statistics of their current state.
Perhaps I can characterize the difference in this world-view by pointing out that in some models of the underlying biology, the whole process (generating both differentiation and diversity) can be estimated using Fst (eg. equilibrium infinite island models with minimal mutation) however, for other models (eg. those for which mutation is of the same order as migration rate) a two-parameter model is required and, in this case, the logic needs to be extended. Hence, I do not think that Fst is a measure of just diversity, but is a compound measure of both diversity and differentiation.
As a first example, consider the hypothetical situation underlying Fig.2 in the Jost 2008 Molecular Ecology paper – which starts with two identical subpopulations with four equally common alleles in each (some diversity but no differentiation). What kind of underlying model could have led to this outcome? One simple explanation is that there is massive migration into both subpopulations (eg. from a continent, or a "cloud" of migrants from a large number of islands) causing the allele frequencies to to be completely dictated by the cloud, and hence the same in each subpopulation. Thus, migration rate is huge and Fst pops out at around zero, as shown in the figure. As we add unique alleles to each subpopulation our estimate of the migration rate goes down – it cannot be that huge or we would not have seen any differentiation. However, our estimate of the mutation rate also goes up – there must be some mutation to explain the unique alleles. Fst (the statistic) is a single compound measure that takes into account both mutation and migration rates, thus Fst does not change linearly but as a function of two parameters. At some number of new unique alleles Fst reaches a tipping point and begins to decline. This is because a high mutation rate must now be invoked to explain the diversity of the data, dragging the statistic back down to zero.
Returning to another example from Jost 2008, consider 20 subpopulations with 100 equally common alleles in each but no alleles shared between subpopulations. The differentiation of the demes tells us that migration must be tiny. The diversity of the demes tells us that mutation must be huge. Fst is a compound measure of both parameters and would therefore be small in this case.
I would be interested to hear if you agree with me so far – namely that Fst is a compound measure of both diversity and differentiation. By the sounds of it this is not the conclusion that others have come to.
Assuming for the sake of argument that I am right so far and ploughing on with the logic, let us consider the case in which the migration rate dwarfs the mutation rate. This seems like a biologically relevant choice for some cases (eg. loci with few alleles). Then the parameter Fst is governed almost exclusively by the migration rate. When Fst (the statistic) is close to zero we can assume that migration is large, and when Fst is close to 1 we can assume that migration is small. Since migration is a proxy of differentiation we can conclude that when mutation rate is small Fst is a reasonable measure of differentiation.
My assertion in a nutshell:
When mutation rate is small…
Small Fst –> Large migration rate –> Low differentiation
Large Fst –> Small migration rate –> High differentiation
Any comments would be greatly appreciation
Bob Verity
Lou Jost says:

September 13, 2010 at 8:22 pm

Thanks for sharing these ideas. I’ll have more time to answer later today or tomorrow, but I have a couple quick comments.
I agree with you that Fst s behavior in my 2008 examples can be rationalized. Fst is a genuine measure of one aspect of population structure. My point is only that this aspect is NOT compositional differentiation between demes. Gregorius 2010 explains the aspect of population structure measured by Fst and contrasts this with differentiation of alleles between demes (D) .
I like your idea of thinking of Fst as a function of diversity and differentiation. If the number of demes n is large so that the factor n/(n-1) in the definition of D is approximately equal to unity , Gst is completely determined by differentiation and either within-group or total diversity, and these expressions are model-independent. Gst = D/[D+Hs/(1-Hs)] or, in terms of real diversity (note my argument above that H is not diversity, 1/(1-H) is diversity)
Gst = Diff / [Diff + Divs/(1-Divs)]
where Dif is differentiation D, and Divs is within-group diversity 1/(1-Hs). When total diversity Divt is used,
Gst = [Diff / (1-Diff)] [1/ (Divt – 1)].
Best check the algebra before using!
More later,
Lou

One, it is easy to write Gst (pardon the switch from Fst, but it is much more direct and its values are similar to those of the other versions of Fst) in terms of differentitaion and diversity.
Lou Jost says:

September 13, 2010 at 11:57 pm

Bob, here are my comments interleaved with yours. Hope they are helpful

Hmmm, it is an interesting idea to think about differentiation and diversity in this way, with differentiation defined as "allele fidelity to demes" and diversity defined as "deme fidelity to alleles".

A small correction- diversity is something else (effective number of equally common alleles). Fst measures deme fidelity to alleles, and D measures allele fidelity to demes.

However, my current view (for which I have a draft manuscript with Richard Nichols) is to emphasize that allele distributions are most helpfully interpreted by considering the process that could have produced them, rather than with descriptive statistics of their current state.

I see that this can be a very useful viewpoint for many purposes. I don’t think process should always be the primary focus, though; sometimes the present state of allele distribution is what is important. For example, in conservation, we are usually dealing with endangered species whose deme sizes and inter-deme migration patterns no longer have anything to do with the processes that caused the allele distributions over evolutionary timescales. If we want to prioritize demes of Spectacled Bears for preservation, we might look at a set of important loci (say, loci controlling immune system responses) or maybe all coding loci, and ask which demes are most different from each other. We’d also look at within-deme diversity. This information is all about the present-day allele distribution. Now maybe knowing ancient inter-deme migration rates might give clues about which demes were most likely to have unique alleles. But I wonder how we might estimate that, since the current population is most certainly not in equilibrium and most ancient demes have vanished.
Rather than making dubious theory-laden and assumption-laden estimates of “process”, in cases like these it seems much more direct to just look at the actual differentiation at loci of interest. And for this, we need a real measure of differentiation, such as D (note there are others). Fst and its relatives will not work.
Also, most techniques for inferring process depend on assumptions of neutrality. However, the loci worth conserving are coding loci that are not neutral. Here again, a direct measurement of their differentiation is required, and techniques which assume neutral models to infer process are not applicable.

…. I do not think that Fst is a measure of just diversity, but is a compound measure of both diversity and differentiation. or differentiation….

I agree that Fst is a function of diversity and differentiation. It is not possible to infer differentiation using Fst alone; you need to know diversity as well. That is also the point of Hedrick’s G’st (which unfortunately fails in the low-diversity end of the spectrum, but does correct Gst when diversity is high).

I would be interested to hear if you agree with me so far – namely that Fst is a compound measure of both diversity and differentiation. By the sounds of it this is not the conclusion that others have come to.

There is no doubt that your view here is correct, and I gave the explicit relationship between diversity and differentiation in the preceding post.

Assuming for the sake of argument that I am right so far and ploughing on with the logic, let us consider the case in which the migration rate dwarfs the mutation rate. This seems like a biologically relevant choice for some cases (eg. loci with few alleles). Then the parameter Fst is governed almost exclusively by the migration rate. When Fst (the statistic) is close to zero we can assume that migration is large

Just want to note that only in the case you mention, with very low diversity, can you infer that migration rate is large based on a small Fst.

, and when Fst is close to 1 we can assume that migration is small. Since migration is a proxy of differentiation

only true for this low-diversity case and for this model

we can conclude that when mutation rate is small Fst is a reasonable measure of differentiation.

Nagylaki (1998) addressed this question directly, and proved mathematically that Fst measures differentiation between two demes if and only if within-group diversity is low.

My assertion in a nutshell:
When mutation rate is small…
Small Fst –> Large migration rate –> Low differentiation
Large Fst –> Small migration rate –> High differentiation

But when there are many demes Fst does not work this way. Again, when all demes are fixed for a single allele, and all but one are fixed for the SAME allele, with very low or no migration, Fst =1.00 but differentiation between demes is virtually 0.00. From the large Fst you would have inferred high differentiation, but this would be wrong (unless there are only two demes).
Only for the bi-allelic, two-deme case do Gregorius’ two aspects of population structure collapse into one, and only in this case does Fst measure compositional differentiation among the demes.
Hope this helps,
Lou
Lou Jost says:

September 14, 2010 at 12:05 am

I am having trouble posting, not sure why. I get this message:
Your comment submission failed for the following reasons: Publish failed: Writing to ‘/var/www/html/boboh/2010/08/16/semiotics-and-statistics.new’ failed: Opening local file ‘/var/www/html/boboh/2010/08/16/semiotics-and-statistics.new’ failed: Read-only file system
Will try again tonight or tomorrow. Lou
Lou Jost says:

September 14, 2010 at 6:19 am

Sorry about the multiple copies of my post–I only submitted one. Bob O, can you please remove the extras? Lou
I’ve removed them. Of course the system chose to duplicate your longer comments – BO’H
Lou Jost says:

September 14, 2010 at 6:22 am

Ah, now I see what happened. I tried posting earlier this evening and kept getting error messages. I tried many times. None got posted, all returned an error message. But apparently every one of those attempted posts did get into the server somewhere, and all got posted at once when I submitted again this evening.
Bob Verity says:

September 15, 2010 at 3:36 pm

OK, I think we’re getting to the heart of the issue now.
Let us imagine a case in which there are 10 subpopulations of equal size; 9 of which are fixed for A and 1 of which is fixed for B. If we define differentiation as "allele fidelity to demes" then differentiation in this case would be low because the A allele is shared between 9 of the 10 demes. By contrast, Fst would be 1 in this case because all subpopulations are fixed. Thus, Fst does not measure differentiation in this sense – I agree.
But I believe there is a different definition of differentiation which does correspond to Fst. There are many unknown parameters in the above example, including the mutation and migration rates and the frequencies of the A and B alleles in the migrant pool. As I have said before, my understanding of Fst the statistic is that it is close to the posterior maximum likelihood value of Fst the parameter. The maximum likelihood explanation of the above scenario occurs when both mutation and migration rates are low (hence Fst=1) and the frequencies of the A and B allele in the migrant pool are 0.9 and 0.1 respectively. In essence, Fst assumes that the best explanation for the data is the fixation of a locus with low polymorphism.
Now imagine that we examine the same sample for a different locus. Based on what we have seen so far we would expect all subpopulations to be fixed. How many subpopulations become fixed and for which alleles depends entirely on the frequencies of the alleles in the migrant pool – for a highly polymorphic locus we would expect high allele fidelity to demes.
Thus Fst gives a sort of whole-genome measure of differentiation, which may or may not reflect the numbers that we see at a particular locus. Going back to the 10 subpopulations example and choosing 2 subpopulations that are both fixed for the A allele, in a way these are fully differentiated – we have no reason to believe that there would be any correlation in allele frequencies at another locus of the same individuals.
Clearly these are very different interpretations of differentiation, and I can see pros and cons for each. I strongly agree with you that we should be careful about theory-laden assumptions about the processes that are occurring, especially when the decisions that we make based on these assumptions may have irrevocable consequences. One weakness with the Fst world-view that I outlined above is its reliance on an infinite pool of migrants. In the case of some endangered species the polymorphism that we see now is really all there is – when it is gone it will not pop back up due to migration.
Another interesting example of the difference between the definitions occurs in the case of 10 subpopulations of size N=100, each of which contains 10 equally represented alleles that are unique to that subpopulation (100 alleles total). By the "allele fidelity to demes" definition this is maximum differentiation. From an Fst point of view the maximum likelihood explanation is that each allele has a frequency of 0.01 in the migrant pool and that there is a fairly large amount of mutation (note that even under this maximum-likelihood situation the data is incredibly unlikely), thus Fst is not close to 1, but instead is closer to 0.3. If we were to examine another locus in the same individuals then we would expect to see some alleles shared between subpopulations. Thus, to say that the subpopulations are maximally differentiated in this situation could be seen as misleading.
Again, I’d be interested to hear your thoughts. Irrespective of which definition is more useful (and I expect the usefulness of each will depend greatly on context), I do not think that this is a distinction that the wider scientific community is aware of.
Bob
Lou Jost says:

September 16, 2010 at 2:06 am

Yes, Bob V, I think now we are getting to the heart of the issue. I suggest we start a new thread about this in the Nature Population Genetics Forum, which spawned this blog post in the first place.
I do not think your interpretation of Fst in the second example can be correct. Assume for the moment that this is a neutral locus and that it follows the finite island model. Wouldn’t the logical conclusion from the data be that there is no migration? And hence, wouldn’t the most logical inference about the differentiation of the next locus be that it would also be completely differentiated? So wouldn’t D be giving the most logical prediction rather than a misleading one as you suggest?
Now suppose we don’t assume anything about neutrality or about the underlying model. If someone gave you that data said, "Predict whether the next locus is differentiated", would you be better off basing your prediction on Fst or on D?
You said your prediction based on Fst would be that some alleles would be shared at the next locus. Notice how odd this prediction is. You are predicting AGAINST the observed data repeating itself at the next locus. What if you looked at a hundred loci and all had D=1, Fst =0.3? Because Fst is 0.3 in all these cases, you would still predict that the hundred-and-first locus would show some shared alleles, even though the 100 observed loci did not. This doesn’t seem right to me.
Nevertheless I think you have made an important point about Fst making a whole-genome assertion, IF the finite island model applies exactly, and IF the system is at equilibrium, and IF the alleles are neutral, and if only loci with small mutation rates are considered. Fst_coalescent can correct for variable mutation rates from locus to locus.
Real compositional differentiation can also be applied to the whole genome, since the average differentiation across loci has biological meaning, as long as we are talking about coding loci.
I do agree with you completely that Fst does have legitimate uses. It just does not measure compositional differentiation between demes. Low values of Fst do not mean that the demes have similar allele compositions, and high values do not mean that they have dissimilar compositions. Such inferences are just plain wrong, as examples easily show.
Lou Jost says:

September 16, 2010 at 6:18 pm

In your second example, if I were asked to predict something about the next locus based on the observed Fst =0.3 ,and if the Fst value were the only thing I knew, I would not predict anything about the differentiation at the next locus. I would predict that fixation would be low.
Lou Jost says:

September 17, 2010 at 5:08 pm

Bob V, any thoughts on the above?
Bob Verity says:

September 20, 2010 at 2:48 pm

Hi Lou. Sorry I’ve been pretty busy, I’ll try to post a reply soon.
Lou Jost says:

September 20, 2010 at 10:18 pm

OK, I’ll keep checking back here now and then…
Lou
Bob Verity says:

September 21, 2010 at 4:55 pm

OK, finally got round to it.

I do not think your interpretation of Fst in the second example can be correct. Assume for the moment that this is a neutral locus and that it follows the finite island model. Wouldn’t the logical conclusion from the data be that there is no migration? And hence, wouldn’t the most logical inference about the differentiation of the next locus be that it would also be completely differentiated? So wouldn’t D be giving the most logical prediction rather than a misleading one as you suggest?

The logical conclusion would indeed be that there is no migration, but also that the mutation rate is large which drags Fst – a composite measure of migration and mutation – down to around 0.3. I think the confusion comes from the fact that I was envisioning a finite-alleles model, in this case there is no reason to expect that the next locus would be fully differentiated. Mutations would likely occur in each of the subpopulations, and more often than not this would lead to some alleles being shared between demes.

Now suppose we don’t assume anything about neutrality or about the underlying model. If someone gave you that data said, "Predict whether the next locus is differentiated", would you be better off basing your prediction on Fst or on D?

Unfortunately I do not believe it is possible to abandon all models and assumptions completely, or at least if we do then it is not possible to extrapolate to further loci. D is a statistic measured on the current data, but tells us nothing about the genome as a whole unless it is somehow linked to a model. Even the attitude its the only information we have so its our best guess is underpinned by a model – the assumption that the same process is likely to yield a similar value of the statistic. As mentioned above, I believe there is an important class of models in which this is not the case.

You said your prediction based on Fst would be that some alleles would be shared at the next locus. Notice how odd this prediction is. You are predicting AGAINST the observed data repeating itself at the next locus. What if you looked at a hundred loci and all had D=1, Fst =0.3? Because Fst is 0.3 in all these cases, you would still predict that the hundred-and-first locus would show some shared alleles, even though the 100 observed loci did not. This doesn’t seem right to me.

As mentioned previously "even under this maximum-likelihood situation the data is incredibly unlikely". Thus, if we were to generate a new sample of allele frequencies using the maximum likelihood parameters it would most likely look nothing like the data we observed. I agree that this is unintuitive, and usually when this happens it can mean one of two things – either the model is a poor representation of the actual process, or (when dealing with hypothetical counts) the data is contrived. I imagine that there are situations in which a finite-alleles model is appropriate, in which case D could give misleading results.
Lou Jost says:

September 21, 2010 at 6:37 pm

Bob V, you are certainly right about the inevitablility of assuming some kind of model when predicting the differentiation of the next locus. I also see your point about betting against a repetition of the actual data if that data is unlikely based on your model. After all, if we were playing cards and you were dealt a royal flush, this would not be reason to predict that your next hand would also be a royal flush.
I agree that getting lots of equally-common alleles (as in my example) is unlikely, but that assumption was made only to make calculations easy, and all of my arguments apply even if they are not equally common (as long as there are lots of alleles with relatively high abundances).
I think this discussion of model has nothing to do with the issue at hand. First, under standard models migration s a free parameter, and there is nothing inherently unlikely about a low migration rate, so it is reasonable to infer low migration rate from the data. Second, to say that you were thinking of a finite allele model does not justify your interpretation of Fst, because your calculation of Fst has nothing to do with what you were thinking. You would have gotten the same 0.3 if you were thinking of the infinite allele model, a finite allele model with lots of alleles, or a finite allele model with just a few alleles. None of that entered into your calculation or interpretation of Fst. Third, the observed data has no repeated alleles across demes, and this argues against a finite allele model unless it has a lot of alleles , in which case it makes no difference to the interpretation of the results.
Now if we sampled a hundred loci and got complete differentiation at all of them, it does seem to be unreasonable, on any model which treats migration as a free parameter, to infer that the next locus examined will have a differentiation of 0.3.
So I do not think your intepretation of Fst, as you formulated it above, is correct.
John A. Davison says:

September 21, 2010 at 8:52 pm

Bob OHara
I see you are into blocking critics here as at After The Bar Closes, Pharyngula, richarddawkins.net and every other bastion of Darwinian mysticism. You have no idea how much such cowardly protectionist tactics please me!
Bob Verity says:

September 22, 2010 at 12:57 pm

As a quick aside, does anyone know how to insert images into comments? I can see the little "Insert/Edit image" icon but am lost from then on!
Bob Verity says:

September 22, 2010 at 2:46 pm

I agree that getting lots of equally-common alleles (as in my example) is unlikely, but that assumption was made only to make calculations easy, and all of my arguments apply even if they are not equally common (as long as there are lots of alleles with relatively high abundances).

As far as I am aware my argument (that Fst correctly predicts shared alleles at the next locus even when there are none in this locus) also applies even when they are not equally common, as long as there are lots of alleles with relatively high abundances. The only requirement is that we consider a finite alleles model. Talking of which…

I think this discussion of model has nothing to do with the issue at hand … You would have gotten the same 0.3 if you were thinking of the infinite allele model, a finite allele model with lots of alleles, or a finite allele model with just a few alleles.

True, we can calculate the statistic for any model that we think up, but the interpretation of the statistic depends on the model.
I think I need to clarify the angle that I am coming from so that we can pinpoint where we disagree. Imagine that there are a bunch of subpopulations (say 10) that breed at random within subpopulations but exchange no individuals with each other. New alleles can be generated in each subpopulation by mutation/migration from a "cloud" – which could be a continent in the case of migration or just a description of the available allele frequencies in the case of mutation. Crucially, there are a finite number of alleles that can be mutated to. The migration rate is m, the mutation rate is mu, the number of individuals in each subpopulation is N, and the frequencies of alleles in the cloud is indexed by rho (so the frequency of allele A is rhoA).
Imagine that we have seen a single A allele in a subpopulation. The probability that the next allele is also an A is the probability of an IBD alelle plus the probability of getting an A allele by mutation/migration:
<a href="http://www.codecogs.com/eqnedit.php?latex=\begin{align*} \Pr(A|A) & = F_{ST} plus; (1-F_{ST}) ho_A \ & = rac{1plus; heta
ho_A}{1@plus; heta} \ \ \mbox{where } heta=4N&(m@plus;\mu) \hspace{5mm}\mbox{and}\hspace{5mm} F_{ST}=rac{1}{1@plus; heta} \end{align}” rel=”nofollow”> $\begin{align*} \Pr(A|A) & = F_{ST} + (1-F_{ST}) ho_A \ & = rac{1+ heta ho_A}{1+ heta} \ \ \mbox{where } heta=4N&(m+\mu) \hspace{5mm}\mbox{and}\hspace{5mm} F_{ST}= rac{1}{1+ heta} \end{align}$
Using the same logic we can arrive at the probability of the entire sample – which turns out to be the multinomial-Dirichlet likelihood developed by Wright.
$\begin{align*} rac{\Gamma( heta)}{\Gamma(k+ heta)}\prod_i{ rac{\Gamma(n_i+ heta ho_i)}{\Gamma( heta ho_i)}}\ \end{align}$
Where n_i represents the number of alleles of type i, and k represents the total number of alleles.
This is the formula that I have been using to calculate the likelihood of the data. My understanding of the statistic Fst is that it is close to the posterior mode of this function. Note that the formula is predecated on a finite alleles model – there has to be a chance of getting an A allele by mutation as well as from identity by descent.
Now lets go back to the example we have been considering – a bunch of subpopulations containing multiple unique alleles. I am no longer stipulating that there are equally common alleles in each subpopulation, only that there are many more alleles than demes. Treating all allele frequencies in the could as unknowns (or for the sake of convenience using the maximum-likelihood values which are just the frequencies we see in the data) it is always unlikely that we would see maximal differentiation. The logic is as follows: there must be a reasonably high mutation/migration to explain the diversity within samples, however the mutations that did occur led to unique alleles in each deme. There are fewer allele "state-spaces" in which no alleles are shared than there are state-spaces in which alleles are shared, meaning the likelihood of this event is very low. The low value of Fst reflects this and leads to the prediction of shared alleles at another locus.
Do you use the program R? If so then it might be useful at this stage to e-mail each other scripts to explain what we mean (as long as they are well documented!)
John A. Davison says:

September 23, 2010 at 11:00 am

Leo Berg identified the role of chance in both ontogeny and phylogeny eighty-eight years ago (1922) –
"Neither in the one nor in the other is there room for chance."
Nomogenesis, page 134
Why do you Darwinians continue to ignore the greatest Russian biologist of his generation? Is it because you are terrified at the prospect that he might be right? These are questions that demand answers. Are you able to provide them?
jadavison.wordpress.cm
Bob Verity says:

September 23, 2010 at 3:52 pm

Just so you know, I have posted a response but it needs to be accepted by Bob O’Hara before it can appear on the forum (it contains images). Stay tuned!
Bob O'Hara says:

September 23, 2010 at 6:43 pm

Hi! Sorry for ignoring this thread – I was having too much fun, and when I came back, the number of comments scared me off.
Bob – hopefully your comment is approved now.
JAD – the only comment of yours that didn’t appear consisted of two webpages, and so was flagged as spam.
Lou – I’ll clean up the mess with multiple comments later. I feel your pain with errors doing stuff like that. I hope I only delete duplicates!
John A. Davison says:

September 23, 2010 at 7:10 pm

Bob O’Hara,
I take it then that you plan to respond to the questions I posed on September 23, 2010, 11:00 AM ? Or will you continue, like other Darwinians before you, to pretend that you never had credible critics?
jadavison.wordpress.com
Lou Jost says:

September 24, 2010 at 3:22 am

John D, you yourself said earlier that you didn’t have anything specific to add to the mathematical discussion of differentiation measures. Don’t you think a discussion of the larger issues you mention would be better done somewhere else? We really just want to deal with this little bite-sized issue for now. This is a hard enough issue by itself; we are unlikely to resolve it if we simultaneously try to address such big questions as the ones you want to discuss.
By the way, these little issues may seem trivial compared to your big issues, but in order to discuss those big issues properly, the countless little issues that underlie the big issues have to be solved first. So lets just solve this one and then later, elsewhere, work on the big picture. Don’t you think that’s a wise strategy?
Bob V, I will have to think about your formulas. Meanwhile it still seems to me that your suggested interpretation of Fst must vary with the assumed size of the allele space, yet the calculation of Fst does not involve the size of the allele space. So I can’t figure out how that works. In practical terms there is no sudden cut-off between the infinite-allele model and the finite allele model with a very large allele space. So nothing about the interpretation of Fst can suddenly change just because we assume a finite allele space.
Thanks Bob O for fixing my repeated comments. The last one was the most complete one. Sorry for the length of these comments. It is a complex issue!
John A. Davison says:

September 24, 2010 at 11:00 am

Bob
You ask – Don’t you think that is a wise strategy? – to which I respond with a firm NO!
It is my position that there has been no role for statistics or probability in the history of life. I am convinced with Robert Broom that there WAS a Plan, the word that Broom capitalized much to the dismay of the Darwinians which is why they pretend he never existed.
As I am sure you are aware, I do not regard allelic mutations has having any signifiance in progressive evolution, although they may play a role in extinction. Furthermore, I also believe that the several extinctions that we know have occurred were also part of Broom’s Plan. The dinosaurs would have become extinct with or without environmental catastrophy just as did the extinction of the giant amphibians that preceeded them. Without planned extinction there could never have been evolution.
I realize that "our" interpretation of phylogeny is incompatible with the neo-Darwinian model. I am simply attempting to establish a dialogue with our adversaries. I thank you for asking your question, thereby offering me this opportunity to reply.
jadavison.wordpress.com
Lou Jost says:

September 24, 2010 at 1:49 pm

Don’t blame Bob for that last post, it was just me.
You are lucky that you already know what is true and what is not. It is understandable then that you don’t see any use for statistics. The rest of us are not so lucky. We have to test our theories against observations, and for this, statistics are essential, else we might end up fooling ourselves or being misled by our preconceptions.
John A. Davison says:

September 24, 2010 at 2:57 pm

Lou Jost
"We" have never claimed to know "what is true." "We" have claimed only what we know not to be true.
It is important that you know exactly who "we" are. St George Mivart, William Bateson, Leo Berg , Richard B. Goldschmidt, Otto Schindewolf, Robert Broom, Pierre Grasse and more recently Robert F. DeHaan, Soren Lovtrup and myself, not one of whom was either a religious fanatic or a proponent of the Darwinian model.
jadavison.wordpress.com
Lou Jost says:

September 25, 2010 at 11:04 pm

Hi Bob V,
Maybe you can help me with one part of your interpretation that I do not understand. You say that the Fst value of 0.3 means that some alleles will likely be shared at the next locus. How are you interpreting the 0.3 then? You seem to treating the 0.3 as if it were an estimate of the likely differentiation. What is the formula for the differentiation measure whose value Fst is predicting? Fst itself does not measure this kind of differentiation, so we can’t use our experience with Fst to guide us in interpreting the value 0.3. We can get an Fst value of 0.3 even when no allleles are shared, as in this example. Does a value of 0.3 mean it predicts many alleles will be shared? Most alleles? A few? How do you know?
Thanks,
Lou
Lou Jost says:

October 8, 2010 at 3:31 am

Bob O, Kronholm et al (2010) have just published in BMC Genetics a correction to their original paper, which you cited in Footnote 2 of this blog post. They agree with me that D is NOT an estimator of Fst.
"Correction: Effect of mutation rate on estimators of genetic differentiation-lessons from Arabidopsis thaliana"
http://www.biomedcentral.com/1471-2156/11/88
As I mentioned earlier, Gregorius (2010) has a deep and original discussion of the aspect of population structure measured by Fst and its relatives, versus the aspect of population structure measured by D and its relatives.