Some creationists have become terribly excited by a recent paper and accompanying New Scientist article. It’ll come as no surprised that they have failed to understand the paper, and I’m confident that explaining the paper in a post won’t help, but I think the paper’s interesting, and I have a few thoughts about it anyway.

The problem the paper deals with can be traced back to two great of Victorian English Charles: Darwin and Dodgson. Darwin, of course wrote a book he called *On the Origin of Species by Means of Natural Selection, or the Preservation of Favoured Races in the Struggle for Life*, in which he failed to give a good account of the origin of species, but did explain natural selection.

Charles Dodgson, as most people know, was a mathematician who wrote under the pseudonym of Lewis Carroll. One of the things he wrote about was the Red Queen. She was introduced to evolutionary biology by Leigh Van Valen. He suggested that fitness (as measured by extinction rate) may not increase over time, and showed evidence that actual times to extinction follow an exponential distribution, as they would if fitness were constant. Van Valen compared this to the Red Queen’s statement to Alice: “Now, here, you see, it takes all the running you can do, to keep in the same place.” Species are evolving through selection, but the environment is changing, so they are constantly trying to keep up.

The present paper looks at Van Valen’s idea from a slightly different perspective. The authors were interested in speciation rather than extinction and argue that the time between speciation events (i.e. the time a species spends as a single species, before it splits) can tell us something about the processes that lead to speciation. they thus compared the distribution of these times (“branch lengths”) in different parts of phylogenetic trees:

.

*A phylogenetic tree, yesterday*

In particular, they compare five distributions of branch lengths, for which they could give explanations for how these distributions might come about:

**Exponential**speciation is random, i.e. there is a constant rate at which species split, and this is not affected by the age of the species**Weibull**The rate of speciation changes over time: it can increase or decrease**log-Normal**There is an accumulation of factors (presumably genetic) which act multiplicatively. Eventually some threshold is reached, when speciation occurs**Variable Rates**Like the exponential, but the rate of speciation is different for each species. This rate itself follows a Gamma distribution.**Normal**Like the log-normal, but the factors add not multiply.

So, Venditti *et al.* argue, if we can say that a tree has one of these distributions of branch lengths, we can say something about the processes. They thus collected sequences from 101 data sets, from species like bumblebees, cats, turtles and roses. For each data set they fitted phylogenetic trees using all of the the models, and then found which model fitted each data set best^{1}. This is what they got

*Percentage of data sets for which each model provided the best overall description of the branch-length distribution (models described in text). The coloured bars are the results from the reversible-jump procedure (see text), the grey bars record the results from the harmonic mean test. Error bars, standard error. Source: Fig. 1*

The bars shows the proportion of data sets where that distribution fitted best. The conclusion is simple: the exponential is the overwhelming winner. Hence, Venditti *et al.* conclude, speciation is a random event: there is nothing intrinsic to the species (such as its age) that makes it more or less likely to speciate. I think this would fit well into how most people think about speciation: it is caused by outside events like mountains rising up in the middle of a species’ range, or a continent inconveniently splitting in half.

I have a couple of methodological concerns about this study, which I will blog about later. But one is important to the whole study. The exponential distribution has one parameter, whilst the others have two. This makes model comparison difficult: a model with more parameters will always fit better to the data. So, if we are to compare models, we have to penalize the complex models. Skipping the details, Venditti *et al.* set the model up to give a debt:

The average prior cost we assessed the two-parameter models translates to having to overcome a ‘debt’ of about 1.1 log-units. That is, to perform better than the exponential the two-parameter model would need to improve the log-likelihood by this amount.

But they compared models by rank: finding out which model was best. if all of the models fit equally well (and the authors admit that with the exception of the normal, they “can produce almost indistinguishable densities”), the exponential would come top. It’s like having a race where one runner is 10% faster than the rest: they will still win most the races (but not all: sometimes they will have an off day, or fall over etc. The statistical equivalent is that sometimes another model will, by chance, fit better to another distribution). Now, it might be that the exponential was much better than the rest, but we aren’t given the information to decide this^{2}. So, I’m not (yet) convinced that the authors have even found anything other than an artifact.

Even if the statistical results are correct, I am not sure about the interpretation and what it means for natural selection. I’d like to see a good scenario for speciation that leads to a normal distribution. The only one I can think of is that shortly after a new species diverges, it splits into two populations. Over time, barriers to reproduction build up, until the two populations have diverged enough to be isolated. This doesn’t look like a general mechanism to me: why would a species split into diverging populations? Only a small amount of gene flow is needed to keep populations connected genetically. It seems more likely that there is a trigger for a species to split, and selection might act after this trigger. if the time between these triggers is long, this will dominate the distribution of branch lengths. So Venditti *et al.* are, I think, implicitly assuming that the branch lengths are short. I’m arm-waving here, but I would be interested in seeing some modelling work to assess how different mechanisms affect branch lengths^{27}.

My reaction to this paper illustrates what I think is a bigger problem. There is a field of applied mathematics called “inverse problems”. This is all about seeing an effect, and inferring the cause (it’s what scientists have been doing for years, of course). A problem is that any effect can have several causes, and they might not be distinguishable from the data. one thing I’ve seen too many times is someone taking a pattern, fitting a mechanism to it and then declaring that the mechanism produced the pattern^{93}. The mechanisms are often dynamic, but they are fitted to static patterns, which requires additional assumptions (e.g. equilibrium) that might not hold. I think we should be suspicious of these exercises, and instead look to fit dynamic models to the dynamics of the processes. This needs more data, which is a bitch, but I think we will also learn much more about the processes we are looking at.

Venditti *et al.* improve on this by comparing several models, but they don’t look at the dynamics they are interested in, only the end result (the time to speciation). They also only go half-way in the inverse problem process: the link between the mechanism of speciation and the distribution is not rigorous. Perhaps the best thing about this paper is that it might make people think a bit more about how mechanisms of speciation affect the trees of life.

__Citation__

Venditti, C., Meade, A., & Pagel, M. (2009). Phylogenies reveal new interpretation of speciation and the Red Queen Nature, 463 (7279), 349-352 DOI: 10.1038/nature08630

Footnotes:

^{1}This is somewhat simplified. But do you want to explain MCMC and rjMCMC in two sentences?

^{2}For those who want to know, the Bayes Factors for the exponential versus the other models could have been presented.

^{27}I had a quick search of the literature, couldn’t find anything. But I might have been looking in the wrong places.

^{93}I’ll spare you my Unified Neutral Theory of Biogeography hobby-horse for now.

I am very interested in models for macro-evolution, but couldn’t quite persuade myself this article was worth paying for. I have read the abstract, New Scientist article, and the supplementary material. The graph in the latter (table S1) showing the posterior for the Weibull shape parameter seems a better summary of the results than the graph you reproduced.

On speciation scenarios and species lifetime distributions. This ref

Losos, J. B. and F. R. Adler. 1995. Stumped by trees? a generalized null model for patterns of organismal diversity. Am. Nat. 145:329.342.

provides a justification for a distribution which is less L-shaped, more normal-like, than the exponential.

It is my suspicion that many biologists take the analogy between the tree of descendants that might arise from a single bacterium (where very short branches are impossible) and the Tree of Life far too seriously. Not sure if this is in Losos and Adler, but I have heard talk of a a ‘refractory period’ during which a new species is supposed to be unable to speciate further. To the extent that Venditti et al knock these notions on the head, I am all for the article.

A clarifying note: branch length is measured in terms of number of substitions. The ‘time a species spends as a single species, before it splits’ (call it branch duration) is measured in years. These are not the same unless a strict molecular clock is assumed, which is a silly assumption in most cases. A variable clock will mean the distributions of branch lengths is not the distributions of branch durations. It is a pity that they didn’t use a phylogenetic analysis program that estimates node times. If you have node times, it is much easier to compare the phylogenetic trees with models of tree formation, and with fossil data.

My main problem with the article is that it fails to model extinction and the sampling of extant taxa by researchers. A phylogenetic tree is only a small part of an unobserved full tree. Both extinction and sampling affect the distributions of branch lengths, and the latter is almost certainly biased. The argument given in the supplementary material(starting bottom p3) does not impress me. It is not just that the geometrical distribution seems to be pulled out of a hat to make the maths work. A bigger problem is that in the presence of many extinctions (ie in a realistic model) the probability p that a new species makes it into the sample is dependent on time. p will be small for a new species ‘born’ long ago, and tends to the sampling-rate for recent species.

There is much research on the shape (especially balance) of phylogenetic trees which concentrates on the topology only (topology = branching pattern, branch lengths ignored). I’ve done some myself, especially comparing age-dependent branching models with empirical phylogenetic trees. This line of research suggests that species have a lifetime distribution which is more L-shaped than the exponential. You could take an equal mixture of two exponentials with scales 1 and 10, or a Weibull with shape parameter around 0.7 as rough idea of how L-shaped the lifetime distribution needs to be in order to reproduce the observed imbalance. A ‘refractory period’ makes trees more balanced than those from a Yule process, which are in turn more balanced than empirical trees.

My guess is that branch durations have a distribution more L-shaped than exponential. Extinctions and sampling make this less L-shaped. Variations in clock-rate transform this further, mainly adding a few very long branches that you wouldn’t otherwise expect. Getting data to test models of extinction seems very tricky, so while lots more studies like Venditti et al will appear in the next decade, people will probably still be arguing at the end of it.

Thanks for your comments: very interesting!

Yes, this is something they mention is the supplemental information: they deliberately don’t transform to calender time, because that would be non-linear. But if that’s the case, they’re saying that the distribution of time to speciation isn’t exponential, because they’re transforming away from that (I’ll have to lay that argument out in a blog post).

This was noted by a commenter on an ID blog (unsurprisingly, they’re pro-evolution). if the distribution is exponential, and if extinction is exponential too, it doesn’t matter. I suspect it matters if either of these are false, but I haven’t looked into this other than try (and fail) to find some literature: as you write, people only seem to be interested in the topology.

“if the distribution is exponential, and if extinction is exponential too, it doesn’t matter.”

It’s not so simple. What you say is true for the birth-death process as a whole, but a phylogenetic tree is not a random instance of that process. Most branches are missing, and it must be conditioned on the observed number of tips. See “The conditioned reconstructed process” by Tanja Gernhard.

Sorry, i should clarify that I meant it doesn’t matter for determining which distribution of speciation time is correct (i.e. everything is still an exponential).

What I’m not sure about is whether non-exponential distributions of extinction and speciation times can lead to exponential distributions of observed speciation times (i.e. coalescent times). I suspect they can, but I’m not bright enough to see a proof.

I was thinking about this more last night, and I now think I’m wrong about the exponentials: I think they are still exponentially distributed, but the rate changes 8and is proportional to the number of lineages. I should go back and look at coalescent theory though.

Ho hum.

There’s a lot more I could say, but this doesn’t seem the best place to say it. You might find my web site interesting:

http://www.indriid.com