Last Monday I wrote about one of those frustrating papers that asks an interesting question, but the more you look at it, the less sure you are of the results. In this case they might be right, but I think there are enough problems with the analysis that I don’t have any confidence in the conclusions.

This is going to get technical, so be warned (or go and read the latest Scientias).

So, you want to stay, then? Well, it would be good to read my previous blog post, so I don’t have to repeat myself. If you’re still interested, read on…

I found a few problems, each of which might make the results dubious, so let’s take them one by one and then I’ll sum up what I think the paper says (and what can be said). I must also thank Graham Jones for his discussion on my original post: it was very useful and clarified some of the points I raise below.

## But they look the same

One problem with looking at the distributions that the authors do is that they’re indistinguishable:

with the exception of the normal, these statistical models can produce almost indistinguishable densities, but imply different modes of causation. For example, the Weibull density simplifies to the exponential density as a special case when α = 1 and β = 1. The variable-rates model is the convolution of multiple exponentials whose individual rates are assumed to follow a gamma probability distribution–in this model α and β describe the shape and scale, respectively13. If this gamma distribution is very narrow (small β), then the variable-rates model converges on an exponential.

So if their analysis was fair to all of the models, these models should have roughly equal probabilities. That it doesn’t (the exponential is hugely favoured) raises some warning flags, but we need something more solid than this.

## How to penalise a model

As I mentioned in the previous post, I think the method used is biased towards the exponential distribution. Oddly, the authors are aware of the problem and try to fix it, but aren’t aware of the consequences.

The problem is because they are comparing models with different numbers of parameters. It’s almost a truism that a model with more parameters will fit better than one with less: it simply has more flexibility. So if one just compares the fit of the models to the data, the models with more parameters will be better. To stop this happening, we penalise the complex models. All else being equal, we would prefer simpler models (Occam’s razor and all that). The problem is how to penalise models.

The authors of the paper use a Bayesian approach to the model fitting. For our purposes, they calculate the probability of a model given the data. This means calculating the following integral (θ is the parameters of the model):

from which the probabilities for the 1^{st} model can be calculated:

Taking the second equation first, P(Model_{i}) is the prior probability of the model (i.e. how likely we think it is before we see the data). we can adjust this, e.g. to make all the models equally likely or to penalise models we don’t like. The other term, P(Data|Model), is calculated in the first equation, and this can also act to penalise models. This is the marginal likelihood: the probability of the data given the model. What the first equation says is that it is averaged over the parameters of the model. In other words, we average the likelihood – P(Data |θ, Model) – over the different possible values of the parameters – P(θ | Model).

How does this penalise models? Well, the likelihood is usually only large for a small portion of the parameters: for most combination of parameters it is effectively zero. Say for one parameter it takes 1/3^{rd} of the parameter space. For two parameters it might take up 1/3^{rd} of each parameter space. If these dimensions are independent, the result will be a circle with and area of π/3^{2}: about 9% of the total parameter space. Or, in pictures:

Of course, it’s not quite this simple (because the likelihood changes too), but this gives the general idea. Anyway, what this means is that if I make the parameter space larger, even more of the parameter space is roughly zero, so the average likelihood is lower. Hence a model with more parameters has a lower overall average – it is penalised by this extra parameter space^{1}.

Getting back to the paper, the authors have one single-parameter model (the exponential), which their results suggest is better than all the two-parameter models. now, the obvious conclusion, given what I have written, is that this is an artifact of the change in dimensions. Well, it is but the story is a bit more complicated because the authors are aware of the problem and try to correct it.

What they do is to run the analysis twice:

To establish the range of these uniform priors we conducted a series of analyes [sic] with each model using priors on a 0-100 interval and then inspected the posterior distributions of the models’ parameters. … For these models, then, we chose our priors to be equal to these posteriors, with most on a uniform scale of width three.

I assume they did this for each model for each data set, but this isn’t clear. it’s also not clear what exactly they did: I assume they picked the extreme values from the posteriors and used those to determine the range of the priors in the “proper” analysis. but this is a terrible way to do it: the extremes are, well, extreme, and hence unstable. the posterior distributions are simulated: i.e. they draw a lot of values the posterior. But, had they drawn more values, they would tend to get extremes that are further out. So, the amount by which they penalise depends on how many simulations they do. Unless they chose the number of simulations to make the distribution “right”, this is just wrong.

What effect this has depends on which part of the paper you read. From the legend to Figure 1:

In both approaches, we used Bayesian prior distributions chosen to favour the four two-parameter models over the one-parameter exponential (Supplementary Information).

Which is contradicted by the supplementary information:

Our procedures for choosing the priors ensured that they were as narrow as possible for the two-parameter models, without constraining the parameter space the model could explore, and translates into the two-parameter models having less prior weight to overcome than if we had set priors from the usual position of relative ignorance. The average prior cost we assessed the two-parameter models translates to having to overcome a ‘debt’ of about 1.1 log-units. That is, to perform better than the exponential the two-parameter model would need to improve the log-likelihood by this amount. This is less than a typical likelihood ratio test, which uses a criterion of 1.92 log-units per parameter, or an Akaike Information Criterion test which penalises models 2 log-units per parameter.

I think the supplementary information (which nobody reads) is correct: the 1.1 comes assuming the parameters are independent (which they look at, at least informally), so that the difference in the likelihoods is log(*r*) – log(*π*), where *r* is the radius. log(*π*) = 1.14, which is their 1.1 log units.

So, the authors penalise but not enough. What makes this worse is that when they present the results, they only show the best model. The exponential is still favoured (on average: sometimes it will be worse because the sampled extremes are wider apart), so if there is really no difference between the models, it will tend to come out as being better, simply because it isn’t penalised enough.

The upshot of this is that the results that are presented are simply worthless for deciding whether the exponential is better: we would expect to see this if there was no information in the data (sometimes the other models will do better because by chance they do fit better^{2}). What we need to see is how strongly the exponential is favoured: a histogram of the posterior probabilities for the exponential would be a simple start.

## A Way to Measure Time

For me that might would be enough to decide that the paper is uninformative, but it could be rescued. But why stop now?

The authors’ claim is that speciation occurs at a constant rate. They write:

We suppose there are many potential causes of speciation, including environmental and behavioural changes, purely physical factors such as the uplifting of a mountain range that divides two populations, or genetic and genomic changes.

and it it how these causes combine that they are interested in. But what time scale are they thinking of? It’s not the one you would think: the one measured in hours, minutes and seconds. But the authors do something different, they use the genetic distance: the number of substitutions (i.e. changes) in the DNA sequence: the more changes the greater the distance. If the rate of substitution is constant, then this is equivalent to calender time. And presumably this is the scale they are thinking that the Red Queen operates. After all, the rise of mountains is not connected to the rate of substitution in a species. But they then admit that the rate isn’t constant:

We used genetic branch lengths in preference to branch lengths scaled to time because all temporal-scaling methods introduce uncertain nonlinear transformations

If they think that the temporal-scaling methods are valid, then they are saying that finding an exponential distribution in their distances equates to a non-exponential distribution in real time: that’s what the non-linear transformation means. Now, they’re uncertain what transformation is correct, which is fine, but that ignorance propagates: it means that they don’t know what the real distribution of speciation times is, so they can’t conclude anything about it. Odd.

## Ignoring Extinctions

This is something I first saw raised by “Sooner Emeritus” on Uncommon Descent (an ID blog: Sooner Emeritus is one of the Loyal Opposition who hasn’t been banned from there yet. At least not banned under that name). The phylogeny that is constructed is based on the extant species: the ones that haven’t gone extinct. Do the extinctions matter? Initially I thought that they did unless the distribution was exponential. But after discussing it with Graham on the previous post, I’m now fairly certain that I was wrong, and that they always matter. The problem is illustrated below.

This is a coalescent tree: we look at them backward, so at each generation, the offspring “chose” their parent. Looking at it this way means we don’t have to worry about the extinct lineages. If we have a simple process where each parent has a random number of offspring, and a constant population size (i.e. the Wright-Fisher model), then going forwards we have (approximately) exponential extinctions and ‘speciations': this is equivalent to the simplest Red Queen model of Venditti *et al*. So what happens to the branch lengths?

Well, digging into the literature (p333) we see that if we have *j* lineages, then the time until any pair coalesces is exponentially distributed with mean 2/(*j*(*j*-1)). But this means that the branch lengths do not have a constant length: if they did then the time to an event would be proportional to 1/*j*, not 1/*j*(*j*-1).

Now, I should be a bit careful: the evolutionary process i just outlined is extremely simple. For example, it assumes that ‘extinction’ and ‘speciation’ are balanced. if speciation is quicker than extinction, so that the population size grows at a rate e^{β} then the rate of coalescence is 2/(e^{βt}*j*(*j*-1)) (where t is the time), which still depends on the number of lineages. The point here is that balancing extinction and speciation so that the time to the speciation in the observed tree is constant is tricky: there’s no guarantee that it can be done.

However, the times to each speciation event are still exponentially distributed, but the mean changes. However, this doesn’t mean that the overall distribution is an exponential. the authors discuss this a bit, and show that if a branch is made up of *n* exponentially distributed bits, and *n* is geometrically distributed, then the total branch length is exponentially distributed. What this means in English is that there are two possible events: Speciation and Nothing. If the branch is made by having a random time to an event, and if there is a constant probability that it is Speciation, then the distribution is exponential. The problem with their tree is that the probability changes: the closer you are to the root of the tree, the smaller than chance of speciation in the tree: there may be more speciations, so that the overall rate is constant, but these aren’t seen in the tree because that lineage goes extinct. So under a random scenario, the overall distribution of branch lengths isn’t exponential, it’s got more variance. than that.

## The Authors’ Final Demonstration

Finally, the authors manage to show that they can’t tell what the distribution in the data is – possibly. What they do is remove the short branches, i.e. where the times to speciation are short:

The exponential and variable-rates models expect very short branches but the lognormal does not. If very short branches are poorly estimated in the phylogenetic inference, this could bias results in favour of the exponential. To ensure that we did not have a bias towards short branches we used uniform priors on branch lengths at the phylogenetic inference step (it makes no difference to our results if we use the conventional exponential prior on branch lengths). Before fitting the statistical models, we then removed all branches with an expectation of having less than or equal to 0.5 nucleotide substitutions per branch. Table 3S shows the results with these branches removed – they do not qualitatively differ from the complete analysis and only the exponential model achieves more than the null expectation of fitting 20% of the datasets.

The exponential distribution has a mode at zero. Removing the short branches should remove the mode, and the “true” distribution should now have less probability at short times: the other distributions can accommodate this, the exponential cannot. So, either they remove so few lineages that it doesn’t matter, or the analysis is unable to distinguish between a full data set and one with the small values chopped off.

Overall, what does this say? Well, very little. What it does say, I think, is that there isn’t a strong signal that speciation times are non-exponential. But I don’t think we can conclude that speciation times *are* exponentially distributed. It might be that there is more information that could be given to persuade us. But, to be honest, I doubt it: I wouldn’t expect there to be a lot of information about the distributions in the data.

This is a general issue in fitting complex models to data. If we knew the true speciation times, we could easily say what the distribution is. But we only know them uncertainly. if we try to estimate the distribution, we can get the broad pattern: the mean and variance, and perhaps that the skewness away from symmetry. But really, we’ll get little else. The times are uncertain, so they can be shuffled around in their confidence bounds to fit the different distributions. this means that, unless there is a huge amount of data, distinguishing between similar distributions is (at best) difficult, and probably impossible.

Which is a shame.

^{1} For those who are interested, this is how BIC, the Bayesian Information Criterion works.

^{2} This is a good excuse to plug a paper my group threw together a couple of years ago. The short version is that different data generated with the same mechanism can strongly favour different models.

Venditti, C., Meade, A., & Pagel, M. (2009). Phylogenies reveal new interpretation of speciation and the Red Queen Nature, 463 (7279), 349-352 DOI: 10.1038/nature08630

Amazing! That’s exactly what I was going to say!

Yeah, but my pictures are better.

wow, and i married this smart guy! if you think this is interesting, you should hear what he says when only i am listening! 😛

Typically interesting stuff, Dr Bob.

While we’re self publicising, I may as well point out that I’ve been pushing a similar idea — that distribution shapes matter, even when mean and variances are the same — albeit in terms of cascading extinctions in ecological communities.

So, distributions in extinction ‘time’ and extinction probability are ‘driven’ by ummm, other distributions. I think.

A

lotmore on branch lengths and species, and well worth the wait. I’m glad you found my previous comments useful. I completely agree with your criticisms of the paper (which I still haven’t read!).You described the coalescent tree, but usually people use a birth-death process for modelling macroevolution. The number of species is then a random variable not a constant or a given function of time. A similar pattern still holds though. If the extinction rate is close to the speciation rate, you get long branches (sometimes very long branches) near the root, and short ones near the present.

http://www.stat.berkeley.edu/~aldous/Research/Phylo/pictures.html

has some pictures of the kind of trees you get when extinction rate = speciation rate. Note that some go off the graph.

I have some code which simulates age-dependent branching processes with extinctions. I could run this and collect branch lengths… if I get round to doing this I’ll report back.

Someone should redo what Venditti did, only better. That is, use a program (eg BEAST) which estimates node times, and then look at the likelihoods of (topology plus node times) under various models. I think that leads to quite powerful tests, despite the uncertainties in the estimates.

Hi, Bob.

I was Googling for “log-likelihood ratio” (the

active informationmeasure of Dembski and Marks takes that form), and ended up on a Nature Network page with a comment by you. A couple clicks brought me here for the first time. Imagine my shock at seeing “Sooner Emeritus.” I’m even more shocked to find that he said something non-stupid regarding Venditti et al.Cheers,

Tom

Mike – the empirical problem is still estimating the distribution from noisy data. I guess it might be possible with a metacommunity, where there are extinctions and recolonisations to tell you about the extinction pattern.

Graham – it doesn’t surprise me that the results are the same: all the asymptotics drive the a coalescent towards a birth-death process. Your suggestion for simulataneously estimating topology and branch length is what the authors did (sorry if I wasn’t clear).

Tom – I hope Clive doesn’t see this: you’ll be banned. banned I say. Although

~~you~~he might get banned by Dembski for the latest bit of poking you’ve done at him (I see DiEb has just been threatened with bannination).Bob, it was me that wasn’t clear. I never doubted that they simultaneously estimated topology and branch lengths. But (a) it would be better to estimate topology and node times (in calendar time) and (b) it would be better to compare with models for the joint distribution of topology and node times, ie for the whole tree, not just models for the distribution of branch lengths.

I wasn’t clear on your (b) – they actually do compare the models with the joint distributions.