Physicists have it easy. When they’re not talking about stuffing their hands into their equipment, they’re measuring their fundamental constants to 38 significant figures. Chemists too have a simple time – they get to make stinks and bangs with expensive toys. But at least they know exactly what they get. Even most biologists are lucky – the crowd who do the molecular stuff are like chemists, but a bit less stinky. But us poor ecologists, who have to work with the real world…
The problem with the real world is that it’s just so messy. I once ran a field experiment looking at mildew epidemics where I missed about a month of data because there was no epidemic: it simply didn’t rain (by the way, this was in East Anglia, i.e. near Cambridge, Eva).
But this is part of Life. One of the things I’m interested in in my research is how the this messiness affects populations, and how we can predict how species will react to future climate change using the messy weather data. This has obvious practical implications, but it’s only worth doing if adding the weather variables into our models improves our predictions significantly.
For this reason I had a bit of an “eek!” moment yesterday when I flicked through the in eTOCs my inbox. In there was a paper that suggested that the whole fitting of climate variables to time series is a waste of time. I couldn’t ignore it, now, could I?
The authors of the paper, Jonas Knape and Perry de Valpine from Berkeley in California, took 492 time series from the Global Population Dynamics Database. For each of them, they extracted the estimated weather and climate variables from standard databases, for the 0.5 × 0.5˚ resolution (if I’ve read Wikipedia correctly this would be about 50km × 28km for Frankfurt. This is a fairly course resolution: it’s the scale we use when looking at patterns all over Europe. Anyway, they end up using maximum and minimum temperatures and rainfall for different seasons, plus two climatic variables: NAO and SO. NAO is the North Atlantic Oscillation, which summarises whether the European climate is warm and dry or
Englishcold and wet. SO is the Southern Oscillation, which does something similar for the Pacific, and is related to El Niño.
The authors created two collections of weather and climate variables: a small one with summer and winter weather variables plus NAO and SO, and a large one with four season’s worth of variables, and using more lags into the past (i.e. this year’s weather, plus last years’, plus two years ago). The point behind this was that using lots of variables, something is bound to come out as “significant”, but this might be due to chance. This means that the model can be over-fitted: it describes the data very well, but is lousy at prediction because the model is too rigid. This is a Bad Thing.
So the Knape and de Valpine fit these covariates to their 492 time series, selecting their best models with AIC (and also looking at model-averaged results, but their conclusions are pretty much the same either way) to come up with some “good” models for their data. They then ask how well these models perform: how much do they reduce prediction error, and also how well they do compared to randomly selected covariates.
Their results are summarised in the figure below. This plots the amount by which the weather covariates decrease the unexplained variance: if half of the variation in the time series is explained by the covariates, this value would be 0.5. If the weather explains nothing, the value is 1. So, low values are good. They plot this against the length of the time series, i.e. the amount of data.
As the amount of data increases (as n goes up), the amount of variation explained decreases: the weather variables don’t do as well in explaining the population dynamics. In addition, the dots are the data where the model doesn’t do significantly better than chance at explaining the variation (i.e. the randomised data does as well in reducing the variance). Both of these point to over-fitting: with less data and a pile of covariates, it’s more likely that something will fit well by chance. With more data, there is more information to winnow out these chance relationships, so they get excluded from the model, and the explanatory power decreases. And whilst the analyses with lots of covariates tend to explain the data better, they still tend not to work much better than chance.
The authors also found that NAO and SO weren’t flagged any better than chance as predictors, when compared to weather variables. There’s a slight wrinkle in this: all of their data was from Europe and the USA, but NAO mainly affects Europe (and Western Europe, and the UK in particular), whilst SO will affect the Pacific coast. It would be interesting to see this analysis broken down by geographic area.
So things look pretty bad: weather variables can’t predict population dynamics very well. Fortunately, I don’t think things are this bad (or perhaps I should write that fortunately I can find some reasons why this paper hasn’t torpedoed my own work). The approach the authors used was to make the whole process a black box. Their weather variables are measured at a course scale, but we all know that it can change over much smaller distances: this is the weather outside just now:
Taunus (the hills in the distance) is covered in cloud, and is probably pretty wet and miserable. Those of us living in the river valley are a bit warmer, and much less wet (well, except for those living in the river). So, climate changes over much smaller distances, and there is often significant variation in topography and hence climate over a few kilometers, so measuring it at a finer scale might be much more effective.
Another problem is that it is not always clear what the right variable is. Are maximum and minimum temperatures the best? These are outliers, so perhaps mean temperature might be better. And perhaps maximum daily rainfall is better than overall rainfall. And also there may be critical times when a variable is important (e.g. for plants, low minimum temperature in the spring, i.e. frosts, could be critical). A generic search might not pick out these important effects. It might be that we can find methods to explore the data to pick out the crucial weather variables, but this would mean fitting many more models, so the over-fitting problem would be immense.
Plant ecophysiologists have a measurement they use called GDD5: Growing Degree Days above 5˚C.This is the sum of daily temperatures above 5˚C. It’s a bit of a heuristic measure, but is based on biological knowledge of how plants grow (i.e. not when it’s cold). This sort of measure might work better. Similarly, my (now ex-) student and I have a paper online where we tackle the same sorts of issues with moth data. But we talked to a moth biologist first, and got some idea about which variables would be important. And we get some reasonable results that way.
So, I guess my take-home message is that the paper shows that shot-gun approaches to fitting weather data to time series doesn’t work very well. The devil will be in the detail. There are some easy things to do: use local weather, not something averaged over a large region. Deciding which weather variables to use is trickier, but some thought should help, as should expert knowledge and data from experiments or more detailed field measurements. Oh, and we really need lots of data, time series of over 50 years, please?
Knape, J., & de Valpine, P. (2010). Effects of weather and climate on the dynamics of animal population time series Proceedings of the Royal Society B: Biological Sciences DOI: 10.1098/rspb.2010.1333
Mutshinda, C., O’Hara, R., & Woiwod, I. (2010). A multispecies perspective on ecological impacts of climatic forcing Journal of Animal Ecology DOI: 10.1111/j.1365-2656.2010.01743.x