Last week Nature had a feature on data sharing. Most of it revolved around the areas of science like bioinformatics that are producing piles of data very quickly. I work a lot with data that has been collected over long periods of time, which is rather different in character: a single datum might be the result of sampling hundreds of individuals over a year. Sharing that sort of data brings its own problems, even if the data have already been described in a publication. Because of the importance of this to me, I thought it was worth sharing my rambling thoughts on the matter. Hopefully someone can explain why I’m totally wrong – I have some severe biases in this area.
I’m interested in ecological and evolutionary processes, particularly in the wild. The sort of data that is really useful has accumulated over the years (e.g. since the 1930s). There is quite a lot of data like this out there, that has been collected by different groups. It would be great to be able to get free access to it, but things aren’t always that simple.
Some data, being collected.
There are some schemes to make data available: The Global Population Dynamics Database, Ecological Archives, and GBIF. The Long Term Ecological Research Network also makes the data it collects available.
The people who collected the data spent a lot of time collecting the data, and want to control access to the data. If this just means sorting out how to acknowledge the data collectors, there will usually not be a big problem: we have to find the right way of doing it. Sometimes the data might only be needed for a small part of a paper, so that co-authorship looks a bit extreme (for published data, at least). It can also create imbalances, if one person insists on being a co-author, but not others. Putting aside from the personal conflicts, some form of consensus in what to expect would be a help. This is, of course, what a lot of the discussions in data sharing are about.
I have also been told on a couple of occasions that I cannot have access to some data because the people holding it still want to do some work on the data. This is frustrating: if they want to do the same thing that I want to, it’s perhaps understandable (but could we collaborate?). But sometimes it’s clear that they want to do something different. This is frustrating: it slows down Science which should not be what we are doing.
So, what are the solutions? The ESA(Ecological Society of America) have a scheme for registering data, which is a start, but the data itself does not have to be deposited. Of itself it is not enough: the archive might not be used. It also does not solve the social problem, of how to make availability the norm (and what is the accepted to acknowledge the data collectors).
This is something where the journals could help. Some journals (like Nature) have clear rules on making data available, but I looked at a couple of ecology journals, and didn’t find any similar statements. Of course, what journals should say is going to be a sensitive matter: they can try to impose their standards in a top-down fashion, but if the community doesn’t accept them, then they’ll just publish elsewhere. Societies such as the ESA and BES could also thrash out a policy with their members: obviously this is better, but could only be enforced (I think) in their own journals.
Many of these problems are general to most fields of science that collect data. But ecology has a couple of quirks that, whilst not unique, are at least not common. The first is that there is a long tradition of data collection by individuals. One of the data sets my student has worked with was collected by an amateur outside his house. Similarly, I have worked with data collected over several years by a PhD student. There is also data that has been collected by institutes as part of their monitoring efforts. This creates a strong bond between a collector and the data. It would be great if they always had the attitude that their data could be used freely by anyone, but if I was to insist on this, I’d be stepping on toes. I don’t want to do this, partly because it’s not nice but also because I’m aware that if I’m using this data, I am (in some ways) parasitizing the people who collected it. This goes back to the problem of how to give credit for the collection, and how to interact with the data providers1.
Another aspect is that there are a lot of data sets that are growing. There are a lot of monitoring activities which are on-going (e.g. the BBS(Breeding Bird Survey) in North America). These are not well suited to the static reporting in journals, or to static deposition of the data. A paper describing the data collection could be written, but it is inevitable that there will be changes in the protocols over time. Recording these would be a mess, as they are chased all over the literature (yes, I’ve had to do this). Perhaps something more dynamic, but still recognised as a formal publication is needed. This seems to be the direction that GBIF is going in, as well as the National Biodiversity Network in the UK2.
Some more data, refusing to be collected.
As someone who’s barely involved in collection of the data I’m biased, so my musings may be complete rubbish. I suspect if we all sit back, we’ll realise that we want the same thing (i.e. to learn about nature from this sort of data), and that the problems are ones of practice: how do we learn, whilst still making sure nobody’s contribution is glossed over (this stuff can be important for careers, after all).
1 To those who have generously provided me with data, I’d just like to say that you shouldn’t feel worried about hassling me to work on it to provide you with what you want. I know I’m not perfect in that regard!
2 I should really look at these in a bit more detail before commenting on them. But this is a blog post, so my being lazy is a condition of service.