Last week Nature had a feature on data sharing. Most of it revolved around the areas of science like bioinformatics that are producing piles of data very quickly. I work a lot with data that has been collected over long periods of time, which is rather different in character: a single datum might be the result of sampling hundreds of individuals over a year. Sharing that sort of data brings its own problems, even if the data have already been described in a publication. Because of the importance of this to me, I thought it was worth sharing my rambling thoughts on the matter. Hopefully someone can explain why I’m totally wrong – I have some severe biases in this area.
I’m interested in ecological and evolutionary processes, particularly in the wild. The sort of data that is really useful has accumulated over the years (e.g. since the 1930s). There is quite a lot of data like this out there, that has been collected by different groups. It would be great to be able to get free access to it, but things aren’t always that simple.
Some data, being collected.
There are some schemes to make data available: The Global Population Dynamics Database, Ecological Archives, and GBIF. The Long Term Ecological Research Network also makes the data it collects available.
The people who collected the data spent a lot of time collecting the data, and want to control access to the data. If this just means sorting out how to acknowledge the data collectors, there will usually not be a big problem: we have to find the right way of doing it. Sometimes the data might only be needed for a small part of a paper, so that co-authorship looks a bit extreme (for published data, at least). It can also create imbalances, if one person insists on being a co-author, but not others. Putting aside from the personal conflicts, some form of consensus in what to expect would be a help. This is, of course, what a lot of the discussions in data sharing are about.
I have also been told on a couple of occasions that I cannot have access to some data because the people holding it still want to do some work on the data. This is frustrating: if they want to do the same thing that I want to, it’s perhaps understandable (but could we collaborate?). But sometimes it’s clear that they want to do something different. This is frustrating: it slows down Science which should not be what we are doing.
So, what are the solutions? The ESA(Ecological Society of America) have a scheme for registering data, which is a start, but the data itself does not have to be deposited. Of itself it is not enough: the archive might not be used. It also does not solve the social problem, of how to make availability the norm (and what is the accepted to acknowledge the data collectors).
This is something where the journals could help. Some journals (like Nature) have clear rules on making data available, but I looked at a couple of ecology journals, and didn’t find any similar statements. Of course, what journals should say is going to be a sensitive matter: they can try to impose their standards in a top-down fashion, but if the community doesn’t accept them, then they’ll just publish elsewhere. Societies such as the ESA and BES could also thrash out a policy with their members: obviously this is better, but could only be enforced (I think) in their own journals.
Many of these problems are general to most fields of science that collect data. But ecology has a couple of quirks that, whilst not unique, are at least not common. The first is that there is a long tradition of data collection by individuals. One of the data sets my student has worked with was collected by an amateur outside his house. Similarly, I have worked with data collected over several years by a PhD student. There is also data that has been collected by institutes as part of their monitoring efforts. This creates a strong bond between a collector and the data. It would be great if they always had the attitude that their data could be used freely by anyone, but if I was to insist on this, I’d be stepping on toes. I don’t want to do this, partly because it’s not nice but also because I’m aware that if I’m using this data, I am (in some ways) parasitizing the people who collected it. This goes back to the problem of how to give credit for the collection, and how to interact with the data providers1.
Another aspect is that there are a lot of data sets that are growing. There are a lot of monitoring activities which are on-going (e.g. the BBS(Breeding Bird Survey) in North America). These are not well suited to the static reporting in journals, or to static deposition of the data. A paper describing the data collection could be written, but it is inevitable that there will be changes in the protocols over time. Recording these would be a mess, as they are chased all over the literature (yes, I’ve had to do this). Perhaps something more dynamic, but still recognised as a formal publication is needed. This seems to be the direction that GBIF is going in, as well as the National Biodiversity Network in the UK2.
Some more data, refusing to be collected.
As someone who’s barely involved in collection of the data I’m biased, so my musings may be complete rubbish. I suspect if we all sit back, we’ll realise that we want the same thing (i.e. to learn about nature from this sort of data), and that the problems are ones of practice: how do we learn, whilst still making sure nobody’s contribution is glossed over (this stuff can be important for careers, after all).
1 To those who have generously provided me with data, I’d just like to say that you shouldn’t feel worried about hassling me to work on it to provide you with what you want. I know I’m not perfect in that regard!
2 I should really look at these in a bit more detail before commenting on them. But this is a blog post, so my being lazy is a condition of service.
I don’t know much about this, but to me it seems like as long as the appropriate people get acknowledged, it’s only a good thing if the data is used? What is the major complaint? Are people afraid of getting scooped, or rather do they think letting people use other peoples data too easily will end in a situation where no one will bother collecting the data in the first place?
It would be nice if someone who collected that sort of data could say. Some of it is the fear of getting scooped, but I think there is also an element of possessiveness: for some people this is their life’s work, so it’s understandable.
‘Ere, aren’t you s’posed to be packing?
I’m tempted to suggest that the funding body should decide how the data they paid for is made available to others. Hudson Bay company have been reasonably open about some of theirs over the years, national funding bodies could come up with some sort of embargo system to allow the data producers/collectors a decent stab at analysis.
And if people are worried about being scooped, they can examine my PhD aphid experiment notebooks for a quick course in preparing data in gobbledeygook. Bletchley Park wouldn’t have a hope of getting their heads round it.
Since we are ‘rambling’ here, some parts of the discussion reminds me of the problem of negative and unpublished data. Sometimes you get the feeling the big labs are holding back quite a lot of data, for several reasons.
One is that the data consists of negative results and there is no percentage in publishing that for certain people.
The other reason could be that negative data is actually also knowledge. It shows you where not to go. And putting effort into publishing it or dumping it in a database so that other people don’t put effort in pursuing a similar dataset seems almost counterproductive for the lab of origin. Sometimes it almost seems like some people take pleasure in certain groups taking the ‘wrong’ road.
Could one of the major problems just be that we are not a big happy family but we all have to compete for the same resources, and in this sense our actions are predetermined by our own biological limitations. We don’t share because mostly there is no immediate benefit for all, just for some.
I recently had a publication based on freely available data in PubMed and the SCImago database. It was a bit of a liberation. For the first time in my life I didn’t need to create the original dataset myself and could just limit myself to the analysis of the data. It was a great feeling. Compare that to my regular ‘day job’ of being a molecular biologist. Huge amount of effort is used to create limited datasets. And if you are lucky you have something to analyze at the end of the month.
I have done work for a government organization that collects large quantities of ecological data. Being a publicly funded government organization we were required to provide data to pretty much anyone who requested it for whatever the reason, it was my job to share data or even collect data for others. One potentially unforeseen problem I have noticed is…
Usually the organization requesting the data was legitimate and had good intentions, but I would often say that certain details and specifics of the data were often left out and/or misinterpreted. While I beleive data sharing to be a good thing there is a certain intimacy involved with collecting data that provides a true understanding of what it means [no written description is a substitute for actually being there]. While I have tried to be specific about the strengths and weaknesses of a dataset the data will inevitably be used to say something it wasnt meant to. A specific example of this is a temporal study of habitat condition we did where many sample sites were set up to examine change. While the spatial aspects of this study are weak there is a strong tendancy for people who use the data to create averages and maps and spatial relationships and things of the sort.
On a personal level I do science because it does good, so it really didnt bother me that I rarely had the opportunity to be published. Although it could be very discouraging for some people.
Thanks for your input, Joe (and sorry for taking so long to reply!). The worry about mis-use of data is not something I’d thought about, but I can see that it can be a concern.
I guess this is a bigger issue than in (say) genomics, because the type of data we collect is so varied that it can be used and mis-used for different things.
I’ve been thinking a little about this, Bob. I am involved with genomics and transcriptomics and some other -omics sets of data. My personal philosophy has been to put the sets out there as soon as possible. Generally in our limited experience that has been to provide some metadata around how they were collected, the biological samples from which they were derived, and wonder if it all will serve some purpose. This data can be entered, for example, on the Gene Expression Omnibus and then embargoed until, for example, a first publication making use of that data has appeared. Thereafter, anyone else using the dataset is honor-bound to cite the first study (usually authored by those who collected the data).
At the conference I was just at, though, data sharing was really a question. We want to participate in a large international collaboration. This entails federating dozens, if not hundreds, of clinicians who are involved in collecting patient questionnaires and DNA samples. They are collaborator-competitors. Of the clinical and genomic data, how much, to whom and when? Obviously not all can be made 100% public. But perhaps it is not exploitable if not. Would questions of privacy be resolved if the data, like for the US Census, were made public in 5 years? in 50? in 100? The proposal was that all data within the consortium should be available to all consortium members. Is that the only way to go? Should a member contributing five patients have as many access rights as another contributing 500? Tricky stuff.
Data sharing is indeed a pressing issue these days through different fields of science. There are questions on the way data is made accessible, how the data is collected, and how much data can be made accessible. There is a recent project I found out about called NIF (Neuroscience Information Framework) that strives to conquer these questions to help neuroscientists and students. I know this does not relate directly to ecology but it is relevant to the data sharing issue. What are your thoughts on this project? Would problems be solved if an ecological information framework was create?
Check out the Neuroscience Information Framework at:
http://neuinfo.org/
The Ecological Society of America’s Science Office has hosted a series of workshops over the past 5 years exploring this topic. Readers of this post may be interested in our workshop reports, available at http://www.esa.org/science_resources/datasharing.php
A number of ecology journals have been active in helping launch Dryad, which is a digital library of data underlying published works. And ESA and BES are currently considering implementing a Joint Data Archiving Policy that was recently adopted by a number of evolutionary biology journals. Let your voice be heard to your society officers if you feel strongly about these issues. They need to hear about how much their membership values data sharing and how much of a leadership role they should take in signing on to, and funding, these kinds of efforts.
Thanks for mentioning that. I saw the anouncement about data sharing in American Naturalist, but hadn’t followed it up. I’ll have to look at Dryad – I hadn’t heard of it.