h1. I’ve got a huge dataset
Last time I talked about networking we looked at how technology has enabled communication between scientists globally, and pondered a little on where it might take us in the future. There was a long and not always off-topic comment thread with some very interesting viewpoints being advanced. But what about the results and theories those scientists produce? Can they be networked?
Elsewhere in Nature Network (but I’m too lazy to search through the comment threads to link to it) I’ve remarked on how utterly cool it has been to be able to take a laptop into a seminar and use information given (with a working internet connection) by the speaker to generate new data about one’s own project. This strikes me as a good thing, almost as good as jelly pies.
In the Gordon Conference (which as I type these words is a fading memory, being rendered into oblivion by the noise of this Air New Zealand 747 allegedly flying to Auckland) last week the wireless signal did not extend into the auditorium, so I was unable to check my protein against information given in the talks directly.
But I have generated a database of my own that sits quite happily on the MacBook. It’s pretty bespoke itself although not as elegant as Jenny’s, I’m sure. I’d also love to be able to link it to annotation databases so that ontology and links to PubMed, OMIM etc. could be updated in real time (or at least each time I open the database).
What I have is an exon microarray dataset. I have knocked-down my favourite protein (and the awake among you will know by now its identity) in cultured cells, made RNA from those cells and probed them with about a million different exon-level probes on a microarray chip. This is actually quite a big deal. I am able to test the level of expression of over two hundred thousand separate, well-annotated exons. That, if I’m clever enough, to work it out, tells me how my protein affects splicing.
So of course, whenever someone talked about their favourite protein I would immediately look it up in my database and see if my experiment had affected it. Which is pretty cool: not only are we bringing scientists together in new and collaborative ways, we’re suddenly able to take their data and let them talk to each other. Of course, you could do this previously, if you had a brain the size of a planet , but the technology we have access to now makes the process a whole lot easier, faster, and more accurate.
I was able to sit opposite my poster, or troll around the session in general, checking out how the genes dear to other scientists were affected in the experiment I performed a month ago. Not only that, but I could pull sequences from the databases and look for the sequence elements predicted to be present if my protein was affecting their gene’s post-transcriptional regulation.
And it didn’t stop there.
In one particular (somewhat boring I’m afraid to say) talk I sat next to someone from London, checking out her dozens of transcripts and mulling over the occurrence of various splice variants. It suddenly occurred to me that even if I didn’t have a good hit for her gene in my knockdown dataset, I at least could come up with an exon expression map from my controls. And not only for her gene of course, but for every gene that is represented in my database. In the poster session I was able to demonstrate to one student that a particular splice variant does not occur in a common cell line (so that, if nothing else, if he wanted to study it in culture he should take care over his system). It also explained, perhaps, why searching on his computer for the sequence elements I am interested in was fruitless.
Again, none of this is really new. Sharing ideas and data is what scientists do worst best, but the ease and power with which we could do it in such a setting was striking. There were plenty of people tapping away on keyboards at the poster sessions — doing email and whatnot — but I’m pretty sure I was the only one actually generating new interaction networks. Generating and sharing new data. Thinking about how we might link these different things into each other and into the public databases that already exist.
And data want to be free. I said that I have a good control expression data set. I can tell you at a glance if your favourite gene is expressed in my favourite cell line, and even the splicing pattern if it has multiple transcripts (which, it appears, is pretty likely unless you’re working on histones). How much data, I wonder, exists like this? More to the point, how easy is it to get at such information? All the negative controls we do (please tell me that you do do negative controls) that are never published are data. They tell us how things behave in the absence of experimental perturbation.
We’re also leading, here, up to looking at the conflict between open science and the publishing imperative. I think it’s highly unlikely that I could publish my control sets in a way that’s meaningful to my CV, but I’d like the information to be available. Even my knockdown experiment is going to produce more information than I can sensibly analyse. Should that too be available, for data mining? Do I write the couple of papers that come out of this and then say “By the way, anyone else who wants to have a go, here are my pivot tables”? Do I, in that case, have any claim on those data?
So what we find is that while technology-enabled networking works on at least two levels (i.e., bringing scientists together, and enabling them to share their data in new and informative ways), we run up against some old problems. How do we make our data available (individual websites? Too ephemeral. Public databases? How do we curate them?) How do we agree on making them machine-readable (paging Peter Murray-Rust and friends)?
How do we prevent ourselves being gazumped? One of the conditions of a Gordon Research Conference is that unpublished results go no further than the conference site (which, by the way, is why I haven’t illustrated this post with a picture of my poster). I understand this limitation, and the rationale behind it. Some of the most exciting presentations were of unpublished work (the talks where people recapped what was published last year in Nature or wherever were interesting to the newcomer to the field, but you could sense a ‘meh, this is old news’ vibe): unfortunately I can not blog about them.
But the new networking is going to challenge our race to publish, or die, environment. The discussion at the Gap raises the possibility of continually evolving manuscripts, where an early publication date might not be as important as the process. Where, in fact, the very notion of ‘first author’ might become unimportant.
The whole scientific publication model is under revision. You might remember the N.I.C.E. results board in CS Lewis’ That Hideous Strength, where the most recent findings were instantly displayed. Are we moving towards that idea, one where peer review takes place ‘on the fly’? (And indeed, if we remove the obligation to be the first to publish in high impact journals, what effect would that have on the quality of the data, or on the instances of scientific fraud. On the flip side, how do we then audit the quality of the published science?)
All this from the simple expedient of schlepping my database to a conference. From any angle, these look to be interesting times.
you’ve got a huge script font too 😉
It’s a level one heading! Not my fault it’s so huge!
🙂
I’m just going to try and break the internets here by some recursive linking.
spam-flavoured pizza?
Hmm. Missed this first time around. Richard, what is wrong with GEO? I’ve found it a perfectly reasonable repository for microarray data. My SAGE data is in there, and we’ve made use of fifteen different labs’ deposited data in that paper which I still can’t get published since last year. Mostly because it’s too much fun to play with lots of big datasets to focus on one single experimental followup. (Not an ideal strategy for publication, but) once it does get out there in a format useful to our postdoc’s career, the approach creates new subjects… much like your going around to posters and generating new “networks”. I do the same as you during conferences with people’s favorite genes and my datasets. It’s a great way to dissipate time, but so gratifying…!
Ah.. thanks for that, Heather. What is ‘wrong’ with GEO is that I’m working outwith my own and the lab’s area of expertise, and we’re all groping around in the dark.
It’s quite isolated here in Australia, so you can see that the whole Science 2.0 thing is very close to my heart!