David Basanta pointed out a provocative article in Wired, in which the author claimed that we won’t need to think about it any more, we can just throw it into a big computer along with an SVM1 or a NN2. His reasoning is that we are getting so much data that any patterns in the data will leap out at us, if we just prod it in the right way. And of course the computer scientists have all the right prods. This will then totally change the way science is done – we won’t have to worry about thinking up hypotheses and experiments to test them, we just mine the computer for results; in a manner reminiscent of The Library of Babel. Or perhaps it is closer to Dave Langford’s The Net of Babel.
There are plenty of criticisms of this that can be made – this Ars Technica piece makes some good points, and hopefully a few people will pop up at the social network forum as well as David’s blog and comment. But I want to pick at one aspect, that has been bothering me for some time.
The problem is the amount of data being produced nowadays. For example, all sorts of odd beasts are being sequenced (including a moggie called Cinnamon, and Craig Venter). There are also microarrays, which allow the biologist to assay the expression of thousands of genes in one sample.
These wonderful new toys have developed, in part, because of miniaturisation of the assay process, so one sample can be assayed for many things at the same time. The problem – and this is particularly severe with microarrays – is that this produces a lot of data from one sample. And because the cost of running one sample is high, only a few samples can be assayed; 50 samples would be a large experiment3. As a statistician, I find having so much data from so few replicates fascinating. There is data, but it is in the wrong place.
Genome sequences are similar: for any reasonably complex eukaryote we are only able to sequence a few individuals. So we are a long way from population genomics: for example if we want to get reasonable estimates of allele frequencies, we will have to sequence tens or hundreds of individuals.
Now, for some things this may not be a problem. The human genome project has been very useful, even if only a single sequence was produced. But it tells us nothing about variation between individuals and populations, so we don’t really know how much of teh sequence is typical. Similarly, it is difficult for most labs to get the replication of microarrays to be able to look at anything but the simpler comparisons. And yet, the amount of data produced is massive. Somehow this just seems wrong.
For someone like me, the good news is that it means we need solid statistical methods to extract the most information we can out of the data. Mining isn’t going to work: we need to design our microarray experiments, and then use analyses that focus on the design. Of course, this also gives statisticians ample opportunity to sit around over a pint of Guinness complaining to each other about how biologists are running worthless experiments.
I guess my other point is that we have to be aware of how little data we have. Just because the data from our latest experiment takes up more disk space than the latest Gee blockbuster doesn’t mean there is a lot of useful information. The comparisons we want to make – between individuals, between treatments – many have little power. 10 million observations on 6 samples is still only information about 6 samples. Any differences will have to be large to be seen above the noise, and extrapolating beyond them is dangerous, because we cannot estimate how much variation there is outside of these samples. Like many things, it is not just the size that matters, the quality is important too.
1 Support Vector Machine. If you know what it means, can you go and re-write the wiki page in plain English, please?
2 Neural Net, not Nature Network. Although there is some mileage in finding out what we would do with a pile of data.
3 Based on a cursory glance. A quick challenge for any bioinformaticians: quickly dig out the sample sizes of the experiments in the GEO database. Or a similar data base.