It’s the Wrong Data, Grommit!

David Basanta pointed out a provocative article in Wired, in which the author claimed that we won’t need to think about it any more, we can just throw it into a big computer along with an SVM1 or a NN2. His reasoning is that we are getting so much data that any patterns in the data will leap out at us, if we just prod it in the right way. And of course the computer scientists have all the right prods. This will then totally change the way science is done – we won’t have to worry about thinking up hypotheses and experiments to test them, we just mine the computer for results; in a manner reminiscent of The Library of Babel. Or perhaps it is closer to Dave Langford’s The Net of Babel.

There are plenty of criticisms of this that can be made – this Ars Technica piece makes some good points, and hopefully a few people will pop up at the social network forum as well as David’s blog and comment. But I want to pick at one aspect, that has been bothering me for some time.
The problem is the amount of data being produced nowadays. For example, all sorts of odd beasts are being sequenced (including a moggie called Cinnamon, and Craig Venter). There are also microarrays, which allow the biologist to assay the expression of thousands of genes in one sample.
These wonderful new toys have developed, in part, because of miniaturisation of the assay process, so one sample can be assayed for many things at the same time. The problem – and this is particularly severe with microarrays – is that this produces a lot of data from one sample. And because the cost of running one sample is high, only a few samples can be assayed; 50 samples would be a large experiment3. As a statistician, I find having so much data from so few replicates fascinating. There is data, but it is in the wrong place.
Genome sequences are similar: for any reasonably complex eukaryote we are only able to sequence a few individuals. So we are a long way from population genomics: for example if we want to get reasonable estimates of allele frequencies, we will have to sequence tens or hundreds of individuals.
Now, for some things this may not be a problem. The human genome project has been very useful, even if only a single sequence was produced. But it tells us nothing about variation between individuals and populations, so we don’t really know how much of teh sequence is typical. Similarly, it is difficult for most labs to get the replication of microarrays to be able to look at anything but the simpler comparisons. And yet, the amount of data produced is massive. Somehow this just seems wrong.
For someone like me, the good news is that it means we need solid statistical methods to extract the most information we can out of the data. Mining isn’t going to work: we need to design our microarray experiments, and then use analyses that focus on the design. Of course, this also gives statisticians ample opportunity to sit around over a pint of Guinness complaining to each other about how biologists are running worthless experiments.
I guess my other point is that we have to be aware of how little data we have. Just because the data from our latest experiment takes up more disk space than the latest Gee blockbuster doesn’t mean there is a lot of useful information. The comparisons we want to make – between individuals, between treatments – many have little power. 10 million observations on 6 samples is still only information about 6 samples. Any differences will have to be large to be seen above the noise, and extrapolating beyond them is dangerous, because we cannot estimate how much variation there is outside of these samples. Like many things, it is not just the size that matters, the quality is important too.

1 Support Vector Machine. If you know what it means, can you go and re-write the wiki page in plain English, please?

2 Neural Net, not Nature Network. Although there is some mileage in finding out what we would do with a pile of data.

3 Based on a cursory glance. A quick challenge for any bioinformaticians: quickly dig out the sample sizes of the experiments in the GEO database. Or a similar data base.

About rpg

Scientist, poet, gadfly
This entry was posted in Uncategorized. Bookmark the permalink.

13 Responses to It’s the Wrong Data, Grommit!

  1. Maxine Clarke says:

    Good post, Bob. I had not thought of it from the point of view of “lots of data, not enough samples” before. That’s a really telling point, for making statistical sense of the data.

  2. Henry Gee says:

    Pish and tosh. That the computers will tease the patterns out for us so that we won’t need models or ghypotheses, that is. Not your post, Bob, which is thought-provoking.
    Why tish and posh push and tush? Because someone has to be there, with a pint of Guinness or otherwise, to decide whether the pattern tells us anything, and because the last step in any statistical procedure is always a judgement call. And no machine can do that. It’s a feel thing.

  3. Bob O'Hara says:

    Henry, that would have been my other post. But I thought that there would be plenty of other people wanting to make that point, so I used this as an excuse to go on about one of my own little hobby-horses.

  4. Neil Saunders says:

    Your very own Euan Adie wrote a nice tutorial on using SVMs. Basically, it’s a multivariate statistical technique for classifying samples into groups based on their features: where a feature is just a set of numerical values that describes the sample.

  5. Heather Etchevers says:

    Bob, I’m happy to see that the footnoting markup is as haphazard for you as it was for me.
    Okay, the main point is that your post rings so true for me, because I just received a collaborator’s draft about experiments using an improved SAGE technique.
    The problem here is that SAGE or any of its newer variants is so bloody expensive that there are no replicates, generally. You try to pull information out of relative levels of expression from one gene to another within the same individual. And then – here is where it sticks for the paper in question – you need to test what amounts to a hypothesis about gene relationships with a different technique. (And “we” haven’t.)
    This draft is one big list of genes, presented in various subsets with no tested biological relevance. It’s also 92 pages long for the moment. No amount of “this is preliminary” is going to make up for that. I wish I could hide behind an editorial anonymity when I tell this person that “the paper is not acceptable in its present form” and the paper is too descriptive.

  6. Henry Gee says:

    Heather, that’s very sound advice. I was pleased to see, in capitals and upfront, the advice that authors should wait 24 hours before deciding how to respond to a letter of rejection.
    Editors (and I are one) are only human and are (or should be) only too aware of their human fallibility. We’re quite prepared to admit that sometimes we reject papers that we shouldn’t. So we are always happy to see letters of appeal provided they are considered, constructive, substantive and polite. It takes time to compose such a letter. By the same token, I sometimes wait 24 hours before rejecting a newly submitted manuscript which, on first glance, seems inappropriate, incomprehensible, or both. Usually my decision, after sleeping on it, is the same — to reject it. Authors appreciate editors making decisions quickly, but not so quickly that it looks as though we haven’t read the paper (which we have.)

  7. Heather Etchevers says:

    not so quickly that it looks as though we haven’t read the paper (which we have.)
    So say you;-) Sometimes I wonder.

  8. Henry Gee says:

    So say you… ;-) Sometimes I wonder
    I could not possibly comment…

  9. Brian Clegg says:

    Like many such ‘this will solve the problems of discipline X’ solutions, I suspect there’s a mix of truth and ‘got it seriously wrong’.
    As I think I’ve blogged elsewhere, back in the 80s someone came up with a computer program that was going to render programmers redundant. You just told it what you wanted, and it wrote the program for you. It didn’t change the world. Later we were told APL or fourth generation languages would change the IT world. They did there bit, but it wasn’t earth-shattering.
    I suspect the same here. Cloud computing will find some interesting stuff – but it isn’t going to make any great change to the way science is undertaken. I think one of the problems, reflecting your remarks about the volume of data, is that we sometimes forget the difference between searching, processing and understanding. Google can return vast numbers of search responses in a fraction of a second – but that doesn’t necessarily help you solve a problem without interpretation.

  10. Bob O'Hara says:

    Heather – after extensive observation, I have discovered that your strategy should be to draw some pie charts. I have yet to establish if anyone finds them informative, but it seems to be The Done Thing.
    Henry – you or one of your colleagues once rejected a manuscript over a Finnish lunchtime. The authors were rather proud of the speed.

  11. Henry Gee says:

    Depends how long Finns take to have lunch. Without knowing this, the as-yet-fictional proverb ‘As long as a Finnish Lunchtime’ could refer to something unusually long (‘A Country Mile’) or short (‘Faster than a Buttered Ferret up a Teflon Trouserleg’).

  12. Bob O'Hara says:

    The Finns are terribly efficient about such things – chat is not considered necessary. Alas for the food, neither is taste.

  13. Jarno Tuimala says:

    Noticing this entry today, I did a quick search to GEO database and sampled roughly 10% of the data series there. The number of samples varied between 1 and several thousand, the median being at 12 arrays per series.
    It seems that a “cursory glance” does a good job here! About 90% of the series had less than 50 samples.
    These figures could be slightly overestimating the true sample sizes, since a series in GEO database can contain data for several experiments.