Over at FriendFeed, Cameron Naylor asked the following on Friday:
Does anyone know whether the impact of papers in PLoS Biol is more or less evenly spread than those in e.g. Cell/Nature/Science – vague recollection that 20% of the papers in CNS make up 80% of the impact factor
The discussion walked around a bit, but nobody presented any actual data. So, yesterday when I was stuck cat-sitting by the computer (or, to be precise, The Beast was engaged in human-sitting), I thought I would chase the data up and see what it said.
My plan was to compare the distribution of the number of citations per paper for PLoS Biology, Cell, Science, Nature, PNAS and Proc. R. Soc. B. Cameron had mentioned the first four, and the latter two are high-quality general journals that PLoS Biology seem to be aiming to compete with.
I extracted my data from Web of Science. PLoS Biology was first published in October 2003, so I restricted my searches to 2004-2007 (2008 is going to be messy, because we haven’t left it yet). I restricted my results to articles (WoS also includes letters, book reviews etc.), and for PNAS, Nature and Science I restricted the search to biological subjects. It is well known that different subject areas have different citation rates, so I wanted to make the comparison as even as possible. It was a bit tricky to decide which categories to keep, as there is some overlap with medicine, and I am not sure where the line should be drawn. Hopefully it won’t affect the results too much.
I extracted the data. WoS only allows you to download 500 articles at a time, so I ended up with (for example) 26 PNAS files. And most of that information I don’t need: I only want to know the year and number of citations.
With data in hand I can start to play with it. Cameron’s question revolves around the spread of citations (I am assuming ‘number of citations’ is synonymous with ‘impact’). It should be obvious that the number of citations change over time, so it is worth splitting the data into years.
We have a decent number of papers per journal:
Table: Number of papers in each journal in each year (only biological paper included for Nature, Science and PNAS)
|Proc. R. Soc. B||499||331||399||393|
If this table looks awful it’s because one of Matt’s mates broke it. He’s promised to call in a virtual carpenter
PLoS Biology has the fewest, PNAS the most. 2004 was a light year for PLoS Biology, presumably partly because they were still starting up.
OK, let’s look at the data, starting by simply plotting histograms, to see the distributions:
(click through for larger version)
Note that I have log-transformed the x-axis (actually log(0.5+ x )). The distributions look fairly similar, and the means decrease as we get more recent. There is perhaps some left skewness, particularly in PLoS Biology and Proc. R. Soc. B, and this seems to be related to the number of zero citations (that’s the left-most bar, at a value below 0). One might interpret this as being a class of papers that are only cited slowly, e.g. I suspect palaeontology is like that.
But the large number of zeroes for PLoS Biology in 2006 looks odd. It turns out that this is because of an error in the WoS database. Several papers that year are listed twice, and only one got the citations. So, I have to write a bit of code to remove the duplicates. Moral: always check your data quality1.
After cleaning up the data, we get this:
It is difficult to see any huge changes in the variance when the mean changes (all the plots are on the same scale). If anything, the variance is slightly larger when the mean is smaller, again it looks like this is because of the class of lowly-cited papers. But this is what we are primarily interested in, so we should explore it further.
There is some evidence that the variance changes with the mean, and there is plentiful evidence that the mean changes over time. One way to look at the relationship between variance and mean is to calculate the coefficient of variation: the standard deviation divided by the mean. Let us plot it:
Visually, there seems to be a trend, with higher CVs later on (when the mean is smaller). PLoS Biology is not too dis-similar: it is increasing more rapidly until 2006 (i.e. the variation increases relatively quickly), but the CV is smaller in 2007 than we might expect.
Overall, that was not too clear. But, after a bit of thought I decided to simply plot the standard deviation against the mean:
This looks nicer. The points all lay along roughly the same line. Cell is lower, i.e. there is less variation than we would expect from the other journals: no doubt this is because it covers a smaller area of biology, so the slower-moving areas are excluded, and hence the mean is higher and the variance is lower. PLoS Biology is in the same area as PNAS (not a bad journal to be compared to). Proc. R. Soc. B is at the bottom: no doubt because it tends to publish more in the slower moving areas of biology.
But what of the variation in PLoS Biology? To me it looks similar to the other journals (with the exception of Cell ). It is lower in 2004, but I guess that may be related to the strategy for getting it started; I’m sure someone will be able to explain what was happening then. Aside from that, it looks like it’s just drifting along in the same noise as the others.
Finally, I can’t help but draw out some statistical lessons:
- Simply getting hold of the data is the best way to answer a question like this.
- Make sure the data is tidied up.
- Graphs are good: sometimes you don’t need anything else.
- This analysis is nothing more than data exploration. I looked at the data in several ways (including a couple I haven’t shown). By doing this, and keeping the question in mind, I could find out quite a bit about the data, and also an answer that (I hope) is fairly compelling.
- We could go on and do more modelling, i.e. get into what could be considered “proper” statistics. In this case, I suspect we don’t need to. Why spend time doing something when you already have the answer?
I did all my analyses in R, so if anyone wants the
data and code, just ask and I can email it to you.
1 Moral No. 2: it’s better to do it before you think you’ve completed the analysis. sigh