How Do Citations Vary?

Over at FriendFeed, Cameron Naylor asked the following on Friday:

Does anyone know whether the impact of papers in PLoS Biol is more or less evenly spread than those in e.g. Cell/Nature/Science – vague recollection that 20% of the papers in CNS make up 80% of the impact factor

The discussion walked around a bit, but nobody presented any actual data. So, yesterday when I was stuck cat-sitting by the computer (or, to be precise, The Beast was engaged in human-sitting), I thought I would chase the data up and see what it said.


My plan was to compare the distribution of the number of citations per paper for PLoS Biology, Cell, Science, Nature, PNAS and Proc. R. Soc. B. Cameron had mentioned the first four, and the latter two are high-quality general journals that PLoS Biology seem to be aiming to compete with.
I extracted my data from Web of Science. PLoS Biology was first published in October 2003, so I restricted my searches to 2004-2007 (2008 is going to be messy, because we haven’t left it yet). I restricted my results to articles (WoS also includes letters, book reviews etc.), and for PNAS, Nature and Science I restricted the search to biological subjects. It is well known that different subject areas have different citation rates, so I wanted to make the comparison as even as possible. It was a bit tricky to decide which categories to keep, as there is some overlap with medicine, and I am not sure where the line should be drawn. Hopefully it won’t affect the results too much.
I extracted the data. WoS only allows you to download 500 articles at a time, so I ended up with (for example) 26 PNAS files. And most of that information I don’t need: I only want to know the year and number of citations.
With data in hand I can start to play with it. Cameron’s question revolves around the spread of citations (I am assuming ‘number of citations’ is synonymous with ‘impact’). It should be obvious that the number of citations change over time, so it is worth splitting the data into years.
We have a decent number of papers per journal:
Table: Number of papers in each journal in each year (only biological paper included for Nature, Science and PNAS)

Journal 2004 2005 2006 2007
PLoS Biology 47 126 102 109
Cell 239 284 294 297
Science 629 658 574 563
Nature 665 712 645 593
PNAS 2915 2910 3055 3194
Proc. R. Soc. B 499 331 399 393

If this table looks awful it’s because one of Matt’s mates broke it. He’s promised to call in a virtual carpenter
PLoS Biology has the fewest, PNAS the most. 2004 was a light year for PLoS Biology, presumably partly because they were still starting up.
OK, let’s look at the data, starting by simply plotting histograms, to see the distributions:

(click through for larger version)
Note that I have log-transformed the x-axis (actually log(0.5+ x )). The distributions look fairly similar, and the means decrease as we get more recent. There is perhaps some left skewness, particularly in PLoS Biology and Proc. R. Soc. B, and this seems to be related to the number of zero citations (that’s the left-most bar, at a value below 0). One might interpret this as being a class of papers that are only cited slowly, e.g. I suspect palaeontology is like that.
But the large number of zeroes for PLoS Biology in 2006 looks odd. It turns out that this is because of an error in the WoS database. Several papers that year are listed twice, and only one got the citations. So, I have to write a bit of code to remove the duplicates. Moral: always check your data quality1.
After cleaning up the data, we get this:

It is difficult to see any huge changes in the variance when the mean changes (all the plots are on the same scale). If anything, the variance is slightly larger when the mean is smaller, again it looks like this is because of the class of lowly-cited papers. But this is what we are primarily interested in, so we should explore it further.
There is some evidence that the variance changes with the mean, and there is plentiful evidence that the mean changes over time. One way to look at the relationship between variance and mean is to calculate the coefficient of variation: the standard deviation divided by the mean. Let us plot it:

Visually, there seems to be a trend, with higher CVs later on (when the mean is smaller). PLoS Biology is not too dis-similar: it is increasing more rapidly until 2006 (i.e. the variation increases relatively quickly), but the CV is smaller in 2007 than we might expect.
Overall, that was not too clear. But, after a bit of thought I decided to simply plot the standard deviation against the mean:

This looks nicer. The points all lay along roughly the same line. Cell is lower, i.e. there is less variation than we would expect from the other journals: no doubt this is because it covers a smaller area of biology, so the slower-moving areas are excluded, and hence the mean is higher and the variance is lower. PLoS Biology is in the same area as PNAS (not a bad journal to be compared to). Proc. R. Soc. B is at the bottom: no doubt because it tends to publish more in the slower moving areas of biology.
But what of the variation in PLoS Biology? To me it looks similar to the other journals (with the exception of Cell ). It is lower in 2004, but I guess that may be related to the strategy for getting it started; I’m sure someone will be able to explain what was happening then. Aside from that, it looks like it’s just drifting along in the same noise as the others.
Finally, I can’t help but draw out some statistical lessons:

  1. Simply getting hold of the data is the best way to answer a question like this.
  2. Make sure the data is tidied up.
  3. Graphs are good: sometimes you don’t need anything else.
  4. This analysis is nothing more than data exploration. I looked at the data in several ways (including a couple I haven’t shown). By doing this, and keeping the question in mind, I could find out quite a bit about the data, and also an answer that (I hope) is fairly compelling.
  1. We could go on and do more modelling, i.e. get into what could be considered “proper” statistics. In this case, I suspect we don’t need to. Why spend time doing something when you already have the answer?

I did all my analyses in R, so if anyone wants the data and code, just ask and I can email it to you.

1 Moral No. 2: it’s better to do it before you think you’ve completed the analysis. sigh

About rpg

Scientist, poet, gadfly
This entry was posted in Uncategorized. Bookmark the permalink.

16 Responses to How Do Citations Vary?

  1. Mark Tummers says:

    I don’t want to bum you out after having written such a thoughtful and well researched blog entry, but isn’t Web of Science content copyrighted?
    I always got the impression that they aren’t usually very keen on other people spreading their stuff for free.

  2. Bob O'Hara says:

    Oops, thanks Mark. I hadn’t checked that. According to their Acceptable use policy I wouldn’t have a leg to stand on.
    So if you want the data, you’ll have to buy me a wheelchair.

  3. Brian Derby says:

    I am not sure if you are violating their terms and conditions. Surely you are analysing data (in the database) and providing an analysis. This is no different from (say) looking up material property data in an online database and using it in a published work. you cite your source but that does not give anyone access to the source without a fee. So as long as you hide the source data (but acknowledge the source) and only publish your results then you should be OK.

  4. Bob O'Hara says:

    Thank, Brian. I guess what I’ve done is OK, but I can’t hand the data out.
    Strictly, it looks like it is unacceptable to put any references found in WoS into a paper or manuscript. Eek!

  5. Mark Tummers says:

    I am not sure if you are violating their terms and conditions. Surely you are analysing data (in the database) and providing an analysis.
    It’s not for personal use if you publish it on a commercial site.

  6. Brian Derby says:

    I think it is the same with any copyright issue in academic work. To be safe you ask for permission to use the data but as long as your output adds new content then you can publish it under your own copyright. You are not copying it if you are making fair academic use. In principle, what Bob is doing could be done by analysing the raw data from articles. There is some freeware out there that does it from analysis of Google Scholar data.

  7. Maxine Clarke says:

    Irrespective of all that, this looks very clever, Bob. It is past my bedtime now so I am unable to offer intelligent comment, but I am going to look at this again tomorrow.
    Thanks for this food for thought.

  8. Cath Ennis says:

    Yes, nice analysis. How long were you cat sitting for exactly?!

  9. Frank Norman says:

    So, is the answer “yes” or “no”? (I like to keep things simple).

  10. Mark Tummers says:

    I think it is the same with any copyright issue in academic work. To be safe you ask for permission to use the data but as long as your output adds new content then you can publish it under your own copyright.
    I sincerly hope you are right but Thompsons isn’t an academic institution or publication. It’s a company that is running a commercial service.

  11. Bob O'Hara says:

    Um, what was the question, Frank?
    I think the answer to Cath’s question was “not long enough”. At least that’s The Beast’s answer.

  12. Brian Derby says:

    I sincerly hope you are right but Thompsons isn’t an academic institution or publication. It’s a company that is running a commercial service.
    So does that mean you think Elsevier isn’t run on commercial grounds?

  13. Noah Gray says:

    Just as an aside, I believe the source of the “error” in the WoS database is actually not their fault, but rather due to the fact that PLoS dumped the printed version of the journal in 2006, and in the transition, some papers actually received page numbers from print as well as the fancy new “e” page number. One of my publications is duplicated in most databases and in fact, both versions have received citations (the original real page number version has received many more citations despite the e version being the only one in PubMed…):
    PLoS Biol. 2006 Nov;4(11):2065-2075
    PLoS Biol. 2006 Nov;4(11):e370

  14. Bob O'Hara says:

    Ah, thanks for the explanation, Noah. I was wondering how this could happen in a computerized age. I would go back and check, but that requires use of the mouse, and The Beast is presently sat on the mousemat, enjoying some Shostakovitch.

  15. Mark Tummers says:

    So does that mean you think Elsevier isn’t run on commercial grounds?
    I think it means that Thompson has no gain of free distribtution of content they normally sell. The Acceptable use policy is actually very clear on this matter.

  16. Heather Etchevers says:

    Bob, it’s really quite fantastic to read someone doing proper scientifici work just for our amusement in a blog post. Well done, and thank you! Now I will feel even more vindicated about pushing PLoS Biology as a viable general-biology journal alternative to the others you name.

Comments are closed.