What I wanted to talk about yesterday, before I got distracted, was a mix of things that came out of last Monday’s ALPSP meeting, a paper in PLoS One and this whole assessment schtick that we’ve got going on.
We know, don’t we, that the Impact Factor as a measure of an individual scientist’s productivity is deeply flawed. Some might go as far as to say it’s flawed fundamentally and should be ditched forthwith: after all it is prone to gaming, unduly skewed by review articles, does not demonstrate a link between the ‘quality’ of a journal and any given article therein, can reduce the measured effectiveness of a single article (if a seminal paper is cited in a review then the review rather than the original paper can be cited more frequently), has no meaning in fields (humanities, nursing…) where citations are not routinely used as a measure of impact, and is very, very slow. (Seriously. We’re looking at a two year gap at least between the original work being done and it being cited in any meaningful manner—there has to be the first paper published, experiments thought about and performed, results analysed, papers written, reviewed and finally another publishing round before you can measure it.) It is at least one step removed from the output it might be taken as measuring, relies on everyone else playing fair, and, just like a quantum physical experiment, perturbs in the act of measurement the very thing it is supposed to measure.
There have been attempts to massage the Impact Factor, and new formulations cooked up, such as the Hirsch Index (and its variants) or the Eigenfactor, but essentially these are all measuring not a researcher’s activity, nor even how good that activity has been, but rather how someone else has managed to exploit it. Let’s not even think about research activities that do not result in a classical ‘citable unit’: database annotations, talks, training of students, fixing kit, etc., etc. And when you start looking at this in terms of stakeholders, for example the Research Excrement Excellence Framework (&c.), you get into all sorts of difficulties because people want to be able to measure you in a pretty automated way in order to finish it before the heat death of the universe.
Clearly there is no substitute for actually reading an article to determine its importance
… which is probably why certain people last Monday were very careful to say “indicators” rather than “metrics”, and brings me onto article-level metrics and a paper on the impact of Wellcome-funded research.
PLoS ONE are trying to look at the impact (however you define that) in terms of individual articles. According to Peter Binfield, for each PLoS ONE article the aim
is to provide information relating to online usage, citation activity, blog and media coverage, commenting activity, social bookmarking, "star ratings" and "best of" picks as selected by academic experts, as well as other measures yet to be determined
In other words, you’ll be measured by what you write, not by who writes about what you wrote. And even that’s a bit tricky, because really, there is no substitute for “expert” review, as was elegantly demonstrated in PLoS One last week.
Liz Allen and colleagues at the Wellcome Trust analysed the output of a whole bunch of Wellcome-funded researchers, whether project or programme grant or fellowship holders. They appointed a panel, or college, of ‘experts’ to review papers published in 2005 and essentially got this college to answer the question “how good is this stuff, I mean, really?”—for nearly 700 papers. (Okay, they didn’t assess ‘non-citable units’, but it’s a start.)
The results are interesting. For a start, it turns out that the Impact Factor or citational analysis is a pretty poor indicator of an individually paper’s importance (or quality, or influence). We all kind of suspected this, but it’s nice to finally have some data.
The other thing, and here I declare my conflict of interest, is that the Wellcome’s expert college on the whole matched Faculty of 1000’s assessment. Yes, some papers were completely missed, but we think we know why and are working to make sure that in future we do spot more interesting things that appear in less well-known journals. We’re pretty stoked about this, actually.
You can’t beat human input. Not yet, anyway. What we’d really like to do is find some way of matching PLoS’s article-level metrics indicators with our F1000 Factor and synthesize a sensible measure that looks at the quality of the research as well as its influence and usage. Here would be a good place to draw your attention to Johan Bollen’s incredibly cool network graphs, which could provide hours of fun for all the family. And yes, I talked to Johan on Monday and yes, he does plan to slice those data at the individual author/research group level.
You may have noticed that we (Faculty of 1000) already ‘score’ articles. Seeing as we have these scores, you might imagine that we’d be able to, oh I don’t know, give scores to individual journals or institutions: perhaps even individual scientists. Now there’s an idea…
But you know, I do wonder if all these indicators and claims of unfairness, fairness, whatever, might be missing the point, somewhat. Whatever the metric you use, good science will out. I wonder if there are truly any brilliant scientists who have not got jobs, or not got funded, because the current indicators have missed them? And I’m sure you’ve all got stories you can tell me. To which I say,
the plural of anecdote is not ‘data’
Furthermore, we all assume that there is necessarily a link between past results and future performance, in opposition to what the investment adverts keep telling us. The Impact Factor—or any citation metric—is at least two years out of date. So I wonder if this holds when dishing out the cash for research? Now, I’m pretty open-minded, and a strong believer in experiment. I would really like someone to test my hypotheses. So I propose the following:
Take a pot of money. Tens, if not hundreds, of millions of pounds or dollars, doesn’t matter which. Divide it into four equal portions. Invite researchers to apply for the money in the usual way, and on receipt of the applications assign them randomly to one of four groups, A, B, C or D. Group A is judged in the standard manner of competitive grant applications. Group B is judged on the basis of Hirsch Indices or Impact Factors of the PIs involved and Group C uses something like an expert college in much the same way as the papers in the Allen study were judged (but looking at individual researchers rather than their papers). Applicants in Group D get the money randomly. We’d probably want to limit to a 20% success rate for each group—someone else can work out the details.
And then, after five years, we assess, using every best method available, the productivity of the researchers who were funded (and could probably look at the careers of those who weren’t, too).
This is probably unethical, but potentially very interesting. I wouldn’t be surprised if there were a few surprises.
Allen, L., Jones, C., Dolby, K., Lynn, D., & Walport, M. (2009). Looking for Landmarks: The Role of Expert Review and Bibliometric Analysis in Evaluating Scientific Publication Outputs PLoS ONE, 4 (6) DOI: 10.1371/journal.pone.0005910
Note self. Think about integrating Streamosphere, etc. Will this be automatic in MT4 do you think?
No, but MT4 will wash the outsides of upstairs windows and vacuum under the couch.
And weigh out SDS.
without a mask.
Or gloves.
Paraformaldehyde, too? And titrate the solution? I could definitely use that feature.
It even warms the PBS to 60°C so it dissolves.
Thanks for the link (both here and at PLoS)! I came across an interesting site last week, which takes this “scoring” system one step further. At Anne-Wil Harzing website you can download a free program called Publish or Perish
The idea behind this is to aid in job / tenure applications to demonstrate the impact of your research.
You end up with a detailed analysis as follows:
* Total number of papers
* Total number of citations
* Average number of citations per paper
* Average number of citations per author
* Average number of papers per author
* Average number of citations per year
* Hirsch’s h-index and related parameters
* Egghe’s g-index
* The contemporary h-index
* The age-weighted citation rate
* Two variations of individual h-indices
* An analysis of the number of authors per paper.
I analyzed my own “productivity” (obviously) and ended up with a bunch of numbers, which took a lot of reading to figure out what any of them meant. What was also interesting (and also for other authors I searched), often some of the “best” ranked articles turned out to be reviews.
My questions were:
1. Would anyone really understand what these numbers meant on your CV?
2. Do we actually need a scoring system for scientific papers?
3. Isn’t this just another score system and how could you integrate these together?:
“is to provide information relating to online usage, citation activity, blog and media coverage, commenting activity, social bookmarking, “star ratings” and “best of” picks as selected by academic experts, as well as other measures yet to be determined”
You’re welcome, Barry.
I’m glad you mentioned PoP, as that was also brought up on Monday and… I’d forgotten. blush
Its a great little program you can burn away hours on…
must … resist …
There was an article on Hirsch in Wired magazine
Has Hirsch’s Hirsch index gone rapidly up since his paper in PNAS?
bq. Either way, the numbers have the stark, uncluttered feel that scientists love. “I’m a physicist,” Hirsch says. “Some people put a lot of weight on subjective criteria, and I don’t like that at all.”
Oh good grief.
Richard, great idea on the experiment. It’s stunning how such an obviously flawed metric as impact factor is so widely accepted by large swathes of supposedly rational, critical thinking scientists.
Is there nothing more depressing than submitting your CV for funding/job applications (as I am regularly at the moment), with strict guidelines stipulating that you must provide citation metrics for all papers etc. Pretty much assures you that the committee is only going to look at the numbers.
You can be a co-author, Darren.
The trouble is raising the money—it would basically mean a funding body saying “OK, what have we got to lose?” and realizing they could fund better science in the long term.
Love the idea of the experiment! I know a program director at NSF who I’ve sent the link. I’ve also sent it to a program director at the German DFG. The third person that came to mind is Richard Gordon of ‘baseline grant’ fame: “Gordon, R. & B.J. Poulin (2009). Cost of the NSERC science grant peer
review system exceeds the cost of giving every qualified researcher a
baseline grant. Accountability in Research: Policies and Quality
Assurance 16(1), 1-28.” Let’s see what they say (if anything).
Ah, top show, Björn. Let’s see if they come knocking on my door…
The ongoing romance with Impact Factor seems to be due to it’s overall apperance of simplicity and objectivity. That and every other index is worse (especially when you’re a young scientist or your paper is new and hasn’t been cited a great deal yet).
Richard- Is this idea of re-assessing quality post-publication simply just another method of peer review? As all of our papers have already been through peer review (ugly as it is) what is the marginal value of yet another round of it? Might be important given the opportunity costs involved in going through them all again?
Yes, Richard, such an experiment is definitely overdue! I have actually sketched out a grant proposal along these lines, for a two-step funding scheme on Europe and Global Challenges (Eur 50,000 in the first phase, 1,000,000 in the second, for which independent submissions are permitted), to be submitted in collaboration with Richard Gordon and Bryan Poulin (the authors of the above-mentioned study) as well as Brian Czech. Since useful stats can only be gathered if there is a high number of projects, it was foreseen to have projects of Eur 1,000 each – a scale inappropriate for multiple research projects but probably suitable for multiple small projects that allow graduate students some a little bit of independent research, e.g. on a sideline of their PhD thesis. Due to the low overall budget, basically no money would be available for project administration, so the allocation procedure would have to be simple, and we chose a competition, which would probably fall somewhere next to C in your scheme – going for random allocation did not stand any chance to pass the peer review process for the call. Given that the funding agencies already have the data on group A (any other Eur 50,000 they have spent previously, perhaps even on peer review, as detailed in the paper the above-mentioned above), the aim of the project would be to test whether these multiple small-scale projects would result in higher impact per budget than the control group. The project should be conducted as an open science project, such that everybody can observe the results as they come in and see for themselves whether this model works at this scale. Knowing that all this is unusual, we contacted the funding agencies that had issued the call, and invited them to give some initial feedback. This turned out to be very discouraging (for three main reasons: i – the focus of the call is on global governance, for which scientists like Gordon and me and economists like Poulin and Czech were considered a non-ideal match, ii – the allocation of research funds was not seen as a global challenge, iii – some of their internal policies not mentioned in the call), and so we finally decided not to submit at this first stage. However, reason iii would not apply for the second stage, and the larger budget then (plus the additional time) would allow to build a team that fits better to their grid, and to make a stronger case for the proper allocation of research funds as a current challenge at both the European and the global level. So we definitely plan to submit a proposal for the second stage and possibly elsewhere as opportunities arise, and we would welcome anyone interested in these matters to work out such proposals together, preferably in the open, like in our wiki linked from above.
My html-formatted text did not go through (even though it worked in preview), so I pasted a properly formatted version onto my blog at http://ways.org/en/blogs/2009/jun/22/re_on_articlelevel_metrics_and_other_animals .
One important factor to control for in this experiment (if possible), would be the influence of other funding sources held by the investigators in the various groups. For example, it seems likely (based on the arguments above) that a PI with stellar impact factor metrics is likely to also be running a lab with a fairly large budget. Surely this would really skew the results in their direction?
Actually, on the same line of thought, it always strikes me as circular logic to judge previous grant funding as a measure of “research output” or achievement, as many assessment exercises seem to do. Surely this factor should be used as a denominator against which output (ie papers, patents etc) is judged to truly gauge productivity?
Nat — no. Pre-publication peer review essentially tells you if the work is sound: done correctly, right experiments, not missing other work, etc. There may be some input into ‘importance’ at that stage, but that’s usually done editorially at the big journals. Once a piece of work is published, there’s no straightforward way of telling people that it’s actually important or just a ‘for the record’ thing. You have to know the field in most cases to ascertain that: and that’s the void F1000 attempts to fill.
Or did I misunderstand the question?
Daniel, there’s a lot to think about there. I think we need to look at serious money though. Five year fellowships I think would be the minimum.
Darren, good point. There’d have to be some very careful thinking about controls. And yeah, your second point is well-made.
I hope some of these ideas are picked up by the organizers of Science Online London 2009.
Subtle, Raf. Thanks.
Indeed, Raf!
May I also point to a related blog post on the Mendeley Blog: Changing the journal impact factor
Richard, yes, serious money is necessary but let’s see what this would entail: To allow for stats to become useful, we would need at least N=10 projects in each of your i=4 groups, and preferably considerably more. Still for statistical reasons, these N projects would have to be very similar in all other aspects and thus probably would all have to come from a pre-selected and narrowly-defined field.
Of course, the analysis could consider k fields independently if funding is available but five-year research projects involving a PI and several other scientists and technicians easily amount to a million Euro. Do you think it is realistic to go for N*i*k millions with such a proof-of-concept project that will certainly be met with high levels of suspicions, particularly since it will most probably be subjected to classical peer review itself?
I admit 50,000 Euro is too little to provide really convincing data but it may entice funding bodies to take another look at their records for group A (like the Wellcome Trust announced in the Allen et al. study you cited above) to see how they’ve spent their last couples of 50,000 Euros (or what have you), and to reconsider the proposal.
All we need as a proof of concept would be to identify a set of conditions under which alternative allocation mechanisms (including random elements, if combined with carefully designed – and transparent – reporting mechanisms at the end of the project, perhaps involving some kind of “karma” in the sense used at http://en.wikipedia.org/w/index.php?title=Slashdot&oldid=297276354#Moderation ) perform better.
I am confident that this initial demonstration can be achieved with grants on the order of the 1,000,000 Euro in the program I mentioned, though it would of course be nice to have several independent studies of this sort – targeting partly overlapping disciplines, for instance, or different mechanisms of quality assessment. I personally do not think your groups A and B would differ too much but C has promise (particularly if it involves transparent mechanisms), while D was previously found to perform slightly worse than A for both grants and papers (see http://www.nature.com/nature/journal/v459/n7247/full/459641b.html ) but it could still be attractive as it basically involves no costs for the allocation itself.
In the long run, I think peer review won’t go away for big grants but it will have to become more transparent, and as the NSERC study cited by Björn shows, we may well be better off by doing away with it for smaller grants, for which the eligibility criteria would then be key (and hopefully involve something more author-centered than the Journal Impact factor).
Ach. Anyone got a copy of Cameron’s letter they could let me see?
Richard, nice post (and interesting paper)
This is probably unethical, but potentially very interesting
all the best kinds of experiments are…. mwhahaha
Thanks Duncan. They called me mad, you know. Mad. I’ll show them.
bq. the plural of anecdote is not ‘data’
Apparently not so in the world of CAM
The engineers and physical scientists have also been discussing random money allocation in funding allocations recently.
I’ve been considering writing a post about IF’s 2 year “best before” issue, seems like you’ve saved me the trouble.
Basically, I agree with what you say – if an article really has an impact in changing the way a field works, for example, we certainly don’t expect to see any results strongly influenced by the article within 2 years. Of course, there can be other reasons papers get cited rapidly – breaking ideas, theoretical developments related to new experimental results, the LPU0, mutual back-scratching, self-citation…
But it got me to thinking – there’s also a certain positive feedback between current IF and the choice of which journal people send their work to. So, they’ll preferentially send their ‘best’ work to journals with the highest IF in their field. If published, that will reflect one community based decision (editorial/peer review) of the current state of their work.
Web of Knowledge Journal Citation Reports also provide a 5 year IF stat (along with others, including the eigenfactor), yet these seem to correlate rather highly with IF anyway1 – do these other measures really add anything new, or just add to the confusion?
0 The Least Publishable Unit – break down your original research into the tiniest possible pieces to
confuse readersbump up your publication rateincrease self citationahh screw it1 For the top 60 Ecology & Evolutionary Biology journals 2- and 5-year IF: Correlation = 0.712, p
hm, I now know what to do at my vacation…. try the program and see if I get happier than I do with IF 😉
Overall, I do think that “the easier the measurer the simpler to use it” … (and of course, “the easier that it leaves important factors out of the equation”). IF and citation marks for each paper seems lika an easy assessment of “importance”. Then again, all new papers are out of the competition so I guess that would be the bias towards “older” scientists with already (lots) cited papers?! but that would be to say like RG already said, “to those who have to get more” …. (or was that somewhere else?)
Having now seen Cameron’s letter, it looks like D (random assignment), once the cost of administering peer review/name your process is factored in, is the best bet…
Richard. I disagree, slightly. I think “Importance” is built into peer review. This is especially true in high end medical journals where you are instructed to take this into account.
This doesn’t of course rule out the possibility of very important findings being published in lower ranked journals. But it may take years for this to become apparent.
The problem remains that without hindsight one cannot tell other than through the field-relative impact factor of the journal whether a new paper will be important or not. I’m not sure I understand how another layer of peer review will help clarify this.
What is the marginal value of two level of peer review?
I think you’re right about good science will out (whatever the metric). You inspired me to write about impact factor boxing . Now where did I put my gloves?
A UK funding initiative that may be relevant here: Vitae innovate (up to £100,000 to support the career development of researchers in the UK). One could argue that the current system provides improper barriers to such career development, and one could argue that the kind of system we are discussing here may be a solution but this has to be tested. That sum, however, is also rather on the lower edge of what we aim at.
Hah, Duncan–only just noticed that you cited my blog in that format. Now all we need to do is get indexed by PubMed…
Dick Gordon of baseline grants fame has asked me to post his comments here, since he is traveling without access:
Richard Gordon:
“I honestly don’t like the direction this is going. The issues Bryan Poulin and I raise have to do with evaluating granting agencies and their performance. The approach of finding “a measure of an individual scientist’s productivity“ is blaming the victim. It’s a game played by science administrators, part of their desire for power. It’s about time we ranked and measured granting agencies, not scientists, as to how much money they waste and opportunities for innovation they squelch. I don’t understand the relationship between this endless “measure of man” fetish and baseline grants. Leave scientists alone. They do not “perform” like a circus troupe. Why measure a scientist’s “performance” at all? Let’s focus on measuring performance of granting agencies. That’s where the 800 pound gorilla is in our room.”
Well, Björn, and Dick, I think that scientists do perform. They have a responsibility to the stakeholders to deliver. And it is important to grant agencies, to PIs looking for someone to hire, to many people, actually, to know who is good and performing and who is not.
Money goes in, something needs to come out. Something measurable: even if that measure is simply adding to the world’s knowledge. Otherwise we may as well just piss around on the internet all day.
Money
Sir Paul Nurse got into a bit of bother recently with what appeared to be some ill-considered remarks about science funding. That was, of course, an interview in a newspaper and some context and subtlety was lost. Mark Henderson of the Times blogged Nu…