I am sick of impact factors and so is science.
The impact factor might have started out as a good idea, but its time has come and gone. Conceived by Eugene Garfield in the 1970s as a useful tool for research libraries to judge the relative merits of journals when allocating their subscription budgets, the impact factor is calculated annually as the mean number of citations to articles published in any given journal in the two preceding years.
By the early 1990s it was clear that the use of the arithmetic mean in this calculation is problematic because the pattern of citation distribution is so skewed. Analysis by Per Seglen in 1992 showed that typically only 15% of the papers in a journal account for half the total citations. Therefore only this minority of the articles has more than the average number of citations denoted by the journal impact factor. Take a moment to think about what that means: the vast majority of the journal’s papers — fully 85% — have fewer citations than the average. The impact factor is a statistically indefensible indicator of journal performance; it flatters to deceive, distributing credit that has been earned by only a small fraction of its published papers.
But the real problem started when impact factors began to be applied to papers and to people, a development that Garfield never anticipated. I can’t trace the precise origin of the growth but it has become a cancer that can no longer be ignored. The malady seems to particularly afflict researchers in science, technology and medicine who, astonishingly for a group that prizes its intelligence, have acquired a dependency on a valuation system that is grounded in falsity. We spend our lives fretting about how high an impact factor we can attach to our published research because it has become such an important determinant in the award of the grants and promotions needed to advance a career. We submit to time-wasting and demoralising rounds of manuscript rejection, retarding the progress of science in the chase for a false measure of prestige.
Twenty years on from Seglen’s analysis a new paper by Jerome Vanclay from Southern Cross University in Australia has reiterated the statistical ineptitude of using arithmetic means to rank journals and highlighted other problems with the impact factor calculation. Vanclay points out that it fails to take proper account of data entry errors in the titles or dates of papers, or of the deficient and opaque sampling methods used by Thomson Reuters in its calculation. Nor, he observes, does the two-year time limit placed on the impact factor calculation accommodate variations in the temporal citation patterns between different fields and journals; peak citations to Nature papers occurs 2-3 years following publication whereas citations of papers in Ecology take much more time to accrue and are maximal only after 7-8 years). Whichever way you look, the impact factor is a mis-measure.
Vanclay’s paper is a worthy addition to the critical literature on the impact factor. Its defects and perverse effects are well known and have been dissected by David Colquhoun, Michael Eisen and Peter Lawrence, among others. Even Philip Campbell, editor-in-chief of Nature, which has one of the highest impact factors in the business, has recognised that we need to escape its dispiriting hold over the lives of researchers.
Writing in 2008, Campbell (albeit somewhat uncertainly) saw a possible solution to the impact factor conundrum in the rise of mega-journals like PLoS ONE, which publish exclusively online and judge papers only on their novelty and technical competence, and in the potential of article-level metrics to assess the scientific worth of papers and their authors. In the end, however, he couldn’t shake the editorial habit of selection, writing of the contents of archives and mega-journals: “nobody wants to have to wade through a morass of papers of hugely mixed quality, so how will the more interesting papers […] get noticed as such?”
Four years later such views are being buffeted by the rising tides of open access and social media. It might sound paradoxical but nobody should have to wade through the entire literature because everybody could be involved in the sifting.
The trick will be to crowd-source the task. Now I am not suggesting we abandon peer-review; I retain my faith in the quality control provided by expert assessment of manuscripts before publication, but this should simply be a technical check on the work, not an arbiter of its value. The long tails of barely referenced papers in the citation distributions of all journals — even those of high rank — are evidence enough that pre-publication peer review is an unreliable determinant of ultimate worth.
Instead we need to find ways to attach to each piece of work the value that the scientific community places on it though use and citation. The rate of accrual of citations remains rather sluggish, even in today’s wired world, so attempts are being made to capture the internet buzz that greets each new publication; there are interesting innovations in this regard from the likes of PLOS, Mendeley and altmetrics.org.
The old guard may be shaking their heads and murmuring darkly about gaming of any system that tries to capture the web-chatter sparked by new research. But they shouldn’t be so concerned. Any working scientist will have experienced the thrill of hearing exciting new findings reported at a conference where results do not need to be wrapped between the covers of a particular journal for their significance to be appreciated. All it takes is for people to gather together in the coffee break and talk. The web allows something akin that process to be energised on a daily basis; if we tap in online as the community of scientists downloads and flags up the papers of greatest interest to them, we could recover a sense of the worth of the scientific literature (and the efforts behind it) that is factual rather than fictional.
These developments go hand in hand with the rise of open access (OA) publishing. Though primarily motivated by the research and societal benefits that will accrue from freeing the dissemination of the research literature, open access is also needed to optimise crowd-sifting of the literature by making it accessible to everyone. But the growth of open access is also being held back by the leaden hand of the impact factor. This year has seen several significant policy developments in the US, EU and UK, but we still have a considerable way to go. In the long term open access can only work by moving to a gold ‘author pays’ model that has to be funded by monies released from subscription cancellations, but while we continue to place false value in impact factors, the publishers of high ranking journals can claim that the cost of sifting and rejecting scores of manuscripts must be borne by the system and therefore warrants exorbitant charges for gold OA.
It doesn’t have to be this way. We can avoid high cost gold OA and achieve a system of valuation that works by ridding ourselves of the impact factor.
I don’t wish to under-estimate the difficulties. I am well aware of the risks involved, particularly to young researchers trying to forge a career in a culture that is so inured to the impact factor. It will take a determined and concerted effort from those in a position of influence, not least by senior researchers, funders and university administrators. It won’t be easy and it won’t be quick. Two decades of criticism have done little to break the addiction to a measure of worth that is statistically worthless.
But every little helps, so, taking my cue from society’s assault on another disease-laden dependency, it is time to stigmatise impact factors the way that cigarettes have been. It is time to start a smear campaign so that nobody will look at them without thinking of their ill effects, so that nobody will mention them uncritically without feeling a prick of shame.
So consider all that we know of impact factors and think on this: if you use impact factors you are statistically illiterate.
- If you include journal impact factors in the list of publications in your cv, you are statistically illiterate.
- If you are judging grant or promotion applications and find yourself scanning the applicant’s publications, checking off the impact factors, you are statistically illiterate.
- If you publish a journal that trumpets its impact factor in adverts or emails, you are statistically illiterate. (If you trumpet that impact factor to three decimal places, there is little hope for you.)
- If you see someone else using impact factors and make no attempt at correction, you connive at statistical illiteracy.
The stupid, it burns. Do you feel the heat?
I need to thank David Colquhoun for inadvertently helping to inspire this post.
Update 20-8-2012: Anyone looking in horror at the long comment thread might be interested to know it has now been summarised in my following post.
Brilliant. I love it.
It’s time to declare war on statistical illiteracy.
I hope you sent a copy to the VC of Queen Mary, University of London, and every “research manager” in the country.
Not yet… But thanks for the inspiration (and example).
Great article Stephen, as usual.
In terms of actually changing things though, how would we, or should we go about doing that? I tell people when they use it that it is a meaningless metric, but these aren’t people with the power to change things.
How do we reach those at the top? The argument against IF is strong, and most likely irrefutable, but how can we make those who keep stamping their IFs on journals actually stop it, and start employing altmetrics etc instead?
P.S. Wouldn’t it be nice if a version of this article could make an appearance in, oh lets say, The Guardian..?
Reaching those at the top is key. The Wellcome Trust has made a start by incorporating into its open access policy that it “affirms the principle that it is the intrinsic merit of the work, and not the title of the journal in which an author’s work is published, that should be considered in making funding decisions.”
Now for many, those are just words. The fear of going against the grain is too great. But if we can get RCUK to echo this and every Research Council and panel chair to keep re-iterating this message (and their disdain for the IF and anyone who mentions it), we might start to make progress.
As for posting it at the Guardian, I thought it was a bit niche for them but I might ask…
Nice post and excellent points, thanks. And I agree that the potential of crowd-sourcing measures of value is really exciting. Couple of random thoughts:
1. Are you aware of any analysis looking at whether skewness in citation frequency varies with journal IF? I’ve not read anything systematic on that (can’t see any in the papers you’ve linked to), but think it would be interesting. In other words, are the citation freq curves parallel for different journals (such that bottom 5% of Nature papers, say, are still cited >> bottom 5% of J Highly Specialised Studies papers), or do they cross (i.e., a poorly-cited Nature paper is poorly-cited compared to more specialised journals too).
2. To some extent high IFs must be self-fulfilling. I’ve often idly wondered how many citations to really high profile papers occur in the first paragraph of citing papers (or grant proposals, or whatever), because you _have_ to cite something high profile just to show that your own work is interesting / timely / relevant. Would be an interesting text-mining project for someone!
3. Something I’ve been pondering (will probably result in a blog post eventually) – I wonder whether the quality (or at least, the rigour) of peer review is one thing that _might_ actually vary with IF? i.e, do reviewers take their responsibilities more seriously when reviewing for higher-ranked journals? Even if true, I have mixed feelings – I think some reviews go out of their way to impress editors of ‘top’ journals with point-scoring & nitpicking, rather than focus on the science; but equally some major improvements would never get made without the intervention of rigorous reviewers. Might a wholesale switch to ‘science basically sound’-type reviews result in a reduction of really useful, ‘how you could improve this (and incidentally reach a much wider audience)’-type reviews?
Anyway, idle thoughts – thanks for provoking them!
1. Seglen’s paper (linked to in post) covers just this point. There’s some variation in skewedness but not much.
2. Indeed – Zoe Corbyn wrote a commentary on a paper analysing this in Nature.
3. That’s an interesting point. Not sure how one would go about that analysis. For sure, sometimes reviewers do suggest additional experiments that are very worthwhile and add significantly to the finished product. Nevertheless I do feel an emphasis on technical quality, rather than ‘is this paper good enough for this journal’ would pay dividends.
Thanks for the response. Fig 5 in Seglen sort of addresses my point 1, though he doesn’t actually test for diffs between journals as far as I can tell, and it’s only an analysis of 3 journals. A systematic comparison skewness with IF across multiple journals is what I was after. And if skewness doesn’t change, it makes using the mean much more defensible.
Thanks too for the link to the Nature commentary, which is interesting, though seems to buy into the ‘it’s high impact so it must be good’ idea, rather than questioning the context in which ‘high impact’ papers are cited in subsequent work (my suspicion remains that they are cited more in introductory paragraphs than in methods, say).
Many thanks Stephen for contributing to this important issue in science. However, I disagree that is a matter of statistical literacy. If “There’s some variation in skewedness but not much”, then any other measure of central tendency would have the same problem. So, to evaluate scientific contributions, we need rationality, not precisely statistical literacy.
Interesting papers you link to…. Who is MIchale Eisen?! Must be a typo…
Thanks – sorted!
Lots of good points. Impact factor has basically two goals. Sorting the literature based on usefulness, and giving some metric to a researcher’s ability. It does both poorly, but before it can be replaced something better has to be proposed.
I’d love to see some more radical ideas about how this would work. All the twitter buzz/social stuff is a good start, but many many more researchers need to be involved. It seems heavily biased by field. I see very few papers in my field being discussed unless they have a genomics slant, and even fewer researchers discussing them.
I disagree. When something doesn’t work, you stop using it. You don’t need to wait for something that does work.
I agree also with D L Dahly, not least because calling for the abandonment of the IF will help to stimulate thinking about how to assess papers and researchers properly.
Can someone please explain to my why this is a problem. The reason given here – that most papers have fewer citations than the arithmetic mean – is only a problem if you think a mean is a median.
If there is a problem with skewed data, it’s at the other end, with a few papers that are cited a lot. I think this is only a problem if a journal only publishes a few papers every year (e.g. the Bulletin of the AMNH), as the effect of the upper tail becomes diluted fairly quickly.
Aren’t you committing a deeper sin, of equating ultimate worth to number of citations?
I also wonder, is it impact factor that’s preventing OA or is it the drive to publish in journals that are perceived as better quality? I suspect it’s the latter – I’m sure most ecologists would rather publish in American Naturalist than in Methods in Ecology and Evolution, despite the latter having a higher impact factor (and a stunningly brilliant executive editor), because Am. Nat. has a higher reputation.
We need something more than the number of times something is cited to understand how important the work is to science and technology, I agree.
The difficulty of the work, the risk, the originality, the amount of labor involved, and the contributions of each individual to making it happen. Unfortunately, teasing that all apart and summing it over a person’s career to reward past performance or anticipate future performance… well, now, that’s beyond the challenge of separating the median from the mean. So far, in fact, that quibbles with the impact factor are largely irrelevant by comparison.
Eugene Garfield (for it was he) has a lot to answer for, having unwittingly unleashed the dark force of impact factors on the world of scientific publishing. For the managers saw this and thinking it was good, allowed impact factors to beget impact statements, and these in turn begat pathways to impact, and then the greatest powers in the land agreed this was good and made impact a significant part of the REF. And scientists were asked to predict impact even for that which had not yet happened, and were left tugging their beards in vexation and disbelief as they witnessed the statistically illiterate in pursuit of the unmetricisable. But their protestations fell on stoney ground as the managers wielded large sticks that they brought to bear on the heads of those scientists whose publications failed verily to cut the impact factor mustard…..etc
But seriously, this is a terrific post Stephen, even though it is difficult to see how one gets the genie back in the bottle.
Yes, Garfield released the beast, but in fairness to him, he “has issued repeated warnings that journal impact factors are an inappropriate and misleading measure of individual research, especially if used for tenure and promotion.” That quote from an excellent 2005 paper by Brian D. Cameron in Libraries & the Academy, which provides essential historical context for any IF discussion: http://muse.jhu.edu/journals/pla/summary/v005/5.1cameron.html
As Andrew & Bob say below – nothing inherently wrong with IF, if used appropriately. Lots wrong with it, if not.
He has indeed spoken against misuse of IF but the problem, as several here have noted, is that this a metric (and a flawed one at that) that relates to journals, but which is misguidedly applied by box-ticking managers to individual papers.
Umm – isn’t that what I said? It’s what I thought I was saying, and exactly the message I took from the Garfield quote.
Thanks for that link Tom – will definitely check it out.
And thank you Stephen for your kind comment!
Great post.
In particular, thanks for writing “If you see someone else using impact factors and make no attempt at correction, you connive at statistical illiteracy”.
I think we should all make a concerted effort to speak-out about this a bit more. I wrote about this back in July and provoked some good responses from a palaeontology community mailing-list . Initially I felt a bit bad for critiquing an otherwise well-meaning initiative to list all the journals in our field – but ranking them all by impact factor just seemed like an awfully bad ‘improvement’ to me.
If we all made such small but important public statements such as this, the message might just hopefully filter through to the rest of academia. Keep up the blogging please 🙂
Well done Ross – It will take lots and lots of individual contributions to fix this.
There’s nothing intrinsically ‘wrong’ with the Impact Factor but I agree it makes sense when assessing a journal to make use of a range of metrics to get a fuller picture and to fit need (e.g. 5 yr impact factor, SNIP, SJR, speed to publish). An analogy is car selling: some buyers look at horsepower, others the safety rating or fuel consumption, usually a combination. (I’m a journals publisher for Elsevier.)
I agree. It’s only a problem because people take it too seriously.
What? There’s nothing ‘wrong’ with IF? It’s negotiated, rather than calculated, it’s irreproducible and we would fail our undergraduate students if they used the mean for such skewed distributions. If that is ‘nothing’ then I really hope you never try to write a scientific paper.
Sources: http://bjoern.brembs.net/comment-n397.html
Actually Andrew, the point (or one of the points, at least) is that journal impact factors are lousy for assessing individual scientists or papers. To use your analogy, it would be like saying “Fords have better average fuel economy; I drive a Ford; therefore my driving is fuel-efficient”. Or, again, “Fords are involved in fewer accidents than other cars; I drive a Ford; therefore I am a safe driver”.
Disclaimer: I hate Fords.
Sheesh – you’ll be trashing the h-index next… 🙂
When you talk about ‘impact factor’, I think you should be clearer, because what you’re talking about is the journal impact factor.
How much of your criticism is actually about the abuse of journal-level metrics applied to individual papers or people? To put it another way, would you embrace the h-index rather than the (Journal) Impact Factor, because the h-index is an article-level metric?
if you wouldn’t embrace the h-index, why not? I’d suggest that it’s because the real problem here is over-attention on the metric approach to measuring and rewarding quality – in that it changes scientific priorities and practice. And so I don’t see how altmetrics are going to change that – apart from, as noted, that they will be article-level metrics, not journal-level. (I would be pleased if someone could explain to me otherwise).
I also agree with Bob O’Hara’s point – that it’s “the drive to publish in journals that are perceived as better quality” that may be holding back OA. Suppose Garfield’s IF never existed. Wouldn’t people still want to publish in perceived-higher-quality journals? I don’t have a good sense of the answer to this, but I do think that people ought to consider that journals gain reputations and desirability on factors other than the IF.
I thought it was pretty clearly defined in my second paragraph.
On metrics – yes, there is a danger that boiling any human activity down to just one number will distract attention from more holistic methods of determining quality. The present problem is that all the focus is on the IF, which is so wrong it hurts.
The h-factor is an improvement of sorts, since it looks specifically at the contributions of a particular individual. But even here there are problems. Citations aren’t the whole story and the h-index is also age-related.
Will have to leave your other points for later – gotta run.
Thanks all so far for the comments. Sorry not to have joined in the discussion yet: bad timing by me since I’m out enjoying the Edinburgh festival. But I’ll be back…
I have a sense of a desire for the simultaneous possession and consumption of cake. We all agree that research papers should be freely accessible, though some agree with this proposition more than others. The question is then whether papers are made freely available before or after editorial selection and/or peer review. If before, then the papers that win out are ‘sifted’ by this ‘crowd’ of which you speak, Earth Human. I have serious doubts that this crowd really exists. Open-Access, Open Peer-Review and so on might be right, but at present they are the obsessions of the few. Most people don’t have the time, and, I suspect, most people value (though it’s not trendy to admit it) the input and detachment of an editor, who can wade through everything so you don’t have to. The alternative – that the best papers will somehow rise to the top of their own accord – is also flawed, because, I suspect, most of the selection of most of the papers will be done by a minority of people who really care about such things such that they can devote the time to do the sifting. The people ‘in the know’ with know, so you have replaced one kind of impact factor for another. The animals looked first at the pigs, and then at the men, and then at the pigs again, and couldn’t tell which was which.
Disclaimer – I am an editor at your favourite weekly professional science magazine beginning with N, which employs, even in these tough times, scads and scads of editors who spend their entire days just sifting. No time for research there. It’s worth saying that if YWPSMBWN cared as much about its IF as some people think it does, it would publish only molecular and cell biology. No evolution or ecology, no earth sciences, no physical sciences of any kind. If it adopted that policy its IF would presumably go sky high.
Another disclaimer – these remarks are made in a personal capacity. So there.
You make many good points, oh wise man, but I do not accept your premise that “The question is then whether papers are made freely available before or after editorial selection and/or peer review”. We could have both. Pre-publication is a useful filter but, as we know, imperfect. And yet its valuations stick fast in terms of the journal impact factor. We need measures and means of assessment that tap into the wisdom of those members of the field into which the research report has been thrown.
This crowd of which I speak certainly does exist. From it come all the people who give their time freely to do pre-publication peer-review. I am not suggesting that there should be a formal post-publication procedure but that we find means to capture the activity and interest that is sparked following publication. This happens normally at conferences; we just need to find a way to apply it to all publications.
A smear campaign? Welcome aboard….
Regarding “sifting”, two points. First, PubMed. Second, evaluating the quality of extant data is one of the main jobs of a scientist. People who want to farm this out to professional editors or anyone not in their laboratory astonish me.
“A smear campaign? Welcome aboard….”
Yes, I know I’m far from being the first to rant about the iniquities of the impact factor — see post 😉 — but maybe the first to call for a smear campaign by name? It doesn’t really matter.
On your 2nd point: true enough but I still think there is merit in pre-publication peer-review. Yes, it is our job as authors but it’s still valuable to get an outside view. But my point is that much of the evaluation stops at the point when the paper is accepted. The journal’s impact factor is attached and there’s a kind of end (citation counts notwithstanding). What I’d like to see is a way of capturing the post-publication commentary on the paper which may well pick up on things that neither the author nor the reviewers have noticed and folding that into the process of evaluation.
As I have probably said elsewhere, there is room in the ecosystem for every style of publication (and of course you know that my masters at YFWPSMBWN are experimenting with many of them.) This is because different kinds of science have different social conventions. Some, for example, prefer double-blind peer-review; others should like it completely open; physicists have no problem with preprint servers, something that fills molecular and cell biologists with horror.
I sense that what you’d like is a better way of managing and quantifying the reaction to a paper once it has appeared (whether or not it is ‘published’ possibly being a side issue). Some journals publish referees’ reports after publication (EMBOJ for example), which is one answer to the problem. Here at
YHWHYFWPSMBWN we publish formally peer-reviewed online responses to papers which themselves attract citations, and also encourage more informal online comment, though at present it’s not much used – a problem common to online commenting.One suggestion might be this: when and if authors are tagged by unique identifiers (yet another discussion) it would be easy to search for anything with that or a combination of tags (for multi-author papers), and this might lead, I guess, to a better assessment of the ‘impact’ of individuals or combinations of scientists, whether papers of their own or their comments or reviews of other papers, which might be combined with the papers themselves to produce an overall single-paper impact. Journal impact (or preprint server impact) could be judged on combinations of all these by some algorithm I hardly dare imagine.
I like that idea, and hope we can trust others to figure out the algorithm…
I see that several people have mentioned the H-index and other proposed metrics. It seems odd, given that most of the commentators are scientists, that nobody has mentioned the little matter of evidence.
Presumably, the idiots who use metrics to select people for jibs or grants do so because they believe that they predict future success. I’m not aware of any such evidence, though I am aware of several Nobel prizewinners who would have been fired early in their careers on the basis of their metrics.
How could evidence be found? You could check the metrics of a sample of scientists every year and follow the cohort for, say, 30 years, and see how many of them are generally accepted as being successful.
Still better, take two samples of young scientists and allocate them at random to (a) get on with the job (b) to be assessed by the cruel and silly methods used at, for example, Queen Mary, University of London. That way you’d really be able to tell whether imposition of their methods had any beneficial effects.
I know these experiments would be difficult and take a long time. but until they are done it is irresponsible to pretend that you know whether metrics are useful or not.
No doubt bibliometricians would object that if you were to do the work properly, it would take too long and and be bad for their own metrics. I’d regard that reaction as being a good example of the harm done by metrics to good science.
At present, the evidence for the usefulness of metrics is about as good as that for homeopathy.
I guess that is why I use so often on Twitter the hashtag #edubollocks.
David, the problem with your cohort analysis is that you have to assume that the career progression of the subjects is unaffected by their data. In other words, they are being judged annually by their peers/senior faculty/administration by the data you are measuring. If the data are misleading in terms of predicting future potential (as is most certainly the case), then the experiment is corrupted. The only way to do this is to sequester 30 volunteers in some isolated research facility for a decade or so, pay them equally, give them equal resources and hope they don’t kill each other. The comparator of Queen Mary with another, less draconian institution is a more practical test, albeit cruel and unlikely to pass decent ethical review…
Publication level citations are much less troublesome but these data are a lot more work to analyze. There’s the rub. We seem to prefer measures that are simple if inaccurate to those that are meaningful but require thinking. Hardly scientific!
I agree entirely. Cohort studies are often misleading, and can even do harm, as in the case of HRT -see http://www.dcscience.net/?p=1435.
Some sort of cluster randomisation might be possible though, along the lines described in the Cabinet Office paper which I keep urging people to read.
http://www.cabinetoffice.gov.uk/resource-library/test-learn-adapt-developing-public-policy-randomised-controlled-trials
My point was not to design the test but merely to point out to advocates of metrics that it apparently has not occurred to most of them that their ideas need to be tested before they are imposed on people. It is irresponsible to get people fired and ruin their lives on the basis of untested ideas, just as much as it’s irresponsible to sell people drugs that don’t work.
Yes, that was my point about the irony of using non-scientifically tested measures to quantitate science. The other shoe to drop is that our methods have very likely weeded out minds and decimated the career prospects of those that are not wired to think like the crowd (as well as promoting those whom learned to game the system early). We’ll never really know. We used to nurture and celebrate eccentricity in science now our processes mute these qualities.
David, I know that one group led by Peter Van den Besselaar is trying to do this kind of study on a small scale, except not quite the way you’ve described it. They have a paper coming out in August in Higher Education Policy (it’s noted on his webpage but hasn’t appeared yet). They compare 21 pairs of similar researchers, where one of the pair dropped out and one is still in their academic career. Bottom line: ” We found no systematic relationship between the career success and the academic performance of highly talented scholars, measured as the number of publications and citations … family conditions, social capital, organisational factors and contextual factors of the talented scholars are important and differentiate the two groups of talents … the interviews suggest that success is the effect of a number of cascading factors and accumulating advantages, whereas accumulating disadvantages determine whether a talented researcher leaves the university.” I can send you a preprint if you’re interested – or ask him.
@Richard Van Noorden
Thanks for that information. I’d be very interested to see it.
Despite the fact that the result seems to be what I expected, it must be said that matching is a risky process, as RA Fisher pointed out in the 1930s. In medicine proper randomisation has become routine ever since the 1940s. But in education and social sciences it is still rare for people to design experiments that can give a clear answer. It’s all described with beautiful clarity in the recent Cabinet Office paper.
To fail to do the right sort of test of your ideas is academic spivery. But of course short cuts and spivery are exactly what’s encouraged by use of metrics-based targets,
Google now ranks journals with an h-index metric based on Google Scholar citations (http://bit.ly/OobKcr). IMHO it’s a better impact metric since it doesn’t suffer from the flaw of averages. You have to do a bit of work if you want to see a list on your favorite area.
What do you mean by the “flaw of averages”? I’m not familiar with hte phrase, and googling it suggests a problem that would affect the h index just as much.
There is a variable difference between Google Scholar citations and those of Web of Science. I can’t fathom it as it strays above and below but typically Google Scholar yields higher cites (even with careful curation). The algorithm and sources for the Google Scholar methodology are not public.
Has someone published an age-correction chart for h-index?
Web of Science is limited to the last 20 years. Google Scholar does not have this limit.
I hadn’t previously thought about the IF/OA connection. Trust you to be the one to highlight it, Stephen.
IFs do seem quite arbitrary as a measure of research quality. We might as well use the length of the journal’s name. In fact, I’ve written a little something about doing just that: http://alexanderbrown.info/2012/08/13/journals-impact-or-title-length/
I’ve now updated that post with some actual data on journal title length versus “impact”.
“If you include journal impact factors in the list of publications in your cv, you are statistically illiterate.”
Can you please tell that to the funding agencies that absolutely require this information in grant applicants’ CVs?
We certainly need to be able to reach that audience. The Wellcome Trust has a standing policy on disregarding journal names and, by association, impact factors. But they are the only funding organisation that does so. We need to shame them into action.
Great to see so much excitement about altmetrics in the post and comments! I agree that it’ll be touch to change the culture of academia to value new, social indicators of engagement and impact. That said, in the two years since we wrote the altmetrics manifesto I’ve seen a huge upswing of interest and support for this approach, including articles in the Chronicle of Higher Ed and elsewhere, and a $125k grant from the Sloan foundation to fund total-impact.
This latter is an open-source webapp to help scholars aggregate and present all sorts of altmetrics, including tweets, bookmarks, Mendeley library inclusion, citation in Wikipedia, and more. We also support diverse products including blog posts, software, and datasets.
I think it’s this diversity which is the greatest potential of altmetrics over the Impact Factor. The JIF, in addition to being statistically problematic, looks along such a narrow dimension of impact. Garfield and others knew decades ago that citations only told part of the story–problem was, it was all they had. I’m really excited about how total-impact and similar software tools are beginning to surface data that lets us tell a richer story.
Jason Priem is advocating a method without the slightest evidence that it does what he claims for it.
Measuring the ability of people by metrics constitutes a social (or educational) intervention. When will people in these areas learn about the need to test their hypotheses?
The usual procedure is for someone to invent a metric on the back of an envelope and try to foist it on the world. Naive administrators believe the made-up claims and use the metrics to ruin people’s lives. Like any other social interventions, these things have real-life consequences.
Have the people who invent them never heard of control groups or randomisation? Apparently not. It’s much easier to promote their untested methods via social media.
It really is very like the problems of quack medicine. Make up claims and turn a quick buck. It’s the very antithesis of how science should be done. It is irresponsible and harmful.
So is your argument that ALL metrics are inherently bad for judging the relative scientific merit of journals and articles? If we accept that the Impact Factor is flawed, does that mean that all attempts to measure in this kind of way are bad?
To be fair, Jason is at least fostering the development of alternatives, though I agree that, ultimately, we need to be very careful about how we deploy them. It seems reasonable to suppose that having a broader range of indicators should be helpful.
I remain slightly sceptical of numerical indicators per se since they do give people an excuse to be lazy in their assessments. Perhaps the time freed up from not having to submit and resubmit manuscripts multiple times will give us the opportunity to make better assessments of our peers? 😉
Regarding the association between journal metrics and access which Stephen touches on: a tenet of the open access movement is the imperative to make journal content available to non academics (academic access currently being high).
It will be interesting to see therefore what the influence of public usage on scientific literature and funding will be given that lay people could become direct and in some cases enthusiastic arbiters: Facebook likes and tweets would work well in that scenario.
Andrew Miller
Oh please! How would Facebook likes work for Peter Higgs papers?
Are you a scientist pf any sort? I’ve rarely heard a dafter suggestion.
@ David Colqhoun
Exactly my point. I think the social media aspect of bibliometrics is problematic for these reasons.
No-one is suggesting that there is any worth in using facile (and easily game-able) indicators such as facebook likes in this process.
Does that mean that the academic consensus is against PLOS’s current article level metrics approach? FB-likes, tweets and blog links are all displayed under an article’s ‘metrics’ tab.
Thanks, Stephen, for calling out Nature Publishing Group as being statistically illiterate:
http://bjoern.brembs.net/comment-n854.html
http://bjoern.brembs.net/comment-n836.html
Unfortunately, the IF is a measure required by libraries to decide whether or not to subscribe, and so is an integral part of the marketing toolkit. If journals stopped touting the IF it would be a brave move, but possibly not a shrewd commercial one.
It will be interesting to see when this will change.
Check out the Mendeley link in the post… alternatives are emerging.
Nice post indeed, but aren’t you caught red handed when you write in this post
http://occamstypewriter.org/scurry/2012/04/01/plos1-public-library-of-sloppiness/
that “the Impact Factor for PLoS ONE is a respectable 4.4.” ?
Guilty (ish). This shows how deeply engrained is the habit — and how difficult, unfortunately, it will be to unshackle ourselves. On that occasion I was comparing journals, which is the only defensible use of the IF (though of course, it should be evaluated as the median, rather than the mean).
At least you didn’t report it to 3 decimal places ( like some fools have done).
But why the median? Why do you think that’s a better summary than the mean? With such a skewed distribution and small mean, most journals will have medians of 0, 0.5 or 1.
It’s more honest? So what if most journals will have medians of 0, 0.5, 1. That tells you the truth that there’s not much to choose between them.
But the wider point is ultimately to dispense with IFs altogether.
But the mean is the truth too. It’s just as honest – it’s a different measure of central spread.
I’m not getting this antipathy towards arithmetic means, which we use all the time to sumamrise our data, but all of a sudden they’re a big no no for impact factors.
We teach our undergraduate students in their first statistics classes (and I’m sure, as this is your specialty, you teach this too?), that you have different ways of reporting differently distributed data.
We fail students who don’t get the difference between parametric and non-parametric data and I’m sure you do, too?
Whenever I review a paper and they fail to use non-parametric tests when they unequivocally should, I tell the authors so and I’m sure you do, too?
I guess its because use of the mean has inflated the credit awarded to journals and those who publish in them; it’s a poor measure of the ‘typical’ citations that a journal’s paper attract.
But I am much more concerned about the mis-use of IFs (as of course is pretty much everyone else). If drawing attention to its weak statistical basis could help to undermine its use, that might help us to crawl out from underneath their dreadful weight.
Björn –
Parametric and non-parametric data? Err… what are you on about?
You haven’t actually answered my question – what’s wrong with using the mean as a measure of central tendency?
Of course I’ve answered it. See, e.g., here:
http://www.diffen.com/difference/Mean_vs_Median
http://sethgodin.typepad.com/seths_blog/2007/09/mean-vs-median.html
http://stattrek.com/descriptive-statistics/central-tendency.aspx
Which I’m sure you know much better than I do (I’m only linking to save myself from writing). Thus, the mean inflates ‘typical’ citation counts and is simply sub-standard. I agree of course it’s honest. Total count is just as honest, or the maximum number or the lowest number (i.e. zero). But you can, in fact, be honestly wrong, just as you can be honestly sub-standard or honestly dumb. I’m not implying anything about any present party, I’m mainly making the point that honesty is quite irrelevant, once you assume that nobody is lying.
WRT the last sentence: given that the IF is negotiated and irreproducible, one could actually make the claim that the IF is produced by ‘lying’ and then the arithmetic mean would in fact be the most honest component of the IF, which doesn’t really make it any better, IMHO.
Stephen –
??
Fair enough – of course using the median or mode would fail to reflect other aspects, i.e. that there are papers that get a decent number of citations, i.e. 10 or 15 rather than 3 or 4.
I think the mean is better interpreted as a journal-level metric, not a paper-level metric: it’s the average (!) rate of citation over all papers. It doesn’t tell you anything about individual papers but it does tell you something about their aggregate influence. Which is, of course what the IF was intended to do.
P.S. We should demand that Richard allows us to nest discussion until they’re so thin there’s only one character per line, and w’s are too wide.
Bob, that’s actually a setting under your control. Some of us on OT have no nesting at all…
I agree with Bob – if you’re after a relative index (which is what I think most people use journal IF for), you want something that reflects the variability in your data (i.e. the long RH tail) – median’s not going to be very useful (for the specific purpose of distinguishing between journals) if it results in most journals getting equal rank. And if skewness of citations really doesn’t vary between journals, the mean will give you this without introducing bias.
Doesn’t get over the myriad other problems with IF of course, and has very bad consequences if applied to individual papers, but using the mean as a journal-level metric is not the problem.
Björn – none of those links are about impact factors, and none appear to have been written by you. Plus, I think they were all written before my comment, so can hardly be a reply to it.
TBH, I don’t see any point in discussing this further with you.
The mean is not a good measure when the IF is being abused – e.g. when assessing individuals based on where they publish. This seems to be far too commonplace.
If you are comparing one journal against another (what the IF was intended for as I understand it) then it perhaps becomes more reasonable to use the mean.
However, if journal A has a high IF one might expect it to have more highly cited articles than journal B. Yet journal B could have more highly cited articles yet not have a monster citation gobbler, which is actually what journal A has. This is an extreme example but you can see how using the mean might mislead people over the ‘impact’ a journal has.
Instead, why not compromise and use two numbers when comparing journals (can our brains not cope?)? Why not have the mean and median as the IF? That way when comparing journals we can get a feel for the spread of citations.
Alternatively just do away with impact factors altogether and let the articles speak for themselves.
Just a disclaimer for my comment above:
I put the links there not because they were mine or because they were written to answer Bob’s comment, but because they say what I said in reply to Bob’s comment, just more verbose: means are usually bad metrics for non-normally distributed data.
You mean like Nature does in their spam campaigns? (see links above)
Whilst there seems to be consensus among commenters here that IF is a dumb tool for assessing individual papers or their authors, I find it alarming to discover there are quite a few here who would advocate other supposedly more accurate forms of metricisation. As David Colquhoun has pointed out, there are several Nobel prize winners who would have looked pretty dismal had they been judged at the time of their prize-winning work using such techniques. In contrast, had Benveniste’s notorious Nature paper on the memory of water been published today, it would undoubtedly have had millions of hits, downloads, tweets, retweets ‘likes’ on Facebook etc., i.e., a clear ‘winner’ no matter what form of metric analysis you apply.
Those of us who started our scientific careers before the advent of IF recall that the absence of bibliometrics/impact analysis/altmetrics/Facebook/Twitter etc etc., proved to be no impediment to doing excellent science. As it is impossible to quantify the impact of scientific work until years after its publication, attempts to do so are doomed to fail.
This is a good point but, if you can make sure that it is the assessments (or expressions of interest in paper) of a community that you trust that are counted, perhaps that is the way forward.
i see a potential analogy with twitter. I follow people who flag up interesting things. Sometimes I unfollow folk because I realise that their main interests are not for me. Over time, I have built up a list of people I follow because I see I can trust their judgement. Similarly, when tweeting myself and linking to things, I am mindful of my reputation among my followers. I only link to articles that I have not read myself very rarely, and then only if the link has originally come from someone I feel I can trust.
Likewise, within my field, I know who people are and who’s judgement I can trust. That trust has built up over time through meeting them at conferences and email exchanges.
But what if we could expand those one-on-one exchanges to the scale of the web by building a twitter-like model for sharing information and opinions? It is that kind of trusted assessment that I would like to see aggregated and attached to people and their publications.
If only more on Twitter were so self-reflective. This is a useful medium but as noted previously, social media is even more distorted than JIFs (Retweet for Prince Harry for the next Nobel Prize).
The reason we ask for external references when assessing promotions, etc. is to gauge objective expert opinions among peers. These letters, if to be of value, require careful consideration on the part of the writer. Their reputation is also at stake (in addition to the person being written about). It is this mutual dependency that provides considered balance and value. Instead of relying on a pair of digits to assign “quality”, we must be prepared to dig deeper in assessing our proteges. There is a lot at stake.
“If only more on Twitter were so self-reflective.”
Thanks but my point was that these communities should be self-organising (as on Twitter). I don’t follow anyone who tweets about nonsensical notions such as Harry being nominated for a Nobel.
I also have to ask what is an “objective expert opinion”? Objective doesn’t work for me in this sentence. I’d replace it by professional and suggest that careful consideration doesn’t have to take so much effort. I agree there’s a lot at stake in assessing papers and people so we do have to think through these things carefully. Not suggesting for a moment that I have reached firm conclusions on all this (apart from on the uselessness of impact factors).
Stephen – I still can’t quite work out whether you are pro or anti metrics. I see the Scientific Press as being free to use whatever metric tools they like to compete with one another, the problem arises (as so many have said here) when those blunt instruments are used as a surrogate for quality by managers (and sometimes academics) when determining the careers, promotions, grant applications etc., of academic scientists.
Even the numerous suggestions here as to how IFs could be refined or calculated more accurately, will still reveal nothing as to the quality of an individual scientific publication or the standing of its authors. The bottom line is that if you want to judge the merit of a piece of work, you actually have to put in a bit of effort and read the paper.
Stephen — I still can’t quite work out whether you are pro or anti metrics.
That’s probably because I’m still trying to think through this issue. Plus, when I talk about ‘crowd-sourcing’, I’m not just talking about extracting information for the compilation of numerical measures. Moreover, metrics are used for different things, not just when assessing applicants for grants and promotion.
I think Chris Chambers is right when he says we’ll never do away with metrics, whatever the limitations of boiling things down to numbers. If that is the case, I’d rather see a panel of measures become established, which should give a more robust and granular assessment. Hopefully also, use of a range of measures would serve to remind us that no single number (and not even a set of them) is an adequate measure, especially when making a judgement with high stakes. The Wellcome Trust has a policy of instructing panel members and reviewers to disregard journal names and impact factors when assessing applications; now there are obvious difficulties with that but I think it the policy were to be adopted by other funders and universities, it would start to shift the culture.
For other situations, such as looking to find the latest, hottest literature in a field that is not your own, I think metrics and crowd-sourcing of information (e.g. via contacts on Mendeley) could be a useful way to navigate to the papers that might be of most interest to you. This is only likely to work well if the information comes from people or a community that you trust (hence my earlier analogy to the trust that can be established between followers & follows on twitter). But ultimately that is about helping you to find scientific papers so that you can read them!
Gents, I don’t think it’s necessary to come down on the side of pro- or anti- metrics. As scientists we should use any appropriate tool that furthers understanding.
The JIF is obviously an inappropriate tool for assessing a scientist’s performance or the impact of a single paper. That doesn’t mean all metrics are necessarily bad. Some metrics (as yet uninvented) might well be useful.
“If you include journal impact factors in the list of publications in your cv, you are statistically illiterate.”
Or you are writing for a statistically illiterate audience. It’s easy from the security of a tenured job to exhort people to avoid quoting IFs, but in some fields it may take some bravery for a PDRA to follow this advice and, much as it pains me to say it and much as I support this campaign against IFs, it may not be in the PDRA’s interests to delete the IFs. If IFs are usually quoted in CVs in your field then my pragmatic advice to PDRAs and finishing grad students would be to do the same but to add a footnote caveat such as “However, see e.g. Vanclay 2012 arXiv:1201.3076 on the limitations of the IF metric.”
“It’s easy from the security of a tenured job to exhort people to avoid quoting IFs, but in some fields it may take some bravery for a PDRA to follow this advice”
I agree – that’s why I wrote in my post:
It is easier for me to write this sort of post that a more junior scientist. But I still think it’s better that I write it than keep quiet. In no way to I expect junior researchers to solve this problem — it will take a concerted effort by all.
Quick question, Stephen: what will you be using for assessing doctoral candidates, other research/researcher applications/bids, etc, particularly if you’re on a board with other assessors using IFs? Will you need to put in extra time and effort with this, including providing re-education notes to the relevant committees??
Also minor point, I don’t recall PlosOne or most other megajournals requiring ‘novelty’, just ‘originality’ (perhaps read better as authenticity).
For doctoral candidates, I read their thesis from cover to cover.
For applications from publishing researchers, I agree it will be more work. But since it is clear that impact factors are a worthless way of judging the work within one paper or the papers of one individual, it would be unfair to rely on them. So, yes, at least I will need to scan the abstracts of the papers that are cited as their most important. There are other ways around this: at Imperial, candidates for promotion typically need to highlight their top four paper and write a short paragraph explaining why they rate it as among their best.
For grant applications, the quality of the case for support is also a useful indicator of the researcher’s ability to conceive and plan a program of research.
To address your last point: not sure I see the difference between novelty and originality in this context (though PLOS ONE’s instructions use the word originality).
The reference to Vanlay (e.g. Vanclay 2012 arXiv:1201.3076) given by Stephen Serjeant is interesting. He suggests using the mode rather than the mean or median. That would make the differences between journals even smaller than by using the median. The distribution of numbers of citations is very highly skewed, much more so than a geometric distribution, especially for high IF journals. That makes any measure of central tendency suspect. Best to forget the whole IF charade.
Use of the mode is good not because it is mathematically or logically better. It is good because humans have a tendency to hear “mode” no matter what measure of central tendency is used.
Just for information – that Vanclay reference is also the one I discussed in my post.
I recently went (unsuccessfully) for promotion. In a debrief meeting, my head of faculty (at a well-respected London institute of research and higher education) advised me to include impact factors for each of my publications on my CV to make it easier for the person who is assessing my application.
That’s the level that this “illiteracy” reaches!
FWIW, I don’t have journal IFs on my CV, but I do highlight a ‘recent example of high impact’, which happens to be a PLOS ONE paper – so I can list page views (compared to average for the journal) as well as citations and stuff I’ve kept track of: extensive media coverage, use in policy reports, textbooks, etc.
Fat lot of good it’s done me. Nobody I’ve wanted to impress with my CV has commented on it, and my HoD reckons it’s 2* in REF terms…
Both of these examples illustrate the extreme and dispiriting state that we have got ourselves into.
@steve
That’s very disappointing. Perhaps your head of faculty got that job because he/she wasn’t much good at research. It seems to be quite common for people who get high on the administrative ladder to loose (or never to have had) critical faculties.
An alternative hypothesis is that the HoF recognizes the constraints of the system and was advising Steve how to game it to advantage–by making it ‘easier’ for the person doing the assessing. Sound advice, surely.
Great article. Thanks.
A couple of side comments. First, IFs vary heavily from field to field. Some areas of engineering have journals with quite low IFs but they are excellent quality; it’s just that it’s a very niche subject area, not such a big research community.
Second, the heavy focus on IFs means some journals actively manipulate the scientific process to maximise their impact factor! I was shocked when recently submitting a paper to a respected learned society that during the submission process I was asked to fill out a form which included as far as I can recall fields for: (i) the number of references in my paper (hmm, ok), (ii) the number of references to other papers from the journal to which I was submitting (what!!), (iii) the number of references to other papers from the journal to which I was submitting within the last 2 years (speechless).
This is actively distorting the scientific process to bump up a journal IF, because it will cause people to reconsider their references and try to wedge in work that doesn’t necessarily fit their arguments. Shocking.
That’s nothing – some journals suggest you add references to specific papers from their own journal. Another trick is to publish your highly citable papers (e.g. reviews) in January, so they have the full 2 years to accrue references.
Any index of quality can be gamed, so perhaps the problem is that we’re taking IFs too seriously (where an operational definition of “too seriously” is “seriously enough to want to game).
That rather depends on knowing who’s going to read the application. If I’m reviewing a grant, I wouldn’t think well of anyone who cites IFs
Yeah, there’s the problem, isn’t it? You don’t know who is reviewing, so you don’t know if they pay attention to IF or not.
Maybe this should segue into a discussion of anonymous vs nonymous review?
j/k
This discussion is making my head hurt… but let me ask you this – how about measuring the number of citations a paper has received, ignoring what journal it’s published in?
This, of course, has its own potential pitfalls – newer papers will rank lower, and certain *ahem* journals that have editorial news articles in their early pages, which themselves cite articles to be found later in the same issue, may inflate things somewhat arbitrarily. But if you buy that, in general, highly-cited papers are more “important” (please note the intentional use of inverted commas there), then it’s (possibly) as good a measure of impact as any.
I must say, I do like David Colquhoun’s suggestion of actually doing the experiment, though.
@Richard Wintle
I think that Stephen Curry dealt with that very well, above
“had Benveniste’s notorious Nature paper on the memory of water been published today, it would undoubtedly have had millions of hits, downloads, tweets, retweets ‘likes’ on Facebook etc., i.e., a clear ‘winner’ no matter what form of metric analysis you apply.”
And of course it’s well known that reviews get more citations than original work.
Those two facts make it pretty clear that citations are bad too.
But, as you say, nobody knows for sure. The experiment hasn’t been done. And I don’t see the altmetrics geeks waiting for 30 years for the answer.
Richard, David,
Citations are some measure of interest in a piece of work, but very variable between fields as David Howey points out above. At best they are a relative measure within a particular field. But ultimately, perhaps it’s the fixation on numbers that is the root of the problem.
Hm, yes, I missed David Howey’s comment, which is a good example of that phenomenon I was alluding to *cough*NatureNewsAndViews*cough*.
David, I’d meant to disclaim about reviews, but I forgot. Since PubMed classifies papers as “Reviews” and other categories (although not entirely accurately all the time), at least there might be a way of excluding reviews from primary publications when counting citations. That’s getting complicated though, and doesn’t at all deal with the “water memory” example Stephen mentioned.
“But ultimately, perhaps it’s the fixation on numbers that is the root of the problem”
Yes. Which is why I’m cautious of altmetrics.
I think everyone is cautious of metrics, alt or not. We should at least understand them and abandon those that are worthless or distorting, such as the impact factor. But I guess the worry is that any metric, any measure, will introduce some kind of distortion. Just look at what the REF achieves…
Just a couple of quick comments as this thread is already full of great ideas and discussion.
First, I think it’s naive to think we will ever do away with metrics. If anything we are moving the other way. Society thinks that science needs to be quantitatively accountable and that’s going to involve numbers of some kind. End of I’m afraid.
Second, as David Colquhoun says, what’s needed here is evidence. What we’re essentially trying to do here is operationalise and quantify the *quality* of science (and scientists). Is that even possible? Maybe not. But we simply don’t know. All we are doing at the moment is making assumptions, one way or the other. JIF has somehow evolved to be a proxy for quality, and a very bad one at that.
Perhaps some combination of metrics (very likely not involving JIF) is indeed predictive of success/influence in science. At the minimum we need to do retrospective analyses of these factors and how they relate to real world outcomes and ratings of article quality by other expert scientists (the gold standard). One could imagine that some combination of, e.g. h-index, individual article citation count, page views, twitter mentions, blog mentions etc. could, in sum, predict quality of work (as rated by scientists), future career success, and public impact of the science over different time scales (short, medium, long).
I don’t know of any such work, but it would be incredibly valuable.
Of course, it would also be inherently correlational rather than causal. As a first step it would identify the *potentially* causal elements, which could then be tested using an RCT-style methodology. Then we might get somewhere, but until then the field will continue to fool itself into thinking that JIF measures something of worth.
One final point: in effect JIF *does* measure value because as a community of (deluded) scientists we *give* it value. If we make appointments and award grants/fellowships based on the JIFs in an applicant’s CV then we are making JIF a currency, like a £20 note.
I think Stephen should set up an impact factor Naughty Step.
Can we measure the impact of blogposts by the number of comments they receive?
I very much doubt it, alas.
Perhaps tweets about blogposts, but then that not be the audience you want to reach.
A really good methodological paper like Lowry’s protein determination gets cited millions of times. Fair enough, because it’s quite difficult to develop a method that’s generally applicable and easy to use.
A paper that wraps up a particular subject, to the extent that there’s nothing more to be said, rapidly becomes common (“textbook”) knowledge, and may hardly ever be cited, particularly if the presentation and the journal were low-key. I guess the identification and characterisation of natural metabolites may sometimes fall into that category.
I enjoyed reading this constructive discussion. I think we all agree on the limitations of simple and single-number metrics.
Indeed, it is well known that the IF is a skewed distribution with a long tail of highly cited papers. A year ago I wrote this editorial on the dissection of the 2010 IF of Nature Materials: http://www.nature.com/nmat/journal/v10/n9/full/nmat3114.html
(Disclaimer: I am an editor at the journal.)
In that editorial you will find the full distribution of the papers, according to accumulated citations in 2010, that contributed to the IF of the journal that year. This gives context to the IF, but it is ‘too much’ information if one has two evaluate tens or hundreds of journals.
Instead, the meaning of a journal’s IF is simple to grasp: the average number of citations per paper (the problem comes when one ignores that such average is meaningless in the context of a particular paper or even a small subset of papers). However, the number is useful to librarians and budget managers at libraries and institutions; the more citations a journal receives, the larger the pool of readers it should have.
But it has been extrapolated to be a synonym of quality of the journal, and more often that not, of quality of the papers the journal publishes. In some respects, I believe this has happened because it is easy to understand, and because it does the job (albeit sometimes a poor one). For instance, if one wants to quickly search for the latest quality research in a certain scientific topic beyond one’s expertise, picking a journal that publishes research in the topic and that has a high IF will very likely provide a good return without much investment in search effort and time.
In my view IF seems to provide to most people some sort of quality signal that on average works, but that on occasion gives lousy results, in particular when it is extrapolated beyond the realm of journals.
Now, I am all for article-level metrics. However, I do not yet have a feeling for what quick, wisdom-of-the-crowd signals mean in the context of scientific papers. How should I interpret such quality signal? Is it fairly consistent? I see two main problems with metrics derived from quick data on online social networks: First, we all know that the tweet/like/kudos of an expert or colleague do not carry the same level of trust than the equivalent vote from the average non-specialist. And second, regardless of the disparities of reasons behind citations, these tend to involve careful thought and a few pairs of eyes (including reviewers); because of course, understanding and assessing the quality of a scientific paper requires more often than not a certain level of expertise and a certain amount of precious time. Instead, tweets and likes can be much more whimsical and more easily prone to quick trends and network effects such as the bandwagon effect.
Nevertheless, it may well be that clever algorithms might some day provide metrics that are good-enough…
“if one wants to quickly search for the latest quality research in a certain scientific topic beyond one’s expertise, picking a journal that publishes research in the topic and that has a high IF will very likely provide a good return without much investment in search effort and time.”
I have to say, I’ve never adopted this approach. Instead I would look for reviews on the subject area and then use that as a starting point for digging into the primary literature. Or ask a colleague — if I’m thinking of breaking into a new area, I’d probably have a friend who’s already interested, or a friend of a friend.
“I see two main problems with metrics derived from quick data on online social networks…
I didn’t mean to imply that quick and trite assessments (such as crop up frequently on twitter/facebook) should be counted among the ‘wisdom of the crowds’. I agree it takes time to make an assessment of a paper but I was thinking more of the ‘crowd’ being the community of trusted scientists within a particular field. It’s the element of trust that is important. Tried to sketch out my thinking in a little more detail in a comment above. But still thinking…
Hmm, citation didn’t go through. This is what I quoted:
“if one wants to quickly search for the latest quality research in a certain scientific topic beyond one’s expertise, picking a journal that publishes research in the topic and that has a high IF will very likely provide a good return without much investment in search effort and time.”
Not really. See the data I just compiled:
https://twitter.com/PepPamies/status/237284953589694465/photo/1/large
Great post Stephen! I agree with you that we should eventually stop using IFs as a measure of people and paper quality, but this will be hard and take some time. As it has already been mentioned, we are addicted to numbers and a single wrong number has taken over for the wrong reasons. The other problem is that IFs are not going to dissapear all of a sudden. We need fo calculate them properly! Perhaps the solution is that us scientists calculate them properly, by using the mode method or by throwing away the top and bottom 25% of citation data and calculate the mean from the remaining data. These corrected IFs could be published right after the current IFs are published by WoK.
This might add more confusion to an already complicated issue, but if we provide an alternative correct number we may stop the addiction to this awful number. I’m aware that having two IFs around might cause more problems, but as many people in this blog are genuinely interested, I thought I will throw this idea here for discussion.
I think more numbers would be better because it would emphasise the message that none of them is perfect.
Stephen
I’d like to further differentiate between two major points that you made. I agree that impact factors are largely misused and irrelevant, although in today’s world of Google Scholar, one could apply with some limited accuracy, actual citations for papers as some kind of measure of “research relevance” over time. Admittedly, though, I’ve seen highly cited papers that were always noted as being wrong.
On the other hand, although you shied away from outright disavowing the peer review process, I do think that there are important standards to uphold. Regardless of impact factor, I’d venture that every researcher in my field knows which peer reviewed journals are highly respected, and which are not. This is despite yearly fluctuations in the impact factor. So I am suggesting that I don’t think a single system where all papers that don’t have flawed experiments is a good one.
Why? Not because of ego. Better papers not only provide new data, but clarify and shed light. Other papers may bring new data to scientists, but if I want to read a paper with a message and not a conglomeration if data (albeit accurate data), I know which journals to turn to. And I aspire to publish in those journals.
Which begs yet another question – how good is Google Scholar at capturing citations? I really don’t know, but it always reports different figures than ISI Web of Science/Science Citation Index, for example. When I have to look at # of citations, I tend to look at both (and if I’m writing them down, say in a summary table for a grant application or whatever, I generally report both – in case anybody cares).
I dunno; Web of Science was gone with the wind when the budget cuts started.
I wasn’t shying away; rather I made it clear that I still see a very useful function in pre-publication peer review.
On your latter point, surely you would rather earn respect by the quality of your work, not the brand name on the front of the journal. It is only the fact that (mis)measures of esteem are attached to journals that drives this behaviour.
The problem is, as Steve Moss notes, that you really have to read the paper to evaluate it. So in today’s world of cluttered scientific releases, knowing that there are journals in which I can trust will allow me to focus and read those works, as opposed to tip-toeing through a sea of data.
I think that in promotion and hiring decisions it is also that ranking by journal relieves the committee members of actually needing to read, understand and judge the papers. They also don’t have to ‘own’ their judgements of the people, or the work, as the metrics can be blamed if it all goes awry:
Talking about the actual reading of papers.
Is there not an opportunity to use the log files of the online journal hosts to demonstrate what gets downloaded or looked at online. That could be one metric.
There is also the rise of the study of online reading, again based on log files from eReaders and also logfiles. There are lots of studies by Nicholas, Rowlands, and Duggan. These could add weight to the download metrics.
And in another development there is the development of Scoopinion. This is a browser level plugin that watches your online reading habits and uses this to seed other reading material based on what you actually spend time reading. Perhaps this could be applied in the academic community to selected “influencers” or “peer review readers” and the data used as a metric.
Absolutely – but as Chris Chambers points out above, we are unlikely ever to get away from some kind of metric (especially since they serve various functions). So we might as well have measures that are more accurate; better yet, if we get away from having a single over-riding indicator (the JIF), then that would serve as a constant reminder that no single number is ever adequate to make the best judgement of a piece of science or an individual scientist.
Great post!
I did not have time to read all the comments, but would like to add these two references:
In an open access commentary, Julien Mayor analyzes impact factors of psychology journals. He found out that top journals such as Psychological Review have a steep increase in citations for a much longer period than only the two years considered in the ISI IF. Furthermore, its median number of citations is much higher than that of Nature, although its IF (based on the mean number of citations) was only about one fourth in 2009.
This is consistent with the claim made above that a minority of articles in high IF journals like Nature is responsible for a majority of citations. (Are scientists nearsighted gamblers?The misleading nature of impact factors)
Leading neuroscientists criticized impact factor fetishism in a PNAS editorial, writing
Fig. 1 in “Are scientists nearsighted gamblers?The misleading nature of impact factors” is something Bob O’hara & Co should see when they defend the IF…
Of the four points at the bottom of your post, I emphatically agree with the first two. Using Impact Factor to evaluate a person rather than a journal has to be laughed out of the room at every opportunity.
The third point, I’m not so sure of. It seems to me this might be the one place where Impact Factor has some usefulness.
As an author trying to decide where to publish an article, I want to have some assurance that publishing my paper in the Chinese Journal of Irreproducible Crap is better than burying the manuscript in my back yard. There are a lot of new journals out there, particularly online, and we’re still in the process of trying to figure out which ones are credible. A measure like Impact Factor can show that a journal is being used by a scientific community, and saves me time in trying to work out if this is a viable venue for publication.
Promoting tiny increases in Impact Factor and showing them to 3 decimal places is goofy, though.
How about you publish there: http://www.journals.elsevier.com/homeopathy/
IF: 1.141
BTW, ~80% of journals are below IF1 (own estimate). My first experimental paper (as an undergrad) was published in the Journal of Fish Biology, IF=1.685
Come to think of it, that comparison alone is almost sufficient to discredit the IF even to compare journals 🙂
My reading of the data:
https://docs.google.com/document/d/1VF_jAcDyxdxqH9QHMJX9g4JH5L4R-9r6VSjc7Gwb8ig/edit
tells me that most differences we perceive between journals are subjective and don’t hold up against scientific scrutiny. The data indicates: journal rank (wrt some measure of quality) is largely a figment of our imagination without much supporting evidence.
Point conceded. How Homeopathy is on the score board is beyond me.
Yes, I totally agree that even in the one place where I think Impact Factor has some value, that its value is limited. You have to look at other factors to figure out how many Pirsigs a journal has.
But if a journal has been running for 5, 10 years, and nobody has cited a paper from the journal, ever? That says to me that it’s worse than a dumping ground. Even a dump attracts flies.
The non-utility of the JIF is clearly seen in the link given by Stephan Schleim above.
Great article, Stephen.
I love Jenny’s suggestion of a naughty step 🙂 Perhaps we need a publicly declared boycott like they did with the Elsevier boycott/Cost of Knowledge?!
Some kind of boycott of impact factors is needed but I’m not sure how it would be implemented; hence the initial plan for a ‘smear campaign’. But it’s definitely worth thinking of additional actions that could be taken to eradicate the use of impact factors.
Some of the statements in the comments above are in contradiction with the clear correlation between the two-year impact factor and the 5-year median of cites for research articles. See the below two links:
https://twitter.com/PepPamies/status/237284953589694465/photo/1/large
http://occamstypewriter.org/scurry/2012/08/19/sick-of-impact-factors-coda/#comment-12080
Reply
To Pep Pàmies
I’m afraid you are making the same mistake that most bibliometricians do, namely talking about averages not about individuals. It was show quite clearly by Seglen that for any one person there is no detectable correlation between the number of citations a paper gets and the IF of the journal in which it appears. That’s been confirmed since, and it’s certainly true for me.
Of course citation counts are also a very dubious method for assessing the future prospects of individuals, but that is another story.
To David Colquhoun:
Yes, of course. I fully agree. As I mentioned in my first comment above (http://occamstypewriter.org/scurry/2012/08/13/sick-of-impact-factors/#comment-11956) and in my other comment on the follow-up piece to this blog entry (http://occamstypewriter.org/scurry/2012/08/19/sick-of-impact-factors-coda/#comment-12080), IFs are for journals, not for individual papers or researchers. This of course has been known for a long time, as the comments and links in response to this blog entry clearly show.
All I am saying with the data I compiled is that there is a strong correlation between the 2-year IF and the 5-year median for original research articles (I would be surprised if no one has published this correlation before). This contradicts one main point of this piece: that IFs are not useful for ranking journals. In view of the data, and if we agree that citations accumulated within a sufficiently long period of time are a measure of the scientific worth of the average paper, I find that point difficult to sustain.
According to the data, the rule of thumb would be that half of the original research papers published in a journal in the last 5 years will have been cited at least a number of times approximately equal to the journal’s IF.
Pep,
Your analysis does not contradict the post because I did not claim that “IFs are not useful for ranking journals”. Rather I argued that use of the IF was inappropriate because it gives an exaggerated indication of the ‘typical’ number of citations accumulated by a journals’ paper over a 2 year period (see the third paragraph). Your analysis confirms this nicely since it shows that it takes 5 years for the median to catch up with the mean calculated from the first 2 years of data. Journals that advertise their IFs without properly explaining the statistical context are engaging in dissimulation.
In any case, as you also clearly agree, the primary problem with IFs (which was the the main point of the post) was their continued misapplication to papers and individuals.
I sense we both agree on the fundamental points: The IF has long been misused, and easy-to-understand, useful article-level and researcher-level metrics are badly needed.
As for the contradictions that I pointed out, I was refereeing to these sentences in your post :
“Take a moment to think about what that means: the vast majority of the journal’s papers — fully 85% — have fewer citations than the average. The impact factor is a statistically indefensible indicator of journal performance; it flatters to deceive, distributing credit that has been earned by only a small fraction of its published papers.”
It seems to me that this statement refers to an inappropriate interpretation of the IF. If one interprets the IF as the typical number of citations that a paper will receive in about two years after publication (that is, the mode), then yes, the IF appears to be a mis-measure.
But problems disappear if one is faithful to the meaning of the IF (it is an average, not a mode, of citations to a particular journal within a particular timespan). If one sticks to such true meaning of the IF, then ‘distribution of credit earned by a small fraction of papers’ does not apply in this context. It seems that the wrong notion of the IF is so ingrained in the scientific community that you have unconsciously been victim of it in your own post: If I understood correctly, you used the average as a proxy for the mode; that is you seem to have applied the IF as a measure of the short-term ‘credit’ for a typical paper, or as a ‘mis-measure’ of the quality of a typical paper of the journal if you want.
But, as the data I compiled show, the IF is a good indicator of journal performance if we agree on that the median number of citations over a sufficiently long time (5 years for instance) is a good proxy for the quality of original research articles published in a journal.
In conclusion: it is difficult to argue against the data-supported notion that the IF is a good indicator of long-term journal performance for the vast majority of journals (that is, for journals that don’t game the system or publish a substantial number of reviews, for instance).
And it should always be stressed (also by publishers; I will raise this issue at the company I work for, Nature Publishing Group) that extrapolations of the IF to individual researchers and papers is erroneous and dangerous.
Thanks Pep — I think we are converging on the view that the problem is that the IF is widely mis-interpreted among the scientific community as the ‘typical’ number of citations that a paper from a given journal will accrue in 2 years. That at least was the message I was trying to get across.
The more important point, however, on which we (and most others it would seem) agree fully is the pernicious problem of mis-applying JIFs to individuals or individual papers. The continued promotion of this one parameter by journals and publishers gives it a false legitimacy in this regard that has proved very difficult to undermine. So I would certainly be grateful for any efforts to get across the message that “extrapolations of the IF to individual researchers and papers is erroneous and dangerous”.
Guys
It is all very interesting and mostly true. However, I think our energies would be better spent fighting the underlying problem: the managerial class usurping the power. Sure, measuring an unmeasurable is an interesting problem. But there is a limited need for this debate in the UK. Good or bad, REF criteria are not as stupid as to rely on JIFs. What is more, I have never seen JIFs referred to in any EPSRC documentation, and I had been reviewing EPSRC grant applications for many years, sitting on EPSRC panels a couple of times a year. I do not think EPSRC has changed this policy in the past three years, although I am not involved with this Council anymore: in 2009 I was unlawfully dismissed for failing to reach my new Dean Professor Rao Bhamidimarri’s JIF ideal. This was employed to disband our old professoriate with a high RAE2008 score. Our hard-earned QR money were then used to attract a new professoriate (who were much younger guys, didn’t hold chairs before and surely, were not informed of what happened). The non-QR income went down by 30 %, student satisfaction at my department is now at 50 %, but these things happen, right? I doubt the outcome of Queen Mary’s proposed “restructure” would be any different .
So, this is my plea: reclaim the ground! Decisions on firing academics should be collegiate, with other academics, researchers, students, technicians and junior administrators involved, probably using well argued submissions (with all sorts of bibliometrics if necessary) and definitely, an anonymous vote. Senior managers should have an input, the external pressures are great, but they have to be accountable for their decisions!
Larissa
Thanks for the comment — I completely agree that the primary problem is the mis-use of JIFs in judging individuals. They have a false legitimacy that administrators use instead of proper judgement. That said, I’m not sure that your extremely democratic proposal is workable in practice though I would certainly want to see individuals’ contributions to research and to teaching (and other vital functions of a university) be factored in to decisions on promotion etc.
Srephen
Of course, a proper democracy is dreadfully difficult to implement. But let’s start thinking about it. General anonymous voting on University appointments and dismissals? Rotating academic boards? What about a database of discredited managerial practices (such as invention of non-REF criteria, in particular, retrospective criteria) approved by say professional bodies? A national anonymous help line for bullied members of the Universities? A fund for class actions against unscrupulous managers? Proper educational research? (Do you know how fashionable is it becoming to say that people can learn without teachers?) This is something Unions should be doing but don’t, so somebody has to.
I am also sick of impact factors or any other ratings/metrics/evaluators. Perfomance indicators are proliferating but what is performance and how do you mesure it ? Is your work crap if you did not published in high impact factors journals ?
As the american psychologist James McKeen Cattell was saying “Expert judgment is the best, and in the last resort the only, criterion of performance.”
When I read a paper, I am just asking :
– is it useful for my own work ?
– can I applied to my work the strategies/technics used ?
– AM I conviced by the data ?
– can I reproduce some of the data in my lab if I need to ?
– has it been peer-reviewed ? (Of course we do need some kind of controls (conflict of interest, rigorous methodology, transparency, plagiarism))
So I am in total agreement with Stephen when he says : “this should simply be a technical check on the work, not an arbiter of its value.” And in a way, the final technical check should by your OWN CHECK. Are you convinced or not by the paper, independently of where and by who it has been published ?
So After that, I don’t care at all if the paper has been published in Cell or in the Little science journal of the new hampshire. Yes well, the little science journal of new hampshire is not in open access, do not have any web site and published 5 papers a year but maybe some great unoticed works that can be very useful for my projects are published here.
But unfortunately ” Impact factor has become such an important determinant in the award of the grants and promotions needed to advance a career. ” ……
Moreover, the so-called big journals are also playing the game and imposing the rules to keep their high impact factors.
In my field (molecular biology, gene regulation), if I want to publish in high impact factor journals (nature, science, cell,…..), it is not just a question of Novelty, High importance or mediacl breakthroughs. No it is also a question of Money ! You know the genome wide association studies, deep sequencing of 3000 tumors taken from 10.000 individuals, that sort of things.
I am not saying that these studies are not useful (well in fact I am saying it) but when I read this kind of work, I am thinking :” how much does it cost to do that ???!!!!!!”. “The first figure of the paper must have cost at least hundred thousand dollars. It’s probably the budget for all my lab for 2 years, not the budget for one project…..”.
Voilà. In short, I don’t like and I don’t care about impact factors. I just need to be convinced by the work and used it to progress in my own research.
Just come across your thoughtful piece. I think I may have been the first person to calculate Impact factors (average number of citations per publication) on a few individuals in the article C. Oppenheim and S.P. Renn, Highly cited old papers and the reasons why people cite them, Journal of the American Society for Information Science, 1978, 29 (2), 225-231. However, it was just a one sentence aside in a paper devoted to other topics, and I did not recommend its further use. I doubt anyone noticed the remark apart from Garfield himself, who commented to me that he thought it was a silly idea!
Thanks for the comment – adds a nice historical touch to the discussion. I wonder have motivations to cite papers changed since 1978, given the rise of bibliometrics and the all-conquering JIF?
Evaluating how we evaluate by Ronald Vale : http://www.molbiolcell.org/content/23/17/3285.full
Very interesting reading !
Thanks for pointing out that paper – it is superb!
Update (8 pm): have now blogged about it!
I am pretty low on the totem pole and, quite frankly, seriously reconsidering science as a desirable career because of problems like this. Science is notorious for long hours and low pay. The only upside (I’m told) is freedom from the hierarchical inefficiencies endemic in most other fields. It looks like this isn’t true anymore.
I think that the onus is really on those doing the hiring, who are by and large too lazy to properly assess the applicant’s ability. These metrics are not the enemy, our enemy is laziness. It is much the same problem faced by businesses who predominantly hire MBAs from x university, irrespective of work experience. These businesses get what they deserve, and this practice is changing rapidly.
The interview process needs to change. This is the root of the problem. People are disregarded as unqualified within minutes of introduction. A resume should be previous publications (the actual publication, not a list of names) alongside a list of technical skills. One serves as proof of the other. If the applicant is capable of solving the kind of problems required by organization x, they might be hired. This would take hours of work per applicant. So be it. At one time, all of you old people were hired in this way. It is truly a shame that you are unwilling to grant us young scientists the same honor.
The fact that prominent scientists can actually get caught up in what is little more than a popularity contest astounds me. A post far above this suggests that Facebook would be a immature indicator of popularity, when the culture at conferences and, in general, the publishing industry, seems like little more than a technologically sophisticated school-yard.
While I do not think this is true of everyone successful in the field, I think that modern science has bred a culture of lazy, popularity-seeking individuals. No wonder they are so poor at statistics.
Clearly a heart-felt comment, jviviano – thank you. The situation is serious and problematic but not, I would like to think, hopeless. Please see the article I linked to in a more recent post, which addresses the point of lazy reliance on impact factors for evaluation of scientists and comes up with more robust proposals.
… just had a paper kicked back from a journal. One of the review comments was this: “Nevertheless, references to previously published papers in [this journal] are not sufficient.”
Groan.
Perennial attempt to replace semantics with syntax typical of complex bureaucracies: Semantics is not algorithmic and it requires people to actually understand things, syntax only needs machines and it can be automated.
Many years ago I was on the Faculty’s Library Committee. The metric that tickled the Science Librarian and the Committee was “Journal half life”, which conjured up the image of volumes stochastically exploding on the shelves….
….but actually quite an interesting metric, since some journals had a half life >99 years. It was in the longer half life journals that the most definitive measurements were recorded, waiting for a real advance before they were superseded. Sadly, ISI now restrict themselves to 1-10 for this metric.
Hi Stephen
as an entomologist doomed to publish in so called low IF journals I totally agree and one of the reasons why I left IC – did you send the Rector a link to your article?
One more thing. Quite often a committee evaluating funding proposals is evaluting proposals from rather broad range of topics within some field. It takes even a specialist many hours to make a proper review about a paper for a journal. I am not sure how a non-specialist could evaluate a paper in lets say 1 hour? Even a nice looking paper maybe have big problems in citing the previous papers and equations maybe be based on wrong or unrealistic assumptions. But these things might be very difficult for a non-specialist (in that subfield) to see. This is based on my own experiences with evaluating papers. In any case, this seems to be very difficult topic and I guess there is no approach is that valid for every scenario. Thank you for reading.
Sir,
Thanks for this article, it has saved me! Because my research was never accepted by journals of high impact factor, i have started underestimating myself and my research and even my research career. Now after reading your article, I am very relieved and down to earth! Recently, my research was accepted by Indian journals (I am from India) most of them open access and with low IF. But I dared to publish my research in such journal thinking that my countrymen would read and benefit from it. But response to such low IF journals is such that readers also hesitate to read them or remark them as material of low grade! Result of such mis (thinking) is that many of the OAP are shutting down even one of journals wherein I have published my articles has now stopped its publication and other way again the authors-researchers like me have suffered!
Thanks again!
The notion to abandon metrics is absurd. We are scientists. Science is successful because we can quantify phenomena and use those quantities to make inferences about the qualities of the systems we study.
How can any of us advocate to not apply scientific principles to evaluating our own work and the work of others?
On my next journal article, I’d like to see how far I get claiming I had an inhibitor, but instead of inhibition data I simply submit a packet of letters from evaluators that explain why my inhibitor may be a good one.
*sings*
There’s a strawwww man, waiting in the sky…
I have not advocated this.
Your inhibitor analogy is flawed. You are right that letters of recommendation won’t get you too far – a more direct measurement such as an IC50 is needed. But the average of IC50s that you have measured in the last 2 years from different and unrelated inhibitors is not good either! In fact the latter is absurd and one can argue it is worst than a packet of letters of recommendations from well respected scientists.
Hi Stephen
I really enjoyed your article. I agree. Can we start and sign an internet petition against application of IF to assess researchers performance in Universities?
Hi Stephen,
I agree with you in almost everything.
I’d like you to comment my project.
https://www.facebook.com/TheScienceRevolution
I am trying to find a solution to the problem…
This could be a number of metrics that depend on the value of your publication for the scientific community.
To measure this value you will consider:
the number of downloads
the rating
the followers
the comments
…
Regards,
Michele Gimona
I’m sorry, but those measurements do not measure the “value of your publication for the scientific community”. This has been shown time and time again. They measure short term popularity and your ability to game the system. If adopted they would corrupt science.
Very useful post. Here is also another digest related to impact factor: http://publication2application.org/2013/12/02/impact-factor-a-poor-quality-indicator-of-quality/
Thanks