Debating the role of metrics in research assessment

I spent all of today attending the “In metrics we trust?” workshop organised jointly by HEFCE and the Science Policy Research Unit (SPRU) at Sussex University. This was part of the information-gathering process of HEFCE’s independent review of the role of metrics in research assessment; the review has a particular focus on how metrics might be used in the Research Excellence Framework (REF) that determines block grant allocations to university departments and research institutes. I was attending because I am a member of the review steering group.

The day promised to be one of vigorous debate because the consultation process that closed earlier in the summer had attracted over 150 responses — soon to be published — and these presented a wide range of views on the dangers and potential of metrics. And so it proved to be, with three panel sessions exploring “The changing landscape for research metrics”, “The darker side of metrics: gaming & unintended consequences” and the question of whether there can be reasonable progress “Towards responsible uses of metrics”. Sandwiched between these was a bazaar in which various metrics vendors displayed their wares.

I don’t have time tonight to capture the range of points and insights that were offered during the course of a interesting day but was somewhat reassured by the widely shared expressions of belief that any use of metrics in research assessment — or as a probe of the propagation or impact of research into the wider world — has to be done with care. The mantra that metrics should inform judgements and not replace them was repeated by many participants and will hopefully soon be enshrined in set of principles to be known as the Leiden Manifesto.

Almost a lone voice, Prof Dorothy Bishop presented a provocative case for supplanting the cumbersome system of peer review in the REF with a much lighter touch analysis of departmental h-indices calculated for research-active staff — an idea that she has previously outlined on her blog. Dorothy showed that, at least for some disciplines (including natural sciences and psychology), use of this metric generated scores that correlated well with resource allocations from the 2008 Research Assessment Exercise (the forerunner of the REF). (Update: as David Colquhoun points out below, Dorothy showed today that most of the correlation is actually due to the the number of people in each department — and she has since detailed her proposals in a new blogpost). The particular advantages of this approach are the cost saving — reckoned to be somewhere between £60m and £100m — and the elimination of the bias that arises from panel members’ affiliations. But it remains to be seen if the method is applicable across all disciplines; or if it fulfils some of the other purposes of the REF, which include examination of broader impacts and demonstrating the commitment of UK research to quality control through periodic self-examination (a feature that plays well at the level of government).

I hope others might chime in with their impressions and analyses of the day. Already there is a Storify aggregation of some of the tweets that tracked the different sessions. I include below my contribution, which was part of the session on the darker side of metrics. It has been lightly edited to clarify and sharpen some points but remains brief and incomplete. This debate is far from over.

“I come here today very much with an open mind on many aspects of metrics, though I fear that may largely be because I am still somewhat confused. So I am glad to have the opportunity to participate in today’s discussions. Already, I am beginning to see some interesting things.

On some topics my mind is made up. I remain sick of impact factors, for example, because of the way that they are so commonly mis-applied in the assessment of individuals or individual pieces of research. I don’t need to rehearse the arguments that I laid out in a blog post of the same name in 2012, except to say that impact factors are a powerful illustration of how a relatively innocent innovation in quantitation can be perverted and do real damage to the research community. I don’t think there is much dispute on that point (though I was surprised and disappointed to come across defenders of this metric in the discussion at the end of this session).

I am worried about people being seduced by the apparent objectivity of numbers. We saw something of that last week in the excitement whipped up by the announcement of the World University Rankings in the Times Higher Education (THE). In the preamble to its explanation of the methodology  the THE describes the ranking process as a “sophisticated exercise”, that is “carefully calibrated” to provide “the most comprehensive and balanced comparisons”. It ranks universities on a composite score drawn from estimates of a range of indicators of teaching, research volume and influence, industrial income, and international outlook.

The Times Higher are good enough to be open about the methodology but when you read exactly how they assemble and weigh the various components, you read statements such as “we believe…”, “UGs tend to…”, “our experts suggested that…” or worse: “the proxy suggests that…”. And so you can see that, although it may be sophisticated, the measure is clearly also subjective. It is not sophisticated enough to assign error bars or confidence intervals to the scores given to universities and I think that’s unhealthy. It seems as if the rankers are laying claim to a level of precision that cannot be justified.

And that tendency for numerical ‘measures’ to wrap themselves in an pseudo-objective authority is a longstanding problem with metrics; in the end people adopt them without thinking hard enough about where they came from.

As a result, I am worried about the word ‘metric’. It implies measurement but, although there are now an increasing number of things that we can count — thanks to the increasing computerisation and connectedness due to the internet — there is still much uncertainty (as we heard this morning from Cameron Neylon) about what those numbers are measuring or what they mean. We still struggle to define quality and impact, never mind being able to measure them. But that is OK and we should not be shy about admitting the difficulty of making judgements about quality or impact — or conceding the limitations of the things that we are counting.

But I think it would be more honest if we were to abandon the word ‘metric’ and confine ourselves to the term ‘indicator’. To my mind it captures the nature of ‘metrics’ more accurately and limits the value that we tend to attribute to them (with apologies to all the bibliometricians and scientometricians in the room).

As someone who is from Ireland, where we have been telling stories for thousands of years — from a time before stories were written down, never mind cited and counted — I was pleased to have heard the word ‘story’ (or its posher cousin ‘narrative’) mentioned so many times in the session this morning. Stories matter to people and although it is now a commonplace to assert that ‘the plural of anecdote is not data’, I wonder if that is always true.

I think that in some ways the diversity of activities and qualities and impacts that are part and parcel of the academic enterprise can only be captured in stories and in narratives. We should be honest about our limited abilities to describe these attributes with quantitative indicators. More than that, we should not be shy about celebrating the wonderful stories that we can tell. I look forward to the publication of the REF2014 narratives (sorry, stories) because I think many of us will be pleasantly surprised to find out about the different ways that research work has vaulted over the walls of academia and into the real world — where it matters.

And finally, wearing the hats associated with my involvement in Science is Vital (SiV) and the Campaign for Science and Engineering (CaSE), I want to emphasise the important political dimension of the REF, which is that it provides a mechanism for the research community to demonstrate that it is accountable — to government and to the tax-payers who fund us. I think that is important. (And I think that is it important for the researchers on the ground buy into the process and participate — it is not sufficient to leave this to provosts, vice-chancellors and research managers).

With that in mind, and not forgetting the limitations of quantitative indicators, researchers shouldn’t be too prissy using numbers that have some meaning — especially if they are aggregated at levels that can attenuate the noise in the system. At SiV and CaSE, the case for continued investment in UK science is based in part on the productivity and quality of our research base. In part that is estimated through numbers of publications, and citation rates. The UK has 1% of world’s scientists but produces 6% of publications, and about 14% of the most highly cited papers. Do we really believe those numbers are meaningless? They are not the whole story of course. It is just as important — I am aware of the presence of sophisticated policy analysts such as Ben Martin and Andy Stirling in the room today — to be able to talk about the need maintain a research and university infrastructure so we have generative and absorptive capacity for innovation. (Not to mention the intrinsic value that research gives to human existence by satisfying our curious nature).

So although there are risks, I think we should count on some indicators to inform our judgements, to test and challenge our stories (so as to mitigate our biases), and to help us tell those stories to ministers and the public. Those risks are real but I think they can be counteracted by transparency and debate. I am optimistic that the research community is up to that challenge.”


This entry was posted in Science, Scientific Life and tagged , , , . Bookmark the permalink.

9 Responses to Debating the role of metrics in research assessment

  1. I don’t think that Dorothy Bishop said what you attribute to her. What she showed was that the moderate (r^2 = 0.7) correlation between RAE income and H-index that she showed on her blog a while ago was explained almost entirely by the number of people in the department (I’m kicking myself for not noticing that her original graph was not normalised for the size of departments). She showed also today that adding the H-index hardly improved the correlation. So her talk showed that the people who were submitted for RAE were more or less equally good, and you could get a very similar result by simply by dividing the cash equally. No need for any metrics whatsoever.

    This would be the best and cheapest system ever. Sadly it has one problem: it works only when applied to people who submitted for the RAE. If departments were paid equally per person, they’d just submit everyone. It is hard to think of a solution that isn’t susceptible to cheating (I don’t use the euphemism “gaming”).

  2. Readers may like to have a look at the article A Critique of Hirsch’s Citation Index: A Combinatorial Fermi Problem by Alexander Yong in this month’s Notices of the American Mathematical Society. The take-home message is that a person’s h-index is well approximated by the square root of the number of total citations they have. Since this is merely a rescaling of the latter measure, the h-index is hardly a nuanced replacement.

    Amusingly, in a discussion of journal metrics [2] a mathematician, on seeing the h-indices of two journals immediately predicted (based on Yong’s findings) one journal published about 8 times as many papers as the other, and he was correct. In this case the more selective journal also had a much smaller impact factor.


  3. David Sweeney says:

    David says ‘you could get a very similar result by simply by dividing the cash equally. No need for any metrics whatsoever’

    Just going on the basis of RAE2008 submitted numbers (and David draws attention to the problem there) and taking no account of performance in allocating funding would lead to cuts of over 20% in QR at several institutions. Performance-based funding does make a significant difference to the allocations. We will see shortly whether REF panels have produced a similarly differentiated performance judgement.

    • Thanks for that information. Have you got it in graphical form? Which departments would lose, and which would gain?

      I guess that it means that Dorothy Bishop’s r-squared of 0.8 between number submitted and allocation is not good enough to predict accurately (quite possible).Or it means that the correlation is less high in areas other than psychology. Do you know which of these explanations is right?

      Of course as I mentioned, allocation on basis of department size alone wouldn’t work because there would be no pre-selection by departments -the departmental cat would get submitted.

      • Dave Fernig says:

        To avoid the inclusion of the cat, could one not apply a simple filter on quality, rather than trying to divide 1*-4*? In essence, simply count the number of academic staff producing above a particular threshold. Panel work would be reduced, since they would only have to look at the margins, and internal assessment would also be reduced, since the focus would again be on a single margin. Moreover, if this threshold was roughly the 1*-2* divide, it would, I think be easier, at least in STEM areas. then apply the above formula.
        In the end, those who are productive are… …productive, pushing the system into every more refined guessing of quality does not change what I publish, though the less wary may well destroy their careers due to poor mentoring and pursuit of the chimera of glamour mags.

        • That sounds like a good idea. But who does that simple filter? It seems well worth thinking about more because it would save a huge amount of time and money.

          It would also save the money that universities pay to metrics companies. If you must have citations, use Google scholar, which is not only free, but better than any of the expensive products of publishing companies, because it covers books.

Comments are closed.