The REF: what is the measure of success?

Science has been extraordinarily successful at taking the measure of the world, but paradoxically the world finds it extraordinarily difficult to take the measure of science — or any type of scholarship for that matter.

That is not for want of trying, as any researcher who has submitted a PhD thesis for examination or a manuscript for publication or an application for funding will know. Assessment is part of what we do and not just at the level of the individual. The UK research community has just prostrated itself for the sake of Research Excellence Framework (REF), an administrative exercise run by the Higher Education Funding Council for England (HEFCE) that has washed a tsunami of paperwork from every university department in the country into the hands of the panels of reviewers now charged with measuring the quality and impact of their research output.

The REF has convulsed the whole university sector — driving the transfer market in star researchers who might score extra performance points and the hiring of additional administrative staff to manage the process — because the judgements it delivers will have a huge effect on funding allocations by HEFCE for at least the next 5 years. The process has many detractors, though most might grudgingly admit that the willingness to submit to this periodic rite of assessment accounts at least part for the high performance of UK research on the global stage. That said, the enormity of the exercise is reason enough to search for ways to make it less burdensome. So before the current REF has even reached its conclusion (the results will be posted on 18th December), HEFCE has already started to think about how the evaluation exercise might itself be evaluated.


The metrics review

As part of that effort and at the instigation of the minister for science and universities, David Willetts, HEFCE has set up an independent review of the role of metrics in research assessment. The review is being chaired by Prof James Wilsdon and, along with eleven others from the worlds of academia, publishing and science policy, I am a member of its steering group (see this Word document for the full membership list and terms of reference).

Metrics remain a tricky issue. In 2009 a pilot of a proposal to supplant the REF (or rather, its predecessor, the RAE) with an assessment that was largely based on bibliometric indicators concluded that citation counts were an insufficiently reliable measure of quality. How much has changed since then is a question that will be considered closely by the steering group, although this time around the focus is on determining whether there are metrics that might be used meaningfully in conjunction with other forms of assessment —including peer review — to lighten the administrative load of the assessment process. There is no appetite for a wholesale switch to a metrics-based assessment process. To get an overview of current thinking on metrics from a variety of perspectives, I would recommend this round-up of recent posts curated by the LSE Impact blog.

One thing that has changed of course is the rise of alternative metrics — or altmetrics — which are typically based on the interest generated by publications on various forms of social media, including Twitter, blogs and reference management sites such as Mendeley. The emergence of altmetrics is very much part of the internet zeitgeist. They have the advantage of focusing minds at the level of the individual article, which avoids the well known problems of judging research quality on the basis of journal-level metrics such as the impact factor.

Social media may be useful for capturing the buzz around particular papers and thus something of their reach beyond the research community. There is potential value in being able to measure and exploit these signals, not least to help researchers discover papers that they might not otherwise come across — to provide more efficient filters as the authors of the altmetrics manifesto would have it. But it would be quite a leap from where we are now to feed these alternative measures of interest or usage into the process of research evaluation. Part of the difficulty lies in the fact that most of the value of the research literature is still extracted within the confines of the research community. That may be slowly changing with the rise of open access, which is undoubtedly a positive move that needs to be closely monitored, but at the same time — and it hurts me to say it — we should not get over-excited by tweets and blogs.

That said, I think it’s still OK to be excited by altmetrics; it’s just that the proponents for these new forms of data capture need to get down to the serious work of determining how much useful information can be extracted. That has already begun, as reported in a recent issue of Research Trends and I look forward to finding out more through the work of the steering group. Though I have already written a fair amount about impact factors and assessment, I don’t feel that I have yet come close to considering all the angles on metrics and claim no particular expertise at this juncture.

That’s why I would encourage people to respond to HEFCE’s call for evidence which is open until noon on Monday 30th June. The review may have been set up at the behest of the minister but it remains very much independent — as I can attest from the deliberations at our first two meetings — and will take a hard look at all the submissions. So please make the most of the opportunity to contribute.


Beyond the review

Although the remit of the review is strictly limited to consideration of the possible role of metrics in future assessment exercises, I can’t help wondering about the wider ramifications of the REF.

The motivation behind the process is undoubtedly healthy. The validation of the quality of UK research and reward of those institutions where it is done best, instills a competitive element that drives ambition and achievement. But however the evaluation is performed, the focus remains on research outputs, primary among which are published papers, and that narrow focus is, I think, problematic. I hope you will indulge me as I try to pick apart that statement; my thinking on this topic has by no means fully matured but I would like to start a conversation.

I know from my time making the arguments for science funding as part of Science is Vital that it is hard to measure the value of public spending on research. As shown in classic studies like those of Salter and Martin or the Royal Society The Scientific Century report, this is in large part because the benefits are multi-dimensional and hard to locate with precision. They include the discovery of new knowledge, realised in published papers, but within the university sector there are many other activities associated with the production of those outputs, such as training of skilled graduates and postgraduates, development of new instruments and methods, fostering of academic and industrial networks, increasing the capacity for scientific problem-solving and the creation of new companies.

There is a whole mesh — or is it mess?  — of outputs. The latest incarnation of the REF has made a determined effort to capture some of this added value or impact of UK research but has wisely taken a pragmatic route. Realising that a metrics-led approach to measuring impact presents too many difficulties, not least for comparisons between disciplines, HEFCE instead asked departments to produce a set of impact case studies, which give a narrative account of how published research has impacted the world beyond academia. Although there has been much carping about the introduction of impact agenda, which many see as boiling the research enterprise down to overly utilitarian essentials, the retrospective survey of the wider influences of UK research output embodied by the REF has been a surprisingly positive experience, not least because it has unearthed benefits of which many university departments were previously unaware. Collectively, the case studies might even provide a rich resource with which to argue for continued and increased investment in the research base.

Even so there are other problematic aspects to the REF. In the past year, as well as generating all the paperwork needed for our REF submission, our department has undergone an external review of its undergraduate teaching. As the current Director of UG Studies (DUGS) I was required to take a leading role in preparing the voluminous documentation for this further assessment exercise — a 58-page report with no fewer than forty appendices — and organising a site visit by our assessors involving many different staff and students. As with the REF, the process is administratively onerous but the exercise nevertheless has significant value: it provides an opportunity to take stock and serves as a bulwark against complacency.

But the question that now looms in my mind is why are these assessment exercises separated? The division appears arbitrary, even if it makes some kind of logistical sense, given the strains that they place on university departments. From that perspective it might be difficult to argue for any kind of unification but there is a fundamental issue to be addressed: is it sensible to isolate research performance from other valuable academic activities?

These other activities include not just UG teaching but also postgraduate training, mentoring of young postdoctoral scientists, peer review of research papers, grant and promotion applications, institutional administration, the promotion of diversity, and involvement in public discourse. Arguably the separation (which in reality means the elevation) of research from these other activities is damaging to the research enterprise as a whole. It creates tensions within universities where staff are juggling their time, more often than not to the detriment of teaching, and is responsible for a culture that has become too dependent on publication. This distortion of the academic mission has been worsened by the reification of journal impact factors as the primary measure of scientific achievement.

Evidence published last Tuesday shows that, in biomedicine at least, the most important predictor of ’success’ for an early career researcher is the number of first-author papers published in journals with high impact factors. The measure of success here is defined narrowly as achieving the independent status of principal investigator running your own lab (usually by securing a lectureship in a university or a long-term post at a research institute). It should come as no surprise that the well-known rules of the game — publish or perish — should produce such an outcome. But what is missing here is consideration of the negative impacts of the artificial separation of research from other facets of the job of academic.

In recent months I have spoken to more that one young researcher who has abandoned the dream of leading their own research group because of their perception of the extreme intensity of the competition and the sure knowledge that without a high-impact paper they are unlikely to make it in such a world. A ‘pissing contest’ is how one memorably described it. Is anyone counting the cost of those broken dreams? Should not these losses be counted in our research assessment processes?

It has often struck me that an academic career is a tremendous privilege; it offers the chance to follow your curiosity into uncharted territory and to share your love of your discipline with the next generation. There are still plenty of people who derive great satisfaction from their work in the academy — even I have my good days — but I detect increased levels of stress and weariness, particularly since becoming DUGS. The responsibilities of that position have had some impact on my own research output but I was willing to take it on because I believe in the multifaceted role of ’the academic’ and in the broader value of the institution known as the university*. However, it has not been an easy task trying to promote the value both research and teaching in a culture — promoted in part by the REF — that places such a supreme value on research output. In such an environment, research cannot do other but conflict with teaching and that is ultimately to the detriment of both. And to the student experience. And to the quality of life for staff.

These issues are not new, and have been addressed previously by the likes of Peter Lawrence and Ron Vale. The San Francisco Declaration on Research Assessment, which has just celebrated its first anniversary (please sign up), is the latest attempt to rein in the mis-measurement of research achievements. But while there may be local efforts to hire and promote staff based on performance across the whole range of academic activities, research remains an international business involving the exchange of people and ideas across national boundaries, so a coordinated effort is required to solve these problems, or at the very least to identify and promote instances of best practice.

To that end, what are the chances that the REF might take a lead — perhaps even by using metrics? If we are going to take some account of citations or downloads in discussions of research quality, why not consider adding other measures designed to capture the student learning experience, or staff satisfaction, or good academic citizenship, to create a basket of measures that might rebalance the incentives for universities and their staff? There are huge and obvious problems with such an approach that need careful consideration; I am not proposing that we submit thoughtlessly to the whims of student satisfaction surveys, but am intrigued by how measures of workplace quality might play a role).

There are no easy answers. I anticipate some will argue that switching the tight focus of the REF away from research risks undermining the power of the UK research base. But to those tempted to follow that line, please evaluate the cost of not doing so and report back.



*Though I cannot deny that my motivation for applying for a lectureship back in 1995 was to secure a permanent foothold that would enable me to start a career as a PI. At the outset I was prepared to pay the quid pro quo of teaching hours demanded but was advised not to get over-enthusiastic about teaching if I wanted to get promoted.





This entry was posted in Scientific Life and tagged , , . Bookmark the permalink.

17 Responses to The REF: what is the measure of success?

  1. Stephen
    Very interesting and I wish you and the rest of the panel happy wrestling with this problem. Broadening what a future REF might cover is a very interesting idea but there are various aspects you don’t mention. One of these is one I know social scientists had problems with during the REF Impact Pilot that I don’t believe was really resolved: what about time people put into policy discussions that don’t lead to changes in policy. So if you sit on some panel (and of course this doesn’t have to be social scientists but can apply much more broadly) but can’t demonstrate a clear link to an outcome from your personal intervention it didn’t ‘count’. In science this relates to the ‘Brian Cox effect’ when all his hard work in public engagement couldn’t, I believe, be counted because it didn’t fit a metrics-driven-research-linkedcriterion. Will you be broadening the meaning of impact in your vision?

    I will also mention here, perhaps inevitably, the gender angle. James assures me that this will be thoroughly considered: the under-citing of work by women being a key example. Analysis of the 2008 RAE highlighted this problem. I can quite see how broadening the REF to include many other non-research-related aspects of HE life could be beneficial in this respect. It is exactly what Cambridge’s letter to the THE was all about (see my previous post here). Good luck in your travails!

    • Stephen says:

      Thanks for the comments and the useful links Athene. The question of contributions that lead nowhere is an interesting one and, of course, parallels the experiences of many during the execution of lab work. To some degree the practice of asking for a selection of impact narratives implicitly acknowledges that not every contribution could or should lead to something tangible; it is the same for patents that never result in products or profits.

      The Cox effect question is more problematic since the REF largely overlooks such contributions, which can be substantial. I would favour some relaxation of the policy that contributions have to be tied to an identifiable research paper of sufficient quality.

  2. Mick Watson says:

    I think we need to be very careful about altmetrics.

    Firstly, because I have over 4000 Twitter followers and a well-read blog, my papers are going to get more Tweets and retweets than someone who is not on Twitter and does not blog. Does that make my paper more valuable? I think not. Secondly, have tables of the top 100 papers. This is a positive feedback loop where the top papers are promoted, and therefore get more attention. Thirdly, and finally, are a start-up – a company whose sole responsibility is to make money. This means they are not independent.

    Whilst impact factor clearly has it’s problems, and looks to be dying, the idea of measuring a paper’s value by the number of citations it attracts is not a bad one. Instead of using the journal’s impact factor, we could use the number of citations a paper receives. This number needs to be normalized to the size of the research community – “human cancer” has many more researchers than “starfish evolution”, for example – but it is not outside of our capabilities to do this.

    I favour Google Scholar citations over other measures as they are far more comprehensive – for example, most citation indices do not index supplementary information, or PhD theses posted online, whereas Google Scholar does. I also think funders should allow citation indices to index both successful and unsuccessful grant applications, as these are another source of impact – if a paper informs a grant application, then that is a valid citation.

    Impact factor is bad, and dead, but let’s not throw the baby out with the bath water!


    • Stephen says:

      Citations may be the best of a bad bunch but the standard charge levelled now is that they can take a long time to accrue; part of the drive for altmetrics is to try to identify measures that will predict citations. However, whether that can be demonstrated with a degree of robustness that will satisfy the research community remains an open question.

    • Jumping in here to clarify for other readers: and altmetrics are two different things. The first is a company that measures the second (altmetrics).

      It’s true that is a start-up, but we at Impactstory are a non-profit, and both of us (and our friends at Plum Analytics and other altmetrics-related companies) are working towards the same goal: to provide scientists with the tools to discover the many ways in which they can have impact, and share that information with others.

      Nonetheless, the question of altmetrics being used in REF is a good one. Stephen’s raised some great points in his post above.

      • Stephen says:

        Hi Stacy – thanks for the comment. Can I ask is it a particular goal of to generate and test metrics for research/researcher evaluation to inform funding decisions or is the main focus on helping people to discover interesting papers?

  3. Stephen
    I read your article with interest and hope you’ll come along to our event at UCL later this month to share your views and experience further. Details are here:

  4. Dave Fernig says:

    An excellent and most stimulating piece. Metrics for individual papers seem to have fatal flaws flawed. Impact factor has been rightly damned that it should be dead though it appears to have more than one life outside the UK. Article citations do not work because rates of citations of papers in the same field differ by orders of magnitude, the “Sleeping Beauty” effect is just on contributing factor. Altmetrics are not independent. Worst, all of the above are “gameable” and this is occurring around the world – e.g., you cite my papers, I cite yours, etc. One metric that might just work is a Departmental (or even Faculty) h-factor. This was noted by Dorothy Bishop on her blog who compared UK psychology department h-indices with the 2004 RAE and independently for physics departments (department h-index compared to 2008 RAE outcome on the totheleftof centre blog
    While this approach has also drawn criticism, it might just work in the UK. It is unlikely to be gameable, because of the independence of academics. If a Pro VC tells us that we should write lots of reviews to increase the h-index, we are not going to comply. I would agree that this only holds if the units are large, so perhaps larger than a department. It averages the problems with article citations, so that these even out. Finally, it looks at strength in depth, which is a good thing.
    However, there is only one method that can actually possibly really work: reading the paper. So the problem can be put quite simply: do we want to spend huge amounts of time reading papers or do we prefer to duck and use a proxy?
    On the grounds that reading papers is interesting, a better way to save time would be to ditch all the narrative on environment, etc. and replace these with a few data Tables.

    • Stephen says:

      Thanks for the link to the theleftofcentre analysis Dave, which I had not seen before and is a useful complement to Dorothy Bishop’s proposal. I think there may be some legs there.

      I agree also that some of the text in the REF submission relating to environment and strategy is probably superfluous. These are ‘inputs’ and the process should perhaps stick to measuring outputs, good ones being the mark of a healthy environment and decent strategic planning.

      I’m still keen on the idea of expanding the range outputs to be considered though…

  5. If you let loose academics on the question of how they’ll be assessed, they’ll go off on an obsessive quest for perfection. I’ve argued for metrics, not because I particularly like metrics – I can see their disadvantages, but because I think we need to do a cost-benefit analysis of any system we adopt, in which we consider what the marginal gains are if we add a new layer of assessment.
    The point I argued in my blogpost was that if you used metrics, the final outcome – which is after all a fairly coarse-grained funding decision – is not very different from the result you’d get from the full-on ref. That seemed true at least for science subjects. The costs of the two methods, on the other hand, differ by an order of magnitude.
    I think we need to be pragmatic. Suppose, for instance, in one discipline Cambridge ended up with a higher allocation of funds than Oxford if you used metrics, whereas the position was reversed if you did the full REF. Whether that concerned you should depend on how big the difference would be, and how much time everyone had saved by not doing the REF. My case is that it might be well worth saving ourselves the hassle of the REF and living with the imperfect outcomes from metrics (and by that I mean something based on paper citations, not journal impact factors) because that is the least bad option – and it would free up lots of time for people who just want to get on and do good research.
    I appreciate that some people argue that REF has other benefits apart from determining funding decisions – e.g. reflecting on what you’ve done etc, and forcing us to adopt a broader view. But it also has massive disadvantages in terms of impact of staff security and morale. I think it’s also questionable that ‘expert judgements’ are really such a good thing – it would be very interesting to know how reliable they are if two evaluators independently rate a pubication. Metrics are at least neutral, whereas the experts may be influenced by personal prejudices or ignorance.
    So my argument is we should stop looking for a perfect system and consider what it is we are trying to achieve. Once we have decided that, we should go for the most efficient approach, in terms of costs and benefits.

  6. Euan says:

    As a disclaimer I founded so you can guess my POV, but wanted to contribute my

    I agree with everything above. The one thing I’d say is that it’d definitely be a mistake to think that there’s a clear definition of altmetrics anywhere (the manifesto is a product of its time and is a good jumping off point). Rather ‘altmetrics’ is more akin to ‘scientometrics’ in that it’s a set of approaches that share a common goal. The data sources and how they’re used vary.

    The common goal is to take a broader view of impact, by which I mean not just to look at citations or articles, or to only measure scholarly use, or to only look at quantitative measures.

    Altmetrics is a bit of a misleading name at this point, I think, because it’s neither strictly speaking an “alternative” (rather it’s complementary) and not particularly about metrics (most of the time it’s dealing with indicators, or qualitative information).

    It’s not supposed to be a magic bullet, and hopefully nobody is describing it as such anywhere. Rather it’s a way of saving time and money by pointing people towards outputs that are having different kinds of impact.

    For it’s mostly about pulling together all of the information about a paper or dataset or set of outputs in one place. You could possibly do the same thing yourself manually – what we do is automate it and process the information in a consistent way.

    Part of that is things like Twitter – and being talked about a lot on Twitter is a valid indicator of impact, just maybe not the kind of impact reflects quality – but actually I think the most exciting data is things like blogs, post publication review and policy documents (which are a more recent addition).

    If your practice orientated paper hasn’t been cited in the scholarly literature but has been referenced widely by, say, health authorities across Europe then you should be able to point to proof of this and use it wherever you need to talk about your work, be it a biosketch or during a P&T meeting.

    If a paper has a strong public engagement component then this can also be difficult to prove. You should be able to point to independently collected data that shows who was talking about it and to what degree, and that data should be benchmarked against similar papers (who can say if ’14 tweets’ is a lot or a little otherwise?)

    If your dataset is on figshare, has been downloaded 10k times and was mentioned in a New Scientist article then why shouldn’t you be able to point directly to it and the associated metrics instead of having to write a one page description of it in a journal, so that your citations can be tracked?

    At a departmental or institutional level who you approach to write impact statements for shouldn’t just be up to word of mouth or who is the least shy about speaking up. If we’re consistently and reliably collecting indicators of different types of impact for all your institution’s papers the hope is that we can help you narrow down the field from all, say, 4000 papers published over the past five years to a few hundred. You still need to go out and interview researchers to write the narrative, but at least you can avoid some wasted time in the process.

    You can take a look at what all this means in practice – go to Nature and click on the metrics tab for a paper. Go to a tool like ImpactStory (which is a separate group, and not affiliated with us) and put it your own papers to see what data comes back.

    To answer some specific points from earlier:

    > a company whose sole responsibility is to make money. This means they are not independent.

    You’re thinking public companies. 🙂 I started Altmetric because I wanted to work in this field and not have to answer to anybody (I started off in academia, then spent a bit of time in publishing, where I worked on this kind of stuff full time)

    I’d actually look at things the other way round. We wouldn’t be able to exist as a commercial company if the data wasn’t useful and there wasn’t demand for what we do.

    > Can I ask is it a particular goal of to generate and test metrics for research/researcher evaluation to inform funding decisions

    No, not to generate or test metrics. That’d imply numbers were the important thing. I’d say one of our goals is to surface *data* which can be used to inform funding decisions (alongside everything else! You put it best in the post – it always has to be in conjunction with other things). The kinds of things you would probably only surface otherwise with a lot of legwork or expert assistance.

    At the research rather than individual researcher level it’s to also surface indicators about that data.

    > is the main focus on helping people to discover interesting papers?

    To a lesser extent.

  7. CrisisMaven says:

    “the world finds it extraordinarily difficult to take the measure of science” – Without oversimplifying: this goes back to why Goostman is not a Turing machine and why Google’s once though brilliant PageRank algorithm does not work as planned: science always tries to measure things that can be added and subtracted to. This is why the number of citations in scientific journals matter for granting tenure just as much as giving a certain website a space in Google’s search results. So while scientists vehemently say that scientific truth is not amenable to democratic voting, scientists resort to exactly that when trying to “measure” science. The same happens when one tries to grapple with automated translations. So far no automatic translation process has really ever been convincing to anyone who has even a rudimentary feeling for semantics and style. Science has to be content with the fact that it is not the contemporaries (like the Ptolemeans) that decide which theory is “right” and which papers therefore SHOULD have been the most spectacularly cited, but that the “proof of the pudding” always lies far far into the future. Anything else is hubris.

Comments are closed.