The ABC of panel scoring: Anchoring, Bias and Committee Procedures

Academic life is particularly full of rank ordered lists, even if they are frequently not transparently available. From undergraduate examinations to professorial promotions, from REF (and in future TEF) marks to grant-awarding panels, the scores matter. Anyone who has ever been ‘scored’ will worry about the accuracy of the scores given; anyone who has been involved in decision-making will have their own views about the process, its validity and whether their own part left them satisfied. Peer review may be the best process we have for making these judgements – which in essence all of them rely on – but no one ever claimed peer review was faultless. If you have never sat on any comparable committee you may well be interested in, as well as deeply suspicious of, what actually goes on. If so, you may find illuminating this scholarly article from the social sciences. In this, the author gives much qualitative insight into the goings on in a series of Swedish Research Council meetings, as he explores a particular  phenomenon known as the ‘anchoring effect’, on which more later.

In all of the committees I have been involved with I have only once sat on (and never had the misfortune to chair) a panel where I felt there was something slightly dodgy going on, in the sense there was a sub-group behaving as a cartel. I hasten to add this behaviour was spotted and neutralised by an oversight panel. In general people try really hard to be objective but, as the article demonstrates, this is not as easy as you might think. Consider the following issues as demonstrating the challenges that implicitly or explicitly may arise:

  1. If asked to score between 1 and 10 against some criterion, some people will use the full range, but others will probably cluster scores between 4 and 8 believing nothing is perfection and nothing is completely worthless. Averaging such scores to produce a crude rank-ordered list (even if subsequently modified by discussion, as such raw lists essentially always are) may not be the optimum way to proceed, but is likely to be what happens.
  2. In the case of a grant proposal, a very convincing case may be made which only the specialist is able to pick up contains a fundamentally flawed assumption; or equivalently in promotions, only the person closest to an application may spot that there is unjustified hyperbole in some of the claims. Rightly, these judgements should have more weight than those of a less expert panel member, but it will be random in each case whether such immediately relevant expertise is represented on the panel.
  3. Absolutely ‘solid’ metrics (e.g the h index) may be used improperly e.g. to compare candidates from very different disciplines. If you try to compare a pure mathematician (think Andrew Wiles of Fermat’s Last Theorem fame) with a synthetic chemist, their h indices may vary by a factor of 10. It says nothing about their relative excellence. That much is pretty obvious, but even if you compare a synthetic chemist with a physical chemist, the differences may be substantial. Sub-disciplines as well as larger groupings matter in these things. Similarly with prizes: focussing on the UK, the Royal Society of Chemistry just happens to have a much larger and more varied collection of prizes than the Institute of Physics so a solidly good but not-necessarily-stellar chemist is far more likely to be able to list a prize or two than a comparable physicist. You need to be very aware of these differences to be able to tension these solid facts appropriately.
  4. The committee procedures may significantly affect the way different panel members participate. I once sat on a research council panel which was dealing with four very different sub-fields. Initially the modus operandi was for each of the four to be taken in turn. This meant it was all too easy for panel members only to focus on the area they were closest to, essentially dozing off (or at least being very bored and not concentrating) during the rest of the presentations. As a result, when the final scores were decided most of the committee had little to say about most of the applications. During the time I served on the panel (and this must be at least 15 years ago) it became obvious just what a bad way of proceeding this was, and eventually meetings took place considering applications simply in alphabetical order. I am sure this led to better decisions as everyone concentrated throughout the discussions.
  5. Without needing to invoke either a conspiracy or genuine conflict of interest, if there is someone who has a prior high opinion of one particular applicant, this may shine through regardless of the case on the table. If this person happens to talk first and is (as a recent committee member described themselves to me) a dogmatic character, a strongly positive message can be conveyed which later speakers find hard or are unwilling to challenge. Randomness in order of speaking may have a significant effect on what is ultimately a collective decision. Chairs can do what they can to overcome dogmatic speakers, but are unlikely to know in advance how best to order speakers so that no unreasonable advantage can be accrued by any particular candidate.

The issue of ‘anchoring’ I referred to at the beginning relates most closely to this last point of a preliminary score influencing later results. First identified I believe by Daniel Kahneman, it is the phenomenon by which the introduction of an initial figure may have subsequent impact on how people score/react or choose to proceed. Given some figure – it could be for scoring a grant or equally for what they are prepared to pay for some product, which was the context Kahneman considered – people use that as a baseline and tweak what they believe is appropriate around it rather than starting afresh themselves with an objective view. So, in the context of scoring a collection of grants, if the scores submitted in advance by panel members are averaged and presented to the panel before detailed discussion starts, it might influence how the subsequent debate unwinds and hence the final scores which are awarded.

This is the situation which forms the basis of the paper I referred to above by Swedish researcher Lambros Roumbanis as he analyses panel meetings of the Swedish Research Council. But his paper describes a much broader range of behaviours than just this particular facet, which is why it is so generally informative for those curious about what goes on in such meetings. Of course every panel is different and so the observations must be treated as examplars rather than necessarily typical. In my experience people are probably less reflective in their lunch breaks than he apparently discovered, probably because his very presence influenced behaviour. Nevertheless people do agonise over their actions – committee members are not, in my experience, blasé or careless. That does not stop them having internal biases, prejudices and baggage from previous meetings, all of which may impact on how they interact with other panel members and the paperwork in front of them. However, let me stress, few if any panel members approach the task with anything but the best of intentions; nor do they tend to set out to game the system for some nefarious purpose. Gross biases tend to be picked up and challenged. Despite all that there is absolutely no doubt that peer review does not always end up with the right answers, be it down to anchoring, ignorance or incompetence. Alternative methodologies are not likely to be any better. Lottery anyone?

This entry was posted in Research, Science Funding and tagged , , . Bookmark the permalink.

7 Responses to The ABC of panel scoring: Anchoring, Bias and Committee Procedures

  1. Luke says:

    You may be too quick to dismiss the lottery method. See Gillies’s arguments in favour of it:
    http://riviste.unimi.it/index.php/roars/article/view/3834

  2. NQ says:

    Heck, even if you compare a synthetic organic chemist with a medicinal (organic) chemist and a chemical biologist in the organic chemistry division of a department, the latter two outputs will always be much higher, everything else equal.

    Happy to hear you have only experienced dodgy business in this context once. I wonder whether that number would be the same if you worked in the US…

  3. Maria says:

    So as long as one person in the committee does not exploit the whole scale consistently across people to be ranked (e.g., they give scores from 4-8 for all people to be ranked and do not change their ranking style throughout the course of the complete ranking procedure), it’s not a big problem.
    I think you cannot take the mean on ordinal scale level measurements such as a ranking. You could take the median. In this case I think it would also be fine to simply summarise your points and take the person with most points.

  4. I have seen committees presented with histograms with each individual’s scoring patterns. It can be informative but is still quite hard to counter any implicit influence of high and low scorers. Some referee forms also require scores – and in many cases this has undoubtedly led to their own ‘grade inflation’.

    • Maria says:

      If scores are for a reference letter, they are almost a meaningless measure.

      For a panel, scores are not necessarily a problem, but maybe I don’t understand how panels work.

      I thought that each of the people in the panel assigns a score, e.g., from 1-10 to each applicant. If you have a pool of, e.g., 32 applicants, and 5 people to evaluate them, all 5 people award a score to each of the 32 applicants. Then you take some type of meaningful measure (which can’t be the mean). If one person in the panel consistently does not award 9 and 10, all applicants will all be similarly at disadvantage. You are only interested in who is best, e.g., you take the person with the highest sum of scores.
      To have someone who does not make use of the full scale only matters if you evaluate the scores, e.g., by setting a threshold value which the successful applicant has to meet, e.g., you reject all applicants who do not get an average of 9.5.

      Then, it would not matter that much whether panel members have different rating styles. It is a classic problem in psychology.

      You could (not sure whether such things are done):

      – simply prompt panel members to make use of the full scale

      – clearly define what a 10 looks like, making use of categories which specify what you are looking for in a candidate (e.g., an applicant can get 10 pts in total. These consist of 3 pt for teaching, 3 pt for publications, 3 pt for mentoring, 1 pt for whatever. Then you specify, e.g. that in the teaching category, 0 pt will be awarded for no teaching experience, and 3 pt will be awarded for excellent teaching and then calculate the composite score for each applicant based on these categories)

      – force panel members to rank applicants instead of assigning scores to them (so one of them has to be the top)

      • If everyone scores you are of course correct. But very often – e.g. in grant-giving panels – this isn’t the case. Only the ‘experts’ on a particular grant will score. So if you have two curmudgeonly people in area A and two who believe in all the hype in area B, B may come out apparently much stronger than A on a first pass. And those scores may ‘anchor’ what happens next. See Douglas Kell’s response below too: he was a Research Council CEO and has only too much awareness of the challenges.

  5. Douglas Kell says:

    The experiment of asking two Panels in separate ‘rooms’ to rank the same set of grant proposals has been done many times. Usually both pick the (few) stunners and turkeys, but whether you get through or not depends entirely on which room you were assessed in. That is the case whatever the strike rate in the typical modern range (15-25%). This is because of GENUINE uncertainty (in the broadly statistical sense).

    There is barely any need to invoke unreasonable bias at all. (Of course a clear view is some kind of bias…). I have very occasionally seen it, and it was always transparent and hence blocked. It helps if IMs know the strike rate so they do not try and push too many they speak to. It also helps if they are ACTUALLY multidisciplinary and understand the grants (which often means chasing at least some of the cited reviews).

    Delphi system (everyone scores every grant) is hard work but can work well, esp for calls inviting ‘left-field’ proposals.

    The anchoring effect is VERY real. Kahneman’s BRILLIANT book (Kahneman D: Thinking, fast and slow. London: Penguin, 2011) lists many other common and unconscious cognitive biases.

    Bear all of this in mind when considering the behaviour of the Cabinet, how the referendum went, and other events that involve ranking ideas, candidates, or anything else….