Article Text

Download PDFPDF

Peer review of quality of care: methods and metrics
  1. Julian Bion,
  2. Joseph Edward Alderman
  1. Intensive Care Medicine, University of Birmingham College of Medical and Dental Sciences, Birmingham, UK
  1. Correspondence to Professor Julian Bion, Intensive Care Medicine, University of Birmingham College of Medical and Dental Sciences, Birmingham B15 2TT, UK; J.F.Bion{at}

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

The privilege of professional self-regulation rests on clinical peer review, a long-established method for assuring quality of care, training, management and research. In clinical peer review, healthcare professionals evaluate each other’s clinical performance. Based originally on the personal experience and expertise (and prejudices and biases) of one’s peers, the process has gradually been formalised by the development of externally verifiable standards of practice, audit of care processes and outcomes and benchmarking of individual, group and organisational performance and patient outcomes. The spectrum of clinical peer review ranges from local quality improvement activities such as morbidity and mortality reviews, to medical opinion offered in courts of law. Peer review can therefore have different purposes ranging from collaborative reflective learning to identification of malpractice.

Given the ubiquity and importance of clinical peer review, it would be reasonable to expect some evidence of reliability of judgements made by different reviewers. And yet the literature tells a rather different story. A systematic review1 of the inter-rater reliability of audited case records reported mean kappa values ranging from 0.32 to 0.7, with higher reliability when reviewers employed explicit criteria. Reviewers may give inconsistent judgements, change their opinions over time2 and be susceptible to a variety of biases including implicit,3 cognitive4 and outcome or hindsight bias.5 To some extent, this may be mitigated and reliability improved by using a combination of both criterion-based and implicit (global) assessment6 combined with structured judgement templates,7 8 or when a smaller group of reviewers is employed to detect well-characterised signals such as adverse events.9 In a comparison of weekend and weekday quality of care across two epochs of time, using a combination of structured judgement and global (implicit) reviews of case records,10 we found modest levels of agreement between reviewers examining the same case, but a high level of agreement when cases were aggregated at organisational level: the big picture was more informative than the individual case. In legal settings in the UK, the Woolf recommendations encourage consensus between expert witnesses by requiring a single, joint assessment from experts appointed by the courts.11 While this approach may have improved matters for the courts,12 the evidence that consensus-based reviews produce more accurate judgements is elusive.13 This is problematic when the stakes are so high for patient care and for individual and organisational reputations. These methodological challenges to peer review have the potential to undermine the edifice of self-regulation: if the instrument of investigation (the reviewer) is so flawed, how can we have confidence in the outcome—the judgement of quality?

In an attempt to determine the utility of peer review, Schmitt et al14 in this edition of the journal performed a cluster randomised trial of 60 hospitals, nested within the German ‘Initiative Qualitätsmedizin’ (IQM), a voluntary national multiprofessional quality improvement collaboration established in 2009 involving 385 hospitals. The population they chose for review was intensive care unit patients receiving mechanical ventilation for >24 hours. The 60 hospitals selected were those with the highest hospital mortality rates in 2016, the rationale being that these hospitals would have the greatest headroom for improvement. The logic model therefore contains the assumption that higher mortality rates are attributable, at least in part, to deficiencies in care processes which can be identified and corrected through peer review.

Ultimately, the analysis was based on 30 intervention hospitals and 29 control hospitals caring for 12 085 and 13 016 patients in the pre-intervention and post-intervention periods, respectively. Thirty-three of the 60 hospitals had previously participated in clinical peer review of mechanical ventilation. Data from the non-participating hospitals (‘observation arm’) were used to derive standardised hospital mortality ratios based on a range of characteristics which included hospital coding for emergency admission status and comorbidities, but not acute physiology or severity of illness.

Clinical peer review in the intervention hospitals consisted of several linked steps: self-selection of 12–16 records of patients who had died; self-assessment of care quality; on-site assessment by the review team consisting of trained doctors and nurses from other IQM hospitals and a structured report and discussion between reviewers and staff to agree ‘clear and precisely formulated potentials for improvement to derive an action plan’, the implementation of which was the responsibility of the local clinicians. This therefore fulfils the Medical Research Council’s criteria for a complex intervention.15 The authors used a difference-in-difference analysis to mitigate the impact of case mix and organisational differences between hospitals. The primary outcome was the difference in the pre-intervention and post-intervention standardised mortality ratios 1 year before and 1 year after peer review. What did they find? There was no impact of the intervention on either crude or adjusted mortality ratios. Peer review had no perceptible impact on mortality.

How should we interpret this apparent lack of effect? The authors took care to ensure adequate power for their study. Could the context have been unfavourable? It seems unlikely that intensive care staff would ‘lack receptors’ for quality improvement interventions since, even when metrics are disputed,16 17 clinical staff achieve improvements in care over time when given the tools.18 Moreover, when the authors invited participation in the project, they received positive responses from 237 hospitals, which does not suggest lack of interest. Or should we accept the null hypothesis and conclude that clinical peer review can join the list of other ineffective interventions in critical illness,19 with wide implications for the whole of medicine?

We suspect that in addition to the unreliability of peer review discussed above, its impact will have been diminished further by methodological issues, some of which are acknowledged by the authors. These need to be addressed in future research. We consider here case selection, the choice of process or outcome measures, and the content of the intervention.

In terms of case selection, restricting the investigation to mortality reviews limits the generalisability of the study and introduces the problem of simultaneity or endogenous selection bias20 in which the selected population (in this case patients requiring mechanical ventilation for >24 hours) and the target of the intervention (reliability improvement or error prevention) lie on the causal pathway to the primary outcome (mortality), and the risk of that outcome is itself a potential (‘simultaneous’) contributor to the probability of requiring prolonged ventilation or of experiencing an error or omission in care,21 a form of ‘reverse causation’.22 23 While mortality reviews may reveal valuable opportunities for improvements in care of individual patients, the rationale for examining only those episodes of care which ended in death may be erroneous when reviewing care quality aggregated at unit or organisational level. Deficiencies in care processes do not generally result in death, while patients who die may have received exemplary care, even though they may have had more complex pathways with greater opportunity for errors. When evaluating quality of care therefore, it may be better to study this in a fully representative patient population, not just in those who died.

Standardised mortality ratios as a measure of care quality suffer from several methodological deficiencies,24 including sensitivity to unmeasured aspects of case mix,25 26 and the low proportion of deaths classed as avoidable, with the majority of deaths being a consequence of the patients’ acute or comorbid diseases.27–29 Importantly, there is no clear relationship between organisation-level standardised mortality ratios and clinical judgements of care quality.30 Institutional reporting rates of incidents involving severe harm or death similarly show no relationship with mortality or patient satisfaction.31 31 The signal-to-noise ratio is therefore adverse, since the opportunity for identifying improvements in care processes leading to improvements in outcomes is dependent on the total number of cases and the proportion which are genuinely avoidable.32 The question then is ‘how much of the (adjusted) mortality risk can be attributed to deficiencies in processes of care which can be detected by peer review and controlled by the clinical team’? The answer to the first part may be ‘not much’, but the answer to the second element (detection and control) could be anything from ‘variable’ to ‘substantial’. And incremental improvements in care processes over time may add up to important gains which emerge as gradual secular trends,18 33 34 the ‘rising tide’ phenomenon.35

Should one use care processes or outcomes to assess quality improvement interventions? Outcomes matter to patients and to staff; but at what point should the measurements be censored—28-day, hospital survival, 3 months postdischarge, 12 months? And what about quality of survival? Duration alone is insufficient for those living with multimorbidity. Process measures may be more laborious to collect, but they offer a more rapidly available quality signal than outcome, and are more ‘empowering’ as they give a clearer indication of what staff need to do to improve care by providing an explicit link between the metrics, the content of the intervention and consequential actions. Process measures and the criteria used for the reviews may have both technical and behavioural-social components. The technical components will include evidence from randomised trials of interventions which influence outcome and for which there is a performance gap. The obvious example in this case is lung-protective ventilation which is still not used reliably in around one-third of eligible patients with adult respiratory distress syndrome36 and even fewer receiving intraoperative ventilation for elective or emergency surgery37 even though standardising best practice reduces mortality.38 Adherence may be higher in patients ventilated for COVID-19 pneumonia39 suggesting that the trend to standardisation of treatment was accelerated by the pandemic. Other interventions could include venous thromboembolism prophylaxis, sedation minimisation, use of neuromuscular blockade and selective digestive decontamination depending on the patient population. Behavioural and social components of quality interventions may include communication, teamworking, use of checklists40 and ability to challenge or raise concerns.41 Behavioural barriers—which may be subtle and difficult to detect—include disputes about evidence, loss of autonomy and divergent views on clinical responsibilities,42 disagreement about the validity of performance metrics43 and difficulty in sustaining improvement over time.44

The contents of the intervention—the process ‘targets’—should lie on the causal pathway to the desired outcome, and as far as possible should be supported by evidence both for impact and a gap analysis indicating headroom for improvement. Schmitt et al found 132 discrete recommendations for improvement in the intervention hospitals, 81 of which had been, or were being, implemented, but 53 (66%) of these were regarded as unlikely to affect mortality. This captures succinctly the problem of peer review: there are many aspects of the care pathway which might be done differently, or better, but which of these is really important? To answer that question needs preliminary diagnostic work to understand the problem before deciding that peer review is the right vehicle, and which treatments it needs to bring to bear.

Quality improvement research is still at an early stage in the development of rigorous methodologies.43 Schmitt et al are to be applauded for having employed a cluster randomised trial to evaluate a quality improvement intervention and for having documented the contents of the intervention with clarity. Future work needs to address the issue of what constitutes a representative patient population; to consider incorporating contextual factors; to determine which processes of care really influence outcomes and to identify gaps in current practice and whether there is sufficient headroom for improvement. These elements should be brought together in the form of a logic model44 offering a theory of change which may then be tested using methods such as realist evaluation.45 46 Increasing sophistication of the electronic patient record (EPR) may reduce dependence on peer review, since if the correct processes of care are known, then error correction can be incorporated in real time in the form of prompts, reminders and automated control limits, with performance benchmarked against one’s peers. This will work for the technical aspects of care, but the non-technical, behavioural aspects such as effective compassionate communication and teamworking cannot really be determined from the EPR. The future of peer review may lie not in the retrospective examination of case records, but in the contemporaneous observations of practice by peers and patients, within a model of workplace-based reflective learning.47

Ethics statements

Patient consent for publication

Ethics approval

Not applicable.



  • Twitter @jaldmn

  • Contributors Both authors contributed to the review and writing the editorial.

  • Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.

  • Competing interests None declared.

  • Provenance and peer review Commissioned; internally peer reviewed.

Linked Articles