Article Text

Download PDFPDF

Identifying adverse events: reflections on an imperfect gold standard after 20 years of patient safety research
  1. Kaveh G Shojania1,
  2. Perla J Marang-van de Mheen2
  1. 1 Medicine, University of Toronto Faculty of Medicine, Toronto, Ontario, Canada
  2. 2 Department of Biomedical Data Sciences, J10-S, Leids Universitair Medisch Centrum, Leiden, The Netherlands
  1. Correspondence to Dr Kaveh G Shojania, Sunnybrook Health Sciences Centre, Room H468, 2075 Bayview Avenue, Toronto, ON M4N 3M5, Canada; kaveh.shojania{at}

Statistics from

In ancient Roman religion, Janus was the god of gates and doorways, but also beginnings, endings, transitions, passages, time and duality. Usually depicted as having two faces, Janus looks at the past with one face and to the future with the other. Why mention Janus in an editorial about patient safety? Partly because the 20-year anniversary of To Err is Human 1 marks a transition—from the beginnings of patient safety as a fledgling field to a more mature research endeavour.

Beyond this symbolism of a transition period, Janus’s past and future looking faces bear another connection to patient safety. The ‘gold standard’ research method in patient safety, record review to look for ‘adverse events’ (AEs), defined as harms from medical care, has taken two forms. The more common method, famously used in the Harvard Medical Practice Study (HMPS)2 and other studies which have emulated it,3–9 involves retrospective (‘backwards looking’) record review. An initial review looks for signs of possible harms from medical care, which, when present, trigger more detailed review to adjudicate the presence of AEs and judge the degree to which adhering to accepted standards of care could have prevented them.

More recently, some investigators have conducted prospective (‘forward looking’) surveillance to identify AEs in near-real time.10–12 These forward-looking and backward-looking AE studies have succeeded in showing the scope of many safety problems. But, after 20 years of research, can we continue to use the same metric for both measuring safety and monitoring its improvement over time?

Identifying adverse events through retrospective record review

Though widely attributed to the HMPS,2 retrospective record review to identify AEs originally came from a much less well-known study.13 Carried out by the California Medical Association and California Hospital Association, this study sought to explore alternate models for compensating patients harmed by their medical care. This was also the main motivation for the HMPS. Showing the extent of preventable harm caused by medical care would potentially provide the basis for changing from the traditional malpractice system to a ‘no fault’ compensation system, as seen in Denmark, Sweden, Finland and New Zealand.14

In a preliminary publication, the HMPS authors wrote “In the California Medical Insurance Feasibility Study, certain screening criteria—such as death, transfer to a special care unit, an undesirable outcome and readmission to the hospital—were found to be associated with an increased likelihood of medical injury. In the absence of such criteria, AEs were generally not found. We eliminated several of the California criteria that we found redundant and added two…”15 Such adding and eliminating has been the story of this methodology over the subsequent decades. Various national AE studies3–9 have added new screening criteria or ‘triggers’ (eg, to detect safety problems of relevance to particular clinical settings or patient populations16–19) and abandoned others (eg, excessive length of stay) that, in practice, detected adverse outcomes from patients’ underlying illnesses rather than harms due to their medical care (ie, non-AEs).

Efforts to refine the AE methodology have also aimed at improving the review process to increase agreement between reviewers about key judgements. Reviewers have generally exhibited moderate to good agreement when it comes to distinguishing AEs from harms not caused by medical care, but only fair agreement when it comes to judging preventability or errors.20 Some studies have reported better agreement about preventable AEs, achieving kappa values in the 0.4–0.6 range.6 8 Even this improved agreement falls short of what one would expect for a field’s ‘gold standard’ measure.

Beyond reviewer disagreement about the presence of AEs and their preventability, these retrospective studies suffer from the complete reliance on documentation practices. As with incident reporting, more events can simply mean more reporting, not worse safety.21 When investigators in the Netherlands conducted a second national AE study, they found an AE rate in 2008 of 6.2%, higher than the 4.1% found in 2004.22 Since preventable AEs did not increase, the accompanying editorial23 suggested that more frequent non-preventable AEs reflected better documentation as a result of the growing interest in patient safety.

More commonly, though, researchers have worried about the converse situation: harms that go undocumented. For instance, members of the surgical team caring for a patient after a bowel resection that has gone well may not document a brief adverse drug event. And, even if someone mentions the event, lack of relevant details will hamper judgments about preventability. Moreover, some events do not cause harm immediately and thus do not appear to warrant documentation when they occur.

Detection of AEs using triggers for retrospective record review launched the field of patient safety, but the method has clear shortcomings. Many AEs go undocumented in medical records and reviewers often disagree about those that are documented. Enter the prospective version of the traditional AE detection method.

Prospective application of the trigger tool method for detecting adverse events

Prospective AE surveillance still relies heavily (but not exclusively) on the use of triggers—signs of possible quality of care problems such as unexpected death, unplanned admission to intensive care, documented of patient dissatisfaction with care, as well as signs of specific events of interest—for instance, a laboratory test positive for Clostridium difficile infection. Importantly, though, trigger detection occurs in near-real time (usually within 48 hours) as opposed to months or years later. And, a trained observer integrated in the clinical environment supplements record review with debriefs of front-line staff, obtaining relevant details not noted in medical records. Observers can also learn of possible AEs from observing wards rounds, which can identify many events not captured in medical records,24 25 reviewing incident reporting systems and direct communication from front line staff.10–12

This intensified prospective surveillance strategy aims to detect more candidate AEs and obtain key details relevant to judgments about harm and preventability. Moreover, involving staff from the clinical unit in identifying possible harms (and also in weekly conferences for reviewing the identified cases) may engage them in efforts to improve patient safety in a way that learning about AEs affecting patients who received care years ago might not.

Could prospective surveillance enhance the value of AEs as a performance measure?

Regardless of whether any method for identifying AEs can also inform improvement efforts, many might argue that we need at least one ‘gold standard’ measure for tracking progress and/or comparing different healthcare organisations. Just as we compare hospitals using risk-adjusted mortality and readmission rates, maybe we could compare hospitals using a robust measure of patient safety. Existing methods for comparing performance on safety measures tend to use administrative data and have limited validity along with poor positive predictive values when compared with clinical data.26–28 Maybe prospective surveillance, with its likely enhanced detection of preventable AEs, can provide such a comparative performance measure for patient safety.

In this issue of BMJ Quality & Safety, Forster et al report on their use of prospective AE surveillance at five hospitals in two Canadian provinces.29 They sought to determine the degree to which observed variations in rates of (preventable) AEs likely reflect true differences in safety versus variations in the measurement method, including observer and reviewer behaviours. To help characterise the contribution of measurement issues to apparent differences in rates of AEs, Forster and colleagues added the elegant methodological feature of rotating observers between hospitals during the study. And, they restricted the study to general medicine wards to avoid another potential source of variation, as units within hospitals can show greater variation than seen across hospitals.30

The five hospitals consisted of four academic centres offering tertiary and quaternary services and one large urban community hospital. The percentage of hospital admissions with at least one AE ranged from a low of 9.9% (at the community hospital) to a high of 35.8% at one of the academic hospitals, with an overall AE risk per hospitalisation of 22% across the five hospitals. Admissions with at least one preventable AE ranged from 9.9% (again, at the community hospital) to 29.7% (at the same academic hospital with the highest AE risk). These risks for AEs and preventable AEs generally exceed those seen in the previous study using retrospective AE record review conducted in Canada.6 That study reported an overall risk for AEs of 7.5% (10.9% in teaching hospitals) and 2.8% for preventable AEs (3.3% in teaching hospitals). The higher rates of AEs in teaching hospitals likely reflect differences in documentation (more clinicians tend to enter notes on a given patient) and/or differences in case-mix, including transfers of particularly complex patients from non-teaching hospitals.

Regardless of hospital type, the focus of this latest prospective AE study lay in determining the degree to which this method allows identification of true differences in safety between hospitals as opposed to variations intrinsic to the AE detection method. Forster and colleagues reported large variation between the trained observers detecting triggers within the same hospital and also that the magnitude of this observer effect was highly correlated with the hospital. For instance, there was a twofold variation between observers in the hospital with the lowest risk of AEs and a smaller variation in the hospital with the highest risk. The subsequent physician review process somewhat dampened this variation in observer behaviour. But, as in retrospective record review studies, physician reviewers exhibited only modest agreement for judging preventability, with a kappa score of 0.55 (95% CI 0.41 to 0.69).

Even with the ability to detect AEs not captured in the medical record and a greater likelihood of obtaining information relevant to judging preventability, the prospective surveillance method does not appear to solve the issue of variation in the measurement method for detecting AEs. The rates at which observers identify triggers for more detailed record review and persistent limitations in reviewer agreement about key judgments prevent distinguishing true differences in safety between hospitals from measurement variation intrinsic to the AE detection method.

Heterogeneity as the fundamental challenge to using AEs as a metric

Past discussions of problems with AE studies have focused on issues such as reviewer behaviour, properties of the triggers and, now with prospective surveillance, the behaviours of the trained reviewers. More fundamental, though, is the problem that the AE rate is a composite indicator comprising multiple heterogeneous components. Composite performance measures, such as the Overall Hospital Quality Star Ratings in the USA31 32 or the NHS England Overall Patient Experience Score,33 combine multiple indicators of care quality into a single score. Such composites offer two advantages—the simplicity of a single overall measure and increased statistical power from having more eligible events. But, these composites can create problems,34 35 just as with composite outcomes in clinical trials.36 37 Composite outcomes pose particular problems when the components vary substantially in terms of their frequency and/or severity and when an intervention exerts differential effects on the various components.

AEs have these problems to a far greater extent than most composite outcomes. Instead of a small number of component submeasures, the AE rate encompasses all possible injuries from medical care: adverse drug events, complications of surgery and other invasive procedures, hospital acquired infections, non-infectious hazards of hospitalisation (eg, fall-related injuries, pressure ulcers, venous thromboembolism, delirium, malnutrition), diagnostic delays and so on. Each of these major categories is itself heterogeneous (figure 1). For instance, the category of preventable adverse drug events includes harms caused at the time of ordering medications, harms arising during drug dispensing and others from medication administration. Computerised provider order entry may prevent some adverse drug events at the ordering stage, but it will do little to reduce harms arising at the stages of dispensing or medication administration. Similarly, hospital acquired infections include central line associated bloodstream infections,38 39 catheter-associated urinary tract infections,40–42 C. difficile 42 and so on, each with different interventions to reduce these events. So, the AE rate at a given hospital at a given time represents a composite comprising a very long list of distinct event types, ranging from common to very infrequent harms, and with very different potentials for improvement from a given safety intervention or even multiple interventions.

Figure 1

Depiction of the intrinsic heterogeneity associated with AE rates. The categories of AEs and their distribution come from a systematic review of retrospective record review studies.20 The point of the figure lies in illustrating the deceptive degree of heterogeneity associated with the label ‘adverse event’, not the specific categories or their relative sizes. Definitions reflect those used in most individual studies, although some studies varied in the names and definitions of certain categories. ‘Therapeutic’ refers to AEs involving inappropriate or delayed treatment despite a correct diagnosis. System/other: includes AEs that cannot be attributed to an individual or specific source (eg, lack of/defective equipment or supplies, inadequate reporting or communication, inadequate staffing/training/supervision, no protocol/failure to implement protocol). AE, adverse event.

Incomplete capture for many specific types of AEs further compounds the measurement problem. The trigger tool methodology—whether retrospective or prospective—incompletely captures specific categories of AEs,43 44 as many will not produce death, transfer to an intensive care unit, readmission within 30 days or other triggers for record review. Thus, we have a composite outcome (AEs) comprising dozens of component categories (distinct types of AEs), most of which will include few cases in a given sample and many of which suffer from underdetection. The noise of chance variations in the small numbers constituting the numerous components of the AE composite will overwhelm any true signal of, for instance, specific adverse drug events reduced by computerised decision support. Heterogeneity both across and within categories of AEs combined with the small numbers for each means that measurements of AE rates have poor signal-to-noise ratio, thus preventing robust comparisons across hospitals. This same signal-to-noise problem will bedevil efforts to monitor progress over time. Fluctuations in (preventable) AEs at a single institution are at least as likely to reflect measurement variation (ie, noise) as they are true changes in safety.

Looking to the future

Returning to Janus and looking back on 20 years of patient safety research, the AE as a metric in various studies, both retrospective and prospective, has served to demonstrate the scope of the problem and to engage clinicians, managers, researchers and policy makers. But, looking forward, such a broad, omnibus metric will not detect important differences in safety between institutions or track progress over time. For these tasks, we need to measure specific events of interest. These include established measures for capturing common healthcare-acquired infections, prospective registries for capturing outcomes of surgery,45–47 validated text mining algorithms applied to electronic health records to capture specific care-related injuries,48 methods to track missed diagnoses leading to harm,49 50 and so on.

Generating broad interest in patient safety required an easily understood measure to demonstrate the scope of preventable harms caused by the healthcare system. Few would question that scope now. To make progress in this now well-established field, we need measures tailored to specific patient harms. No other field attempts to measure progress in the form of an omnibus measure. To assess progress in cardiovascular health, for instance, one looks at trends in the incidence and prognosis over time for common cardiovascular diseases, such as myocardial infarction and stroke, not an omnibus measure of all possible harms from ‘heart and blood vessel disease’. Similarly, we will show progress in patient safety by tracking common, well-defined patient safety problems, not some general measure of all possible harms from medical care, the nature of which will inevitably change over time.51

The figure of Janus looking to the past and to the future captures the possibility of measuring AEs using both retrospective and prospective methods. But Janus also represents transitions. After 20 years of active research in patient safety, the time has come to put away the imperfect gold standard of AE rates and transition to more specific measures of important safety problems.


View Abstract


  • Contributors KGS and PJM-vdM both contributed to conception of the paper; they both critically read and modified subsequent drafts and approved the final version. They are both editors at BMJ Quality & Safety.

  • Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.

  • Competing interests None declared.

  • Patient consent for publication Not required.

  • Provenance and peer review Commissioned; internally peer reviewed.

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Linked Articles