Assessment of adverse events in medical care: lack of consistency between experienced teams using the global trigger tool
- 1School of Health and Caring Sciences, Faculty of Health, Social Work and Behavioural Sciences, Linnaeus University, Kalmar, Sweden
- 2Division of Drug Research, Anaesthesiology and Intensive Care, Department of Medical and Health Sciences, Linköping, Sweden
- 3Department of Anesthesia and Intensive Care, County Council of Östergötland, Linköping, Sweden
- 4Division of Nursing Science, Department of Medical and Health Sciences, Linköping University, Linköping, Sweden
- Correspondence to Kristina Schildmeijer, Linnaéus University, School of Health and Caring Sciences, Kalmar 39182, Sweden;
Contributors KGIS was responsible for the study design, acquisition of data, analysis, interpretation of data, drafting the article and intellectual content. LN was responsible for the research study design, analysis, interpretation of data, drafting the article and intellectual content. KÅ was responsible for the analysis, drafting the article and intellectual content. JP was responsible for the study design, drafting the article and intellectual content.
- Accepted 7 January 2012
- Published Online First 23 February 2012
Background Many patients are harmed as the result of healthcare. A retrospective structured record review is one way to identify adverse events (AEs). One such review approach is the global trigger tool (GTT), a consistent and well-developed method used to detect AEs. The GTT was originally intended to be used for measuring data over time within a single organisation. However, as the method spreads, it is likely that comparisons of GTT safety outcomes between hospitals will occur.
Objective To evaluate agreement in judgement of AEs between well-trained GTT teams from different hospitals.
Methods Five teams from five hospitals of different sizes in the southeast of Sweden conducted a retrospective review of patient records from a random sample of 50 admissions between October 2009 and May 2010. Inter-rater reliability between teams was assessed using descriptive and κ statistics.
Results The five teams identified 42 different AEs altogether. The number of identified AEs differed between the teams, corresponding to a level of AEs ranging from 27.2 to 99.7 per 1000 hospital days. Pair-wise agreement for detection of AEs ranged from 88% to 96%, with weighted κ values between 0.26 and 0.77. Of the AEs, 29 (69%) were identified by only one team and not by the other four groups. Most AEs resulted in minor and transient harm, the most common being healthcare-associated infections. The level of agreement regarding the potential for prevention showed a large variation between the teams.
Conclusions The results do not encourage the use of the GTT for making comparisons between hospitals. The use of the GTT to this end would require substantial training to achieve better agreement across reviewer teams.
- Global trigger tool
- record review
- adverse events
- patient safety
- trigger tools
- adverse events
- epidemiology and detection
Despite increased efforts to improve patient safety, many patients are still harmed as the result of healthcare.1 2 European studies report a prevalence of patient injuries of 9–12%.2–4 A review of conditions in Sweden in 2008 showed that 12% of the patients who received hospital care had an injury. The majority of these harmful events were considered to be preventable.2
One way to identify medical injuries is through a retrospective structured review of the patients' records. There are different methods for performing reviews of such records, all of which have been developed to identify and subsequently prevent healthcare-related injury.2 5 All methods are based on the identification of criteria/triggers—that is, events or circumstances indicating that the patient may have had a medical injury. The focus is on injury and clinical outcome rather than on individual mistakes, and healthcare professionals are therefore encouraged to use knowledge of outcomes for improving patient safety.6
One of the most common research methods for identifying patient harm is the one originally used in the Harvard Medical Practice Study.7 8 This and other retrospective methods have been criticised for being time consuming and resource intensive.9 10 Another method is the global trigger tool (GTT), which has a time limit of 20 min per record assessment.9 11 The method was developed by the Institute for Health Care Improvement (IHI) in the USA. The aims of the GTT are to identify adverse events (AEs) and measure their rate of occurrence over time. The trigger tool method has been used increasingly in patient safety work.5 6 9 12–14 Research involving activity-specific versions based on the GTT has also been conducted for patients in intensive care,15 neonatal care,16 surgical care6 and after ambulatory surgery.13 The Swedish version of the GTT is based on the IHI version from 2007.12 17
The GTT was developed as a means of measuring the occurrence of internal AEs over time.18 In order to keep data over time as robust as possible, the GTT method suggests that changes in team members should take place gradually over time with some overlap. In this way, the team members keep their evaluation of harm as constant as possible.18 The use of the GTT in this manner has been shown to provide a consistent and well-developed method for detecting AEs.19 20 For the GTT to be used for comparisons between clinics or hospitals, there must be satisfactory agreement between teams. To the best of our knowledge, agreement between teams that are regular users of the GTT has not been studied. Therefore the aim of this study was to evaluate agreement in judgements of AEs made by experienced GTT teams from different hospitals.
The global trigger tool
The GTT entails retrospective review of medical records by experienced personnel. Small samples of randomly chosen medical records are reviewed over time. The review is based on identification of triggers that may indicate a healthcare-related injury. Examples of triggers include pressure ulcers, transfer to a higher level of care, and performance of a reoperation. The original IHI version, including 54 triggers defined in six different modules (care, surgical, medication, intensive care, perinatal and emergency department), has been translated into Swedish and culturally adjusted by a group of physicians and researchers in collaboration with county councils in the southeast health region of Sweden. The Swedish GTT version consists of 53 triggers, and a second aspect has been added: preventability. In contrast with the IHI view that there is a danger that the reviewers will become too involved in addressing questions about preventability, the Swedish version of the GTT considers that the primary goal should be to focus on the avoidable AE as part of ongoing work on safety.17
The audit team consists of two primary reviewers, often registered nurses (RNs), who perform an individual initial review of all the records. Each record is reviewed within a maximum time of 20 min, and the review focuses on finding triggers (events) and potential AEs, rather than on making any attempt to read the entire record in detail. Each trigger found is marked on the GTT worksheet, one for each patient. A positive trigger does not necessarily mean that an injury has occurred.18 When the RNs have made their reviews and have reached a consensus, they hand over the records with marked triggers and potential AEs to the physician for confirmation of the AEs, and for determination of the level of harm and the possibility of prevention. The AEs that are found are categorised according to the NCC MERP (National Coordinating Council for Medication Error Reporting and Prevention Index).21 The level of harm is categorised between E and I, and preventability is classified on a scale of 1 to 6.2 22 Higher-letter and higher-score values indicate higher levels of harm and potential for prevention (table 1).
The study included one team from each of five different hospitals of different sizes in the south-eastern region of Sweden. Following the IHI model, each team consisted of one physician and two RNs. Medical specialities for the team members are shown in table 2.
All teams were very familiar with the GTT method, having used it for at least 3 years in their own hospital's patient safety work. All five teams were instructed to review the records that had been selected from a single hospital and to make this review in exactly the same way as they would have done had the records been from their own hospital. No validation or consensus was made between the teams before the start of the study. For the same reason, no specific or distinct definition of an AE was presented to the teams for this study, but it was clear to them that an AE was broadly looked upon as harm attributed to medical care.
A sample of 50 cases was selected for review from admissions at one of the participating hospitals (220 beds) between October 2009 and May 2010, using a random number generator. Team I belonged to the audited hospital. All cases eligible for selection met two criteria for inclusion: (1) inpatients with at least 24 h hospital stay; (2) patients over 18 years old. Records from surgical, orthopaedics, gynaecology and obstetrics, medical, psychiatry and geriatric clinics were included in the study. All hospitals in the study use the same digitised record system. A secretary, not otherwise involved in the study, made paper copies of the records and removed patient identification. The copies were sent by mail to the participating teams. After finishing their reviews, the teams returned the records and case report form (CRF) to the researcher (KGIS) for analysis.
The review process was performed in two stages. In stage 1, the records in the random sample were reviewed independently by two experienced RNs in each team. They each made an individual review of all the records and screened for the presence of one or more of the 53 triggers. When the RNs found an incident that coincided with at least one trigger, a decision was made by the RN whether it constituted a potential AE or not.
After the two RNs in each team had analysed the records independently and each had completed a CRF, they discussed and analysed the records together and subsequently agreed on an evaluation for each of those events. The RNs then completed together a CRF representing a consensus reached for each medical record. Records with potential AEs were handed over to the physician.
In stage 2, the physician in the team performed an independent review of all records that contained a potential AE in step 1. The physicians made judgements about a possible cause for AEs related to the healthcare provided, as only AEs due to the care given and not those that could be a consequence of the patient's disease were be noted as AEs in the GTT summary. The physician also estimated the severity of each AE and the degree of preventability (table 1).22 If a patient had more than one AE, each of these was included and counted separately. The physician looked especially at the suspected AEs that were identified in the first stage, but he or she did not systematically review the entire record. If a physician found an AE not identified by the RN in stage 1, this was included in the review.
Descriptive statistics were used to describe sample characteristics and the agreement between teams. The results are presented as frequencies, means, medians and ranges. Agreement between teams in terms of inter-rater reliability was assessed using κ statistics. Pair-wise agreement between teams regarding numbers of identified triggers and number of AEs was evaluated with weighted κ statistics, using linear weights for agreement, as the variables had more than two levels.23 The Stata command KAPCI including bootstraps was used (10 000 replications) to calculate CIs (95%) and a combined κ statistic across all teams. The KAPCI command cannot be used to calculate weighted κ based on multiple raters, and therefore the combined κ values for all teams are reported as unweighted values.24 In contrast with the pair-wise comparison, the combined κ statistics was not weighted. We used the following scale to interpret the κ statistic: poor (<0.00), slight (0.00–0.20), fair (0.21–0.40), moderate (0.41–0.60), substantial (0.61–0.80) and almost perfect (0.81–1.00).25 The data analyses were performed using the SPSS V.19 for Windows and Stata for Windows V.12.0.
Background description of cases and the review process
No record was excluded in the review process because of insufficient documentation. The mean age of patients was 62.4 years (SD 21.2; range 19–98). There were 31 women and 19 men. The total number of hospital days analysed was 331, corresponding to a median of 5 (IQR 5; range 2–32). The included patients were surgical (13), orthopaedic (four), gynaecological and obstetric care (six), medical (21), psychiatric (four) and geriatric (two).
Triggers and adverse events
Times for primary review ranged from 1 to 20 min with a median time of 10 min (IQR 7; range 1–20). There were no differences in terms of review time between teams. The total number of identified triggers varied between 29 and 95 across the teams (table 3). Of the total number of 53 triggers in the GTT, 31 were identified in the examined patient records. In nine of the charts, no positive trigger was found by any team. The most common triggers found were ‘readmission within 30 days’ (n=47; range between teams 5–15), ‘healthcare-associated infection’ (n=30; range between teams 3–9), and ‘abrupt medication stop’ (n=29; range between teams 0–16). In no case did an RN suggest a potential AE that was then rejected by the physician. The five teams identified 42 different AEs altogether. The number of AEs identified by the teams ranged from a low of nine to a high of 33 (table 3), corresponding to a level of AEs ranging from 27.2 to 99.7 per 1000 hospital days. Some patients were identified as having more than one AE (range 1–5), implying that 7–16 patients (14–32%) had suffered at least one AE according to the teams. Most AEs resulted in minor, transient harm (level E–F) (table 4). The most common AE identified by the teams was a healthcare-associated infection (range 5–8).
The majority (55–82%) of AEs were found among the surgical specialities by four of the five teams (table 4).
We observed agreement among the majority of teams in eight (19%) of the total of 42 AEs. Of all the AEs identified, 29 (69%) were identified by only one of the teams and not by the other four (online appendix 1). Overall, the level of agreement for detecting AEs and level of harm was largest for healthcare-related infections—that is, pneumonia, sepsis and urinary tract infection.
In two cases, AEs were judged to have contributed to the patient's death (online appendix 1). The first case was detected by two teams. However, the AE that contributed to the patient's death was regarded as pneumonia by one team and as arterial emboli in the other. The second case, an accidental cutting of an intestinal anastomosis (Roux-en-Y limb), was identified by two other teams. In this case, one of the other teams stated that, as the patient's death had occurred more than 30 days after the index care episode, it should be excluded, according to the GTT manual. Other teams stated that the patient's concomitant diseases were the cause of the death and thus were not a result of the care the patient received, and accordingly these were not included as AEs in the GTT analysis.
Most of the AEs (58%) were judged to have been preventable. Even if teams demonstrated substantial agreement on preventability for some types of AEs—for example, sepsis—the teams judged the level of preventability differently—for example, urinary tract infection (range 1–5), pressure ulcer (range 2–5) and pneumonia (range 1–4). Overall, the percentage of the AEs identified by each group as being preventable ranged from 33% (team V) to 82% (team III) (table 3).
The levels of agreement in terms of inter-rater reliability are presented in table 5. Weighted κ values for the detection of the number of triggers team-by-team ranged from 0.32 to 0.60, with a combined unweighted κ of 0.20 (95% CI 0.12 to 0.30), corresponding to slight inter-rater reliability (table 5). The level of agreement was stronger for the detection of numbers of AEs. The weighted κ values ranged from 0.26 to 0.77 team-by-team, with a combined unweighted κ of 0.45 (95% CI 0.26 to 0.63), corresponding to moderate reliability. Team IV differed from the other teams in its identification of a larger share of AEs than the other teams: a combined unweighted κ value was calculated by excluding team IV. The greatest differences were that the level of agreement for the number of identified AEs increased from κ=0.45 to κ=0.65.
The aim of this study was to evaluate if the GTT could justifiably be used for comparisons between teams or hospitals. We found that there were large differences in the number of AEs detected by different, but equally experienced, review teams, and this was also true for the assessment of levels of harm and preventability. The agreement between teams was highest in detecting healthcare-associated infections.
All teams had used the GTT on a monthly basis for at least 3 years in patient safety work at their own hospital, and they were accustomed to reviewing care episodes from different medical specialities. No training sessions or efforts to instruct the teams about reaching a consensus were given before the start of the study, since our aim was to compare the degree of agreement between experienced teams when each team used the GTT method in the manner to which they were already accustomed.
There has been a clear increase in the use of the GTT in patient safety work at both the overall hospital level and the clinical level. As the number of review teams has been increasing, there is a risk that there will be ever greater departures from the original GTT method. As the method uses numbers to indicate ‘level of patient safety’, there is also a risk that efforts will be made to compare the GTT results between hospitals or clinics. For such comparisons to be meaningful, it must be shown that teams at all hospitals come to the same evaluations given data on a particular group of patients. The results of the present study indicate that such comparisons cannot be justified.
Four teams in our study made similar assessments, while one team (team IV) documented three times as many AEs as the others. In cases where several teams had identified the same AE, team IV's assessment of severity and preventability did not differ in an obvious manner from the others. The difference is that, on 22 occasions, team IV identified events as being AEs that none of the other teams identified as such. The other four teams also identified AEs that were not identified by the other teams, but only one or two per team. It appears to us that team IV had a much broader view of the kinds of events that might be identified as AEs. Many of the AEs identified by that team were seen by the other teams solely as triggers. For example, a readmission within 30 days was considered by team IV to be an AE because they assumed that, if care had been performed following standard procedure on the first occasion, then the patient would not have had to be readmitted to the hospital. Within our health region, a regional GTT network was started in 2007. Other issues occasionally discussed within this network include harm and preventability. Teams I, II, III and V have participated in these meetings, but team IV has not for various reasons. Teams I, II, III and V used the IHI definition of AE strictly following the GTT by limiting the identification of AEs to unintended physical injuries, whereas team IV's identifications were based on The Swedish National Board of Health and Welfare's definition of AE: ‘Any suffering, discomfort, bodily or mental injury, illness or death caused by healthcare and which is not an inevitable consequence of the patient's condition or an expected effect of the treatment received by the patient because of her/his condition’.26
In the present study, the agreement regarding AEs expressed as a percentage is higher than those reported by Classen et al,11 who looked at the level of agreement between reviewer pairs who analysed the same 50 records. Our κ values are substantially lower than in that study. One possible explanation is that we compared the number of AEs, whereas Classen et al reported whether the case had any AEs or not. The κ values were much lower for detecting triggers than for detecting AEs, and lower than in a study by Naessens et al,20 who looked at levels of agreement between primary reviewers within teams. Finding triggers is helpful in identifying possible AEs, but not a goal in itself. We asked the teams to use the GTT in the manner to which they were accustomed. It may be that the teams have developed different ways of conducting their review and thereby look for triggers in different ways. Furthermore, an AE can be identified by any one of several different triggers, but the teams sometimes only marked one of those different triggers. If the GTT is to be used in patient safety work, then agreement on the level of harm is of primary importance, but reaching agreement on finding triggers could be helpful in the further development of the instrument, as well as in teaching and implementation of the GTT.
We observed agreement among the majority (at least three) of the teams on only eight of the total of 42 AEs. All of these were different types of infection plus one pressure ulcer. In one case, four teams noted that the patient had a urinary infection, but one of the teams also felt that the patient had sepsis (positive blood culture). The fifth team simply identified the positive blood culture as a trigger. Only three of the 42 AEs were identified by all five teams, all of these being infection complications. The reason that healthcare-associated infections and pressure ulcers were identified more consistently than others may be that they are more easily found in the record text and are often indicated by diagnostic codes at the beginning of the record. In the cases where only one of the teams found an AE, the reason may be that the records, according to the GTT manual, were only to be read summarily in the search for triggers. Another reason is that some of the teams in some cases made the assessment that an injury was caused by the patient's underlying disease and had no relationship to the care received. In one case where the patient had pressure ulcers, this was identified by four teams as an AE. The fifth team had noted the pressure ulcer as a positive trigger, but did not consider it to be harm related to healthcare. Another case in which the assessments differed was a case in which a urethral catheter had fallen out and been replaced by a new one. One of the teams noted this as a positive trigger ‘reoperation’, and judged it as an AE. Another team classified this as a positive ‘treatment’ trigger, but not an AE, and the other three teams did not mention it at all. As a final example of the differences we found, one team had indicated that a patient had an allergic reaction and identified it as an AE. In the CRF from the other teams, two had classified this event as an antihistamine trigger and one as ‘other’ trigger. These teams did not consider that this reaction gave rise to any patient harm. The fifth team did not describe this event in any way.
The majority of the AEs in the present study were judged preventable. This is in accordance with other studies.2 16 However, great differences were found in the assessments of levels of preventability. Consideration of preventability is not included in the original GTT method.18 The Swedish manual is not precise about preventability, and no examples are given in the manual that could help to make judgements. This makes us doubt the wisdom of using preventability assessments in this way. Instead, we argue in accordance with Classen et al19 that a better approach in patient safety work is to look at almost all AEs as being preventable. Moreover, it may be that AEs that today are considered as not being avoidable will in the future be judged as indicating substandard care. It is of utmost importance in making structured record reviews to place the greatest effort on detecting AEs and not on estimating the level of preventability, nor on identifying an individual person's mistakes. Instead of relying only on the traditional voluntary reports, hospitals might turn to making standardised use of the GTT to stimulate the staff to engage in proactive patient safety work.6
GTT reviews with feedback to the participating clinics encourage safety work and increase awareness of patient safety among the reviewers. The reviewers can pass the knowledge on both inside and outside their own clinics. The opportunity to carry out reviews of one's own department/hospital can lead to awareness of the risks patients are exposed to and also enhance the possibility of taking preventive actions. Sharek et al27 found that internal review teams identified more AEs with higher inter-rater and intrarater reliability than external teams. Compared with experienced teams, however, sensitivity was only 49% and 34%, respectively. A simulation study by Hofer et al28 concluded that using only one reviewer to define AEs is not satisfactory. Only by combining the reviews from multiple physicians can acceptable positive predictive values for AEs be obtained. The use of internal teams with a stable group of team members is necessary for using the GTT to measure rates and trends over time. It may be that results from GTT reviews could better be used to identify areas where improvements in patient safety work are warranted rather than using reviews to compare different hospitals or clinics.
There are limitations to the present study that need to be considered. One is the retrospective design of the record review. There is no gold standard for AE detection with which to compare our results. We cannot presume that the teams included in our study are representative of other teams in other hospitals. The composition of our teams with respect to medical specialties represented differed from team to team, and this may have influenced the judgements made by the teams. Teams at different hospitals will, of course, be made up of people with different specialties. We cannot exclude the fact that the team from the sourced hospital (team I) may have been biased in their evaluation even though their results are in agreement with three of the other teams (II, III and V). The records from the hospital used in this study may not be representative of a broader sample of hospitals. The teams only reviewed 50 records; however, the differences in judgement we have identified are striking. Classen et al11 also evaluated 50 records in their original evaluation of the GTT method.
Training increases the consistency of assessments within the teams.11 20 Naessens et al compared GTT reviews at four different hospitals and reported κ values from 0.40 to 0.77 for AE agreement within the teams. These teams were given a 5-month period in which to establish a working process, and they attended consistent training sessions during the 2-year period of the study.20 In no case in our study did physicians overturn an AE judgement by one of the RNs. Even though the physician's assessments were based on the RNs' findings and not on independent reviews, this finding indicates that the participants in each team were in agreement about their assessment. Training and discussions to reach agreement within a team significantly improve agreement on the presence of an AE,11 and our team members had all worked together for longer periods. Even if agreement within reviewer pairs is increased by discussion, this consistency does not improve overall reliability across pairs of physicians who are part of different discussions that include other reviewers.29 On the contrary, a high reliability within a team working together carries a risk of overconfidence and comfort in their ratings.
One measure to increase compliance between reviewer teams could be to develop a more detailed manual and changing the Swedish GTT version to clearly state that the review is to be limited to identifying physical harm only. To make sure that training, collaboration and a standardised process are prerequisites for achieving conformity within teams,20 it would be interesting to design a study where teams examine again the records they had examined 5 years earlier.
We found that there were large differences between the experienced review teams in the number of AEs detected, as well as in the assessment of level of harm and preventability. Our results do not encourage the use of the GTT to make comparisons between teams or hospitals. If the GTT is to be used to this end, substantial group training will be required to achieve better agreement across reviewer teams.
We thank the teams for their efforts and Linnaeus University for support.
Funding FORSS—the research council of the southeast of Sweden and the County Council of Kalmar (grant number 72521).
Competing interests None.
Ethics approval The ethics board at Linköping University, Sweden (study number 2010/56-31).
Provenance and peer review Not commissioned; externally peer reviewed.