Table 4

Summary of research into the reliability of adverse event measures of healthcare quality

Study	Dimension of reliability	Methods	Results and conclusions
Panniers and Newlander (1986)³⁰	Inter-rater reliability	Used 2 raters to apply modified form of adverse patient occurrence inventory to sample of 200 cases from 426 patients with myocardial infarctions	Raw agreement of 99–100% for 10 items of adverse patient occurrence inventory; other 5 items ranged 72–96% (κ 0.29 to 0.83). Concluded adverse patient occurrence inventory was reliable
Schumacher et al (1987)³²	Inter-rater reliability	Used 7 raters to apply adverse patient occurrence inventory to 752 cases (each being reviewed 2 or 3 times) drawn from 7 hospitals	Pearson correlation coefficients cited, measuring association between raters. Mean correlation for adverse patient occurrence score was 0.33 (ranged from −0.05 to 0.58). Concluded adverse patient occurrence inventory insufficiently reliable
Richards et al (1988)²⁸	Inter-rater reliability	Used multiple raters to apply adverse patient occurrence inventory to 516 cases drawn from 5 hospitals, each reviewed by 2 raters	Κ statistics for adverse patient occurrence numerator items had mean of 0.33 (ranged −0.18 to 0.73); for adverse patient occurrence denominator items mean was 0.50 (ranged 0.28 to 0.83). For adverse patient occurrence score, found within-patient variability much less than overall variability. ANOVA showed differences between raters responsible for 2% of adverse patient occurrence score variability. Concluded adverse patient occurrence inventory “at best moderately reliable”
Harvard Medical Practice Study (1990)¹⁰	Inter-rater reliability	Used multiple raters to apply own adverse event measure to 282 cases (random 1% sample of total study), each reviewed by 2 raters	Raw agreement on presence/absence of adverse event in each case of 93.6%, κ of 0.85. Concluded measure was sufficiently reliable for use in study
Walshe (1998)¹⁵	Inter-rater reliability	Used multiple raters to apply adverse event measure to 374 admissions across three specialties, each reviewed by 2 raters	Overall κ statistics of 0.46, 0.63, and 0.65 in three specialties, suggesting “moderate to good reliability” but much dependent on rater training
Walshe (1998)¹⁵	Intra-rater reliability	Used a single rater to apply adverse event measure to 110 admissions in obstetrics, and then rescreened same records 4 months later	Overall κ statistic of 0.56 suggesting moderate reliability. Significantly more adverse events found on second screening when rater aware study being undertaken
Walshe (1998)¹⁵	Inter-rater reliability	Observational study of 6095 admissions in 8 specialties screened by four different raters for adverse events	Significant differences in rates of adverse events detected by different raters found in 6 specialties