Article Text

Download PDFPDF

Inter- and intra-rater reliability for classification of medication related events in paediatric inpatients
  1. D L Kunac1,
  2. D M Reith2,
  3. J Kennedy1,
  4. N C Austin3,
  5. S M Williams2
  1. 1School of Pharmacy, University of Otago, Dunedin, New Zealand
  2. 2Department of Women’s and Children’s Health, Dunedin School of Medicine, University of Otago, Dunedin, New Zealand
  3. 3Christchurch Women’s Hospital, Christchurch, New Zealand
  1. Correspondence to:
 MrsD L Kunac
 Research Fellow, School of Pharmacy, University of Otago, P O Box 913, Dunedin, New Zealand; desiree.kunac{at}


Background: In medication safety research studies medication related events are often classified by type, seriousness, and degree of preventability, but there is currently no universally reliable “gold standard” approach. The reliability (reproducibility) of this process is important as the targeting of prevention strategies is often based on specific categories of event. The aim of this study was to determine the reliability of reviewer judgements regarding classification of paediatric inpatient medication related events.

Methods: Three health professionals independently reviewed suspected medication related events and classified them by type (adverse drug event (ADE), potential ADE, medication error, rule violation, or other event). ADEs and potential ADEs were then rated according to seriousness of patient injury using a seven point scale and preventability using a decision algorithm and a six point scale. Inter- and intra-rater reliabilities were calculated using the kappa (κ) statistic.

Results: Agreement between all three reviewers regarding event type ranged from “slight” for potential ADEs (κ = 0.20, 95% CI 0.00 to 0.40) to “substantial” agreement for the presence of an ADE (κ = 0.73, 95% CI 0.69 to 0.77). Agreement ranged from “slight” (κ = 0.06, 95% CI 0.02 to 0.10) to “fair” (κ = 0.34, 95% CI 0.30 to 0.38) for seriousness classifications but, by collapsing the seven categories into serious versus not serious, “moderate” agreement was found (κ = 0.50, 95% CI 0.46 to 0.54). For preventability decision, overall agreement was “fair” (κ = 0.37, 95% CI 0.33 to 0.41) but “moderate” for not preventable events (κ = 0.47, 95% CI 0.43 to 0.51).

Conclusion: Trained reviewers can reliably assess paediatric inpatient medication related events for the presence of an ADE and for its seriousness. Assessments of preventability appeared to be a more difficult judgement in children and approaches that improve reliability would be useful.

  • medication error
  • children
  • classification
  • inter-rater reliability
  • intra-rater reliability

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Medication related patient injury—so-called adverse drug events (ADEs)—and errors in the use of medications are commonly associated with the pharmacological treatment of patients in hospital.1–3 In order to analyse these events with the aim of developing prevention strategies, data on the frequency, type, seriousness, and degree of preventability4 of the event is required. If calculation of rates of events and the targeting of prevention strategies are based on specific categories of event, then the concept of reliability (or reproducibility) of the classification process is important.

Such classifications require some form of professional review and the general approach has previously been to have two independent physicians make these judgements.4 Judgements require not only an up to date clinical knowledge, but also consideration of standards of care and the recognition of distinction between those injuries caused by disease or patient condition and those due to a medication.5 Variations in judgements made by reviewers are an important source of measurement error.6

Where two or more reviews have been undertaken independently, it is possible to conduct reliability studies to determine the level of reviewer agreement in the measurement process. Reliability refers to the consistency of ratings or to the ability of two or more reviewers to reach the same conclusions about a specific case.7

In epidemiological studies of adverse events (including ADEs), the statistic most often used to measure agreement between two reviewers is the kappa statistic (κ). Kappa is a chance corrected index of agreement and is calculated by the equation ((O − E)/(1 − E), where O = observed agreement and E = expected agreement by chance.6 Using kappa, reliability of 0.00 is considered poor agreement, 0.01–0.20 considered slight agreement, 0.21–0.40 fair agreement, 0.41–0.60 moderate agreement, 0.61–0.80 substantial agreement, and 0.81–1.00 almost perfect agreement.6,8 The percentage agreement is sometimes reported and is calculated by dividing the number of agreed cases by the total number of cases.

Although a standardised approach for classification of events has recently been proposed,4 there is no “gold standard” currently available. Not only does this mean that a variety of classification scales are in use, but the rigor with which these classifications are undertaken varies considerably between studies. Few paediatric ADE studies have published estimates of reliability. Where estimates of reliability are provided, there is often too little information available to allow comparison between studies or to allow an understanding of what factors may influence reliability judgements.

This study was undertaken to determine the reliability of reviewer judgements regarding classification of paediatric inpatient medication related events by type, seriousness, and degree of preventability.


Study design and setting

A prospective observational cohort study was conducted over a 12 week period from 18 March to 9 June 2002 at a university affiliated urban general hospital in Dunedin, New Zealand. All admissions to the neonatal intensive care unit (NICU), postnatal ward, and paediatric ward during the study period were eligible for inclusion. Patients were excluded if the hospital admission was for less than 24 hours, if medical staff deemed it inappropriate for a patient to be involved, or if the admission was due to an intentional overdose. This resulted in 495 eligible study patients who had a total of 520 admissions (84 paediatric medical, 61 paediatric surgical, 57 NICU, 318 postnatal).

Medication related events were identified by the principal investigator (DK) using a multipronged approach which involved:

  • chart review for all admissions;9

  • attendance at multidisciplinary ward meetings;

  • interview of parents/carers (and children) when further information or clarification of information was required (a total of 106 of the 110 parents approached (96.4%) gave consent and were interviewed);

  • voluntary and verbally solicited reports from staff;4 all paediatric ward staff were educated about the study and were invited to take part by submitting voluntary reports of any actual events or potentially unsafe medication systems that they noted during their daily activities. This was either via the hard copy medEVENT form designed for the study or communicated verbally direct to the investigator during daily ward visits or via telephone. In addition, when the investigator visited ward areas, reports were solicited from staff on duty at the time.

All suspected medication related incidents (N = 701) were reviewed by a panel of three health professionals who independently categorised the events in various ways. The panel included a paediatric clinical pharmacologist (DR, reviewer 1), a neonatologist (NA, reviewer 2), and a clinical pharmacist (JK, reviewer 3). Prior to this process, the reviewers underwent a calibration exercise using simulated test cases and the reviewer form. As a result of discussions regarding these test cases, a clear set of guidelines were agreed; this included explanatory notes about the review process and contained definitions and examples for the different event categories, as shown in table 1. An anonymised computer generated summary was created for each event. Assessments were performed individually by the reviewers using a standardised form.

Table 1

 Medication related event types: definitions and examples (adapted from Kaushal et al3)

The review panel was required to judge event type (ADE, potential ADE, medication error, rule violation or other event), seriousness, and preventability. The reviewers rated ADEs and potential ADEs for seriousness based on International Committee on Harmonisation (US) guidelines.10 The reviewers assessed preventability on the basis of the practitioners’ presumed knowledge at the time the medication was prescribed. A preventable versus not preventable decision was made using a set of questions developed by Schumock and Thornton.11 Confidence about the preventability classification of events was rated on a six point scale, based on the four point score devised by Dubois and Brook.12 The preventability scores were collapsed into preventable (score 1–3) and not preventable (4–6) events. Medication errors, by definition, were automatically deemed “not serious” and “preventable” events. Rule violations, being very trivial events, were separated out as “not applicable” in the classification of preventability of events.

Statistical analysis

Inter-rater and intra-rater reliabilities for key judgements were calculated using the percentage of agreement and the kappa statistic (κ) using STATA for Windows Version 8.0 (Stata Corporation, College Station, TX, 2003). Three-way kappa was used to evaluate reliability between all three reviewers and two-way kappa analysis performed for evaluation of reviewer pairs. Because the marginal totals for some outcomes for some pairs of reviewers were very different, the maximum possible value of kappa was also calculated. Kappa max (max κ) was calculated using the equation: 1 − (minimum disagreement/expected disagreement).8


Level of agreement between all three reviewers

Agreement between all three reviewers regarding event type ranged from “slight” for potential ADEs to “substantial” for the presence of an ADE. Overall, using all five categories of event, “fair” agreement was found between reviewers (table 2).

Table 2

 Inter-rater reliability for all three reviewers for event type

The level of agreement between the three reviewers for seriousness is shown in table 3. Agreement was not much better than chance for seriousness categories of “potential death” (D) (there were no fatalities documented during the study period) and “intervention to prevent permanent impairment” (O). The strength of agreement was “moderate” for not serious events and also “moderate” when the seriousness categories were collapsed into serious versus not serious events.

Table 3

 Inter-rater reliability for all three reviewers for seriousness

The level of agreement between the three reviewers for preventability is shown in table 4. For preventability decision (yes/no), overall agreement was “fair” but “moderate” for not preventable events. For the preventability score and when scores were collapsed into three categories, overall agreement was again “fair” for preventability of events.

Table 4

 Inter-rater reliability for all three reviewers for preventability

Level of agreement between reviewer pairs

The levels of agreement between reviewer pairs for event type, seriousness, and preventability are shown in table 5. For event type, the best agreement occurred between reviewers 1 and 3 where the level of agreement was found to be “moderate”. Only “fair” agreement occurred between the other two reviewer pairs. The intra-rater reliability for each reviewer for a repeat categorisation of event type (12 months apart) of 100 randomly selected events is shown in table 5. Each of the three reviewers was found to be consistent.

Table 5

 Level of agreement between reviewer pairs

The classification of events into the seven categories of seriousness demonstrated only “fair” agreement between each of the reviewer pairs. The maximum value of κ for reviewers 1 and 2 was 0.63 because the reviewers judged the seriousness in very different ways. For example, the second reviewer described 55 (7.9%) events as seriousness category O, whereas reviewer 1 described three (0.43%) events as seriousness category O. In addition, it appears that reviewer 3 was more likely to classify potential ADEs as more serious events than the other reviewers (table 3). By collapsing the categories down into two (serious versus not serious events), agreement between the reviewer pairs 2 and 3 and between 1 and 2 improved to “moderate” agreement. The κ/κmax ratio also improved for these reviewer pairs demonstrating “substantial” agreement for seriousness of events. There was only “fair” agreement between reviewers 1 and 3 when considering κ values and the κ/κmax ratio.

For the judgements made by reviewer pairs regarding the yes or no decision as to whether or not an event was preventable, the best agreement was found between reviewers 1 and 3 with κ = 0.50 and κ/κmax = 0.51, which is regarded as “moderate” agreement. Only “fair” agreement was found for the other two reviewer pairs. Similar findings were found for the preventability scores and when preventability was collapsed into three categories.


The level of agreement between all three reviewers was found to be “substantial” for judgments regarding whether or not an event was an ADE (patient injury related to a medication). However, for classification into the other event types, the level of agreement was lower, especially for potential ADEs where agreement was only “slight”. Moderate agreement was achieved when the seriousness categories were collapsed into serious v not serious events. The degree of preventability appeared a more difficult judgement, with only “fair” agreement found between the three reviewers. Despite classification guidelines and prior discussion between the reviewers, there appeared to be some marked differences in interpretation between them. The judgements of each reviewer regarding event categorisation were, however, found to be consistent over time.

There are a limited number of paediatric studies of ADEs and medication errors that report reliability data for event classification by type of event, seriousness, or degree of preventability.3,13–15 Kaushal et al3 reported 87–100% agreement, κ = 0.65–1.0, but actual values specific to event type, seriousness, and preventability were not stated. In each of the studies by King et al14 and Potts et al,15 inter-rater reliability is reported for event type but not for seriousness or preventability. In the remaining study, Kozer and colleagues13 found substantial agreement between two paediatric emergency physicians for whether an error occurred (κ = 0.79) and for a three category ranking of severity of events (κ = 0.70).

Event type

In the present study, although we found “substantial” agreement between all three reviewers for the presence of an ADE (κ = 0.73), there was only “fair” to “moderate” agreement for classification of the other event types. It is evident that the reviewers classified event types very differently (table 2); in particular, reviewer 1 classified very few events as potential ADEs compared with reviewers 2 and 3. This led to only a “fair” level of agreement for event type overall between the three reviewers (κ = 0.40).

For reviewer pairs, the present study showed best agreement between reviewers 1 and 3 (κ = 0.51) for event type overall. Previous paediatric studies reported higher levels of agreement between two reviewers for event type. King et al14 found “substantial” agreement (κ = 0.64, 95% CI 0.45 to 0.82) for 20 randomly selected incident reports from paediatric inpatients at a tertiary care paediatric hospital when independently rated by two physicians. Potts et al15 reported a κ value of 0.96, indicating “almost perfect” agreement between a clinical pharmacist and physician when a 10% random sample of patients from a paediatric critical care unit was reviewed. Unfortunately, these reports do not provide a breakdown of levels of agreement for the different event types, so it is difficult to compare our study findings any further with other paediatric inpatient reliability data.

However, the finding for the presence of an ADE (κ = 0.73) in the present study is consistent with the adult literature. For ADE v potential ADE or problem order, Bates and colleagues reported “almost perfect” agreement in two studies (κ = 0.83 and κ = 0.98),2,16 and “substantial” agreement (κ = 0.68) for classification as medication error, rule violation or neither.17 For adult inpatients at community based nursing homes in the United States, Gurwitz18 reported “substantial’ agreement” (κ = 0.80) between two independent physicians for the presence of an ADE. It appears that judgements regarding classification of events as ADEs are more reliable than classification of other event types. This would seem reasonable as ADE classification is based on objective evidence of actual patient injury, whereas classification for other event types is subjective and based on reviewer opinion regarding potential for patient harm (potential ADEs) and whether the cause of the event was due to error (medication error) or violation of a rule or guideline (rule violation).


In the present study, when reviewers classified events for seriousness into one of the seven categories, the level of agreement was only “fair” between all three reviewers and for the reviewer pairs. When collapsed into two seriousness categories (serious and not serious), “moderate” agreement between the three reviewers was achieved. Many of the published paediatric studies of ADEs have included some assessment of seriousness of events using a variety of different rating scales. However, only two studies appear to have evaluated and published inter-rater reliability data regarding the severity or seriousness scale being used. Both studies report “substantial” agreement between two independent physician reviewers.3,13 The lower level of agreement in the present study may in part be due to differences in the rating scales used (seven categories in the present study compared with 3–4 point scales in the previous paediatric reports), but may also be attributed to bias among reviewers in the present study. The very different frequencies (table 3) show that the reviewers classified seriousness of events in very different ways.

In adult inpatient studies, three to four category scales of seriousness have been evaluated for reliability, producing mixed results. Using a three point scale and classification by two independent physicians, Bates16 reported a κ value of 0.89 for life threatening v serious or significant and a κ value of 0.63 for significant v serious or life threatening. In a later study by Bates and colleagues,2 using a four point scale adapted from Folli et al,19 found (as we had) low κ values despite a high percentage agreement. Actual findings were life threatening v serious or significant κ = 0.37 (85% agreement) and significant v serious or life threatening κ = 0.32 (66% agreement). Again using the same four point Folli scale19 but subsequently collapsed into severe v not severe events, Gurwitz et al18 reported “substantial” agreement (κ = 0.62).


In the present study the low levels of agreement regarding preventability indicate that the reviewers had difficulty determining whether an error was associated with an event. It may be that such judgements are difficult in the paediatric setting due to unlicensed use of medicines in children20–25 and the resulting lack of standardised paediatric clinical practice guidelines. It is believed that judgements regarding appropriateness of care are strongly influenced by perceived outcomes and that practice guidelines aid reviewers to make assessments by clarifying the accepted standard of care.26 The appropriate standard of care may have therefore been unclear to our reviewers, making preventability judgements difficult.

Few paediatric inpatient studies have reviewed events for degree of preventability and, unfortunately, those that have also used the Schumock and Thornton11 assessment criteria27–29 have not reported inter-rater reliability data for preventability judgements. Kaushal et al,3 using a five point scale collapsed into preventable v not preventable events, reported the level of agreement for preventability to be within the range κ = 0.65–1.0. In the present study, rule violations—being very trivial events—were considered separately within the “not applicable” category. It is not clear whether rule violations were considered as part of the “preventable” event group by Kaushal et al3 but, if so, this may account for a higher level of agreement than the present study findings.

In adult inpatient studies, using a four point scale proposed by Dubois12 and collapsed into preventable v not preventable events, “substantial” to “almost perfect” agreement has been found between two independent physician reviewers. In two separate studies of hospitalised adults, Bates and colleagues have reported κ values for preventability of 0.7116 and 0.92.2 For adult inpatients in community based nursing homes, Gurwitz18 also found “substantial” agreement (κ = 0.73) for preventability.

The lower levels of agreement in the present study probably reflect bias among reviewers, but may also be attributed to different event types being included in the “preventable” grouping.


The present study is limited because the data come from hospital records of paediatric admissions at one academic institution and represent the agreements from three reviewers, so they may not be generalisable to other geographical locations or other reviewers. Also, our study only investigated one implicit review instrument. Reordering, rewording, or restructuring the subcategories of our review form could produce better degrees of reliability.

Implications for future research

Our findings have several implications for the design of future research studies involving medication related event classification. Firstly, in research studies involving classification of events, independent review should be undertaken by at least two reviewers so that reliability of judgements may be determined. Although in the present study there seemed to be some differences in interpretation between reviewers despite classification guidelines, structured review criteria, and early joint review of “test” cases, such strategies to identify any differences in interpretation before the start of the study are essential. Secondly, for assessment of seriousness, reviewer judgements could be streamlined by direct classification of events as serious v not serious events, but this may not be as useful clinically. Thirdly, in research studies where event classification is undertaken, inter- and intra-rater reliability data should be reported in sufficient detail to allow the reader to assess the reproducibility of the classification method used. Finally, where marginal totals are markedly different, inclusion of the κ:κmax ratio is useful as this may account for lower than expected levels of agreement.

Key messages

  • The reliability of reviewer ratings for medication related event classification was “substantial” for the presence of an ADE, “moderate” for seriousness of the event, but only “fair” for the more complex judgement regarding preventability of events.

  • Trained reviewers can reliably assess paediatric inpatient medication related events for the presence of an ADE and for seriousness.

  • Assessments of preventability appeared to be a more difficult judgement in children and approaches that improve reliability would be useful.



  • This research was supported by a Fellowship awarded to Desireé Kunac by the Child Health Research Foundation of New Zealand.

  • Competing interests: none.

  • Ethical approval for this study was granted by the Otago Ethics Committee.