Article Text

Download PDFPDF

Assessing quality of care from hospital case notes: comparison of reliability of two methods
  1. A Hutchinson1,
  2. J E Coster1,
  3. K L Cooper1,
  4. A McIntosh1,
  5. S J Walters2,
  6. P A Bath3,
  7. M Pearson4,
  8. K Rantell2,
  9. M J Campbell2,
  10. J Nicholl2,
  11. P Irwin4
  1. 1Section of Public Health, ScHARR, University of Sheffield, Sheffield, UK
  2. 2Section of Health Services Research, ScHARR, University of Sheffield, Sheffield, UK
  3. 3Department of Information Studies, University of Sheffield, Sheffield, UK
  4. 4Clinical Effectiveness and Evaluation Unit, Royal College of Physicians, London, UK
  1. Correspondence to Allen Hutchinson, Section of Public Health, ScHARR, Regent Court, 30 Regent Street, Sheffield S1 4DA, UK; allen.hutchinson{at}


Objectives To determine which of the two methods of case note review provide the most useful and reliable information for reviewing quality of care.

Design Retrospective, multiple reviews of 692 case notes were undertaken using both holistic (implicit) and criterion-based (explicit) review methods. Quality measures were evidence-based review criteria and a quality of care rating scale.

Setting Nine randomly selected acute hospitals in England.

Participants Sixteen doctors, 11 specialist nurses and three clinically trained audit staff, and eight non-clinical audit staff.

Analysis Methods Intrarater consistency, inter-rater reliability between pairs of staff using intraclass correlation coefficients (ICCs), completeness of criterion data capture and between-staff group comparison.

Results A total of 1473 holistic reviews and 1389 criterion-based reviews were undertaken. When the three same staff types reviewed the same record, holistic scale score inter-rater reliability was moderate within each group (ICC 0.46 to 0.52). Inter-rater reliability for criterion-based scores was moderate to good (ICC 0.61 to 0.88).

Comparison of holistic review score and criterion-based score of case notes reviewed by doctors and by non-clinical audit staff showed a reasonable level of agreement between the two methods.

Conclusions Using a holistic approach to review case notes, same staff groups can achieve reasonable repeatability within their professional groups. When the same clinical record was reviewed twice by the doctors, and by the non-clinical audit staff, using both holistic and criterion-based methods, there are close similarities between the quality of care scores generated by the two methods. When using retrospective review of case notes to examine quality of care, a clear view is required of the purpose and the expected outputs of the project.

  • Medical records
  • quality of care
  • reliability
  • implicit review
  • explicit review
  • healthcare quality
  • criterion-based review
  • holistic review

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Quality of care is assessed from clinical records using two principal approaches: holistic (implicit) and criterion-based (explicit) review. Both approaches have strengths and weaknesses, whether they are used for performance monitoring, assessment or research.

Clinical staff are accustomed to reviewing patient records to judge the quality of care. This holistic approach uses professional judgement and requires no prior assumptions about the individual case, can be applied to any condition, can extend to examining any aspect of care and may be relatively quick. However, the standards against which quality is judged holistically are implicit, being dependent on the reviewer's personal knowledge and perspective, and therefore subjective.

Semistructured holistic review methods have therefore been developed to determine standards of hospital, outpatient and nursing care.1–3 These methods ask specific questions about phases and aspects of care and may use scales to rate quality.

Nevertheless, despite attempts to reduce levels of subjectivity in holistic review by providing extensive training for physician reviewers, concerns remain about review methods based principally on professional judgement. There are concerns about inter-rater reliability,4 choice of methods of assessing reliability,5 consistency,6 bias towards harshness or leniency,7 hindsight bias8 and reviewer idiosyncrasy.9 Moreover, lower levels of inter-rater reliability have been found for holistic review than for criterion-based review.9 Criterion-based review has therefore been proposed as a more effective means of assessing quality.10 ,11

Criterion-based review allows comparison of care against explicit standards (eg, from national clinical guidelines), where unambiguous questions are defined to construct variables with good reproducibility, for retrieval from case notes. Clinical audit in the UK has adopted these objective, criterion-based, approaches,12–14 using explicit standards, independent of profession. These have been used to identify substantial variations in organisation and clinical care between hospitals.12

However, criterion-based review has been criticised as being insensitive15 and may not identify unexpected factors influencing outcomes of care.16 Mixed methods are an alternative,17 ,18 whereby nurses use criterion-based review to identify a subset of problematic cases for subsequent holistic review by doctors; however, prior selection may lead to hindsight bias among the physician reviewers who may judge selected cases more harshly.8 ,11 Moreover, nurses and doctors may use different information to judge care quality (and may make different judgements about an individual case).18

It is therefore not clear which review method provides more reliable and useful information or how relatively reliable and reproducible are the different methods when carried by different healthcare professionals. Our study compares the results of three different professional groupings when evaluating quality of care from the same set of case notes, using both holistic review using quality of care rating scales and criterion-based review.


Setting and reviewer professional background

Data were collected from nine acute hospitals in England, selected randomly from 136 that met high patient-throughput criteria for the two study conditions. In each hospital, staff were recruited to undertake reviews of cases of an admission for an exacerbation of either chronic obstructive pulmonary disease (COPD) or heart failure. Three staff types were recruited: 16 doctors in specialist training, 14 other clinical staff (11 of whom were nurses specialising in the review condition) and 8 non-clinical audit staff.

Training and data collection

Reviewers received a 1-day joint training session on holistic and criterion-based review methods. Clinical scenarios were used and reviewers were provided with copies of national clinical guidelines for COPD and heart failure care.19 ,20 Data collection software was demonstrated. Reviewers evaluated the records within their own hospital, similar to local clinical audit, and no patient-identifiable data were used in the analysis.

Review methods

Different combinations of reviewers from the three staff types were used at each hospital to compare their effectiveness in carrying out holistic and criterion-based case note reviews. In each hospital, case notes of 50 consecutive admissions of COPD or heart failure were sought and reviewed by staff type combinations of one to four staff (figure 1).

Figure 1

Overview of selection and review process.

Each reviewer evaluated care on the same case notes using both review methods, holistic and then criterion-based review, holistic being used first to reduce potential hindsight bias8 ,11 caused by finding a low criterion-based score first. Using their own implicit standards, reviewers rated the reported quality of care provided to each patient for three structured phases of care (admission/investigations, initial management and predischarge care). Each phase was rated on a 1 to 6 scale (1=unsatisfactory, 6=very best care). Overall quality of care for each review was rated on a 1 to 10 scale (1=unsatisfactory, 10=very best care).

Reviewers then undertook a criterion-based review on the same case notes. Criteria development used established methods for constructing explicit evidence-based review criteria4 ,12 ,13 ,21 (COPD, n=37; heart failure, n=33) derived from national clinical guidelines recommendations and expert opinion.19 ,20

Criterion-based data were used to assess each reviewer's effectiveness at abstracting data from clinical records and completing the data collection form; an “effectiveness of reviewer” score was calculated and converted to a percentage for each record review (one point per data field completed by the reviewer; one point subtracted per data field left blank). Quality of care scores were calculated for each record, comprising the percentage of the criteria identified by the reviewer as having been met.

Analysis methods

Holistic review

Intrarater and inter-rater reliability was calculated for holistic quality of care scores. Robust standard errors (STATA V.9, College Station, Texas, USA)22 were used to allow for clustering of scores around each reviewer when calculating confidence intervals and p values for the mean overall scores by reviewer type.

Intrarater consistency for each review was assessed by calculating Pearson's correlation coefficient between the mean rating of the three phases of care and the rating of overall care.

To assess inter-rater reliability between ratings of the same record by different reviewers, raw ratings were converted to ranks to adjust for variation in the range of scores used by different reviewers. Intraclass correlation coefficients (ICCs) were calculated on these ranks.4 ,23 The ICC is the correlation between two measurements or quality of care ratings in the same patient, using randomly chosen reviewers.

Criterion-based review

Mean criterion-based quality of care scores were compared across the three staff types using a one-way analysis of variance, taking account of clustering by staff type.

Inter-rater reliability for overall quality of care scores by pairs/triplets of staff reviewing the same records was estimated using ICCs. Pooled ICC estimates from the different combinations of reviewers used a weighting that was inversely proportional to the variance of the estimate.23

Inter-rater reliability results for the two review methods were compared.


Across nine acute hospitals, 38 reviewers undertook 1473 holistic reviews and 1389 criterion-based reviews (total=692 case notes). The number of case notes reviewed by each individual ranged from 9 to 50 (see electronic table E1). This variation was due to the effect of job rotations, workload pressures and difficulties in obtaining clinical records.

Intrarater consistency in holistic reviews

For all three staff types (table 1), there were statistically significant correlations (r>0.71, p<0.001) between the mean scale score ratings that reviewers assigned to the individual phases of care and their rating of the overall quality of care for the same set of case notes, indicating a fair to good level of intrarater consistency in rating the quality of care using holistic review scale scores.

Table 1

Intrarater consistency between holistic scale score ratings for phases of care and for overall care

Criterion-based reviewer effectiveness

Effectiveness in capturing criterion-based data was high and similar across all staff types (table 2), with mean scores approximately 95% (approximately 1.5 data items missing per review).

Table 2

Criterion-based reviewer effectiveness scores

Inter-rater reliability for holistic review

Holistic review reliability between scale score ratings of the same record by pairs of reviewers was moderate within all three staff types, although it varied between reviewer pairs and was sometimes very poor (table 3).

Table 3

Inter-rater reliability between holistic overall ratings of the same record by paired reviewers of different staff types

The overall weighted mean ICC was moderate across all three reviewer types, with overlapping 95% confidence intervals (CIs) indicating no significant differences between staff types.

Comparisons between professional groups

Where reviewers from different staff types used holistic scale score methods to review the same record, inter-rater reliability was assessed between staff groups for all phases of care and overall care (table 4). For the holistic phase of care findings within staff groups, there was generally modest to fair agreement within pairs, particularly among doctors, although the range is large even among them (eg, initial management results). However, where staff from different groups reviewed the same record, agreement between the different professional groups on their assessment of the quality of care was poor to non-existent.

Table 4

Within-staff-type ICC and between-staff-type group ICC comparisons of holistic scale score reliability for phases of care and overall score

Analysis of variance between the holistic overall scale ratings of the three staff types show that the nurse/other clinical group scores were significantly lower than the doctor (p<0.001) and non-clinical audit groups (p<0.001). The comparison of the latter two groups showed no significant differences (p=0.352).

Inter-rater reliability for criterion-based review

Inter-rater reliability between criterion-based scores (ie, the percentage of criteria recorded as being met) for the same record by different reviewers ranged from moderate to good within all staff types. Doctors showed a significantly higher level of reliability (table 5).

Table 5

Inter-rater reliability between criterion-based scores (proportion of criteria stated as being met) for the same record by different reviewers

Comparison of holistic and criterion-based methods

Inter-rater reliability results for the two review methods were compared. In addition, an estimate of the within-staff-type consistency across both review methods was calculated using p value for difference between the overall holistic quality of care ratings (percentage) and the percentage of criteria recorded as being met.

Table 6 shows that the mean overall “quality of care” scores across the 692 patient records were similar for holistic and criterion-based methods and for all three staff types (70% to 79%, where 100%=excellent care).

Table 6

Mean ratings/scores of overall quality of care: comparison of two review methods

Estimation of the level of quality of care score agreement between the two methods for an individual record, using p value for difference, shows that there was no significant difference between the holistic and criterion-based assessments when undertaken by the doctors (mean difference −1.9, p value for difference 0.406) and by the non-clinical audit staff (mean difference 3.1, p value for difference 0.223).

A non-significant p value for difference indicates that there is some association between the scores derived from the two review methods. These results suggest that for the doctors and the non-clinical audit staff the two methods are giving, on average, a somewhat similar result. The pooled results for all staff showed a small mean difference (−2.6) that bordered on statistical significance, possibly influenced by the highly significant results from the nurse/other clinical group (39% of all of the reviews).


Retrospective assessment of the quality and safety of care can be performed from the clinical record using holistic or criterion-based review methods: both have methodological constraints. Studies mostly compare different professional groups using different methods. Thus, Weingart et al18 compared explicit (criterion-based) review undertaken by nurses with implicit review of the same record undertaken by physicians, and found that “nurse and physician reviewers often came to substantially different conclusions”. This is the first UK study to contrast the two methods of review systematically and also across three different professional groups.

We investigated the level of agreement between healthcare professionals, from different backgrounds, when they review the same record. This agreement, or reliability, relates to the repeatability of the results from the review—whether a different reviewer would come to the same conclusion about the quality of care from the same data source, using the same method. This is clearly a practical question for those reviewing quality of care in clinical audit or performance review.

Reviewers undertaking holistic review, using scale scores, were relatively consistent in the scores allocated to care quality across the individual phases of care and overall for the entire episode of care. All three staff groups had moderate within-group inter-rater reliability, ranging from 0.46 (95% CI 0.34 to 0.59) to 0.52 (95% CI 0.41 to 0.62), with the doctor reviewers faring best. These were rather higher values than the average found in a systematic review by Lilford et al,5 in which implicit structured case note review studies concerned with causality and process of care had mean κ values <0.4 (causality; κ 0.39 (SD 0.19), process; κ 0.35 (SD 0.19)). Our study results are also somewhat similar to those of Hofer et al4 who used ICCs to examine repeatability and found a reliability of 0.46 for a structured holistic review of diabetes and heart failure case notes by physician reviewers (although only 0.26 for case notes of patients with COPD). By comparison, a recent holistic assessment of patients dying in UK hospitals achieved a κ score of 0.39 on the key indicator of quality of medical care.24

Criterion-based review demonstrated that all reviewers could identify relevant data (the effectiveness of reviewer scores were around 95%). There were moderate (0.61 for non-clinical audit staff) to quite high levels of inter-rater reliability (clinical staff 0.74, doctors 0.88)—similar to those found in large UK national clinical audit programmes of stroke25 ,26 and continence,27 and reflecting the trend to higher values for explicit reviews found in other studies.5 Our study confirms the findings of the UK stroke care audit,25 ,26 that criterion-based record review can be undertaken by staff from different backgrounds.

Case note review can only consider what has been recorded, and incomplete records do not mean that an event did not occur. If a practitioner considered something too trivial to record, then it is doubtful that any consequential actions would have occurred. However, some significant events will remain unrecorded and thus unreviewed. Direct observation of care delivery overcomes the problem of missing information, and is an alternative approach,17 although too expensive as a standard procedure. Hindsight bias in case note review is an acknowledged challenge.28 We tried to minimise any effect by undertaking holistic review before criterion review.

The overall results of care quality assessment were similar with both methods from our review and all rated care quality reasonably highly (between 70% and 79%, where 100% represents excellent care). But the weak inter-group reliability for holistic scores has implications when choosing how to evaluate the care quality from case notes. Performing as a screening tool, criterion-based review produces sufficient information to judge the overall quality of care, provided that appropriate review criteria are chosen. A structured form of holistic review also gives a reliable picture of the quality of care in the right hands, yet can also pick up extra nuances of quality variation.

Our medical reviewers were relatively inexperienced but with audit training were able to use both criterion-based and holistic review effectively. It would be interesting to explore whether senior clinicians' greater clinical experience would produce different holistic assessments. We hypothesise that it would useful to explore further the expertise of specialist nurses in holistic review because they have particular skills in helping patients with adherence to care pathways.

So which method of review would be best used for clinical audit and performance review, and by which professional groups? All three professional groups performed well when using criterion-based review, so the decision on who should undertake reviews depends mainly on cost and availability of staff.

On the other hand, the decision on who should undertake structured holistic review is more complex. The method can deliver more than just the sum of the results of collecting a set of review criteria. Although all groups can use the method of holistic scale scoring, our data suggest that, for the more technical phases of care, the three groups interpreted the same records differently despite considerable training in the review method. To some extent this probably reflects their background knowledge of clinical care delivery. It is unrealistic to expect non-clinical audit staff to fully appreciate the details of the medical care, let alone judge when care has deviated from best practice.

Although nurses are much closer to the medical care process, the limited agreement between the doctors and the nurses may reflect different internal professional standards for assessing quality and safety of care. Weingart et al18 conjectured that nurses and doctors reviewed in different ways, that nurses sought data on the routines of care while doctors looked for a wider picture and that neither group considered both dimensions. Analysis of textual commentary on quality of care available from each holistic review will throw further light on this question.


There is modest agreement between the holistic and criterion-based quality assessment scores of the same record by the same reviewer. However, for holistic review, different staff groups are implicitly using different care standards in their assessment of quality. Large-scale criterion-based audits, such as those promoted by the English Healthcare Commission,29 may miss the richer information provided by holistic review. A mixed holistic and criterion-based approach may be a solution5 and has been subsequently used in this study to investigate the relationship between care process and outcome.


Karen Beck has provided administrative support throughout the project and her contribution has been exceptionally helpful.

The research team would like to thank all of those NHS staff who so generously contributed to the successful completion of this study. We also wish to thank staff at the Royal College of Physicians Clinical Effectiveness and Evaluation Unit for their help in selecting research sites and methods development.



  • Funding DoH Methodology Research Programme c/o Department of Epidemiology and Public Health, University of Birmingham, Edgbaston, Birmingham, UK. The project was funded by the Department of Health for England Methodology Research Programme.

  • Competing interests None.

  • Ethics approval An opinion was sought from the Trent Multiple Research Ethics Committee and the project was deemed not to require formal ethical review because no identifiable individual patient data were seen by the research team.

  • Provenance and peer review Not commissioned; externally peer reviewed.