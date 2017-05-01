Relating to our third objective, we show that the low reliability has considerable impact on drawing conclusions about the burden of preventable deaths, both at the individual and population level when using the ‘more probable than not’ standard of causation. To make a judgement at the individual level about a specific case, a reliability of 0.25 for a single measurement implies a need to average 12 independent reviews to achieve a reliability of 0.8 for a decision about whether any given death was preventable. 22 , 23 However, for an estimate at the hospital or system level, one can average across cases as well as reviewers, and reasonably precise estimates could be made with more practical numbers of reviews per case and total cases. In addition, as seen in figures 3 and 4 , analyses that do not remove the noise or reviewer differences will significantly overestimate the degree to which reviewers think deaths are preventable. 10 , 24

In terms of our second objective, we find that the reliability of the Likert and continuous scale measurements were similar to each other in both datasets (0.27 vs 0.27 for the UK study; 0.23 vs 0.22 for the US study), suggesting no particular preference for one scale or the other in terms of precision. These low estimates of reliability, at about 0.2–0.3, are also consistent with almost all prior studies using expert review to estimate preventable deaths, quality of care or preventability of adverse events. 21

In this paper, we have discussed some of the measurement characteristics of methods to judge the preventability of hospital deaths. In reference to our first objective, despite the two samples being very different in time period, country and design of the sample, the Likert scale and the continuous scale appear to behave in a similar fashion ( figure 2 ). The reviewers appear to stop assigning the ‘uncertain preventability’ category and start assigning the ‘possibly’ and ‘probably preventable’ categories at just about exactly when they estimate the preventability on a continuous scale exceeds 50%. If the goal is to determine whether an average reviewer would feel that the death was more likely than not to be preventable, the observed correspondence provides support for grouping the response of ‘uncertain’ on a 5-point Likert scale along with the ‘possibly not’ and ‘probably not preventable’ responses.

The strength of our study relies in its generalisability, given the different datasets in terms of time, place and clinical conditions. We were able to tease apart reviewer effects and residual errors in describing preventability across the case note sample. We carefully eliminated from the statistical analysis the few cases of ‘incoherence’—the provision of logically inconsistent answers.

The study was constrained by the original datasets. The Likert scale was always completed before the continuous scale on the case note review forms, whereas, ideally, the order would be randomised to mitigate practice effects. There were (small) differences in the precise descriptions given for the Likert categories in the UK and USA.

Implications

Relating to our fourth objective, what are the implications of our findings for the design of a programme that would attempt to measure the burden of preventable deaths in a health system? Our findings suggest that a Likert scale can reasonably represent expert opinion about causality, and there are no clear advantages in terms of precision to a continuous scale. However, the low reliability would suggest the need for a detailed consideration as to how to design the measurement procedure to find the optimal number of reviews per patient and independent reviewers needed to generate the estimates that the programme is supposed to produce at the required precision. Furthermore, to allow monitoring of the reliability of measurement and the estimates adjusted for that reliability, both reviewers and the case notes that they review must be more or less randomly distributed. This would preclude the use of reviewers only from the hospital that provides the cases, and require standardising the selection and training of reviewers across the health system.

However, it is crucial to point out that the interpretation of these numbers elicited from physician reviewers remains open to question. We would argue strongly against the interpretation that the measurements represent an objective probability that can be used to estimate a casualty count. It is critical to point out that there is really no evidence suggesting that this scale, elicited from physician experts by either of the two measurements, is anything more than ordinal with respect to the true probability of preventing death. Counterfactual reasoning is notoriously difficult, and there is evidence that physicians are not very good at estimating absolute prognostic probabilities and systematically increase their estimates of poor care when there is a bad outcome.25 ,26

If we can only assume that the measurement is ordinal with respect to the true probability of death, then it makes no sense to dichotomise it and consider the cases on one side of the cut-off as ‘truly’ preventable and those on the other as not. In fact, it is extraordinary that this measurement procedure, which has been used in numerous studies, has been assumed to represent what would have to be a ratio scale of measurement, with a true zero and equal intervals for one to be able to estimate the actual burden of preventable deaths, in the absence of any evidence supporting that inference.

There may be a better solution. The measurement properties demonstrate that physicians, with a low, but not insignificant, level of reproducibility, can distinguish between patient case notes on a scale that is elicited by asking them to estimate the probability that a death could have been averted by optimal care. Why not give up on trying, after the fact, to estimate the probability of preventing death? It is, after all, a hubristic endeavour. Rather, ask the reviewers to estimate simply how good the care was in the cases of people who have died. This would not allow health systems to count up the number of deaths attributable to poor care, as they all seem to want to do. However, with a sufficient investment in a robust measurement system, it could allow systems to track relative performance across both hospitals and time, ensuring that attention could be focused on laggards and improvement of the system overall could be tracked. In that sense, review of deaths is really just a review of case notes enriched, one may suppose, to contain a higher proportion of serious errors. Review of deaths also enables doctors to be involved in quality assurance and to detect specific ‘bear traps’ to which they and others can be alerted.