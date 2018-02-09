The years immediately following the widespread interest in patient safety1 and then healthcare quality2 saw considerable debate between pragmatically oriented improvers and research-oriented evaluators3–6 —or between ‘evangelists’ and ‘snails’ as one longtime observer characterised the two groups.7 Too often, enthusiastic improvers (‘evangelists’) relied on simple pre-post designs within a single context leading to erroneous claims of efficacy.8 In contrast, research-oriented investigators (‘snails’) and journals pushed for ever more rigorous designs including randomised trials, potentially at the cost of discouraging many improvers without this training and leading to slower development and deployment of effective interventions.9 10 Many clinicians, quality improvement (QI) experts and researchers are thus caught in a quandary: how best to evaluate a candidate QI intervention? How can we best balance the pragmatic needs of improvement—including the frequent need to refine the intervention or its implementation—with the requirement of most traditional evaluative designs, which typically require a static intervention?

We believe this question is one of the most important issues to consider when developing a QI intervention and is often not considered carefully enough—either by snails or evangelists. Decisions about when and how to evaluate potentially promising interventions can have crucial implications for the future of the intervention and the patients it could affect.

Two recent examples of improvement interventions evaluated using traditional designs In this issue of BMJ Quality and Safety, Swaminathan and colleagues11 present a rigorous evaluation of the Michigan Appropriateness Guide for Intravenous Catheters (MAGIC) QI intervention, intended to reduce adverse events stemming from the insertion of peripherally inserted venous central catheters (PICC). PICCs have become ubiquitous as a substitution for a central intravenous line when patients need longer term central intravenous access, but clinicians often order them unnecessarily or order inappropriate types—for example, a double-lumen PICC when a single-lumen PICC would work just as well and carry a lower risk of complications. The authors implemented MAGIC at a single intervention hospital and used data from nine contemporaneous controls drawn from a QI collaborative in the state of Michigan (all 10 sites participate in the collaborative). The MAGIC intervention included computerised decision support at the time of ordering and a much larger role for PICC nurses to regulate appropriate PICC placement. Training was also delivered for PICC nurses and ordering providers. Outcomes included rates of inappropriate PICC use and device-related adverse events. The intervention achieved a statistically significant but relatively small decrease in the rate of inappropriate PICC use at the intervention site after adjustment for measurable potential confounders (incidence rate ratio 0.86; 95% CI 0.74 to 0.99, P=0.048). Fewer adverse events occurred at the intervention hospital, but this reduction largely reflected fewer catheter occlusions. Rates of venous thrombosis and infection rates remained unchanged, though prior work by the authors has shown low rates for both of these complications (5.2% and 1.1%, respectively, for thrombosis and infection).12 Some might characterise these results as disappointing—bordering on a ‘negative trial’. The authors (understandably) regard the intervention as having achieved some success and probably hope to refine the intervention further and test it in other hospitals in this collaborative. We do not seek to debate this point. Our interest here lies in discussing the tension in QI between the need to refine interventions, especially early in their development, and the desire to conduct rigorous, compelling evaluations to demonstrate their impact. Consider a second example, in which Westbrook and colleagues evaluated a bundled intervention to reduce nurse interruptions during medication administration using a cluster randomised controlled trial (RCT).13 For every 100 medication administrations, nurses on intervention wards experienced 15 fewer non-medication-related interruptions compared with control wards. Using results from their previous work on the risk of adverse drug events with interruptions during medication administration,14 the authors themselves acknowledged that the observed reduction in interruptions would likely achieve little benefit for patients. Moreover, the nurses hated wearing the ‘do not interrupt’ vests, which constituted a core feature of the intervention.

Why these two examples? What both interventions share, in addition to their small to modest impacts, is the use of rigorous, traditional evaluation paradigms—one an interrupted time series combined with contemporaneous controls (about as rigorous a non-randomised design as possible) and the other a cluster RCT. Yet, the disappointing effect sizes raise the question: did these interventions need further refinement before subjecting them to rigorous evaluation? In the case of the MAGIC, the authors had a reasonable idea for an intervention, but much less prior work to inform the precise ingredients or implementation. In the example of the cluster RCT of a bundled ‘do not interrupt’ intervention, substantial prior work (not just by these authors) had explored this type of intervention. Thus, the investigators did not plan modifications to the intervention or its implementation strategy. Consequently, it made sense to randomise wards to a fixed intervention or to usual care and focus on evaluating the impact. That said, it obviously occurred to Westbrook et al 13 that the nurses might not like wearing the vests, since they solicited this feedback as part of their results. Thus, they might have anticipated the need for making some changes to the intervention. We recognise that hindsight is 20–20, and that the increasing use of controlled before-and-after studies, interrupted time series, and RCTs to evaluate improvement interventions represent a welcome advance compared to the simple before-after study, which has nothing to recommend it yet remains woefully common. On the other hand, we question the degree to which these traditional designs provide the appropriate balance between rigour and the need to refine interventions, since these evaluative designs presume a ‘fixed’ or unchanging intervention. None offer an obvious way for investigators to modify the intervention in response to implementation challenges or a disappointing effect size. More adaptive versions of these traditional evaluative designs do in fact exist, and we believe are underused.