Article Text

How to attribute causality in quality improvement: lessons from epidemiology
  1. Alan J Poots1,
  2. Julie E Reed1,
  3. Thomas Woodcock1,
  4. Derek Bell1,
  5. Don Goldmann2,3
  1. 1 National Institute for Health Research (NIHR) Collaboration for Leadership in Applied Health Research and Care (CLAHRC) North West London (NWL), Imperial College London, London, UK
  2. 2 Institute for Healthcare Improvement, Boston, Massachusetts, USA
  3. 3 Harvard TH Chan School of Public Health and Harvard Medical School, Institute for Healthcare Improvement, Boston, Massachusetts, USA
  1. Correspondence to Dr Alan J Poots, National Institute for Health Research (NIHR) Collaboration for Leadership in Applied Health Research and Care (CLAHRC) North West London (NWL), Imperial College London, SW10 9NH, London, UK; a.poots{at}

Statistics from


Quality improvement and implementation (QI&I) initiatives face critical challenges in an era of evidence-based, value-driven patient care. Whether front-line staff, large organisations or government bodies design and run QI&I, there is increasing need to demonstrate impact to justify investment of time and resources in implementing and scaling up an intervention.

Decisions about sustaining, scaling up and spreading an initiative can be informed by evidence of causation and the estimated attributable effect of an intervention on observed outcomes. Achieving this in healthcare can be challenging, where interventions often are multimodal and applied in complex systems.1 Where there is weak evidence of causation, credibility in the effectiveness of the intervention is reduced with a resultant reduced desire to replicate. The greater confidence of a causal relationship between QI&I interventions and observed results, the greater our confidence that improvement will result when the intervention occurs in different settings.

Guidance exists for design, conduct, evaluation and reporting of QI&I initiatives;2–4; the Standards for QUality Improvement Reporting Excellence (SQUIRE) and the Standards for Reporting Implementation Studies (STARI) guidelines were developed specifically for reporting QI&I initiatives.5 6 However, much of this guidance is targeted at larger formal evaluations, and may require levels of resource or expertise not available to all QI&I initiatives. This paper proposes QI&I initiatives, regardless of scope and resources, can be enhanced by applying epidemiological principles, adapted from those promulgated by Austin Bradford Hill.7

Applying Bradford Hill Criteria and QI&I methods to strengthen evidence

Hill proposed nine ‘aspects of association’ that could be considered before ‘…deciding that the most likely interpretation is causation’.7 His objective was to improve the ability to form scientific judgements about causality. The nine aspects, subsequently referred to as the ‘Bradford Hill Criteria’ (BHC), are considered in the following sections. With roots in causes of disease, the BHC have natural alignment with healthcare.8 They can help make sense of causation in complex healthcare systems and, by extension, interventions to improve those systems. We posit that QI&I methods can be used to provide evidence towards meeting the criteria and infer causality. We offer a QI&I-oriented interpretation of the BHC and match criteria with relevant QI&I methods (table 1), and in the main text refer to the Michigan’s Keystone Project to show the criteria in practice.9

Table 1

The Bradford Hill Criteria, epidemiological meaning, a translation for quality improvement and implementation (QI&I) in italics and a brief description of QI&I methods that can provide evidence and advice to practitioners

Strength of association

QI&I initiatives aim to achieve meaningful impact from the perspective of the delivery system, providers and patients. The size of change is important. This can be detected using statistical process control (SPC) charts with appropriate control limits10 and measures of effect size (eg, relative risk). Hill did not consider the p-value to be as important for establishing cause, stating:

Such tests can, and should, remind us of the effects that the play of chance can create, and they will instruct us in the likely magnitude of those effects. Beyond that they contribute nothing to the ‘proof’ of our hypothesis’.7

Significant p-values have little meaning if changes are trivial in a clinical or systems improvement sense. The potential overemphasis on p-values per se has been highlighted.11 Yet the more speculative the postulated cause–effect relationship, the more stringent should be the design and evaluation (eg, mitigation for bias and confounding, consideration of the counterfactual and determination of effect sizes), and the strength of evidence for the intervention.

Consistency of association

Confidence in the causal nature of a relationship between stimulus (eg, intervention) and response (eg, health outcomes) is increased if the association is demonstrated in multiple studies or projects in diverse contexts and conditions. To determine consistency of association, there is a need to account for context, including differentiating the hard ‘core’ from the adaptable ‘peripheral’ components.12 Programme theory and implementation plans should incorporate study and documentation of contextual factors to facilitate scale-up of interventions.

Specificity of association

Specificity is established when a presumed cause produces a specific effect, asking if the outcome is unique to the exposure. The assessment of alleged impact of a QI&I intervention on outcomes needs to consider bias, confounding, and trends unrelated to the intervention. There is a premium on having a comparator not exposed to the intervention of interest (eg, the counterfactual). If this is impractical or unethical, there should be sufficient baseline and follow-up data to support a claim of temporality. As noted by the Cochrane Effective Practice and Organisation of Care (EPOC) Group, the level of confidence regarding causality is dependent on the degree to which QI&I initiatives address bias and confounding in design, evaluation and analysis and publication.3 EPOC provides tools, such as a bias checklist, to mitigate these problems.3 A strong implementation and evaluation plan coupled with examination for specificity of changes using SPC and other analysis, in light of the planned QI&I interventions, would provide evidence for specificity.


Changes in outcome taking place before an intervention starts are not caused by that intervention. QI&I is well prepared in this regard; temporality is demonstrable by SPC charts, with annotations.9 When demonstrating temporality, a comparison group is desirable to ensure that changes attributed to the intervention are not secular trend. Baseline data are required to detect change in the system performance. Other interventions or events occurring concurrently to the QI&I intervention should be documented, as they confound the observed association. The anticipated lag time for the intervention to ‘kick in’ should be specified and displayed on time-series graphs.

Biological gradient

Often referred to as ‘dose-response’, this is the relationship between exposure to stimulus (amount of input) and outcome (degree of change in the outcome of interest). In QI&I this is relevant as one should measure the ‘dose’ of an intervention, and whether the intended ‘dose’ was delivered to intended recipients with reliability (all individuals receive an intervention) and fidelity (all components of the intervention are received by an individual). We can consider the ‘dose-response’ on individuals, for example, direct effect of a care bundle on a patient’s health, or on organisations, for example, the resources required to implement an intervention, the number of people treated, the modifications of the intervention between and within settings, and the population and health economy outcomes.

If an intervention is delivered with diminished ‘dose’, the magnitude of improvement would be expected to be reduced. Specification of the ‘dose’ is important in an implementation plan; for example, in designing a postoperative care plan for patients with hip arthroplasty, the number of nurse home visits and the intensity (eg, length of visit and quality of interaction) of the nurse’s activities could be considered ‘dose’, with speed and magnitude of recovery the ‘response’. Determination of ‘dose’ can be complex when the intervention is multimodal and interactions among the various elements are difficult to estimate. Programme theory and implementation plans should provide logic for expected gradients,13 14 demonstrable on SPC charts annotated with key changes in implementation practices based on Plan-Do-Study-Act (PDSA) cycles.10 15


Plausibility requires a credible rationale as to why an intervention might have a specific outcome. It does not imply certainty, and if a claim of cause and effect generates incredulity, a more robust design should be considered in the programme theory and implementation plan. The investigation of an initially implausible premise would require strong evidence for an association through an escalation of confirmatory studies, starting with proof-of-concept studies, then increasing the size and scope of studies as confidence in programme theory grows.13 For example, is it plausible that one educational seminar would improve colon cancer screening in primary care? Or is it more plausible to incorporate computer alerts, a behavioural economics ‘nudge’, patient education and feedback of screening rates from comparable primary care practices?

Formal approaches to displaying programme theory incorporate assessments of ‘plausibility’ in predicting the attributable effect on desired outcomes of implementing specific activities. Process maps can be used to understand the processes and mechanisms of care requiring alteration to change outcomes, thereby contributing towards plausibility.


Coherence is concerned with the alternative theories we need to reject to find an idea plausible: does the observed effect conform to expectations, and can variations to those expectations be explained rationally? In QI&I practice, we should ask what other potential mechanisms would be rejected before accepting the QI&I intervention has coherence. If a claim exhibits greater congruence with existing knowledge, it should be preferred. Programme theory, drawn from existing literature, considers coherence and should highlight any alternative explanations for any observed improvements.


This criterion asks whether deliberate alterations to a system result in changes in outcome. Numerous designs are available (eg, step-wedge, factorial, cluster randomised control trials and Bayesian adaptive cluster randomised trials),16 17 providing differing levels of confidence in attribution depending on the design’s mitigation of bias and confounding. The choice of a specific design may be dictated by practical or ethical considerations.

QI&I exploits iterative experiments: PDSA cycles test changes that improvers hypothesise are contributory to achieving the outcome of interest. The cycle of prediction, measurement, analysis and revision of the intervention based on the analysis is fundamental in QI&I. If processes of care rather than outcomes are measured (because the outcomes are rare or likely to be delayed), there should be strong evidence that targeted processes are linked to those outcomes. Iterative successful PDSAs build confidence that a theorised causal pathway is correct and that improvement in outcomes is attributable to the implemented changes.


Analogy is related to plausibility and coherence, allowing inference to be drawn from related studies and learning from other settings. Hill used as an example:

‘In some circumstances it would be fair to judge by analogy. With the effects of thalidomide and rubella before us we would surely be ready to accept slighter but similar evidence with another drug or another viral disease in pregnancy?’ 7

Improvers could establish analogy by reviewing the literature for related initiatives and plumbing the experience of the QI&I community.

The criteria in practice: considering Michigan’s Keystone Project

The Michigan’s Keystone Project sought to reduce catheter-based infection.9 The interventions around disinfection were based on strong evidence, providing plausibility. The initial project evaluation shows a strength of association: statistically significant effects (p<0.002), with relatively large and meaningful sizes (eg, median 2.7 infections per 1000 days to 0).9 There was consistency of effect with observed reductions in 103 centres. While analyses included time variables, appropriately annotated SPC could have increased confidence in temporality, and a comparative arm would have increased confidence in specificity. During the initial keystone project, the cause and effect relationship of the intervention appeared to have coherence, and no alternative explanations were presented for the observed effects. However, in ‘Matching Michigan’ and a post hoc evaluation of keystone,18 19 it became apparent that the explanation was not fully coherent, and that alternative explanations for the success of the intervention existed: the ‘dose’ and its delivery were more complex than initially described. This demonstrates how applying the BHC to knowledge gained over time can build or question confidence in causality: in this case, coherence for the programme theory seemed strong at first, with attempts to replicate diminishing that confidence. For the Michigan study, the causality question remains contested as the post hoc theorisation for what constitutes the intervention needs to be applied in practice to determine if it is sufficient for reproducing the desired impact.


The BHC provide an epidemiological approach to imputing causality in QI&I initiatives. These criteria are compatible with scientific improvement methods and, if properly used, QI&I methods can provide evidence towards each criterion. Pragmatic amendments to the BHC are permissible (eg, combining plausibility and coherence). Refinements to the BHC will be desirable as new scientific advances provide insights into mechanisms by which interventions influence outcomes.20 Further, the BHC should not be considered ‘the letter of the law’. As Hill stated:

‘None of my nine viewpoints [criteria] can bring indisputable evidence for or against the cause and effect hypothesis and none can be required as a sine qua non’.7

We contend these criteria, in their totality, can build confidence towards causality, and should not be a ‘checklist’ in which every element must be checked for a study to be deemed credible. Yet lack of temporality would raise a concern, and an implausible intervention would suggest the other criteria need to be addressed with rigour. Apparent conflicts between criteria should be weighed in reaching a judgement, analogous to how judgement is reached in law courts: is the evidence, in its totality, proof beyond reasonable doubt? For instance, were there weak plausibility and limited specificity, attribution may be tenuous, despite apparent statistical association. Using techniques for grading the quality of evidence, for example,  Grading of Recommendations Assessment, Development, and Evaluation (GRADE)21 could help this judgement.

A causal relationship can inform a decision to scale up an intervention: the magnitude of the impact, the number of beneficiaries, the overall cost (time and resources) to the healthcare delivery system and society, and the policy environment are important considerations.

Improvers could ask if Hill were to examine their QI&I initiative, would he willingly state the results are attributable to the implemented interventions? The BHC can provide a lens through which improvers can gain ‘casual confidence’ in their initiatives. Achieving an ‘exemplary’ causal confidence would require a plausible programme theory specifying a causal pathway to improving measurable processes and outcomes. There would be a sound implementation plan incorporating real-time learning from iterative PDSA tests with bias and confounding addressed during design and evaluation. The timing of improvements in relation to implementation would be clear, with dose–response to interventions. Alternative explanations for the observed effect would be explored. Similar interventions would be successful in other settings. Contextual factors accelerating or impeding the intervention would be presented to enhance the likelihood that replication of core elements of the intervention, with adaptation to local context, would lead to improvement.

In short, improvers can leverage epidemiology and improvement science to maximise causal confidence when attributing interventions implemented to results observed.


View Abstract


  • Contributors AJP and DG conceived the manuscript. AJP wrote the first draft. All authors contributed to development of the idea and ongoing writing, and approved the final manuscript.

  • Funding This article presents independent research commissioned by the National Institute for Health Research (NIHR) under the Collaborations for Leadership in Applied Health Research and Care (CLAHRC) programme for North West London. The views expressed in this publication are those of the author(s) and not necessarily those of the NHS, the NIHR or the Department of Health.

  • Competing interests None declared.

  • Provenance and peer review Not commissioned; externally peer reviewed.

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.