Article Text
Abstract
Quality improvement (QI) projects often employ statistical process control (SPC) charts to monitor process or outcome measures as part of ongoing feedback, to inform successive Plan-Do-Study-Act cycles and refine the intervention (formative evaluation). SPC charts can also be used to draw inferences on effectiveness and generalisability of improvement efforts (summative evaluation), but only if appropriately designed and meeting specific methodological requirements for generalisability. Inadequate design decreases the validity of results, which not only reduces the chance of publication but could also result in patient harm and wasted resources if incorrect conclusions are drawn. This paper aims to bring together much of what has been written in various tutorials, to suggest a process for using SPC in QI projects. We highlight four critical decision points that are often missed, how these are inter-related and how they affect the inferences that can be drawn regarding effectiveness of the intervention: (1) the need for a stable baseline to enable drawing inferences on effectiveness; (2) choice of outcome measures to assess effectiveness, safety and intervention fidelity; (3) design features to improve the quality of QI projects; (4) choice of SPC analysis aligned with the type of outcome, and reporting on the potential influence of other interventions or secular trends.
These decision points should be explicitly reported for readers to interpret and judge the results, and can be seen as supplementing the Standards for Quality Improvement Reporting Excellence guidelines. Thinking in advance about both formative and summative evaluation will inform more deliberate choices and strengthen the evidence produced by QI projects.
- Statistical process control
- Quality improvement methodologies
- Evaluation methodology
- Control charts, run charts
- PDSA
Statistics from Altmetric.com
- Statistical process control
- Quality improvement methodologies
- Evaluation methodology
- Control charts, run charts
- PDSA
WHAT IS ALREADY KNOWN ON THIS TOPIC
Many tutorials have explained the advantages of statistical process control (SPC) techniques over traditional statistical testing in quality improvement (QI) projects, the basic principles, how to select and construct SPC charts and specific issues such as sampling considerations. Little has been written on how to bring these together in a process for using SPC in QI projects, highlighting critical decision points that are often missed but affect the inferences that can be drawn about the effectiveness of the intervention.
WHAT THIS STUDY ADDS
Critical decision points that should be explicitly reported for readers to interpret and judge the results include
establishing a stable baseline;
the choice of outcome measures;
QI design features; and
SPC analysis used to draw inferences on the intervention effect.
HOW THIS STUDY MIGHT AFFECT RESEARCH, PRACTICE OR POLICY
Thinking in advance about the kind of conclusions that need to be drawn from a QI project will inform explicit choices on design, measurements and SPC analyses and thereby allow for greater generalisability of results.
The situation in practice
Many quality improvement (QI) projects start because physicians, nurses or other healthcare professionals encounter a problem in practice that they want to improve. Some common triggers include:
Outcome data showing variation or disparities in care; for example, indigenous Maya women in Guatemala, who mostly deliver at home, are more than twice as likely to die from obstetric complications than non-indigenous mothers.1
Worrying trends within a particular service or care provider organisation; for example, an increase in newly prescribed benzodiazepines and sedative-hypnotic drugs during patients’ hospitalisation, putting them at risk for long-term use and drug-related problems.2
Inefficient processes or unnecessary tests; for example, some laboratory tests commonly ordered in combination with another, adding little value to patient management at significant cost.3
Problems in sustaining improvements made previously; for example, adherence to central line maintenance bundles for every patient every time.4
Patient feedback; for example, patients reporting loss of medication information across care settings.5
A common approach is for a team of stakeholders to investigate root causes of the problem and create driver diagrams to develop a programme theory and intervention, for example, using the action effect method as a structured approach to guide this process.6 Multiple Plan-Do-Study-Act (PDSA) cycles are then commonly used to implement the intervention, adding or adapting elements to make it more effective, guided by targeted process and outcome measures.7 Particularly if positive results appear, the team may want to disseminate the project findings so that others can benefit from the lessons learnt. The Standards for Quality Improvement Reporting Excellence (SQUIRE 2.0) reporting guidelines state that such reports should include ‘qualitative and quantitative methods used to draw inferences from the data’.8 Very often, the desired inference concerns whether and to what extent the intervention causes improvement in the targeted outcome—the effectiveness of the intervention. This raises the question: has the study been designed such that changes in outcome can really be attributed to the intervention?
QI projects often employ run charts or statistical process control (SPC) charts to monitor a targeted process or outcome measure, to understand and inform successive PDSA cycles. This approach is an example of formative evaluation; that is, ongoing feedback to refine an intervention. Run charts can help identify upward or downward trends and thereby whether the targeted measure is moving in the right direction, but cannot establish whether a process is in control or not. SPC charts can do both as these have control limits, and therefore add the ability to detect when changes are needed to make the process stable. This type of formative evaluation is often not consistent with the methodological requirements to use SPC for summative evaluation, that is, to draw inferences on effectiveness and generalisability of the improvement effort. If one only starts to think about these methodological issues when writing up a project for publication, it will be too late to change the design to allow for robust and generalisable conclusions concerning effectiveness. In many cases, SPC techniques can be used both for formative evaluation and to draw inferences concerning effectiveness, but only if appropriately designed. Inadequate design reduces the chance of publication in a peer-reviewed journal and more importantly could also result in patient harm and wasted resources if incorrect conclusions are drawn.
There have been many tutorials explaining the advantages of SPC over traditional statistical testing in QI projects, outlining the basic principles, how to select and construct SPC charts, or specific issues such as sampling considerations.9–11 However, little has been written on bringing these together in a process for using SPC in QI projects, highlighting critical decision points that are often missed but can affect how inferences about effectiveness may be drawn. This paper will address this gap by explaining how these critical decision points are inter-related and how they can be addressed by a careful plan for SPC.
Need for a stable baseline
Many improvement projects fail to establish a stable baseline against which to identify and quantify any improvements made. They may have either insufficient data points for a stable baseline, or encounter special cause variation within their baseline, meaning it is not stable and may be changing even prior to introduction of an intervention. To understand why a stable baseline is crucial to draw inferences from SPC analyses, we first briefly review the principles of SPC, described in more detail elsewhere.9 SPC is based on the principle that there is variation in any process, but that the variation is predictable if that process is stable (‘common cause variation’). Furthermore, for a stable process, we can compute the range of values within which this variation occurs, based on observed data and a hypothesised underlying statistical distribution (eg, Gaussian, binomial or Poisson, depending on the measure of interest).9 ⇓For instance, under common cause variation, average values of a continuous variable at every data point (eg, weekly averages) will tend to be normally distributed, and we can expect that if the process remains stable, 99.7% of future measurements will be within ±3 SD of the mean. Within these limits, we consider the process to be ‘in statistical control’; the limits are therefore called control limits. If an intervention is introduced, this may disturb the expected pattern (special cause variation), meaning that the measurements will deviate from the predicted range.
From these principles, it is clear that if we are to attribute evidence of special cause variation to an intervention, we first need to establish that there is a stable process before introducing the intervention, which is referred to as a stable baseline (figure 1A). Establishing a stable baseline will usually take about 20–25 data points for SPC charts,11 12 ⇓ before we can test whether subsequent measurements are starting to deviate from what is expected. This can be understood from the fact that each data point is a sample taken, and that there is likely variation from sample to sample. Taking only a few samples will therefore not give a very good representation of the true mean, similar to having only few observations in traditional statistical analysis. Even though it is possible to calculate control limits with fewer data points, the charts become more powerful when at least 20 data points are used.11 Run charts need a similar stability to enable detection of a changing trend, which usually takes about 10–15 data points,11 so these are sometimes used if the urgency of improvement outweighs the need for a stable baseline. The sample size for each data point is also important in this context, as this influences variation in the mean (or other statistic of interest) and hence the width of the control limits. Together, these determine the power to detect differences at a certain point in time. Further guidance regarding sampling considerations, the minimum sample size and power calculations is available elsewhere.11 12 Related to this is the choice of the time unit, for example, whether monthly or weekly measures are used. This will likely depend on the number of eligible patients or available data. The rationale for the number of data points should therefore be reported (similar to other time series techniques),13 rather than only the total time before and after the intervention.
Data are monthly percentages of emergency department attendances admitted to hospital, discharged home or transferred to another provider within 4 hours, for two hospitals in England, taken from figures published by NHS England (adjusted for overdispersion).31 The baseline process is stable for the hospital depicted in (A), but is unstable for the hospital in (B).
Sometimes, the preintervention process is not stable, that is, there is special cause variation in the baseline, for instance, one point outside the control limits (figure 1B). Once this is detected, there are several things that can be done. The first is to look for causes of the special cause variation; for example, this may occur in a particular subgroup of patients and once the data from this subgroup are removed from the analysis, the process is stable for all other patients. Removing a subgroup of patients from the analyses will obviously limit the generalisability, but at least allows inferences to be drawn on those remaining patients. This does not necessarily entail withholding the intervention from any subgroup of patients, and care must be taken to ensure certain groups are not disadvantaged or discriminated against, as this merely relates to the analyses. Another option is to delay the introduction of the intervention, particularly if the special cause variation occurred only early in the baseline or is limited to a single data point, so that adding a few more data points yields a stable baseline. It is important that this is done iteratively until a stable baseline is achieved—the resulting control limits are sometimes known as ‘trial limits’—otherwise predictions about future outcomes and thereby inferences about the intervention effect will be invalid and misleading. In practice, improvement teams have to balance the potential benefits of waiting for a stable baseline against the potential harm of delaying onset of improvement efforts. There will not always be an approach that can achieve the best of both worlds—but at least if such questions are considered up front, this will be a conscious decision.
Choice of outcome measures
As for any intervention study, a range of outcome measures is required for a QI study to cover the effectiveness (primary outcome measure), safety (balancing measures) and compliance (intervention fidelity measures) with the intervention, using existing or newly collected data.
The primary outcome measure should capture the key quality or safety issue targeted by the QI project, which can relate to an outcome for patients or a care process. The definition and data collection method for this outcome measure should be the same before and after the intervention, and for all healthcare settings (including control settings if applicable) as slight changes will affect the effect attributed to the intervention. For instance, using retrospective (existing) data before the intervention versus prospective data after the intervention is likely to induce methodological effects unrelated to the intervention. In practice, because a stable baseline needs to be established (which might take time) and the QI team may be keen to start implementing the intervention, this often means that existing (routinely collected) data are used. Examples include the number of opioid-related oversedation events per 1000 patient-days,14 percentage of primary care patients lost to follow-up,15 time to receive antibiotics16 or number of deliveries between newborns with an Apgar score <7 after 5 min.17 Even though it is possible to use prospectively collected data, enough data must first be collected to establish a stable baseline before starting the intervention. Using existing data therefore has the advantage of being able to start developing and implementing the intervention more quickly, provided that the baseline process is indeed stable. What we encounter frequently though is that there has been an audit of only a few data points showing that there is a problem of some kind, which then initiated a QI project and implementation of an intervention a year later. Even though a lot of effort has gone into such projects, without a stable baseline and continuous longitudinal data we cannot attribute any change in outcome to the intervention.
Balancing measures are important to include in a QI project to ensure the intervention has no unintended effects, and therefore also require the same data before and after introduction of the intervention. Examples of balancing measures include falls that result in injury,2 prehospital time18 and major adverse events.19 Intervention fidelity measures are used to establish whether the intervention has been delivered as intended, similar to treatment compliance in randomised controlled trials. These are very useful as part of the PDSA cycles, as they are often intermediate processes to improve the primary outcome, and so provide information regarding the need to adapt the intervention. Showing improvements in these intermediate measures therefore also contributes information on the likelihood that the intervention (rather than something else) produced the change in the primary outcome. Examples of intervention fidelity measures include compliance with clinical care bundles,4 17 the percentage of infants with verbal consent documented19 or the proportion of clinical care providers receiving an educational intervention.2 These fidelity measures are mostly collected prospectively during intervention implementation and can add further evidence concerning the causal relationship between the intervention and outcome.
QI study designs
The design of a QI study is important as it defines the data collection, analysis and inferences that can be drawn, and should therefore be explicitly reported. For instance, an uncontrolled before-after study giving only average estimates in two periods will not be able to control for any secular trends.
First, the use of routinely collected (existing) data versus newly collected (prospective) data is an important decision, similar to cohort studies where retrospective refers to already available data and prospective to newly collected data. This distinction is important as use of prospective data allows for choices on which data to collect and how to measure, including for relevant confounders, whereas these possibilities are more limited when existing data are used.
Second, the time horizon available for a QI project may affect choices on the time unit for the longitudinal data. If, for instance, the QI project needs to be conducted within 1 year, this means that monthly data on the outcomes will not give enough data points to establish a stable baseline, let alone for any evaluation of the intervention effect. Weekly or fortnightly data may give enough data points, provided there is sufficient sample size for each data point, that is, the total number of eligible patients every week or 2 weeks. Alternatively, if a primary outcome can only be assessed monthly, this means that the time horizon of the QI project needs to be longer.
Third, it is important to consider whether a controlled or uncontrolled QI study design is to be used. A controlled study is a stronger design to draw inferences on the effectiveness of the intervention, increasing the likelihood that the intervention produced the change in outcome if it did not change in the control group. There are different types of controls that can be considered for QI studies, which will help to deal with different types of confounding as reported for other time series20 (table 1). Previous QI studies have used:
Location-based controls, for example, another ward in the same hospital not exposed to the intervention.2 21
Characteristic-based controls, for example, a study aiming to reduce readmissions specifically for patients with heart failure, used patients with acute myocardial infarction and pneumonia as controls.22
Outcome-based controls, for example, a study targeting hand hygiene to improve healthcare-associated infections attributable to inpatient or outpatient care, used infections attributable to the operating room as the control as these were expected to be less sensitive to changes in hand hygiene.23
Controlled quality improvement studies—different types of controls take care of different types of confounding
When selecting a control, it is important to consider the scale of the intervention and risk of contamination. For instance, this might include whether professionals working in another location or another patient group chosen as control could still be exposed to the intervention. In addition, one needs to consider the type of factors that will be controlled for and what cannot be controlled because these factors are uniquely tied to the intervention group or outcome. For example, examining the changes in outcome in a control ward in the same hospital will show the impact of, for example, a new hospital-wide policy on the outcome, which can be taken into account when interpreting the impact of the intervention in the intervention ward. However, a change in ward policy in the intervention ward that affects the outcome cannot be separated from the intervention effect. The choice of control also depends on whether it will be possible to collect similar data on the outcome measures as for the intervention group.
Even though not included as a separate item in the SQUIRE guidelines, it would be helpful if authors include an explicit description of their study design. Explicit reporting on the data used, time horizon, (un)controlled design, type of control and which confounding factors are controlled in this way, will give more insight into the quality of evidence generated by this particular QI study.
SPC analyses to draw inferences on the intervention effect
Choosing the appropriate type of SPC chart is the first step in SPC analyses, directly linked to the outcome for which the analysis is conducted. Outcomes can be based on different types of variables (eg, continuous, percentage) that follow different statistical distributions, which subsequently determines calculation of control limits. Just as the appropriate regression analysis is chosen based on the type of outcome, the same is true for choice of SPC chart (table 2) as described in more detail elsewhere.10 12 Although continuous measures contain more information than dichotomised equivalents, many QI projects have primary outcomes expressed as percentages/proportions and hence the p-chart is most frequently used. More complex charts such as the exponentially weighted moving averages and cumulative sum charts use accumulated information over time and are thereby able to detect small changes more quickly than the p-chart.24 U-Charts are appropriate for rates, where the number of events is adjusted for the time at risk, for example, the rate of central line infections per 1000 days in situ. C-Charts monitor counts, for example, the number of falls per week. If the correct chart type is chosen, deviations from the underlying distribution will show up as special causes. An incorrect choice of chart may result in poor chart performance, such as an increased risk of false positives.
Choosing the appropriate type of statistical process control chart aligned with the type of outcome data
The next step is to plot the observed outcome data over time and to calculate the centre line and control limits, to allow detection of special cause variation introduced by the intervention. Most QI projects use ±3 SD control limits to limit the risk of a type I error—incorrectly detecting special cause variation (ie, false-positive signals). Where traditional statistical techniques mostly accept a 5% type I error risk to test one hypothesis at a time and use ±2 SD for clinical decisions, control charts consist of many data points and therefore multiple points contributing to the overall false-positive risk. As shown elsewhere, a control chart of 25 data points using 3 SD limits would have an overall false-positive rate of 6.5% whereas using 2 SD limits would increase this to 27.7%.9 To detect special cause variation, several rules are commonly used that can be summarised as data falling outside the control limits or displaying abnormal patterns within the control limits, that would not be expected under common cause variation9 10:
One point outside the upper or lower control limit, as this is outside the range of predicted values for common cause variation.
Trend: a run of 8 successive points10 (some prefer fewer points9) trending up or down.
Shift: 8 successive points on one side of the centre line (again, some prefer a different cut-off).
Two out of 3 successive points beyond 2 SD on the same side of the centre line. One point beyond the 2 SD limit may be an extreme value within the range of predicted values, but two consecutive extreme values are not very likely under random variation.
Which rules are employed should be decided a priori and specified in a QI report. It is helpful to annotate the SPC charts with the different PDSA cycles to show when different elements were added or adapted, and to show when special cause variation occurred (eg, by a different colour).
Certain types of special cause variation, notably shift rule breaks, may indicate an improvement (or indeed a deterioration) in the process being measured. If such a change persists, a new centre line with new control limits should be established to encapsulate the expected variation in this new process. This may happen more than once if a project is successful. For instance, in a recently published QI report the mean monthly percentage of infants receiving timely hepatitis B vaccination increased from a baseline mean of 45% to 76%, and then to 95%.19 Other projects have used centre lines before versus during intervention implementation to show the intervention effect,4 before versus after the intervention ended15 or during sustainability phases.16 In such cases, where the new limits are established based on a time period criterion rather than on special cause variation, it is important to note that differences in the mean between the two periods may be due to common cause variation alone, that is, the intervention may not have caused a change. Regardless of the criteria used to determine when new limits are established, each set of limits must be based on sufficient data, as described for the baseline period above. For instance, if at least 25 data points are required to establish baseline limits, it is logical to use at least 25 data points to form control limits for a new, improved process.
There may be factors other than the intervention affecting the outcome and thereby introducing special cause variation. Such confounding can be dealt with by statistical adjustment or stratification as in other research. It is important to report explicitly about any other interventions occurring in the same time period that may have affected the outcome, particularly for uncontrolled QI studies where this could be due to a secular trend. For controlled studies, such reporting should focus on other interventions specific for the intervention group such as a new ward policy (when using a location-based control) or a policy affecting specific outcomes only (when using an outcome-based control). The controlled design will not be able to separate the effect of these factors from those attributable to the intervention itself, highlighting the importance of selecting a control carefully and reporting explicitly on the type of confounding it will control for. This also re-enforces that intervention fidelity measures need to be chosen carefully to show how the intervention may have produced its results, and evaluate whether changes in these intermediate variables are concurrent with changes in the primary outcome. The combination of effects shown for the primary outcome measure in the intervention group (vs control) and intervention fidelity measures will determine our understanding of how the intervention has worked and the likelihood it has produced the changes in outcomes.
Discussion and recommendations
Even though not every QI project sets out to establish evidence on the effectiveness of an intervention, many end up seeking to do so and it is therefore important to plan for both formative and summative evaluation needs. By thinking in advance about the kind of conclusions you would like to be able to draw, this will inform explicit choices on the design, measurements and SPC analyses. The present paper has highlighted the critical decision points and shown how these are inter-related with many aspects of a QI study. Explicit reporting on these critical decision points can be seen as an add-on to the SQUIRE guidelines as routinely used for QI reports, and needed for editors, reviewers and readers to interpret and judge the results (table 3). This paper also adds more specific detail to the features of a high-quality measurement plan guiding all phases of the project, including planning the (SPC) analysis, handling missing data and possible confounders.25
Critical decision points that need explicit reporting for a quality improvement study
Making more deliberate choices and explicit reporting on these choices will strengthen QI projects and thereby the evidence they can produce. The need for rigorous methodology and identification of sources of bias in QI projects has been argued previously, including that statements to the effect that data ‘are for quality improvement’ and not ‘research’ may run the risk of promoting less rigorous standards.26 The primary aim at the start of many QI projects may be to improve a particular problem in daily practice, rather than to produce new knowledge as in research.27 SPC charts are then often used as an approach to measurement, where data will give feedback and inform further development of the intervention. Previous studies have shown that using SPC charts as part of the intervention (formative evaluation) can be effective to improve patient outcomes.28 29 Even though then primarily a tool for monitoring rather than to draw inferences about the effect of each implemented improvement initiative, this also requires planning the feedback, defining what constitutes special cause variation and so on. Similarly, when used as a statistical tool to draw inferences about the effect of the intervention (summative evaluation), this requires planning as otherwise decisions taken during the process might affect the ability to draw generalisable conclusions. In learning health systems, there will be continual QI projects to evaluate whether interventions to improve care are effective and reduce harm, which require the best possible methods and strongest possible designs to ensure the evidence is as rigorous as possible even if randomisation may not always be possible.30 Ultimately, evaluation of QI projects cannot be an afterthought as we need to know which of our efforts have worked to improve care, to ensure that patients will benefit.
Ethics statements
Patient consent for publication
Footnotes
Contributors Both authors conceived this study. PJM-vdM wrote the first draft and is guarantor. Both authors critically reviewed the manuscript and approved the final version.
Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.
Competing interests None declared.
Provenance and peer review Commissioned; externally peer reviewed.