Article Text

## Abstract

**Background** Patient-reported outcome measures (PROMs) often produce skewed distributions of individual scores after a healthcare intervention. For health performance indicators derived from skewed distributions, funnel plots designed with symmetric control limits may increase the risk of false alarms about poor performance.

**Aim** To investigate the accuracy of funnel plots with symmetric control limits when comparing provider performance based on PROMs.

**Methods** The authors used a database containing condition-specific PROMs for 17 453 hip replacements and 7656 varicose vein procedures performed by providers in the English NHS. The mean postoperative PROM score, adjusted for patient characteristics, was used as the measure of performance. To compare performance, symmetric 99.8% control limits were calculated on funnel plots, 3 SDs away from the overall mean on either side. These were compared to control limits derived directly from percentiles of simulated (bootstrap) distributions of mean scores.

**Results** The simulated control limits on funnel plots for both procedures were asymmetric. The empirical probability of falling outside the symmetric 99.8% ‘poor performance’ control limit was inflated from the stipulated rate of 0.1% to 0.2–0.3% for provider sample sizes of up to 150 procedures. The authors observed that, out of 237 providers of hip replacement, eight had adjusted mean scores that exceeded the symmetric ‘poor performance’ limit compared with only five that exceeded the corresponding simulated limit. In other words, three (1.3%) were differently classified. For varicose vein surgery, five out of 160 providers exceeded the symmetric limit and four exceeded the simulated limit, that is, 1 (0.6%) was differently classified.

**Conclusions** When designing funnel plots for comparisons of provider performance based on highly skewed data, the use of simulated control limits should be considered.

- Patient outcomes
- funnel plots
- healthcare quality
- surgery
- statistical process control
- statistics
- performance measures
- clinical practice guidelines

## Statistics from Altmetric.com

- Patient outcomes
- funnel plots
- healthcare quality
- surgery
- statistical process control
- statistics
- performance measures
- clinical practice guidelines

## Introduction

The funnel plot is an increasingly common graphical tool for comparing providers on some measure of performance and for identifying ‘outliers’.1–4 Funnel plots present the performance indicator value on the vertical axis with a measure that is related to how accurately the indicator has been measured on the horizontal axis. The latter is typically provider volume, such as number of procedures. Superimposed lines are drawn to mark out a target outcome and a set of control limits which form a curved funnel about the target (figure 1). The control limits are designed to contain the bulk of the variation in provider values that could be attributable to random (common-cause) variation rather than to systematic variation in performance.

The funnel plot is a type of scatter plot used for comparing the performance of healthcare providers. In a simple funnel plot, a performance indicator is measured on the vertical axis and the number of cases that the indicator is based on (the sample size) is measured on the horizontal axis. A horizontal line is drawn to indicate a target level of performance, often the average, and curved lines are drawn in a funnel shape 2 and 3 SDs away from the target on either side to show how much natural variation would be expected for different sample sizes.

We focus on the postoperative Oxford Hip Score (OHS), a patient-reported measure of hip pain and disability, as an indicator of the performance of orthopaedic units in carrying out hip replacement. Figure 1 shows the mean OHS for 237 units compared with the number of procedures in each. It reveals 14 ‘outliers’ that have an unusually low mean OHS lying on or outside the outer ‘poor performance’ control limit (more than 3 SDs) and six ‘outliers’ that have an unusually high mean.

The use of funnel plots to compare performance was introduced as a pragmatic alternative to ranking providers. It recognises that some variation in provider outcomes is to be expected and is not always a cause for action. It provides a transparent threshold for investigating potential cases of poor performance and is becoming increasingly popular among clinicians and policy makers.

When designing funnel plots, symmetric upper and lower control limits are ordinarily calculated using the assumption that the random variation in the provider indicator values follows an approximately normal distribution.1 However, this assumption may be wrong when the patient-level data are skewed. This situation frequently occurs for scores from patient-reported outcome measures (PROMs) after healthcare interventions. It arises because continuous measures of outcome based on patients' assessments of their symptoms, functioning and health-related quality of life produce a score for someone of average health that is near to one end of the range, not in the middle. This effect is due to a ‘ceiling effect’ in the measurement scales and to the effectiveness of treatment in relieving patients of health problems. The practical consequence is that, after treatments designed to cure symptoms or improve functioning, the distribution of individual patient scores is frequently skewed and concentrated around the best score on the scale.5

When there is a skewed distribution of mean PROM scores at the provider level, applying symmetric control limits may lead to wrong inferences about outlying provider performance. This is of particular concern for the identification of ‘poor performing’ providers, that is, providers that have worse than expected values and exceed the control limits. This can occur because the data are skewed in a direction that causes the ‘poor performance’ symmetric control limits to be too close to the target outcome. However, the risk associated with this has so far not been evaluated. Moreover, based on fundamental statistical principles (the central limit theorem), we can expect that the skewness in the provider mean scores will decrease as provider volume increases. Symmetric control limits based on the normal approximation will become acceptable in practice above some minimum volume.

The aims of this study were threefold. First, to quantify the impact of skewness in two PROMs on simulated (bootstrap) sampling distributions of mean postoperative scores for different provider volumes, for both unadjusted and risk-adjusted data. Second, to estimate a minimum provider volume above which the normal approximation becomes acceptable. Third, to estimate the probability of a provider's mean postoperative score exceeding the ‘poor performance’ control limits using alternative methods of calculating control limits, including a skewness correction method from the non-health literature. Section 1 describes the data. Section 2 presents the different methods of estimating control limits, the estimated probabilities of mean scores exceeding these limits and additional issues related to risk-adjustment. Section 3 focuses on the numbers of orthopaedic and general surgery providers in our dataset that were differently labelled using the symmetric and simulated (bootstrap) control limits.

## Data

Since April 2009, all patients in the English NHS undergoing hip replacement and varicose vein surgery have been asked to complete a questionnaire before their operation and a second questionnaire 6 months after for hips and 3 months after for varicose veins.6 For this study, data were gathered for 25 109 patients who either underwent hip replacement surgery in one of 237 orthopaedic providers between April 2009 and January 2010, or who had varicose vein surgery in one of 160 general surgery providers between April 2009 and March 2010. We excluded patients with a missing postoperative questionnaire (n=14 804) and patients with missing data on age, sex, preoperative PROM score or postoperative PROM score (n=1903). We also excluded patients of providers who had performed surgery on less than five patients over the study period (n=72 patients, 32 providers). Further details of the patient cohort are given in an online appendix (tables A1–A3).

Two condition-specific PROMS were used. The Oxford Hip Score (OHS) is derived from patient responses to 12 questions about hip-related pain and limits on physical functioning and everyday activities.7 Scores are calculated by adding up values associated with each response to produce a scale from 0 (worst) to 48 (best). In our database, 1 in 10 patients attained the maximum score of 48 after surgery, a quarter attained a score of at least 46 and the distribution was negatively skewed (figure 2), similar to findings in other studies.8

The Aberdeen Varicose Vein Questionnaire (AVVQ) consists of 13 questions on levels of pain, ankle swelling, discolouration, itching and cosmetic appearance.9 Patient responses are scored to form a scale from 0 (best) to 100 (worst), the reverse direction to that of the OHS. After surgery, nearly 1 in 10 patients attained the minimum (best) score of 0, a quarter attained a score of 4 or less and the distribution of scores was positively skewed (figure 2).

## Funnel plots for provider level (mean) PROM scores

The initial exercise focused on the mean of patients' postoperative PROM scores (OHS or AVVQ) by provider. For each procedure, target performance was defined as the mean postoperative score across all patients. Symmetric 95% and 99.8% control limits were calculated using values lying 2 and 3 SDs from the mean (see formulae in online appendix). SDs of the mean scores for each provider volume were estimated using the common SD of the scores across all patients, divided by the square root of provider volume. As explained above, this method assumes (either explicitly or implicitly) that the sampling distribution of mean scores is approximately normal.

We then derived control limits from percentiles of simulated sampling distributions of the mean scores. We simulated sampling distributions of mean scores by resampling data from all patients pooled together using bootstrapping.10 Bootstrap methods have been proposed in the design of control charts for non-normal data outside the health literature.11 This approach recognises that the theoretical distribution of patient-level scores is unknown and reproduces patterns of mean scores under the hypothesis of random variation, avoiding the assumption that the distribution is normal. Instead, the observed distribution is used to generate a large number of ‘new’ samples of data (bootstrap replicates). The mean of each new sample is calculated and the distribution of these means gives an approximation to the sampling distribution. This approach is conceptually simple although computationally intensive, and was carried out in Stata. We took repeated random samples ranging in size from 10 to 200 patients, selected to represent a realistic range of provider sample sizes. For each sample, we performed 20 000 (bootstrap) replications.

We used the simulated sampling distributions to derive 95% and 99.8% control limits. The lower and upper 95% limits were calculated as the 2.5th and 97.5th percentile values respectively and the lower and upper 99.8% limits corresponded to 0.1th and 99.9th percentile values. The control limits were robust to alternative methods of calculating percentiles, since values for the 20th and 21st and the 19 980th and 19 981st observations (out of 20 000 simulated means) were close together, even for small sample sizes.

The simulated sampling distributions of means were skewed and the degree of skewness decreased with increasing (simulated) provider volume. Consequently, the control limits derived from these simulated distributions were asymmetric. Figure 3 compares the symmetric control limits, based on the normal approximation, and the asymmetric control limits, derived from the simulated distributions.

### Classification of outlying performance

We calculated the percentage of the simulated sampling distribution that fell beyond the outer symmetric control limits to determine the degree to which using symmetric control limits inflated the risk of false alarms about poor provider performance (equivalent to a type I error). The risk of a false alarm is intended to be 1 in 1000 (0.1%) using 99.8% control limits. However, for volume ranges of 10–150, we found that the risk of a provider being beyond the 99.8% control limit which labelled it as a ‘poor performing’ outlier was 2–3 in 1000 (0.2–0.3%) (table 1). The risk of a false alarm for poor performance converged to the intended risk with increasing provider volume, falling to less than 0.2% at a volume of around 150 (slightly higher for the AVVQ).

The impact of skewness on 95% simulated control limits was relatively small, because the normality assumption holds fairly well in the central part of the distribution. Consequently, the proportional increase in the risk of a false alarm using symmetric 95% control limits was smaller and converged more quickly to the intended risk of 2.5% with increasing sample size (table 1).

Although we have not shown estimates for this, the use of symmetric control limits for skewed PROMS data also decreased the chance of provider mean scores exceeding the ‘good performance’ control limits.

### Risk adjustment

The implications of using risk-adjusted data for the design of funnel plots and control charts has been investigated for performance indicators derived from binary and count data,1 12 but not to our knowledge for continuous outcome measures such as PROMs. Risk adjustment of continuous data has two impacts on the calculation of control limits on funnel plots. First, risk adjustment is expected to reduce the variability in any continuous outcome measure. Consequently, the SD of risk-adjusted scores is less than the SD of actual scores and both the symmetric and simulated control limits should be slightly narrower. Second, risk adjustment may reduce the skewness in the distribution of patient outcome data, which in turn may reduce the asymmetry of the simulated control limits.

To investigate the effect of risk adjustment, we produced adjusted postoperative PROM scores for both procedures using a linear regression model that included the following patient characteristics: preoperative PROM scores (OHS or AVVQ), age, sex, Index of Multiple Deprivation based on residential postcode, patient-reported comorbidities, and patient-reported general health before surgery (see online appendix table A4). We found that risk adjustment reduced the SD in the outcome by around 10% for the OHS and 20% for the AVVQ. It also reduced the skewness in the distribution of patient postoperative scores. For the OHS data, skewness fell by around 13% from −1.11 to −0.97. For AVVQ, skewness fell by around 40% from +1.55 to +0.97.

Using the risk-adjusted data, we derived symmetric and simulated control limits using the same methods as for the unadjusted data. The simulated control limits were again asymmetric, and the risk of a false alarm about poor performance using 99.8% symmetric control limits remained higher than the intended rate of 0.1%. However, the risk adjustment did not greatly reduce the degree of inflated risk. Estimates ranged from 0.2% to 0.3% for provider volumes of up to 150 and falling to 0.2% or below for provider volumes over 150.

### Skewness correction

A number of approaches have been proposed outside the health literature for adjusting control charts to analyse means of skewed process data. As well as bootstrap methods, which we have applied here, more general parametric charts and non-parametric methods have also been suggested.13 As a potential simple alternative method to simulated (bootstrap) control limits, we tested a correction method for skewed data to produce asymmetric control limits. The method approximates percentiles of the sampling distribution of means using the mean, SD and degree of skewness in the individual data.14 Simulation results from a family of well known skewed distributions were used to select values for the correction formula (see online appendix). However, we found that the formula overcorrected for the effects of skewness and the ceiling effect in PROMs data (results not shown). On the basis of these findings, we concluded that the use of this simple correction formula cannot be recommended.

## Impact on classification of provider performance

Using postoperative PROMs data for hip replacement and varicose vein surgery, we have shown two examples of how skewed patient-level scores influence the probability of a provider being labelled as a poor performer on a funnel plot with symmetric control limits. Here, we focus on the numbers of providers in our dataset that were differently labelled using the symmetric and simulated (bootstrap) control limits. In practice, the distribution of provider volumes and their relative mean scores will affect the numbers of providers that are misclassified as poor performers using symmetric instead of asymmetric control limits. The orthopaedic and general surgery providers included in our analysis included those with between five and 528 patients with complete data. Around two-fifths of providers had data on 30 or fewer patients and a fifth had data on 150 patients or more. Because these data come from the early stages of the NHS PROMs programme, provider volumes with complete data are likely to increase. However, older data may be considered less timely.

Based on the unadjusted data, 14 orthopaedic providers out of 237 had mean postoperative OHS scores which fell below the symmetric 99.8% lower limit, of which 12 (out of 237) also fell below the asymmetric simulated 99.8% lower limit (table 2). In other words, using asymmetric simulated limits, two (0.8%) who would otherwise have been deemed poor performers were not. After risk adjustment, including adjustment of the control limits, eight orthopaedic providers fell below the symmetric 99.8% lower limit, of which five also fell below the corresponding simulated limit, that is, three (1.3%) were differently classified.

Following a similar pattern, seven general surgery providers out of 160 had mean AVVQ scores higher (worse) than the symmetric 99.8% upper limit, of which three also fell above the simulated 99.8% upper limit, that is, four (2.5%) were differently classified (table 2). After risk adjustment, five providers were classified as outlying by the symmetric 99.8% limit and four by the asymmetric 99.8% limit, that is, one (0.6%) was differently classified.

Providers that were classified as poor performers by the symmetric but not by the simulated asymmetric control limits had mean scores lying close to both sets of control limits (figure 3). They were marginal cases, lying just outside the symmetric limits. In the present examples, it was not only small providers whose classification of ‘poor performance’ changed, but also those with higher volumes, including one provider with 155 patients.

## Discussion

Skewness in the distribution of postoperative PROM scores causes the distribution of provider performance indicators derived from these to be skewed. As a consequence, the usual design of funnel plots with symmetric control limits may increase the number of providers labelled as having poor performance (and decrease the number of providers designated as being better than average).

For two condition-specific PROMs, the OHS and the AVVQ, distributions of patients' postoperative scores were highly skewed after surgery. We used the mean postoperative score, adjusted for patient characteristics, as our measure of provider performance. We compared the impact of using funnel plots designed with symmetric and simulated control limits on the classification of poor performance, with the latter derived directly from percentiles of simulated (bootstrap) distributions of mean scores. We found that the simulated control limits on funnel plots for both procedures were asymmetric. Compared with the simulated limits, the estimated empirical probability of falling outside the symmetric 99.8% ‘poor performance’ control limit was inflated from a rate of 1 in 1000 (0.1%) to between 2 and 3 in 1000 (0.2–0.3%) for providers carrying out less than 150 procedures. The estimated probability fell to below 0.2% for provider sample sizes of more than 150.

We also compared the impact of using symmetric and simulated control limits on the observed classification of poor performance among providers in our database. For hip replacement, eight out of 237 providers had adjusted mean scores that exceeded the outer symmetric ‘poor performance’ limit, compared with only five that exceeded the corresponding simulated limit. In other words, three (1.3%) were differently classified. For varicose vein surgery, five exceeded the symmetric limit and four exceeded the simulated limit, that is, one (0.6%) was differently classified.

For studies using mean PROM scores to compare performance in other clinical areas, the impact of using simulated rather than symmetric funnel plots on the classification of performance will depend upon the level of skewness in the individual data and the provider sample sizes. Skewness is a common feature of PROMs in other clinical areas,5 although the level of skewness may be less than that observed for the OHS and AVVQ. Larger provider sample sizes could be used to minimise the impact on provider comparisons, but this would usually require the collection of data for a longer period of time. For rare procedures and for comparisons of surgeon performance, smaller sample sizes may be unavoidable.

As regards the practical importance of our findings, the selection of some level of risk of a false alarm about poor performance is always traded off against the risk and consequences of not identifying genuine instances of poor performance. It may well be that using symmetric control limits that correspond to 0.2% or 0.3% instead of 0.1% is deemed an acceptably small risk of a false alarm. It should also be recognised that making inferences about provider performance is not an exact science. For example, comparing performance across many providers also increases the risk of false alarms above the intended rate, unless adjustments are made for multiple testing.15

The choice of method for calculating control limits for routine provider comparisons based on PROMs should depend on how the performance data are to be published and used. Symmetric control limits could reasonably continue to be used if there is to be some flexibility over how marginal cases of poor performance are handled, particularly if PROMs are to be used alongside other performance indicators. This would require managers, regulators and politicians to make judgements about marginal cases lying on or close to the 99.8% threshold, especially when low volume providers are considered. Alternatively, if simple classifications of performance are to be widely used as the basis for patient and clinician judgements and potential investigation by regulators, the calculation of asymmetric simulated control limits should be considered.

## Acknowledgments

We would like to thank Jiri Chard, Susan Charman, Mike Kenward, Maxine Kuczawski, Mark Pennington and David Smith for useful discussions and comments.

## References

## Footnotes

Funding This work is funded by the DH Directorate of System Management and New Enterprise.

Competing interests None.

Provenance and peer review Not commissioned; externally peer reviewed.