Article Text
Statistics from Altmetric.com
Healthcare performance measurement is a complex undertaking, often presenting a number of potential alternative approaches and methodological nuances. Important considerations include richness and quality of data sources; data completeness; choice of metrics and target population; sample size; patient and providerlevel data collection periods; risk adjustment; statistical methodology (eg, logistic regression vs hierarchical models); model performance, reliability and validity; and classification of outliers. Given these many considerations, as well as the absence of nationally accepted standards for provider profiling, it is not surprising that different rating organisations and methodologies may produce divergent results for the same hospitals.1–6
Outlier classification, the last step in the measurement process, has particularly important ramifications. For patients, it may lead them to choose or avoid a particular provider. For providers, outlier status may positively or negatively impact referrals and reimbursement, and may influence how scarce hospital resources are deployed to address putative areas of concern. Misclassification is probably more common than generally appreciated. For example, partitioning of hospitals (eg, terciles, quartiles, quintiles, deciles) to determine outliers may lead to excessive false positives—hospitals labelled as having above or below average performance when, in fact, their results do not differ significantly from the mean based on appropriate statistical tests.7 ,8
The current study
In this issue, Paddock et al9 address a seemingly straightforward question—what precisely does it mean to be a performance outlier? Using Hospital Compare data, the authors demonstrate an apparently contradictory finding. When directly compared one to another, some individual hospitals in a given performance tier may not be statistically significantly different than individual hospitals in adjacent tiers, even when those tier assignments were made using appropriate tests of statistical significance. For instance, Paddock et al9 show that for each bottomtier hospital, there was at least one midtier hospital with statistically indistinguishable performance. Among midtier (‘average’) hospitals, 60–75% had performance that was not statistically significantly different than that of some bottomtier hospitals.
How can this be? On the one hand, hospitals appear to have been appropriately divided into three discrete groups based on their performance rankings—bottom, mid and top tiers. On the other hand, direct comparisons between specific pairs of hospitals in adjacent tiers often showed no statistically significant difference, which seems inconsistent with their original rankings The answer to this apparent paradox illustrates several statistical concepts, some unfamiliar to nonstatisticians but of fundamental importance to the correct interpretation of riskadjusted outcomes and outlier status.
First and most fundamentally, Paddock et al9 use a completely different statistical methodology for their direct hospital to hospital comparisons than the approach used in the original Hospital Compare tier assignments.10–12 The latter employed Bayesian hierarchical regression models with 95% credible intervals (similar to CIs) to determine outliers. From the perspective of causal inference theory,13–,17 the Hospital Compare approach considers the following unobservable counterfactual: ‘What would the results have been if this hospital's patients had been cared for by an “average” hospital in the reference population?’ This is often referred to as the ‘expected’ outcome. A level of statistical certainty for the hospitallevel estimates is chosen (eg, 95% credible interval), the actual results of a given hospital are compared to the expected or counterfactual outcomes, and any hospital whose 95% credible interval for their riskadjusted mortality rate excludes the expected mortality rate is designated an outlier.
Because Paddock et al9 did not have access to the patientlevel data on which the Hospital Compare analyses were based, they first converted CIs to SEs, then reestimated performance tiers (presumably, though not stated, using onesample ztests), which were similar to the original Hospital Compare ratings. Finally, they performed twosample ztests using the results from various hospital combinations in adjacent performance tiers. Their counterfactual is not the expected outcome if a hospital's patients were cared for by an average hospital, but rather by one specific alternative hospital. Their corresponding null hypothesis is that the difference in mean mortality rates between the two hospitals being compared is zero (or, alternatively, that the ratio of their mean mortality rates is unity).
Thus, the direct hospital–hospital comparisons performed by Paddock et al9 ask a different question than the original Hospital Compare analyses, with a different counterfactual statement and statistical approach. Viewed from this perspective, it is no longer paradoxical but completely logical that they found different results. In this particular study, failure to reject the null hypothesis of no difference in performance among pairs of hospitals from adjacent tiers was also driven by the large SEs (resulting from small hospital sample sizes—see below). Indistinguishable performance would be particularly likely for pairs of hospitals whose performance was close to the boundary between two adjacent performance categories. That the authors only required at least one hospital from an adjacent tier to be statistically indistinguishable is a relatively low bar.
Direct and indirect standardisation
Notwithstanding the results from this specific study, which are largely a function of small sample sizes, the authors do not address the more fundamental error of using indirectly standardised results to directly compare pairs of hospitals. The differences between direct and indirect standardisation13 ,17 ,18 remain unappreciated by most nonmethodologists, resulting in their frequent misapplication and misinterpretation. In direct standardisation, rates from each stratum of the study population are applied to a reference population. This type of standardisation is common in epidemiological studies where there may be only a few strata of interest (eg, age–sex strata). Directly standardised results estimate what the outcomes would have been in the reference population if these patients had been cared for by a particular study hospital. In causal inference terminology, this is the unobservable counterfactual. The results from many different hospitals can be applied to the reference population in exactly the same fashion, and it is therefore permissible to directly compare their directly standardised results.
The conditions that make direct standardisation possible are not found in most profiling applications because of the large number of risk factors and the fact that any given hospital may have no observations for patients having certain types of risk factors. Consequently, in virtually all healthcare profiling applications, risk adjustment is performed using indirect rather than direct standardisation. The incremental risks associated with each predictor variable (eg, a risk factor such as insulindependent diabetes) are derived from the reference population using regression. As in the original Hospital Compare approach discussed above, the expected outcomes in the study population reflect the anticipated results if those patients had been cared for by an average hospital in the reference population, a quite different counterfactual than in direct standardisation.17 Expected results for each patient of a given hospital are summed and compared with their observed results to estimate an O/E ratio (eg, standardised mortality ratio), which can be multiplied by the average mortality to yield a riskadjusted or riskstandardised rate.
Covariate overlap
Direct hospital–hospital comparisons using indirectly standardised observational data are inappropriate in virtually all profiling scenarios. The only exception is a very specific circumstance—when all regions of the covariate space defined by patient risk factors contain observations from all hospitals being compared—which would be an uncommon and chance occurrence in most profiling applications.17 ,19 In the absence of covariate overlap, there may be patients from one hospital for whom there are no comparable patients in the other hospital (in causal inference parlance, there is no empirical counterfactual19), and thus no way to fairly compare performance in all patients cared for by the two hospitals. For example, it is unlikely that each hospital would have octogenarians with renal failure and chronic liver disease who underwent emergency aortic valve replacement (AVR), but one of them might. No adjustment (eg, modelbased extrapolation) can reliably remedy the lack of data in the area of nonoverlap, and statistical inferences should generally be limited to regions where there is overlap.
Thus, ‘riskadjusted’ results derived using indirect standardisation cannot be used to directly compare two hospitals unless their patient mix has been demonstrated to be similar (eg, overlapping propensity score distributions).17 Indirectly standardised rates for each hospital are estimated only for the patients they actually treated, and their results only apply to their particular case mix. It cannot be assumed that a hospital achieving better than average results in a generally low risk population could do the same in a population of very high risk patients that it has never treated. Because their indirectly standardised rates were obtained by applying reference population rates to their low risk patients, assuming that they would have similar performance if confronted with a highrisk, tertiary patient population is optimistic and unwarranted.
Covariate imbalance and bias
Irrespective of whether there is overlap in their respective distributions of patient risk, these distributions may still vary across hospitals being compared (ie, the prevalence of relevant risk factors may be different) and this covariate imbalance19 may bias the interpretation of results and the determination of outliers.20 Covariate imbalance is a common problem in profiling using observational data because patients are not randomised (the method used to achieve covariate balance in clinical trials). Standard regressionbased adjustment may not completely address bias when there is substantial lack of covariate balance. Covariate imbalance was the motivation for the development of propensity score approaches for matching, modelling or stratification in studies using observational data,21 ,22 and propensity approaches to profiling have been investigated.20
Case mix bias
Despite excellent patientlevel risk adjustment, substantial case mix bias (eg, due to marked differences in the distributions of high and low risk cases between hospitals) may be present and may impact performance estimates and outlier status. For example, the target population (condition or procedure) may be very broadly defined, which is usually done in an effort to increase sample size. Instead of focusing only on isolated aortic valve replacement (AVR), a relatively homogeneous cohort, measure developers may include all patients with an AVR, even when this procedure has been combined with other operations (such as simultaneous coronary artery bypass grafting surgery).7 These combined procedures generally are associated with higher average mortality than their corresponding isolated procedures, so the resulting study population will have a heterogeneous range of expected mortality rates. Sometimes, completely dissimilar conditions or procedures with quite different inherent risk are aggregated into a heterogeneous composite measure to increase sample size or to give the appearance of being broadly representative. For example, the hospital standardised mortality ratio (HSMR) encompasses nearly all of the admissions at a given hospital.4
In all these examples, even with perfect patientlevel risk adjustment, comparisons among providers may be biased and inaccurate unless differences in the relative distributions of higher and lower risk cases are properly accounted for,4 ,23 a profiling analogue of Simpson's paradox.24 ,25 The impact of this phenomenon is not uniform. Centres performing a greater proportion of more complex cases, with higher inherent risk of adverse outcomes, may falsely appear to have worse results.
The limitation of small sample size
Small sample sizes are common in provider profiling, and this makes it difficult to reliably differentiate hospital performance and classify outliers. In a study of major surgical procedures, Dimick et al26 found that only coronary artery bypass grafting surgery was performed with sufficient volume by most providers to reliably allow detection of a doubling of mortality rate. Krell et al27 found that most surgical outcomes measures estimated from the American College of Surgeons’ National Surgical Quality Improvement Program (NSQIP) registry data had low reliability to detect performance differences for common procedures. Similar findings have been observed with common medical diagnoses.28–30
At volumes typically encountered in practice, and even assuming perfect patientlevel risk adjustment, much of the variation in healthcare performance measures is random; the extent of random variation and potential misclassification is greater at lower volumes and event rates.31 As a consequence, there is substantial fluctuation from one sampling period to another in the rates of adverse events and performance rankings among providers.32 Longitudinal assessment of provider performance over longer periods of time and investigation of trends are more prudent approaches than relying on results in one sampling period.33
Different approaches have been used to address the limitations of small sample size in provider profiling and outlier classification. These include establishing lower limits for sample size below which estimates are not calculated; collecting provider data over longer time periods to increase the number of observations; broadening the target population inclusion criteria (although this may lead to aggregation issues discussed previously, including ecological bias); attribution of results to larger units (eg, hospitals rather than individual physicians); and use of composite measures that effectively increase the number of endpoints.34 Many statisticians also advocate the use empirical Bayes or fully Bayesian approaches which shrink sample estimates towards the population mean.35–38 This yields more accurate estimates of true underlying performance, with less chance of false positive outliers, especially in small samples.
Statistical certainty
Closely related to these sample size concerns is the degree of statistical certainty chosen to classify a hospital as an outlier (eg, 90%, 95%, 99% CI). The overall health policy ‘costs’ of higher specificity and fewer false outliers versus higher sensitivity and more false outliers must be considered, and there is no one correct answer.39 Furthermore, the p values and CIs from traditional frequentist approaches may sometimes be misleading. With very small sample sizes, virtually no provider can be reliably identified as an outlier; conversely, with very large sample sizes, outlying results identified by statistical criteria may have little practical difference from the average. Bayesian approaches may provide more intuitive interpretation, such as estimating the probability that a hospital's performance exceeds some threshold.36 ,40–42
The impact of outliers on the reference population (‘expected’ values)
Additional problems with outlier classification can arise if the expected outcome for a particular provider is derived from a relatively small reference population (eg, the cardiac surgery programmes in a particular state). Every provider's outcomes impact not only their own observed value but also the ‘expected’ value (the E in O/E) for their programme, which is based on the reference population to which they belong.13 A substantially aberrant result from one or two providers will expand the range of values that are considered average, and will reduce the likelihood of a truly abnormal outlying provider being correctly classified as such. Several approaches to this problem have been suggested, including replication with posterior predicted p values, and leaveoneout cross validation, in which the expected performance for each hospital is estimated from a model developed from all other hospitals.37
Graphical tools for outlier detection
Finally, various graphical methods have also been used to monitor healthcare performance and to determine outliers. These include funnel plots,43 ,44 in which unadjusted or adjusted point estimates of provider performance are plotted against sample size (volume), with superimposed CIs around the population average to indicate warning or outlier status. Other methods include real time graphical monitoring using cumulative sum (CUSUM) approaches, in which results are immediately updated with each patient or procedure.45–47
Conclusion
Outlier determination, the final step in the performance measurement process, is a more complicated undertaking than most nonexperts appreciate, with many nuances in implementation and interpretation. Those involved in provider profiling have a responsibility to explicitly state the approaches they use for outlier classification, and to explain the proper interpretation of outlier status to end users of varying statistical sophistication. For example, as demonstrated in the study of Paddock et al,9 it should be recognised that while the CMS website is named Hospital Compare, the statistically valid comparison is between each hospital and a hypothetical average hospital, not between pairs of hospitals.
Given the historical lack of comparative performance data in healthcare and the urgent need to foster informed consumer choice and performance improvement, it is understandable that various stakeholders (patients, payers, regulators) might be tempted to view the issue of outliers too simplistically, sometimes misinterpreting or unintentionally misusing outlier results. However, this may lead to consequences that are at least as undesirable as having no performance data at all. Misclassification of providers may misdirect consumers, unfairly discredit or commend certain providers, and lead to misallocation of scarce resources. Scientific rigour and sound judgment are required to accurately classify outliers and to constructively use this information to improve healthcare quality.
References
Footnotes

Competing interests None.

Provenance and peer review Not commissioned; internally peer reviewed.