Article Text

Download PDFPDF

Better-than-average and worse-than-average hospitals may not significantly differ from average hospitals: an analysis of Medicare Hospital Compare ratings
  1. Susan M Paddock1,
  2. John L Adams2,
  3. Fernando Hoces de la Guardia1
  1. 1RAND Corporation, Santa Monica, California, USA
  2. 2Kaiser Permanente Research, Pasadena, California, USA
  1. Correspondence to Dr Susan M Paddock, RAND Corporation, 1776 Main Street, PO Box 2138, Santa Monica, CA 90470-2138, USA; paddock{at}


Background Public report card designers aim to provide comprehensible provider performance information to consumers. Report cards often display classifications of providers into performance tiers that reflect whether performance is statistically significantly above or below average or not statistically significantly different from average. To further enhance the salience of public reporting to consumers, report card websites often allow a user to compare a subset of selected providers on tiered performance rather than direct statistical comparisons of the providers in a consumer's personal choice set.

Objective We illustrate the differences in conclusions drawn about relative provider performance using tiers versus conducing statistical tests to assess performance differences.

Methods Using publicly available cross-sectional data from Medicare Hospital Compare on three mortality and three readmission outcome measures, we compared each provider in the top or bottom performance tier with those in the middle tier and assessed the proportion of such comparisons that exhibited no statistically significant differences.

Results Across the six outcomes, 1.3–6.1% of hospitals were classified in the top tier. Each top-tier hospital did not statistically significantly differ in performance from at least one mid-tier hospital. The percentages of mid-tier hospitals that were not statistically significantly different from a given top-tier hospital were 74.3–81.1%. The percentages of hospitals classified as bottom tier were 0.6–4.0%. Each bottom-tier hospital showed no statistically significant difference from at least one mid-tier hospital. The percentage of mid-tier hospitals that were not significantly different from a bottom-tier hospital ranged from 60.4% to 74.8%.

Conclusions Our analyses illustrate the need for further innovations in the design of public report cards to enhance their salience for consumers.

  • Performance measures
  • Quality measurement
  • Report cards
  • Statistics

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.


The public reporting of healthcare provider performance is widespread among public and private purchasers of healthcare. The Agency for Healthcare Research and Quality's Health Care Reporting Compendium lists over 200 samples of ‘report cards’ of healthcare provider performance.1 Public report sponsors aim to improve the quality of care and lower its cost by providing consumers with the information needed to select high-performing providers.2 However, the effectiveness of public reports in improving care and patient outcomes is mixed.3–6

Recommendations for enhancing the effectiveness and appeal of public report cards include presenting the information for consumers in as clear and accessible a format as possible.7 ,8 To this end, public reports often categorise provider performance into tiers based on their relative performance. Sometimes the top tier is defined by a prespecified level of percentile of performance. For example, the top performance tier used by the Office of the Patient Advocate in California in the USA is defined as performance ‘comparable to the top 90% of scores’ on a given quality measure.9 The types of scores used in such summaries might be simple observed averages or averages that are adjusted for patient characteristics. One approach is to classify providers on the basis of whether their performance is statistically significantly different from average. For example, the US's Medicare Hospital Compare website reports hospital performance relative to the national mean for 30-day readmission and mortality rates. Hospital performance is classified as belonging to one of three tiers: ‘better than US national rate,’ ‘worse than US national rate,’ or ‘no different from US national rate’. Some public reports will supplement such verbal descriptors with graphical symbols. The reporting of performance tiers is also supported by the findings that consumer numeracy is associated with the consumer's degree of understanding of public reports 10 and that consumers prefer clear and simple summaries of the information presented in public reports, such as assigning star ratings that correspond to tiers of performance or rank-ordering by provider performance.11

From the consumer perspective, a side-by-side comparison of selected providers tailors the information in public report cards to a consumer's specific interests and needs, potentially enhancing their appeal.7 For example, the Medicare Hospital Compare website prompts users to ‘Choose up to three hospitals to compare’. Consumers are then able to visually compare the tier assignments for their selected hospitals. The underlying goal of enhancing the ability of consumers to make side-by-side comparisons of select providers on report card websites is to steer consumers toward the best providers. Several other web-based reports described in the Agency for Healthcare Research and Quality's Health Care Reporting Compendium have similar comparison utilities.1

A major limitation of such side-by-side comparison utilities of major public report card websites is that they do not reflect direct comparisons of performance. Provider A might be ‘better than the US average’ and provider B ‘not significantly different from the US average’; however, that does not mean that provider A is significantly better than provider B. Testing whether providers A and B significantly differ from one another requires that one conduct a statistical test that directly compares their performances. In contrast, the tier assignments of providers A and B result from two distinct statistical tests, each of which is only informative about whether each provider significantly differs from the national average. However, the difference between a statistically significant result and a non-significant result is not necessarily significant.12 Thus, for these side-by-side comparisons, the critical comparison between providers A and B is missing from public report cards.

Given the prevalence of tier-based reporting, the appeal of reporting tiers in lieu of or in concert with more detailed numeric information, and the goal of improving the quality of healthcare by empowering consumers to select providers based on relative performance,13 the objective of this report is to examine the robustness of using tiered performance reports from the perspective of a consumer seeking personalised comparisons of provider performance. We examine the degree to which a consumer might be misled by comparing providers based on performance tiers, illustrating the implications of this using 30-day mortality and readmission outcome measures from Medicare Hospital Compare, and conclude with a discussion of the implications of our findings for the design of public report cards.

Study data and methods


We examine publicly available Medicare Hospital Compare data archived in October 2012. Nearly all of the hospitals serving Medicare beneficiaries in the USA participate in the programme to avoid receiving payment reductions of two percentage points for non-participation.14 The publicly available performance data for the six risk-adjusted 30-day unplanned readmission and mortality outcome measures presented here reflect patient-level data obtained from Medicare claims and eligibility information. The population of Medicare beneficiaries targeted for these measures are those aged 65 or older who were enrolled in fee-for-service Medicare for 12 months before their hospital admission and who were admitted to acute care or Veterans’ Administration hospitals for heart attack, heart failure or pneumonia. The readmission measures had the additional requirements that beneficiaries were enrolled in Medicare for 30 days after their index admission and patients who died during the index admission or who left the hospital against medical advice are excluded from the calculation.15 A single cross-sectional estimate of performance is reported for each measure and each hospital for care provided to Medicare beneficiaries during the 3-year period of 1 July 2008 to 30 June 2011. The six outcome measures considered here are shown in table 1.

Table 1

Summary statistics for Medicare Hospital Compare mortality and readmission measures for the performance period 1 July 2008 through 30 June 2011


For hospitals with at least 25 eligible events, the publicly available Medicare Hospital Compare data contain reports of hospital-specific estimates of mortality and readmission, 95% CIs and national average performance obtained using hierarchical logistic regression modelling with adjustment for patient risk factors.16 The 95% CIs are used to assign hospitals to one of three performance tiers. If the entire 95% CI falls below the national average, then the hospital is deemed to be ‘better than US national rate’. In contrast, when the entire 95% CI exceeds the national average, performance is labelled as ‘worse than US national rate’. Those hospitals with CIs that include the national rate are labelled as ‘no different from US national rate’. In order to make direct comparisons between pairs of hospitals, we first compute an approximate SE from the reported 95% CIs by dividing the length of the CIs reported in Hospital Compare by 2×1.96. We then re-estimate the tier assignments using approximate 95% CIs obtained as a function of the reported hospital mean performance and the approximate SE. Then, for each pairing of one hospital from the ‘better than US national rate’ tier and one from the ‘no different from US national rate’ tier, we calculate a two-sample z test to test whether their performances significantly differ from one another. The test statistic for the z test is the difference in performance estimates for a pair of hospitals divided by the estimated SD of the difference estimate. Finally, we compute the average proportion of mid-tier hospitals that are not statistically significant from a given top-tier or bottom-tier hospital.

The full patient-level dataset that is analysed to obtain the Medicare Hospital Compare estimates is not publicly available, limiting our ability to perfectly replicate results. However, our approach of computing approximate 95% CIs based on the publicly available hospital-level Hospital Compare data results in low levels of disagreement between the tier assignments assigned by our approach versus those reported by Hospital Compare; specifically, the range of disagreement between tier classifications using the reported versus approximate 95% CIs ranged from 0.9% to 2.6% across the six measures.


Column (a) of table 1 displays the outcome measure considered in the given row, along with the number of hospitals for which the outcome was reported in Hospital Compare. The national rate for each outcome is presented in column (b) of table 1. The mean number and range (10th–90th centile) of eligible events per hospital is reported in column (c). Each hospital's performance is compared with the national rate for each outcome. Column (d) shows the number and percentage of hospitals having performed significantly better than the US national rate that is displayed in column (b), column (e) shows the number and percentage of hospitals not significantly different from the national rate, and column (f) shows the number and percentage of hospitals significantly worse than the national rate. The proportions of mid-tier hospitals are 90–96% (column (e)). There are 0.6% of hospitals classified as bottom-tier hospitals on the heart attack mortality measure, and 1.3% classified as top-tier hospitals on the pneumonia 30-day readmission measure, although there is a greater proportion (6.1%) of hospitals classified as top tier for the 30-day mortality rate for heart failure (column (d)).

Column (b1) of table 2 shows that, for each outcome measure displayed in column (a), 100% of the top-tier hospitals do not significantly differ from at least one mid-tier hospital, while 91.5–94.8% of mid-tier hospitals do not significantly differ from at least one top-tier hospital (table 2, Column (b2)). Column (b3) of table 2 shows that the average percentage of mid-tier hospitals that do not statistically significantly differ from a given top-tier hospital varies between 74.3% and 81.1% across the six measures, reflecting the difference in results obtained by comparing provider tiers versus testing the significance of the difference between a top-tier and a mid-tier provider. Columns (c1)–(c3) of table 2 show analogous results for the comparison of bottom-tier and mid-tier providers. For each measure, 100% of bottom-tier hospitals do not significantly differ from at least one mid-tier hospital (table 2, column (c1)), while the percentage of mid-tier hospitals that do not significantly differ from at least one bottom-tier hospital ranges from 83.1% to 90.8% across the measures (table 2, column (c2)). Column (c3) of table 2 shows that the average percentage of mid-tier hospitals that do not statistically significantly differ from a given bottom-tier hospital is 67.6% to 74.8% across the six measures. Although not shown in the table, all pairwise comparisons between top- and bottom-tier hospitals are statistically significant. Thus, for this particular dataset and this scenario, the comparisons made using tier assignments and pairwise significance tests of top versus bottom tier providers agree.

Table 2

Summary statistics for Medicare Hospital Compare mortality and readmission measures for the performance period 1 July 2008 through 30 June 2011

Table 3 presents a sensitivity analysis to explore whether having twice as much data on each provider would reduce the percentages shown in columns (b2–b3) and (c2–c3) in table 2. Such a scenario would occur for instance if the data collection period were twice as long and the estimates across the two periods were independent. In such a scenario, there would be a greater number of top-tier and bottom-tier hospitals, although similar to the results presented in table 2, virtually all such hospitals would not significantly differ from at least one mid-tier hospital (columns b1 and c1 of table 3). The proportions of mid-tier hospitals not significantly different from at least one top-tier (column b2, table 3) or bottom-tier (column c2, table 3) hospital are comparable to those shown in table 2. The proportions of mid-tier hospitals that are not significantly different from a given top-tier or bottom-tier hospital (columns b3 and c3, respectively, of table 3) are lower than those shown in table 2, ranging from 50.4% to 70.9%.

Table 3

Illustration of how results might differ with twice the amount of information in the data


The goal of public reporting is to improve the quality of healthcare by empowering consumers to choose the best providers.13 Personalising public report cards to consumers’ interests might enhance their appeal and ultimately their effectiveness.7 Reports of relative performance might be more useful to consumers than comparisons of individual providers with national benchmarks.2 There is the potential for current web-based report cards to implicitly and explicitly encourage consumers to view the comparisons provided of provider performance tiers as more personally tailored to their own decision-making processes than they really are. In the Medicare Hospital Compare example shown here, top-tier and mid-tier hospitals did not have a significantly different performance about 75% of the time. While the level of ‘acceptable’ disagreement in this context is an open question, studies suggest that patients might tolerate misclassification rates of physician profiles of 5–20%.17 From the patient perspective, there is a risk of needlessly disrupting patient–provider relationships when there is little evidence of meaningful differences between pairs of consumer-selected providers.2 Over the long term, failure to truly personalise the personalised comparisons in public report cards risks eroding confidence in public report cards if consumer actions are based on misleading information.

We also illustrated that the disagreement between tier assignment comparisons and direct statistical testing might be reduced somewhat by increasing the amount of information available in the data; however, the fundamental issue remains that direct comparisons of providers that belong to difference performance tiers are often not statistically significant. Even if such a solution were technically feasible, there are practical considerations that might impede its uptake. For example, as one reporting period for the Medicare Hospital Compare outcomes reflects 3 years of performance, two reporting periods would reflect 6 years of performance, making it more likely that quality improvement initiatives, technology improvements, or other changes might occur in the interim that would make the results less relevant to stakeholders.18

Our analysis focused on the hierarchical modelling-based classification of hospitals into three tiers used by Medicare Hospital Compare. However, our findings apply more generally to any report card in which providers are classified into tiers on the basis of examining the overall distribution of provider performance. For example, providers flagged as outliers using a funnel plot would not necessarily significantly differ from a given non-outlier provider,19 and providers that exceed a predetermined performance threshold (eg, 75th centile) would not necessarily significantly differ from those that do not.20 Although the publicly available Medicare Hospital Compare estimates were risk-adjusted, it is important to note that our findings apply to both risk-adjusted and non-adjusted performance estimates; information about the strengths and limitations of risk adjustment strategies is available elsewhere.21 ,22

To improve the clarity and salience of public report cards, several potential changes could be made. One strategy would be for report cards to simply list the ‘top’ providers, ‘average’ providers and ‘low-performing’ providers as determined by performance tiers and remove tools that allow seemingly personalised comparisons to be made of performance tiers. Consumers might misinterpret such tiers and make their own personalised comparisons, but at least the public report card website design would not encourage consumers to do so. Report cards might also include an explicit caution against doing so. However, given the importance of providing salient information to consumers, alternatives that allow consumers to make better cross-tier comparisons should be considered. For example, provider performance data and/or direct provider comparisons could be preloaded or computed on-demand. Given modern computational tools, this is a feasible solution. Once the analysis data and datasets containing the requisite hierarchical modelling output are loaded, computations using V.3.0.1 of the R Language and Environment for Statistical Computing23 for all six measures shown in table 1 took 14 s with a dual processor of 2.9 GHz each and 8 GB of RAM memory (64 bit), and the size of the object that stored the pairwise test results for a single measure was about 135 MB.

Although this report focuses on pairwise comparisons, we are aware of several public reporting websites that allow consumers to directly ‘compare’ multiple providers. However, our findings have implications for all consumer-designated provider performance comparisons. The focus on pairwise comparisons in this report will provide the most optimistic assessment of how well a consumer might do. This is clear if one considers the extreme case of comparing all providers simultaneously through rank-ordering their performances. Estimates of hospital performance ranks are subject to variation that is large enough to make it difficult, if not impossible, to differentiate performance among providers except for the most extreme performers.24 ,25

A potential limitation is that we do not correct for the inflation of Type I error (ie, falsely identifying statistically significant differences) associated with conducting multiple testing of pairwise provider differences. We instead take the perspective of an individual consumer who is interested in examining a specific pair of providers of his/her choosing. This is the implicit expectation of public report designers, given the current practice in public reporting of not adjusting for pairwise comparisons that reflect multiple comparisons. While this is a methodological topic worthy of future study, our analysis could be regarded as illustrating a best-case scenario in terms of how well provider performance is differentiated for providers classified to different tiers.


Comparing the performance of selected hospitals using performance tiers can lead to misleading results. In the majority of cases, performance for hospitals reported as either top tier or bottom tier on Medicare's Hospital Compare website did not significantly differ from that of mid-tier hospitals. These findings highlight the need for innovations in the design of public report cards to enhance their salience for consumers. At a minimum, report cards need clearer explanations to avoid misleading consumers into thinking that all top- or bottom-tier hospitals necessarily have meaningfully different performances from a mid-tier hospital.


The authors thank the editors and two anonymous referees for their helpful suggestions.


View Abstract


  • Contributors SMP: conceptualised and designed the study, oversaw statistical analyses, and drafted the initial manuscript. JLA: contributed to the conceptualisation of the study and provided comments on analysis output and draft. FHdlG: analysed the data and reviewed and provided comments on the manuscript.

  • Funding This work is supported by grant 1 R21 HS021860, ‘Innovations in the Science of Public Reporting of Provider Performance’, from the Agency for Healthcare Research and Quality. The content is solely the responsibility of the authors and does not necessarily represent the official views of the Agency for Healthcare Research and Quality.

  • Competing interests None.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Data sharing statement The authors used publicly available data from, file name: (archived 10/01/2012).

Linked Articles

  • Editorial
    David M Shahian Sharon-Lise T Normand