Article Text
Abstract
Statistical hypothesis testing involving the comparison of three or more means and/or proportions is a frequent undertaking in medical statistics. For comparison of means, analysis of variance is a common choice and for comparison of proportions, χ2 tests are common. However, both these approaches have important limitations which include the need for post hoc testing to identify the unusual group(s) without an integral graphical device to present the final results. These limitations are elegantly overcome by the analysis of means, which is widely used in industrial statistics, and illustrated here using means and proportions.
Statistics from Altmetric.com
Introduction
Experimental or observational studies in healthcare typically involve the comparison of two or more groups of patients with respect to a pre-specified outcome (eg, weight loss or mortality). The comparison is usually accompanied by a statistical hypothesis test which produces a p value—the probability of the observed, or more extreme, result given the null hypothesis of no difference.1 But where there are three or more groups under consideration, researchers often resort to using relatively more sophisticated methods such as analysis of variance (ANOVA)1 for numerical outcomes (eg, comparing blood pressure in the three groups) or a χ2 test for categorical outcomes (eg, comparing infection in the three groups).1 However, although ANOVA and χ2 tests are included in introductory medical statistics books1–3 and are widely employed, there are some important issues and limitations associated with their proper use which merit attention.
Analysis of variance
Consider the data in table 1 which are based (via simulation) on data reported by Bewick et al,4 where the mean Simplified Acute Physiology Scores for patients admitted to intensive care are reported for patients with no infection, infection on admission to the intensive care unit (ICU), ICU acquired infection and ICU acquired infection and infection on admission. Figure 1 (left panel) shows the data using box plots. When analysed using a one-way ANOVA, these data produce an F statistic of 3.19 (3 degrees of freedom) and an associated p value of 0.02 leading to the rejection of the null hypothesis of no difference.
Although we can conclude that the variation is statistically significant at the 5% level, our question remains only partially answered because we do not know which patient group(s) is the odd one out. This is because ANOVA (described elsewhere)4 5 tests the omnibus null hypothesis of no difference between groups but does not identify the aberrant group(s). So, ANOVA is incomplete and requires subsequent post hoc testing to identify the unusual group(s). Ideally these post hoc tests should also be pre-specified and make proper statistical allowance for multiple statistical testing.5 At least a half a dozen post hoc tests6 (eg, Bonferroni correction, Tukey's honestly significant difference, etc) have been recommended, but with little consensus among researchers the choice is somewhat bewildering. Furthermore, ANOVA does little to aid the assessment of clinically significant differences, partly because ANOVA is essentially a numerical technique with no accompanying graphic.4 The danger with an unplanned post hoc comparison after ANOVA is that it may tempt comparison of the most extreme groups, yielding results which are likely to be too optimistic and potentially biased. If ANOVA does not yield a statistically significant difference, then there is no clear rationale for further analysis,4 5 although there is contradictory advice about this.7 In summary then, it appears that the proper use of ANOVA in practice is less than straightforward.
Analysis of means
It would of course be attractive to have a more straightforward statistical technique which undertook the omnibus test of the null hypothesis, identified the unusual group(s) and aided the assessment of clinical significance while simultaneously making allowance for multiple testing without the need to resort to post hoc testing. Fortunately such a technique, almost unheard of in medical statistics, has been developed and is used in industrial statistics—it is called analysis of means (ANOM) and resulted in a prestigious prize for its inventor RE Ott.8
ANOM relies on critical values from a multivariate non-central t distribution assuming that the individual results are independent and normally distributed with a common variance.9 The general idea is to plot the group means and upper and lower decision lines which are based on the grand mean obtained from
The researcher is only required to specify statistical significance levels (α). In our case for the data shown in table 1, k=4, α=0.05, and v=400–4, which yields a critical value9 of ha,k,v as 2.47 and ANOM limits of 35.86 (lower) and 42.15 (upper). Figure 1 (right panel) shows the results from ANOM. The output is a remarkable graphic that simultaneously incorporates the mean physiological scores in each group, the null hypothesis statistical test (as shown by the upper and lower ANOM limits with respect to the grand mean) and gives insight into the clinical significance of the findings as well as identifying the aberrant group(s)—in this case we conclude the physiological scores for group A are statistically (and clinically) significant with respect to the grand mean score. A further demonstration of the versatility of ANOM is the ease with which we can include ANOM limits for a lower level of statistical significance, such as α=0.01 (using ha,k,v=3.01). This helps us to assess statistical significance at different thresholds and serves as a useful reminder that the setting of statistical significance levels is a decision for the researcher.
It is important to appreciate how the null hypothesis differs between ANOM and ANOVA. ANOM compares each group mean to the grand mean, whereas ANOVA compares the means of each group, and so the results from ANOVA and ANOM may sometimes differ. This difference in approach highlights the fact that in considering equality of means there are several ways to construct the null hypothesis. Similarities with the statistical process control charts10 are obvious. ANOM can also be used with imbalanced subgroup sizes and with attribute data.9
χ2 Test for proportions
When the comparison of three or more groups is based on categorical or attribute data (eg, alive/dead), then it is not unusual to see the χ2 test being used to test for differences in proportions. For example, table 2 shows the crude mortality in five hospitals at 6 months following admission with a stroke.11 A χ2 test shows the variation is statistically significant (χ2=396.6, 9 degrees of freedom, p<0.0001), but once again the test sheds no light on which hospitals are the odd ones out. When we analyse the data using ANOM, but now adapted to dealing with proportions9 (see figure 2), we clearly see that hospitals A and D have aberrant mortality compared with the grand mean. The ANOM decision limits are derived using the following formula, which shows how the limits are computed for a given hospital, j, where j=1 to k:
Limitations of ANOM
ANOM is not widely used in medical statistics and this may reflect its absence from courses, textbooks and from statistical software with SAS Software (V.9.2)12 and Minitab (V.5.1.0.0)13 being exceptions known to the authors. ANOM, like ANOVA, also requires the usual assumptions of normality and constant variance to apply. However, the general advice appears to be that whenever ANOVA and the χ2 test are viable and employed for their basic purpose, then ANOM may be profitably used as a supplementary or alternative analysis without drawbacks.9 ANOM shows which group mean(s) differ significantly from the overall grand mean, but when the primary questions involve pairwise testing (ie, is group A significantly different from group B and group C?), then post hoc testing would still be required. ANOM has not been extended to studies involving repeated measures, which can be analysed, under certain restrictions, by ANOVA, although the multilevel modelling approach is advocated as the preferred analysis. The primary purpose of the statistical hypothesis testing framework is to produce p values and so, just as with ANOVA, further work is required to derive CIs.14
Key messages
Many studies, including quality improvement applications, involve a comparison of three or more groups (eg, mean blood pressure in three patient groups).
The traditional approach in medical statistics has been to compare groups using analysis of variance (ANOVA), but this approach has limitations.
ANOVA tests the omnibus null hypothesis of no difference between the groups but does not identify the aberrant group(s) without subsequent post hoc testing. Moreover, ANOVA does not produce a graphical result.
Analysis of means (ANOM) is a well-established technique in industrial quality improvement that solves both of these problems: it identifies which group(s) has a mean value that differs significantly from the overall average and it naturally lends itself to a graphical display of the groups with the aberrant group(s) appearing outside statistical limits, thereby aiding the assessment of statistical and clinical significance.
ANOM does not have any limitations not shared by the traditional ANOVA approach
Analogous arguments favour the use of ANOM for the comparison of proportions over the traditional χ2 test.
Footnotes
Competing interests None.
Provenance and peer review Not commissioned; externally peer reviewed.