Original articlesDealing with missing data in observational health care outcome analyses
Introduction
Observational outcomes studies appear frequently in the clinical and health services research literature. The objectives of these studies typically are hypothesis generation about optimum management of illness, or analyses of the quality of medical care. As Iezzoni notes, meaningful assessments of patients' outcomes in observational studies require two basic procedures [1]: a reliable and accurate measure of the outcome itself, and a method of adjusting for factors affecting that outcome, other than the variable(s) of primary interest. For example, where mortality is the outcome under scrutiny, multivariable models are constructed to determine which variables predict individual patients' probabilities of dying, and the expected mortality rates for two or more groups of patients.
Much of the published outcomes research in health care relies on administrative databases with limited clinical information about patients. Multivariable risk adjustment based on administrative data is therefore constrained from the outset by the lack of details on important prognostic factors. Clinical databases are better able to explain interprovider differences in outcomes than are administrative databases. As Hannan et al. [2] have demonstrated, the advantage of clinical databases comes from the ability to select and capture prospectively those clinical variables that are important prognostically and have no comparable diagnostic code in administrative data. However, when more detailed databases are developed, costs rise as do the chances that some data for some patients will not be collected. Cases with missing values for any one of the variables entered into a model unfortunately cannot be used in multivariable analysis, unless imputation is used.
Common methods for handling missing covariate values include stratification on missing data status, conditional-mean imputation, and complete subject analysis in logistic regression. These methods can be biased under reasonable circumstances and are often unsatisfactory [3]. More sophisticated methods include multiple-imputation methods, maximum likelihood or pseudo maximum likelihood methods, and weighted estimating equation methods [3]. However, the validity of all methods for handling missing data depends on meeting certain assumptions [3], the most stringent being the assumption that the data as a whole are “missing completely at random” (i.e., whether or not a given variable is missing is entirely independent of the values of other variables, and also independent of whether other variables are missing). A less stringent assumption is that the data are “missing at random” (i.e., whether or not a given variable is missing is entirely independent of the values of any other unobserved variables, although it can depend on the values of observed variables). In their review of methods for handling missing covariates in epidemiology, Greenland and Finkle argue that if the “missing at random” assumption fails, none of the above mentioned missing data methods can be applied [3].
We recently faced the problems of nonrandom patterns of missing data in a new clinical registry. One conventional response in this situation is to exclude cases with missing data, but herein lies a catch-22. If the data are nonrandomly missing, then the impact of exclusion will be nonrandom, with resultant biases in any analyses. Another approach is to impute the lowest level of severity for a given missing variable. In this instance, the goal is to provide an incentive for participating centers or health care providers to be more assiduous about data capture in the future. A third possibility is using alternative data sources to “fill in the blanks.” For example, Smith et al. [4] recently demonstrated that significantly more accurate estimates of probabilities of death are possible with administrative data when limited clinical information from clinical databases is merged with the administrative data. The converse (using administrative data to fill in gaps in clinical registry data) is also feasible.
We have tested each of these three strategies for dealing with missing data, and report here on the findings. We also reflect on the lessons that other health services research might draw from our experience.
Section snippets
APPROACH project
The Alberta Provincial Project for Outcome Assessment in Coronary Heart Disease (APPROACH Project) is a province-wide inception cohort of all adult Alberta residents undergoing cardiac catheterization for ischemic heart disease. The APPROACH project was initiated to study provincial outcomes of care and facilitate quality assurance/quality improvement for patients with coronary artery disease in Alberta. The APPROACH database contains detailed clinical information on adult patients with known
Results
A total of 6065 patients (71.5% male) with a mean age 62.1 years (SD = 11.3 years) were used for these analyses. Table 3 indicates the prevalence of the predictor variables in each of the three datasets examined in our analysis. With the exception of prior PTCA, CABG, and lytic therapy, the enhanced database demonstrates a consistently higher prevalence for each of the predictor variables. This suggests that assuming a negative or 0 code when data are missing or unknown underestimates the true
Discussion
Prospective clinical databases like APPROACH are potentially valuable tools for studying outcomes of health care. However, missing data present major challenges to researchers wishing to develop risk adjustment algorithms to take advantage of clinical databases. As noted earlier, the standard methodologies for handling missing data presuppose that the data are at least “missing at random”—an assumption that is frequently violated in clinical and health care research. In the APPROACH database,
Acknowledgements
The authors thank the members of the APPROACH Project Clinical Steering Committee for their continued input and support: Principal Investigator, Dr. Merril Knudtson (Foothills Hospital), Chairperson: Dr. Vladamir Dzavik (University of Alberta Hospital), Dr. Neil Brass (Royal Alexandra Hospital), Dr. William Ghali (University of Calgary), Dr. Dennis Humen (University of Alberta Hospital), Dr. Arvind Koshal (University of Alberta Hospital), Dr. Robert Lesoway (Foothills Hospital), Dr. Andrew
References (16)
- et al.
A new method of classifying prognostic comorbidity in longitudinal studies
J Chron Dis
(1987) - et al.
Using Medicare claims data to assess provider quality for CABG surgerydoes it work well enough?
Health Services Res
(1997) - et al.
A critical look at methods for handling missing covariates in epidemiologic regression analyses
Am J Epidemiol
(1995) - et al.
Using clinical variables to estimate the risk of clinical mortality
Med Care
(1991) - Smith LR, Harrell FE Jr, Rankin JS, et al. Determinants of early versus late cardiac death in patients undergoing...
- International classification of diseases, 9th revision (clinical modification). Washington: Public Health Service, US...
Promises and limitations of the Patient Outcome Research Teamsthe low-back pain example
Proc Assoc Am Physicians
(1995)
Cited by (79)
Coronary Artery Bypass Surgery Improves Outcomes in Patients With Diabetes and Left Ventricular Dysfunction
2018, Journal of the American College of CardiologyCitation Excerpt :As in all nonrandomized studies, the direct comparisons of distinct groups may be misleading because the groups generally differ systematically. To obtain a comparable distribution of demographic characteristics, comorbidities, and clinical variables among low-EF patients who underwent PCI compared with patients with low EFs who underwent CABG, we used the Rosenbaum and Rubin propensity–score-matching technique (11). The propensity score was calculated as the probability of having undergone CABG conditional on the observed baseline (measured at recruitment) characteristics.
Need of informatics in designing interoperable clinical registries
2017, International Journal of Medical InformaticsCitation Excerpt :studied and characterized missing data in clinical registries and associated factors. Norris at el. [40] developed a method for handling missing data in a cardiac registry.
Body Mass Index Is Associated With Differential Rates of Coronary Revascularization After Cardiac Catheterization
2017, Canadian Journal of CardiologyCitation Excerpt :Pulmonary complications included pulmonary edema or pulmonary embolism, or both. The validity of clinical data collected in the APPROACH database has previously been investigated by comparing the data with administrative data obtained from the hospitals and coded using the International Classification of Diseases, ninth edition, Clinical Modification.17,18 This merger of the data allowed assessment of the presence of several clinical variables in 2 separate data sources for each patient.