Original articles
Dealing with missing data in observational health care outcome analyses

https://doi.org/10.1016/S0895-4356(99)00181-XGet rights and content

Abstract

Observational outcome analyses appear frequently in the health research literature. For such analyses, clinical registries are preferred to administrative databases. Missing data are a common problem in any clinical registry, and pose a threat to the validity of observational outcomes analyses. Faced with missing data in a new clinical registry, we compared three possible responses: exclude cases with missing data; assume that the missing data indicated absence of risk; or merge the clinical database with an existing administrative database. The predictive model derived using the merged data showed a higher C statistic (C = 0.770), better model goodness-of-fit as measured in a decile-of-risk analysis, the largest gradient of risk across deciles (46.3), and the largest decrease in deviance (−2 log likelihood = 406.2). The superior performance of the enhanced data model supports the use of this “enhancement” methodology and bears consideration when researchers are faced with nonrandom missing data.

Introduction

Observational outcomes studies appear frequently in the clinical and health services research literature. The objectives of these studies typically are hypothesis generation about optimum management of illness, or analyses of the quality of medical care. As Iezzoni notes, meaningful assessments of patients' outcomes in observational studies require two basic procedures [1]: a reliable and accurate measure of the outcome itself, and a method of adjusting for factors affecting that outcome, other than the variable(s) of primary interest. For example, where mortality is the outcome under scrutiny, multivariable models are constructed to determine which variables predict individual patients' probabilities of dying, and the expected mortality rates for two or more groups of patients.

Much of the published outcomes research in health care relies on administrative databases with limited clinical information about patients. Multivariable risk adjustment based on administrative data is therefore constrained from the outset by the lack of details on important prognostic factors. Clinical databases are better able to explain interprovider differences in outcomes than are administrative databases. As Hannan et al. [2] have demonstrated, the advantage of clinical databases comes from the ability to select and capture prospectively those clinical variables that are important prognostically and have no comparable diagnostic code in administrative data. However, when more detailed databases are developed, costs rise as do the chances that some data for some patients will not be collected. Cases with missing values for any one of the variables entered into a model unfortunately cannot be used in multivariable analysis, unless imputation is used.

Common methods for handling missing covariate values include stratification on missing data status, conditional-mean imputation, and complete subject analysis in logistic regression. These methods can be biased under reasonable circumstances and are often unsatisfactory [3]. More sophisticated methods include multiple-imputation methods, maximum likelihood or pseudo maximum likelihood methods, and weighted estimating equation methods [3]. However, the validity of all methods for handling missing data depends on meeting certain assumptions [3], the most stringent being the assumption that the data as a whole are “missing completely at random” (i.e., whether or not a given variable is missing is entirely independent of the values of other variables, and also independent of whether other variables are missing). A less stringent assumption is that the data are “missing at random” (i.e., whether or not a given variable is missing is entirely independent of the values of any other unobserved variables, although it can depend on the values of observed variables). In their review of methods for handling missing covariates in epidemiology, Greenland and Finkle argue that if the “missing at random” assumption fails, none of the above mentioned missing data methods can be applied [3].

We recently faced the problems of nonrandom patterns of missing data in a new clinical registry. One conventional response in this situation is to exclude cases with missing data, but herein lies a catch-22. If the data are nonrandomly missing, then the impact of exclusion will be nonrandom, with resultant biases in any analyses. Another approach is to impute the lowest level of severity for a given missing variable. In this instance, the goal is to provide an incentive for participating centers or health care providers to be more assiduous about data capture in the future. A third possibility is using alternative data sources to “fill in the blanks.” For example, Smith et al. [4] recently demonstrated that significantly more accurate estimates of probabilities of death are possible with administrative data when limited clinical information from clinical databases is merged with the administrative data. The converse (using administrative data to fill in gaps in clinical registry data) is also feasible.

We have tested each of these three strategies for dealing with missing data, and report here on the findings. We also reflect on the lessons that other health services research might draw from our experience.

Section snippets

APPROACH project

The Alberta Provincial Project for Outcome Assessment in Coronary Heart Disease (APPROACH Project) is a province-wide inception cohort of all adult Alberta residents undergoing cardiac catheterization for ischemic heart disease. The APPROACH project was initiated to study provincial outcomes of care and facilitate quality assurance/quality improvement for patients with coronary artery disease in Alberta. The APPROACH database contains detailed clinical information on adult patients with known

Results

A total of 6065 patients (71.5% male) with a mean age 62.1 years (SD = 11.3 years) were used for these analyses. Table 3 indicates the prevalence of the predictor variables in each of the three datasets examined in our analysis. With the exception of prior PTCA, CABG, and lytic therapy, the enhanced database demonstrates a consistently higher prevalence for each of the predictor variables. This suggests that assuming a negative or 0 code when data are missing or unknown underestimates the true

Discussion

Prospective clinical databases like APPROACH are potentially valuable tools for studying outcomes of health care. However, missing data present major challenges to researchers wishing to develop risk adjustment algorithms to take advantage of clinical databases. As noted earlier, the standard methodologies for handling missing data presuppose that the data are at least “missing at random”—an assumption that is frequently violated in clinical and health care research. In the APPROACH database,

Acknowledgements

The authors thank the members of the APPROACH Project Clinical Steering Committee for their continued input and support: Principal Investigator, Dr. Merril Knudtson (Foothills Hospital), Chairperson: Dr. Vladamir Dzavik (University of Alberta Hospital), Dr. Neil Brass (Royal Alexandra Hospital), Dr. William Ghali (University of Calgary), Dr. Dennis Humen (University of Alberta Hospital), Dr. Arvind Koshal (University of Alberta Hospital), Dr. Robert Lesoway (Foothills Hospital), Dr. Andrew

References (16)

  • M.E. Charlson et al.

    A new method of classifying prognostic comorbidity in longitudinal studies

    J Chron Dis

    (1987)
  • E.L. Hannan et al.

    Using Medicare claims data to assess provider quality for CABG surgerydoes it work well enough?

    Health Services Res

    (1997)
  • S. Greenland et al.

    A critical look at methods for handling missing covariates in epidemiologic regression analyses

    Am J Epidemiol

    (1995)
  • D.W. Smith et al.

    Using clinical variables to estimate the risk of clinical mortality

    Med Care

    (1991)
  • Smith LR, Harrell FE Jr, Rankin JS, et al. Determinants of early versus late cardiac death in patients undergoing...
  • International classification of diseases, 9th revision (clinical modification). Washington: Public Health Service, US...
  • R.A. Deyo

    Promises and limitations of the Patient Outcome Research Teamsthe low-back pain example

    Proc Assoc Am Physicians

    (1995)
There are more references available in the full text version of this article.

Cited by (79)

  • Coronary Artery Bypass Surgery Improves Outcomes in Patients With Diabetes and Left Ventricular Dysfunction

    2018, Journal of the American College of Cardiology
    Citation Excerpt :

    As in all nonrandomized studies, the direct comparisons of distinct groups may be misleading because the groups generally differ systematically. To obtain a comparable distribution of demographic characteristics, comorbidities, and clinical variables among low-EF patients who underwent PCI compared with patients with low EFs who underwent CABG, we used the Rosenbaum and Rubin propensity–score-matching technique (11). The propensity score was calculated as the probability of having undergone CABG conditional on the observed baseline (measured at recruitment) characteristics.

  • Need of informatics in designing interoperable clinical registries

    2017, International Journal of Medical Informatics
    Citation Excerpt :

    studied and characterized missing data in clinical registries and associated factors. Norris at el. [40] developed a method for handling missing data in a cardiac registry.

  • Body Mass Index Is Associated With Differential Rates of Coronary Revascularization After Cardiac Catheterization

    2017, Canadian Journal of Cardiology
    Citation Excerpt :

    Pulmonary complications included pulmonary edema or pulmonary embolism, or both. The validity of clinical data collected in the APPROACH database has previously been investigated by comparing the data with administrative data obtained from the hospitals and coded using the International Classification of Diseases, ninth edition, Clinical Modification.17,18 This merger of the data allowed assessment of the presence of several clinical variables in 2 separate data sources for each patient.

View all citing articles on Scopus
View full text