Original article
Attrition in longitudinal studies: How to deal with missing data

https://doi.org/10.1016/S0895-4356(01)00476-0Get rights and content

Abstract

The purpose of this paper was to illustrate the influence of missing data on the results of longitudinal statistical analyses [i.e., MANOVA for repeated measurements and Generalised Estimating Equations (GEE)] and to illustrate the influence of using different imputation methods to replace missing data. Besides a complete dataset, four incomplete datasets were considered: two datasets with 10% missing data and two datasets with 25% missing data. In both situations missingness was considered independent and dependent on observed data. Imputation methods were divided into cross-sectional methods (i.e., mean of series, hot deck, and cross-sectional regression) and longitudinal methods (i.e., last value carried forward, longitudinal interpolation, and longitudinal regression). Besides these, also the multiple imputation method was applied and discussed. The analyses were performed on a particular (observational) longitudinal dataset, with particular missing data patterns and imputation methods. The results of this illustration shows that when MANOVA for repeated measurements is used, imputation methods are highly recommendable (because MANOVA as implemented in the software used, uses listwise deletion of cases with a missing value). Applying GEE analysis, imputation methods were not necessary. When imputation methods were used, longitudinal imputation methods were often preferable ab9ove cross-sectional imputation methods, in a way that the point estimates and standard errors were closer to the estimates derived from the complete dataset. Furthermore, this study showed that the theoretically more valid multiple imputation method did not lead to different point estimates than the more simple (longitudinal) imputation methods. However, the estimated standard errors appeared to be theoretically more adequate, because they reflect the uncertainty in estimation caused by missing values.

Introduction

One of the main methodological problems in longitudinal studies is attrition. Attrition, missing data, dropouts; all terms are used for the situation that not all N subjects have data on all T repeated measurements 1, 2, 3. Attrition is generally seen at the end of a longitudinal study, although it is also possible that subjects miss a particular measurement, and then return in the study at the next follow-up.

A few decades ago, few methods were available to analyse longitudinal data. The available methods (e.g., MANOVA for repeated measurements) had a major drawback, namely that if one of the repeated measurements was missing, all other available data of that subject were excluded from the analysis as well. To overcome this problem imputation methods for missing data have been developed [3]. With today's sophisticated methods to analyse longitudinal data, such as Generalised Estimating Equations (GEE), subjects with incomplete data are not excluded from the analyses. If a particular subject is missing one or more out of T repeated measurements, the remaining available data from the other measurements for that particular subject are used in the analyses. In other words, when more sophisticated methods for the analysis of longitudinal data are used, it is probably less urgent to estimate the missing data.

In general, a distinction is made between three types of attrition or missingness: (1) missing completely at random (MCAR, attrition is independent of both unobserved and observed data), (2) missing at random (MAR, attrition depends on observed data, but not on unobserved data), and (3) missing not at random (MNAR, attrition depends on unobserved data). Information on the type of attrition and the possible determinants of attrition is important for a proper interpretation of the results of longitudinal data analysis. However, that is not the focus of this paper.

The purpose of this paper is to give an introduction on the possibilities how to deal with missing data in longitudinal studies; particularly aiming at researchers who are less experienced in this field. Therefore, several of the available imputation methods to replace missing data will be discussed. In an example, the influence of missing data on the results of statistical analyses and the influence of different imputation methods on these results will be illustrated. Because the topic of this supplement concerns longitudinal observational studies, the illustration will be focussed on a specific observational longitudinal dataset. The results of this illustration should, therefore, be interpreted within the limitations of this choice.

Section snippets

Dataset

The longitudinal dataset consists of a continuous outcome variable Y, which has been measured six times, and four predictor variables: X1, a continuous time-independent predictor variable, X2, a continuous time-dependent predictor variable, X3, a dichotomous time-dependent predictor variable and X4, a dichotomous time-independent predictor variable. All time-dependent predictor variables were measured at the same six occasions as the outcome variable Y. The number of subjects in the complete

Results

Table 1 shows descriptive information of outcome variable Y (i.e., total serum cholesterol) and the four predictor variables (fitness level, the sum of skinfolds, smoking behaviour and gender). Table 2 shows the interperiod correlation coefficients for outcome variable Y. Because the interperiod correlation coefficients were quite high, the MNAR and the MAR dataset were comparable. We decided only to report the results of the statistical analysis for the MAR dataset. The reason for choosing

Discussion

In this paper we have examined the consequences of missing data in longitudinal studies for the results of statistical analyses. This was done by comparing datasets without missing data, datasets with missing data, and datasets in which the missing data were imputed by different imputation methods. Furthermore, we considered two statistical methods; i.e., MANOVA for repeated measurements, in which listwise deletion of cases with missing data occurred, and GEE analysis in which it is assumed

Conclusions

The present example with its mentioned limitations (i.e., specific observational longitudinal dataset, four missing data scenarios, limited number of imputation techniques, missingness dependent on the outcome variable, two statistical methods, less advanced multiple imputation estimation procedures) shows that when MANOVA for repeated measurements is used to analyze a longitudinal dataset with missing data, imputation methods to replace these missing data are highly recommendable (because

References (17)

  • P.J. Diggle

    Testing for random dropouts in repeated measurement data

    Biometrics

    (1989)
  • R.J.A. Little et al.

    Statistical analysis with missing data

    (1987)
  • D.B. Rubin

    Multiple imputation for nonresponse in surveys

    (1987)
  • D.B. Rubin

    Multiple imputation after 18+ years

    J Am Stat Assoc

    (1996)
  • J.L. Schafer

    Analysis of incomplete multivariate data

    (1997)
  • SPSS-X user's guide. 3rd ed. Chicago: SPSS Inc.,...
There are more references available in the full text version of this article.

Cited by (391)

  • Longitudinal study

    2023, Handbook for Designing and Conducting Clinical and Translational Surgery
  • Cross-sectional study

    2023, Handbook for Designing and Conducting Clinical and Translational Surgery
View all citing articles on Scopus
View full text