Friday, February 16, 2007

Comments on the inaugural address of Prof dr Jos Twisk I: multiple imputation

On Friday, February 2nd, Jos Twisk held his inaugural address at the occasion of his appointment as a full professor applied biostatistics of longitudinal research at VU University,
Amsterdam, entitled (my translation):

Applied biostatistics: be swayed by the issues of the day or sustain ancient traditions?

It was a very amusing lecture, in which he brought up several methodological issues.

I will bring up these issues here once more as subsequent entries to this weblog and give my personal comments.

One of today's issues is `multiple imputation', introduced by Rubin in the nineties of the previous century (see for instance: Rubin, 1987/2004). Jos showed how results of multiple
imputed data sets have to be combined to get a composite result. His conclusion was:
don't impute if it can be avoided and certainly don't do it several times.

In general, I agree with the first part of that advice, but this is not realistic in real life. Techniques like repeated measurement analysis of (co-)variance require complete
data sets and one missing observation of a patient will cause the whole case to be left out from the analysis.

One can, of course, use multilevel analysis instead, but this is difficult if a complex factorial design with several covariates has to be analyzed. It takes away the possibility to use analysis of variance in a quick-and-dirty way to get rid of uninfluential covariates/factors (confounders).

Instead, I prefer to multiply (three times, as Rubin suggests, not five as Jos did in his deterrent) imputate anyway but use only one of the imputed data sets for the whole data analysis process.
Only when the analysis process is finished, the essential results are replicated using the other two data sets.

Four extra remarks about this suggestion:

  1. Apart from a comparison with the two other imputed data sets, a complete case analysis should be done and the results compared.
  2. Rubin's original proposal was aimed at parameter estimation. By pooling the results in a clever way combined parameters are optained in which the variance due to imputation is taken into account. However, we are usually not interested in estimation. In the above procedure, we only check whether the imputation procedure influences the effects of interest.
  3. The above procedure does not serve as a check on the assumption of ignorability of nonresponse (`missing at randon'), since nothing much can be derived from the data set concerning the distribution of nonresponse.
  4. Note that there is a publication problem. One can not publish results that are based on heavily imputed data sets derived from a raw data set with a large percentage of missings, unless the results are more or less the same as those obtained from the complete case analysis. In that case, reporting the results from the raw data set is preferable.

Herman

Daniel B. Rubin (1987). Multiple Imputation for Nonresponse in Surveys. New York: Wiley. (Reprinted in 2004).