Tuesday, February 20, 2007

Power and sample size calculations

One of the other issues Jos Twisk commented upon in his inaugural address (see previous posting) was the doubtful obligation of researchers to provide research proposals with a sample size/power calculation. To demonstrate the alledged futility of this requirement, he announced that he had written a small computer program that has the following input dialogue:

====================================================
Q: Give the required alpha
A: 0.05

Q: Give the required beta
A: .80

Q: Give the required standard deviation
A: ....

Q: Give the required effect size
A: ...

The program would now always produce the following output:

Sample size: As many units as possible

====================================================

That power calculations can be doubtful and unreliable has been stated by many others (See, for instance, Kuik and Tobi, 1998). The problems with it are twofold:

  1. Almost always, there are practical constraints to the number of patients/respondents that can be possibly be included in a study. This makes it tempting for the researcher (and the methodologist who helps him or her) to choose standard deviation and effect size in such a way that an acceptable number comes out. Usually there is some freedom to do so.
  2. Although there is not much room to tamper with the effect size, the variability estimate can be choosen more freely: in most cases it is hard to tell whether a realistic value has been chosen.

On the other hand, it is not realistic in practice (as Jos did) to militate against power calculations altogether. From the point of view of the subsidizer, the wish to get an impression of the expected credibility of the results of a study can hardly be called exaggerated. To underpower a study is not only wasting money, but may also be unethical towards the included patients, while overpowering may place an undue burden on the patients.

So, what can we do? Since the bottleneck in the power calculations seems to be the variance estimate, a possible solution is to estimate that statistic from a small number of observations, collected, for instance, during a pilot study. See section 25.4 in Efron and Tibshirani (1993), on how to approach this. They give a more general treatment of the use of bootstrapping in power calculations.

Herman Adèr

PS. An overview of available free statistical software including software for power and sample size calculations can be found at: http://statpages.org/javasta2.html

References

Kuik, D. J. and H. Tobi (1998). On the Uselessness of Power Computations. In: Proceedings in Computational Statistics. 1998. Short Communications and Posters. Eds. R. Payne and P. Laine. pp 67-68.

Efron, B. and R. J. Tibshirani (1993), An Introduction to the Bootstrap. New York London: Chapman & Hall.

Friday, February 16, 2007

Comments on the inaugural address of Prof dr Jos Twisk I: multiple imputation

On Friday, February 2nd, Jos Twisk held his inaugural address at the occasion of his appointment as a full professor applied biostatistics of longitudinal research at VU University,
Amsterdam, entitled (my translation):

Applied biostatistics: be swayed by the issues of the day or sustain ancient traditions?

It was a very amusing lecture, in which he brought up several methodological issues.

I will bring up these issues here once more as subsequent entries to this weblog and give my personal comments.

One of today's issues is `multiple imputation', introduced by Rubin in the nineties of the previous century (see for instance: Rubin, 1987/2004). Jos showed how results of multiple
imputed data sets have to be combined to get a composite result. His conclusion was:
don't impute if it can be avoided and certainly don't do it several times.

In general, I agree with the first part of that advice, but this is not realistic in real life. Techniques like repeated measurement analysis of (co-)variance require complete
data sets and one missing observation of a patient will cause the whole case to be left out from the analysis.

One can, of course, use multilevel analysis instead, but this is difficult if a complex factorial design with several covariates has to be analyzed. It takes away the possibility to use analysis of variance in a quick-and-dirty way to get rid of uninfluential covariates/factors (confounders).

Instead, I prefer to multiply (three times, as Rubin suggests, not five as Jos did in his deterrent) imputate anyway but use only one of the imputed data sets for the whole data analysis process.
Only when the analysis process is finished, the essential results are replicated using the other two data sets.

Four extra remarks about this suggestion:

  1. Apart from a comparison with the two other imputed data sets, a complete case analysis should be done and the results compared.
  2. Rubin's original proposal was aimed at parameter estimation. By pooling the results in a clever way combined parameters are optained in which the variance due to imputation is taken into account. However, we are usually not interested in estimation. In the above procedure, we only check whether the imputation procedure influences the effects of interest.
  3. The above procedure does not serve as a check on the assumption of ignorability of nonresponse (`missing at randon'), since nothing much can be derived from the data set concerning the distribution of nonresponse.
  4. Note that there is a publication problem. One can not publish results that are based on heavily imputed data sets derived from a raw data set with a large percentage of missings, unless the results are more or less the same as those obtained from the complete case analysis. In that case, reporting the results from the raw data set is preferable.

Herman

Daniel B. Rubin (1987). Multiple Imputation for Nonresponse in Surveys. New York: Wiley. (Reprinted in 2004).