Saturday, June 23, 2007

An alternative modelling procedure based on a strong Theory

Suppose we have a strong theory T that translates into several (for instance, three) alternative models M1, M2 and M3. Furthermore, suppose we are able to construct a research design R, based on theory T which allows us to verify our theory. Suppose we are able to implement an experiment (for instance, a clinical trial ) based on R and that we have collected data D during this experiment.

The usual procedure would be to test whether the data D are consistent with models M1, M2 and/or M3.

But we could also go about as follows:

Generate data according to the models M1, M2 and M3, resulting in three data sets D1, D2 and D3 and test whether these data sets could have resulted from the same population as D.

Remarks:
  • The above is only possible if we have a strong theory T on which we can base our models beforehand.
  • An methodological advantage is that the researcher is forced to formulate his/her theoretical concepts and translate them into models before (s)he starts his or her experiment.
  • A second advantage is that deviations between D and Di (i= 1, 2, 3) give information both on the relationships between variables and on the influence of the underlying (possibly multivariate) distributions (this is assuming that our models are based on known theoretical distributions like the normal distribution, which is common practice).
  • A third advantage seems to be that we can directly test the alternative hypothesis.
  • This procedure can not be combined with crossvalidation (randomly splitting the data in two parts, one part to find models consistent with the data, another part to test those models), because in the first part, models are formulated that are consistent with (possibly multivariate) distribution violations in the data: the same violations are present in the second part of the data, too.

Questions:

  1. Does a weak theory simply translates into a larger set of models?
  2. Simulating data based on models M1, M2 and M3 may not be trivial. Can we use similar procedures as are used in MCMC (Markov chain Monte Carlo) ?
  3. Can we use a Bayesian perspective, for instance by assuming that D1, D2 and D3 are based on prior distributions for the data D?
  4. Is the above approach known and described in the `simulation community'?

Monday, June 11, 2007

Paul de Boeck: Always do a PCA

Another of Paul's rules used during statistical consultation (rule 5) was: `Always do a PCA, it tells you about sources of differences in the data and about the interaction between the two modes of the data set', was met with scepticism, notably by David Kaplan, who said that he recommended clients never to use PCA.

Comments:
  • PCA is an abbreviation of `Principal component analysis'. It is essentially a data reduction technique requiring no assumptions about the distribution of the variables. In a nutshell, the technique results in a representation of the data relative an orthogonal coordinate system. Data reduction is obtained by considering only a few axes of the coordinate system.
  • Methodologically, the drawback of the technique lays in the orthogonality, which in most cases is not realistic in view of the substantive meaning of the data. To mend this, a promax rotation can be used which allows to obtain non-orthogonal axes.
  • As an alternative to PCA, confirmatory factor analysis (CFA) can be used in an exploratory way, in particular, if some assumtions can be made about the relation between factors and items (note that CFA does have several assumptions on the distribution of the observed variables, notably multinormality.)
  • Although Paul had to endure heavy critique on his fifthst rule, in my opinion he had a point. In fact, it is common practice among data analysts to use PCA as a quick and dirty technique to explore the data, even if they know how to apply CFA. If promax rotation is used instead of varimax rotation, some of the objections against the orthogonality assumption are mitigated, although not completely met: a CFA on the other half of the data using the factor structure found with PCA may result in completely different estimated angles between the factors.
  • For those who heard Paul's talk, the recommendation to use PCA was not supprising: he did put heavy emphasis on data exploration as an antidote to the often theory-centered approach that prevails in social science and behavioral research, and PCA can very well used in an exploratory way. As holds for all exploratoration, the truth is never ascertained. The analysis has to be confirmed either on a another part of the data or by doing a new, carefully designed experiment, that allows for unequivocal confirmation.
  • As a last remark, I want to stress the fact that the use of PCA is not so straightforward as it seems (in particular, if one wants to have some confidence in the results). For a recent article on PCA see Costello and Osborne (2005).

References

Costella, A. B. and J. W. Osborne (2005). Best Practices in Exploratory Factor Analysis: Four Recommendations for Getting the Most From YourAnalysis. Practical Assessment Research & Evaluation, 10 (7).

Monday, June 4, 2007

Paul de Boeck: rules during consultation

During the March colloquium on Advising on research methods (see previous logs), Paul de Boeck (http://www.kuleuven.be/cv/u0002630.htm) gave seven rules that are important in consultancy when clients involved in social science research in which theoretical concepts are investigated using emperical research methods, come for advice. His presentation and the abstract of the talk can be found at:
http://www.knaw.nl/colloquia/advising/index.cfm#proceedings

I give the rules below and will comment on some of them in this and the next few blogs:

  1. Not everything is worth being measured or can be measured, often the data are more interesting than the concept.
  2. Always reflect on which type of covariation is meant when the relationship between concepts is considered. All too often, automatically the covariation over persons is used as the basis, without good reasons.
  3. Measurement, reliability and validity testing, and hypothesis testing don’t need to be sequential steps, they can all be done simultaneously.
  4. So-called psychometric criteria are not theory-independent, and sometimes the theoretical implications of the psychometric criteria are wrong.
  5. Always do a PCA, it tells you about sources of differences in the data and about the interaction between the two modes of the data set.
  6. One does not necessarily have to care about the scale level of the data.
  7. Don’t construct indices of concepts, unless for descriptive summaries.

Ad rule 1 (Not everything is worth being measured or can be measured, often the data are more interesting than the concept). It should be stressed that this rule is thought to be most relevant during a consultation session. Let's take it apart: the first part (`Not everything is worth being measured or can be measured') is difficult to `sell' during consultation, because it means that during the study data were collected that were not worth collecting: this is particularly painful when it has to be said about the primary variables of an investigation. It is difficult to see what the second part (`often the data are more interesting than the concept') has to do with the first part: one can hardly say: `your study design started from a wrong idea and thus the data collection is worthless, but let's look at the data'. However (and this was clearly demonstrated during the presentation), a case can be made for a much looser connection between the data and the concepts to which they refer, because much can go wrong during implementation.

In particular (my addition): if enough data are available, a crossvalidation strategy can be useful, in which the data are randomly split in two parts. The first part is used for exploration and the emphasis is on the data (and their relation to study design and implementation), the second part is to investigate all worthwhile findings/hypotheses that came out of the exploration phase. Of course, in many cases not enough data were collected to allow this strategy. In this case two other strategies are available: (1) One may use the expected crossvalidation index (ECVI) given in Kaplan (page 117 e.v) which gives an impression of the crossvalidation adequacy of a model. (2) One may split the sample in unequal parts, using the first part for exploration as before and the (smaller) second part to test the findings of the exploration as before but now using small sample techniques like bootstrapping, if needed.

Kaplan, D. (2000). Structural Equation Modeling. Foundations and Extensions. Thousand Oaks London New Delhi: Sage Publications.

Thursday, April 26, 2007

Small sample SEM

An issue that came up during a session on applications of structural equation modelling (SEM) after a lecture by David Kaplan at the Colloquium `Advising on research methods' in Amsterdam (29-30 March) was the fact that SEM is less often applied in medical/epidemiological research than it could be. Several bottlenecks can be (and were) identified:

  1. The concept of `latent variable' (and, relatedly, the concept of (substantive) theory and corresponding model) seems to be less easy to conceptualize in Medicine than in the social/behavioral sciences.
  2. Jules Ellis mentioned that SEM software is not easily optainable and costly and that specifying a SEM model in standard software is often awkward and difficult to achieve by the medical researcher/analyst.
  3. Most SEM models require sample sizes that are too large for most small- to moderately- sized medical/epidemiological research projects.

Comments:

Ad 1. (The concept of `latent variable' seems to be less easy to conceptualize in Medicine than in the social/behavioral sciences).

Of course, this is an interesting but more or less philosophical point, if phrased in this way. In real-life medical/epidemiological research there often seem to be no compelling reasons to take a theoretical stand to start from. I won't go into the reasons for this.

However, there are several situations in which a theoretical conceptualization could help to specify a SEM model that is more appropriate than the regression-like models that are frequently used. For instance,

  • In what is called `fundamental' or `basic' medical research, where complicated dynamic mechanisms are studied (think of genome studies or fysiological studies of illness progression), dynamic SEM models could be used to specify the dynamic process.
  • In questionnaire design, confirmatory factor analysis can be used to analyze the factor structure, just as it is done in social science research (see de Vet et al. (2005)).

Ad 2 (SEM software is costly and specifying a SEM model in standard software is too difficult for the medical researcher/analyst).

In an e-mail afterwards, David Kaplan mentioned that a package to perform SEM modelling exists in R+ (a statistical package related to Splus, freely obtainable via the Internet, see, f.i.: http://cran.nedmirror.nl/). It turned out to be written by John Fox. Sources and binaries for SEM can be obtained at: http://finzi.psych.upenn.edu/R/library/sem/html/00Index.html

As to the other point, that the syntax of model specifation would be too complicated for the researcher-in-the-field: after all, many of them have also been able to learn how to use multilevel analysis. So, where there is a will, there is a way, even in learning how to use SEM, and in particular if there would be an clear need.

Ad 3 (Most SEM models require sample sizes that are too large).

To me, this seems the most important bottleneck for standard application in Medicine/Epidemiology. If SEM requires to have sample sizes of at least 500, application in most medical studies are out of the question.

In his comments, Jelte Wicherts mentioned that small sample size techniques are being studied. Afterwards, in some e-mail exchanges, I suggested that resampling techniques like bootstrapping could be applied in small sample situations. As it turns out, Fox's package, mentioned above, also contains a bootstrapping possibility (but see Kaplan's book chapter 5, where arguments are given why sample sizes would increase when non-normality has to be assumed).

Comment by David Kaplan:

I'm not sure what you mean by:


(but see Kaplan's book chapter 5, where arguments are given why sample sizes would increase when non-normality has to be assumed).

I think you mean that when estimating models with non-normal observed variables, larger sample sizes are typically needed for estimators to behave properly. That is partly true, and was true in the good old days. But now, there are estimators that don't require huge sample sizes. Also, I believe there are bootstrapping approaches to get standard errors when sample sizes are a bit smaller.

References

de Vet, H. C., Ader, H. J., Terwee, C. B., & Pouwer, F. (2005). Are factor analytical techniques used appropriately in the validation of health status questionnaires? A systematic review on the quality of factor analys of the SF-36. Quality of life research, 14(5), 1203–1218.

Kaplan, D. (2000). Structural Equation Modeling. Foundations and Extensions. Thousand Oaks London New Delhi: Sage Publications.

Wednesday, April 4, 2007

Colloquium `Advising on research methods'

This colloquium was organised by Don Mellenbergh and myself and held on March 29-30 in Amsterdam, the Netherlands.

Speakers were :


  • Janice Derr (Advising in a multi-disciplinary setting)
  • Steven Piantadosi (Research designs in Medicine)
  • Don Mellenbergh (Advising on test construction)
  • Gerald van Belle (Statistics and everyday life)
  • Jules Ellis (Advising to policy makers in Health Care)
  • Bo Lu (Bias correction using propensity scores)
  • Paul de Boeck (Consulting in behaviour research)
  • Willem Heiser (Survival skills in publishing)
  • Robert Pool (Combining qualitative and quantitative methods)
  • David Kaplan (Research problem and structural equation model)
  • Denny Borsboom (Advising on test validity)
  • Herman Adèr (Time and strategy)

The colloquium was preceded by a masterclass on Wednesday the 28st.

For more details, see: http://www.knaw.nl/colloquia/advising/
In the next few posts, I will describe some of the topics that were discussed.

Herman

Monday, March 5, 2007

Which methodological mistake is less harmful?

I hope you can agree to the following statement:

`Methodological considerations on an empirical study are relevant only in relation to the results'.

This idea is in line with the concepts of content quality of a study, defined as:

Content quality. The Content quality of a study indicates its bearing on the solution of the initial research problem.

Now assume that a study is methodologically flawed in some way. In other words, the content quality is low. The question now arises what is less harmful:


  1. Something that could have been uncovered by using the right methodology, but that was not found due to these flaws.
  2. Due to methodological flaws, erroneous conclusions were drawn (and published).

Most researchers would indicate the first option as the least harmful, reasoning that it happens all the time: different researchers draw different conclusions looking at the same data, and historically, new perspectives are opened time and again. However, one may ask what is meant by `the right methodology' in option 1: if this indicates that something could have been found, but wasn't, since the less subtle analysis technique was applied, the error could have been easily avoided. In particular, if the data set was published, as is required in some fields of research, your name as a serious researcher would be endangered if somebody else, repeating the analyses the right way, would come up with something completely different! (maybe that's why some researchers are hesitant to make their data publicly available).

In the second case, if these wrong conclusions are simply negative ones (we did not find nor report an effect although there is one present in the data), this is of course a pity, but there is no serious harm done. Even in case somebody showed that a positive conclusion could have been drawn, nobody will look askance at you, since your conclusions were at the safe side. And the one who has done the replication should explain very well why his results are deviant. If he or she has applied some outlandish technique that is not familiar to most, (s)he may even be considered a niggler who is out to harm your good name as a researcher.

(note that in the above, we talk about published results, although this may not have been obvious at the beginning).

Tuesday, February 20, 2007

Power and sample size calculations

One of the other issues Jos Twisk commented upon in his inaugural address (see previous posting) was the doubtful obligation of researchers to provide research proposals with a sample size/power calculation. To demonstrate the alledged futility of this requirement, he announced that he had written a small computer program that has the following input dialogue:

====================================================
Q: Give the required alpha
A: 0.05

Q: Give the required beta
A: .80

Q: Give the required standard deviation
A: ....

Q: Give the required effect size
A: ...

The program would now always produce the following output:

Sample size: As many units as possible

====================================================

That power calculations can be doubtful and unreliable has been stated by many others (See, for instance, Kuik and Tobi, 1998). The problems with it are twofold:

  1. Almost always, there are practical constraints to the number of patients/respondents that can be possibly be included in a study. This makes it tempting for the researcher (and the methodologist who helps him or her) to choose standard deviation and effect size in such a way that an acceptable number comes out. Usually there is some freedom to do so.
  2. Although there is not much room to tamper with the effect size, the variability estimate can be choosen more freely: in most cases it is hard to tell whether a realistic value has been chosen.

On the other hand, it is not realistic in practice (as Jos did) to militate against power calculations altogether. From the point of view of the subsidizer, the wish to get an impression of the expected credibility of the results of a study can hardly be called exaggerated. To underpower a study is not only wasting money, but may also be unethical towards the included patients, while overpowering may place an undue burden on the patients.

So, what can we do? Since the bottleneck in the power calculations seems to be the variance estimate, a possible solution is to estimate that statistic from a small number of observations, collected, for instance, during a pilot study. See section 25.4 in Efron and Tibshirani (1993), on how to approach this. They give a more general treatment of the use of bootstrapping in power calculations.

Herman Adèr

PS. An overview of available free statistical software including software for power and sample size calculations can be found at: http://statpages.org/javasta2.html

References

Kuik, D. J. and H. Tobi (1998). On the Uselessness of Power Computations. In: Proceedings in Computational Statistics. 1998. Short Communications and Posters. Eds. R. Payne and P. Laine. pp 67-68.

Efron, B. and R. J. Tibshirani (1993), An Introduction to the Bootstrap. New York London: Chapman & Hall.

Friday, February 16, 2007

Comments on the inaugural address of Prof dr Jos Twisk I: multiple imputation

On Friday, February 2nd, Jos Twisk held his inaugural address at the occasion of his appointment as a full professor applied biostatistics of longitudinal research at VU University,
Amsterdam, entitled (my translation):

Applied biostatistics: be swayed by the issues of the day or sustain ancient traditions?

It was a very amusing lecture, in which he brought up several methodological issues.

I will bring up these issues here once more as subsequent entries to this weblog and give my personal comments.

One of today's issues is `multiple imputation', introduced by Rubin in the nineties of the previous century (see for instance: Rubin, 1987/2004). Jos showed how results of multiple
imputed data sets have to be combined to get a composite result. His conclusion was:
don't impute if it can be avoided and certainly don't do it several times.

In general, I agree with the first part of that advice, but this is not realistic in real life. Techniques like repeated measurement analysis of (co-)variance require complete
data sets and one missing observation of a patient will cause the whole case to be left out from the analysis.

One can, of course, use multilevel analysis instead, but this is difficult if a complex factorial design with several covariates has to be analyzed. It takes away the possibility to use analysis of variance in a quick-and-dirty way to get rid of uninfluential covariates/factors (confounders).

Instead, I prefer to multiply (three times, as Rubin suggests, not five as Jos did in his deterrent) imputate anyway but use only one of the imputed data sets for the whole data analysis process.
Only when the analysis process is finished, the essential results are replicated using the other two data sets.

Four extra remarks about this suggestion:

  1. Apart from a comparison with the two other imputed data sets, a complete case analysis should be done and the results compared.
  2. Rubin's original proposal was aimed at parameter estimation. By pooling the results in a clever way combined parameters are optained in which the variance due to imputation is taken into account. However, we are usually not interested in estimation. In the above procedure, we only check whether the imputation procedure influences the effects of interest.
  3. The above procedure does not serve as a check on the assumption of ignorability of nonresponse (`missing at randon'), since nothing much can be derived from the data set concerning the distribution of nonresponse.
  4. Note that there is a publication problem. One can not publish results that are based on heavily imputed data sets derived from a raw data set with a large percentage of missings, unless the results are more or less the same as those obtained from the complete case analysis. In that case, reporting the results from the raw data set is preferable.

Herman

Daniel B. Rubin (1987). Multiple Imputation for Nonresponse in Surveys. New York: Wiley. (Reprinted in 2004).