Northern Prairie Wildlife Research Center
I find it useful to think of replication occurring at 3 different levels (Table 1). The fundamental notion is of ordinary replication in an experiment: treatments are applied independently to several units. In our squirrel-woodlot example, we would want several woodlots to be logged and several to be left as controls. (Comparable considerations apply to observational studies or sample surveys.) As mentioned above, replication serves to ensure against making a decision based on a single, possibly unusual, outcome of the treatment. It also provides an estimate of the variation associated with the treatment. Other levels of replication are pseudoreplication and metareplication.
| Term | Repeated action | Scope of inference | P-value | Analysis |
|---|---|---|---|---|
| Pseudoreplication | Measurement | Object measured | Wrong | Pseudo-analysis |
| Ordinary replication | Treatment | Objects for which samples are representative | "OK" | Analysis |
| Metareplication | Study | Situations for which studies are representative | Irrelevant | Meta-analysis |
At a lower level than ordinary replication is what Hurlbert (1984) called pseudoreplication. Often couched in analysis of variance terms (using the wrong error term in an analysis), typically it arises by repeating measurements on units and treating such measurements as if they represented independent observations. The treatments may have been assigned randomly and independently to the units, but repeated observations on the same unit are not independent. This was what Hurlbert (1984) called simple pseudoreplication and what Eberhardt (1976) had included in pseudodesign. Pseudoreplication was common when Hurlbert (1984) surveyed literature on manipulative ecological experiments, mostly published during 1974-1980, and estimated that about 27% of the experiments involved pseudoreplication. Heffner et al. (1996:2561) found that the frequency of pseudoreplication in more recent literature (1991-1992) had dropped but was still "disturbingly high." Stewart-Oaten (2002) provided some keys for recognizing pseudoreplication, which is not always straightforward.
At a higher level than ordinary replication is what I term metareplication. Metareplication involves the replication of studies, preferably in different years, at different sites, with different methodologies, and by different investigators. Conducting studies in different years and at different sites reduces the chance that some artifact associated with a particular time or place caused the observed results; it should be unlikely that an unusual set of circumstances would manifest itself several times or, especially, at several sites. Conducting studies with different methods similarly reassures us that the results were not simply due to the methods or equipment employed to get those results. And having more than 1investigator perform studies of similar phenomena reduces the opportunity for the results to be due to some hidden bias or characteristic of that researcher. Just as replication within individual studies reduces the influence of errors in observations by averaging the errors, metareplication reduces the influence of errors among studies themselves.
Youden (1972) provided a classic example of the need for metareplication. He described the sequence of 15 studies conducted during 1895-1961 to estimate the average distance between Earth and the sun. Each scientist obtained an estimate, as well as a confidence interval for that estimate. Every estimate obtained was outside the confidence interval for the previous estimate! The confidence each investigator had in his estimate thus was severely overrated. The critical message from this saga is that we should have far less confidence in any individual study than we are led to believe from internal estimates of reliability. This also points out the need to conduct studies of any phenomenon in different circumstances, with different methods, and by different investigators. That is, to do metareplication.
Allied to this reasoning is Levins' notion of truth lying at the "intersection of independent lies" (Levins 1966:423). He considered alternative models, each of which suffered from 1 or more simplifying assumptions (and all models involve some simplification of the system being modeled) that made each model unrealistic in some way or another. He suggested that if the models—despite their differing assumptions—lead to similar results, we have a robust finding that is relatively free of the details of each model. In the context of metareplication, although independent studies of some phenomenon each may suffer from various shortcomings, if they paint substantially similar pictures, we can have confidence in what we see.
The idea of robustness in data analysis is analogous to robustness among studies. Robustness in the analysis of data from a single study means that the conclusions are not strongly dependent on the assumptions involved in the analysis (Mallows 1979). Similar inferences would be obtained from statistical methods that differ in their assumptions. For example, conclusions might not vary even if the data do not follow the assumed distribution, such as the Normal, or if outliers are present in the data. Analogously, robustness in metareplication means that similar interpretations about phenomena are reached from studies that differ in methods, investigators, locations, times, etc.
The notion that studies should be replicated certainly is not new. Replication, in the form of repetition of key experiments by others, has been conventional practice in science far longer than statistics itself has been (Carpenter 1990). Fisher (1971) observed that conclusions are always provisional, like progress reports, interpreting the evidence so far accrued. Tukey (1960) proposed that conclusions derive from the assessment of a series of individual results, rather than a particular result. Eberhardt and Thomas (1991:57) observed that "truly definitive single experiments are very rare in any field of endeavor, progress is actually made through sequences of investigations." Cox and Wermuth (1996:10) noted that, "Of course, deep understanding is unlikely to be achieved by a single study, no matter how carefully planned." Hurlbert and White (1993:149) suggested that, although serious statistical errors were rampant in at least 1 area of ecology, principal conclusions, "those concerning phenomena that have been studied by several investigators, have been unaffected." And Catchpole (1989:287) stated that, "Most hypotheses are tested, not in the splendid isolation of one finely controlled 'perfect' experiment, but in the wider context of a whole series of experiments and observations. Surely a much more valuable form of validity comes from the independent repetition of experiments by colleagues in different parts of the world." As summarized by Anderson et al. (2001:312), "In the long run, science is safeguarded by repeated studies to ascertain what is real and what is merely a spurious result from a single study."