Northern Prairie Wildlife Research Center
P-values resulting from statistical tests of null hypotheses often are used to judge the significance of findings from a study. A small P-value suggests either that the null hypothesis is not true or that an unusual result has occurred. P-values often are misinterpreted as: (1) the probability that the results were due to chance, (2) an indication of the reliability of the result, or (3) the probability that the null hypothesis is true (Carver 1978, Johnson 1999). Small P-values are taken to represent strong evidence that the null hypothesis is false, but in reality the connection between P and Pr{H0 is true|data} is nebulous (Berger and Sellke 1987).
R. A. Fisher was an early advocate of P-values, but he actually recommended that they be used opposite to the way they are mostly used now. Fisher viewed a significant P-value as providing reason to continue studying the phenomenon (recalling that either the hypothesis was wrong or something unusual happened). In stark contrast, modern researchers often use nonsignificant P-values as reason to continue study; many investigators, when faced with nonsignificant results, argue that, "a larger sample size [i.e., further research] is needed to reach significance."
Scientists are encouraged to replicate studies using the same methods as were used in the original studies. This practice eliminates variation due to methodology and, if different results are obtained, suggests that the initial results may have been an accident (Table 2). That is, they did not bear up under metareplication. Obtaining the same results when using the same methods, however, allows for the possibility that the results were specific to the method, rather than a general truth.
Replication with different methods is critical to determine whether results are robust with respect to methodology and not an artifact of the methods employed. When we get consistent results with different methods, we have greater confidence in those results; the results are robust with respect to method. Should we get different results when different methods are used, the original results may have been artifacts of the methods (Table 2).
| Methods of original and replicated studies |
Results from original and replicated studies | |
|---|---|---|
| Same | Different | |
| Same | Results may have been specific to method, rather than a general truth | Original results may have been accidental, not bearing up under metareplication |
| Different | Results are robust with respect to method | Results may have been an artifact of the method used |
A cogent argument has been made that only well-thought-out hypotheses should be tested in a study. Doing so avoids "fishing expeditions" and the chance of claiming that accidental findings are real (Johnson 1981, Rexstad et al. 1988, Anderson et al. 2001, Burnham and Anderson 2002). I think that surprise findings in fact should be considered, but not as confirmed results from the study so much as prods for further investigations. They generate hypotheses to test. For example, suppose you conduct a regression analysis involving many explanatory variables. If you use a stepwise procedure to select variables, results from that analysis can give very misleading estimates of effect sizes, P-values, and the like (Pope and Webster 1972, Hurvich and Tsai 1990). Variables deemed to be important may or may not actually have major influence on the response variable, and conclusions to that effect should not be claimed. It is appropriate, on the other hand, to use the results in a further investigation, focusing on the explanatory variables that the analysis had suggested were influential. It is better to conduct a new study (i.e., to metareplicate), but at a minimum cross-validation will be useful. In that approach, a model is developed with part of the data set and evaluated on the remaining data. This is not to suggest that a priori hypotheses are not important, or that carefully designed studies to evaluate those hypotheses are not a highly appropriate way to conduct science. Only that a balance between exploratory and confirmatory research is needed. Studies should be designed to learn something, not merely to generate questions for further research. Apparent findings need to be rigorously confirmed. If scientists look only at variables known or suspected to be influential, however, how would we get new findings?
Meta-analysis essentially is an analysis of analyses (Hedges and Olkin 1985, Osenberg et al. 1999, Gurevitch and Hedges 2001). The units being analyzed are themselves analyses. Meta-analysis dates back to 1904, when Karl Pearson grouped data from various military tests and concluded that vaccination against intestinal fever was ineffective (Mann 1994). Often studies of comparable effects are analyzed by vote counting: of the studies that looked for the effect, this many had statistically significant results and that many did not. One problem with the vote-counting approach is that, if the true effect is not strong and sample sizes are not large, most studies will not detect the effect. So a critical review of the studies would conclude that most studies found no effect, and the effect would be dismissed.
In contrast, meta-analysis examines the full range of estimated effects (not P-values), whether or not they were individually statistically significant. From the resulting pattern may emerge evidence of consistent effects, even if they are small. Mann (1994) cited several instances in which meta-analyses led to dramatically different conclusions than did expert reviews of studies that used vote-counting methods. Meta-analysis does have a serious danger, however, in publication bias (Berlin et al. 1989). A study that demonstrates an effect at a statistically significant level is more likely to be written for publication, favorably reviewed by referees and editors, and ultimately published than is a study without such significant effects (Sterling et al. 1995). So the published literature on an effect may give a very biased picture of what the research in toto demonstrated. (The medical community worries that ineffective and even harmful medical practices may be adopted if positive results are more likely to be published than negative results [Hoffert 1998]. Indeed, an on-line journal, the Journal of Negative Results in Biomedicine, is being launched to correct distortions caused by a general bias against null results [Anonymous 2002].) Even if results from unpublished studies could be accessed, much care would be needed to evaluate them. Bailar (1995) observed that quality meta-analysis requires expertise in the subject matter reviewed. A question always looms about unpublished studies (Hoffert 1998): Was the study not published because it generated no statistically significant results or because it was flawed in some way? Further, could it be that the study was not published because it was contrary to the prevailing thinking at the time? Yet, despite the concerns with meta-analysis, it does provide a vehicle for thoughtfully conducting a synthesis of the studies relevant to a particular question.
Statisticians, including myself (Johnson 1974), regularly advise against conducting studies that lack sufficient power. Observations in those studies are too few to yield a high probability of rejecting some null hypothesis, even if the hypothesis is false. While large samples are certainly preferable to small samples, I no longer believe that it is appropriate to condemn studies with small samples. Indeed, it may be preferable to have the results of numerous small but well-designed studies rather than results from a single "definitive" investigation. This is so because the single study, despite large samples, may have been compromised by an unusual happenstance or by the effect of a "lurking variable" (a third variable that induces a correlation between 2other variables that are otherwise unrelated). Numerous small studies, due to the benefits of metareplication, are less at jeopardy of yielding misleading conclusions.
One danger of a small study is that the sampled units do not adequately represent the target population. Lack of representation also can plague larger studies, however. I suspect that the greatest danger of a small study is the tendency to accept the null hypothesis as truth, if it is not rejected. Concluding that a hypothesis is true simply because it was not rejected in a statistical test is folly. Nonetheless, it is done frequently; Johnson (1999, 2002) cited numerous instances in which authors of The Journal of Wildlife Management articles concluded that null hypotheses were true, even when samples were small and test statistics were nearly significant.
Metareplication protects against situations in which there is an effect, but it is small and therefore not statistically significant in individual studies, and thus is never claimed. Hence, small studies should not be discouraged, as long as the investigators acknowledge that they are not definitive. Studies should be designed to address the topic as effectively and efficiently as possible. If the scope has to be narrow and the scale has to be small, or if logistic constraints preclude large samples, results still may be worthwhile and should be published, with their limitations acknowledged. Without meta-analysis or a similar strategy, any values of small studies will not be realized.
This journal encourages authors to present management implications deriving from the studies they describe. That practice may not always be appropriate. Results from a single study, unless supported by evidence from other studies, may be misleading. The fact that a study is the only one dealing with a certain species in a particular state is no reason to base management recommendations solely on that single study. Recommendations should be based on a larger body of knowledge. Similarly, manuscripts should be considered for publication even if they are not "groundbreaking," but instead provide support for inferences originally obtained from previous studies.
What about "management studies"? These seem to be studies conducted by others than scientists or graduate students. They also are claimed to be in less need of quality (good design, adequate sample size, etc.) than are "research studies." I would argue that the reverse may in fact be true: Management studies should at least equal research investigations in quality. If an erroneous conclusion is reached in a research study, the only negative consequence is the publication of that error in a journal. And, hopefully, further investigation will demonstrate that the published conclusion was unwarranted. In contrast, an erroneous conclusion reached in a management study may well lead to some very inappropriate management action being taken, with negative consequences to wildlife and their habitats.
The Bayesian philosophy offers a more natural way to think about metareplication than does the traditional (frequentist) approach. In concept, a frequentist considers only the likelihood function, acting as if the only information about a variable under investigation derives from the study at hand. A Bayesian accounts for the context and history more explicitly by considering the likelihood in conjunction with the prior distribution. The prior incorporates what is known or believed about the variable before the study was conducted. I think people naturally tend to be Bayesians. They have what might be termed mental inertia: they tend to continue in their existing beliefs even in the face of evidence against those beliefs. Only with repeated doses of new evidence do they change their opinions. Sterne and Smith (2001) suggested that the public, by being cynical about the results of new medical studies, were exhibiting a subconscious Bayesianism.