Northern Prairie Wildlife Research Center

Frequentist methods, dating back at least a century, are much more than merely test statistics and

The prevalence of testing null hypotheses that are uninteresting (or even
silly) is quite high. For example, Anderson et al. (2000) found and average
of 6,188 *P*-values per year (1993-1997) in *Ecology* and 5,263
per year (1994-1998) in *The Journal of Wildlife Management* and suggested
that these large frequencies represented a misuse and overuse of null hypothesis
testing methods. Johnson (1999) and Anderson et al. (2000) give examples of
null hypotheses tested that were clearly of little biological interest, or
were entirely unsupported before the study was initiated. We strongly recommend
a substantial decrease in the reporting of results of null hypothesis tests
when the null is trivial or uninteresting.

Naked *P*-values (i.e., those reported without estimates of effect size,
its sign, and a measure of precision) are especially to be avoided. Nonparametric
tests (like their parametric counterparts) are based on estimates of effect
size, although usually only the direction of the effect is reported (a nearly
naked *P*-value). The problems with naked and nearly naked *P*-values
is that their magnitude is often interpreted as indicative of effect size.
It is misleading to interpret that small *P*-values indicate large effect
sizes because small *P*-values can also result from low variability or
large sample sizes. *P*-values are not a proper strength of evidence
(Royall 1997, Sellke et al. 2001).

We encourage authors to carefully consider whether the information they convey in the language of null hypothesis testing could be greatly improved by instead reporting estimates and measures of precision. Emphasizing estimation over hypothesis testing in the reporting of the results of data analysis helps protect against the pitfalls associated with the failure to distinguish between statistical significance and biological significance (Yoccoz 1991).

We do not recommend reporting test statistics and *P*-values from observational
studies, at least not without appropriate caveats (Sellke et al. 2001). Such
results are suggestive rather than conclusive given the observational nature
of the data. In strict experiments, these quantities can be useful, but we
still recommend a focus on the estimation of effect size rather than on *P*-values
and their supposed statistical significance.

The computer output of many canned statistical packages contains numerous
test statistics and *P*-values, many of which are of little interest;
reporting these values may create an aura of scientific objectivity when both
the objectivity and substance are often lacking. We encourage authors to resist
the temptation to report dozens of *P*-values only because these appear
on computer output.

Do not claim to have proven the null hypothesis; this is a basic tenet of
science. If a test yields a nonsignificant *P*-value, it may not be unreasonable
to state that "the test failed to reject the null hypothesis" or that "the
results seem consistent with the null hypothesis" and then discuss Type I
and II errors. However, these classical issues are not necessary when discussing
the estimated effect size (e.g., "The estimated effect of the treatment was
small," and then give the estimate and measure of precision).

Do not report estimated test power after a statistical test has been conducted and found to be nonsignificant, as such post hoc power is not meaningful (Goodman and Berlin 1994). A priori power and sample size considerations are important in planning an experimental design, but estimates of post hoc power should not be reported (Gerard et al. 1998, Hoenig and Heisey 2001).

Previous Section -- Introduction

Return to Contents

Next Section -- Information-Theoretic Methods