Northern Prairie Wildlife Research Center
The prevalence of testing null hypotheses that are uninteresting (or even silly) is quite high. For example, Anderson et al. (2000) found and average of 6,188 P-values per year (1993-1997) in Ecology and 5,263 per year (1994-1998) in The Journal of Wildlife Management and suggested that these large frequencies represented a misuse and overuse of null hypothesis testing methods. Johnson (1999) and Anderson et al. (2000) give examples of null hypotheses tested that were clearly of little biological interest, or were entirely unsupported before the study was initiated. We strongly recommend a substantial decrease in the reporting of results of null hypothesis tests when the null is trivial or uninteresting.
Naked P-values (i.e., those reported without estimates of effect size, its sign, and a measure of precision) are especially to be avoided. Nonparametric tests (like their parametric counterparts) are based on estimates of effect size, although usually only the direction of the effect is reported (a nearly naked P-value). The problems with naked and nearly naked P-values is that their magnitude is often interpreted as indicative of effect size. It is misleading to interpret that small P-values indicate large effect sizes because small P-values can also result from low variability or large sample sizes. P-values are not a proper strength of evidence (Royall 1997, Sellke et al. 2001).
We encourage authors to carefully consider whether the information they convey in the language of null hypothesis testing could be greatly improved by instead reporting estimates and measures of precision. Emphasizing estimation over hypothesis testing in the reporting of the results of data analysis helps protect against the pitfalls associated with the failure to distinguish between statistical significance and biological significance (Yoccoz 1991).
We do not recommend reporting test statistics and P-values from observational studies, at least not without appropriate caveats (Sellke et al. 2001). Such results are suggestive rather than conclusive given the observational nature of the data. In strict experiments, these quantities can be useful, but we still recommend a focus on the estimation of effect size rather than on P-values and their supposed statistical significance.
The computer output of many canned statistical packages contains numerous test statistics and P-values, many of which are of little interest; reporting these values may create an aura of scientific objectivity when both the objectivity and substance are often lacking. We encourage authors to resist the temptation to report dozens of P-values only because these appear on computer output.
Do not claim to have proven the null hypothesis; this is a basic tenet of science. If a test yields a nonsignificant P-value, it may not be unreasonable to state that "the test failed to reject the null hypothesis" or that "the results seem consistent with the null hypothesis" and then discuss Type I and II errors. However, these classical issues are not necessary when discussing the estimated effect size (e.g., "The estimated effect of the treatment was small," and then give the estimate and measure of precision).
Do not report estimated test power after a statistical test has been conducted and found to be nonsignificant, as such post hoc power is not meaningful (Goodman and Berlin 1994). A priori power and sample size considerations are important in planning an experimental design, but estimates of post hoc power should not be reported (Gerard et al. 1998, Hoenig and Heisey 2001).