USGS - science for a changing world

Northern Prairie Wildlife Research Center

  Home About NPWRC Our Science Staff Employment Contacts Common Questions About the Site

The Insignificance of Statistical Significance Testing

What are the Alternatives?

What should we do instead of testing hypotheses? As Quinn and Dunham (1983) pointed out, it is more fruitful to determine the relative importance to the contributions of, and interactions between, a number of processes. For this purpose, estimation is far more appropriate than hypothesis testing (Campbell 1992). For certain other situations, decision theory is an appropriate tool. For either of these applications, as well as for hypothesis testing itself, the Bayesian approach offers some distinct advantages over the traditional methods. These alternatives are briefly outlined below. Although the alternatives will not meet all potential needs, they do offer attractive choices in many frequently encountered situations.

Estimates and Confidence Intervals

Four decades ago, Anscombe (1956) observed that statistical hypothesis tests were totally irrelevant, and that what was needed were estimates of magnitudes of effects, with standard errors. Yates (1964) indicated that "The most commonly occurring weakness in the application of Fisherian methods is undue emphasis on tests of significance, and failure to recognize that in many types of experimental work estimates of the treatment effects, together with estimates of the errors to which they are subject, are the quantities of primary interest." Further, because wildlife ecologists want to influence management practices, Johnson (1995) noted that, "If ecologists are to be taken seriously by decision makers, they must provide information useful for deciding on a course of action, as opposed to addressing purely academic questions." To enforce that point, several education and psychological journals have adopted editorial policies requiring that parameter estimates accompany any P-values presented (McLean and Ernest 1998).

Ordinary confidence intervals provide more information than do P-values. Knowing that a 95% confidence interval includes zero tells one that, if a test of the hypothesis that the parameter equals zero is conducted, the resulting P-value will be greater than 0.05. A confidence interval provides both an estimate of the effect size and a measure of its uncertainty. A 95% confidence interval of, say, (-50, 300) suggests the parameter is less well estimated than would a confidence interval of (120, 130). Perhaps surprisingly, confidence intervals have a longer history than statistical hypothesis tests (Schmidt and Hunter 1997).

With its advantages and longer history, why have confidence intervals not been used more than they have? Steiger and Fouladi (1997) and Reichardt and Gollob (1997) posited several explanations: (1) hypothesis testing has become a tradition; (2) the advantages of confidence intervals are not recognized; (3) there is some ignorance of the procedures available; (4) major statistical packages do not include many confidence interval estimates; (5) sizes of parameter estimates are often disappointingly small even though they may be very significantly different from zero; (6) the wide confidence intervals that often result from a study are embarrassing; (7) some hypothesis tests (e.g., chi square contingency table) have no uniquely defined parameter associated with them; and (8) recommendations to use confidence intervals often are accompanied by recommendations to abandon statistical tests altogether, which is unwelcome advice. These reasons are not valid excuses for avoiding confidence intervals in lieu of hypothesis tests in situations for which parameter estimation is the objective.

Decision Theory

Often experiments or surveys are conducted in order to help make some decision, such as what limits to set on hunting seasons, if a forest stand should be logged, or if a pesticide should be approved. In those cases, hypothesis testing is inadequate, for it does not take into consideration the costs of alternative actions. Here a useful tool is statistical decision theory: the theory of acting rationally with respect to anticipated gains and losses, in the face of uncertainty. Hypothesis testing generally limits the probability of a Type I error (rejecting a true null hypothesis), often arbitrarily set at α= 0.05, while letting the probability of a Type II error (accepting a false null hypothesis) fall where it may. In ecological situations, however, a Type II error may be far more costly than a Type I error (Toft and Shea 1983). As an example, approving a pesticide that reduces the survival rate of an endangered species by 5% may be disastrous to that species, even if that change is not statistically detectable. As another, continued overharvest in marine fisheries may result in the collapse of the ecosystem even while statistical tests are unable to reject the null hypothesis that fishing has no effect (Dayton 1998). Details on decision theory can be found in DeGroot (1970), Berger (1985), and Pratt et al. (1995).

Model Selection

Statistical tests can play a useful role in diagnostic checks and evaluations of tentative statistical models (Box 1980). But even for this application, competing tools are superior. Information criteria, such as Akaike's, provide objective measures for selecting among different models fitted to a data set. Burnham and Anderson (1998) provided a detailed overview of model selection procedures based on information criteria. In addition, for many applications it is not advisable to select a "best" model and then proceed as if that model was correct. There may be a group of models entertained, and the data will provide different strength of evidence for each model. Rather than basing decisions or conclusions on the single model most strongly supported by the data, one should acknowledge the uncertainty about the model by considering the entire set of models, each perhaps weighted by its own strength of evidence (Buckland et al. 1997).

Bayesian Approaches

Bayesian approaches offer some alternatives preferable to the ordinary (often called frequentist, because they invoke the idea of the long-term frequency of outcomes in imagined repeats of experiments or samples) methods for hypothesis testing as well as for estimation and decision-making. Space limitations preclude a detailed review of the approach here; see Box and Tiao (1973), Berger (1985), and Carlin and Louis (1996) for longer expositions, and Schmitt (1969) for an elementary introduction.

Sometimes the value of a parameter is predicted from theory, and it is more reasonable to test whether or not that value is consistent with the observed data than to calculate a confidence interval (Berger and Delampady 1987, Zellner 1987). For testing such hypotheses, what is usually desired (and what is sometimes believed to be provided by a statistical hypothesis test) is Pr[H0 | data]. What is obtained, as pointed out earlier, is P = Pr[observed or more extreme data | H0]. Bayes' theorem offers a formula for converting between them.

GIF - Bayes' Theorem

This is an old (Bayes 1763) and well-known theorem in probability. Its use in the present situation does not follow from the frequentist view of statistics, which considers Pr[H0] as unknown, but either zero or 1. In the Bayesian approach, Pr[H0] is determined before data are gathered; it is therefore called the prior probability of H0. Pr[H0] can be determined either subjectively (what is your prior belief about the truth of the null hypothesis?) or by a variety of objective means (e.g., Box and Tiao 1973, Carlin and Louis 1996). The use of subjective probabilities is a major reason that Bayesian approaches fell out of favor: science must be objective! (The other main reason is that Bayesian calculations tend to get fairly heavy, but modern computer capabilities can largely overcome this obstacle.)

Briefly consider parameter estimation. Suppose you want to estimate a parameter θ. Then replacing H0 by θ in the above formula yields

GIF - Formula

which provides an expression that shows how initial knowledge about the value of a parameter, reflected in the prior probability function Pr[θ], is modified by data obtained from a study, Pr[data | θ], to yield a final probability function, Pr[θ | data]. This process of updating beliefs leads in a natural way to adaptive resource management (Holling 1978, Walters 1986), a recent favorite topic in our field (e.g., Walters and Green 1997).

Bayesian confidence intervals are much more natural than their frequentist counterparts. A frequentist 95% confidence interval for a parameter θ, denoted (θL, θU), is interpreted as follows: if the study were repeated an infinite number of times, 95% of the confidence intervals that resulted would contain the true value θ. It says nothing about the particular study that was actually conducted, which led Howson and Urbach (1991:373) to comment that "statisticians regularly say that one can be '95 per cent confident' that the parameter lies in the confidence interval. They never say why." In contrast, a Bayesian confidence interval, sometimes called a credible interval, is interpreted to mean that the probability that the true value of the parameter lies in the interval is 95%. That statement is much more natural, and is what people think a confidence interval is, until they get the notion drummed out of their heads in statistics courses.

For decision analysis, Bayes' theorem offers a very logical way to make decisions in the face of uncertainty. It allows for incorporating beliefs, data, and the gains or losses expected from possible consequences of decisions. See Wolfson et al. (1996) and Ellison (1996) for recent overviews of Bayesian methods with an ecological orientation.

Previous Section -- Replication
Return to Contents
Next Section -- Conclusions

Accessibility FOIA Privacy Policies and Notices

Take Pride in America logo logo U.S. Department of the Interior | U.S. Geological Survey
Page Contact Information: Webmaster
Page Last Modified: Saturday, 02-Feb-2013 06:03:41 EST
Sioux Falls, SD [sdww55]