Northern Prairie Wildlife Research Center

Northern Prairie Wildlife Research Center

U.S. Geological Survey, Biological Resources Division

Jamestown, North Dakota 58401

Presented at the Fifth Annual Conference of the Wildlife Society, Buffalo, New York, 26 September 1998. Part of the symposium, Evaluating the Role of Hypothesis Testing/Power Analysis in Wildlife Science, sponsored by the Biometrics Working Group.

*Abstract*: Wildlife biologists recently have been subjected to the
credo that if you're not testing hypotheses, you're not doing real science.
To protect themselves against rejection by journal editors, authors cloak
their findings in an armor of P values. I contend that much statistical hypothesis
testing is misguided. Virtually all null hypotheses tested are, in fact, false;
the only issue is whether or not the sample size is sufficiently large to
show it. No matter if it is or not, one then gets led into the quagmire of
deciding biological significance versus statistical significance. Most often,
parameter estimation is a more appropriate tool than statistical hypothesis
testing. Statistical hypothesis testing should be distinguished from scientific
hypothesis testing, in which truly viable alternative hypotheses are evaluated
in a real attempt to falsify them. The latter method is part of the deductive
logic of strong inference, which is better-suited to simple systems. Ecological
systems are complex, with components typically influenced by many factors,
whose influences often vary in place and time. Competing hypotheses in ecology
rarely can be falsified and eliminated. Wildlife biologists perhaps adopt
hypothesis tests in order to make what are really descriptive studies appear
as scientific as those in the "hard" sciences. Rather than attempting to falsify
hypotheses, it may be more productive to understand the relative importance
of multiple factors.

*Literature*: The following is a compilation of references cited in
the presented paper, as well as citations provided by Marks R. Nester for
his *A Myopic View and History of Hypothesis Testing*,
which follows the Literature. Marks Nester's email address is **nesterm@qfri1.se2.dpi.qld.gov.au**.
Additional comments are available at David Parkhurst's *Quotes
Criticizing Significance Testing*.

Altman, D. G. 1985. Discussion of Dr Chatfield's paper. Journal of the Royal Statistical Society A 148 : 242. Anscombe, F. J. 1956. Discussion on Dr. David's and Dr. Johnson's Paper. Journal of the Royal Statistical Society B 18 : 24-27. Arbuthnott, J. 1710. An argument for Divine Providence, taken from the constant regularity observ'd in the births of both sexes. Philosophical Transactions of the Royal Society 23 : 186-190. Bakan, D. 1967. The test of significance in psychological research. From Chapter 1 of On Method, Jossey-Bass, Inc. (San Francisco). Reprinted in The Significance Test Controversy - A Reader, D. E. Morrison and R. E. Henkel, eds., 1970, Aldine Publishing Company (Butterworth Group). Barnard, G. 1998. Letter. New Scientist 157: 47. Barndorff-Nielsen, O. 1977. Discussion of D. R. Cox's paper. Scandinavian Journal of Statistics 4 : 67-69. Beaven, E. S. 1935. Discussion on Dr. Neyman's Paper. Journal of the Royal Statistical Society, Supplement 2 : 159-161. Berger, J. O., and D. A. Berry. 1988. Statistical Analysis and the Illusion of Objectivity. American Scientist 76:159-165. Berger, J. O. and Sellke, T. 1987. Testing a point null hypothesis: the irreconcilability of P values and evidence. Journal of the American Statistical Association 82 : 112-122. Berkson, J. 1938. Some difficulties of interpretation encountered in the application of the chi-square test. Journal of the American Statistical Association 33 : 526-536. Berkson, J. 1942. Tests of significance considered as evidence. Journal of the American Statistical Association 37 : 325-335. Binder, A. 1963. Further considerations on testing the null hypothesis and the strategy and tactics of investigating theoretical models. Psychological Review 70 : 107-115. Reprinted in Statistical Issues, A Reader for the Behavioural Sciences, R. E. Kirk, ed., 1972, Wadsworth Publishing Company : 118-126. Boardman, T. J. 1994. The statistician who changed the world: W. Edwards Deming, 1900-1993. The American Statistician 48(3) : 179-187. Box, G. E. P. 1976. Science and statistics. Journal of the American Statistical Association 71 : 791-799. Box, G. E. P. 1983. An apology for ecumenism in statistics. In Scientific Inference, Data Analysis, and Robustness, G. E. P. Box, T. Leonard and C. F. Wu, eds., Academic Press, Inc. : 51-84. Braithwaite, R. B. 1953. Scientific Explanation. A Study of the Function of Theory, Probability and Law in Science. Cambridge University Press. Bryan-Jones, J. and Finney, D. J. 1983. On an error in "Instructions to Authors". HortScience 18(3) : 279-282. Buchanan-Wollaston, H. J. 1935. The philosophic basis of statistical analysis. Journal of the International Council for the Exploration of the Sea 10 : 249-263. Camilleri, S. F. 1962. Theory, probability, and induction in social research. American Sociological Review 27 : 170-178. Reprinted in The Significance Test Controversy - A Reader, D. E. Morrison and R. E. Henkel, eds., 1970, Aldine Publishing Company (Butterworth Group). Campbell, M. 1992. Letter. Royal Statistical Society News & Notes 18(9):5. Carver, R. P. 1978. The case against statistical significance testing. Harvard Educational Review 48 : 378-399. Casella, G. and Berger, R. L. 1987. Rejoinder. Journal of the American Statistical Association 82 : 133-135. Chatfield, C. 1985. The initial examination of data (with discussion). Journal of the Royal Statistical Society A 148: 214-253. Chatfield, C. 1989. Comments on the paper by McPherson. Journal of the Royal Statistical Society A 152 : 234-238. Chernoff , H. 1986. Comment. The American Statistician 40(1) : 5-6. Chew, V. 1976. Comparing treatment means: a compendium. HortScience 11(4) : 348-357. Chew, V. 1977. Statistical hypothesis testing: an academic exercise in futility. Proceedings of the Florida State Horticultural Society 90 : 214-215. Chew, V. 1980. Testing differences among means: correct interpretation and some alternatives. HortScience 15(4) : 467-470. Cochran, W. G. and Cox, G. M. 1957. Experimental Designs. 2nd ed. John Wiley & Sons, Inc. Cohen, J. 1990. Things I have learned (so far). American Psychologist 45 : 1304-1312. Cohen, J. 1994. The earth is round (p < .05). American Psychologist 49 : 997-1003. Cormack, R. M. 1985. Discussion of Dr Chatfield's paper. Journal of the Royal Statistical Society A 148 : 231-233. Cox, D. R. 1958. Some problems connected with statistical inference. Annals of Mathematical Statistics 29 : 357-372. Cox, D. R. 1977. The role of significance tests. (with discussion). Scandinavian Journal of Statistics 4 : 49-70. Cox, D. R. 1982. Statistical significance tests. British Journal of Clinical Pharmacology 14 : 325-331. Cox, D. R. and Snell, E. J. 1981. Applied Statistics Principles and Examples. Chapman and Hall. Deming, W. E. 1975. On probability as a basis for action. The American Statistician 29:146. Edwards, W., Lindman, H. and Savage, L. J. 1963. Bayesian statistical inference for psychological research. Psychological Review 70 : 193-242. Finney, D. J. 1988. Was this in your statistics textbook? III. Design and analysis. Experimental Agriculture 24 : 421-432. Finney, D. J. 1989a. Was this in your statistics textbook? VI. Regression and covariance. Experimental Agriculture. 25 : 291-311. Finney, D. J. 1989b. Is the statistician still necessary? Biometrie Praximetrie 29 : 135-146. Fisher, R. A. 1925. Statistical Methods for Research Workers. Oliver and Boyd (London). Fisher, R. A. 1935. The Design of Experiments. Oliver and Boyd (Edinburgh). Gauch Jr., H. G. 1988. Model selection and validation for yield trials with interaction. Biometrics 44 : 705-715. Gavarret, J. 1840. Principes Généraux de Statistique Médicale. [No publisher given] (Paris). (Not cited). Geary, R. C. 1947. Testing for normality. Biometrika 34 : 209-242. Gerard, P. D., D. R. Smith, and G. Weerakkody. 1998. Limits of retrospective power analysis. Journal of Wildlife Management 62:801-807. Gigerenzer, G., Swijtink, Z., Porter, T., Daston, L., Beatty, J. and Krüger, L. 1989. The Empire of Chance. Cambridge University Press, Cambridge, England. Gold, D. 1958. Comment on "A critique of tests of significance". American Sociological Review 23 : 85-86. Good, I. J. 1983. Good Thinking. The Foundations of Probability and Its Applications. University of Minnesota Press (Minneapolis). Grant, D. A. 1962. Testing the null hypothesis and the strategy and tactics of investigating theoretical models. Psychological Review 69 : 54-61. Graybill, F. A. 1976. Theory and Application of the Linear Model. Duxbury Press (Massachusetts). Guttman, L. 1977. What is not what in statistics. The Statistician 26 : 81-107. Guttman, L. 1985. The illogic of statistical inference for cumulative science. Applied Stochastic Models and Data Analysis 1 : 3-10. Hacking, I. 1965. Logic of Statistical Inference. Cambridge University Press. Hahn, G. J. 1990. Commentary. Technometrics 32: 257-258. Hays, W. L. 1973. Statistics for the Social Sciences. Second edition. Holt, Rinehart and Winston. Healy, M. J. R. 1978. Is statistics a science? Journal of the Royal Statistical Society A 141 : 385-393. Healy, M. J. R. 1989. Comments on the paper by McPherson. Journal of the Royal Statistical Society A 152 : 232-234. Hinkley, D. V. 1987. Comment. Journal of the American Statistical Association 82 : 128-129. Hodges Jr., J. L. and Lehmann, E. L. 1954. Testing the approximate validity of statistical hypotheses. Journal of the Royal Statistical Society B 16 : 261-268. Hogben, L. 1957a. The contemporary crisis or the uncertainties of uncertain inference. Statistical Theory, W. W. Norton & Co., Inc., Reprinted in The Significance Test Controversy - A Reader, D. E. Morrison and R. E. Henkel, eds., 1970, Aldine Publishing Company (Butterworth Group). Hogben, L. 1957b. Statistical prudence and statistical inference. Statistical Theory, W. W. Norton & Co., Inc. Reprinted in The Significance Test Controversy - A Reader, D. E. Morrison and R. E. Henkel, eds., 1970, Aldine Publishing Company (Butterworth Group). Hunter, J. S. 1990. Commentary. Technometrics 32 : 261. Inman, H. F. 1994. Karl Pearson and R. A. Fisher on statistical tests: A 1935 exchange from Nature. The American Statistician 48(1) : 2-11. Johnson, D. H. 1995. Statistical sirens: the allure of nonparametrics. Ecology 76:1998-2000. Jones, D. 1984. Use, misuse, and role of multiple-comparison procedures in ecological and agricultural entomology. Environmental Entomology 13 : 635-649. Jones, D. and Matloff, N. 1986. Statistical hypothesis testing in biology: a contradiction in terms. Journal of Economic Entomology 79 : 1156-1160. Kempthorne, O. 1966. Some aspects of experimental inference. Journal of the American Statistical Association 61 : 11-34. Kempthorne, O. 1976. Of what use are tests of significance and tests of hypotheses. Communications in Statistics: Theory and Methods A5 : 763-777. Kish, L. 1959. Some statistical problems in research design. American Sociological Review, 24 : 328-338. Reprinted in The Significance Test Controversy - A Reader, D. E. Morrison and R. E. Henkel, eds., 1970, Aldine Publishing Company (Butterworth Group). Kruskal, W. H. 1978. Significance, Tests of. In International Encyclopedia of Statistics , W. H. Kruskal and J. M. Tanur, eds., Free Press (New York) : 944-958. Kruskal, W. 1980. The significance of Fisher: a review of R. A. Fisher: The Life of a Scientist. Journal of the American Statistical Association 75 : 1019-1030. Kruskal, W. and Majors, R. 1989. Concepts of relative importance in recent scientific literature. The American Statistician 43(1) : 2-6. LaForge, R. 1967. Confidence intervals or tests of significance in scientific research. Psychological Bulletin 68 : 446-447. Lindley, D. V. 1986. Discussion. The Statistician 35 : 502-504. Little, T. M. 1981. Interpretation and presentation of results. HortScience 16 : 637-640. Luce, R. D. 1988. The tools-to-theory hypothesis. Review of G. Gigerenzer and D. J. Murray, "Cognition as intuitive statistics." Contemporary Psychology 33 : 582-583. Lykken, D. T. 1968. Statistical significance in psychological research. Psychological Bulletin 70 : 151-159. Reprinted in The Significance Test Controversy - A Reader, D. E. Morrison and R. E. Henkel, eds., 1970, Aldine Publishing Company (Butterworth Group). Matloff, N. S. 1991. Statistical hypothesis testing: problems and alternatives. Environmental Entomology 20 : 1246-1250. Matthews, R. 1997. Faith, hope and statistics. New Scientist 156(2109):36-39. McCloskey, D. N. 1995. The insignificance of statistical significance. Scientific American 272(4) : 104-105. McNemar, Q. 1960. At random: sense and nonsense. American Psychologist 15 : 295-300. Meehl, P. E. 1967. Theory testing in psychology and physics: A methodological paradox. Philosophy of Science 34 : 103-115. Reprinted in The Significance Test Controversy - A Reader, D. E. Morrison and R. E. Henkel, eds., 1970, Aldine Publishing Company (Butterworth Group). Meehl, P. E. 1978. Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology. Journal of Consulting and Clinical Psychology 46 : 806-834. Meehl, P. E. 1990. Why summaries of research on psychological theories are often uninterpretable. Psychological Reports 66 (Monograph Supplement 1-V66) : 195-244. Moore, D. S. and McCabe, G. P. 1989. Introduction to the Practice of Statistics. W. H. Freeman and Company (New York). Morrison, D. E. and Henkel, R. E. 1969. Significance tests reconsidered. The American Sociologist 4 : 131-140. Reprinted in The Significance Test Controversy - A Reader, D. E. Morrison and R. E. Henkel, eds., 1970, Aldine Publishing Company (Butterworth Group). Morrison, D. E. and Henkel, R. E. (Eds.) 1970. The Significance Test Controversy - A Reader. Aldine Publishing Company (Butterworth Group). Natrella, M. G. 1960. The relation between confidence intervals and tests of significance. American Statistician 14 : 20-22, 33. Reprinted in Statistical Issues, A Reader for the Behavioural Sciences, R. E. Kirk, ed., 1972, Wadsworth Publishing Company : 113-117. Nelder, J. A. 1971. Discussion on papers by Wynn, Bloomfield, O'Neill and Wetherill. Journal of the Royal Statistical Society, B 33 : 244-246. Nelder, J. A. 1985. Discussion of Dr Chatfield's paper. Journal of the Royal Statistical Society A 148 : 238. Nester, M. R. 1996. An applied statistician's creed. Applied Statistics 45:401-410. Neyman, J. 1958. The use of the concept of power in agricultural experimentation. Journal of the Indian Society of Agricultural Statistics 9 : 9-17. Neyman, J. and Pearson, E. S. 1933. On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society A 231 : 289-337. Nunnally, J. 1960. The place of statistics in psychology. Educational and Psychological Measurement 20 : 641-650. O'Brien, T. C. and Shapiro, B. J. 1968. Statistical significance--what? Mathematics Teacher 61 : 673-676. Reprinted in Statistical Issues, A Reader for the Behavioural Sciences, R. E. Kirk, ed., 1972, Wadsworth Publishing Company : 109-112. Pearce, S. C. 1992. Data analysis in agricultural experimentation. II. Some standard contrasts. Experimental Agriculture 28 : 375-383. Pearson, K. 1900. On the criterion that a given system of deviations from the probable in the case of a correlated systems of variables is such that it can be reasonably supposed to have arisen from random sampling. Philosophical Magazine, Series V, 1 : 157-175. Pearson, K. 1935a. Statistical tests. Nature 136 : 296-297. (reproduced in H. F. Inman (1994). Karl Pearson and R. A. Fisher on statistical tests: A 1935 exchange from Nature. The American Statistician 48(1) : 2-11. Pearson, K. 1935b. Statistical tests. Nature 136 : 550. (reproduced in H. F. Inman (1994). Karl Pearson and R. A. Fisher on statistical tests: A 1935 exchange from Nature. The American Statistician 48(1) : 2-11. Perry, J. N. 1986. Multiple-comparison procedures: a dissenting view. Journal of Economic Entomology 79 : 1149-1155. Peterman, R. M. 1990. The importance of reporting statistical power: the forest decline and acidic deposition example. Ecology 71: 2024-2027. Petranka, J. W. 1990. Caught between a rock and a hard place. Herpetologica 46: 346-350. Platt, J. R. 1964. Strong inference. Science 146:347-353. Pratt, J. W. 1976. A discussion of the question: for what use are tests of hypotheses and tests of significance. Communications in Statistics: Theory and Methods A5 : 779-787. Preece, D. A. 1982. The design and analysis of experiments: what has gone wrong? Utilitas Mathematica 21A : 201-244. Preece, D. A. 1984. Biometry in the Third World: science not ritual. Biometrics 40 : 519-523. Preece, D. A. 1990. R. A. Fisher and experimental design: a review. Biometrics 46 : 925-935. Quinn, J. F., and A. E. Dunham. 1983. On hypothesis testing in ecology and evolution. American Naturalist 122: 602-617. Ranstam, J. 1996. A common misconception about p-value and its consequences. Acta Orthopaedica Scandinavica 67 : 505-507. Rosnow, R. L. and Rosenthal, R. 1989. Statistical procedures and the justification of knowledge in psychological science. American Psychologist 44 : 1276-1284. Rothman, K. 1978. A show of confidence. New England Journal of Medicine 299 : 1362-1363. Rozeboom, W. W. 1960. The fallacy of the null hypothesis significance test. Psychological Bulletin 57 : 416-428. Reprinted in The Significance Test Controversy - A Reader, D. E. Morrison and R. E. Henkel, eds., 1970, Aldine Publishing Company (Butterworth Group). Savage, I. R. 1957. Nonparametric statistics. Journal of the American Statistical Association 52 : 331-344. Sayn-Wittgenstein, L. 1965. Statistics - salvation or slavery? Forestry Chronicle 41 : 103-105. Selvin, H. C. 1957. A critique of tests of significance in survey research. American Sociological Review 22 : 519-527. Simberloff, D. 1990. Hypotheses, errors, and statistical assumptions. Herpetologica 46: 351-357. Skipper Jr., J. K., Guenther, A. L. and Nass, G. 1967. The sacredness of .05: A note concerning the uses of statistical levels of significance in social science. The American Sociologist 2 : 16-18. Reprinted in The Significance Test Controversy - A Reader, D. E. Morrison and R. E. Henkel, eds., 1970, Aldine Publishing Company (Butterworth Group). Smith, C. A. B. 1960. Book review of Norman T. J. Bailey: Statistical Methods in Biology. Applied Statistics 9 : 64-66. Stevens, S. S. 1968. Measurement, statistics, and the schemapiric view. Science 161 : 849-856. Abridged in Statistical Issues, A Reader for the Behavioural Sciences, R. E. Kirk, ed., 1972, Wadsworth Publishing Company : 66-78. Street, D. J. 1990. Fisher's contributions to agricultural statistics. Biometrics 46 : 937-945. "Student" 1908. The probable error of a mean. Biometrika 6 : 1-25. Tamhane, A. C. 1996. Review of R. E. Bechhofer, T. J. Santner and D. M. Goldsman, Design and Analysis of Experiments for Statistical Selection, Screening and Multiple Comparisons, John Wiley (New York), 1995. Technometrics 38 : 289-290. Tukey, J. W. 1973. The problem of multiple comparisons. Unpublished manuscript, Dept. of Statistics, Princeton University. Tukey, J. W. 1991. The philosophy of multiple comparisons. Statistical Science 6 : 100-116. Tversky, A. and Kahneman, D. 1971. Belief in the law of small numbers. Psychological Bulletin 76 : 105-110. Upton, G. J. G. 1992. Fisher's exact test. Journal of the Royal Statistical Society A 155 : 395-402. Vardeman, S. B. 1987. Comment. Journal of the American Statistical Association 82 : 130-131. Venn, J. 1888. Cambridge anthropometry. Journal of the Anthropological Institute 18 : 140-154. Wang, C. 1993. Sense and Nonsense of Statistical Inference. Marcel Dekker, Inc. Warren, W. G. 1986. On the presentation of statistical analysis: reason or ritual. Canadian Journal of Forest Research 16 : 1185-1191. Yates, F. 1951. The influence of Statistical Methods for Research Workers on the development of the science of statistics. Journal of the American Statistical Association 46 : 19-34. Yates, F. 1964. Sir Ronald Fisher and the design of experiments. Biometrics 20 : 307-321. Yoccoz, N. G. 1991. Use, overuse, and misuse of significance tests in evolutionary biology and ecology. Bulletin of the Ecological Society of America 72:106-111. Zeisel, H. 1955. The significance of insignificant differences. Public Opinion Quarterly 17 : 319-321. Reprinted in The Significance Test Controversy - A Reader, D. E. Morrison and R. E. Henkel, eds., 1970, Aldine Publishing Company (Butterworth Group).

Marks R. Nester, author of "An Applied Statistician's Creed." Applied Statistics 45:401- 410.

According to Hacking (1965), John Arbuthnott (1710) was the first to publish a test of a statistical hypothesis. Hogben (1957b) attributes to Jules Gavarret (1840) the earliest use of the probable error as a form of significance test in the biological arena. Hogben (1957b) also states that Venn (1888) was one of the earliest users of the terms "test" and "significant". The form of the chi-squared goodness-of-fit distribution was published by K. Pearson in 1900. W. S. Gosset, using the pseudonym "Student", developed the t-distribution in 1908. According to E. S. Beaven (1935), T. B. Wood and Professor Stratton were the first to determine probable errors in the context of replicated agricultural experiments. Apparently Wood and Stratton wrote their paper in 1910, but Beaven does not give a reference. The foundations of modern hypothesis testing were laid by Fisher (1925), although the modifications propounded by Neyman and Pearson (1933) are the generally accepted norm.

I contend that the general acceptance of statistical hypothesis testing is
one of the most unfortunate aspects of 20^{th} century applied science.
Tests for the identity of population distributions, for equality of treatment
means, for presence of interactions, for the nullity of a correlation coefficient,
and so on, have been responsible for much bad science, much lazy science,
and much silly science. A good scientist can manage with, and will not be
misled by, parameter estimates and their associated standard errors or confidence
limits. A theory dealing with the statistical behaviour of populations should
be supported by rational argument as well as data. In such cases, accurate
statistical evaluation of the data is hindered by null hypothesis testing.
The scientist must always give due thought to the statistical analysis, but
must never let statistical analysis be a substitute for thinking! If instead
of developing theories, a researcher is involved in such practical issues
as selecting the best treatment(s), then the researcher is probably confronting
a complex decision problem involving *inter alia* economic considerations.
Once again, analyses such as null hypothesis testing and multiple comparison
procedures are of no benefit.

Although some of the following passages have been included for their historical interest, most of the quotations are offered in partial support of my views.

**1710**

Arbuthnott - "This Equality of Males and Females is not the Effect of Chance but Divine Providence ... which I thus demonstrate :

Let there be a Die of Two sides, M and F ...But it is very improbable (if mere Chance govern'd) that ... To repair that Loss, provident Nature ... brings forth more Males than Females ; and that in almost a constant proportion."

**1888**

Venn - "When a sufficient number of results had been obtained ... I was requested ... to undertake an analysis of them, and a comparison of their general outcome with that of those obtained by almost identical instruments at South Kensington. ... When we are dealing with statistics, we ought to be able not merely to say vaguely that the difference does or does not seem significant to us, but we ought to have some test as to what difference would be significant. ... The above remarks ... inform us which of the differences in the above tables are permanent and significant, in the sense that we may be tolerably confident that if we took another similar batch we should find a similar difference; and which of them are merely transient and insignificant, in the sense that another similar batch is about as likely as not to reverse the conclusion we have obtained."

**1900**

Pearson - "A theoretical probability curve without limited range will never at the extreme tails exactly fit observation. The difficulty is obvious where the observations go by units and the theory by fractions."

Pearson - "if the earlier writers on probability had not proceeded so entirely from the mathematical standpoint, but had endeavoured first to classify experience in deviations from the average, and then to obtain some measure of the actual goodness of fit provided by the normal curve, that curve would never have obtained its present position in the theory of errors"

Pearson - "We can only conclude from the investigations here considered that the normal curve possesses no special fitness for describing errors or deviations such as arise either in observing practice or in nature"

**1933**

Neyman and Pearson - "if x is a continuous variable ... then any value of x is a singularity of relative probability equal to zero. We are inclined to think that as far as a particular hypothesis is concerned, no test based upon the theory of probability can by itself provide any valuable evidence of the truth or falsehood of that hypothesis"

**1935**

Buchanan-Wollaston - "The [null] hypothesis should be such that it is acceptable
on *a priori* grounds if the data do not show it to be unlikely to
be true"

Fisher - "Every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis"

Pearson (a)- "the [^{2} goodness-of-fit] tests are used to ascertain
whether a reasonable *graduation* curve has been achieved, not to assert
whether one or another hypothesis is true or false"

Pearson (a)- "I have never found a normal curve fit anything if there are enough observations!"

Pearson (b)- "There is only one case in which an hypothesis can be definitely rejected, namely when its probability is zero."

**1938**

Berkson - "we may assume that it is practically certain that any series of
real observations does not actually follow a normal curve with *absolute
exactitude* ... and ... the chi-square [goodness-of-fit] *P* will
be small if the sample has a sufficiently large number of observations in
it"

**1942**

Berkson - "null hypothesis procedure ... It says 'If A is true, B will happen sometimes; therefore if B has been found to happen, A can be considered disproved' "

Berkson - "I do not say anything has been 'proved' or 'disproved.' I leave to others the use of these words, which I think are quite inadmissible as applying to anything that can be accomplished by statistics"

**1947**

Geary - "Normality is a myth; there never was, and never will be, a normal distribution"

**1951**

Yates - "the emphasis given to formal tests of significance ... has resulted in ... an undue concentration of effort by mathematical statisticians on investigations of tests of significance applicable to problems which are of little or no practical importance ... and ... it has caused scientific research workers to pay undue attention to the results of the tests of significance ... and too little to the estimates of the magnitude of the effects they are investigating"

Yates - "the occasions ... in which quantitative data are collected solely with the object of proving or disproving a given hypothesis are relatively rare"

Yates - "... the unfortunate consequence that scientific workers have often regarded the execution of a test of significance on an experiment as the ultimate objective"

Yates - "Results are significant or not significant and that is the end of it"

**1953**

Braithwaite - "The peculiarity of ... statistical hypotheses is that they are not conclusively refutable by any experience"

Braithwaite - "no batch of observations, however large, either definitively
rejects or definitively fails to reject the hypothesis H_{0}"

Braithwaite - "what John Dewey called 'the quest for certainty' is, in the case of empirical knowledge, a snare and a delusion"

Braithwaite - "The ultimate justification for any scientific belief will depend upon the main purpose for which we think scientifically--that of predicting and thereby controlling the future"

**1954**

Hodges, Jr. and Lehmann - "we may formulate the hypothesis that a population is normally distributed, but we realize that no natural population is ever exactly normal"

Hodges, Jr. and Lehmann - "when we formulate the hypothesis that the sex ratio is the same in two populations, we do not really believe that it could be exactly the same"

**1955**

Zeisel - "the researchers who follow the statistical way of life often distinguish themselves by a certain aridity of theoretical insights"

**1956**

Anscombe - "Tests of the null hypothesis that there is no difference between certain treatments are often made in the analysis of agricultural or industrial experiments in which alternative methods or processes are compared. Such tests are ... totally irrelevant. What are needed are estimates of magnitudes of effects, with standard errors"

**1957**

Cochran and Cox - "In many experiments it seems obvious that the different
treatments must have produced some difference, however small, in effect. Thus
the hypothesis that there is *no* difference is unrealistic: the real
problem is to obtain estimates of the sizes of the differences"

Hogben (a) - "Acceptability of a statistically *significant* result
... promotes a high output of publication. Hence the argument that the techniques
*work* has a tempting appeal to young biologists, if harassed by their
seniors to produce results, or if admonished by editors to conform to a prescribed
ritual of analysis before publication. ... the plea for justification by works
... is therefore likely to fall on deaf ears, unless we reinstate reflective
thinking in the university curriculum"

Hogben (a) - "we can already detect signs of such deterioration in the growing volume of published papers ... recording so-called significant conclusions which an earlier vintage would have regarded merely as private clues for further exploration"

Savage - "to make measurements and then ignore their magnitude would ordinarily be pointless. Exclusive reliance on tests of significance obscures the fact that statistical significance does not imply substantive significance"

Savage - "Null hypotheses of no difference are usually known to be false before the data are collected ... when they are, their rejection or acceptance simply reflects the size of the sample and the power of the test, and is not a contribution to science"

Selvin - "High levels of predictability, explanation, and association are legitimate goals for social scientists; they are not the same as a high level of significance, nor is statistical significance a substitute for them."

**1958**

Cox - "Exact truth of a null hypothesis is very unlikely except in a genuine uniformity trial"

Cox - "Assumptions that we make, such as those concerning the form of the population sampled, are always untrue"

Gold - "An important weakness of much analysis in current social research is the failure of the analyst to consider the distinction between statistical significance and substantive importance."

Neyman - "If we go to the trouble of setting up an experiment this is because
we want to establish the presence of some possible effect of a treatment."
*Comment: This is sadly reminiscent of Fisher (1935).*

Neyman - "if experimenters realized how little is the chance of their experiments discovering what they are intended to discover, then a very substantial proportion of the experiments that are now in progress would have been abandoned in favour of an increase in size of the remaining experiments, judged more important"

Neyman - "What was the probability (power) of detecting interactions ... in the experiment performed? ... The probability in question is frequently relatively low ... in cases of this kind the fact that the test failed to detect the existence of interactions does not mean very much. In fact, they may exist and have gone undetected."

**1959**

Kish - "Significance should stand for meaning and refer to substantive matter. ... I would recommend that statisticians discard the phrase 'test of significance' "

Kish - "the tests of null hypotheses of *zero* differences, of no
relationships, are frequently weak, perhaps trivial statements of the researcher's
aims ... in many cases, instead of the tests of significance it would be more
to the point to measure the magnitudes of the relationships, attaching proper
statements of their sampling variation. The magnitudes of relationships cannot
be measured in terms of levels of significance"

**1960**

McNemar - "too many users of the analysis of variance seem to regard the reaching of a mediocre level of significance as more important than any descriptive specification of the underlying averages"

McNemar - "so much of what should be regarded as preliminary gets published,
then quoted as the last word, which it usually is because the investigator
is too willing to rest on the laurels that come from finding a significant
difference. Why should he worry about the *degree* of relationship
or its possible lack of linearity"

Natrella - "One reason for preferring to present a confidence interval statement (where possible) is that the confidence interval, by its width, tells more about the reliance that can be placed on the results of the experiment than does a YES-NO test of significance."

Natrella - "the significance test without its OC [Operating Characteristic] curve has distorted the thinking in some experimental problems"

Natrella - "Confidence intervals give a feeling of the uncertainty of experimental evidence, and (very important) give it in the same units ... as the original observations."

Nunnally - "Few ... of the criticisms which will be made were originated by the author ... However, it is hoped that when the criticisms are brought together they will argue persuasively for a change in viewpoint about statistical logic"

Nunnally - "the null-hypothesis models ... share a crippling flaw: in the
real world the null hypothesis is almost never true, and it is usually nonsensical
to perform an experiment with the *sole* aim of rejecting the null
hypothesis"

Nunnally - "when large numbers of subjects are used in studies, nearly all comparisons of means are 'significantly' different and all correlations are 'significantly' different from zero'

Nunnally - "If rejection of the null hypothesis were the real intention in psychological experiments, there usually would be no need to gather data"

Nunnally - "the mere rejection of a null hypothesis provides only meager information"

Nunnally - "Closely related to the null hypothesis is the notion that only enough subjects need be used in psychological experiments to obtain 'significant' results. This often encourages experimenters to be content with very imprecise estimates of effects"

Nunnally - "analysis of variance should be considered primarily an estimation device"

Nunnally - "psychological research is often difficult and frustrating, and the frustration can lead to a 'flight into statistics.' With some, this takes the form of a preoccupation with statistics to the point of divorcement from the headaches of empirical study. With others, the hypothesis-testing models provide a quick and easy way of finding 'significant differences' and an attendant sense of satisfaction"

Nunnally - "We should not feel proud when we see the psychologist smile and say 'the correlation is significant beyond the .01 level.' Perhaps that is the most that he can say, but he has no reason to smile"

Rozeboom - "one can hardly avoid polemics when butchering sacred cows"

Rozeboom - "Whenever possible, the basic statistical report should be in
the form of a *confidence interval*"

Rozeboom - "the stranglehold that conventional null hypothesis significance testing has clamped on publication standards must be broken"

Rozeboom - "The traditional null hypothesis significance-test method ...
of statistical analysis is here vigorously excoriated for its inappropriateness
as a method of *inference*"

Smith - "One feature ... which requires much more justification than is usually given, is the setting up of unplausible null hypotheses. For example, a statistician may set out a test to see whether two drugs have exactly the same effect, or whether a regression line is exactly straight. These hypotheses can scarcely be taken literally"

**1962**

Camilleri - "another problem associated with the test of significance. The particular level of significance chosen for an investigation is not a logical consequence of the theory of statistical inference"

Camilleri - "The precision and empirical concreteness often associated with the test of significance are illusory and it would be a serious error to predicate our actions towards hypotheses on the test of significance as if it were a reliable arbiter of truth"

Grant - "In view of our long-term strategy of improving our theories, our statistical tactics can be greatly improved by shifting emphasis away from over-all hypothesis testing in the direction of statistical estimation. This always holds true when we are concerned with the actual size of one or more differences rather than simply in the existence of differences."

**1963**

Binder - [With regard to Fisher's 1935 quote about experiments and null hypotheses] "This is not very edifying since one does not expect to prove any hypothesis by the methods of probabilistic inference."

Binder - "when one tests a point prediction he usually knows before the first sample element is drawn that his empirical hypothesis is not precisely true"

Binder - "It is surely apparent that anyone who wants to obtain a significant difference badly enough can obtain one ... choose a sample size large enough"

Edwards et al. - "in typical applications, one of the hypotheses--the null hypothesis--is known by all concerned to be false from the outset"

Edwards et al. - "classical procedures quite typically are, from a Bayesian
point of view, far too ready to reject the null hypotheses" *Comment: Then
this is a most convincing argument against the use of Bayesian methods.*

Edwards et al. - "Estimation is best when it is stable. Rejection of a null hypothesis is best when it is interocular"

**1964**

Yates - "The most commonly occurring weakness ... is ... undue emphasis on tests of significance, and failure to recognise that in many types of experimental work estimates of treatment effects, together with estimates of the errors to which they are subject, are the quantities of primary interest"

Yates - "In many experiments ... it is known that the null hypothesis ... is certainly untrue"

**1965**

Sayn-Wittgenstein - "There is nothing wrong with the t-test; it has merely
been used to give an answer that was never asked for. The Student t-test answers
the question: 'Is there any *real* difference between the means of
the measurement by the old and the new method, or could the *apparent*
difference have arisen from random variation?' We already know that there
is a real difference, so the question is pointless. The question we should
have answered is: 'How big is the difference between the two sets of measurements,
and how precisely have we determined it?'"

**1966**

Kempthorne - "a continuously distributed random variable ... one never actually observes such random variables, ... all observations are ... discrete"

**1967**

Bakan - "Little of what is contained in this paper is not already available in the literature"

Bakan - "the test of significance has been carrying too much of the burden of scientific inference. It may well be the case that wise and ingenious investigators can find their way to reasonable conclusions from data because and in spite of their procedures. Too often, however, even wise and ingenious investigators ... tend to credit the test of significance with properties it does not have"

Bakan - "a priori *reasons for believing that the null hypothesis is generally
false anyway*. One of the common experiences of research workers is the
very high frequency with which significant results are obtained with large
samples"

Bakan - "there is really no good reason to expect the null hypothesis to be true in any population ... Why should any correlation coefficient be exactly .00 in the population? ... why should different drugs have exactly the same effect on any population parameter"

Bakan - "if the test of significance is really of such limited appropriateness ... we would be much better off if we were to attempt to estimate the magnitude of the parameters in the populations"

Bakan - "When we reach a point where our statistical procedures are substitutes instead of aids to thought, and we are led to absurdities, then we must return to common sense"

Bakan - "we need to get on with the business of generating ... hypotheses and proceed to do investigations and make inferences which bear on them, instead of ... testing the statistical null hypothesis in any number of contexts in which we have every reason to suppose that it is false in the first place"

LaForge - "Confidence regions ... for estimation of unknown parameters ...
*are* appropriate for most scientific research and reporting"

Meehl - "in psychological and sociological investigations involving very large numbers of subjects, it is regularly found that almost all correlations or differences between means are statistically significant"

Meehl - "it is highly unlikely that *any* psychologically discriminable
stimulation which we apply to an experimental subject would exert literally
*zero* effect upon any aspect of his performance"

Meehl - "a fairly widespread tendency to report experimental findings with
a liberal use of *ad hoc* explanations for those that didn't 'pan out'
"

Meehl - "our eager-beaver researcher, undismayed by logic-of-science considerations and relying blissfully on the 'exactitude' of modern statistical hypothesis-testing, has produced a long publication list and been promoted to a full professorship. In terms of his contribution to the enduring body of psychological knowledge, he has done hardly anything. His true position is that of a potent-but-sterile intellectual rake, who leaves in his merry path a long train of ravished maidens but no viable scientific offspring"

Skipper Jr., Guenther and Nass - "The current obsession with .05 ... has
the consequence of differentiating significant research findings and those
best forgotten, published studies from unpublished ones, and renewal of grants
from termination. It would not be difficult to document the joy experienced
by a social scientist when his *F* ratio or *t* value yields
significance at .05, nor his horror when the table reads 'only' .10 or .06.
One comes to internalize the difference between .05 and .06 as 'right' *vs*.
'wrong,' 'creditable' *vs*. 'embarrassing,' 'success' *vs*.
'failure' "

Skipper Jr., Guenther and Nass - "blind adherence to the .05 level denies any consideration of alternative strategies, and it is a serious impediment to the interpretation of data"

**1968**

Lykken - "Unless one of the variables is wholly unreliable so that the values
obtained are strictly random, it would be foolish to suppose that the correlation
between any two variables is identically equal to 0.0000... (or that the effect
of some treatment or the difference between two groups is exactly *zero*)"

Lykken - "the finding of statistical significance is perhaps the least important attribute of a good experiment

Lykken - "The value of any research can be determined, not from the statistical results, but only by skilled, subjective evaluation of the coherence and reasonableness of the theory, the degree of experimental control employed, the sophistication of the measuring techniques, the scientific or practical importance of the phenomena studied"

Lykken - "Editors must be bold enough to take responsibility for deciding
which studies are good and which are not, without resorting to letting the
*p* value of the significance tests determine this decision"

O'Brien and Shapiro - "It is this distinction between statistical significance and practical importance that seems often to be overlooked by many researchers."

Stevens - "What does it mean? Can no one recognize a decisive result without
a significance test? How much can the burgeoning of computation be blamed
on fad? How often does inferential computation serve as a premature excuse
for going to press? Whether the scholar has discovered something or not, he
can sometimes subject his data to an analysis of variance, a *t* test,
or some other device that will produce a so-called objective measure of 'significance.'
The illusion of objectivity seems to preserve itself despite the admitted
necessity for the investigator to make improbable assumptions, and to pluck
off the top of his head a figure for the level of probability that he will
consider significant."

Stevens - "The extreme stochastophobe is likely to ask: What scientific discoveries owe their existence to the techniques of statistical analysis or inference?"

Stevens - "The aspersions voiced by stochastophobes fall mainly on those scientists who seem, by the surfeit of their statistical chants, to turn data treatment into hierurgy. These are not the statisticians themselves, for they see statistics for what it is, a straightforward discipline designed to amplify the power of common sense in the discernment of order amid complexity."

**1969**

Morrison and `Henkel - "In addition to important technical errors, fundamental errors in the philosophy of science are frequently involved in this indiscriminate use of the tests [of significance]"

Morrison and Henkel - "What we say is frankly polemical, though not original"

Morrison and Henkel - "we usually know in advance of testing that the null hypothesis is false"

Morrison and Henkel - "To say we want to be conservative, to guard against accepting more than 5 percent of our false alternative hypotheses as true ... is nonsense in scientific research"

Morrison and Henkel - "Researchers have long recognized the unfortunate connotations and consequences of the term 'significance,' and we propose it is time for a change"

Morrison and Henkel - "there is evidence that significance tests have been a genuine block to achieving ... knowledge"

**1970**

Morrison and Henkel - "we are convinced that the diversion of energy away
from the rituals of significance testing in basic scientific research will
be a worthy first step toward this goal [solving the problems of scientific
inference] and will ... be one difference in behavioral science that *is*
significant"

Morrison and Henkel - "scientists by and large adjust their beliefs about a hypothesis in informal ways on the basis of evidence, regardless of the formal decisions to reject or accept hypotheses made by individual researchers"

Morrison and Henkel - "significance testing in behavioral research is deeply implicated in our false search for empirical association, rather than a search for hypotheses that explain"

Morrison and Henkel - "many researchers ... will regard the abandonment of the tests a threat to the very foundations of empirical behavioural research. In fact, our experience (among sociologists) has been that many researchers accept all or most of our arguments on rational grounds, but keep using significance tests as before simply because use is a strong norm in the discipline"

**1971**

Nelder - "multiple comparison methods have no place at all in the interpretation of data"

Tversky and Kahneman - "the statistical power of many psychological studies is ridiculously low. This is a self-defeating practice: it makes for frustrated scientists and inefficient research. The investigator who tests a valid hypothesis but fails to obtain significant results cannot help but regard nature as untrustworthy or even hostile"

Tversky and Kahneman - "Significance levels are usually computed and reported, but power and confidence limits are not. Perhaps they should be."

Tversky and Kahneman - "The emphasis on significance levels tends to obscure a fundamental distinction between the size of an effect and its statistical significance."

**1973**

Hays - "There is surely nothing on earth that is completely independent of anything else. The strength of an association may approach zero, but it should seldom or never be exactly zero."

Tukey - "The twin assumptions of normality of distribution and homogeneity of variance are not ever exactly fulfilled in practice, and often they do not even hold to a good approximation."

**1976**

Box - "all models are wrong"

Box - "in nature there never was a normal distribution, there never was a straight line"

Box - "experiments where errors cannot be expected to be independent are very common"

Chew - "the research worker has been oversold on hypothesis testing. Just
as no two peas in a pod are identical, no two treatment means will be exactly
equal. ... It seems ridiculous ... to test a hypothesis that we *a priori*
know is almost certain to be false"

Graybill - "when making inferences about parameters ... hypothesis tests
should seldom be used if confidence intervals are available ... the confidence
intervals could lead to *opposite* practical conclusions when a test
suggests rejection of *H _{0}* ... even though

Kempthorne - "one will not ever have a random sample from a normal distribution"

Kempthorne - "no one, I think, really believes in the possibility of sharp null hypotheses -- that two means are absolutely equal in noisy sciences"

Pratt - "tests [of hypotheses] provide a poor model of most real problems, usually so poor that their objectivity is tangential and often too poor to be useful"

Pratt - "And when, as so often, the test is of a hypothesis known to be false ... the relevance of the conventional testing approach remains to be explicated"

Pratt - "This reduces the role of tests essentially to convention. Convention is useful in daily life, law, religion, and politics, but it impedes philosophy"

**1977**

Barndorff-Nielsen - "Most of the models considered in statistics are but rough approximations to reality"

Chew - "Testing the equality of 2 true treatment means is ridiculous. They will always be different, at least beyond the hundredth decimal place."

Cox - "Overemphasis on tests of significance at the expense especially of interval estimation has long been condemned"

Cox - "Admittedly all real measurements are discrete"

Cox - "there are considerable dangers in overemphasizing the role of significance tests in the interpretation of data"

Cox - "statistical significance is quite different from scientific significance and ... therefore estimation ... of the magnitude of effects is in general essential regardless of whether statistically significant departure from the null hypothesis is achieved"

Guttman - "lack of interaction in analysis of variance and ... lack of correlation in bivariate distributions--such nullities would be quite surprising phenomena in the usual interactive complexities of social life"

Guttman - "Estimation and approximation may be more fruitful than significance in developing science, never forgetting replication."

Guttman - "It [the normal distribution] is seldom, if ever, observed in nature."

**1978**

Carver - "Statistical significance testing has involved more fantasy than fact. The emphasis on statistical significance over scientific significance in educational research represents a corrupt form of the scientific method. Educational research would be better off if it stopped testing its results for statistical significance."

Carver - "Statistical significance ordinarily depends upon how many subjects are used in the research. the more subjects the researcher uses, the more likely the researcher will be to get statistically significant results."

Healy - "it is widely agreed among statisticians ... that significance testing is not the be-all and end-all of the subject"

Healy - "The commonest agricultural experiments ... are fertilizer and variety trials. In neither of these is there any question of the population treatment means being identical ... the objective is to measure how big the differences are"

Kruskal - "statistical significance of a sample bears no necessary relationship to possible subject-matter significance"

Kruskal - "it is easy to ... throw out an interesting baby with the nonsignificant bath water. Lack of statistical significance at a conventional level does not mean that no real effect is present; it means only that no real effect is clearly seen from the data. That is why it is of the highest importance to look at power and to compute confidence intervals"

Kruskal - "Another criticism of standard significance tests is that in most
applications it is known beforehand that the null hypothesis cannot be *exactly*
true"

Kruskal - "Because of the relative simplicity of its structure, significance testing has been overemphasized in some presentations of statistics, and as a result some students come mistakenly to feel that statistics is little else than significance testing"

Meehl - "I suggest to you that Sir Ronald has befuddled us, mesmerized us, and led us down the primrose path. I believe that the almost universal reliance on merely refuting the null hypothesis as the standard method for corroborating substantive theories in the soft areas is a terrible mistake, is basically unsound, poor scientific strategy, and one of the worst things that ever happened in the history of psychology."

Meehl - "probably all theories are false in the eyes of God"

Meehl - "as I believe is generally recognized by statisticians today and by thoughtful social scientists, the null hypothesis, taken literally, is always false"

Rothman - "The P value ... conveys no information about the extent to which two groups differ or two variables are associated. ... P vales serve poorly as descriptive statistics".

Rothman - "By choosing a measure that quantifies the degree of association or effect in the data and then calculating a confidence interval, researchers can summarize the strength of association in their data and allow for random variation in a simple and unambiguous way."

**1980**

Chew - "... means are *significantly different* ... This is a very
unfortunate choice of terminology, because the significant difference in the
*statistical* sense is often taken, incorrectly, as being significant
in the *practical* or *economic* sense"

Chew - "Experimenters are often unhappy if the decision from the analysis
of variance is to accept H_{0}. ... The correct interpretation in
this case is that all true differences are 'small' and/or the number of replicates
is insufficient"

Chew - "I have tried to steer them [agricultural researchers] away from testing
H_{0}. I maintain that on *a priori* physical, chemical and
biological grounds, H_{0} is always false in all realistic experiments,
and H_{0} will always be rejected given enough replication"

Chew - "As Confucius might have said, *if the difference isn't different
enough to make a difference, what's the difference?*"

Kruskal - "the traditional table [analysis of variance table] with its terminology and seductive additivities has in fact often led to superficiality of analysis"

**1981**

Cox and Snell - "Models are always to some extent tentative"

Little - "The idea that one should proceed no further with an analysis, once a non-significant F-value for treatments is found, has led many experimenters to overlook important information in the interpretation of their data"

**1982**

Cox - "It is very bad practice to summarise an important investigation solely by a value of P".

Cox - "The criterion for publication should be the achievement of reasonable precision and not whether a significant effect has been found"

Preece - "over-emphasis on significance-testing continues"

Preece - "the norm should be that only a standard error is quoted for comparing means from an experiment"

Preece - "experimenters having difficulty in interpreting their results, after the results have been converted into an analysis of variance, must often be urged to think as if they had never heard of statistics; only then is fettered rote-thinking abandoned in favour of common-sense and intelligence"

**1983**

Box - "The resultant magnification of the importance of formal hypothesis tests has inadvertently led to underestimation by scientists of the area in which statistical methods can be of value and to a wide misunderstanding of their purpose"

Bryan-Jones and Finney - "Of central importance to clear presentation is the standard error of a mean"

Bryan-Jones and Finney - "In interpreting and in presenting experimental results there is no adequate substitute for thought - thought about the questions to be asked, thought about the nature and weight of evidence the data provide on these questions, and thought about how the story can be told with clarity and full honesty to a reader. Statistical techniques must be chosen and used to aid, but not to replace, relevant thought"

Bryan-Jones and Finney - "Our message is not new"

Good - "with general principles ... it is usually possible to find something in the past that to some extent foreshadows it"

Good - "A large enough sample will usually lead to the rejection of almost any null hypothesis ... Why bother to carry out a statistical experiment to test a null hypothesis if it is known in advance that the hypothesis cannot be exactly true"

**1984**

Jones - "There is a rising feeling among statisticians that hypothesis tests ... are not the most meaningful analyses"

Jones - "preoccupation with testing 'is there an interaction"' in factorial experiments, ... emphasis should be on 'how strong is the interaction?' "

Jones - "The difference between 'statistically significant' and 'biologically significant' needs to be appreciated much more than it is now"

Jones - "Reporting of results in terms of confidence intervals instead of hypothesis tests should be strongly encouraged"

Preece - "Statistical 'recipes' are followed blindly, and ritual has taken over from scientific thinking"

Preece - "The ritualistic use of multiple-range tests--often when the null
hypothesis is *a priori* untenable ...- is a disease"

**1985**

Altman - "Somehow there has developed a widespread belief that statistical analysis is legitimate only if it includes significance testing. This belief leads to, and is fostered by, numerous introductory statistics texts that are little more than catalogues of techniques for performing significance tests"

Chatfield - "differences are 'significant' ... nearly always ... in large samples"

Chatfield - "Within the last decade or so, practising statisticians have begun to question the relevance of some Statistics courses ... However ... Statistics teaching is still often dominated by formal mathematics"

Chatfield - "tests on outliers are less important than advice from 'people in the field' "

Chatfield - "significance tests ... are also widely overused and misused"

Chatfield - "an *ANOVA* will not tell us *how* a null hypothesis
is rejected"

Chatfield - "Rather than ask if these differences are statistically significant, it seems more important to ask if they are of educational importance"

Chatfield - "All statistical techniques, however sophisticated, should be subordinate to subjective judgement"

Chatfield - "it has ... become impossible to get results published in some medical, psychological and biological journals without reporting significance values even when of doubtful validity"

Cormack - "Estimates and measures of variability are more valuable than hypothesis tests"

Guttman - "Since a point hypothesis is not to be expected in practice to
be exactly true, but only approximate, a proper test of significance should
*almost always* show significance for large enough samples. So the
whole game of testing point hypotheses, power analysis notwithstanding, is
but a mathematical game without empirical importance."

Nelder - "the grotesque emphasis on significance tests in statistics courses of all kinds ... is taught to people, who if they come away with no other notion, will remember that statistics is about tests for significant differences. ... The apparatus on which their statistics course has been constructed is often worse than irrelevant, it is misleading about what is important in examining data and making inferences"

**1986**

Chernoff - "Analysis of variance ... stems from a hypothesis-testing formulation that is difficult to take seriously and would be of limited value for making final conclusions."

Gardner and Altman - "In this approach [hypothesis testing] data are examined in relation to a statistical 'null' hypothesis, and the practice has led to the mistaken belief that studies should aim at obtaining 'statistical significance.' On the contrary, the purpose of most research investigations in medecine is to determine the magnitude of some factor(s) of interest."

Gardner and Altman - "there is a tendency to equate statistical significance with medical importance or biological relevance"

Gardner and Altman - "Confidence intervals ... should become the standard method for presenting the statistical results of major findings."

Jones and Matloff - "We recommend that authors display the estimate of the difference and the confidence limit for this difference"

Jones and Matloff - "at its worst, the results of statistical hypothesis testing can be seriously misleading, and at its best, it offers no informational advantage over its alternatives"

Jones and Matloff - "the ubiquitous problem of synonymizing statistical significance with biological significance"

Jones and Matloff - "all populations are different, a priori"

Jones and Matloff - "The only remedy ... is for journal editors to be keenly aware of the problems associated with hypothesis tests, and to be sympathetic, if not strongly encouraging, toward individuals who are taking the initial lead in phasing them out"

Lindley - "estimation procedures provide more information [than significance tests]: they tell one about reasonable alternatives and not just about the reasonableness of one value"

Perry - "significance tests have a limited role in biological experiments because 1) significance refers merely to plausibility, not to biological importance ... 2) theories may be proved to be strictly untrue but still of practical use ... 3) a null hypothesis is often known to be false before experimentation 4) the outcome of a test often depends merely on the size of the experiment ... the more replicates, the greater the chance of achieving significance; 5) in agricultural and ecological entomology, the really critical, single experiment is rare; 6) results may indicate merely that a hypothesis is rejected, but not give the magnitude of departures from the hypothesis ... 7) the exact nature of tests is often exaggerated and ignores the fact that all tests are based on assumptions that rarely hold in practice"

Perry - "A confidence interval certainly gives more information than the result of a significance test alone ... I ... recommend its use [standard error of each mean]"

Warren - "the word 'significant' could be abolished ... Based on a dictionary definition, one might expect that results that are declared significant would be important, meaningful, or consequential. Being 'significant at an arbitrary probability level,' ... ensures none of these"

Warren - "the researcher has the right to make inferences that may seem contrary to the objective analysis [statistical analysis], provided that is what he or she really believes and that the objective results have been given due consideration"

Warren - "I have seen authors declare that means were not different, but with less than a 50% chance of detecting a difference the magnitude of which would be important; if such a difference existed they would have been better off tossing a coin and not doing the experiment"

**1987**

Berger and Sellke - "even if testing of a point null hypothesis were disreputable,
the reality is that people do it all the time ... and we should do our best
to see that it is done well". *Comment: On the contrary, if we assist others
to perform disreputable tests then we ourselves also become disreputable.*

Casella and Berger - "In a large majority of problems (especially location problems) hypothesis testing is inappropriate: Set up the confidence interval and be done with it!"

Hinkley - "for problems where the usual null hypothesis defines a special value for a parameter, surely it would be more informative to give a confidence range for that parameter"

Vardeman - "Competent scientists do not believe their own models or theories, but rather treat them as convenient fictions. ... The issue to a scientist is not whether a model is true, but rather whether there is another whose predictive power is enough better to justify movement from today's fiction to a new one"

Vardeman - "Too much of what *all* statisticians do ... is blatantly
subjective for any of us to kid ourselves or the users of our technology into
believing that we have operated 'impartially' in any true sense. ... We can
do what seems to us most appropriate, but we can *not* be objective
and would do well to avoid language that hints to the contrary"

**1988**

Finney - "rigid dependence upon significance tests in single experiments is to be deplored"

Finney - "The primary purpose of analysis of variance is to produce estimates of one or more error mean squares, and not (as is often believed) to provide significance tests"

Finney - "A null hypothesis that yields under two different treatments have identical expectations is scarcely very plausible, and its rejection by a significance test is more dependent upon the size of an experiment than upon its untruth"

Finney - "I have failed to find a single instance in which the Duncan test was helpful, and I doubt whether any of the alternative tests [multiple range significance tests] would please me better"

Finney - "Is it ever worth basing analysis and interpretation of an experiment on the inherently implausible null hypothesis that two (or more) recognizably distinct cultivars have identical yield capacities?"

Gauch - "the mere declaration that the interaction is or is not significant is far too coarse a result to give agronomists or plant breeders effective insight into their research material"

Luce - "I could only wish for every psychologist to read this chapter as an antidote to mindless hypothesis testing in lieu of doing good science: measuring effects, constructing substantive theories of some depth, and developing probability models and statistical procedures suited to these theories."

**1989**

Chatfield - "We all know ... that the misuse of statistics and an overemphasis
on *p* values is endemic in many scientific journals"

Finney (a) - "The analysis of data ... requires assumptions ... The assumptions are never correct"

Finney (b)- "I confidently assert that yields of potatoes from plots of a well-conducted field experiments [sic] can be assumed independently and Normally distributed with constant variance; I do not believe this"

Finney (b)- "the Blind need frequent warnings and help in avoiding the multiple comparison test procedures that some editors demand but that to me appear completely devoid of practical utility"

Gigerenzer et al. - "In some fields, a strikingly narrow understanding of statistical significance made a significant result seem to be the ultimate purpose of research, and non-significance the sign of a badly conducted experiment - hence with almost no chance of publication."

Healy - "it is a travesty to describe a *p* value ... as 'simple,
objective and easily interpreted' ... To use it as a *measure* of closeness
between model and data is to invite confusion"

Kruskal and Majors - "We are also concerned about the use of statistical significance--P values--to measure importance; this is like the old confusion of substantive with statistical significance"

Moore and McCabe - "Some hesitation about the unthinking use of significance tests is a sign of statistical maturity"

Moore and McCabe - "It is usually wise to give a confidence interval for the parameter in which you are interested"

Moore and McCabe - "A null hypothesis that is ... false can become widely believed if repeated attempts to find evidence against it fail because of low power"

Moore and McCabe - "Other eminent statisticians have argued that if 'decision' is given a broad meaning, almost all problems of statistical inference can be posed as problems of making decisions in the presence of uncertainty"

Rosnow and Rosenthal - "A result that is statistically significant is not necessarily practically significant as judged by the magnitude of the effect."

**1990**

Cohen - "The null hypothesis ... is *always* false in the real world.
... If it is false, even to a tiny degree, it must be the case that a large
enough sample will produce a significant result and lead to its rejection."

Cohen - "I believe ... that hypothesis testing has been greatly overemphasized in psychology and in other disciplines that use it."

Cohen - "The prevailing yes-no decision at the magic .05 level from a single research is a far cry from the use of informed judgment. Science simply doesn't work that way. A successful piece of research doesn't conclusively settle an issue, it just makes some theoretical proposition to some degree more likely. ... There is no ontological basis for dichotomous decision making in psychological inquiry."

Hahn - "hypothesis tests (irrelevant for most practical applications)"

Hunter - "How about 'alpha and beta risks' and 'testing the null hypothesis'? ... The very beginning language employed by the statistician describes phenomena in which engineers/physical scientists have little practical interest! They want to know how many, how much, and how well ... Required are interval estimates. We offer instead hypothesis tests and power curves"

Meehl - "All statistical tables should be required to include means and standard
deviations, rather than merely a *t*, *F* or^{2}, or
even worse only statistical significance."

Meehl - "Confidence intervals for parameters ought regularly to be provided."

Meehl - "Since the null hypothesis refutation racket is 'steady work' and has the merits of an automated research grinding device, scholars who are pardonably devoted to making more money and keeping their jobs ... are unlikely to contemplate with equanimity a criticism that says that their whole procedure is scientifically feckless and that they should quit doing it and do something else. ... that might ... mean that they should quit the academy and make an honest living selling shoes"

Preece - "I cannot see how anyone could now agree with this [Fisher's 1935 quote about experiments and null hypotheses]"

Street - "Fisher ... appears to have placed an undue emphasis on the significance test"

Street - "in many experiments it is well known ... that there are differences among the treatments. The point of the experiment is to estimate ... and provide ... standard errors. One of the consequences of this emphasis on significance tests is that some scientists ... have come to see a significant result as an end in itself"

**1991**

Matloff - "statistical significance is not the same as scientific significance"

Matloff - "the test is asking whether a certain condition holds exactly, and this exactness is almost never of scientific interest"

Matloff - With regard to a goodness-of-fit test to answer whether certain ratios have given exact values, "we know a priori this is not true; no model can completely capture all possible genetical mechanisms"

Matloff - "the number of stars by itself is relevant only to the question
of whether *H _{0}* is exactly true--a question which is almost
always not of interest to us, especially because we usually know a priori
that

Matloff - "problems stemming from the fact that hypothesis tests do not address questions of scientific interest"

Matloff - "the 'star system' includes neither an E part [estimate] nor an A part [accuracy] and thus excludes vital information ... There is no such danger in basing our analysis on CIs [confidence intervals]"

Matloff - "no population has an exact normal distribution, nor are variances exactly homogeneous, and independence assumptions are often violated to at least some degree"

Tukey - "Statisticians classically asked the wrong question-and were willing to answer with a lie, one that was often a downright lie. They asked 'Are the effects of A and B different?' and they were willing to answer 'no.' All we know about the world teaches us that the effects of A and B are always different-in some decimal place-for any A and B. Thus asking 'Are the effects different?' is foolish."

Tukey - "Empirical knowledge is always fuzzy! And theoretical knowledge, like all the laws of physics, as of today's date, is always wrong-in detail, though possibly providing some very good approximations indeed."

**1992**

Pearce - "In a biological context interactions are common, so it is better to play safe and regard any appreciable interaction as real whether it is significant or not"

Upton - "The experimenter must keep in mind that significance at the 5% level will only coincide with practical significance by chance!"

**1993**

Wang - "Testing of statistical hypotheses ... are often irrelevant, wrong-headed, or both"

Wang - "the tyranny of the N-P [Neyman-Pearson] theory in many branches of empirical science is detrimental, not advantageous, to the course of science"

**1994**

Boardman - "He [W. E. Deming] went on to suggest that the problem lay in teaching 'what is wrong.' The list of evils taught in courses on statistics ... is a long one. One of the topics included hypothesis testing. Personally I have found few, if any, occasions where such tests are appropriate."

Cohen - "I make no pretense of the originality of my remarks in this article."

Cohen - "I argue herein that NHST [null hypothesis significance testing] has not only failed to support the advance of psychology as a science but also has seriously impeded it."

Cohen - "my ... recommendation is that ... we routinely report effect sizes in the form of confidence limits."

Cohen - "they [confidence limits] are rarely to be found in the literature. I suspect that the main reason they are not reported is that they are so embarrassingly large!"

Inman - "Like many working scientists since, Buchanan-Wollaston professed a belief that commonly used statistical tests were either obvious or irrelevant to the scientific problem of interest"

**1995**

McCloskey - "scientists care about whether a result is statistically significant, but they should care much more about whether it is meaningful"

McCloskey - "the scale for measuring ... effects ... or ... changes ... is not so clear: you may get statistically impeccable answers that make little difference to anyone or 'insignificant' ones that are absolutely crucial"

**1996**

Ranstam - "A common misconception is that an effect exists only if it is statistically significant and that it does not exist if it is not [statistically significant]"

Ranstam - "When using confidence intervals, clinical rather than statistical significance is emphasized. Moreover, confidence intervals, by their width, disclose the statistical precision of the results."

Tamhane - "The point of departure in ranking-and-selection methodology is the recognition that the treatments being compared are in fact different, and a sufficiently large sample size will demonstrate this fact with any preassigned confidence level. Therefore, it is futile to test the null hypothesis of homogeneity."