USGS - science for a changing world

Northern Prairie Wildlife Research Center

  Home About NPWRC Our Science Staff Employment Contacts Common Questions About the Site

Hypothesis Testing: Statistics as Pseudoscience

Douglas H. Johnson
Northern Prairie Wildlife Research Center
U.S. Geological Survey, Biological Resources Division
Jamestown, North Dakota 58401

Presented at the Fifth Annual Conference of the Wildlife Society, Buffalo, New York, 26 September 1998. Part of the symposium, Evaluating the Role of Hypothesis Testing/Power Analysis in Wildlife Science, sponsored by the Biometrics Working Group.

Abstract: Wildlife biologists recently have been subjected to the credo that if you're not testing hypotheses, you're not doing real science. To protect themselves against rejection by journal editors, authors cloak their findings in an armor of P values. I contend that much statistical hypothesis testing is misguided. Virtually all null hypotheses tested are, in fact, false; the only issue is whether or not the sample size is sufficiently large to show it. No matter if it is or not, one then gets led into the quagmire of deciding biological significance versus statistical significance. Most often, parameter estimation is a more appropriate tool than statistical hypothesis testing. Statistical hypothesis testing should be distinguished from scientific hypothesis testing, in which truly viable alternative hypotheses are evaluated in a real attempt to falsify them. The latter method is part of the deductive logic of strong inference, which is better-suited to simple systems. Ecological systems are complex, with components typically influenced by many factors, whose influences often vary in place and time. Competing hypotheses in ecology rarely can be falsified and eliminated. Wildlife biologists perhaps adopt hypothesis tests in order to make what are really descriptive studies appear as scientific as those in the "hard" sciences. Rather than attempting to falsify hypotheses, it may be more productive to understand the relative importance of multiple factors.

Literature: The following is a compilation of references cited in the presented paper, as well as citations provided by Marks R. Nester for his A Myopic View and History of Hypothesis Testing, which follows the Literature. Marks Nester's email address is nesterm@qfri1.se2.dpi.qld.gov.au. Additional comments are available at David Parkhurst's Quotes Criticizing Significance Testing.

Altman, D. G.  1985.  Discussion of Dr Chatfield's paper.   Journal of the
     Royal Statistical Society A 148 : 242.


Anscombe, F. J. 1956. Discussion on Dr. David's and Dr. Johnson's Paper. 
     Journal of the Royal Statistical Society  B 18 : 24-27.


Arbuthnott, J. 1710. An argument for Divine Providence, taken from the 
     constant regularity observ'd in the births of both sexes. 
     Philosophical Transactions of the Royal Society 23 : 186-190.


Bakan, D. 1967. The test of significance in psychological research. From 
     Chapter 1 of On Method, Jossey-Bass, Inc. (San Francisco). Reprinted
     in The Significance Test Controversy - A Reader, D. E. Morrison and 
     R. E. Henkel, eds., 1970, Aldine Publishing Company (Butterworth Group).


Barnard, G.  1998.  Letter.  New Scientist 157: 47.


Barndorff-Nielsen, O. 1977. Discussion of D. R. Cox's paper. Scandinavian 
     Journal of Statistics 4 : 67-69.


Beaven, E. S. 1935. Discussion on Dr. Neyman's Paper. Journal of the Royal 
     Statistical Society, Supplement 2 : 159-161.


Berger, J. O., and D. A. Berry.  1988.  Statistical Analysis and the 
     Illusion of Objectivity. American Scientist 76:159-165.


Berger, J. O. and Sellke, T. 1987. Testing a point null hypothesis: the 
     irreconcilability of P values and evidence.   Journal of the 
     American Statistical Association 82 : 112-122.


Berkson, J. 1938. Some difficulties of interpretation encountered in the 
     application of the chi-square test.   Journal of the American 
     Statistical Association 33 : 526-536.


Berkson, J. 1942. Tests of significance considered as evidence. Journal of 
     the American Statistical Association 37 : 325-335.


Binder, A. 1963. Further considerations on testing the null hypothesis and 
     the strategy and tactics of investigating theoretical models. 
     Psychological Review 70 : 107-115. Reprinted in Statistical Issues, 
     A Reader for the Behavioural Sciences, R. E. Kirk, ed., 1972, 
     Wadsworth Publishing Company : 118-126.


Boardman, T. J. 1994. The statistician who changed the world: W. Edwards 
     Deming, 1900-1993. The American Statistician 48(3) : 179-187.

Box, G. E. P. 1976. Science and statistics. Journal of the American 
     Statistical Association 71 : 791-799.

Box, G. E. P.  1983.  An apology for ecumenism in statistics. In Scientific 
     Inference, Data Analysis, and Robustness, G. E. P. Box, T. Leonard and 
     C. F. Wu, eds., Academic Press, Inc. : 51-84.

Braithwaite, R. B.  1953.  Scientific Explanation. A Study of the Function 
     of Theory, Probability and Law in Science. Cambridge University Press.

Bryan-Jones, J. and Finney, D. J.  1983.  On an error in "Instructions to 
     Authors". HortScience 18(3) : 279-282.

Buchanan-Wollaston, H. J.  1935.  The philosophic basis of statistical 
     analysis. Journal of the International Council for the Exploration 
     of the Sea 10 : 249-263.

Camilleri, S. F.  1962.  Theory, probability, and induction in social 
     research. American Sociological Review 27 : 170-178. Reprinted in 
     The Significance Test Controversy - A Reader, D. E. Morrison and 
     R. E. Henkel, eds., 1970,  Aldine Publishing Company (Butterworth 
     Group).

Campbell, M.  1992.  Letter.  Royal Statistical Society News & Notes 
     18(9):5.

Carver, R. P. 1978. The case against statistical significance testing. 
     Harvard Educational Review 48 : 378-399.

Casella, G. and Berger, R. L.  1987.  Rejoinder. Journal of the American 
     Statistical Association 82 : 133-135.

Chatfield, C.  1985.  The initial examination of data (with discussion). 
     Journal of the Royal Statistical Society  A 148: 214-253.

Chatfield, C.  1989.  Comments on the paper by McPherson. Journal of the 
     Royal Statistical Society  A 152 : 234-238.

Chernoff , H.  1986.  Comment. The American Statistician 40(1) : 5-6.

Chew, V.  1976.  Comparing treatment means: a compendium. HortScience 
     11(4) : 348-357.

Chew, V.  1977.  Statistical hypothesis testing: an academic exercise in 
     futility.  Proceedings of the Florida State Horticultural Society 
     90 : 214-215.

Chew, V.  1980.  Testing differences among means: correct interpretation 
     and some alternatives. HortScience 15(4) : 467-470.

Cochran, W. G. and Cox, G. M.  1957.  Experimental Designs. 2nd ed. John 
     Wiley & Sons, Inc.

Cohen, J.  1990.  Things I have learned (so far). American  Psychologist 
     45 : 1304-1312.

Cohen, J. 1994. The earth is round (p < .05). American  Psychologist 
     49 : 997-1003.

Cormack, R. M. 1985. Discussion of Dr Chatfield's paper. Journal of the 
     Royal Statistical Society A 148 : 231-233.

Cox, D. R. 1958. Some problems connected with statistical inference. 
     Annals of Mathematical Statistics 29 : 357-372.

Cox, D. R. 1977. The role of significance tests.  (with discussion). 
     Scandinavian Journal of Statistics  4 : 49-70.

Cox, D. R. 1982. Statistical significance tests. British Journal of 
     Clinical Pharmacology 14 : 325-331.

Cox, D. R. and Snell, E. J. 1981. Applied Statistics Principles and 
     Examples. Chapman and Hall.

Deming, W. E.  1975.  On probability as a basis for action.  The 
     American Statistician 29:146. 

Edwards, W., Lindman, H. and Savage, L. J. 1963. Bayesian statistical 
     inference for psychological research. Psychological Review 70 : 
     193-242.

Finney, D. J. 1988. Was this in your statistics textbook? III. Design 
     and analysis. Experimental Agriculture 24 : 421-432.

Finney, D. J. 1989a. Was this in your statistics textbook? VI. Regression 
     and covariance. Experimental Agriculture. 25 : 291-311.

Finney, D. J. 1989b. Is the statistician still necessary? Biometrie 
     Praximetrie 29 : 135-146.

Fisher, R. A. 1925. Statistical Methods for Research Workers. Oliver 
     and Boyd (London).

Fisher, R. A. 1935. The Design of Experiments. Oliver and Boyd (Edinburgh).

Gauch Jr., H. G. 1988. Model selection and validation for yield trials 
     with interaction. Biometrics 44 : 705-715.

Gavarret, J. 1840. Principes Généraux de Statistique Médicale. 
     [No publisher given] (Paris). (Not cited).

Geary, R. C. 1947. Testing for normality. Biometrika 34 : 209-242.

Gerard, P. D., D. R. Smith, and G. Weerakkody.  1998.  Limits of 
     retrospective power analysis. Journal of Wildlife Management 
     62:801-807.

Gigerenzer, G., Swijtink, Z., Porter, T., Daston, L., Beatty, J. and 
     Krüger, L. 1989. The Empire of Chance. Cambridge University 
     Press, Cambridge, England.

Gold, D. 1958. Comment on "A critique of tests of significance". American 
     Sociological Review 23 : 85-86.

Good, I. J. 1983. Good Thinking. The Foundations of Probability and Its 
     Applications. University of Minnesota Press (Minneapolis).

Grant, D. A. 1962. Testing the null hypothesis and the strategy and 
     tactics of investigating theoretical models. Psychological Review 
     69 : 54-61.

Graybill, F. A. 1976. Theory and Application of the Linear Model. 
     Duxbury Press (Massachusetts).

Guttman, L. 1977. What is not what in statistics. The Statistician 
     26 : 81-107.

Guttman, L. 1985. The illogic of statistical inference for cumulative 
     science. Applied Stochastic Models and Data Analysis 1 : 3-10.

Hacking, I. 1965. Logic of Statistical Inference. Cambridge University 
     Press.

Hahn, G. J. 1990. Commentary. Technometrics 32: 257-258.

Hays, W. L. 1973. Statistics for the Social Sciences. Second edition. 
     Holt, Rinehart and Winston.

Healy, M. J. R. 1978. Is statistics a science? Journal of the Royal 
     Statistical Society  A 141 : 385-393.

Healy, M. J. R. 1989. Comments on the paper by McPherson. Journal of the 
     Royal Statistical Society  A 152 : 232-234.

Hinkley, D. V. 1987. Comment. Journal of the American Statistical Association 
     82 : 128-129.

Hodges Jr., J. L. and Lehmann, E. L. 1954. Testing the approximate validity
     of statistical hypotheses.  Journal of the Royal Statistical Society  
     B 16 : 261-268.

Hogben, L. 1957a. The contemporary crisis or the uncertainties of 
     uncertain inference. Statistical Theory, W. W. Norton & 
     Co., Inc., Reprinted in The Significance Test Controversy - 
     A Reader, D. E. Morrison and R. E. Henkel, eds., 1970, Aldine 
     Publishing Company (Butterworth Group).

Hogben, L. 1957b. Statistical prudence and statistical inference. 
     Statistical Theory, W. W. Norton & Co., Inc. Reprinted 
     in The Significance Test Controversy - A Reader, D. E. 
     Morrison and R. E. Henkel, eds., 1970, Aldine Publishing 
     Company (Butterworth Group).

Hunter, J. S. 1990. Commentary. Technometrics 32 : 261.

Inman, H. F. 1994. Karl Pearson and R. A. Fisher on statistical tests: A 
     1935 exchange from Nature. The American Statistician 48(1) : 2-11.

Johnson, D. H.  1995.  Statistical sirens: the allure of nonparametrics.  
     Ecology 76:1998-2000.

Jones, D. 1984. Use, misuse, and role of multiple-comparison procedures in 
     ecological and agricultural entomology. Environmental Entomology 
     13 : 635-649.

Jones, D. and Matloff, N. 1986. Statistical hypothesis testing in biology: 
     a contradiction in terms. Journal of Economic Entomology 79 : 
     1156-1160.

Kempthorne, O. 1966. Some aspects of experimental inference. Journal of 
     the American Statistical Association 61 : 11-34.

Kempthorne, O. 1976. Of what use are tests of significance and tests of 
     hypotheses. Communications in Statistics: Theory and Methods A5 : 
     763-777.

Kish, L. 1959. Some statistical problems in research design. American 
     Sociological Review, 24 : 328-338. Reprinted in The Significance 
     Test Controversy - A Reader, D. E. Morrison and R. E. Henkel, 
     eds., 1970, Aldine Publishing Company (Butterworth Group).

Kruskal, W. H. 1978. Significance, Tests of.   In International Encyclopedia 
     of Statistics , W. H. Kruskal and J. M. Tanur, eds., Free Press 
     (New York) : 944-958.

Kruskal, W. 1980. The significance of Fisher: a review of R. A. Fisher: The 
     Life of a Scientist. Journal of the American Statistical Association 
     75 : 1019-1030.

Kruskal, W. and Majors, R. 1989. Concepts of relative importance in recent 
     scientific literature. The American Statistician 43(1) : 2-6.

LaForge, R. 1967. Confidence intervals or tests of significance in scientific 
     research. Psychological Bulletin 68 : 446-447.

Lindley, D. V. 1986. Discussion. The Statistician 35 : 502-504.

Little, T. M. 1981. Interpretation and presentation of results. 
     HortScience 16 : 637-640.

Luce, R. D. 1988. The tools-to-theory hypothesis. Review of G. Gigerenzer 
     and D. J. Murray, "Cognition as intuitive statistics." Contemporary 
     Psychology 33 : 582-583.

Lykken, D. T. 1968. Statistical significance in psychological research. 
     Psychological Bulletin 70 : 151-159. Reprinted in The Significance 
     Test Controversy - A Reader, D. E. Morrison and R. E. Henkel, eds., 
     1970, Aldine Publishing Company (Butterworth Group).

Matloff, N. S. 1991. Statistical hypothesis testing: problems and 
     alternatives. Environmental Entomology 20 : 1246-1250.

Matthews, R.  1997.  Faith, hope and statistics.  New Scientist 
     156(2109):36-39.

McCloskey, D. N. 1995. The insignificance of statistical significance. 
     Scientific American 272(4) : 104-105.

McNemar, Q. 1960. At random: sense and nonsense. American Psychologist 
     15 : 295-300.

Meehl, P. E. 1967. Theory testing in psychology and physics: A 
     methodological paradox. Philosophy of Science 34 : 103-115. 
     Reprinted in The Significance Test Controversy - A Reader, 
     D. E. Morrison and R. E. Henkel, eds., 1970, Aldine 
     Publishing Company (Butterworth Group).

Meehl, P. E. 1978. Theoretical risks and tabular asterisks: Sir Karl, 
     Sir Ronald, and the slow progress of soft psychology. Journal of 
     Consulting and Clinical Psychology 46 : 806-834.

Meehl, P. E. 1990. Why summaries of research on psychological theories 
     are often uninterpretable. Psychological Reports 66 (Monograph 
     Supplement 1-V66) : 195-244.

Moore, D. S. and McCabe, G. P. 1989. Introduction to the Practice of 
     Statistics. W. H. Freeman and Company (New York).

Morrison, D. E. and Henkel, R. E. 1969. Significance tests reconsidered. 
     The American Sociologist 4 : 131-140. Reprinted in The Significance 
     Test Controversy - A Reader, D. E. Morrison and R. E. Henkel, eds., 
     1970, Aldine Publishing Company (Butterworth Group).

Morrison, D. E. and Henkel, R. E. (Eds.) 1970. The Significance Test 
     Controversy - A Reader. Aldine Publishing Company (Butterworth 
     Group).

Natrella, M. G. 1960. The relation between confidence intervals and 
     tests of significance. American Statistician 14 : 20-22, 33. 
     Reprinted in Statistical Issues, A Reader for the Behavioural 
     Sciences, R. E. Kirk, ed., 1972, Wadsworth Publishing Company 
     : 113-117.

Nelder, J. A. 1971. Discussion on papers by Wynn, Bloomfield, O'Neill 
     and Wetherill. Journal of the Royal Statistical Society, B 33 : 
     244-246.

Nelder, J. A. 1985. Discussion of Dr Chatfield's paper. Journal of the 
     Royal Statistical Society  A 148 : 238.

Nester, M. R.  1996.  An applied statistician's creed.  Applied 
     Statistics 45:401-410.

Neyman, J. 1958. The use of the concept of power in agricultural 
     experimentation. Journal of the Indian Society of 
     Agricultural Statistics 9 : 9-17.

Neyman, J. and Pearson, E. S. 1933. On the problem of the most efficient 
     tests of statistical hypotheses. Philosophical Transactions of the 
     Royal Society A 231 : 289-337.

Nunnally, J. 1960. The place of statistics in psychology. Educational and 
     Psychological Measurement 20 : 641-650.

O'Brien, T. C. and Shapiro, B. J. 1968. Statistical significance--what? 
     Mathematics Teacher 61 : 673-676. Reprinted in Statistical Issues, 
     A Reader for the Behavioural Sciences, R. E. Kirk, ed., 1972, 
     Wadsworth Publishing Company : 109-112.

Pearce, S. C. 1992. Data analysis in agricultural experimentation. II. Some 
     standard contrasts. Experimental Agriculture 28 : 375-383.

Pearson, K. 1900. On the criterion that a given system of deviations from 
     the probable in the case of a correlated systems of variables is 
     such that it can be reasonably supposed to have arisen from random 
     sampling. Philosophical Magazine, Series V, 1 : 157-175.

Pearson, K. 1935a. Statistical tests. Nature 136 : 296-297. (reproduced 
     in H. F. Inman (1994). Karl Pearson and R. A. Fisher on statistical 
     tests: A 1935 exchange from Nature. The American Statistician 
     48(1) : 2-11.

Pearson, K. 1935b. Statistical tests. Nature 136 : 550. (reproduced in 
     H. F. Inman (1994). Karl Pearson and R. A. Fisher on statistical 
     tests: A 1935 exchange from Nature. The American Statistician 
     48(1) : 2-11.

Perry, J. N. 1986. Multiple-comparison procedures: a dissenting view. 
     Journal of Economic Entomology 79 : 1149-1155.

Peterman, R. M.  1990.  The importance of reporting statistical power: 
     the forest decline and acidic deposition example.  Ecology 71: 
     2024-2027.

Petranka, J. W.  1990.  Caught between a rock and a hard place.  
     Herpetologica 46: 346-350.

Platt, J. R.  1964.  Strong inference.  Science 146:347-353.

Pratt, J. W. 1976. A discussion of the question: for what use are tests 
     of hypotheses and tests of significance. Communications in 
     Statistics: Theory and Methods  A5 : 779-787.

Preece, D. A. 1982. The design and analysis of experiments: what has gone
     wrong? Utilitas Mathematica 21A : 201-244.

Preece, D. A. 1984. Biometry in the Third World: science not ritual. 
     Biometrics 40 : 519-523.

Preece, D. A. 1990. R. A. Fisher and experimental design: a review. 
     Biometrics 46 : 925-935.

Quinn, J. F., and A. E. Dunham.  1983.  On hypothesis testing in ecology and 
     evolution. American Naturalist 122: 602-617.

Ranstam, J. 1996. A common misconception about p-value and its consequences. 
     Acta Orthopaedica Scandinavica 67 : 505-507.

Rosnow, R. L. and Rosenthal, R. 1989. Statistical procedures and the 
     justification of knowledge in psychological science. American 
     Psychologist 44 : 1276-1284.

Rothman, K. 1978. A show of confidence. New England Journal of Medicine 
     299 : 1362-1363.

Rozeboom, W. W. 1960. The fallacy of the null hypothesis significance test. 
     Psychological Bulletin 57 : 416-428. Reprinted in The Significance 
     Test Controversy - A Reader, D. E. Morrison and R. E. Henkel, eds., 
     1970, Aldine Publishing Company (Butterworth Group).

Savage, I. R. 1957. Nonparametric statistics. Journal of the American 
     Statistical Association 52 : 331-344.

Sayn-Wittgenstein, L. 1965. Statistics - salvation or slavery? Forestry 
     Chronicle 41 : 103-105.

Selvin, H. C. 1957. A critique of tests of significance in survey research. 
     American Sociological Review 22 : 519-527.

Simberloff, D.  1990.  Hypotheses, errors, and statistical assumptions.  
     Herpetologica 46: 351-357.

Skipper Jr., J. K., Guenther, A. L. and Nass, G. 1967. The sacredness 
     of .05: A note concerning the uses of statistical levels of 
     significance in social science. The American Sociologist 2 : 
     16-18.  Reprinted in The Significance Test Controversy - A 
     Reader, D. E. Morrison and R. E. Henkel, eds., 1970, Aldine 
     Publishing Company (Butterworth Group).

Smith, C. A. B. 1960. Book review of Norman T. J. Bailey: Statistical 
     Methods in Biology. Applied Statistics 9 : 64-66.

Stevens, S. S. 1968. Measurement, statistics, and the schemapiric view. 
     Science 161 : 849-856. Abridged in Statistical Issues, A Reader 
     for the Behavioural Sciences, R. E. Kirk, ed., 1972, Wadsworth 
     Publishing Company : 66-78.

Street, D. J. 1990. Fisher's contributions to agricultural statistics. 
     Biometrics 46 : 937-945.

"Student" 1908. The probable error of a mean. Biometrika 6 : 1-25.

Tamhane, A. C. 1996. Review of R. E. Bechhofer, T. J. Santner and D. M. 
     Goldsman, Design and Analysis of Experiments for Statistical 
     Selection, Screening and Multiple Comparisons, John Wiley 
     (New York), 1995. Technometrics 38 : 289-290.

Tukey, J. W. 1973. The problem of multiple comparisons. Unpublished 
     manuscript, Dept. of Statistics, Princeton University.

Tukey, J. W. 1991. The philosophy of multiple comparisons. Statistical 
     Science 6 : 100-116.

Tversky, A. and Kahneman, D. 1971. Belief in the law of small numbers. 
     Psychological Bulletin 76 : 105-110.

Upton, G. J. G. 1992. Fisher's exact test. Journal of the Royal Statistical
     Society  A 155 : 395-402.

Vardeman, S. B. 1987. Comment. Journal of the American Statistical 
     Association 82 : 130-131.

Venn, J. 1888. Cambridge anthropometry. Journal of the Anthropological 
     Institute 18 : 140-154.

Wang, C. 1993. Sense and Nonsense of Statistical Inference. Marcel 
     Dekker, Inc.

Warren, W. G. 1986. On the presentation of statistical analysis: reason or 
     ritual. Canadian Journal of Forest Research 16 : 1185-1191.

Yates, F. 1951. The influence of Statistical Methods for Research Workers 
     on the development of the science of statistics. Journal of the 
     American Statistical Association 46 : 19-34.

Yates, F. 1964. Sir Ronald Fisher and the design of experiments. Biometrics 
     20 : 307-321.

Yoccoz, N. G.  1991.  Use, overuse, and misuse of significance tests in 
     evolutionary biology and ecology.  Bulletin of the Ecological 
     Society of America 72:106-111.

Zeisel, H. 1955. The significance of insignificant differences. Public 
     Opinion Quarterly 17 : 319-321. Reprinted in The Significance 
     Test Controversy - A Reader, D. E. Morrison and R. E. Henkel, 
     eds., 1970, Aldine Publishing Company (Butterworth Group).

Marks R. Nester, author of "An Applied Statistician's Creed." Applied Statistics 45:401- 410.

1. A Myopic View and History of Hypothesis Testing

According to Hacking (1965), John Arbuthnott (1710) was the first to publish a test of a statistical hypothesis. Hogben (1957b) attributes to Jules Gavarret (1840) the earliest use of the probable error as a form of significance test in the biological arena. Hogben (1957b) also states that Venn (1888) was one of the earliest users of the terms "test" and "significant". The form of the chi-squared goodness-of-fit distribution was published by K. Pearson in 1900. W. S. Gosset, using the pseudonym "Student", developed the t-distribution in 1908. According to E. S. Beaven (1935), T. B. Wood and Professor Stratton were the first to determine probable errors in the context of replicated agricultural experiments. Apparently Wood and Stratton wrote their paper in 1910, but Beaven does not give a reference. The foundations of modern hypothesis testing were laid by Fisher (1925), although the modifications propounded by Neyman and Pearson (1933) are the generally accepted norm.

I contend that the general acceptance of statistical hypothesis testing is one of the most unfortunate aspects of 20th century applied science. Tests for the identity of population distributions, for equality of treatment means, for presence of interactions, for the nullity of a correlation coefficient, and so on, have been responsible for much bad science, much lazy science, and much silly science. A good scientist can manage with, and will not be misled by, parameter estimates and their associated standard errors or confidence limits. A theory dealing with the statistical behaviour of populations should be supported by rational argument as well as data. In such cases, accurate statistical evaluation of the data is hindered by null hypothesis testing. The scientist must always give due thought to the statistical analysis, but must never let statistical analysis be a substitute for thinking! If instead of developing theories, a researcher is involved in such practical issues as selecting the best treatment(s), then the researcher is probably confronting a complex decision problem involving inter alia economic considerations. Once again, analyses such as null hypothesis testing and multiple comparison procedures are of no benefit.

Although some of the following passages have been included for their historical interest, most of the quotations are offered in partial support of my views.

1710

Arbuthnott - "This Equality of Males and Females is not the Effect of Chance but Divine Providence ... which I thus demonstrate :

Let there be a Die of Two sides, M and F ...But it is very improbable (if mere Chance govern'd) that ... To repair that Loss, provident Nature ... brings forth more Males than Females ; and that in almost a constant proportion."

1888

Venn - "When a sufficient number of results had been obtained ... I was requested ... to undertake an analysis of them, and a comparison of their general outcome with that of those obtained by almost identical instruments at South Kensington. ... When we are dealing with statistics, we ought to be able not merely to say vaguely that the difference does or does not seem significant to us, but we ought to have some test as to what difference would be significant. ... The above remarks ... inform us which of the differences in the above tables are permanent and significant, in the sense that we may be tolerably confident that if we took another similar batch we should find a similar difference; and which of them are merely transient and insignificant, in the sense that another similar batch is about as likely as not to reverse the conclusion we have obtained."

1900

Pearson - "A theoretical probability curve without limited range will never at the extreme tails exactly fit observation. The difficulty is obvious where the observations go by units and the theory by fractions."

Pearson - "if the earlier writers on probability had not proceeded so entirely from the mathematical standpoint, but had endeavoured first to classify experience in deviations from the average, and then to obtain some measure of the actual goodness of fit provided by the normal curve, that curve would never have obtained its present position in the theory of errors"

Pearson - "We can only conclude from the investigations here considered that the normal curve possesses no special fitness for describing errors or deviations such as arise either in observing practice or in nature"

1933

Neyman and Pearson - "if x is a continuous variable ... then any value of x is a singularity of relative probability equal to zero. We are inclined to think that as far as a particular hypothesis is concerned, no test based upon the theory of probability can by itself provide any valuable evidence of the truth or falsehood of that hypothesis"

1935

Buchanan-Wollaston - "The [null] hypothesis should be such that it is acceptable on a priori grounds if the data do not show it to be unlikely to be true"

Fisher - "Every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis"

Pearson (a)- "the [2 goodness-of-fit] tests are used to ascertain whether a reasonable graduation curve has been achieved, not to assert whether one or another hypothesis is true or false"

Pearson (a)- "I have never found a normal curve fit anything if there are enough observations!"

Pearson (b)- "There is only one case in which an hypothesis can be definitely rejected, namely when its probability is zero."

1938

Berkson - "we may assume that it is practically certain that any series of real observations does not actually follow a normal curve with absolute exactitude ... and ... the chi-square [goodness-of-fit] P will be small if the sample has a sufficiently large number of observations in it"

1942

Berkson - "null hypothesis procedure ... It says 'If A is true, B will happen sometimes; therefore if B has been found to happen, A can be considered disproved' "

Berkson - "I do not say anything has been 'proved' or 'disproved.' I leave to others the use of these words, which I think are quite inadmissible as applying to anything that can be accomplished by statistics"

1947

Geary - "Normality is a myth; there never was, and never will be, a normal distribution"

1951

Yates - "the emphasis given to formal tests of significance ... has resulted in ... an undue concentration of effort by mathematical statisticians on investigations of tests of significance applicable to problems which are of little or no practical importance ... and ... it has caused scientific research workers to pay undue attention to the results of the tests of significance ... and too little to the estimates of the magnitude of the effects they are investigating"

Yates - "the occasions ... in which quantitative data are collected solely with the object of proving or disproving a given hypothesis are relatively rare"

Yates - "... the unfortunate consequence that scientific workers have often regarded the execution of a test of significance on an experiment as the ultimate objective"

Yates - "Results are significant or not significant and that is the end of it"

1953

Braithwaite - "The peculiarity of ... statistical hypotheses is that they are not conclusively refutable by any experience"

Braithwaite - "no batch of observations, however large, either definitively rejects or definitively fails to reject the hypothesis H0"

Braithwaite - "what John Dewey called 'the quest for certainty' is, in the case of empirical knowledge, a snare and a delusion"

Braithwaite - "The ultimate justification for any scientific belief will depend upon the main purpose for which we think scientifically--that of predicting and thereby controlling the future"

1954

Hodges, Jr. and Lehmann - "we may formulate the hypothesis that a population is normally distributed, but we realize that no natural population is ever exactly normal"

Hodges, Jr. and Lehmann - "when we formulate the hypothesis that the sex ratio is the same in two populations, we do not really believe that it could be exactly the same"

1955

Zeisel - "the researchers who follow the statistical way of life often distinguish themselves by a certain aridity of theoretical insights"

1956

Anscombe - "Tests of the null hypothesis that there is no difference between certain treatments are often made in the analysis of agricultural or industrial experiments in which alternative methods or processes are compared. Such tests are ... totally irrelevant. What are needed are estimates of magnitudes of effects, with standard errors"

1957

Cochran and Cox - "In many experiments it seems obvious that the different treatments must have produced some difference, however small, in effect. Thus the hypothesis that there is no difference is unrealistic: the real problem is to obtain estimates of the sizes of the differences"

Hogben (a) - "Acceptability of a statistically significant result ... promotes a high output of publication. Hence the argument that the techniques work has a tempting appeal to young biologists, if harassed by their seniors to produce results, or if admonished by editors to conform to a prescribed ritual of analysis before publication. ... the plea for justification by works ... is therefore likely to fall on deaf ears, unless we reinstate reflective thinking in the university curriculum"

Hogben (a) - "we can already detect signs of such deterioration in the growing volume of published papers ... recording so-called significant conclusions which an earlier vintage would have regarded merely as private clues for further exploration"

Savage - "to make measurements and then ignore their magnitude would ordinarily be pointless. Exclusive reliance on tests of significance obscures the fact that statistical significance does not imply substantive significance"

Savage - "Null hypotheses of no difference are usually known to be false before the data are collected ... when they are, their rejection or acceptance simply reflects the size of the sample and the power of the test, and is not a contribution to science"

Selvin - "High levels of predictability, explanation, and association are legitimate goals for social scientists; they are not the same as a high level of significance, nor is statistical significance a substitute for them."

1958

Cox - "Exact truth of a null hypothesis is very unlikely except in a genuine uniformity trial"

Cox - "Assumptions that we make, such as those concerning the form of the population sampled, are always untrue"

Gold - "An important weakness of much analysis in current social research is the failure of the analyst to consider the distinction between statistical significance and substantive importance."

Neyman - "If we go to the trouble of setting up an experiment this is because we want to establish the presence of some possible effect of a treatment." Comment: This is sadly reminiscent of Fisher (1935).

Neyman - "if experimenters realized how little is the chance of their experiments discovering what they are intended to discover, then a very substantial proportion of the experiments that are now in progress would have been abandoned in favour of an increase in size of the remaining experiments, judged more important"

Neyman - "What was the probability (power) of detecting interactions ... in the experiment performed? ... The probability in question is frequently relatively low ... in cases of this kind the fact that the test failed to detect the existence of interactions does not mean very much. In fact, they may exist and have gone undetected."

1959

Kish - "Significance should stand for meaning and refer to substantive matter. ... I would recommend that statisticians discard the phrase 'test of significance' "

Kish - "the tests of null hypotheses of zero differences, of no relationships, are frequently weak, perhaps trivial statements of the researcher's aims ... in many cases, instead of the tests of significance it would be more to the point to measure the magnitudes of the relationships, attaching proper statements of their sampling variation. The magnitudes of relationships cannot be measured in terms of levels of significance"

1960

McNemar - "too many users of the analysis of variance seem to regard the reaching of a mediocre level of significance as more important than any descriptive specification of the underlying averages"

McNemar - "so much of what should be regarded as preliminary gets published, then quoted as the last word, which it usually is because the investigator is too willing to rest on the laurels that come from finding a significant difference. Why should he worry about the degree of relationship or its possible lack of linearity"

Natrella - "One reason for preferring to present a confidence interval statement (where possible) is that the confidence interval, by its width, tells more about the reliance that can be placed on the results of the experiment than does a YES-NO test of significance."

Natrella - "the significance test without its OC [Operating Characteristic] curve has distorted the thinking in some experimental problems"

Natrella - "Confidence intervals give a feeling of the uncertainty of experimental evidence, and (very important) give it in the same units ... as the original observations."

Nunnally - "Few ... of the criticisms which will be made were originated by the author ... However, it is hoped that when the criticisms are brought together they will argue persuasively for a change in viewpoint about statistical logic"

Nunnally - "the null-hypothesis models ... share a crippling flaw: in the real world the null hypothesis is almost never true, and it is usually nonsensical to perform an experiment with the sole aim of rejecting the null hypothesis"

Nunnally - "when large numbers of subjects are used in studies, nearly all comparisons of means are 'significantly' different and all correlations are 'significantly' different from zero'

Nunnally - "If rejection of the null hypothesis were the real intention in psychological experiments, there usually would be no need to gather data"

Nunnally - "the mere rejection of a null hypothesis provides only meager information"

Nunnally - "Closely related to the null hypothesis is the notion that only enough subjects need be used in psychological experiments to obtain 'significant' results. This often encourages experimenters to be content with very imprecise estimates of effects"

Nunnally - "analysis of variance should be considered primarily an estimation device"

Nunnally - "psychological research is often difficult and frustrating, and the frustration can lead to a 'flight into statistics.' With some, this takes the form of a preoccupation with statistics to the point of divorcement from the headaches of empirical study. With others, the hypothesis-testing models provide a quick and easy way of finding 'significant differences' and an attendant sense of satisfaction"

Nunnally - "We should not feel proud when we see the psychologist smile and say 'the correlation is significant beyond the .01 level.' Perhaps that is the most that he can say, but he has no reason to smile"

Rozeboom - "one can hardly avoid polemics when butchering sacred cows"

Rozeboom - "Whenever possible, the basic statistical report should be in the form of a confidence interval"

Rozeboom - "the stranglehold that conventional null hypothesis significance testing has clamped on publication standards must be broken"

Rozeboom - "The traditional null hypothesis significance-test method ... of statistical analysis is here vigorously excoriated for its inappropriateness as a method of inference"

Smith - "One feature ... which requires much more justification than is usually given, is the setting up of unplausible null hypotheses. For example, a statistician may set out a test to see whether two drugs have exactly the same effect, or whether a regression line is exactly straight. These hypotheses can scarcely be taken literally"

1962

Camilleri - "another problem associated with the test of significance. The particular level of significance chosen for an investigation is not a logical consequence of the theory of statistical inference"

Camilleri - "The precision and empirical concreteness often associated with the test of significance are illusory and it would be a serious error to predicate our actions towards hypotheses on the test of significance as if it were a reliable arbiter of truth"

Grant - "In view of our long-term strategy of improving our theories, our statistical tactics can be greatly improved by shifting emphasis away from over-all hypothesis testing in the direction of statistical estimation. This always holds true when we are concerned with the actual size of one or more differences rather than simply in the existence of differences."

1963

Binder - [With regard to Fisher's 1935 quote about experiments and null hypotheses] "This is not very edifying since one does not expect to prove any hypothesis by the methods of probabilistic inference."

Binder - "when one tests a point prediction he usually knows before the first sample element is drawn that his empirical hypothesis is not precisely true"

Binder - "It is surely apparent that anyone who wants to obtain a significant difference badly enough can obtain one ... choose a sample size large enough"

Edwards et al. - "in typical applications, one of the hypotheses--the null hypothesis--is known by all concerned to be false from the outset"

Edwards et al. - "classical procedures quite typically are, from a Bayesian point of view, far too ready to reject the null hypotheses" Comment: Then this is a most convincing argument against the use of Bayesian methods.

Edwards et al. - "Estimation is best when it is stable. Rejection of a null hypothesis is best when it is interocular"

1964

Yates - "The most commonly occurring weakness ... is ... undue emphasis on tests of significance, and failure to recognise that in many types of experimental work estimates of treatment effects, together with estimates of the errors to which they are subject, are the quantities of primary interest"

Yates - "In many experiments ... it is known that the null hypothesis ... is certainly untrue"

1965

Sayn-Wittgenstein - "There is nothing wrong with the t-test; it has merely been used to give an answer that was never asked for. The Student t-test answers the question: 'Is there any real difference between the means of the measurement by the old and the new method, or could the apparent difference have arisen from random variation?' We already know that there is a real difference, so the question is pointless. The question we should have answered is: 'How big is the difference between the two sets of measurements, and how precisely have we determined it?'"

1966

Kempthorne - "a continuously distributed random variable ... one never actually observes such random variables, ... all observations are ... discrete"

1967

Bakan - "Little of what is contained in this paper is not already available in the literature"

Bakan - "the test of significance has been carrying too much of the burden of scientific inference. It may well be the case that wise and ingenious investigators can find their way to reasonable conclusions from data because and in spite of their procedures. Too often, however, even wise and ingenious investigators ... tend to credit the test of significance with properties it does not have"

Bakan - "a priori reasons for believing that the null hypothesis is generally false anyway. One of the common experiences of research workers is the very high frequency with which significant results are obtained with large samples"

Bakan - "there is really no good reason to expect the null hypothesis to be true in any population ... Why should any correlation coefficient be exactly .00 in the population? ... why should different drugs have exactly the same effect on any population parameter"

Bakan - "if the test of significance is really of such limited appropriateness ... we would be much better off if we were to attempt to estimate the magnitude of the parameters in the populations"

Bakan - "When we reach a point where our statistical procedures are substitutes instead of aids to thought, and we are led to absurdities, then we must return to common sense"

Bakan - "we need to get on with the business of generating ... hypotheses and proceed to do investigations and make inferences which bear on them, instead of ... testing the statistical null hypothesis in any number of contexts in which we have every reason to suppose that it is false in the first place"

LaForge - "Confidence regions ... for estimation of unknown parameters ... are appropriate for most scientific research and reporting"

Meehl - "in psychological and sociological investigations involving very large numbers of subjects, it is regularly found that almost all correlations or differences between means are statistically significant"

Meehl - "it is highly unlikely that any psychologically discriminable stimulation which we apply to an experimental subject would exert literally zero effect upon any aspect of his performance"

Meehl - "a fairly widespread tendency to report experimental findings with a liberal use of ad hoc explanations for those that didn't 'pan out' "

Meehl - "our eager-beaver researcher, undismayed by logic-of-science considerations and relying blissfully on the 'exactitude' of modern statistical hypothesis-testing, has produced a long publication list and been promoted to a full professorship. In terms of his contribution to the enduring body of psychological knowledge, he has done hardly anything. His true position is that of a potent-but-sterile intellectual rake, who leaves in his merry path a long train of ravished maidens but no viable scientific offspring"

Skipper Jr., Guenther and Nass - "The current obsession with .05 ... has the consequence of differentiating significant research findings and those best forgotten, published studies from unpublished ones, and renewal of grants from termination. It would not be difficult to document the joy experienced by a social scientist when his F ratio or t value yields significance at .05, nor his horror when the table reads 'only' .10 or .06. One comes to internalize the difference between .05 and .06 as 'right' vs. 'wrong,' 'creditable' vs. 'embarrassing,' 'success' vs. 'failure' "

Skipper Jr., Guenther and Nass - "blind adherence to the .05 level denies any consideration of alternative strategies, and it is a serious impediment to the interpretation of data"

1968

Lykken - "Unless one of the variables is wholly unreliable so that the values obtained are strictly random, it would be foolish to suppose that the correlation between any two variables is identically equal to 0.0000... (or that the effect of some treatment or the difference between two groups is exactly zero)"

Lykken - "the finding of statistical significance is perhaps the least important attribute of a good experiment

Lykken - "The value of any research can be determined, not from the statistical results, but only by skilled, subjective evaluation of the coherence and reasonableness of the theory, the degree of experimental control employed, the sophistication of the measuring techniques, the scientific or practical importance of the phenomena studied"

Lykken - "Editors must be bold enough to take responsibility for deciding which studies are good and which are not, without resorting to letting the p value of the significance tests determine this decision"

O'Brien and Shapiro - "It is this distinction between statistical significance and practical importance that seems often to be overlooked by many researchers."

Stevens - "What does it mean? Can no one recognize a decisive result without a significance test? How much can the burgeoning of computation be blamed on fad? How often does inferential computation serve as a premature excuse for going to press? Whether the scholar has discovered something or not, he can sometimes subject his data to an analysis of variance, a t test, or some other device that will produce a so-called objective measure of 'significance.' The illusion of objectivity seems to preserve itself despite the admitted necessity for the investigator to make improbable assumptions, and to pluck off the top of his head a figure for the level of probability that he will consider significant."

Stevens - "The extreme stochastophobe is likely to ask: What scientific discoveries owe their existence to the techniques of statistical analysis or inference?"

Stevens - "The aspersions voiced by stochastophobes fall mainly on those scientists who seem, by the surfeit of their statistical chants, to turn data treatment into hierurgy. These are not the statisticians themselves, for they see statistics for what it is, a straightforward discipline designed to amplify the power of common sense in the discernment of order amid complexity."

1969

Morrison and `Henkel - "In addition to important technical errors, fundamental errors in the philosophy of science are frequently involved in this indiscriminate use of the tests [of significance]"

Morrison and Henkel - "What we say is frankly polemical, though not original"

Morrison and Henkel - "we usually know in advance of testing that the null hypothesis is false"

Morrison and Henkel - "To say we want to be conservative, to guard against accepting more than 5 percent of our false alternative hypotheses as true ... is nonsense in scientific research"

Morrison and Henkel - "Researchers have long recognized the unfortunate connotations and consequences of the term 'significance,' and we propose it is time for a change"

Morrison and Henkel - "there is evidence that significance tests have been a genuine block to achieving ... knowledge"

1970

Morrison and Henkel - "we are convinced that the diversion of energy away from the rituals of significance testing in basic scientific research will be a worthy first step toward this goal [solving the problems of scientific inference] and will ... be one difference in behavioral science that is significant"

Morrison and Henkel - "scientists by and large adjust their beliefs about a hypothesis in informal ways on the basis of evidence, regardless of the formal decisions to reject or accept hypotheses made by individual researchers"

Morrison and Henkel - "significance testing in behavioral research is deeply implicated in our false search for empirical association, rather than a search for hypotheses that explain"

Morrison and Henkel - "many researchers ... will regard the abandonment of the tests a threat to the very foundations of empirical behavioural research. In fact, our experience (among sociologists) has been that many researchers accept all or most of our arguments on rational grounds, but keep using significance tests as before simply because use is a strong norm in the discipline"

1971

Nelder - "multiple comparison methods have no place at all in the interpretation of data"

Tversky and Kahneman - "the statistical power of many psychological studies is ridiculously low. This is a self-defeating practice: it makes for frustrated scientists and inefficient research. The investigator who tests a valid hypothesis but fails to obtain significant results cannot help but regard nature as untrustworthy or even hostile"

Tversky and Kahneman - "Significance levels are usually computed and reported, but power and confidence limits are not. Perhaps they should be."

Tversky and Kahneman - "The emphasis on significance levels tends to obscure a fundamental distinction between the size of an effect and its statistical significance."

1973

Hays - "There is surely nothing on earth that is completely independent of anything else. The strength of an association may approach zero, but it should seldom or never be exactly zero."

Tukey - "The twin assumptions of normality of distribution and homogeneity of variance are not ever exactly fulfilled in practice, and often they do not even hold to a good approximation."

1976

Box - "all models are wrong"

Box - "in nature there never was a normal distribution, there never was a straight line"

Box - "experiments where errors cannot be expected to be independent are very common"

Chew - "the research worker has been oversold on hypothesis testing. Just as no two peas in a pod are identical, no two treatment means will be exactly equal. ... It seems ridiculous ... to test a hypothesis that we a priori know is almost certain to be false"

Graybill - "when making inferences about parameters ... hypothesis tests should seldom be used if confidence intervals are available ... the confidence intervals could lead to opposite practical conclusions when a test suggests rejection of H0 ... even though H0 is not rejected, the confidence interval gives more useful information"

Kempthorne - "one will not ever have a random sample from a normal distribution"

Kempthorne - "no one, I think, really believes in the possibility of sharp null hypotheses -- that two means are absolutely equal in noisy sciences"

Pratt - "tests [of hypotheses] provide a poor model of most real problems, usually so poor that their objectivity is tangential and often too poor to be useful"

Pratt - "And when, as so often, the test is of a hypothesis known to be false ... the relevance of the conventional testing approach remains to be explicated"

Pratt - "This reduces the role of tests essentially to convention. Convention is useful in daily life, law, religion, and politics, but it impedes philosophy"

1977

Barndorff-Nielsen - "Most of the models considered in statistics are but rough approximations to reality"

Chew - "Testing the equality of 2 true treatment means is ridiculous. They will always be different, at least beyond the hundredth decimal place."

Cox - "Overemphasis on tests of significance at the expense especially of interval estimation has long been condemned"

Cox - "Admittedly all real measurements are discrete"

Cox - "there are considerable dangers in overemphasizing the role of significance tests in the interpretation of data"

Cox - "statistical significance is quite different from scientific significance and ... therefore estimation ... of the magnitude of effects is in general essential regardless of whether statistically significant departure from the null hypothesis is achieved"

Guttman - "lack of interaction in analysis of variance and ... lack of correlation in bivariate distributions--such nullities would be quite surprising phenomena in the usual interactive complexities of social life"

Guttman - "Estimation and approximation may be more fruitful than significance in developing science, never forgetting replication."

Guttman - "It [the normal distribution] is seldom, if ever, observed in nature."

1978

Carver - "Statistical significance testing has involved more fantasy than fact. The emphasis on statistical significance over scientific significance in educational research represents a corrupt form of the scientific method. Educational research would be better off if it stopped testing its results for statistical significance."

Carver - "Statistical significance ordinarily depends upon how many subjects are used in the research. the more subjects the researcher uses, the more likely the researcher will be to get statistically significant results."

Healy - "it is widely agreed among statisticians ... that significance testing is not the be-all and end-all of the subject"

Healy - "The commonest agricultural experiments ... are fertilizer and variety trials. In neither of these is there any question of the population treatment means being identical ... the objective is to measure how big the differences are"

Kruskal - "statistical significance of a sample bears no necessary relationship to possible subject-matter significance"

Kruskal - "it is easy to ... throw out an interesting baby with the nonsignificant bath water. Lack of statistical significance at a conventional level does not mean that no real effect is present; it means only that no real effect is clearly seen from the data. That is why it is of the highest importance to look at power and to compute confidence intervals"

Kruskal - "Another criticism of standard significance tests is that in most applications it is known beforehand that the null hypothesis cannot be exactly true"

Kruskal - "Because of the relative simplicity of its structure, significance testing has been overemphasized in some presentations of statistics, and as a result some students come mistakenly to feel that statistics is little else than significance testing"

Meehl - "I suggest to you that Sir Ronald has befuddled us, mesmerized us, and led us down the primrose path. I believe that the almost universal reliance on merely refuting the null hypothesis as the standard method for corroborating substantive theories in the soft areas is a terrible mistake, is basically unsound, poor scientific strategy, and one of the worst things that ever happened in the history of psychology."

Meehl - "probably all theories are false in the eyes of God"

Meehl - "as I believe is generally recognized by statisticians today and by thoughtful social scientists, the null hypothesis, taken literally, is always false"

Rothman - "The P value ... conveys no information about the extent to which two groups differ or two variables are associated. ... P vales serve poorly as descriptive statistics".

Rothman - "By choosing a measure that quantifies the degree of association or effect in the data and then calculating a confidence interval, researchers can summarize the strength of association in their data and allow for random variation in a simple and unambiguous way."

1980

Chew - "... means are significantly different ... This is a very unfortunate choice of terminology, because the significant difference in the statistical sense is often taken, incorrectly, as being significant in the practical or economic sense"

Chew - "Experimenters are often unhappy if the decision from the analysis of variance is to accept H0. ... The correct interpretation in this case is that all true differences are 'small' and/or the number of replicates is insufficient"

Chew - "I have tried to steer them [agricultural researchers] away from testing H0. I maintain that on a priori physical, chemical and biological grounds, H0 is always false in all realistic experiments, and H0 will always be rejected given enough replication"

Chew - "As Confucius might have said, if the difference isn't different enough to make a difference, what's the difference?"

Kruskal - "the traditional table [analysis of variance table] with its terminology and seductive additivities has in fact often led to superficiality of analysis"

1981

Cox and Snell - "Models are always to some extent tentative"

Little - "The idea that one should proceed no further with an analysis, once a non-significant F-value for treatments is found, has led many experimenters to overlook important information in the interpretation of their data"

1982

Cox - "It is very bad practice to summarise an important investigation solely by a value of P".

Cox - "The criterion for publication should be the achievement of reasonable precision and not whether a significant effect has been found"

Preece - "over-emphasis on significance-testing continues"

Preece - "the norm should be that only a standard error is quoted for comparing means from an experiment"

Preece - "experimenters having difficulty in interpreting their results, after the results have been converted into an analysis of variance, must often be urged to think as if they had never heard of statistics; only then is fettered rote-thinking abandoned in favour of common-sense and intelligence"

1983

Box - "The resultant magnification of the importance of formal hypothesis tests has inadvertently led to underestimation by scientists of the area in which statistical methods can be of value and to a wide misunderstanding of their purpose"

Bryan-Jones and Finney - "Of central importance to clear presentation is the standard error of a mean"

Bryan-Jones and Finney - "In interpreting and in presenting experimental results there is no adequate substitute for thought - thought about the questions to be asked, thought about the nature and weight of evidence the data provide on these questions, and thought about how the story can be told with clarity and full honesty to a reader. Statistical techniques must be chosen and used to aid, but not to replace, relevant thought"

Bryan-Jones and Finney - "Our message is not new"

Good - "with general principles ... it is usually possible to find something in the past that to some extent foreshadows it"

Good - "A large enough sample will usually lead to the rejection of almost any null hypothesis ... Why bother to carry out a statistical experiment to test a null hypothesis if it is known in advance that the hypothesis cannot be exactly true"

1984

Jones - "There is a rising feeling among statisticians that hypothesis tests ... are not the most meaningful analyses"

Jones - "preoccupation with testing 'is there an interaction"' in factorial experiments, ... emphasis should be on 'how strong is the interaction?' "

Jones - "The difference between 'statistically significant' and 'biologically significant' needs to be appreciated much more than it is now"

Jones - "Reporting of results in terms of confidence intervals instead of hypothesis tests should be strongly encouraged"

Preece - "Statistical 'recipes' are followed blindly, and ritual has taken over from scientific thinking"

Preece - "The ritualistic use of multiple-range tests--often when the null hypothesis is a priori untenable ...- is a disease"

1985

Altman - "Somehow there has developed a widespread belief that statistical analysis is legitimate only if it includes significance testing. This belief leads to, and is fostered by, numerous introductory statistics texts that are little more than catalogues of techniques for performing significance tests"

Chatfield - "differences are 'significant' ... nearly always ... in large samples"

Chatfield - "Within the last decade or so, practising statisticians have begun to question the relevance of some Statistics courses ... However ... Statistics teaching is still often dominated by formal mathematics"

Chatfield - "tests on outliers are less important than advice from 'people in the field' "

Chatfield - "significance tests ... are also widely overused and misused"

Chatfield - "an ANOVA will not tell us how a null hypothesis is rejected"

Chatfield - "Rather than ask if these differences are statistically significant, it seems more important to ask if they are of educational importance"

Chatfield - "All statistical techniques, however sophisticated, should be subordinate to subjective judgement"

Chatfield - "it has ... become impossible to get results published in some medical, psychological and biological journals without reporting significance values even when of doubtful validity"

Cormack - "Estimates and measures of variability are more valuable than hypothesis tests"

Guttman - "Since a point hypothesis is not to be expected in practice to be exactly true, but only approximate, a proper test of significance should almost always show significance for large enough samples. So the whole game of testing point hypotheses, power analysis notwithstanding, is but a mathematical game without empirical importance."

Nelder - "the grotesque emphasis on significance tests in statistics courses of all kinds ... is taught to people, who if they come away with no other notion, will remember that statistics is about tests for significant differences. ... The apparatus on which their statistics course has been constructed is often worse than irrelevant, it is misleading about what is important in examining data and making inferences"

1986

Chernoff - "Analysis of variance ... stems from a hypothesis-testing formulation that is difficult to take seriously and would be of limited value for making final conclusions."

Gardner and Altman - "In this approach [hypothesis testing] data are examined in relation to a statistical 'null' hypothesis, and the practice has led to the mistaken belief that studies should aim at obtaining 'statistical significance.' On the contrary, the purpose of most research investigations in medecine is to determine the magnitude of some factor(s) of interest."

Gardner and Altman - "there is a tendency to equate statistical significance with medical importance or biological relevance"

Gardner and Altman - "Confidence intervals ... should become the standard method for presenting the statistical results of major findings."

Jones and Matloff - "We recommend that authors display the estimate of the difference and the confidence limit for this difference"

Jones and Matloff - "at its worst, the results of statistical hypothesis testing can be seriously misleading, and at its best, it offers no informational advantage over its alternatives"

Jones and Matloff - "the ubiquitous problem of synonymizing statistical significance with biological significance"

Jones and Matloff - "all populations are different, a priori"

Jones and Matloff - "The only remedy ... is for journal editors to be keenly aware of the problems associated with hypothesis tests, and to be sympathetic, if not strongly encouraging, toward individuals who are taking the initial lead in phasing them out"

Lindley - "estimation procedures provide more information [than significance tests]: they tell one about reasonable alternatives and not just about the reasonableness of one value"

Perry - "significance tests have a limited role in biological experiments because 1) significance refers merely to plausibility, not to biological importance ... 2) theories may be proved to be strictly untrue but still of practical use ... 3) a null hypothesis is often known to be false before experimentation 4) the outcome of a test often depends merely on the size of the experiment ... the more replicates, the greater the chance of achieving significance; 5) in agricultural and ecological entomology, the really critical, single experiment is rare; 6) results may indicate merely that a hypothesis is rejected, but not give the magnitude of departures from the hypothesis ... 7) the exact nature of tests is often exaggerated and ignores the fact that all tests are based on assumptions that rarely hold in practice"

Perry - "A confidence interval certainly gives more information than the result of a significance test alone ... I ... recommend its use [standard error of each mean]"

Warren - "the word 'significant' could be abolished ... Based on a dictionary definition, one might expect that results that are declared significant would be important, meaningful, or consequential. Being 'significant at an arbitrary probability level,' ... ensures none of these"

Warren - "the researcher has the right to make inferences that may seem contrary to the objective analysis [statistical analysis], provided that is what he or she really believes and that the objective results have been given due consideration"

Warren - "I have seen authors declare that means were not different, but with less than a 50% chance of detecting a difference the magnitude of which would be important; if such a difference existed they would have been better off tossing a coin and not doing the experiment"

1987

Berger and Sellke - "even if testing of a point null hypothesis were disreputable, the reality is that people do it all the time ... and we should do our best to see that it is done well". Comment: On the contrary, if we assist others to perform disreputable tests then we ourselves also become disreputable.

Casella and Berger - "In a large majority of problems (especially location problems) hypothesis testing is inappropriate: Set up the confidence interval and be done with it!"

Hinkley - "for problems where the usual null hypothesis defines a special value for a parameter, surely it would be more informative to give a confidence range for that parameter"

Vardeman - "Competent scientists do not believe their own models or theories, but rather treat them as convenient fictions. ... The issue to a scientist is not whether a model is true, but rather whether there is another whose predictive power is enough better to justify movement from today's fiction to a new one"

Vardeman - "Too much of what all statisticians do ... is blatantly subjective for any of us to kid ourselves or the users of our technology into believing that we have operated 'impartially' in any true sense. ... We can do what seems to us most appropriate, but we can not be objective and would do well to avoid language that hints to the contrary"

1988

Finney - "rigid dependence upon significance tests in single experiments is to be deplored"

Finney - "The primary purpose of analysis of variance is to produce estimates of one or more error mean squares, and not (as is often believed) to provide significance tests"

Finney - "A null hypothesis that yields under two different treatments have identical expectations is scarcely very plausible, and its rejection by a significance test is more dependent upon the size of an experiment than upon its untruth"

Finney - "I have failed to find a single instance in which the Duncan test was helpful, and I doubt whether any of the alternative tests [multiple range significance tests] would please me better"

Finney - "Is it ever worth basing analysis and interpretation of an experiment on the inherently implausible null hypothesis that two (or more) recognizably distinct cultivars have identical yield capacities?"

Gauch - "the mere declaration that the interaction is or is not significant is far too coarse a result to give agronomists or plant breeders effective insight into their research material"

Luce - "I could only wish for every psychologist to read this chapter as an antidote to mindless hypothesis testing in lieu of doing good science: measuring effects, constructing substantive theories of some depth, and developing probability models and statistical procedures suited to these theories."

1989

Chatfield - "We all know ... that the misuse of statistics and an overemphasis on p values is endemic in many scientific journals"

Finney (a) - "The analysis of data ... requires assumptions ... The assumptions are never correct"

Finney (b)- "I confidently assert that yields of potatoes from plots of a well-conducted field experiments [sic] can be assumed independently and Normally distributed with constant variance; I do not believe this"

Finney (b)- "the Blind need frequent warnings and help in avoiding the multiple comparison test procedures that some editors demand but that to me appear completely devoid of practical utility"

Gigerenzer et al. - "In some fields, a strikingly narrow understanding of statistical significance made a significant result seem to be the ultimate purpose of research, and non-significance the sign of a badly conducted experiment - hence with almost no chance of publication."

Healy - "it is a travesty to describe a p value ... as 'simple, objective and easily interpreted' ... To use it as a measure of closeness between model and data is to invite confusion"

Kruskal and Majors - "We are also concerned about the use of statistical significance--P values--to measure importance; this is like the old confusion of substantive with statistical significance"

Moore and McCabe - "Some hesitation about the unthinking use of significance tests is a sign of statistical maturity"

Moore and McCabe - "It is usually wise to give a confidence interval for the parameter in which you are interested"

Moore and McCabe - "A null hypothesis that is ... false can become widely believed if repeated attempts to find evidence against it fail because of low power"

Moore and McCabe - "Other eminent statisticians have argued that if 'decision' is given a broad meaning, almost all problems of statistical inference can be posed as problems of making decisions in the presence of uncertainty"

Rosnow and Rosenthal - "A result that is statistically significant is not necessarily practically significant as judged by the magnitude of the effect."

1990

Cohen - "The null hypothesis ... is always false in the real world. ... If it is false, even to a tiny degree, it must be the case that a large enough sample will produce a significant result and lead to its rejection."

Cohen - "I believe ... that hypothesis testing has been greatly overemphasized in psychology and in other disciplines that use it."

Cohen - "The prevailing yes-no decision at the magic .05 level from a single research is a far cry from the use of informed judgment. Science simply doesn't work that way. A successful piece of research doesn't conclusively settle an issue, it just makes some theoretical proposition to some degree more likely. ... There is no ontological basis for dichotomous decision making in psychological inquiry."

Hahn - "hypothesis tests (irrelevant for most practical applications)"

Hunter - "How about 'alpha and beta risks' and 'testing the null hypothesis'? ... The very beginning language employed by the statistician describes phenomena in which engineers/physical scientists have little practical interest! They want to know how many, how much, and how well ... Required are interval estimates. We offer instead hypothesis tests and power curves"

Meehl - "All statistical tables should be required to include means and standard deviations, rather than merely a t, F or2, or even worse only statistical significance."

Meehl - "Confidence intervals for parameters ought regularly to be provided."

Meehl - "Since the null hypothesis refutation racket is 'steady work' and has the merits of an automated research grinding device, scholars who are pardonably devoted to making more money and keeping their jobs ... are unlikely to contemplate with equanimity a criticism that says that their whole procedure is scientifically feckless and that they should quit doing it and do something else. ... that might ... mean that they should quit the academy and make an honest living selling shoes"

Preece - "I cannot see how anyone could now agree with this [Fisher's 1935 quote about experiments and null hypotheses]"

Street - "Fisher ... appears to have placed an undue emphasis on the significance test"

Street - "in many experiments it is well known ... that there are differences among the treatments. The point of the experiment is to estimate ... and provide ... standard errors. One of the consequences of this emphasis on significance tests is that some scientists ... have come to see a significant result as an end in itself"

1991

Matloff - "statistical significance is not the same as scientific significance"

Matloff - "the test is asking whether a certain condition holds exactly, and this exactness is almost never of scientific interest"

Matloff - With regard to a goodness-of-fit test to answer whether certain ratios have given exact values, "we know a priori this is not true; no model can completely capture all possible genetical mechanisms"

Matloff - "the number of stars by itself is relevant only to the question of whether H0 is exactly true--a question which is almost always not of interest to us, especially because we usually know a priori that H0 cannot be exactly true."

Matloff - "problems stemming from the fact that hypothesis tests do not address questions of scientific interest"

Matloff - "the 'star system' includes neither an E part [estimate] nor an A part [accuracy] and thus excludes vital information ... There is no such danger in basing our analysis on CIs [confidence intervals]"

Matloff - "no population has an exact normal distribution, nor are variances exactly homogeneous, and independence assumptions are often violated to at least some degree"

Tukey - "Statisticians classically asked the wrong question-and were willing to answer with a lie, one that was often a downright lie. They asked 'Are the effects of A and B different?' and they were willing to answer 'no.' All we know about the world teaches us that the effects of A and B are always different-in some decimal place-for any A and B. Thus asking 'Are the effects different?' is foolish."

Tukey - "Empirical knowledge is always fuzzy! And theoretical knowledge, like all the laws of physics, as of today's date, is always wrong-in detail, though possibly providing some very good approximations indeed."

1992

Pearce - "In a biological context interactions are common, so it is better to play safe and regard any appreciable interaction as real whether it is significant or not"

Upton - "The experimenter must keep in mind that significance at the 5% level will only coincide with practical significance by chance!"

1993

Wang - "Testing of statistical hypotheses ... are often irrelevant, wrong-headed, or both"

Wang - "the tyranny of the N-P [Neyman-Pearson] theory in many branches of empirical science is detrimental, not advantageous, to the course of science"

1994

Boardman - "He [W. E. Deming] went on to suggest that the problem lay in teaching 'what is wrong.' The list of evils taught in courses on statistics ... is a long one. One of the topics included hypothesis testing. Personally I have found few, if any, occasions where such tests are appropriate."

Cohen - "I make no pretense of the originality of my remarks in this article."

Cohen - "I argue herein that NHST [null hypothesis significance testing] has not only failed to support the advance of psychology as a science but also has seriously impeded it."

Cohen - "my ... recommendation is that ... we routinely report effect sizes in the form of confidence limits."

Cohen - "they [confidence limits] are rarely to be found in the literature. I suspect that the main reason they are not reported is that they are so embarrassingly large!"

Inman - "Like many working scientists since, Buchanan-Wollaston professed a belief that commonly used statistical tests were either obvious or irrelevant to the scientific problem of interest"

1995

McCloskey - "scientists care about whether a result is statistically significant, but they should care much more about whether it is meaningful"

McCloskey - "the scale for measuring ... effects ... or ... changes ... is not so clear: you may get statistically impeccable answers that make little difference to anyone or 'insignificant' ones that are absolutely crucial"

1996

Ranstam - "A common misconception is that an effect exists only if it is statistically significant and that it does not exist if it is not [statistically significant]"

Ranstam - "When using confidence intervals, clinical rather than statistical significance is emphasized. Moreover, confidence intervals, by their width, disclose the statistical precision of the results."

Tamhane - "The point of departure in ranking-and-selection methodology is the recognition that the treatments being compared are in fact different, and a sufficiently large sample size will demonstrate this fact with any preassigned confidence level. Therefore, it is futile to test the null hypothesis of homogeneity."

Accessibility FOIA Privacy Policies and Notices

Take Pride in America logo USA.gov logo U.S. Department of the Interior | U.S. Geological Survey
URL: http://www.npwrc.usgs.gov/resource/methods/hypotest/index.htm
Page Contact Information: Webmaster
Page Last Modified: Saturday, 02-Feb-2013 06:02:24 EST
Sioux Falls, SD [sdww54]