Read over the chapter on effect sizes don?t worry about formulas or the methods of how to calculate effect sizes) and the Cohen article (1994. Describe the
Read over the chapter on effect sizes (don’t worry about formulas or the methods of how to calculate effect sizes) and the Cohen article (1994). Describe the importance of effect sizes in the field of psychology.
The Earth Is Round (p < .05)
Jacob Cohen
After 4 decades of severe criticism, the ritual of null hy- pothesis significance testing—mechanical dichotomous decisions around a sacred .05 criterion—still persists. This article reviews the problems with this practice, including its near-universal misinterpretation ofp as the probability that Ho is false, the misinterpretation that its complement is the probability of successful replication, and the mis- taken assumption that if one rejects Ho one thereby affirms the theory that led to the test. Exploratory data analysis and the use of graphic methods, a steady improvement in and a movement toward standardization in measurement, an emphasis on estimating effect sizes using confidence intervals, and the informed use of available statistical methods is suggested. For generalization, psychologists must finally rely, as has been done in all the older sciences, on replication.
I make no pretense of the originality of my remarks in this article. One of the few things we, as psychol- ogists, have learned from over a century of scientific
study is that at age three score and 10, originality is not to be expected. David Bakan said back in 1966 that his claim that "a great deal of mischief has been associated" with the test of significance "is hardly original," that it is "what 'everybody knows,'" and that "to say it 'out loud'is . . . to assume the role of the child who pointed out that the emperor was really outfitted in his under- wear" (p. 423). If it was hardly original in 1966, it can hardly be original now. Yet this naked emperor has been shamelessly running around for a long time.
Like many men my age, I mostly grouse. My ha- rangue today is on testing for statistical significance, about which Bill Rozeboom (1960) wrote 33 years ago, "The statistical folkways of a more primitive past continue to dominate the local scene" (p. 417).
And today, they continue to continue. And we, as teachers, consultants, authors, and otherwise perpetrators of quantitative methods, are responsible for the rituali- zation of null hypothesis significance testing (NHST; I resisted the temptation to call it statistical hypothesis in- ference testing) to the point of meaninglessness and be- yond. I argue herein that NHST has not only failed to support the advance of psychology as a science but also has seriously impeded it.
Consider the following: A colleague approaches me with a statistical problem. He believes that a generally rare disease does not exist at all in a given population, hence Ho: P = 0. He draws a more or less random sample of 30 cases from this population and finds that one of the cases has the disease, hence Ps = 1/30 = .033. He is not
December 1994 • American Psychologist Copyright 1994 by the American Psychological Association. Inc. 0003-066X/94/S2.00 Vol.49. No. 12,997-1003
sure how to test Ho, chi-square with Yates's (1951) cor- rection or the Fisher exact test, and wonders whether he has enough power. Would you believe it? And would you believe that if he tried to publish this result without a significance test, one or more reviewers might complain? It could happen.
Almost a quarter of a century ago, a couple of so- ciologists, D. E. Morrison and R. E. Henkel (1970), edited a book entitled The Significance Test Controversy. Among the contributors were Bill Rozeboom (1960), Paul Meehl (1967), David Bakan (1966), and David Lykken (1968). Without exception, they damned NHST. For example, Meehl described NHST as "a potent but sterile intellec- tual rake who leaves in his merry path a long train of ravished maidens but no viable scientific offspring" (p. 265). They were, however, by no means the first to do so. Joseph Berkson attacked NHST in 1938, even before it sank its deep roots in psychology. Lancelot Hogben's book-length critique appeared in 1957. When I read it then, I was appalled by its rank apostasy. I was at that time well trained in the current Fisherian dogma and had not yet heard of Neyman-Pearson (try to find a reference to them in the statistics texts of that day—McNemar, Edwards, Guilford, Walker). Indeed, I had already had some dizzying success as a purveyor of plain and fancy NHST to my fellow clinicians in the Veterans Adminis- tration.
What's wrong with NHST? Well, among many other things, it does not tell us what we want to know, and we so much want to know what we want to know that, out of desperation, we nevertheless believe that it does! What we want to know is "Given these data, what is the prob- ability that Ho is true?" But as most of us know, what it tells us is "Given that Ho is true, what is the probability of these (or more extreme) data?" These are not the same, as has been pointed out many times over the years by the contributors to the Morrison-Henkel (1970) book, among
J. Bruce Overmier served as action editor for this article. This article was originally an address given for the Saul B. Sells
Memorial Lifetime Achievement Award, Society of Multivariate Exper- imental Psychology, San Pedro, California, October 29, 1993.
I have made good use of the comments made on a preliminary draft of this article by Patricia Cohen and other colleagues: Robert P. Abelson, David Bakan, Michael Borenstein, Robyn M. Dawes, Ruma Falk, Gerd Gigerenzer, Charles Greenbaum, Raymond A. Katzell, Don- ald F. Klein, Robert S. Lee, Paul E. Meehl, Stanley A. Mulaik, Robert Rosenthal, William W. Rozeboom, Elia Sinaiko, Judith D. Singer, and Bruce Thompson. I also acknowledge the help I received from reviewers David Lykken, Matt McGue, and Paul Slovic.
Correspondence concerning this article should be addressed to Jacob Cohen, Department of Psychology, New York University, 6 Washington Place, 5th Floor, New York, NY 10003.
997
others, and, more recently and emphatically, by Meehl (1978, 1986, 1990a, 1990b), Gigerenzer( 1993), Falk and Greenbaum (in press), and yours truly (Cohen, 1990).
The Permanent Illusion One problem arises from a misapplication of deductive syllogistic reasoning. Falk and Greenbaum (in press) called this the "illusion of probabilistic proof by contra- diction" or the "illusion of attaining improbability." Gig- erenzer (1993) called it the "permanent illusion" and the "Bayesian Id's wishful thinking," part of the "hybrid logic" of contemporary statistical inference—a mishmash of Fisher and Neyman-Pearson, with invalid Bayesian interpretation. It is the widespread belief that the level of significance at which Ho is rejected, say .05, is the prob- ability that it is correct or, at the very least, that it is of low probability.
The following is almost but not quite the reasoning of null hypothesis rejection:
If the null hypothesis is correct, then this datum (D) can not occur.
It has, however, occurred. Therefore, the null hypothesis is false.
If this were the reasoning of Ho testing, then it would be formally correct. It would be what Aristotle called the modus tollens, denying the antecedent by denying the consequent. But this is not the reasoning of NHST. In- stead, it makes this reasoning probabilistic, as follows:
If the null hypothesis is correct, then these data are highly un- likely.
These data have occurred. Therefore, the null hypothesis is highly unlikely.
By making it probabilistic, it becomes invalid. Why? Well, consider this:
The following syllogism is sensible and also the for- mally correct modus tollens:
If a person is a Martian, then he is not a member of Congress. This person is a member of Congress. Therefore, he is not a Martian.
Sounds reasonable, no? This next syllogism is not sensible because the major premise is wrong, but the rea- soning is as before and still a formally correct modus tollens:
If a person is an American, then he is not a member of Congress. (WRONG!)
This person is a member of Congress. Therefore, he is not an American.
If the major premise is made sensible by making it probabilistic, not absolute, the syllogism becomes for- mally incorrect and leads to a conclusion that is not sen- sible:
If a person is an American, then he is probably not a member of Congress. (TRUE, RIGHT?)
This person is a member of Congress.
Therefore, he is probably not an American. (Pollard & Richardson. 1987)
This is formally exactly the same as
If Ho is true, then this result (statistical significance) would probably not occur.
This result has occurred. Then Ho is probably not true and therefore formally invalid.
This formulation appears at least implicitly in article after article in psychological journals and explicitly in some statistics textbooks—"the illusion of attaining improba- bility."
Why P(D! Ho) D)
When one tests //0, one is finding the probability that the data (£>) could have arisen if Ho were true, P(D j Ho). If that probability is small, then it can be concluded that if //o is true, then D is unlikely. Now, what really is at issue, what is always the real issue, is the probability that Ho is true, given the data, P(H0D), the inverse probability. When one rejects Ho, one wants to conclude that Ho is unlikely, say, p < .01. The very reason the statistical test is done is to be able to reject Ho because of its unlikeli- hood! But that is the posterior probability, available only through Bayes's theorem, for which one needs to know P(H()), the probability of the null hypothesis before the experiment, the "prior" probability.
Now, one does not normally know the probability of Ho. Bayesian statisticians cope with this problem by positing a prior probability or distribution of probabilities. But an example from psychiatric diagnosis in which one knows P(H0) is illuminating:
The incidence of schizophrenia in adults is about 2%. A proposed screening test is estimated to have at least 95% accuracy in making the positive diagnosis (sensitiv- ity) and about 97% accuracy in declaring normality (specificity). Formally stated, /"(normal | Ho) =s .97, /"(schizophrenia 1/Zj) > .95. So, let
//o = The case is normal, so that //i = The case is schizophrenic, and D = The test result (the data) is positive for schizophrenia.
With a positive test for schizophrenia at hand, given the more than .95 assumed accuracy of the test, P(D! HQ)—the probability of a positive test given that the case is normal—is less than .05, that is, significant at p < .05. One would reject the hypothesis that the case is normal and conclude that the case has schizophrenia, as it happens mistakenly, but within the .05 alpha error. But that's not the point.
The probability of the case being normal, P(H0), given a positive test (D), that is, P(H0 D), is not what has just been discovered however much it sounds like it and however much it is wished to be. It is not true that the probability that the case is normal is less than .05, nor is it even unlikely that it is a normal case. By a Bayesian maneuver, this inverse probability, the probability that
December 1994 • American Psychologist 998
the case is normal, given a positive test for schizophrenia, is about .60! The arithmetic follows:
P(HoD)
P(Ho)*P(test wrong H0)
PiH0)*P{test wrongH0) + PiHt)*P(test correct!//,)
(.98)(.03) .0294 = .607
(.98)(.03) + (.02)(.95) .0294 + .0190
The situation may be made clearer by expressing it approximately as a 2 X 2 table for 1,000 cases. The case actually is
Result Normal Schiz Total
Negative test (Normal) 949 1 950 Positive test (Schiz) 30 20 50
Total 979 21 1,000
As the table shows, the conditional probability of a normal case for those testing as schizophrenic is not small—of the 50 cases testing as schizophrenics, 30 are false positives, actually normal, 60% of them!
This extreme result occurs because of the low base rate for schizophrenia, but it demonstrates how wrong one can be by considering the p value from a typical sig- nificance test as bearing on the truth of the null hypothesis for a set of data.
It should not be inferred from this example that all null hypothesis testing requires a Bayesian prior. There is a form of Ho testing that has been used in astronomy and physics for centuries, what Meehl (1967) called the "strong" form, as advocated by Karl Popper (1959). Pop- per proposed that a scientific theory be tested by attempts to falsify it. In null hypothesis testing terms, one takes a central prediction of the theory, say, a point value of some crucial variable, sets it up as the Ho, and challenges the theory by attempting to reject it. This is certainly a valid procedure, potentially even more useful when used in confidence interval form. What I and my ilk decry is the "weak" form in which theories are "confirmed" by re- jecting null hypotheses.
The inverse probability error in interpreting Ho is not reserved for the great unwashed, but appears many times in statistical textbooks (although frequently together with the correct interpretation, whose authors apparently think they are interchangeable). Among the distinguished authors making this error are Guilford, Nunnally, An- astasi, Ferguson, and Lindquist. Many examples of this error are given by Robyn Dawes (1988, pp. 70-75); Falk and Greenbaum (in press); Gigerenzer (1993, pp. 316— 329), who also nailed R. A. Fisher (who emphatically rejected Bayesian theory of inverse probability but slipped into invalid Bayesian interpretations of NHST (p. 318); and Oakes (1986, pp. 17-20), who also nailed me for this error (p. 20).
The illusion of attaining improbability or the Bayes- ian Id's wishful thinking error in using NHST is very easy to make. It was made by 68 out of 70 academic
psychologists studied by Oakes (1986, pp. 79-82). Oakes incidentally offered an explanation of the neglect of power analysis because of the near universality of this inverse probability error:
After all, why worry about the probability of obtaining data that will lead to the rejection of the null hypothesis if it is false when your analysis gives you the actual probability of the null hypothesis being false? (p. 83)
A problem that follows readily from the Bayesian Id's wishful thinking error is the belief that after a suc- cessful rejection of Ho, it is highly probable that repli- cations of the research will also result in Ho rejection. In their classic article "The Belief in the Law of Small Num- bers," Tversky and Kahneman (1971) showed that be- cause people's intuitions that data drawn randomly from a population are highly representative, most members of the audience at an American Psychological Association meeting and at a mathematical psychology conference believed that a study with a significant result would rep- licate with a significant result in a small sample (p. 105). Of Oakes's (1986) academic psychologists 42 out of 70 believed that a t of 2.7, with df= 18 and p = .01, meant that if the experiment were repeated many times, a sig- nificant result would be obtained 99% of the time. Ro- senthal (1993) said with regard to this replication fallacy that "Nothing could be further from the truth" (p. 542f) and pointed out that given the typical .50 level of power for medium effect sizes at which most behavioral scientists work (Cohen, 1962), the chances are that in three repli- cations only one in eight would result in significant results, in all three replications, and in five replications, the chance of as many as three of them being significant is only 50:50.
An error in elementary logic made frequently by NHST proponents and pointed out by its critics is the thoughtless, usually implicit, conclusion that if Ho is re- jected, then the theory is established: If A then B; B there- fore A. But even the valid form of the syllogism (if A then B; not B therefore not A) can be misinterpreted. Meehl (1990a, 1990b) pointed out that in addition to the theory that led to the test, there are usually several auxiliary theories or assumptions and ceteris paribus clauses and that it is the logical product of these that is counterpoised against Ho. Thus, when Ho is rejected, it can be because of the falsity of any of the auxiliary theories about in- strumentation or the nature of the psyche or of the ceteris paribus clauses, and not of the substantive theory that precipitated the research.
So even when used and interpreted "properly," with a significance criterion (almost always p < .05) set a priori (or more frequently understood), Ho has little to com- mend it in the testing of psychological theories in its usual reject-//0-confirm-the-theory form. The ritual dichoto- mous reject-accept decision, however objective and ad- ministratively convenient, is not the way any science is done. As Bill Rozeboom wrote in 1960, "The primary aim of a scientific experiment is not to precipitate deci- sions, but to make an appropriate adjustment in the de-
December 1994 • American Psychologist 999
gree to which one . . . believes the hypothesis . . . being tested" (p. 420)
The Nil Hypothesis Thus far, I have been considering Hos in their most general sense—as propositions about the state of affairs in a pop- ulation, more particularly, as some specified value of a population parameter. Thus, "the population mean dif- ference is 4" may be an Ho, as may be "the proportion of males in this population is .75" and "the correlation in this population is .20." But as almost universally used, the null in Ho is taken to mean nil, zero. For Fisher, the null hypothesis was the hypothesis to be nullified. As if things were not bad enough in the interpretation, or mis- interpretation, of NHST in this general sense, things get downright ridiculous when Ho is to the effect that the effect size (ES) is 0—that the population mean difference is 0, that the correlation is 0, that the proportion of males is .50, that the raters' reliability is 0 (an Ho that can almost always be rejected, even with a small sample—Heaven help us!). Most of the criticism of NHST in the literature has been for this special case where its use may be valid only for true experiments involving randomization (e.g., controlled clinical trials) or when any departure from pure chance is meaningful (as in laboratory experiments on clairvoyance), but even in these cases, confidence in- tervals provide more information. I henceforth refer to the Ho that an ES = 0 as the "nil hypothesis."
My work in power analysis led me to realize that the nil hypothesis is always false. If I may unblushingly quote myself,
It can only be true in the bowels of a computer processor running a Monte Carlo study (and even then a stray electron may make it false). If it is false, even to a tiny degree, it must be the case that a large enough sample will produce a significant result and lead to its rejection. So if the null hypothesis is always false, what's the big deal about rejecting it? (p. 1308)
I wrote that in 1990. More recently I discovered that in 1938, Berkson wrote
It would be agreed by statisticians that a large sample is always better than a small sample. If, then, we know in advance the P that will result from an application of the Chi-square test to a large sample, there would seem to be no use in doing it on a smaller one. But since the result of the former test is known, it is no test at all. (p. 526f)
Tukey (1991) wrote that "It is foolish to ask 'Are the effects of A and B different?' They are always differ- ent—for some decimal place" (p. 100).
The point is made piercingly by Thompson (1992):
Statistical significance testing can involve a tautological logic in which tired researchers, having collected data on hundreds of subjects, then, conduct a statistical test to evaluate whether there were a lot of subjects, which the researchers already know, be- cause they collected the data and know they are tired. This tautology has created considerable damage as regards the cu- mulation of knowledge, (p. 436)
In an unpublished study, Meehl and Lykken cross- tabulated 15 items for a sample of 57,000 Minnesota
high school students, including father's occupation, fa- ther's education, mother's education, number of siblings, sex, birth order, educational plans, family attitudes toward college, whether they liked school, college choice, occu- pational plan in 10 years, religious preference, leisure time activities, and high school organizations. AH of the 105 chi-squares that these 15 items produced by the cross- tabulations were statistically significant, and 96% of them ap< .000001 (Meehl, 1990b).
One might say, "With 57,000 cases, relationships as small as a Cramer <j> of .02-.03 will be significant at p < .000001, so what's the big deal?" Well, the big deal is that many of the relationships were much larger than .03. En- ter the Meehl "crud factor," more genteelly called by Lykken "the ambient correlation noise." In soft psy- chology, "Everything is related to everything else." Meehl acknowledged (1990b) that neither he nor anyone else has accurate knowledge about the size of the crud factor in a given research domain, "but the notion that the cor- relation between arbitrarily paired trait variables will be, while not literally zero, of such minuscule size as to be of no importance, is surely wrong" (p. 212, italics in original).
Meehl (1986) considered a typical review article on the evidence for some theory based on nil hypothesis test- ing that reports a 16:4 box score in favor of the theory. After taking into account the operation of the crud factor, the bias against reporting and publishing "negative" re- sults (Rosenthal's, 1979, "file drawer" problem), and as- suming power of .75, he estimated the likelihood ratio of the theory against the crud factor as 1:1. Then, assuming that the prior probability of theories in soft psychology is <.10, he concluded that the Bayesian posterior prob- ability is also <.10 (p. 327f). So a 16:4 box score for a theory becomes, more realistically, a 9:1 odds ratio against it.
Meta-analysis, with its emphasis on effect sizes, is a bright spot in the contemporary scene. One of its major contributors and proponents, Frank Schmidt (1992), provided an interesting perspective on the consequences of current NHST-driven research in the behavioral sci- ences. He reminded researchers that, given the fact that the nil hypothesis is always false, the rate of Type I errors is 0%, not 5%, and that only Type II errors can be made, which run typically at about 50% (Cohen, 1962; Sedlmeier & Gigerenzer, 1989). He showed that typically, the sample effect size necessary for significance is notably larger than the actual population effect size and that the average of the statistically significant effect sizes is much larger than the actual effect size. The result is that people who do focus on effect sizes end up with a substantial positive bias in their effect size estimation. Furthermore, there is the irony that the "sophisticates" who use procedures to adjust their alpha error for multiple tests (using Bonfer- roni, Newman-Keuls, etc.) are adjusting for a nonexistent alpha error, thus reduce their power, and, if lucky enough to get a significant result, only end up grossly overesti- mating the population effect size!
Because NHST p values have become the coin of the realm in much of psychology, they have served to
1000 December 1994 • American Psychologist
inhibit its development as a science. Go build a quanti- tative science with p values! All psychologists know that statistically significant does not mean plain-English sig- nificant, but if one reads the literature, one often discovers that a finding reported in the Results section studded with asterisks implicitly becomes in the Discussion sec- tion highly significant or very highly significant, impor- tant, big!
Even a correct interpretation of p values does not achieve very much, and has not for a long time. Tukey (1991) warned that if researchers fail to reject a nil hy- pothesis about the difference between A and B, all they can say is that the direction of the difference is "uncer- tain." If researchers reject the nil hypothesis then they can say they can be pretty sure of the direction, for ex- ample, "A is larger than B." But if all we, as psychologists, learn from a research is that A is larger than B (p < .01), we have not learned very much. And this is typically all we learn. Confidence intervals are rarely to be seen in our publications. In another article (Tukey, 1969), he chided psychologists and other life and behavior scientists with the admonition "Amount, as well as direction is vital" and went on to say the following:
The physical scientists have learned much by storing up amounts, not just directions. If, for example, elasticity had been confined to "When you pull on it, it gets longer!," Hooke's law, the elastic limit, plasticity, and many other important topics could not have appeared (p. 86).. . . Measuring the right things on a communicable scale lets us stockpile information about amounts. Such information can be useful, whether or not the chosen scale is an interval scale. Before the second law of ther- modynamics—and there were many decades of progress in physics and chemistry before it appeared—the scale of temper- ature was not, in any nontrivial sense, an interval scale. Yet these decades of progress would have been impossible had phys- icists and chemists refused either to record temperatures or to calculate with them. (p. 80)
In the same vein, Tukey (1969) complained about correlation coefficients, quoting his teacher, Charles Winsor, as saying that they are a dangerous symptom. Unlike regression coefficients, correlations are subject to vary with selection as researchers change populations. He attributed researchers' preference for correlations to their avoidance of thinking about the units with which they measure.
Given two perfectly meaningless variables, one is reminded of their meaninglessness when a regression coefficient is given, since one wonders how to interpret its value. . . . Being so uninter- ested in our variables that we do not care about their units can hardly be desirable, (p. 89)
The major problem with correlations applied to re- search data is that they can not provide useful information on causal strength because they change with the degree of variability of the variables they relate. Causality op- erates on single instances, not on populations whose members vary. The effect of A on B for me can hardly depend on whether I'm in a group that varies greatly in A or another that does not vary at all. It is not an accident
that causal modeling proceeds with regression and not correlation coefficients. In the same vein, I should note that standardized effect size measures, such as d and / developed in power analysis (Cohen, 1988) are, like cor- relations, also dependent on population variability of the dependent variable and are properly used only when that fact is kept in mind .
To work constructively with "raw" regression coef- ficients and confidence intervals, psychologists have to start respecting the units they work with, or develop mea- surement units they can respect enough so that research- ers in a given field or subfield can agree to use them. In this way, there can be hope that researchers' knowledge can be cumulative. There are few such in soft psychology. A beginning in this direction comes from meta-analysis, which, whatever else it may accomplish, has at least fo- cused attention on effect sizes. But imagine how much more fruitful the typical meta-analysis would be if the research covered used the same measures for the con- structs they studied. Researchers could get beyond using a mass of studies to demonstrate convincingly that "if you pull on it, it gets longer."
Recall my example of the highly significant corre- lation between height and intelligence in 14,000 school children that translated into a regression coefficient that meant that to raise a child's IQ from 100 to 130 would require giving enough growth hormone to raise his or her height by 14 feet (Cohen, 1990).
What to Do? First, don't look for a magic alternative to NHST, some other objective mechanical ritual to replace it. It doesn't exist.
Second, even before we, as psychologists, seek to generalize from our data, we must seek to understand and improve them. A major breakthrough to the ap- proach to data, emphasizing "detective work" rather than "sanctification" was heralded by John Tukey in his article "The Future of Data Analysis" (1962) and detailed in his seminal book Exploratory Data Analysis (EDA; 1977). EDA seeks not to vault to generalization to the population but by simple, flexible, informal, and largely graphic techniques aims for understanding the set of data in hand. Important contributions to graphic data analysis have since been made by Tufte (1983, 1990), Cleveland (1993; Cleveland & McGill, 1988), and others. An excellent chapter-length treatment by Wainer and Thissen (1981), recently updated (Wainer & Thissen, 1993), provides many useful references, and statistical program packages provide the necessary software (see, for an example, Lee Wilkinson's [1990] SYGRAPH, which is presently
Collepals.com Plagiarism Free Papers
Are you looking for custom essay writing service or even dissertation writing services? Just request for our write my paper service, and we'll match you with the best essay writer in your subject! With an exceptional team of professional academic experts in a wide range of subjects, we can guarantee you an unrivaled quality of custom-written papers.
Get ZERO PLAGIARISM, HUMAN WRITTEN ESSAYS
Why Hire Collepals.com writers to do your paper?
Quality- We are experienced and have access to ample research materials.
We write plagiarism Free Content
Confidential- We never share or sell your personal information to third parties.
Support-Chat with us today! We are always waiting to answer all your questions.
