Summarize a recent research journal article from one of the American P
- needs to be between 800-1,000 words- no more and no less.
- You must also use 12-point Times New Roman font.
- Your work will automatically be submitted to Turnitin upon submission, to determine how another author’s work was used in the assignment. Make sure you take notes while reading the selected article in your own words. Do not copy and paste directly from the selected article because matches to other authors’ works of 30% or more will result in an automatic zero (0) for the assignment.
- At the top of your work, you must include your name, the name of the article you selected, the name of the journal that the article was taken from, name of the authors of the article, and your total word count. An example is below:
- Student name
- Name of article
- Name of journal that the article was in: PSYCHOLOGICAL METHODS
- Authors names
- Word count
- Your work should not include any direct quotations from the article you selected. Put everything in your own words and do not summarize the abstract section of the article.
- You should summarize a recent research journal article from one of the American Psychological Association (APA) journals listed in the table below.
There’s More Than One Way to Conduct a Replication Study: Beyond Statistical Significance
Samantha F. Anderson and Scott E. Maxwell University of Notre Dame
As the field of psychology struggles to trust published findings, replication research has begun to become more of a priority to both scientists and journals. With this increasing emphasis placed on reproducibility, it is essential that replication studies be capable of advancing the field. However, we argue that many researchers have been only narrowly interpreting the meaning of replication, with studies being designed with a simple statistically significant or nonsignificant results framework in mind. Although this interpretation may be desirable in some cases, we develop a variety of additional “replication goals” that researchers could consider when planning studies. Even if researchers are aware of these goals, we show that they are rarely used in practice—as results are typically analyzed in a manner only appropriate to a simple significance test. We discuss each goal conceptually, explain appropriate analysis procedures, and provide 1 or more examples to illustrate these analyses in practice. We hope that these various goals will allow researchers to develop a more nuanced understanding of replication that can be flexible enough to answer the various questions that researchers might seek to understand.
Keywords: replication, data analysis, confidence interval, effect size, equivalence test
Replication, a once largely ignored premise, has recently be- come a defining precept for the future of psychology. Reproduc- ibility has been referred to as the “cornerstone” (Simons, 2014, p. 76) and “Supreme Court” (Collins, 1985, p. 19) of science, and as “the best and possibly the only believable evidence for the reli- ability of an effect” (Simons, 2014, p. 76). In fact, “findings that do not replicate are worse than fairy tales” (Wagenmakers, Wet- zels, Borsboom, van der Maas, & Kievit, 2012, p. 633). The idea of replication is not new. Even prior to Sir Ronald
Fisher and the advent of modern experimental design (circa 1935), the field of agriculture used replication to assess accuracy and reliability (Yates, 1964). In fact, Fisher himself emphasized the importance of replication, believing that experimental findings are only established if “a properly designed experiment rarely fails to give . . . significance” (Fisher, 1926, p. 504). In 1969, Tukey noted that “confirmation comes from repetition” and that ignoring the need for replication would “lend[s] itself to failure and more probably destruction” (Tukey, 1969, p. 84). However, replications were rarely conducted due to lack of incentive and rarely published due to lack of novelty (Nosek & Lakens, 2014). This lack of incentive gradually started to change when concerns about “the reliability of research findings in the field” began to emerge (Pashler & Wagenmakers, 2012, p. 528). The field has been amid “a crisis of confidence” (Pashler & Wagenmakers, 2012, p. 528) wherein published findings are regarded with a greater degree of
skepticism in the wake of potentially too much flexibility in research practices (e.g., Simmons, Nelson, & Simonsohn, 2011). Due to these growing concerns over the potential unreliability of
reported results in psychology, researchers have begun to empha- size the importance of reproducing results and call for a greater focus on replication. Over the past decade, the number of articles focused on replication has grown steadily. A PsycINFO search for scholarly documents with replication or any of its derivatives in the title yields 82 articles in 2003, 121 articles in 2008, and 154 articles in 2013. Major journals have dedicated special sections and issues to the topic (e.g., Perspectives on Psychological Sci- ence, 2012; Social Psychology, 2014) in the hopes of creating incentives for researchers to engage in replication studies. The Center for Open Science (2012) has introduced a project aimed at assessing the bias present in the current psychological literature by inviting scientists to attempt to replicate findings from a sample of published findings from prominent journals in 2008. Yet, despite increased appreciation for the role of replication
and motivation to focus on replication, the current state of repli- cation research remains seemingly incapable of truly advancing the field. The failure rate of replications is alarmingly high, as evidenced in a recent issue of Social Psychology. Out of 14 replication attempts arranged by Nosek and Lakens (2014), nine did not replicate the original study and another five were only partial replications, more nuanced manifestations of the effect (i.e., the effect only appeared in specific conditions), or had smaller effect sizes. This lack of replicability lends itself to questions regarding the reason that so many fail. In some cases, failure to replicate may be due to issues with the original study, including researcher degrees of freedom (Simmons et al., 2011). However, other replications may fail due to problems with the replication study itself. In addition to low power of the replication study, a number of other factors have limited the effectiveness of recent replications (Braver, Thoemmes, & Rosenthal, 2014). First, re-
This article was published Online First July 27, 2015. Samantha F. Anderson and Scott E. Maxwell, Department of Psychol-
ogy, University of Notre Dame. Correspondence concerning this article should be addressed to Samantha
F. Anderson, Department of Psychology, University of Notre Dame, 118 Haggar Hall, Notre Dame, IN 46556. E-mail: [email protected]
T hi s do cu m en t is co py ri gh te d by th e A m er ic an Ps yc ho lo gi ca l A ss oc ia tio n or on e of its al lie d pu bl is he rs .
T hi s ar tic le is in te nd ed so le ly fo r th e pe rs on al us e of th e in di vi du al us er an d is no t to be di ss em in at ed br oa dl y.
Psychological Methods © 2015 American Psychological Association 2016, Vol. 21, No. 1, 1–12 1082-989X/16/$12.00 http://dx.doi.org/10.1037/met0000051
1
searchers have often displayed a rather narrow perspective on replication, with study goals often being to replicate a statistically significant effect. Less often, the intention seems to be to show that a presumed effect does not exist. We assert that there are a number of additional worthy goals that researchers have rarely considered in planning replication studies. Further, even when pursuing noble goals, the analyses used to achieve these goals are often inadequate and often do not truly match the intended research question. This article aims to offer readers an appreciation for a number of
replication-related goals that may lead the field to a more nuanced understanding of the replicability of prior findings (see Table 1). Further, we intend to provide a conceptual and practical overview of recommended analysis strategies, each paired with illustrative examples.
General Considerations
We conducted a PsycINFO search for scholarly, peer-reviewed articles published in 2013 with replicat� in the title. Of the 154 results, we selected 50 to code. The other 104 studies were ex- cluded based on the following properties: less relevant to general psychology (e.g., business journals, nursing journals), language other than English, qualitative-only results, replications of psycho- metric properties, and genome-wide association studies. This se- lection of 50 replications yields 44 that seem to decide the success of the replication based on a statistical test alone.1 These studies generally interpreted the p value as either in line or divergent with the original study, based on whether both studies came to the same or different conclusions regarding statistical significance. Al- though this general strategy may at times be the most appropriate for the question at hand, we argue that authors may be considering replication in an overly narrow context. Along these lines, we invite authors to consider both the additional goals we outline and the analyses appropriate to those goals. The following section introduces six potential goals for replication. Later sections will be devoted to further developing those goals and associated analyses. Replication of significance may indeed be a worthy goal to
pursue, as replicating an effect in the same direction as the original study is often enlightening in its own right. This may be especially true if the original study resulted in unexpected or counterintuitive findings. For example, consider the seminal findings on thought- suppression (Wegner, Schneider, Carter, & White, 1987). Surpris- ingly, participants who were instructed to suppress thoughts of a white bear were rather ineffective at doing so, thinking about the bear once per minute, on average. The second finding was even more surprising. When the same participants repeated the experi- ment with instructions to think about the white bear, they had thoughts of the bear significantly more often than a control group who never received the thought-suppression instructions. This rebound effect was an unexpected phenomenon, and one could argue that its existence and the directionality of the effect are more important than the size of the effect. Replications of results of this nature may simply need to reproduce a statistically significant effect in the intended direction in order to lend support to the original findings. Unexpected results may have a special role in theory testing, underscoring the importance of hypothesis testing as opposed to effect-estimation when examining theoretical pre- dictions (Morey, Rouder, Verhagen, & Wagenmakers, 2014).
However, significance-based replication (Goal 1) may not al- ways be the most advantageous goal, and the overwhelming em- phasis on this goal may be limiting the contribution that replication research can make to the field as a whole. Recent apparent non- replications of controversial results, such as those of Bargh’s subtle priming and Bem’s ESP experiment, speak to the impor- tance of detecting spurious findings (e.g., Doyen, Klein, Pichon, & Cleeremans, 2012; Galak, LeBoeuf, Nelson, & Simmons, 2012). Indeed, there were a number (16) of authors in the 2013 PsycINFO sample of replication studies that reported a nonreplication of original findings. Although one of these studies explicitly refer- enced nonreplication as the goal, others were vaguer in their intentions. It is thus somewhat unclear what the goal truly was in some of these cases, but what is clear is the fact that the often reported claims of “null” results did not match the analyses per- formed. In fact, all of these 14 studies conducted analyses capable only of evidencing a statistically significant effect, but not a null effect. It is well known that failure to reject the null hypothesis does not necessarily constitute evidence that the null hypothesis should be accepted, but our review of replication studies showed authors regularly making this mistake. Several studies published in the 2014 replication special issue of Social Psychology also fell victim to this mismatch between reported nonreplication and the inappropriate analyses used to support that conclusion. Thus, whether or not authors are aware of the utility of inten-
tions to show a null effect (nonreplication; Goal 2), they are most often using analysis strategies that fail to support the interpretation given for the results and fail to answer what may have been the real question of interest. There are two separate cautions here. First, authors who expect a statistically significant result (replication) are absolutely justified in conducting analyses in line with this goal. However, these authors must be careful in interpreting nonsignif- icant findings as direct evidence of a failure to replicate. More notably, authors who desire to evidence that there is no effect have often failed to utilize an analysis that can substantiate this goal. Again, though nonsignificant findings indicate a failure to reject the null hypothesis, many researchers claim that p values greater than .05 are evidence in favor of the null hypothesis or are a metric from which to determine the probability that the null is true. In fact, p values greater than .05 by themselves reveal little about the probability that the null hypothesis is true. We recommend that authors make use of equivalence tests (frequentist) or Bayesian methods in order to adequately support the claim of a null repli- cation effect, which will be described in greater detail later. We note here that although Bayesian methods can be used for other hypothesis and interval-based situations presented in this paper, we limit our presentation of Bayesian methods to Goal 2, as these methods are especially helpful for answering questions regarding the lack of an effect. Readers interested in Bayesian methods more generally may consult Kruschke (2014) and Gelman et al. (2013).
1 This estimate of 44 studies is likely conservative. Three additional studies based their main analysis on surface level comparisons (correla- tions, stepwise regression models, and sensitivity/specificity), without test- ing whether these parameters statistically differed. Thus, these studies did not decide success directly based on a statistically significant– nonsignificant distinction, but rather an even simpler visual inspection of the estimates in question.
T hi s do cu m en t is co py ri gh te d by th e A m er ic an Ps yc ho lo gi ca l A ss oc ia tio n or on e of its al lie d pu bl is he rs .
T hi s ar tic le is in te nd ed so le ly fo r th e pe rs on al us e of th e in di vi du al us er an d is no t to be di ss em in at ed br oa dl y.
2 ANDERSON AND MAXWELL
In addition to showing the existence or nonexistence of a pre- viously published finding, we believe that there are other potential goals of replication that have largely been overlooked in the recent push toward reproducibility. For example, researchers may have reason to question the size of the effect reported in the original study. Research has shown that published effect sizes are likely to be upwardly biased, which may motivate researchers to attempt to better estimate the true population effect size (Lane & Dunlap, 1978; Maxwell, 2004). Thus, it may be worthwhile to estimate the size of the effect of the original study, providing evidence that it is indeed as sizable as the original authors claimed (Biesanz & Schrager, 2010; Goal 3). This goal warrants the formation of a confidence interval around the replication effect size. Many re- searchers seem to be unaware of this goal, as only 23 studies from our PsycINFO sample provided an effect size, only four provided a confidence interval around it, and none discussed or interpreted these confidence intervals. Another goal may be to replicate the original study by combin-
ing it with a new sample of participants in something akin to a small meta-analysis (Goal 4). Although two studies in the 2013 sample had access to the original study raw data, this access is often not possible. This goal allows comparison between a repli- cation and original study without this requirement. However, no studies in our sample followed this goal. Authors may also want to show that a replication effect is clearly inconsistent with the original study’s effect through more than simply direction/signif- icance alone, meriting a test of the difference in effect sizes (Goal 5). This goal is an extension of Goal 3, wherein the replication effect size must be significantly distinct from the original to support nonreplication. Authors who declare their study a nonrep- lication in response to finding a smaller effect size in their sample, without testing its disparity from the original effect size, seem to be unaware of this goal, or of the proper analyses to accomplish the goal (three studies from our PsycINFO sample). Only one study from our 2013 sample used an analysis in line with Goal 5. Conversely, it may also be enlightening to show that a replication is clearly consistent with the original study through an analysis such as an equivalence test of the difference in effect sizes (Goal 6; no studies from our PsycINFO sample). Similarly to Goal 5, authors who declare their study a replication in response to finding a similar effect size, without testing its equivalence to the original
effect size, seem to be unaware of this goal and associated analyses (three studies from our PsycINFO sample.) It is important to note some caveats regarding direct (exact)
versus conceptual replications. While direct replications were once avoided for lack of originality, authors have recently urged the field to take note of the benefits and importance of direct replica- tion. According to Simons (2014), this type of replication is “the only way to verify the reliability of an effect” (p. 76). With respect to this recent emphasis, the current article will assume direct replication. However, despite the push toward direct replication, some have still touted the benefits of conceptual replication (Stro- ebe & Strack, 2014). Importantly, many of the points and analyses suggested in this paper may translate well to conceptual replica- tion. However, readers should be cautioned that there are excep- tions to this, as in replication studies with multiple dependent variables. Further, the interpretation of results may not be as straightforward in conceptual replications, as nonsignificant or disparate findings could be due to a host of uncontrolled factors, such as differences in, conditions, measurement tools, and partic- ipants. Further, this article will mainly limit its focus to single replica-
tions of original studies. However, as others have noted, a single replication is usually insufficient to accept or refute a published effect with absolute confidence. We echo the importance of mul- tiple replications which can then lend themselves to a future meta-analysis and we will broaden our discussion to these cases when possible (Hunter, 2001). Nevertheless, it is also important for researchers to be aware of various questions that may be addressed with a single replication study, as well as knowing the most appropriate analytic methods for each type of question. We note that just as in meta-analysis, access to the raw data is not required for any of our proposed methods. Finally, the issue of sample size planning for replication studies
is beyond the scope of this article. However, we emphasize that many, if not all, of the goals may require much larger sample sizes than are commonly seen in the literature. Replication research often suffers from low power due to the uncertainty and bias inherent in the sample effect sizes (from the original study) that inform the replication’s planned sample size (Maxwell, Lau, & Howard, in press). Thus, even replication studies that claim to have power greater than .8 may have actual power that is much lower.
Table 1 Six Replication Goals and Descriptions
No. Goal Recommended analysis Success criterion
1 To inter the existence of a replication effect
Repeat analysis of original study p � .05
2 To infer a null replication effect Equivalence test Confidence interval falls completely inside region of equivalence
3 To precisely estimate the replication effect size
AIPE, construct confidence interval for effect size
Effect size estimated with desired level of precision
4 To combine replication sample data with original results
Construct confidence interval for the average effect size of replication and original studies
Building on prior knowledge; more precise estimate of the effect of interest
5 To assess whether replication is clearly inconsistent with original
Construct confidence interval for the difference in effect sizes
Confidence interval for difference in effect sizes does not include 0
6 To assess whether replication is clearly consistent with original
Equivalence test, using confidence interval for the difference in effect sizes
Confidence interval for difference in effect sizes falls completely inside region of equivalence
T hi s do cu m en t is co py ri gh te d by th e A m er ic an Ps yc ho lo gi ca l A ss oc ia tio n or on e of its al lie d pu bl is he rs .
T hi s ar tic le is in te nd ed so le ly fo r th e pe rs on al us e of th e in di vi du al us er an d is no t to be di ss em in at ed br oa dl y.
3MORE THAN ONE WAY TO CONDUCT A REPLICATION
We urge researchers to attend to the nuances of their proposed analyses in making sample size decisions. We recommend Taylor and Muller (1996) for a power analysis method that handles both publication bias and the distribution inherent in sample effect size estimates. Finally, it is important to note that the sample size of the original study plays an important role anytime the goal involves either comparing or combining the results of the original study and the replication study. We will discuss each of the aforementioned potential goals in
more detail later in this article. We emphasize that these goals are not mutually exclusive and often may be combined when appro- priate based on the questions at hand. We caution that goals should be decided on a priori, before conducting analyses. Performing replication studies with attention to a wider variety of definitions that could constitute replication may provide more illuminating answers as to the validity of purported effects in the literature.
Goal 1: To Infer the Existence (and Direction) of an Effect
As described previously, the goal of replicating the statistical significance of an effect seems to be the most common purpose described in recent replication studies. This is not surprising, given that psychologists often have “an exaggerated belief in the likeli- hood of successfully replicating an obtained finding” (Tversky & Kahneman, 1971, p. 105). If reproducibility is indeed the gold standard of science, it makes sense to attempt to replicate the statistical significance (and in many cases replicate the direction- ality) of previously reported effects. After selecting an appropriate sample size, the statistical methods chosen should typically mirror those in the original study. This may be a regression, an ANOVA, or something more complex. We provide a two group example below, although we acknowledge that this is only representative of some replications of interest to researchers. Of course, there are many ways an original study could have been performed, each with its own standard analysis. Researchers should be sensitive to the context of the original study in planning the most appropriate way to conduct the replication analyses. Suppose a researcher is interested in replicating the scope-
severity paradox. The original study on the topic found a surprising series of results that stood in contrast to the common sense view of the time. Specifically, participants randomly assigned to condi- tions judged equivalent crimes less severely when more people had been victimized by the crime and recommended more punishment for crimes of equal magnitude when fewer people were victimized (Nordgren & McDonnell, 2011). For simplicity, suppose the re- searcher’s replication will focus only on the perceived severity of the crime, where in the original study, small scope vignettes were judged with more severity (M � 6.37, SD � 1.67) than large scope vignettes (M � 5.51, SD � 1.33), F(1, 59) � 4.88, p � .03.2 The corresponding original sample effect size was d � 0.57 (approx- imately Cohen’s medium effect). Notice that the original study found a significant effect of the scope of the crime on perceived severity. The researcher may first want to replicate the statistical significance of the effect, without attention to its size. In this case, it is important to the researcher to say that participants indeed perceived crimes to be more severe when fewer people were affected by them, but not whether that effect is large enough to be of clinical or practical importance. In other words, the fact that
such a surprising effect exists is noteworthy, while its size is less vital to the theory. Consequently, analyses should proceed as in the original study. In this case, the researcher would perform an independent samples t test on two groups randomly assigned to scope conditions. A p value of less than .05 would indicate that the replication attempt was successful, while a larger p value would indicate that varying the scope of a crime did not have a statisti- cally significant effect on perceived severity, though not that the influence of scope was essentially zero. We argue that in these cases, it is often not only the statistical
significance of the effect, but also the directionality inherent in the statistically significant finding that is important to convey. For a successful replication of a two-group study, the replication effect must not only have a p value less than .05, but also reproduce the direction of the mean difference found in the original study. In the example above, a replication finding that participants judge crimes more severely when they victimize more people would likely be considered unsuccessful, even if the mean difference was statisti- cally significant in both cases. We acknowledge, however, that there may be a few situations where even direction does not matter. Although many studies involving three or more groups eventually involve analyzing contrasts, where direction is of interest to the theory, some theories may simply contrast any difference between means with no difference between means. For example, a seminal study found that infants preferred to look at faces over scrambled faces and blank screens (Goren, Sarty, & Wu, 1975). We argue that in this case, however, a replication of any visual preference may still be considered successful by some, if the contrasted theories are thought to be no visual preference versus any visual preference (indicating that the infant visual system is more devel- oped than had been previously thought).
Goal 2: To Infer a Null Effect
In 2011, an uproar ensued over a controversial study published in the Journal of Personality and Social Psychology (JPSP; Bem, 2011). The article, through nine experiments, claimed that under- graduates successfully displayed retroactive influence of future events on current responses, an indication of the existence of psi, with a mean sample effect size of d � 0.22. Skeptics attempted to reproduce Bem’s findings and failed multiple times. But what truly constitutes a nonreplication? A study using methods akin to the original, but failing to produce a statistically significant result may be viewed by many as a failure to replicate. In fact, a highly publicized replication attempt of Bem’s study made essentially these conclusions based on nonsignificant results (Ritchie, Wise- man, & French, 2012). Although other replications went on to apply more appropriate analyses as evidence for nonreplication, these and other similar conclusions are a sign of a general lack of understanding of what nonsignificant findings actually reveal. If one is skeptical of an original study’s results, a goal may be
to infer a null effect. In this case, it is necessary to show evidence in favor of the null hypothesis, rather than simply a failure to reject the null. As discussed earlier, a failure to reject the null hypothesis is not necessarily informative about the likelihood that the true
2 The authors report 1 and 59 degrees of freedom for a 1 way ANOVA with 60 participants. If the description is accurate, correct df would be 1 and 58.
T hi s do cu m en t is co py ri gh te d by th e A m er ic an Ps yc ho lo gi ca l A ss oc ia tio n or on e of its al lie d pu bl is he rs .
T hi s ar tic le is in te nd ed so le ly fo r th e pe rs on al us e of th e in di vi du al us er an d is no t to be di ss em in at ed br oa dl y.
4 ANDERSON AND MAXWELL
effect is zero or does not exist, even when the study is seemingly adequately powered. In fact, when the goal is to infer a null effect, the alternative hypothesis should be the default hypothesis, and it should take sufficient evidence to overturn the alternative in favor of the null, so the two hypotheses effectively play opposite roles from their usual role in traditional hypothesis testing (Walker & Nowacki, 2011). In light of this, we do not recommend using the traditional statistical methods of the original study. Three ap- proaches are capable of satisfying the goal of being able to con- clude that an effect is null or essentially null in many common psychological designs.
Frequentist Method
The method most accessible to psychology researchers is the equivalence test (or two one-sided tests; TOST), because it is derived from the traditional frequentist perspective familiar to those conducting hypothesis tests. The first step is to establish what is known as a region of equivalence or region of indifference. This is an interval of values of which the researcher believes to be so small as to be essentially zero. Notice that this interval is based entirely on theory and must be specified prior to collecting repli- cation data. The logic of this is consistent with the “good enough principle,” which acknowledges that in strict terms, the null hy- pothesis may never be exactly true (Serlin & Lapsley, 1985). The authors encourage forming “a good-enough belt width of delta” in the null prediction (p. 79). Following the traditional analyses, the second step is to form a (1 � 2�) � 100% confidence interval around the estimate of the effect. For an � level of .05, a 90% confidence interval should be computed. Although 95% confi- dence intervals are more common in traditional null hypothesis testing, the equivalence test corresponds to two one-tailed tests, each at � � .05 (Walker & Nowacki, 2011). The logic of TOST is that if the confidence interval of the estimate falls entirely within the region of equivalence, the null hypothesis can be claimed to be functionally true with a low amount of uncertainty. Continuing with the scope-severity paradox example introduced
in Goal 1, suppose a skeptic believes the original study was flawed and thus would like to show that the number of individuals affected by a crime essentially does not impact
Collepals.com Plagiarism Free Papers
Are you looking for custom essay writing service or even dissertation writing services? Just request for our write my paper service, and we'll match you with the best essay writer in your subject! With an exceptional team of professional academic experts in a wide range of subjects, we can guarantee you an unrivaled quality of custom-written papers.
Get ZERO PLAGIARISM, HUMAN WRITTEN ESSAYS
Why Hire Collepals.com writers to do your paper?
Quality- We are experienced and have access to ample research materials.
We write plagiarism Free Content
Confidential- We never share or sell your personal information to third parties.
Support-Chat with us today! We are always waiting to answer all your questions.