Read the paper by John et al. (2012). What are your thoughts on how researchers conduct science in psychology? What did you learn from this paper, and how do you consolidate the
Read the paper by John et al. (2012). What are your thoughts on how researchers conduct science in psychology? What did you learn from this paper, and how do you consolidate these results with the ideas discussed in the segment on philosophy of science? Are there ways how we should change the practice of conducting psychological research based on the findings in this paper?
Psychological Science 23(5) 524 –532 © The Author(s) 2012 Reprints and permission: sagepub.com/journalsPermissions.nav DOI: 10.1177/0956797611430953 http://pss.sagepub.com
Although cases of overt scientific misconduct have received significant media attention recently (Altman, 2006; Deer, 2011; Steneck, 2002, 2006), exploitation of the gray area of acceptable practice is certainly much more prevalent, and may be more damaging to the academic enterprise in the long run, than outright fraud. Questionable research practices (QRPs), such as excluding data points on the basis of post hoc criteria, can spuriously increase the likelihood of finding evidence in support of a hypothesis. Just how dramatic these effects can be was demonstrated by Simmons, Nelson, and Simonsohn (2011) in a series of experiments and simulations that showed how greatly QRPs increase the likelihood of finding support for a false hypothesis. QRPs are the steroids of scientific com- petition, artificially enhancing performance and producing a kind of arms race in which researchers who strictly play by the rules are at a competitive disadvantage. QRPs, by nature of the very fact that they are often questionable as opposed to bla- tantly improper, also offer considerable latitude for rational- ization and self-deception.
Concerns over QRPs have been mounting (Crocker, 2011; Lacetera & Zirulia, 2011; Marshall, 2000; Sovacool, 2008; Sterba, 2006; Wicherts, 2011), and several studies—many of which have focused on medical research—have assessed their prevalence (Gardner, Lidz, & Hartwig, 2005; Geggie, 2001; Henry et al., 2005; List, Bailey, Euzent, & Martin, 2001;
Martinson, Anderson, & de Vries, 2005; Swazey, Anderson, & Louis, 1993). In the study reported here, we measured the per- centage of psychologists who have engaged in QRPs.
As with any unethical or socially stigmatized behavior, self-reported survey data are likely to underrepresent true prevalence. Respondents have little incentive, apart from good will, to provide honest answers (Fanelli, 2009). The goal of the present study was to obtain realistic estimates of QRPs with a new survey methodology that incorporates explicit response- contingent incentives for truth telling and supplements self- reports with impersonal judgments about the prevalence of practices and about respondents’ honesty. These impersonal judgments made it possible to elicit alternative estimates, from which we inferred the upper and lower boundaries of the actual prevalence of QRPs. Across QRPs, even raw self-admission rates were surprisingly high, and for certain practices, the inferred actual estimates approached 100%, which suggests that these practices may constitute the de facto scientific norm.
Corresponding Author: Leslie K. John, Harvard Business School—Marketing, Morgan Hall 169, Soldiers Field, Boston, MA 02163 E-mail: [email protected]
Measuring the Prevalence of Questionable Research Practices With Incentives for Truth Telling
Leslie K. John1, George Loewenstein2, and Drazen Prelec3
1Marketing Unit, Harvard Business School; 2Department of Social & Decision Sciences, Carnegie Mellon University; and 3Sloan School of Management and Departments of Economics and Brain & Cognitive Sciences, Massachusetts Institute of Technology
Abstract
Cases of clear scientific misconduct have received significant media attention recently, but less flagrantly questionable research practices may be more prevalent and, ultimately, more damaging to the academic enterprise. Using an anonymous elicitation format supplemented by incentives for honest reporting, we surveyed over 2,000 psychologists about their involvement in questionable research practices. The impact of truth-telling incentives on self-admissions of questionable research practices was positive, and this impact was greater for practices that respondents judged to be less defensible. Combining three different estimation methods, we found that the percentage of respondents who have engaged in questionable practices was surprisingly high. This finding suggests that some questionable practices may constitute the prevailing research norm.
Keywords
professional standards, judgment, disclosure, methodology
Received 5/20/11; Revision accepted 10/20/11
Research Article
The Prevalence of Questionable Research Practices 525
Method
In a study with a two-condition, between-subjects design, we e-mailed an electronic survey to 5,964 academic psychologists at major U.S. universities (for details on the survey and the sample, see Procedure and Table S1, respectively, in the Sup- plemental Material available online). Participants anony- mously indicated whether they had personally engaged in each of 10 QRPs (self-admission rate; Table 1), and if they had, whether they thought their actions had been defensible. The order in which the QRPs were presented was randomized between subjects. There were 2,155 respondents to the survey, which was a response rate of 36%. Of respondents who began the survey, 719 (33.4%) did not complete it (see Supplemen- tary Results and Fig. S1 in the Supplemental Material); how- ever, because the QRPs were presented in random order, data from all respondents—even those who did not finish the sur- vey—were included in the analysis.
In addition to providing self-admission rates, respondents also provided two impersonal estimates related to each QRP: (a) the percentage of other psychologists who had engaged in each behavior (prevalence estimate), and (b) among those psy- chologists who had, the percentage that would admit to having done so (admission estimate). Therefore, each respondent was asked to provide three pieces of information for each QRP. Respondents who indicated that they had engaged in a QRP were also asked to rate whether they thought it was defensible to have done so (0 = no, 1 = possibly, and 2 = yes). If they wished, they could also elaborate on why they thought it was (or was not) defensible.
After providing this information for each QRP, respondents were also asked to rate their degree of doubt about the integrity of the research done by researchers at other institutions, other researchers at their own institution, graduate students, their collaborators, and themselves (1 = never, 2 = once or twice, 3 = occasionally, 4 = often).
Table 1. Results of the Main Study: Mean Self-Admission Rates, Comparison of Self-Admission Rates Across Groups, and Mean Defensibility Ratings
Self-admission rate (%) Odds ratio
(BTS/control)
Two-tailed p (likelihood ratio
test)
Defensibility rating (across
groups)Item Control group BTS group
1. In a paper, failing to report all of a study’s dependent measures
63.4 66.5 1.14 .23 1.84 (0.39)
2. Deciding whether to collect more data after looking to see whether the results were significant
55.9 58.0 1.08 .46 1.79 (0.44)
3. In a paper, failing to report all of a study’s conditions
27.7 27.4 0.98 .90 1.77 (0.49)
4. Stopping collecting data earlier than planned because one found the result that one had been looking for
15.6 22.5 1.57 .00 1.76 (0.48)
5. In a paper, “rounding off” a p value (e.g., reporting that a p value of .054 is less than .05)
22.0 23.3 1.07 .58 1.68 (0.57)
6. In a paper, selectively reporting studies that “worked”
45.8 50.0 1.18 .13 1.66 (0.53)
7. Deciding whether to exclude data after looking at the impact of do- ing so on the results
38.2 43.4 1.23 .06 1.61 (0.59)
8. In a paper, reporting an unex- pected finding as having been predicted from the start
27.0 35.0 1.45 .00 1.50 (0.60)
9. In a paper, claiming that results are unaffected by demographic variables (e.g., gender) when one is actually unsure (or knows that they do)
3.0 4.5 1.52 .16 1.32 (0.60)
10. Falsifying data 0.6 1.7 2.75 .07 0.16 (0.38)
Note: Items are listed in decreasing order of rated defensibility. Respondents who admitted to having engaged in a given behavior were asked to rate whether they thought it was defensible to have done so (0 = no, 1 = possibly, and 2 = yes). Standard deviations are given in parenthe- ses. BTS = Bayesian truth serum. Applying the Bonferroni correction for multiple comparisons, we adjusted the critical alpha level downward to .005 (i.e., .05/10 comparisons).
526 John et al.
The two versions of the survey differed in the incentives they offered to respondents. In the Bayesian-truth-serum (BTS) condition, a scoring algorithm developed by one of the authors (Prelec, 2004) was used to provide incentives for truth telling. This algorithm uses respondents’ answers about their own behavior and their estimates of the sample distribution of answers as inputs in a truth-rewarding scoring formula. Because the survey was anonymous, compensation could not be directly linked to individual scores. Instead, respondents were told that we would make a donation to a charity of their choice, selected from five options, and that the size of this donation would depend on the truthfulness of their responses, as determined by the BTS scoring system. By inducing a (cor- rect) belief that dishonesty would reduce donations, we hoped to amplify the moral stakes riding on each answer (for details on the donations, see Supplementary Results in the Supple- mental Material). Respondents were not given the details of the scoring system but were told that it was based on an algo- rithm published in Science and were given a link to the article. There was no deception: Respondents’ BTS scores determined our contributions to the five charities. Respondents in the con- trol condition were simply told that a charitable donation would be made on behalf of each respondent. (For details on the effect of the size of the incentive on response rates, see Participation Incentive Survey in the Supplemental Material.)
The three types of answers to the survey questions—self- admission, prevalence estimate, admission estimate—allowed us to estimate the actual prevalence of each QRP in different ways. The credibility of each estimate hinged on the cred- ibility of one of the three answers in the survey: First, if respondents answered the personal question honestly, then self-admission rates would reveal the actual prevalence of the QRPs in this sample. Second, if average prevalence estimates were accurate, then they would also allow us to directly esti- mate the actual prevalence of the QRPs. Third, if average admission estimates were accurate, then actual prevalence could be estimated using the ratios of admission rates to admission estimates. This would correspond to a case in which respondents did not know the actual prevalence of a practice but did have a good sense of how likely it is that a colleague would admit to it in a survey. The three estimates should con- verge if the self-admission rate equaled the prevalence esti- mate multiplied by the admission estimate. To the extent that this equality is violated, there would be differences between prevalence rates measured by the different methods.
Results Raw self-admission rates, prevalence estimates, prevalence estimates derived from the admission estimates (i.e., self- admission rate/admission estimate), and geometric means of these three percentages are shown in Figure 1. For details on our approach to analyzing the data, see Data Analysis in the Supplemental Material.
Truth-telling incentives
A priori, truth-telling incentives (as provided in the BTS condition) should affect responses in proportion to the base- line (i.e., control condition) level of false denials. These base- line levels are unknown, but one can hypothesize that they should be minimal for impersonal estimates of prevalence and admission, and greatest for personal admissions of unethical practices broadly judged as unacceptable, which represent “red-card” violations.
As hypothesized, prevalence estimates (see Table S2 in the Supplemental Material) and admission estimates (see Table S3 in the Supplemental Material) were comparable in the two conditions, but self-admission rates for some items (Table 1), especially those that were “more questionable,” were higher in the BTS condition than in the control condition. (Table 1 also presents the p values of the likelihood ratio test of the differ- ence in admission rates between conditions.)
We assessed the effect of the BTS manipulation by examin- ing the odds ratio of self-admission rates in the BTS condition to self-admission rates in the control condition. The odds ratio was high for one practice (falsifying data), moderate for three practices (premature stopping of data collection, falsely report- ing a finding as expected, and falsely claiming that results are unaffected by certain variables), and negligible for the remain- der of the practices (Table 1). The acceptability of a practice can be inferred from the self-admission rate in the control condition (baseline) or assessed directly by judgments of defensibility. The nonparametric correlation of BTS impact, as measured by odds ratio, with the baseline self-admission rate was –.62 (p < .06; parametric correlation = −.65, p < .05); the correlation of odds ratio with defensibility rating was –.68 (p < .03; para- metric correlation = −.94, p < .001). These correlations were more modest when Item 10 (“Falsifying data”) was excluded (odds ratio with baseline self-admission rate: nonparametric correlation = −.48, p < .20; parametric correlation = −.59, p < .10; odds ratio with defensibility rating: nonparametric correla- tion = −.57, p < .12; parametric correlation = −.59, p < .10).
Prevalence estimates Figure 1 displays mean prevalence estimates for the three types of responses in the BTS condition (the admission esti- mates were capped at 100%; they exceeded 100% by a small margin for a few items). The figure also shows the geometric means of all three responses; these means, in effect, give equal credence to the three types of answers. The raw admission rates are almost certainly too low given the likelihood that respondents did not admit to all QRPs that they actually engaged in. Therefore, the geometric means are probably con- servative judgments of true prevalence.
One would infer from the geometric means of the three variables that nearly 1 in 10 research psychologists has intro- duced false data into the scientific record (Items 5 and 10) and
The Prevalence of Questionable Research Practices 527
that the majority of research psychologists have engaged in practices such as selective reporting of studies (Item 6), not reporting all dependent measures (Item 1), collecting more data after determining whether the results were significant (Item 2), reporting unexpected findings as having been pre- dicted (Item 8), and excluding data post hoc (Item 7).
These estimates are somewhat higher than estimates reported in previous research. For example, a meta-analysis of surveys—none of which provided incentives for truthful responding—found that, among scientists from a variety of disciplines, 9.5% of respondents admitted to having engaged in QRPs other than data falsification; the upper-boundary esti- mate was 33.7% (Fanelli, 2009). In the present study, the mean self-admission rate in the BTS condition (excluding the data- falsification item for comparability with Fanelli, 2009) was 36.6%—higher than both of the meta-analysis estimates. Moreover, among participants in the BTS condition who com- pleted the survey, 94.0% admitted to having engaged in at least one QRP (compared with 91.4% in the control
condition). The self-admission rate in our control condition (33.0%) mirrored the upper-boundary estimate obtained in Fanelli’s meta-analysis (33.7%).
Response to a given item on our survey was predictive of responses to the other items: The survey items approximated a Guttman scale, meaning that an admission to a relatively rare behavior (e.g., falsifying data) usually implied that the respon- dent had also engaged in more common behaviors. Among completed response sets, the coefficient of reproducibility—the average proportion of a person’s responses that can be repro- duced by knowing the number of items to which he or she responded affirmatively—was .80 (high values indicate close agreement; items are considered to form a Guttman scale if reproducibility is .90 or higher; Guttman, 1974). This finding suggests that researchers’ engagement in or avoidance of spe- cific QRPs is not completely idiosyncratic. It indicates that there is a rough consensus among researchers about the relative unethicality of the behaviors, but large variation in where researchers draw the line when it comes to their own behavior.
0
10
20
30
40
50
60
70
80
90
100
Fail ing
to R
ep ort
A ll
Dep en
de nt
Mea su
res
Coll ec
tin g M
ore D
ata A
fte r
See ing
W he
the r R
es ult
s
W ere
S ign
ific an
t
Fail ing
to R
ep ort
A ll
Con dit
ion s
Stop pin
g D ata
C oll
ec tio
n A fte
r
Ach iev
ing th
e D es
ire d R
es ult
Rou nd
ing D
ow n p
V alu
es Sele
cti ve
ly Rep
ort ing
Stud ies
Tha t “W
ork ed
”
Exc lud
ing D
ata A
fte r L
oo kin
g
at the
Im pa
ct of
Doin g S
o
Clai ming
to H
av e P
red ict
ed
an U
ne xp
ec ted
Find ing
Fals ely
C lai
ming Tha
t
Res ult
s A re
Una ffe
cte d b
y
Dem og
rap hic
s Fals
ify ing
D ata
Self-Admission Rate Prevalence Estimate Prevalence Estimate Derived From Admission Estimate
78% 72%
42% 36% 39%
67%
62% 54%
13%
9%
P er
ce nt
ag e
Fig. 1. Results of the Bayesian-truth-serum condition in the main study. For each of the 10 items, the graph shows the self-admission rate, prevalence estimate, prevalence estimate derived from the admission estimate (i.e., self-admission rate/admission estimate), and geometric mean of these three percentages (numbers above the bars). See Table 1 for the complete text of the items.
528 John et al.
Perceived defensibility
Respondents had an opportunity to state whether they thought their actions were defensible. Consistent with the notion that latitude for rationalization is positively associated with engagement in QRPs, our findings showed that respondents who admitted to a QRP tended to think that their actions were defensible. The overall mean defensibility rating of practices that respondents acknowledged having engaged in was 1.70 (SD = 0.53)—between possibly defensible and defensible. Mean judged defensibility for each item is shown in Table 1. Defensibility ratings did not generally differ according to the respondents’ discipline or the type of research they conducted (see Table S4 in the Supplemental Material).
Doubts about research integrity A large percentage of respondents indicated that they had doubts about research integrity on at least one occasion (Fig. 2). The degree of doubt differed by target; for example, respon- dents were more wary of research generated by researchers at other institutions than of research conducted by their collabo- rators. Although heterogeneous referent-group sizes make these differences difficult to interpret (the number of research- ers at other institutions is presumably larger than one’s own set of collaborators), it is noteworthy that approximately 35% of respondents indicated that they had doubts about the integrity of their own research on at least one occasion.
Frequency of engagement
Although the prevalence estimates obtained in the BTS condi- tion are somewhat higher than previous estimates, they do not enable us to distinguish between the researcher who rou- tinely engages in a given behavior and the researcher who has only engaged in that behavior once. To the extent that self- admission rates are driven by the former type, our results are more worrisome. We conducted a smaller-scale survey, in which we tested for differences in admission rates as a func- tion of the response scale.
We asked 133 attendees of an annual conference of behav- ioral researchers whether they had engaged in each of 25 dif- ferent QRPs (many of which we also inquired about in the main study). Using a 2 × 2 between-subjects design, we manipulated the wording of the questions and the response scale. The questions were either phrased as a generic action (“Falsifying data”) or in the first person (“I have falsified data”), and participants indicated whether they had engaged in the behaviors using either a dichotomous response scale (yes/ no, as in the main study) or a frequency response scale (never, once or twice, occasionally, frequently).
Because the overall self-admission rates to the individual items were generally similar to those obtained in the main study, we do not report them here. Respondents made fewer affirma- tive admissions on the dichotomous response scale (M = 3.77 out of 25, SD = 2.27) than on the frequency response scale (M = 6.02 out of 25, SD = 3.70), F(1, 129) = 17.0, p < .0005). This
0
10
20
30
40
50
60
70
80
90
100
Researchers at Other Institutions
Researchers at Your Institution
Graduate Students
Your Collaborators
Yourself
P er
ce nt
ag e
of P
ar tic
ip an
ts
Never Once or Twice Occasionally Often
Category of Researcher
Fig. 2. Results of the main study: distribution of responses to a question asking about doubts concerning the integrity of the research conducted by various categories of researchers.
The Prevalence of Questionable Research Practices 529
result suggests that in the dichotomous-scale condition, some nontrivial fraction of respondents who engaged in a QRP only a small number of times reported that they had never engaged in it. This suggests that the prevalence rates obtained in the main study are conservative. There was no effect of the wording manipulation.
We explored the response-scale effect further by comparing the distribution of responses between the two response-scale conditions across all 25 items and collapsing across the word- ing manipulation (Fig. 3). Among the affirmative responses in the frequency-response-scale condition (i.e., responses of once or twice, occasionally, or frequently), 64% (i.e., .153/ (.151 + .062 + .023)) of the affirmative responses fell into the once or twice category, a nontrivial percentage fell into occasionally (26%), and 10% fell into frequently. This result suggests that the prevalence estimates from the BTS study rep- resent a combination of single-instance and habitual engage- ment in the behaviors.
Subgroup differences Table 2 presents self-admission rates as a function of disci- plines within psychology and the primary methodology used in research. Relatively high rates of QRPs were self-reported among the cognitive, neuroscience, and social disciplines, and among researchers using behavioral, experimental, and labo- ratory methodologies (for details, see Data Analysis in the Supplemental Material). Clinical psychologists reported rela- tively low rates of QRPs.
These subgroup differences could reflect the particular rel- evance of our QRPs to these disciplines and methodologies, or they could reflect differences in perceived defensibility of the behaviors. To explore these possible explanations, we sent a brief follow-up survey to 1,440 of the participants in the main study, which asked them to rate two aspects of the same 10
QRPs. First, they were asked to rate the extent to which each practice applies to their research methodology (i.e., how fre- quently, if at all, they encountered the opportunity to engage in the practice). The possible responses were never applicable, sometimes applicable, often applicable, and always applica- ble. Second, they were asked whether it is generally defensible to engage in each practice. The possible responses were inde- fensible, possibly defensible, and defensible. Unlike in the main study, in which respondents were asked to provide a defensibility rating only if they had admitted to having engaged in a given practice, all respondents in the follow-up survey were asked to provide these ratings. We counterbalanced the order in which respondents rated the two dimensions. There were 504 respondents, for a response rate of 35%. Of respon- dents who began the survey, 65 (12.9%) did not complete it; as in the main study, data from all respondents—even those who did not finish the survey—were included in the analysis because the QRPs were presented in randomized order.
Table 2 presents the results from the follow-up survey. The subgroup differences in applicability ratings and defensibility ratings were partially consistent with the differences in self- reported prevalence: Most notably, mean applicability and defensibility ratings were elevated among social psychologists— a subgroup with relatively high self-admission rates. Similarly, the items were particularly applicable to (but not judged to be more defensible by) researchers who conduct behavioral, experi- mental, and laboratory research.
To test for the relative importance of applicability and defensibility ratings in explaining subfield differences, we conducted an analysis of variance on mean self-admission rates across QRPs and disciplines. Both type of QRP (p < .001, ηp
2 = .87) and subfield (p < .05, ηp 2 = .21) were highly signifi-
cant predictors of self-admission rates, and their significance and effect size were largely unchanged after controlling for applicability and defensibility ratings, even though both of the
.0
.1
.2
.3
.4
.5
.6
.7
.8
.9 1.0
No Yes
P ro
po rti
on o
f P ar
tic ip
an ts
E nd
or si
ng R
es po
ns e
O pt
io n
P ro
po rti
on o
f P ar
tic ip
an ts
E nd
or si
ng R
es po
ns e
O pt
io n
.0
.1
.2
.3
.4
.5
.6
.7
.8
.9 1.0
Nev er
Onc e o
r T wice
Occ as
ion all
y
Freq ue
ntl y
a b
Fig. 3. Results of the follow-up study: distribution of responses among participants who were asked whether they had engaged in 25 questionable research practices. Participants answered using either (a) a frequency response scale or (b) a dichotomous response scale.
Collepals.com Plagiarism Free Papers
Are you looking for custom essay writing service or even dissertation writing services? Just request for our write my paper service, and we'll match you with the best essay writer in your subject! With an exceptional team of professional academic experts in a wide range of subjects, we can guarantee you an unrivaled quality of custom-written papers.
Get ZERO PLAGIARISM, HUMAN WRITTEN ESSAYS
Why Hire Collepals.com writers to do your paper?
Quality- We are experienced and have access to ample research materials.
We write plagiarism Free Content
Confidential- We never share or sell your personal information to third parties.
Support-Chat with us today! We are always waiting to answer all your questions.