Please use the attached files for this assignment? Please
Please use the attached files for this assignment
Please provide a 250–500 word summary of your assigned reading. This should describe key aspects of the article you read, and how it relates to the subject of the week (i.e., class lecture).
- Farrington, D. P. (2003). Methodological Quality Standards for Evaluation Research. The ANNALS of the American Academy of Political and Social Science, 587(1), 49–68.
- Petrosino, A., et al. (2001). Meeting challenges of evidence-based policy: The Campbell Collaboration. The Annals of the American Academy of Political and Social Science, 578, 14–34.
- Weisburd, D., et al. (2001). Does research design affect study outcomes in criminal justice? The Annals of the American Academy of Political and Social Science, 578, 50–70.
- Wilson, D. B. (2001). Meta-Analytic Methods for Criminology. The Annals of the American Academy of Political and Social Science, 578, 71–89.
10.1177/0002716202250789 ARTICLETHE ANNALS OF THE AMERICAN ACADEMYMETHODOLOGICAL QUALITY STANDARDS May587
Evaluation studies vary in methodological quality. It is essential to develop methodological quality standards for evaluation research that can be understood and eas- ily used by scholars, practitioners, policy makers, the mass media, and systematic reviewers. This article pro- poses that such standards should be based on statistical conclusion validity, internal validity, construct validity, external validity, and descriptive validity. Methodologi- cal quality scales are reviewed, and it is argued that efforts should be made to improve them. Pawson and Tilley’s challenge to the Campbell evaluation tradition is also assessed. It is concluded that this challenge does not have any implications for methodological quality stan- dards, because the Campbell tradition already empha- sizes the need to study moderators and mediators in evaluation research.
Keywords: methodological quality; evaluation; valid- ity; crime reduction; systematic reviews
The Campbell Collaboration Crime and Jus- tice Group aims to prepare and maintain
systematic reviews of impact evaluation studies on the effectiveness of criminological interven- tions and to make them accessible electronically to scholars, practitioners, policy makers, the mass media, and the general public (Farrington and Petrosino 2000, 2001). It is clear that evalua-
ANNALS, AAPSS, 587, May 2003 49
DOI: 10.1177/0002716202250789
Methodological Quality
Standards for Evaluation
Research
By
DAVID P. FARRINGTON
David P. Farrington is professor of psychological crimi- nology at Cambridge University. He is the chair of the Campbell Collaboration Crime and Justice Group and president of the Academy of Experimental Criminology. He is a past president of the American Society of Crimi- nology, the British Society of Criminology, and the European Association of Psychology and Law. He has received the Sellin-Glueck and Sutherland awards from the American Society of Criminology for outstanding contributions to criminology. His major research inter- est is in the development of offending from childhood to adulthood, and he is the director of the Cambridge Study in Delinquent Development, which is a prospective lon- gitudinal survey of 411 London males from age eight to age forty-eight.
NOTE: I am grateful to Bob Boruch, Tom Cook, Cynthia Lum, Anthony Petrosino, David Weisburd, and Brandon Welsh for helpful comments on an earlier draft of this article.
tion studies vary in methodological quality. The preferred approach of the Camp- bell Collaboration Crime and Justice Group is not for a reviewer to attempt to review all evaluation studies on a particular topic, however poor their methodol- ogy, but rather to include only the best studies in systematic reviews. However, this policy requires the specification of generally accepted, explicit, and transparent criteria for determining what are the best studies on a particular topic, which in turn requires the development of methodological quality standards for evaluation research.
Methodological quality depends on four criteria: statistical conclusion validity,
internal validity, construct validity, and external validity.
In due course, it is possible that methodological quality standards will be speci- fied by the Campbell Collaboration for all its constituent groups. It is also possible that different standards may be needed for different topics. This article is an attempt to make progress in developing methodological quality standards. Unfor- tunately, discussions about methodological quality standards, and about inclusion and exclusion criteria in systematic reviews, are inevitably contentious because they are seen as potentially threatening by some evaluation researchers. People whose projects are excluded from systematic reviews correctly interpret this as a criticism of the methodological quality of their work. In our systematic reviews of the effectiveness of improved street lighting and closed-circuit television (CCTV) (Farrington and Welsh 2002; Welsh and Farrington 2003 [this issue]), referees considered that the excluded studies were being “cast into outer darkness” (although we did make a list of them).
What are the features of an evaluation study with high methodological quality? In trying to specify these for criminology and the social and behavioral sciences, the most relevant work—appropriately enough—is by Donald Campbell and his col- leagues (Campbell and Stanley 1966; Cook and Campbell 1979; Shadish, Cook, and Campbell 2002). Campbell was clearly one of the leaders of the tradition of field experiments and quasi experimentation (Shadish, Cook, and Campbell 2002, p. xx). However, not everyone agrees with the Campbell approach. The main chal- lenge to it in the United Kingdom has come from Pawson and Tilley (1997), who have developed “realistic evaluation” as a competitor. Briefly, Pawson and Tilley argued that the Campbell tradition of experimental and quasi-experimental evalu-
50 THE ANNALS OF THE AMERICAN ACADEMY
ation research has “failed” because of its emphasis on “what works.” Instead, they argue, evaluation research should primarily be concerned with testing theories, especially about linkages between contexts, mechanisms, and outcomes (see below).
Methodological quality standards are likely to vary according to the topic being reviewed. For example, because there have been many randomized experiments on family-based crime prevention (Farrington and Welsh 1999), it would not be unreasonable to restrict a systematic review of this topic to the gold standard of ran- domized experiments. However, there have been no randomized experiments designed to evaluate the effect of either improved street lighting or CCTV on crime. Therefore, in our systematic reviews of these topics (Farrington and Welsh 2002; Welsh and Farrington 2003), we set a minimum methodological standard for inclusion in our reviews of projects with before-and-after measures of crime in experimental and comparable control areas. This was considered to be the mini- mum interpretable design by Cook and Campbell (1979).
This was also set as the minimum design that was adequate for drawing valid conclusions about what works in the book Evidence-Based Crime Prevention (Sherman et al. 2002), based on the Maryland Scientific Methods Scale (SMS) (see below). An important issue is how far it is desirable and feasible to use a method- ological quality scale to assess the quality of evaluation research and as the basis for making decisions about including or excluding studies in systematic reviews. And if a methodological quality scale should be used, which one should be chosen?
This article, then, has three main aims:
1. to review criteria of methodological quality in evaluation research, 2. to review methodological quality scales and to decide what type of scale might be useful in
assisting reviewers in making inclusion and exclusion decisions for systematic reviews, and
3. to consider the validity of Pawson and Tilley’s (1997) challenge to the Campbell approach.
Methodological Quality Criteria
According to Cook and Campbell (1979) and Shadish, Cook, and Campbell (2002), methodological quality depends on four criteria: statistical conclusion validity, internal validity, construct validity, and external validity. This validity typology “has always been the central hallmark of Campbell’s work over the years” (Shadish, Cook, and Campbell 2002, xviii). “Validity” refers to the correctness of inferences about cause and effect (Shadish, Cook, and Campbell 2002, 34).
From the time of John Stuart Mill, the main criteria for establishing a causal relationship have been that (1) the cause precedes the effect, (2) the cause is related to the effect, and (3) other plausible alternative explanations of the effect can be excluded. The main aim of the Campbell validity typology is to identify plau- sible alternative explanations (threats to valid causal inference) so that researchers can anticipate likely criticisms and design evaluation studies to eliminate them. If
METHODOLOGICAL QUALITY STANDARDS 51
threats to valid causal inference cannot be ruled out in the design, they should at least be measured and their importance estimated.
Following Lösel and Koferl (1989), I have added descriptive validity, or the ade- quacy of reporting, as a fifth criterion of the methodological quality of evaluation research. This is because, to complete a systematic review, it is important that information about key features of the evaluation is provided in each research report.
Statistical Conclusion Validity
Statistical conclusion validity is concerned with whether the presumed cause (the intervention) and the presumed effect (the outcome) are related. Measures of effect size and their associated confidence intervals should be calculated. Statisti- cal significance (the probability of obtaining the observed effect size if the null hypothesis of no relationship were true) should also be calculated, but in many ways, it is less important than the effect size. This is because a statistically signifi- cant result could indicate a large effect in a small sample or a small effect in a large sample.
The main threats to statistical conclusion validity are insufficient statistical power to detect the effect (e.g., because of small sample size) and the use of inap- propriate statistical techniques (e.g., where the data violate the underlying assumptions of a statistical test). Statistical power refers to the probability of cor- rectly rejecting the null hypothesis when it is false. Other threats to statistical con- clusion validity include the use of many statistical tests (in a so-called fishing expe- dition for significant results) and the heterogeneity of the experimental units (e.g., the people or areas in experimental and control conditions). The more variability there is in the units, the harder it will be to detect any effect of the intervention.
Shadish, Cook, and Campbell (2002, 45) included the unreliability of measures as a threat to statistical conclusion validity, but this seems more appropriately clas- sified as a threat to construct validity (see below). While the allocation of threats to validity categories is sometimes problematic, I have placed each threat in only one validity category.
Internal Validity
Internal validity refers to the correctness of the key question about whether the intervention really did cause a change in the outcome, and it has generally been regarded as the most important type of validity (Shadish, Cook, and Campbell 2002, 97). In investigating this question, some kind of control condition is essential to estimate what would have happened to the experimental units (e.g., people or areas) if the intervention had not been applied to them—termed the “counter- factual inference.” Experimental control is usually better than statistical control.
52 THE ANNALS OF THE AMERICAN ACADEMY
One problem is that the control units rarely receive no treatment; instead, they typ- ically receive the more usual treatment or some kind of treatment that is different from the experimental intervention. Therefore, it is important to specify the effect size—compared to what?
It does seem useful…to communicate to scholars, policy makers, and practitioners that
not all research is of the same quality.
The main threats to internal validity have been identified often but do not seem to be uniformly well known (Shadish, Cook, and Campbell 2002, 55):
1. Selection: the effect reflects preexisting differences between experimental and control conditions.
2. History: the effect is caused by some event occurring at the same time as the intervention. 3. Maturation: the effect reflects a continuation of preexisting trends, for example, in nor-
mal human development. 4. Instrumentation: the effect is caused by a change in the method of measuring the
outcome. 5. Testing: the pretest measurement causes a change in the posttest measure. 6. Regression to the mean: where an intervention is implemented on units with unusually
high scores (e.g., areas with high crime rates), natural fluctuation will cause a decrease in these scores on the posttest, which may be mistakenly interpreted as an effect of the inter- vention. The opposite (an increase) happens when interventions are applied to low-crime areas or low-scoring people.
7. Differential attrition: the effect is caused by differential loss of units (e.g., people) from experimental compared to control conditions.
8. Causal order: it is unclear whether the intervention preceded the outcome.
In addition, there may be interactive effects of threats. For example, a selection- maturation effect may occur if the experimental and control conditions have differ- ent preexisting trends, or a selection-history effect may occur if the experimental and control conditions experience different historical events (e.g., where they are located in different settings).
In principle, a randomized experiment has the highest possible internal validity because it can rule out all these threats, although in practice, differential attrition may still be problematic. Randomization is the only method of assignment that controls for unknown and unmeasured confounders as well as those that are known and measured. The conclusion that the intervention really did cause a change in the outcome is not necessarily the final conclusion. It is desirable to go beyond this
METHODOLOGICAL QUALITY STANDARDS 53
and investigate links in the causal chain between the intervention and the outcome (“mediators,” according to Baron and Kenny 1986), the dose-response relationship between the intervention and the outcome, and the validity of any theory linking the intervention and the outcome.
Construct Validity
Construct validity refers to the adequacy of the operational definition and mea- surement of the theoretical constructs that underlie the intervention and the out- come. For example, if a project aims to investigate the effect of interpersonal skills training on offending, did the training program really target and change interper- sonal skills, and were arrests a valid measure of offending? Whereas the opera- tional definition and measurement of physical constructs such as height and weight are not contentious, this is not true of most criminological constructs.
The main threats to construct validity center on the extent to which the inter- vention succeeded in changing what it was intended to change (e.g., how far there was treatment fidelity or implementation failure) and on the validity and reliability of outcome measures (e.g. how adequately police-recorded crime rates reflect true crime rates). Displacement of offending and “diffusion of benefits” of the interven- tion (Clarke and Weisburd 1994) should also be investigated. Other threats to con- struct validity include those arising from a participant’s knowledge of the interven- tion and problems of contamination of treatment (e.g., where the control group receives elements of the intervention). To counter the Hawthorne effect, it is acknowledged in medicine that double-blind trials are needed, wherein neither doctors nor patients know about the experiment. It is also desirable to investigate interaction effects between different interventions or different ingredients of an intervention.
External Validity
External validity refers to the generalizability of causal relationships across dif- ferent persons, places, times, and operational definitions of interventions and out- comes (e.g., from a demonstration project to the routine large-scale application of an intervention). It is difficult to investigate this within one evaluation study, unless it is a large-scale, multisite trial. External validity can be established more convinc- ingly in systematic reviews and meta-analyses of numerous evaluation studies. Shadish, Cook, and Campbell (2002, 83) distinguished generalizability to similar versus different populations, for example, contrasting how far the effects of an intervention with men might be replicated with other men as opposed to how far these effects might be replicated with women. The first type of generalizability would be increased by carefully choosing random samples from some population as potential (experimental or control) participants in an evaluation study.
54 THE ANNALS OF THE AMERICAN ACADEMY
The main threats to external validity listed by Shadish, Cook, and Campbell (2002, 87) consist of interactions of causal relationships (effect sizes) with types of persons, settings, interventions, and outcomes. For example, an intervention designed to reduce offending may be effective with some types of people and in some types of places but not in others. A key issue is whether the effect size varies according to whether those who carried out the research had some kind of stake in the results (e.g., if a project is funded by a government agency, the agency may be embarrassed if the evaluation shows no effect of its highly trumpeted interven- tion). There may be boundary conditions within which interventions do or do not work, or “moderators” of a causal relationship in the terminology of Baron and Kenny (1986). Also, mediators of causal relationships (links in the causal chain) may be effective in some settings but not in others. Ideally, theories should be pro- posed to explain these kinds of interactions.
Descriptive Validity
Descriptive validity refers to the adequacy of the presentation of key features of an evaluation in a research report. As mentioned, systematic reviews can be carried out satisfactorily only if the original evaluation reports document key data on issues such as the number of participants and the effect size. A list of minimum elements to be included in an evaluation report would include at least the following (see also Boruch 1997, chapter 10):
1. Design of the study: how were experimental units allocated to experimental or control conditions?
2. Characteristics of experimental units and settings (e.g., age and gender of individuals, sociodemographic features of areas).
3. Sample sizes and attrition rates. 4. Causal hypotheses to be tested and theories from which they are derived. 5. The operational definition and detailed description of the intervention (including its
intensity and duration). 6. Implementation details and program delivery personnel. 7. Description of what treatment the control condition received. 8. The operational definition and measurement of the outcome before and after the
intervention. 9. The reliability and validity of outcome measures.
10. The follow-up period after the intervention. 11. Effect size, confidence intervals, statistical significance, and statistical methods used. 12. How independent and extraneous variables were controlled so that it was possible to dis-
entangle the impact of the intervention or how threats to internal validity were ruled out. 13. Who knows what about the intervention. 14. Conflict of interest issues: who funded the intervention, and how independent were the
researchers?
It would be desirable for professional associations, funding agencies, journal editors, and/or the Campbell Collaboration to get together to develop a checklist of items that must be included in all research reports on impact evaluations.
METHODOLOGICAL QUALITY STANDARDS 55
Methodological Quality Scales
Methodological quality scales can be used in systematic reviews to determine criteria for inclusion or exclusion of studies in the review. Alternatively, they can be used (e.g., in a meta-analysis) in trying to explain differences in results between dif- ferent evaluation studies. For example, Weisburd, Lum, and Petrosino (2001) found disparities between estimates of the effects of interventions from random- ized experiments compared with quasi experiments. Weaker designs were more likely to find that an intervention was effective because in these designs, the inter- vention is confounded with other extraneous influences on offending.
Descriptive validity refers to the adequacy of the presentation of key features of an
evaluation in a research report.
There have been many prior attempts to devise scales of methodological quality for impact evaluations, especially in the medical sciences. Moher et al. (1995) iden- tified twenty-five scales devised up to 1993 for assessing the quality of clinical trials. The first of these was constructed by Chalmers et al. (1981), and it included thirty items each scored from 0 to 10, designed to produce a total methodological quality score out of 100. The items with the highest weightings focused on how far the study was a double-blind trial (i.e., how far the participants and treatment profes- sionals knew or did not know about the aims of the study). Unfortunately, with this kind of a scale, it is hard to know what meaning to attach to any score, and the same score can be achieved in many different ways.
Juni et al. (1999) compared these twenty-five scales to one another. Interest- ingly, interrater reliability was excellent for most scales, and agreement among the twenty-five scales was considerable (r = .72). The authors of sixteen scales defined a threshold for high quality, with the median threshold corresponding to 60 per- cent of the maximum score. The relationship between methodological quality and effect size varied considerably over the twenty-five scales. Juni et al. concluded that this was because some of these scales gave more weight to the quality of reporting, ethical issues, or the interpretation of results rather than to internal validity.
As an example of a methodological quality scale developed in the social sciences, Gibbs (1989) constructed a scale for assessing social work evaluation studies. This was based on fourteen items, which, when added up, produced a score from 0 to
56 THE ANNALS OF THE AMERICAN ACADEMY
100. Some of the items referred to the completeness of reporting of the study, while others (e.g., randomization, a no-treatment control group, sample sizes, con- struct validity of outcome, reliability of outcome measure, and tests of statistical significance) referred to methodological features.
The guidance offered by the Centre for Reviews and Dissemination (2001) of the U.K. National Health Service is intended to assist reviewers in the health field. A hierarchy of evidence is presented:
1. Randomized, controlled, double-blind trials. 2. Quasi-experimental studies (experiments without randomization). 3. Controlled observational studies (comparison of outcomes between participants who
have received an intervention and those who have not). 4. Observational studies without a control group. 5. Expert opinion.
This guidance includes many methodological points and discussions about cri- teria of methodological quality, including key questions that reviewers should ask. The conclusions suggest that quality assessment primarily involves the appraisal of internal validity, that is, how far the design and analysis minimize bias; that a mini- mum quality threshold can be used to select studies for review; that quality differ- ences can be used in explaining the heterogeneity of results; and that individual quality components are preferable to composite quality scores.
The SMS
The most influential methodological quality scale in criminology is the SMS, which was developed for large-scale reviews of what works or does not work in pre- venting crime (Sherman et al. 1998, 2002). The main aim of the SMS is to commu- nicate to scholars, policy makers, and practitioners in the simplest possible way that studies evaluating the effects of criminological interventions differ in methodolog- ical quality. The SMS was largely based on the ideas of Cook and Campbell (1979).
In constructing the SMS, the Maryland researchers were particularly influ- enced by the methodological quality scale developed by Brounstein et al. (1997) in the National Structured Evaluation of Alcohol and Other Drug Abuse Prevention. These researchers rated each prevention program evaluation on ten criteria using a scale from 0 to 5: adequacy of sampling, adequacy of sample size, pretreatment measures of outcomes, adequacy of comparison groups, controls for prior group differences, adequacy of measurement of variables, attrition, postintervention measurement, adequacy of statistical analyses, and testing of alternative explana- tions. They also gave each program evaluation an overall rating from 0 (no confi- dence in results) to 5 (high confidence in results), with 3 indicating the minimum degree of methodological rigor for the reviewers to have confidence that the results were reasonably accurate. Only 30 percent out of 440 evaluations received a score of 3 to 5.
METHODOLOGICAL QUALITY STANDARDS 57
Brounstein et al. (1997) found that the interrater reliability of the overall quality score was high (.85), while the reliabilities for the ten criteria ranged from .56 (test- ing of alternative explanations) to .89 (adequacy of sample size). A principal com- ponent analysis of the ten criteria revealed a single factor reflecting methodological quality. The weightings of the items on this dimension ranged from .44 (adequacy of sample size) to .84 (adequacy of statistical analyses). In attempting to improve future evaluations, they recommended random assignment, appropriate compari- son groups, preoutcome and postoutcome measures, the analysis of attrition, and assessment of the levels of dosage of the treatment received by each participant.
In constructing the SMS, the main aim was to devise a simple scale measuring internal validity that could easily be communicated. Thus, a simple 5-point scale was used rather than a summation of scores (e.g., from 0 to 100) on a number of specific criteria. It was intended that each point on the scale should be understand- able, and the scale is as follows (see Sherman et al. 1998):
Level 1: correlation between a prevention program and a measure of crime at one point in time (e.g., areas with CCTV have lower crime rates than areas without CCTV).
This design fails to rule out many threats to internal validity and also fails to estab- lish causal order.
Level 2: measures of crime before and after the program, with no comparable con- trol condition (e.g., crime decreased after CCTV was installed in an area).
This design establishes causal order but fails to rule out many threats to internal validity. Level 1 and level 2 designs were considered inadequate and uninterpret- able by Cook and Campbell (1979).
Level 3: measures of crime before and after the program in experimental and com- parable control conditions (e.g., crime decreased after CCTV was installed in an experimental area, but there was no decrease in crime in a comparable control area).
As mentioned, this was considered to be the minimum interpretable design by Cook and Campbell (1979), and it is also regarded as the minimum design that is adequate for drawing conclusions about what works in the book Evidence-Based Crime Prevention (Sherman et al. 2002). It rules out many threats to internal valid- ity, including history, maturation/trends, instrumentation, testing effects, and dif- ferential attrition. The main problems with it center on selection effects and regression to the mean (because of the nonequivalence of the experimental and control conditions).
Level 4: measures of crime before and after the program in multiple experimental and control units, controlling for other variables that influence crime (e.g., vic-
58 THE ANNALS OF THE AMERICAN ACADEMY
timization of premises under CCTV surveillance decreased compared to victim- ization of control premises, after controlling for features of premises that influ- enced their victimization).
This design has better statistical control of extraneous influences on the outcome and hence deals with selection and regression threats more adequately.
Level 5: random assignment of program and control conditions to units (e.g., victim- ization of premises randomly assigned to have CCTV surveillance decreased compared to victimization of control premises).
Providing that a sufficiently large number of units are randomly assigned, those in the experimental condition will be equivalent (within the limits of statistical fluctu- ation) to those in the control condition on all possible extraneous variables that influence the outcome. Hence, this design deals with selection and regression problems and has the highest possible internal validity.
While randomized experiments in principle have the highest internal validity, in practice, they are relatively uncommon in criminology and often have implementa- tion problems (Farrington 1983; Weisburd 2000). In light of the fact that the SMS as defined above focuses only on internal validity, all evaluation projects were also rated on statistical conclusion validity and on construct validity. Specifically, the following four aspects of each study were rated:
Statistical conclusion validity
1. Was the statistical analysis appropriate? 2. Did the study have low statistical power to detect effects because of small samples? 3. Was there a low response rate or differential attrition?
Construct validity
4. What was the reliability and validity of measurement of the outcome?
If there was a serious problem in any of these area
Collepals.com Plagiarism Free Papers
Are you looking for custom essay writing service or even dissertation writing services? Just request for our write my paper service, and we'll match you with the best essay writer in your subject! With an exceptional team of professional academic experts in a wide range of subjects, we can guarantee you an unrivaled quality of custom-written papers.
Get ZERO PLAGIARISM, HUMAN WRITTEN ESSAYS
Why Hire Collepals.com writers to do your paper?
Quality- We are experienced and have access to ample research materials.
We write plagiarism Free Content
Confidential- We never share or sell your personal information to third parties.
Support-Chat with us today! We are always waiting to answer all your questions.