Describe how Brookhart and Nitko’s assertion might apply to your school or school district.
WEEK 2: Assessment Purpose, Strengths, and Weaknesses
Overview: In Week 2, you will explore the purpose of assessment. You will examine the advantages and disadvantages of various assessments utilized in many K – 12 settings. Education is driven by outcomes. As a leader in education, you will be expected to analyze the goals and assumptions inherent in different assessment instruments as part of evaluating curriculum and instruction.
Part A – Mastery-Level Grading or Standards-Based Grading
Brookhart and Nitko (2019) suggest that increased teaching, teacher effort, and working more effectively constitute “positive test preparation” because these actions result in increased student learning (p. 779). The authors, however, dissuade the use of high-stakes test preparation strategies such as coaching and the reallocation of instructional time and resources because they narrow the scope of what is taught in classrooms to cover only sample test items.
In today’s high-stakes testing environment, consider how you, as a K–12 administrator, would respond to Brookhart and Nitko’s assertions.
Write a 250- to 300-word response to the following:
Describe how Brookhart and Nitko’s assertion might apply to your school or school district.
Share specific examples of how you might work with faculty to provide a learning environment that is not high stakes test centered. Include at least 3 steps that must be taken to ensure that this conversation with your faculty might result in improved teaching and learning at the classroom level.
Reference
Brookhart, S. M., & Nitko, A. J. (2019). Educational assessment of students (8th ed.). Pearson Education, Inc.
Part B – Article Synthesis
Select and read 2 of the Week 2 articles provided in the attachments to this post.
Write a 250- to 300-word response to the following:
Describe the types of studies conducted.
Describe the population and sample size of the studies.
Compare the conclusions drawn by the authors relevant to the use of letter grades, standards-based grading, and/or mastery-based grading.
Provide citations according to APA guidelines.
Part 3 – Summative Assessment: Assumptions and Goals of Assessment Tools [
Exam Content
As the leader of your school district’s assessment and evaluation team, you have been asked to share information about two formal assessment tools with parents, teachers, administrators, and the larger community.
Identify 2 formal assessment tools used in a school district of your choice. Select assessments relevant to your current placement and/or your doctoral studies. For example, secondary teachers might consider the ACT, SAT, and/or state-mandated proficiency tests.
Create a 12- to 16-slide presentation providing an analysis of each assessment. Your presentation should:
Include speaker notes with your presentation. Speaker notes should be detailed and thoroughly cited with references.
Include a minimum of 5 peer-reviewed scholarly references with a copy and/or link to the assessments you reviewed.
WEEK 2 Learning Activities
Educational Assessment of Students, Ch. 7
Read Ch. 7, “Diagnostic and Formative Assessments.”
Educational Assessment of Students, Ch. 16
Read Ch. 16, “Standardized Achievement Tests.”
REQUIRED
Brookhart, S. M., Guskey, T. R., Bowers, A. J., et al. (2016). A century of grading research: Meaning and value in the most common educational measure. Review of Educational Research, 86(4), 803-848.
Klugman, E. M., & Ho, A. D. (2020). How can released state test items support interim assessment purposes in an educational crisis? Educational Measurement: Issues & Practice, 39(3), 65-69.
Scarlett, M. H. (2018). “Why did I get a C?”: Communicating student performance using standards-based grading. InSight: A Journal of Scholarly Teaching, 13, 59-75.
Requirements: Multi Answer
Review of Educational ResearchDecember 2016, Vol. 86, No. 4, pp. 803 –848DOI: 10.3102/0034654316672069© 2016 AERA. http://rer.aera.net803A Century of Grading Research: Meaning and Value in the Most Common Educational MeasureSusan M. BrookhartDuquesne UniversityThomas R. GuskeyUniversity of KentuckyAlex J. BowersTeachers College, Columbia UniversityJames H. McMillanVirginia Commonwealth UniversityJeffrey K. Smith and Lisa F. SmithUniversity of OtagoMichael T. Stevens and Megan E. WelshUniversity of California at DavisGrading refers to the symbols assigned to individual pieces of student work or to composite measures of student performance on report cards. This review of over 100 years of research on grading considers five types of stud-ies: (a) early studies of the reliability of grades, (b) quantitative studies of the composition of K–12 report card grades, (c) survey and interview studies of teachers’ perceptions of grades, (d) studies of standards-based grading, and (e) grading in higher education. Early 20th-century studies generally con-demned teachers’ grades as unreliable. More recent studies of the relation-ships of grades to tested achievement and survey studies of teachers’ grading practices and beliefs suggest that grades assess a multidimensional construct containing both cognitive and noncognitive factors reflecting what teachers value in student work. Implications for future research and for grading prac-tices are discussed.672069RERXXX10.3102/0034654316672069Brookhart et al.A Century of Gradingresearch-article2016
Brookhart et al.804Keywords: grading, classroom assessment, educational measurementGrading refers to the symbols assigned to individual pieces of student work or to composite measures of student performance on student report cards. Grades or marks, as they were referred to in the first half of the 20th century, were the focus of some of the earliest educational research. Grading research history parallels the history of educational research more generally, with stud-ies becoming both more rigorous and sophisticated over time. Grading is important to study because of the centrality of grades in the educational experi-ence of all students. Grades are widely perceived to be what students “earn” for their achievement (Brookhart, 1993, p. 139), and have pervasive influence on students and schooling (Pattison, Grodsky, & Muller, 2013). Furthermore, grades predict important future educational consequences, such as dropping out of school (Bowers, 2010a; Bowers & Sprott, 2012; Bowers, Sprott, & Taff, 2013), applying and being admitted to college, and college success (Atkinson & Geiser, 2009; Bowers, 2010a; Thorsen & Cliffordson, 2012). Grades are especially predictive of academic success in more open admissions higher edu-cation institutions (Sawyer, 2013).Purpose of This Review, and Research QuestionThis review synthesizes findings from five types of grading studies: (a) early studies of the reliability of grades on student work, (b) quantitative studies of the composition of K–12 report card grades and related educational outcomes, (c) survey and interview studies of teachers’ perceptions of grades and grading prac-tices, (d) studies of standards-based grading (SBG) and the relationship between students’ report card grades and large-scale accountability assessments, and (e) grading in higher education. The central question underlying all of these studies is, “What do grades mean?” In essence, this is a validity question (Kane, 2006; Messick, 1989). It concerns whether evidence supports the intended meaning and use of grades as an educational measure. To date, several reviews have given par-tial answers to that question, but none of these reviews synthesize 100 years of research from five types of studies. The purpose of this review is to provide a more comprehensive and complete answer to the research question, “What do grades mean?”BackgroundThe earliest research on grading concerned mostly the reliability of grades teachers assigned to students’ work. The earliest investigation of which the authors are aware was published in the Journal of the Royal Statistical Society. Edgeworth (1888) applied the “theory of errors” (p. 600) based on normal curve theory to the case of grading examinations. He described three different sources of error: (a) chance; (b) personal differences among graders regarding the whole exam (severity or leniency and speed) and individual items on the exam, now referred to as task variation; and (c) “taking his [the examinee’s] answers as rep-resentative of his proficiency” (p. 614), now referred to as generalizing to the
A Century of Grading805domain. In parsing these sources of error, Edgeworth went beyond simple chance variation in grades to treat grades as subject to multiple sources of variation or error. This nuanced view, which was quite advanced for its time, remains useful today. Edgeworth pointed out the educational consequences of unreliability in grading, especially in awarding diplomas, honors and other qualifications to stu-dents. He used this point to build an argument for improving reliability. Today, the existence of unintended adverse consequences is also an argument for improving validity (Messick, 1989).During the 19th century, student progress reports were presented to parents orally by the teacher during a visit to a student’s home, with little standardization of content. Oral reports were eventually abandoned in favor of written narrative descriptions of how students were performing in certain skills like penmanship, reading, or arithmetic (Guskey & Bailey, 2001). In the 20th century, high school student populations became so diverse and subject area instruction so specific that high schools sought a way to manage the increasing demands and complexity of evaluating student progress (Guskey & Bailey, 2001). Although elementary schools maintained narrative descriptions, high schools increasingly favored per-centage grades because the completion of narrative descriptions was viewed as time-consuming and lacking cost-effectiveness (Farr, 2000). One could argue that this move to percentage grades eliminated the specific communication of what students knew and could do.Reviews by Crooks (1933), A. Z. Smith and Dobbin (1960), and Kirschenbaum, Napier, and Simon (1971) debated whether grading should be norm- or criterion-referenced, based on clearly defined standards for student learning. Although high schools tended to stay with norm-referenced grades to accommodate the need for ranking students for college admissions, some ele-mentary school educators transitioned to what was eventually called mastery learning and then standards-based education. Based on studies of grading reli-ability (F. J. Kelly, 1914; Rugg, 1918), in the 1920s, teachers began to adopt grading systems with fewer and broader categories (e.g., the A–F scale). Still, variation in grading practices persisted. Hill (1935) found variability in the fre-quency of grade reports, ranging from 2 to 12 times per year, and a wide array of grade reporting practices. Of 443 schools studied, 8% employed descriptive grading, 9% percentage grading, 31% percentage-equivalent categorical grad-ing, 54% categorical grading that was not percentage-equivalent, and 2% “gave a general rating on some basis such as ‘degree to which the pupil is working to capacity’” (Hill, 1935, p. 119). By the 1940s, more than 80% of U.S. schools had adopted the A–F grading scale. A–F remained the most commonly used scale until the present day. Current grading reforms move in the direction of SBG, a relatively new and increasingly common practice (Grindberg, 2014) in which grades are based on standards for achievement. In SBG, work habits and other nonachievement factors are reported separately from achievement (Guskey & Bailey, 2010).MethodLiterature searches for each of the five types of studies were conducted by dif-ferent groups of coauthors, using the same general strategy: (a) a keyword search
Brookhart et al.806of electronic databases, (b) review of abstracts against criteria for the type of study, (c) a full read of studies that met criteria, and (d) a snowball search using the references from qualified studies. All searches were limited to articles pub-lished in English. To identify studies of grading reliability, electronic searches using the terms “teachers’ marks (or marking)” and “teachers’ grades (or grad-ing)” were conducted in the following databases: ERIC, the Journal of Educational Measurement, Educational Measurement: Issues and Practice, ProQuest’s Periodicals Index Online, and the Journal of Educational Research. The criterion for inclusion was that the research addressed individual pieces of student work (usually examinations), not composite report card grades. Sixteen empirical stud-ies were found (Table 1).To identify studies of grades and related educational outcomes, search terms included “(grades OR marks) AND (model* OR relationship OR correlation OR association OR factor).” Databases searched included JSTOR, ERIC, and Educational Full Text Wilson Web. Criteria for inclusion were that the study (a) examined the relationship of K–12 grades to schooling outcomes, (b) used quan-titative methods, and (c) examined data from actual student assessments rather than teacher perspectives on grading. Forty-one empirical studies were identified (Tables 2, 3, and 4).For studies of K–12 teachers’ perspectives about grading and grading prac-tices, the search terms used were “grade(s),” “grading,” and “marking” with “teacher perceptions,” “teacher practices,” and “teacher attitudes.” Databases searched included ERIC, Education Research Complete, Dissertation Abstracts, and Google Scholar. Criteria for inclusion were that the study topic was K–12 teachers’ perceptions of grading and grading practices and were published since 1994 (the date of Brookhart’s previous review). Thirty-five empirical studies were found (31 are presented in Table 5, and four that investigated SBG are in Table 6).The search for studies of SBG used the search terms “standards” and (“grades” or “reports) and “education.” Databases searched included Psycinfo, Psycarticles, ERIC, and Education Source. The criterion for inclusion was that articles needed to address SBG. Eight empirical studies were identified (Table 6).For studies of grading in higher education, search terms included “grades” or “grading,” combined with “university,” “college,” and “higher education” in the title. Databases searched included EBSCO Education Research Complete, ERIC, and ProQuest (Education Journals). The inclusion criterion was that the study investigated grading practices in higher education. University websites in 12 dif-ferent countries were also consulted to allow for international comparisons. Fourteen empirical studies were found (Table 7).ResultsGrading ReliabilityTable 1 displays the results of studies on the reliability of teachers’ grades. The main finding was that great variation exists in the grades teachers assign to students’ work (Ashbaugh, 1924; Brimi, 2011; Eells, 1930; Healy, 1935; Hulten, 1925; F. J. Kelly, 1914; Lauterbach, 1928; Rugg, 1918; Silberstein, 1922; Sims, 1933; Starch, 1913, 1915; Starch & Elliott, 1912, 1913a, 1913b). Three studies (Bolton, 1927; (Text continues on p. 820.)
807TABLE 1Early studies of the reliability of gradesStudyMethodSampleMain findingsAshbaugh (1924)Descriptive statistics55 seniors and graduate students in Education grading 1 sev-enth-grade arithmetic paper● Grading the same paper on 3 occasions, the mean remained constant but the distribution narrowed● Grader inconsistency over time; grades more variable on Oc-casion 2 than Occasion 3● After presenting results to the class and discussing the prob-lems and the students’ work, graders devised a point scheme for each problem and grading variability decreasedBolton (1927)Descriptive statistics22 sixth-grade teachers of arith-metic in one district, grading 24 papers● Teachers are consistent with one another in their ratings● Average deviation was 5.1 (out of 100)● Greater variability for lowest-quality work (level of work as a source of variation)Brimi (2011)Descriptive statistics73 English teachers grading one essay● Range of scores was 46 points and covered all five letter grade levels (ABCDF)Eells (1930)Intrarater reliability; correla-tion of Time 1 and Time 2, in 11-week interval61 teachers in a measurement course, grading 3 elementary geography and 2 history ques-tions● Teacher inconsistency over time a major source of variation● Estimated reliability ranged from .25 to .51● Variability lowest for one very poor paper (level of work as a source of variation)Healy (1935)Descriptive statistics175 sixth-grade compositions from 50 different teachers, one each of Excellent, Superior, Average, Poor, Failure, reana-lyzed by trained judges● Format and usage errors weighed more heavily in teachers’ grades than the quality of ideas (relative emphasis of criteria as a source of variation in grades)Hulten (1925)Intrarater reliability; descrip-tive statistics for Time 1 and Time 2, in 2-month interval30 English teachers grading 5 compositions● Teacher inconsistency over time● 20% of compositions changed from pass to fail or vice versa on the second marking(continued)
808StudyMethodSampleMain findingsJacoby (1910)Descriptive statistics6 astronomy professors marking 11 exams● Little variability in grades● Student work quality was highLauter-bach (1928)Descriptive statistics57 teachers grading 120 papers (30 papers per teacher, half handwritten and half typed)● Student work quality was a source of variation in grades● In absolute terms, there was much variation by teacher for each paper● In relative terms, teachers’ marks reliably ranked studentsShriner (1930)Descriptive statistics25 high school English teachers and 25 algebra teachers, grad-ing 25 exams each (English and algebra, respectively)● Teachers’ grading was reliable● Median correlations of each teacher’s grade with the average grade for each paper were .946 (algebra) and .917 (English)● Greater teacher variability in grades for the poorer papersSilber-stein (1922)Descriptive statistics31 teachers grading 1 English paper that originally passed in high school (73%) but failed by Regents (59%)● When teachers regraded the same paper, they changed their grade● Variation in scores on individual questions on the exam were very variable and explained the overall grading variation, except for one question about syntax, where grades were more uniformSims (1933)Descriptive statisticsReanalysis of four data sets: 21 teachers grading 24 arithmetic papers; 25 teachers grading 25 algebra papers; 25 teach-ers grading 25 high school English exams; and 9 readers grading 20 psychology exams● Two kinds of variability in teachers’ grades: (a) differences in students’ work quality and (b) “differences in the standards of grading found among school systems and among teachers within a system” (p. 637)● Teacher variability in assigning grades was large● Variability in marks was reduced by converting scores to gradesStarch (1913)Descriptive statistics10 instructors grading 10 fresh-man English exams● Teacher variability was large, and largest for the two poorest papers● Isolated four sources of variation and reported probable error (p. 632, total probable error [pe] = 5.4 out of 100): (a) differ-ences among the standards of different schools (pe almost 0), TABLE 1 (continued)(continued)
809StudyMethodSampleMain findings(b) differences among the standards of different teachers (pe = 1.0), (c) differences in the relative values placed by different teachers on various elements in a paper, including content and form (pe = 2.1), and (d) differences due to the pure inability to distinguish between closely allied degrees of merit (pe = 2.2)Starch (1915)Descriptive statistics12 teachers grading 24 sixth- and seventh-grade composi-tions● Average teacher variability of 4.2 (out of 100) was reduced to 2.8 by forcing a normal distribution using a 5-category scale (poor, inferior, medium, superior, and excellent)Starch and Elliott (1912)Descriptive statistics142 high school English teach-ers grading 2 exams● Teacher variability in assigning grades was large (a range of 30–40 out of 100 points, pe = 4.0 and 4.8, respectively)● Teacher variability in the relative sense, as wellStarch and Elliott (1913a)Descriptive statistics138 high school mathematics teachers grading 1 geometry exam● Teacher variability was larger than for the English papers in Starch and Elliott (1912): pe = 7.5● Grade for 1 answer varies about as widely as composite grade for the whole examStarch and Elliott (1913b)Descriptive statistics122 high school history teachers grading 1 exam● Teacher variability was larger than for the English or math exams (Starch & Elliott, 1912, 1913a): pe = 7.7● Concluded that variability is due not to subject but to “the examiner and method of examination” (p. 680)TABLE 1 (continued)
810TABLE 2Studies of the relation of K–12 report card grades and tested achievementStudyMethodSampleMain findingsBrennan, Kim, Wenz-Gross, and Siperstein (2001)Correlation736 eighth-grade studentsCompared the Massachusetts Comprehensive Assessment System stan-dardized state reading test scores to grades in mathematics, English, and science, r = .54–.59Carter (1952)Correlation235 high school studentsGrades and standardized algebra achievement scores, r = .52Duckworth, Quinn, and Tsukayama (2012)Structural equation modelinga. 1,364 ninth-grade students● Standardized reading and mathematics test scores compared to GPA, r = .62–.66● Engagement and persistence are mediated through teacher evaluations of student conduct and homework completionb. 510 eighth-grade studentsDuckworth and Selig-man (2006)Correlation140 eighth-grade studentsGPA and 2003 TerraNova Second Edition/California Achievement Test, r = .66McCandless, Roberts, and Starnes (1972)Correlation433 seventh-grade studentsGrades and Metropolitan Achievement Test scores, r = .31, accounting for socioeconomic status, ethnicity, and genderMoore (1939)Correlation200 fifth- and sixth-grade studentsGrades and Stanford Achievement Test, r = .61Pattison, Grodsky, and Muller (2013)CorrelationU.S. nationally representative data sets of over 10,000 students eachHigh school GPA compared to reading (r = 0.46 to 0.54) and mathematics standardized tests, r = .52–.64● National Longitudinal Study of the High School Class of 1972● High School and Beyond sophomore cohort● National Educational Longitu-dinal Study of 1988● Educational Longitudinal Study of 2002Unzicker (1925)Correlation425 seventh- through ninth-grade studentsAverage grades across English, mathematics and history correlated .47 with the Otis intelligence testWoodruff and Ziomek (2004)CorrelationAbout 700,000 high schools stu-dents each year, 1991–2003Self-reported GPA and ACT composite scores, r = .56–.58Self-reported mathematics grades and ACT scores, r = .54–.57Self-reported English grades and ACT scores, r = .45–.50
811TABLE 3Studies of K–12 report card grades as multidimensional measures of academic knowledge, engagement, and persistenceStudyMethodSampleMain findingsBowers (2009)Multidimensional scaling195 students high school studentsGrades were multidimensional, separating core subject and noncore grades versus state standardized assessments in science mathematics and reading and the ACTBowers (2011)Multidimensional scaling4,520 high school students from the Educational Longi-tudinal Study of 2002Three-factor structure: (a) a cognitive factor that describes the relationship between tests and core subject grades, (b) an engagement factor between core subject grades and noncore subject grades, and (c) a factor that described the difference between grades in art and physical educationCasillas et al. (2012)Correlation; hierarchi-cal linear modeling4,660 seventh and eighth graders25% of the explained variance in GPAs was attributable the standardized assessments; academic discipline and commitment to school were strongly related to GPAFarkas, Grobe, Sheehan, and Shuan (1990)Regression486 eighth graders and their teachersStudent work habits were the strongest noncognitive predictors of gradesS. Kelly (2008)Hierarchical linear modeling1,653 sixth-, seventh-, and eighth-grade studentsPositive and significant effects of students’ substantive engagement on subsequent grades but no relationship with procedural engagementKlapp Lekholm and Cliffordson (2008)Structural equation modeling99,070 Swedish studentsGrades consisted of two major factors: (a) a cognitive achievement factor and (b) a noncognitive “common grade dimension”Klapp Lekholm and Cliffordson (2009); Klapp Lekholm (2011)Factor analysis; structural equation modeling99,070 Swedish studentsCognitive achievement factor of grades consists of student self-perception of competence, self-efficacy, coping strategies, and subject-specific interest; noncognitive factor consists of motivation and a general interest in schoolMiner (1967)Factor analysis671 high school studentsExamined academic grades in first, third, sixth, ninth, and twelfth grades; achievement tests in fifth, sixth, and ninth grades; and citizenship grades in first, third, and sixth grades; a three factor solu-tion was identified: (a) objective achievement, (b) behavior factor, and (c) high school achievement as measured through gradesSobel (1936)DescriptiveNot reportedStudents categorized into three groups based on comparing grades and achievement test levels; grade-superior, middle-group, mark-superiorThorsen and Clif-fordson (2012)Structural equation modelingAll Grade 9 students in Swe-den, 99,085 (2003), 105,697 (2004), 108,753 (2005)Generally replicated Klapp Lekholm and Cliffordson (2009)Thorsen (2014)Structural equation modeling3,855 students in SwedenGenerally replicated Klapp Lekholm and Cliffordson (2009) in examining norm-referenced gradesWillingham, Pol-lack, and Lewis (2002)Regression8,454 students from 581schoolsA moderate relationship between grades and tests was identified as well as strong positive relation-ships between grades and student motivation, engagement, completion of work assigned, and persistence
812TABLE 4Studies of grades as predictors of educational outcomesStudyMethodSampleMain findingsAlexander, Entwisle, and Kabbani (2001)Regression301 Grade 9 studentsStudent background, grade retention, academic performance and behavior strongly related to dropping outAllensworth and Easton (2007)Descriptive; regres-sion24,894 first-time ninth grades students in ChicagoGPA and failing a course in early high school strongly predict dropoutAllensworth, Gwynne, Moore, and de la Torre (2014)Descriptive; regres-sion19,963 Grade 8 Chicago studentsMiddle school grades and attendance are stronger predictors of high school performance in comparison to test scores, and middle school grades are a strong predictor of students on or offtrack for high school successBalfanz, Herzog, and MacIver (2007)Regression12,972 sixth-grade students from PhiladelphiaPredictors of dropping out of high school included failing mathemat-ics or English, low attendance, poor behaviorBarrington and Hendricks (1989)analysis of variance; correlation214 high school studentsGPA, number of low grades, intelligence test scores, and student mobility significantly predicted dropout.Bowers (2010a)Cluster analysis188 students tracked from Grade 1 through high schoolLongitudinal low-grade clusters across all types of course subjects correlated with dropping out and not taking the ACTBowers (2010b)Regression193 students tracked from Grade 1 through high schoolReceiving low grades (D or F) and being retained in grade strongly related to dropping outBowers and Sprott (2012)Growth mixture modeling5,400 Grade 10 Education Longitudinal Study of 2002 studentsNoncumulative GPA trajectories in early high school were strongly predictive of dropping outBowers, Sprott, and Taff (2013)Receiver operat-ing characteristic analysis110 dropout flags from 36 previous studiesDropout flags focusing on GPA were some of the most accurate drop-out flags across the literatureCairns, Cairns, and Neck-erman (1989)Cluster analysis; regression475 Grade 7 studentsBeyond student demographics, student aggressiveness and low levels of academic performance associated with dropping outCliffordson (2008)Two-level modeling164,106 Swedish studentsGrades predict achievement in higher education more strongly than Swedish Scholastic Aptitude Test, and criterion-referenced grades predict slightly better than norm-referenced grades(continued)
813StudyMethodSampleMain findingsEkstrom, Goertz, Pollack, and Rock (1986)RegressionHigh School and Beyond survey, 30,000 high school sophomoresGrades and problem behavior identified as the most important vari-ables for identifying dropping out, higher than test scores.Ensminger and Slusarcick (1992)Regression1,242 first graders from historically disadvantaged communityLow grades and aggressive behavior related to eventually dropping out, with low SES negatively moderating the relationships.Fitzsimmons, Cheever, Leonard, and Macunovich (1969)Correlation270 high school studentsStudents receiving low grades (D or F) in elementary or middle school were at much higher risk of dropping out.Jimerson, Egeland, Sroufe, and Carlson (2000)Regression177 children tracked from birth through age 19Home environment, quality of parent caregiving, academic achieve-ment, student problem behaviors, peer competence and intelligence test scores significantly related with dropping out.Lloyd (1978)Regression1,532 third-grade studentsDropping out significantly predicted with grades and marksMorris, Ehren, and Lenz (1991)Correlation; chi-square785 in Grades 7 through 12Dropping out predicted by absences, low grades (D or F), mobilityRoderick and Camburn (1999)Regression27,612 Chicago ninth gradersExamined significant predictors of course failure, including low atten-dance, and found failure rates varied significantly at the school levelTroob (1985)Descriptive21,000 New York City high school studentsLow grades and high absences corresponded to higher levels of drop-ping outTABLE 4 (continued)
814TABLE 5Studies of teachers’ grading practices and perceptionsStudyMethodSampleMain FindingsAdrian (2012)Mixed methods86 elementary teachers● Approximately 20% of teachers thought that effort, behavior, and homework should be included in standards-based grading● Few thought that it was not appropriate to reduce grades for late assignmentsBailey (2012)Survey; descrip-tive307 secondary teachersTeachers used a variety of factors in grading, with social studies and male teachers emphasizing effort more than other groups, science teachers emphasizing effort least, and female teachers emphasizing behavior more than male teachersBonner and Chen (2008)Survey; scenarios; descriptive222 teacher candi-datesGrading perceptions, based on instructional style, focused on equity, consistency, accuracy, and fairness, using nonachievement factors to obtain highest grades possibleCizek, Fitzgerald, and Rachor (1995)Survey; descrip-tive143 elementary and secondary teachers● With few differences based on grade level or years of experience, teachers used both objective and subjective factors, synthesizing information to enhance the likelihood of achieving high grades● Significant diversity in grading practices● Little awareness of district grading policiesCross and Frary (1999)Survey; descrip-tive307 middle and high school teachers● Teachers variously combined achievement, effort, behavior, improvement, and attitudes to assign grades, and reported that “ideal” grading should include noncognitive factors● Most teachers agreed that effort, conduct and achievement should be reported separatelyDuncan and Noonan (2007)Survey; factor analysis77 high school math teachers● Achievement and academic enabling factors, such as effort and ability, were identified as most impor-tant for grading, with significant variation among teachers● Nonachievement factors considered by most teachers● Frame of reference for grading was mixed; mostly criterion-referenced, some self-referenced based on improvement, some norm-referencedFrary, Cross, and Weber (1993)Survey; descrip-tive536 secondary teachersUp to 70% of teachers agreed that ability, effort, and improvement should be used for gradingGrimes (2010)Survey; descrip-tive199 middle school teachersGrades should be based on both achievement and nonachievement factors, including improvement, mas-tery, and effortGuskey (2002)Survey; descrip-tive94 elementary and 112 secondary teachers● 70% of teachers reported an ideal grade distribution of 41% As, 29% Bs, and 19% Cs, but with signifi-cant variation● Teachers wanted students to obtain the highest grade possible● Highest ranked purpose was to communicate to parents, then to use as feedback to students● Multiple factors used to determine grades, including homework, effort, and progress(continued)
815StudyMethodSampleMain FindingsGuskey (2009a)Survey; descrip-tive513 elementary and secondary teachers● Significant variation in grading practices and issues were reported● Most agreed learning occurs without grading● 50% averaged multiple scores to determine grades● 73% based grades on criteria, not norm● Grades used for communication with students and parentsHay and Macdonald (2008)Interviews and observationsTwo high school teachersTeachers’ values and experience influenced internalization of criteria important for grading, resulting in varied practicesImperial (2011)Survey; descrip-tive411 high school teachers● Teachers reported a wide variety of grading practices; whereas the primary purpose was to indicate achievement, about half used noncognitive factors● Grading was unrelated to training received in recommended grading practicesKunnath (2016)Mixed methods251 high school teachers● Teachers used both objective achievement results and subjective factors in grading● Teachers incorporated individual circumstances to promote the highest grades possible● Grading was based on teachers’ philosophy of teachingLiu (2008b)Survey; multivari-ate analyses52 middle and 55 high school teachers● Most teachers used effort, ability, and attendance/participation in grading, with few differences be-tween grade levels● 40% used classroom behavior● 90% used effort● 65% used ability● 75% used attendance/participationLiu (2008a)Survey; factor analysis300 middle and high school teachersSix components in grading were confirmed: importance/value, feedback for motivation, instruction, and improvement, effort/participation, ability and problem solving, comparisons/extra credit, and grading self-efficacy/ease/confidence/accuracyLlosa (2008)Survey; factor analysis; verbal protocol analysis1,224 elementary teachers● While showing variations in interpreting English proficiency standards, teachers’ grading supported valid summative judgments though weak formative use for improving instruction● Teachers incorporated student personality and behavior in gradingMcMillan (2001)Survey; descrip-tive; factor analysis1,483 middle and high school teachers● Significant variation in weight given to different factors, with a high percentage of teachers using noncognitive factors● Four components of grading were identified: academic enabling noncognitive factors, achievement, external comparisons, use of extra credit, with significant variation among teachersMcMillan and Lawson (2001)Survey; descrip-tive213 secondary sci-ence teachersTeachers reported use of both cognitive and noncognitive factors in grading, especially effort(continued)TABLE 5 (continued)
816StudyMethodSampleMain FindingsMcMillan, Myran, and Workman (2002)Survey; factor analysis901 elementary school teachers● Five components were confirmed, including academic enablers such as improvement and effort, extra credit, achievement, homework, and external comparisons● 70% indicated use of effort, improvement, and ability● No differences between math and language arts teachers● High variability in how much different factors are weightedMcMillan and Nash (2000)Interviews24 elementary and secondary math and English teachersFound that teaching philosophy and student effort that improves motivation and learning were very impor-tant considerations for gradingRandall and Engelhard (2009)Survey; scenarios; descriptive; Rasch modeling800 elementary, 800 middle, and 800 high school teachersAchievement was the most important factor; effort and behavior provided as feedback; little emphasis on abilityRandall and Engelhard (2010)Survey; scenarios; descriptive79 elementary, 155 middle, and 108 high school teachersAchievement was the most important factor; use of effort and classroom behavior for borderline casesRussell and Austin (2010)Survey; descrip-tive352 secondary music teachers● Noncognitive factors, such as performance/skill, attendance/participation, attitude, and practice/effort, weighted as much or more than achievement● In high school there was a greater emphasis on attendance; middle school more on practiceM. Simon, Tierney, Forgette-Giroux, Charland, Noonan, and Duncan (2010)Case studyOne high school math teacherFound standardized grading policies conflicted with professional judgmentsTABLE 5 (continued)(continued)
817StudyMethodSampleMain FindingsSun and Cheng (2013)Survey scenarios; descriptive350 English lan-guage secondary teachers● Found emphasis on individualized use of grades for motivation and extensive use of noncognitive factors and fairness, especially for borderline grades and for encouragement and effort attributions to benefit students● Teachers placed more emphasis on nonachievement factors, such as effort, homework and study habits, than achievementSvennberg, Meckbach, and Redelius (2014)InterviewsFour physical edu-cation teachersIdentified knowledge/skills, motivation, confidence, and interaction with others as important factorsTierney, Simon, and Charland (2011)Mixed methods77 high school math teachers● Most teachers believed in fair grading practices that stressed improvement, with little emphasis on attitude, motivation, or participation, with differences individualized to students● Effort was considered for borderline gradesTroug and Friedman (1996)Mixed methods53 high school teachersFound significant variability in grading practices and use of both achievement and nonachievement factorsWebster (2011)Mixed methods42 high school teachersTeachers reported multiple purposes and inconsistent practices while showing a clear desire to focus most on achievement consistent with standardsWiley (2011)Survey; scenarios; descriptive15 high school teachers● Teachers varied in how much nonachievement factors were used for grading● Found greater emphasis on nonachievement factors, especially effort for low-ability or low-achieving studentsYesbeck (2011)Interviews10 middle school language arts teachersFound that a multitude of both achievement and nonachievement factors were included in gradingTABLE 5 (continued)
818TABLE 6Studies of standards-based gradingStudyMethodSampleMain FindingsCox (2011)Focus group; interview16 high school teachersAlthough a district policy limited the impact of nonachievement factors on grades, teach-ers varied a great deal in their implementationHigh implementers● substituted end-of-course assessment and high stakes assessment scores for grades when students performed better on these exams than on other assessments● allowed students to retake exams and would record the highest score● assigned a score of 50 to all failing grades● accepted late work without penaltyGuskey, Swan, and Jung (2010)Survey; descriptive24 elementary and secondary teachers and 117 parentsTeachers and parents believed that a standards-based report card provided high-quality, clear, and more understood informationHowley, Kusimo, and Parrott (1999)Interviews; surveys; test scores; GPA52 middle school girls and 52 of their teachersHalf of the variance in GPA could be explained by test scores, but the relationship between grades and test scores varied by school; teachers differed in the extent to which noncognitive factors like effort were used to determine gradesMcMunn, Schenck, and McColskey (2003)Interviews; focus groups; observations; surveys; document analysis241 teachers, all levels● Teachers who volunteered to participate in a standards-based grading effort reported changing their grading practices to be more standards-based after participating in professional development● However, classroom observations and student focus group data indicated that imple-mentation of standards-based practice was not as widespread as teachers reportedJ. A. Ross and Kostuch (2011)Grades; test scores; stu-dent demographics15,942 students randomly sampled from the population of students in Ontario● Moderate correlations were observed between grades and test scores● The magnitude of the grade-test score relationship did not vary by gender or grade but was stronger in mathematics than in reading or writing● Grades tended to be higher than test scores, except for in writingSwan, Guskey, and Jung (2014)Survey115 parents, 383 teachers, both in a district in which grades and traditional report cards were concurrently generated37 elementary teachers were interviewed, 80 elementary classrooms provided student-level grades and test scoresBoth teachers and parents preferred standards-based over traditional report cards, with teachers indicated the greatest preference; teachers also reported that although standards-based grades took more time to generate, the effort was worthwhile due to improvements in the quality of information providedWelsh and D’Agostino (2009); Welsh, D’Agostino, and Kaniskan (2013)Interviews; 2 years of standards-based grades; 2 years of test scores● Interviews were quantitatively coded to generate an Appraisal Style scale that cap-tured the use of high-quality standards-based grading practices● The convergence between spring grades and test scores, both expressed in terms of performance levels, was estimated for each teacher in each year; teachers tended to grade more rigorously in mathematics and less rigorously in reading and writing● Appraisal style was moderately correlated with convergence rates
819TABLE 7Studies of grading in higher educationStudyMethodSampleMain findingsAbrami, Dickens, Perrry, and Lev-enthal (1980)Experimental, quantitativeExperiment 1: 143 undergraduates Experiment 2: 278 undergraduatesStandards did not affect student achievementBrumfield (2005)Survey419 member institutions of the American As-sociation of Collegiate Registrars and Admis-sions Officers in 2014Grades are a central feature of academia; there is a broad range of grading systemsCentra and Creech (1976)Nonexperi-mental9,194 class averages of student evaluationsRatings of teacher effectiveness were correlated at .20 with expected gradesCollins and Nickel (1974)Survey544 two-and four-year colleges and universitiesThere are many different types of grading systems and the use of nontraditional grading practices is widespreadFeldman (1997)Meta-analysis31 studiesCorrelation between anticipated grade and course evaluation rating was between .10 and .30Ginexi (2003)Survey136 undergraduate students in a general psychol-ogy courseAnticipated grade was related to higher teacher ratings and ease of comprehension of as-signed readings, but to no other questions on the course evaluationHolmes (1972)Experimental97 undergraduate students in an introductory psychology courseStudents’ grades were not related to course evaluations but students who received unexpect-edly (manipulated) low grades gave poorer instructor evaluationsKasten and Young (1983)Experimental77 graduate students in 5 educational administra-tion classesRandom assignment to 3 purposes for the course evaluation (personal decision, instructor’s use, or no purpose stated) yielded no significant differences in ratingsKulick and Wright (2008)Monte Carlo simulationSeries of simulations based on 400 studentsNormal distributions of test scores do not necessarily provide evidence of the efficacy of the evaluation of the quality of the testMaurer (2006)Experimental642 students in 17 (unspecified) classes taught by the same instructorStudents were randomly assigned to 3 conditions (personnel decision, course improvement, or control group) and asked for expected grades; expected grade was related to course evaluations but stated purpose of the evaluation was notMayo (1970)Survey3 instructors of an undergraduate introductory measurement courseIn a mastery learning context, active participation with course material appear to be superior to only doing the reading and receiving lecturesNicolson (1917)Survey64 colleges approved by the Carnegie Foundation36 of the colleges used a 5-division marking scale for grading purposesSalmons (1993)Nonexperi-mental444 introductory psychology students from Radford UniversityStudents were given a course evaluation prior to the first exam and again after receiving their final grades; from pre to post, students anticipating a low grade lowered their evaluation of the course and students anticipating a high grade raised their evaluation of the courseJ. K. Smith and Smith (2009)Experimental240 introductory psychology studentsStudents were randomly assigned to 1 of 3 approaches to university grading: a 100-point system, a percentage system, and an open-point system; significant differences were found for motivation, confidence, and effort but not for perceptions of achievement or accuracy
Brookhart et al.820Jacoby, 1910; Shriner, 1930) argued against this conclusion, however, contending that teacher variability in grading was not as great as commonly suggested.As the work of Edgeworth (1888) previewed, these studies identified several sources of the variability in grading. Starch (1913), for example, determined that three major factors produced an average probable error of 5.4 on a 100-point scale across instructors and schools. Specifically, “differences due to the pure inability to distinguish between closely allied degrees of merit” (p. 630) contributed 2.2 points, “differences in the relative values placed by different teachers upon various ele-ments in a paper, including content and form” (p. 630) contributed 2.1 points, and “differences among the standards of different teachers” (p. 630) contributed 1.0 point. Although investigated, “differences among the standards of different schools” (p. 630) contributed practically nothing toward the total (p. 632).Other studies listed in Table 1 identify these and other sources of grading vari-ability. Differences in grading criteria, or lack of criteria, were found to be a prominent source of variability in grades (Ashbaugh, 1924; Brimi, 2011; Eells, 1930; Healy, 1935; Silberstein, 1922), akin to Starch’s (1913) difference in the relative values teachers place on various elements in a paper. Teacher severity or leniency was found to be another source of variability in grades (Shriner, 1930; Silberstein, 1922; Sims, 1933), similar to Starch’s differences in teachers’ stan-dards. Differences in student work quality were associated with variability in grades, but the findings were inconsistent. Bolton (1927), for example, found greater grading variability for poorer papers. Similarly, Jacoby (1910) interpreted his high agreement as a result of the high quality of the papers in his sample. Eells (1930), however, found greater grading consistency in the poorer papers. Lauterbach (1928) found more grading variability for typewritten compositions than for handwritten versions of the same work. Finally, between-teacher error was a central factor in all of the studies in Table 1. Studies by Eells (1930) and Hulten (1925) demonstrated within-teacher error, as well.Given a probable error of around 5 in a 100-point scale, Starch (1913) recom-mended the use of a 9-point scale (i.e., A+, A−, B+, B−, C+, C-, D+, D−, and F) and later tested the improvement in reliability gained by moving to a 5-point scale based on the normal distribution (Starch, 1915). His and other studies contributed to the movement in the early 20th century away from a 100-point scale. The ABCDF letter grade scale became more common and remains the most prevalent grading scale in schools in the United States today.Grades and Related Educational OutcomesQuantitative studies of grades and related educational outcomes moved the focus of research on grades from questions of reliability to questions of validity. Three types of studies investigated the meaning of grades in this way. The oldest line of research (Table 2) looked at the relationship between grades and scores on standard-ized tests of intelligence or achievement. Today, those studies would be seen as seeking concurrent evidence for validity under the assumption that graded achieve-ment should be the same as tested achievement (Brookhart, 2015). As the 20th cen-tury progressed, researchers added noncognitive variables to these studies, describing grades as multidimensional measures of academic knowledge, engage-ment, and persistence (Table 3). A third group of more recent studies looked at the
A Century of Grading821relationship between grades and other educational outcomes, for example, dropping out of school or future success in school (Table 4). These studies offer predictive evidence for validity under the assumption that grades measure school success.Correlation of Grades and Other AssessmentsTable 2 describes studies that investigated the relationship between grades (usually grade point average [GPA]) and standardized test scores in an effort to understand the composition of the grades and marks that teachers assign to K–12 students. Despite the enduring perception that the correlation between grades and standardized test scores is strong (Allen, 2005; Duckworth, Quinn, & Tsukayama, 2012; Stanley & Baines, 2004), this correlation is and always has been relatively modest, in the .5 range. As Willingham, Pollack, and Lewis (2002) noted,Understanding these characteristics of grades is important for the valid use of test scores as well as grade averages because, in practice, the two measures are often intimately connected . . . [there is a] tendency to assume that a grade average and a test score are, in some sense, mutual surrogates; that is, measuring much the same thing, even in the face of obvious differences. (p. 2)Research on the relationship between grades and standardized assessment results is marked by two major eras: early 20th-century studies and late 20th into 21st century studies. Unzicker (1925) found that average grades across subjects correlated .47 with intelligence test scores. C. C. Ross and Hooks (1930) reviewed 20 studies conducted from 1920 through 1929 on report card grades and intelli-gence test scores in elementary school as predictors of junior high and high school grades. Results showed that the correlations between grades in seventh grade and intelligence test scores ranged from .38 to .44. C. C. Ross and Hooks concluded,Data from this and other studies indicate that the grade school record affords a more reliable or consistent basis of prediction than any other available, the correlations in three widely-scattered school systems showing remarkable stability; and that without question the grade school record of the pupil is the most usable or practical of all bases for prediction, being available wherever cumulative records are kept, without cost and with a minimum expenditure of time and effort. (p. 195)Subsequent studies moved from correlating grades and intelligence test scores to correlating grades with standardized achievement results (Carter, 1952, r = .52; Moore, 1939, r = .61). McCandless, Roberts, and Starnes (1972) found a smaller correlation (r = .31) after accounting for socioeconomic status, ethnicity, and gen-der. Although the sample selection procedures and methods used in these early investigations are problematic by current standards, they represent a clear desire on the part of researchers to understand what teacher-assigned grades represent in comparison to other known standardized assessments. In other words, their focus was criterion validity (C. C. Ross & Hooks, 1930).Investigations from the late 20th century and into the 21st century replicated earlier studies but included larger, more representative samples and used more current standardized tests and methods (Brennan, Kim, Wenz-Gross, & Siperstein,
Brookhart et al.8222001; Woodruff & Ziomek, 2004). Brennan et al. (2001), for example, compared reading scores from the Massachusetts Comprehensive Assessment System state test to grades in mathematics, English, and science and found correlations ranging from .54 to .59. Similarly, using GPA and 2003 TerraNova Second Edition/California Achievement Tests, Duckworth and Seligman (2006) found a correla-tion of .66. Subsequently, Duckworth et al. (2012) examined standardized reading and mathematics test scores to GPA and found correlations between .62 and .66.Woodruff and Ziomek (2004) compared GPA and ACT composite scores for all high school students who took the ACT college entrance exam between 1991 and 2003. They found moderate but consistent correlations ranging from .56 to .58 over the years for average GPA and composite ACT scores, from .54 to .57 for mathematics grades and ACT scores, and from .45 to .50 in English. Student GPAs were self-reported, however. Pattison et al. (2013) examined four decades of achievement data on tens of thousands of students using national databases to compare high school GPA to reading and mathematics standardized tests. The authors found GPA correlations consistent with past research, ranging from .52 to .64 in mathematics and from .46 to .54 in reading comprehension.Although some variability exists across years and subjects, correlations have remained moderate but remarkably consistent in studies based on large, nationally representative data sets. Across 100 years of research, teacher-assigned grades typically correlate about .5 with standardized measures of achievement. In other words, 25% of the variation in grades teachers assign is attributable to a trait mea-sured by standardized tests (Bowers, 2011). The remaining 75% is attributable to something else. As Swineford (1947) noted in a study on grading in middle and high school, “The data clearly show that marks assigned by teachers in this school are reliable measures of something [italics added] but there is apparently a lack of agreement on just what that something should be” (p. 517). A correlation of .5 is neither very weak—countering arguments that grades are completely subjective measures of academic knowledge—nor is it very strong—refuting arguments that grades are a strong measure of fundamental academic knowledge, and remain consistent despite large shifts in the educational system, especially in relation to accountability and standardized testing (Bowers, 2011; Linn, 1982).Grades as Multidimensional Measures of Academic Knowledge, Engagement, and PersistenceInvestigations of the composition of K–12 report card grades consistently find them to be multidimensional, comprising minimally academic knowledge, sub-stantive engagement, and persistence. Table 3 presents studies of grades and other measures, including many noncognitive variables. The earliest study of this type, Sobel (1936) found that students with high grades and low test scores had out-standing penmanship, attendance, punctuality, and effort marks, and their teachers rated them high in industry, perseverance, dependability, cooperation, and ambi-tion. Similarly, Miner (1967) factor-analyzed longitudinal data for a sample of students, including their grades in 1st, 3rd, 6th, 9th, and 12th grades; achievement tests in 5th, 6th, and 9th grades; and citizenship grades in 1st, 3rd, and 6th grades. She identified a three-factor solution: (a) objective achievement as measured through standardized assessments, (b) early classroom citizenship (a behavior
A Century of Grading823factor), and (c) high school achievement as measured through grades, demonstrat-ing that behavior and two types of achievement could be identified as separate factors.Farkas, Grobe, Sheehan, and Shuan (1990) showed that student work habits were the strongest noncognitive predictors of grades. They noted, “Most striking is the powerful effect of student work habits upon course grades . . . teacher judg-ments of student non-cognitive characteristics are powerful determinants of course grades, even when student cognitive performance is controlled” (p. 140). Likewise, Willingham et al. (2002), using large national databases, found a mod-erate relationship between grades and tests as well as strong positive relationships between grades and student motivation, engagement, completion of work assigned, and persistence. Relying on a theory of a conative factor of schooling—focusing on student interest, volition, and self-regulation (Snow, 1989)—the authors suggested that grades provide a useful assessment of both conative and cognitive student factors (Willingham et al., 2002).S. Kelly (2008) countered a criticism of the conative factor theory of grades, namely that teachers may award grades based on students appearing engaged and going through the motions (i.e., a procedural form of engagement) as opposed to more substantive engagement involving legitimate effort and participation that leads to increased learning. He found positive and significant effects of students’ substantive engagement on subsequent grades but no relationship with procedural engagement, noting, “This finding suggests that most teachers successfully use grades to reward achievement-oriented behavior and promote a widespread growth in achievement” (p. 45). S. Kelly also argued that misperceptions that teachers do not distinguish between apparent and substantive engagement lends mistaken support to the use of high-stakes tests as inherently more “objective” (p. 46) than teacher assessments.Recent studies have expanded on this work, applying sophisticated methodolo-gies. Bowers (2009, 2011) used multidimensional scaling to examine the relation-ship between grades and standardized test scores in each semester in high school in both core subjects (mathematics, English, science, and social studies) and non-core subjects (foreign/non-English languages, art, and physical education). Bowers (2011) found evidence for a three-factor structure: (a) a cognitive factor that describes the relationship between tests and core subject grades, (b) a cona-tive and engagement factor between core subject grades and noncore subject grades (termed a “Success at School Factor, SSF,” p. 154), and (c) a factor that described the difference between grades in art and physical education. He also showed that teachers’ assessment of students’ ability to negotiate the social pro-cesses of schooling represents much of the variance in grades that is unrelated to test scores. These results point to the importance of substantive engagement and persistence (S. Kelly, 2008; Willingham et al., 2002) as factors that help students in both core and noncore subjects. Subsequently, Duckworth et al. (2012) used structural equation modeling for 510 New York City fifth through eighth graders to show that engagement and persistence are mediated through teacher evalua-tions of student conduct and homework completion.Casillas et al. (2012) examined the interrelationship among grades, standard-ized assessment scores, and a range of psychosocial characteristics and behavior.
Brookhart et al.824Twenty-five percent of the explained variance in GPAs was attributable to the standardized assessments; the rest was predicted by a combination of prior grades (30%), psychosocial factors (23%), behavioral indicators (10%), demographics (9%), and school factors (3%). Academic discipline and commitment to school (i.e., the degree to which the student is hard working, conscientious, and effortful) had the strongest relationship to GPA.A set of recent studies focused on the Swedish national context (Cliffordson, 2008; Klapp Lekholm, 2011; Klapp Lekholm & Cliffordson, 2008, 2009; Thorsen, 2014; Thorsen & Cliffordson, 2012), which is interesting because report cards are uniform throughout the country and require teachers to grade students using the same performance level scoring system used by the national exam. Klapp Lekholm and Cliffordson (2008) showed that grades consisted of two major factors: a cog-nitive achievement factor and a noncognitive “common grade dimension” (p. 188). In a follow-up study, Klapp Lekholm and Cliffordson (2009) reanalyzed the same data, examining the relationships between multiple student and school char-acteristics and both the cognitive and noncognitive achievement factors. For the cognitive achievement factor of grades, student self-perception of competence, self-efficacy, coping strategies, and subject-specific interest were most important. In contrast, the most important student variables for the noncognitive factor were motivation and a general interest in school. These structural equation modeling results were replicated across three full population-level cohorts in Sweden repre-senting all 99,085 9th grade students in 2003, 105,697 students in 2004, and 108,753 in 2005 (Thorsen & Cliffordson, 2012), as well as in comparison to both norm-referenced and criterion-referenced grading systems, examining 3,855 stu-dents in Sweden (Thorsen, 2014). Klapp Lekholm and Cliffordson (2009) wrote,The relation between general interest or motivation and the common grade dimension seems to recognize that students who are motivated often possess both specific and general goals and approach new phenomena with the goal of understanding them, which is a student characteristic awarded in grades. (p. 19)These findings, similar to those of S. Kelly (2008), Bowers (2009, 2011), and Casillas et al. (2012), support the idea that substantive engagement is an impor-tant component of grades that is distinct from the skills measured by standard-ized tests. A validity argument that expects grades and standardized tests to correlate highly therefore may not be sound because the construct of school achievement is not fully defined by standardized test scores. Tested achievement represents one dimension of the results of schooling, privileging “individual cog-nition, pure mentation, symbol manipulation, and generalized learning” (Resnick, 1987, pp. 13–15).Grades as Predictors of Educational OutcomesTable 4 presents studies of grades as predictors of educational outcomes. Teacher-assigned grades are known to predict graduation from high school (Bowers, 2014), as well as transition from high school to college (Atkinson & Geiser, 2009; Cliffordson, 2008). Satisfactory grades historically have been used as one of the means to grant students a high school diploma (Rumberger, 2011).
A Century of Grading825Studies from the second half of the 20th century and into the 21st century, how-ever, have focused on using grades from early grade levels to predict student graduation rate or risk of dropping out of school (Gleason & Dynarski, 2002; Pallas, 1989).Early studies in this domain (Fitzsimmons, Cheever, Leonard, & Macunovich, 1969; Lloyd, 1974, 1978; Voss, Wendling, & Elliott, 1966) identified teacher-assigned grades as one of the strongest predictors of student risk for failing to graduate from high school. Subsequent studies included other variables such as absence and misbehavior and found that grades remained a strong predictor (Barrington & Hendricks, 1989; Cairns, Cairns, & Neckerman, 1989; Ekstrom, Goertz, Pollack, & Rock, 1986; Ensminger & Slusarcick, 1992; Finn, 1989; Hargis, 1990; Morris, Ehren, & Lenz, 1991; Rumberger, 1987; Troob, 1985). More recent research using a life course perspective showed that low or failing grades have a cumulative effect over a student’s time in school and contribute to the eventual decision to leave (Alexander, Entwisle, & Kabbani, 2001; Jimerson, Egeland, Sroufe, & Carlson, 2000; Pallas, 2003; Roderick & Camburn, 1999).Other research in this area considered grades in two ways: the influence of low grades (Ds and Fs) on dropping out, and the relationship of a continuous scale of grades (e.g., GPA) to at-risk status and eventual graduation or dropping out. Three examples are particularly notable. Allensworth and colleagues have shown that failing a core subject in ninth grade is highly correlated with dropping out of school, and places a student offtrack for graduation (Allensworth, 2013; Allensworth & Easton, 2005, 2007). Such failure also compromises the transition from middle school to high school (Allensworth, Gwynne, Moore, & de la Torre, 2014). Balfanz, Herzog, and MacIver (2007) showed a strong relationship between failing core courses in sixth grade and dropping out. Focusing on model-ing conditional risk, Bowers (2010b) found the strongest predictor of dropping out after grade retention was having D and F grades.Few studies, however, have focused on grades as the sole predictor of gradua-tion or dropping out. Most studies examine longitudinal grade patterns, using either data-mining techniques such as cluster analysis of all K–12 course grades (Bowers, 2010a) or mixture modeling techniques to identify growth patterns or decline in GPA in early high school (Bowers & Sprott, 2012). A recent review of the studies on the accuracy of dropout predictors showed that along with the Allensworth Chicago on-track indicator (Allensworth & Easton, 2007), longitudi-nal GPA trajectories were among the most accurate predictors identified (Bowers et al., 2013).Teachers’ Perceptions of Grading and Grading PracticesSystematic investigations of teachers’ grading practices and perceptions about grading began to be published in the 1980s and were summarized in Brookhart’s (1994) review of 19 empirical studies of teachers grading practices, opinions, and beliefs. Five themes were supported. First, teachers use measures of achievement, primarily tests, as major determinants of grades. Second, teachers believe it is important to grade fairly. Views of fairness included using multiple sources of information, incorporating effort, and making it clear to students what is assessed and how they will be graded. This finding suggests teachers consider school
Brookhart et al.826achievement to include the work students do in school, not just the final outcome. Third, in 12 of the studies, teachers included noncognitive factors in grades, including ability, effort, improvement, completion of work, and, to a small extent, other student behaviors. Fourth, grading practices are not consistent across teach-ers, with respect to either the purpose or the extent to which noncognitive factors are considered, reflecting differences in teachers’ beliefs and values. Finally, grading practices vary by grade level.Secondary teachers emphasize achievement products such as tests whereas elementary teachers use informal evidence of learning along with achievement and performance assessments. Brookhart’s (1994) review demonstrated an upswing in interest in investigating grading practices during this period, in which performance-based and portfolio classroom assessment was emphasized and reports of the unreliability of teachers’ subjective judgments about student work also increased. The findings were in accord with policymakers’ increasing distrust of teachers’ judgments about student achievement.Teachers’ Reported Grading PracticesEmpirical studies of teachers’ grading practices over the past 20 years have mainly used surveys to document how teachers use both cognitive and noncogni-tive evidence, primarily effort, and their own professional judgment in determin-ing grades. Table 5 shows most studies published since Brookhart’s (1994) review document that teachers in different subjects and grade levels use “hodgepodge” grading (Brookhart, 1991, p. 36), combining achievement, effort, behavior, improvement, and attitudes (Adrian, 2012; Bailey, 2012; Cizek, Fitzgerald, & Rachor, 1995; Cross & Frary, 1999; Duncan & Noonan, 2007; Frary, Cross, & Weber, 1993; Grimes, 2010; Guskey, 2002, 2009a; Imperial, 2011; Liu, 2008b; Llosa, 2008; McMillan, 2001; McMillan & Lawson, 2001; McMillan, Myran, & Workman, 2002; McMillan & Nash, 2000; Randall & Engelhard, 2009, 2010; Russell & Austin, 2010; Sun & Cheng, 2013; Svennberg, Meckbach, & Redelius, 2014; Troug & Friedman, 1996; Yesbeck, 2011). Teachers often make grading decisions with little school or district guidance.Teachers distinguish among nonachievement factors in grading. They view “academic enablers” (McMillan, 2001, p. 25), including effort, ability, work hab-its, attention, and participation, differently from other nonachievement factors, such as student personality and behavior. McMillan (2001), consistent with earlier research, found that academic performance and academic enablers were by far most important in determining grades. These findings have been replicated (Duncan & Noonan, 2007; McMillan et al., 2002). In a qualitative study, McMillan and Nash (2000) found that teaching philosophy and judgments about what is best for students’ motivation and learning contribute to variability of grading practices, suggesting that an emphasis on effort, in particular, influences these outcomes. Randall and Engelhard (2010) found that teacher beliefs about what best supports students are important factors in grading, especially using noncognitive factors for borderline grades, as Sun and Cheng (2013) also found with a sample of Chinese secondary teachers. These studies suggest that part of the reason for the multidimensional nature of grading reported in the previous section is that teach-ers’ conceptions of academic achievement include behavior that supports and
A Century of Grading827promotes academic achievement, and that teachers evaluate these behaviors as well as academic content in determining grades. These studies also showed sig-nificant variation among teachers within the same school. That is, the weight that different teachers give to separate factors can vary a great deal within a single elementary or secondary school (Cizek et al., 1995; Cross & Frary, 1999; Duncan & Noonan, 2007; Guskey, 2009a; Henke, Chen, Goldman, Rollefson, & Gruber, 1999; Troug & Friedman, 1996; Webster, 2011).Teacher Perceptions About GradingCompared to the number of studies about teachers’ grading practices, rela-tively few studies focus directly on perceptual constructs such as importance, meaning, value, attitudes, and beliefs. Several studies used Brookhart’s (1994) suggestion that Messick’s (1989) construct validity framework is a reasonable approach for investigating perceptions. This framework focuses on both the inter-pretation of the construct (what grading means) and the implications and conse-quences of grading (the effect it has on students). Sun and Cheng (2013) used this conceptual framework to analyze teachers’ comments about their grading and the extent to which values and consequences were considered. The results showed that teachers interpreted good grades as a reward for accomplished work, based on both effort and quality, student attitude toward achievement as reflected by homework completion, and progress in learning. Teachers indicated the need for fairness and accuracy, not just accomplishment, saying that grades are fairer if they are lowered for lack of effort or participation and that grading needs to be strict for high achievers. Teachers also considered consequences of grading deci-sions for students’ future success and feelings of competence.Fairness in an individual sense is a theme in several studies of teacher percep-tions of grades (Bonner & Chen, 2009; Grimes, 2010; Hay & Macdonald, 2008; Kunnath, 2016; Sun & Cheng, 2013; Svennberg et al., 2014; Tierney, Simon, & Charland, 2011). Teachers perceive grades to have value according to what they can do for individual students. Many teachers use their understanding of individ-ual student circumstances, their instructional experience, and perceptions of equity, consistency, accuracy, and fairness to make professional judgments, instead of relying solely on a grading algorithm. These claims suggest that grad-ing practices may vary within a single classroom, just as it does among teachers, and that this variation is viewed, at least by some teachers, as a needed element of accurate, fair grading, not as a problem. In a case study of one high school math-ematics teacher in Canada, M. Simon et al. (2010) reported that standardized grading policy often conflicted with professional judgment and had a significant impact on determining students’ final grades.Some researchers (Liu, 2008a; Liu, O’Connell, & McCoach, 2006; Wiley, 2011) have developed scales to assess teachers’ beliefs and attitudes about grad-ing, including items that load on importance, usefulness, effort, ability, grading habits, and perceived self-efficacy of the grading process. These studies have cor-roborated the survey and interview findings about teachers’ beliefs in using both cognitive and noncognitive factors in grading. Guskey (2009a) found differences between elementary and secondary teachers in their perspectives about purposes of grading. Elementary teachers were more likely to view grading as a process of
Brookhart et al.828communication with students and parents and to differentiate grades for individ-ual students. Secondary teachers believed that grading served a classroom control and management function, emphasizing student behavior and completion of work.In short, findings from the limited number of studies on teacher perceptions of grading are largely consistent with findings from grading practice surveys. Some studies have successfully explored the basis for practices and show that teachers view grading as a means to have fair, individualized, positive impacts on students’ learning and motivation and, to a lesser extent, classroom control. Together, the research on grading practices and perceptions suggests the following four clear and enduring findings. First, teachers idiosyncratically use a multitude of achieve-ment and nonachievement factors in their grading practices to improve learning and motivation as well as document academic performance. Second, student effort is a key element in grading. Third, teachers advocate for students by helping them achieve high grades. Finally, teacher judgment is an essential part of fair and accurate grading.Standards-Based GradingSBG recommendations emphasize communicating student progress in relation to grade-level standards (e.g., adding fractions, computing area) that describe per-formance using ordered categories (e.g., below basic, basic, proficient, advanced) and involve separate reporting of work habits and behavior (Brookhart, 2011; Guskey, 2009b; Guskey & Bailey, 2001, 2010; Marzano & Heflebower, 2011; McMillan, 2009; Melograno, 2007; Mohnsen, 2013; O’Connor, 2009; Scriffiny, 2008; Shippy, Washer, & Perrin, 2013; Wiggins, 1994). SBG is differentiated from standardized grading, which provides teachers with uniform grading proce-dures in an attempt to improve consistency in grading methods, and from mastery grading, which expresses student performance on a variety of skills using a binary mastered/not mastered scale (Guskey & Bailey, 2001). Some also assert that SBG can provide exceptionally high-quality information to parents, teachers, and stu-dents and, therefore, has the potential to bring about instructional improvements and larger educational reforms. Others urge caution. Cizek (2000), for example, warned that SBG may be no better than other reporting formats and subject to the same misinterpretations as other grading scales.Literature on SBG implementation recommendations is extensive, but empiri-cal studies are few. Studies of SBG to date have focused mostly on the implemen-tation of SBG reforms and the relationship of SBG to state achievement tests designed to measure the same or similar standards. One study investigated stu-dent, teacher, and parent perceptions of SBG. Table 6 presents these studies.Implementation of SBGSchools, districts, and teachers have experienced difficulties in implementing SBG (Clarridge & Whitaker, 1994; Cox, 2011; Hay & Macdonald, 2008; McMunn, Schenck, & McColskey, 2003; M. Simon et al., 2010; Tierney et al., 2011). The understanding and support of teachers, parents, and students are key to successful implementation of SBG practices, especially grading on standards and separating achievement grades from learning skills (academic enablers). Although many teachers report that they support such grading reforms, they also report using
A Century of Grading829practices that mix effort, improvement, or motivation with academic achievement (Cox, 2011; Hay & Macdonald, 2008; McMunn et al., 2003). Teachers also vary in implementing SBG practices (Cox, 2011), especially in using common assess-ments, following minimum grading policies, accepting work late with no penalty, and allowing students to retest and replace poor scores with retest scores.The previous section summarized two studies of grading practices in Ontario, Canada, which adopted SBG province-wide and required teachers to grade stu-dents on specific topics within each content area using percentage grades. M. Simon et al. (2010) identified tensions between provincial grading policies and one teacher’s practice. Tierney et al. (2011) found that few teachers were aware of and applying provincial SBG policies. These findings are consistent with McMunn et al.’s (2003) findings, which showed that changes in grading practice do not necessarily follow after changes in grading policy.SBG as a Communication ToolSwan, Guskey, and Jung (2014; see also Guskey, Swan, & Jung, 2010) found that parents, teachers, and students preferred SBG over traditional report cards, with teachers considering adopting SBG having the most favorable attitudes. Teachers implementing SBG reported that it took longer to record the detailed information included in the SBG report cards but felt the additional time was worthwhile because SBGs yielded higher-quality information. An earlier informal report by Guskey (2004) found, however, that many parents attempted to interpret nearly all labels (e.g., below basic, basic, proficient, advanced) in terms of letter grades. It may be that a decade of increasing familiarity with SBG has changed perceptions of the meaning and usefulness of SBG.Relationship of SBGs to High-Stakes Test ScoresOne might expect consistency between SBGs and standards-based assessment scores because they purport to measure the same standards. Eight papers exam-ined this consistency (Howley, Kusimo, & Parrott, 1999; Klapp Lekholm, 2011; Klapp Lekholm & Cliffordson, 2008, 2009; J. A. Ross & Kostuch, 2011; Thorsen & Cliffordson, 2012; Welsh & D’Agostino, 2009; Welsh, D’Agostino, & Kaniskan, 2013). All yielded essentially the same results: SBGs and high-stakes, standards-based assessment scores were only moderately related. Howley et al. (1999) found that 50% of the variance in GPA could be explained by standards-based assessment scores, and the magnitude of the relationship varied by school. Interview data revealed that even in SBG settings, some teachers included non-cognitive factors (e.g., attendance and participation) in grades. This finding may explain the modest relationship, at least in part.Welsh and D’Agostino (2009) and Welsh et al. (2013) developed an Appraisal Scale that gauged teachers’ efforts to assess and grade students on standards attainment. This 10-item measure focused on the alignment of assessments with standards and on the use of a clear, standards attainment–focused grading method. They found small to moderate correlations between this measure and grade–test score convergence. That is, the standards-based grades of teachers who used cri-terion-referenced achievement information were more related to standards-based assessments than were the grades of teachers who did not follow this practice.
Brookhart et al.830Welsh and D’Agostino (2009) and Welsh et al. (2013) found that SBG–test score relationships were larger in writing and mathematics than in reading. In addition, although teachers assigned lower grades than test scores in mathematics, grades were higher than test scores in reading and writing. J. A. Ross and Kostuch (2011) also found stronger SBG–test correlations in mathematics than in reading or writ-ing, and grades tended to be higher than test scores, with the exception of writing scores at some grade levels.Grading in Higher EducationGrades in higher education differ markedly among countries. As a case in point, four dramatic differences exist between the United States and New Zealand. First, grading practices are much more centralized in New Zealand, where grad-ing is fairly consistent across universities and highly consistent within universi-ties. Second, the grading scale starts with a passing score of 50%, and 80% and above yields an A. Third, the use of essay is more prevalent in New Zealand than multiple-choice testing. Fourth, grade distributions are reviewed and grades of individual instructors are considered each semester at departmental-level meet-ings. These practices are, at best, rarities in higher education in the United States.An examination of 35 country and university websites paints a broad picture of the diversity in grading practices. Many countries use a system like that in New Zealand, in which 50 or 51 is the minimal passing score, and 80 and above (some-times 90 and above) is required for an “A.” Many countries also offer an “E” grade, which is sometimes a passing score and other times indicates a failure less egregious than an “F.” If 50% is considered passing, then skepticism toward mul-tiple-choice testing (where there is often a 1 in 4 chance of a correct guess) becomes understandable. In the Netherlands, a 1 (lowest) to 10 (highest) system is used, with Grades 1 to 3 and 9 and 10 rarely awarded, leaving a 5-point grading system for most students (Nuffic, 2013). In the European Union, differences between countries are so substantial that the European Credit Transfer and Accumulation System was created (European Commission, 2009).Grading in higher education varies within countries, as well. In the United States, it is typically seen as a matter of academic freedom and not a fit subject for external intervention. Indeed, in an analysis of the American Association of Collegiate Registrars and Admissions Officers survey of grading practices in higher education in the United States, Collins and Nickel (1974) reported, “There are as many different types of grading systems as there are institutions” (p. 3). The 2004 version of the same survey suggested, however, a somewhat more settled situation in recent years (Brumfield, 2005). Grading in higher education shares many issues of grade meaning with the K–12 context, which have been addressed above. Two unique issues for grade meaning remain: grading and student course evaluations, and historical changes in expected grade distributions. Table 7 pres-ents studies in these areas.Grades and Student Course EvaluationsStudents in higher education routinely evaluate the quality of their course experiences and their instructors’ teaching. The relationship between course grades and course evaluations has been of interest for at least 40 years (Abrami,
A Century of Grading831Dickens, Perry, & Leventhal, 1980; Holmes, 1972) and is a subquestion in the general research about student evaluations of courses (e.g., Centra, 1993; Marsh, 1984, 1987; McKeachie, 1979; Spooren, Brockx, & Mortelmans, 2013). The hypothesis is straightforward: Students will give higher course evaluations to fac-ulty who are lenient graders. This grade-leniency theory (Love & Kotchen, 2010; McKenzie, 1975) has long been lamented, particularly by faculty who perceive themselves as rigorous graders and do not enjoy favorable student evaluations. This assumption is so prevalent that it is close to accepted as settled science (Ginexi, 2003; Marsh, 1987; Salmons, 1993). Ginexi (2003) posited that the rela-tionship between anticipated grades and course evaluation ratings could be a function of cognitive dissonance (between the student’s self-image and an antici-pated low grade) or of revenge theory (retribution for an anticipated low grade). Although Maurer (2006) argued that revenge theory is popular among faculty receiving low course evaluations, both his study and an earlier study by Kasten and Young (1983) did not find this to be the case. These authors therefore argued for the cognitive dissonance model, where attributing poor teaching to the per-ceived lack of student success is an intrapersonal face-saving device.A critical look at the literature presents an alternative argument. First, the rela-tionship between anticipated grades and course evaluation ratings is moderate at best. Meta-analytic work (Centra & Creech, 1976; Feldman, 1997) suggests cor-relations between .10 and .30, or that anticipated grades account for less than 10% of the variance in course evaluations. It therefore appears that anticipated grades have little influence on student evaluations. Second, the relationship between anticipated grades and course evaluations could simply reflect an honest assess-ment of students’ opinions of instruction, which varies according to the students’ experiences of the course (J. K. Smith & Smith, 2009). Students who like the instructional approach may be expected to do better than students who do not. Students exposed to exceptionally good teaching might be expected to do well in the course and to rate the instruction highly (and vice versa for poor instruction). Although face-saving or revenge might occur, a fair amount of honest and accu-rate appraisal of the quality of teaching might be reflected in the observed correlations.Historical Changes in Expectations for Grade DistributionsThe roots of grading in higher education can be traced back hundreds of years. In the 16th century, Cambridge University developed a three-tier grading system with 25% of the grades at the top, 50% in the middle, and 25% at the bottom (Winter, 1993). Working from European models, American universities invented systems for ranking and categorizing students based both on academic perfor-mance and on progress, conduct, attentiveness, interest, effort, and regular atten-dance at class and chapel (Cureton, 1971; Rugg, 1918; Schneider & Hutt, 2014). Grades were ubiquitous at all levels of education at the turn of the 20th century but were idiosyncratically determined (Schneider & Hutt, 2014), as described earlier.To resolve inconsistencies, educators turned to the new science of statistics, and a concomitant passion for measuring and ranking human characteristics (Pearson, 1930). Inspired by the work of his cousin, Charles Darwin, Francis
Brookhart et al.832Galton pioneered the field of psychometrics, extending his efforts to rank one’s fitness to produce high-quality offspring on an A to D scale (Galton & Galton, 1998). Educators began to debate how normal curve theory and other scientific advances should be applied to grading. As with K–12 education, the consensus was that the 0 to 100 marking system led to an unjustified implication of preci-sion, and that the normal curve would allow for transformation of student ranks into A–F or other categories (Rugg, 1918).Meyer (1908) argued for grade categories as follows: excellent (3% of stu-dents), superior (22%), medium (50%), inferior (22%), and failure (3%). He argued that a student picked at random is as likely to be of medium ability as not. Interestingly, Meyer’s terms for the middle three grades (superior, medium, and inferior) are norm-referenced, whereas the two extreme grades (excellent and fail-ure) are criterion-referenced. Roughly a decade later, Nicolson (1917) found that 36 out of 64 colleges were using a 5-point scale for grading, typically A, B, C, D, and F. The questions debated at the time were more over the details of such sys-tems as opposed to the overall approach. As Rugg (1918) stated,Now the term inherited capacity practically defines itself. By it we mean the “start in life;” the sum total of nervous possibilities which the infant has at birth and to which, therefore, nothing that the individual himself can do will contribute in any way whatsoever. (p. 706)Rugg (1918) went on to say that educational conditions interact with inherited capacity, resulting in what he called “ability-to-do” (p. 706). He recommended that teachers base marks on observations of students’ performance that reflect those abilities, and that grades should form a normal distribution. This approach reduces grading to determining the number of grading divisions and the number of students who should fall into each category. Thus, there is a shift from a decen-tralized and fundamentally haphazard approach to assigning grades to one that is based on “scientific” (p. 701) principles. Furthermore, Rugg argued that letter grades were preferable to percentage grades as they more accurately represented the level of precision that was possible.Another interesting aspect of Rugg’s (1918) and Meyer’s (1908) work is the notion that grades should simply be a method of ranking students, and not neces-sarily used for making decisions about achievement. Although Meyer argued that 3% should fail a typical course (and he feared that people would see this as too lenient), he was less certain about what to do with the “inferior” group, stating that grades should solely represent a student’s rank in the class. In hindsight, these approaches seem reductionist at best. Although the notion of grading “on the curve” remained popular through at least through the early 1960s, a categorical (A–F) approach to assigning grades was implemented. This system tended to mask keeping a close eye on the notion that neither too many As nor too many Fs were handed out (Guskey, 2000; Kulick & Wright, 2008). The normal curve was the “silent partner” of the grading system.In the United States in the 1960s, a confluence of technical and societal events led to dramatic changes in perspectives about grading. These were criterion-ref-erenced testing (Glaser, 1963), mastery learning and mastery testing (Bloom,
A Century of Grading8331971; Mayo, 1970), the Civil Rights movement, and the war in Vietnam. Glaser (1963) brought forth the innovative idea that sense should be made out of test performance by “referencing” performance not to a norming group but rather to the domain whence the test came; students’ performance should not be based on the performance of their peers. The proper referent, according to Glaser, was the level of mastery on the subject matter being assessed. Working from Carroll’s (1963) model of school learning, Bloom (1971) developed the underlying argu-ment for mastery learning theory: that achievement in any course (and by exten-sion, the grade received) should be a function of the quality of teaching, the perseverance of the student, and the time allowed for the student to master the material (Guskey, 1985).It was not the case that the work of Bloom (1971) and Glaser (1963) single-handedly changed how grading took place in higher education, but ideas about teaching and learning partially inspired by this work led to a substantial rethinking of the proper aims of education. Bring into this mix a national reexamination of status and equity, and the time was ripe for a humanistic and social reassessment of grading and learning in general. The final ingredient in the mix was the war in Vietnam. The United States had its first conscription since World War II, and as the war grew increasingly unpopular, so did the pressure on professors not to fail students and make them subject to the draft. The effect of the draft on grading practices in higher education is unmistakable (Rojstaczer & Healy, 2012). The proportion of A and B grades rose dramatically during the years of the draft; the proportion of D and F grades fell concomitantly.Grades have risen again dramatically in the past 25 years. Rojstaczer and Healy (2012) argued that the increase resulted from new views of students as consumers, or even customers, and away from viewing students as needing discipline. Others have contended that faculty inflate grades to vie for good course ratings (the grade-leniency theory, Love & Kotchen, 2010). Or perhaps students are higher achieving than they were and deserve better grades.DiscussionThis review shows that over the past 100 years, teacher-assigned grades have been maligned by researchers and pyschometricians alike as subjective and unre-liable measures of student academic achievement (Allen, 2005; Banker, 1927; Carter, 1952; Evans, 1976; Hargis, 1990; Kirschenbaum et al., 1971; Quann, 1983; S. B. Simon & Bellanca, 1976). However, others have noted that grades are a useful indicator of numerous factors that matter to students, teachers, parents, schools, and communities (Bisesi, Farr, Greene, & Haydel, 2000; Folzer-Napier, 1976; Linn, 1982). Over the past 100 years, research has attempted to identify the different components of grades in order to inform educational decision making (Bowers, 2009; Parsons, 1959). Interestingly, although standardized assessment scores have been shown to have low criterion validity for overall schooling out-comes (e.g., high school graduation and admission to postsecondary institutions), grades consistently predict K–12 educational persistence, completion, and transi-tion from high school to college (Atkinson & Geiser, 2009; Bowers et al., 2013).One hundred years of quantitative studies of the composition of K–12 report card grades demonstrate that teacher-assigned grades represent both the cognitive
Brookhart et al.834knowledge measured in standardized assessment scores and, to a smaller extent, noncognitive factors such as substantive engagement, persistence, and positive school behaviors (e.g., Bowers, 2009, 2011; Farkas et al., 1990; Klapp Lekholm & Cliffordson, 2008, 2009; Miner, 1967; Willingham et al., 2002). Grades are useful in predicting and identifying students who may face challenges in either the academic component of schooling or in the sociobehavioral domain (e.g., Allensworth, 2013; Allensworth & Easton, 2007; Allensworth et al., 2014; Atkinson & Geiser, 2009; Bowers, 2014).The conclusion is that grades typically represent a mixture of multiple factors that teachers value. Teachers recognize the important role of effort in achieve-ment and motivation (Aronson, 2008; Cizek et al., 1995; Cross & Frary, 1999; Duncan & Noonan, 2007; Guskey, 2002, 2009a; Imperial, 2011; S. Kelly, 2008; Liu, 2008b; McMillan, 2001; McMillan et al., 2002; McMillan & Lawson, 2001; McMillan & Nash, 2000; Randall & Engelhard, 2009, 2010; Russell & Austin, 2010; Sun & Cheng, 2013; Svennberg et al., 2014; Troug & Friedman, 1996; Yesbeck, 2011). They differentiate academic enablers (McMillan, 2001, p. 25) like effort, ability, improvement, work habits, attention, and participation, which they endorse as relevant to grading, from other student characteristics like gen-der, socioeconomic status, or personality, which they do not endorse as relevant to grading.This quality of graded achievement as a multidimensional measure of success in school may be what makes grades better predictors of future success in school than tested achievement (Atkinson & Geiser, 2009; Barrington & Hendricks, 1989; Bowers, 2014; Cairns et al., 1989; Cliffordson, 2008; Ekstrom et al., 1986; Ensminger & Slusarcick, 1992; Finn, 1989; Fitzsimmons et al., 1969; Hargis, 1990; Lloyd, 1974, 1978; Morris et al., 1991; Rumberger, 1987; Troob, 1985; Voss et al., 1966), especially given known limitations of achievement testing (Nichols & Berliner, 2007; Polikoff, Porter, & Smithson, 2011). In the search for assessments of noncognitive factors that predict educational outcomes (Heckman & Rubinstein, 2001; Levin, 2013), grades appear to be useful. Current theories postulate that both cognitive and noncognitive skills are important to acquire and build over the course of life. Although noncognitive skills may help students develop cognitive skills, the reverse is not true (Cunha & Heckman, 2008).Teachers’ values are a major component in this multidimensional interpreta-tion of grades. Besides academic enablers, two other important teacher values work to make graded achievement different from tested achievement. One is the value that teachers place on being fair to students (Bonner, 2016; Bonner & Chen, 2009; Brookhart, 1994; Grimes, 2010; Hay & Macdonald, 2008; Sun & Cheng, 2013; Svennberg et al., 2014; Tierney et al., 2011). In their concept of fairness, most teachers believe that students who try should not fail, whether or not they learn. Related to this concept is teachers’ wish to help all or most students be suc-cessful (Bonner, 2016; Brookhart, 1994).Grades, therefore, must be considered multidimensional measures that reflect mostly achievement of classroom learning intentions and also, to a lesser degree, students’ efforts at getting there. Grades are not unidimensional measures of pure achievement, as has been assumed in the past (e.g., Carter, 1952; McCandless et al., 1972; Moore, 1939; C. C. Ross & Hooks, 1930) or recommended in the
A Century of Grading835present (e.g., Brookhart, 2009, 2011; Guskey, 2000; Guskey & Bailey, 2010; Marzano & Heflebower, 2011; O’Connor, 2009; Scriffiny, 2008). Although mea-surement experts and professional developers may wish grades were unadulter-ated measures of what students have learned and are able to do, strong evidence indicates that they are not.For those who wish grades could be a more focused measure of achievement of intended instructional outcomes, future research needs to cast a broader net. The value teachers attach to effort and other academic enablers in grades and their insistence that grades should be fair point to instructional and societal issues that are well beyond the scope of grading. Why, for example, do some students who sincerely try to learn what they are taught not achieve the intended learning out-comes? Two important possibilities include intended learning outcomes that are developmentally inappropriate for these students (e.g., these students lack readi-ness or prior instruction in the domain), and poorly designed lessons that do not make clear what students are expected to learn, do not instruct students in appro-priate ways, and do not arrange learning activities and formative assessments in ways that help students learn well.Research focusing solely on grades typically misses antecedent causes. Future research should make these connections. For example, does more of the variance in grades reflect achievement in classes where lessons are high-quality and appro-priate for students? Is a negatively skewed grade distribution, where most stu-dents achieve and very few fail, effective for the purposes of certifying achievement, communicating with students and parents, passing students to the next grade, or predicting future educational success? Do changes in instructional design lead to changes in grading practices, in grade distributions, and in the use-fulness of grades as predictors of future educational success?This review suggests that most teachers’ grades do not yield a pure achieve-ment measure but are rather a multidimensional measure dependent on both what the students learn and how they behave in the classroom. This conclusion, how-ever, does not excuse low-quality grading practices or suggest there is no room for improvement. One hundred years of grading research have generally confirmed large variation among teachers in the validity and reliability of grades, both in the meaning of grades and in the accuracy of reporting. Early research found great variation among teachers when asked to grade the same examination or paper. Many of these early studies communicated a “what’s wrong with teachers” under-tone that today would likely be seen as researcher bias.Early researchers attributed sources of variation in teachers’ grades to one or more of the following sources: criteria (Ashbaugh, 1924; Brimi, 2011; Healy, 1935; Silberstein, 1922; Sims, 1933, Starch, 1915; Starch & Elliott, 1913a,b), students’ work quality (Bolton, 1927; Healy, 1935; Jacoby, 1910; Lauterbach, 1928; Shriner, 1930; Sims, 1933), teacher severity/leniency (Shriner, 1930; Silberstein, 1922; Sims, 1933; Starch, 1915; Starch & Elliott, 1913b), task (Silberstein, 1922; Starch & Elliott, 1913a), scale (Ashbaugh, 1924; Sims, 1933; Starch 1913, 1915), and teacher error (Brimi, 2011; Eells, 1930; Hulten, 1925; Lauterbach, 1928, Silberstein, 1922; Starch & Elliott, 1912, 1913a,b). Starch (1913; Starch & Elliott 1913b) found that teacher error and emphasizing different criteria were the two largest sources of variation.
Brookhart et al.836Regarding sources of error, J. K. Smith (2003) suggested reconceptualizing reliability for grades as a matter of sufficiency of information for making the grade assignment. This recommendation is consistent with the fact that as grades are aggregated from individual pieces of work to report card or course grades and GPAs, reliability increases. The reliability of overall college grade-point average is estimated at .93 (Beatty, Walmsley, Sackett, Kuncel, & Koch, 2015).In most studies investigating teachers’ grading reliability, teachers were sent examination papers without specific grading criteria and simply asked to assign grades. Today, this lack of clear grading criteria would be seen as a shortcoming in the assessment process. Most of these studies thus confounded teachers’ inability to judge student work consistently and random error, considering both teacher error. Rater training offers a modern solution to this situation. Research has shown that with training on established criteria, individuals can judge examinees’ work more accurately and reliably (Myford, 2012). Unfortunately, most teachers and professors today are not well trained, typically grade alone, and rarely seek help from colleagues to check the reliability of their grading. Thus, working toward clearer criteria, col-laborating among teachers, and involving students in the development of grading criteria appear to be promising approaches to enhancing grading reliability.Considering criteria as a source of variation in teachers’ grading has implica-tions for grade meaning and validity. The attributes on which grading decisions are based function as the constructs the grades are intended to measure. To the extent teachers include factors that do not indicate achievement in the domain they intend to measure (e.g., when grades include consideration of format and surface level features of an assignment), grades do not give students, parents, or other educators accurate information about learning. Furthermore, to the extent teachers do not appropriately interpret student work as evidence of learning, the intended meaning of the grade is also compromised. There is evidence that even teachers who explicitly decide to grade solely on achievement of learning stan-dards sometimes mix effort, improvement, and other academic enablers when determining grades (Cox, 2011; Hay & Macdonald, 2008; McMunn et al., 2003).Future research in this area should seek ways to help teachers improve the criteria they use to grade, their skill at identifying levels of quality on the criteria, and their ability to effectively merge these assessment skills and instructional skills. When students are taught the criteria by which to judge high-quality work and are assessed by those same criteria, grade meaning is enhanced. Even if grades remain multidimensional measures of success in school, the dimensions on which grades are based should be defensible goals of schooling and should match students’ opportunities to learn.No research agenda will ever entirely eliminate teacher variation in grading. Nevertheless, the authors of this review have suggested several ways forward. Investigating grading in the larger context of instruction and assessment will help focus research on important sources and causes of invalid or unreliable grading decisions. Investigating ways to differentiate instruction more effec-tively, routinely, and easily will reduce teachers’ feelings of pressure to pass students who may try but do not reach an expected level of achievement. Investigating the multidimensional construct of “success in school” will acknowl-edge that grades measure something significant that is not measured by achieve-ment tests. Investigating ways to help teachers develop skills in writing or
A Century of Grading837selecting and then communicating criteria, and recognizing these criteria in stu-dents’ work, will improve the quality of grading. All of these seem reachable goals to achieve before the next century of grading research. All will assuredly contribute to enhancing the validity, reliability, and fairness of grading.NoteContributing authors worked equally and are listed in alphabetical order after the two project leaders.ReferencesAbrami, P. C., Dickens, W. J., Perry, R. P., & Leventhal, L. (1980). Do teacher stan-dards for assigning grades affect student evaluations of instruction? Journal of Educational Psychology, 72, 107–118. doi:10.1037/0022-0663.72.1.107Adrian, C. A. (2012). Implementing standards-based grading: Elementary teachers’ beliefs, practices and concerns (Doctoral dissertation). Available from ProQuest Dissertations and Theses database. (UMI No. 1032540669)Alexander, K. L., Entwisle, D. R., & Kabbani, N. S. (2001). The dropout process in life course perspective: Early risk factors at home and school. Teachers College Record, 103, 760–822. doi:10.1111/0161-4681.00134Allen, J. D. (2005). Grades as valid measures of academic achievement of classroom learning. The Clearing House, 78, 218–223. doi:10.3200/TCHS.78.5.218-223Allensworth, E. M. (2013). The use of ninth-grade early warning indicators to improve Chicago Schools. Journal of Education for Students Placed at Risk, 18, 68–83. doi:10.1080/10824669.2013.745181Allensworth, E. M., & Easton, J. Q. (2005). The on-track indicator as a predictor of high school graduation. Chicago, IL: University of Chicago Consortium on Chicago School Research.Allensworth, E. M., & Easton, J. Q. (2007). What matters for staying on-track and graduating in Chicago public high schools: A close look at course grades, failures, and attendance in the freshman year. Chicago, IL: University of Chicago Consortium on Chicago School Research.Allensworth, E. M., Gwynne, J. A., Moore, P., & de la Torre, M. (2014). Looking for-ward to high school and college: Middle grade indicators of readiness in Chicago Public Schools. Chicago, IL: University of Chicago Consortium on Chicago School Research.Aronson, M. J. (2008). How teachers’ perceptions in the areas of student behavior, atten-dance and student personality influence their grading practice (Doctoral dissertation). Available from ProQuest Dissertations and Theses database. (UMI No. 304510267)Ashbaugh, E. J. (1924). Reducing the variability of teachers’ marks. Journal of Educational Research, 9, 185–198. doi:10.1080/00220671.1924.10879447Atkinson, R. C., & Geiser, S. (2009). Reflections on a century of college admissions tests. Educational Researcher, 38, 665–676. doi:10.3102/0013189x09351981Bailey, M. T. (2012). The relationship between secondary school teacher prerceptions of grading practices and secondary school teacher perceptions of student motivation (Doctoral dissertation) Available from ProQuest Dissertations and Theses database. (UMI No. 1011481355)Balfanz, R., Herzog, L., & MacIver, D. J. (2007). Preventing student disengagement and keeping students on the graduation path in urban middle-grades schools: Early
Brookhart et al.838identification and effective interventions. Educational Psychologist, 42, 223–235. doi:10.1080/00461520701621079Banker, H. J. (1927). The significance of teachers’ marks. Journal of Educational Research, 16, 159–171. doi:10.1080/00220671.1927.10879778Barrington, B. L., & Hendricks, B. (1989). Differentiating characteristics of high school graduates, dropouts, and nongraduates. Journal of Educational Research, 82, 309–319. doi:10.1080/00220671.1989.10885913Beatty, A. S., Walmsley, P. T., Sackett, P. R., Kuncel, N. R., & Koch, A. J. (2015). The reliability of college grades. Educational Measurement: Issues and Practice, 34, 31–40. doi:10.1111/emip.12096Bisesi, T., Farr, R., Greene, B., & Haydel, E. (2000). Reporting to parents and the com-munity. In E. Trumbull & B. Farr (Eds.), Grading and reporting student progress in an age of standards (pp. 157–184). Norwood, MA: Christopher-Gordon.Bloom, B. S. (1971). Mastery learning. In J. H. Block (Ed.), Mastery learning: Theory and practice (pp. 47–63). New York, NY: Holt, Rinehart & Winston.Bolton, F. E. (1927). Do teachers’ marks vary as much as supposed? Education, 48, 28–39.Bonner, S. M. (2016). Teachers’ perceptions about assessment. In G. T. Brown & L. Harris (Eds.), Handbook of human and social conditions in assessment (pp. 21–39). London, England: Routledge.Bonner, S. M., & Chen, P. P. (2009). Teacher candidates’ perceptions about grading and constructivist teaching. Educational Assessment, 14, 57–77. doi:10.1080/ 10627190903039411Bowers, A. J. (2009). Reconsidering grades as data for decision making: More than just academic knowledge. Journal of Educational Administration, 47, 609–629. doi:10.1108/09578230910981080Bowers, A. J. (2010a). Analyzing the longitudinal K-12 grading histories of entire cohorts of students: Grades, data driven decision making, dropping out and hierar-chical cluster analysis. Practical Assessment Research and Evaluation, 15(7), 1–18. Retrieved from http://pareonline.net/pdf/v15n7.pdfBowers, A. J. (2010b). Grades and graduation: A longitudinal risk perspective to iden-tify student dropouts. Journal of Educational Research, 103, 191–207. doi:10.1080/00220670903382970Bowers, A. J. (2011). What’s in a grade? The multidimensional nature of what teacher-assigned grades assess in high school. Educational Research and Evaluation, 17, 141–159. doi:10.1080/13803611.2011.597112Bowers, A. J. (2014). Student risk factors. In D. J. Brewer & L. O. Picus (Eds.), Encyclopedia of education economics & finance (pp. 624–628). Thousand Oaks, CA: Sage.Bowers, A. J., & Sprott, R. (2012). Examining the multiple trajectories associated with dropping out of high school: A growth mixture model analysis. Journal of Educational Research, 105, 176–195. doi:10.1080/00220671.2011.552075Bowers, A. J., Sprott, R., & Taff, S. (2013). Do we know who will drop out? A review of the predictors of dropping out of high school: Precision, sensitivity and specific-ity. High School Journal, 96, 77–100. doi:10.1353/hsj.2013.0000Brennan, R. T., Kim, J., Wenz-Gross, M., & Siperstein, G. N. (2001). The relative equitability of high-stakes testing versus teacher-assigned grades: An analysis of the Massachusetts Comprehensive Assessment System (MCAS). Harvard Educational Review, 71, 173–215. doi:10.17763/haer.71.2.v51n6503372t4578
A Century of Grading839Brimi, H. M. (2011). Reliability of grading high school work in English. Practical Assessment, Research & Evaluation, 16(17). Retrieved from http://pareonline.net/getvn.asp?v=16&n=17Brookhart, S. M. (1991). Grading practices and validity. Educational Measurement: Issues and Practice, 10(1), 35–36. doi:10.1111/j.1745-3992.1991.tb00182.xBrookhart, S. M. (1993). Teachers’ grading practices: Meaning and values. Journal of Educational Measurement, 30, 123–142. doi:10.1111/j.1745-3984.1993.tb01070.xBrookhart, S. M. (1994). Teachers’ grading: Practice and theory. Applied Measurement in Education, 7, 279–301. doi:10.1207/s15324818ame0704_2Brookhart, S. M. (2009). Grading (2nd ed.). New York, NY: Merrill Pearson Education.Brookhart, S. M. (2011). Grading and learning: Practices that support student achieve-ment. Bloomington, IN: Solution Tree Press.Brookhart, S. M. (2015). Graded achievement, tested achievement, and validity. Educational Assessment, 20, 268–296. doi:10.1080/10627197.2015.1093928Brumfield, C. (2005). Current trends in grades and grading practices in higher educa-tion: Results of the 2004 AACRAO survey. Retrieved from ERIC database. (ED489795)Cairns, R. B., Cairns, B. D., & Neckerman, H. J. (1989). Early school dropout: Configurations and determinants. Child Development, 60, 1437–1452. doi:10.2307/ 1130933Carroll, J. (1963). A model of school learning. Teachers College Record, 64, 723–723.Carter, R. S. (1952). How invalid are marks assigned by teachers? Journal of Educational Psychology, 43, 218–228. doi:10.1037/h0061688Casillas, A., Robbins, S., Allen, J., Kuo, Y. L., Hanson, M. A., & Schmeiser, C. (2012). Predicting early academic failure in high school from prior academic achievement, psychosocial characteristics, and behavior. Journal of Educational Psychology, 104, 407–420. doi:10.1037/a0027180Centra, J. A. (1993). Reflective faculty evaluation. San Francisco, CA: Jossey-Bass.Centra, J. A., & Creech, F. R. (1976). The relationship between student, teacher, and course characteristics and student ratings of teacher effectiveness (Report No. PR-76–1). Princeton, NJ: Educational Testing Service.Cizek, G. J. (2000). Pockets of resistance in the assessment revolution. Educational Measurement: Issues and Practice, 19(2), 16–23. doi:10.1111/j.1745-3992.2000.tb00026.xCizek, G. J., Fitzgerald, J. M., & Rachor, R. A. (1995). Teachers’ assessment practices: Preparation, isolation, and the kitchen sink. Educational Assessment, 3, 159–179. doi:10.1207/s15326977ea0302_3Clarridge, P. B., & Whitaker, E. M. (1994). Implementing a new elementary progress report. Educational Leadership, 52(2), 7–9. Retrieved from http://www.ascd.org/publications/educational-leadership/oct94/vol52/num02/Implementing-a-New-Elementary-Progress-Report.aspxCliffordson, C. (2008). Differential prediction of study success across academic pro-grams in the Swedish context: The validity of grades and tests as selection instru-ments for higher education. Educational Assessment, 13, 56–75. doi:10.1080/ 10627190801968240Collins, J. R., & Nickel, K. N. (1974). A study of grading practices in institutions of higher education. Retrieved from ERIC database. (ED 097 846)Cox, K. B. (2011). Putting classroom grading on the table, a reform in progress. American Secondary Education, 40(1), 67–87.
Brookhart et al.840Crooks, A. D. (1933). Marks and marking systems: A digest. Journal of Educational Research, 27, 259–272. doi:10.1080/00220671.1933.10880402Cross, L. H., & Frary, R. B. (1999). Hodgepodge grading: Endorsed by students and teachers alike. Applied Measurement in Education, 12, 53–72.doi:10.1207/s15324818ame1201_4Cunha, F., & Heckman, J. J. (2008). Formulating, identifying and estimating the tech-nology of cognitive and noncognitive skill formation. Journal of Human Resources, 43, 738–782. doi:10.3368/jhr.43.4.738Cureton, L. W. (1971). The history of grading practices. NCME Measurement in Education, 2(4), 1–8.Duckworth, A. L., Quinn, P. D., & Tsukayama, E. (2012). What No Child Left Behind leaves behind: The roles of IQ and self-control in predicting standardized achieve-ment test scores and report card grades. Journal of Educational Psychology, 104, 439–451. doi:10.1037/a0026280Duckworth, A. L., & Seligman, M. E. P. (2006). Self-discipline gives girls the edge: Gender in self-discipline, grades, and achievement test scores. Journal of Educational Psychology, 98, 198–208. doi:10.1037/0022-0663.98.1.198Duncan, R. C., & Noonan, B. (2007). Factors affecting teachers’ grading and assess-ment practices. Alberta Journal of Educational Research, 53, 1–21.Edgeworth, F. Y. (1888). The statistics of examinations. Journal of the Royal Statistical Society, 51, 599–635.Eells, W. C. (1930). Reliability of repeated grading of essay type examinations. Journal of Educational Psychology, 21, 48–52.Ekstrom, R. B., Goertz, M. E., Pollack, J. M., & Rock, D. A. (1986). Who drops out of high school and why? Findings from a national study. Teachers College Record, 87, 356–373.Ensminger, M. E., & Slusarcick, A. L. (1992). Paths to high school graduation or drop-out: A longitudinal study of a first-grade cohort. Sociology of Education, 65, 91–113. doi:10.2307/2112677European Commission. (2009). ECTS user’s guide. Luxembourg, Belgium: Office for Official Publications of the European Communities. doi:10.2766/88064Evans, F. B. (1976). What research says about grading. In S. B. Simon & J. A. Bellanca (Eds.), Degrading the grading myths: A primer of alternatives to grades and marks (pp. 30–50). Washington, DC: Association for Supervision and Curriculum Development.Farkas, G., Grobe, R. P., Sheehan, D., & Shuan, Y. (1990). Cultural resources and school success: Gender, ethnicity, and poverty groups within an urban school dis-trict. American Sociological Review, 55, 127–142. doi:10.2307/2095708Farr, B. P. (2000). Grading practices: An overview of the issues. In E. Trumbull & B. Farr (Eds.), Grading and reporting student progress in an age of standards (pp. 1–22). Norwood, MA: Christopher-Gordon.Feldman, K. A. (1997). Identifying exemplary teachers and teaching: Evidence from student ratings. In R. P. Perry & J. C. Smart (Eds.), Effective teaching in higher education: Research and practice (pp. 93–143). New York, NY: Agathon Press.Finn, J. D. (1989). Withdrawing from school. Review of Educational Research, 59, 117–142. doi:10.3102/00346543059002117Fitzsimmons, S. J., Cheever, J., Leonard, E., & Macunovich, D. (1969). School fail-ures: Now and tomorrow. Developmental Psychology, 1, 134–146. doi:10.1037/h0027088
A Century of Grading841Folzer-Napier, S. (1976). Grading and young children. In S. B. Simon & J. A. Bellanca (Eds.), Degrading the grading myths: A primer of alternatives to grades and marks (pp. 23–27). Washington, DC: Association for Supervision and Curriculum Development.Frary, R. B., Cross, L. H., & Weber, L. J. (1993). Testing and grading practices and opinions of secondary teachers of academic subjects: Implications for instruction in measurement. Educational Measurement: Issues & Practice, 12(3), 23–30. doi:10.1111/j.1745-3992.1993.tb00539.xGalton, D. J., & Galton, C. J. (1998). Francis Galton: And eugenics today. Journal of Medical Ethics, 24, 99–105.Ginexi, E. M. (2003). General psychology course evaluations: Differential survey response by expected grade. Teaching of Psychology, 30, 248–251.Glaser, R. (1963). Instructional technology and the measurement of learning outcomes: Some questions. American Psychologist, 18, 519–521. doi:10.1111/j.1745-3992.1994.tb00561.xGleason, P., & Dynarski, M. (2002). Do we know whom to serve? Issues in using risk factors to identify dropouts. Journal of Education for Students Placed at Risk, 7, 25–41. doi:10.1207/S15327671ESPR0701_3Grimes, T. V. (2010). Interpreting the meaning of grades: A descriptive analysis of middle school teachers’ assessment and grading practices (Doctoral dissertation). Available from ProQuest Dissertations and Theses database. (UMI No. 305268025)Grindberg, E. (2014, April 7). Ditching letter grades for a “window” into the class-room. Cable News Network. Retrieved from: http://www.cnn.com/2014/04/07/liv-ing/report-card-changes-standards-based-grading-schools/Guskey, T. R. (1985). Implementing mastery learning. Belmont, CA: Wadsworth.Guskey, T. R. (2000). Grading policies that work against standards . . . and how to fix them. IASSP Bulletin, 84(620), 20–29. doi:10.1177/019263650008462003Guskey, T. R. (2002, April). Perspectives on grading and reporting: Differences among teachers, students, and parents. Paper presented at the annual meeting of the American Educational Research Association, New Orleans, LA.Guskey, T. R. (2004). The communication challenge of standards-based reporting. Phi Delta Kappan, 86, 326–329. doi:10.1177/003172170408600419Guskey, T. R. (2009a, April). Bound by tradition: Teachers’ views of crucial grading and reporting issues. Paper presented at the annual meeting of the American Educational Research Association, San Francisco, CA.Guskey, T. R. (2009b). Grading policies that work against standards . . . And how to fix them. In T.R. Guskey (Ed.), Practical solutions for serious problems in standards-based grading (pp. 9–26). Thousand Oaks, CA: Corwin.Guskey, T. R., & Bailey, J. (2001). Developing grading and reporting systems for stu-dent learning. Thousand Oaks, CA: Corwin.Guskey, T. R., & Bailey, J.M. (2010). Developing standards based report cards. Thousand Oaks, CA: Corwin.Guskey, T. R., Swan, G. M., & Jung, L. A. (2010, April). Developing a statewide, stan-dards-based student report card: A review of the Kentucky initiative. Paper presented at the annual meeting of the American Educational Research Association, Denver, CO.Hargis, C. H. (1990). Grades and grading practices: Obstacles to improving education and helping at-risk students. Springfield, MA: Charles C. Thomas.Hay, P. J., & Macdonald, D. (2008). (Mis)appropriations of criteria and standards-referenced assessment in a performance-based subject. Assessment in Education, 15, 153–168. doi:10.1080/09695940802164184
Brookhart et al.842Healy, K. L. (1935). A study of the factors involved in the rating of pupils’ composi-tions. Journal of Experimental Education, 4, 50–53. doi:10.1080/00220973.1935.11009995Heckman, J. J., & Rubinstein, Y. (2001). The importance of noncognitive skills: Lessons from the GED testing program. American Economic Review, 91, 145–149. doi:10.2307/2677749Henke, R. R., Chen, X., Goldman, G., Rollefson, M., & Gruber, K. (1999). What hap-pens in classrooms? Instructional practices in elementary and secondary schools, 1994–95 (NCES 1999–348). Washington, DC: U.S. Department of Education. Retrieved from http://nces.ed.gov/pubs99/1999348.pdfHill, G. (1935). The report card in present practice. Educational Method, 15, 115–131.Holmes, D. S. (1972). Effects of grades and disconfirmed grade expectancies on students’ evaluations of their instructor. Journal of Educational Psychology, 63, 130–133.Howley, A., Kusimo, P. S., & Parrott, L. (1999). Grading and the ethos of effort. Learning Environments Research, 3, 229–246. doi:10.1023/A:1011469327430Hulten, C. E. (1925). The personal element in teachers’ marks. Journal of Educational Research, 12, 49–55. doi:10.1080/00220671.1925.10879575Imperial, P. (2011). Grading and reporting purposes and practices in catholic second-ary schools and grades’ efficacy in accurately communicating student learning (Doctoral dissertation). Available from ProQuest Dissertations and Theses database. (UMI No. 896956719)Jacoby, H. (1910). Note on the marking system in the astronomical course at Columbia College, 1909–1910. Science, 31, 819–820. doi:10.1126/science.31.804.819Jimerson, S. R., Egeland, B., Sroufe, L. A., & Carlson, B. (2000). A prospective longi-tudinal study of high school dropouts examining multiple predictors across devel-opment. Journal of School Psychology, 38, 525–549. doi:10.1016/S0022- 4405(00)00051-0Kane, M. T. (2006). Validation. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 17–64). Westport, CT: American Council on Education/Praeger.Kasten, K. L., & Young, I. P. (1983). Bias and the intended use of student evaluations of university faculty. Instructional Science, 12, 161–169. doi:10.1007/BF00122455Kelly, F. J. (1914). Teachers’ marks: Their variability and standardization (Contributions to Education No. 66). New York, NY: Teachers College, Columbia University.Kelly, S. (2008). What types of students’ effort are rewarded with high marks? Sociology of Education, 81, 32–52. doi:10.1177/003804070808100102Kirschenbaum, H., Napier, R., & Simon, S. B. (1971). Wad-ja-get? The grading game in American education. New York, NY: Hart.Klapp Lekholm, A. (2011). Effects of school characteristics on grades in compulsory school. Scandinavian Journal of Educational Research, 55, 587–608. doi:10.1080/00313831.2011.555923Klapp Lekholm, A., & Cliffordson, C. (2008). Discrepancies between school grades and test scores at individual and school level: Effects of gender and family back-ground. Educational Research and Evaluation, 14, 181–199. doi:10.1080/ 13803610801956663Klapp Lekholm, A., & Cliffordson, C. (2009). Effects of student characteristics on grades in compulsory school. Educational Research and Evaluation, 15, 1–23. doi:10.1080/13803610802470425Kulick, G., & Wright, R. (2008). The impact of grading on the curve: A simulation analysis. International Journal for the Scholarship of Teaching and Learning, 2(2), Article 5.
A Century of Grading843Kunnath, J. P. (2016). A critical pedagogy perspective of the impact of school poverty level on the teacher grading decision-making process (Doctoral dissertation). Available from ProQuest Dissertations and Theses database. (UMI No. 10007423)Lauterbach, C. E. (1928). Some factors affecting teachers’ marks. Journal of Educational Psychology, 19, 266–271.Levin, H. M. (2013). The utility and need for incorporating noncognitive skills into large-scale educational assessments. In M. von Davier, E. Gonzalez, I. Kirsch & K. Yamamoto (Eds.), The role of international large-scale assessments: Perspectives from technology, economy, and educational research (pp. 67–86). Dordrecht, Netherlands: Springer.Linn, R. L. (1982). Ability testing: Individual differences, prediction, and differential prediction. In A. K. Wigdor & W. R. Garner (Eds.), Ability testing: Uses, consequences, and controversies (pp. 335–388). Washington, DC: National Academies Press.Liu, X. (2008a, October). Assessing measurement invariance of the teachers’ percep-tions of grading practices scale across cultures. Paper presented at the annual meet-ing of the Northeastern Educational Research Association, Rocky Hill, CT.Liu, X. (2008b, October). Measuring teachers’ perceptions of grading practices: Does school level make a difference? Paper presented at the annual meeting of the Northeastern Educational Research Association, Rocky Hill, CT.Liu, X., O’Connell, A. A., & McCoach, D. B. (2006, April). The initial validation of teachers’ perceptions of grading practices. Paper presented at the annual meeting of the American Educational Research Association, San Francisco, CA.Llosa, L. (2008). Building and supporting a validity argument for a standards-based classroom assessment of English proficiency based on teacher judgments. Educational Measurement, 27(3), 32–42. doi:10.1111/j.1745-3992.2008.00126.xLloyd, D. N. (1974). Analysis of sixth grade characteristics predicting high school dropout or graduation. JSAS Catalog of Selected Documents in Psychology, 4, 90.Lloyd, D. N. (1978). Prediction of school failure from third-grade data. Educational and Psychological Measurement, 38, 1193–1200. doi:10.1177/001316447803800442Love, D. A., & Kotchen, M. J. (2010). Grades, course evaluations, and academic incen-tives. Eastern Economic Journal, 36, 151–163. doi:10.1057/eej.2009.6Marsh, H. W. (1984). Students’ evaluations of university teaching: Dimensionality, reliability, validity, potential biases, and utility. Journal of Educational Psychology, 76, 707–754. doi:10.1016/0883-0355(87)90001-2Marsh, H. W. (1987). Students’ evaluations of university teaching: Research findings, methodological issues, and directions for future research. International Journal of Educational Research, 11, 253–288. doi:10.1016/0883-0355(87)90001-2Marzano, R. J., & Heflebower, T. (2011). Grades that show what students know. Educational Leadership, 69(3), 34–39.Maurer, T. W. (2006). Cognitive dissonance or revenge? Student grades and course evaluations. Teaching of Psychology, 33, 176–179. doi:10.1207/s15328023top3303_4Mayo, S. T. (1970). Trends in the teaching of the first course in measurement. Paper presented at the National Council on Measurement in Education symposium, Chicago, IL. Retrieved from ERIC database. (ED047007)McCandless, B. R., Roberts, A., & Starnes, T. (1972). Teachers’ marks, achievement test scores, and aptitude relations with respect to social class, race, and sex. Journal of Educational Psychology, 63, 153–159. doi:10.1037/h0032646McKeachie, W. J. (1979). Student ratings of faculty: A reprise. Academe, 65, 384–397. doi:10.2307/40248725
Brookhart et al.844McKenzie, R.B. (1975). The economic effects of grade inflation on instructor evalua-tions: A theoretical approach. Journal of Economic Education, 6, 99–105. doi:10.1080/00220485.1975.10845408McMillan, J. H. (2001). Secondary teachers’ classroom assessment and grading prac-tices. Educational Measurement: Issues and Practice, 20(1), 20–32. doi:10.1111/ j.1745-3992.2001.tb00055.xMcMillan, J. H. (2009). Synthesis of issues and implications for practice. In T. R. Guskey (Ed.), Practical solutions for serious problems in standards-based grading (pp. 105–120). Thousand Oaks, CA: Corwin.McMillan, J. H., & Lawson, S. R. (2001). Secondary science teachers’ classroom assessment and grading practices. Retrieved from ERIC database. (ED 450 158)McMillan, J. H., Myran, S., & Workman, D. (2002). Elementary teachers’ classroom assessment and grading practices. Journal of Educational Research, 95, 203–213. doi:10.1080/00220670209596593McMillan, J. H., & Nash, S. (2000, April). Teacher classroom assessment and grading decision making. Paper presented at the annual meeting of the National Council on Measurement in Education, New Orleans, LA.McMunn, N., Schenck, P., & McColskey, W. (2003, April). Standards-based assess-ment, grading, and reporting in classrooms: Can district training and support change teacher practice? Paper presented at the annual meeting of the American Educational Research Association, Chicago, IL.Melograno, V. J. (2007). Grading and report cards for standards-based physical educa-tion. Journal of Physical Education, Recreation, and Dance, 78(6), 45–53. doi:10. 1080/07303084.2007.10598041Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13–103). New York, NY: American Council of Education and Macmillan.Meyer, M. (1908). The grading of students. Science, 28, 243–252. doi:10.1126/sci-ence.28.712.243Miner, B. C. (1967). Three factors of school achievement. Journal of Educational Research, 60, 370–376. doi:10.2307/27531890Mohnsen, B. (2013). Assessment and grading in physical education. Strategies, 20(2), 24–28. doi:10.1080/08924562.2006.10590709Moore, C. C. (1939). The elementary school mark. Pedagogical Seminary and Journal of Genetic Psychology, 54, 285–294. doi:10.1080/08856559.1939.10534336Morris, J. D., Ehren, B. J., & Lenz, B. K. (1991). Building a model to predict which fourth through eighth graders will drop out in high school. Journal of Experimental Education, 59, 286–293. doi:10.1080/00220973.1991.10806615Myford, C. (2012). Rater cognition research: Some possible directions for the future. Educational Measurement: Issues and Practice, 31(3), 48–49. doi:10.1111/j.1745- 3992.2012.00243.xNichols, S. L., & Berliner, D. C. (2007). Collateral damage: How high stakes testing corrupts America’s schools. Cambridge, MA: Harvard Education Press.Nicolson, F. W. (1917). Standardizing the marking system. Educational Review, 54, 225–237.Nuffic. (2013). Grading systems in the Netherlands, the United States and the United Kingdom. The Hague, Netherlands: Author.O’Connor, K. (2009). How to grade for learning: Linking grades to standards (3rd ed.). Glenview, IL: Pearson Professional Development.
A Century of Grading845Pallas, A. M. (1989). Conceptual and measurement issues in the study of school drop-outs. In K. Namboodiri & R. G. Corwin (Eds.), Research in the sociology of educa-tion and socialization (Vol. 8, pp. 87–116). Greenwich, CT: JAI.Pallas, A. M. (2003). Educational transitions, trajectories, and pathways. In J. T. Mortimer & M. J. Shanahan (Eds.), Handbook of the life course (pp. 165–184). New York, NY: Kluwer Academic/Plenum.Parsons, T. (1959). The school class as a social system: Some of its functions in American society. Harvard Educational Review, 29, 297–318.Pattison, E., Grodsky, E., & Muller, C. (2013). Is the sky falling? Grade inflation and the signaling power of grades. Educational Researcher, 42, 259–265. doi:10.3102/0013189x13481382Pearson, K. (1930). Life of Francis Galton. London, England: Cambridge University Press.Polikoff, M. S., Porter, A. C., & Smithson, J. (2011). How well aligned are state assess-ments of student achievement with state content standards? American Educational Research Journal, 48, 965–995. doi:10.3102/0002831211410684Quann, C. J. (1983). Grades and grading: Historical perspectives and the 1982 AACRAO study. Washington, DC: American Association of Collegiate Registrars and Admissions Officers.Randall, J., & Engelhard, G. (2009). Examining teacher grades using Rasch measure-ment theory. Journal of Educational Measurement, 46, 1–18. doi:10.1111/ j.1745-3984.2009.01066.xRandall, J., & Engelhard, G. (2010). Examining the grading practices of teachers. Teaching and Teacher Education, 26, 1372–1380. doi:10.1016/j.tate.2010.03.008Resnick, L. B. (1987). The 1987 presidential address: Learning in school and out. Educational Researcher, 16(9), 13–20. doi:10.3102/0013189X016009013Roderick, M., & Camburn, E. (1999). Risk and recovery from course failure in the early years of High School. American Educational Research Journal, 36, 303–343. doi:10.3102/00028312036002303Rojstaczer, S., & Healy, C. (2012). Where A is ordinary: The evolution of American college and university grading, 1940–2009. Teachers College Record, 114(7), 1–23.Ross, C. C., & Hooks, N. T. (1930). How shall we predict high-school achievement? Journal of Educational Research, 22, 184–196. doi:10.2307/27525222Ross, J. A., & Kostuch, L. (2011). Consistency of report card grades and external assessments in a Canadian province. Educational Assessment, Evaluation and Accountability, 23, 158–180. doi:10.1007/s11092-011-9117-3Rugg, H. O. (1918). Teachers’ marks and the reconstruction of the marking system. Elementary School Journal, 18, 701–719. doi:10.1086/454643Rumberger, R. W. (1987). High school dropouts: A review of issues and evidence. Review of Educational Research, 57, 101–121. doi:10.3102/00346543057002101Rumberger, R. W. (2011). Dropping out: Why students drop out of high school and what can be done about it. Cambridge, MA: Harvard University Press.Russell, J. A., & Austin, J. R. (2010). Assessment practices of secondary music teachers. Journal of Research in Music Education, 58, 37–54. doi:10.1177/0022429409360062Salmons, S. D. (1993). The relationship between students’ grades and their evaluation of instructor performance. Applied H.R.M. Research, 4, 102–114.Sawyer, R. (2013). Beyond correlations: Usefulness of high school GPA and test scores in making college admissions decisions. Applied Measurement in Education, 26, 89–112. doi:10.1080/08957347.2013.765433
Brookhart et al.846Schneider, J., & Hutt, E. (2014). Making the grade: A history of the A-F marking scheme. Journal of Curriculum Studies, 46, 201–224. doi:10.1080/00220272.2013.790480.Scriffiny, P. L. (2008). Seven reasons for standards-based grading. Educational Leadership, 66(2), 70–74. Retrieved from http://www.ascd.org/publications/educa-tional_leadership/oct08/vol66/num02/Seven_Reasons_for_Standards-Based_Grading.aspxShriner, W. O. (1930). The comparison factor in the evaluation of examination papers. Teachers College Journal, 1, 65–74.Shippy, N., Washer, B. A., & Perrin, B. (2013). Teaching with the end in mind: The role of standards-based grading. Journal of Family & Consumer Sciences, 105(2), 14–16. doi:10.14307/JFCS105.2.5Silberstein, N. (1922) The variability of teachers’ marks. English Journal, 11, 414–424.Simon, M., Tierney, R. D., Forgette-Giroux, R., Charland, J., Noonan, B., & Duncan, R. (2010). A secondary school teacher’s description of the process of determining report card grades. McGill Journal of Education, 45, 535–554. doi:10.7202/1003576arSimon, S. B., & Bellanca, J. A. (1976). Degrading the grading myths: A primer of alternatives to grades and marks. Washington, DC: Association for Supervision and Curriculum Development.Sims, V. M. (1933). Reducing the variability of essay examination marks through eliminating variations in standards of grading. Journal of Educational Research, 26, 637–647. doi:10.1080/00220671.1933.10880358Smith, A. Z., & Dobbin, J. E. (1960). Marks and marking systems. In C. W. Harris (Ed.), Encyclopedia of educational research (3rd ed., pp. 783–791). New York, NY: Macmillan.Smith, J. K. (2003). Reconsidering reliability in classroom assessment and grading. Educational Measurement: Issues and Practice, 22(4), 26–33. doi:10.1111/ j.1745-3992.2003.tb00141.xSmith, J. K., & Smith, L. F. (2009). The impact of framing effect on student preferences for university grading systems. Studies in Educational Evaluation, 35, 160–167. doi:10.1016/j.stueduc.2009.11.001Snow, R. E. (1989). Toward assessment of cognitive and conative structures in learn-ing. Educational Researcher, 18(9), 8–14. doi:10.3102/0013189x018009008Sobel, F. S. (1936). Teachers’ marks and objective tests as indices of adjustment. Teachers College Record, 38, 239–240.Spooren, P., Brockx, B., & Mortelmans, D. (2013). On the validity of student evalua-tion of teaching: The state of the art. Review of Educational Research, 83, 598–642. doi:10.3102/0034654313496870Stanley, G., & Baines, L. (2004). No more shopping for grades at B-Mart: Re-establishing grades as indicators of academic performance. The Clearing House, 77, 101–104. doi:10.1080/00098650409601237Starch, D. (1913). Reliability and distribution of grades. Science, 38, 630–636. doi:10.1126/science.38.983.630Starch, D. (1915). Can the variability of marks be reduced? School & Society, 2, 242–243.Starch, D., & Elliott, E. C. (1912). Reliability of the grading of high-school work in English. School Review, 20, 442–457.Starch, D., & Elliott, E. C. (1913a). Reliability of grading work in mathematics. School Review, 21, 254–259.
A Century of Grading847Starch, D., & Elliott, E. C. (1913b). Reliability of grading work in history. School Review, 21, 676–681.Sun, Y., & Cheng, L. (2013). Teachers’ grading practices: Meaning and values assigned. Assessment in Education, 21, 326–343. doi:10.1080/0969594.2013.768207Svennberg, L., Meckbach, J., & Redelius, K. (2014). Exploring PE teachers’ “gut feel-ings”: An attempt to verbalise and discuss teachers’ internalised grading criteria. European Physical Education Review, 20, 199–214. doi:10.1177/1356336X13517437Swan, G. M., Guskey, T. R., & Jung, L. A. (2014). Parents’ and teachers’ perceptions of standards-based and traditional report cards. Educational Assessment, Evaluation and Accountability, 26, 289–299. doi:10.1007/s11092-014-9191-4Swineford, F. (1947). Examination of the purported unreliability of teachers’ marks. Elementary School Journal, 47, 516–521. doi:10.2307/3203007Thorsen, C. (2014). Dimensions of norm-referenced compulsory school grades and their relative importance for the prediction of upper secondary school grades. Scandinavian Journal of Educational Research, 58, 127–146. doi:10.1080/00313831.2012.705322Thorsen, C., & Cliffordson, C. (2012). Teachers’ grade assignment and the predictive validity of criterion-referenced grades. Educational Research and Evaluation, 18, 153–172. doi:10.1080/13803611.2012.659929Tierney, R. D., Simon, M., & Charland, J. (2011). Being fair: Teachers’ interpretations of principles for standards-based grading. The Educational Forum, 75, 210–227. doi:10.1080/00131725.2011.577669Troob, C. (1985). Longitudinal study of students entering high school in 1979: The relationship between first term performance and school completion. New York, NY: New York City Board of Education.Troug, A. J., & Friedman, S. J. (1996, April). Evaluating high school teachers’ written grading policies from a measurement perspective. Paper presented at the annual meeting of the National Council on Measurement in Education, New York, NY.Unzicker, S. P. (1925). Teachers’ marks and intelligence. Journal of Educational Research, 11, 123–131. doi:10.1080/00220671.1925.10879537Voss, H. L., Wendling, A., & Elliott, D. S. (1966). Some types of high school dropouts. Journal of Educational Research, 59, 363–368.Webster, K. L. (2011). High school grading practices: Teacher leaders’ reflections, insights, and recommendations (Doctoral dissertation). Available from ProQuest Dissertations and Theses database. (UMI No. 3498925)Welsh, M. E., & D’Agostino, J. (2009). Fostering consistency between standards-based grades and large-scale assessment results. In T. R. Guskey (Ed.), Practical solutions for serious problems in standards-based grading (pp. 75–104). Thousand Oaks, CA: Corwin.Welsh, M. E., D’Agostino, J. V., & Kaniskan, R. (2013). Grading as a reform effort: Do standards-based grades converge with test scores? Educational Measurement: Issues and Practice, 32(2), 26–36. doi:10.1111/emip.12009Wiggins, G. (1994). Toward better report cards. Educational Leadership, 52(2), 28–37. Retrieved from: http://www.ascd.org/publications/educational-leadership/oct94/vol52/num02/Toward-Better-Report-Cards.aspxWiley, C. R. (2011). Profiles of teacher grading practices: Integrating teacher beliefs, course criteria, and student characteristics (Doctoral dissertation). Available from ProQuest Dissertations and Theses database. (UMI No. 887719048)
Brookhart et al.848Willingham, W. W., Pollack, J. M., & Lewis, C. (2002). Grades and test scores: Accounting for observed differences. Journal of Educational Measurement, 39, 1–37. doi:10.1002/j.2333-8504.2000.tb01838.xWinter, R. (1993). Education or grading? Arguments for a non-subdivided honours degree. Studies in Higher Education, 18, 363–377. doi:10.1080/03075079312331382271Woodruff, D. J., & Ziomek, R. L. (2004). High school grade inflation from 1991 to 2003 (Research Report Series 2004–04). Iowa City, IA: ACT. doi:10.1.1.409.9896Yesbeck, D. M. (2011). Grading practices: Teachers’ considerations of academic and non-academic factors (Doctoral dissertation). Available from ProQuest Dissertations and Theses database. (UMI No. 913076079)AuthorsSUSAN M. BROOKHART, PhD, is an independent educational consultant and an adjunct faculty member at Duquesne University, Pittsburgh, PA 15282; email: [email protected] R. GUSKEY, PhD, is a professor of education at the University of Kentucky, Lexington, KY 40506; email: [email protected] J. BOWERS, PhD, is an associate professor of education leadership at Teachers College, Columbia University, New York, NY 10027; email: [email protected] H. MCMILLAN, PhD, is interim associate dean for academic affairs and professor of education at Virginia Commonwealth University, Richmond, VA 23284; email: [email protected] K. SMITH, PhD, is a professor of education at the University of Otago in Dunedin, New Zealand; email: [email protected] F. SMITH, PhD, is a professor and dean of education at the University of Otago in Dunedin, New Zealand; email: [email protected] T. STEVENS is a graduate student in the School of Education at the University of California, Davis, CA 95616; email: [email protected] E. WELSH, PhD, is an assistant professor in educational assessment and me asurement at the University of California, Davis, CA 95616; email: [email protected].
EducationalMeasurement:IssuesandPracticeFall2020,Vol.39,No.3,pp.65–69HowCanReleasedStateTestItemsSupportInterimAssessmentPurposesinanEducationalCrisis?EmmaM.KlugmanandAndrewD.Ho,HarvardGraduateSchoolofEducationStatetestingprogramsregularlyreleasepreviouslyadministeredtestitemstothepublic.Weprovideanopen-sourcerecipeforstate,district,andschoolassessmentcoordinatorstocombinetheseitemsflexiblytoproducescoreslinkedtoestablishedstatescorescales.Thesewouldenableestimationofstudentscoredistributionsandachievementlevels.Wediscusshoweducatorscanuseresultingscorestoestimateachievementdistributionsattheclassroomandschoollevel.Weemphasizethatanyuseofsuchtestsshouldbetertiary,withnostakesforstudents,educators,andschools,particularlyinthecontextofacrisisliketheCOVID-19pandemic.Thesetestsandtheirresultsshouldalsobelowerinprioritythanassessmentsofphysical,mental,andsocial–emotionalhealth,andlowerinprioritythanclassroomanddistrictassessmentsthatmayalreadybeinplace.Weencouragestatetestingprogramstoreleasealltheingredientsforthisrecipetosupportlow-stakes,aggregate-levelassessments.Thisisparticularlyurgentduringacrisiswherescoresmaybedecliningandgapsincreasingatunknownrates.Keywords:achievementlevels,COVID-19,interimassessment,itemmaps,itemresponsetheory,psychometrics,statetestingStatetestingprogramsregularlyreleaseexamplesoftestitemstothepublic.Thesereleasesservemultiplepurposes.Theyprovideeducatorsandstudentsanopportunitytofamil-iarizethemselveswithitemformats.Theydemystifythetest-ingexperienceforthepublic.Andtheycanimproveunder-standingoftestscoresbyillustratingthekindsoftasksthatstudentsatparticularachievementlevelscanaccomplishsuccessfully.Asexemplars,theseitemsaretypicallyscreenedcarefully,withdemonstratedalignmenttostatecontentstan-dards.Theyaregenerallyevaluatedatgreatexpenseinop-erationaladministrationsandfieldtests.Theyhaveknownqualityandtechnicalcharacteristics.However,statesgener-allyreleasetheitemsthemselves,nottheirtechnicalcharac-teristics.Thispreventsanyuseofreleaseditemstoestimatescoresonstatescales.Thisisgenerallywise.Releaseditemshaveunknownex-posureandunknownfamiliarity,anduncontrolledconditionsinanyreadministrationwouldriskstandardinferencesaboutproficiency.Statetestingprogramsarerightfullyhesitanttosanctionanyusesofreleaseditemstoprotectagainstcoach-ingthatwouldinflatescoresonatypicaladministration.However,atthiswritinginAugustof2020,thereareseriousthreatstoanynotionofatypicaladministration,andthereisadearthofhigh-qualityassessmentoptions.Inthiscurrentpandemic,wearguethatstatesshouldmaketechnicalpa-rametersofreleaseditemspublictosupportlow-stakesusesofstandards-basedtestscorereports.Thecostisnegligible,andallassessmentoptionsshouldbeavailabletoeducatorsforeducationalmonitoringpurposes.Inthisarticle,wepro-videarecipeforconstructionoftestsusingreleaseditemsandprovideguardrailstoensureappropriateuseinanedu-cationalcrisis.AssessmentintheCOVID-19CrisisInthespringof2020,COVID-19causedU.S.schooldistrictstoceasein-personinstructionmonthsearlierthanusual.ThefirststatesclosedschoolsonMarch16,andallstateshadrecommendedschoolclosurebyMarch24(EducationWeek,2020).Remoteinstructionhasdifferedsubstantiallybetweenandwithinstatesinimplementationanduptake(Harrisetal.,2020).Asschoolsopenin-personandonlineinthefallof2020,unusualnumbersofstudentsmaynothavelearnednorhadtheopportunitytolearnpreviousgradema-terial.Althoughprojectionsexistforthemagnitudeofdeclinesandpossibleincreasesindisparities(Kuhfeldetal.,2020),assessmentscanprovideamoredirectestimatethisschoolyear.Resultsofsuchinterimassessmentscaninformstrate-giestosupportteachersandstudents,includingfunding,cur-riculumredesign,andinstruction(Perie,Marion,&Gong,2009).COVID-19isaninternationalhealthdisaster,andstandard-izedmeasuresofproficiencyinreading,writing,mathemat-ics,andothersubjectsshouldbetertiarytootherassessmenttargetsandassessmentpurposes(Lake&Olson,2020;Mar-ion,Gong,Lorié,&Kockler,2020;Olson,2020).Thereisahierarchyofassessmentneedsinacrisis,andmeasuresofacademiclevelsshouldrightfullybetertiary.Higherpriori-tiesandassessmentapproachesshouldinclude:Teacher-orparent-reportedsurveysofstudents’springattendance,participation,andcontentcoverage.Inmanyschoolswithremoteinstruction,teachersandpar-entscanreporttheirimpressionsofattendance,partic-ipation,andproficiencycomparedtoprioryears.©2020bytheNationalCouncilonMeasurementinEducation65
Table1.OnlinePublicAvailabilityofItemsandParameterEstimatesfortheConstructionofOpenTestsSelectedlarge-scalenationalandinternationaltestingprogramsandprogramsfromthe15largeststatesasofAugust,2020.Thistablewillbeupdatedonlineathttps://emmaklugman.github.io/files/open-tests.htmlTestingProgram(1)AreOperational(orFieldTested)ItemsAvailable?(2)AreItemParameterEstimatesAvailable?(3)IsaKeyEnablingaMergeof(1)and(2)Available?NAEPYesYesYesPISAYesYesYesTIMSSYesYesYesSmarterbalancedYesNoNoNewMeridian(PARCC)CC)YesNoNoCaliforniaYesNoNoTexasYesNoNoFloridaYesNoNoNewYorkYes3–8&RegentsYes3–8&RegentsNo3–8;yesRegentsPennsylvaniaYesYesNoIllinoisYesNoNoOhioYesYesHaphazardlyGeorgiaYesNoNoNorthCarolinaYesNoNoMichiganYesNoNoNewJerseyYesNoNoVirginiaYesNoNoWashingtonYesNoNoArizonaYesYesNoMassachusettsYesYesNoExistingclassroomanddistrictassessments.Districtsalreadyhaveaccesstoclassroomassessmentsthatcanassessprior-gradematerial.Somedistrict-levelassess-mentshavefallteststhatcanreportscoreslinkedtostateproficiencystandards.Assessmentsofphysical,mental,andsocial–emotionalhealth,sufficientlevelsofwhicharenecessarycondi-tionsforlearning.Asanoptionalsupplementtotheseapproaches,schoolanddistricteducationalpersonnelmayalsofindaggregatesum-mariesofstudentproficiencyintermsofstateperformancestandardsuseful.Forexample,aschoolordistrictmayrec-ognizeduetootherassessmentslistedabovethatsubstantialunitsorstudentshadnoaccesstomaterialtaughtattheendoftheyear,motivatingsomeweeksofreviewofprior-gradecontent.Atestcomprisedofpreviouslyreleased,prior-gradeitemswouldenableestimationofproficiencydistributionsonprior-gradescorescales,includingproficiencyintermsofachievementlevelcutscores.Althoughsomedistrictshaveaccesstoassessmentsthatreportonstatetestscorescales,usuallythroughstatisticalprojections,suchassessmentsarecostlyandnotuniversal.Testscomprisedofreleaseditemsarefreeandinterpretabledirectlyintermsofstateachievementlevels.Wealsoshowhowitemmapscomprisedofreleaseditemscanprovideed-ucatorswithexamplesofperformancetasksthatstudentsineachachievementlevelcando.Weprovideanexplicitrecipeforsuchtests,;thenweconcludewithclearguardrailsforap-propriateuse.Inparticular,wecautionthatanycurrentuse(orimpliedfutureuse)ofthesescoresforjudgmentsaboutstudenttracking,educatoreffectiveness,orschooleffective-nesswouldinviteseverebiasandinflationthatwouldrenderscoresunusableforthosehigh-stakespurposes.AvailabilityofReleasedItemsandParameterEstimatesInterestinthereuseofcalibrateditemssurgedinthe1990sastheNationalAssessmentofEducationalProgress(NAEP)be-ganreportingstateresults.Theterm“market-basketreport-ing”(NationalResearchCouncil,2000)wasconsideredanddiscarded,andauthorsdemonstratedhow“domainscores”usingItemResponseTheorycouldsupportreuseofcalibrateditems(Bock,Thissen,&Zimowski,1997;Pommerich,2006).Morerecently,therehasbeeninternationalinterestincre-atingtestsforadministrationacrossdifferentcountriesandconditions(Das&Zajonc,2020;Muralidharan,Singh,&Ga-nimian,2019).Wecouldnotfindastraightforwardrecipeforcreatingsuchtestsnoranarticlethatdiscussedapplicationandcaveatsinacrisis.Unfortunately,inoursearchofpubliclyavailablemanuals,wefoundfewexamplesofstatetechnicalmanualsthatenableuserstomergepublisheditemstopublishedestimates.Thisdoesnotappeartobeanintentionalomission.Rather,statetestingprogrampersonnelmayreasonthatreleaseditemshaveanaudiencethatisnotinterestedintechnicalspecifi-cations,anditemparameterestimateshaveanaudiencethatisnotinterestedinitemcontent.Wehopethatitbecomesstandardpracticetoeitherpublishitemparameterestimateswithreleaseditemsorincludeakeythatenablesmergingofreleaseditemswithparameterestimatesintechnicalmanu-als.Table1showswhetherthekeyingredientsforreuseofitemsareavailableacrosslargetestingprogramsandstates.Theingredientsareavailableforlargenationalandinter-nationalprogramslikeNAEP,PISA,andTIMSS.Wealsoconductedasearchofstatewebsitesforthe15largeststates,foritems,parameterestimates,andakeylinkingthetwo.Wefindthatthesestatetestingprogramsalwaysmake66©2020bytheNationalCouncilonMeasurementinEducationEducationalMeasurement:IssuesandPractice
operationalitemsavailable,inthecaseofsomestates,throughtheassessmentconsortiaknownasSmarterBal-ancedandNewMeridian(whichwasrelatedtothePartner-shipforAssessmentofReadinessforCollegeandCareers,PARCC).Wefounditemparameterestimatesinafewstates.AkeythatenablesamergeofthetwokeyingredientswasonlyavailablefortheNewYorkRegents(alongstandinghighschooltestingprogram)andinOhio,wherethenecessaryin-formationwaslargelyavailablebutseemedunintentionalandbasedonitemorderratherthanitemIDs.IngredientsforTestConstructionUsingReleasedStateTestItemsForthisexample,weconsiderapossibleuseofGrade4itemstoestimateGrade4proficiencyforGrade5studentsinaCOVID-19-disruptedyear.Thisillustrativeexampleisavail-ableinourOnlineAppendix,completewithcodeinR.WeusetheNationalAssessmentofEducationalProgress(NAEP)forpubliclyavailableingredients.Inpractice,ingredientsfromstatetestswillbepreferablegiventherelativecurricularandpoliticalrelevanceofstatestandardsandstatescorescales.Therecipeforstandards-linkedtestscoresrequiresfivees-sentialingredients:1.Testitems2.Itemparameterestimates3.Alistorkeyenablingassociationofitemsandtheircor-respondingestimates4.Linkingfunctionsfromunderlyingθscalestoscalescoes5.AchievementlevelcutscoresStartingwiththefirstingredient,designersshouldensureselectionofitemsthatsuitstheirdesiredcontentcoverage.AlthoughtherestrictiveassumptionsofItemResponseThe-orysuggestthattheselectionofitemshasnoeffectonscoreestimation(Yen&Fitzpatrick,2006),itisreasonabletoselectitemsinsimilarproportiontotestblueprints,orsomesubsetofitemsfromacontentareainwhicheducatorshavepartic-ularinterest.Aswenoteinoursectionaboutcaveats,statetestsaretypicallyadministeredattheendofasequenceofre-latedinstruction.Iftestsarenotgiveninasimilarsequenceandconditions,standardinferencesmaynotapply.Thus,apresentationorreviewofGrade4materialthatmimicsthestandardinstructionalonramptoGrade4testingwouldhelptoensureappropriateinferencesfromscores.Thesecondingredientisitemparameterestimates.Theseareanoccasionalfeatureoftechnicalmanualsforstatetests.Turningtothethirdingredient,aswementionabove,alinkisrarelyavailablewiththeexceptionoflarge-scaleprogramslikeNAEP,TIMSS,andPISA,andone-offexamplesliketheNewYorkRegentsExamsandOhio.Thefourthingredientisalinkingfunction,usuallyasim-plelinearequationforeachscorescalethatmapsfromitemparameterestimatesontheunderlyingθscaletothescalescoresforreporting.Fifthandfinally,achievementlevelre-porting,incategorieslikeBasic,Proficient,andAdvanced,re-quirescutscoresdelineatingtheselevels.Bothlinkingfunc-tionsandachievementlevelcutscoresarereportedregularlyinstatetechnicalmanualsanddocumentation.RecipeforTestConstructionUsingReleasedStateTestItemsTherecipeforgeneratingstandards-basedscorereportsfromtheingredientsaboverequiresstraightforwardapplicationofItemResponseTheory.Therecipeisavailableonlineathttps://emmaklugman.github.io/files/open-tests.htmlandas-sumesexpertiseatthelevelofafirst-yearsurveycourseineducationalmeasurement.ReviewsofIRTincludethosebyYenandFitzpatrick(2006)andThissenandWainer(2001).Manystatetechnicalmanualsalsoreviewstate-specificscor-ingproceduresandtechnicaldetails.WeuseacommonandstraightforwardprocedureknownasTestCharacteristicCurve(TCC)scoringmethodthatresultsina1-to-1tableofsummedscorestoθestimatesandscalescores.KolenandTong(2010)comparethisapproachwithotheralternatives.TheynotethattheTCCapproachisbothtransparentandavoidsthedependenceofscoresonpriors,whichmayoffsetthetradeoffsoftheslightincreaseinimpre-cision.Usersmaysubstitutealternativescoringapproachesintothisrecipe.Giventheingredientslistedintheprevioussection,therecipefollows:1.Arrangereleasedtestitemsintoanonlineorpaperbooklet.2.Generateatablemappingsummedscorestoscalescores.3.Administerthetestandcollectresponses.4.Sumcorrectresponsestosummedscoresandlocatecor-respondingscalescores5.Reportscalescores,includingachievementlevelsanditemmaplocationsasdesired.Testitemsshouldbearrangedtosupportanaturalflowofcontentanddifficulty.Foritemswhereitemlocationsareknown,testconstructorsmaytrytopreserverelativeitemor-der.Formoreonprinciplesoftestdesign,seeDowningandHaladyna(2006).Tocreateatablemappingsummedscorestoscalescores,wereproduceastandardrecipetosumitemcharacteristiccurvefunctiontoatestcharacteristiccurve,invertit,andthentransformtheresultlinearlytoestimatescalescores.Forsimplicity,consideradichotomouslyscored3-parameter-logisticmodel:Pi(θ)≡Pi(Xi=1|θ)=ci+1−ci1+exp(−Dai(θ−bi)).Here,eachexaminee’sdichotomousresponseXtoitemidependsuponexamineeproficiencyθanditemparametersa,b,andc,indicatinginformation(discrimination),loca-tion(difficulty),andalowerasymptote(pseudo-guessing),respectively.Manymodelsincludeanarbitraryscalingparam-eter,D=1.7,whichshouldsimplybeincludedorexcludedforconsistency.Thesumoftheseitemcharacteristiccurvesyieldsthetestcharacteristiccurve:T(θ)=iPi(θ).:Thissumofprobabilitiesistheexpectedsumscoregivenknownexamineeproficiencyθ.InvertingthetestcharacteristiccurveusingnumericalinterpolationmethodsyieldstheTCCestimateofθforanysummedscore.ˆθTCC=T−1iXi.Fall2020©2020bytheNationalCouncilonMeasurementinEducation67
Table2.SumScores,EstimatedθScores,ScaleScores,AchievementLevels,andItemMaps,withContentAreasShownIngredientsarefromtheNationalAssessmentofEducationalProgressandtheNationalCenterforEducationStatistics.Therecipeisavailableathttps://emmaklugman.github.io/files/open-tests.htmlSumScoreThetaScaleScoreAchievementLevelSubscaleItem8-2.48162BelowBasicGeometryIdentifyafigurethatisnot…9-2.01177BelowBasicGeometryDivideasquareintovarious…10-1.65188BelowBasicMeasurementIdentifyappropriate…11-1.36198BelowBasicMeasurementIdentifyareasonableamount…12-1.10206BelowBasicOperationsIdentifytheplacevalueofa…13-0.88213BelowBasicOperationsRecognizetheresultof…14-0.68219BasicOperationsComposenumbersusingplace…15-0.49225BasicOperationsRepresentthesamewhole…16-0.32231BasicOperationsSubtractthree-digitnumberfrom…17-0.15236BasicAlgebraSolveaone-variablelinear…180.01241BasicAlgebraDeterminethemissingshapesin…190.17246BasicAlgebraMarklocationsonagrid…200.33251ProficientGeometryUseaninteractivetooltocreate…210.49256ProficientMeasurementDetermineperimeterofa…220.65262ProficientAlgebraDetermineandapplyarule…230.82267ProficientOperationsRepresentfractionsusinga…241.00273ProficientMeasurementIdentifygivenmeasurementson…251.19279ProficientAnalysisDeterminenumberofways…261.40286AdvancedAlgebraDetermineandapplyarule…271.64293AdvancedOperationsSolveastoryprobleminvolving…281.93303AdvancedAlgebraRelateinputtooutputfroma…292.32315AdvancedOperationsComposenumbersusingplace…302.95335AdvancedGeometryDivideasquareintovarious…Transformationstoscalescoressaretypicallylinear,andconstantsfortheslopeandintercept(MandK,respectively)areoftenavailableintechnicalmanuals:ˆs=Mˆθ+K.Statesalsopublishachievementlevelcutscoresdenotingminimumthresholdscoresforcategories.ForNAEP,theseachievementlevellabelsareBasic,Proficient,andAdvancedanddelineatedbycutscoresineachsubjectandgrade:cB,cP,andcA.AscalescoresisassignedanachievementlevelcategoryLinstraightforwardfashion:L(s)⎧⎪⎨⎪⎩Advancedifs≤cAProficientifcp≤s<cABasicifcB≤s<cPBelowBasicifs<cBFinally,itemmapscanillustrateitemsandtasksthatex-amineesateachscorearelikelytobeabletoanswercorrectly.Eachitemisanchoredtotheθscaleassumingagivenproba-bilityofacorrectresponse,knownastheresponseprobability,pR.Thiscanbesettovariouslevelslike.67(Huynh,2006)or,inourexamplehereandonline,.73.Theitemresponsefunc-tionistheninvertedandtransformedtothescorescaletoob-taineachitem’smappedlocation,si.UndertheassumptionsofIRT,anyitemfromthedomaincanbemapped,evenifitwasnotadministeredtostudents.si=M1DailogpR−ci1−pR+bi+K.ThisreciperesultsinTable2,usingrealdatafromNAEP.Eachsummedscorealignswithasingleunder-lyingproficiencyestimateˆθ,scalescoreˆs,achievementlevel,andnearbymappeditem.Thisrecipeisonlineandavailableathttps://emmaklugman.github.io/files/open-tests.html,completewithopen-sourcecodeinR.Althoughwerec-ommendscoresforaggregate-levelinferences,wealsoin-cludeestimatesofstandarderrorsforeachindividual-levelscalescoreusingItemResponseTheory.Discussion:CautionsandCaveatsWeclosewithaseriesofcaveats.Onesetofcaveatsrelatestotheinterpretationanduseofindividualscores.Asecondsetofcaveatsbuildsuponthefirst,withadditionalthreatstothecomparabilityofaggregatescorestopastyears.Usersofthesetestsinacrisismaytrytoanswertwoimportantdescrip-tivequestions:(1)Howmuchhavescoresdeclined?(2)Howmuchhavescoredisparitiesgrown?Answerstotheseques-tionsmustattendtothesesetsofcaveats.First,inacrisis,manyphysicalandpsychologicalfac-torsmaythreatenatypicaladministrationandintroduceconstruct-irrelevantvariance.Wecannotemphasizeenoughtheappropriatelytertiaryandsupplementalroleoftheteststhatweproposehere.Physicalhealthandsafetymustcomefirstinacrisis,followedbyassessmentsofsocialandemo-tionalwell-being.Studentsmustbesafeandfeelsafebeforetheycanlearnordemonstratewhattheyhavelearned.Second,whenmanystudentsareworkingfromhome,on-linetest-takingindifferentadministrationconditionsareathreattocomparability.Complicatingfactorsinhomeadmin-istrationsincludeonlineconnectivity,parentalinvolvement,andotherin-homeinterferenceordistractions.Suchfactorscaninflatescoresif,forexample,parentsassiststudents,orstudentsuseadditionalonlineresources.Theycandeflate68©2020bytheNationalCouncilonMeasurementinEducationEducationalMeasurement:IssuesandPractice
scoresifthereareatypicaldistractionsorpoorinternetcon-nectivity.Third,theseteststypicallyfollowstandardizedinstruc-tionalon-rampsattheendofayearofinstruction.Irregularorinconsistentexposuretoinstructionpriortoadministra-tionwillthreatenstandardinterpretationsofscalescores.Forexample,considerafalladministrationthatfollowsafallin-structionalunitwhereteachersemphasizealgebraoverotherdomainslikegeometryormeasurement.Resultingscoresmayleaduserstounderestimatealgebraproficiency,wheninfactthescoresreflectrelativelylowproficiencyinotherdomains.Additionalthreatstoinferencesariseattheaggregatelevel,totheextentthatthepopulationinschoolinacri-sismaynotbethesameasinyearspast.Studentswhoarenotinschoolinacrisisarenotmissingatrandom.Stan-dardinterpretationsoftrendsandgaptrendswillbethreat-enedtotheextentthatthepopulationofstudentsinschooldoesnotmatchthepopulationofstudentswhowouldhavebeeninschoolabsentthecrisis.Matchingbasedonscoresfrompastyearsandothercovariatesmayhelptoaddresssomeofthisbias,butsuchaprocedurerisksprecisionandtransparency.Theuseofexistingclassroomandinterimassessmentswillalsorequiresimilarcaveatsabove.Theoneimportantexcep-tionisthethirdcaveat,whereclassroomanddistrictassess-mentsmayhavemoreflexibleandappropriateinstructionalonramps.However,high-qualitydistrictassessmentsarenotavailabletoalldistricts,andthesearenotalwaysdirectlyin-terpretableintermsofstatecontentandperformancestan-dards.Thus,inspiteofthesenecessarycaveats,weemphasizethatstatetestingprogramsalreadymakehigh-qualityingre-dientsforusefultestsavailabletothepublic,andweprovidearecipeaswellasguardrailsforappropriateuse.Weencouragestatestoreleasethecurrentlymissingingredient,akeyformergingitemswithparameterestimates.Thecostwouldbenegligible.Alllow-stakesassessmentoptionsshouldbeavail-abletoschoolsanddistrictsinacrisis.ReferencesBock,R.D.,Thissen,D.,&Zimowski,M.F.(1997).IRTestimationofdomainscores.JournalofEducationalMeasurement,34(3),197–211.https://doi.org/10.1111/j.1745-3984.1997.tb00515.xDas,J.,&Zajonc,T.(2010).IndiashiningandBharatdrowning:Com-paringtwoIndianstatestotheworldwidedistributioninmathemat-icsachievement.JournalofDevelopmentEconomics,92(2),175–187.https://doi.org/10.1016/j.jdeveco.2009.03.004Downing,S.M.,&Haladyna,T.M.(Eds.).(2006).Handbookoftestde-velopment.Mahwah,NewJersey:LawrenceErlbaum.EducationWeek.(2020).Map:Coronavirusandschoolclosures.Edu-cationWeek.https://www.edweek.org/ew/section/multimedia/map-coronavirus-and-school-closures.htmlHarris,D.N.,Liu,L.,Oliver,D.,Balfe,C.,Slaughter,S.,&Mattei,N.(2020).HowAmerica’sschoolsrespondedtotheCOVIDcrisis(p.59)[TechnicalReport].NationalCenterforResearchonEducationAccessandChoice.https://www.reachcentered.org/uploads/technicalreport/20200713-Technical-Report-Harris-et-al-How-Americas-Schools-Responded-to-the-COVID-Crisis.pdfHuynh,H.(2006).AclarificationontheresponseprobabilitycriterionRP67forstandardsettingsbasedonbookmarkanditemmapping.EducationalMeasurement:IssuesandPractice,25(2),19–20.https://doi.org/10.1111/j.1745-3992.2006.00053.xKolen,M.J.,&Tong,Y.(2010).PsychometricpropertiesofIRTprofi-ciencyestimates.EducationalMeasurement:IssuesandPractice,29(3),8–14.https://doi.org/10.1111/j.1745-3992.2010.00179.xKuhfeld,M.,Soland,J.,Tarasawa,B.,Johnson,A.,Ruzek,E.,&Liu,J.(2020).ProjectingthepotentialimpactsofCOVID-19schoolclosuresonacademicachievement(No.20–226;WorkingPapers).Annen-bergInstituteatBrownUniversity.https://www.edworkingpapers.com/ai20-226Lake,R.,&Olson,L.(2020).Learningaswego:PrinciplesforeffectiveassessmentduringtheCOVID-19pandemic(TheEvi-denceProjectatCRPE).https://www.crpe.org/sites/default/files/final_diagnostics_brief_2020.pdfMarion,S.,Gong,B.,Lorié,W.,&Kockler,R.(2020).Restart&re-covery:AssessmentconsiderationsforFall2020.CouncilofChiefStateSchoolOfficers.https://ccsso.org/sites/default/files/2020-07/Assessment%20Considerations%20for%20Fall%202020.pdfMuralidharan,K.,Singh,A.,&Ganimian,A.J.(2019).Disruptinged-ucation?Experimentalevidenceontechnology-aidedinstructioninIndia.AmericanEconomicReview,109(4),1426–1460.https://doi.org/10.1257/aer.20171112NationalResearchCouncil.(2000).DesigningamarketbasketforNAEP:Summaryofaworkshop.P.J.DeVito&J.A.Koenig(Eds.),Washington,DC:TheNationalAcademiesPress.https://doi.org/10.17226/9891Olson,L.(2020).Blueprintfortesting:HowschoolsshouldassessstudentsduringtheCovidcrisis(FutureEd).GeorgetownUniver-sity.https://www.future-ed.org/blueprint-for-testing-how-schools-should-assess-students-during-the-covid-crisis/Perie,M.,Marion,S.,&Gong,B.(2009).Movingtowardacomprehen-siveassessmentsystem:Aframeworkforconsideringinterimassess-ments.EducationalMeasurement:IssuesandPractice,28(3),5–13.https://doi.org/10.1111/j.1745-3992.2009.00149.xPommerich,M.(2006).Validationofgroupdomainscoreestimatesus-ingatestofdomain.JournalofEducationalMeasurement,43(2),97–111.https://doi.org/10.1111/j.1745-3984.2006.00007.xThissen,D.,&Wainer,H.(Eds.).(2001).Testscoring.LawrenceErl-baumAssociatesPublishers.Yen,W.M.,&Fitzpatrick,A.R.(2006).ItemResponseTheory.InR.Brennan(Ed.),Educationalmeasurement(4thed.,pp.111–154).Westport,CT:AmericanCouncilonEducation,PraegerPublishers.Fall2020©2020bytheNationalCouncilonMeasurementinEducation69
CopyrightofEducationalMeasurement:Issues&PracticeisthepropertyofWiley-Blackwellanditscontentmaynotbecopiedoremailedtomultiplesitesorpostedtoalistservwithoutthecopyrightholder’sexpresswrittenpermission.However,usersmayprint,download,oremailarticlesforindividualuse.
InSight: A Journal of Scholarly Teaching 59 “Why did I get a C?”: Communicating Student Performance Using Standards-Based Grading Michael H. Scarlett, PhD Associate Professor, Education Department Augustana College Standards-based grading, an alternative form of grading in which a student’s achievement is based on their performance on a clearly defined set of standards rather than on their performance on tests and assignments, is commonplace in K-12 education but has been slow to catch on in higher education. This article presents an example of how standards-based grading was implemented in two sections of an undergraduate course on assessment to add clarity to the meaning of students’ grades. The author reflects on lessons learned from implementation including the benefits and challenges posed by adopting the practice. “How many points is this worth?” “What do I need to do to get an ‘A’?” “Do you offer extra credit?” These are the types of questions I often get from students when we talk about their grades. I generally respond by discussing the weight of different assignments in comparison to tests and projects, the impact of turning assignments in late, the amount of material to be covered on tests, and, almost invariably the admonition, “t’s in the syllabus.” Rarely it seems do these conversations focus on student learning. In fact, grades often seem to impede rather than facilitate communication. To address the confusion that often surrounds the awarding of grades, I implemented an approach that is commonplace in K-12 schools but almost completely absent from higher education called standards-based grading. Standards-based grading is a practice that bases students’ grades on their performance on a set of clearly defined learning objectives rather than the completion of assignments and tests or the accumulation of points (Brookhart, 2009; Guskey & Bailey, 2010). With the system of standards-based grading I implemented, students’ grades were calculated by averaging scores they received on rubrics indicating a level of mastery of course objectives. I derived the rubric scores using evidence from tests, quizzes, projects, etc. instead of just adding up points for correct answers. At the end of the course I developed a standards-based report card that clearly showed my students exactly which learning objectives they had mastered and which ones they had not. What I discovered was that implementing standards-based grading involved much more than a simple cosmetic redesign of my grade book. By aligning my grade book with specific standards and by basing the students’ grade on their performance in relation to these standards, the standards-based grading approach caused me to reconceptualize the relationship between assessment, curriculum, and instruction in significant ways. As a result, I had much more substantive conversations with students, focusing on learning rather than policies, effort, or the number of points for an assignment. Most importantly, I felt like the grade I awarded to students at the end of the term much more accurately represented their level of understanding than when my grades were based solely on the number of points students earned. In short, my experience implementing standards-based grading was truly transformational. The
60 Volume 13 ● 2018 purpose of this article is to describe a rationale and a process for implementing standards-based grading and to reflect on the benefits and challenges of implementation. Background on Standards-Based Grading To understand standards-based grading it is helpful to understand how it is different from traditional grading practices. In higher education, and in most secondary schools, a student’s grade is determined by their performance on a variety of assessments, such as tests, quizzes, and projects. It is common for each assessment to be worth a certain number of points, with assessments that are deemed more significant being worth more points or a greater percentage of the student’s grade. Assessments generally address multiple learning goals, sometimes identified on a rubric in the case of a project, or, in the case of a test the learning goals are reflected in the questions, but are often not explicitly communicated to the student (i.e., the final exam will cover all the material addressed since the midterm). In contrast, in a standards-based approach students receive a score for each learning goal or target addressed in the course. The score for each standard is determined by a student’s performance on assessment items (test questions, performance assessments, etc.) carefully aligned with the learning targets or goals. For example, my students take three tests during the term. Each test is made up of approximately 20 open-ended questions, and each question is aligned with a learning target with multiple test items aligned to a specific standard. When I grade the tests, I score each item on a scale based on their level of understanding, and then I give a standards-based score that is an average score of the items that addressed that standard. When students get their tests back they can see not only how well they did on each item, but also how well they did on each of the standards. Similarly, students complete several projects and -receive scores on a rubric that is aligned with course learning goals (for a summary of key differences between standards-based grading and traditional grading practices, see Appendix A, and see Appendix B for a sample standards-based grade report). Standards-based grading is not a new practice in K-12 education. Beginning in the 1990’s, the curriculum of elementary and secondary schools became increasingly standards-based. With the passage of No Child Left Behind in 2001, an accountability system was established to monitor the educational progress of students using standardized tests. The pressure to prepare students for standardized testing caused many in education to question the relationship between students’ performance in the classroom, represented by their grades, with their performance on standardized tests. Presumably, a student who can get good grades in a math class should do well on standardized tests in the same subject. Now that most states adopted the Common Core State Standards in math and English/language arts the alignment between curriculum, instruction, and assessment is likely to be ever more heavily scrutinized (Welsh, D’Agostino, & Kaniskan, 2013). Anybody who has seen a report card for an elementary student recently has probably noted how they no longer report a student’s grade in single subject areas, such as math or science; instead they report the student’s progress on specific skills or standards.
InSight: A Journal of Scholarly Teaching 61 To make grades a more accurate reflection of what students know and can do in relation to standards, standards-based grading is based on several core principles. First, a grade should represent the degree to which a student has demonstrated mastery of a clearly defined set of standards (Brookhart, 2009; Marzano, 2000; Popham, 2011; Wiggins, 1998) rather than a norm-referenced or relative approach in which students are compared to other students. Second, performance in relation to standards should be defined using clearly articulated descriptors on a scale of four or five levels rather than with a percentage system based on the accumulation of a number of points (Guskey, 2011). Third, factors that influence a grade, but are not directly related to student mastery of a standard, should be considered separately for grading purposes (Guskey, 2011). Such factors include lateness, effort, attendance, and the use of extra credit to “boost” a grade. These factors only serve to confuse the true performance of the student. Fourth, a grade should reflect how much a student has learned and not when they learned it, meaning, the most recent and/or consistent evidence of a student’s understanding should be considered over a simple averaging of performance on tests and assignments over the course of a year or semester. Finally, and related to the last principle, students should not be penalized for practice, meaning, not all assignments should be factored into a student’s grade (Fisher, Frey, & Pumpian, 2011). Homework, practice problems, or other types of formative assessment should be used for feedback but not to determine a final grade because they reflect a students’ developing understanding and not their final understanding, which should be measured using summative assessments. In addition to these core principles, standards-based grading is often connected to mastery learning (Guskey, 1980). The underlying assumption behind mastery learning is that all students should be provided with multiple opportunities to demonstrate their understanding of a standard to achieve proficiency. Grades in this approach are used to help identify students’ strengths and weaknesses to foster growth rather than simply to identify talent (Guskey, 2011). Allowing opportunities for reassessment provides teachers with opportunities to use grades to facilitate meaningful communication with students about their specific strengths and weaknesses. Review of Literature While the research supporting the use of standards-based grading is lacking, there are several studies that suggest traditional grading practices are flawed. For example, in two famous early studies by Starch and Elliot (1912, 1913) on the subjectivity of grading they discovered a wide range of scores awarded by teachers grading the same assignment, even when it involved subjects like geometry. Brimi (2011) replicated one of these early studies and discovered almost identical results, even after teachers had received 20 hours of training on assessment. Another problem is that the meaning of a grade is often difficult to ascertain because it conflates too many factors—lateness, effort, neatness, for example—often unrelated to learning or impossible to measure (Gordon & Fay, 2010). The well-documented rise in grade …reassessment provides teachers with opportunities to use grades to facilitate meaningful communication with students.
62 Volume 13 ● 2018 inflation, too, suggests that there is good reason to be skeptical of the meaning of grades as a true measure of a students’ understanding (Rojstaczer & Healy, 2012; Seligman, 2002). Brookhart (1994) discovered in her research on teachers’ grading practices a lack of congruence between best practices in the field of assessment and how teachers graded their students. In other words, many teachers are simply not well-educated when it comes to issues of assessment and grading. This is particularly clear in the emphasis teachers place on grades as a reward for students’ work, rather than a level of achievement (Brookhart, 1993, p. 139). These are just a few of the reasons why experts in assessment advocate for standards-based grading as an alternative to traditional grading. The studies done in K-12 education on the practice of standards-based grading suggest that it can improve student learning and may increase student motivation. A large-scale study in the Denver area, for example, demonstrated a higher correlation between grades and standardized test scores in schools with standards-based grading versus those without. The scores on standardized tests in schools with standards-based grading were higher than in schools without (Haptonstall, 2010). In the Omaha Public Schools as well, the number of students failing classes decreased significantly when a standards-based approach to grading was implemented (Proulx, Spencer-May, & Westerberg, 2012). Also, in a study by Fisher et al. (2011) a school in San Diego that implemented several components of standards-based grading saw their performance on state tests increase as well as students’ GPAs. Despite these positive findings, very little research has been done in higher education related to the use of standards-based grading. The few studies that do exist on the use of standards-based grading in colleges or universities suggest that grade reform is possible in higher education, and the experiences of both the professors and students involved in the studies were generally positive. Beatty (2013), for example, documented his experience implementing standards-based grading in two semesters of university physics. He discovered that many, but not all, students liked the standards-based approach; however, the logistics of successful implementation are significant and challenging. Rundquist (2011) also reported a similarly positive experience implementing standards-based grading in an upper level physics course. Finally, Kalnin (2014) implemented proficiency-based grading in one instructional unit in a course on assessment and found that the process gave her a deeper appreciation of the challenges of “practicing what we preach,” and it deepened her students’ assessment literacy. To date, these appear to be the only studies specifically on the use of standards-based grading in college or university settings; however, given currents trends in K-12 education it appears likely that standards-based grading will continue to grow in use in colleges and universities, and the need for a better understanding of the best practices in implementing this approach will only increase. Context The context in which I implemented standards-based grading was a private, selective liberal arts college in the Midwest. The course was Assessing Learning, a required course for all students in the Education program, which includes elementary,
InSight: A Journal of Scholarly Teaching 63 secondary, and K-12 majors. As the second course in the education sequence, most students take Assessing Learning as sophomores, and they have all either been fully or provisionally admitted into the education program by the time they take the course, meaning most the students have at least a 3.0 GPA and received a minimum of a 22 on the ACT or 1100 on the SAT. The fact that the students were all majors and most have met minimum program requirements means that they, on average, are more highly motivated and capable than the average student on our campus. Also, as majors in education they tend to have a high level of engagement and interest in topics such as grading. The Process of Implementing the Standards-Based Grading Approach To implement the standards-based grading approach I consulted a variety of articles and texts, mainly relating to the context of K-12 education, but also those mentioned above in higher education. Various articles cited below influenced practical considerations, but the overall process came from the course text (Popham, 2011) and the work of Guskey and Bailey (2010) and Marzano (2000). I used Popham’s process primarily because I wanted to model what was presented in our course text, and I found that there was a great degree of conceptual similarity between the different approaches, even though Popham refers to the approach as “goal-attainment grading.” Guskey and Bailey (2010) and Marzano (2000) provided more in-depth answers to the many practical considerations I needed to make. Step 1: Clarifying Curricular Aims or Standards The first step in implementing the standards-based grading approach is to determine a set of learning targets or objectives that accurately reflect important concepts and skills addressed in the course. I started with ten course goals, which were broad statements of what the teacher candidates need to know and be able do. Using these broad outcomes as a guide I thoroughly reviewed course materials and readings to identify more specific and assessable learning targets reflecting the knowledge and skills I deemed necessary to be literate in classroom assessment practices. Bloom’s Taxonomy was useful for ensuring the learning targets represented an appropriate range of cognitive challenge and for ensuring the learning targets were all assessable. Examples of specific learning targets can be found in the sample standards-based report card in Appendix B. Once I identified an appropriate number of learning targets addressing the essential content and skills, I considered what non-academic behaviors were important for my students to demonstrate. Advocates of standards-based grading argue that non-academic factors such as lateness, effort, attendance, etc., should not be used to determine a standards-based grade (Guskey & Bailey, 2010); however, Guskey and Bailey (2010) recommend acknowledging the importance of these non-academic factors by separating students’ grades into product (mastery of course objectives), process (factors such as attendance), and progress (how much a student has gained from a course). The process goals I deemed most important included attendance, active participation, meeting deadlines, completing assignments (even ungraded ones), and
64 Volume 13 ● 2018 the general professional dispositions desirable of an adult working with children (use of appropriate language, communication skills, etc.). See Appendix B for an example of how the process grade was communicated to students using the standards-based report card. A progress grade was not computed because of the challenges of fairly determining the amount of growth attained by each student during the term were simply too great, particularly given an eleven-week trimester. Another step in the process of clarifying one’s curricular aims is to identify the criteria by which you will determine if a student has mastered the aim (Popham, 2011, p. 391). Having an idea of what mastery looks like is an essential step in clarifying for oneself and one’s students what the curricular aim is. After reviewing multiple examples of rubrics (Beatty, 2013; Guskey & Bailey, 2010; Rundquist, 2011), I arrived at a five-point scale to evaluate student learning in relation to my learning targets (see Appendix C). Step 2: Choosing Standards-Based Assessment Evidence Once I identified my learning targets, both product and process related, I reviewed my course assessments—a combination of performance assessments and traditional tests—to make sure I was collecting appropriate evidence of my students’ understanding. All my assessment items required constructed responses in which students needed to write out an answer rather than multiple choice or true/false questions. Multiple assessments were helpful for a variety of reasons, some of which will be discussed in the next section, but overall, advocates of the standards-based grading approach suggest that students should have multiple opportunities to demonstrate mastery of course learning targets, and a grade should be based on a sufficient amount of evidence (Marzano & Heflebower, 2011). I found that I really did not have to significantly change my assessments, but rather the process required me to think about how my assessments were connected to my learning targets and how strong the evidence was I collected. I also excluded a wide range of assignments I normally would have included in the grade book when calculating the final grade. The types of assignments excluded fall into the category of formative assessments, assignments designed to collect evidence of a student’s progress towards meeting standard for feedback and to assist the student in monitoring their own learning (Popham, 2011). Examples of formative assessments not included for grading were daily homework assignments, quizzes, and other in class assignments. While these assessments were critical for me as the professor to know if my students were learning, including them in the final grade would have ultimately punished students for practice (Fisher et al., 2011). Step 3: Weighting Standards-Based Assessment Evidence When considering how to weight the evidence that would be used to determine the final grade, I again considered the basic tenet of the standards-based grading approach that low grades received early in a term should not be averaged with grades received later (Fisher et al., 2011). This means that my grade book was set up so that the time of assessment was taken into account and that the most recent grade a
InSight: A Journal of Scholarly Teaching 65 student received was the most important grade in determining the final grade. While there are different models and approaches to determine a grade for a single standard (Hooper & Cowell 2014; Marzano 2000), using the most recent score made sense to me both because it didn’t penalize student for low grades early in the term and because it would ostensibly communicate to students that what really matters is how they finish, not how they start. The fact that average scores on my first test tend to be much lower than on later tests suggests to me that students also need to get used to the assessments and the expectations for assessment evidence graded as distinguished, on target, etc. Another question related to the weighting of evidence is what to do if a student scores lower on a reassessment opportunity. Should the lower, but most recent score be used, should the new score be disregarded, or should the scores be averaged? I decided to average the two most recent scores because I wanted to communicate to students that the fact they had already demonstrated a higher level of understanding was important, but consistency was also important and, if a student had mastered a learning target early in the term with a 4 but forgot what they learned and scored a 1, then perhaps they really did not reach a level of mastery warranting a score of “distinguished.” Lastly, it was clear to me as I determined the standards to be assessed that not all standards should be weighted equally. Some standards were more important because they reflected a greater level of cognitive complexity or they were more fundamental to the broader course outcomes required by the course and the program. For example, my students’ ability to identify different types of assessment bias was important, but their ability to construct their own assessment items that were free from assessment bias was even more important and worthy of more weight in the grade book. The system of weighting used involved a multiplier from 1 to 4 depending on the complexity and significance of the standard being assessed. This system of weighting was useful when communicating with students because it gave them an idea of what knowledge and skills were most significant and why. Step 4: Arriving at a Final Standards-Based Grade The overarching purpose of standards-based grading is to clearly communicate to students a level of performance in relation to a set of standards, and the best way to do this is to use a reporting system that is sufficiently detailed to accomplish this task. Guskey and Bailey (2010) recommend reporting performance on non-academic or process goals separately from product and progress goals so that the ability of the grade to clearly communicate will not be diminished (p. 157). Yet in higher education and in most secondary schools there is a need to award a student a final, omnibus grade. Acknowledging the significance of grades for students to advance in our program and with an understanding of the need to ensure that the grades I awarded students were indeed reflective of their mastery of course standards, I provided students with a grade report that was separated into process and product grades so that I could communicate to them what their grade was based on, but I also calculated a final grade that reflected both academic and non-academic performance. My decision to include the process score in the students’ overall grade is a significant departure from the spirit of standards-based grading, but I was concerned that a
66 Volume 13 ● 2018 student could receive a good final grade in the course but not demonstrate the types of dispositions we expect of our students, and in my case, future teachers. Accordingly, I wanted the final grade to reflect both mastery of content and professional dispositions. To do this I decided that to receive an “A,” a student should demonstrate an understanding at the distinguished level on a majority of the learning targets and exhibit no non-academic concerns. With this in mind I arrived at a final, omnibus grade, by averaging their performance on the learning targets, using the most recent evidence (or an averaging of the previous two scores if the most recent score was lower) and averaging the scores awarded for non-academic factors (determined through a combination of self, peer, and instructor assessment, depending on the trait). Finally, I multiplied the product score by .8 (80%) and the process score by .2 (20%) and combined them to determine the final score, which I then converted to a letter grade using the grade point scale (see Appendix B for a sample grade book). The final grades I awarded were comparable in range and distribution to grades given in non-standards-based grading courses, but unlike in non-standards-based grading courses, they were based on clearly defined standards of performance. Reflections on the Process My main take-away from implementing standards-based grading and from reviewing the research is that it is an approach with a great deal of value because it encourages healthy reflection on what we teach and how we assess our students. It also fosters communication with our students by making the focus of a grade on student achievement rather than on success on an assessment instrument. The grades students received at the end of the course more accurately reflected a level of understanding of course content than in the past when I based my grades on an accumulation of points. Also, the way I communicated with students about their grades and assessments improved significantly. Rather than discussing low test scores or a failure to complete assignments as the reason for a poor grade, I used a standards-based “report card” to communicate with students’ specific learning targets they still needed to master and the opportunities they would have to demonstrate their understanding of these learning targets. In course evaluations students reported that they clearly understood the relationship between course content, in-class learning activities, and assessments, and that this helped them to focus on learning what was important. These conversations represented a significant, positive shift in the way I talked about grades and assessment with students. In fact, students’ reaction to standards-based grading were mostly very positive. The results from an anonymous post-course survey indicate students liked the clarity of standards-based grading and that it gave them a sense of control over their grade because of the opportunities for re-assessment. On the other hand, some students felt that the standards were set too high or that they were not sure what they needed to do to reach a higher level of mastery. Also, the practice confused some students. I believe this was partly because it was different from what they were used to and partly because I was still learning how to implement the practice. Despite some negative comments, scores on course evaluations were much higher than the average for other courses at the institution, specifically on items related to grading and
InSight: A Journal of Scholarly Teaching 67 assessment. Overall, the majority of students appreciated the approach and wished other faculty used it in their courses. The process also required me to think about my assessments differently as each item on a test, for example, was connected to a specific learning target. Reviewing my assessments from this perspective improved them by ensuring that course content was adequately represented. Most importantly, when I graded my tests, I was able to see which learning targets students struggled with and which the majority had mastered. Understanding student and class performance in relation to learning targets then led me to examine my teaching practices and the ways I presented different topics in class. As a result of this reflection, I made several changes to my teaching to better address specific learning targets students struggled with, including using more formative assessments and structuring in-class activities to address specific topics in more depth. I also used item analysis to inform future assessments, making sure to include questions on topics the class overall struggled with to provide them with an opportunity for reassessment. Providing both individual and group opportunities for reassessment represented another significant improvement afforded by standards-based grading. While not all students took advantage of opportunities to reassess, I believe those who did benefited from the opportunity to review material and to demonstrate their understanding in different ways. In almost all cases reassessment led to higher scores for students, and, because the higher, most recent score was used to determine the final grade, this final grade was a more accurate representation of the students’ level of understanding. Despite my generally positive experience, standards-based grading is not without its pitfalls. Something that I hear quite often from K-12 teachers, and is reflected in my own course surveys, is that standards-based grading is difficult to understand at first because it is different. Another common complaint I hear from K-12 and pre-service teachers is that students are not as motivated to complete assignments if they know the grades on the assignments will not count for their grade. The practice of not grading work that is formative, a central component of standards-based grading, reflects a significant hurdle for teachers or professors wanting to implement this approach. The way I addressed this concern was to include work completion in the process grade, so that a students’ grade was impacted if they failed to complete homework assignments. In addition, to participate in class students needed to come prepared with their work complete, which was another graded component of the course. Standards-based grading is also difficult to implement because it requires professors to think about assessment differently. It was definitely more work grading because, rather than just adding up the number right on tests, I was thinking about the level of understanding reflected in their answer compared to a standard of performance. My main conclusion is that the philosophy and the growing body of research supporting standards-based grading is promising, but the realities of assessing and grading in higher education present professors with challenges of implementing it with fidelity. Issues such as arriving at a final, end-of-course grade that does not take into account non-academic factors, providing multiple opportunities for reassessment, and not grading homework are all elements of standards-based grading that I struggled
68 Volume 13 ● 2018 with as I implemented the approach. My review of the three published articles related to the implementation of standards-based grading suggest that the issues I faced are not uncommon; however, the way these issues are addressed varies depending on context. The nature of the course and the methods of assessment will likely determine what standards-based grading will look like in practice. Recommendations While my experience and the reaction of the students was positive overall, more needs to be done to “work out the kinks.” The challenges to implementing this approach in a higher education context with fidelity to the basic principles are significant. To be successful multiple iterations are likely to be needed and much more serious, systematic inquiry into the benefits and limitations will be needed. My first recommendation is that more research needs to do be done to better understand best practices for implementing standards-based grading in higher education. Some of the more obvious areas in need of investigation include: what role does context play in the successful implementation of the standards-based grading approach? In K-12 education standards-based grading has been implemented in a wide variety of contexts, but it seems to run into more resistance in secondary education. Could it be that the content being taught and the course level will determine whether or not standards-based grading can be implemented successfully? It seems to work well in college physics and assessment courses, but what about upper division writing courses or introductory language courses? Is it feasible in large, lecture style courses or will it only be manageable when course enrollment is low? Another question needing to be researched is how does the standards-based grading influence students’ approach to learning and their overall mastery of the course goals? If the standards-based grading approach is meant to improve learning, do we know this is really happening? The work that has already been done is promising because it suggests students view the approach favorably, but the next step needs to be taken particularly when the opportunity to compare student learning in courses with standards-based grading and without standards-based grading is available. For those interested in implementing standards-based grading, my recommendation is to start by developing a mock grade book representing the elements that you feel are most important to you and that will help to facilitate communication with your students. If a significant reason to adopt standards-based grading is to improve communication, then the tool used to convey this information to students is important. Once you have an idea of what the final product will look like then the process for arriving at the grade report, outlined in the article, will likely make more sense. The work that has already been done on standards-based grading suggests that it is a worthwhile approach but that it is challenging to implement. In my experience, the challenges are worth the effort because of the clarity standards-based grading brought to my grading process and the improved levels of communication it enabled. Given that standards-based grading is likely to become more commonplace in higher education it behooves us to continue to work out the kinks and to learn from each other.
InSight: A Journal of Scholarly Teaching 69 References Beatty, I. (2013). Standards-based grading in introductory university physics. Journal of the Scholarship of Teaching and Learning, 13(2), 1-22. Retrieved from http://www.iupui.edu/~josotl Brimi, H. (2011). Reliability of grading high school work in English. Practical Assessment, Research and Evaluation, 16(17), 1-12. Retrieved from http://pareonline.net/pdf/v16n17.pdf Brookhart, S. (1993). Teachers’ grading practices: Meaning and values. Journal of Educational Measurement, 30(2), 123-142. Brookhart, S. (1994). Teachers’ grading: Practice and theory. Applied Measurement in Education, 7(4), 279-301. Brookhart, S. (2009). Grading (2nd ed.). New York, NY: Merrill. Fisher, D., Frey, N., & Pumpian, I. (2011). No penalties for practice. Educational Leadership, 69(3), 46-51. Retrieved from http://www.ascd.org/publications/educational-leadership/nov11/vol69/num03/ abstract.aspx Gordon, M., & Fay, C. (2010). The effects of grading and teaching practices on students’ perceptions of fairness. College Teaching, 58, 93-98. Guskey, T. R. (1980). Mastery learning: Applying the theory. Theory into Practice, 19(2), 104-111. Guskey, T. R. (2011). Five obstacles to grading reform. Educational Leadership, 69(3), 16-21. Retrieved from http://www.ascd.org/publications/educational-leadership/nov11/vol69/num/03/ abstract.aspx Guskey, T. R., & Bailey, J. (2010). Developing standards-based report cards. Thousand Oaks, CA: Corwin Press. Haptonstall, K. G. (2010). An analysis of the correlation between standards- based, non-standards-based grading systems and achievement as measured by the Colorado student assessment program (CSAP, Doctorial dissertation). Retrieved from ProQuest Dissertations and Theses. (UMI No. 3397087) Hooper, J., & Cowell, R. (2014). Standards-based grading: History adjusted true score. Educational Assessment, 19(1), 58-76. doi: 10.1080/10627197.2014.869451 Kalnin, J. (2014). Proficiency-based grading: Can we practice what they preach? AILACTE Journal, 11(1), 19-36. Retrieved from http://www.eric.ed.gov/contentdelivery/servlet/ERICServlet?accno=EJ1052571 Marzano, R. (2000). Transforming classroom grading. Alexandria, VA: ASCD. Marzano, R. J., & Heflebower, T. (2011). Grades that show what students know. Educational Leadership, 69(3), 34-39. Retrieved from http://www.ascd.org/publications/educational-leadership/nov11/vol69/num03/ abstract.aspx O’Connor, K. (2002). How to grade for learning: Linking grades to standards (2nd ed.). Thousand Oaks, CA: Corwin Press. Popham, J. (2011). Classroom assessment: What teachers need to know. Boston, MA: Pearson.
70 Volume 13 ● 2018 Proulx, C., Spencer-May, K., & Westerberg, T. (2012). Moving to standards-based grading: Lessons from Omaha. Principal Leadership, 13(4), 30-34. Retrieved from http://www.principals.org/Default.aspx?TabId=2043 Rojstaczer, S., & Healy, C. (2012). Where A is ordinary: The evolution of university and college grading, 1940-2009. Teachers College Record, 114(7), 1-23. Retrieved from http://www.tcrecord.org/Content.asp?ContentId=16473 Rundquist, A. (2011). Standards-based grading with a voice: Listening for students’ understanding. In N. S. Rebello, P. V. Englehardt, & C. Singh, (Eds.). AIP Conference Proceedings, 1413 (pp. 69-72). doi: 10.1063/1.3679996 Seligman, D. (2002, March 18). The grade-inflation swindle. Forbes, 19. Starch, D., & Elliott, E. C. (1912). Reliability of the grading of high-school work in English. School Review, 20, 442–457. Retrieved from http://www.jstor.org/stable/1076706 Starch, D., & Elliott, E. C. (1913). Reliability of grading work in mathematics. School Review, 21, 254–259. Retrieved from http://www.jstor.org/stable/1640875 Townsley, M. (2014, November 11). What’s the difference between standards-based grading (or reporting) and competency-based education? Retrieved from http://www.competencyworks.org/analysis/what-is-the-difference-between-standards-based-grading/ Welsh, M., D’Agostino, J., & Kaniskan, B. (2013). Grading as a reform effort: Do standards-based grades converge with test scores? Educational Measurement: Issues and Practice, 32(2), 26-36. doi: 10.1111/emip.12009 Wiggins, G. (1998). Educative assessment: Designing assessments to inform and improve student performance. San Francisco, CA: Jossey-Bass Publishers.
InSight: A Journal of Scholarly Teaching 71 Appendices Appendix A Table A1 Summary of Differences between Traditional Grading and Standards-Based Grading Traditional Grading System Standards-Based Grading System 1. Based on assessment methods (quizzes, tests, homework, projects, etc.). One grade/entry is given per assessment. 1. Based on learning goals and performance standards. One grade/entry is given per learning goal. 2. Assessments are based on a percentage system. Criteria for success may be unclear. 2. Standards are criterion or proficiency-based. Criteria and targets are made available to students ahead of time. 3. Use an uncertain mix of assessment, achievement, effort, and behavior to determine the final grade. May use late penalties and extra credit. 3. Measures achievement only OR separates achievement from effort/behavior. No penalties or extra credit given. 4. Everything goes in the grade book – regardless of purpose. 4. Selected assessments (tests, quizzes, projects, etc.) are used for grading purposes. 5. Include every score, regardless of when it was collected. Assessments record the average – not the best – work. 5. Emphasize the most recent evidence of learning when grading. Note. Adapted from How to Grade for Learning: Linking Grades to Standards (2nd ed.), by M. Townsley from K. O’Connor (2002). Copyrighted 2014 by Corwin Press.
72 Volume 13 ● 2018 Appendix B Sample Standards-Based Report Card
InSight: A Journal of Scholarly Teaching 73 Appendix C Table A2 Generic scale used to evaluate assessment evidence Level of Performance Performance Descriptor 4-Distinguished Students demonstrates clear, accurate, and advanced evidence of understanding 3-Mastery Student demonstrates a clear, accurate understanding 2-Developing Student demonstrates a partial understanding 1-Concern Student demonstrates a clear misunderstanding 0-No evidence No evidence of understanding provided Appendix D Student post-course survey comments What did you like about standards-based grading, if anything? That it reflects what [is] most important; Learning I like how the intangibles are separate from the overall grade. This makes the student’s assignment grade more accurate as to the caliber of his/her performance in mastering the learning targets. It also provides more organization for the teacher because basically everything (assessments, grading, instruction, etc.) revolves around the learning targets he/she puts in place to satisfy standards. This ensures that teachers do not get too carried away with planning only somewhat related lessons because everything has to tie back to the learning targets. I like how there is something that everyone could achieve and work up to. I like that standards-based focuses on the mastery of content when giving a grade. Then nothing else would influence the grade and students, parents, and teachers would get a clear understanding of the student’s learning. I liked that we had the opportunity to reassess on certain learning targets that we did not fully master. I like the reassessment opportunities. I like that we can do reassessments for our learning targets. It really shows whether or not you understand the content and where you need to focus your attention if you want to raise your grade.
74 Volume 13 ● 2018 I was able to be reassessed. I could see where I went wrong on what topic. I liked how I knew everything that was going to be on the test. I knew exactly what to study. Nothing was a surprise. It is extremely fair. I like that the standards are communicated with us before hand and we know exactly what we are going to be graded on. I did like how I was better able to tell what I did know and didn’t. It was easier than just a percentage. I like that learning targets were given to us for every class period, and we knew exactly what was expected for us to know and be assessed on. I liked how it set out a certain criteria. The learning targets make it easy to track progress and help students know what to study. I did enjoy seeing exactly where I was lacking. Being able to see the learning targets and my score on each helped motivate me to reach 4s for every target It follows the course objectives/ learning targets and it measures student mastery of their content I liked that it showed the level of understanding for each of the standards and that the grade was not given but it had more of an impression that it was being earned. Only assess[es] the students learning based on the standards being assessed It most resembled how much I actually learned. I thought this was an awesome way to grade, especially with reassessment opportunities. What did you NOT like about standards-based grading, if anything? Not understanding my grade for 8 weeks. I honestly like standards-based grading but I feel like it would be a culture-shock to suddenly implement this in schools. Although people would eventually get used to it, I feel like many students and parents would be initially overwhelmed by the grading format as it would appear on something like PowerSchool. Instead of having the traditional format of exams, homework assignments, participation, etc., there would be actual learning targets with assignments listed under it. Like I said, people would get used to it, but I know that I would be somewhat alarmed if my child’s grading format was changed dramatically from the way I was comfortable with.
InSight: A Journal of Scholarly Teaching 75 I think it would be hard to not consider effort when giving a grade because it is very important in the learning process. I was sometimes confused about why I got a different level than I expected (for example, a “Target” instead of “Distinguished”), and I didn’t feel like this was ever explained to me. I really enjoyed it a lot. The only thing was that I wasn’t used to this type of grading, so it took me a while to adjust to how I can view my performance. I did not necessarily dislike it, but I could see that some people would dislike how heavily test scores are weighted and that their homework does not count for much. The teacher controls the standards, so sometimes they are subjective. I didn’t know why I got the score I got and what was the 100% correct answer ever. I did not like how on a test if you mastered it the first time, but then is it was to be re-assessed and you didn’t do as well the second time, then the score was reevaluated and lowered. It can be too specific – not allowing for creativity or wiggle room. I did not like how hard it was to gain mastery. I understand it, but it took a lot more work to earn my A than other classes may take. I do not like the fact that it is often difficult to tell how I did on a particular assignment. For me, I do not think of my courses as being separated into various standard. I think of them as being separated into various assignments. If you tell me I got a 2/3 on this standard, that doesn’t mean anything to me. But if you tell me I got an 85% on the rubric project, I can judge that against how I *thought* I should have done on that project and determine whether I need to put in more effort. In short, I think it’s useful for letting students know how they are doing, which as a student is frustrating. When being reassessed I did not like the averaging of the scores if the 2nd time the grade was lower. Michael Scarlett is an assistant professor in the Education Department at Augustana College in Rock Island, IL. He teaches courses on assessment, educational technology, and methods of teaching social studies. His research interests include game-based learning, the history of American Indian education, and standards-based grading.
CopyrightofInSight:AJournalofScholarlyTeachingisthepropertyofParkUniversity,CenterforExcellenceinTeaching&Learninganditscontentmaynotbecopiedoremailedtomultiplesitesorpostedtoalistservwithoutthecopyrightholder’s
Collepals.com Plagiarism Free Papers
Are you looking for custom essay writing service or even dissertation writing services? Just request for our write my paper service, and we'll match you with the best essay writer in your subject! With an exceptional team of professional academic experts in a wide range of subjects, we can guarantee you an unrivaled quality of custom-written papers.
Get ZERO PLAGIARISM, HUMAN WRITTEN ESSAYS
Why Hire Collepals.com writers to do your paper?
Quality- We are experienced and have access to ample research materials.
We write plagiarism Free Content
Confidential- We never share or sell your personal information to third parties.
Support-Chat with us today! We are always waiting to answer all your questions.
