Applications of Generalizability Theory and Their Relations to Classical Test Theory and Structural Equation Modeling? Name of journal that the article was in: ?Psychological Meth
- Each pap.. needs to be between 800-1,000 words- no more and no less.
- You must also use 12-point Times New Roman font.
- You need to save your file as either Microsoft Word format or rich text format (.rtf) in order for the summary to be uploaded to Webcourses. It is your responsibility to ensure that your file is saved correctly and in the correct file format before the deadline. It is also your responsibility to ensure that you can successfully upload your article summary into the system prior to the deadline. Do not wait until the last minute to upload your article summaries in case you experience technical problems.
- Your pape.. will automatically be submitted to Turnitin upon submission, to determine how another author’s work was used in the assignment. Make sure you take notes while reading the selected article in your own words. Do not copy and paste directly from the selected article because matches to other authors’ works of 30% or more will result in an automatic zero (0) for the assignment.
- At the top of your summary, you must include your name, the name of the article you selected, the name of the journal that the article was taken from, name of the authors of the article, and your total word count. An example is below:
- Student name
- Name of article: Applications of Generalizability Theory and Their Relations to Classical Test Theory and Structural Equation Modeling
- Name of journal that the article was in: Psychological Methods
- Word count
- Your summary should not include any direct quotations from the article you selected. Put everything in your own words and do not summarize the abstract section of the article.
- You should summarize a recent research journal article from one of the American Psychological Association (APA) journals listed in the table below. Please note that you will receive a zero if your summary is on a pap.. that is not from one of the journals listed below and/or it is not clear from your title page whether you used one of these journals.
Applications of Generalizability Theory and Their Relations to Classical Test Theory and Structural Equation Modeling
Walter P. Vispoel, Carrie A. Morris, and Murat Kilinc University of Iowa
Abstract Although widely recognized as a comprehensive framework for representing score reliability, general- izability theory (G-theory), despite its potential benefits, has been used sparingly in reporting of results for measures of individual differences. In this article, we highlight many valuable ways that G-theory can be used to quantify, evaluate, and improve psychometric properties of scores. Our illustrations encompass assessment of overall reliability, percentages of score variation accounted for by individual sources of measurement error, dependability of cut-scores for decision making, estimation of reliability and dependability for changes made to measurement procedures, disattenuation of validity coefficients for measurement error, and linkages of G-theory with classical test theory and structural equation modeling. We also identify computer packages for performing G-theory analyses, most of which can be obtained free of charge, and describe how they compare with regard to data input requirements, ease of use, complexity of designs supported, and output produced.
Translational Abstract Generalizability theory (G-theory) is widely recognized as a comprehensive framework for representing score reliability. However, despite its potential benefits, G-theory has been used sparingly in reporting of results for measures of individual differences. In this article, we describe G-theory in a straightforward manner and highlight many valuable ways it can be used to quantify, evaluate, and improve psychometric properties of scores. Our illustrations encompass assessment of overall reliability, percentages of score variation accounted for by individual sources of measurement error, dependability of cut-scores for decision making, estimation of reliability and dependability for changes made to measurement proce- dures, disattenuation of validity coefficients for measurement error, and linkages of G-theory with classical test theory and structural equation modeling. We also identify computer packages for perform- ing G-theory analyses, most of which can be obtained free of charge, and describe how they compare with regard to data input requirements, ease of use, complexity of designs supported, and output produced. These resources, along with formulas provided throughout the article, should enable readers to apply G-theory to their own research and understand how it aligns with and differs from other measurement models.
Keywords: generalizability theory, reliability, validity, classical test theory, structural equation modeling
Over 40 years have passed since Cronbach, Gleser, Nanda, and Rajaratnam (1972) published their seminal treatise on gen- eralizability theory (G-theory)—The Dependability of Behav- ioral Measurements: Theory of Generalizability for Scores and Profiles. Their work significantly broadened perspectives on measurement theory by providing a comprehensive framework for estimating score consistency with reference to multiple sources of measurement error. Over time, many additional
treatments of G-theory have appeared that summarize and ex- pand the work of Cronbach et al. (see, e.g., Brennan, 2001a; Crocker & Algina, 1986; Feldt & Brennan, 1989; Haertel, 2006; Marcoulides, 2000; Raykov & Marcoulides, 2011; Shavelson & Webb, 1991; Shavelson, Webb, & Rowley, 1989; Wiley, Webb, & Shavelson, 2013). Yet despite the strong interest in G-theory within the measurement community, applications of it are still rare when reporting results for measures of individual differ- ences. Possible reasons for such neglect may be G-theory’s technical vocabulary, overlooked linkages between it and clas- sical test theory (CTT), and difficulty in finding and running software for doing G-theory analyses. The purpose of this article is to describe G-theory in a straightforward manner, illustrate effective ways it can be used with measures of indi- vidual differences, highlight many of its direct connections with conventional indices of reliability and validity, show how G-theory can be approached from a structural equation model- ing perspective, and identify computer resources for conducting G-theory analyses.
This article was published Online First January 23, 2017. Walter P. Vispoel, Carrie A. Morris, and Murat Kilinc, Department of
Psychological and Quantitative Foundations, University of Iowa. We thank Patricia Martin for her help in preparing and proof reading
drafts of the submitted manuscript. Correspondence concerning this article should be addressed to Walter P.
Vispoel, Department of Psychological and Quantitative Foundations, Uni- versity of Iowa, 361 Lindquist Center, Iowa City, IA 52242-1529. E-mail: [email protected]
T hi s do cu m en t is co py ri gh te d by th e A m er ic an Ps yc ho lo gi ca l A ss oc ia tio n or on e of its al lie d pu bl is he rs .
T hi s ar tic le is in te nd ed so le ly fo r th e pe rs on al us e of th e in di vi du al us er an d is no t to be di ss em in at ed br oa dl y.
Psychological Methods © 2017 American Psychological Association 2018, Vol. 23, No. 1, 1–26 1082-989X/18/$12.00 http://dx.doi.org/10.1037/met0000107
1
Background
The most important concept in CTT is the reliability of scores for a given measure. The theory begins with the assumption that an individual’s observed score (X) is the sum of true (T) and error (E) scores: X � T � E. True score represents an individual’s expected or average observed score over a presumed infinite number of retakes of the measure with no carryover effects. Because such individual score modeling is impossible to implement in practice, reliability of scores is derived by administering the same mea- sure(s) to a population of individuals. To compute a reliability coefficient for those scores, we assume that parallel forms of a measure can be constructed in which a given individual has the same true score on both forms, observed-score variances are equal across forms, and error scores are uncorrelated with true scores and with each other. Under these conditions, observed-score variance will equal the sum of true-score and error variances, and the correlation between scores from the parallel forms will represent reliability as a ratio of true-score variance over true-score variance plus error variance, as shown in Equation 1:
CTT: Reliability coefficient
� True-score Variance True-score Variance�Total Error Variance . (1)
Equation 1 illustrates that error variance in CTT is a single undifferentiated entity. Partitioning of observed-score variance in G-theory can mirror that same partitioning, but refine it further by isolating multiple sources of measurement error. These sources of measurement error in turn define the universe over which results are generalized, leading to the term universe score in G-theory replacing the term true score in CTT. Generalizability coefficients (G-coefficients) in G-theory are analogous to reliability coeffi- cients in CTT, and also represent ratios of systematic variance divided by systematic variance plus error variance. The primary difference between the two is that G-coefficients can separate individual sources of measurement error, as shown in Equation 2:
G-theory: G-coefficient
� Universe-score Variance Universe-score Variance
� Individual Sources of Error Variance
.
(2)
G-coefficients are typically derived using variance components from analysis of variance (ANOVA) models. In the context of ANOVA, partitioning of true-score and error variances in CTT is conceptually similar to partitioning of systematic and error effects in a one-way design, whereas partitioning of universe score and multiple sources of error in G-theory more closely resembles having multiple effects in a factorial design (Shavelson et al., 1989). Within a G-theory framework, a single measurement of behav-
ior (item score, subscale score, rating, etc.) is conceptualized as a sample from a universe of admissible observations for the targeted objects of measurement—represented as persons in all illustrations discussed here. Aspects of the assessment such as individual items, blocks of items, test forms, prompts, raters, or occasions can represent facets or possible sources of measurement error in a G-theory design analogous to factors in an ANOVA model. As with factors in an ANOVA model, facets in a G-theory design can be treated as either fixed or random. A facet representing essay
prompts, for example, would be considered fixed if inferences are restricted only to the particular prompts administered, or as ran- dom if the prompts are viewed as being sampled from a larger universe of possible prompts. In each G-theory design that we illustrate, tasks, occasions, or both will represent measurement facets of interest. Unless otherwise noted, we assume that persons are sampled at random from a target population of interest, and that tasks and occasions are sampled at random from broader universes of similar tasks and occasions, respectively. In sections to follow, we describe the basics of G-theory with
reference to single- and two-facet designs relevant to most mea- sures of individual differences. We then demonstrate using real data how G-theory and conventional reliability coefficients (e.g., alpha, split-half, parallel-form, and test–retest) align and differ. In doing so, we emphasize benefits of two-facet designs in quantify- ing multiple sources of measurement error and show how those designs can provide more informative indices of overall reliability, dependability of individual cut-scores, and validity coefficients disattenuated for measurement error. In later sections, we describe recent advances in G-theory, its linkages with structural equation modeling, and software packages available for doing G-theory analyses.
G-Theory Basics
Single-Facet Designs
Partitioning of scores. In Table 1, we provide ANOVA mod- els for two single-facet designs. Task is the single measurement facet of interest in the first design, and occasion in the second. In each design, a given observed score is partitioned into a linear composite representing a grand mean and effects for persons, the measurement facet of interest (task or occasion), and the combi- nation of person and measurement facet. Although our initial illustrations focus on tasks, they can be easily adapted to occasions simply by substituting occasions for tasks in the equations to follow. To show direct linkages between G-theory and CTT indi- ces in a familiar context, we conceptualize tasks in our illustrations as representing items, half-measures (i.e., splits), or full-measures (i.e., forms) for objectively scored instruments such as Likert-style questionnaires or multiple-choice tests in which anyone scoring the measures would get the same results. Equation 3 represents a G-theory, persons� tasks (p � t) design
with person as the object of measurement and task (item, split, or form) as the measurement facet of interest. This design is a repeated measures, random-effects ANOVA model in which each person has as many scores as number of tasks sampled:
Ypt � � � (�p � �)� (�t � �)� (Ypt � �p � �t � �) � grand mean� person effect� task effect� residual. (3)
In Equation 3, Ypt represents the score for a particular person on a particular task. The grand mean (�) is a constant that represents the mean Ypt score aggregated across all persons and tasks. The universe score (�p), analogous to true score in CTT, corresponds to a person’s expected long-run average observed score over the universe of admissible observations included in the model (i.e., tasks in this example). The universe score is typically the primary focus of interest because it is intended to represent a person’s score independent of the specific tasks used to derive it. The symbol �t
T hi s do cu m en t is co py ri gh te d by th e A m er ic an Ps yc ho lo gi ca l A ss oc ia tio n or on e of its al lie d pu bl is he rs .
T hi s ar tic le is in te nd ed so le ly fo r th e pe rs on al us e of th e in di vi du al us er an d is no t to be di ss em in at ed br oa dl y.
2 VISPOEL, MORRIS, AND KILINC
denotes the mean for a particular task aggregated across persons. The deviation score �p – � represents the person effect, �t – � represents the task effect, and Ypt – �p – �t � � is a residual term that reflects what remains in a given Ypt score after the person and task effects are subtracted out. The residual term, often labeled residual, error, pt, or pt,e, includes the person � task interaction and any other sources of error. All components in Equation 3 except the constant � will have a
distribution with a variance representing differences in scores for persons, tasks, or residuals. In this completely crossed, balanced ANOVA model, the variance of Ypt scores can be partitioned into independent additive variance components representing persons, tasks, and residuals, as shown in Equation 4:
�Ypt
2 � �p 2� �t
2� �pt,e 2 . (4)
In the equation, �p 2 represents the extent to which scores vary
across persons, and �t 2 across tasks; �pt,e
2 reflects residual variation in observed scores not accounted for by persons and tasks. Because observed scores used for decision making in practice
usually involve aggregating or averaging across tasks (e.g., sum-
ming item scores within a subscale to create a total subscale score), partitioning of those scores is typically of more interest in repre- senting consistency of results. The partitioning of mean scores across tasks is shown in Equation 5. Because means for tasks themselves are constants across individuals, �t
2 is excluded from that partitioning:
�YpT
2 � �p 2�
�pt,e 2
n�t , where n�t � number of tasks. (5)
Note that the lowercase t from Ypt in Equation 4 has been replaced with an uppercase T in Equation 5 to reflect the averaging of Y scores across all tasks represented.
Indices of score consistency. Indices of score consistency in G-theory are catered to whether scores are used for norm- or criterion-referenced decisions. With norm referencing (e.g., rank ordering), decisions are focused on relative differences in the characteristic of interest. For example, I might want to know where I fall among my peers in extraversion. With criterion referencing, decisions are based on absolute levels of scores. Here, I might be more interested in determining whether my extraversion score is
Table 1 G-Theory ANOVA Models, Partitioning, and Score Consistency Indices for One-Facet Designs
Design and Characteristic Formula
One facet: persons � tasks Modela Ypt � � � ��p � �� � ��t � �� � �Ypt � �p � �t � ��
Score of a person on a task � mean across persons and tasks � person effect � task effect � person � task interaction and other error
Partitioning of variance Individual score: �Ypt
2 � �p 2 � �t
2 � �pt,e 2
Mean score: �YpT
2 � �p 2 �
�pt,e 2
n�t
Error variances Relative: �pt,e 2
n�t Absolute:
�pt,e 2
n�t �
�t 2
n�t
Coefficients G-coefficient: �p 2
�p 2 �
�pt,e 2
n�t
Global D-coefficient: �p 2
�p 2 �
�pt,e 2
n�t �
�t 2
n�t
Standard error of measurement Relative:��pt,e 2
n�t Absolute:��pt,e
2
n�t �
�t 2
n�tOne facet: persons � occasions Model Ypo � � � ��p � �� � ��o � �� � �Ypo � �p � �o � ��
Score of a person on a given occasion � mean across persons and occasions� person effect � occasion effect � person � occasion interaction and other error
Partitioning of variance Individual score: �Ypo
2 � �p 2 � �o
2 � �po,e 2
Mean score: �YpO
2 � �p 2 �
�po,e 2
n�o
Error variances Relative: �po,e 2
n�o Absolute:
�po,e 2
n�o �
�o 2
n�o
Coefficients G-coefficient: �p 2
�p 2 �
�po,e 2
n�o
Global D-coefficient: �p 2
�p 2 �
�po,e 2
n�o �
�o 2
n�o
Standard error of measurement Relative:��po,e 2
n�o Absolute:� �po,e
2
n�o �
�o 2
n�o
Note. Primes are used with ns in all G-theory based formulas to allow for changes in numbers of replicates within different contexts. a Tasks represent items, splits, or forms in illustrations used throughout this article.
T hi s do cu m en t is co py ri gh te d by th e A m er ic an Ps yc ho lo gi ca l A ss oc ia tio n or on e of its al lie d pu bl is he rs .
T hi s ar tic le is in te nd ed so le ly fo r th e pe rs on al us e of th e in di vi du al us er an d is no t to be di ss em in at ed br oa dl y.
3APPLICATIONS OF GENERALIZABILITY THEORY
high enough to qualify me to be hired as a sales representative. For norm-referenced decisions, indices of score consistency will only include variance components that influence relative differences in scores, whereas those for criterion-referenced decisions will in- clude components that reflect both relative and absolute differ- ences, as shown in Equations 6 and 7:
Generalizability (or G-) coefficient � �p 2
�p 2�
�pt,e 2
n�t
� Universe-score Variance Universe-score Variance� Relative-error Variance; (6)
Dependability (or D-) coefficient � �p 2
�p 2� ��pt,e
2
n�t �
�t 2
n�t �
� Universe-score Variance Universe-score Variance� Absolute-error Variance . (7)
Equations 6 and 7 represent indices of generalizability and dependability relevant to norm- and criterion-referenced decisions, respectively. In other treatments of G-theory (e.g., Brennan, 2001a), the symbol E�2 is often used to denote a G-coefficient, and � is used to denote a D-coefficient. In Equation 7, �t
2 is included as part of the D-coefficient because the nature of behaviors re- flected by tasks could affect the absolute magnitude of scores. Note from Equations 6 and 7 that absolute-error variance will always be greater than or equal to relative-error variance. They will be equal only when all task means are the same (i.e., �t
2 � 0), thereby reflecting equal levels of the behaviors measured (endorsement, difficulty, etc.). We will refer to D-coefficients like those in Equation 7 as global D-coefficients to distinguish them from the cut-score specific D-coefficients we consider next. Global D-coefficients provide overall estimates of consistency
accounting for differences in rank order of scores as well as absolute differences in levels of scores. However, in practice, decisions based on absolute levels of scores are typically targeted to specific cut-points. In such cases, cut-score specific dependabil- ity indices are of greater interest. Equation 8 represents the general formula for these coefficients within the p � T design:
Cut-score specific D-coefficient� �p 2 � [�Y � C]2
�p 2 � [�Y � C]2� ��pt,e
2
n�t �
�t 2
n�t �
� Universe-score Variance� [Mean � Cut-score]2
Universe-score Variance� [Mean � Cut-score]2
�Absolute-error Variance (8)
Equation 8 shows that a cut-score specific D-coefficient equals its corresponding global counterpart when the cut-score (C) is at the scale mean (i.e., �Y � C � 0), but exceeds that value in other instances. Conceptually, cut-score specific D-coefficients quantify the extent to which an observed score reflects whether an individ- ual is truly above or below the targeted cut-score. G- and global D-coefficients in G-theory, as well as alpha,
split-half, parallel-form, and test–retest coefficients in CTT, pro- vide summary indices of score consistency on a standardized 0 to 1 metric. However, each has the drawback of not being on the scale likely used for decision making (Cronbach, Linn, Brennan, & Haertel, 1997; Cronbach & Shavelson, 2004). In CTT, this draw- back can be addressed by transforming a reliability coefficient to
the observed score scale using Equation 9 to derive the standard error of measurement (SEM), which represents the standard devi- ation of differences between observed and true scores:
SEMCTT � �Y�1 � Reliability Coefficient. (9)
Similarly, standard error indices can be derived from G-theory for making relative and absolute decisions by taking the square roots of relative- and absolute-error variances, as shown in Equa- tions 10 and 11:
SEMG-theory, relative� ��pt,e 2
n�t � �Relative-error Variance;
(10)
SEMG-theory, absolute� ��pt,e 2
n�t �
�t 2
n�t
� �Absolute-error Variance. (11)
If tasks are represented by items or splits, the SEMs from Equations 10 and 11 would need to be multiplied by the number of items (n�i ) or splits (n�s ) to reference them to the total score metric. Although not discussed here, conditional standard error and asso- ciated indices also can be derived for particular points on a score scale using either CTT or G-theory (see, e.g., Brennan, 1998, 2001a; Feldt, 1984; Jarjoura, 1986; Lord, 1957, 1965, 1980; Thorndike, 1951; Vispoel & Tao, 2013; Woodruff, 1991; Wood- ruff, Traynor, Cui, & Fang, 2013, for further information about how to derive them and the complexities often involved).
Two-Facet Designs
Partitioning of scores. The main problem with reliability indices for the present single-facet designs is that they fail to account for and separate the three primary sources of measurement error typically affecting scores from measures of individual differences: random-response, specific-factor, and transient. Random-response error reflects “noise” affecting scores within a particular occasion of administration resulting from moment-to- moment fluctuations in effort, mood, attention, memory, and other factors. Specific-factor error represents consistent responding to particular tasks unrelated to the construct(s) being measured. Tran- sient error refers to stable factors that affect scores within a particular occasion (fatigue, illness, motivation, etc.) but not across occasions. Unless each source of measurement error is properly taken into account, reliability will likely be overestimated. Reli- ability indices from single-facet, persons � Tasks (p � T) designs from G-theory and single-occasion alpha, split-half, and parallel- form coefficients from CTT include random-response and specific-factor error, but treat transient error as universe/true-score variance. In contrast, reliability indices for single-facet, persons � Occasions (p � O) designs from G-theory and test–retest coeffi- cients from CTT include random-response and transient error, but treat specific-factor error as universe/true-score variance. Disentangling these sources of measurement error requires rep-
lications of both tasks and occasions that could be represented in a G-theory, persons � tasks � occasions (p � t � o) design (see Table 2). Within this general design, the task facet again could be items (i), splits (s), or forms (f). As before, the design entails
T hi s do cu m en t is co py ri gh te d by th e A m er ic an Ps yc ho lo gi ca l A ss oc ia tio n or on e of its al lie d pu bl is he rs .
T hi s ar tic le is in te nd ed so le ly fo r th e pe rs on al us e of th e in di vi du al us er an d is no t to be di ss em in at ed br oa dl y.
4 VISPOEL, MORRIS, AND KILINC
repeated measures but with each person now having as many scores as the product of numbers of tasks and occasions sampled �n�t � n�o �. Equation 12 shows the partitioning of an observed score Ypto within a p � t � o, random-effects ANOVA model. Note that a given Ypto score is a linear composite of a grand mean and effects for person, each measurement facet of interest, and all combina- tions of person and measurement facets:
Ypto � � ���p � �� ���t � �� � ��o � �� � ��to � �t � �o � �� � ��pt � �p � �t � �� � ��po � �p � �o � �� � (Ypto � �pt � �po � �to � �p � �t � �o � �).
Score of a person on a task on one occasion �mean across persons, tasks, and occasions
� person effect (p)� task effect (t)� occasion effect (o) � task � occasion interaction (t � o) � person � task interaction (p � t) � person � occasion interaction (p � o) � person � task � occasion interaction and other error (p � t � o, residual, error, pto, or pto, e). (12)
In Equation 12, Ypto represents a score for a particular person, task, and occasion. The grand mean (�) is a constant that equals the mean Y score aggregated across all persons, tasks, and occa- sions. The universe score (�p) represents a person’s expected long-run average observed score over all combinations of tasks and occasions. The symbol �t is the mean for a particular task aggregated across persons and occasions; �o is the mean for a particular occasion aggregated across persons and tasks; and
�p –�,�t –�, and�o –�, represent main effects for person, task, and occasion, respectively. The remaining components in the model re- flect all possible two- and three-way interactions involving persons and the measurement facets of interest. A task � occasion (t � o) interaction effect would indicate that
differences in task means vary by occasion. A person � task (p � t) interaction effect would reveal that differences in task means vary from person to person. These idiosyncratic task differences in scores signal the presence of specific-factor error. A person � occasion (p � o) interaction effect would indicate that differences in occasion means vary from person to person. These person-specific occasion differences reflect the presence of transient error. The per- son � task � occasion interaction (p � t � o) represents what remains after all other main and interaction effects are subtracted from Ypto. This term, typically labeled as residual, error, pto, or pto,e, is treated as random-response error and includes the three-way interac- tion and other sources of error unaccounted for in the model. As was the case with the single-facet designs, the variance of
individual scores in this two-facet design will be a composite of variance components for all main and interaction effects, as shown in Equation 13:
&#x
Collepals.com Plagiarism Free Papers
Are you looking for custom essay writing service or even dissertation writing services? Just request for our write my paper service, and we'll match you with the best essay writer in your subject! With an exceptional team of professional academic experts in a wide range of subjects, we can guarantee you an unrivaled quality of custom-written papers.
Why Hire Collepals.com writers to do your paper?
Quality- We are experienced and have access to ample research materials.
We write plagiarism Free Content
Confidential- We never share or sell your personal information to third parties.
Support-Chat with us today! We are always waiting to answer all your questions.