FOCUS ADDRESSING BIAS AND RACISM OF BLACK WOMEN IN LEADERSHIP AND EMPLOYMENT Use the Attached Template and structure an annota
FOCUS – ADDRESSING BIAS AND RACISM OF BLACK WOMEN IN LEADERSHIP AND EMPLOYMENT
Use the Attached Template and structure an annotated bibliography APA 7th edition format of the Article attached
YOU CAN ONLY COMPILE INFORMATION FROM THE ARTICLE ATTACHED!!!!
250 words
EXAMPLE OF ANNOTATED BIBLIOGRAPHY
Example Reference Format
Baker, V. L., & Pifer, M. J. (2011). The role of relationships in the transition from doctor to independent scholar. Studies in Continuing Education, 33(1), 5-17. http://doi.org/10.1080/0158037X. 2010.515569
Provide a reference and an annotation (150-250 words) that includes important details about the article for each of the sources.
Annotations are descriptive and critical assessments of literature that help researchers evaluate texts and determine relevancy in relation to a research project. Ultimately, it is a note-taking tool that fosters critical thinking and helps you evaluate the source material for possible later use. Instead of reading articles and forgetting what you have read, you have a convenient document full of helpful information. An annotated bibliography can help you see the bigger picture of the literature you are reading. It can help you visualize the overall status of the topic, as well as where your unique question might fit into the field of literature.
AT THE END OF THE ANNOTATED BIBLIOGRAPHY EXPLAIN WHY THIS ARTICLE IS RELEVANT TO THE STRUGGLES OF BLACK WOMEN IN LEADERSHIP POSITIONS IN AMERICA
,
RESEARCH ARTICLE
Avoiding bias when inferring race using name-
based approaches
Diego KozlowskiID 1*, Dakota S. Murray2, Alexis Bell3, Will Hulsey3, Vincent Larivière4,
Thema Monroe-White 3 , Cassidy R. Sugimoto
5
1 DRIVEN DTU, Faculté des Sciences, de la Technologie et de la Médecine, University of Luxembourg,
Esch-sur-Alzette, Luxembourg, 2 School of Informatics, Computing, and Engineering, Indiana University
Bloomington, Bloomington, Indiana, United States of America, 3 Campbell School of Business, Berry
College, Mt Berry, Georgia, United States of America, 4 École de bibliothéconomie et des sciences de
l’information, Université de Montréal, Montréal, Québec, Canada, 5 School of Public Policy, Georgia Institute
of Technology, Atlanta, Georgia, United States of America
Abstract
Racial disparity in academia is a widely acknowledged problem. The quantitative under-
standing of racial-based systemic inequalities is an important step towards a more equitable
research system. However, because of the lack of robust information on authors’ race, few
large-scale analyses have been performed on this topic. Algorithmic approaches offer one
solution, using known information about authors, such as their names, to infer their per-
ceived race. As with any other algorithm, the process of racial inference can generate biases
if it is not carefully considered. The goal of this article is to assess the extent to which algo-
rithmic bias is introduced using different approaches for name-based racial inference. We
use information from the U.S. Census and mortgage applications to infer the race of U.S.
affiliated authors in the Web of Science. We estimate the effects of using given and family
names, thresholds or continuous distributions, and imputation. Our results demonstrate that
the validity of name-based inference varies by race/ethnicity and that threshold approaches
underestimate Black authors and overestimate White authors. We conclude with recom-
mendations to avoid potential biases. This article lays the foundation for more systematic
and less-biased investigations into racial disparities in science.
Introduction
The use of racial categories in the quantitative study of science dates from so long ago that it
intertwines with the controversial origins of statistical analysis itself [1,2]. However, while Gal-
ton and the eugenics movement reinforced the racial stratification of society, racial categories
have also been used to acknowledge and mitigate racial discrimination. As Zuberi [3] explains:
“The racialization of data is an artifact of both the struggles to preserve and to destroy racial
stratification.” This places the use of race as a statistical category in a precarious position, one
that both reinforces the social processes that segregate and disempower parts of the popula-
tion, while simultaneously providing an empirical basis for understanding and mitigating
inequities.
PLOS ONE
PLOS ONE | https://doi.org/10.1371/journal.pone.0264270 March 1, 2022 1 / 16
a1111111111
a1111111111
a1111111111
a1111111111
a1111111111
OPENACCESS
Citation: Kozlowski D, Murray DS, Bell A, Hulsey
W, Larivière V, Monroe-White T, et al. (2022) Avoiding bias when inferring race using name-
based approaches. PLoS ONE 17(3): e0264270.
https://doi.org/10.1371/journal.pone.0264270
Editor: Lutz Bornmann, Max Planck Society,
GERMANY
Received: October 15, 2021
Accepted: February 7, 2022
Published: March 1, 2022
Peer Review History: PLOS recognizes the
benefits of transparency in the peer review
process; therefore, we enable the publication of
all of the content of peer review and author
responses alongside final, published articles. The
editorial history of this article is available here:
https://doi.org/10.1371/journal.pone.0264270
Copyright: © 2022 Kozlowski et al. This is an open access article distributed under the terms of the
Creative Commons Attribution License, which
permits unrestricted use, distribution, and
reproduction in any medium, provided the original
author and source are credited.
Data Availability Statement: The data used for this
article are available at https://sciencebias.uni.lu/
app/ and https://github.com/DiegoKoz/
intersectional_inequalities.
Science is not immune from these inequities [4–7]. Early research on racial disparities in
scientific publishing relied primarily on self-reported data in surveys [8], geocoding [9], and
directories [10]. However, there is an increasing use of large-scale inference of race based on
names [11], similar to the approaches used for gender-disambiguation [12]. Algorithms, how-
ever, are known to encode human biases [13,14]: there is no such thing as algorithmic neutral- ity. The automatic inference of authors’ race based on their features in bibliographic databases is itself an algorithmic process that needs to be scrutinized, as it could implicitly encode bias,
with major impact in the over and under representation of racial groups.
In this study, we use the self-declared race/ethnicity from the 2010 U.S. Census and mort-
gage applications as the basis for inferring race from author names on scientific publications
indexed in the Web of Science database. Bibliometric databases do not include self-declared
race by authors, as they are based on the information provided in publications, such as given
and family names. Given that the U.S. Census provides the proportion of self-declared race by
family name, this information can be used to infer U.S. authors’ race given their family names.
Name-based racial inference has been used in several articles. Many studies assigned a single
category given the family or given name [15–19]. Other studies used the aggregated probabili-
ties related with a name, instead of using a single label [20]. In this research, we assess the
incurred biases when using a single label, i.e. thresholding. The main goal of this research is to
define the most unbiased algorithm to predict a racial category given a name. We present sev-
eral different approaches for inferring race and examine the bias generated in each case. The
goal of the research is to provide an empirical critique of name-based race inference and rec-
ommendations for approaches that minimize bias. Even if prefect inference is not achievable,
the conclusions that arise from this study will allow researchers to conduct more careful analy-
ses on racial and ethnic disparities in science. Although the categories analysed are only valid
in the U.S. context, the general recommendation can be extended to any other country in
which the Census (or similar data collection mechanism) includes self-reported race.
Racial categories in the U.S. Census
The U.S. Census is a rich and long-running dataset, but also deeply flawed and criticized. Cur-
rently it is a decennial counting of all U.S. residents, both citizens or non-citizens, in which
several characteristics of the population are gathered, including self-declared race/ethnicity.
The classification of race in the U.S. Census is value-laden with the agendas and priorities of
its creators, namely 18th century White men who Wilkerson [21] refers to as “the dominant
caste.” The first U.S. Census was conducted in 1790 and founded on the principles of racial
stratification and White superiority. Categories included: “Free White males of 16 years and
upward,” “Free White males under 16 years;” “Free White females,” “All other free persons,”
and “Slaves” [22]. At that time, each member of a household was classified into one of these
five categories based on the observation of the census-taker, such that an individual of “mixed
white and other parentage” was classified into “All other free persons” in order to preserve the
“Free White. . .” privileged status. To date, anyone classifying themselves as other than “non-
Hispanic White” is considered a “minority.” The shared ground across the centuries of census
survey design and classification strata reflects the sustained prioritization of the White male
caste [3,23].
Today, self-identification is used to assign individuals to their respective race/ethnicity clas-
sifications [24], per the U.S. Office of Management and Budget (OMB) guidelines. However,
the concept of race and/or ethnicity remains poorly understood. For example, in 2000 the cate-
gory “Some other race” was the third largest racial group, consisting primarily of individuals
who in 2010 identified as Hispanic or Latino (which according to the 2010 census definition
PLOS ONE Avoiding bias when inferring race using name-based approaches
PLOS ONE | https://doi.org/10.1371/journal.pone.0264270 March 1, 2022 2 / 16
Funding: VL acknowledges funding from the
Canada Research Chairs program, https://www.
chairs-chaires.gc.ca/, (grant # 950-231768), DK
acknowledges funding from the Luxembourg
National Research Fund, https://www.fnr.lu/, under
the PRIDE program (PRIDE17/12252781). The
funders had no role in study design, data collection
and analysis, decision to publish, or preparation of
the manuscript.
Competing interests: The authors have declared
that no competing interests exist.
refers to a person of Cuban, Mexican, Puerto Rican, South or Central American, or other
Spanish culture or origin regardless of race). Instructions and questions which facilitated the
distinction between race and ethnicity began with the 2010 census which stated that “[f]or this
census, Hispanic origins are not races” and to-date, in the U.S. federal statistical system, His-
panic origin is considered to be a separate concept from race. However, this did not preclude
individuals from self-identifying their race as “Latino,” “Mexican,” “Puerto Rican,” “Salva-
doran,” or other national origins or ethnicities [25]. Furthermore, 6.1% of the U.S. population
changed their self-identification of both race and ethnicity between the 2000 and 2010 cen-
suses [26], demonstrating the dynamicity of the classification. The inclusion of certain catego-
ries has also been the focus of considerable political debate. For example, the inclusion of
citizenship generated significant debates in the preparation of the 2020 Census, as it may have
generated a larger nonresponse rate from the Hispanic community [27]. For this article, we
attempt to represent the fullest extent of potential U.S.-affiliated authors; thereby, we consider
both citizens and non-citizen.
The social function of the concept of race (i.e., the building of racialized groups) underpins
its definition more than any physical traits of the population. For example, "Hispanic" as a cat-
egory arises from this conceptualization, even though in the 2010 U.S. Census the question
about Hispanic origin is different from the one on self-perceived race. While Hispanic origin
does not relate to any physical attribute, it is still considered a socially racialised group, and
this is also how the aggregated data is presented by the Census Bureau. Therefore, in this
paper, we will utilize the term race to refer to these social constructions, acknowledging the
complex relation between conceptions of race and ethnicity. But even more important, this
conceptualization of race also determines what can be done with the results of the proposed
models. Given that race is a social construct, inferred racial categories should only be used in
the study of group-level social dynamics underlying these categories, and not as individual-
level traits. Census classifications are founded upon the social construction of race and reality
of racism in the U.S., which serves as “a multi-level and multi-dimensional system of dominant
group oppression that scapegoats the race and/or ethnicity of one or more subordinate groups”
[28]. Self-identification of racial categories continue to reflect broader definitional challenges,
along with issues of interpretation, and above all the amorphous power dynamics surrounding
race, politics, and science in the U.S. In this study, we are keenly aware of these challenges, and
our operationalization of race categories are shaped in part by these tensions.
Data
This project uses several data sources to test the different approaches for race inference based
on the author’s name. First, to test the interaction between given and family names distribu-
tions, we simulate a dataset that covers most of the possible combinations. Using a Dirichlet
process [29], we randomly generate 500 multinomial distributions that simulate those from
given names, and another 500 random multinomial distributions that simulate those from
family names. After this, we build a grid of all the possible combinations of given and family
names random distributions (250,000 combinations). This randomly generated data will only
be used to determine the best combination of the probability distributions of given and family
names for inferring race.
In addition to the simulation, we use two datasets with real given and family names and an
assigned probability for each racial group. The data from the given names is from Tzioumis
[30], who builds a list of 4,250 given names based on mortgage applications, with self-reported
race. Family name data is based on the 2010 U.S. Census [31], which includes all family names
with more than 100 appearances in the census, with a total of 162,253 surnames that covers
PLOS ONE Avoiding bias when inferring race using name-based approaches
PLOS ONE | https://doi.org/10.1371/journal.pone.0264270 March 1, 2022 3 / 16
more than 90% of the population. For confidentiality, this list removes counts for those racial
categories with fewer than five cases, as it would be possible to exactly identify individuals and
their self-reported race. In those cases, we replace with zero and renormalize. As explained
previously, changes were introduced in the 2010 U.S. Census racial categories. Questions now
include both racial and ethnic origin, placing "Hispanic" outside the racial categories. Even if
now “Hispanic” is not considered a racial category, but an ethnic origin that can occur in com-
bination with other racial categories (e.g., Black, White or Asian Hispanic), the information
about names and racial groups merge both questions into a single categorization. Therefore,
the racial categories used in this research includes “Hispanic” as a category, and all other racial
categories excluding people with Hispanic origin. The category "White" becomes "Non-His-
panic White Alone", and "Black or African American" becomes "Non-Hispanic Black or Afri-
can American Alone", and so on. The final categories used in both datasets are:
• Non-Hispanic White Alone (White)
• Non-Hispanic Black or African American Alone (Black)
• Non-Hispanic Asian and Native Hawaiian and Other Pacific Islander Alone (Asian)
• Non-Hispanic American Indian and Alaska Native Alone (AIAN)
• Non-Hispanic Two or More Races (Two or more)
• Hispanic or Latino origin (Hispanic)
We test these data on the Web of Science (WoS) to study how name-based racial inference
performs on the population of U.S. first authors. WoS did not regularly provide first names in
articles before 2008, nor did it provide links between authors and their institutional addresses;
therefore, the data includes all articles published between 2008 and 2019. Given that links
between authors and institutions are sometimes missing or incorrect, we restricted the analysis
to first authors to ensure that our analysis solely focused on U.S. authors. This results in
5,431,451 articles, 1,609,107 distinct U.S. first authors in WoS, 152,835 distinct given names
and 288,663 distinct family names for first authors. Given that in this database, ‘AIAN’ and
‘Two or more’ account for only 0.69% and 1.76% of authors respectively, we remove these and
renormalize the distribution with the remaining categories. Therefore, in what follows we will
refer exclusively to categories Asian, Black, Hispanic, and White.
Methods
Manual validation
The data is presented as a series of distributions of names across race (Table 1). In name-based
inference methods, it is not uncommon to use a threshold to create a categorical distinction:
e.g., using a 90% threshold, one would assume that all instances of Juan as first name should be
categorized as Hispanic and all instances of Washington as a given name should be categorized
as Black. In such a situation, any name not reaching this threshold would be excluded (e.g.,
those with the last name of “Lee” would be removed from the analysis). This approach, how-
ever, assumes that the distinctiveness of names across races does not significantly differ.
To test this, we began our analysis by manually validating name-based inference at three
threshold ranges: 70–79%, 80–89%, and 90–100%. We sampled 300 authors from the WoS
database, 25 randomly sampled for every combination of racial category and inference thresh-
old. Two coders manually queried a search engine for the name and affiliation of each author
and attempted to infer a perceived racial category through visual inspection of their
PLOS ONE Avoiding bias when inferring race using name-based approaches
PLOS ONE | https://doi.org/10.1371/journal.pone.0264270 March 1, 2022 4 / 16
professional photos and information listed on their websites and CVs (e.g., affiliation with
racialized organizations such as Omega Psi PhiFraternity, Inc., SACNAS, etc.). Fig 1 shows the number of valid and invalid inferences, as well as those for whom a category
could not be manually identified, and those for whom no information was found. Name-based
inference of Asian authors was found to be highly valid at every considered threshold. The
inference of Black authors, in contrast, produced many invalid or uncertain classifications at
the 70–80% threshold, but had higher validity at the 90% threshold. Similarly, inferring His-
panic authors was only accurate after the 80% threshold. Inference of White authors was highly
valid at all thresholds but improved above 90%. This suggests that a simple threshold-based
approach does not perform equally well across all racial categories. We thereby consider an
alternative weighting-based scheme that does not provide an exclusive categorization but uses
the full information of the distribution.
Weighting scheme
We assess three strategies for inferring race from an author’s name using a combination of
their given and family name distributions across racial categories (Table 1). The first two aim
at building a new distribution as a weighted average from both the given and family name
racial distributions, and the third uses both distributions sequentially. In this section we
Table 1. Sample of family names (U.S. Census) and given names (mortgage data).
Type Name Asian Black Hispanic White Count Given Juan 1.5% 0.5% 93.4% 4.5% 4,019
Doris 3.4% 13.5% 6.3% 76.7% 1,332
Andy 38.8% 1.6% 6.4% 53.2% 555
Family Rodriguez 0.6% 0.5% 94.1% 4.8% 1,094,924
Lee 43.8% 16.9% 2.0% 37.3% 693,023
Washington 0.3% 91.6% 2.7% 5.4% 177,386
https://doi.org/10.1371/journal.pone.0264270.t001
Fig 1. Manual validation of racial categories.
https://doi.org/10.1371/journal.pone.0264270.g001
PLOS ONE Avoiding bias when inferring race using name-based approaches
PLOS ONE | https://doi.org/10.1371/journal.pone.0264270 March 1, 2022 5 / 16
explain these three approaches and compare them to alternatives that use only given or only
family name racial distributions.
The weighting scheme should account for the intuition that if the given (family) name is
highly informative while the family (given) name is not, the resulting average distribution
should prioritize the information on the given (family) name distribution. For example, 94%
of people with Rodriguez as a family name identify themselves as Hispanic, whereas 39% of the
people with the given name Andy identify as Asian, and 53% as White (see Table 1). For an
author called Andy Rodriguez, we would like to build a distribution that encodes the informa-
tiveness of their family name, Rodriguez, rather than the relatively uninformative given name,
Andy. The first weighting scheme proposed is based on the standard deviation of the distribu-
tion:
SD ¼
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1
n � 1
Xn
i¼1
ðxi � �xÞ 2
s
Where xi is in this case the probability associated with category i, and n is the total number of categories. With four racial categories, the standard deviation moves between 0, for perfect
uniformity, and 0.5 when one category has a probability of 1. The second weighting scheme is
based on entropy, a measure that is designed to capture the informativeness of a distribution:
Entropy ¼ � Xn
i¼1
PðxiÞlogPðxiÞ
Using these, we propose the following weight for both given and family names:
xweight ¼ fðxÞexp
fðxÞexp þ fðyÞexp
with x and y as the given (family) and family (given) names respectively, f is the weighting function (standard deviation or entropy), and exp is the exponent applied to the function and a tuneable parameter. For the standard deviation, using the square function means we use the
variance of the distribution. In general, the higher the exp is set, the more skewed the weight- ing is towards the most informative name distribution. In the extreme, it would be possible to
use an indicator function to simply choose the most skewed of the two distributions, but this
approach would not use the information from both distributions. For this reason, we decided
to experiment with exp2{1,2}, which imply a trade-off between selecting the most informative of the two distributions, and using all available information.
Fig 2 shows the weighting of the simulated given and family names based on their informa-
tiveness, and for different values of the exponent. The horizontal and vertical axes show the
highest value on the given and family name distribution, respectively. This means that a higher
value on any axis corresponds with a more informative given/family name. The color shows
how much weight is given to given names. When the exponent is set to two, both the entropy
and standard deviation-based models skew towards the most informative feature, a desirable
property. Compared to other models, the variance gives the most extreme values to cases
where only one name is informative, whereas the entropy-based model is the most uniform.
Information retrieval
The above weighting schemes result in a single probability distribution of an author belonging
to each of the racial categories, from which a race can be inferred. One strategy for inferring
race from this distribution is to select the racial category above a certain threshold, if any. A
PLOS ONE Avoiding bias when inferring race using name-based approaches
PLOS ONE | https://doi.org/10.1371/journal.pone.0264270 March 1, 2022 6 / 16
second strategy is to use the full distribution to weight the author across different racial catego-
ries, rather than assigning any specific category. We also consider a third strategy, which
sequentially uses family and then given names to infer race.
We first retrieve all authors who have a family name with a probability of belonging to a
specific racial group greater than a given threshold. This retrieves N authors. Second, we retrieve the same number of authors as in the first step, N, using their given names. Finally, we merge the authors from both steps, removing duplicates who had both given and family names
above the set threshold. This process results in between N and 2N authors. There are several natural variations on this two-step method. For example, a percentage thresho
Collepals.com Plagiarism Free Papers
Are you looking for custom essay writing service or even dissertation writing services? Just request for our write my paper service, and we'll match you with the best essay writer in your subject! With an exceptional team of professional academic experts in a wide range of subjects, we can guarantee you an unrivaled quality of custom-written papers.
Get ZERO PLAGIARISM, HUMAN WRITTEN ESSAYS
Why Hire Collepals.com writers to do your paper?
Quality- We are experienced and have access to ample research materials.
We write plagiarism Free Content
Confidential- We never share or sell your personal information to third parties.
Support-Chat with us today! We are always waiting to answer all your questions.
