March 31, 2022

FOCUS ADDRESSING BIAS AND RACISM OF BLACK WOMEN IN LEADERSHIP AND EMPLOYMENT Use the Attached Template and structure an annota

FOCUS – ADDRESSING BIAS AND RACISM OF BLACK WOMEN IN LEADERSHIP AND EMPLOYMENT

Use the Attached Template and structure an annotated bibliography APA 7th edition format of the Article attached

YOU CAN ONLY COMPILE INFORMATION FROM THE ARTICLE ATTACHED!!!!

250 words

EXAMPLE OF ANNOTATED BIBLIOGRAPHY

Example Reference Format

Baker, V. L., & Pifer, M. J. (2011). The role of relationships in the transition from doctor to independent scholar. Studies in Continuing Education, 33(1), 5-17. http://doi.org/10.1080/0158037X. 2010.515569

Provide a reference and an annotation (150-250 words) that includes important details about the article for each of the sources.

Annotations are descriptive and critical assessments of literature that help researchers evaluate texts and determine relevancy in relation to a research project. Ultimately, it is a note-taking tool that fosters critical thinking and helps you evaluate the source material for possible later use. Instead of reading articles and forgetting what you have read, you have a convenient document full of helpful information. An annotated bibliography can help you see the bigger picture of the literature you are reading. It can help you visualize the overall status of the topic, as well as where your unique question might fit into the field of literature.

AT THE END OF THE ANNOTATED BIBLIOGRAPHY EXPLAIN WHY THIS ARTICLE IS RELEVANT TO THE STRUGGLES OF BLACK WOMEN IN LEADERSHIP POSITIONS IN AMERICA

RESEARCH ARTICLE

Avoiding bias when inferring race using name-

based approaches

Diego KozlowskiID 1*, Dakota S. Murray2, Alexis Bell3, Will Hulsey3, Vincent Larivière4,

Thema Monroe-White 3 , Cassidy R. Sugimoto

1 DRIVEN DTU, Faculté des Sciences, de la Technologie et de la Médecine, University of Luxembourg,

Esch-sur-Alzette, Luxembourg, 2 School of Informatics, Computing, and Engineering, Indiana University

Bloomington, Bloomington, Indiana, United States of America, 3 Campbell School of Business, Berry

College, Mt Berry, Georgia, United States of America, 4 École de bibliothéconomie et des sciences de

l’information, Université de Montréal, Montréal, Québec, Canada, 5 School of Public Policy, Georgia Institute

of Technology, Atlanta, Georgia, United States of America

* [email protected]

Abstract

Racial disparity in academia is a widely acknowledged problem. The quantitative under-

standing of racial-based systemic inequalities is an important step towards a more equitable

research system. However, because of the lack of robust information on authors’ race, few

large-scale analyses have been performed on this topic. Algorithmic approaches offer one

solution, using known information about authors, such as their names, to infer their per-

ceived race. As with any other algorithm, the process of racial inference can generate biases

if it is not carefully considered. The goal of this article is to assess the extent to which algo-

rithmic bias is introduced using different approaches for name-based racial inference. We

use information from the U.S. Census and mortgage applications to infer the race of U.S.

affiliated authors in the Web of Science. We estimate the effects of using given and family

names, thresholds or continuous distributions, and imputation. Our results demonstrate that

the validity of name-based inference varies by race/ethnicity and that threshold approaches

underestimate Black authors and overestimate White authors. We conclude with recom-

mendations to avoid potential biases. This article lays the foundation for more systematic

and less-biased investigations into racial disparities in science.

Introduction

The use of racial categories in the quantitative study of science dates from so long ago that it

intertwines with the controversial origins of statistical analysis itself [1,2]. However, while Gal-

ton and the eugenics movement reinforced the racial stratification of society, racial categories

have also been used to acknowledge and mitigate racial discrimination. As Zuberi [3] explains:

“The racialization of data is an artifact of both the struggles to preserve and to destroy racial

stratification.” This places the use of race as a statistical category in a precarious position, one

that both reinforces the social processes that segregate and disempower parts of the popula-

tion, while simultaneously providing an empirical basis for understanding and mitigating

inequities.

PLOS ONE

PLOS ONE | https://doi.org/10.1371/journal.pone.0264270 March 1, 2022 1 / 16

a1111111111

OPENACCESS

Citation: Kozlowski D, Murray DS, Bell A, Hulsey

W, Larivière V, Monroe-White T, et al. (2022) Avoiding bias when inferring race using name-

based approaches. PLoS ONE 17(3): e0264270.

https://doi.org/10.1371/journal.pone.0264270

Editor: Lutz Bornmann, Max Planck Society,

GERMANY

Received: October 15, 2021

Accepted: February 7, 2022

Published: March 1, 2022

Peer Review History: PLOS recognizes the

benefits of transparency in the peer review

process; therefore, we enable the publication of

all of the content of peer review and author

responses alongside final, published articles. The

editorial history of this article is available here:

https://doi.org/10.1371/journal.pone.0264270

Creative Commons Attribution License, which

permits unrestricted use, distribution, and

reproduction in any medium, provided the original

author and source are credited.

Data Availability Statement: The data used for this

article are available at https://sciencebias.uni.lu/

app/ and https://github.com/DiegoKoz/

intersectional_inequalities.

https://orcid.org/0000-0002-5396-3471

https://doi.org/10.1371/journal.pone.0264270

http://crossmark.crossref.org/dialog/?doi=10.1371/journal.pone.0264270&domain=pdf&date_stamp=2022-03-01

https://doi.org/10.1371/journal.pone.0264270

http://creativecommons.org/licenses/by/4.0/

https://sciencebias.uni.lu/app/

https://github.com/DiegoKoz/intersectional_inequalities

Science is not immune from these inequities [4–7]. Early research on racial disparities in

scientific publishing relied primarily on self-reported data in surveys [8], geocoding [9], and

directories [10]. However, there is an increasing use of large-scale inference of race based on

names [11], similar to the approaches used for gender-disambiguation [12]. Algorithms, how-

ever, are known to encode human biases [13,14]: there is no such thing as algorithmic neutral- ity. The automatic inference of authors’ race based on their features in bibliographic databases is itself an algorithmic process that needs to be scrutinized, as it could implicitly encode bias,

with major impact in the over and under representation of racial groups.

In this study, we use the self-declared race/ethnicity from the 2010 U.S. Census and mort-

gage applications as the basis for inferring race from author names on scientific publications

indexed in the Web of Science database. Bibliometric databases do not include self-declared

race by authors, as they are based on the information provided in publications, such as given

and family names. Given that the U.S. Census provides the proportion of self-declared race by

family name, this information can be used to infer U.S. authors’ race given their family names.

Name-based racial inference has been used in several articles. Many studies assigned a single

category given the family or given name [15–19]. Other studies used the aggregated probabili-

ties related with a name, instead of using a single label [20]. In this research, we assess the

incurred biases when using a single label, i.e. thresholding. The main goal of this research is to

define the most unbiased algorithm to predict a racial category given a name. We present sev-

eral different approaches for inferring race and examine the bias generated in each case. The

goal of the research is to provide an empirical critique of name-based race inference and rec-

ommendations for approaches that minimize bias. Even if prefect inference is not achievable,

the conclusions that arise from this study will allow researchers to conduct more careful analy-

ses on racial and ethnic disparities in science. Although the categories analysed are only valid

in the U.S. context, the general recommendation can be extended to any other country in

which the Census (or similar data collection mechanism) includes self-reported race.

Racial categories in the U.S. Census

The U.S. Census is a rich and long-running dataset, but also deeply flawed and criticized. Cur-

rently it is a decennial counting of all U.S. residents, both citizens or non-citizens, in which

several characteristics of the population are gathered, including self-declared race/ethnicity.

The classification of race in the U.S. Census is value-laden with the agendas and priorities of

its creators, namely 18th century White men who Wilkerson [21] refers to as “the dominant

caste.” The first U.S. Census was conducted in 1790 and founded on the principles of racial

stratification and White superiority. Categories included: “Free White males of 16 years and

upward,” “Free White males under 16 years;” “Free White females,” “All other free persons,”

and “Slaves” [22]. At that time, each member of a household was classified into one of these

five categories based on the observation of the census-taker, such that an individual of “mixed

white and other parentage” was classified into “All other free persons” in order to preserve the

“Free White. . .” privileged status. To date, anyone classifying themselves as other than “non-

Hispanic White” is considered a “minority.” The shared ground across the centuries of census

survey design and classification strata reflects the sustained prioritization of the White male

caste [3,23].

Today, self-identification is used to assign individuals to their respective race/ethnicity clas-

sifications [24], per the U.S. Office of Management and Budget (OMB) guidelines. However,

the concept of race and/or ethnicity remains poorly understood. For example, in 2000 the cate-

gory “Some other race” was the third largest racial group, consisting primarily of individuals

who in 2010 identified as Hispanic or Latino (which according to the 2010 census definition

PLOS ONE Avoiding bias when inferring race using name-based approaches

PLOS ONE | https://doi.org/10.1371/journal.pone.0264270 March 1, 2022 2 / 16

Funding: VL acknowledges funding from the

Canada Research Chairs program, https://www.

chairs-chaires.gc.ca/, (grant # 950-231768), DK

acknowledges funding from the Luxembourg

National Research Fund, https://www.fnr.lu/, under

the PRIDE program (PRIDE17/12252781). The

funders had no role in study design, data collection

and analysis, decision to publish, or preparation of

the manuscript.

Competing interests: The authors have declared

that no competing interests exist.

https://doi.org/10.1371/journal.pone.0264270

https://www.chairs-chaires.gc.ca/

https://www.fnr.lu/

refers to a person of Cuban, Mexican, Puerto Rican, South or Central American, or other

Spanish culture or origin regardless of race). Instructions and questions which facilitated the

distinction between race and ethnicity began with the 2010 census which stated that “[f]or this

census, Hispanic origins are not races” and to-date, in the U.S. federal statistical system, His-

panic origin is considered to be a separate concept from race. However, this did not preclude

individuals from self-identifying their race as “Latino,” “Mexican,” “Puerto Rican,” “Salva-

doran,” or other national origins or ethnicities [25]. Furthermore, 6.1% of the U.S. population

changed their self-identification of both race and ethnicity between the 2000 and 2010 cen-

suses [26], demonstrating the dynamicity of the classification. The inclusion of certain catego-

ries has also been the focus of considerable political debate. For example, the inclusion of

citizenship generated significant debates in the preparation of the 2020 Census, as it may have

generated a larger nonresponse rate from the Hispanic community [27]. For this article, we

attempt to represent the fullest extent of potential U.S.-affiliated authors; thereby, we consider

both citizens and non-citizen.

The social function of the concept of race (i.e., the building of racialized groups) underpins

its definition more than any physical traits of the population. For example, "Hispanic" as a cat-

egory arises from this conceptualization, even though in the 2010 U.S. Census the question

about Hispanic origin is different from the one on self-perceived race. While Hispanic origin

does not relate to any physical attribute, it is still considered a socially racialised group, and

this is also how the aggregated data is presented by the Census Bureau. Therefore, in this

paper, we will utilize the term race to refer to these social constructions, acknowledging the

complex relation between conceptions of race and ethnicity. But even more important, this

conceptualization of race also determines what can be done with the results of the proposed

models. Given that race is a social construct, inferred racial categories should only be used in

the study of group-level social dynamics underlying these categories, and not as individual-

level traits. Census classifications are founded upon the social construction of race and reality

of racism in the U.S., which serves as “a multi-level and multi-dimensional system of dominant

group oppression that scapegoats the race and/or ethnicity of one or more subordinate groups”

[28]. Self-identification of racial categories continue to reflect broader definitional challenges,

along with issues of interpretation, and above all the amorphous power dynamics surrounding

race, politics, and science in the U.S. In this study, we are keenly aware of these challenges, and

our operationalization of race categories are shaped in part by these tensions.

Data

This project uses several data sources to test the different approaches for race inference based

on the author’s name. First, to test the interaction between given and family names distribu-

tions, we simulate a dataset that covers most of the possible combinations. Using a Dirichlet

process [29], we randomly generate 500 multinomial distributions that simulate those from

given names, and another 500 random multinomial distributions that simulate those from

family names. After this, we build a grid of all the possible combinations of given and family

names random distributions (250,000 combinations). This randomly generated data will only

be used to determine the best combination of the probability distributions of given and family

names for inferring race.

In addition to the simulation, we use two datasets with real given and family names and an

assigned probability for each racial group. The data from the given names is from Tzioumis

[30], who builds a list of 4,250 given names based on mortgage applications, with self-reported

race. Family name data is based on the 2010 U.S. Census [31], which includes all family names

with more than 100 appearances in the census, with a total of 162,253 surnames that covers

PLOS ONE Avoiding bias when inferring race using name-based approaches

PLOS ONE | https://doi.org/10.1371/journal.pone.0264270 March 1, 2022 3 / 16

https://doi.org/10.1371/journal.pone.0264270

more than 90% of the population. For confidentiality, this list removes counts for those racial

categories with fewer than five cases, as it would be possible to exactly identify individuals and

their self-reported race. In those cases, we replace with zero and renormalize. As explained

previously, changes were introduced in the 2010 U.S. Census racial categories. Questions now

include both racial and ethnic origin, placing "Hispanic" outside the racial categories. Even if

now “Hispanic” is not considered a racial category, but an ethnic origin that can occur in com-

bination with other racial categories (e.g., Black, White or Asian Hispanic), the information

about names and racial groups merge both questions into a single categorization. Therefore,

the racial categories used in this research includes “Hispanic” as a category, and all other racial

categories excluding people with Hispanic origin. The category "White" becomes "Non-His-

panic White Alone", and "Black or African American" becomes "Non-Hispanic Black or Afri-

can American Alone", and so on. The final categories used in both datasets are:

• Non-Hispanic White Alone (White)

• Non-Hispanic Black or African American Alone (Black)

• Non-Hispanic Asian and Native Hawaiian and Other Pacific Islander Alone (Asian)

• Non-Hispanic American Indian and Alaska Native Alone (AIAN)

• Non-Hispanic Two or More Races (Two or more)

• Hispanic or Latino origin (Hispanic)

We test these data on the Web of Science (WoS) to study how name-based racial inference

performs on the population of U.S. first authors. WoS did not regularly provide first names in

articles before 2008, nor did it provide links between authors and their institutional addresses;

therefore, the data includes all articles published between 2008 and 2019. Given that links

between authors and institutions are sometimes missing or incorrect, we restricted the analysis

to first authors to ensure that our analysis solely focused on U.S. authors. This results in

5,431,451 articles, 1,609,107 distinct U.S. first authors in WoS, 152,835 distinct given names

and 288,663 distinct family names for first authors. Given that in this database, ‘AIAN’ and

‘Two or more’ account for only 0.69% and 1.76% of authors respectively, we remove these and

renormalize the distribution with the remaining categories. Therefore, in what follows we will

refer exclusively to categories Asian, Black, Hispanic, and White.

Methods

Manual validation

The data is presented as a series of distributions of names across race (Table 1). In name-based

inference methods, it is not uncommon to use a threshold to create a categorical distinction:

e.g., using a 90% threshold, one would assume that all instances of Juan as first name should be

categorized as Hispanic and all instances of Washington as a given name should be categorized

as Black. In such a situation, any name not reaching this threshold would be excluded (e.g.,

those with the last name of “Lee” would be removed from the analysis). This approach, how-

ever, assumes that the distinctiveness of names across races does not significantly differ.

To test this, we began our analysis by manually validating name-based inference at three

threshold ranges: 70–79%, 80–89%, and 90–100%. We sampled 300 authors from the WoS

database, 25 randomly sampled for every combination of racial category and inference thresh-

old. Two coders manually queried a search engine for the name and affiliation of each author

and attempted to infer a perceived racial category through visual inspection of their

PLOS ONE Avoiding bias when inferring race using name-based approaches

PLOS ONE | https://doi.org/10.1371/journal.pone.0264270 March 1, 2022 4 / 16

https://doi.org/10.1371/journal.pone.0264270

professional photos and information listed on their websites and CVs (e.g., affiliation with

racialized organizations such as Omega Psi PhiFraternity, Inc., SACNAS, etc.). Fig 1 shows the number of valid and invalid inferences, as well as those for whom a category

could not be manually identified, and those for whom no information was found. Name-based

inference of Asian authors was found to be highly valid at every considered threshold. The

inference of Black authors, in contrast, produced many invalid or uncertain classifications at

the 70–80% threshold, but had higher validity at the 90% threshold. Similarly, inferring His-

panic authors was only accurate after the 80% threshold. Inference of White authors was highly

valid at all thresholds but improved above 90%. This suggests that a simple threshold-based

approach does not perform equally well across all racial categories. We thereby consider an

alternative weighting-based scheme that does not provide an exclusive categorization but uses

the full information of the distribution.

Weighting scheme

We assess three strategies for inferring race from an author’s name using a combination of

their given and family name distributions across racial categories (Table 1). The first two aim

at building a new distribution as a weighted average from both the given and family name

racial distributions, and the third uses both distributions sequentially. In this section we

Table 1. Sample of family names (U.S. Census) and given names (mortgage data).

Type Name Asian Black Hispanic White Count Given Juan 1.5% 0.5% 93.4% 4.5% 4,019

Doris 3.4% 13.5% 6.3% 76.7% 1,332

Andy 38.8% 1.6% 6.4% 53.2% 555

Family Rodriguez 0.6% 0.5% 94.1% 4.8% 1,094,924

Lee 43.8% 16.9% 2.0% 37.3% 693,023

Washington 0.3% 91.6% 2.7% 5.4% 177,386

https://doi.org/10.1371/journal.pone.0264270.t001

Fig 1. Manual validation of racial categories.

https://doi.org/10.1371/journal.pone.0264270.g001

PLOS ONE Avoiding bias when inferring race using name-based approaches

PLOS ONE | https://doi.org/10.1371/journal.pone.0264270 March 1, 2022 5 / 16

https://doi.org/10.1371/journal.pone.0264270.t001

https://doi.org/10.1371/journal.pone.0264270.g001

https://doi.org/10.1371/journal.pone.0264270

explain these three approaches and compare them to alternatives that use only given or only

family name racial distributions.

The weighting scheme should account for the intuition that if the given (family) name is

highly informative while the family (given) name is not, the resulting average distribution

should prioritize the information on the given (family) name distribution. For example, 94%

of people with Rodriguez as a family name identify themselves as Hispanic, whereas 39% of the

people with the given name Andy identify as Asian, and 53% as White (see Table 1). For an

author called Andy Rodriguez, we would like to build a distribution that encodes the informa-

tiveness of their family name, Rodriguez, rather than the relatively uninformative given name,

Andy. The first weighting scheme proposed is based on the standard deviation of the distribu-

tion:

SD ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1

n � 1

i¼1

ðxi � �xÞ 2

Where xi is in this case the probability associated with category i, and n is the total number of categories. With four racial categories, the standard deviation moves between 0, for perfect

uniformity, and 0.5 when one category has a probability of 1. The second weighting scheme is

based on entropy, a measure that is designed to capture the informativeness of a distribution:

Entropy ¼ � Xn

i¼1

PðxiÞlogPðxiÞ

Using these, we propose the following weight for both given and family names:

xweight ¼ fðxÞexp

fðxÞexp þ fðyÞexp

with x and y as the given (family) and family (given) names respectively, f is the weighting function (standard deviation or entropy), and exp is the exponent applied to the function and a tuneable parameter. For the standard deviation, using the square function means we use the

variance of the distribution. In general, the higher the exp is set, the more skewed the weight- ing is towards the most informative name distribution. In the extreme, it would be possible to

use an indicator function to simply choose the most skewed of the two distributions, but this

approach would not use the information from both distributions. For this reason, we decided

to experiment with exp2{1,2}, which imply a trade-off between selecting the most informative of the two distributions, and using all available information.

Fig 2 shows the weighting of the simulated given and family names based on their informa-

tiveness, and for different values of the exponent. The horizontal and vertical axes show the

highest value on the given and family name distribution, respectively. This means that a higher

value on any axis corresponds with a more informative given/family name. The color shows

how much weight is given to given names. When the exponent is set to two, both the entropy

and standard deviation-based models skew towards the most informative feature, a desirable

property. Compared to other models, the variance gives the most extreme values to cases

where only one name is informative, whereas the entropy-based model is the most uniform.

Information retrieval

The above weighting schemes result in a single probability distribution of an author belonging

to each of the racial categories, from which a race can be inferred. One strategy for inferring

race from this distribution is to select the racial category above a certain threshold, if any. A

PLOS ONE Avoiding bias when inferring race using name-based approaches

PLOS ONE | https://doi.org/10.1371/journal.pone.0264270 March 1, 2022 6 / 16

https://doi.org/10.1371/journal.pone.0264270

second strategy is to use the full distribution to weight the author across different racial catego-

ries, rather than assigning any specific category. We also consider a third strategy, which

sequentially uses family and then given names to infer race.

We first retrieve all authors who have a family name with a probability of belonging to a

specific racial group greater than a given threshold. This retrieves N authors. Second, we retrieve the same number of authors as in the first step, N, using their given names. Finally, we merge the authors from both steps, removing duplicates who had both given and family names

above the set threshold. This process results in between N and 2N authors. There are several natural variations on this two-step method. For example, a percentage thresho

Collepals.com Plagiarism Free Papers

Are you looking for custom essay writing service or even dissertation writing services? Just request for our write my paper service, and we'll match you with the best essay writer in your subject! With an exceptional team of professional academic experts in a wide range of subjects, we can guarantee you an unrivaled quality of custom-written papers.

Get ZERO PLAGIARISM, HUMAN WRITTEN ESSAYS

Why Hire Collepals.com writers to do your paper?

Quality- We are experienced and have access to ample research materials.

We write plagiarism Free Content

Confidential- We never share or sell your personal information to third parties.

Support-Chat with us today! We are always waiting to answer all your questions.

FOCUS ADDRESSING BIAS AND RACISM OF BLACK WOMEN IN LEADERSHIP AND EMPLOYMENT Use the Attached Template and structure an annota

Related Posts

Describe your personal coaching and leadership philosophy that will guide your work as a coach of a team or as an athletic director of your insti

Educators typically follow a process for identifying students ?who struggle and implement a series of steps for providing ?individualized instruc

How can educators plan for instruction to meet the needs of struggling students? What is your school or district?s process for providing ?just