Inside the folder, two websites are assessment examples of 2021 and 2020, and three articles are assigned research papers (could be references), and the PT
Inside the folder, two websites are assessment examples of 2021 and 2020, and three articles are assigned research papers (could be references), and the PTUA summative assessment file is the requirement.
Sinclair et al. – 2023 – Assessing the socio-demographic representativeness.pdf
Applied Geography 158 (2023) 102997
Available online 13 July 2023 0143-6228/© 2023 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
Assessing the socio-demographic representativeness of mobile phone application data
Michael Sinclair a,*, Saeed Maadi a, Qunshan Zhao a, Jinhyun Hong b, Andrea Ghermandi c, Nick Bailey a
a Urban Big Data Centre, University of Glasgow, Glasgow, UK b Department of Smart Cities, University of Seoul, South Korea c Department of Natural Resources and Environmental Management, University of Haifa, Israel
A R T I C L E I N F O
Handling Editor: Y.D. Wei
Keywords: Mobile phone data Socio-demographic representativeness Tamoco Huq
A B S T R A C T
Emerging forms of mobile phone data generated from the use of mobile phone applications have the potential to advance scientific research across a range of disciplines. However, there are risks regarding uncertainties in the socio-demographic representativeness of these data, which may introduce bias and mislead policy recommen- dations. This paper addresses the issue directly by developing a novel approach to assessing socio-demographic representativeness, demonstrating this with two large independent mobile phone application datasets, Huq and Tamoco, each with three years data for a large and diverse city-region (Glasgow, Scotland) home to over 1.8 million people. We advance methods for detecting home location by including high-resolution land use data in the process and test representativeness across multiple dimensions. Our findings offer greater confidence in using mobile phone app data for research and planning. Both datasets show good representativeness compared to the known population distribution. Indeed, they achieve better population coverage than the ‘gold standard’ random sample survey which is the alternative source of data on population mobility in this region. More importantly, our approach provides an improved benchmark for assessing the quality of similar data sources in the future.
1. Introduction
New forms of mobile phone (MP) data from the use of applications, or ‘apps’, offer enormous potential as an alternative or complement to traditional survey data sources to enhance our understanding of human activity and mobility (Huang et al., 2022; Kang et al., 2020). The huge volumes of data available from these novel sources as well as the spatial and temporal details they provide, create unprecedented opportunities across a wide range of disciplines to advance scientific research. How- ever, there are critical unanswered questions concerning the socio-demographic representativeness of these new forms of MP data. This creates a risk that underlying bias could produce unreliable results which are then used as the basis for policy (Grantz et al., 2020). Furthermore, the limited analysis of the issue of socio-demographic representativeness restricts the progress of applied research seeking to utilise these novel and emerging form of MP app data as an alternative or complement to more traditional data sources.
Traditionally, scientific research has utilized MP data from call detail
records, which track phone locations during potentially billable events (Calabrese et al., 2013; Grantz et al., 2020; Pappalardo et al., 2021; Ren & Guan, 2022; Vanhoof et al., 2018b; Wang et al., 2020; Yabe et al., 2022). More recently, a new form of location data from the use of GPS-enabled smartphone applications has emerged, which also offer large data volumes but with much higher spatial accuracy (Berke et al., 2022; Grantz et al., 2020; Huang et al., 2022; Wang et al., 2020; Yabe et al., 2020). This mobile phone application (MPA) data, generated and collected from the use of a wide range of apps, provides point location information which supports more detailed analysis and opens up the range of possible analytical applications (Cameron et al., 2020; Heo et al., 2020; Mears et al., 2021; Sinclair et al., 2021; Yabe et al., 2020). So far, these have included disaster and pandemic response (Huang et al., 2022; Kishore et al., 2022; Yabe et al., 2020), nature-based rec- reation (Mears et al., 2021; Sinclair et al., 2021) and analyses of human mobility (Calafiore et al., 2021; Gao et al., 2020; Kang et al., 2020). The applications of these novel data are in their infancy and their potential spans a wide range of disciplines.
* Corresponding author. E-mail address: [email protected] (M. Sinclair).
Contents lists available at ScienceDirect
Applied Geography
journal homepage: www.elsevier.com/locate/apgeog
https://doi.org/10.1016/j.apgeog.2023.102997 Received 17 February 2023; Received in revised form 20 April 2023; Accepted 11 May 2023
Applied Geography 158 (2023) 102997
2
New spatial data forms such as MPA data are frequently contrasted with traditional survey data as an alternative source for applied research (Mayer-Schönberger and Cukier, 2013; Savage & Burrows, 2007). Although household surveys are considered the ‘gold standard’ for research due to generalizable random samples, they face declining response rates (Brick & Williams, 2013; Meyer et al., 2015), interviewer effects, recall error and normative bias (Marsh, 1982). Surveys are un- suitable for rapid response situations like the Covid-19 pandemic, and their low sample sizes limit fine-grained spatial/temporal detail. Addi- tionally, some highly marginalized groups such as the homeless or those in temporary forms of accommodation may be excluded. In this context, MPA data may offer advantages as a complement or alternative to traditional data (boyd & Crawford, 2012). MPA data provide spatial-temporal detail, often in (near) real-time or with low lag. While consent is still required, data collection is less burdensome, potentially reducing non-response bias and including previously excluded groups. Despite this potential, there are uncertainties of these novel data. A major one concerns data quality since so much of the data capture and processing is unavailable to researchers due to commercial concerns. This raises worries that the data may under-represent marginalized groups, particularly the ‘digitally excluded’ (boyd & Crawford, 2012).
There are critical questions, therefore, about the quality of MPA data which need to be addressed before more widespread use is justified in research (Grantz et al., 2020). Key among them is the question of how accurately MPA data represents the population of interest, given they are not the result of a carefully-planned sampling strategy (Ranjan et al., 2012; Zhao et al., 2016) and are constructed from the use a wide range of different applications. In particular, the concern is that inequalities in access to and use of mobile phones may be reflected in these data. The resulting research may skew attention and possibly resources towards already advantaged social groups (Grantz et al., 2020) or fail to adequately include groups such as the elderly population (Guo et al., 2019; Lee et al., 2021). Though the question of bias is common to all forms of MP data, it is perhaps especially relevant to MPA data where datasets are assembled by commercial intermediaries. These in- termediaries gather data across a wide and diverse set of apps with the aim of achieving scale and broad representativeness, but these are not transparent with few metrics provided to evidence the latter and there are currently no standards by which they can be evaluated. Despite its importance, few studies using MPA data explore the topic of socio-demographic representativeness directly (Huang et al., 2020, 2022).
Assessing representativeness is challenging due to the steps taken by MP data providers to protect user privacy. MP data are often provided to researchers as aggregated totals, making it impossible to identify the characteristics of individuals at all. Where data are provided at the in- dividual level, such as with MPA data, information is rarely if ever provided on a user’s personal characteristics, so representativeness cannot be examined directly (Grantz et al., 2020). Researchers therefore often use techniques to infer the user’s home location based on location histories. These home locations allow the geographic distribution of the sample of MP users to be compared to ‘ground truth’ sources such as official population statistics (Berke et al., 2022; Calabrese et al., 2013; Huang et al., 2022; Mao et al., 2015; Phithakkitnukoon et al., 2012; Wang et al., 2019; Çolak et al., 2015). This process provides a very useful measure of differential geographic coverage (Yabe et al., 2020) as well as variations by the socio-demographic status of different areas (Ber- nabeu-Bautista et al., 2021; Huang et al., 2020, 2020, 2022, 2020). Enriching the data in this way also greatly increases the potential impact of research.
Different approaches have been adopted to estimate or infer home locations from MP and other locational data based on the volume of content generated by a user in space and/or time (Calafiore et al., 2021; Pappalardo et al., 2021; Sinclair et al., 2020). Since it is rarely possible to validate home detection algorithms against known home locations for users (Pappalardo et al., 2021), techniques are designed with the aim of
reducing potential error. To assign a home location at a country or city level, daily activity counts are generally sufficient (Bojic et al., 2015; Sinclair et al., 2020). However, to infer socio-demographic information for users requires predicting home locations for much smaller geogra- phies. Including the full range of an individual’s daily activities towards this end could lead to an increase in false predictions as users might record volumes of data around places designated for work or socialising (Pappalardo et al., 2021; Vanhoof et al., 2018a). This is especially true for MPA data where datasets represent a wide range of activities, due to the mix of apps involved. The main approach to overcome this is to utilise activity heuristics, by including a time element in the algorithm. Restricting the analysis to night-time data, based on the assumption that people are more often at home during this period (Berke et al., 2022; Bojic et al., 2015; Calabrese et al., 2013; Calafiore et al., 2021; Phi- thakkitnukoon et al., 2012; Sinclair et al., 2020; Vanhoof et al., 2018; Çolak et al., 2015), has been shown to improve results (Pappalardo et al., 2021).
There are two main limitations with current approaches to assessing representativeness of these novel data. The first is that, even with ac- tivity heuristics, problems remain in inferring home locations as many people spend periods of the night at sites of work, leisure, or transit. This is especially true for MPA data where the data represent various activ- ities based on a diversity of apps. The second is that, once home loca- tions have been inferred, researchers rarely explore representativeness in a systematic or comprehensive way. In this paper, we address both issues and hence provide a more appropriate standard for assessing representativeness of MPA data. First, with home locations, we propose a novel approach which incorporates high-resolution land use data into the process. By relying only on data captured within buildings which have a designated residential use, we greatly reduce the chance of identifying night-time work, leisure, or transit locations as home loca- tions. Second, we use these potentially improved home location esti- mates to examine representativeness using multiple independent dimensions. These cover the geographical distribution but also socio- economic and socio-demographic status.
To illustrate our approach, we apply it to an assessment of repre- sentativeness for two extensive and independent sources of MPA data. Each contains data from a diverse portfolio of apps covering a wide time period (three years) for a large and socio-demographically diverse city- region (Glasgow). First, we apply our home detection approach which incorporates high-resolution residential land use data into the process. Second, we compare the distribution of the resulting samples of MPA users from both data sources to the known population distribution across three years (2019–2021). Comparisons are made by geographic location and against two different measures of area socio-demographic status. One is an official index of area deprivation in Scotland, the Scottish Index of Multiple Deprivation (SIMD). The other is a commer- cial socio-demographic classification, CACI’s Acorn consumer classifi- cation (CACI), which segments areas by analysing a wide range of data on demographics and consumer behaviour. Third, we compare the re- sults on representativeness found using our novel home detection approach against those found using a more conventional approach, which does not utilise residential land use data, to illustrate the impact of this innovation. Finally, we compare the distribution of our sample of MP users to the distribution of the sample of households captured by a traditional survey which is widely used in the study area for mobility analysis and transport planning, the Scottish Household Survey (SHS). Such traditional forms of data are frequently held up as the ‘gold stan- dard’ against which new forms of data are compared since they are built round a structured random sample. Comparison against such a sample provides arguably a fairer test of representativeness.
M. Sinclair et al.
Applied Geography 158 (2023) 102997
3
2. Data and methods
2.1. Study area
The study area is Glasgow city-region, comprising the core city (the largest in Scotland) and seven surrounding councils (Fig. 1). Glasgow is home to over 600,000 people, while the wider city-region houses over 1.8 million people. The city-region covers areas or neighbourhoods with a wide range of socio-demographic circumstances, which makes it particularly suitable to test for inequalities in sample coverage by socio- economic status. Fig. 1 also shows the eight council boundaries used for reporting results as well as the built-up residential areas within each.
2.2. Mobile phone application datasets
The core data for this research are MPA datasets from two private companies, Huq and Tamoco1. Both are examples of smartphone GPS location data (Yabe et al., 2022) which are timestamped point data generated using MP apps on GPS-enabled smartphones (Table 1). This type of big data generally offers a higher spatial precision than tradi- tional sources of MP data such as call detail records which are often limited to cell tower regions. The data used in this study are confined to the extent of the study area (Fig. 1) and consist of hundreds of millions of data points per year. Wider, Huq currently offers data across the UK and Tamoco across the UK and the United States of America.
The construction and structure of the datasets are similar across both providers. Each contains data from a range of partner apps on an informed consent basis, with data limited to users aged 16+. Data is collected when an app records the time and location of a device based on the most accurate location sensor available at the time, including GPS, Bluetooth, cellular tower, Wi-Fi or a combination of sources (Wang & Chen, 2018). Due to a lack of transparency from the commercial pro- viders, the specific applications included in the datasets are unknown to researchers. However, data are pooled from a wide and diverse set of apps with the aim of achieving scale and broad representativeness. In one of the years, for example, one provider was collecting data from over 200 unique apps. The data represent timestamped point locations with a certain degree of error. Each MP device has the personal identifiers replaced with non-reversible hashed identifiers. This means that data points from the same user can be linked over time. With Huq, the points from individual users can be linked over the whole period while Tamoco resets its hashed identifiers every month. Data volumes are vast and fluctuate year-to-year (Table 1), reflecting in part the changes in the apps with whom the intermediaries have contracts. The challenge of assessing representativeness will therefore always be an on-going exercise.
2.3. Other secondary data sources used in the analysis
Different levels of geographic boundaries are used in the analysis, all of which are represented visually in Appendix 1. The highest level used is Council (n = 8) which is also visualised in Fig. 1. The next level is the Intermediate Zone (n = 417) which nest within councils. We also use Datazones, which nest within Intermediate Zones, and are the key ge- ography for small area statistics in Scotland. These are the spatial unit used in this paper for home location detection. Datazones are also used to assign the Scottish Index of Multiple Deprivation measure of socio- demographic status to mobile devices (see below). Datazones are designed to have a population of 500–1000 and there are 2336 in the study area. The finest spatial boundary used is the unit postcode (n =
44,829) which nest within Datazones. These boundaries are used to assign the second measure of socio-demographic status to users, the CACI Acorn Consumer Classification (see below).
In comparing the socio-demographic representativeness of MP users to the population, we use two sources, one public and one private. The Scottish Index of Multiple Deprivation (SIMD, https://simd.scot/) as- signs a relative measure of area deprivation to Datazones across Scot- land. The SIMD combines measures of deprivation across multiple domains (income, employment, education, health, crime, housing and access to services) into a scaleless relative ranking. SIMD 2020 is used in this research. Our analysis assigns MP users an SIMD quintile and percentile, using national rankings, based on the Datazone where they are estimated to live. As Glasgow city-region is a relatively deprived area, there is an over-representation in more deprived quintiles (1 and 2). See the supplementary material for population breakdown by SIMD groups. The CACI Acorn Consumer Classification (https://www.caci.co. uk/) is a private socio-demographic data source which segments the UK population by analysing a wide range of data on demographics and consumer behaviour. CACI segments unit postcodes into 6 categories, 18 groups and 62 types. The subdivisions are nested, with the 6 categories broken into between 2 and 4 groups2, and the 18 groups broken into between 3 and 6 types. Our analysis assigns mobile phone users with a category, group and type based on the postcode where they are esti- mated to live. In this study we use 2020 CACI data. See the supple- mentary material for population breakdown by CACI groups.
As a key step in the home location detection process, which is explained in the next section, we make use of high-resolution land use data from Geomni’s UKBuildings layer. This dataset is a multi-polygon spatial dataset representing the footprint of all buildings in the UK, including residential buildings (see Fig. 1). Each building is assigned a usage, classified into various types. For this study, we use all buildings with a residential or mixed-residential use. Data from 2020 is used in this study.
In the final section of the results, we compare the MPA samples to a traditional survey dataset widely used in social research across Scotland, the Scottish Household Survey (SHS, http://www.scottishhouseholdsu rvey.com/). The SHS is an annual survey of over 10,000 households, used as the basis of a range of official statistics. The SHS has a repeat cross-sectional design with a sample for the Glasgow city-region of N = 3495 in 2019 (the most recent available).3 For Glasgow City, less than half the eligible adults completed a travel diary. Younger adults were significantly under-represented while those 65+ were over-represented. Some population groups are excluded by the sample design including households living on military bases, in communal establishments, in mobile homes or sites for traveling people, or homeless (Scottish Gov- ernment, 2020).
2.4. Home location detection techniques
To compare the distribution of each MPA sample to the population, it is necessary to estimate the home location for each MP user in the dataset and this is typically done based on night-time locations, as dis- cussed in the Introduction. The specific period which constitutes night- time varies between studies but a window beginning between 19.00 and
1 Information for Huq available at: https://www.ubdc.ac.uk/data-services/ data-catalogue/transport-and-mobility-data/huq-data/; and Tamoco available at: https://www.ubdc.ac.uk/data-services/data-catalogue/transport-and-mobi lity-data/tamoco-data/.
2 This is with the exception of the category/group ‘Not Private Households’ which is not disaggregated between the levels of category and group (and related to areas which generally do not have a residential population).
3 Households are selected using a random sample stratified by council which over-represents smaller councils to ensure each achieves a minimum sample size. A travel survey portion is completed by one randomly-selected adult in each household (Scottish Government, 2020) and is by definition therefore skewed towards adults from smaller households. For 2019, the response rate for households nationally was 63%, with random adults completing the travel survey in 92% of cases but this varied across the country.
M. Sinclair et al.
Applied Geography 158 (2023) 102997
4
22.00 and ending between 05.00 and 09.00 is common (Pappalardo et al., 2021; Vanhoof et al., 2018). It is rarely possible to verify the ac- curacy of home location estimates with ‘ground truth’ data. One study which achieved this found that using night-time data was more accurate than taking data which covered the whole day (Pappalardo et al., 2021). Accordingly, this is the approach we build on here using the night time period of 20.00 to 06.00. Box 1 explains in more details how we estimate home locations using our approach (Method 1) and a more conventional approach (Method 2).
2.5. Comparing the representativeness of mobile phone application data
The results from section 2.4 allow us to allocate each unique MP user to a Council, Intermediate Zone, Datazone, and unit postcode. Using these we can assign MP users to a SIMD deprivation quintile and percentile (from Datazone), as well as a CACI Acorn category, group, and type (from postcode). We assess representativeness in three ways. For the geographic distribution, we focus initially on the eight council
areas but later report the distribution across the 417 Intermediate Zones. For variations by deprivation status, we initially examine the distribu- tion across the five quintiles of the SIMD index but later present results at the percentile level. Lastly, for variations by socio-demographic sta- tus, we initially examine the distribution across CACI’s six broadest categories but later use the 18 groups and the 62 types. Following these comparison to the population, we make a further comparison with the sample of travel diary respondents in the SHS. We make comparisons across the eight councils using the measure of SIMD deprivation quin- tile, the latter being the finest spatial disaggregation available on the publicly-available SHS files.
2.6. Transparency and reproducibility
The MP datasets used in this analysis can be accessed for research purposes by application to the Urban Big Data Centre, an Economic and Social Research Council funded research centre and national data ser- vice based at the University of Glasgow. Datazone and higher geographic boundaries are available under Open Government license (http://spatialdata.gov.scot/). Postcode boundaries are freely available from the Scottish Postcode Directory (National Records of Scotland, n. d.) under the ‘Public Sector Geospatial Agreement’ which covers non-commercial use of the data. SIMD data is available under Open Government licence. CACI data are accessed here under a licence agreed with CACI for this particular study. Geomni’s UKBuildings layer (Digital Map Data © The GeoInformation Group Limited (2022), created and maintained by Geomni, a Verisk company) is accessed under a general academic license via Digimap (https://digimap.edina.ac.uk/). SHS data are accessed through the UK Data Service under their standard End User Licence (Scottish Government & Ipsos MORI, 2021). All analysis is completed using a combination of PostgreSQL and R programming language (R Core Team, 2022). The code to process the data and esti- mate home location is openly available on GitHub (https://github.co m/sinclairmichael/appliedgeography_representativeness.git).
Fig. 1. Glasgow City-region, council areas and built-up areas Residential and mixed residential buildings are from Geomni’s UKBuildings layer which is created and maintained by Geomni, a Verisk company (see section on data sources).
Table 1 Summary of mobile phone application data collections in the study area.
Provider Measure 2019 2020 2021
Huq Unique users 19,399 29,741 25,233 Datapoints (millions) 21.9 161.8 346.8 Mean datapoints per user 1129 5440 13,744
Tamoco Unique users 81,203 85,258 81,136 Datapoints (millions) 442.5 808.1 471.8 Mean datapoints per user 5449 9478 5814
Notes: Unique users are based on the number of unique hashed identifiers active in a given year. The number of Tamoco users is based on the monthly average for each year since identifiers are reset monthly.
M. Sinclair et al.
Collepals.com Plagiarism Free Papers
Are you looking for custom essay writing service or even dissertation writing services? Just request for our write my paper service, and we'll match you with the best essay writer in your subject! With an exceptional team of professional academic experts in a wide range of subjects, we can guarantee you an unrivaled quality of custom-written papers.
Get ZERO PLAGIARISM, HUMAN WRITTEN ESSAYS
Why Hire Collepals.com writers to do your paper?
Quality- We are experienced and have access to ample research materials.
We write plagiarism Free Content
Confidential- We never share or sell your personal information to third parties.
Support-Chat with us today! We are always waiting to answer all your questions.
