Issues with Data Science-Data governance and data management are often used to mean each other. Better to treat them as s
20 short answers question with word count of 2-3 sentences each
4 long answer with count of 4-5 sentences each
Time limit is 80min
Week 6: Issues with Data Science
Data Management Maturity
Data governance and data management are often used to
mean each other.
Better to treat them as separate levels
• Data Management is what you do to handle the data
o Resources, practises, enacting policies
• Data Governance is making sure that it is done
appropriately
o Policies, training, providing resources
o Planning and understanding
Governance and management
DCC data (curation) lifecycle model
https://www.dcc.ac.uk/guidance/curation-lifecycle-model
Capability Maturity Model • Good management happens all through the data lifecycle
• 4 key process areas:
o Data acquisition, processing and quality assurance Goal: Reliably capture and describe scientific data in a way that
facilitates preservation and reuse
o Data description and representation Goal: Create quality metadata for data discovery, preservation, and
provenance functions
o Data dissemination Goal: Design and implement interfaces for users to obtain and
interact with data
o Repository services/preservation Goal: Preserve collected data for long-term use
• Good data governance uses a good management system
o A mature system manages data all through the data lifecycle and throughout all projects. K Crowston & J Qin (2011) A Capability Maturity Model for Scientific Data Management: Evidence
from the Literature. Proceedings of the American Society for Information Science & Technology V48
https://asistdl.onlinelibrary.wiley.com/doi/full/10.1002/meet.2011.1450480103 6
Capability Maturity Model
• Data management and governance are not things just
arranged for each project.
• They should be universal in how an organisation
thinks about and approaches data
o at all times o in all divisions o in all projects o for all stakeholders
Universality
End of Data Management Maturity
Week 6: Issues with Data Science
Ethics of Linked Data
• Connecting elements within multiple structured data sets
• Allows data relating to an element to be collected from multiple data sets
• Expands the knowledge base of a single data set
• Linked Open Data (LOD) allows the links and data to be freely shared and accessed o Used by companies but don’t tend to contribute
their own data
Linked Data
Sir Tim Berners-Lee, the inventor of WWW and HTML, wanted a semantic web, using linked data 1. Name/Identify things with URIs
2. Use HTTP URIs so things can be looked up
3. Standardise the format of data about things with URIs
4. On the web, use the URIs when mentioning things
CC BY 2.0 https://www.flickr.com/photos/tamaleaver/7674657708
Semantic Web
• Resource Description Framework (RDF) is another style of language for representing (subject, verb, object) triples, which is used to represent semantics. It is a core representation language for Linked Open Data and the Semantic Web.
• RDF can be represented in different formats, for instance as XML or simply as line delimited lists.
<?xml version="1.0" encoding="utf-8"?> <rdf:RDF
xmlns:dcterms=http://purl.org/dc/terms/ xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:schema="http://schema.org/">
<rdf:Description rdf:about="http://example.org/bob#me"> <rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Person"/> <schema:birthDate rdf:datatype="http://www.w3.org/2001/XMLSchema#date"> 1990-07-04 </schema:birthDate> <foaf:knows rdf:resource="http://example.org/alice#me"/> <foaf:topic_interest rdf:resource="http://www.wikidata.org/entity/Q12418"/>
</rdf:Description> <rdf:Description rdf:about="http://www.wikidata.org/entity/Q12418">
<dcterms:title>Mona Lisa</dcterms:title> <dcterms:creator rdf:resource="http://dbpedia.org/resource/Leonardo_da_Vinci"/>
</rdf:Description> <rdf:Description rdf:about="http://data.europeana.eu/item/04802/243FA8618938F4117025 F17A8B813C5F9AA4D619">
<dcterms:subject rdf:resource="http://www.wikidata.org/entity/Q12418"/> </rdf:Description> 20 </rdf:RDF> https://www.w3.org/TR/rdf11-primer/
Format of linked (open) data
• Ethics – the moral handling of data, e.g., not selling on other’s private data to scammers
• People have rights o privacy o access o erasure o … etc.
• Companies have rights
o ownership of data o intellectual property o copyright o confidentiality
Ethics
• Business models o Data has become a valuable asset o Data has become a valuable product
• Data from different services can be linked by companies by buying out other companies or establishing new services for other companies to use.
Alphabet Facebook Microsoft
Google Facebook Skype
YouTube Instagram Hotmail
Gmail Oculus VR Bing
Android WhatsApp Windows
Chrome Giphy Xbox Live/ Minecraft/ Bethesda
DoubleClick Github
Companies using linked data
• Business models
o Multiple departments have separate systems
o Departments interact, so why can’t their data
o Law enforcement needs to know what everyone else knows!
• Problems
o Who should know what?
o How do you manage who should know what?
o What priorities do you give to the rights of people?
Governments using linked data
• What can you do?
• What should you do?
• How do you make sure the right thing is done?
Breaking it down
See: “The curly fry conundrum: Why social media ‘likes’ say more
than you might think” by Jennifer Golbeck
e.g. Target ® predicting which women are pregnant based on their
purchases
• Many things can be predicted from Facebook “likes”
• Homophily (tendency to associate with similar individuals) is important for enabling prediction
• We often don’t own or manage corporate/internet/app
data about ourselves
• The source data critical for advertisers so we cannot expect
companies to be banned/excluded from using it
• So how can we manage confidentiality?
Confidentiality
• for many apps/websites, you must accept their privacy
data sharing policies to use their services fully;
• the interface for selecting privacy preferences should
move away from individual Internet platforms and be put
into the hands of individual consumers;
• user could have an open source agent that broker their
confidentiality preferences
• but would that be feasible and would businesses ever
agree?
Confidentiality (cont.)
See: “Empower consumers to control their privacy in the
Internet of Everything” by Carla Rudder (blog)
1. Corporations: want to use data for business advantage;
‣ opposing consumers 2. Security conscience: concerned with individual freedom, liberty,
mass surveillance;
‣ opposing intelligence orgs like National Security Agency 3. Open data: want open accessibility, support FOI requests
‣ opposing security experts concerned with leaks 4. Big data and civil rights: concerned about big data and citizens;
‣ opposing data brokers selling consumer data
Politics of Confidentiality
See: “Four political camps in the big data world” by Cathy O’Neil (blog)
See: Facebook Doesn’t Tell Users Everything and Facebook Privacy: Social Network Buys Data
Facebook buys 3rd-party data (from brokers) to obtain a user’s activity, income, etc.
• keeps upwards of 52,000 features about users, many provided to advertisers
• bought data used as a complement Oracle’s Datalogix,
• it is public, offline data, e.g., from Oracle’s Datalogix,
• but is not revealed to users
Facebook and Personal Data
See: “Can Facebook influence an election result?” by Michael Brand (ex-Monash, opinion on ABC news via The Conversation) and also
“How Facebook could swing the election” by Caitlin Dewey (article, Washington Post)
• implicit data: Facebook can predict who you will vote for
• their “I voted” button encourages people to vote (as they see which of their friends have)
• studies show it significantly increased voting in 2010 US election
• they can therefore subtly affect your voting
• could Facebook deploy “I voted” button selectively to favour
certain parties in certain areas?
Facebook and Voting
See “Machine logic: our lives are ruled by big tech’s decisions by data", and “If prejudice lurks among us, can our analytics do any better?”
Predictive models built on large populations are used to filter/make key life decisions like release from jail, treatment in hospital, getting a loan, news/videos you see (e.g., Facebook) …
• ML algorithms do the filtering
• ML algorithms can also produce prejudice (i.e., are biased)
• decisions made on mass, not personalised
• decisions are centralised (who writes the algorithms?)
• perhaps this is OK … perhaps
Population-level Prediction
Philip R. "Phil" Zimmermann,
• creator of the Pretty Good Privacy (PGP) email
encryption software
• Interview in 2013: “the biggest threat to privacy was Moore’s Law
… the ability of computers to track us doubles every
eighteen months
…The natural flow of technology tends to move in the direction of making surveillance easier”
Zimmerman’s law
Australian govt interface: • Australian JobSearch
• Australian Taxation Office • Centrelink • Child Support • Department of Health Applications Portal • Department of Veterans' Affairs
• HousingVic Online Services • Medicare • My Aged Care • My Health Record • National Disability Insurance Scheme
• National Redress Scheme • State Revenue Office Victoria
Government linked data
• My.gov.au provides access to the public to their data
o Greater dependency on online interfaces
o Less pen and paper data processing
o More automation of processing
o Cf. RoboDebt, Census
• Less clear what access each government can have to
the data
Government data access
• “require some telecommunications service providers to retain specific telecommunications data (the data set)
relating to the services they offer for at least 2 years” o Who talks to whom on the phone & when o Who emails whom & when o The IP address
• What doesn’t it include? o information about telecommunications content or web
browsing history
• Who has access to the data without a warrant? o 20 intelligence agencies, criminal law enforcement agencies,
ATO, ASIC and ACCC o Civil litigation exemption
(Australian) Data retention laws
https://www.homeaffairs.gov.au/about-us/our-portfolios/national- security/lawful-access-telecommunications/data-retention-obligations
• Rights vs functionality
• Change in responsibilities
o Change in processes and technology in response
• Where does automation and AI fit?
o Where is the responsibility and accountability?
o Snowden and the NSA surveillance
Data retention laws – issues
End of Ethics of Linked Data
Week 6: Issues with Data Science
AI Veracity
• Various factors can affect the “accuracy” of any analysis
o Data quality
o Choice of analysis
o Design of analysis
o Choice of data
• It is easy for the modelling to misrepresent what the data is supposed to reflect.
o Even statistical analysis can be biased!
Can you trust the analysis?
Chris is an excellent driver. They have applied for new car insurance, but a ML
system automatically evaluates their application. What personal data should be considered?
a) Driving record?
b) Payment metrics?
c) Location?
Should the system reject the application purely due to where Chris lives?
https://www.crimestatistics.vic.gov.au/crime-
statistics/latest-crime-data-by-area
Question
Google trains ML systems to recognize some common items in pictures. What do you think it
thought was in these hands in 2020?
a) Banana
b) Gun
c) Monocular
d) Tool
Question
https://algorithmwatch.org/en/story/google-vision-racism/
• Not all bias is in the numbers
• Bias can also be in how you have designed the
research
o Are the variables appropriate for all situations being modelled?
o Are assumptions made about the stakeholders who the data relates to?
o Are assumptions being made about the context of the data?
Bias of design
• Sometimes the data used to train a ML system is biased, regardless of its volume
o Narrow o Regional
o Undertested in varied contexts
• Biased system may discriminate in its results, for instance by
o gender o ethnic associations
o generalities
• Biased system may not be as accurate in its results for unfamiliar contexts and subjects
Bias of data
• Bias like this can appear in any automated processing
o Google: Shows ads for high paying jobs to men more than women
o Jailtime: Sees black Americans as more at risk of reoffending than white Americans
o Student applications: ML used to recognize bias in the decision process and to add bias to the system
• Automated systems will only be as good as the underlying data
Not just about image recognition
https://towardsdatascience.com/bias-in-artificial-intelligence-a3239ce316c9 https://www.fastcompany.com/90342596/schools-are-quietly-turning-to-ai-to-
help-pick-who-gets-in-what-could-go-wrong
• Automated systems may speed up the processes, but humans are better at understanding the context
o Human-in-the-loop
• Need the human perspective in the design, understanding and review of the process, how it is utilised and its results
Great responsibilities Great transparency
Human perspective
• Need to incorporate legal requirements into the system
o Not discriminating by race, gender, sex, age, etc unless allowed
• Need to respect the rights of the individual
• Privacy-by-design
o Factor the right to privacy into the design of any DS system, not as an afterthought.
o Also factor rights and legal requirements into how any system is used
Legality by design
End of AI Veracity
Week 6: Issues with Data Science
Sampling
• When collecting data for processing, it has to be relevant o Can you get all data relating to the scenario you are
modelling?
o Can you only get a random sample of data? The sample data has to be representative of the population being modelled
o How large a sample do you need?
o What known variables are included in the data?
o Is the sample data distributed to match the required strata/categories
• Observe the population before you make any unqualified assumptions
Sampling populations
• Blind experiments or A/B testing may be used to show if
relationship between various variables
• The experimental scenario needs to be divided into:
o A: Sample is subjected to the known variable
o B: Sample is not subjected to the known variable (the Control set)
• The validity of the the
hypothesis is based on whether
A has a different response to
B, where the response is the
target variable.
https://en.wikipedia.org/wiki/A/B_testing#/media/File:A- B_testing_example.png (cc BY-SA 4.0)
A/B testing
How much of a difference in results is enough? • Must test the statistical significance
o p value: units of chance of your “surprise” (0 to 1) Considering how likely you could get the same results regardless of the hypothesis
• Hypothesis: Aspirin reduces heart attack o Sample: studied 100 men for 5 years
Group HA: 50 men take aspirin daily Group HP: 50 men take placebo daily (control)
o Results: ‣ High p: HA 4 heart attacks, HP 5 heart attacks so both
around 1 in 10 men ‣ Low p: HP 10, HA 1
so very different and significant!
Significance testing
• How much difference is enough? (p<0.05?) • More data gives a more accurate impression, but how
much is enough? • Should you publish experimental results that challenge
previous runs of the same experiment? o Negative results shouldn’t be forgotten o Old experiments may be flawed o New data may understand the context
• Can you cross-validate your results? o k-fold testing: experiment with k combinations of test
and training data
Significance chasing
• Is data science interested about finding patterns in data
(observation) or experimentation (testing outcomes)?
• Both models and theories/hypotheses are research artefacts
o Need to demonstrate how they match evidence o Scientific method isn’t the only valid research
methodology!
• Still need to make sure any modelling or other research
outcomes are valid!
Challenging the scientific method?
• In 2009 Google claimed it worked out a correlation between some search terms and a growth in flu cases
o Could identify the trends 2 weeks before it became a health problem!
• But this has problems!
o Not openly sharing their methods – IP!
o Not openly sharing their data – privacy and proprietary
o Inconsistent in temporal perspectives
o Overestimates the infections!
• “greater value can be obtained by combining GFT with other near–real time health data”
Google Flu Trends
Lazer, D., R. Kennedy, G. King, and A. Vespignani. 2014. “The Parable of Google Flu: Traps in Big Data Analysis.” Science 343 (6176) (March 14): 1203–1205.
• Data science allows us to expand what we can do with data o Growth laws o Dealing with the Vs
• Data science allows us to reinterpret scenarios o New ways to approach old problems
• Data science is not standalone o Combine with existing methods o Human-in-the-loop
• Data science doesn’t have just be about making better models o Use data science to solve real problems
Data science and society
End of Sampling
Week 6: Issues with Data Science
Future of Data Science
https://www.gartner.com/smarter withgartner/5-trends-drive-the-
gartner-hype-cycle-for-emerging-
technologies-2020/
Gartner’s hype cycle
• Traditional technology reaches its limits
• DNA storage becomes a reality
• Expansion of electronic physical experiences
• Farms and factories face automation
• CIOs become Chief Operating Officers
• Change is driven by recording work conversations
• Increase in freelance customer service experts
• More attention to a “voice of society” metric in organisations
• On-site childcare entices employees
• Handling malicious content becomes a priority
Gartner’s predictions for 2021+
https://www.gartner.com/smarterwithgartner/gartner-top-10-strategic-predictions-for-2021-and-beyond/
The 2021 Hype Cycle for Emerging Technologies
https://www.gartner.com/smarterwithgartner/3-themes-surface-in-the-2021-hype-cycle-for-emerging-technologies
• Growth of big data technologies has allowed multiple types
of data to be combined
o Structured and unstructured data, e.g., sales records and customer feedback
o Multimedia, e.g., video and textual data, image and textual data
• Growth of IT has allowed better processing capability
(Moore’s Law)
o New ways to use multiple models relating to different data sets (Bell’s Law), e.g,. visual interpretation of gestures and
audio interpretation of speech vs world knowledge
o ML using neural networks (NN) and deep learning
Combining data
• Very much a data science process o Gather data o Analyse data o Produce conclusions o Make decisions o Act on the decisions
• Many uses o Manufacturing “robots” o Robotic vacuum cleaner o Adaptive energy systems o Chatbot o Stock market agent o Independent agents in modelling, e.g., public behaviour
during pandemic
Autonomous devices
• For instance,
o Drones & other aircraft (autopilot!)
o Trucks on freeways & mining sites
o Trains
o Suburban cars
• Collect data from various sources
o Local: speed limits
o Internal: sensors, cameras, radar
o External: road maps, weather
• Actions
o Plans: routes, known objectives
o Instinct: dynamic, adaptable responses, preempting actions of other entities
Autonomous vehicles
Gartner’s Legal Tech Hype Curve
Collepals.com Plagiarism Free Papers
Are you looking for custom essay writing service or even dissertation writing services? Just request for our write my paper service, and we'll match you with the best essay writer in your subject! With an exceptional team of professional academic experts in a wide range of subjects, we can guarantee you an unrivaled quality of custom-written papers.
Get ZERO PLAGIARISM, HUMAN WRITTEN ESSAYS
Why Hire Collepals.com writers to do your paper?
Quality- We are experienced and have access to ample research materials.
We write plagiarism Free Content
Confidential- We never share or sell your personal information to third parties.
Support-Chat with us today! We are always waiting to answer all your questions.