February 7, 2022

Issues with Data Science-Data governance and data management are often used to mean each other. Better to treat them as s

Business & Finance /Management

20 short answers question with word count of 2-3 sentences each

4 long answer with count of 4-5 sentences each

Time limit is 80min

Week 6: Issues with Data Science

Data Management Maturity

Data governance and data management are often used to

mean each other.

Better to treat them as separate levels

• Data Management is what you do to handle the data

o Resources, practises, enacting policies

• Data Governance is making sure that it is done

appropriately

o Policies, training, providing resources

o Planning and understanding

Governance and management

DCC data (curation) lifecycle model

https://www.dcc.ac.uk/guidance/curation-lifecycle-model

Capability Maturity Model • Good management happens all through the data lifecycle

• 4 key process areas:

o Data acquisition, processing and quality assurance Goal: Reliably capture and describe scientific data in a way that

facilitates preservation and reuse

o Data description and representation Goal: Create quality metadata for data discovery, preservation, and

provenance functions

o Data dissemination Goal: Design and implement interfaces for users to obtain and

interact with data

o Repository services/preservation Goal: Preserve collected data for long-term use

• Good data governance uses a good management system

o A mature system manages data all through the data lifecycle and throughout all projects. K Crowston & J Qin (2011) A Capability Maturity Model for Scientific Data Management: Evidence

from the Literature. Proceedings of the American Society for Information Science & Technology V48

https://asistdl.onlinelibrary.wiley.com/doi/full/10.1002/meet.2011.14504801036

https://asistdl.onlinelibrary.wiley.com/doi/full/10.1002/meet.2011.1450480103 6

Capability Maturity Model

https://asistdl.onlinelibrary.wiley.com/doi/full/10.1002/meet.2011.14504801036

• Data management and governance are not things just

arranged for each project.

• They should be universal in how an organisation

thinks about and approaches data

o at all times o in all divisions o in all projects o for all stakeholders

Universality

End of Data Management Maturity

Week 6: Issues with Data Science

Ethics of Linked Data

• Connecting elements within multiple structured data sets

• Allows data relating to an element to be collected from multiple data sets

• Expands the knowledge base of a single data set

• Linked Open Data (LOD) allows the links and data to be freely shared and accessed o Used by companies but don’t tend to contribute

their own data

Linked Data

Sir Tim Berners-Lee, the inventor of WWW and HTML, wanted a semantic web, using linked data 1. Name/Identify things with URIs

2. Use HTTP URIs so things can be looked up

3. Standardise the format of data about things with URIs

4. On the web, use the URIs when mentioning things

CC BY 2.0 https://www.flickr.com/photos/tamaleaver/7674657708

Semantic Web

https://www.w3.org/DesignIssues/LinkedData.html

http://www.ted.com/talks/tim_berners_lee_on_the_next_web

• Resource Description Framework (RDF) is another style of language for representing (subject, verb, object) triples, which is used to represent semantics. It is a core representation language for Linked Open Data and the Semantic Web.

• RDF can be represented in different formats, for instance as XML or simply as line delimited lists.

<?xml version="1.0" encoding="utf-8"?> <rdf:RDF

xmlns:dcterms=http://purl.org/dc/terms/ xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:schema="http://schema.org/">

<rdf:Description rdf:about="http://example.org/bob#me"> <rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Person"/> <schema:birthDate rdf:datatype="http://www.w3.org/2001/XMLSchema#date"> 1990-07-04 </schema:birthDate> <foaf:knows rdf:resource="http://example.org/alice#me"/> <foaf:topic_interest rdf:resource="http://www.wikidata.org/entity/Q12418"/>

</rdf:Description> <rdf:Description rdf:about="http://www.wikidata.org/entity/Q12418">

<dcterms:title>Mona Lisa</dcterms:title> <dcterms:creator rdf:resource="http://dbpedia.org/resource/Leonardo_da_Vinci"/>

</rdf:Description> <rdf:Description rdf:about="http://data.europeana.eu/item/04802/243FA8618938F4117025 F17A8B813C5F9AA4D619">

<dcterms:subject rdf:resource="http://www.wikidata.org/entity/Q12418"/> </rdf:Description> 20 </rdf:RDF> https://www.w3.org/TR/rdf11-primer/

Format of linked (open) data

http://www.w3.org/RDF/

http://purl.org/dc/terms/

https://www.w3.org/TR/rdf11-primer/

• Ethics – the moral handling of data, e.g., not selling on other’s private data to scammers

• People have rights o privacy o access o erasure o … etc.

• Companies have rights

o ownership of data o intellectual property o copyright o confidentiality

Ethics

• Business models o Data has become a valuable asset o Data has become a valuable product

• Data from different services can be linked by companies by buying out other companies or establishing new services for other companies to use.

Alphabet Facebook Microsoft

Google Facebook Skype

YouTube Instagram Hotmail

Gmail Oculus VR Bing

Android WhatsApp Windows

Chrome Giphy Xbox Live/ Minecraft/ Bethesda

DoubleClick Github

Companies using linked data

• Business models

o Multiple departments have separate systems

o Departments interact, so why can’t their data

o Law enforcement needs to know what everyone else knows!

• Problems

o Who should know what?

o How do you manage who should know what?

o What priorities do you give to the rights of people?

Governments using linked data

• What can you do?

• What should you do?

• How do you make sure the right thing is done?

Breaking it down

See: “The curly fry conundrum: Why social media ‘likes’ say more

than you might think” by Jennifer Golbeck

e.g. Target ® predicting which women are pregnant based on their

purchases

• Many things can be predicted from Facebook “likes”

• Homophily (tendency to associate with similar individuals) is important for enabling prediction

• We often don’t own or manage corporate/internet/app

data about ourselves

• The source data critical for advertisers so we cannot expect

companies to be banned/excluded from using it

• So how can we manage confidentiality?

Confidentiality

https://www.ted.com/talks/jennifer_golbeck_the_curly_fry_conundrum_why_social_media_likes_say_more_than_you_might_think?language=en

• for many apps/websites, you must accept their privacy

data sharing policies to use their services fully;

• the interface for selecting privacy preferences should

move away from individual Internet platforms and be put

into the hands of individual consumers;

• user could have an open source agent that broker their

confidentiality preferences

• but would that be feasible and would businesses ever

agree?

Confidentiality (cont.)

See: “Empower consumers to control their privacy in the

Internet of Everything” by Carla Rudder (blog)

https://enterprisersproject.com/article/2015/7/empower-consumers-control-their-privacy-internet-everything

1. Corporations: want to use data for business advantage;

‣ opposing consumers 2. Security conscience: concerned with individual freedom, liberty,

mass surveillance;

‣ opposing intelligence orgs like National Security Agency 3. Open data: want open accessibility, support FOI requests

‣ opposing security experts concerned with leaks 4. Big data and civil rights: concerned about big data and citizens;

‣ opposing data brokers selling consumer data

Politics of Confidentiality

See: “Four political camps in the big data world” by Cathy O’Neil (blog)

http://mathbabe.org/2015/04/22/four-political-camps-in-the-big-data-world

See: Facebook Doesn’t Tell Users Everything and Facebook Privacy: Social Network Buys Data

Facebook buys 3rd-party data (from brokers) to obtain a user’s activity, income, etc.

• keeps upwards of 52,000 features about users, many provided to advertisers

• bought data used as a complement Oracle’s Datalogix,

• it is public, offline data, e.g., from Oracle’s Datalogix,

• but is not revealed to users

Facebook and Personal Data

https://www.propublica.org/article/facebook-doesnt-tell-users-everything-it-really-knows-about-them

http://www.ibtimes.com/facebook-privacy-social-network-buys-data-third-party-brokers-fill-user-profiles-2466651

https://en.wikipedia.org/wiki/Datalogix

See: “Can Facebook influence an election result?” by Michael Brand (ex-Monash, opinion on ABC news via The Conversation) and also

“How Facebook could swing the election” by Caitlin Dewey (article, Washington Post)

• implicit data: Facebook can predict who you will vote for

• their “I voted” button encourages people to vote (as they see which of their friends have)

• studies show it significantly increased voting in 2010 US election

• they can therefore subtly affect your voting

• could Facebook deploy “I voted” button selectively to favour

certain parties in certain areas?

Facebook and Voting

http://www.abc.net.au/news/2016-09-28/can-facebook-influence-an-election-result/7881660

https://www.washingtonpost.com/news/the-intersect/wp/2016/09/30/how-facebook-could-swing-the-election-and-who-will-benefit-if-it-does/

See “Machine logic: our lives are ruled by big tech’s decisions by data", and “If prejudice lurks among us, can our analytics do any better?”

Predictive models built on large populations are used to filter/make key life decisions like release from jail, treatment in hospital, getting a loan, news/videos you see (e.g., Facebook) …

• ML algorithms do the filtering

• ML algorithms can also produce prejudice (i.e., are biased)

• decisions made on mass, not personalised

• decisions are centralised (who writes the algorithms?)

• perhaps this is OK … perhaps

Population-level Prediction

https://www.theguardian.com/technology/2016/oct/08/algorithms-big-tech-data-decisions

https://www.oreilly.com/ideas/if-prejudice-lurks-among-us-can-our-analytics-do-any-better

Philip R. "Phil" Zimmermann,

• creator of the Pretty Good Privacy (PGP) email

encryption software

• Interview in 2013: “the biggest threat to privacy was Moore’s Law

… the ability of computers to track us doubles every

eighteen months

…The natural flow of technology tends to move in the direction of making surveillance easier”

Zimmerman’s law

https://en.wikipedia.org/wiki/Pretty_Good_Privacy

https://en.wikipedia.org/wiki/Email_encryption

https://web.archive.org/web/20130815064716/http:/gigaom.com/2013/08/11/zimmermanns-law-pgp-inventor-and-silent-circle-co-founder-phil-zimmermann-on-the-surveillance-society/

Australian govt interface: • Australian JobSearch

• Australian Taxation Office • Centrelink • Child Support • Department of Health Applications Portal • Department of Veterans' Affairs

• HousingVic Online Services • Medicare • My Aged Care • My Health Record • National Disability Insurance Scheme

• National Redress Scheme • State Revenue Office Victoria

Government linked data

https://my.gov.au

• My.gov.au provides access to the public to their data

o Greater dependency on online interfaces

o Less pen and paper data processing

o More automation of processing

o Cf. RoboDebt, Census

• Less clear what access each government can have to

the data

Government data access

• “require some telecommunications service providers to retain specific telecommunications data (the data set)

relating to the services they offer for at least 2 years” o Who talks to whom on the phone & when o Who emails whom & when o The IP address

• What doesn’t it include? o information about telecommunications content or web

browsing history

• Who has access to the data without a warrant? o 20 intelligence agencies, criminal law enforcement agencies,

ATO, ASIC and ACCC o Civil litigation exemption

(Australian) Data retention laws

https://www.homeaffairs.gov.au/about-us/our-portfolios/national- security/lawful-access-telecommunications/data-retention-obligations

https://www.homeaffairs.gov.au/about-us/our-portfolios/national-security/lawful-access-telecommunications/data-retention-obligations

• Rights vs functionality

• Change in responsibilities

o Change in processes and technology in response

• Where does automation and AI fit?

o Where is the responsibility and accountability?

o Snowden and the NSA surveillance

Data retention laws – issues

End of Ethics of Linked Data

Week 6: Issues with Data Science

AI Veracity

• Various factors can affect the “accuracy” of any analysis

o Data quality

o Choice of analysis

o Design of analysis

o Choice of data

• It is easy for the modelling to misrepresent what the data is supposed to reflect.

o Even statistical analysis can be biased!

Can you trust the analysis?

Chris is an excellent driver. They have applied for new car insurance, but a ML

system automatically evaluates their application. What personal data should be considered?

a) Driving record?

b) Payment metrics?

c) Location?

Should the system reject the application purely due to where Chris lives?

https://www.crimestatistics.vic.gov.au/crime-

statistics/latest-crime-data-by-area

Question

https://www.crimestatistics.vic.gov.au/crime-statistics/latest-crime-data-by-area

Google trains ML systems to recognize some common items in pictures. What do you think it

thought was in these hands in 2020?

a) Banana

b) Gun

c) Monocular

d) Tool

Question

https://algorithmwatch.org/en/story/google-vision-racism/

• Not all bias is in the numbers

• Bias can also be in how you have designed the

research

o Are the variables appropriate for all situations being modelled?

o Are assumptions made about the stakeholders who the data relates to?

o Are assumptions being made about the context of the data?

Bias of design

• Sometimes the data used to train a ML system is biased, regardless of its volume

o Narrow o Regional

o Undertested in varied contexts

• Biased system may discriminate in its results, for instance by

o gender o ethnic associations

o generalities

• Biased system may not be as accurate in its results for unfamiliar contexts and subjects

Bias of data

• Bias like this can appear in any automated processing

o Google: Shows ads for high paying jobs to men more than women

o Jailtime: Sees black Americans as more at risk of reoffending than white Americans

o Student applications: ML used to recognize bias in the decision process and to add bias to the system

• Automated systems will only be as good as the underlying data

Not just about image recognition

https://towardsdatascience.com/bias-in-artificial-intelligence-a3239ce316c9 https://www.fastcompany.com/90342596/schools-are-quietly-turning-to-ai-to-

help-pick-who-gets-in-what-could-go-wrong

https://towardsdatascience.com/bias-in-artificial-intelligence-a3239ce316c9

https://www.fastcompany.com/90342596/schools-are-quietly-turning-to-ai-to-help-pick-who-gets-in-what-could-go-wrong

• Automated systems may speed up the processes, but humans are better at understanding the context

o Human-in-the-loop

• Need the human perspective in the design, understanding and review of the process, how it is utilised and its results

Great responsibilities Great transparency

Human perspective

https://appen.com/blog/human-in-the-loop/

• Need to incorporate legal requirements into the system

o Not discriminating by race, gender, sex, age, etc unless allowed

• Need to respect the rights of the individual

• Privacy-by-design

o Factor the right to privacy into the design of any DS system, not as an afterthought.

o Also factor rights and legal requirements into how any system is used

Legality by design

https://www.oaic.gov.au/privacy/privacy-for-organisations/privacy-by-design/

End of AI Veracity

Week 6: Issues with Data Science

Sampling

• When collecting data for processing, it has to be relevant o Can you get all data relating to the scenario you are

modelling?

o Can you only get a random sample of data? The sample data has to be representative of the population being modelled

o How large a sample do you need?

o What known variables are included in the data?

o Is the sample data distributed to match the required strata/categories

• Observe the population before you make any unqualified assumptions

Sampling populations

• Blind experiments or A/B testing may be used to show if

relationship between various variables

• The experimental scenario needs to be divided into:

o A: Sample is subjected to the known variable

o B: Sample is not subjected to the known variable (the Control set)

• The validity of the the

hypothesis is based on whether

A has a different response to

B, where the response is the

target variable.

https://en.wikipedia.org/wiki/A/B_testing#/media/File:A- B_testing_example.png (cc BY-SA 4.0)

A/B testing

https://en.wikipedia.org/wiki/A/B_testing

How much of a difference in results is enough? • Must test the statistical significance

o p value: units of chance of your “surprise” (0 to 1) Considering how likely you could get the same results regardless of the hypothesis

• Hypothesis: Aspirin reduces heart attack o Sample: studied 100 men for 5 years

Group HA: 50 men take aspirin daily Group HP: 50 men take placebo daily (control)

o Results: ‣ High p: HA 4 heart attacks, HP 5 heart attacks so both

around 1 in 10 men ‣ Low p: HP 10, HA 1

so very different and significant!

Significance testing

https://en.wikipedia.org/wiki/Statistical_significance

• How much difference is enough? (p<0.05?) • More data gives a more accurate impression, but how

much is enough? • Should you publish experimental results that challenge

previous runs of the same experiment? o Negative results shouldn’t be forgotten o Old experiments may be flawed o New data may understand the context

• Can you cross-validate your results? o k-fold testing: experiment with k combinations of test

and training data

Significance chasing

• Is data science interested about finding patterns in data

(observation) or experimentation (testing outcomes)?

• Both models and theories/hypotheses are research artefacts

o Need to demonstrate how they match evidence o Scientific method isn’t the only valid research

methodology!

• Still need to make sure any modelling or other research

outcomes are valid!

Challenging the scientific method?

• In 2009 Google claimed it worked out a correlation between some search terms and a growth in flu cases

o Could identify the trends 2 weeks before it became a health problem!

• But this has problems!

o Not openly sharing their methods – IP!

o Not openly sharing their data – privacy and proprietary

o Inconsistent in temporal perspectives

o Overestimates the infections!

• “greater value can be obtained by combining GFT with other near–real time health data”

Google Flu Trends

Lazer, D., R. Kennedy, G. King, and A. Vespignani. 2014. “The Parable of Google Flu: Traps in Big Data Analysis.” Science 343 (6176) (March 14): 1203–1205.

https://dash.harvard.edu/handle/1/12016836

• Data science allows us to expand what we can do with data o Growth laws o Dealing with the Vs

• Data science allows us to reinterpret scenarios o New ways to approach old problems

• Data science is not standalone o Combine with existing methods o Human-in-the-loop

• Data science doesn’t have just be about making better models o Use data science to solve real problems

Data science and society

https://www.technologyreview.com/2020/08/18/1007196/ai-research-machine-learning-applications-problems-opinion/

End of Sampling

Week 6: Issues with Data Science

Future of Data Science

https://www.gartner.com/smarter withgartner/5-trends-drive-the-

gartner-hype-cycle-for-emerging-

technologies-2020/

Gartner’s hype cycle

https://www.gartner.com/smarterwithgartner/5-trends-drive-the-gartner-hype-cycle-for-emerging-technologies-2020/

• Traditional technology reaches its limits

• DNA storage becomes a reality

• Expansion of electronic physical experiences

• Farms and factories face automation

• CIOs become Chief Operating Officers

• Change is driven by recording work conversations

• Increase in freelance customer service experts

• More attention to a “voice of society” metric in organisations

• On-site childcare entices employees

• Handling malicious content becomes a priority

Gartner’s predictions for 2021+

https://www.gartner.com/smarterwithgartner/gartner-top-10-strategic-predictions-for-2021-and-beyond/

The 2021 Hype Cycle for Emerging Technologies

https://www.gartner.com/smarterwithgartner/3-themes-surface-in-the-2021-hype-cycle-for-emerging-technologies

• Growth of big data technologies has allowed multiple types

of data to be combined

o Structured and unstructured data, e.g., sales records and customer feedback

o Multimedia, e.g., video and textual data, image and textual data

• Growth of IT has allowed better processing capability

(Moore’s Law)

o New ways to use multiple models relating to different data sets (Bell’s Law), e.g,. visual interpretation of gestures and

audio interpretation of speech vs world knowledge

o ML using neural networks (NN) and deep learning

Combining data

• Very much a data science process o Gather data o Analyse data o Produce conclusions o Make decisions o Act on the decisions

• Many uses o Manufacturing “robots” o Robotic vacuum cleaner o Adaptive energy systems o Chatbot o Stock market agent o Independent agents in modelling, e.g., public behaviour

during pandemic

Autonomous devices

• For instance,

o Drones & other aircraft (autopilot!)

o Trucks on freeways & mining sites

o Trains

o Suburban cars

• Collect data from various sources

o Local: speed limits

o Internal: sensors, cameras, radar

o External: road maps, weather

• Actions

o Plans: routes, known objectives

o Instinct: dynamic, adaptable responses, preempting actions of other entities

Autonomous vehicles

Gartner Legal Tech Hype Curve – 2020 Positions

Gartner’s Legal Tech Hype Curve

<a target='_blank' href='https://www.artificiallawyer.com/2020/07/27/gartner-legal-tech-hype-curve-2020-positions/'

Collepals.com Plagiarism Free Papers

Are you looking for custom essay writing service or even dissertation writing services? Just request for our write my paper service, and we'll match you with the best essay writer in your subject! With an exceptional team of professional academic experts in a wide range of subjects, we can guarantee you an unrivaled quality of custom-written papers.

Get ZERO PLAGIARISM, HUMAN WRITTEN ESSAYS

Why Hire Collepals.com writers to do your paper?

Quality- We are experienced and have access to ample research materials.

We write plagiarism Free Content

Confidential- We never share or sell your personal information to third parties.

Support-Chat with us today! We are always waiting to answer all your questions.

Issues with Data Science-Data governance and data management are often used to mean each other. Better to treat them as s

Related Posts

You need to be aware of the intertwining nature of management and leadership as you look for a management position. Traditionally

Why is accountability important in the health care industry? How is an employee’s accountability measured in the health car

What is an example of an effective work group in health care? Why is this important? Provide details.