Define, explain in detail, then present an actual example via research. Your paper must provide in-depth analysis of all the topics presented: > Read cases and white papers that talk abo
Follow a 3-paragraph format; Define, explain in detail, then present an actual example via research. Your paper must provide in-depth analysis of all the topics presented:
> Read cases and white papers that talk about Big Data analytics. Present the common theme in those case studies.
> Review the following Big Data Tutorial (attached).
> Choose one of the three applications for big data presented (Recommendation, Social Network Analytics, and Media Monitoring)
> Provide a case study of how a company has implemented the big data application and from your research suggest areas of improvement or expansion.
Need 8-10 pages in APA format with introduction and conclusion. Must include minimum of 9 peer-reviewed citations.
Marko Grobelnik, Blaz Fortuna, Dunja Mladenic
Jozef Stefan Institute, Slovenia
Sydney, Oct 22nd 2013
Big-Data in numbers
Big-Data Definitions
Motivation
State of Market
Techniques
Tools
Data Science
Applications ◦ Recommendation, Social networks, Media Monitoring
Concluding remarks
http://www.go-gulf.com/blog/online-time/
http://www.go-gulf.com/blog/online-time/
http://www.go-gulf.com/blog/online-time/
http://www.go-gulf.com/blog/online-time/
„Big-data‟ is similar to „Small-data‟, but bigger
…but having data bigger it requires different approaches: ◦ techniques, tools, architectures
…with an aim to solve new problems ◦ …or old problems in a better way.
Volume – challenging to load and process (how to index, retrieve)
Variety – different data types and degree of structure (how to query semi- structured data)
Velocity – real-time processing influenced by rate of data arrival
From “Understanding Big Data” by IBM
1. Volume (lots of data = “Tonnabytes”) 2. Variety (complexity, curse of
dimensionality) 3. Velocity (rate of data and information flow)
4. Veracity (verifying inference-based models from comprehensive data collections)
5. Variability 6. Venue (location) 7. Vocabulary (semantics)
Comparing volume of “big data” and “data mining” queries
…adding “web 2.0” to “big data” and “data mining” queries volume
Big-Data
Key enablers for the appearance and growth of “Big Data” are:
◦ Increase of storage capacities
◦ Increase of processing power
◦ Availability of data
Source: WikiBon report on “Big Data Vendor Revenue and Market Forecast 2012-2017”, 2013
…when the operations on data are complex: ◦ …e.g. simple counting is not a complex problem
◦ Modeling and reasoning with data of different kinds can get extremely complex
Good news about big-data: ◦ Often, because of vast amount of data, modeling
techniques can get simpler (e.g. smart counting can replace complex model-based analytics)…
◦ …as long as we deal with the scale
Research areas (such as IR, KDD, ML, NLP, Se mWeb, …) are sub- cubes within the data cube
Scalability
Streaming
Context
Quality
Usage
A risk with “Big-Data mining” is that an analyst can “discover” patterns that are meaningless
Statisticians call it Bonferroni‟s principle: ◦ Roughly, if you look in more places for interesting
patterns than your amount of data will support, you are bound to find crap
Example:
We want to find (unrelated) people who at least twice have stayed at the same hotel on the same day ◦ 109 people being tracked. ◦ 1000 days. ◦ Each person stays in a hotel 1% of the time (1 day out of 100) ◦ Hotels hold 100 people (so 105 hotels). ◦ If everyone behaves randomly (i.e., no terrorists) will the data
mining detect anything suspicious?
Expected number of “suspicious” pairs of people: ◦ 250,000 ◦ … too many combinations to check – we need to have some
additional evidence to find “suspicious” pairs of people in some more efficient way
Example taken from: Rajaraman, Ullman: Mining of Massive Datasets
Smart sampling of data ◦ …reducing the original data while not losing the
statistical properties of data
Finding similar items ◦ …efficient multidimensional indexing
Incremental updating of the models ◦ (vs. building models from scratch)
◦ …crucial for streaming data
Distributed linear algebra ◦ …dealing with large sparse matrices
On the top of the previous ops we perform usual data mining/machine learning/statistics operators: ◦ Supervised learning (classification, regression, …)
◦ Non-supervised learning (clustering, different types of decompositions, …)
◦ …
…we are just more careful which algorithms we choose ◦ typically linear or sub-linear versions of the algorithms
An excellent overview of the algorithms covering the above issues is the book “Rajaraman, Leskovec, Ullman: Mining of Massive Datasets”
Downloadable from: http://infolab.stanford.edu/~ullman/mmds.html
Where processing is hosted? ◦ Distributed Servers / Cloud (e.g. Amazon EC2)
Where data is stored? ◦ Distributed Storage (e.g. Amazon S3)
What is the programming model? ◦ Distributed Processing (e.g. MapReduce)
How data is stored & indexed? ◦ High-performance schema-free databases (e.g.
MongoDB)
What operations are performed on data? ◦ Analytic / Semantic Processing
http://www.bigdata-startups.com/open-source-tools/
Computing and storage are typically hosted transparently on cloud infrastructures ◦ …providing scale, flexibility and high fail-safety
Distributed Servers ◦ Amazon-EC2, Google App Engine, Elastic,
Beanstalk, Heroku
Distributed Storage ◦ Amazon-S3, Hadoop Distributed File System
Distributed processing of Big-Data requires non-standard programming models ◦ …beyond single machines or traditional parallel
programming models (like MPI)
◦ …the aim is to simplify complex programming tasks
The most popular programming model is MapReduce approach ◦ …suitable for commodity hardware to reduce costs
The key idea of the MapReduce approach: ◦ A target problem needs to be parallelizable
◦ First, the problem gets split into a set of smaller problems (Map step) ◦ Next, smaller problems are solved in a parallel way ◦ Finally, a set of solutions to the smaller problems get synthesized
into a solution of the original problem (Reduce step)
Google Maps charts new territory into businesses
Google selling new tools for businesses to build their own maps
Google 4
Maps 4
Businesses 4
New 1
Charts 1
Territory 1
Tools 1
…
Google promises consumer experience for businesses with Maps Engine Pro
Google is trying to get its Maps service used by more businesses
Google Maps charts new territory into businesses
Google selling new tools for businesses to build their own maps
Businesses 2
Charts 1
Maps 2
Territory 1
…
Google promises consumer experience for businesses with Maps Engine Pro
Google is trying to get its Maps service used by more businesses
Map 2
Businesses 2
Engine 1
Maps 2
Service 1
…
Map 1
Split according to the hash of a key
In our case: key = word, hash = first character
Businesses 2
Charts 1
Maps 2
Territory 1
…
Businesses 2
Engine 1
Maps 2
Service 1
…
Maps 2
Territory 1
…
Maps 2
Service 1
…
Businesses 2
Charts 1
…
Businesses 2
Engine 1
…
R e d u c e 1
R e d u c e 2
T a s k 1
T a s k 2
Businesses 4
Charts 1
Engine 1
…
Maps 4
Territory 1
Service 1
…
Maps 2
Territory 1
…
Maps 2
Service 1
…
Businesses 2
Charts 1
…
Businesses 2
Engine 1
…
Reduce 2
Reduce 1
We concatenate the outputs into final result
Businesses 4
Charts 1
Engine 1
…
Maps 4
Territory 1
Service 1
…
Businesses 4
Charts 1
Engine 1
…
Maps 4
Territory 1
Service 1
…
R e d
u c e 1
R e d u c e 2
Apache Hadoop [http://hadoop.apache.org/] ◦ Open-source MapReduce implementation
Tools using Hadoop: ◦ Hive: data warehouse infrastructure that provides data
summarization and ad hoc querying (HiveQL) ◦ Pig: high-level data-flow language and execution
framework for parallel computation (Pig Latin) ◦ Mahout: Scalable machine learning and data mining
library ◦ Flume: Flume is a distributed, reliable, and available
service for efficiently collecting, aggregating, and moving large amounts of log data
◦ Many more: Cascading, Cascalog, mrjob, MapR, Azkaban, Oozie, …
Hadoop
Hype on Databases from nineties == Hadoop from now
Hadoop
Hadoop
“[…] need to solve a problem that relational databases are a bad fit for”, Eric Evans
Motives: ◦ Avoidance of Unneeded Complexity – many use-case
require only subset of functionality from RDBMSs (e.g ACID properties)
◦ High Throughput – some NoSQL databases offer significantly higher throughput then RDBMSs
◦ Horizontal Scalability, Running on commodity hardware ◦ Avoidance of Expensive Object-Relational Mapping –
most NoSQL store simple data structures ◦ Compromising Reliability for Better Performance
Based on “NoSQL Databases”, Christof Strauch http://www.christof-strauch.de/nosqldbs.pdf
BASE approach ◦ Availability, graceful degradation, performance
◦ Stands for “Basically available, soft state, eventual consistency”
Continuum of tradeoffs: ◦ Strict – All reads must return data from latest completed
writes
◦ Eventual – System eventually return the last written value
◦ Read Your Own Writes – see your updates immediately
◦ Session – RYOW only within same session
◦ Monotonic – only more recent data in future requests
Consistent hashing ◦ Use same function for
hashing objects and nodes
◦ Assign objects to nearest nodes on the circle
◦ Reassign object when nodes added or removed
◦ Replicate nodes to r nearest nodes
White, Tom: Consistent Hashing. November 2007. – Blog post of 2007-11-27. http://weblogs.java.net/blog/tomwhite/archive/2007/11/consistent_hash.html
Storage Layout ◦ Row-based
◦ Columnar
◦ Columnar with Locality Groups
Query Models ◦ Lookup in key-value stores
Distributed Data Processing via MapReduce
Lipcon, Todd: Design Patterns for Distributed Non-Relational Databases. June 2009. – Presentation of 2009-06-11. http://www.slideshare.net/guestdfd1ec/design-patterns-for-distributed-nonrelationaldatabases
Map or dictionary allowing to add and retrieve values per keys
Favor scalability over consistency ◦ Run on clusters of commodity hardware ◦ Component failure is “standard mode of operation”
Examples: ◦ Amazon Dynamo ◦ Project Voldemort (developed by LinkedIn) ◦ Redis ◦ Memcached (not persistent)
Combine several key-value pairs into documents
Documents represented as JSON
Examples: ◦ Apache CouchDB
◦ MongoDB
" Title " : " CouchDB ",
" Last editor " : "172.5.123.91" ,
" Last modified ": "9/23/2010" ,
" Categories ": [" Database ", " NoSQL ", " Document Database "],
" Body ": " CouchDB is a …" , " Reviewed ": false
Using columnar storage layout with locality groups (column families)
Examples: ◦ Google Bigtable
◦ Hypertable, HBase
open source implementation of Google Bigtable
◦ Cassandra
combination of Google Bigtable and Amazon Dynamo
Designed for high write throughput
Infrastructure: Kafka [http://kafka.apache.org/]
◦ A high-throughput distributed messaging system
Hadoop [http://hadoop.apache.org/] ◦ Open-source map-reduce implementation
Storm [http://storm-project.net/] ◦ Real-time distributed computation system
Cassandra [http://cassandra.apache.org/] ◦ Hybrid between Key-Value and Row-Oriented DB ◦ Distributed, decentralized, no single point of failure ◦ Optimized for fast writes
Collepals.com Plagiarism Free Papers
Are you looking for custom essay writing service or even dissertation writing services? Just request for our write my paper service, and we'll match you with the best essay writer in your subject! With an exceptional team of professional academic experts in a wide range of subjects, we can guarantee you an unrivaled quality of custom-written papers.
Get ZERO PLAGIARISM, HUMAN WRITTEN ESSAYS
Why Hire Collepals.com writers to do your paper?
Quality- We are experienced and have access to ample research materials.
We write plagiarism Free Content
Confidential- We never share or sell your personal information to third parties.
Support-Chat with us today! We are always waiting to answer all your questions.