Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG

Tutorial: Big Data Algorithms and
Applications Under Hadoop
KUNPENG ZHANG
SIDDHARTHA BHATTACHARYYA
http://kzhang6.people.uic.edu/tutorial/amcis2014.html
August 7, 2014
Schedule
I. Introduction to big data (8:00 – 8:30)
II. Hadoop and MapReduce (8:30 – 9:45)
III. Coffee break (9:45 – 10:00)
IV. Distributed algorithms and applications (10:00 – 11:40)
V. Conclusion (11:40 – 12:00)
I. Introduction to big data
I. Introduction to big data
• What is big data
• Why big data matters to you
• 10 use cases of big data analytics
• Techniques for analyzing big data
What is big data
• Big data is a blanket term for any types of data sets so large
and complex that it becomes difficult to process using onhand data management tools or traditional data processing
applications. [From Wikipedia]
5 Vs of big data
• To get better understanding
of what big data is, it is
often described using 5 Vs.
Variety
Volume
Veracity
Velocity
Value
We see increasing volume of data, that grow at
exponential rates
Volume refers to the vast amount of data
generated every second. We are not
talking about Terabytes but Zettabytes or
Variety
Volume
Brontobytes. If we take all the data
generated in the world between the
beginning of time and 2008, the same
amount of data will soon be generated
every minute. This makes most data sets Veracity
Velocity
too large to store and analyze using
traditional database technology. New big
data tools use distributed systems so we
can store and analyze data across
Value
databases that are dotted around
everywhere in the world.
We see increasing velocity (or speed) at which
data changes, travels, or increases
Velocity refers to the speed at
which new data is generated and
Variety
Volume
the speed at which data moves
around. Just think of social media
messages going viral in seconds.
Veracity
Velocity
Technology now allows us to
analyze the data while it is being
generated (sometimes referred to
as it in-memory analytics), without
Value
ever putting into databases.
We see increasing variety of data types
Variety refers to the different types of
data we can now use. In the past we
only focused on structured data that
Variety
Volume
neatly fitted into tables or relational
databases, such as financial data. In
fact, 80% of world’s data is
unstructured (text, images, video,
Veracity
Velocity
voice, etc.). With big data technology
we can now analyze and bring together
data of different types such as
messages, social media conversations,
Value
photos, sensor data, video or voice
recordings.
We see increasing veracity (or accuracy) of data
Veracity refers to messiness or
trustworthiness of data. With
Variety
Volume
many forms of big data quality and
accuracy are less controllable (just
think Twitter posts with hash tags,
abbreviations, typos and colloquial Veracity
Velocity
speech as well as the reliability and
accuracy of content) but
technology now allows us to work
Value
with this type of data.
Value – The most important V of all!
There is another V to take into
account when looking at big
data: Value.
Having access to big data is no
good unless we can turn it into
value.
Companies are starting to
generate amazing value from
their big data.
Variety
Volume
Veracity
Velocity
Value
Introduction to big data
• What is big data
• Why big data matters to you
• 10 use cases of big data analytics
• Techniques for analyzing big data
Big data is more prevalent than you think
Big data formats
Competitive advantages gained through big data
Big data job postings
Introduction to big data
• What is big data
• Why big data matters to you
• 10 use cases of big data analytics
• Techniques for analyzing big data
1. Understanding and targeting customers
• Big data is used to better
understand customers and their
behaviors and preferences.
– Target: very accurately predict
when one of their customers will
expect a baby;
– Wal-Mart can predict what
products will sell;
– Car insurance companies
understand how well their
customers actually drive;
– Obama use big data analytics to
win 2012 presidential election
campaign.
Browser
logs
Social
media
data
Predictive
models
Sensor
data
Text
analytics
2. Understanding and optimizing business
processes
• Retailers are able to optimize their stock based on
predictions generated from social media data, web search
trends, and weather forecasts;
• Geographic positioning and radio frequency identification
sensors are used to track goods or delivery vehicles and
optimize routes by integrating live traffic data, etc.
3. Personal quantification and performance
optimization
• The Jawbone armband collects data on our calorie
consumption, activity levels, and our sleep patterns and
analyze such volumes of data to bring entirely new insights
that it can feed back to individual users;
• Most online dating sites apply big data tools and
algorithms to find us the most appropriate matches.
4. Improving healthcare and public health
• Big data techniques are already being used to monitor
babies in a specialist premature and sick baby unit;
• Big data analytics allow us to monitor and predict the
developments of epidemics and disease outbreaks;
• By recording and analyzing every heart beat and breathing
pattern of every baby, infections can be predicted 24 hours
before any physical symptoms appear.
5. Improving sports performance
• Use video analytics to track the performance of every
player;
• Use sensor technology in sports equipment to allow us to
get feedback on games;
• Use smart technology to track athletes outside of the
sporting environment: nutrition, sleep, and social media
conversation.
6. Improving science and research
• CERN, the Swiss nuclear physics
lab with its Large Hadron
Collider, the world’s largest and
most powerful particle
accelerator is using thousands of
computers distributed across 150
data centers worldwide to unlock
the secrets of our universe by
analyzing its 30 petabytes of data.
7. Optimizing machine and device performance
• Google self-driving car: the Toyota Prius is fitted with
cameras, GPS, powerful computers and sensors to safely
drive without the intervention of human beings;
• Big data tools are also used to optimize energy grids using
data from smart meters.
8. Improving security and law enforcement
• National Security Agency (NSA) in the U.S. uses big data
analytics to foil terrorist plots (and maybe spy on us);
• Police forces use big data tools to catch criminals and even
predict criminal activity;
• Credit card companies use big data to detect fraudulent
transactions.
9. Improving and optimizing cities and countries
• Smart cities optimize traffic flows based on real time
traffic information as well as social media and weather
data.
10. Financial trading
• The majority of equity trading now takes place via data
algorithms that increasingly take into account signals from
social media networks and news websites to make, buy and
sell decisions in split seconds (High-Frequency Trading,
HFT).
Introduction to big data
• What is big data
• Why big data matters to you
• 10 use cases of big data analytics
• Techniques for analyzing big data
Techniques and their applications
•
•
•
•
•
Association rule mining: market basket analysis
Classification: prediction of customer buying decisions
Cluster analysis: segmenting consumers into groups
Crowdsourcing: collecting data from community
Data fusion and data integration: social media data
combined with real-time sales data to determine what
effect a marketing campaign is having on customer
sentiment and purchasing behavior
Techniques and their applications
• Ensemble learning
• Genetic algorithms: job scheduling in manufacturing
and optimizing the performance of an investment
portfolio
• Neural networks: identify fraudulent insurance claims
• Natural language processing: sentiment analysis
• Network analysis: identifying key opinion leaders to
target for marketing and identifying bottlenecks in
Techniques and their applications
• Regression: forecasting sales volumes based on various
market and economic variables
• Time series analysis: hourly value of a stock market
index or the number of patients diagnosed with a given
condition every day
• Visualization: understand and improve results of big
data analyses
Big data tools
•
•
•
•
•
•
Big Table by Google
MapReduce by Google
Cassandra by Apache
Dynamo by Amazon
Hbase by Apache
Hadoop by Apache
Visualization tools
• D3.js: http://d3js.org/
• Tag cloud: http://tagcrowd.com/
• Clustergram:
http://www.schonlau.net/clustergram.html
• History flow: http://hint.fm/projects/historyflow/
• R: http://www.r-project.org/
• Network visualization (Gephi): http://gephi.github.io/