Create Danny Bickson Unleash Data Science Co-Founder

Create
TM
Unleash Data Science
Danny Bickson
Co-Founder
GraphLab Project History
GraphLab
GraphChi
(2009)
(2011)
GraphLab
Create
(2014)
GraphLab Conferences
2012
2013
2014
Growing community contribution
Data-driven, predictive data apps
are making our world amazing…
Historical data
Real-time
predictions &
decisions
Sensor & interaction data
Recommenders
Social
Industrial Apps
Pers.
Medicine
Human Sensing
Forecasters
Fraud &
Anomaly
Detection
Sentiment
Analysis &
other Text
Apps
Why predictive apps are
the future?
Spectrum of data products
time-to-line-of-business value
Continuously get to value faster!
Was using 217 business rules
hoping world doesn’t change
Have an inspiring idea to
reinvent their business
Building a predictive app
Key pains:
Noisy Space of Tools
Data scientists use a variety of tools, across
different programming languages…
require a lot of context-switching…
affects productivity and impedes reproducibility.
Ben Lorica,
Data Analysis: Just one component of
the Data Science workflow
Hiring Talent
35%
Shortfall in data-savvy workers
needed to make sense out of
big data by 2018 [McKinsey 2011]
Crossing the Big Data Chasm
scale of data
production data
GraphLab Create:
Get aUnleashing
Hadoop
cluster!?!?
data
science
big data chasm
from inspiration to production
single machine memory
speed of iteration
Create
TM
Data scientist: inspiration to production
Prototype
Use my laptop
Variety of data
Not toy data scales
Language I love
Iterate quickly
Production
Clean
Learn
Monitor
Deploy
data pipeline
predictive service
GraphLab Create
Analyze big data on one machine
graphs, tables, text, images
in Python
doesn’t have to fit in memory
Distribute in production
with same code
on EC2, Yarn,…
GraphLab Canvas: Monitor & visualize
from prototype to production
What folks are saying about
GraphLab Create
“The ease of use and scalable performance, which is not limited by the memory of the
machine, are allowing us to innovate and advance at an astonishing pace.”
- Andrew Bruce, Senior Director, Data Science, Zillow
“Graphlab Create provides us with an end-to-end efficient framework … both tabular and
graph data generated by the activity of our users.”
- Baldo Faeita, Social Computing Lead, Adobe Systems
“…during my time as Zynga's lead architect for big data, I found my way to GraphLab. I was
astounded at the dramatic savings, on the order of 500x…”
- Mohan Reddy, Chief Architect, The Hive LLC.
Action 1: end-to-end predictive app
Key use cases
Retail
Financial
Marketing &
Advertising
Telecom
Social network
analysis
Medical
• Recommenders, pricing,…
• Fraud detection,…
• Targeting, sentiment,…
• Churn prediction,…
• Communities, friends,…
• Med records, drug design,…
Detecting fraud in the bitcoin network
Poorly-regulated cryptocurrency market w/open transactions
Task:
Investigate suspicious transactional behavior
Productive on one
machine
Python
Tables, graphs, text & images
Scale beyond memory
Integrated visualization
Productive at scale
Deploy, monitor, improve on the Cloud/Cluster
Integrates with Hadoop/EC2
Deploy data pipelines & prediction services
GraphLab Create
Powered by
GraphLab Engine
Fastest analytics system
Built for Machine Learning
in Production
Scalable, robust algorithms
End-to-end support
Get to value fast
Deep Dive: key components of a
GraphLab Create
Scalable data structures:
The SFrame & The SGraph
Scalable data structures
ease those scaling pains
Pain
SFrame/SGraph
Running out of memory Graceful degradation
Optimized out-of-core computation
Integrating data
Unified tables, graphs, text, images
Missing values
Strong/weak types
Missing value support from get-go
Strong types only when strength helps
SFrame: Scalable tabular data
Never run out of memory
Sharded, compressed, out-of-core, columnar
Arbitrary lambda transformations, joins,… from Python
Group-by
aggregate
10M rows/s
on your
desktop
…
Same
Python
code
SGraph: Scalable graph data
PageRank at
10M edges/s
on your
desktop
Easily and efficiently express entire pipelines
Scalable, robust machine learning
Most ML toolkits don’t focus on the
real challenges
Tools out there
Real needs
Bag of algorithms
Task-oriented, e.g., recommender
Brittle
Robust to data problems
Lots of parameters to tune
Automatic; tuning is a bonus
“State-of-the-art” methods
in “research” mode
state-of-the-art accuracy,
performance (& methods)
GraphLab Create: Robust ML & graph analytics
state-of-the-art scaling and accuracy
focused on solving tasks, automatically
Scalable machine learning
•
•
•
•
•
•
•
Recommend Products
Target & segment
Detect Fraud
Analyze Text
Recommender
Regress
3.5B ratings
1 hour
on your
Classify
desktop
…
ML with GraphLab Create
Tables, graphs, text, images
State-of-the-art accuracy & scaling
Classification, regression, clustering, nearest neighbors,…
Want topic models?
Want winning method
for 50% of Kaggle
competitions?
Want hot, hot, hot?
Super-fast LDA
Boosted
decision trees
Deep learning
Action 2: deep learning
Deploying GraphLab Create
Data pipelines
Predictive services
Sample Prototype
MESSY
NOT MODULAR
FILE PATHS
NOT PORTABLE
Data pipelines & predictive services
Clean
Learn
Deploy
GraphLab
Data Pipeline
GraphLab
Predictive ServiceTM
Reusable components
Beyond batch & stream processing
Runs on Hadoop
Predictive applications
require real-time service
CDH5 now; Pivotal, Spark coming…
Runs on Cloud
EC2 now; Azure, Google coming…
Deployed directly from
data pipeline
Monitor from GraphLab Canvas
GraphLab Predictive ServicesTM
GraphLab
Predictive ServicesTM
Serves predictions in a robust, scalable,
incremental fashion
Your Web Service or
Predictive App
Caching Layer
Predictive
Object Server
MLBetter
Model
ML Model
Architecture and integration
Clean
Learn
Deploy
data pipelines
predictive services
Local
GraphLab
Canvas
<Python>
HDFS
Machine Learning
S3
SQL/noSQL
Robust, scalable, auto-tuning, task-oriented
Graphs, tables, text, images
SFrame
Scales out-of-core
GraphDB
SGraph
Fastest graph analytics
SameGraphLab
code, many Engine
environments
Same code, many environments
End-to-end
visualization
monitoring
management
Benchmarking
Recommender
3.5B ratings
1 hour
on your
desktop
PageRank at
10M edges/s
on your
desktop
LDA at
1.4M tokens/s
on your
desktop
Tables
Graphs
Text
GraphLab Create is fast, really fast
Most importantly, it’s fast
enough for production now!
On graph data
Finding influencers in the
Live-Journal graph
16 machines
1340
Mahout
1 machine
GraphLab
135
Create
0
200
400
600
800 1000 1200 1400 1600
Runtime (in seconds, PageRank for 10
iterations)
GraphLab on 1 machine is 10x faster than Mahout on 16 machines
On tabular data
Time in seconds on desktop
Logistic regression benchmark
70000
60000
50000
40000
30000
20000
10000
0
Scikit Learn
GraphLab Create
Orders of magnitude faster
KDD Cup data: predict student performance on math problems based on interactions with tutoring system
8.4M data points, 20M features, 2.4GB compressed
Scaling up
Time in seconds on desktop
Recommender
using matrix factorization
3.5B ratings
~ 1 hour
5000
4000
3000
2000
1000
0
0
2E+09
Number of ratings
Amazon ratings data: 35M ratings, 6.6M users, 2.5M products
Replicated synthetically WRT users to evaluate scaling
4E+09
GraphLab.com
GraphLab Create Roadmap
March 2014
July 21st 2014
Scalable data structures
Tables, graphs, text
Robust ML algorithms
GraphLab Canvas
Data pipelines
October 2014
New ML algorithms
More data types
Predictive services
Monitoring in production
100+ companies participated in beta program
Already used in production
Extremely positive feedback
Every feature since March in response to customer requests
Please keep them coming!
Commitment to open-source
• We have been committed to open-source for 6 years
- PowerGraph, GraphChi,…
- Our focus now is on GraphLab Create
• We are inspired by companies like MongoDB & ElasticSearch
- Open-source core
- Provide value-add tools, such as monitoring & management
Our users can be successful by just
using open-source version
GraphLab Create:
Unleashing data science
from inspiration to production
pip install graphlab-create
@graphlabteam