Create TM Unleash Data Science Danny Bickson Co-Founder GraphLab Project History GraphLab GraphChi (2009) (2011) GraphLab Create (2014) GraphLab Conferences 2012 2013 2014 Growing community contribution Data-driven, predictive data apps are making our world amazing… Historical data Real-time predictions & decisions Sensor & interaction data Recommenders Social Industrial Apps Pers. Medicine Human Sensing Forecasters Fraud & Anomaly Detection Sentiment Analysis & other Text Apps Why predictive apps are the future? Spectrum of data products time-to-line-of-business value Continuously get to value faster! Was using 217 business rules hoping world doesn’t change Have an inspiring idea to reinvent their business Building a predictive app Key pains: Noisy Space of Tools Data scientists use a variety of tools, across different programming languages… require a lot of context-switching… affects productivity and impedes reproducibility. Ben Lorica, Data Analysis: Just one component of the Data Science workflow Hiring Talent 35% Shortfall in data-savvy workers needed to make sense out of big data by 2018 [McKinsey 2011] Crossing the Big Data Chasm scale of data production data GraphLab Create: Get aUnleashing Hadoop cluster!?!? data science big data chasm from inspiration to production single machine memory speed of iteration Create TM Data scientist: inspiration to production Prototype Use my laptop Variety of data Not toy data scales Language I love Iterate quickly Production Clean Learn Monitor Deploy data pipeline predictive service GraphLab Create Analyze big data on one machine graphs, tables, text, images in Python doesn’t have to fit in memory Distribute in production with same code on EC2, Yarn,… GraphLab Canvas: Monitor & visualize from prototype to production What folks are saying about GraphLab Create “The ease of use and scalable performance, which is not limited by the memory of the machine, are allowing us to innovate and advance at an astonishing pace.” - Andrew Bruce, Senior Director, Data Science, Zillow “Graphlab Create provides us with an end-to-end efficient framework … both tabular and graph data generated by the activity of our users.” - Baldo Faeita, Social Computing Lead, Adobe Systems “…during my time as Zynga's lead architect for big data, I found my way to GraphLab. I was astounded at the dramatic savings, on the order of 500x…” - Mohan Reddy, Chief Architect, The Hive LLC. Action 1: end-to-end predictive app Key use cases Retail Financial Marketing & Advertising Telecom Social network analysis Medical • Recommenders, pricing,… • Fraud detection,… • Targeting, sentiment,… • Churn prediction,… • Communities, friends,… • Med records, drug design,… Detecting fraud in the bitcoin network Poorly-regulated cryptocurrency market w/open transactions Task: Investigate suspicious transactional behavior Productive on one machine Python Tables, graphs, text & images Scale beyond memory Integrated visualization Productive at scale Deploy, monitor, improve on the Cloud/Cluster Integrates with Hadoop/EC2 Deploy data pipelines & prediction services GraphLab Create Powered by GraphLab Engine Fastest analytics system Built for Machine Learning in Production Scalable, robust algorithms End-to-end support Get to value fast Deep Dive: key components of a GraphLab Create Scalable data structures: The SFrame & The SGraph Scalable data structures ease those scaling pains Pain SFrame/SGraph Running out of memory Graceful degradation Optimized out-of-core computation Integrating data Unified tables, graphs, text, images Missing values Strong/weak types Missing value support from get-go Strong types only when strength helps SFrame: Scalable tabular data Never run out of memory Sharded, compressed, out-of-core, columnar Arbitrary lambda transformations, joins,… from Python Group-by aggregate 10M rows/s on your desktop … Same Python code SGraph: Scalable graph data PageRank at 10M edges/s on your desktop Easily and efficiently express entire pipelines Scalable, robust machine learning Most ML toolkits don’t focus on the real challenges Tools out there Real needs Bag of algorithms Task-oriented, e.g., recommender Brittle Robust to data problems Lots of parameters to tune Automatic; tuning is a bonus “State-of-the-art” methods in “research” mode state-of-the-art accuracy, performance (& methods) GraphLab Create: Robust ML & graph analytics state-of-the-art scaling and accuracy focused on solving tasks, automatically Scalable machine learning • • • • • • • Recommend Products Target & segment Detect Fraud Analyze Text Recommender Regress 3.5B ratings 1 hour on your Classify desktop … ML with GraphLab Create Tables, graphs, text, images State-of-the-art accuracy & scaling Classification, regression, clustering, nearest neighbors,… Want topic models? Want winning method for 50% of Kaggle competitions? Want hot, hot, hot? Super-fast LDA Boosted decision trees Deep learning Action 2: deep learning Deploying GraphLab Create Data pipelines Predictive services Sample Prototype MESSY NOT MODULAR FILE PATHS NOT PORTABLE Data pipelines & predictive services Clean Learn Deploy GraphLab Data Pipeline GraphLab Predictive ServiceTM Reusable components Beyond batch & stream processing Runs on Hadoop Predictive applications require real-time service CDH5 now; Pivotal, Spark coming… Runs on Cloud EC2 now; Azure, Google coming… Deployed directly from data pipeline Monitor from GraphLab Canvas GraphLab Predictive ServicesTM GraphLab Predictive ServicesTM Serves predictions in a robust, scalable, incremental fashion Your Web Service or Predictive App Caching Layer Predictive Object Server MLBetter Model ML Model Architecture and integration Clean Learn Deploy data pipelines predictive services Local GraphLab Canvas <Python> HDFS Machine Learning S3 SQL/noSQL Robust, scalable, auto-tuning, task-oriented Graphs, tables, text, images SFrame Scales out-of-core GraphDB SGraph Fastest graph analytics SameGraphLab code, many Engine environments Same code, many environments End-to-end visualization monitoring management Benchmarking Recommender 3.5B ratings 1 hour on your desktop PageRank at 10M edges/s on your desktop LDA at 1.4M tokens/s on your desktop Tables Graphs Text GraphLab Create is fast, really fast Most importantly, it’s fast enough for production now! On graph data Finding influencers in the Live-Journal graph 16 machines 1340 Mahout 1 machine GraphLab 135 Create 0 200 400 600 800 1000 1200 1400 1600 Runtime (in seconds, PageRank for 10 iterations) GraphLab on 1 machine is 10x faster than Mahout on 16 machines On tabular data Time in seconds on desktop Logistic regression benchmark 70000 60000 50000 40000 30000 20000 10000 0 Scikit Learn GraphLab Create Orders of magnitude faster KDD Cup data: predict student performance on math problems based on interactions with tutoring system 8.4M data points, 20M features, 2.4GB compressed Scaling up Time in seconds on desktop Recommender using matrix factorization 3.5B ratings ~ 1 hour 5000 4000 3000 2000 1000 0 0 2E+09 Number of ratings Amazon ratings data: 35M ratings, 6.6M users, 2.5M products Replicated synthetically WRT users to evaluate scaling 4E+09 GraphLab.com GraphLab Create Roadmap March 2014 July 21st 2014 Scalable data structures Tables, graphs, text Robust ML algorithms GraphLab Canvas Data pipelines October 2014 New ML algorithms More data types Predictive services Monitoring in production 100+ companies participated in beta program Already used in production Extremely positive feedback Every feature since March in response to customer requests Please keep them coming! Commitment to open-source • We have been committed to open-source for 6 years - PowerGraph, GraphChi,… - Our focus now is on GraphLab Create • We are inspired by companies like MongoDB & ElasticSearch - Open-source core - Provide value-add tools, such as monitoring & management Our users can be successful by just using open-source version GraphLab Create: Unleashing data science from inspiration to production pip install graphlab-create @graphlabteam
© Copyright 2024