overview slides - Myria - University of Washington

Big Data Management with Myria Brandon Haynes and Magdalena Balazinska DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING UNIVERSITY OF WASHINGTON hEp://myria.cs.washington.edu 1 Myria Team Magda Balazinska, Bill Howe, and Dan Suciu (faculty) SoTware Engineer: Tobin Baker Grad Students Shumo Chu Eric Gribkoff Brandon Haynes Jeremy Hyrkas Paris Koutris Brendan Lee Brandon Myers Ryan Maas Dominik Moritz Laurel Orr Jennifer OrZz Jingjing Wang Ugrad Students Yuqing Guo Dan Radion Alumnae/Alumni: Victor Almeida, Lee Lee Choo, Dan Halperin, Vaspol Ruamviboonsuk, Emad Soroush, Mayukha Vadari, Andrew Whitaker, Shengliang Xu 2 Acknowledgments The Myria Team Our science collaborators Our sponsors •  NaZonal Science FoundaZon, Moore & Sloan FoundaZons, Washington Research FoundaZon, eScience InsZtute, ISTC Big Data, Petrobras, and EMC Magdalena Balazinska -­‐ University of Washington 3 Overview of the Myria Stack Magdalena Balazinska -­‐ University of Washington 4 Stack for big data management and analyZcs •  A new big data mgmt & analyZcs system –  Available open source –  Runs in shared-­‐nothing clusters (Amazon EC2) –  Also runs in an HPC cluster at MIT –  Think of it as Hive/Hadoop but faster –  Think of it as Spark but faster –  This of it as SQL Server or PostgreSQL but more scalable •  An opera3onal service deployed at UW •  Developed by the UW database group and eScience 5 Myria Big Data Management Service Myria is a cloud service: Just open browser and go! 6 Myria Demo on Amazon hEp://demo.myria.cs.washington.edu Magdalena Balazinska -­‐ University of Washington 7 Myria Is a Cloud Service Browser RACO
MyMergerTree Python/iPython Web UI Language Parser Query OpZmizer Google
App
Engine
MyriaX
Parallel Query
Execution
8 Myria Supports Different Back-­‐Ends mul$ple languages and apps RACO mul$ple Big Data systems RAGO Goals: OpZmizing middleware for big data systems – write once, run efficiently anywhere: Hadoop, Spark, MyriaX, straight C, RDBMS, PGAS, MPI, … Magdalena Balazinska -­‐ University of Washington 9 Myria Is a Parallel Data Mgmt System JSON query plans & other instruc$ons MyriaX
Sharednothing
cluster
REST Server
Coordinator
Worker
Primary data store: Catalog
Catalog
Worker
Catalog
jdbc jdbc RDBMS
RDBMS
Can ingest data from: Worker
Catalog
jdbc RDBMS
Can ingest data from: HDFS
S3
… HDFS
Magdalena Balazinska -­‐ University of Washington HDFS
10 Performance Example Input data: TwiEer Graph 1.5 billion edges and 41 million verZces 18 Ra#o (Spark / Myria) 16 14 12 Select 10 8 Aggregate 6 Join 4 Connected Components 2 0 4 8 16 32 64 # workers Magdalena Balazinska -­‐ University of Washington 11 Example Myria ApplicaZons ps3.fcs…subset
RED fluorescence
Nanoplankton
P35-surf
103
102
104
Ultraplankton
Picoplankton
103
101
102
100
100
1
10
IS
580-30
692-40
Phytoplankton
0
Galaxy SimulaZons 104
101
Prochlorococcus
101
2
10
FSC Small Stuff
3
10
102
FSC
103
104
4
10
FSC
Environmental Flow Cytometry 100
100
101
102
FSC Small Stuff
103
104
Retail AnalyZcs Bibliometrics 12 Myria Demo hEp://demo.myria.cs.washington.edu •  DemonstraZng Web interface to Myria –  Analyzing pre-­‐loaded data –  Analyzing data stored in S3 •  DemonstraZng Python interface to Myria –  Basic Python and iPython •  DemonstraZng spinning up your own cluster –  Demo with Amazon EC2 but private cluster also OK Magdalena Balazinska -­‐ University of Washington 13 Myria DocumentaZon Docs: hEp://myria.cs.washington.edu/docs/index.html Issues: Please post on github hEps://github.com/uwescience/myria-­‐stack Users mailing list: myria-­‐[email protected] To subscribe: hEps://mailman.cs.washington.edu/mailman/lisZnfo/myria-­‐users Magdalena Balazinska -­‐ University of Washington 14