Big Data Management with Myria Brandon Haynes and Magdalena Balazinska DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING UNIVERSITY OF WASHINGTON hEp://myria.cs.washington.edu 1 Myria Team Magda Balazinska, Bill Howe, and Dan Suciu (faculty) SoTware Engineer: Tobin Baker Grad Students Shumo Chu Eric Gribkoff Brandon Haynes Jeremy Hyrkas Paris Koutris Brendan Lee Brandon Myers Ryan Maas Dominik Moritz Laurel Orr Jennifer OrZz Jingjing Wang Ugrad Students Yuqing Guo Dan Radion Alumnae/Alumni: Victor Almeida, Lee Lee Choo, Dan Halperin, Vaspol Ruamviboonsuk, Emad Soroush, Mayukha Vadari, Andrew Whitaker, Shengliang Xu 2 Acknowledgments The Myria Team Our science collaborators Our sponsors • NaZonal Science FoundaZon, Moore & Sloan FoundaZons, Washington Research FoundaZon, eScience InsZtute, ISTC Big Data, Petrobras, and EMC Magdalena Balazinska -‐ University of Washington 3 Overview of the Myria Stack Magdalena Balazinska -‐ University of Washington 4 Stack for big data management and analyZcs • A new big data mgmt & analyZcs system – Available open source – Runs in shared-‐nothing clusters (Amazon EC2) – Also runs in an HPC cluster at MIT – Think of it as Hive/Hadoop but faster – Think of it as Spark but faster – This of it as SQL Server or PostgreSQL but more scalable • An opera3onal service deployed at UW • Developed by the UW database group and eScience 5 Myria Big Data Management Service Myria is a cloud service: Just open browser and go! 6 Myria Demo on Amazon hEp://demo.myria.cs.washington.edu Magdalena Balazinska -‐ University of Washington 7 Myria Is a Cloud Service Browser RACO MyMergerTree Python/iPython Web UI Language Parser Query OpZmizer Google App Engine MyriaX Parallel Query Execution 8 Myria Supports Different Back-‐Ends mul$ple languages and apps RACO mul$ple Big Data systems RAGO Goals: OpZmizing middleware for big data systems – write once, run efficiently anywhere: Hadoop, Spark, MyriaX, straight C, RDBMS, PGAS, MPI, … Magdalena Balazinska -‐ University of Washington 9 Myria Is a Parallel Data Mgmt System JSON query plans & other instruc$ons MyriaX Sharednothing cluster REST Server Coordinator Worker Primary data store: Catalog Catalog Worker Catalog jdbc jdbc RDBMS RDBMS Can ingest data from: Worker Catalog jdbc RDBMS Can ingest data from: HDFS S3 … HDFS Magdalena Balazinska -‐ University of Washington HDFS 10 Performance Example Input data: TwiEer Graph 1.5 billion edges and 41 million verZces 18 Ra#o (Spark / Myria) 16 14 12 Select 10 8 Aggregate 6 Join 4 Connected Components 2 0 4 8 16 32 64 # workers Magdalena Balazinska -‐ University of Washington 11 Example Myria ApplicaZons ps3.fcs…subset RED fluorescence Nanoplankton P35-surf 103 102 104 Ultraplankton Picoplankton 103 101 102 100 100 1 10 IS 580-30 692-40 Phytoplankton 0 Galaxy SimulaZons 104 101 Prochlorococcus 101 2 10 FSC Small Stuff 3 10 102 FSC 103 104 4 10 FSC Environmental Flow Cytometry 100 100 101 102 FSC Small Stuff 103 104 Retail AnalyZcs Bibliometrics 12 Myria Demo hEp://demo.myria.cs.washington.edu • DemonstraZng Web interface to Myria – Analyzing pre-‐loaded data – Analyzing data stored in S3 • DemonstraZng Python interface to Myria – Basic Python and iPython • DemonstraZng spinning up your own cluster – Demo with Amazon EC2 but private cluster also OK Magdalena Balazinska -‐ University of Washington 13 Myria DocumentaZon Docs: hEp://myria.cs.washington.edu/docs/index.html Issues: Please post on github hEps://github.com/uwescience/myria-‐stack Users mailing list: myria-‐[email protected] To subscribe: hEps://mailman.cs.washington.edu/mailman/lisZnfo/myria-‐users Magdalena Balazinska -‐ University of Washington 14
© Copyright 2024