1 USER-CENTRIC DATA MANAGEMENT IN THE ERA OF BIG DATA Alexandros Labrinidis Advanced Data Management Technologies Lab Department of Computer Science University of Pittsburgh http://labrinidis.cs.pitt.edu © 2014 Alexandros Labrinidis, University of Pittsburgh October 8, 2014 2 Data-Intensive Science Data-intensive science Observational ✤ SDSS: Sloan Digital Sky Survey (2000 - ) 200 GB/night ✤ LSST: Large Synoptic Survey Telescope (2015 - ) 30 TB/night -- 1.28PB/year ✤ LHC: Large Hadron Collider 15 PB/year ✤ SKA: Square Kilometer Array (2019 - ) 10 PB/hour One (virtual) instrument Multiple instruments Simulation © 2014 Alexandros Labrinidis, University of Pittsburgh October 8, 2014 3 Data-Intensive Science Data-intensive science Observational Simulation One (virtual) instrument Multiple instruments ✤ Gene Sequencing ✤ Personalized Medicine © 2014 Alexandros Labrinidis, University of Pittsburgh October 8, 2014 4 Data-Intensive Science Data-intensive science Observational Simulation One (virtual) instrument Multiple instruments ✤ Climate Modeling ✤ Turbulent Combustion Flow © 2014 Alexandros Labrinidis, University of Pittsburgh October 8, 2014 5 What’s the Big Deal with Big Data? • Featured on the cover of Nature and the Economist! © 2014 Alexandros Labrinidis, University of Pittsburgh October 8, 2014 6 What’s the Big Deal with Big Data? • And even has a Dilbert Cartoon! © 2014 Alexandros Labrinidis, University of Pittsburgh October 8, 2014 7 Big Data Definition - The three Vs • Volume - size does matter! • Velocity - data at speed, i.e., the data “fire-hose” • Variety - heterogeneity is the rule © 2014 Alexandros Labrinidis, University of Pittsburgh October 8, 2014 8 Five more Vs • Variability - rapid change of data characteristics over time • Veracity - ability to handle uncertainty, inconsistency, etc • Visibility – protect privacy and provide security • Value – usefulness & ability to find the right hay-colored needle in the haystack • Voracity - strong appetite for data! © 2014 Alexandros Labrinidis, University of Pittsburgh October 8, 2014 9 Enter Moore’s Law Moore's law is the observation that, over the history of computing hardware, the number of transistors in a dense integrated circuit doubles approximately every two years. The law is named after Gordon E. Moore, co-founder of Intel Corporation, who described the trend in his 1965 paper. Source: http://en.wikipedia.org/wiki/Moore's_law [ Wikipedia Image ] © 2014 Alexandros Labrinidis, University of Pittsburgh October 8, 2014 10 Enter Bezos’ Law Bezos' law is the observation that, over the history of cloud, a unit of computing power price is reduced by 50% approximately every 3 years Source: http://blog.appzero.com/blog/futureofcloud © 2014 Alexandros Labrinidis, University of Pittsburgh Photo: http://www.slashgear.com/google-data-center-hd-photos-hit-where-the-internet-lives-gallery-17252451/ October 8, 2014 11 Storage capacity increase HDD Capacity (GB) 7000 6000 5000 4000 3000 2000 1000 0 Insert other exponentially increasing graphs here (e.g., data generation rates, world-wide smartphone access rates, Internet of Things, …) [ Wikipedia Data ] © 2014 Alexandros Labrinidis, University of Pittsburgh October 8, 2014 12 But • Human processing capacity remains roughly the same! © 2014 Alexandros Labrinidis, University of Pittsburgh October 8, 2014 13 We refer to this as the: Big Data – Same Humans Problem © 2014 Alexandros Labrinidis, University of Pittsburgh October 8, 2014 14 About the ADMT Lab • Directed by: • Panos K. Chrysanthis • Alexandros Labrinidis • Established in 1995 • Currently: 5 PhD students • Our “slogan”: User-centric data management for network-centric applications © 2014 Alexandros Labrinidis, University of Pittsburgh October 8, 2014 15 Look at the entire data lifecycle © 2014 Alexandros Labrinidis, University of Pittsburgh October 8, 2014 16 AQSIOS - A DSMS Architecture AQSIOS is the DSMS prototype developed at our ADMT Lab. It is built on top of the STREAM prototype from Stanford. Cont. queries AQS IOS Scheduler Query optimizer Administrator Set the delay targets and priorities for queries Statistics collector Query networks Load Manager Q1 Q2 Q3 Stream applications Data stream sources © 2014 Alexandros Labrinidis, University of Pittsburgh October 8, 2014 17 DILoS evaluation – QoS and QoD Average response time (ms) Average data loss (%) Class 1 Class 2 Class 3 Class 1 Class 2 Class 3 No load manager 3.40 3.53 56541.69 0 0 0 Common load manager 3.00 3.13 517.07 11.42 11.43 11.60 Per-class load manager 3.55 3.75 492.84 0 0 35.95 DILoS 4.28 4.38 42.95 0 0 0 © 2014 Alexandros Labrinidis, University of Pittsburgh October 8, 2014 18 Style of research • Emphasis on systems and algorithms • Building real systems • Often based on academic prototypes (e.g., Stream from Stanford) or on top of well-known open-source software (e.g., Storm) • Experimenting using real systems and simulation • Comparing alternatives • Should we do grouping of queries in way A or way B? • If we do 4 different optimizations, what is the relative benefit of each one? • In which cases would a certain algorithm be better than another? © 2014 Alexandros Labrinidis, University of Pittsburgh October 8, 2014 19 Types of projects for undergrads • Upcoming: • web-based user interface to visualize run-time behavior of a real system • Past: • clustering of tweets • web-based interfaces to different database back-ends • REST APIs for remote data access • application to coordinate supernovae observations • monitoring application for transient astronomical events © 2014 Alexandros Labrinidis, University of Pittsburgh October 8, 2014 20 More info • [1] The Beckman Report on Database Research By Abadi et al, October 2013 http://beckman.cs.wisc.edu • [2] Big Data and Its Technical Challenges By Jagadish, Gehrke, Labrinidis, Papakonstantinou, Patel, Ramakrishnan, and Shahabi, Communications of the ACM, July 2014 http://bit.ly/bigdatachallenges (over 4,500 downloads) • [3] Contact me: http://labrinidis.cs.pitt.edu/contact © 2014 Alexandros Labrinidis, University of Pittsburgh October 8, 2014
© Copyright 2024