The Big Data – Same Humans Problem Alexandros Labrinidis Advanced Data Management Technologies Lab Department of Computer Science University of Pittsburgh CIDR 2015 – Gong Show – January 5, 2015 You know Big Data is an important problem if... • It is featured on the cover of Nature and the Economist! 2 You know Big Data is an even more important problem if... • It has a Dilbert cartoon! 3 What is Big Data? Definition #1: • Big data is like teenage sex: o o o o everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it... Definition #2: • Anything that Won't Fit in Excel! Definition #3: • Using the Vs 4 The three Vs If you have been living under a rock for the last few years 5 The three Vs • Volume - size does matter! • Velocity - data at speed, i.e., the data “fire-hose” • Variety - heterogeneity is the rule 6 Five more Vs • Variability - rapid change of data characteristics over time • Veracity - ability to handle uncertainty, inconsistency, etc • Visibility – protect privacy and provide security • Value – usefulness & ability to find the right-needle in the stack • Voracity - strong appetite for data! 7 Enter Moore’s Law Moore'ʹs law is the observation that, over the history of computing hardware, the number of transistors in a dense integrated circuit doubles approximately every two years. The law is named after Gordon E. Moore, co-‐‑founder of Intel Corporation, who described the trend in his 1965 paper. Source: hZp://en.wikipedia.org/wiki/Moore'ʹs_law [ Wikipedia Image ] 8 Enter Bezos’ Law Bezos'ʹ law is the observation that, over the history of cloud, a unit of computing power price is reduced by 50% approximately every 3 years Source: hZp://blog.appzero.com/blog/futureofcloud 9 Photo: hZp://www.slashgear.com/google-‐‑data-‐‑center-‐‑hd-‐‑photos-‐‑hit-‐‑where-‐‑the-‐‑internet-‐‑lives-‐‑gallery-‐‑17252451/ Storage capacity increase HDD Capacity (GB) 7000 6000 5000 4000 3000 2000 1000 0 Insert other exponentially increasing graphs here (e.g., data generation rates, world-‐‑wide smartphone access rates, Internet of Things, …) [ Wikipedia Data ] 10 But • Human processing capacity remains roughly the same! 11 We refer to this as the: Big Data – Same Humans Problem 12 Rethinking Scalability Systems point of view Human point of view • • • • Response time Throughput Scale-‐‑up Scale-‐‑out • Making sure humans do not get lost in a sea of data! 13 Rethinking Scalability Example: Data Stream Management System processing 1,000,000 events per second Systems point of view Human point of view • Fantastic! • Terrible! 14 So now what? • In addition to “traditional” scalability, it is important to consider “big data” research in: o Summarization o Personalization o Ranking o Recommender Systems o Visual Analytics o Crowd-sourcing o … 15 It is important! 16 17 Thank you Further reading: Big Data and Its Technical Challenges http://bit.ly/bigdatachallenges By Jagadish, Gehrke, Labrinidis, Papakonstantinou, Patel, Ramakrishnan, and Shahabi, Communications of the ACM, July 2014 18
© Copyright 2024