Data driven discovery: opportunities and challenges Tony Tyson UC Davis Next Generation Research & the University of California Planning Watch the tail of the distribution Disruptive technologies drive user behavior Rare modalities will become commonplace Future of computing at scale Future of computing at scale The “Big” in Big Data • What you do with it • More challenging than volume or storage • The big opportunity and challenge Full end-to-end simulations 6 LSST Wide-Fast-Deep survey A survey of 37 billion objects in space and time Each sky patch will be visited over 800 times: 30 trillion measurements 15 terabytes per night, for ten years Complex high-dimensional data “Genome project” approach to astronomy Avoid cost of building a new facility running a new experiment every time we ask a new science question One exhaustive survey of the optical universe A 3.2 Giga pixel image every 18 sec for 10 years Calibrated trusted data: over 500PB 500PB image collection + 15PB catalog Many simulated universes Multiple 100-1000PB databases Exascale data enables many “experiments” 9 Automated discovery Data exploration This is required also for automated Data Quality Assessment 10 Comparing data with theory: Cosmological Simulations Hard to analyze the data afterwards -> need Database Compare to real data Next generation of simulations with Petabytes of output are under way (Exascale-Sky) The Science of Big Data Data growing exponentially, in all science Changes the nature of science from hypothesis-driven to data-driven discovery Cuts across all sciences Convergence of physical and life sciences through Big Data (statistics and computing) A new scientific revolution The Crunch Science community starving for storage and IO • Put data-intensive computations as close to data as possible Current architectures cannot scale much further • Need to get off the curve leading to power wall A new, Fourth Paradigm of science is emerging • Many common patterns across all scientific disciplines 5 Year Trend Sociology: • Data collection in ever larger collaborations • Analysis decoupled, off archived data by smaller groups • Multi-PB data sets Some form of a scalable Cloud solution inevitable • Who will operate it, what business model, what scale? • Science needs different tradeoffs than eCommerce Scientific data will never be fully co-located • Geographic origin tied to experimental facilities • Streaming algorithms, data pipes for distributed workflows Research Fiber 1 Gb – 10 Gb INTERNET (CENIC) VLAN1 LAN LAN 100 Mb - 1 Gb BUILDING 10 - 100 Gb 100 Mb - 1 Gb BUILDING BORDER 10 Gb 10 Gb 10 Gb AREA AREA 10 Gb 10 Gb CORE VLAN1 Infrastructure • Big bandwidth between data centers • Exascale computations at centers • Adequate bandwidth to users for vizualization (streaming HD) • Plus one more important ingredient.. What’s needed? (not drawn to scale) Miners Scientists Science Data & Questions Data Mining Algorithms Plumbers Database To store data Execute Queries Question & Answer Visualization Tools Jim Gray 17 We need to train a cadre of scientists who are deep both in CS/Statistics and their domain science Example: Data Science Initiative @ UC Davis 18 Big Data @ UC Davis Domain Sciences Discover Develop Training Distribute Infrastructure Analytics
© Copyright 2024