T. Tyson - Computing & Communications

Data driven discovery:
opportunities and challenges
Tony Tyson
UC Davis
Next Generation Research & the University of California
Planning
 Watch the tail of the distribution
 Disruptive technologies drive user behavior
 Rare modalities will become commonplace
Future of computing at scale
Future of computing at scale
The “Big” in Big Data
• What you do with it
• More challenging than volume or storage
• The big opportunity and challenge
Full end-to-end simulations
6
LSST Wide-Fast-Deep survey
A survey of 37 billion objects
in space and time
Each sky patch will be visited over 800 times:
30 trillion measurements
15 terabytes per night, for ten years
Complex high-dimensional data
“Genome project” approach to astronomy
 Avoid cost of building a new facility running a new
experiment every time we ask a new science question
 One exhaustive survey of the optical universe
 A 3.2 Giga pixel image every 18 sec for 10 years
 Calibrated trusted data: over 500PB
500PB image collection + 15PB catalog
 Many simulated universes
 Multiple 100-1000PB databases
 Exascale data enables many “experiments”
9
Automated discovery
Data exploration
This is required also for
automated Data Quality Assessment
10
Comparing data with theory:
Cosmological Simulations
 Hard to analyze the data afterwards ->
need Database
 Compare to real data
 Next generation of simulations with
Petabytes of output are under way
(Exascale-Sky)
The Science of Big Data
 Data growing exponentially, in all science
 Changes the nature of science
from hypothesis-driven to data-driven discovery
 Cuts across all sciences
 Convergence of physical and life sciences
through Big Data (statistics and computing)
 A new scientific revolution
The Crunch
 Science community starving for storage and IO
• Put data-intensive computations as close to data
as possible
 Current architectures cannot scale much further
• Need to get off the curve leading to power wall
 A new, Fourth Paradigm of science is emerging
• Many common patterns across all scientific
disciplines
5 Year Trend
 Sociology:
• Data collection in ever larger collaborations
• Analysis decoupled, off archived data by smaller
groups
• Multi-PB data sets
 Some form of a scalable Cloud solution inevitable
• Who will operate it, what business model, what scale?
• Science needs different tradeoffs than eCommerce
 Scientific data will never be fully co-located
• Geographic origin tied to experimental facilities
• Streaming algorithms, data pipes for distributed
workflows
Research Fiber
1 Gb – 10 Gb
INTERNET
(CENIC)
VLAN1
LAN
LAN
100 Mb - 1 Gb
BUILDING
10 - 100 Gb
100 Mb - 1 Gb
BUILDING
BORDER
10 Gb
10 Gb
10 Gb
AREA
AREA
10 Gb
10 Gb
CORE
VLAN1
Infrastructure
• Big bandwidth between data centers
• Exascale computations at centers
• Adequate bandwidth to users for
vizualization (streaming HD)
• Plus one more important ingredient..
What’s needed?
(not drawn to scale)
Miners
Scientists
Science Data
& Questions
Data Mining
Algorithms
Plumbers
Database
To store data
Execute
Queries
Question &
Answer
Visualization
Tools
Jim Gray
17
We need to train a cadre of
scientists who are deep both
in CS/Statistics and their
domain science
Example: Data Science Initiative @ UC Davis
18
Big Data @ UC Davis
Domain
Sciences Discover
Develop
Training
Distribute
Infrastructure
Analytics