IBM Big Data

Big Data Analytics for Life & Agricultural Sciences
1
© 2012 IBM Corporation
IBM Big Data Session
2
 IBM Big Data Platform Overview
Bill Zanine, GM IBM Advanced
Analytics for Big Data
 IBM Big Data Life Sciences
Healthcare & Research Use Cases
Bill Zanine
© 2012 IBM Corporation
The IBM Big Data Platform
IBM Blue Gene/Q
Platform Computing
High-performance for
computationally intensive application
High-performance framework for
distributed computing
High-Performance
Computing
Stream
Computing
Hadoop
InfoSphere BigInsights
Hadoop-based low latency
analytics for variety and volume
InfoSphere Streams
Low Latency Analytics for
streaming data
MPP Data Warehouse
IBM Puredata System for
Operational Analytics
IBM Smart Analytics
System
BI+Ad Hoc Analytics Structured
Data
Operational Analytics on
Structured Data
© 2012 IBM Corporation
IBM Big Data Strategy: Optimized Analytic & Compute Platforms
Big data, business strategy and
new analytics require
optimized analytic platforms:
• Integrate and manage the
full variety, velocity and
volume of data
• Variety, velocity and volume
further drive the need for
optimized compute platforms
• Operate on the data in it’s
native form, on it’s current
location, minimizing
movement
• Simplify data management,
embed governance,
seamlessly leverage tools
across platforms
Analytic Applications
BI /
Reporting
Exploration /
Visualization
Functional
App
Industry Predictive
App
Analytics
Content
Analytics
IBM Big Data Platform
Visualization
& Discovery
Application
Development
Systems
Management
Analytic Accelerators
High-Performance
Computing
Hadoop
System
Data
Warehouse
Stream
Computing
Information Integration & Governance
© 2012 IBM Corporation
IBM Big Data Platforms Span the Breadth of
Data Intensive Analytic Computing Needs
Big Data Computing
Peta-scale Analytics
Real-time Analytics
Petascale
Computing
Petascale
Data Processing
Petascale
Interactive Data
Processing
In-Line
Analytics
Highly-Planned,
Batch
Scalable Batch
Unstructured
Batch & Interactive
Structured
Highly Planned
Autonomous
 No single architecture can fulfill all the compute needs for big data analytics,
exploration and reporting
 Specialized architectures and technologies are optimized to solve specific compute
profiles, data volumes and data types
 Architectures must be leveraged appropriately for scalable, cost-effective computing
© 2012 IBM Corporation
Data Intensive Computing for Life Sciences
Streaming
Data
Sensor Data:
Manufacturing, Medical,
Environmental & Lab
Patient Monitoring
Quality Control
Peta-Scale
Data
Sequencer Data, Assays,
Medical Records,
Tissue & Cell Imaging
SNP Alignment
Image Classification
Attribute Extraction
Peta-Scale
Computing
Simulations & Models
Publication Graphs
Protein Science
Molecular Dynamics
Complex Graph Analysis
Data
Intensive,
Interactive
Computing
Image Metadata,
Genomic Profiles,
Environmental & Patient
Attribution
Translational Medicine
Genotype/Phenotype
Predictive Healthcare
Data Intensive
Compute
Acceleration
Complex Simulations
Matrix Acceleration
Compute Acceleration
© 2012 IBM Corporation
Tackling Computational Challenges in Life Sciences
 The Vast Datascape of Life Sciences
 Scalable, Cost Efficient Genomic Processing and Analysis
 Scalable Sensor Processing and Geospatial Analytics
 Health Outcomes and Epidemiology
7
© 2012 IBM Corporation
Combining Big Data and Smarter Analytics to
Improve Performance in the Life Sciences
SNP
Yield
Biologic
Records
Plant
Records
Gene
Assembly
Gene
Interaction
Proteins
Translational
Analysis
Proteomics
Simulations
Claims
Weather
Clustering
Health
Records
Environmental
Records
Tissue
Topography
8
Geospatial
Analysis
© 2012 IBM Corporation
Scalable Genomic Data Processing
9
© 2012 IBM Corporation
SUNY Buffalo – Center for Computational Research
Data Intensive Discovery Initiative
10
© 2012 IBM Corporation
SUNY Buffalo – Large Gene Interaction Analytics
UB Center for Protein Therapeutics
 Use the new algorithms and add
multiple variables that before were
nearly impossible to achieve
 Reduce the time required to conduct
analysis from 27.2 hours without the
IBM Puredata data warehouse
appliance to 11.7 minutes with it
 Carry out their research with little to no
database administration
 Publish multiple articles in scientific
journals, with more in process
 Proceed with studies based on ‘vector
phenotypes’—a more complex variable
that will further push the IBM Puredata
data warehouse appliance platform
11
© 2012 IBM Corporation
Revolution R – Genome Wide Association Study
 Genome Wide Association Study (GWAS)
– An epidemiological study that consists of the examination of many
common genetic variants in different individuals to see if any variant is
associated with a trait
 Revolution R allows the bio-statisticians to work with Puredata as if they
were simply using R on their desktop
– Simplicity with performance considered a “game-changer” by endusers
 CRAN Library support allows them to benefit from the aggregate
knowledge of the R community
– Extensive selection of packages for bio-statistics and relevant analytic
techniques allows developers to be significantly more productive
“What finished in 2 hours 47 minutes on the 1000-6 was still
running on the HPC environment and was not estimated to
complete until after 14+ days.”
- Manager – IT Solution Delivery
© 2012 IBM Corporation
EC2 vs Puredata: Bowtie Financial Value Proposition
 What amount of processing on EC2 equates to the cost of a Puredata
system?
 Bowtie on EC2
– Assume the new system will be used for Bowtie, and only Bowtie
– Today, Bowtie takes 3 hours on 320 EC2 CPU
– Cost of each Bowtie run = 3 hours * 320 CPU * $0.68 per CPU per hour
• $653* per Bowtie Run
 Bowtie on Puredata
– TF-6 costs $600K*, or $200K per Year assuming 3 year deferral
– How many times would a customer have to run Bowtie on EC2 (OnDemand) for the same expenditure?
• $200K per year / $653 per run = 306 Bowtie Runs per Year
 Basically, Puredata is a better financial value proposition
– If the need to run Bowtie exceeds 300 times a year, for 3 years,
 Also, Puredata TF-6 offers 9x the processing capacity for Bowtie of a
comparably priced EC2 environment
*Costs are relative, based upon list pricing 2010
© 2012 IBM Corporation
Benchmarks on Tanay ZX Series
For all the options traded in US in a given day as reported by
OPRA (500K to 1 million trades), implied volatility can be
calculated by Tanay ZX Series in less than 500 milliseconds
14
Applications in Healthcare
• Predictive Modeling
– Relationship between conditions, genetics,
demographics and outcomes
– Survival modeling
• Monte Carlo Simulations
– Gibbs Sampling for MCMC studies on patient data
– Drug design - molecular docking
• Gene Sequencing
– Parallelized Basic Local Alignment Search Tool
(BLAST)
15
15
Sensor & Geospatial Data Processing
16
© 2012 IBM Corporation
IBM Puredata Spatial - Precision Farming
High Speed Analytics on Farm Telematics
Yield data (GPS), Soil Data, Common Land Units(CLUs), Elevation, Farm Plots
Example –Farm Equipment Company
Intersect:
48 million crop yield records (points)
30 million Common Land Units
Result:
~411,000 Summary Records by CLU
(min, max, avg yield)
Total Time ~ 45 min
“We would not even attempt to
do this large a process on
Oracle.”
-Customer GIS Analyst
Page 17
© 2012 IBM Corporation
Information Management
Vestas
A global wind energy company based in Denmark
Business Challenge
 Improve placement of wind turbines – save time, increase
output, extend service life
The Solution:
Project objectives
 IBM InfoSphere
BigInsights Enterprise
Edition
 Leverage large volume of weather data (2.8 PB today; ~16 PB
by 2015)
 IBM xSeries hardware
 Reduce modeling time from weeks to hours.
 Optimize ongoing operations.
Why IBM?
 Domain expertise
 Reliability, security, scalability, and integrated solution
 Standard enterprise software support
 Single-vendor for software, hardware, storage, support
18
© 2012 IBM Corporation
Information Management
University of Ontario Institute of Technology
Detecting life-threatening conditions in neonatal care units
 Business Challenge
– Premature births and associated health risks are on the rise.
– Enormous data loss from patient monitoring equipment. 3600
readings/hr reduced to 1 spot reading/hr
– Analyze physical behaviors (heart rate, respiration, etc) it is
possible to determine when the body is coming under stress.
 Project objectives
– Analyze ALL the data in real-time to detect when a baby is
becoming unwell earlier than is possible today.
– Reduce avg length of stay in neonatal intensive care  reducing
healthcare costs.
 The benefits
– Analyze ~90 million points of data per day per patient in real-time .
. . every reading taken is analyzed.
– Able to stream the data into a database, and shown that the
process can keep pace with the incoming data.
Solution Components:
InfoSphere Streams
 On premises
 In the cloud
Warehouse to correlate
physical behavior across
different populations.
Models developed in
warehouse used to
analyze streaming data.
“I could see that there were enormous opportunities to capture,
store and utilize this data in real time to improve the quality of
care for neonatal babies.”
Dr. Carolyn McGregor
Canada Research Chair in Health Informatics
University of Ontario Institute of Technology
19
© 2012 IBM Corporation
Pacific Northwest Smart
Grid Demonstration Project
Capabilities:
Stream Computing – real-time
control system
Deep Analytics Appliance – analyze
massive data sets
Demonstrates scalability from 100
to 500K homes while retaining 10
years’ historical data
60k metered customers in 5 states
Accommodates ad hoc analysis of price
fluctuation, energy consumption profiles,
risk, fraud detection, grid health, etc.
© 2012 IBM Corporation
Hardcore Research
21
© 2012 IBM Corporation
Computational Biology and Healthcare – Groups and
Projects
 Computational Biology Center (Watson Research Lab)
– Comparative Genomics
– Protein Folding
– DNA Transistor (nanopore sequencing)
 Healthcare Informatics (Almaden Research Lab) ***
– AALIM: Advanced Analytics for Information Management
– The Spatiotemporal Epidemiological Modeler (STEM)
– Genome-Wide Association Studies for Predictive Healthcare ***
 Healthcare Solutions (Haifa Research Lab)
– HIV Therapy Prediction (based on virus DNA markers)
– HYPERGENES (genetics of hypertension)
© 2012 IBM Corporation
Bioinformatics on Hadoop: Alignment and Variant Calling
 This DNA sequence analysis workflow is implemented in the academic
software Crossbow (Bowtie aligner + SOAPsnp variant caller)
Aligned
Reads
DNA
Reads
Aligner
Aligner
Chr 1
Chr 22
Chr 1
Chr 22
Variation by
Chromosomal Region
Variation Caller
Chr 1 SNPs
Variation Caller
Chr 22 SNPs
•••
•••
Chr 22
Aligner
MAP
Step
Chr Y
Variation Caller
Chr Y SNPs
REDUCE
Step
© 2012 IBM Corporation
Health Outcomes and Epidemiology
24
© 2012 IBM Corporation
Information Management
University of Ontario Institute of Technology
Detecting life-threatening conditions in neonatal care units
Business Challenge
 Premature births and associated health risks are on the rise.
 Enormous data loss from patient monitoring equipment. 3600
readings/hr reduced to 1 spot reading/hr
 Analyze physical behaviors (heart rate, respiration, etc) it is possible
to determine when the body is coming under stress.
Project objectives
 Analyze ALL the data in real-time to detect when a baby is becoming
unwell earlier than is possible today.
 Reduce avg length of stay in neonatal intensive care  reducing
healthcare costs.
Solution Components:
InfoSphere Streams
 On premises
 In the cloud
Warehouse to correlate
physical behavior across
different populations.
Models developed in
warehouse used to
analyze streaming data.
The benefits
 Analyze ~90 million points of data per day per patient in real-time . . .
every reading taken is analyzed.
 Able to stream the data into a database, and shown that the process
can keep pace with the incoming data.
25
© 2012 IBM Corporation
Health Analytics – GPRD & OMOP Benchmarking
 General Practice Research Database – GPRD
– European Clinical Records Database - 11 Million Patients, 24 Years
– Pharmacovigilance, Drug Utilization and Health Outcomes Analytics
 Observation Medical Outcomes Partnership - OMOP
– Shared Models and Methods for Health Outcomes Analytics
 Migration of GPRD & OMOP data
models, ETL and SAS programs to
Puredata
– 266 GB of Raw Data
– Compressed to 75 GB
 Puredata 1000-6
SAS
w/ Puredata
GPRD copy
65 hours
~14 hours
OMOP copy
Failed to complete
out of memory
~14 hours
Two weeks
~51 minutes
Small SAS Analysis
6h 38m
0h 3m
Large SAS Analysis
Failed to complete
out of memory
16m 47s
Standard “OSCAR”
Analysis
Failed to complete
out of memory
6 h 21m
n/a
48 seconds
GPRD to GPRDOMOP
transformation
 SAS & SAS Access 9.2
IHCIS summary
© 2012 IBM Corporation
Improved Clinical Insights
Problem
Effect
Post-launch monitoring of clinical data required the manual
integration of data across several data providers
Data inquiries performed overnight via manually developed SAS
programs
Differences in data assets did not facilitate integration over time
Pre-formatted data sets (to simplify integration) did not enable
access to unique characteristics of different sources
Flat-file oriented processing significantly increase complexity of
analysis
Significant data duplication at the risk of good data management
Implementation Scope
Migration of SAS-based flat files to a relational environment in Puredata
Optimization of existing SAS code to best leverage Puredata in-database functionality
Integration of clinical data from Premiere and IMS with sales & marketing data from IMS, WK and SDI
Improvement Metrics
Reduction in time to perform data inquiries and analytics
Advance analytic capabilities for clinical analysis
Ability of end-users to gain immediate benefit without significant
retooling or new technologies
Result
Immediate 10x performance improvement on existing SAS
applications with little to no rework or original code
Leveraging traditional S&M data assets as an early indicator for
more detailed investigations with clinical data
Company data management strategy focused on centralizing data
assets on Puredata to improve analytic capabilities
© 2012 IBM Corporation
Revolution R - Medication Response Modeling
 Understand the major causes of morbidity and mortality related to inaccurate
dosages of medications
– Relating individual responses to medication with genetic makeup and environmental factors
indicated by biomarker measurement
 The Puredata architecture allowed for the re-use of the existing investment in R
– 20x+ performance improvement over existing HPC infrastructure
– Business logic of the R code remained intact
– Speed at which they could interrogate the data allowed them to play with many models
 Explored use of in-database analytics (dbLytix)
– 1000x performance improvement of existing HPC infrastructure
IBM Puredata
1000-12
R
R (nzTapply)
Linux HPC
IBM Puredata
TBD (100+)
12
R server Revolution R Enterprise
Linux Command Line Desktop GUI (nzTapply)
36+ hours
2 hours
HPC Environment
Language
Platform
Nodes
Deployment
Interface
Elapsed Time
IBM Puredata
1000-12
dbLytix
IBM Puredata
12
dbLytix
24.9 Billion
2 minutes
© 2012 IBM Corporation
Optum Insight – Predictive Healthcare with Fuzzy Logix
 Predict who is at risk to develop diabetes 6 months, 1 year & 2 years out
– Provide more advanced intervention services for health plans
– Based upon medical observations and patient demographics
– In-memory analytic environment was limited to analyzing small sets of patient
cohorts with limited computational capability
 “Optum Insight could not do this work without Netezza or Fuzzy
Logix”
– Leveraged 79x more medical observations, processed 150x faster
– From 150 variables to 1700, with a capacity for 5000+
– Fast iterative analysis for continuous model improvement
In-Memory Analytics
Linux Server
Nodes
Multi-Core/4GB RAM
Observations (rows)
2 Million
Variables (dimensions)
150
Evaluations (matrix elements)
300 Million
Puredata Elapsed Time
> 5 Hours
Improvement*
Puredata
Model v.1 - UDA
24 CPU/96 Core
2 Million
1500
2.7 Billion
30 minutes
10x
Puredata
Model v.2 - UDTF
24 CPU/96 Core
14 Million
1700
23.8 Billion
4 hours
150x
*Improvement estimates are conservative. Models on Puredata leveraged 10x the amount of data
while performing several additional calculations that were not feasible with in-memory solution
© 2012 IBM Corporation
Health Insurance Provider – SAS SQL Pass-Thru
•
Puredata easily outperforms a finely
tuned BCU
IBM BCU database used for SAS applications
>
5 years of policyholder data
>
“Nasty” queries – “Decision Tree across Neural
Networks by zipcode”
>
10,000
9,000
Had just undergone 6 months of tuning
8,000
Puredata POC loaded raw data with no
optimization
7,000
>
Testing used SAS/Access for ODBC
>
Will be even faster with SAS/Access for Puredata
>
15x average performance improvement
>
Over 26x on long-running analytics
6,000
Seconds
•
5,000
TwinFin-12
4,000
IBM BCU
3,000
IBM BCU
1000-12
22
24 CPU/96 Cores
2,000
32.0TB
32.0TB
1,000
Data
1.5TB
0.8TB
Indices
1.5TB
Not Applicable
15.0TB
Not Applicable
6-months
1-wk POC
CPU
Storage
Duplication
Tuning
0
© 2012 IBM Corporation
Catalina Marketing – In-Database Analytics
Developing Models Faster...
40
Speedup
 35X improvement in staff productivity
– Model development reduced from
2+ months to 2 days
– 10’s of models to 100’s per year
with the same staff
30
20
10
0
1
 Increased depth of data per model
– 150 to 3.2 Million features
– 1 Million to 14.5 Trillion records per
analysis
24
400
Number of CPU
400
Speedup
80
200,000
60
150,000
40
100,000
20
50,000
0
Rows (Millions)
While Increasing Depth of Data
Days
 Impressive ROI on IT investment
– 12 Million times more data
processed per CPU per day
– Direct correlation of models
development to revenue
4
0
1
4
24
400
Number of CPU
Days for Model Build
400
Rows/Day/CPU (Millions)
© 2012 IBM Corporation
Catalina – SAS Scoring Accelerator
PC SAS
Unix SAS
SAS MP Connect
Puredata SAS
Puredata SAS
SQL Pass-Thru
Scoring Accelerator
Nodes
1 CPU
4 CPU
24 CPU
400 CPU
400 CPU
Storage
0.5 TB
5 TB
15 TB
120 TB
120 TB
1 Million
4 Billion
4 Billion
14 Trillion
140 Trillion
#columns
(dimensions, features)
30 Brands,
5 Variables
30 Brands,
800 Categories,
5 Variables
30 Brands,
800 Categories,
5 Variables
80,000 Brands,
800 Categories,
5 Variables
80,000 Brands,
800 Categories,
5 Variables
Model Build and
Deploy
70-84 Days
35 Days
10 Days
3 Days
2 Days
Rows/CPU/Day
14,286
28,571,429
16,666,667
12,083,333,333
175,625,000,000
2,000 X
1,167 X
845,833 X
12,293,750 X
114,285,714
400,000,000
8,000 X
28,000 X
#rows
(data volume)
Per CPU Speedup
from PC SAS
Rows/Day
Per Day Speedup
from PC SAS
14,286
4,833,333,333,333 70,250,000,000,000
338,333,333 X
4,917,500,000 X
© 2012 IBM Corporation
Harvard Medical School collaboration
What is the Harvard Computational PharmacoEpidemiology Program?
• Harvard Medical School faculty selects Puredata for pharmaco-epi
complex analytics, studies on drug effectiveness & safety
• 100% of computation run on Puredata
• Faculty are in Methods Core of FDA Mini-Sentinel & globally esteemed
Why computational pharmaco-epidemiology?
• FDA will be implementing a system to track safety of drugs and devices through active
surveillance on tens to hundreds of terabytes of claims data (FDA Mini-Sentinel)
• Pharma’s want to innovate ahead of Sentinel, find new markets and risks
• Payers want to measure their insured population, providers, outcomes, ACO
• Comparative effectiveness research is a top priority for next-generation healthcare
Why is it special?
• These end users have no IT budget, no DBA’s, period!
© 2012 IBM Corporation