Big Data Analytics for Life & Agricultural Sciences 1 © 2012 IBM Corporation IBM Big Data Session 2 IBM Big Data Platform Overview Bill Zanine, GM IBM Advanced Analytics for Big Data IBM Big Data Life Sciences Healthcare & Research Use Cases Bill Zanine © 2012 IBM Corporation The IBM Big Data Platform IBM Blue Gene/Q Platform Computing High-performance for computationally intensive application High-performance framework for distributed computing High-Performance Computing Stream Computing Hadoop InfoSphere BigInsights Hadoop-based low latency analytics for variety and volume InfoSphere Streams Low Latency Analytics for streaming data MPP Data Warehouse IBM Puredata System for Operational Analytics IBM Smart Analytics System BI+Ad Hoc Analytics Structured Data Operational Analytics on Structured Data © 2012 IBM Corporation IBM Big Data Strategy: Optimized Analytic & Compute Platforms Big data, business strategy and new analytics require optimized analytic platforms: • Integrate and manage the full variety, velocity and volume of data • Variety, velocity and volume further drive the need for optimized compute platforms • Operate on the data in it’s native form, on it’s current location, minimizing movement • Simplify data management, embed governance, seamlessly leverage tools across platforms Analytic Applications BI / Reporting Exploration / Visualization Functional App Industry Predictive App Analytics Content Analytics IBM Big Data Platform Visualization & Discovery Application Development Systems Management Analytic Accelerators High-Performance Computing Hadoop System Data Warehouse Stream Computing Information Integration & Governance © 2012 IBM Corporation IBM Big Data Platforms Span the Breadth of Data Intensive Analytic Computing Needs Big Data Computing Peta-scale Analytics Real-time Analytics Petascale Computing Petascale Data Processing Petascale Interactive Data Processing In-Line Analytics Highly-Planned, Batch Scalable Batch Unstructured Batch & Interactive Structured Highly Planned Autonomous No single architecture can fulfill all the compute needs for big data analytics, exploration and reporting Specialized architectures and technologies are optimized to solve specific compute profiles, data volumes and data types Architectures must be leveraged appropriately for scalable, cost-effective computing © 2012 IBM Corporation Data Intensive Computing for Life Sciences Streaming Data Sensor Data: Manufacturing, Medical, Environmental & Lab Patient Monitoring Quality Control Peta-Scale Data Sequencer Data, Assays, Medical Records, Tissue & Cell Imaging SNP Alignment Image Classification Attribute Extraction Peta-Scale Computing Simulations & Models Publication Graphs Protein Science Molecular Dynamics Complex Graph Analysis Data Intensive, Interactive Computing Image Metadata, Genomic Profiles, Environmental & Patient Attribution Translational Medicine Genotype/Phenotype Predictive Healthcare Data Intensive Compute Acceleration Complex Simulations Matrix Acceleration Compute Acceleration © 2012 IBM Corporation Tackling Computational Challenges in Life Sciences The Vast Datascape of Life Sciences Scalable, Cost Efficient Genomic Processing and Analysis Scalable Sensor Processing and Geospatial Analytics Health Outcomes and Epidemiology 7 © 2012 IBM Corporation Combining Big Data and Smarter Analytics to Improve Performance in the Life Sciences SNP Yield Biologic Records Plant Records Gene Assembly Gene Interaction Proteins Translational Analysis Proteomics Simulations Claims Weather Clustering Health Records Environmental Records Tissue Topography 8 Geospatial Analysis © 2012 IBM Corporation Scalable Genomic Data Processing 9 © 2012 IBM Corporation SUNY Buffalo – Center for Computational Research Data Intensive Discovery Initiative 10 © 2012 IBM Corporation SUNY Buffalo – Large Gene Interaction Analytics UB Center for Protein Therapeutics Use the new algorithms and add multiple variables that before were nearly impossible to achieve Reduce the time required to conduct analysis from 27.2 hours without the IBM Puredata data warehouse appliance to 11.7 minutes with it Carry out their research with little to no database administration Publish multiple articles in scientific journals, with more in process Proceed with studies based on ‘vector phenotypes’—a more complex variable that will further push the IBM Puredata data warehouse appliance platform 11 © 2012 IBM Corporation Revolution R – Genome Wide Association Study Genome Wide Association Study (GWAS) – An epidemiological study that consists of the examination of many common genetic variants in different individuals to see if any variant is associated with a trait Revolution R allows the bio-statisticians to work with Puredata as if they were simply using R on their desktop – Simplicity with performance considered a “game-changer” by endusers CRAN Library support allows them to benefit from the aggregate knowledge of the R community – Extensive selection of packages for bio-statistics and relevant analytic techniques allows developers to be significantly more productive “What finished in 2 hours 47 minutes on the 1000-6 was still running on the HPC environment and was not estimated to complete until after 14+ days.” - Manager – IT Solution Delivery © 2012 IBM Corporation EC2 vs Puredata: Bowtie Financial Value Proposition What amount of processing on EC2 equates to the cost of a Puredata system? Bowtie on EC2 – Assume the new system will be used for Bowtie, and only Bowtie – Today, Bowtie takes 3 hours on 320 EC2 CPU – Cost of each Bowtie run = 3 hours * 320 CPU * $0.68 per CPU per hour • $653* per Bowtie Run Bowtie on Puredata – TF-6 costs $600K*, or $200K per Year assuming 3 year deferral – How many times would a customer have to run Bowtie on EC2 (OnDemand) for the same expenditure? • $200K per year / $653 per run = 306 Bowtie Runs per Year Basically, Puredata is a better financial value proposition – If the need to run Bowtie exceeds 300 times a year, for 3 years, Also, Puredata TF-6 offers 9x the processing capacity for Bowtie of a comparably priced EC2 environment *Costs are relative, based upon list pricing 2010 © 2012 IBM Corporation Benchmarks on Tanay ZX Series For all the options traded in US in a given day as reported by OPRA (500K to 1 million trades), implied volatility can be calculated by Tanay ZX Series in less than 500 milliseconds 14 Applications in Healthcare • Predictive Modeling – Relationship between conditions, genetics, demographics and outcomes – Survival modeling • Monte Carlo Simulations – Gibbs Sampling for MCMC studies on patient data – Drug design - molecular docking • Gene Sequencing – Parallelized Basic Local Alignment Search Tool (BLAST) 15 15 Sensor & Geospatial Data Processing 16 © 2012 IBM Corporation IBM Puredata Spatial - Precision Farming High Speed Analytics on Farm Telematics Yield data (GPS), Soil Data, Common Land Units(CLUs), Elevation, Farm Plots Example –Farm Equipment Company Intersect: 48 million crop yield records (points) 30 million Common Land Units Result: ~411,000 Summary Records by CLU (min, max, avg yield) Total Time ~ 45 min “We would not even attempt to do this large a process on Oracle.” -Customer GIS Analyst Page 17 © 2012 IBM Corporation Information Management Vestas A global wind energy company based in Denmark Business Challenge Improve placement of wind turbines – save time, increase output, extend service life The Solution: Project objectives IBM InfoSphere BigInsights Enterprise Edition Leverage large volume of weather data (2.8 PB today; ~16 PB by 2015) IBM xSeries hardware Reduce modeling time from weeks to hours. Optimize ongoing operations. Why IBM? Domain expertise Reliability, security, scalability, and integrated solution Standard enterprise software support Single-vendor for software, hardware, storage, support 18 © 2012 IBM Corporation Information Management University of Ontario Institute of Technology Detecting life-threatening conditions in neonatal care units Business Challenge – Premature births and associated health risks are on the rise. – Enormous data loss from patient monitoring equipment. 3600 readings/hr reduced to 1 spot reading/hr – Analyze physical behaviors (heart rate, respiration, etc) it is possible to determine when the body is coming under stress. Project objectives – Analyze ALL the data in real-time to detect when a baby is becoming unwell earlier than is possible today. – Reduce avg length of stay in neonatal intensive care reducing healthcare costs. The benefits – Analyze ~90 million points of data per day per patient in real-time . . . every reading taken is analyzed. – Able to stream the data into a database, and shown that the process can keep pace with the incoming data. Solution Components: InfoSphere Streams On premises In the cloud Warehouse to correlate physical behavior across different populations. Models developed in warehouse used to analyze streaming data. “I could see that there were enormous opportunities to capture, store and utilize this data in real time to improve the quality of care for neonatal babies.” Dr. Carolyn McGregor Canada Research Chair in Health Informatics University of Ontario Institute of Technology 19 © 2012 IBM Corporation Pacific Northwest Smart Grid Demonstration Project Capabilities: Stream Computing – real-time control system Deep Analytics Appliance – analyze massive data sets Demonstrates scalability from 100 to 500K homes while retaining 10 years’ historical data 60k metered customers in 5 states Accommodates ad hoc analysis of price fluctuation, energy consumption profiles, risk, fraud detection, grid health, etc. © 2012 IBM Corporation Hardcore Research 21 © 2012 IBM Corporation Computational Biology and Healthcare – Groups and Projects Computational Biology Center (Watson Research Lab) – Comparative Genomics – Protein Folding – DNA Transistor (nanopore sequencing) Healthcare Informatics (Almaden Research Lab) *** – AALIM: Advanced Analytics for Information Management – The Spatiotemporal Epidemiological Modeler (STEM) – Genome-Wide Association Studies for Predictive Healthcare *** Healthcare Solutions (Haifa Research Lab) – HIV Therapy Prediction (based on virus DNA markers) – HYPERGENES (genetics of hypertension) © 2012 IBM Corporation Bioinformatics on Hadoop: Alignment and Variant Calling This DNA sequence analysis workflow is implemented in the academic software Crossbow (Bowtie aligner + SOAPsnp variant caller) Aligned Reads DNA Reads Aligner Aligner Chr 1 Chr 22 Chr 1 Chr 22 Variation by Chromosomal Region Variation Caller Chr 1 SNPs Variation Caller Chr 22 SNPs ••• ••• Chr 22 Aligner MAP Step Chr Y Variation Caller Chr Y SNPs REDUCE Step © 2012 IBM Corporation Health Outcomes and Epidemiology 24 © 2012 IBM Corporation Information Management University of Ontario Institute of Technology Detecting life-threatening conditions in neonatal care units Business Challenge Premature births and associated health risks are on the rise. Enormous data loss from patient monitoring equipment. 3600 readings/hr reduced to 1 spot reading/hr Analyze physical behaviors (heart rate, respiration, etc) it is possible to determine when the body is coming under stress. Project objectives Analyze ALL the data in real-time to detect when a baby is becoming unwell earlier than is possible today. Reduce avg length of stay in neonatal intensive care reducing healthcare costs. Solution Components: InfoSphere Streams On premises In the cloud Warehouse to correlate physical behavior across different populations. Models developed in warehouse used to analyze streaming data. The benefits Analyze ~90 million points of data per day per patient in real-time . . . every reading taken is analyzed. Able to stream the data into a database, and shown that the process can keep pace with the incoming data. 25 © 2012 IBM Corporation Health Analytics – GPRD & OMOP Benchmarking General Practice Research Database – GPRD – European Clinical Records Database - 11 Million Patients, 24 Years – Pharmacovigilance, Drug Utilization and Health Outcomes Analytics Observation Medical Outcomes Partnership - OMOP – Shared Models and Methods for Health Outcomes Analytics Migration of GPRD & OMOP data models, ETL and SAS programs to Puredata – 266 GB of Raw Data – Compressed to 75 GB Puredata 1000-6 SAS w/ Puredata GPRD copy 65 hours ~14 hours OMOP copy Failed to complete out of memory ~14 hours Two weeks ~51 minutes Small SAS Analysis 6h 38m 0h 3m Large SAS Analysis Failed to complete out of memory 16m 47s Standard “OSCAR” Analysis Failed to complete out of memory 6 h 21m n/a 48 seconds GPRD to GPRDOMOP transformation SAS & SAS Access 9.2 IHCIS summary © 2012 IBM Corporation Improved Clinical Insights Problem Effect Post-launch monitoring of clinical data required the manual integration of data across several data providers Data inquiries performed overnight via manually developed SAS programs Differences in data assets did not facilitate integration over time Pre-formatted data sets (to simplify integration) did not enable access to unique characteristics of different sources Flat-file oriented processing significantly increase complexity of analysis Significant data duplication at the risk of good data management Implementation Scope Migration of SAS-based flat files to a relational environment in Puredata Optimization of existing SAS code to best leverage Puredata in-database functionality Integration of clinical data from Premiere and IMS with sales & marketing data from IMS, WK and SDI Improvement Metrics Reduction in time to perform data inquiries and analytics Advance analytic capabilities for clinical analysis Ability of end-users to gain immediate benefit without significant retooling or new technologies Result Immediate 10x performance improvement on existing SAS applications with little to no rework or original code Leveraging traditional S&M data assets as an early indicator for more detailed investigations with clinical data Company data management strategy focused on centralizing data assets on Puredata to improve analytic capabilities © 2012 IBM Corporation Revolution R - Medication Response Modeling Understand the major causes of morbidity and mortality related to inaccurate dosages of medications – Relating individual responses to medication with genetic makeup and environmental factors indicated by biomarker measurement The Puredata architecture allowed for the re-use of the existing investment in R – 20x+ performance improvement over existing HPC infrastructure – Business logic of the R code remained intact – Speed at which they could interrogate the data allowed them to play with many models Explored use of in-database analytics (dbLytix) – 1000x performance improvement of existing HPC infrastructure IBM Puredata 1000-12 R R (nzTapply) Linux HPC IBM Puredata TBD (100+) 12 R server Revolution R Enterprise Linux Command Line Desktop GUI (nzTapply) 36+ hours 2 hours HPC Environment Language Platform Nodes Deployment Interface Elapsed Time IBM Puredata 1000-12 dbLytix IBM Puredata 12 dbLytix 24.9 Billion 2 minutes © 2012 IBM Corporation Optum Insight – Predictive Healthcare with Fuzzy Logix Predict who is at risk to develop diabetes 6 months, 1 year & 2 years out – Provide more advanced intervention services for health plans – Based upon medical observations and patient demographics – In-memory analytic environment was limited to analyzing small sets of patient cohorts with limited computational capability “Optum Insight could not do this work without Netezza or Fuzzy Logix” – Leveraged 79x more medical observations, processed 150x faster – From 150 variables to 1700, with a capacity for 5000+ – Fast iterative analysis for continuous model improvement In-Memory Analytics Linux Server Nodes Multi-Core/4GB RAM Observations (rows) 2 Million Variables (dimensions) 150 Evaluations (matrix elements) 300 Million Puredata Elapsed Time > 5 Hours Improvement* Puredata Model v.1 - UDA 24 CPU/96 Core 2 Million 1500 2.7 Billion 30 minutes 10x Puredata Model v.2 - UDTF 24 CPU/96 Core 14 Million 1700 23.8 Billion 4 hours 150x *Improvement estimates are conservative. Models on Puredata leveraged 10x the amount of data while performing several additional calculations that were not feasible with in-memory solution © 2012 IBM Corporation Health Insurance Provider – SAS SQL Pass-Thru • Puredata easily outperforms a finely tuned BCU IBM BCU database used for SAS applications > 5 years of policyholder data > “Nasty” queries – “Decision Tree across Neural Networks by zipcode” > 10,000 9,000 Had just undergone 6 months of tuning 8,000 Puredata POC loaded raw data with no optimization 7,000 > Testing used SAS/Access for ODBC > Will be even faster with SAS/Access for Puredata > 15x average performance improvement > Over 26x on long-running analytics 6,000 Seconds • 5,000 TwinFin-12 4,000 IBM BCU 3,000 IBM BCU 1000-12 22 24 CPU/96 Cores 2,000 32.0TB 32.0TB 1,000 Data 1.5TB 0.8TB Indices 1.5TB Not Applicable 15.0TB Not Applicable 6-months 1-wk POC CPU Storage Duplication Tuning 0 © 2012 IBM Corporation Catalina Marketing – In-Database Analytics Developing Models Faster... 40 Speedup 35X improvement in staff productivity – Model development reduced from 2+ months to 2 days – 10’s of models to 100’s per year with the same staff 30 20 10 0 1 Increased depth of data per model – 150 to 3.2 Million features – 1 Million to 14.5 Trillion records per analysis 24 400 Number of CPU 400 Speedup 80 200,000 60 150,000 40 100,000 20 50,000 0 Rows (Millions) While Increasing Depth of Data Days Impressive ROI on IT investment – 12 Million times more data processed per CPU per day – Direct correlation of models development to revenue 4 0 1 4 24 400 Number of CPU Days for Model Build 400 Rows/Day/CPU (Millions) © 2012 IBM Corporation Catalina – SAS Scoring Accelerator PC SAS Unix SAS SAS MP Connect Puredata SAS Puredata SAS SQL Pass-Thru Scoring Accelerator Nodes 1 CPU 4 CPU 24 CPU 400 CPU 400 CPU Storage 0.5 TB 5 TB 15 TB 120 TB 120 TB 1 Million 4 Billion 4 Billion 14 Trillion 140 Trillion #columns (dimensions, features) 30 Brands, 5 Variables 30 Brands, 800 Categories, 5 Variables 30 Brands, 800 Categories, 5 Variables 80,000 Brands, 800 Categories, 5 Variables 80,000 Brands, 800 Categories, 5 Variables Model Build and Deploy 70-84 Days 35 Days 10 Days 3 Days 2 Days Rows/CPU/Day 14,286 28,571,429 16,666,667 12,083,333,333 175,625,000,000 2,000 X 1,167 X 845,833 X 12,293,750 X 114,285,714 400,000,000 8,000 X 28,000 X #rows (data volume) Per CPU Speedup from PC SAS Rows/Day Per Day Speedup from PC SAS 14,286 4,833,333,333,333 70,250,000,000,000 338,333,333 X 4,917,500,000 X © 2012 IBM Corporation Harvard Medical School collaboration What is the Harvard Computational PharmacoEpidemiology Program? • Harvard Medical School faculty selects Puredata for pharmaco-epi complex analytics, studies on drug effectiveness & safety • 100% of computation run on Puredata • Faculty are in Methods Core of FDA Mini-Sentinel & globally esteemed Why computational pharmaco-epidemiology? • FDA will be implementing a system to track safety of drugs and devices through active surveillance on tens to hundreds of terabytes of claims data (FDA Mini-Sentinel) • Pharma’s want to innovate ahead of Sentinel, find new markets and risks • Payers want to measure their insured population, providers, outcomes, ACO • Comparative effectiveness research is a top priority for next-generation healthcare Why is it special? • These end users have no IT budget, no DBA’s, period! © 2012 IBM Corporation
© Copyright 2024