Taming the Big Ocean Data

Na#onal Aeronau#cs and Space Administra#on Jet Propulsion Laboratory California Ins,tute of Technology Pasadena, California Taming the Big Ocean Data
Thomas Huang
Project Technologist
NASA Physical Oceanography Distributed Active Archive Center
Jet Propulsion Laboratory
California Institute of Technology
4800 Oak Grove Drive
Pasadena, CA 91109-8099, United States of America
THUANG/JPL
Cloud Computing @GSAW 2015
Na#onal Aeronau#cs and Space Administra#on Jet Propulsion Laboratory NASA’s PO.DAAC
California Ins,tute of Technology Pasadena, California • 
The NASA Physical Oceanography Distributed
Active Archive Center (PO.DAAC) at Jet Propulsion
Laboratory is an element of the Earth Observing
System Data and Information System (EOSDIS).
The EOSDIS provides science data to a wide
communities of user for NASA’s Science Mission
Directorate.
• 
Archives and distributes data relevant to the physical
state of the ocean
• 
The mission of the PO.DAAC is to PRESERVE
NASA’s ocean and climate data and make these
universally ACCESSIBLE and MEANINGFUL.
http://podaac.jpl.nasa.gov
THUANG/JPL
Cloud Computing @GSAW 2015
2
Na#onal Aeronau#cs and Space Administra#on Jet Propulsion Laboratory PO.DAAC’s Mission and Cloud Computing
California Ins,tute of Technology Pasadena, California • 
• 
• 
THUANG/JPL
Applications
Data Preservation
•  Archive on the Cloud
•  Storage redundancy
•  Hardware reliability
•  Elastic storage
Aquarius
HandlerAquarius
Handler
GHRSST
GHRSST
HandlerGHRSST
Handler
Handler
Ingest
ASCAT
Handler ASCAT
Handler
Jason-1
Handler
Ingest
Ingest
Business Logics
Manager
Manager
Manager
Data Accessibility
•  Data services availability
•  Platform for
•  Spatial Searches
•  Spatial subsetting
•  Quality screening, etc.
•  Zone replication (with additional costs)
Inventory
Security
Sig
Event
Search
Job Tracking Services
ZooKeeper
ZooKeeper
ZooKeeper
File Services
Ingest
Pool
Ingest
Pool
Archive
Pool
Archive
Pool
Data Management & Archive System
Data Analysis
•  Climatology
•  Data re-gridding
•  Relevancy
•  Anomaly detections
Cloud Computing @GSAW 2015
3
Na#onal Aeronau#cs and Space Administra#on Jet Propulsion Laboratory Cloud Options
California Ins,tute of Technology Pasadena, California • 
Commercial Cloud: public Cloud (e.g. Amazon)
•  ~$30/TB/month, additional cost for I/O (PUT, COPY, POST…)
•  Transfer Out: ~$90/TB/Month
•  What do we get?
•  Potential cost reduction
•  No hardware maintenance
•  No System Administration required
•  Reliable storage
•  Pay-by-the-drink
• 
On-premise Cloud: private Cloud
•  Closer to the physical archive
•  Fixed storage cost
•  No transfer / I/O cost
•  Need trained System Administrator
•  Provide elastic computing infrastructure and programming model
• 
THUANG/JPL
Bursting Cloud: hybrid Cloud
•  Bursting computing jobs to external (public/private Clouds)
•  Ability to leverage additional computing resources
•  Gotchas
•  Depending on the computing problem, it might require data replication, hence storage cost if the external
Cloud is a commercial Cloud
•  Cost for On-premise Cloud and possible external costs
Cloud Computing @GSAW 2015
4
Na#onal Aeronau#cs and Space Administra#on Jet Propulsion Laboratory Some PO.DAAC-related Cloud Computing Activities
California Ins,tute of Technology Pasadena, California Technology Infusion – Cloud Computing Study
•  Amazon
•  NASA Nebula
•  Apache Hadoop and HBase
•  Built Climatology Service
NASA ITLabs Cloudbursting
•  Job busting between three NASA centers
Pilot JPL CIO OpenStack (Nebula Inc.)
Nexus – Science Data Analysis Platform
•  Spark, Hadoop, NoSQL, Solr, etc.
•  Turnkey deployment
•  Target projects: Sea Level Change Portal, ACCESS, AIST
Nexus
Data Analysis Platform
EDGE
OpenSearch
Metadata
ISO, GCMD, etc…
Analysis
Data Aggregation Service
Working with Big Data and Cloud Computing Communities
•  Chair, ESIP Federation Cloud Computing Cluster
•  Chair, NASA ESDSWG Data-Intensive Architecture
•  Active Contributors, ESDSWG Cloud Computing
•  Active Contributors, NIST Big Data Working Group
•  JPL Selection Committee for Cloud RFP
THUANG/JPL
Cloud Computing @GSAW 2015
Geospatial
Metadata
Repository
Data
Management
Data Access and
Distribution
Workflow
Data Analysis
5
Na#onal Aeronau#cs and Space Administra#on Jet Propulsion Laboratory Funded Cloud Computing Efforts
California Ins,tute of Technology Pasadena, California • 
2013 Sea Level Rise
•  PI: C. Boening/JPL, A NASA Web
portal
for Sea
Change
Knowledge
Base including
SWEET, Level
ocean ontology,
and triple store from PO.DAAC data
COAPS
NCAR
<<in-situ>>
Cache
<<in-situ>>
SAMOS
<<MySQL>>
IVAD
<<in-situ>>
ICOADS
THREDDS
<<W10N>>
<<W10N>>
OPeNDAP
Promogranate
Promogranate
holdings and the user community, b) re-interface semantic engine as the MUDROD Engine by
considering vocabulary, ontology, triple store linkage and weights, metadata, user profile and
ocean ontology analyses, and c) integrate the MUDROD GUI for PO.DAAC data discovery, and
search data holdings at ECHO and CLH by leveraging previous developments.
EDGE
EDGE
Geospatial
Metadata
Repository
• 
2013 ROSES ACCESS
•  PI: E. Armstrong/JPL, Enhanced
Screening for Earth Science
1.2.3 Quality
MUDROD GUI
User Centered Design will be adopted to integrate the MUDROD Graphical User Interface
Data
(GUI) by a) involving user communities for ontology and triple store capture, b) utilizing the
workflow of scientists, c) engaging subject matter experts to provide insights and feedback
•  PI: C. Lynnes/GSFC, Federated
Giovanni
during
the integration process, and proactively testing the MUDROD GUI for overall usability at
OpenSearch
JPL
OpenSearch
W10N
Metadata
ISO, GCMD, etc…
W10N
IN-SITU Match-up
W10N
PO.DAAC
EDGE
Metadata
ISO, GCMD, etc…
EDGE
W10N
OpenSearch
Data Aggregation Service
<<W10N>>
Metadata
ISO, GCMD, etc…
W10N
Data Aggregation Service
Geospatial
Metadata
Repository
Promogranate
Metadata
ISO, GCMD, etc…
SPURS
OpenSearch
Match-up Service
Data Aggregation Service
OpenSearch
<<W10N>>
Promogranate
2014 ROSES AIST
•  PI: T. Huang/JPL, OceanXtremes: Oceanographic Data-Intensive
Anomaly Detection and Analytics Portal
•  PI: S. Smith/FSU, A Service to Match Satellite and In-situ Marine
Observations to Support Platform Intercomparisons, Cross1.2.4 MUDROD Engine
calibration, Validation, and Quality
Control
The MUDROD
engine will include four components: a semantic search dispatcher, a semantic
similarity calculator, a result presentation component, and a profile analyzer. Scientists will input
or otherEarth
search terms.
The MUDROD
engine will take the search input and coordinate
•  PI: C. Yang/GMU, Mining and keywords
Utilizing
Science
Dataset
the search against the data sources at PO.DAAC, ECHO, and CLH using the MUDROD
The results will to
be provided
to scientists
for interaction in three forms of ranked
Metadata, Usage Metrics, andknowledge
User base.
Feedback
Improve
Dataset
results, recommendations, and navigation through ontology.
Figure 4. MUDROD
Relevancy
Web Portal
Match-up
Matchup
Matchup
Processor
Processor
Processor
Geospatial
Metadata
Repository
EDGE
each stage of development.
The MUDROD GUI will provide the interface for user interactions with: a) search
constraints input, b) ranked results, c) data exploration based on data recommendations, and d)
navigation through semantics to find relevant datasets.
The optimization will be added with three components to assist users with better
discovery and access: a) ontology graph will be used to show the semantic association hierarchy
of user keywords for navigation; b) similarity ranking will be added to provide better matched
results for end users; c) recommendation will be added once user selected a specific dataset.
Scientists would be able to use the three functionalities to quickly nail down to available datasets
and be directed to the PO.DAAC and other Earth science data downloading and subsetting
services within NASA data systems.
Match-up
Products
• 
Metadata
ISO, GCMD, etc…
Geospatial
Metadata
Repository
Data Aggregation Service
Data Aggregation Service
Geospatial
Metadata
Repository
<<W10N>>
OPeNDAP
OPeNDAP
Promogranate
<<in-situ>>
Cache
<<in-situ>>
SPURS
<<satellite>>
Physical Ocean
PO.DAAC Labs
http://podaac.jpl.nasa.gov/podaac_labs
VirtualQSS Portal
+
-+
- “RDX Wall Art: The Making Of” iand new short
documentary iand new short isa new short
- isa new short documentary
- highlighting iand new sho documentary
- some of the pioneers highlighting iand new sho
more ...
Nav
PO.DAAC
Engine Architecture Oceanographic Common Search Interface
1.2.4.1 Semantic
Virtualized Quality
SearchCache
Dispatcher
NASA
PO.DAAC
Screening Service
Based on the
semantic capability
developed for
PO.DAAC
clearinghouse
and
w10n
w10n
EIE, we will
OPeNDAP
integrate a
Archive
semantic search
dispatcher to
transform
keyword-based
search into
semantic search
THUANG/JPL
Cloud Computing @GSAW 2015
1-7
Use or disclosure of information contained on this sheet is subject to the restriction on the Cover Page of this proposal.
Apache Solr
ECHO
w10n
NSIDC
Other Data Center
OPeNDAP
OPeNDAP
Archive
Archive
w10n
6
Na#onal Aeronau#cs and Space Administra#on Jet Propulsion Laboratory Visualization as a Service (VaaS)
California Ins,tute of Technology Pasadena, California •  We are also working on visualization widgets that can be embedded into
any HTML pages. Visualization as a Service (VaaS)
•  Visualizing L3 GRACE data from a NetCDF file located somewhere
within PO.DAAC.
THUANG/JPL
Cloud Computing @GSAW 2015
7
Na#onal Aeronau#cs and Space Administra#on Jet Propulsion Laboratory NASA Sea Level Change Portal
California Ins,tute of Technology Pasadena, California •  The Sea Level Change Portal will serve
as central hub for enabling collaboration
between the NASA Sea Level Science
Team
•  The ultimate goal is to provide sciences
and general public with “one-stop”
source for current sea level change
information and data, including
interactive tools for accessing and
viewing regional data, a virtual
dashboard of sea level indicators, and
ongoing updates through a suite of
editorial products that include content
articles, graphics, video, and
animations.
THUANG/JPL
Cloud Computing @GSAW 2015
8
Na#onal Aeronau#cs and Space Administra#on Jet Propulsion Laboratory California Ins,tute of Technology Pasadena, California Sea Level Change Portal
DATA ACCESS AND ANALYSIS
•  Goal: Enable easy access to multi-disciplinary data sets and
facilitate quick online analyses. Features include
•  Global data selection for geographical maps
•  Spatial
•  Temporal
•  Data analysis
•  Regional averages (time series)
•  Basic statistical analysis (RMS, correlation, PDF, spectral analysis, …)
•  Model/data comparison
•  Data subscription
•  Define search and receive “data alert” once new data matching this search
arrives.
THUANG/JPL
Cloud Computing @GSAW 2015
9 Na#onal Aeronau#cs and Space Administra#on Jet Propulsion Laboratory California Ins,tute of Technology Pasadena, California NASA Sea Level Change Portal
ARCHITECTURE
JPL
NASA GSFC
SLCP CMS
Nexus
Science Data Analysis
Client
Data Analysis Platform
EDGE
OpenSearch
Metadata
ISO, GCMD, etc…
Analysis
Data Aggregation Service
Geospatial
Metadata
Repository
Data Search and Analysis
Content Search
Ruby on Rails
EDGE
OpenSearch
Metadata
ISO, GCMD, etc…
Analysis
Content Index
Geospatial
Metadata
Repository
Metadata
(Dataset and Granule)
Data Access and
Distribution
Workflow
Data Analysis
Data Analysis
(Time-Series,
subsetting,
comparison, etc.)
Data Aggregation Service
PostgreSQL
Data
Management
Metadata
NASA Common Metadata Repository (CMR)
Data Center
Data
Non-NASA
Data Center
Metadata
Data Center
NASA
Data Center
THUANG/JPL
Cloud Computing @GSAW 2015
Data
10 Na#onal Aeronau#cs and Space Administra#on Sea Level Change Portal
Jet Propulsion Laboratory California Ins,tute of Technology Pasadena, California NEXUS: DATA ANALYSIS PLATFORM
Nexus
Data Analysis Platform
EDGE
OpenSearch
Metadata
ISO, GCMD, etc…
Analysis
•  Data analysis platform on the
Cloud
•  Data management and
transformation
Data Aggregation Service
•  Multi-disciplinary data
coordination
Geospatial
Metadata
Repository
Data
Management
Data Access and
Distribution
Workflow
Data Analysis
•  On-the-fly analysis services
•  Time series
•  Correlation
•  Re-gridding
•  Data subsetting
•  Data visualization service
•  RESTful access to geospatial
array data
THUANG/JPL
Cloud Computing @GSAW 2015
11 Na#onal Aeronau#cs and Space Administra#on Federated, Multi-Cloud Architecture for Big Data
Jet Propulsion Laboratory California Ins,tute of Technology Pasadena, California • 
Recognizing not all data centers are equal
• 
Providing a common, portal software solution stack improves interoperability by
providing distributed data analysis solution for Big Earth Science Data.
Data Center
Data Center
Nexus
Applications
Handler
Handler
Handler
Ingest
Ingest
Ingest
Nexus
Data Analysis Platform
OpenSearc
h
EDGE
Metadata
ISO, GCMD,
etc…
Manager
Manager
Manager
Inventory
Sig
Event
Security
Product
Subscriber
Handler
Ingest
Business Logics
Manager
Manager
Manager
Geospatial
Metadata
Repository
Handler
Ingest
Analysi
s
Data Aggregation Service
Search
Product
Subscriber
Handler
Ingest
Metadata
ISO, GCMD,
etc…
OpenSearc
h
Analysi
s
Data Aggregation Service
Business Logics
Applications
Data Analysis Platform
EDGE
Inventory
Geospatial
Metadata
Repository
Product
Subscriber
Job Tracking Services
Sig
Event
Security
Search
Product
Subscriber
Job Tracking Services
ZooKeeper
ZooKeeper
ZooKeeper
ZooKeeper
Data
Management
Workflow
Data Access and Distribution
Data
Management
Data Access and Distribution
Data Analysis
Workflow
Data Analysis
File & Product
Services
ZooKeeper
ZooKeeper
File & Product
Services
Ingest
Pool
Ingest
Pool
Ingest
Pool
Worker
Pool
Worker
Pool
Central Analytic Node
HORIZON
Data Management and Workflow Framework
Data
Management
Node
Analytic
Node
Ingest
Pool
Worker
Pool
Worker
Pool
HORIZON
Data Management and Workflow Framework
Analytic
Node
Data
Management
Node
Nexus
Data Analysis Platform
Data Center
Data Center
EDGE
OpenSearch
Metadata
ISO, GCMD, etc…
Analysis
Data Aggregation Service
Applications
Handler
Handler
Handler
Ingest
Ingest
Ingest
Geospatial
Metadata
Repository
Business Logics
Manager
Manager
Manager
Inventory
Sig
Event
Security
Product
Subscriber
Applications
Search
Product
Subscriber
Handler
Handler
Ingest
Ingest
Handler
Ingest
Business Logics
Manager
Manager
Manager
Inventory
Security
Product
Subscriber
Sig
Event
Search
Product
Subscriber
Job Tracking Services
ZooKeeper
Job Tracking Services
ZooKeeper
ZooKeeper
ZooKeeper
File & Product
Services
File & Product
Services
Ingest
Pool
ZooKeeper
ZooKeeper
Ingest
Pool
Ingest
Pool
Ingest
Pool
Worker
Pool
Worker
Pool
Worker
Pool
Worker
Pool
HORIZON
Data Management and Workflow Framework
Data
Management
Node
Data
Management
Data Access and
Distribution
Workflow
Data Analysis
HORIZON
Data Management and Workflow Framework
Data
Management
Node
Data Center
Data Center
Applications
Applications
Handler
Handler
Ingest
Ingest
Handler
Nexus
Handler
Handler
Ingest
Ingest
Handler
Nexus
Data Analysis Platform
Ingest
Data Analysis Platform
Ingest
EDGE
OpenSearc
h
Business Logics
Manager
Manager
Manager
Inventory
Security
Sig
Event
Search
Metadata
ISO, GCMD,
etc…
EDGE
Analysi
s
OpenSearc
h
Product
Subscriber
Inventory
Product
Subscriber
ZooKeeper
Sig
Event
Search
Product
Subscriber
Geospatial
Metadata
Repository
Job Tracking Services
ZooKeeper
ZooKeeper
File & Product
Services
Worker
Pool
Security
Product
Subscriber
Geospatial
Metadata
Repository
Ingest
Pool
Business Logics
Manager
Manager
Manager
Data Aggregation Service
ZooKeeper
Data
Management
Data Access and Distribution
Workflow
Data Analysis
Worker
Pool
HORIZON
Data Management and Workflow Framework
Data
Management
Node
THUANG/JPL
Analysi
s
Data Aggregation Service
Job Tracking Services
Ingest
Pool
Metadata
ISO, GCMD,
etc…
Data
Management
Data Access and Distribution
Workflow
Data Analysis
Analytic
Node
Analytic
Node
Cloud Computing @GSAW 2015
ZooKeeper
ZooKeeper
File & Product
Services
Ingest
Pool
Ingest
Pool
Worker
Pool
Worker
Pool
HORIZON
Data Management and Workflow Framework
Data
Management
Node
12
Na#onal Aeronau#cs and Space Administra#on Jet Propulsion Laboratory Summary
California Ins,tute of Technology Pasadena, California •  Start with the architecture. Can’t build with just Cloud.
•  Don’t jump into Cloud because it is popular
•  Use Cloud because it makes sense
•  Cost?
•  Reliability?
•  Platform to improve data access and analysis?
•  Might need to rethink existing software solutions when moving to
Cloud
•  Truly leverage the elasticity of the Cloud?
THUANG/JPL
Cloud Computing @GSAW 2015
13
Na#onal Aeronau#cs and Space Administra#on Jet Propulsion Laboratory Summary
California Ins,tute of Technology Pasadena, California • 
User automation deployment – Puppet, Chef, Salt, etc.
• 
Bring the computing close to the data make sense – On-Premise Cloud (currently)
•  Need local experts
•  Governance
• 
For Commercial Cloud
•  Simplified (fixed) costing model for Amazon resources
•  Reduced storage and transfer out pricing, and zone replication
• 
For On-premise Cloud
•  Export controlled: public data, ITAR software
•  Suggest standardize Cloud stack
•  Federated, multi-Cloud environment
THUANG/JPL
Cloud Computing @GSAW 2015
14
Na#onal Aeronau#cs and Space Administra#on Jet Propulsion Laboratory California Ins,tute of Technology Pasadena, California THANKS Ques,ons, and more informa,on [email protected] THUANG/JPL
Cloud Computing @GSAW 2015