PHEMI Central Datasheet

D ATA S H E E T
PHEMI Central Big Data Warehouse
PHEMI Central™ is a big data warehouse that takes advantage of the power,
scalability, and flexibility of Hadoop while providing fully integrated privacy, security,
governance, and data management—all built right in.
Drive discovery and fuel innovation with big data
economics, while meeting compliance and
governance objectives.
Take Advantage of Enterprise-Grade
Hadoop to Unlock the Value in Your Data
PHEMI Central provides:
• The ability to collect, curate, and consume any volume and variety of data
Collect
Consolidate all of your data — structured
and unstructured
• High-speed data ingestion and processing to support real-time operations
and business intelligence applications
• Full data control and lifecycle management
• Built-in Privacy by Design to enable collaboration while protecting sensitive
and private information
• The ability to work with big data technologies without an army of
Hadoop programmers
PHEMI Central leverages three unique innovations:
• Metadata Framework Extensible, descriptive end-to-end metadata
enables field-level access control and data management.
• DPF Framework Custom and standard Data Processing Functions (code
libraries) can parse, recognize, extract, cleanse, standardize, encrypt, mask,
or redact selected fields.
• Policy Enforcement Engine All user requests for data are filtered through a
rules engine, ensuring that data remains governance-compliant at all times.
Curate
Automatically index and catalog data
for sub-second lookups
Consume
Make data easily accessible to all of
your end users
Privacy, Security, and Governance
Automatically enforce data sharing,
consent, and privacy policies
Data Management
Gain full control with enterprise-grade
data management
DATABASE
SYSTEMS
TEXT
REPORTS
and ANALYTICS
COLLECT
Ingest any raw data type
and tag with metadata
CURATE
CONSUME
Use powerful data processing
functions to transform and catalog
data into analytics-ready assets
Generate datasets on demand
Use any third-party apps
BUSINESS
INTELLIGENCE
CUSTOM
APPLICATIONS
SPREADSHEET
PRIVACY, SECURITY, GOVERNANCE
Protect information at the field-level and ensure rightful access at scale
THIRD-PARTY
APPLICATIONS
IMAGES
DATA MANAGEMENT
Manage data down to the field level
SENSORS
GENOMICS
SYSTEM MANAGEMENT
Enterprise-grade reliability, availability, and scalability with cluster economics
APPLICATIONS AND USERS
DATA SOURCES
PHEMI CENTRAL BIG DATA WAREHOUSE
2
Collect
Ingest and tag all types and any size of data
PHEMI Central ingests data from multiple and disparate sources. Data can range from small kilobyte files to large
terabyte files. Schemaless ingestion is fast. You can:
• Stream data from machine-to-machine data sources through the PHEMI REST API
• Push data directly from data sources and ETL tools using either JDBC or the PHEMI REST API
• Deploy a custom connector based on the PHEMI REST API to allow PHEMI Central to fetch data from
data sources
• Upload data manually using a standard web browser window
Data is tagged on ingest with descriptive metadata that immediately enforces privacy policies and data sharing
agreements, and controls the data lifecycle.
Curate
Extract the greatest possible value from your data with
processing, indexing, cataloging, linking,
and metadata
PHEMI Central uses a flexible, key-value store. Data is
automatically indexed and cataloged as it is stored, making
it immediately findable and retrievable. Sophisticated
metadata tagging is used to describe, manage and govern
the data that it stores.
“
For the first time, organizations can take
advantage of big data while retaining the
governance and data management of a
traditional enterprise data warehouse.
Data Linking
After cataloging and indexing, data can be linked based on keywords, graph relationships, and geospatial attributes.
Data linking expands the kinds of connections you can make between data items, promotes discovery, and gives
you a more complete picture of your data.
Data Processing Function Framework
PHEMI Central lets you develop custom computer programs, called Data Processing Functions (DPFs), that provide
unprecedented power and flexibility.
• Parse ingested data, extract or cleanse data, encrypt, mask, or anonymize selected information
• Provide enhanced or deeper indexing and cataloging
• Map data into standardized ontologies
• Analyze streams of machine data to find patterns and exceptions, calculate aggregates, or convert streaming
data into an analytics-ready state for trending and predictive analysis.
As the organization’s needs evolve and knowledge advances, you can simply develop new DPFs and re-execute
on your data. DPFs can be developed in modern programming languages such as Java, Python, and C++.
No specialized expertise in big data technologies such as MapReduce or YARN is required. Your DPF can be
developed by PHEMI, by your in-house programmers, or by a third party.
Data Dictionary
Conventional big data systems store big data, but struggle to catalog or track diverse data types. With PHEMI
Central, you can use DPFs to build data dictionaries, identifying and saving a common interpretation for fields that
occur frequently but are named differently or use different format conventions (such as “M/F” vs. “Male/Female”,
or converting between Imperial and metric measurement schemes). Data dictionaries greatly simplify queries
and analysis.
3
Consume
Access your datasets on demand at sub-second speeds,
even with petabytes of data
Describing or tagging information with information with metadata means that
users and applications can query data based on the data’s properties, instead
of navigating complex directories or schemas. Multiple users can interact with
the system, accessing datasets via SQL, data exports, and PHEMI API custom
applications.
Above all, information in PHEMI Central is findable and searchable, for users
and applications.
• Break down costly data silos by aggregating, then constructing datasets
across multiple and disparate data sources
• Reduce data sprawl by creating virtual datasets only on export
• Improve consumption speeds with digital assets that are cataloged and
indexed in advance
• Ensure rightful access at all times, with every data request automatically
mediated by the PHEMI policy enforcement engine
Privacy, Security, and Governance
Automatically de-identify, encrypt, or mask personal information
PHEMI Central provides an industry-pioneering set of capabilities to manage
the governance of sensitive data, enforced from end to end and throughout
the lifecycle of data. PHEMI Central uses one coordinated framework based
on Privacy by Design principles to define, manage, and enforce data sharing
agreements and privacy policies across an entire organization or set
of organizations.
Data is tagged with attributes that describes its level of sensitivity. Users
are tagged with attributes that describe their level of authorization. Simple,
powerful access rules describe the relationships between data visibility and
user authorization. Datasets can be associated with access policies that are
independent of the policies attached to the source data collections,
but rightful access to data is always enforced.
PHEMI Central keeps your data secure:
• User roles determine what operations a user can perform
• The system maintains a complete, tamperproof audit log of operations
and data access
• Communication links from data sources or to consuming systems can
be encrypted using Secure Sockets Layer (SSL) or Transport Layer
Security (TLS)
• Data fields can be individually selected for encryption at rest
• Because privacy and security are performed at the data level, it’s easier
and faster to prototype, test, and deploy new applications
Privacy by Design
A Privacy by Design (PbD) approach requires you
to take into account seven foundational principles
throughout your system. But how do you know
whether your system implements PbD principles?
Here’s a checklist:
1. Metadata. All data should be tagged on ingest
with enough descriptive information to allow
adequate privacy, sharing, consent, and lifecycle
management, plus compliance with any other
governance requirements.
2. Role-based access control. User and
application access to functionality and operations
is adequately restricted by system roles.
3. Policy-based data access. Access to and
visibility of data is restricted by permissions and
authorizations, and controlled by access policies.
4. Automatic policy enforcement. The system
automatically enforces policies and governance;
manual intervention is not required. Enforcement
is not relegated to applications built on top of the
repository. There’s a single point of management
to ensure policy enforcement.
5. Transparency. Data stewards and privacy
officers can directly view and verify the system
implementation of governance policies.
6. Auditability. The system automatically
tracks system activity, and maintains a detailed,
tamperproof audit log of data access and
system operations.
7. Data immutability. Data in the repository
remains available in its original form, regardless of
what digital assets are derived from the original
through transformation.
8. Ability to anonymize. The system should be
able to de-identify, encrypt, mask, obfuscate, or
redact personal information, and allow the data
steward or privacy officer to choose which version
of data appears to which users.
Privacy by Design is recognized as the global
privacy standard in a landmark resolution by the
International Conference of Data Protection and
Privacy Commissioners. Visit privacybydesign.ca.
4
Specifications
Data Management
Use a powerful metadata framework to
manage digital assets at the field level
On-Premise Deployment*
Cloud Deployment*
4 Cluster Nodes. Each:
• Subscribe to PHEMI Central
as a managed service running
on Amazon Web Services.
• 8xCore (2.2GHz)
Field-level metadata contains the rules and
policies governing the data at the field level. Data
retention policies and data sharing agreements
are automatically enforced. Data in the system is
immutable: the original data cannot be modified
and data is only purged from the system based
on the configured retention policy. Robust
version control and rollback capabilities mean
that data is never lost, corrupted, or overwritten.
• 64 GB RAM
• 12 TB Direct Attached Storage
• Cloud service grows from
1 TB storage capacity.
2 Management Nodes. Each:
• 4xCore (2.2 GHz)
• 64 GB RAM
• 2 TB RAID1 Storage
1 Front-End Node:
• 4xCore (2.2 GHz)
System Management
• 64 GB RAM
• 2 TB RAID1 Storage
Get cluster reliability and economics at scale
10 Gigabit Ethernet Network
PHEMI Central can be deployed at the customer
premise, as a managed service, or as a cloudbased service. The system uses low-cost
commodity hardware components and Direct
Attached Storage (DAS) disk drives to lower
the cost of ownership compared to traditional
enterprise data warehouse systems. Storage and
compute resources scale linearly from terabytes
to petabytes.
Data Ingest Protocols
Data Export Protocols
• SFTP File Transfer
• Excel/CSV/TSV Download
• HTTP/HTTPS Manual Upload
• REST Web Services API
• REST Web Services API
• ODBC/JDBC SQL Interface
• ODBC/JDBC SQL Interface
All data in the system is replicated three times to
ensure availability and resiliency. DAS drives can
be hot-swapped without impacting performance
or data availability. Larger or faster DAS drives
and nodes are absorbed into the system and
load-balanced automatically.
• CCDA HL7 Interface
The system provides clear visibility into system
health, diagnostics, troubleshooting, capacity,
and digital assets under management. System
management capabilities can also be integrated
with existing tools.
•R
Analytics Tools
Data Processing Functions
Supports leading analytics tools,
• Excel Reader • CSV Reader
including:
• Variant Call Format (VCF) Reader
• SAP • SAS
• SPSS
• Qlikview
• Stata
• Tableau
• Netezza
• MySQL
• JSON Reader • XML Reader
• Custom DPFs
*All our deployments align with appropriate privacy and security requirements, including Health Insurance Portability and
Accountability Act (HIPAA) and Health Information Technology for Economic and Clinical Health (HITECH) Act as well as
Canadian federal and provincial legislation.
Ease Your Entry into Big Data
PHEMI Central makes it easy to break into big data. The software is fully integrated and enterprise-ready, so you don’t need to
hire a team of Hadoop engineers to build and maintain your system. And, you can start small and expand incrementally. Use
PHEMI Central to offload your existing data warehouse, or to capture new data types or sources. Keep your existing systems
and tools and let PHEMI Central feed data into them. You can move into big data as you become ready, at your own speed.
Visit www.phemi.com for more information.
www.phemi.com
[email protected]
twitter.com/PHEMIsystems
linkedin.com/company/phemi
Copyright © 2015, PHEMI and/or its affiliates. All rights reserved. Affiliate names may be trademarks of their respective owners.
This document contains forward-looking features. May 2015