Beyond the Data Lake

W H I T E PA P E R
Beyond the Data Lake
Managing Big Data for Value Creation
In this white paper
1 The Data Lake Fallacy
2 Moving Beyond Data Lakes
3 A Big Data Warehouse Supports Strategy, Value Creation
Beyond the Data Lake
Managing Big Data for Value Creation
By Dr. Paul Terry, President & CEO, PHEMI
We live in an era of “big data” in which data-driven insights will drive efficiencies and
increased productivity, fuel discovery, and spark innovation.
To gain these benefits, organizations must take a strategic approach to their digital
assets. They should consider the value these assets can provide today, as well as
into the future, if they are properly, strategically managed. A sound data management
strategy should address the complete data lifecycle, including how data is collected,
stored, secured, curated, analyzed, presented and, finally, destroyed.
Many vendors that purport to address the big data challenge actually offer only the first
items on this checklist — collection, storage, and security. These basic services address
the fact that most organizations possess multiple databases that cannot communicate
with one another. Thus, one commonly proffered solution is to combine all databases
into one — the so-called “data lake.”
In this white paper, I’ll point out the strengths and shortcomings of the data lake
concept and describe an alternative big data management strategy that is scalable and
enterprise-grade, and addresses the complete data lifecycle for significant value creation
today and into the future.
The Data Lake Fallacy
The term data lake implies a single repository where all data is stored in its native format
and made available for retrieval, analysis, and value creation. Without proper curation
and the addition of metadata to guide governance, link related data and provide
additional functionalities, a data lake risks becoming a “data swamp.”
With a data lake approach, finding the right data in real time, along with valuable, related
data, to produce business intelligence and actionable insights remains an ill-defined
proposition. Even the most skilled data scientists can find themselves stuck in a data
swamp, struggling to gain useful results. Software modules that work with big data file
systems and address the additional functionalities needed for an effective data lake
approach are emerging. But an organization that takes an à la carte approach — rather
than investing in an integrated suite of functionalities in a single solution —must select
components, integrate them, configure them, test them, and revise them. Users must
rely on experts skilled in both big data programming, and data science. Their information
PHEMI White Paper : Beyond the Data Lake : 1
technology (IT) department will gain a time-consuming new set of responsibilities that
require specialized and hard-to-find skillsets.
With an ad hoc approach to the big data challenge, the execution of an
organizational data strategy is likely to become arduous and time-consuming, with
few assurances of success.
“Data lakes typically begin as ungoverned data stores. Meeting the needs of wider
audiences requires curated repositories with governance, semantic consistency
and access controls—elements already found in a data warehouse. It’s beneficial
to quickly move beyond a ‘data lake’ concept to develop a more robust, logical,
data warehouse strategy.”
— Nick Heudecker, research director, Gartner, Inc.
A data lake, while appealing in its apparent simplicity, may collect, store, and secure
data in its native format, and combine many disparate databases, but it does not
address the suite of additional functionalities that create significant value over time.
With a data lake, it is a challenge to protect, control, find, and retrieve data, much
less create value with it.
One commonly implemented approach is to use Hadoop, an open-source software
framework in the Java programming language for distributed storage and processing
of big data. Though Hadoop offers file system availability and reliability, it provides
limited security, particularly in the area of access controls. A new approach is
needed to ensure that only the right user, at the right time, can see the specific data
or level of data they are permitted to access. Data security and privacy is all about
controlling the visibility of individual pieces of data, such as the names and numbers
associated with a patient record in a healthcare application.
Moving Beyond Data Lakes
“While it is certainly true that ‘data lakes’ can provide value to various parts of the
organization, the [data lake’s] proposition of enterprise-wide data management has
yet to be realized.”
— Andrew White, VP and distinguished analyst, Gartner, Inc.
Three functionalities are needed to move beyond the data lake approach towards
a comprehensive, enterprise-grade, big data warehouse solution for value creation:
metadata, governance, and performance.
PHEMI White Paper : Beyond the Data Lake : 2
By assigning metadata to datasets entering a data management system, it is possible to
determine data quality, maintain the original data, and track changes made to that data
(version control). Without metadata, every query begins from scratch. The data lake risks
becoming a data swamp.
Governance policies to control who has access to specific datasets or levels of data
granularity are critical to managing privacy, consent, confidentiality, and data-sharing
agreements within or between organizations. Data lakes provide little or no oversight
and control of their contents and security and
privacy policies. Access controls for a single
data repository such as a data lake that lacks
The Power of Metadata
metadata and policy enforcement remain
Traditional database metadata describes
embryonic at best.
provenance (lineage) and data definition.
Performance is a critical variable in data
Metadata is data that describes and provides
management practices. Without metadata, and
context and governance rules for other data.
the indexing and cataloging metadata supports,
This is useful, but a big data management
finding answers to queries is slow, cumbersome,
platform needs to do more. In a big data
and fragmented. Queries themselves depend on
management platform, metadata describes the
data-analysis expertise on the part of enterprise
original file, adds context such as source, user
users or the IT staff that supports them. A single
history and data-sharing agreements, captures
data repository, cobbled together with à la carte
additional data on the nature of the dataset, and
functionalities, is unlikely to perform as swiftly,
supports indexing and cataloging. Metadata
accurately, and comprehensively as a purposeallows users to implement governance policies,
built, optimized data management system that
security and privacy measures such as access
reflects an organization’s strategic vision for value
and visibility controls.
creation. Metadata is the key to powerful end-toend data control.
A Big Data Warehouse Supports
Strategy and Value Creation
The implications of the data lake fallacy for public- or private-sector managers who are
mandated by law or driven by market pressures to store, secure, analyze, and create value
from data in their care should be clear: a holistic, enterprise-grade solution for big data
should manage data across its complete lifecycle.
A complete solution, found in the big data warehouse model—must enable automated
data collection, ensure the application of data privacy, security, and governance measures,
convert disparate data formats into an analytics-ready state and analyze and present
actionable insights upon demand. When a dataset’s mandated or useful life reaches an
end, a complete solution must address proper data destruction. Throughout the lifecycle,
PHEMI White Paper : Beyond the Data Lake : 3
data must be preserved in its original form, but also, any data items derived through
transformation after ingest must be accurately tracked.
A complete solution should also be scalable. That is, to validate the solution’s value
proposition, it must be possible to apply the solution incrementally, to one database entity
at a time, in a systematic and cost-effective way, as the business builds out its big data
management strategy and associated investments. Scalability means the marginal cost of
adding the next bit of data is less than the previous bit of data.
The goal of any big data strategy should be actionable insights, value creation, and
innovation. With a big data management platform approach, the ability to mine for insights
increases as more data sources are added to the system. ( This is known as the “network
effect.” ) Increases in efficiencies and productivity—doing more with less — should be a
given. To illustrate how a big data warehouse approach enables an organization’s data
management strategy to achieve these returns on investment, it’s useful to envision a stepby-step approach that includes collection, curation, and consumption.
Collect.
A big data warehouse solution enables a user to automatically collect disparate data
sources, tag them with metadata, catalog and index them, and load them into a data
repository, according to rules set by the user. A critical requirement of a big data warehouse
platform in the collection phase is the ability to handle any kind of data, including structured
(e.g., database records), semi-structured (e.g., Microsoft Excel, machine-collected data, or
genomic files) and/or unstructured (e.g., images or documents).
Curate.
A big data warehouse produces analytics-ready digital assets that are cataloged and
protected. The curation process preserves the original data item in its native format,
providing a baseline resource that can re-analyzed—potentially in different ways—going
forward. Metadata describes the original file, adding context such as source, user history
and data-sharing agreements, captures additional data, and supports indexing and
cataloging. Metadata allows users to implement governance policies, security, and privacy
measures such as access and visibility controls. And metadata tracks who has touched
that data, when, and how.
Consume.
The result of these steps should be a flexible yet robust platform that enables on-demand
retrieval of datasets and the analysis, actionable insights, and value that justify such
an investment. A big data warehouse needs to integrate with and leverage existing IT
investments, applications, and analytics tools. The platform takes on the role of policy
enforcement, based on the attributes of the user, the metadata, and applicable governance
policies. A big data warehouse platform also supports the development and enables
PHEMI White Paper : Beyond the Data Lake : 4
the use of in-house or third-party applications that perform the actual data analysis for
actionable insights.
Unlike a data lake, a big data warehouse approach should collect, curate, and consume
data at speed and scale.
Conclusion
Full data-lifecycle management in a single, purpose-built platform enables an optimal,
strategic approach to digital assets for value creation.
Functionalities should include governance (including access, data-sharing, and visibility
controls), secure, reliable, scalable, and fast storage, the application of metadata, data
immutability, audit, version control, and timely destruction. Such a platform should enable
swift development of applications to serve an organization’s specific needs.
The result of an organization’s strategic approach to digital assets and investment in a big
data warehouse platform should be increased efficiencies and productivity, and support
for value creation and innovation.
PHEMI White Paper : Beyond the Data Lake : 5
About the author
Dr. Paul Terry is president and CEO of Vancouver, B.C.-based PHEMI,
developer of a big data warehouse platform, where he provides vision and
technical leadership. Terry advises private and public healthcare organizations
on next-generation data strategies. He is an adjunct professor in big data at
Simon Fraser University (SFU) and a partner with Magellan Angel Partners.
He lectures in technology, strategy and product management for the MBA
program at SFU. He is a member of the big data Sub-Committee Working
Group at the BC Institute for Health Innovation and serves on Genome BC’s
Health Strategy Task Force.
Prior to his experience in healthcare and venture capital, Paul was the CTO
and cofounder of OctigaBay Systems—a pioneer in high performance
computing—which was acquired by Cray Inc., the world leader in supercomputing. He
was also the cofounder and CTO of Abatis Systems, which was acquired by Redback
Networks in one of the largest technology acquisitions in Canadian history. He holds an
MBA from the Cranfield School of Management, a marketing diploma from the Chartered
Institute of Marketing, a PhD in electrical engineering and an honours Bachelor’s degree
from the University of Liverpool.
About PHEMI
PHEMI was founded in 2013 by a team of proven entrepreneurs and industry experts.
Headquartered in Vancouver, Canada, the PHEMI team has extensive experience
bringing innovative technologies to enterprise-class customers. Industry expertise—
ranging from healthcare to telecom to public sector to security—drives PHEMI Central
features, while networking and high performance computing technology expertise drive
PHEMI architecture to meet the challenges of big data.
PHEMI Central gives organizations the agility to seamlessly collect data sources,
catalog and curate a powerful inventory of secure digital assets, conceive new business
applications, and rapidly build new solutions to support strategic objectives.
PHEMI partners with best-in-class technology and service providers to deliver a complete
solution to meet any organization’s needs.
Visit www.phemi.com for more information.
www.phemi.com
[email protected]
twitter.com/PHEMIsystems
linkedin.com/company/phemi
Copyright © 2015, PHEMI and/or its affiliates. All rights reserved. Affiliate names may be trademarks of their respective owners. April 2015.