W H I T E PA P E R Beyond the Data Lake Managing Big Data for Value Creation In this white paper 1 The Data Lake Fallacy 2 Moving Beyond Data Lakes 3 A Big Data Warehouse Supports Strategy, Value Creation Beyond the Data Lake Managing Big Data for Value Creation By Dr. Paul Terry, President & CEO, PHEMI We live in an era of “big data” in which data-driven insights will drive efficiencies and increased productivity, fuel discovery, and spark innovation. To gain these benefits, organizations must take a strategic approach to their digital assets. They should consider the value these assets can provide today, as well as into the future, if they are properly, strategically managed. A sound data management strategy should address the complete data lifecycle, including how data is collected, stored, secured, curated, analyzed, presented and, finally, destroyed. Many vendors that purport to address the big data challenge actually offer only the first items on this checklist — collection, storage, and security. These basic services address the fact that most organizations possess multiple databases that cannot communicate with one another. Thus, one commonly proffered solution is to combine all databases into one — the so-called “data lake.” In this white paper, I’ll point out the strengths and shortcomings of the data lake concept and describe an alternative big data management strategy that is scalable and enterprise-grade, and addresses the complete data lifecycle for significant value creation today and into the future. The Data Lake Fallacy The term data lake implies a single repository where all data is stored in its native format and made available for retrieval, analysis, and value creation. Without proper curation and the addition of metadata to guide governance, link related data and provide additional functionalities, a data lake risks becoming a “data swamp.” With a data lake approach, finding the right data in real time, along with valuable, related data, to produce business intelligence and actionable insights remains an ill-defined proposition. Even the most skilled data scientists can find themselves stuck in a data swamp, struggling to gain useful results. Software modules that work with big data file systems and address the additional functionalities needed for an effective data lake approach are emerging. But an organization that takes an à la carte approach — rather than investing in an integrated suite of functionalities in a single solution —must select components, integrate them, configure them, test them, and revise them. Users must rely on experts skilled in both big data programming, and data science. Their information PHEMI White Paper : Beyond the Data Lake : 1 technology (IT) department will gain a time-consuming new set of responsibilities that require specialized and hard-to-find skillsets. With an ad hoc approach to the big data challenge, the execution of an organizational data strategy is likely to become arduous and time-consuming, with few assurances of success. “Data lakes typically begin as ungoverned data stores. Meeting the needs of wider audiences requires curated repositories with governance, semantic consistency and access controls—elements already found in a data warehouse. It’s beneficial to quickly move beyond a ‘data lake’ concept to develop a more robust, logical, data warehouse strategy.” — Nick Heudecker, research director, Gartner, Inc. A data lake, while appealing in its apparent simplicity, may collect, store, and secure data in its native format, and combine many disparate databases, but it does not address the suite of additional functionalities that create significant value over time. With a data lake, it is a challenge to protect, control, find, and retrieve data, much less create value with it. One commonly implemented approach is to use Hadoop, an open-source software framework in the Java programming language for distributed storage and processing of big data. Though Hadoop offers file system availability and reliability, it provides limited security, particularly in the area of access controls. A new approach is needed to ensure that only the right user, at the right time, can see the specific data or level of data they are permitted to access. Data security and privacy is all about controlling the visibility of individual pieces of data, such as the names and numbers associated with a patient record in a healthcare application. Moving Beyond Data Lakes “While it is certainly true that ‘data lakes’ can provide value to various parts of the organization, the [data lake’s] proposition of enterprise-wide data management has yet to be realized.” — Andrew White, VP and distinguished analyst, Gartner, Inc. Three functionalities are needed to move beyond the data lake approach towards a comprehensive, enterprise-grade, big data warehouse solution for value creation: metadata, governance, and performance. PHEMI White Paper : Beyond the Data Lake : 2 By assigning metadata to datasets entering a data management system, it is possible to determine data quality, maintain the original data, and track changes made to that data (version control). Without metadata, every query begins from scratch. The data lake risks becoming a data swamp. Governance policies to control who has access to specific datasets or levels of data granularity are critical to managing privacy, consent, confidentiality, and data-sharing agreements within or between organizations. Data lakes provide little or no oversight and control of their contents and security and privacy policies. Access controls for a single data repository such as a data lake that lacks The Power of Metadata metadata and policy enforcement remain Traditional database metadata describes embryonic at best. provenance (lineage) and data definition. Performance is a critical variable in data Metadata is data that describes and provides management practices. Without metadata, and context and governance rules for other data. the indexing and cataloging metadata supports, This is useful, but a big data management finding answers to queries is slow, cumbersome, platform needs to do more. In a big data and fragmented. Queries themselves depend on management platform, metadata describes the data-analysis expertise on the part of enterprise original file, adds context such as source, user users or the IT staff that supports them. A single history and data-sharing agreements, captures data repository, cobbled together with à la carte additional data on the nature of the dataset, and functionalities, is unlikely to perform as swiftly, supports indexing and cataloging. Metadata accurately, and comprehensively as a purposeallows users to implement governance policies, built, optimized data management system that security and privacy measures such as access reflects an organization’s strategic vision for value and visibility controls. creation. Metadata is the key to powerful end-toend data control. A Big Data Warehouse Supports Strategy and Value Creation The implications of the data lake fallacy for public- or private-sector managers who are mandated by law or driven by market pressures to store, secure, analyze, and create value from data in their care should be clear: a holistic, enterprise-grade solution for big data should manage data across its complete lifecycle. A complete solution, found in the big data warehouse model—must enable automated data collection, ensure the application of data privacy, security, and governance measures, convert disparate data formats into an analytics-ready state and analyze and present actionable insights upon demand. When a dataset’s mandated or useful life reaches an end, a complete solution must address proper data destruction. Throughout the lifecycle, PHEMI White Paper : Beyond the Data Lake : 3 data must be preserved in its original form, but also, any data items derived through transformation after ingest must be accurately tracked. A complete solution should also be scalable. That is, to validate the solution’s value proposition, it must be possible to apply the solution incrementally, to one database entity at a time, in a systematic and cost-effective way, as the business builds out its big data management strategy and associated investments. Scalability means the marginal cost of adding the next bit of data is less than the previous bit of data. The goal of any big data strategy should be actionable insights, value creation, and innovation. With a big data management platform approach, the ability to mine for insights increases as more data sources are added to the system. ( This is known as the “network effect.” ) Increases in efficiencies and productivity—doing more with less — should be a given. To illustrate how a big data warehouse approach enables an organization’s data management strategy to achieve these returns on investment, it’s useful to envision a stepby-step approach that includes collection, curation, and consumption. Collect. A big data warehouse solution enables a user to automatically collect disparate data sources, tag them with metadata, catalog and index them, and load them into a data repository, according to rules set by the user. A critical requirement of a big data warehouse platform in the collection phase is the ability to handle any kind of data, including structured (e.g., database records), semi-structured (e.g., Microsoft Excel, machine-collected data, or genomic files) and/or unstructured (e.g., images or documents). Curate. A big data warehouse produces analytics-ready digital assets that are cataloged and protected. The curation process preserves the original data item in its native format, providing a baseline resource that can re-analyzed—potentially in different ways—going forward. Metadata describes the original file, adding context such as source, user history and data-sharing agreements, captures additional data, and supports indexing and cataloging. Metadata allows users to implement governance policies, security, and privacy measures such as access and visibility controls. And metadata tracks who has touched that data, when, and how. Consume. The result of these steps should be a flexible yet robust platform that enables on-demand retrieval of datasets and the analysis, actionable insights, and value that justify such an investment. A big data warehouse needs to integrate with and leverage existing IT investments, applications, and analytics tools. The platform takes on the role of policy enforcement, based on the attributes of the user, the metadata, and applicable governance policies. A big data warehouse platform also supports the development and enables PHEMI White Paper : Beyond the Data Lake : 4 the use of in-house or third-party applications that perform the actual data analysis for actionable insights. Unlike a data lake, a big data warehouse approach should collect, curate, and consume data at speed and scale. Conclusion Full data-lifecycle management in a single, purpose-built platform enables an optimal, strategic approach to digital assets for value creation. Functionalities should include governance (including access, data-sharing, and visibility controls), secure, reliable, scalable, and fast storage, the application of metadata, data immutability, audit, version control, and timely destruction. Such a platform should enable swift development of applications to serve an organization’s specific needs. The result of an organization’s strategic approach to digital assets and investment in a big data warehouse platform should be increased efficiencies and productivity, and support for value creation and innovation. PHEMI White Paper : Beyond the Data Lake : 5 About the author Dr. Paul Terry is president and CEO of Vancouver, B.C.-based PHEMI, developer of a big data warehouse platform, where he provides vision and technical leadership. Terry advises private and public healthcare organizations on next-generation data strategies. He is an adjunct professor in big data at Simon Fraser University (SFU) and a partner with Magellan Angel Partners. He lectures in technology, strategy and product management for the MBA program at SFU. He is a member of the big data Sub-Committee Working Group at the BC Institute for Health Innovation and serves on Genome BC’s Health Strategy Task Force. Prior to his experience in healthcare and venture capital, Paul was the CTO and cofounder of OctigaBay Systems—a pioneer in high performance computing—which was acquired by Cray Inc., the world leader in supercomputing. He was also the cofounder and CTO of Abatis Systems, which was acquired by Redback Networks in one of the largest technology acquisitions in Canadian history. He holds an MBA from the Cranfield School of Management, a marketing diploma from the Chartered Institute of Marketing, a PhD in electrical engineering and an honours Bachelor’s degree from the University of Liverpool. About PHEMI PHEMI was founded in 2013 by a team of proven entrepreneurs and industry experts. Headquartered in Vancouver, Canada, the PHEMI team has extensive experience bringing innovative technologies to enterprise-class customers. Industry expertise— ranging from healthcare to telecom to public sector to security—drives PHEMI Central features, while networking and high performance computing technology expertise drive PHEMI architecture to meet the challenges of big data. PHEMI Central gives organizations the agility to seamlessly collect data sources, catalog and curate a powerful inventory of secure digital assets, conceive new business applications, and rapidly build new solutions to support strategic objectives. PHEMI partners with best-in-class technology and service providers to deliver a complete solution to meet any organization’s needs. Visit www.phemi.com for more information. www.phemi.com [email protected] twitter.com/PHEMIsystems linkedin.com/company/phemi Copyright © 2015, PHEMI and/or its affiliates. All rights reserved. Affiliate names may be trademarks of their respective owners. April 2015.
© Copyright 2024