The “No ETL” Approach

Advances In Data Integration:
The “No ETL” Approach
Marcos A. Campos, Principle Consultant, The Cognatic Group
Sponsored by Capsenta™
capsenta.com
INTRODUCTION Data integration. It’s a costly activity. Current approaches naturally lead to the question: Is it possible to bring multiple data sources together in a less costly, fully automated way, and still satisfactorily respond to the full breadth of business needs? The answer to that question would represent a breakthrough in several industries. For example, in the pharmaceutical industry there are two primary areas of executive pain. The first is the pre-­‐
clinical data discovery phase. There is a need to “fail faster.” With lots of potential drugs being tested, pharmaceutical executives want to be able to rule out failures as early as possible. They also want to rule out drugs that don’t meet specific criteria, and rule out drugs that have already been evaluated or patented by a competitor. Another executive pain occurs in the later stages 2 and 3 of the clinical trial phase. It can be difficult to find appropriate patients for the new drug trial. It is a painstaking process. The ability to integrate key clinical data from rd
electronic medical record systems and 3 party repositories could alleviate these problems and reduce the time it takes to develop a new drug. In the health care industry access to the right data, at the right time and in the right setting can save lives and improve efficiency. In general, the goals of this “no ETL” approach are to achieve: •
•
50% reduction in time to integrate data sources, and 1
50% reduction in costs related to business intelligence (BI) data integration. To some extent, recent methods in systems and software architecture have already broken down some of the walls between data silos. Service Oriented Architecture (SOA) methods have improved the way that data is moved between separate databases. However, the mapping of one data model to another and the physical transformation of data from one model to another remains expensive. Mapping data is a surprisingly hands-­‐on and labor-­‐intensive process. In general, these labor-­‐intensive processes can be categorized as Extract, Transform, and Load (ETL) activities. ETL activity has become a mainstay of Business Intelligence (BI) tools, and data warehousing. In this white paper, Capsenta proposes a solution that renders ETL activities obsolete. To understand where the innovation is happening, one must understand the shift that has occurred. It’s a shift that redefines what we mean by data integration. 1
Bluepatent.com, Business Intelligence (BI) is the set of techniques and tools for the transformation of raw data into meaningful and useful information for business analysis purposes IS ETL BECOMING OBSOLETE? Most organizations that deal with lots of data are using methods and approaches that are becoming obsolete. From a business perspective, how often have business leaders identified a new business opportunity only to be told that either the data is not compliant with company standards, or that integration is too costly to be feasible? In some situations, operations come to a stop because key technical people in an organization who are familiar with a legacy database leave the organization. Maintenance and modifications have to be put on hold. Time and resources go into finding new-­‐hires to learn formats of legacy data or consultants to transform the existing data into supported formats. And yet, addressing all the above problems still does not solve the root problem of deriving sufficient business value from disconnected data sets. Part of the reason for this is that typical data sources are complex. Data must be tended to by technical staff. In large data warehouses, technical staff often work in islands of heterogeneous technology. For example, one department’s Oracle database (with tables designed 20 years ago) might not play well with another department’s Microsoft SQL Server database (whose tables are changing). Data is not easily aggregated. This is part of why so much time and effort goes into Extract, Transform, and Load (ETL) activities. The dominate form of data storage for many years in business, commerce and government has been based on relational databases. Because of that, business intelligence, search functions, and data analytics have required ETL operations. The problem with current ETL methods is that they take time and resources, and include hidden costs that are not apparent at the beginning of a project. In an article titled, "The True Cost of Integration in the World of BI," David Linthicum states: “When it comes to the cost of a BI deployment, it’s not the software that will get you; it’s the 2
miscellany -­‐-­‐ the miscellaneous integration work, in particular. ” It’s logical to assume that data integration is a significant cost for any BI project, in the implementation phase 3
estimates of almost 80%; compared to the analytics component of 20% . Data integration hassles are legendary, including bringing together all relevant data from various operational systems not designed to feed BI systems. In addition to integration and conversion costs, there are ongoing costs. Linthicum states, “When you look at ongoing 4
costs, though, the roles reverse, making data integration 20 percent of the costs versus reporting and analytics .” Why so expensive? On a commercial scale, data integration is hard. It’s complex. It’s perhaps one of the most difficult jobs in the world of BI. However, it’s also critical. Indeed, the ability to bring in the right data on a timely basis is directly related to the value that the BI system will bring to the business -­‐-­‐ more so than the types of analytics it’s looking to drive. It’s the old “garbage-­‐in-­‐garbage-­‐out” concept. 2
Linthicum, David, The True Cost of Data Integration, TDWI, Aug, 20, 2013 Linthicum, David, The True Cost of Data Integration, TDWI, Aug, 20, 2013 4
Linthicum, David, The True Cost of Data Integration, TDWI, Aug, 20, 2013 3
What about real-­‐time analysis? The need to perform ETL activities means that real-­‐time data analysis is not possible. Typically, new search queries mean that new report formats have to be created manually. In contrast, semantic technologies provide tools for real-­‐time data analysis. That’s another reason why data integration is moving toward semantic technologies, and away from ETL activities. SHIFT TO SEMANTIC TECHNOLOGY The shift to semantic technology sometimes comes under the umbrella of “Big Data” or “Data Analytics.” But the true meaning of those words does not mean that ETL is eliminated. The true shift is in the way that we treat data; removing focus from methods of storage and placing the focus on business value contained in the data itself. That’s what semantic technology does. It adds meaning to data. This shift has become possible because we have turned a corner in machine processing speeds, network availability, network performance, and the cost of hardware. Turning the corner in these areas means that methods of computing have become available to the business community. Until recently, these programming methods and algorithms were more suited for an academic environment where realistic application would have taken place on a “super-­‐computer.” These methods and emerging algorithms are now commercially viable. By enriching data with semantics we improve: •
•
•
SEARCH: Enable search across multiple, heterogeneous data sources ANALYTICS: Provide data analytics in real-­‐time between previously unmapped data sets SPEED TO MARKET: Reduce time, capital, and human resources associated with data mapping Semantic technology gets at the root problem that ETL ignores. Coupled with next-­‐generation information architecture, the latest semantic technologies represent a giant leap toward the end of the need for ETL. By applying certain semantic techniques, an organization can automate computer processing to produce those benefits in search, analytics, and improved speed to market. Data scientists at many universities around the world, governments and commercial companies are taking advantage of this technology. The technology is moving quickly. New methods are emerging and existing methods are being optimized. EMERGING SOLUTION Capsenta has developed technologies based on industry standards and semantic techniques that provide a better approach to data integration. Capsenta uses computational methods and approaches that are not restricted by underlying storage mechanisms, and do not require ETL activity. Capsenta technology can: •
•
•
•
Virtualize data as graphs Leverage common vocabularies Use automated mapping Support federation There is already proven value of this technology and approach in the area of cardiovascular medical devices. Capsenta is powering first-­‐to-­‐market connectivity for every implantable device patient in the United States. Medical device manufacturing is among the most competitive industries in the healthcare domain. Individual companies and organization do not have incentive to merge their data. There is little attempt to collaborate in order to centralize data for the purpose of analytics and patient search. In addition, almost 50% of patients with implantable devices do not know the type of their device nor do they know the name of the manufacturer of their device. These issues create a problem in identifying which manufacturer service representative should be contacted during patient triage. This can result in lengthy times before clinical care is performed. Capsenta is working with industry professionals to enable federated search of device data in emergency rooms. This will enable faster identification of a patient’s device, and reduce critical clinical time. VIRTUALIZE DATA AS A GRAPH A key part of the solution is to virtualize data as a graph. A graph is a data model that uses nodes and edges instead of tables, columns and rows. One advantage of using a graph is that integrating multiple data sets is easier than it would be if the data were in tables. Integrating multiple, disparate graphs can be done by adding more edges and nodes. In this way, multiple graphs can become one. By exposing relational data as graphs, we change the hard problem of relational database integration to the relatively simple activity of finding edges between nodes. The tables, columns, and rows in the original database still exist. But with the use of graphs, the data can be represented in a virtual way and integrated with other databases without the need to map foreign keys, and normalize data entities. To do this, Capsenta uses semantic web standards established by World Wide Web Consortium (W3C). One key standard is the Resource Description Framework (RDF). RDF describes resources, and is used for modeling data as graphs. We apply RDF conventions to data that is in a relational database so that it can be represented in an RDF database. Another useful open standard is SPARQL, which is used for querying graphs. Capsenta’s OntoExplorer search tool uses SPARQL to perform semantic searches. Capsenta also uses a W3C standard called Web Ontology Language (OWL). We use OWL to represent complex knowledge about things, groups of things, and the relationships between things within the context of a graph. As the name implies, OWL helps describe ontologies. In this context, an ontology is like a vocabulary; a complex collection of terms. LEVERAGE COMMON VOCABULARIES The semantic approach becomes much more useful if it can leverage common vocabularies and taxonomies. Vocabularies are sets of metadata, terms, and relationships. In the context of semantic standards, vocabularies and ontologies essentially mean the same thing. Every industry has its own set of standard vocabularies. Many industries, such as the health care industry, have mature vocabularies that are shared in common between companies and organizations. For example, in the healthcare industry there are organizations such as Health Level Seven International (HL7) and the Coalition for ICD-­‐10 that maintain standards and common dictionaries in order to promote interoperability. In the finance industry, there is the Financial Industry Business Ontology (FIBO), which is an initiative to define common financial terms, definitions and synonyms. The suite of semantic technologies, RDF and OWL allow the creation of rich knowledge models (ontologies) where business logic can be captured and maintained. OWL ontologies are machine-­‐readable. This is important because it means that an application can directly use those ontologies (vocabularies) without the need for traditional data analysis. For example, let’s say that in Database A, the ontology reveals that a person named Pedro is Mexican. In Database B, the ontology reveals that a person named Maria is Cuban. What if you were to ask the question: “Who are all the Hispanics in the data set?” Suppose that there is a layer of semantics (ontological knowledge) that includes assertions that “Mexican is Hispanic” and “Cuban is Hispanic”. Given those ontological relationships, a semantic search could provide a result that shows that both Pedro and Maria are Hispanic -­‐ even though the word Hispanic does not appear in Database A or in Database B. USE AUTOMATED MAPPING Applying semantic technology to graph data models makes automated mapping possible. This is not the kind of TM
mapping seen in traditional technologies. Using supervised automation features in Capsenta’s Ultrawrap product, initial discovery and visualization of a source relational database schema and metadata is most instantaneous. Traditional methods of mapping data elements require human judgment. Seemingly simple data elements can require human effort to sort out the logic. For example, consider a simple element such as a name. One data source might store a name as LAST_NAME with a separate element for FIRST_NAME. A different data source might store a name as LAST_FIRST_MIDDLE. And yet another data source might store a name as FIRST_MIDDLE_LAST. Traditional mapping of one element to another would require not only human judgment, but some method of conversion in order to normalize the data. Applying semantic technology to graph data makes it possible to connect the name elements from those separate databases automatically, without the need for manual conversion or normalization. Using semantic search in combination with common vocabularies, we can automate the mapping between source schemas and target ontologies. This saves time and effort over traditional integrations which would have required a human to map the elements of one database to another. Traditional data mapping can often lead to problems of missing data in search results. For example, consider a search in a clinical database for all patients with diabetes. A search using the term diabetes might yield a list of patients that is not complete. What if some patients are diagnosed with NIDDM or are diabetic? What if the intention of the search was to find those patients also? Using semantics, an automated mapping can tie words together to yield a search result for all patients associated with either of those diagnoses without the need for a human to make that determination. Another advantage is that it makes it possible for a computer to make inferences about the relationships between data elements. This kind of inference technique in search results goes beyond finding synonyms. It finds related data that are not necessarily synonyms. For example, consider a recent case study involving databases of constitutional language. There are data warehouses that contain text from the constitutions of nations all over the world. If one were to search for text that contains the word “religion,” the search results might miss instances of text that deal with religious issues. For example, traditional methods of mapping the data would miss phrases such as “church and state.” This kind of error of omission in search results is much less likely in a semantic search. In a semantic search, these phrases would have been inferred to be related to each other, and been found in search results. SUPPORT FEDERATION One of the most significant improvements of using semantic technology to perform search operations is that the data can exist anywhere. Data sources can be in different areas of an organization, in different formats, or even stored within different organizations. The data sources do not have to be centralized in order to support the virtualization, mapping and search operations we have described. As we mentioned earlier, individual companies and organization do not always have incentive to merge their data with other companies and organizations. Those barriers between data sources, however, do not represent barriers to using semantic search technology. Virtualization, mapping, and search can be done across independent databases without the need to centralize the data. In addition, this federated approach does not impose any performance overhead. The speed of performing a search across independent data sources is not slower than a non-­‐federated search. This is possible because the Capsenta tools leverage the optimization mechanisms that are inherent in each individual database. The Ultrawrap™ middleware takes advantage of the inherent database optimizers that already exist in relational databases, such as Oracle, DB2, and Microsoft SQL databases. CONCLUSION Tools and capabilities available to IT data professionals have lagged behind the need of business demands. Semantic technology provides a solution to existing data integration problems and the overhead of ETL activities. The solution comprises several elements. One element is the representation of relational databases as graphs (virtualizing data without changing the underlying database). Another element is to connect the virtualized data with common vocabularies to create useful ontologies. And finally, to be useful in a practical application the solution must also support automated mapping and federation. When all these elements come into together, the need for ETL is no longer required. Capsenta Inc. has created a solution that does this called Ultrawrap™. The Ultrawrap™ middleware uses semantic technology to integrate and enrich disparate data sources. Ontoexplorer enables next-­‐generation search capability. Together, the combined solution brings all of the above-­‐mentioned elements to provide a significant advance in data integration and search. Capsenta’s solution has already shown value in the healthcare domain. The semantic approach could provide similar benefits in other domains. ABOUT CAPSENTA Capsenta, was founded in 2011, spun-­‐out of the University of Texas at Austin. The Ultrawrap™ technology is a product of more than 6 years of research. Capsenta increases the value of a company’s data by applying knowledge graphs and semantic technologies. Semantics improve search, analysis, and interpretation. Capsenta’s patent-­‐pending technology is the only complete W3C-­‐compliant turnkey solution to making existing SQL, and SQL data warehouse infrastructure, upward compatible with semantics. The approach assures scalability and robustness. Traditional applications seamlessly coexist with semantically enriched applications. By enriching data with semantics, Capsenta is able to accelerate the provisioning process from months to days. Capsenta’s technology starts with standards compliance and is able to integrate the result with any existing database infrastructure. Capsenta provides Data as a Service (DAAS) -­‐ database integration services as a one-­‐time cloud service. Capsenta customers benefit from the power of semantics from day one, with low risk. TARGET MARKET: Life Science, Drug Discovery, Clinical Trials, Personalized Medicine, Health IT For further guidance on this topic and additional detail, contact us at [email protected]. You can learn more about database automation and search at www.capsenta.com or reach Capsenta directly by calling 1-888983-2597. 5225 FOSSIL RIM RD., AUSTIN TX 78746
888.983.2597
[email protected]
capsenta.com