How to Use an Uncommon-Sense Approach to Big Data Quality CONCLUSIONS PAPER Insights from a webinar in the Applying Business Analytics webinar series Featuring: Scott Chastain, Global Engineering Manager, Information Management and Delivery, SAS David Loshin, President, Knowledge Integrity Inc. SAS Conclusions Paper Table of Contents Why All the Interest in Big Data?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 The Advantages of Analyzing All the Data, Not Just a Subset. . . . . . . . 1 Challenges and Problems with Big Data . . . . . . . . . . . . . . . . . . . . . . . 2 Data Volume Overwhelms the Systems Designed to Digest and Analyze It. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Data Comes from Disparate and Inconsistent Sources . . . . . . . . . . . . 2 Much of the Available Data Was Not Designed for Decision Making . . 2 It Can Be Difficult to Determine What Data Is Relevant. . . . . . . . . . . . . 3 Data Preparation Differs Based on the Analytic Method Being Used. . 3 The Uncommon-Sense Part of the Data Quality Story . . . . . . . . . . . . 3 Big Data Quality in Action: Three Use Cases. . . . . . . . . . . . . . . . . . . . 4 Utility Sensor Data: Dealing with Missing or Out-of-Range Values. . . . 4 Social Media Data: Grappling with the Ambiguity of Language . . . . . . 5 Multichannel Retail Data: Linking an Online Entity to a Real Person. . . 6 Closing Thoughts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 About the Presenters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 For More Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 How to Use an Uncommon-Sense Approach to Big Data Quality Why All the Interest in Big Data? Organizations are inundated in data – terabytes, petabytes and exabytes of it. Data pours in from every conceivable direction: from operational and transactional systems, from scanning and facilities management systems, from inbound and outbound customer contact points, from mobile media and the Web. Exponential growth in data The hopeful vision of big data is that organizations will be able to harvest every byte of relevant data and use it to make supremely informed decisions. We now have the technologies to collect and store big data, but more importantly, to understand and take advantage of its full value. for many organizations, the “The financial services industry has led the way in using analytics and big data to manage risk and curb fraud, waste and abuse – especially important in that regulatory environment,” said Scott Chastain, Director of Information Management and Delivery at SAS. “We’re also seeing a transference of big data analytics into other areas, such as health care and government. The ability to find that needle in the haystack becomes very important when you’re examining things like costs, outcomes, utilization and fraud for large populations. volumes isn’t new. It continues a trend that started in the 1970s. What is new is that volume, variety and velocity of data exceeds the storage and compute capacity to use that data for accurate and timely decision making. The Advantages of Analyzing All the Data, Not Just a Subset “Big data provides gigantic statistical samples, which enhance analytic tool results,” wrote Philip Russom, Director of Data Management Research for TDWI (in the TDWI Best Practices Report, Big Data Analytics, Fourth Quarter 2011). “The general rule is that the larger the data sample, the more accurate are the statistics and other products of the analysis.” In the past, organizations were limited to using subsets of their data, or were constrained to simplistic analysis because the sheer volume of data overwhelmed their IT platforms. What good is it to collect and store terabytes of data if you can’t analyze it in full context, or if you have to wait hours or days to get results for urgent questions? “When you expand your focus and data sources, you’ve got more data, but when you’re limited in your computing capabilities, you tend to focus on the low-hanging fruit,” said David Loshin, data management consultant and President of Knowledge Integrity Inc. “Analysts look for the typical patterns of fraud, but there’s a lot of real estate under that curve. When you’re accumulating much larger volumes of data, and analysts have access to that data at a granular level, they’re able to find things they weren’t able to find before.” For example, “utility companies have been collecting sensor data for quite some time, but they were constrained and couldn’t use it all, said Chastain. “Now they can look at sensor data in near-real-time fashion across a community or systemwide process, merge that with data from traditional business processes, and use the insights to optimize the infrastructure for reliability and cost savings.” 1 SAS Conclusions Paper Challenges and Problems with Big Data With the onset of big data comes the question: How do you maintain the quality of all that data? The traditional recipes for data quality success – data profiling, data cleansing and data monitoring – don’t always lead to success in today’s big data world. There are several challenges to address. Data Volume Overwhelms the Systems Designed to Digest and Analyze It Consider the examples of data generated by utility meters, RFID readers and facility management systems. “Many industries are not used to the scale and volume of all the data that can be generated by these sensors,” said Loshin. “For example, the legacy model for data collection in a utility company was a person going home-tohome reading meters once a month. As utilities transition to smart meters and sensors installed throughout the power grid, these components are starting to generate data readings every 15 minutes – and they are capable of delivering readings in increments of seconds or milliseconds. The industry has to be able to absorb that data feed and be poised to collect and analyze a scale of data that is orders of magnitude greater than anything they’ve ever seen before.” “Traditional structured data is still a very large part of every business organization, but we see an increased use of nontraditional data sources, such as machine-to-machine data, geolocation data from cellphones and unstructured text data. The growth of these Data Comes from Disparate and Inconsistent Sources This diversity of data types introduces integration challenges as organizations seek a more holistic perspective on the business. “Organizations have typically focused their data management practices on solving a particular problem in a functional area, not on looking horizontally across all areas of the business,” said Loshin. When trying to integrate data from functional silos, data formats will be inconsistent. Definitions of a term can vary from one system to another, leading to confusion about what a concept such as “customer” even means and whether data elements can even be merged. As organizations start looking to optimize for business value across the organization, data management and data quality practices will have to adapt. Much of the Available Data Was Not Designed for Decision Making “Many of the data sets were originally focused on the operational and transactional functions,” said Loshin. “For transaction processing, for example, the data set just needs to have the complete set of values that enable a customer’s purchase to go through. But the objective of big data analytics is to drive good decision making, which relies on the absorption of data sets that were not intended for that type of use. So we end up with data that doesn’t meet the expectations of downstream business consumers. “Furthermore, the data is being generated in environments over which we have no control. When we pull data from sensors to analyze residential use patterns, we can’t control whether the sensors were working right or not. We can only use our techniques to determine whether the data complies with our expectations and whether we have enough trust in the data to analyze it for our business purposes.” 2 nontraditional sources is playing a big role in the growth of big data.” Scott Chastain Global Engineering Manager, Information Management and Delivery, SAS How to Use an Uncommon-Sense Approach to Big Data Quality It Can Be Difficult to Determine What Data Is Relevant Of the data you collect, what do you keep, and what do you throw out? Cheap storage has driven a propensity to hoard data, but this habit is unsustainable. Organizations need a better information engineering pipeline and governance process. “We’ve talked to clients who regret that they hadn’t collected data that now they could use, so now they’re starting to collect everything,” said Loshin. “Others are saying, ‘We have all this data, and we don’t know what we’re going to do with it, but we don’t want to throw it out, because we might want to use it someday.’ This data caching creates another demand on a big data environment - for more archiving, more scalability, more capabilities for analysis and extraction - which creates yet more demand, and still people are sometimes left scratching their heads trying to figure out what they’re going to do.” Of the data you keep, what do you include in analysis? Not all questions are better answered by bigger data. The traditional modus operandi has been to store everything, and only when you query it do you discover if it is relevant. This is a costly and cumbersome proposition. “You have to determine exactly which data is relevant to a particular business question,” said Chastain. “Whether it’s big data, small data or a combination, organizations are trying to determine what data will be used for business gain – and that influences how we set up the data.” “Much of the data that feeds into big data analysis was generated for other purposes besides analysis – and is being repurposed for analyses that were not originally anticipated.” David Loshin President, Knowledge Integrity Inc. Data Preparation Differs Based on the Analytic Method Being Used For example, all the practices associated with data preparation for a data warehouse are appropriate for online analytic processing (OLAP). But with query-based analytics, users often want to begin the analysis very quickly in response to a sudden change in the business environment. The analysis can require large data volumes – often multiple terabytes – of raw operational data. The urgency of the analysis doesn’t allow time for much (if any) data transformation, cleansing and modeling. Not that you’d want to make it perfect. The Uncommon-Sense Part of the Data Quality Story “It is counterintuitive, but the traditional approach to data quality – which focuses on data cleanliness and orderliness – is not the way we do it for big data analytics,” said Loshin. Advanced analytics has the potential to identify useful information from data that could be perceived as having poor quality. Data anomalies, missing data, nonstandard values and other things that would be inappropriate for reporting from a standard data warehouse may hold useful information that can be revealed through advanced analytics. For example, fraud is often revealed in nonstandard or outlier data. So you don’t want to do much data cleansing, data modeling or ETL (extract, transform, load) – the way you would for data warehousing – because that could mask the very issues you’re looking for. “It is counterintuitive, but the traditional approach to data quality – which focuses on data cleanliness and orderliness – is not the way we do it for big data analytics.” David Loshin President, Knowledge Integrity Inc. 3 SAS Conclusions Paper “In the past, people would say, ‘Here’s a set of data; we want you to predict fraud, waste or abuse,’” said Chastain. “Often that data had already been cleansed or standardized in an enterprise data warehouse. However, the inconsistencies in the data can be good predictors of fraud, waste and abuse. For example, a fraud analyst might want to spot people who are trying to skirt the system by entering names or addresses slightly differently, using the wife’s maiden name or a work address rather than a home address. Rather than cleansed and standardized data that has been through traditional data quality processes, analysts need as much granular data as possible from the existing sources.” Loshin agreed. “Traditional approaches to data quality and data integration focused on cleansing and correcting the data before it came into the analysis environment, but if you do that, you eliminate some of the hooks you may be looking for. … We’re looking not to cleanse the data but rather to use the cleansing techniques to establish relationships among transactional records.” For instance, record linkage and matching techniques designed for fixing data inconsistencies for an entity can be used to spot cases where multiple family members are getting the same controlled substance from different physicians, or where drivers and body shops are colluding for purposes of claims fraud. The idea is to preserve analytical data’s rich details, because they enable discovery. Big Data Quality in Action: Three Use Cases Utility Sensor Data: Dealing with Missing or Out-of-Range Values In machine-to-machine data processes, you don’t have a human to blame for data errors, but data quality problems still exist. As with fraud investigation, you don’t necessarily want to fix it though. “If there’s a missed expectation in the data – data is missing or doesn’t conform to a particular range of values – it’s indicative of some type of problem,” said Loshin. For example, if a sensor monitoring the viscosity of a liquid passing through a pipeline delivers an out-of-range value, it could mean: (A) The value is correct and there’s a problem with the viscosity of the fluid, or (B) The value is incorrect, and the sensor is not working properly. “When you’ve got thousands of miles of pipeline equipped with multiple sensors every 15 feet … or oil rigs with significant numbers of sensors placed along different areas of the activity … or a jumbo jet generating gigabytes of performance data every 15 seconds … you want to be able to determine whether something is going wrong with the sensor or the condition,” said Loshin. “We can use data quality techniques to monitor compliance with our expected values, but we don’t want to correct the data. We need to determine if it is a data error and we can impute the correct value, or whether the data indicates something that’s actually going on with the environment.” 4 “The standard rallying flag for the data quality consultant has been to say, ‘The data always has to be 100 percent perfect.’ But if you make it perfect, you end up eliminating some of the things you’re actually looking for.” David Loshin President, Knowledge Integrity Inc. How to Use an Uncommon-Sense Approach to Big Data Quality Chastain provided another example, of temperature sensors giving readings over time. “If we see readings start to drop in and out of the expected range, is that something we want to fix? For a time series, I can impute missing values and get a nice pretty curve representing temperature as a function of time. But what if the fluctuations indicate that the temperature sensor is failing? This is one of those cases where the traditional data quality process would be to fix the data, but the missing values may actually be more valuable to the business than knowing what the values should be.” Social Media Data: Grappling with the Ambiguity of Language Unstructured data sources (such as text documents, tweets, social media posts) provide a rich supplement to the data feed into big data analytics. Through analysis of text in customer service notes, warranty claims and social media, organizations can spot impending product quality issues, determine customer sentiment about the brand, and uncover new market opportunities. Analysts are only beginning to explore the possibilities. Trouble is, human language is messy from a data quality point of view. It can be casual, inconsistent and ambiguous. People misspell words and use cryptic acronyms. Homonyms abound. “There’s no constraint on the creation of that data,” said Loshin. “You can’t force a Facebook poster or Twitter user to spell somebody’s name right, and you can’t go back and tell them that as the data owner they need to correct it. So it’s a challenge to extract a signal out of a lot of noise.” “My own son tells me I’m just not hip enough to understand half of what tweets are saying. So the biggest thing we see with understanding unstructured text is that it’s not just necessarily about the text, but the context.” Scott Chastain Global Engineering Manager, Information Management and Delivery, SAS If those hindrances are true of human speech and writing in general, they are exacerbated with social media. “Social media is its own language,” said Chastain. “My own son tells me I’m just not hip enough to understand half of what tweets are saying. So the biggest thing we see with understanding unstructured text is that it’s not just necessarily about the text, but the context.” For unstructured data, data quality revolves around natural language processing – a discipline from the field of artificial intelligence that combines computer science and linguistics. Natural language processing identifies meaningful concepts, attributes and opinions in the spoken or written word. The data quality mechanisms are advanced linguistic rules that determine the structure of a sentence, the part of speech (noun, verb, etc.), known entities and so on. The context of the sentence aids with disambiguation, such as differentiating between car repair services and religious services, or between Amazon the river and Amazon the e-commerce giant. “All of this natural language processing has to be relative to the business or the problem we want to solve,” said Chastain. “For example, ‘radical’ has a completely different meaning for a government organization or a skateboard manufacturer. So the data quality processes not only have to do entity extraction, we also have to put those terms into our business vernacular to understand them in the context of the conversation.” 5 SAS Conclusions Paper Multichannel Retail Data: Linking an Online Entity to a Real Person Retailers who have brick-and-mortar stores and an online store share some vexing data challenges: How do you link the anonymous online visitor with someone who visits your store and gets your catalog? How do you determine if the purchase was made for the buyer or for someone else? Retailers want to capture that link between an online visitor and an individual consumer’s name to be able to apply customer intelligence to that shopper, offer the right treatment. Even if that link is unknown, the retailer still wants as much information as possible to optimize the online visitor’s experience. Because this link is so vital, retailers are getting inventive in persuading consumers to provide identifying information. “You’ll see sites that ask whether you’d like to log in with your Facebook ID,” said Chastain. “Many consumers will say, ‘Wow, that’s great, it’s a lot easier for me to remember than some other type of login.’ Now the retailer can link me from my Facebook identity to my consumer behavior in their stores with shopping cart analysis and any other ways I interact with the retailer. So there are definitely some techniques being tried out in the marketplace, but getting that missing link is still an ongoing challenge.” Another challenge is understanding why a purchase was made, said Loshin. “There are data quality issues involved with differentiating an identity from the representation of an identity. When I buy books for my kids from Amazon, I’m really representing my children, as opposed to when I buy textbooks for gifts or for my own pleasure reading. As an Amazon customer, I’m using a single identity but representing multiple entities. Similarly, my wife and I use the same supermarket loyalty card but we have different types of buying behavior. “A third example is a colleague who created a Twitter account for her dog and posts tweets on the dog’s behalf. This begs the question, how many individuals have multiple Twitter accounts that represent themselves in multiple ways? When you talk about linking things together, it becomes complex, because sometimes one representation has multiple people behind it, and sometimes multiple representations have a single person behind them.” “Most recommendation engines don’t have the ability to differentiate between those various entities,” said Chastain. “When I go to an online shopping site, I get recommendations based on a previous purchase of a gift for someone else. The industry is taking steps, but this is a growing data quality challenge.” 6 How to Use an Uncommon-Sense Approach to Big Data Quality Closing Thoughts “The people who make data quality mistakes tend to be the ones who create a big data environment and then go looking for problems to solve with it,” said Chastain. “Start with the problem the business wants to solve, because that will determine the analytic techniques, data sources and what it means to have ‘quality’ data.” Do you need to have some of that traditional data cleansing, or do you need to be granular and preserve the anomalies that could enable discovery?” When it comes to big data analytics, raw and unstructured data can be a goldmine for details that reveal facts, relationships, clusters and anomalies. Too much prep at the beginning of an analytical data project may lose some of the “data nuggets” that fuel the discovery. About the Presenters David Loshin President, Knowledge Integrity Inc. David Loshin is a recognized thought leader and expert consultant in the areas of data quality, master data management and business intelligence. He is a prolific author on business intelligence best practices and has authored numerous books and papers on data management, including the recently published Practitioner’s Guide to Data Quality Improvement. Scott Chastain Global Engineering Manager, Information Management and Delivery, SAS Scott Chastain designs customer solutions for a wide array of business challenges. Chastain has extensive experience in a variety of industries, including health care, manufacturing, telecommunications, government and finance. Chastain believes that enterprise data architecture is a foundational requirement for business analytics, and to that end, his primary focus revolves around data management, data governance and master data management. 7 SAS Conclusions Paper For More Information To view the on-demand recording of this webinar: sas.com/reg/web/corp/2069818 To view other events in the Applying Business Analytics Webinar Series: sas.com/ABAWS For more about SAS® and big data: sas.com/big-data/index.html For more about SAS Data Quality Solution: sas.com/data-quality/index.html For a go-to-resource for premium content and collaboration with experts and peers: AllAnalytics.com Follow us on twitter: @sasanalytics Like us on Facebook: SAS Analytics 8 About SAS SAS is the leader in business analytics software and services, and the largest independent vendor in the business intelligence market. Through innovative solutions, SAS helps customers at more than 60,000 sites improve performance and deliver value by making better decisions faster. Since 1976 SAS has been giving customers around the world THE POWER TO KNOW ®. For more information on SAS® Business Analytics software and services, visit sas.com. SAS Data Quality Solution provides an enterprise solution for profiling, cleansing, augmenting and integrating data to create consistent, reliable information. With SAS Data Quality Solution, you can automatically incorporate data quality into data integration and business intelligence projects to dramatically improve returns on your analytic initiatives with big data. SAS Institute Inc. World Headquarters +1 919 677 8000 To contact your local SAS office, please visit: sas.com/offices SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies. Copyright © 2013, SAS Institute Inc. All rights reserved. 106269_S95503_0413