How to Use an Uncommon-Sense Approach to Big Data Quality CONCLUSIONS PAPER

How to Use an Uncommon-Sense Approach
to Big Data Quality
CONCLUSIONS PAPER
Insights from a webinar in the Applying Business Analytics webinar series
Featuring:
Scott Chastain, Global Engineering Manager,
Information Management and Delivery, SAS
David Loshin, President, Knowledge Integrity Inc.
SAS Conclusions Paper
Table of Contents
Why All the Interest in Big Data?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
The Advantages of Analyzing All the Data, Not Just a Subset. . . . . . . . 1
Challenges and Problems with Big Data . . . . . . . . . . . . . . . . . . . . . . . 2
Data Volume Overwhelms the Systems Designed to Digest
and Analyze It. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Data Comes from Disparate and Inconsistent Sources . . . . . . . . . . . . 2
Much of the Available Data Was Not Designed for Decision Making . . 2
It Can Be Difficult to Determine What Data Is Relevant. . . . . . . . . . . . . 3
Data Preparation Differs Based on the Analytic Method Being Used. . 3
The Uncommon-Sense Part of the Data Quality Story . . . . . . . . . . . . 3
Big Data Quality in Action: Three Use Cases. . . . . . . . . . . . . . . . . . . . 4
Utility Sensor Data: Dealing with Missing or Out-of-Range Values. . . . 4
Social Media Data: Grappling with the Ambiguity of Language . . . . . . 5
Multichannel Retail Data: Linking an Online Entity to a Real Person. . . 6
Closing Thoughts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
About the Presenters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
For More Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
How to Use an Uncommon-Sense Approach to Big Data Quality
Why All the Interest in Big Data?
Organizations are inundated in data – terabytes, petabytes and exabytes of it. Data
pours in from every conceivable direction: from operational and transactional systems,
from scanning and facilities management systems, from inbound and outbound
customer contact points, from mobile media and the Web.
Exponential growth in data
The hopeful vision of big data is that organizations will be able to harvest every byte
of relevant data and use it to make supremely informed decisions. We now have the
technologies to collect and store big data, but more importantly, to understand and
take advantage of its full value.
for many organizations, the
“The financial services industry has led the way in using analytics and big data to
manage risk and curb fraud, waste and abuse – especially important in that regulatory
environment,” said Scott Chastain, Director of Information Management and Delivery
at SAS. “We’re also seeing a transference of big data analytics into other areas, such
as health care and government. The ability to find that needle in the haystack becomes
very important when you’re examining things like costs, outcomes, utilization and fraud
for large populations.
volumes isn’t new. It continues
a trend that started in the
1970s. What is new is that
volume, variety and velocity of
data exceeds the storage and
compute capacity to use that
data for accurate and timely
decision making.
The Advantages of Analyzing All the Data, Not Just a Subset
“Big data provides gigantic statistical samples, which enhance analytic tool results,”
wrote Philip Russom, Director of Data Management Research for TDWI (in the TDWI
Best Practices Report, Big Data Analytics, Fourth Quarter 2011). “The general rule is
that the larger the data sample, the more accurate are the statistics and other products
of the analysis.”
In the past, organizations were limited to using subsets of their data, or were constrained
to simplistic analysis because the sheer volume of data overwhelmed their IT platforms.
What good is it to collect and store terabytes of data if you can’t analyze it in full context,
or if you have to wait hours or days to get results for urgent questions?
“When you expand your focus and data sources, you’ve got more data, but when
you’re limited in your computing capabilities, you tend to focus on the low-hanging fruit,”
said David Loshin, data management consultant and President of Knowledge Integrity
Inc. “Analysts look for the typical patterns of fraud, but there’s a lot of real estate under
that curve. When you’re accumulating much larger volumes of data, and analysts have
access to that data at a granular level, they’re able to find things they weren’t able to
find before.”
For example, “utility companies have been collecting sensor data for quite some time,
but they were constrained and couldn’t use it all, said Chastain. “Now they can look
at sensor data in near-real-time fashion across a community or systemwide process,
merge that with data from traditional business processes, and use the insights to
optimize the infrastructure for reliability and cost savings.”
1
SAS Conclusions Paper
Challenges and Problems with Big Data
With the onset of big data comes the question: How do you maintain the quality of all
that data? The traditional recipes for data quality success – data profiling, data cleansing
and data monitoring – don’t always lead to success in today’s big data world. There are
several challenges to address.
Data Volume Overwhelms the Systems Designed to Digest and Analyze It
Consider the examples of data generated by utility meters, RFID readers and facility
management systems. “Many industries are not used to the scale and volume of all
the data that can be generated by these sensors,” said Loshin. “For example, the
legacy model for data collection in a utility company was a person going home-tohome reading meters once a month. As utilities transition to smart meters and sensors
installed throughout the power grid, these components are starting to generate data
readings every 15 minutes – and they are capable of delivering readings in increments
of seconds or milliseconds. The industry has to be able to absorb that data feed and
be poised to collect and analyze a scale of data that is orders of magnitude greater
than anything they’ve ever seen before.”
“Traditional structured data is
still a very large part of every
business organization, but
we see an increased use of
nontraditional data sources,
such as machine-to-machine
data, geolocation data from
cellphones and unstructured
text data. The growth of these
Data Comes from Disparate and Inconsistent Sources
This diversity of data types introduces integration challenges as organizations seek a
more holistic perspective on the business. “Organizations have typically focused their
data management practices on solving a particular problem in a functional area, not
on looking horizontally across all areas of the business,” said Loshin.
When trying to integrate data from functional silos, data formats will be inconsistent.
Definitions of a term can vary from one system to another, leading to confusion about
what a concept such as “customer” even means and whether data elements can even
be merged. As organizations start looking to optimize for business value across the
organization, data management and data quality practices will have to adapt.
Much of the Available Data Was Not Designed for Decision Making
“Many of the data sets were originally focused on the operational and transactional
functions,” said Loshin. “For transaction processing, for example, the data set just needs
to have the complete set of values that enable a customer’s purchase to go through.
But the objective of big data analytics is to drive good decision making, which relies on
the absorption of data sets that were not intended for that type of use. So we end up
with data that doesn’t meet the expectations of downstream business consumers.
“Furthermore, the data is being generated in environments over which we have no
control. When we pull data from sensors to analyze residential use patterns, we can’t
control whether the sensors were working right or not. We can only use our techniques
to determine whether the data complies with our expectations and whether we have
enough trust in the data to analyze it for our business purposes.”
2
nontraditional sources is
playing a big role in the
growth of big data.”
Scott Chastain
Global Engineering Manager, Information
Management and Delivery, SAS
How to Use an Uncommon-Sense Approach to Big Data Quality
It Can Be Difficult to Determine What Data Is Relevant
Of the data you collect, what do you keep, and what do you throw out? Cheap storage
has driven a propensity to hoard data, but this habit is unsustainable. Organizations
need a better information engineering pipeline and governance process.
“We’ve talked to clients who regret that they hadn’t collected data that now they could
use, so now they’re starting to collect everything,” said Loshin. “Others are saying, ‘We
have all this data, and we don’t know what we’re going to do with it, but we don’t want
to throw it out, because we might want to use it someday.’ This data caching creates
another demand on a big data environment - for more archiving, more scalability, more
capabilities for analysis and extraction - which creates yet more demand, and still
people are sometimes left scratching their heads trying to figure out what they’re going
to do.”
Of the data you keep, what do you include in analysis? Not all questions are better
answered by bigger data. The traditional modus operandi has been to store everything,
and only when you query it do you discover if it is relevant. This is a costly and
cumbersome proposition. “You have to determine exactly which data is relevant to a
particular business question,” said Chastain. “Whether it’s big data, small data or a
combination, organizations are trying to determine what data will be used for business
gain – and that influences how we set up the data.”
“Much of the data that feeds
into big data analysis was
generated for other purposes
besides analysis – and is being
repurposed for analyses that
were not originally anticipated.”
David Loshin
President, Knowledge Integrity Inc.
Data Preparation Differs Based on the Analytic Method Being Used
For example, all the practices associated with data preparation for a data warehouse
are appropriate for online analytic processing (OLAP). But with query-based analytics,
users often want to begin the analysis very quickly in response to a sudden change in
the business environment. The analysis can require large data volumes – often multiple
terabytes – of raw operational data. The urgency of the analysis doesn’t allow time for
much (if any) data transformation, cleansing and modeling. Not that you’d want to
make it perfect.
The Uncommon-Sense Part of the Data Quality Story
“It is counterintuitive, but the traditional approach to data quality – which focuses
on data cleanliness and orderliness – is not the way we do it for big data analytics,”
said Loshin.
Advanced analytics has the potential to identify useful information from data that could
be perceived as having poor quality. Data anomalies, missing data, nonstandard
values and other things that would be inappropriate for reporting from a standard
data warehouse may hold useful information that can be revealed through advanced
analytics. For example, fraud is often revealed in nonstandard or outlier data. So you
don’t want to do much data cleansing, data modeling or ETL (extract, transform, load)
– the way you would for data warehousing – because that could mask the very issues
you’re looking for.
“It is counterintuitive, but the
traditional approach to data
quality – which focuses on data
cleanliness and orderliness –
is not the way we do it for big
data analytics.”
David Loshin
President, Knowledge Integrity Inc.
3
SAS Conclusions Paper
“In the past, people would say, ‘Here’s a set of data; we want you to predict fraud,
waste or abuse,’” said Chastain. “Often that data had already been cleansed or
standardized in an enterprise data warehouse. However, the inconsistencies in the data
can be good predictors of fraud, waste and abuse. For example, a fraud analyst might
want to spot people who are trying to skirt the system by entering names or addresses
slightly differently, using the wife’s maiden name or a work address rather than a home
address. Rather than cleansed and standardized data that has been through traditional
data quality processes, analysts need as much granular data as possible from the
existing sources.”
Loshin agreed. “Traditional approaches to data quality and data integration focused on
cleansing and correcting the data before it came into the analysis environment, but if you
do that, you eliminate some of the hooks you may be looking for. … We’re looking not
to cleanse the data but rather to use the cleansing techniques to establish relationships
among transactional records.” For instance, record linkage and matching techniques
designed for fixing data inconsistencies for an entity can be used to spot cases where
multiple family members are getting the same controlled substance from different
physicians, or where drivers and body shops are colluding for purposes of claims fraud.
The idea is to preserve analytical data’s rich details, because they enable discovery.
Big Data Quality in Action: Three Use Cases
Utility Sensor Data: Dealing with Missing or Out-of-Range Values
In machine-to-machine data processes, you don’t have a human to blame for data
errors, but data quality problems still exist. As with fraud investigation, you don’t
necessarily want to fix it though. “If there’s a missed expectation in the data – data is
missing or doesn’t conform to a particular range of values – it’s indicative of some type
of problem,” said Loshin. For example, if a sensor monitoring the viscosity of a liquid
passing through a pipeline delivers an out-of-range value, it could mean: (A) The value is
correct and there’s a problem with the viscosity of the fluid, or (B) The value is incorrect,
and the sensor is not working properly.
“When you’ve got thousands of miles of pipeline equipped with multiple sensors every
15 feet … or oil rigs with significant numbers of sensors placed along different areas
of the activity … or a jumbo jet generating gigabytes of performance data every 15
seconds … you want to be able to determine whether something is going wrong with
the sensor or the condition,” said Loshin. “We can use data quality techniques to
monitor compliance with our expected values, but we don’t want to correct the data.
We need to determine if it is a data error and we can impute the correct value, or
whether the data indicates something that’s actually going on with the environment.”
4
“The standard rallying flag for the
data quality consultant has been
to say, ‘The data always has to
be 100 percent perfect.’ But if
you make it perfect, you end up
eliminating some of the things
you’re actually looking for.”
David Loshin
President, Knowledge Integrity Inc.
How to Use an Uncommon-Sense Approach to Big Data Quality
Chastain provided another example, of temperature sensors giving readings over time.
“If we see readings start to drop in and out of the expected range, is that something we
want to fix? For a time series, I can impute missing values and get a nice pretty curve
representing temperature as a function of time. But what if the fluctuations indicate that
the temperature sensor is failing? This is one of those cases where the traditional data
quality process would be to fix the data, but the missing values may actually be more
valuable to the business than knowing what the values should be.”
Social Media Data: Grappling with the Ambiguity of Language
Unstructured data sources (such as text documents, tweets, social media posts)
provide a rich supplement to the data feed into big data analytics. Through analysis
of text in customer service notes, warranty claims and social media, organizations
can spot impending product quality issues, determine customer sentiment about the
brand, and uncover new market opportunities. Analysts are only beginning to explore
the possibilities.
Trouble is, human language is messy from a data quality point of view. It can be
casual, inconsistent and ambiguous. People misspell words and use cryptic acronyms.
Homonyms abound. “There’s no constraint on the creation of that data,” said Loshin.
“You can’t force a Facebook poster or Twitter user to spell somebody’s name right, and
you can’t go back and tell them that as the data owner they need to correct it. So it’s a
challenge to extract a signal out of a lot of noise.”
“My own son tells me I’m just not
hip enough to understand half
of what tweets are saying. So
the biggest thing we see with
understanding unstructured text
is that it’s not just necessarily
about the text, but the context.”
Scott Chastain
Global Engineering Manager, Information
Management and Delivery, SAS
If those hindrances are true of human speech and writing in general, they are
exacerbated with social media. “Social media is its own language,” said Chastain. “My
own son tells me I’m just not hip enough to understand half of what tweets are saying.
So the biggest thing we see with understanding unstructured text is that it’s not just
necessarily about the text, but the context.”
For unstructured data, data quality revolves around natural language processing –
a discipline from the field of artificial intelligence that combines computer science and
linguistics. Natural language processing identifies meaningful concepts, attributes
and opinions in the spoken or written word. The data quality mechanisms are
advanced linguistic rules that determine the structure of a sentence, the part of speech
(noun, verb, etc.), known entities and so on. The context of the sentence aids with
disambiguation, such as differentiating between car repair services and religious
services, or between Amazon the river and Amazon the e-commerce giant.
“All of this natural language processing has to be relative to the business or the problem
we want to solve,” said Chastain. “For example, ‘radical’ has a completely different
meaning for a government organization or a skateboard manufacturer. So the data
quality processes not only have to do entity extraction, we also have to put those terms
into our business vernacular to understand them in the context of the conversation.”
5
SAS Conclusions Paper
Multichannel Retail Data: Linking an Online Entity to a Real Person
Retailers who have brick-and-mortar stores and an online store share some vexing
data challenges:
How do you link the anonymous online visitor with someone who visits your
store and gets your catalog?
How do you determine if the purchase was made for the buyer or for
someone else?
Retailers want to capture that link between an online visitor and an individual consumer’s
name to be able to apply customer intelligence to that shopper, offer the right treatment.
Even if that link is unknown, the retailer still wants as much information as possible to
optimize the online visitor’s experience.
Because this link is so vital, retailers are getting inventive in persuading consumers to
provide identifying information. “You’ll see sites that ask whether you’d like to log in with
your Facebook ID,” said Chastain. “Many consumers will say, ‘Wow, that’s great, it’s a
lot easier for me to remember than some other type of login.’ Now the retailer can link
me from my Facebook identity to my consumer behavior in their stores with shopping
cart analysis and any other ways I interact with the retailer. So there are definitely some
techniques being tried out in the marketplace, but getting that missing link is still an
ongoing challenge.”
Another challenge is understanding why a purchase was made, said Loshin. “There are
data quality issues involved with differentiating an identity from the representation of an
identity. When I buy books for my kids from Amazon, I’m really representing my children,
as opposed to when I buy textbooks for gifts or for my own pleasure reading. As an
Amazon customer, I’m using a single identity but representing multiple entities. Similarly,
my wife and I use the same supermarket loyalty card but we have different types of
buying behavior.
“A third example is a colleague who created a Twitter account for her dog and posts
tweets on the dog’s behalf. This begs the question, how many individuals have multiple
Twitter accounts that represent themselves in multiple ways? When you talk about
linking things together, it becomes complex, because sometimes one representation
has multiple people behind it, and sometimes multiple representations have a single
person behind them.”
“Most recommendation engines don’t have the ability to differentiate between
those various entities,” said Chastain. “When I go to an online shopping site, I get
recommendations based on a previous purchase of a gift for someone else. The
industry is taking steps, but this is a growing data quality challenge.”
6
How to Use an Uncommon-Sense Approach to Big Data Quality
Closing Thoughts
“The people who make data quality mistakes tend to be the ones who create a big
data environment and then go looking for problems to solve with it,” said Chastain.
“Start with the problem the business wants to solve, because that will determine the
analytic techniques, data sources and what it means to have ‘quality’ data.” Do you
need to have some of that traditional data cleansing, or do you need to be granular
and preserve the anomalies that could enable discovery?”
When it comes to big data analytics, raw and unstructured data can be a goldmine for
details that reveal facts, relationships, clusters and anomalies. Too much prep at the
beginning of an analytical data project may lose some of the “data nuggets” that fuel
the discovery.
About the Presenters
David Loshin
President, Knowledge Integrity Inc.
David Loshin is a recognized thought leader and expert consultant in the areas of data
quality, master data management and business intelligence. He is a prolific author on
business intelligence best practices and has authored numerous books and papers
on data management, including the recently published Practitioner’s Guide to Data
Quality Improvement.
Scott Chastain
Global Engineering Manager, Information Management and Delivery, SAS
Scott Chastain designs customer solutions for a wide array of business challenges.
Chastain has extensive experience in a variety of industries, including health care,
manufacturing, telecommunications, government and finance. Chastain believes that
enterprise data architecture is a foundational requirement for business analytics, and to
that end, his primary focus revolves around data management, data governance and
master data management.
7
SAS Conclusions Paper
For More Information
To view the on-demand recording of this webinar:
sas.com/reg/web/corp/2069818
To view other events in the Applying Business Analytics Webinar Series:
sas.com/ABAWS
For more about SAS® and big data:
sas.com/big-data/index.html
For more about SAS Data Quality Solution:
sas.com/data-quality/index.html
For a go-to-resource for premium content and collaboration with experts and peers:
AllAnalytics.com
Follow us on twitter: @sasanalytics
Like us on Facebook: SAS Analytics
8
About SAS
SAS is the leader in business analytics software and services, and the largest independent vendor in the business intelligence market.
Through innovative solutions, SAS helps customers at more than 60,000 sites improve performance and deliver value by making better
decisions faster. Since 1976 SAS has been giving customers around the world THE POWER TO KNOW ®. For more information on
SAS® Business Analytics software and services, visit sas.com.
SAS Data Quality Solution provides an enterprise solution for profiling, cleansing, augmenting and integrating data to create consistent,
reliable information. With SAS Data Quality Solution, you can automatically incorporate data quality into data integration and business
intelligence projects to dramatically improve returns on your analytic initiatives with big data.
SAS Institute Inc. World Headquarters +1 919 677 8000
To contact your local SAS office, please visit:
sas.com/offices
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA
and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies.
Copyright © 2013, SAS Institute Inc. All rights reserved. 106269_S95503_0413