Why Enterprises Need Trustworthy Data TDWI

June 2014
TDWI E-Book
Why Enterprises
Need Trustworthy
Data
1 Q&A: Building Trust in Your Data
4 The Ramifications of Trusted Data
6 Fostering Confidence in Data
9 About IBM
Sponsored by:
tdwi.org
Expert Q&A
Why We Need Trusted Data
Fostering Confidence in Data
About IBM
Q&A: Building Trust in Your Data
Without accurate and trustworthy data, the value of
enterprise analytics will be diminished. In this Q&A, we
discuss governance, data quality, and the role of the
chief data officer.
Enterprises are increasingly recognizing the importance of
having accurate, trustworthy data in their data warehouses.
Data governance can help ensure data quality, but it need
not be heavy-handed and inflexible. We spoke to Paula
Wiles Sigmon, program director of product marketing for the
InfoSphere Information Integration and Governance portfolio
at IBM Corporation’s Information Management division, about
what enterprises are doing to improve the quality of their data,
who should be responsible for data that goes in and comes
out of an enterprise data warehouse, and the emerging role of
the chief data officer.
TDWI: How have your recent conversations with
clients about data warehousing been different from
conversations in the past?
Paula Wiles Sigmon: In recent years, most organizations
have recognized the importance of a data warehouse and have
spoken with us primarily about either how to get started with
a new one or how to improve the warehouses already in place.
Today, the focus has shifted to the concept of a modernized
warehouse—in particular, one that meets the needs of an
organization that is starting to take in lots more data from
new sources, in more and different forms.
What are the new issues and concerns?
The one we’re hearing most often is a concern about how to
take advantage of what’s best in a modern data warehouse as
well as what’s best in Hadoop. After a brief period a year or
so ago, when organizations wondered if data warehouses were
even needed in a world where Hadoop was an option, most
enterprises today realize that the warehouse and the Hadoop
environment can and should coexist and complement each
other. What organizations are trying to do is to figure out how
to get it right—how to move the right data from Hadoop to the
warehouse, move the right data out of the warehouse when it
no longer has value, and make the warehouse a home for good
data rather than bad data or questionable data, so that deep
analytics can be based on the best available information.
1 TDWI e - book Wh y Enterprises Need Trustworth y Data
Expert Q&A
Why We Need Trusted Data
You are focused on “information integration and
governance”—not the first thoughts many people have
when they start to plan or modernize a data warehouse.
Can you help explain the relationship?
It’s certainly true that people considering a new or modernized
warehouse tend to focus first on the warehouse itself and not
the data flowing into it. Sometimes very early in the process
and sometimes a bit later, they realize that the whole point
of the warehouse is to provide a foundation for reporting and
analytics. If those reports and analytics are not based on the
best available information, then the entire warehouse/analytics
ecosystem has diminished value to the organization.
We’ve all seen the scenario where business people simply
don’t trust the reports they receive. They can base their
decisions on them anyway; they can ignore them and make
decisions that don’t even pretend to be fact based; they can
strike out on their own and try to create their own systems
to meet their specific needs, creating more data silos. The
options aren’t pretty.
People start asking key questions, such as: How can I create a
warehouse that instills confidence among the business users
who receive the output? Tracing back from that question are
questions about the factors that tend to build confidence. For
example: How can I create and manage data quality? How can
I provide transparency into the lineage of the data—where
it originated, who or what has changed it, when it was last
updated? How can I make sure I’m keeping the information
that’s needed for compliance or for business operations but
not the data that has passed its useful life and become a
liability rather than an asset? How can I provide the best
360-degree view of customers, products, and other key
entities despite conflicting information flowing toward the
warehouse from multiple sources? How can I protect the
information in the warehouse from both accidental leaks and
intentional breaches?
All these questions are critical to the design of a modern data
warehouse ecosystem, and they all point to the importance of
information integration and governance.
Fostering Confidence in Data
About IBM
Some people raise concerns about governance as a
heavyweight undertaking that can slow down projects or
increase costs. What is your view?
We certainly have heard that concern. It tends to result from
the belief that governance is a “one size fits all” undertaking.
If that were the case, the concern would probably be valid. If
we needed to define our governance practices for the critical
data from internal systems that feed our financial reporting,
and then apply these practices to comments gleaned from
social media—and intended to provide a sense of the market
or some additional insights into a particular customer’s
interests—we would end up with a heavy-handed approach to
the second set of data that is neither practical nor reasonable.
However, that isn’t the case. One size doesn’t fit all.
What organizations need to do instead is agree that all data
brought into the organization needs some level of governance,
but understand that the levels vary based on both the data
source and the intended use of the data. Then we end up with
a situation where we have appropriate controls in place, where
users can have confidence in the information at their disposal,
but where there is plenty of agility to adapt to new data
sources without a governance sledgehammer.
We’ve heard customers describe their own approaches to rightsizing data governance. For example, some use a time-boxed
approach, accepting data from a new source into a special
test zone for a short period of time while it is evaluated in
isolation. Then, if it is determined to have longer-term value, it
must move to another zone where more controls are in place.
Another approach is to classify data according to intended
uses, and set up governance zones according to those
classifications.
So many conversations today focus on what comes out
of the data warehouse—in particular, the analytics that
organizations need to drive the business. What is your
perspective?
I believe that’s an important focus. If we put data into a
warehouse and never produced any output, it would not serve
the business. Today’s organizations are moving more and more
to a data-driven approach to decision making. All eyes are on
the analytics, and that’s appropriate.
2 TDWI e - book Wh y Enterprises Need Trustworth y Data
Expert Q&A
Why We Need Trusted Data
It’s our view, though, that any close examination of analytics
must ultimately lead back to the underlying data. If it’s
outdated, inconsistent, or derived from questionable sources,
the organization’s best efforts at data-driven decisions can
lead to some disastrous results.
Should business people, in particular, be concerned
about what goes into the warehouse?
Yes. No business person should be comfortable making decisions
based on questionable data. For those who want to take a look
under the covers, the facts about what goes into the warehouse
should be transparent. For others who don’t really want to take
a deep dive into data lineage as an ongoing practice, there still
should be clear answers to questions about where and when the
data originated, how it is secured, and so on.
In fact, we have found that coming to a clear understanding
of information, its history, and its meaning is an important
area of collaboration between IT and the business. That
collaboration is essential to good governance and also
confidence in data.
Are you starting to deal with individuals in any roles that
previously were not involved in data warehousing?
As CMOs grow in their importance as consumers of
information and analytics and as investors in technology, they
clearly care more and more about the data coming out of the
warehouse, if not in the warehouse itself. The significant new
players we’re seeing in data warehouse conversations are the
data scientists. These folks are popping up everywhere as
they look to apply varied techniques to derive meaning from
data, whether it is in a data warehouse or elsewhere. Their
role seems to be taking on new importance in a big data world,
where there is more data to which they can apply their skills.
How are chief data officers (CDOs) getting involved with
functions such as data warehousing and analytics?
Although most organizations don’t yet have CDOs, the CDO
population is growing rapidly, especially in industries such as
government and financial services. The CDO job description
varies from one organization to the next, but often the CDO
has either direct or dotted-line responsibility for analytics.
The CDO’s role is all about governance (a control-oriented
objective) combined with a creative search for ways to drive
Fostering Confidence in Data
About IBM
value from information. Because much of the information in
question resides in the data warehouse, the CDO naturally
takes an interest in that data, how it’s managed, and the
value it contains.
Is there any connection between emerging roles such
as the CDO and the issue of confidence (or lack of
confidence) in data?
There isn’t enough data yet to define a causal relationship
between the presence of a CDO and an increase in user
confidence in data, but we do see a correlation between the
two, and that makes sense. The very presence of a chief data
officer means that the organization cares enough about data
and its business value that it has created a new role to focus
on data. The CDO is a C-level executive who thinks every day
about ways to make data better and to make it work better
to support the goals of the organization. If the CDO is a good
communicator—and every CDO should be—then it just
makes sense that people in the organization will understand
more about their data and trust it more because it is getting
top-level attention.
What does IBM bring to the table for organizations
planning a new or modernized data warehouse?
IBM offers a complete portfolio of tools and solutions ranging
from data warehouse appliances to data exploration tools to
a Hadoop distribution that’s perfect for landing big data or
offloading data from a warehouse within a zone architecture.
For information integration and governance—so important to
data warehouses that inspire confidence in business users—
the IBM InfoSphere product family, part of Watson Foundations,
enables organizations to integrate data from diverse sources
at high speed, establish and maintain data quality, foster
business/IT collaboration, manage master data, manage data
across its life cycle, and enhance data security and privacy.
Whether organizations are just getting started with a new data
warehouse or expanding and enriching an existing warehouse
environment, IBM has not only the tools but also the expertise
to help accelerate success.
3 TDWI e - book Wh y Enterprises Need Trustworth y Data
Expert Q&A
Why We Need Trusted Data
Fostering Confidence in Data
About IBM
The Ramifications
of Trusted Data
By Philip Russom
What is “trusted data” and how do we achieve it?
The term trusted data has been bandied about a lot lately,
and everyone seems to have a different definition. It’s an
important concept, so in this column I will define the term and
consider the ramifications for business intelligence and other
data-driven business processes.
For some, trusted data is an emotional matter. In the
context of business intelligence (BI), they want to
feel confident that data presented in reports, cubes,
dashboards, and other BI products is in the best condition
possible because data in good condition makes their jobs
easier and their actions more accurate, timely, effective,
and compliant. People who consume data through
operational applications have similar concerns.
4 TDWI e - book Wh y Enterprises Need Trustworth y Data
Expert Q&A
Why We Need Trusted Data
Emotions aside, the condition of trusted data is easily
quantified. Namely, condition is a technical measurement
of data’s completeness, quality, age, schema, profile, and
documentation. The assumption is that trusted data should
come from carefully selected sources, be transformed in
accordance with data’s intended use, and be delivered in
formats and time frames that are appropriate to specific
consumers of reports and other manifestations of data.
Hence, the trustworthiness of data is quantified mostly by the
technical properties that define its condition. You still need to
be mindful of the emotional impact that data’s condition has
on the perceptions of people who consume the data for BI.
Why We Need Trusted Data
A lack of trusted data leads to a number of poor practices. For
example, if BI data isn’t trustworthy because it’s not in good
condition, then poor decisions are based on poor data. Whether
data is in good condition or not, the mere perception that the
data’s not trusted can lead users to ignore supplied BI data
and instead build their own BI data stores, such as rogue data
marts, spreadsheets, and personal productivity databases.
Less-than-trustworthy data creates problems in operational
processes, too. When data is (or is perceived to be) of poor
quality or incomplete, users and managers base tactical and
operational decisions on guesses.
If you can turn these problems around, then trusted data has
benefits. Whether in BI or operations, people will use data they
trust, which in turn leads to greater consistency, compliance,
and accuracy in business processes based on the data.
How to Get Trusted Data
Achieving trustworthiness for data is a multi-step process:
• Select sources that are appropriate, certified, and diverse
• Process data to transform and aggregate it for the
intended use, improve its quality, and enhance its metaand master data
Fostering Confidence in Data
About IBM
• Deliver data in the right time frame, in forms suited to its
intended use
The process and its best practices involve a mix of:
• Organizational effort: Business and technical people and
teams, in collaboration
• Technical automation: Information management tools
and techniques for integration, quality, databases,
applications, metadata and master data management,
and so on
A Final Word
Giving business and technical users data they trust is key to
good business intelligence. If users don’t have confidence in
the data of a data warehouse and other BI data stores, they
may argue over the data’s accuracy, refuse to use the reports
and analyses fed from the data, or build their own data stores.
Non-trusted data likewise hinders operational excellence, as
people sidestep prescribed processes and misuse applications.
All these paths are nonproductive and lead to faulty decision
making and operational actions.
To learn more, replay Philip Russom’s TDWI Webinar on
Trusted Data for BI.
Philip Russom is director of TDWI Research for data management
and oversees many of TDWI’s research-oriented publications, services,
and events. He is a well-known figure in data warehousing and
business intelligence, having published over 500 research reports,
magazine articles, opinion columns, speeches, Webinars, and more.
Before joining TDWI in 2005, Russom was an industry analyst covering
BI at Forrester Research and Giga Information Group. He also ran his
own business as an independent industry analyst and BI consultant
and was a contributing editor with leading IT magazines. Before that,
Russom worked in technical and marketing positions for various
database vendors. You can reach him at [email protected],
@prussom on Twitter, and on LinkedIn at
linkedin.com/in/philiprussom.
• Collaborate cross-functionally and govern data for
compliance
5 TDWI e - book Wh y Enterprises Need Trustworth y Data
Expert Q&A
Why We Need Trusted Data
Fostering Confidence in Data
About IBM
Fostering Confidence in Data
Analytics has little value if users don’t have
confidence in their data. Governance practices
combined with data management best practices can
enhance confidence in data and give decision makers
confidence in their analysis.
Now as ever, the data warehouse (DW) is a good idea. Ask
a team of experts to design a data architecture capable of
delivering accurate, trustworthy information and timely analytic
insights and the finished product would closely resemble the DW.
Even though trends such as big data, advanced analytics, and
the cloud are massive forces for disruption, they don’t—singly
or in combination—obviate or nullify the data warehouse.
The DW design balances the business decision makers’ need for
qualified and consistent information (a need for history: a sixmonth, one-year, or even five-year perspective on what’s going
on with the business) with their need for access to timely and,
above all, trustworthy insights and information.
Trust can best be won by ensuring the consistency,
cleanliness, correctness, lineage, and security of data, argues
Praveenkumar Hosangadi, product marketing manager with
IBM’s Information Management team. He says research
shows that one in three business leaders doesn’t trust the
information used to make important decisions. “The tsunami
of data that we are seeing now will only add to the data
uncertainty. One of the critical success factors for a modern
data architecture is the implementation of information
governance. Governance practices, including data quality,
data life cycle management, master data management, and
data security and privacy—when blended with best practices
in data management—can enhance confidence in data and
enable confident decision making.
6 TDWI e - book Wh y Enterprises Need Trustworth y Data
Expert Q&A
Why We Need Trusted Data
“Confidence in data is very important to the adoption of
big data and analytics, as questionable data can lead to
questionable insights—with no value as a basis for enterprise
decisions or operations,” he argues. “Data confidence is
especially vital for business users who make high-impact
decisions based on insights into data. If users lack confidence
in their data, they will lack confidence in the results. Data
confidence is all the more important in the big data world
because so much of the data growth is coming in forms and
from sources whose reliability is questionable.”
Mutual Complementarity
Hosangadi doesn’t question the importance of Hadoop and
other big data technologies, however. What’s intriguing, he
suggests, is just how neatly the DW and big data complement
one another.
“The traditional data warehouse ecosystem needs to adapt to
big data scale and support a wider variety of data types. This
does not mean ripping and replacing infrastructure. Instead,
what’s needed is the right strategy and the right combination
of technologies,” he says. “As more big data use cases are
emerging and more organizations are gaining experience with
big data, it is clear that the data warehouse and other big
data technologies such as Hadoop complement each other. The
analytic performance, the capability sets, the value per byte of
data stored, and the costs involved are totally different.”
The keyword, again, is complementarity: big data technologies
support workloads and use cases that the DW itself cannot
cost-effectively perform. These include advanced analytics—
particularly in conjunction with multi-structured data types—
and the use of the Hadoop platform as a landing zone or
even as a persistent store for staging and transforming data.
(MapReduce, Hadoop’s built-in parallel processing engine, is a
big help in this regard.)
“It is important to view a data warehouse as an ecosystem that
fosters analytic and BI systems. The more insights organizations
seek, the more crucial the modernization step is,” Hosangadi
says. “To adapt to the big data world, organizations need to
consider a few types of modernization, including preprocessing
of data in a ‘landing zone’ to determine what should be moved
Fostering Confidence in Data
About IBM
to the warehouse, facilities for offloading infrequently accessed
data from the warehouse, and exploration of big data to
discover new, high-value information and free up the warehouse
for deeper analytics.”
Complementarity is, by definition, a quid-pro-quo proposition.
Just as big data technologies largely complement the data
warehouse, Hosangadi suggests, so, too, does the DW fill
a complementary role vis-à-vis Hadoop and other big data
technologies. “[O]rganizations need to make sure that their
capabilities in areas such as data quality and transformation
of data that’s moved to the data warehouse are up to the task,
ready to handle high volumes at high speed, and [able to]
handle the diversity of data types that [an] organization [today]
needs to be able to handle.”
Complementarity also means using the right platform for the
right workloads. The Hadoop platform bundles a distributed
file system (HDFS) with a built-in massively parallel
processing (MPP) compute layer, the MapReduce engine. For
this reason, some vendors now position Hadoop as a onestop platform for all things—including the decision support
and analytic workloads traditionally performed by MPP data
warehouse systems.
This is a huge mistake, argues Hosangadi, inasmuch as it
miscasts Hadoop for a role that it cannot possibly fulfill.
“Hadoop as a big data platform is not a substitute for an MPP
data integration engine because data integration capabilities
have not yet matured on the Hadoop stack. Features such as
data cleansing, data profiling, and the capture of changed
data—capabilities of today’s powerful integration systems—
are not available in Hadoop,” he contends.
“Organizations need to take advantage of Hadoop for providing
functions such as storage of big data in a landing zone while
continuing to exploit the rich, production-proven capabilities of
data integration systems for getting trustworthy data into the
enterprise data warehouse.”
Increasing Confidence
There’s another, often-overlooked aspect of trustworthiness,
Hosangadi points out: security. Organizations must not only
tend to the consistency and quality of the data that they load
7 TDWI e - book Wh y Enterprises Need Trustworth y Data
Expert Q&A
Why We Need Trusted Data
into a DW, but must also protect that data against breaches
from within or without.
Moreover, as it develops and “productionizes” big data
technologies, an organization must also determine how to
integrate information with its data warehouse in such a way as
to substantively address data consistency and quality issues, as
well as governance requirements. After all, the data warehouse
(or a specialized, usually MPP, analytic database) is the logical
destination for the analytic insights that are identified and
refined in a big data platform such as Hadoop, as well as for
the smaller, conformed data sets that are to be prepared and
exported from Hadoop when it is used as a landing zone or
persistent store for data of all kinds.
The key issue is that business decision makers and knowledge
workers must be able to trust this information. “In nextgeneration data warehouses, information integration and
governance capabilities enable organizations to manage
data growth with a scalable, high-performance platform and
deliver information that is worthy of knowledge workers’ trust,”
Hosangadi observes, citing a number of issues that he says
“are keys to confidence in information.”
Fostering Confidence in Data
About IBM
such as data quality and governance. This isn’t just a mistake,
Hosangadi argues, it’s unnecessary. It might sound heretical,
but it’s frankly OK to relax quality or governance rules for
certain kinds of data.
“Data abundance makes data quality and governance much
more relevant today than ever before. Participants in a recent
Twitter chat mentioned that they spend anywhere from 40 to
80 percent of their time searching for the right information. The
more time spent locating the right information, the less time
available for analytics and innovation,” he notes.
“Different types of data require different levels of governance,”
Hosangadi continues. “For example, customer data, product
data, and data to be used for financial planning, budgeting, and
forecasting require maximum control and governance, whereas
social network data and unstructured external data, when used
to assess high-level market trends, typically need much less
governance. So while data of all types should be governed in
some way, organizations can be smart about their governance
implementations and invest in the areas of greatest need.”
These include:
• System integrity: Data must be consistent across
different systems
• Data governance: Governance policies must be in place,
and, more important, must be enforced
• Data completeness: Records must be complete, with a
common view of master data records
• Data correctness: Data must be validated, verified, and
standardized
• Data currency: Data must be up to date
• Data lineage: The source or lineage of data must be
known and qualified
• Data security and protection: Data must be
safeguarded against breach and/or data loss
When dealing with data at big data scale, there’s a temptation
to throw in the towel on some core data management tenets,
8 TDWI e - book Wh y Enterprises Need Trustworth y Data
Expert Q&A
Why We Need Trusted Data
Fostering Confidence in Data
About IBM
www.ibm.com
tdwi.org
IBM offerings for the data warehouse environment include
data warehouse appliances, data exploration tools, a Hadoop
distribution, and an information integration and governance
portfolio that is critical to delivering the best available data to
the warehouse.
TDWI, a division of 1105 Media, Inc., is the premier provider
of in-depth, high-quality education and research in the
business intelligence, data warehousing, and analytics
industry. TDWI is dedicated to educating business and
information technology professionals about the best practices,
strategies, techniques, and tools required to successfully
design, build, maintain, and enhance business intelligence,
data warehousing, and analytics solutions. TDWI also fosters
the advancement of business intelligence, data warehousing,
and analytics research and contributes to knowledge
transfer and the professional development of its members.
TDWI offers a worldwide membership program, five major
educational conferences, topical educational seminars,
role-based training, on-site courses, certification, solution
provider partnerships, an awards program for best practices,
live Webinars, resourceful publications, an in-depth research
program, and a comprehensive website, tdwi.org.
As a critical element of Watson™ Foundations, the IBM
big data and analytics platform, InfoSphere Information
Integration and Governance (IIG) provides market-leading
functionality to handle the challenges of big data. It provides
optimal scalability and performance for massive data
volumes, agile and right-sized integration and governance for
the increasing velocity of data, and support and protection for
a wide variety of data types and big data systems. It enables
organizations to have a clear understanding of information, its
history and its meaning, facilitating collaboration between IT
and business.
InfoSphere capabilities include: Metadata, business glossary
and policy management; data integration, data quality, master
data management (MDM), data life cycle management, and
data security and privacy. Together, these capabilities help
make data warehousing, big data and analytics projects
successful by delivering business users the confidence to act
on insight. Details are available at ibm.com/software/data/
information-integration-governance.
• Trusted Information
• Information Integration and Governance
© 2014 by TDWI (The Data Warehousing InstituteTM), a division of 1105 Media,
Inc. All rights reserved. Reproductions in whole or in part are prohibited except
by written permission. E-mail requests or feedback to [email protected].
Product and company names mentioned herein may be trademarks and/or
registered trademarks of their respective companies.
9 TDWI e - book Wh y Enterprises Need Trustworth y Data