DATA WAREHOUSE/BIG DATA

DATA WAREHOUSE/BIG DATA – AN ARCHITECTURAL APPROACH
By W H Inmon and Deborah Arline
© copyright 2014 Forest Rim Technology, all rights reserved
First there was data warehouse. Then came Big Data. Some of the proponents of Big Data have made
the proclamation – “When you have Big Data, you won’t need a data warehouse”, such was the
enthusiasm for Big Data. Indeed there is much confusion and much misunderstanding of information
with regard to Big Data and data warehouse.
In this paper it will be seen that data warehouse and Big Data indeed are separate environments and
that they are complementary to each other. This paper takes an architectural view.
AN ARCHITECTURAL PERSPECTIVE
In order to understand the complex and symbiotic relationship between data warehouse and Big Data it
is necessary to have some foundational groundwork laid. Without the groundwork the final solution will
not make much sense.
The starting point is that data warehouse is an architecture and Big Data is a technology. And as is the
case with all technologies and all architectures, there may be some overlap but a technology and an
architecture are essentially different things.
WHAT IS BIG DATA?
A good starting point is – what is Big Data? Fig 1 shows a representation of Big Data.
In Fig 1 we see Big Data. So what is Big Data? Big Data is technology that is designed to –
-
Accommodate very large – almost unlimited amounts – of storage
Use inexpensive storage for the housing of data
Manage the storage using the “Roman census” method
Store data in an unstructured manner
There are other definitions of Big Data but for the purpose of this paper this will be our working
definition.
Big Data centers around a technological component known as Hadoop. Hadoop is technology that
satisfies all these conditions of the definition of Big Data. Many vendors have built suites of tools
surrounding Hadoop.
Big Data then is a technology that is useful for the storage and management of large volumes of data.
© copyright 2014 Forest Rim Technology, all rights reserved
WHAT IS A DATA WAREHOUSE?
So what is a data warehouse? A data warehouse is a structure of data where there is a single version of
the truth. The essence of a data warehouse is the integrity of data contained inside the data warehouse.
When an executive wants information that can be believed and trusted, the executive turns to a data
warehouse. Data warehouses contain detailed historical data. Data in a data warehouse is typically
integrated, where the data comes from different sources.
From the standpoint of a definition of what a data warehouse is, the definition of a data warehouse has
been established from the very beginning. A data warehouse is a –
Subject oriented
Integrated
Non volatile
Time variant collection of data
In support of management’s decisions.
Fig 2 depicts a representation of a data warehouse.
In order to achieve the integrity of data that is central to a data warehouse, a data warehouse typically
has a carefully constructed infrastructure, where data is edited, calculated, tested, and transformed
before it enters the data warehouse. Because data going into the data warehouse comes from multiple
sources, data typically passes through a process known as ETL (extract/transform/load).
OVERLAP
From a foundational standpoint, how much overlap is there between a data warehouse and Big Data?
The answer is that there is actually very little overlap between a data warehouse and Big Data. Fig 3
shows the overlap.
© copyright 2014 Forest Rim Technology, all rights reserved
In Fig 3 it is seen that sometimes a data warehouse contains a reasonably large amount of data. And of
course, Big Data can certainly accommodate a reasonably large amount of data. So there is some
overlap between a data warehouse and Big Data. But the overlap between the two entities is
remarkable in how little overlap there really is.
Another way to look at the overlap between a data warehouse and Big Data is seen in Fig 4.
data warehouse and no Big Data
data warehouse and Big Data
Big Data and no data warehouse
Big Data and data warehouse
yes
yes
yes
yes
Fig 4 shows that there is no necessary overlap between a data warehouse and Big Data. A data
warehouse and Big Data are COMPLETELY, mutually exclusive of each other.
A NON TRADITIONAL VIEW
In order to understand how Big Data and a data warehouse interface, it is necessary to look at Big Data
in a non traditional way. There are indeed many different ways that Big Data can be analyzed. The way
suggested here is only one of many ways.
One way that Big Data can be sub divided is in terms of repetitive data and non repetitive data.
Fig 5 notionally shows this sub division.
repetitive
non repetitive
Repetitive unstructured data is data that occurs very frequently and has the same structure and often
times the same content. There are many examples or repetitive unstructured data. One example of
repetitive unstructured data is the record of phone calls, where the length of the call, the date of the
call, the caller and the callee are noted. Another example of repetitive unstructured data is metering
data. Metering data is data that is gathered each week or month where there is a register of the activity
or usage of energy at a particular location. In metering data there is a metered amount, an account
number, and a date. And there are many, many occurrences of metering records. Another type of
repetitive data is oil and gas exploration data.
There are in fact many examples of repetitive Big Data.
The other type of Big Data is non repetitive unstructured data. With non repetitive unstructured data
there often are many records of data. But each record of data is unique in terms of structure and
© copyright 2014 Forest Rim Technology, all rights reserved
content. If any two non repetitive unstructured records are similar in terms of structure and content, it
is an accident.
There are many forms of non repetitive unstructured data. There are emails, where one email is very
different from the next email in the queue. Or there are call center records, where a customer interacts
with an operator representing a company. There are telephone conversations, sales calls, litigation
records, and many, many different types of non repetitive unstructured data.
So Big Data can be divided into this simple classification of data – repetitive unstructured data and non
repetitive unstructured data.
Admittedly there are many different ways to sub divide Big Data. But for the purpose of defining the
relationship between a data warehouse and Big Data, this division is the one we will use.
CONTEXT
When dealing with data – any data – it is useful to consider the context of the data. Indeed, using data
where the context is unknown is a dangerous thing to do.
An important point to be made is that for repetitive unstructured data, identifying the context of data
comes very easily and naturally. Consider the simple diagram seen in Fig 6.
context
content
Fig 6 shows that there are many records in a repetitive Big Data environment. But the records are
essentially simple records and the context and meaning of each record is clear. That is because when it
comes to repetitive unstructured data there records are essentially very simple records. Determining
context in a repetitive environment is a very easy and natural thing to do.
Now consider context in the non repetitive environment. There is plenty of context to be found in the
non repetitive unstructured environment. The problem is that context is embedded in the document
itself. The context is found in a million different places and in a million different ways in the non
repetitive unstructured environment. Sometimes context is buried in the text of the document.
Sometimes context is inferred in the external characterization of the document. Sometimes context is
found in the words of the document itself. There are literally a million ways that context is found in the
non repetitive unstructured environment.
In order to derive the context inherent to a non repetitive unstructured document, it is necessary to use
technology known as “textual disambiguation” (or “textual ETL”.) Fig 7 shows that textual
disambiguation is used to derive context from non repetitive unstructured data.
© copyright 2014 Forest Rim Technology, all rights reserved
repetitive
context
non repetitive
textual
disambiguation
ANALYTIC PROCESSING
So how is analytic processing done from Big Data? There are several ways that analytic processing can
be done. One way is through simple search technology. This approach is seen in Fig 8.
repetitive
simple
search
non repetitive
In Fig 8 it is seen that simple search technology works well on repetitive unstructured Big Data. Simple
search technology works well where there is obvious and easily derived context. The problem is that
simple search technology does not work well in the face of non repetitive unstructured data. In order for
simple search technology to work well, there must be simple and obvious context of the data the search
processing is operating on.
But it is possible to use textual disambiguation to derive context from non repetitive unstructured data
and then to replace the data back into the Big Data environment. In this case it is said that the Big Data
environment has been “context enriched”. Fig 9 shows this enrichment.
repetitive
non repetitive
textual
disambiguation
context
In Fig 9 it is seen that non repetitive unstructured data is read and passed through textual
disambiguation. Then the output is placed back into the Big Data environment but it is placed into Big
© copyright 2014 Forest Rim Technology, all rights reserved
Data in a “context enriched” state. After the data is placed back in Big Data in a context enriched state, a
simple search tool can be used to analyze the data.
THE DATA WAREHOUSE/Big Data INTERFACE
The actual interface between data warehouse and Big Data is seen in Fig 10.
direct
analysis
classical
data
warehouse
distill
raw
Big Data
repetitive
simple
search
unstructured
data base
non repetitive
context
textual
disambiguation
simple
search
analysis
context
enriched
Big Data
context
combined
analysis
analysis
of unstructured
contextualized
data
simple
search
simple
search
analysis
of enriched
Big Data
In Fig 10 it is seen that raw Big Data can be divided into repetitive data and non repetitive data, as has
been discussed. Repetitive data can be directly analyzed or can be searched by a simple search tool. Non
repetitive data is accessed by textual disambiguation. When non repetitive data passes through textual
disambiguation, the context of the data is derived. Once the context has been derived, the output can
be placed either in a standard data base format or into en enriched Big Data environment. If data is
© copyright 2014 Forest Rim Technology, all rights reserved
placed in a data base format, the data can be easily accessed and analyzed in conjunction with existing
data warehouse data.
In addition, repetitive data can be “distilled” and placed into a standard data base if desired.
One interesting feature of this diagram is that the different kinds of analysis that are done throughout
the environment are quite different. The type of analysis that is done is profoundly shaped by the data
that is available for analysis.
Forest Rim Technology is located in Castle Rock, CO. Forest Rim Technology produces textual ETL, a
technology that allows unstructured text to be disambiguated and placed into a standard data base
where it can be analyzed.
Forest Rim Technology was founded by Bill Inmon.
Deborah Arline is……………………………….
-
© copyright 2014 Forest Rim Technology, all rights reserved