Data X-Ray: A Diagnostic Tool for Data Errors

Data X-Ray: A Diagnostic Tool for Data Errors
Xiaolan Wang
Xin Luna Dong
Alexandra Meliou
University of Massachusetts
Amherst, MA, USA
[email protected]
Google Inc.
Mountain View, CA, USA
[email protected]
University of Massachusetts
Amherst, MA, USA
[email protected]
ABSTRACT
A lot of systems and applications are data-driven, and the correctness
of their operation relies heavily on the correctness of their data.
While existing data cleaning techniques can be quite effective at
purging datasets of errors, they disregard the fact that a lot of errors
are systematic, inherent to the process that produces the data, and
thus will keep occurring unless the problem is corrected at its source.
In contrast to traditional data cleaning, in this paper we focus on
data diagnosis: explaining where and how the errors happen in a
data generative process.
We develop a large-scale diagnostic framework called DATA XR AY.
Our contributions are three-fold. First, we transform the diagnosis problem to the problem of finding common properties among
erroneous elements, with minimal domain-specific assumptions.
Second, we use Bayesian analysis to derive a cost model that implements three intuitive principles of good diagnoses. Third, we
design an efficient, highly-parallelizable algorithm for performing
data diagnosis on large-scale data. We evaluate our cost model and
algorithm using both real-world and synthetic data, and show that
our diagnostic framework produces better diagnoses and is orders
of magnitude more efficient than existing techniques.
1.
INTRODUCTION
Systems and applications rely heavily on data, which makes data
quality a detrimental factor for their function. Data management
research has long recognized the importance of data quality, and has
developed an extensive arsenal of data cleaning approaches based
on rules, statistics, analyses, and interactive tools [1, 25, 37, 53, 54].
While existing data cleaning techniques can be quite effective at
purging datasets of errors, they disregard the fact that a lot of errors
are systematic, inherent to the process that produces the data, and
thus will keep occurring unless the problems are corrected at their
source. For example, a faulty sensor will keep producing wrong
measurements, a program bug will keep generating re-occurring
mistakes in simulation results, and a bad extraction pattern will
continue to derive incorrect relations from web documents.
In this paper, we propose a data diagnostic tool that helps data
producers identify the possible systematic causes of errors in the
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from [email protected].
SIGMOD’15, May 31–June 4, 2015, Melbourne, Victoria, Australia.
Copyright is held by the owner/author(s). Publication rights licensed to ACM.
ACM 978-1-4503-2758-9/15/05 ...$15.00.
http://dx.doi.org/10.1145/2723372.2750549.
data. This is in contrast to traditional data cleaning methods, which
treat the symptom (the errors in a dataset) rather than the underlying
condition (the cause of the errors). Since finding particular causes is
often domain specific, we instead aim to provide a generic approach
that finds groupings of errors that may be due to the same cause;
such groupings give clues for discerning the underlying problems.
We call our tool DATA XR AY: just as a medical X-ray can facilitate
(but not in itself give) the diagnosis of medical conditions, our tool
shows the inherent relationship between errors and helps diagnose
their causes. We use examples from three different domains to
illustrate how DATA XR AY achieves this goal.
E XAMPLE 1 (K NOWLEDGE EXTRACTION ). Web-scale knowledge bases [20,24,36] often contain a large number of errors, resulting both from mistakes, omissions, or oversights in the knowledge
extraction process, and from erroneous, out-of-date information
from the web sources. Existing knowledge curation techniques focus on finding correct knowledge in the form of (subject, predicate,
object) triples by analyzing extractions from multiple knowledge
extractors [20, 21]; however, they do not offer any insight on why
such errors occur and how they can be prevented in the future.
We applied DATA XR AY on 2 billion knowledge triples from
Knowledge Vault [20], a web-scale probabilistic knowledge base
that continuously augments its content through extraction of web
information, such as text, tables, page structure, and human annotations. Our tool returns groupings of erroneous triples that may
be caused by the same systematic error. Among the many different
types of errors that DATA XR AY reports, we give three examples.
Annotation errors: DATA XR AY reports a grouping of about 600
knowledge triples from besoccer.com with object “Feb 18, 1986”,
which were extracted from webmaster annotations according
to schema.org; among them the error rate is 100% (all triples
conflict with the real world). A manual examination of three to five
instances quickly revealed that the webmaster used “2/18/1986”
to annotate the date of birth for all soccer players, possibly by
copying HTML segments.
Reconciliation errors: DATA XR AY reports a grouping of about
700,000 triples, extracted by a particular extractor from various
websites, with object “baseball coach”; among them the error
rate is 90%. Manual investigation showed that the mistakes
resulted from reconciliation errors: all coaches were reconciled
to baseball coaches.
Extraction errors: DATA XR AY reports a grouping of about 2 million triples, extracted by a particular extractor from various websites, with predicates containing “olympics”; among them the
error rate is 95%. Manual investigation showed that the mistakes
resulted from an over-generalized extraction pattern that extracts
all sports games as olympic games.
These three examples illustrate that the errors exposed by our tool
cover a variety of problems, from data errors to extraction errors.
Some of these problems can be very hard to detect; for example, web
annotations are typically not visible from webpages. Even though
DATA XR AY does not report the causes directly, it guides diagnosis
by reporting sets of errors that are likely due to the same cause.
E XAMPLE 2 (W IRELESS PACKET LOSS ). A wireless sensor
network experiences significant packet loss. DATA XR AY reports
a grouping of messages containing the range of node IDs 10–15
as destinations, where the message drop rate is very high. Manual
investigation unveils that nodes 10–15 are all located on a side of
the building with poor connectivity due to interference.
E XAMPLE 3 (T RAFFIC INCIDENTS ). We use DATA XR AY to
analyze traffic incident and weather data collected by the US department of Transportation on multiple freeways in Portland, OR, over
two months [59]. Our algorithm uses the reported traffic incidents
as “error” labels, and automatically derives that surface water level
of more than 2cm is likely a cause of accidents.
Diagnosing errors in big data environments raises three major
challenges that make existing techniques, such as provenance analysis, feature selection, and causal analysis, not applicable.
Massive scale. Many applications, such as knowledge extraction
and sensing, continuously produce huge volumes of data. A sampling that results in a manageable data size often loses statistical
strength. Algorithms working with data of this size require linear
time complexity and the ability to process data in parallel. Existing techniques such as feature selection [48, 57] cannot handle this
increased scale, because they are not easy to implement in sharednothing architectures [42].
System complexity. Data generative processes, such as knowledge
extractors, often implement a large number of algorithms, invoke
external tools, set different parameters, and so on. As a result, it is
not feasible to analyze them and reason with them directly. Existing
provenance techniques [15, 19] are not well-suited for this level of
complexity, and most existing tools on data auditing focus on much
simpler, relational settings [31, 55].
High error rates. In some applications such as knowledge extraction from the web, the error rate can be as high as 70% [21]. This
disqualifies causal analysis techniques [44, 45], whose premise relies on the assumption that errors, and generally observations that
require explanation, are rare.
DATA XR AY addresses these challenges by analyzing the relationship between erroneous instances in an efficient and scalable
manner. More concretely, we make the following contributions.
• We abstract the processes that derive data using a hierarchical
structure of features. Each feature corresponds to a subset of
data properties and each cause of errors can be considered to be
associated with a feature. We then transform the problem of error
diagnosis to the problem of finding the features that best represent
erroneous elements. This transformation enforces minimal assumptions, can model a large range of application scenarios, and
allows for efficient exploration of possible diagnoses (Section 2).
• We apply Bayesian analysis to estimate the causal likelihood of a
set of features being associated with the causes of the errors, and
use that to determine the most likely diagnosis for a given set of
errors. We identify three intuitive principles for good diagnoses:
conciseness (simpler diagnoses are preferable), specificity (each
diagnosis should be closely associated with the real cause), and
consistency (diagnoses should not be contradicted by a lot of correct data). We design a cost model that captures these principles
and can be evaluated efficiently (Section 3).
• We exploit the hierarchical structure of features and propose a
top-down iterative algorithm with linear time complexity that
evaluates possible diagnoses from broader to more concise using
our cost model. We then extend our algorithm to a parallel,
MapReduce-based version (Section 4).
• Our evaluation includes three phases of experiments. First, we
evaluate our cost model on real-world extraction data and demonstrate that it is significantly more effective at deriving diagnoses
than other feature selection methods, including logistic regression.
Second, we show that our algorithm is orders of magnitude more
efficient than other feature selection methods. Third, we present
experiments on synthetic data demonstrating that our approach
can scale effectively to very large data sizes (Section 5).
2.
DATA MODEL ABSTRACTIONS
In this section, we introduce a running example motivated by
knowledge extraction, and describe a model abstraction to formalize the problem of diagnosing errors in this setting. Although we
focus on knowledge extraction as the driving application for our
framework, our approach can easily adapt to general settings.
E XAMPLE 4. Figure 1a depicts two example web tables that
reside on the same wiki page, containing information about musicians. Figure 1b depicts the knowledge triples extracted from these
web tables using an extraction system. For example, the triple (P.
Fontaine, DoB, c.1380) represents the information that the date of
birth of P. Fontaine is c. 1380.
Some of the extracted triples are incorrect, and are highlighted in
the table of Figure 1b (t5 , t9 , and t12 ). While traditional cleaning
techniques may simply remove these triples from the knowledge base,
or further provide a list of such triples as feedback, our objective is
to help diagnose the problem and understand the reasons for these
errors. In this case, the reason for the incorrect results is that the
extractors assign a default value (“01/01/1900”) to unknown dates.
In this work, we assume that we know which extracted triples
are incorrect. Such labels can be obtained by existing cleaning and
classification techniques [21, 25, 26]. They may also occur naturally
in the data, such as extraction confidence from the extraction systems, or the occurrence of accidents in Example 3. Our goal is not
to identify the errors, but to reveal common properties of the errors
that may be helpful in diagnosing the underlying causes.
2.1
The element-feature model
We observe that a cause of errors is often associated with some
properties of the erroneous instances and causes a high error rate
for data with these properties. In Example 4, the three erroneous
triples are caused by using 01/01/1900 as the default value when a
date is unknown. Indeed, they share a common property that their
objects are all 01/01/1900. By highlighting the observation that the
error rate is high (1.0 in our example) for triples with “object value:
01/01/1900”, we can help users diagnose this possible cause. As
another example, imagine a high error rate for triples extracted from
Tbl #1 (Figure 1b) where the objects are of the Date type. It suggests
that the date format in that table may not be captured properly by
the extractors. Surfacing such observation for triples with “source
tableID: Tbl # 1” and “object type: date” can help the diagnosis.
Based on this intuition, we define the element-feature model,
where we consider each data instance as an element, and capture its
properties by a set of property values. We then use a subset of property values, which we call a feature, to capture a possible cause of
errors. Features can be derived from data using their schema, types,
and values, as well as from provenance metadata. Features form a
Musicians – Table 1
Composers – Table 2
Name
Date of Birth
Date of Death
Name
Date of Birth
Date of Death
P. Fontaine
J. Vide
c.1380
unknown
c.1450
1433
G. Legrant
H. Lantins
fl.1405
fl.c.1420
N/A
unknown
(a) Two web tables with information about musicians, that appear on the same wiki page.
Extracted triples
z
}|
Triple properties
{
z
}|
{
ID
knowledge triple
source
URL tableID
subject
type
instance
predicate
type instance
object
type
instance
t1
t2
t3
t4
t5
t6
t7
t8
t9
t10
t11
t12
{P. Fontaine, Profession, Musician}
{P. Fontaine, DoB, c.1380}
{P. Fontaine, DoD,c.1450}
{J. Vide, Profession, Musician}
{J. Vide, DoB, 01/01/1900}
{J. Vide, DoD, 1433}
{G. Legrant, Profession, Composer}
{G. Legrant, DoB, fl.1405}
{G. Legrant, DoD, 01/01/1900}
{H. Lantins, Profession, Composer}
{H. Lantins, DoB, fl.c.1420}
{H. Lantins, DoD, 01/01/1900}
wiki
wiki
wiki
wiki
wiki
wiki
wiki
wiki
wiki
wiki
wiki
wiki
People
People
People
People
People
People
People
People
People
People
People
People
P. Fontaine
P. Fontaine
P. Fontaine
J. Vide
J. Vide
J. Vide
G. Legrant
G. Legrant
G. Legrant
H. Lantins
H. Lantins
H. Lantins
Bio
Bio
Bio
Bio
Bio
Bio
Bio
Bio
Bio
Bio
Bio
Bio
Profession
Date
Date
Profession
Date
Date
Profession
Date
Date
Profession
Date
Date
Musician
c.1380
c.1450
Musician
01/01/1900
1433
Composer
fl.1405
01/01/1900
Composer
fl.c.1420
01/01/1900
tbl #1
tbl #1
tbl #1
tbl #1
tbl #1
tbl #1
tbl #2
tbl #2
tbl #2
tbl #2
tbl #2
tbl #2
Profession
DoB
DoD
Profession
DoB
DoD
Profession
DoB
DoD
Profession
DoB
DoD
(b) Knowledge triples extracted from the web tables in (a) and values of their properties in all dimensions. Incorrect triples are highlighted.
Figure 1: Error diagnosis in information extraction motivated by Example 1. An extraction pipeline processes the web tables in (a) and derives
12 knowledge triples (elements). Each element has four property dimensions with different granularity levels. The extractors assign a default
value to dates that are unknown (“01/01/1900”), leading to three highlighted incorrect triples (t5 , t9 , and t12 ).
natural hierarchy based on containment relationships; for example,
the feature with object_type=Date contains, and thus is an ancestor
of, the feature with object_type=Date ∧ object_instance=01/01/1900.
The problem of error diagnosis then reduces to the problem of finding a set of features that best summarizes all erroneous data elements.
We next define the terms used in this model.
Property dimension: A property dimension describes one aspect of a
data instance. For our example dataset, there can be four dimensions:
source, subject, predicate, and object. Without losing generality, we
consider a certain ordering of the dimensions.
Property hierarchy: Recall that features in each dimension form a
hierarchy. In our example, the source dimension has three levels
in the hierarchy: Root, source_URL, source_tableID. Each other
dimension also has three levels; taking subject as an example, the
hierarchy is Root, subject_type, subject_instance.
Accordingly, we define the property hierarchy as follows. The
root of the hierarchy represents the coarsest granularity and each
dimension has value ALL for the root level (Root). Descendant
properties are finer-granularity representations of their ancestors;
we say property A is a child of property B if one of the features
of A is a child of the corresponding feature of B. For example,
property {ALL, ALL, ALL, (object_instance, 01/01/1900)} is a child
of property {ALL, ALL, ALL, (object_type, Date)}. As we show later,
the hierarchy will allow us to solve the problem efficiently, using an
algorithm that explores it in a top-down fashion (Section 4).
Property vector: With property dimensions and hierarchies, we can
use a property vector to represent a data instance or a set of instances
that share common properties. The vector contains one (property,
value) pair for each dimension, where the property is in the hierarchy
of that dimension, and the value is for the particular property. The
root-level property corresponds to the pair (Root, ALL), but we write
just ALL for short. For example:
• {ALL, ALL, ALL, (object_instance, 01/01/1900)} represents all
triples with object 01/01/1900.
• {(source_tableID, Tbl#1), ALL, ALL, (object_type, Date)} represents all triples from Tbl#1 of the particular wiki page with
objects of the Date type.
Element: For each data instance (triple) we define an element to capture its truthfulness and property vector; the vector should contain a
value for the leaf-level property for every dimension.
D EFINITION 5 (E LEMENT ). Consider a dataset with m property dimensions. A data instance is an element e = (V, P ), where
• V is true if the instance is correct and false otherwise;
• P = {d1 , . . . , dm } is a property vector, where di , i ∈ [1, m],
is a (property, value) pair for the leaf property of the i-th
dimension.
Figure 2a presents a subset of the elements that correspond to the
triples in Figure 1b.
Feature: Each property vector defines a set of triples that share a set
of properties; we call it a feature.
D EFINITION 6 (F EATURE ). Consider a dataset with m property dimensions. A Feature f is a pair f = (P, E), where
• P = {d1 , . . . , dm } is a property vector, where di , i ∈ [1, m],
is a (property, value) pair for a property in the hierarchy of
the i-th dimension.
• E is the set of elements with the properties represented by P .
Figure 2b shows some example features for Example 4. As an
example, feature f6 represents all triples whose object is 01/01/1900;
elements e5 , e9 , and e12 carry this feature.
Problem definition.
We now formalize the problem of deriving diagnoses for data
errors using the element-feature model. Each feature identifies a
possible cause of error, and a diagnosis is a set of features that
collectively explain the causes of the observed errors.
Element
V
e1
e2
e3
e4
e5
···
true
true
true
true
false
···
Property Vector
{(source_tableID, tbl #1),
{(source_tableID, tbl #1),
{(source_tableID, tbl #1),
{(source_tableID, tbl #1),
{(source_tableID, tbl #1),
···
(subj_instance, P. Fontaine),
(subj_instance, P. Fontaine),
(subj_instance, P. Fontaine),
(subj_instance, J. Vide),
(subj_instance, J. Vide),
···
(pred_instance, Profession),
(pred_instance, DoB),
(pred_instance, DoD),
(pred_instance, Profession),
(pred_instance, DoB),
···
(obj_instance, Musician)}
(obj_instance, c.1380)}
(obj_instance, c.1450)}
(obj_instance, Musician)}
(obj_instance, 01/01/1900)}
···
(a) List of Elements: The triples of Figure 1b, represented in the element format. The truthfulness value (V) of incorrect elements is false.
Feature
Property vector
Structure vector
List of elements
Level 0
f0
{ALL, ALL, ALL, ALL}
{0, 0, 0, 0}
{e1 , e2 , e3 , e4 , e5 , e6 , e7 , e8 , e9 , e10 , e11 , e12 }
Level 1
f1
f2
f3
f4
f5
{(source_URL, wiki), ALL, ALL, ALL}
{ALL, (subj_type, People), ALL,ALL}
{ALL, ALL, (pred_type, Bio), ALL}
{ALL, ALL, ALL, (obj_type, Profession)}
{ALL, ALL, ALL, (obj_type, Date)}
{1, 0, 0, 0}
{0, 1, 0, 0}
{0, 0, 1, 0}
{0, 0, 0, 1}
{0, 0, 0, 1}
{e1 , e2 , e3 , e4 , e5 , e6 , e7 , e8 , e9 , e10 , e11 , e12 }
{e1 , e2 , e3 , e4 , e5 , e6 , e7 , e8 , e9 , e10 , e11 , e12 }
{e1 , e2 , e3 , e4 , e5 , e6 , e7 , e8 , e9 , e10 , e11 , e12 }
{e1 , e4 , e7 , e10 }
{e2 , e3 , e5 , e6 , e8 , e9 , e11 , e12 }
Level 2
f6
f7
···
{ALL, ALL, ALL, (obj_instance, 01/01/1900)}
{ALL, ALL, ALL, (obj_instance, c.1380)}
···
{0, 0, 0, 2}
{0, 0, 0, 2}
···
{e5 , e9 , e12 }
{e2 }
···
f8
f9
···
{(source_tableID, tbl #1), ALL, (pred_instance, DoB), ALL }
{(source_tableID, tbl #2), ALL, (pred_instance, DoD), ALL }
···
{2, 0, 2, 0}
{2, 0, 2, 0}
···
{e2 , e5 }
{e9 , e12 }
···
···
Level 4
···
(b) List of Features: Candidate reasons of extraction errors in the feature format. The incorrect elements in each feature are marked as red. Structure vector and
feature level are defined in Section 4.
Figure 2: The Element-Feature model is a more efficient representation of the data instance and the possible error causes that takes advantage
of the hierarchical relationship of the extraction properties.
D EFINITION 7 (O PTIMAL D IAGNOSIS ). Given a dataset of
elements E = {e1 , ..., en } and a cost function c, the optimal diagnosis is the set of features, F = {f1 , ..., fk }, such that
• ∀ei ∈ E with ei .V =false, ∃fj ∈ F such that ei ∈ fj .E;
• c(F) is minimized.
The first condition of the definition requires that the diagnosis
explains all errors, while the second condition requires that the
diagnosis is optimal with respect to a cost function c. Note again
that the output is a set of features that are associated with the causes,
instead of the causes themselves.
In the rest of the paper we tackle two challenges: (1) how to
derive an appropriate cost function, and (2) how to solve the optimal
diagnosis problem efficiently.
Overview: cost model (Section 3).
We start by using Bayesian analysis to derive the set of features
with the highest probability of being associated with the causes for
the mistakes in the dataset. We derive our cost function from the
Bayesian estimate: the lowest cost corresponds to the highest a
posteriori probability that the selected features are the real causes
for the errors. The resulting cost function contains three types of
penalties, which capture the following three intuitions:
Conciseness: Simpler diagnoses with fewer features are preferable.
Specificity: Each feature should have a high error rate.
Consistency: Diagnoses should not include many correct elements.
We propose an additive cost function based on these three penalties
that approximates the Bayesian estimate and is efficient to compute.
Overview: diagnostic algorithm (Section 4).
We can prove that finding the set of features with the minimal
cost is NP-complete. We design a top-down, iterative algorithm
with linear-time complexity. Our algorithm traverses the feature
hierarchy from coarser to finer granularity features. It uses local
stopping conditions to decide whether to accept the current feature or
explore deeper. We extend our algorithm to a parallel, MapReducebased version that is effective at large-scale diagnosis tasks.
3.
DIAGNOSTIC FRAMEWORK
The core of our diagnostic framework is the cost model used to
determine the optimal diagnosis (Definition 7). In this section, we
focus on deriving a cost function that is effective in identifying good
diagnoses and that can be computed efficiently. We start by using
Bayesian analysis to compute the probability that a set of features
is the cause of the mistakes in the dataset (Section 3.1). Then, we
propose a cost function that approximates the Bayesian analysis
efficiently, through simple, additive penalty functions (Section 3.2).
Finally, we show that deriving the optimal diagnosis is NP-complete,
which further motivates the need for efficient algorithms with parallelization potential (Section 3.3).
3.1
Bayesian estimate of causal likelihood
Given a set of elements E = {e1 , . . . , en } and their correctness,
we wish to estimate the probability Pr(F|E) that a set of features
F = {f1 , . . . , fk } is the cause for the incorrect data instances in E.
From Bayesian inference, the a posteriori probability Pr(F |E) is
proportional to the likelihood Pr(E|F) times the prior Pr(F):
Pr(F|E) ∝ Pr(E|F) Pr(F)
(1)
We assume that mistakes represented by features are independent. This assumption is reasonable because even for related features, the associated causes can still be independent. For example,
feature f6 = {ALL, ALL, ALL, (obj_instance, 01/01/1900)} is
subsumed by feature f5 = {ALL, ALL, ALL, (obj_type, Date)};
however, f6 can be associated with the cause of incorrectly assigning the default value 01/01/1900, while f5 can be associated with
the cause of inability of an extractor to parse dates, and the two
causes are independent. Using this independence assumption, we
can express the prior Pr(F) as follows.
Y
Pr(F) =
Pr(fi )
(2)
fi ∈F
We further use α to denote the a priori probability that a feature is a
cause (Pr(fi ) = α).1 Then, Pr(F) = αk .
Now consider Pr(E|F): We assume the elements in E are independent conditioned on F. For an element ej ∈ E, we denote by
F (ej ) ⊆ F the features that contain ej ; only errors associated with
features in F (ej ) can affect the correctness of ej . Thus, we have
the following.
Y
Y
Pr(E|F) =
Pr(ej |F) =
Pr(ej |F (ej ))
(3)
ej ∈E
ej ∈E
We assume that for each cause of errors, there is an error rate
between 0 and 1. For example, assigning a default date 01/01/1900
often has an error rate close to 1 (if not 1), date format parsing error
from a particular webpage can also have a high error rate, whereas a
webtable providing erroneous data often has a lower error rate. We
denote by i the error rate of the cause associated with feature fi .
The error rate i can be derived directly from fi .E and denotes the
probability that an element represented by feature fi is incorrect
when fi is associated with a cause of error.
Then, the probability of an element ej being correct is the probability that none of the causes associated with the features it belongs
to affects its correctness. Similarly, we can compute the probability
of an element being incorrect.
Pr(ej .V = true|F (ej )) = Πfi ∈F (ej ) (1 − i )
Pr(ej .V = false|F (ej )) = 1 − Πfi ∈F (ej ) (1 − i )
Pr(ej .V = false|F (ej )) = Πfi ∈F (ej ) i
For simplicity, we assume that all features have the same a priori
probability of failure. However, this is not imperative in our model,
and different probabilities can be used.
(6)
In general, this estimate can be arbitrarily bad (consider the extreme case |F (ej )| → ∞). However, in practice, an erroneous
element is usually due to one or two reasons instead of many. Therefore, ideally, |F (ej )| should be small in the diagnosis (i.e., little
overlap between returned features). Our estimate is precise when
|F (ej )| = 1. Furthermore, when |F (ej )| > 1, Equation (6) computes a lower probability than (5), penalizing overlapping features
even hasher, which is consistent with our intuition. Our experimental
results verify that this estimate leads to good diagnosis results.
We combine Equations (1–4) and (6) to get the following expression for the probability Pr(F|E).
Y
Y
Pr(F|E) ∝
Pr(fi )
Pr(ej |F (ej ))
ej ∈E
fi ∈F
=
Y
+
|fi .E−
i |
αi
(1 − i )|fi .Ei |
(7)
fi ∈F
(5)
E XAMPLE 8. We consider two sets of features, F1 = {f6 }
and F2 = {f8 , f9 }, as possible causes for the errors in the set
of elements E = {e1 , . . . , e12 } in Figure 2a. Semantically, the
former means that the errors are caused by the wrong default value
“01/01/1900”; the latter means that the errors are caused by two
mistakes: wrongly parsing the dates of birth in Table 1 and wrongly
parsing the dates of death in Table 2.
Feature f6 has 3 incorrect elements and no correct elements; its
error rate is 6 = 1. Using α = 0.5, we get that P r(F1 ) = 0.5,
P r(E|F1 ) = 1 − (1 − 1) = 1, so P r(F1 |E) ∝ 0.5.
On the other hand, f8 has one incorrect element, and one correct
element (8 = 0.5), and f9 has two incorrect elements and no
correct elements (9 = 1). Thus, P r(F2 ) = 0.52 , P r(e2 |F2 ) =
P r(e2 |f8 ) = (1 − 0.5) = 0.5, P r(e5 |F2 ) = P r(e5 |f8 ) = 1 −
(1 − 0.5) = 0.5, P r(e9 |F2 ) = P r(e12 |F2 ) = 1 − (1 − 1) = 1, so
P r(F2 |E) ∝ 0.52 · 0.5 · 0.5 · 1 · 1 = 0.0625. This result indicates
that F1 is more likely to be the cause of the errors than F2 .
The diagnostic cost model
The Bayesian analysis described previously represents the probability that a set of features F are the causes of the errors in E. It
requires probability computation for each element. Since our goal
is to find the set of features that best diagnose the errors, it would
be much more intuitive to transform the a posteriori probability to a
cost function that computes a cost for each feature, and sums up the
cost for all selected features.
Note that both Pr(F) and Pr(ej .V = true|F (ej )) can be written as the product of a set of terms, each associated with a feature. If
we can transform Pr(ej .V = false|F (ej )) to such a form, we can
then define a cost function for each feature and take an aggregation.
For this purpose, we estimate Pr(ej .V = false|F (ej )) as follows.
(4)
As special cases, we define P r(ej .V = true|∅) = 1, rewarding
not including correct elements in the returned features, and define
P r(ej .V = false|∅) = 0, penalizing not including incorrect
elements in the returned features. Since Definition 7 requires covering all incorrect elements, we assume in the rest of the paper that
F (ei ) 6= ∅ for every ei ∈ E, ei .V = false.
Equations (1–5) together compute the probability that the features
in set F are the causes of the errors in data instances of E. Our goal
is to find the set F with the highest probability Pr(F|E).
1
3.2
where fi .E− and fi .E+ are the sets of false and true elements
of fi , respectively. Equation (7) contains three distinct components
that comprise the probability Pr(F|E): a fixed factor (α), a factor
that corresponds to false elements of the feature, and a factor
that corresponds to true elements. Accordingly, we define a cost
function for each feature.
D EFINITION 9 (F EATURE COST ). The cost c(fi ) of a feature
fi is the sum of the fixed cost, false cost, and true cost, defined as:
cf ixed (fi ) = log
1
α
1
i
1
ctrue (fi ) = |fi .E+ | log
1 − i
cf alse (fi ) = |fi .E− | log
The use of logarithms2 allows our cost function to be additive. Then,
the diagnosis cost, which we define below, is logarithmically proportional to the a posteriori probability Pr(F|E) in Equation (7).
D EFINITION 10 (D IAGNOSIS COST ). The cost c(F) of a diagnosis F = {F1 , ..., Fk } is the sum of the costs of all its features:
X
c(F) =
c(fi )
fi ∈F
E XAMPLE 11. We revisit the candidate diagnoses of Example 8:
F1 = {f6 } and F2 = {f8 , f9 }. The costs of the relevant features
2
Without loss of generality, we assume logarithms of base 2.
are: c(f6 ) = cf ixed (f6 ) + cf alse (f6 ) + ctrue (f6 ) = 1 + 1 · 0 + 0
= 1, c(f8 ) = 1 + 1 · 1 + 1 · 1 = 3, and c(f9 ) = 1 + 2 · 0 + 0 = 1.
Therefore, c(F1 ) = 1 and c(F2 ) = 3 + 1 = 4. Since F1 has a
lower cost, it is a better diagnosis than F2 .
Note also that both F1 and F2 contain only disjoint features,
so their costs estimate the corresponding probabilities precisely:
Pr(F1 |E) = 2−c(F1 ) = 0.5 and Pr(F2 |E) = 2−c(F2 ) = 0.0625.
Interestingly, the three penalties considered for the feature cost
capture the three important properties for the returned diagnosis.
Conciseness: Penalty cf ixed (fi ) > 0 represents the a priori probability. This means that the diagnosis cost increases as the size of F
increases, so this factor prioritizes smaller feature sets (i.e., concise
explanations).
Specificity: Penalty cf alse (fi ) prioritizes the choice of features
with higher error rate to cover the same wrong element. If two
features cover the same wrong elements, the one with higher error
rate will result in a lower cost.
Consistency: Penalty ctrue (fi ) prioritizes the choice of features
that contain fewer correct elements. Feature sets that cover a lot of
correct elements will result in a high cost.
Adding these cost penalties balances conciseness, specificity, and
consistency. For example, the diagnosis with a single feature that
contains all elements is obviously the most concise diagnosis, but
its error rate is presumably low and it involves a lot of correct
elements, so there is a high true cost and false cost. On the other
hand, returning each element as a single-element feature in the
diagnosis is obviously the most specific and consistent diagnosis,
but the number of features is high, resulting in a high fixed cost.
3.3
Complexity
For a given dataset of elements E, our cost model assigns a constant cost to each feature. This transforms the problem of deriving optimal diagnoses (Definition 7) to a weighted set cover problem [29]. There is a straightforward reduction from weighted set
cover to the problem of optimal diagnosis, which means that our
problem is NP-complete.
T HEOREM 12 (NP-C OMPLETENESS ). Given a dataset of elements E = {e1 , ..., en }, the cost function c of Definition 10, and a
maximum cost K, determining whether there exists a diagnosis F,
such that c(F) ≤ K, is NP-complete.
Weighted set cover is a well established problem, with extensive
related work. Specifically, there are several approximation algorithms for this problem [13, 16, 27], but typically, they do not come
near to addressing the scale of problems that are relevant to our
motivating application domain. These algorithms typically have
high-degree polynomial complexity (e.g., quadratic in the number
of features [27]), and they are not amenable to parallelism.
In the next section, we introduce a powerful, sort-free, top-down
iterative algorithm for the optimal diagnosis problem, with linear
time complexity and great parallelization potential. We extend our
algorithm to a MapReduce-based implementation, and show that it
is both effective and efficient.
4.
DERIVING A DIAGNOSIS
In this section, we propose an algorithm that can derive diagnoses
efficiently, by exploiting the hierarchical structure of features. We
start with a description of the feature hierarchy (Section 4.1). We
then propose an algorithm that constructs a diagnosis by traversing
the hierarchy in a top-down fashion (Section 4.2). Finally, we
present a MapReduce version of our algorithm that makes causal
diagnosis practical for large-scale datasets (Section 4.3).
Partition 1
!! !
PV:{(source_URL,wiki), ALL,
ALL, ALL}
SV: {1, 0, 0, 0}
ES: {e1, e2, …, e5, …, e9,…, e12}
!! !
PV: {ALL,(subj_type, people),
ALL,ALL}
SV: {0, 1, 0, 0}
ES: {e1, e2, …, e5, …, e9,…, e12}
!! !
PV: {ALL,ALL,
(pred_type, Bio),ALL}
SV: {0, 0, 1, 0}
ES: {e1, e2, …, e5, …, e9,…, e12}
Partition 2
!! !
PV: {ALL,ALL,ALL,ALL}
SV: {0, 0, 0, 0}
ES: {e1, e2, …, e5, …, e9,…, e12}
Partition 3
Partition 4
!! !
!! !
Level 0
PV: {ALL,ALL,
ALL,(obj_type, Profession)}
SV: {0, 0, 0, 1}
ES: {e1, e4, e7, e10}
PV: {ALL,ALL,
ALL, (obj_type, Date)}
SV: {0, 0, 0, 1}
ES: {e2, e3, e5,…, e9,…, e12}
Level 1
……
Partition 4
!! !
!! !
PV: {ALL,ALL,ALL,
(obj_instance, 01/01/1900)}
SV: {0, 0, 0, 2}
ES: {e5, e9, e12}
PV: {ALL,ALL,
ALL, (obj_instance,c.1380)}
SV: {0, 0, 0, 2}
ES: {e2}
……
Level 2
Figure 3: The hierarchy structure of the features of Figure 2b.
4.1
The feature hierarchy
Features form a natural hierarchy due to the property hierarchy
(Section 2.1). For example, the feature {(source_URL, wiki), ALL,
ALL, ALL} contains the feature {(source_tableID, tbl #1), ALL, ALL,
ALL }; semantically, the table is a subset of the wiki page. This
hierarchy is embedded in the features’ property vectors, and we
model it explicitly by deriving a feature’s structure vector and its
hierarchy level.
Structure vector (SV): A SV is an integer vector {s1 , . . . , sm },
where the i-th element represents the granularity of the feature
in the i-th property dimension. Lower numbers represent coarser
granularity, with 0 mapping to Root. For example, in the source
dimension, we have the granularity levels Root, source_URL, and
source_tableID, which are represented in the structure vector with
0, 1, and 2, respectively. Therefore, the structure vector of feature
f1 in Figure 2b is {1, 0, 0, 0}, because wiki is a value at the granularity level of source_URL. The structure vector is derived from a
feature’s property vector, and provides an intuitive representation of
the feature’s relationship with other features.
Feature level: The feature level is an integer that denotes the distance
of a feature from the root of the feature hierarchy. It can be computed
directly
from the structure vector, as the sum of all its dimensions
P
( i si ). For example, feature f1 has level 1.
Feature hierarchy: We define the parent-child relationships in the
feature hierarchy using the list of feature elements (f.E), the feature
structure vector (SVf ) and the feature level (Lf ).
D EFINITION 13 (PARENT- CHILD FEATURES ). A feature fp is
the parent of feature fc (equivalently, fc is a child of fp ) when the
following conditions hold:
(a) e ∈ fc .E ⇒ e ∈ fp .E
(b) Lfp = Lfc − 1
(c) ∀i ∈ [1, m], SVfp (i) ≥ SVfc (i)
The conditions of the definition ensure that (a) the elements of a
parent feature are a superset of the elements of the child feature, (b)
the parent feature is one level closer to the root of the hierarchy, and
(c) each dimension of the child feature has equal or finer granularity
than the parent.
A feature can have multiple parents. For example, the features {(source_URL, wiki), ALL, (pred_instance, DoB), ALL} and
{(source_tableID, tbl #1), ALL, (pred_type, Bio), ALL} are both parents of feature {(source_tableID, tbl #1), ALL, (pred_instance, DoB),
Algorithm 1 DATA XR AY
Require: A set of elements E;
Ensure: A set of problematic features R;
1: parentList ← InitialFeature(elementList);
2: R ← ∅;
3: while parentList! = ∅ do
4:
S, U, childList, nextLevel ← ∅;
5:
for each parentF ∈ parentList do
6:
S PLIT F EATURE(parentF, childList);
7:
end for
8:
partitionList ← getPartition(parentList, childList);
9:
for each partition ∈ partitionList do
10:
C OMPARE F EATURE(partition, S, U );
11:
end for
12:
M ERGE F EATURE(parentList, nextLevel, S, U, R);
13:
parentList ← nextLevel;
14: end while
15: return R
ALL }.
The hierarchy defined by the parent-child feature relationships is a directed acyclic graph (DAG). The root of the hierarchy is
feature {ALL, ALL, ALL, ALL}, and each leaf is a feature that maps
to a unique element. For example, the feature {(source_tableID, tbl
#1), (subj_instance, J. Vide), (pred_instance, DoB), (obj_instance,
01/01/1900)}, at level 8, represents element e5 in Figure 2a, and
equivalently, triple t5 in Figure 1b.
Feature partitions: Features at the same hierarchy level generally
have overlapping sets of elements. For example, f1 and f4 have four
elements in common. This can be a problem for an algorithm that
explores the hierarchy, because it is harder to compare features that
overlap in an arbitrary way. To address this problem, we organize
the child features of a parent feature fp into m partitions, where m
is the number of property dimensions.
f
D EFINITION 14 (PARTITION ). A partition Pi p contains every feature f that is a child feature of fp and SVfp (i) = SVf (i)−1.
For example, f4 and f5 form partition P4f0 in level 1: they share
the same parent, f0 , and SVf4 (4) = SVf5 (4) = SVf0 (4) + 1.
Overall, level 1 has four partitions: P1f0 = {f1 }, P2f0 = {f2 },
P3f0 = {f3 }, and P4f0 = {f4 , f5 }. By construction, partitions
ensure that their features do not overlap (e.g., f4 .E ∩ f5 .E = ∅),
and the union of all their features cover all the parent elements (e.g.,
f4 .E ∪ f5 .E = f0 .E).
4.2
Top-down iterative diagnosis
Our diagnostic algorithm traverses the hierarchy in a top-down
fashion (breadth-first), exploring coarse-granularity features first,
and drilling down to finer-granularity features while improving the
cost estimate at every step. The algorithm maintains three sets of
features in the traversal.
• Unlikely causes U: features that are not likely to be causes.
• Suspect causes S: features that are possibly the causes.
• Result diagnosis R: features that are decided to be associated
with the causes.
Our algorithm, DATA XR AY, is described in Algorithm 1. At
a high level, every iteration of the algorithm considers features at
a particular level (Line 5–Line 7), compares each parent feature
with its child features, and populates the list of suspect causes S
and the list of unlikely causes U (Line 8–Line 11). At the end of
the iteration, the sets S and U are consolidated (Line 12): parent
features that occur only in S are added to the result diagnosis R, and
all elements that R contains are marked as “covered”; child features
that occur only in S are kept for traversal in the next iteration. The
traversal completes once all incorrect elements are marked as being
covered by some feature in R.
We next describe the major components of the algorithm.
S PLIT F EATURE : Given a feature and its elements, this component
derives the corresponding child features and partitions. It uses the
structure vector of the parent feature to derive the structure vectors
of the children and the partition of the parent. It then generates the
property vectors of each child feature by examining the elements in
the parent. Finally, if the parent feature is marked as “covered”, all
the elements of the feature are already covered by ancestor features
selected to R. To avoid producing redundant diagnoses, the child
features of the current feature are also marked as “covered”. Features
marked as “covered” will not be added to R.
C OMPARE F EATURE : Given a parent node and a partition, this
component compares the feature set containing only the parent,
and the feature set containing all child features in the partition, to
determine which is a better solution. The winner features are added
to S and the loser features are added to U. The comparison is based
on the cost model of Definition 10. In Section 4.2.1 we describe two
additional criteria that simplify this computation.
M ERGE F EATURE : In the previous step, each partition populates
the sets S and U independently. This component consolidates the
results. Parent features only in S and not in U transfer to the
result diagnosis R, and their child features are marked as “covered”.
Parent features in U are discarded, since it means there exists some
partition where the child features form a better solution. Child
features in S are sent to the next iteration for further traversal.
T HEOREM 15 (C OMPLEXITY AND BOUNDS ). Algorithm 1 has
complexity O(|F|), and provides an O(n)-approximation of the
minimum cost diagnosis, where n is the number of elements in E
and F is the set of features that can be derived from E.
DATA XR AY provides an O(n)-approximation of the optimal
diagnosis, which is worse than the greedy approximation bound for
set cover (O(log n)). However, this worst case setting requires that
errors are uniformly distributed among the child nodes of a feature,
across all dimensions. This is extremely unusual in practice, and in
most cases DATA XR AY significantly outperforms approximations
for set cover.
In Section 5, we show that DATA XR AY outperforms greedy set
cover by an order of magnitude. Our algorithm exploits the hierarchy
of the features, leading to better worst-case complexity (linear in
the number of features) than other approximations for set cover. We
note that the number of features can be huge: O(lm |E|), where m
is the number of dimensions and l the maximum number of levels
in the property hierarchy for each dimension. However, in practice,
m is usually small. Moreover, DATA XR AY is by design highlyparallelizable and we show how to implement it in the MapReduce
framework (Section 4.3).
4.2.1
Optimizations
Line 10 of Algorithm 1 compares two sets of features using the
cost model of Definition 10. This computation requires enumerating
each element in the feature sets. We use two heuristic criteria that
simplify computation and prune features faster.
Variance pruning: The variance of a feature describes how the
errors are distributed among the child features. We compute the
variance in each partition Pif of a feature f as:
X (f − f )2
c
V arif =
|Pif |
f
fc ∈Pi
Intuitively, if a feature is associated with a cause of errors, it is likely
to result in uniform mistakes across its child features, resulting in
low variance of error rates among its children. A feature with high
variance indicates that the actual culprit is deeper in the feature hierarchy; we thus add a parent feature f to U if V arif ≥ θmax . Based
on empirical testing, we chose θmax = 0.1 for our experiments. For
example, feature f5 in Figure 2b has 6 child features in partition P4f5 ;
five with zero error rate, and one with = 1. Then, the variance in
that partition is V ar4f6 = 0.14 > θmax , so f6 is added to U.
Error rate pruning. When a feature is associated with a cause of
errors, typically its error rate would not be too low. Accordingly, we
add a parent feature f to U if f ≤ δmin . Again, empirically, we
chose δmin = 0.6.
4.2.2
Greedy refinement
Our diagnostic framework does not consider correlations among
features. If correlations exist, they can result in diagnoses that
contain redundant features (i.e., features with a lot of overlap).
DATA XR AY detects redundancies across features of different levels, but is unaware of overlap in features selected from the same
hierarchy level. To eliminate such redundancies in the resulting
diagnosis, we post-process it with a greedy set-cover step. This
greedy step looks for a minimal set of features among those chosen
by DATA XR AY. Since the number of features in the DATA XR AY
result is typically small, this step is very efficient. In Section 5,
we show that with negligible overhead, DATA XR AY with greedy
refinement results in significant improvements in accuracy.
4.3
Parallel diagnosis in MapReduce
We design our algorithm with parallelization in mind: the split,
compare, and merge steps can each execute in parallel, as the computation always focuses on a specific partition or feature. In this
section, we describe how our algorithm works in a MapReduce
framework, creating a separate map-reduce stage for each of the
split, compare, and merge functions.
Stage I parallelizes the generation of child features. The Map
phase maps each element in the parent feature to relevant child
features; in other words, for each element in a parent feature, it
generates pairs where the element is the value and a child property
vector is the key. The Reduce phase generates each child feature
according to the set of elements, and computes its error rate and cost.
Stage II parallelizes the comparison of each parent feature and a
partition of child features. The Map phase generates, for each child
feature, the partitions it belongs to; in other words, for each partition
that contains the child feature, it generates a pair where the child is
the value and the parent-partition pair is the key. The Reduce phase
compares the parent feature with each partition of its child features.
Stage III parallelizes the decision of whether to discard a feature,
or to return a feature in the diagnosis, or to keep it for further
traversal. The Map phase populates S and U; in other words, for
each feature in the comparison, it generates a pair where the feature
is the key and the decision for adding it to S or U is the value. The
Reduce phase makes a final decision for each feature.
Extraction
reverb
reverbnolex
textrunner
woepos
woeparse
false triples
true triples
error rate
304
338
478
535
557
315
290
203
218
324
0.49
0.54
0.70
0.71
0.63
Figure 4: Real-world datasets from 5 different knowledge extraction
systems of the ReVerb ClueWeb Extraction dataset [24].
models the quality of diagnoses effectively; (2) our algorithm is both
more effective and more efficient than other existing techniques; and
(3) our MapReduce implementation of the algorithm is effective at
handling datasets of large scale.
Datasets
We first describe the real-world data used in our evaluation; we
describe our synthetic data experiments in Section 5.2.
Knowledge triple extraction systems. We demonstrate the effectiveness of our diagnosis framework in practice, using five real-world
knowledge extraction systems of the ReVerb ClueWeb Extraction
dataset [24]. Figure 4 provides high-level characteristics about each
of these 5 extractors. The dataset samples 500 sentences from the
web, using Yahoo!’s random link service. The dataset contains
labeled knowledge triples: each triple has a true or false label
indicating whether it is correct or incorrect, respectively.
We proceed to describe how we model the knowledge extraction
datasets in our feature-based framework. In our model, each knowledge triple is an element with a 5-dimensional property vector, with
the following property hierarchies:
1. Source (Root, sentenceID) describes which sentence the triple
is extracted from.
2–4. Subject, Predicate, Object (Root, structure, content). Each
of these dimensions describes the structure of the sentence,
and the content value. The structure is composed by the Pos
Tags [58] (e.g., none, verb). The content is the actual content
value of the triple.
5. Confidence (Root, confidence bucket). Extraction systems
annotate the extracted knowledge with confidence values as
an assessment of their quality. We capture the confidence
bucket as part of the property dimensions.
In our experiments, we focused on these 5 dimensions, because
of limited knowledge of each systems’ inner workings. In practice,
domain experts are likely to include more dimensions (e.g., specific
extraction patterns) to derive more accurate diagnoses.
Silver standard. The dataset does not provide a “gold standard” of
diagnoses for the erroneous triples. We manually derived a “silver
standard” against which we evaluate our methods. In particular, we
considered every feature returned by each alternative technique we
implemented, and manually investigated if it is very likely to be associated with a particular error. To the best of our knowledge, there
is no freely available dataset with labeled diagnoses; our manually
derived silver standard is the best-effort approach to evaluating our
techniques while grounded to real-world data.
Comparisons
5.
EXPERIMENTAL EVALUATION
This section describes a thorough evaluation of our diagnostic
framework on real-world knowledge extraction data, as well as
large-scale synthetic data. Our results show that (1) our cost function
We compare two versions of our algorithms, DATA XR AY and
DATA XR AY +G REEDY with several alternative algorithms and stateof-the-art methods designed for similar problem settings: a greedy
algorithm for set cover, G REEDY, and a variant with different op-
timization objective, R ED B LUE; a data quality exploration tool,
DATA AUDITOR; and two classification algorithms, F EATURE S E LECTION and D ECISION T REE .
DATA XR AY (Section 4): Derives diagnoses by identifying “bad”
features using the DATA XR AY algorithm proposed in this paper.
We set α = 0.1, used in the fixed cost, and θmax = 0.1 and
δmin = 0.6, used in the pruning and filtering heuristics.
DATA XR AY +G REEDY (Section 4.2.2): This algorithm applies a
greedy set-cover refinement step on the result of DATA XR AY to
eliminate redundancies.
G REEDY [13]: We apply the greedy approximation for weighted
set cover to select the set of features of minimum cost that cover all
of the false elements, according to our cost model (Section 3.2).
Our cost model allows set cover to penalize features that cover true
elements, which it does not do in its default objective.
R ED B LUE [10,50]: Given a collection of sets with “blue” and “red”
elements, the red-blue set cover problem looks for a sub-collection
of sets that covers all “blue” elements and minimum number of
“red” elements. In contrast to regular set-cover, red-blue set cover
can model both correct and incorrect element coverage. We map
false elements to “blue” elements, and true elements to “red”
elements, while features are sets. We use a greedy approximation
algorithm [50] to find the cover.
DATA AUDITOR [31, 32]: We use Data Auditor, a data quality exploration tool that uses rules and integrity constraints to construct
pattern tableaux. We annotate false elements as a consequent
(dependent) value in a FD, and use Data Auditor to learn a pattern
for this rule, which we treat as a diagnosis. We set support sˆ = 1
to diagnose all false elements; the confidence cˆ corresponds to the
error rate pruning in DATA XR AY, so we set cˆ = δmin = 0.6.
F EATURE S ELECTION [48, 57]: We use logistic regression to derive a set of features that is a good classifier between true and
false elements. For each feature the algorithm learns a weight between -1 and 1: a positive weight indicates that the feature is positive
proportional to the class (in our context the feature is a cause), and
a negative weight indicates the opposite. We use our labeled data
as the training dataset, excluding features with only true elements
to speed up learning, and return features with positive weights. We
apply L1 -regularization, which favors fewer features for the purpose
of avoiding over-fitting. We use 0.01 as the regularization parameter
(a higher parameter applies a higher penalty for including more
features), as we empirically found that it gives the best results.
D ECISION T REE [51]: We use decision trees with pruning as an
alternative classification method. Pruning avoids overfitting, which
would lead decision trees to always select features at the lowest
hierarchy levels. We set low confidence (0.25) to promote pruning,
and restrict the minimum number of instances to two (each selected
feature should have at least two elements). We found empirically
that these parameters provide the best results.
We use the SLEP package implementation for logistic regression [40] and WEKA [35] for decision trees, and we implemented
the rest of the algorithms in Java. In addition, the MapReduce
version of DATA XR AY uses Hadoop APIs.
Metrics
We evaluate the effectiveness and efficiency of the methods.
Precision/Recall/F-measure: We measure the correctness of the
derived diagnoses using three measures: Precision measures the
portion of features that are correctly identified as part of the optimal
diagnosis; Recall measures the portion of features associated with
causes of errors that appear in the derived diagnosis; F-measure com-
putes their harmonic mean ( 2·precision·recall
). Note that we do not
precision+recall
know all causes for the errors, so recall is evaluated against the union
of all features marked as correct diagnoses in our silver standard.
Execution time: We report the execution time for each method, broken down into preprocessing time (prep.), computation time (comp.),
and total execution time (total time). The preprocessing time for
G REEDY, R ED B LUE, and F EATURE S ELECTION accounts for the
time to find all eligible features and compute their costs if necessary.
The other methods (DATA XR AY, DATA XR AY +G REEDY, DATA AU DITOR , and D ECISION T REE ) only need to compose the initial root
feature at level 0 during preprocessing. Computation time is the
time that each method takes on average, after the preprocessing step.
The total time is the sum of preprocessing and computation time.
We ran experiments comparing all of the methods on an iMac
with 3.2 GHz Intel Core i5 processor, 8GB RAM. We also conducted
experiments on the scalability of our MapReduce implementation.
The experiments were conducted on a Hadoop 2.2.0 cluster with 15
slave nodes. The head node has 2.4 GHz processor and 24GB RAM.
The slave nodes have 2.66 GHz processor and 16GB RAM.
5.1
Real-world data
In our first round of experiments, we test our diagnostic framework using real-world knowledge extraction data. Figures 5a–5e
report the quality of the diagnoses produced by each method on
the data extracted by five real-world extraction systems. Figure 5f
reports the average execution time for each method. Our results in
Figure 5 demonstrate that our framework derives better diagnoses
than the other approaches, and does so more efficiently.
Our first goal is to evaluate the effectiveness of our cost function. The results in Figures 5a–5e demonstrate that the methods that
apply our cost function (DATA XR AY, DATA XR AY +G REEDY, and
G REEDY) result in diagnoses of significantly better quality. R ED B LUE generally has high recall, but lower precision. This is because
R ED B LUE favors finer-granularity features: its objective function
depends on the number of red elements (i.e., true elements) that
are included in the diagnosis, but does not consider the number of
returned features (size of the diagnosis). DATA AUDITOR also uses
a different objective and prioritizes coarse features, leading to bad
performance across all extractors. The logistic regression method
(F EATURE S ELECTION) shows low quality in all datasets. The goal
of F EATURE S ELECTION is to build a good prediction model, which
is different from our diagnosis goal. Even with L1 -regularization,
it may still select small features for the purpose of optimizing classification. We found that the F EATURE S ELECTION results often
contained redundancy and features with low error rates, resulting
in both low precision and low recall. D ECISION T REE performs
the worst, with F-measure values below 0.015. Compared to F EA TURE S ELECTION , D ECISION T REE is restricted to non-overlapping
rules; this reduces the search space, but ignores many of the features,
leading to bad performance. These results show that our cost model
is successful at producing good diagnoses, and the quality of the
diagnoses is significantly better than those produced by methods
with alternative objectives.
Our second goal is to evaluate the effectiveness of our approximation algorithms in solving the optimization problem. All of the methods that use our cost model (DATA XR AY, DATA XR AY +G REEDY,
and G REEDY) achieve high recall scores in all five datasets. We
observe that typically DATA XR AY has a higher recall, whereas
G REEDY has a higher precision, especially for the textrunner,
woepos , and woeparse datasets. One weakness of DATA XR AY is
that it does not detect overlap across features that are selected at
the same level of the hierarchy. When that occurs, the resulting
diagnoses contain redundancies (multiple features that explain the
DataXRay+Greedy
DataXRay
DataAuditor
FeatureSelection
Greedy
RedBlue
DataXRay+Greedy
DataXRay
DataAuditor
FeatureSelection
Greedy
RedBlue
DataXRay+Greedy
DataXRay
1.0
1.0
1.0
0.8
0.8
0.8
0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
0.2
0.0
Recall
Precision
F-measure
0.0
(a) Extractor: reverb
DataXRay+Greedy
DataXRay
Greedy
RedBlue
Recall
Precision
F-measure
0.0
Recall
(b) Extractor: reverbnolex
DataAuditor
FeatureSelection
DataXRay+Greedy
DataXRay
1.0
1.0
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
Recall
Precision
F-measure
(d) Extractor: woepos
0.0
Precision
F-measure
DataAuditor
FeatureSelection
Greedy
RedBlue
DATA XR AY +G REEDY
DATA XR AY
G REEDY
R ED B LUE
DATA AUDITOR
F EATURE S ELECTION
D ECISION T REE
Recall
Precision
DataAuditor
FeatureSelection
(c) Extractor: textrunner
Method
0.0
Greedy
RedBlue
Prep.
Comp.
Total
0.02
0.01
0.7
0.7
0.03
0.9
0.03
0.41
0.40
2.3
3.5
0.17
4.6
0.1
0.43
0.41
3.0
4.2
0.2
5.5
0.13
F-measure
(e) Extractor: woeparse
(f) Average execution time (sec)
Figure 5: The quality of the derived diagnoses for all of the methods, across five knowledge extraction systems; the maximum F-measure value
achieved by D ECISION T REE was 0.0148, which was too low to display in the graphs. Our approach that combines DATA XR AY with a greedy
set-cover step outperforms all other approaches, in some cases significantly. It is also faster than other methods except DATA AUDITOR and
D ECISION T REE, but the latter two produce results of very low quality for our problem.
same errors), leading to low precision. DATA XR AY +G REEDY overcomes this weakness by applying the greedy set-cover method over
the result of DATA XR AY. This eliminates the redundancy from
the DATA XR AY diagnoses, leading to big improvements in precision, and usually with a very small drop on recall. Overall, our
DATA XR AY +G REEDY method maintains excellent performance,
with F-measure above 0.85 across all five of our datasets.
Our final goal is to evaluate the efficiency of the different algorithms. Figure 5f reports the average execution time for each
method. DATA AUDITOR and D ECISION T REE are a bit faster that
DATA XR AY on this dataset, but they are not viable for this problem
given their terrible F-measure performance. DATA XR AY is an order
of magnitude faster than the remaining methods. The greedy refinement is only slightly slower than DATA XR AY: since it executes
greedy set cover on the solution of DATA XR AY, the problem space
is significantly reduced and thus the greedy step is very efficient.
Interesting diagnoses: Our diagnostic framework identified several
interesting problems in these real-world extraction systems. In one
case, we found that on the reverb dataset our system produced the
feature {ALL, ALL, ALL, (obj_structure, ECC), ALL} as part of a
diagnosis, where ECC stands for objects ending with coordinating
conjunction such as and, but, for, and nor (e.g., “newspapers and”).
This feature corresponds to elements whose objects are extracted
from coordinating conjunctions. This indicates a clear problem
with the extraction process, as it does not make sense to have a
coordinating conjunction as an object in a knowledge triple.
As another example, our method identified {ALL, (subj_structure,
CD), ALL, ALL, ALL} as a problem feature in the textrunner
dataset, where CD stands for cardinal numbers. This feature corresponds to the use of cardinal numbers as subject. This, again, provides a clear pointer to a specific mistake with the extraction process.
5.2
Scaling with synthetic data
We built a synthetic data generator to test our diagnostic framework across varied data sizes and error rates. We have three goals for
these experiments: (1) evaluate how the different methods perform
in terms of diagnostic accuracy across datasets of varied size and
varied error rates, (2) evaluate how the different methods scale, and
(3) evaluate the accuracy and scalability of DATA XR AY in a parallel
setting across a large range of data sizes. All figures presented in
this section display averages across 50 executions.
In the first round of our synthetic data experiments, we test all
methods against datasets that range from 100 to 10,000 elements.
In this experiment, each feature fails (becomes a cause of error)
with probability 0.3, and the error rate of failed features is 0.95.
We present the performance results in Figure 6. We note that
DATA XR AY and DATA XR AY +G REEDY have almost identical performance in the synthetic data experiments (other than a negligible
difference in execution time), so we omit the DATA XR AY +G REEDY
line to make the plots less crowded. This is because our synthetic
data generator avoids feature overlap at the same hierarchy levels,
which makes the greedy refinement unnecessary. Therefore, from
here on we do not include results for the greedy refinement.
DATA XR AY is extremely effective across different data sizes,
showing superior performance in both effectiveness and efficiency.
As the size of the dataset increases, the F-measure of the competing
methods steadily declines, falling below 0.6 for G REEDY and below
0.4 for the other methods at 10,000 elements. In contrast, our techniques maintain F-measure above 0.8 for all data sizes. DATA XR AY
is also the fastest of the diagnostic methods across all data sizes, in
several cases by multiple orders of magnitude (Figure 6b). Figure 6c
evaluates the conciseness of the produced diagnoses. It is interesting
that F EATURE S ELECTION derives diagnoses of very similar size
to DATA XR AY, yet its F-measure is much lower, indicating that
its objective is not suitable for the problem that we tackle in this
work. DATA AUDITOR heavily favors features at higher levels of the
hierarchy, which results in diagnoses with fewer features, but leads
to low F-measure. We include additional results on the granularity
of features chosen by different methods in the Appendix.
In our second round of experiments, we generate datasets of
10,000 elements but vary the probability that a feature is incorrect
(Figure 7a) and the error rate among the elements of an incorrect feature (Figure 7b). As both these probabilities increase, they increase
FeatureSelection
DecisionTree
102
0.8
101
0.6
100
0.4
DataXRay
Greedy
RedBlue
DataAuditor
FeatureSelection
DecisionTree
10−1
10−2
0.2
0.0 2
10
103
10−3 2
10
104
Data size (# of elements)
103
(a) F-measure
RedBlue
DataAuditor
FeatureSelection
DecisionTree
103
102
101
100 2
10
104
Data size (# of elements)
DataXRay
Greedy
104
# of returned features
RedBlue
DataAuditor
Time (sec)
F-measure
DataXRay
Greedy
1.0
(b) Execution time
103
104
Data size (# of elements)
(c) Number of returned features
Figure 6: Performance of different methods on synthetic data when we vary the size of the data. DATA XR AY maintains consistently good
performance, while the effectiveness of the other methods drops dramatically as the size increases.
fixed
F-measure
0.4
0.2
RedBlue
DataAuditor
FeatureSelection
DecisionTree
0.3
Probability of feature failure
(a) F-measure
0.4
0.5
Computation Time
102
0.8
fixed
0.6
0.4
0.0
0.70
F-measure
adapted
0.2
0.2
Recall
Precision
1.0
0.8
0.6
0.0
0.1
DataXRay
Greedy
1.0
adapted
0.8
F-measure
FeatureSelection
DecisionTree
0.6
101
0.4
Time(min)
RedBlue
DataAuditor
Accuracy
DataXRay
Greedy
1.0
0.2
0.75
0.80
0.85
Error rate of failed feature
(b) F-measure
0.90
0.95
0.0
1000
10000
100000
1000000
100
Data Size(# of elements)
(c) Accuracy & execution time
Figure 7: Performance of different methods when we vary the parameters of our synthetic data generator. (a) and (b) show the robustness of
DATA XR AY whereas (c) shows the scalability of our parallel implementation of DATA XR AY.
the overall error rate of the dataset, which leads us to consider adjustments to the error pruning parameter (δmin ). Figures 7a and 7b
display the performance for two versions of DATA XR AY: fixed uses
fixed δmin = 0.6, while adapted adapts this parameter according
to the overall error rate δmin = max(0.6, E ).
Both versions of DATA XR AY maintain F-measure well above the
other methods under this range of configurations, with DATA XR AYadapted having a natural advantage. The performance of G REEDY
decreases as the probability of feature failure increases (Figure 7a).
The greedy approximation is prone to mistakenly rejecting large features from the diagnosis since they frequently contain true elements,
which translates to higher feature weight. This occurrence is more
common at high feature failure probability, where such features are
more likely to be incorrect. On the other hand, F EATURE S ELEC TION improves because the algorithm is better at selecting the larger
incorrect features under the classification objective. When the error
rate of an incorrect feature decreases, the performance of G REEDY
drops (Figure 7b). This is also expected: with lower error rates,
incorrect features include more correct elements, which leads the
greedy approximation to a lot of mistakes. For DATA AUDITOR,
increases in the overall error rate exacerbate its innate tendency to
select features at higher hierarchy levels, leading to sharp drops in
its performance. In contrast, the consistently high F-measure of
DATA XR AY shows the robustness of our algorithm.
5.2.1
Parallel evaluation
We evaluated our framework with even larger data sizes using our
MapReduce implementation of DATA XR AY (Section 4.3). Figure 7c
presents results on our MapReduce DATA XR AY algorithm. We do
not include data for the competing techniques, as we were not able
to scale them beyond 10,000 elements, and we were not able to find
comparable parallel implementations of these methods.3
We ran our parallel implementation of DATA XR AY on a Hadoop
cluster with 15 slave nodes, ranging the data size from 1,000 to 1
million elements. Our results in Figure 7c show that the quality of
our diagnoses does not degrade as the problem size grows larger. In
addition, using a parallel framework allows us to derive diagnoses
efficiently, even for large data sizes: our algorithm processes the 1
million elements in about 100 minutes and the execution time grows
linearly with the data size. This execution time is very reasonable
for this type of problem, as data diagnosis, similarly to data cleaning,
is an offline process.
6.
RELATED WORK
DATA XR AY targets the problem of data diagnosis: explaining
where and how the errors happen in a data generative process. This
is in contrast to traditional data cleaning [53], which focuses on
identifying and repairing incorrect data. Identifying and correcting
data errors is an important and well studied problem. Data management research has supplied a variety of tools to deal with errors
originating from integrating data from multiple sources [1, 5, 49, 52],
identifying data items that refer to the same entity [34,39], resolving
conflicts [22, 62, 63], and providing language extensions for cleaning [28]. The ultimate goal for all of these techniques is to identify
which items in a dataset are correct, and which are incorrect. These
techniques are complementary to our work; they can be used to label
the truthfulness of elements in our diagnostic framework. Consequently, our work focuses on identifying mistakes in the process
that produced the data, rather than finding errors in the dataset itself.
A major research thrust in the domain of data cleaning is to
automatically generate repairs for the recovered errors [12]. Again,
there is a large arsenal of tools in this category that use rules [7, 14]
and functional dependencies [12, 26], while there is also work that
focuses on generating repairs in an interactive fashion, using user
3
To the best of our knowledge, existing parallel implementations of
logistic regression target shared-memory architectures, so they are
limited to shared-memory, multi-core systems [42].
feedback [54, 61]. DATA XR AY is a diagnostic tool, rather than a
cure. It can offer insight on the likely causes of error, but it does not
suggest specific repairs for these errors.
Data Auditor [31, 32] is a data quality exploration tool that uses
rules and integrity constraints to construct pattern tableaux. These
tableaux summarize the subsets of elements that fail to satisfy a
constraint, with some flexibility to avoid over-fitting. The tableaux
outputs resemble diagnoses with a list of features, and Data Auditor
also uses a top down algorithm to derive them. However, there
are several distinctions to DATA XR AY. First, users need to select
the attributes and constraints that will be in the tableaux. Second,
the objective function that it applies focuses only on the number
of returned attributes (features), and the coverage of satisfying and
non-satisfying elements is specified as confidence constraints. In
our evaluation, we showed that Data Auditor has much lower Fmeasure than our methods. We also observed that its performance
is extremely sensitive to the confidence constraint, but the optimal
setting of this parameter was not consistent across data sizes and
other experimental settings (Section 5).
As DATA XR AY traces errors in the processes that produce data, it
has connections to the field of data and workflow provenance. Data
provenance studies formalisms that express why a particular data
item appears in a query result, or how that query result was produced
in relation to input data [9, 11, 15, 33]. However, since we often
do not know the details of each data generator (e.g., knowledge
extractors), we cannot easily apply these approaches. In comparison,
DATA XR AY describes how collections of data items were produced
in relation to high-level characteristics of the data generative process.
Roughly, features in our framework are a form of provenance annotations, but these are much simpler than the complex process steps
and interactions that workflow provence [2, 19] typically captures
and maintains. Work on interactive investigation of information
extraction [18] uses provenance-based techniques and discusses a
concept of diagnosis under a similar application setting. The focus
of that work however, is on interactive exploration and repair tools,
which is different from the scope of our work.
The work on database causality [43, 44, 46] refines the notions
of provenance, and describes the dependencies that a query result
has on the input data. Similar to DATA XR AY, causality offers a
diagnostic method that identifies data items that are more likely to
be causes of particular query outputs. This leads to a form of postfactum data cleaning, where errors in the output can be traced and
corrected at their source [45]. There are four significant differences
between causality and DATA XR AY. First, causal reasoning can only
be applied to simple data generative processes, such as queries and
boolean formulas, whereas DATA XR AY works with arbitrarily complex processes by relying on feature annotations. Second, existing
algorithms for causality do not scale to the data sizes that we tackle
with DATA XR AY. Third, existing work on causality in databases
has focused on identifying fine-grained causes in the input data,
which would correspond to individual elements in our setting, and
no higher level features. Fourth, the premise of causality techniques
relies on the assumption that errors, and generally observations that
require explanation, are rare. This assumption does not hold in many
practical settings that our diagnostic algorithms target.
Existing work on explanations is also limited in its applicability,
as it is tied to simple queries or specific applications. The Scorpion
system [60] finds predicates on the input data as explanations for
a labeled set of outlier points in an aggregate query over a single
relation. Roy and Suciu [55] extended explanations with a formal
framework that handles complex SQL queries and database schemas
involving multiple relations and functional dependencies. Predicate explanations are related to features, but these approaches are
again limited to relational queries, rather than general processes.
Finally, application-specific explanations study this problem within
a particular domain: performance of MapReduce jobs [38], item
rating [17, 56], auditing and security [6, 23]. Recent work [30] introduced sampling techniques (Flashlight and Laserlight) to derive
explanation summaries. The objective of these methods is to find features that maximize the information gain. We evaluated Flashlight
and Laserlight experimentally, but they faired very poorly when applied to our diagnostic problems (maximum F-measure below 0.1).
A natural step in the problem of data diagnosis is to use feature
selection [48,57] to identify the features that can distinguish between
the true and the false elements. There are different methods for
performing feature selection, and logistic regression [8, 40] is one
of the most popular. In our experiments, we used logistic regression
with L1 regularization to minimize the number of returned features.
However, as our evaluation showed, it is not well suited for deriving
good diagnoses. Moreover, machine learning methods such as
logistic regression are hard to parallelize. Existing shared-memory
implementations [42] offer significant speedups, but they cannot
benefit from the MapReduce design that DATA XR AY uses. Efforts to
implement feature selection in a MapReduce framework showed that
this paradigm is not as effective as the shared-memory designs [41].
Other techniques like clustering [3, 4, 47] can be used to create groups of false elements as explanations of errors. However,
most of the clustering work assumes non-overlapping clusters. This
assumption makes these techniques ill-suited for our context as a
false element may be caused by several reasons and it may be
involved in multiple returned features.
Finally, the problem of finding an optimal diagnosis is related
to different versions of the set cover problem [27]. Weighted set
cover [16] seeks the set of features of minimum weight that covers
all the false elements. By assigning feature weights using our cost
model, we get an alternative algorithm for computing the optimal
diagnoses. We use the greedy heuristic approximation for weighted
set cover [13], but our evaluation showed that this algorithm is
not as effective as DATA XR AY. Red-blue set cover [10, 50] is a
variant of set cover; it seeks the set of features that cover all blue
(false) elements, and tries to minimize the number of covered red
(true) elements. Our experiments showed that the objective of
red-blue set cover produces diagnoses of lower accuracy compared
to DATA XR AY.
7.
CONCLUSIONS
In this paper, we presented DATA XR AY, a large-scale, highlyparallelizable framework for error diagnosis. Diagnosis is a problem
complementary to data cleaning. While traditional data cleaning
focuses on identifying errors in a dataset, diagnosis focuses on
tracing the errors in the systems that derive the data. We showed how
to model the optimal diagnosis problem using feature hierarchies,
and used Bayesian analysis to derive a cost model that implements
intuitive properties for good diagnoses. Our experiments on realworld and synthetic datasets showed that our cost model is extremely
effective at identifying causes of errors in data, and outperforms
alternative approaches such as feature selection. By using the feature
hierarchy effectively, DATA XR AY is also much faster than the other
techniques, while our parallel MapReduce implementation allows
us to scale to data sizes beyond the capabilities of the other methods.
Acknowledgements: We thank Evgeniy Gabrilovich, Ramanathan
Guha, and Wei Zhang for inspiring discussions and use-case examples on Knowledge Vault. This work was partially supported by NSF
CCF-1349784, IIS-1421322, and a Google faculty research award.
8.
REFERENCES
[1] S. Abiteboul, S. Cluet, T. Milo, P. Mogilevsky, J. Siméon, and
S. Zohar. Tools for data translation and integration. IEEE
Data Engineering Bulletin, 22(1):3–8, 1999.
[2] Y. Amsterdamer, S. B. Davidson, D. Deutch, T. Milo,
J. Stoyanovich, and V. Tannen. Putting lipstick on pig:
Enabling database-style workflow provenance. PVLDB,
5(4):346–357, Dec. 2011.
[3] P. Arabie, J. D. Carroll, W. S. DeSarbo, and J. Wind.
Overlapping clustering: A new method for product
positioning. Journal of Marketing Research, 18:310–317,
1981.
[4] A. Banerjee, C. Krumpelman, J. Ghosh, S. Basu, and R. J.
Mooney. Model-based overlapping clustering. In Proceedings
of the Eleventh ACM SIGKDD International Conference on
Knowledge Discovery in Data Mining, KDD, pages 532–537,
New York, NY, USA, 2005. ACM.
[5] C. Batini, M. Lenzerini, and S. B. Navathe. A comparative
analysis of methodologies for database schema integration.
ACM Comput. Surv., 18(4):323–364, Dec. 1986.
[6] G. Bender, L. Kot, and J. Gehrke. Explainable security for
relational databases. In Proceedings of the 2014 ACM
SIGMOD International Conference on Management of Data,
SIGMOD, pages 1411–1422, New York, NY, USA, 2014.
ACM.
[7] G. Beskales, I. F. Ilyas, and L. Golab. Sampling the repairs of
functional dependency violations under hard constraints.
PVLDB, 3(1-2):197–207, Sept. 2010.
[8] J. K. Bradley, A. Kyrola, D. Bickson, and C. Guestrin. Parallel
coordinate descent for l1-regularized loss minimization. In
International Conference on Machine Learning (ICML),
Bellevue, Washington, June 2011.
[9] P. Buneman, S. Khanna, and W. C. Tan. Why and where: A
characterization of data provenance. In ICDT, pages 316–330,
2001.
[10] R. D. Carr, S. Doddi, G. Konjevod, and M. Marathe. On the
red-blue set cover problem. In In Proceedings of the 11th
Annual ACM-SIAM Symposium on Discrete Algorithms, pages
345–353, 2000.
[11] J. Cheney, L. Chiticariu, and W. C. Tan. Provenance in
databases: Why, how, and where. Foundations and Trends in
Databases, 1(4):379–474, 2009.
[12] X. Chu, I. F. Ilyas, and P. Papotti. Holistic data cleaning:
Putting violations into context. In ICDE, pages 458–469.
IEEE Computer Society, 2013.
[13] V. Chvatal. A greedy heuristic for the set-covering problem.
Mathematics of Operations Research, 4(3):233–235, 1979.
[14] G. Cong, W. Fan, F. Geerts, X. Jia, and S. Ma. Improving data
quality: Consistency and accuracy. In Proceedings of the 33rd
International Conference on Very Large Data Bases, VLDB
’07, pages 315–326. VLDB Endowment, 2007.
[15] Y. Cui, J. Widom, and J. L. Wiener. Tracing the lineage of
view data in a warehousing environment. ACM Transactions
on Database Systems, 25(2):179–227, 2000.
[16] M. Cygan, L. Kowalik, and M. Wykurz. Exponential-time
approximation of weighted set cover. Information Processing
Letters, 109(16):957–961, July 2009.
[17] M. Das, S. Amer-Yahia, G. Das, and C. Yu. Mri: Meaningful
interpretations of collaborative ratings. PVLDB,
4(11):1063–1074, 2011.
[18] A. Das Sarma, A. Jain, and D. Srivastava. I4e: Interactive
investigation of iterative information extraction. In
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
[33]
[34]
[35]
[36]
[37]
Proceedings of the 2010 ACM SIGMOD International
Conference on Management of Data, SIGMOD, pages
795–806, New York, NY, USA, 2010. ACM.
S. B. Davidson and J. Freire. Provenance and scientific
workflows: Challenges and opportunities. In Proceedings of
the 2008 ACM SIGMOD International Conference on
Management of Data, SIGMOD, pages 1345–1350, New
York, NY, USA, 2008. ACM.
X. L. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao,
K. Murphy, T. Strohmann, S. Sun, and W. Zhang. Knowledge
vault: A web-scale approach to probabilistic knowledge
fusion. In 20th ACM SIGKDD Conference on Knowledge
Discovery and Data Mining, 2014.
X. L. Dong, E. Gabrilovich, G. Heitz, W. Horn, K. Murphy,
S. Sun, and W. Zhang. From data fusion to knowledge fusion.
PVLDB, 2014.
X. L. Dong and F. Naumann. Data fusion–resolving data
conflicts for integration. PVLDB, 2009.
D. Fabbri and K. LeFevre. Explanation-based auditing.
PVLDB, 5(1):1–12, Sept. 2011.
A. Fader, S. Soderland, and O. Etzioni. Identifying relations
for open information extraction. In EMNLP, 2011.
W. Fan, F. Geerts, and X. Jia. A revival of integrity constraints
for data cleaning. Proc. VLDB Endow., 1(2):1522–1523, Aug.
2008.
W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional
functional dependencies for capturing data inconsistencies.
ACM Transactions on Database Systems, 33(2):6:1–6:48, June
2008.
T. A. Feo and M. G. C. Resende. A probabilistic heuristic for
a computationally difficult set covering problem. Operations
Research Letters, 8(2):67–71, Apr. 1989.
H. Galhardas, D. Florescu, D. Shasha, and E. Simon. Ajax:
An extensible data cleaning tool. In Proceedings of the 2000
ACM SIGMOD International Conference on Management of
Data, SIGMOD, page 590, New York, NY, USA, 2000. ACM.
M. R. Garey and D. S. Johnson. Computers and Intractability:
A Guide to the Theory of NP-Completeness. W. H. Freeman &
Co., New York, NY, USA, 1979.
K. E. Gebaly, P. Agrawal, L. Golab, F. Korn, and
D. Srivastava. Interpretable and informative explanations of
outcomes. PVLDB, 8(1):61–72, 2014.
L. Golab, H. Karloff, F. Korn, D. Srivastava, and B. Yu. On
generating near-optimal tableaux for conditional functional
dependencies. PVLDB, 1(1):376–390, Aug. 2008.
L. Golab, H. J. Karloff, F. Korn, and D. Srivastava. Data
auditor: Exploring data quality and semantics using pattern
tableaux. PVLDB, 3(2):1641–1644, 2010.
T. J. Green, G. Karvounarakis, and V. Tannen. Provenance
semirings. In PODS, pages 31–40, 2007.
A. Gruenheid, X. L. Dong, and D. Srivastava. Incremental
record linkage. PVLDB, 7(9):697–708, 2014.
M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann,
and I. H. Witten. The weka data mining software: An update.
SIGKDD Explor. Newsl., 11(1):10–18, Nov. 2009.
J. Hoffart, F. M. Suchanek, K. Berberich, and G. Weikum.
YAGO2: A Spatially and Temporally Enhanced Knowledge
Base from Wikipedia. 2012.
D. V. Kalashnikov and S. Mehrotra. Domain-independent data
cleaning via analysis of entity-relationship graph. ACM
[38]
[39]
[40]
[41]
[42]
[43]
[44]
[45]
[46]
[47]
[48]
[49]
[50]
[51]
[52]
[53]
[54]
[55]
[56]
Transactions on Database Systems, 31(2):716–767, June
2006.
N. Khoussainova, M. Balazinska, and D. Suciu. Perfxplain:
debugging mapreduce job performance. Proc. VLDB Endow.,
5(7):598–609, Mar. 2012.
N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage:
Similarity measures and algorithms. In Proceedings of the
2006 ACM SIGMOD International Conference on
Management of Data, SIGMOD ’06, pages 802–803, New
York, NY, USA, 2006. ACM.
J. Liu, S. Ji, and J. Ye. SLEP: Sparse Learning with Efficient
Projections. Arizona State University, 2009.
Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and
J. M. Hellerstein. Distributed graphlab: A framework for
machine learning and data mining in the cloud. PVLDB,
5(8):716–727, Apr. 2012.
Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and
J. M. Hellerstein. Graphlab: A new parallel framework for
machine learning. In Conference on Uncertainty in Artificial
Intelligence (UAI), Catalina Island, California, July 2010.
A. Meliou, W. Gatterbauer, J. Halpern, C. Koch, K. F. Moore,
and D. Suciu. Causality in databases. IEEE Data Engineering
Bulletin, Sept. 2010.
A. Meliou, W. Gatterbauer, K. F. Moore, and D. Suciu. The
complexity of causality and responsibility for query answers
and non-answers. PVLDB, 4(1):34–45, 2010.
A. Meliou, W. Gatterbauer, S. Nath, and D. Suciu. Tracing
data errors with view-conditioned causality. In SIGMOD
Conference, pages 505–516, 2011.
A. Meliou, S. Roy, and D. Suciu. Causality and explanations
in databases. PVLDB, 7(13):1715–1716, 2014.
K. P. Murphy. Machine Learning: A Probabilistic Perspective.
The MIT Press, 2012.
A. Y. Ng. Feature selection, l1 vs. l2 regularization, and
rotational invariance. In In ICML, 2004.
C. Parent and S. Spaccapietra. Issues and approaches of
database integration. Commununications of the ACM,
41(5):166–178, 1998.
D. Peleg. Approximation algorithms for the label-cover< sub>
max</sub> and red-blue set cover problems. Journal of
Discrete Algorithms, (1):55–64, March 2007.
J. R. Quinlan. Induction of decision trees. Machine Learning,
1(1):81–106, Mar. 1986.
E. Rahm and P. A. Bernstein. A survey of approaches to
automatic schema matching. The VLDB Journal,
10(4):334–350, Dec. 2001.
E. Rahm and H. H. Do. Data cleaning: Problems and current
approaches. IEEE Data Engineering Bulletin, 23(4):3–13,
2000.
V. Raman and J. M. Hellerstein. Potter’s wheel: An interactive
data cleaning system. In Proceedings of the 27th International
Conference on Very Large Data Bases, VLDB ’01, pages
381–390, San Francisco, CA, USA, 2001. Morgan Kaufmann
Publishers Inc.
S. Roy and D. Suciu. A formal approach to finding
explanations for database queries. In Proceedings of the 2014
ACM SIGMOD International Conference on Management of
Data, SIGMOD, pages 1579–1590, New York, NY, USA,
2014. ACM.
S. Thirumuruganathan, M. Das, S. Desai, S. Amer-Yahia,
G. Das, and C. Yu. Maprat: meaningful explanation,
[57]
[58]
[59]
[60]
[61]
[62]
[63]
interactive exploration and geo-visualization of collaborative
ratings. PVLDB, 5(12):1986–1989, Aug. 2012.
R. Tibshirani. Regression shrinkage and selection via the
lasso. Journal of the Royal Statistical Society, Series B,
58(1):267–288, 1996.
K. Toutanova, D. Klein, C. D. Manning, and Y. Singer.
Feature-rich part-of-speech tagging with a cyclic dependency
network. In Proceedings of the 2003 Conference of the North
American Chapter of the Association for Computational
Linguistics on Human Language Technology, NAACL, pages
173–180, Stroudsburg, PA, USA, 2003. Association for
Computational Linguistics.
US Department of Transportation, Federal Highway
Administration. Freeway incident and weather data.
http://portal.its.pdx.edu/Portal/index.php/fhwa.
E. Wu and S. Madden. Scorpion: Explaining away outliers in
aggregate queries. PVLDB, 6(8):553–564, June 2013.
M. Yakout, A. K. Elmagarmid, J. Neville, M. Ouzzani, and
I. F. Ilyas. Guided data repair. PVLDB, 4(5):279–289, Feb.
2011.
X. Yin, J. Han, and P. S. Yu. Truth discovery with multiple
conflicting information providers on the web. In KDD, pages
1048–1052, New York, NY, USA, 2007. ACM.
B. Zhao, B. I. P. Rubinstein, J. Gemmell, and J. Han. A
Bayesian approach to discovering truth from conflicting
sources for data integration. PVLDB, 5(6):550–561, 2012.
APPENDIX
A.
SUMMARY OF NOTATIONS
Notation
Description
e(V, P )
f (P, E)
E
F
Pr(F |E)
f.E−
f.E+
Element with truthfulness V and property vector P
Feature with property vector P and set of elements E
Dataset of elements E = {e1 , . . . , en }
Set of features F = {f1 , . . . , fk }; candidate diagnosis
Causal likelihood: probability that F caused the errors in E
The false elements in f : {e | (e ∈ f.E) ∧ (¬e.V )}
The true elements in f : {e | (e ∈ f.E) ∧ e.V }
i
α
SVf
Lf
f
Pi p
V arif
The error rate of feature fi : i = |fi .E|
i
The a priori probability that a feature is a cause of error
The structure vector of feature f
The level of feature f
Partition i of the child features of fp
f
The variance of error rates of features in partition Pi p
|f .E− |
Figure 8: Summary of notations used in the paper.
B.
THEORETICAL RESULTS
T HEOREM 15 (C OMPLEXITY AND BOUNDS ). Algorithm 1 has
complexity O(|F|), and provides an O(n)-approximation of the
minimum cost diagnosis, where n is the number of elements in E
and F is the set of features that can be derived from E.
P ROOF. Let D be the diagnosis produced by Algorithm 1, and
let DOP T be the diagnosis of minimum cost that covers all the
elements. Algorithm 1 compares the cost of a parent feature with
that of its children, and decides to proceed if the cost decreases.
That means that for every F ∈ D, Ancestors(F ) ∩ DOP T = ∅:
If F 0 , ancestor of F , was in DOP T , we could produce a diagnosis
of lower cost than DOP T by replacing F 0 with its children.
Therefore, DOP T contains no ancestors of F ∈ D, but it could
contain its descendants instead of F . Let nF be the total number
of elements under feature F , and let be the error rate of the feature. The cost of feature F is the sum of the three cost types in
Definition 9:
1
1
CF = log + nF log α
(1 − )(1−)
In the worst case, F can be replaced by a single descendant feature
F 00 with error rate 1. This means that the false and true costs of
F 00 are 0, and therefore, its overall cost is CF 00 = log α1 . Since
1
log (1−)
(1−) ≤ 1, the cost of F is at most O(nF ) worse than
the cost of the optimal descendant set of F . Adding these worstcase costs for all features in the diagnosis D, we get overall worstcase approximation bound O(n), where n is the total number of
elements.
Finally, Algorithm 1 accesses each feature at most once. Therefore, its time complexity is O(|F|).
ADDITIONAL RESULTS
Figure 9 shows how the produced diagnoses compare with the
ground truth at different granularities. For each level of the hierarchy,
we depict the total incorrect features in a diagnosis (total false
positive and false negative) for each method, normalized by the total
number of features at that level. The plot presents an average of 50
executions over randomly generated datasets with 10,000 tuples and
feature hierarchies of 5 levels.
DATA XR AY is the method closest to the ground truth, with only
a few mistakes among features of average granularity (middle of
the hierarchy). DATA AUDITOR tends to select more features at
higher hierarchy levels, and this is where most of its mistakes are
concentrated. R ED B LUE and G REEDY make mistakes that are more
evenly distributed across the hierarchy levels. In contrast, the classification techniques (F EATURE S ELECTION and D ECISION T REE)
tend to make the most mistakes in the middle of the hierarchy, with
almost 60% of the features at that level being incorrectly included
to or excluded from the diagnosis.
(T IGHTNESS ). The O(n) approximation bound
P ROOF. Let F0 be a feature that is the root of a subtree in the
hierarchy. Let also F0 have exactly two child nodes, F1 and F2 , each
containing n elements. Therefore, F0 contains 2n elements. We
assume that the error rates of features F1 and F2 are 1 = 2 = .
1
Therefore, the cost of F0 is C0 = log α1 + 2n log (1−)
(1−) , and
1
the cost of F1 and F2 is C1 + C2 = 2 log α1 + 2n log (1−)
(1−) .
Therefore, DATA XR AY will select F0 instead of its children and
will terminate the descent to that part of the hierarchy. However,
the optimal diagnosis can be F10 and F20 , descendants of F1 and F2
respectively, with total cost 2 log α1 . This means that the cost C0 is
O(n) worse than OPT.
The worst-case analysis of the approximation bound only occurs
when the errors are distributed uniformly across the children of a
node, across all dimensions. If this is not true for some dimension,
then the algorithm will descent into the corresponding partition to
approach the optimal solution. It is in fact very unusual for this to
happen in practice, which is why DATA XR AY performs much better
than G REEDY in our experimental evaluation.
DataXRay
Greedy
RedBlue
DataAuditor
FeatureSelection
DecisionTree
1.0
Difference percentage
T HEOREM 16
is tight.
C.
0.8
0.6
0.4
0.2
0.0
0
1
2
Feature levels
3
4
Figure 9: We evaluate how much the selected features at each
hierarchy level deviate from the ground truth for each technique,
over datasets of 10,000 elements. Level 0 is the root of the hierarchy,
and level 4 contains the leaves (individual elements). DATA XR AY
has the smallest difference from the ground truth.