Download Report

Luzzu – A Framework for
Linked Data Quality Assessment?
Jeremy Debattista, Christoph Lange, Sören Auer
University of Bonn & Fraunhofer IAIS
{debattis,langec,auer}@cs.uni-bonn.de
Abstract With the increasing adoption and growth of the Linked Open Data
cloud [9], with RDFa, Microformats and other ways of embedding data into ordinary Web pages, and with initiatives such as schema.org, the Web is currently
being complemented with a Web of Data. Thus, the Web of Data shares many
characteristics with the original Web of Documents, which also varies in quality. This heterogeneity makes it challenging to determine the quality of the data
published on the Web and to subsequently make this information explicit to data
consumers. The main contribution of this article is L UZZU, a quality assessment
framework for Linked Open Data. Apart from providing quality metadata and
quality problem reports that can be used for data cleaning, L UZZU is extensible:
third party metrics can be easily plugged-in the framework. The framework does
not rely on SPARQL endpoints, and is thus free of all the problems that come
with them, such as query timeouts. Another advantage over SPARQL based quality assessment frameworks is that metrics implemented in L UZZU can have more
complex functionality than triple matching. Using the framework, we performed
a quality assessment of a number of statistical linked datasets that are available
on the LOD cloud. For this evaluation, 25 metrics from ten different dimensions
were implemented.
Keywords: Data Quality, Assessment Framework, Quality Metadata, Quality Metrics
1
Introduction
With the increasing adoption of Linked Open Data [9], with RDFa, Microformats and
other ways of embedding data into ordinary Web pages (cf. Web Data Commons1 ), and
with initiatives such as schema.org, the Web is currently being complemented with a
Web of Data. The Web of Data shares many characteristics with the original Web of
Documents. As on the Web of Documents, we have varying quality on the Web of Data.
Quality on the Web of Documents is usually measured indirectly using the page rank
of the Web documents. This is because the quality of a document is only subjectively
assessable, and thus an indirect measure, such as the number of links created by others
to a certain Web page, gives a good approximation of its quality. On the Web of Data the
?
1
This work is supported by the European Commission under the Seventh Framework Program
FP7 grant 601043 (http://diachron-fp7.eu).
http://webdatacommons.org
2
Jeremy Debattista, Christoph Lange, Sören Auer
2.#
Assessment#
1.#Metric#
Iden@ﬁca@on#
and#
Deﬁni@on#
5.#
Explora@on/
Ranking#
3.#Data#
Repairing#and#
Cleaning#
4.#Storage/
Cataloguing/
Archiving##
Figure 1. The stages of the data quality lifecycle.
situation can be deemed to be both simpler and more complex at the same time. There
is a large variety of dimensions and measures of data quality [17] that can be computed
automatically so we do not have to rely on indirect indicators alone. On the the other
hand, the assessment of quality in terms of fitness for use [10] with respect to a certain
use case is more challenging. Even datasets with quality problems might be useful for
certain applications, as long as the quality is in the required range.
For example, data extracted from semi-structured sources, such as DBpedia [12, 15],
often contains inconsistencies as well as misrepresented and incomplete information.
However, in the case of DBpedia the data quality is perfectly sufficient for enriching
Web search with facts or suggestions about general information, such as entertainment
topics. In such a scenario, DBpedia can be used to show related movies and personal
information, when a user searches for an actor. In this case, it is rather neglectable,
when, in relatively few cases, a related movie or some personal facts are missing. For
developing a medical application, on the other hand, the quality of DBpedia is probably insufficient, as shown in [18], since data is extracted from a semi-structured source
created in a crowdsourcing effort (i.e. Wikipedia). It should be noted that even the traditional document-oriented Web has content of varying quality but is still commonly
perceived to be extremely useful. Consequently, a key challenge is to determine the
quality of datasets published on the Web and to make this quality information explicit.
Moreover, quality assessment cannot be tackled in isolation, but should be considered holistically involving other stages of the data management lifecycle. Quality
assessment on its own cannot improve the quality of a dataset, but as part of a lifecycle, quality assessment can be used to continuously monitor and curate datasets as
they evolved. The stages in the lifecycle depicted in Figure 1 are:
1. Metric Identification and Definition – Choosing the right metrics for a dataset and
task at hand;
2. Assessment – Dataset assessment based on the metrics chosen;
Luzzu – A Framework for Linked Data Quality Assessment
3
3. Data Repairing and Cleaning – Ensuring that, following a quality assessment, a
dataset is curated in order to improve its quality;
4. Storage, Cataloguing and Archiving – Updating the improved dataset on the cloud
whilst making the quality metadata available to the public.
5. Exploration and Ranking – Finally, data consumers can explore cleaned datasets
according to their quality metadata.
This article focuses on the Assessment aspect of the data lifecycle. The
L UZZU Quality Assessment framework (cf. Section 2) for Linked Data is proposed and
discussed in this article as means to attempt solving the above-mentioned challenge.
The framework and the experiment on its performance (cf. Section 3) are the main
contributions of this paper. In order to show the framework’s applicability, a number of
statistical linked open datasets (available at http://270a.info) are evaluated against
a number of quality metrics. The quality evaluation results are discussed in Section 4.
L UZZU is compared with other similar frameworks in Section 5, whilst the concluding
remarks are given in Section 6.
2
The Framework
The L UZZU2 Quality Assessment Framework (referred to as simply L UZZU or ‘the
framework’) provides support for the various stages of the data quality lifecycle. This
section provides a comprehensive outline of the framework, describing the main contribution of this paper. The rationale of Luzzu is to provide an integrated platform that:
(1) assesses Linked Data quality using a library of generic and user-provided domain
specific quality metrics in a scalable manner; (2) provides queryable quality metadata
on the assessed datasets; (3) assembles detailed quality reports on assessed datasets.
Furthermore, we aim to create an infrastructure that:
– can be easily extended by users by creating their custom and domain-specific pluggable metrics, either by employing a novel declarative quality metric specification
language or conventional imperative plugins;
– employs a comprehensive ontology stack for representing and exchanging all quality related information in the assessment workflow.
The framework follows a three step workflow, starting with the metric initialisation
process (Step 1). In this step, declarative metrics defined by the Luzzu Quality Metric
Language (LQML [4]) are compiled and initialised together with metrics implemented
in Java. The quality assessment process is then commenced (Step 2) by sequentially
streaming statements of the candidate dataset into the initialised quality metrics. Once
this process is completed, the annotation (Step 3) generates quality metadata and compiles a comprehensive quality report. This tool can be easily integrated into collections
such as the Linked Data Stack [1] to enhance them with a quality assessment workflow.
Figure 2 illustrates the high level architecture of L UZZU. The framework comprises three layers: Communication Layer, Assessment Layer and Knowledge Layer.
The Communication Layer exploits the framework’s interfaces as a REST service. The
latter two layers are described in the remainder of this section.
2
GitHub: https://github.com/EIS-Bonn/Luzzu; Website: http://eis-bonn.
github.io/Luzzu/
4
Jeremy Debattista, Christoph Lange, Sören Auer
Quality)Assessment)Unit)
Processing)Unit)
LQML)Comp.)Unit)
Annota9on)Unit)
Opera9ons)Unit)
Seman9c)Schema)Layer)
Assessment)
Layer)
Knowledge)
Layer)
Opera;onal&Level&
Speciﬁc&RL&
Quality&Report&Ontology&(QPRO)&
Generic&
Representa;on&
Level&
Dataset&Quality&Ontology&(daQ)&
PROVFO&
Data&Cube&
Figure 2. High-level architecture of the
Luzzu quality framework.
2.1
LMI,&DRMO,&…&&
Quality&Metrics&(e.g.&DQM)&
Representa;on&
Level&
Communica9on)Layer)
Figure 3. The L UZZU Ontology Stack.
Knowledge Layer
The Knowledge Layer is composed of three units, namely the Semantic Schema Layer,
Annotation Unit, and Operations Unit. These units assist to the provision of quality
metadata and other operations upon the same metadata.
Semantic Schema Layer L UZZU is based on semantic technologies and thus has an
underlying semantic schema layer. The semantic schema layer consists of two levels:
representation (split into generic and specific sub-levels) and operational level. Figure 3 shows the framework’s schema stack, where the lower level comprises generic
ontologies which form the foundations of the quality assessment framework, and the
upper level represents the specific ontologies required for the various quality assessment tasks. Figure 4 depicts the relationships between the ontologies.
lmi:LuzzuMetric1
JavaImplementaAon1
qpro:QualityProblem1
lmi:referTo1
qpro:isDescribedBy1
dqm:Extensional1
Conciseness1
rdfs:subClassOf,
daq:hasObservaAon1
daq:ObservaAon1
daq:Metric1
daq:metric1
daq:QualityGraph1
rdfs:subClassOf1
qb:DataSet1
Figure 4. Key relationships in the data quality ontologies.
Generic Representation Level – The generic representation level is domain independent, and can be easily reused in similar frameworks for assessing quality. The two
vocabularies described on this level are the Dataset Quality Ontology and the Quality
Problem Report Ontology. These vocabularies contribute to two objectives of L UZZU,
Luzzu – A Framework for Linked Data Quality Assessment
5
more specifically, providing queryable metadata and assembling detailed quality reports.
The rationale behind the Dataset Quality Ontology (daQ) [5]3 is to provide a core
vocabulary that defines how quality metadata should be represented at an abstract level.
It is used to attach the results of quality benchmarking of a Linked Open Dataset to
the dataset itself. daQ is the core vocabulary of this schema layer, and any ontology
describing quality metrics added to the framework (in the specific representation level)
should extend it. The benefit of having an extensible schema is that quality metrics can
be added to the vocabulary without major changes, as the representation of new metrics
would follow those previously defined.
The Quality Problem Report Ontology (QPRO) enables the fine-grained description of quality problems found while assessing a dataset. It comprises the two classes
qpro:QualityReport and qpro:QualityProblem. The former represents a
report on the problems detected during the assessment of quality on a dataset, whilst
the latter represents the individual quality problems contained in that report.
Specific Representation Level – Complementing L UZZU, a number of Linked Data
quality metrics are implemented, some of which were identified from the survey presented in [17]. These metrics are used in assessing the quality of the 270a statistical datasets [3] (cf. Section 4). The Data Quality Metric (dqm) Ontology represents these metrics in a semantic fashion based on the daQ vocabulary.
Operational Level – The highest level in the schema stack. This level comprises
a number of ontologies that assist the framework to perform certain tasks, such as the
Luzzu Metric Implementation ontology (LMI), which is a vocabulary that enables the
metrics specified in terms of daQ to be connected to their Java implementations.
Annotation Unit The Semantic Annotation Unit generates quality metadata and quality reports for each metric assessed. The annotation unit is based on the two core ontologies – daQ and QPRO. With regard to the quality problem reports, the unit compiles
a set of triples consisting of problematic triples found during the metric assessment of
a dataset. The user can then use the generated quality problem report in a data cleaning
tool that supports the Quality Report Ontology to clean the dataset.
Operations Unit The Operations Unit does not interact with the quality assessment
directly. Currently, this unit provides algorithms for the day-to-day use of the quality
metadata, including an algorithm for quality-driven ranking of datasets. In the spirit of
“fitness for use”, the proposed user-driven ranking algorithm enables users to define
weights on their preferred categories, dimensions or metrics, that are deemed suitable
for her task in hand. The user-driven ranking is formalised as follows:
Let v : Fm → {R ∪ N ∪ B ∪ . . .} be the function that yields the value of a metric
(which is, most commonly, a real number, but could also be an integer, a boolean, or
any other simple type).
3
All defined ontologies in L UZZU have the namespace http://purl.org/eis/vocab/
{prefix}. For each ontology then one should add the relevant ontology prefix, e.g., daq.
Any updates to the ontology are automatically reflected in the ontology’s webspace
6
Jeremy Debattista, Christoph Lange, Sören Auer
Ranking by Metric – Ranking datasets by individual metrics requires computing a
weighted sum of the values of all metrics chosen by the user. Let mi be a metric, v(mi )
its value, and θi its weight (i = 1, . . . , n), then the weighted value v(mi , θi ) is given
by:
Definition 1 (Weighted metric value)
v(mi , θi ) := θi · v(mi )
Pn
The sum i=1 v(mi , θi ) of these weighted values, with the same weights applied
to all datasets, determines the ranking of the datasets.
Ranking by Dimension – When users want to rank datasets in a less fine-grained manner, they assign weights to dimensions.4 The weighted value of the dimension D is
computed by evenly applying the weight θ to each metric m in the dimension.
Definition 2 (Weighted dimension value)
P
v(D, θ) :=
v(m, θ)
=θ
#D
m∈D
P
v(m)
#D
m∈D
Ranking by Category – A category C is defined to have one or more dimensions D5 ,
thus, similarly to the previous case, ranking on the level of categories requires distributing the weight chosen for a category over the dimensions in this category and then
applying Definition 2.
Definition 3 (Weighted category value)
P
v(C, θ) :=
2.2
v(D, θ)
#C
D∈C
Assessment Layer
The Assessment Layer is composed of three units, namely the Processing Unit, the
LQML Compilation Unit, and the Quality Assessment Unit. These units handle the operations related to the quality assessment of a dataset.
4
5
Zaveri et al. define, for example, the dimension of licensing, which comprises the metrics
“[existence of a] machine-readable indication of a license” and “human-readable indication of
a license” [17].
In the classification of Zaveri et al., “Licensing” is a dimension within the category of “accessibility dimensions” [17].
Jena&Streaming&
Mechanism&(Threaded)&
broadcast&quad&
to&all&metrics&
Output:,Problem,
Report,
Communica0on&Layer&
Input:,Dataset,and,
Metric,Selec:on,
Invoke,
Quality&Assessment&Unit&
Metric&1&
Metric&2&
Quad,<s,p,o,c>,
Dataset&
triple/quad,
…"
Metric&n&
Metric,Value,&,
Problema:c,Triples,
Processing&Unit&
Quality,
Report,
1&
2&
…"
n&
Metrics&Running&on&&
Separate&Threads&
input&ds&
7
SPARK&Master&
Streaming&Mechanism&
map&DS&to&clusters&
1&
Streaming&Process&
input&ds&
Streaming&Process&
Luzzu – A Framework for Linked Data Quality Assessment
2&
…"
n&
SPARK&Clusters&
stream&quads&to&
message&queue&
broadcast&queue&
messages&&
1&
2&
…"
n&
Metrics&on&Separate&
Threads&on&Master&
Annota0on&Unit&
Quality,Metadata,
Figure 5. Quality Assessment Process
Figure 6. Streaming Quads using the Jena
Stream Process (left) and Spark Stream
Process (right).
The Quality Assessment Unit The Quality Assessment Unit is the most important unit
of the framework. Third parties can extend the framework by creating custom metrics
and plugging them into the framework. Figure 5 depicts the quality assessment process.
A user selects a dataset as well as the metrics required for the quality assessment. This
information is passed to the quality framework via the communication layer, which then
invokes the assessment process. The processing unit is then initialised by: (1) creating
the necessary objects in memory, and (2) initialising the chosen metrics. In Figure 5,
Metric 1 is shaded out to illustrate that it was not chosen by the user for this particular
process. Once the objects have been created, the stream processor fetches the dataset
and streams triples or quads one by one to all initialised metrics. Since the framework
has no control over the metrics’ performance, the assessment of datasets is optimised
by parallelising metric computation into different threads, ensuring that the different
metrics are computed at the same time. After all statements have been assessed, the
annotation unit requests the results for each metric and creates (or updates) the quality
metadata graph. The annotation unit also requests problematic triples and prepares a
quality report that is passed back to the user via the communication layer. This marks
the end of a successful quality assessment process.
Processing Unit The Processing Unit controls the whole execution of the quality assessment of a chosen dataset. In L UZZU two stream processing units are implemented;
one based on the Jena RDF API6 and the other on the Spark big data processing framework7 . Streaming ensures scalability (since we are not limited by main memory) and
parallelisability (since the parsing of a dataset can be split into several streams to be
processed on different threads, cores or machines).
Each data processor in the Quality Framework operates in three stages: (i) processor
initialisation; (ii) processing; and (iii) memory clean up. Typically, an invoked processor
has three inputs: the base URI, the dataset URI, and a metric configuration file defining
the metrics to be computed on the dataset. In the first stage (processor initialisation),
the processor creates the necessary objects in memory to process data and loads the
6
7
http://jena.apache.org/
http://spark.apache.org/
8
Jeremy Debattista, Christoph Lange, Sören Auer
metrics defined in the configuration file. Once the initialisation is ready, the processing
is performed by passing the streamed triples to the metrics. A final memory cleanup
ensures that no unused objects are using unnecessary computational resources.
Sequential Streaming of Datasets Jena provides the possibility of streaming triples
sequentially in a separate thread implementing a producer-consumer queue. A dataset,
which can be serialised in many typical RDF formats (RDF/XML, N-Triples, N-Quads
etc.), is read directly from the disk storage. Triple statements are read as string tokens,
which the Jena API then transforms into triple or quad objects.
Another approach for sequential stream processing is to use the Hadoop MapReduce technology8 or its in-memory equivalent Spark. The idea is to map the processing
of large datasets on multiple clusters, creating triples in the process. A simple reduce
function then takes the results of the map to populate a queue that feeds the metrics.
Metric computations are not ‘associative’ in general, which is one of the main requirements to implement a MapReduce job; therefore, instantiated metrics are split into different threads in the master node. Figure 6 shows how the two different processors
stream statements to initialised metrics.
LQML Compilation Unit One foreseeable obstacle is that Java experts are needed to
create these metrics, using traditional imperative classes, following a defined interface.
Therefore, our framework also provides the Luzzu Quality Metric Language (LQML),
thus enabling knowledge engineers without Java expertise to create quality metrics in a
declarative manner. Since LQML is not one of the main contributions of this paper, the
reader is referred to [4] for more information.
3
Performance Experiment
The aim of this experiment is to determine the scalability properties of the stream processors described in Section 2.2 with regard to dataset size. Runtime is measured for
both processors against a number of datasets ranging from 10,000 triples to 100,000,000
triples and also against a number of different metrics (up to 9). Since the main goal of
this experiment is to measure the stream processors’ performance, having datasets with
different quality problems is considered to be not the primary focus at this stage. We
generated datasets of different sizes using the Berlin SPARQL Benchmark (BSBM) V3.1
data generator9 . BSBM is primarily used as a benchmark to measure the performance
of SPARQL queries against large datasets. We generate datasets with scale factors of
24, 56, 128, 199, 256, 666, 1369, 2089, 2785, 28453, 70812, 284826, which translates
into approximately 10K, 25K, 50K, 75K, 100K, 250K, 500K, 750K, 1M, 10M, 25M,
50M and 100M triples respectively.
The result of the performance evaluation meets our expectations of having a linearly
scalable processor, whilst metrics affect the runtime of the stream processor. The results
8
9
http://hadoop.apache.org/
http://wifo5-03.informatik.uni-mannheim.de/bizer/
berlinsparqlbenchmark/spec/
Luzzu – A Framework for Linked Data Quality Assessment
1077"
1066"
1055"
Log10 Time in milliseconds
Log10 Time in milliseconds
1066"
1044"
1033"
1022"
1011"
1000" 3"
103
9
4"
104
5"
105
6"
106
7"
107
1044"
1033"
1022"
1011"
1000" 3"
103
8"
108
No. of Triples (Log10)
x Stream Processor + Spark Processor
Figure 7. Time vs. dataset size in triples –
no metrics initialised.
1055"
4"
104
5"
6"
105
106
No. of Triples (Log10)
0"Metrics"
4"Metrics"
9"Metrics"
7"
107
8"
108
Figure 8. Time vs. dataset size with different metric initialisations.
also confirm the assumption that big data technologies such as Spark are not beneficial
for smaller datasets.
All tests were run on a Google Cloud platform, with three worker clusters set up for
the Spark stream processors.
Results Figure 7 and Figure 8 show the time taken (in ms) to process datasets of different sizes10 . Figure 7 shows the time taken to process the datasets without computing
any metrics. The latter chart shows the time taken to process and compute datasets with
nine metrics (note the logarithmic scale). As expected, both processors scale linearly
with regard to the number of triples. This is also observed when we added metrics to
the processor.
From these results, we also conclude that for up to 100M triples, the (Jena) stream
processor performs better than the Spark processor. A cause for this difference is that
the Spark processor has to deal with the extra overhead needed to enqueue and dequeue
triples on an external queue, however, as the number of triples increases, the performance of both processors converges. Figure 8 shows how the stream processor fares with
regard to the execution time when we initialise it with 0, 4 and 9 metrics. The performance of the processor still maintains a linear computation, showing that the execution
time is longer as the number of initialised metrics is increased.
4
Linked Data Quality Evaluation
To showcase L UZZU in a real-world scenario, a Linked Data quality evaluation was
performed on the 270a statistical linked data space. The datasets in question are:
– European Central Bank (Acronym: ecb11 ) – statistical data published by the
European Central Bank;
– Food and Agriculture Organization of the United Nations (fao) – statistical
data from the FAO official website, related to food and agriculture statistics for
global monitoring;
10
11
More performance evaluation results can be found at http://eis-bonn.github.io/
Luzzu/evaluation.html
All URIs for the 270a datasets are: http://{acronym}.270a.info
10
Jeremy Debattista, Christoph Lange, Sören Auer
– Swiss Federal Statistics Office (bfs) – contains Swiss statistical data related to
population, national economy, health and sustainable development amongst others;
– UNESCO Institute for Statistics (uis) – contains data related to education, science and technology, culture and communication, comparing over 200 countries;
– International Monetary Fund (imf) – contains data related to the IMF financial
indicators;
– World Bank (worldbank) – includes data from the official World Bank API endpoint, and contains data related to social and financial projections;
– Transparency International Linked Data (transparency) – features data from
excel sheets and PDF documents available on the Transparency International website, containing information about corruption perceptions index;
– Organisation for Economic Co-operation and Development (oecd) – features a
variety of data related to the economic growth of countries;
– Federal Reserve Board (frb) – data published by the United States central bank.
These datasets are published using the Linked SDMX principles [3]. Each of these datasets has a SPARQL endpoint from which a query was executed to obtain the datasets
(void:Dataset) and corresponding data dumps (void:dataDump). All datasets
were then assessed in L UZZU over a number of metrics, whilst quality metadata was
generated12 .
Chosen Metrics A number of metrics (25) from ten different dimensions were implemented13 for the applicability evaluation. Whilst most of these metrics were identified
from [17], two provenance related metrics were additionally defined for this assessment. The W3C Data on the Web Best Practices working group has defined a number
of best practices in order to support the self-sustained eco-system of the Data Web. With
regard to provenance, the working group argues that provenance information should be
accessible to data consumers, whilst also being able to track the origin or history of
the published data [13]. In light of this statement, we defined the Basic Provenance and
Extended Provenance metrics. The Basic Provenance metric checks if a dataset, usually
of type void:Dataset or dcat:Dataset, has the most basic provenance information; that is information related to the creator or publisher, using the dc:creator
or dc:publisher. On the other hand, the Extended Provenance metric checks if a
dataset has the required information that enables a consumer to track down the origin.
Each dataset should have an entity containing the identification of a prov:Agent and
the identification of one or more prov:Activity. The identified activities should
also have an agent associated with it, apart from having a data source attached to it.
Together with the dataset quality evaluation results, Table 1 shows the dimensions
and metrics (marked with a reference from [17]) used on the 270a linked datasets. The
metrics were implemented as described by their respective authors, expect for the Misreported Content Types, and No Entities as Members of Disjoint Classes metrics, where
12
13
Quality Metadata for each dataset can be found at http://eis-bonn.github.io/
Luzzu/eval270a.html
These metrics, together with any configuration, can be downloaded from http://
eis-bonn.github.io/Luzzu/downloads.html
Luzzu – A Framework for Linked Data Quality Assessment
11
approximation techniques, more specifically reservoir sampling, were used to improve
scalability [6].
BFS
Availability
Estimated No Misreported
Content Types (A4)
End Point Availability (A1)
RDF Availability (A2)
Performance
Scalability of Data Source (P4)
Low Latency (P2)
High Throughput (P3)
Correct URI Usage (P1)
Licencing
Machine Readable License (L1)
Human Readable License (L2)
ECB
FAO
FRB
Datasets
IMF
OECD Transparency
UIS
World Bank
97.30% 91.67% 95.41% 96.11% 88.96% 96.82%
99.67% 88.96%
80.43%
100.0% 100.0% 100.0% 100.0% 100.0% 100.0%
100.0% 100.0% 100.0% 100.0% 100.0% 100.0%
100.0% 100.0%
100.0% 100.0%
100.0%
100.0%
97.38%
100.0%
100.0%
0.117%
97.92%
100.0%
100.0%
99.999%
97.67%
100.0%
100.0%
100.0%
25.0% 66.667%
0.0%
0.0%
42.857%
0.0%
99.01%
100.0%
100.0%
99.955%
98.87% 99.0% 100.0% 98.89%
100.0% 100.0% 100.0% 100.0%
100.0% 100.0% 100.0% 100.0%
100.0% 100.0% 100.0% 100.0%
25.0% 94.231% 78.571% 52.632%
0.0%
0.0%
0.0%
0.0%
96.41%
100.0%
100.0%
100.0%
62.5% 97.163%
0.0%
0.0%
Reuse of Existing Terms (IO1)
Reuse of Existing Vocabularies (IO2)
17.81% 11.01% 33.12% 19.18% 28.99%
0.00% 0.00% 0.00% 0.00% 0.00%
3.93%
0.00%
56.36% 32.54%
0.00% 0.00%
24.15%
6.25%
Basic Provenance
Extended Provenance
100.0% 100.0% 100.0% 100.0% 100.0% 100.0%
74.43% 48.819% 30.939% 79.091% 22.759% 73.153%
100.0% 100.0%
0.0% 67.164%
100.0%
0.0%
Representational No Prolix RDF (RC2)
Conciseness Short URIs (RC1)
100.0% 100.0% 100.0% 100.0% 100.0% 100.0%
99.955% 99.999% 99.998% 96.774% 99.997% 97.945%
100.0% 100.0%
95.534% 100.0%
100.0%
38.166%
Interoperability
Provenance
Versatility
Interpretability
Timeliness
Consistency
Different Serialisations (V1)
Multiple Language Usage (V2)
33.333% 1.923% 7.692% 5.556% 14.286% 0.714%
4
1
3
1
1
2
12.5%
2
14.286%
1
Low Blank Node Usage (IN4)
80.38% 99.977% 99.985% 99.977% 99.981% 99.923%
Defined Classes and Properties (IN3) 90.909% 22.921% 82.178% 92.0% 59.116% 7.55%
89.559% 99.874%
90.323% 64.497%
99.505%
34.158%
Timeliness of Resource (TI2)
Freshness of Dataset (TI1)
Currency of Dataset (TI2)
0.0%
0.0%
14.055%
0.0%
0.0%
0.0%
0.0%
0.0%
0.0%
0.0%
0.0%
0.0%
0.0%
0.0%
0.0%
0.0%
0.0%
0.0%
0.0%
0.0%
0.0%
Misplaced Classes or Properties (CS2) 100.0% 100.0% 100.0% 100.0% 100.0% 100.0%
Estimated No Entities As
Members of Disjoint Classes
100.0% 100.0% 100.0% 100.0% 100.0% 100.0%
(CS1)
Correct Usage of OWL
Datatype Or Object
100.0% 100.0% 100.0% 100.0% 100.0% 100.0%
Properties (CS3)
25.0%
1
0.0%
0.0%
0.0%
0.0%
0.0%
0.141%
100.0% 100.0%
100.0%
100.0% 100.0%
100.0%
100.0% 100.0%
100.0%
Table 1. Quality Evaluation Results for 270a Datasets
Results and Discussion In general, the datasets adhere to high quality when it comes to
the consistency, representational conciseness, and performance dimensions. The consistency dimension relates to the efficiency of the system hosting the dataset, whilst
the representational conciseness dimension deals with data clarity and compactness. In
other words, the latter dimension measures if (1) a dataset’s resource URIs are short,
that is a URI should be maximum 80 characters14 and have no parameters; and (2) a
dataset is free of RDF Collections and Lists.
The performance metrics are related to how much the dataset is free of logical and
formal errors and contradictions. According to the CoolURIs guidelines15 , Hash URIs
should be used in small and stable datasets since they provide stability with regard to the
retrieval performance, since no extra 303 redirects are required. On the other hand, Slash
14
15
This best practice is mentioned at http://www.w3.org/TR/chips/#uri
http://www.w3.org/TR/cooluris/
12
Jeremy Debattista, Christoph Lange, Sören Auer
URIs should be used in large datasets. Following [7], the minimum number of triples
required to be considered as a large dataset is set to 500K. The assessed transparency
data space has around 43K triples, of which only 30 URIs are hash URIs.
Whilst all data spaces provided basic provenance information, tracking of the origin (extended provenance metric) of the data source is not possible in all datasets. For
example, in the Transparency data space, no provenance information is recorded. Furthermore, although provenance was recorded in the other assessed data spaces, entities
might have had any of the required attributes mentioned in the previous section missing.
This was confirmed by a SPARQL query on the ECB dataset, where from 460 entities,
only 248 were attributable to an agent, whilst each activity contained pointers to the
data source used and the agent it is associated with.
With regard to the timeliness metrics, it was noticed that the datasets were missing
the information required for the assessment. For example, the timeliness of resource
metric required the last modified time of the original data source and the last modified
time of the semantic resource. Similarly, for the freshness of dataset and currency of
dataset (recency) metrics, the datasets had missing meta information, except for the
Transparency and World Bank datasets for the latter mentioned metric. These still gave
a relatively a low quality score, and a manual investigation revealed that the datasets
have not been updated for over a year.
In order to measure the interoperability metrics, some pre-processing was required.
All nine datasets were tagged with a number of domains (such as Statistical, Government and Financial) and a number of vocabularies that were deemed suitable. For each
domain, the Linked Open Vocabulary API (LOV16 ) was queried, and resolvable vocabularies were downloaded to L UZZU. Although one does not expect that datasets reuse all existing and available vocabularies, the reuse of vocabularies is encouraged.
A SPARQL query, getting all vocabularies referred to with the void:vocabulary
property, was executed against the BFS data space endpoint. It resulted in nine
vocabularies, five of which were removed since they were the common OWL, RDF,
RDFS, SKOS and DCMI Terms vocabularies. The rest of the vocabularies were
http://bfs.270a.info/property/, http://purl.org/linked-data/sdmx/
2009/concept, http://purl.org/linked-data/xkos, http://www.w3.org/
ns/prov. The LOV API suggested 14 vocabularies when given the terms assigned
to this dataset, none of which appeared in the list of vocabularies used. One possible
problem of this metric is that since these are qualitative metrics, that are measuring the
quality according to a perspective, the dataset might have been tagged wrongly. On the
other hand, the metric relies on an external service, LOV, that might (1) not have all
vocabularies in the system, (2) vocabularies might not appear in the results since the
search term is not included in the vocabulary description or label.
With all data dumps and SPARQL endpoint URIs present in the VoID file of all
datasets, the 270a data space publishers are well aware of the importance of having the data available and obtainable (availability dimension). The datasets also fare
well with regard to having a relatively low number of misreported content types
for the resource URIs they use. The percentage loss in this metric is attributable
to the fact that in this metric all URI resources (local, i.e. having a base URI of
16
http://lov.okfn.org/dataset/lov/
Luzzu – A Framework for Linked Data Quality Assessment
13
270a.info, and external) that return a 200 OK HTTP status code are being assessed.
In fact, most of the problematic URIs found were attributable to the usage of the
following three vocabularies: http://purl.org/linked-data/xkos; http://
purl.org/linked-data/cube; http://purl.org/linked-data/sdmx/2009/
dimension. A curl command shows that the content type returned by these is
text/html; charset=iso-8859-1, even when enforcing a content negotiation for semantic content types such as application/rdf+xml. When assessing the World Bank dataset, there was also one instance (more specifically http:
//worldbank.270a.info/property/implementing-agency) that the metric incorrectly reported as a problematic resource URI. It was observed that the mentioned
resource took an unusual long time to resolve, which might have resulted in a timeout.
All data spaces have a machine readable license (licensing dimension) in their void
file, more specifically those datasets (void:Dataset) that contain observation data.
The datasets that describe metadata, such as http://bfs.270a.info/dataset/
meta, have no machine readable license attributed to them, thus such cases decreased
the quality result for this metric. On the other hand, the datasets had no human readable licenses in their description. This metric verifies whether a human-readable text,
stating the licensing model attributed to the resource, has been provided as part of the
dataset. In contrast to the machine readable metric, literal objects (attached to properties
such as rdfs:label, dcterms:description, rdfs:comment) are analysed
for terms related to licensing.
The versatility dimension refers to the different representation of data for both human and machine consumption. With regard to human consumption (multiple language
usage), the BFS dataset on average supports four languages, the highest from all assessed datasets. This metric checks all literals that have a language tag. On the other
hand, data spaces offer different serialisations (attributed via the void:feature
– two formats for each data space), but only one dataset in each data space (e.g.
http://bfs.270a.info/dataset/bfs) has this property as part of its VoID description. If for example we take the OECD data space, there are 140 defined voID
datasets but only one of those has the void:feature property defined.
The interpretability dimension refers to the notation of data, that is checking if a machine can process data. Therefore, a dataset should have a low blank node usage, since
blank nodes cannot be resolved and referenced externally. In this regard, all datasets
have a low usage, and these are generally used to describe void:classPartition
in the voID metadata. The datasets also use classes and properties that are defined in ontologies and vocabularies, with an occasional usage of the typical http://example.
org, such as http://example.org/XStats#value found in all data spaces.
5
Related Work
In this section, we will introduce the reader to similar tools that assess quality of linked
data. Table 5 summarises the tools discussed in this section. As can be seen in this table
Luzzu fills with its focus on scalability, extensibility and quality metadata representation
a gap in the space of related work.
14
Jeremy Debattista, Christoph Lange, Sören Auer
Scalability
Extensibility
Quality
Metadata
Quality Report
Collaboration
Cleaning support
Last release
Flemming
LinkQA
Sieve
RDF Unit
LiQuate
Luzzu
3
SPARQL
3
Triple
Check
Mate
Crowdsourcing
No
7
7
7
7
3
Java
7
3
XML
3 (Optional)
N/A
Bayesian rules
7
3
Java, LQML
3
HTML
7
7
HTML
7
7
7
7
3
HTML, RDF
7
7
7
3
7
7
7
7
RDF
7
7
2010
2011
2014
2014
2013
2014
2015
Table 2. Functional comparison of Linked Data quality tools.
Flemming [7] provides a simple web user interface and a walk through guide that
helps a user to assess data quality of a resource using a set of defined metrics. Metric
weight can be customisable either by user assignment, though no new metric can be
added to the system. L UZZU enables users to create custom metrics, including complex
quality metrics that can be customisable further by configuring them, such as providing
a list of trustworthy providers. In contrast to L UZZU, Flemming’s tool outputs the result
and problems as unstructured text.
LinkQA [8] is an assessment tool to measure the quality (and changes in quality) of
a dataset using network analysis measures. The authors provide five network measures,
namely degree, clustering coefficient, centrality, sameAs chains, and descriptive richness through sameAs. Similar to L UZZU, new metrics can be integrated to LinkQA, but
these metrics are related to topological measures. LinkQA provides HTML reports in
order to represent the assessment findings. Although considered structured, such reports
cannot be attached to LOD datasets, in such a way that assessment values can be further
queried in the future.
In Sieve [14], metadata about named graphs is used in order to assess data quality,
where assessment metrics are declaratively defined by users through an XML configuration. In such configurations, scoring functions (that can also be extended or customised) can be applied on one or an aggregate of metrics. Whereas L UZZU provides the
quality problem report report which can be possibly imported in third party cleaning
software, Sieve provides a data cleaning process, where data is cleaned based on a user
configuration. The quality assessment tool is part of the LDIF Linked Data Integration
Framework, which supports Hadoop.
Similar to Flemming’s tool, the LiQuate [16] tool provides a guided walkthrough
to view pre-computed datasets. LiQuate is a quality assessment tool based on Bayesian
Networks, which analyse the quality of dataset in the LOD cloud whilst identifying
potential quality problems and ambiguities. This probabilistic model is used in LiQuate
to explore the assessed datasets for completeness, redundancies and inconsistencies.
Data experts are required to identify rules for the Bayesian Network.
Triple check mate [18] is mainly a crowdsourcing tool for quality assessment, supported with a semi-automatic verification of quality metrics. With the crowdsourcing
approach, certain quality problems (such as semantic errors) might be detected easily
by human evaluators rather than by computational algorithms. On the other hand, the
Luzzu – A Framework for Linked Data Quality Assessment
15
proposed semi-automated approach makes use of reasoners and machine learning to
learn characteristics of a knowledge base.
RDFUnit [11] provides test-driven quality assessment for Linked Data. In RDFUnit, users define quality test patterns based on a SPARQL query template. Similar to
L UZZU and Sieve, this gives the user the opportunity to adapt the quality framework to
their needs. The focus of RDFUnit is more to check for integrity constraints expressed
as SPARQL patterns. Thus, users have to understand different SPARQL patterns that
represent these constraints. Quality is assessed by executing the custom SPARQL queries against dataset endpoints. In contrast, L UZZU does not rely on SPARQL querying
to assess a dataset, and therefore can compute more complex processes (such as checking for dereferenceability of resources) on dataset triples themselves. Test case results,
both quality values and quality problems, from an RDFUnit execution, are stored and
represented as Linked Data and visualised as HTML. The main difference between
L UZZU and RDFUnit in quality metadata representation is that the daQ ontology enables a more fine-grained and detailed quality metric representation. For example, representing quality metadata with daQ enables the representation of a metric value change
over time.
6
Conclusions
Data quality assessment is crucial for the wider deployment and use of Linked Data. In
this paper, we presented an approach for a scalable and easy-to-use Linked Data quality assessment framework. The performance experiment showed that although there is
a margin for improvement, L UZZU has very attractive performance characteristics. In
particular, quality assessment with the proposed framework scales linearly w.r.t. dataset
size and adding additional (domain-specific) metrics adds a relatively small overhead.
Thus, L UZZU effectively supports Big Data applications. Beyer and Laney coined the
definition of Big Data as High Volume, High Velocity, and High Variety [2]. Volume
means large amounts of data; velocity addresses how much information is handled in
real time; variety addresses data diversity. The implemented framework currently scales
for both Volume and Variety. With regard to Volume, the processor runtime grows linearly with the amount of triples. We also cater for Variety since in L UZZU the results
are not affected by data diversity. In particular, since we support the analysis of all
kinds of data being represented as RDF any data schema and even various data models are supported as long as they can be mapped or encoded in RDF (e.g. relational
data with R2RML mappings). Velocity completes the Big Data definition. Currently we
employed Luzzu for quality assessment at well-defined checkpoints rather than in real
time. However, due to its streaming nature, Luzzu can easily assess the performance of
data streams as well thus catering for velocity.
We see L UZZU as the first step on a long-term research agenda aiming at shedding
light on the quality of data published on the Web. With regard to the future of L UZZU,
we are currently investigating a number of techniques which would enable us to support the framework better in big data architecture. We are also conducting a thorough
investigation of the quality of the Linked Datasets available in the LOD cloud.
16
REFERENCES
References
1. Auer, S. et al. Managing the life-cycle of Linked Data with the LOD2 Stack. In:
Proceedings of International Semantic Web Conference (ISWC 2012). 2012.
2. Beyer, M. A., Laney, D. The Importance of ‘Big Data’: A Definition.
3. Capadisli, S., Auer, S., Ngomo, A. N. Linked SDMX Data: Path to high fidelity
Statistical Linked Data. In: Semantic Web 6(2) (2015), pp. 105–112.
4. Debattista, J., Lange, C., Auer, S. Luzzu Quality Metric Language – A DSL for
Linked Data Quality Assessment. In: CoRR (2015).
5. Debattista, J., Lange, C., Auer, S. Representing Dataset Quality Metadata using
Multi-Dimensional Views. In: 2014, pp. 92–99.
6. Debattista, J. et al. Quality Assessment of Linked Datasets using Probabilistic Approximation. In: CoRR (2015). To appear at the ESWC 2015 Proceedings.
7. Flemming, A. Quality characteristics of linked data publishing datasources. MA
thesis. Humboldt-Universität zu Berlin, Institut für Informatik, 2011.
8. Guéret, C. et al. Assessing Linked Data Mappings Using Network Measures. In:
ESWC. 2012.
9. Jentzsch, A., Cyganiak, R., Bizer, C. State of the LOD Cloud. Version 0.3.
19th Sept. 2011. http://lod-cloud.net/state/ (visited on 2014-08-06).
10. Juran, J. M. Juran’s Quality Control Handbook. 4th ed. McGraw-Hill, 1974.
11. Kontokostas, D. et al. Test-driven Evaluation of Linked Data Quality. In: Proceedings of the 23rd international conference on World Wide Web. to appear. 2014.
12. Lehmann, J. et al. DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia. In: Semantic Web Journal (2014).
13. Lóscio, B. F., Burle, C., Calegari, N. Data on the Web Best Practices - W3C First
Public Working Draft. Tech. rep. W3C, 2015. http://www.w3.org/TR/dwbp/.
14. Mendes, P. N., Mühleisen, H., Bizer, C. Sieve: Linked Data Quality Assessment
and Fusion. In: 2012 Joint EDBT/ICDT Workshops. ACM, 2012, pp. 116–123.
15. Morsey, M. et al. DBpedia and the Live Extraction of Structured Data from Wikipedia. In: Program: electronic library and information systems 46 (2012), p. 27.
16. Ruckhaus, E., Baldizán, O., Vidal, M.-E. Analyzing Linked Data Quality with LiQuate. English. In: OTM 2013 Workshops. 2013.
17. Zaveri, A. et al. Quality Assessment Methodologies for Linked Open Data. In:
Semantic Web Journal (2014).
18. Zaveri, A. et al. User-driven Quality Evaluation of DBpedia. In: Proceedings of 9th
International Conference on Semantic Systems, I-SEMANTICS ’13. ACM, 2013,
pp. 97–104.