HOW TO INTEGRATE LINKED DATA INTO YOUR APPLICATION

SEMANTIC TECHNOLOGY & BUSINESS CONFERENCE
SAN FRANCISCO, JUNE 5, 2012
|
HOW TO
INTEGRATE LINKED DATA
INTO YOUR APPLICATION
LDIF Team:
Andreas Schultz, Freie Universität Berlin
Andrea Matteini, mes|semantics
Robert Isele, Freie Universität Berlin
Pablo N. Mendes, Freie Universität Berlin
Christian Becker, mes|semantics
Christian Bizer, Freie Universität Berlin
With contributions by:
Hannes Mühleisen, Freie Universität Berlin; William Smith, Vulcan Inc.
|
WHAT IS LINKED DATA?
•
Raw data (RDF)
•
Accessible on the web
•
Data can link to other data sources
Thing
Thing
Thing
Thing
Thing
Thing
Thing
Thing
Thing
Thing
data link
A
data link
B
data link
C
data link
D
•
Benefits: Ease of access and re-use; enables discovery
•
One API for all data sources?
E
|
LINKING OPEN DATA CLOUD
Magnatune
DB
Tropes
Hellenic
FBD
Hellenic
PD
Crime
Reports
UK
Ox
Points
Media
Geographic
Publications
User-generated content
Government
Open
Election
Data
Project
EU
Institutions
Mortality
(EnAKTing)
Ordnance
Survey
legislation
data.gov.uk
UK Postcodes
ESD
standards
ISTAT
Immigration
Lichfield
Spending
Scotland
Pupils &
Exams
Traffic
Scotland
reference
data.gov.
uk
London
Gazette
TWC LOGD
Eurostat
Eurostat
(FUB)
(Ontology
Central)
GovTrack
Finnish
Municipalities
World
Factbook
Geo
Species
UMBEL
El
Viajero
Tourism
BNB
BibBase
DBLP
(FU
Berlin)
Uberblic
Daily
Med
dataopenac-uk
Diseasome
SIDER
Twarql
EUNIS
Cornetto
PDB
SMC
Journals
Ocean
Drilling
Codices
Turismo
de
Zaragoza
Janus
AMP
Climbing
Linked
GeoData
Alpine
Ski
Austria
AEMET
Metoffice
Weather
Forecasts
Weather
Stations
Yahoo!
Geo
Planet
GEMET
ChEMBL
Open
Data
Thesaurus
Airports
National
Radioactivity
JP
Sears
STW
Pisa
ProDom
PubMed
Linked
Open
Colors
SGD
Gene
Ontology
NVD
IBM
DEPLOY
Newcastle
LOCAH
Roma
CiteSeer
Courseware
dotAC
ePrints
VIVO
Cornell
OMIM
MGI
InterPro
Smart
Link
Product
Types
Ontology
Open
Corporates
Italian
Museums
Amsterdam
Museum
UniParc
UniRef
UniSTS
GeneID
meducator
Reactome
OGOLOD
KEGG
Pathway
Medi
Care
Google
Art
wrapper
Linked
Open
Numbers
KEGG
Drug
Pub
Chem
UniPath
way
Chem2
Bio2RDF
Homolo
Gene
Scholarometer
IRIT
ACM
RAE2001
STITCH
GESIS
RESEX
IEEE
RISKS
PROSITE
AGROV
OC
Product
DB
DBLP
(RKB
Explorer)
HGNC
(Bio2RDF)
Affymetrix
SISVU
Swedish
Open
Cultural
Heritage
Budapest
LAAS
KISTI
NSF
JISC
WordNet
(RKB
Explorer)
EARTh
lobid
Organisations
ECS
(RKB
Explorer)
VIVO
Indiana
UniProt
LODE
WordNet
(W3C)
Wiki
ECS
Southampton
ECS
Southampton
EPrints
Eurécom
LinkedCT
Taxono
my
NSZL
Catalog
Resources
P20
Pfam
UniProt
WordNet
(VUA)
lobid
UN/
LOCODE
Drug
Bank
Enipedia
Lexvo
DBLP
(L3S)
ERA
lingvoj
Europeana
Deutsche
Biographie
OAI
data
dcs
TCM
Gene
DIT
VIAF
Ulm
data
bnf.fr
OS
YAGO
Open
Cyc
riese
ndlna
Freebase
dbpedia
lite
Norwegian
MeSH
GND
UB
Mannheim
Calames
RDF
Book
Mashup
Project
Gutenberg
Rådata
nå!
PSH
IdRef
Sudoc
iServe
Geo
Names
LIBRIS
LCSH
Sudoc
DDC
Open
Calais
Greek
DBpedia
DBpedia
GeoWord
Net
Piedmont
Accomodations
URI
Burner
ntnusc
MARC
Codes
List
totl.net
US Census
(rdfabout)
Italian
public
schools
http://lod-cloud.net
New
York
Times
LEM
RAMEAU
SH
Thesaurus W
SW
Dog
Food
Portuguese
DBpedia
t4gm
info
LinkedL
CCN
theses.
fr
Revyu
Fishes
of Texas
(rdfabout)
Scotland
Geography
Linked
MDB
Event
Media
US SEC
Semantic
XBRL
FTS
Chronicling
America
Telegraphis
Linked
Sensor Data
(Kno.e.sis)
Eurostat
Linked
EDGAR
(Ontology
Central)
EURES
Life sciences
(RKB
Explorer)
BBC
Music
Geo
Linked
Data
CORDIS
CORDIS
(FUB)
Pokedex
NDL
subjects
Open
Library
(Talis)
Plymouth
Reading
Lists
my
Experiment
flickr
wrappr
NTU
Resource
Lists
Open
Library
SSW
Thesaur
us
semantic
web.org
BBC
Wildlife
Finder
NASA
(Data
Incubator)
transport
data.gov.
uk
Source Code
Ecosystem
Linked Data
Didactal
ia
Goodwin
Family
St.
Andrews
Resource
Lists
Manchester
Reading
Lists
gnoss
Poképédia
Classical
(DB
Tune)
Taxon
Concept
LOIUS
Jamendo
(DBtune)
Last.FM
(rdfize)
BBC
Program
mes
Rechtspraak.
nl
Openly
Local
data.gov.uk
intervals
Music
Brainz
(DBTune)
Ontos
News
Portal
Sussex
Reading
Lists
Bricklink
yovisto
Semantic
Tweet
Linked
Crunchbase
RDF
ohloh
(Data
Incubator)
(DBTune)
OpenEI
statistics
data.gov.
uk
GovWILD
Brazilian
Politicians
educatio
n.data.g
ov.uk
Lotico
Discogs
FanHubz
patents
data.go
v.uk
research
data.gov.
uk
CO2
Emission
(EnAKTing)
Energy
(EnAKTing)
EEA
Data
Gov.ie
Cross-domain
NHS
(EnAKTing)
Surge
Radio
Klappstuhlclub
Music
Brainz
(zitgist)
(Data
Incubator)
Last.FM
artists
Population (EnAKTing)
reegle
Ren.
Energy
Generators
(DBTune)
tags2con
delicious
Slideshare
2RDF
(DBTune)
Music
Brainz
John
Peel
EUTC
Productions
business
data.gov.
uk
Crime
(EnAKTing)
GTAA
Linked
User
Feedback
LOV
Audio
Scrobbler
Moseley
Folk
VIVO UF
ECCOTCP
bible
ontology
KEGG
Enzyme
PBAC
KEGG
Reaction
KEGG
Compound
KEGG
Glycan
As of September 2011
|
TYPES OF LINKED DATA
VERY SOON?
Open,
Public Data
(LOD Cloud)
Linked
Enterprise
Data
Commercial
Linked Data
... AND WHAT YOU CAN DO WITH THEM
•
Provide interfaces on top of them
•
Augment your website
•
Integrate them into your application logic
•
Create specialized data marts
|
AUGMENT YOUR WEBSITE: BBC
BBC online properties make intensive use of
data from Wikipedia and MusicBrainz
|
DATA MARTS: NEUROWIKI
•
NeuroWiki creates views
for genes, drugs and
diseases data from four
RDF data sources
•
Provides navigation and
composition tools for
accessing and mining the
data
|
APPLICATION LOGIC: IBM WATSON
http://www.flickr.com/photos/ibm_media/
•
IBM Watson makes use of Linked Data sources such as DBpedia
|
4 STEPS TO
LINKED DATA INTEGRATION
|
STEP #1:
ACCESS LINKED DATA
•
Linked Data is published via HTTP, SPARQL endpoints, RDF dumps
Architecture
On-The-Fly
Dereferencing
Access Methods
HTTP
Dump
SPARQL
Dereferencing
import
X
X
Query Federation
Crawling and Caching
X
X
X
Decision Factors
Recency Speed / Scalability
High
Low
High
Decreases
exponentially as
new sources are
added
Depends High
Reliability
Complexity
Low
High
Low
Moderate with
SPARQL 1.1
SERVICE clause
High
High
Adapted from: Linked Data: Evolving the Web into a Global Data Space (Heath/Bizer 2011)
•
Live access allows quick prototyping and limited production use
•
As data sets grow in size and more data sources are added, a
crawling/caching architecture often becomes necessary
|
STEP #1:
ACCESS LINKED DATA
Implementations:
•
On-the-fly dereferencing
•
•
Query federation
•
•
LDspider, SQUIN, Semantic Web Client library
SPARQL 1.1 SERVICE clause
Crawling and Caching
•
Triplestore import script
•
Public caches (e.g. Sindice, OpenLink LOD endpoint)
•
LDIF
|
STEP #2:
NORMALIZE VOCABULARIES
Data sources that overlap in content use a wide range of vocabularies.
swrcbibpo
tldcam
mpeg7
rdfg
compass
txnwot
metalex
doap
wdrs
admingeo
vann
orgapi
sawsdl
sdmx
geospecies
xmlqb
rev
vu-wordnet
umbel
uniprot
dc
http
scovo
void
tag
dbp
bio
ore
dbo
gr
dbpedia
event
time
xsd
frbr
geonames
cc
sioc
vcard
mo
bibo
akt
xhtml
foaf
geo
skos
Most widely used vocabularies in the LOD cloud (08/10/2011)
Source: FU Berlin / DERI; http://www4.wiwiss.fu-berlin.de/lodcloud/state/
•
Over 60 % of all LOD sources use
proprietary vocabularies
•
It’s up to the data consumer to
normalize the vocabularies
•
Enterprise: Need to translate
between internal and external
vocabularies
|
STEP #2:
NORMALIZE VOCABULARIES
Approaches to Schema Mapping:
•
Hand-crafting queries against individual sources – no different than an API
OPTIONAL { ?ow fb:location.location.containedby [ ot:preferredLabel ?city_fb_con ] } .
OPTIONAL { ?ow dbp-prop:location ?loc. ?loc rdf:type umbel-sc:City ; ot:preferredLabel ?city_db_loc }
OPTIONAL { ?ow dbp-ont:city [ ot:preferredLabel ?city_db_cit ] }
Source: http://www.readwriteweb.com/archives/the_modigliani_test_for_linked_data.php
•
Ontology Representation Languages: OWL, RDFS
•
Rules: SWRL, RIF
•
Query Languages
•
SPARQL CONSTRUCT clause
•
TopQuadrant SPARQLMotion
•
Mosto
•
R2R (part of LDIF)
|
STEP #2:
NORMALIZE VOCABULARIES
Using SPARQL:
• Rename a class
CONSTRUCT {
?s a mo:MusicArtist
} WHERE {
?s a dbpedia-owl:MusicalArtist
}
•
Value transformation
CONSTRUCT {
?s movie:runtime ?runtimeInMinutes .
} WHERE {
?s dbpedia-owl:runtime ?runtime .
BIND(?runtime * 60 As ?runtimeInMinutes)
}
•
Create URI from literal
CONSTRUCT {
?s diseasome:omim ?omimuri .
?omimuri dc:identifier ?identifier .
} WHERE {
?s dbpedia-owl:omim ?omim .
BIND(IRI(concat(“http://bio2rdf.org/omim:”, ?omim)) As ?omimuri)
BIND(concat(“omim:”, ?omim) As ?identifier)
}
Slide credits: Andreas Schultz
|
STEP #3:
RESOLVE IDENTIFIERS
Data sources that overlap in content use different identifiers for the
same real-world entity.
98
1 linked data sets
•
Most LOD sources only provide
owl:sameAs links to one other
data source
•
It’s up to the data consumer to
generate additional links
•
Enterprise: Need to link both
internal and external resources
62
2 linked data sets
38
3 linked data sets
19
4 linked data sets
5 linked data sets
5
6 - 10 linked data sets
17
> 10 linked data sets
27
0
25
50
75
100
Number of linked data sets per source (08/10/2011)
Source: FU Berlin / DERI; http://www4.wiwiss.fu-berlin.de/lodcloud/state/
|
STEP #3:
RESOLVE IDENTIFIERS
Approaches to Identity Resolution:
•
Improvised or manual merging
•
Rule-based approaches:
•
SILK (part of LDIF)
•
LIMES
Union Sq., New York
Union Sq., Seattle
Union Sq., San Francisco
N
′W
4
°
2
37 2°
12
′
47
Union
Square
N
′W
4
°
2
37 2°
12
′
47
Union Sq.
=
Union Sq.,
San Francisco
|
STEP #4:
FILTER DATA
Data sources that overlap in content provide data that is conflicting and of
varying quality.
•
•
Data sources have...
•
... different knowledge levels, views or intents
•
... wrong, biased, inconsistent or outdated information
Approaches:
•
Import data into distinct Named Graphs; query them separately
using the SPARQL GRAPH clause
•
Sieve (part of LDIF)
|
LDIF – LINKED DATA INTEGRATION FRAMEWORK
Integrates Linked Data from multiple sources into a clean, local target
representation while keeping track of data provenance
NEW
1
Collect data: Managed download and update
2
Translate data into a single target vocabulary
3
Resolve identifier aliases into local target URIs
4
Cleanse data; resolving the conflicting values
5
Output
•
Follows the Crawling and Caching Architecture Pattern
•
Open source (Apache License, Version 2.0)
•
Collaboration between Freie Universität Berlin and mes|semantics
|
LDIF PIPELINE
1
Collect data
Supported data sources:
2
Translate data
•
RDF dumps (all common formats)
•
SPARQL Endpoints
•
Crawling Linked Data via HTTP
3
Resolve identities
4
Cleanse data
5
Output
|
LDIF PIPELINE
1
Collect data
2
Translate data
dbpedia-owl: City
Resolve identities
schema:Place
4
Cleanse data
fb:location.citytown
5
Output
3
Sources use a wide range of different RDF vocabularies
R2R
local:City
•
Simple mappings using OWL / RDFS statements
(x rdfs:subClassOf y)
•
Complex mappings with SPARQL expressivity
•
Built-in transformation function library (XPath)
|
LDIF PIPELINE
1
Collect data
2
Translate data
3
Resolve identities
4
5
Cleanse data
Sources use different identifiers for the same entity
Union Sq., New York
Union Sq., Seattle
Union Sq., San Francisco
N
7′ ′ W
4
4
°
37 2°2
12
Union
Square
Output
N
7′ ′ W
4
4
°
37 2°2
12
Silk
Union Sq.
=
Union Sq.,
San Francisco
•
Automated link creation based on Link Specifications
•
Supports various comparators and transformations
(string similarity, basic arithmetics, time, geographical
distance)
|
LDIF PIPELINE
1
Collect data
2
Translate data
3
Resolve identities
Sources provide different values for the same property
San Francisco
population is
0.7M
★
4
Cleanse data
5
Output
San Francisco
population is
0.8M
★
★
★
Sieve
San
Francisco
population
is 0.8M
★
1. Quality Assessment – assign quality scores to Named
Graphs (by time, by source preference, thresholds)
2. Data Fusion – resolve conflicting property values
(according to quality scores, frequency, averages)
|
LDIF PIPELINE
1
Collect data
Output options:
2
Translate data
3
Resolve identities
4
Cleanse data
5
Output
•
N-Quads
•
N-Triples
•
SPARQL Update Stream
•
Provenance tracking using Named Graphs
!
|
!
!
!
LDIF ARCHITECTURE
Application!Layer!
Application!Code!!
SPARQL!or!RDF!API!
Data!Access,!!
Integration!and!!
Storage!Layer!
!!!!!!LDIF!!
!!
Web!Data!
Access!Module!
!
Data!
Translation!
Module!
!
Identity!
Resolution!
Module!
!
Data!Quality!
and!Fusion!
Module!
Integrated!
Web!Data!
HTTP!
Web!of!Data!
HTTP!
Publication!Layer!
LD!Wrapper!
Database!A!
HTTP!
LD!Wrapper!
Database!B!
HTTP!
RDFa!
CMS!
RDF/X
ML!
|
VERSIONS
•
In-memory
•
•
•
fast, but scalability limited by local RAM
RDF Store (TDB)
•
stores intermediate results in a Jena TDB RDF store
•
can process more data than In-memory but doesn't scale
Cluster (Hadoop)
•
scales by parallelizing work across multiple machines using Hadoop
•
can process a virtually unlimited amount of data
•
ready for Amazon Elastic MapReduce
|
BENCHMARKS
KEGG GENES VS. UNIPROT (CLUSTER)
300M TRIPLES
3.6B TRIPLES
|
Q&A
|
THANKS!
•
Early adopters wanted!
•
Website: http://bit.ly/ldifweb
•
Google Group: http://bit.ly/ldifgroup
•
http://mes-semantics.com
•
Supported in part by
•
•
Vulcan Inc. as part of its Project Halo
•
EU FP7 project LOD2 - Creating Knowledge out of Interlinked Data
(Grant No. 257943)
Slide credits: Andrea Matteini, Robert Isele, Andreas Schultz