article

Databases and data
modelling
Lecturer: Jiaheng Lu
Spring 2016
www.helsinki.fi
10.10.2016
1
Data storage and history
Before-1950s Data was stored as paper records
Lot of time was wasted. e.g. when searching.
Therefore inefficient.
www.helsinki.fi
Magnetic tapes and hard disk
• 1950s and early 1960s: Data processing using
magnetic tapes for storage
• Late 1960s and 1970s: Hard disks allow direct
access to data
•
• Data stored in files
Matemaattis-luonnontieteellinen tiedekunta /
Iso tiedonhallinta/
Jiaheng Lu
www.helsinki.fi
10.10.2016
3
Drawbacks of file system
• Each program has its own data format
• Programs are written in different languages, and so
cannot easily access each other’s files.
• Any new requirement needs a new program
Matemaattis-luonnontieteellinen tiedekunta /
Iso tiedonhallinta/
Jiaheng Lu
www.helsinki.fi
10.10.2016
4
Database Approach
•
•
•
•
•
1960’s Network databases
1970’s Relational databases
1990’s Object-oriented and object-relational
1995+ XML, Mobile, GeoDB, Embedded DB
2005+ NoSQL DB, NewSQL DB
Matemaattis-luonnontieteellinen tiedekunta /
Iso tiedonhallinta/
Jiaheng Lu
www.helsinki.fi
10.10.2016
5
History of databases: Turing awards
1973 Charles W. Bachman
1998 Jim Gray
1981 Edgar F. Codd
2014 www.helsinki.fi
Michael Stonebraker
6
History of databases: Turing awards
Object-relational
model, column
stores,…Modern
databases
Distributed
databases and
transaction
Network
databases
Relational
databases
2014 Michael Stonebraker
1998 Jim Gray
1981 Edgar F. Codd
1973 Charles W. Bachman
www.helsinki.fi
7
Data model describes the
characteristics
• Structure
• Operations
• Constraints
Matemaattis-luonnontieteellinen tiedekunta /
Iso tiedonhallinta/
Jiaheng Lu
www.helsinki.fi
10.10.2016
8
Data model describes the
characteristics
• Structure
(John, 32, Male)
(Mary, 27, Female)
(Anna, 57, Female)
• Operations
• Constraints
Matemaattis-luonnontieteellinen tiedekunta /
Iso tiedonhallinta/
Jiaheng Lu
www.helsinki.fi
10.10.2016
9
Data model describes the
characteristics
• Structure
• Operations
E.g. Join and Union
• Constraints
Matemaattis-luonnontieteellinen tiedekunta /
Iso tiedonhallinta/
Jiaheng Lu
www.helsinki.fi
10.10.2016
10
Data model describes the
characteristics
• Structure
• Operations
• Constraints
Matemaattis-luonnontieteellinen tiedekunta /
Iso tiedonhallinta/
Jiaheng Lu
Age is never negative and
always smaller than 150
www.helsinki.fi
10.10.2016
11
Two structures of data files
Structure 1
(John, 32, Male)
(Mary, 27, Female)
(Anna, 57, Female)
Structure 2
(John, 32, Male, Engineer)
(Mary, 27, Female, Doctor)
(Anna, 57, Female, Teacher)c
Matemaattis-luonnontieteellinen tiedekunta /
Iso tiedonhallinta/
Jiaheng Lu
www.helsinki.fi
10.10.2016
12
Examples of operations
• Subsetting
• Given a condition and a set of data, find a subset of
data which satisfy the condition
• Substructure extracting
• Extract from each data item a part of structure as
specified by a condition
• Union and Join
Matemaattis-luonnontieteellinen tiedekunta /
Iso tiedonhallinta/
Jiaheng Lu
www.helsinki.fi
10.10.2016
13
Types of constraints
• Value constraints
• Age is never negative
• Uniqueness constraints
• Any course can have only one ID
• Cardinality constraints
• Each person can have at most three blood pressure
records in the system
Matemaattis-luonnontieteellinen tiedekunta /
Iso tiedonhallinta/
Jiaheng Lu
www.helsinki.fi
10.10.2016
14
Types of constraints (cont’d)
• Type constraint
• Age is an integer
• Domain constraint
• Month is between 1 to 12
• Structure constraints
• Put constraints on structure rather than data values
Matemaattis-luonnontieteellinen tiedekunta /
Iso tiedonhallinta/
Jiaheng Lu
www.helsinki.fi
10.10.2016
15
Relational models
• A collection of tables
• No duplicates tuples
• Dissimilar tuples disallowed
(John, 32, Male, Engineer)
(Mary, 27, Female, Doctor)
(Anna, 57, Female, Teacher)
(23, 87, Computer science, Teacher)
Matemaattis-luonnontieteellinen tiedekunta /
Iso tiedonhallinta/
Jiaheng Lu
www.helsinki.fi
10.10.2016
16
Relational models
• A collection of tables
• No duplicates tuples
• Dissimilar tuples disallowed
(John, 32, Male, Engineer)
(Mary, 27, Female, Doctor)
(Anna, 57, Female, Teacher)
(23, 87, Computer science, Teacher)
Matemaattis-luonnontieteellinen tiedekunta /
Iso tiedonhallinta/
Jiaheng Lu
www.helsinki.fi
10.10.2016
17
Relational models (cont’d)
• A collection of tables
• No duplicates tuples
• Dissimilar tuples disallowed
• Foreign keys and primary keys
Matemaattis-luonnontieteellinen tiedekunta /
Iso tiedonhallinta/
Jiaheng Lu
www.helsinki.fi
10.10.2016
18
Join
• Join operation for two tables
(John, 32, Male, Engineer)
(Mary, 27, Female, Doctor)
(Anna, 57, Female, Teacher)
(John, Married, )
(Mary, Single)
(Anna, Married,)
(John, 32, Male, Engineer Married, )
(Mary, 27, Female, Doctor Single)
(Anna, 57, Female, Teacher Married,)
Matemaattis-luonnontieteellinen tiedekunta /
Iso tiedonhallinta/
Jiaheng Lu
www.helsinki.fi
10.10.2016
19
Semi-structured model: XML
• eXtensible Markup Language
• Markup language for documents containing
structured information
www.helsinki.fi
XML….
• Based on Standard Generalized Markup
Language (SGML)
• Version 1.0 introduced by World Wide Web
Consortium (W3C) in 1998
• Bridge for data exchange on
the Web
www.helsinki.fi
A Simple XML Document
Freely definable tags
<article>
<author>Gerhard Weikum</author>
<title>The Web in Ten Years</title>
<text>
<abstract>In order to evolve...</abstract>
<section number=“1” title=“Introduction”>
The <index>Web</index> provides the universal...
</section>
</text>
</article>
www.helsinki.fi
22
April 29th, 2003
A SimpleStart
XML
Document
Tag
<article>
<author>Gerhard Weikum</author>
Content of
<title>The Web in Ten Years</title>
the Element
<text>
(Subelements
<abstract>In order to
and/or Text)
evolve...</abstract>
End Tag
<section number=“1” title=“Introduction”>
The <index>Web</index> provides the
universal...
</section>
</text>
</article>
www.helsinki.fi
23
April 29th, 2003
A Simple XML Document
<article>
<author>Gerhard Weikum</author>
<title>The Web in Ten Years</title>
<text>
<abstract>In order to evolve...</abstract>
<section number=“1” title=“Introduction”>
The <index>Web</index> provides the universal...
</section>
Attributes with
name and value
</text>
</article>
www.helsinki.fi
24
April 29th, 2003
Elements in XML Documents
• (Freely definable) tags: article, title, author
•
with start tag: <article> etc.
•
and end tag: </article> etc.
• Elements: <article> ... </article>
• Elements have a name (article) and a content
(...)
• Elements may be nested.
• Elements may be empty: <this_is_empty/>
www.helsinki.fi
25
Elements vs. Attributes
What is the difference between elements and
attributes?
• Only one attribute with a given name per element
(but an arbitrary number of subelements)
• Attributes have no structure, simply strings (while
elements can have subelements)
As a rule of thumb:
• Content into elements
• Metadata into attributes
Example:
<person born=“1912-06-23“ died=“1954-06-07“>
Alan Turing</person> proved that…
www.helsinki.fi
26
XML Documents as Ordered Trees
article
author
title
text
number=“1“
abstract
Gerhard
Weikum
section
title=“…“
In order …
The Web
in 10 years
The
index
Web
www.helsinki.fi
27
provides …
Document Type Definitions (DTD)
• An XML document may have an optional DTD.
• DTD serves as grammar for the underlying
XML document, and it is part of XML language.
• DTD has the form:
<!DOCTYPE name [markupdeclaration]>
www.helsinki.fi
DTD (cont’d)
• Consider an XML document:
<db><person><name>Alan</name>
<age>42</age>
<email>[email protected] </email>
</person>
<person>………</person>
……….
</db>
www.helsinki.fi
DTD (cont’d)
• DTD for it might be:
<!DOCTYPE db [
<!ELEMENT db (person*)>
<!ELEMENT person (name, age, email)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT age (#PCDATA)>
<!ELEMENT email (#PCDATA)>
<! AttributeListemail (#PCDATA)>
]>
www.helsinki.fi
DTD (cont’d)
Occurrence Indicator:
Indicator
Occurrence
(no indicator)
Required
?
Optional
*
Optional,
repeatable
Required,
repeatable
+
One and only
one
None or one
None, one, or
more
One or more
www.helsinki.fi
Important attribute types
• There are ten attribute types
• These are the most important ones:
• CDATA
The value is character data
• (man|woman|child)
The value is one from this list
• ID
The value is a unique identifier
‒ ID values must be legal XML names and must be unique within the
document
www.helsinki.fi
More attribute types
• IDREF
The ID of another element
• IDREFS
A list of other IDs
• ENTITY
An entity
• ENTITIES
A list of entities
• NOTATION
A notation
www.helsinki.fi
Requirements
• Recall that an attribute has the form
<!ATTLIST element-name name type requirement>
• The requirement is one of:
• A default value, enclosed in quotes
‒ Example: <!ATTLIST degree CDATA "PhD">
• #REQUIRED
‒ The attribute must be present
• #IMPLIED
‒ The attribute is optional
• #FIXED "value"
‒ The attribute always has the given value
‒ If specified in the XML, the same value must be used
www.helsinki.fi
Another example: XML
<?xml version="1.0"?>
<!DOCTYPE weatherReport SYSTEM
"http://www.mysite.com/mydoc.dtd">
<weatherReport>
<date>05/29/2002</date>
<location>
<city>Philadelphia</city>, <state>PA</state>
<country>USA</country>
</location>
<temperature-range>
<high scale="F">84</high>
<low scale="F">51</low>
</temperature-range>
</weatherReport>
www.helsinki.fi
The DTD for this example
<!ELEMENT weatherReport (date, location,
temperature-range)>
<!ELEMENT date (#PCDATA)>
<!ELEMENT location (city, state, country)>
<!ELEMENT city (#PCDATA)>
<!ELEMENT state (#PCDATA)>
<!ELEMENT country (#PCDATA)>
<!ELEMENT temperature-range
((low, high)|(high, low))>
<!ELEMENT low (#PCDATA)>
<!ELEMENT high (#PCDATA)>
<!ATTLIST low scale (C|F) #REQUIRED>
<!ATTLIST high scale (C|F) #REQUIRED>
www.helsinki.fi
Operations on XML documents
• GetParent
• GetChildren
• GetSibling
• Root-node path
• Query needs the tree traversal
Matemaattis-luonnontieteellinen tiedekunta /
Iso tiedonhallinta/
Jiaheng Lu
www.helsinki.fi
10.10.2016
37
Querying XML with XPath and
XQuery
XPath and XQuery are query languages for XML data, both
standardized by the W3C and supported by various database products.
A query result is a set of qualifying nodes, paths, subtrees,
or subgraphs from the underyling data graph,
or a set of XML documents constructed from this raw result.
38
www.helsinki.fi
XPath
• XPath is a simple language to identify parts of the
XML document (for further processing)
• XPath operates on the tree representation of the
document
• Result of an XPath expression is a set of elements
or attributes
• Discuss abbreviated version of XPath
www.helsinki.fi
39
Elements of XPath
• An XPath expression usually is a location path that consists of
location steps, separated by /:
/article/text/abstract: selects all abstract elements
• A leading / always means the root element
• Each location step is evaluated in the context of a node in the tree,
the so-called context node
• Possible location steps:
•
•
•
•
child element x: select all child elements with name x
Attribute @x: select all attributes with name x
Wildcards * (any child), @* (any attribute)
Multiple matches, separated by |: x|y|z
www.helsinki.fi
40
Combining Location Steps
• Standard: / (context node is the result of the
preceding location step)
article/text/abstract (all the abstract nodes of
articles)
• Select any descendant, not only children: //
article//index (any index element in articles)
• Select the parent element: ..
• Select the content node: .
The latter two are important when using predicates.
www.helsinki.fi
41
Predicates in Location Steps
• Added with [] to the location step
• Used to restricts elements that qualify as result of a
location step to those that fulfil the predicate:
• a[b] elements a that have a subelement b
• a[@d] elements a that have an attribute d
• Plus conditions on content/value:
‒ a[b=„c“]
‒ A[@d>7]
‒ <, <=, >=, !=, …
www.helsinki.fi
42
XPath by Example
/literature/book/author
retrieves all book authors:
/literature/(book|article)/author
/literature/*/author
/literature//author
/literature//@year
authors of books or articles
authors of books, articles, essays, etc.
authors that are descendants of literature
value of the year attribute of descendants of literature
/literature//author[firstname]
authors that have a subelement firstname
/literature/book[author//country = “Germany“] books with German author
www.helsinki.fi
43
Semi-structure model: JSON
• JSON is a data interchange format
• Interactive Web 2.0 applications, no more use page
replacement. Data transfer without refreshing a
page.
• The most important aspects of data transfer are
simplicity, extensibility, interoperability, openness and
human readability
• Key idea in AJAX – Asynchronous Java Script and
XML.
www.helsinki.fi
How does it work?
• JSON is a subset of Java Script. JSON can be
parsed by a Java Script parser.
• It can represent either complex or simple data as it
has data types
• They are Strings, Number, Boolean, Objects and
Arrays
• E.g. of Object:
• { "name": "Jack Nimble", "format": { "type": "rect",
"width": 120, "interlace": false}}
www.helsinki.fi
• An array can be shown as
• ["Sunday", "Monday", "Tuesday", "Wednesday“]
• All data types are intuitive and similar to other programming
languages
• Also compatible with other languages like C, C++, C#,
ColdFusion, Python and many more.
www.helsinki.fi
Graph data model
Nodes table:
1. John
2. Mary
3. Anna
John
Friend
Anna
Mary
Sister
Matemaattis-luonnontieteellinen tiedekunta /
Henkilön nimi / Esityksen nimi
Edge table:
John, Mary,Friend
Mary,Anna,Sister
www.helsinki.fi
10.10.2016
47
Graphs from the Real World
Königsberg's Bridges
Ref: http://en.wikipedia.org/wiki/Seven_Bridges_of_K%C3%B6nigsberg
www.helsinki.fi
Graphs from the Real Word
Webpage Hyperlink Graph
Directed Communities
Network of Word Associations
Overlapping Communities
www.helsinki.fi
Operations on graphs
• GetNeighbour
• ShortestPath(A,B)
• Community detection
Matemaattis-luonnontieteellinen tiedekunta /
Iso tiedonhallinta/
Jiaheng Lu
www.helsinki.fi
10.10.2016
50
Applications of Community
Detection
• Website mirror server assignment
• Recommendation system
• Social network role detection
• Functional module in biological networks
• Graph coarsening and summarization
• Network hierarchy inference
www.helsinki.fi
Defining Communities
• Intuition: There are more edges inside a
community than edges connected with
the rest of the graph
www.helsinki.fi
General Challenges
• Many clustering problems are NP-hard.
Even polynomial time approaches may
be too expensive
• Call for scalable solutions
• Concepts of “cluster”, “community” are
not quantitatively well defined
www.helsinki.fi