Distributed Data Management - Databases and Information Systems

Distributed Data Management
Summer Semester 2015
TU Kaiserslautern
Prof. Dr.-Ing. Sebastian Michel
Databases and Information Systems
Group (AG DBIS)
http://dbis.informatik.uni-kl.de/
Distributed Data Management, SoSe 2015, S. Michel
1
General Note
• For exercise sheet 2 we put some data files to
download.
• To access these outside the university network,
you need the following login information.
– Username: ddm
– Password:
• Remember this login also for forthcoming
protected accesses.
Distributed Data Management, SoSe 2015, S. Michel
2
Map Reduce from High Level
D
MAP
A
MAP
T
MAP
A
REDUCE
Result
REDUCE
Result
REDUCE
Result
MAP
Intermediate Results
Distributed Data Management, SoSe 2015, S. Michel
3
Map and Reduce: Types
• Map (k1,v1)
 list(k2,v2)
• Reduce (k2, list(v2))  list(k3, v3)
keys allow grouping
data to
machines/tasks
• For instance:
– k1= document identifier
– v1= document content
– k2= term
– v2=count
– k3= term
– v3= final count
Distributed Data Management, SoSe 2015, S. Michel
4
MR Key Principle
• Many data chunks
• Map function on each of the chunks
• Map process outputs data with keys
=> Partitions based on keys
• Aggregate (i.e., reduce) mapped data per key
• Out is written out to (distributed) file system
• E.g., count number occurrences of each terms in
set of documents.
Distributed Data Management, SoSe 2015, S. Michel
5
MR: Example of Principle
• For instance, CSV file in file system, e.g.,
weather.csv of temperature readings
2/12/2004;64;5;2.46
9/6/2006;80;14;10.15
6/1/2002;9;16;16.01
10/30/2014;73;19;23.81
8/30/2002;64;4;16.16
1/29/2007;40;24;-2.16
11/10/2012;85;10;12.20
…..
with data;station_id;hour_of_day;temp
Distributed Data Management, SoSe 2015, S. Michel
6
MR: Example of Principle (2)
• The Mapper is responsible for “parsing” the lines
• Simple example: get all tuples from 2014, like in the
“grep” example. No reducer. E.g., we have
File 1
11/24/2014;21;3;-0.47
3/13/2014;40;6;12.79
10/14/2014;26;22;22.41
2/5/2014;17;12;7.87
File 2
11/1/2014;84;1;4.62
2/24/2014;35;13;-2.44
11/17/2014;59;17;26.31
6/9/2014;23;13;23.60
File 3
2/24/2014;11;11;6.80
11/17/2014;12;2;4.85
10/8/2014;3;9;12.71
8/28/2014;33;12;7.27
……..
One file per Mapper is
directly written to file
system; no partioning
by key, no sorting.
But let’s add a reducer
now ….
Distributed Data Management, SoSe 2015, S. Michel
7
MR: Example of Principle (3)
• Let’s say we are interested in the average
temperature for each hour of the day, in 2014.
• This can be done by the reducer (after the
mapper), which we add now. So the mapper
“sends” stuff to the reducer. Sorted by key.
• Say, we have two reducers (= number of
partitions)
– Partitions of tuples are created (by default) using
key.hashCode() % number_of_partitions
– So there are in general more than one group of
hour_of_dayDistributed
in the
partition.
Data Management, SoSe 2015, S. Michel
8
MR: Example of Principle (4)
• The reducer obtains then for each group of
tuples with the same hour_of_day the
temperature values and can compute the
average.
• Output is
One file per Reducer
14;17.34
17;14.01
23;9.11
4;7.19
16;16.35
22;9.89
Inside each file lines are
sorted by hour_of_day
But not globally!
Distributed Data Management, SoSe 2015, S. Michel
9
More Details
• While there are in general several keys (and
corresponding values) in a partition, it is
assured that all tuples for a key are in a single
partition.
• The sorting is done based on the key.
• Data is grouped by key. The reduce function is
called for each such group.
To get a strong understanding of these concepts, play around with the Hadoop
MapReduce implementation (see exercise sheet 2).
Distributed Data Management, SoSe 2015, S. Michel
10
Map-Only Job vs. IdentityReducer
• In map-only jobs (zero reducers) no sorting
takes place
• With an identity reducer that simply outputs
data it receives, sorting will take place.
Distributed Data Management, SoSe 2015, S. Michel
11
Multiple MapReduce Jobs Together
• One can combine multiple MapReduce Jobs, in
general, forming a workflow of jobs that can be
described with a directed acyclic graph (DAG)
• Each MR job outputs results to the (distributed) file
system.
• Subsequent jobs can consume these outputs (files,
one by reducer or mapper if there is no reducer).
Distributed Data Management, SoSe 2015, S. Michel
12
MAPREDUCE AND SQL / JOINS
Distributed Data Management, SoSe 2015, S. Michel
13
Processing SQL Queries in MR
• Given a relation R(A,B,…) that is stored in a file
(one tuple per line)
• How to process standard SQL queries?
SELECT B
FROM R
WHERE predicate(B)
“projection”
Distributed Data Management, SoSe 2015, S. Michel
“selection”
14
SQL in MR: Group-By, Aggregate, Having
SELECT department, avg(salary)
FROM salaries
GROUP BY department
HAVING avg(salary) > 50000
Distributed Data Management, SoSe 2015, S. Michel
15
SQL in MR: Group-By, Aggregate, Having
SELECT department, avg(salary)
FROM salaries
GROUP BY department
HAVING avg(salary) > 50000
• Group-By, Aggregate
• Map: Send tuple to reducer using the attribute after
which is grouped as key (here: department)
• Reducer: Receives, hence, all tuples for one department
and can group and aggregate
• Having
• Having is a predicate executed on aggregate of a
group, hence, executed at reducer.
Distributed Data Management, SoSe 2015, S. Michel
16
(Equi) Joins in Map Reduce
• Two relations R(A,B) and S(B,C):
SELECT *
FROM R, S
WHERE R.B = S.B
• Obviously: Join “partners” have to end
up at same node. How to achieve
this?
Distributed Data Management, SoSe 2015, S. Michel
17
Example
Station ID
Timestamp
Temperature
1
12434343434300
25
Station
Name
2
12434343434500
27
1
12434343434700
31
1
A
1
12434343434900
28
2
B
2
12434343435200
29
Station ID
Join
Station ID
Station Name
Timestamp
Temperature
1
A
12434343434300
25
2
B
12434343434500
27
1
A
12434343434700
31
1
A
12434343434900
28
2
B
12434343435200
29
Distributed Data Management, SoSe 2015, S. Michel
18
Reduce Side Join
• Two relations R(A,B) and S(B,C).
• Map:
– Send tuple t to reducer of key t.B
– And information where t is from (R or S)
• Reduce:
– Join tuples t1, t2 with t1.B=t2.B and t1 in R and t2
in S
Distributed Data Management, SoSe 2015, S. Michel
19
Map-Side Join with one entirely known
Relation
• Two relations R(A,B) and S(B,C).
• One relation is small, say R
• Map:
– each map process knows entire relation R
– can perform join on subset of S
• output joined tuple
• Reduce:
– no reducer needed
Distributed Data Management, SoSe 2015, S. Michel
20
Reduce-Side Join with “Semi Join”
Optimization (Filtering) in Map Phase
• Two relations R(A,B) and S(B,C).
• Unique values in R.B are small in number
• Map:
– knows unique ids of R.B
– send tupes t in R by key t.B
– send tuples t in S only if t.B in R.B
• Reduce:
– perform the actual join (why is this still required?)
Distributed Data Management, SoSe 2015, S. Michel
21
Global Sharing of Information
• Implemented as “Distributed Cache”
• For small data
• E.g., dictionary or “stopwords” file for text
processing, or “small” relation for joins
• Read at initialization of Mapper
Distributed Data Management, SoSe 2015, S. Michel
22
Reduce Side Join with Map-Side
Filtering, but now with Bloom Filters
• Reduce-side join with Map-side filtering
• Compact representation of join attributes
• Using Bloom Filter*
– very generic data structure with wide applications to
distributed data management / systems
• Will see them later again (so worth introducing)
*) - Bloom, Burton H. (1970), "Space/time trade-offs in hash coding with allowable
errors", Communications of the ACM 13 (7): 422–426.
- Broder, Andrei; Mitzenmacher, Michael (2005), "Network Applications of Bloom Filters:
A Survey", Internet Mathematics 1 (4): 485–509
Distributed Data Management, SoSe 2015, S. Michel
23
Bloom Filter
• Bit array of size m (all bits=0 initially)
• Encode elements of a set in that array
– set is for instance the distinct attributes of table
column or a set of words. How to hash nonnumbers? E.g., use byte representation of string
• How is the bit array constructed?
– Hash element to bucket no. and set this bit to 1
(If the bit is already 1, ok, keep it 1)
– Use multiple (=k) hash functions hi
Bucket
number
1
2
3
4
5
Distributed Data Management, SoSe 2015, S. Michel
6
7
8
24
Bloom Filter: Insert + Query
h1(x) = 3*x mod 8
h2(x) = 5*x mod 8
h1(17)=3
h2(17)=5
h1(59)=1
h2(59)=7
Bucket
number
0
1
0
1
0
1
0
1
0
1
2
3
4
5
6
7
• Query: is x contained in the set (=filter)?
– Check if bits at both h1(x) and h2(x) are set to 1. Yes?
Then x ”might be” in the set. No? Then x is for sure not
in!
Distributed Data Management, SoSe 2015, S. Michel
25
Bloom Filter: False Positives
• In case all bits at hash positions are 1, the
element might be in, but maybe it’s a mistake.
• Is x=45 contained?
Bucket
number
h1(45)=7
h2(45)=1
0
1
0
1
0
1
0
1
0
1
2
3
4
5
6
7
• Looks like, but actually it is not! (i.e., we didn’t
insert it on the slide before)
• It is a false positive!
Distributed Data Management, SoSe 2015, S. Michel
26
Bloom Filter: Probability of False Positives
• Bloom Filter of size m (bits)
• k hash functions
• n inserted elements
• Thus, can be controlled: tradeoff between
compression and “failures“
Distributed Data Management, SoSe 2015, S. Michel
27
Implications of False Positives on Join
• Reconsider the reduce-side join with map-side
filtering of relations R(A,B) and S(B,C).
• We have a Bloom filter for R.B, etc. (see slide
before)
• What do false positives cause?
– additional (and useless network) traffic and also
more work for reducer
– but no erroneous results as reducer will check if
the join can in fact be done
Distributed Data Management, SoSe 2015, S. Michel
28
Map-Side Join
• Reduce side joins are quite straight forward
• But potentially lots of data needs to be
transferred
• How can we realize a join on the map side
(without distributed cache for one small
relation)?
Distributed Data Management, SoSe 2015, S. Michel
29
Map-Side Join: Idea
• Consider that input relations are already sorted
by key over that the join is computed
Relation S
Relation R
A
value
A
value
A
value
A
value
B
value
B
value
B
value
B
value
C
value
B
value
C
value
C
value
D
value
C
value
D
value
E
value
E
value
How can this work out properly?
Distributed Data Management, SoSe 2015, S. Michel
30
Map-Side Join: Solution
• Need to make sure data is properly aligned
• That is, Mappers need to read input in pairs of
chunks (one from R, one from S) that contain all
join partners
• Achieved through two prior MR jobs* that use
same number of reducers, sorting tuples by the
key (of course, same order)
• Then final MR job performs join in Map Phase
Distributed Data Management, SoSe 2015, S. Michel
*) If not sorted/aligned properly already
31
Map-Side Join: Solution (Cont’d)
• In Hadoop there exist a special input format
named CompositeInputFormat
• You can specify the join you want (e.g., inner
and outer join) and the input.
Distributed Data Management, SoSe 2015, S. Michel
32
Map Reduce vs. Databases
Traditional RDBMS
Map Reduce
Data Size
Gigabytes
Petabytes
Access
Interactive and
batch
Batch
Updates
Read and write
many times
Write once, read
many times
Structure
Static schema
Dynamic schema
Integrity
High
Low
Scaling
Non linear
Linear
source: T. White, Hadoop, The Definitive Guide, 3rd edition
Distributed Data Management, SoSe 2015, S. Michel
33
Objectives/Benefits
• Simple model (see also criticisms) ;)
• Scalable (depends also on problem of course)
• Aims at high throughput
• Tolerant against node failures
Distributed Data Management, SoSe 2015, S. Michel
34
Limitations
• Very low level routines
• Can have quite slow response time for individual,
small tasks
• Writing complex queries can be a hassle
– Think: declarative languages like SQL
SELECT * FROM
WHERE
GROUP BY
…
Distributed Data Management, SoSe 2015, S. Michel
35
Criticism
• Some people claim MR is a major step
backward
• Why?
– Too low level
– No indices
– No updates
– No transactions
• But: was it really made to replace a DB?
http://craig-henderson.blogspot.de/2009/11/dewitt-and-stonebrakers-mapreducemajor.html
Distributed Data Management, SoSe 2015, S. Michel
36
HADOOP (A MR IMPLEMENTATION)
Distributed Data Management, SoSe 2015, S. Michel
37
Hands on MapReduce (with Hadoop)
• Apache Hadoop. Open Source MR
• Wide acceptance:
– http://wiki.apache.org/hadoop/PoweredBy
– Amazon.com, Apple, AOL, eBay, IBM, Google,
LinkedIn, Last.fm, Microsoft, SAP, Twitter, …
Distributed Data Management, SoSe 2015, S. Michel
38
Hadoop Distributed File System
(HDFS): Basics
 Given file is cut in big pieces (blocks) (e.g.,
64MB)
 Which are then assigned to (different) nodes
block
Distributed Data Management, SoSe 2015, S. Michel
node
39
HDFS Architecture
metadata
ops
Metadata (Name, replicas, …)
/home/foo/data, 3, …
NameNode
Client
block ops
read
DataNodes
DataNodes
replication
of block
Rack 2
Rack 1
write
Client
source: http://hadoop.apache.org
Distributed Data Management, SoSe 2015, S. Michel
40
Replication
• Can specify default replication factor (or per
directory/file)
• Replication is pipelined
– if block is full, NameNode is asked for other
DataNodes (that can hold replica)
– DataNode is contacted, receives data
– Forwards to third replica, etc.
Distributed Data Management, SoSe 2015, S. Michel
41
Distributed Data Management, SoSe 2015, S. Michel
42
Distributed Data Management, SoSe 2015, S. Michel
43
A Note on Input Splits
• An Input Split is a chunk of the input data,
processed by a single map.
• For instance a set of lines of the original big file.
• Size of splits usually like size of file system blocks.
• But does not fit in general precisely with the block
boundaries. So need to read a bit across
boundaries.
• Luckily, for applications we consider, we “do not
care” and use available input formats.
Distributed Data Management, SoSe 2015, S. Michel
44
MR job execution in Hadoop 1.x
Map Reduce
Program
run
Job
client JVM
client node
source: T. White, Hadoop, The Definitive Guide, 3rd edition
Distributed Data Management, SoSe 2015, S. Michel
45
MR job execution in Hadoop 1.x (Cont’d)
2: get new job ID
1: run
Job
4: submit job
JobTracker
5: init job
lient JVM
jobtracker node
ent node
3: copy job
resources
Shared
Filesystem
(e.g., HDFS)
6: retrieve input
splits
… tasktracker node …
Distributed Data Management, SoSe 2015, S. Michel
46
MR job execution in Hadoop 1.x (Cont’d)
7: heartbeat (returns task)
JobTracker
5: init job
jobtracker node
6: retrieve input
splits
TaskTracker
9: launch
8: retrieve job
resources
Child
Shared
Filesystem
(e.g., HDFS)
10: run
Map or
Reduce
child JVM
tasktracker node
Distributed Data Management, SoSe 2015, S. Michel
47
Job Submission, Initialization,
Assignment, Execution
•
•
•
•
•
asks for new job id
checks if input/output directories exist
computes input splits
writes everything to HDFS
submits job to JobTracker (step 4)
• Retrieves splits (chunks) from HDFS
• Creates for each split a Map task
• TaskTracker is responsible for executing a certain
assigned task (multiple on one physical machine)
Distributed Data Management, SoSe 2015, S. Michel
48
Distributed Data Management, SoSe 2015, S. Michel
49
Stragglers and Speculative Execution
• JobTracker continuously controls progress (see
Web user interface)
• Stragglers are slow nodes
– have to wait for the slowest one (think: only one out of
1000 is slow and delays overall response time)
• Speculative execution
– run same task on more nodes if the first instance is
observed to underperform (after some time)
– wasted resources vs. improved performance
Distributed Data Management, SoSe 2015, S. Michel
50
source: T. White, Hadoop, The Definitive Guide, 3rd edition
Typical Setup
Switch
Rack 1
Rack 2
Node 1
Node 2
Node 3
Node 4
Node 5
Node 6
Disks
Disks
Disks
Disks
Disks
Disks
Distributed Data Management, SoSe 2015, S. Michel
51
Locality
node
• data-local
• rack-local
• off-rack
map tasks
rack
Map task
HDFS block
data center
Distributed Data Management, SoSe 2015, S. Michel
source: T. White, Hadoop, The Definitive Guide,
3rd
52
edition
Cost Model + Configuration for Rack
Awareness
• Simple cost model applied in Hadoop:
– Same node: 0
– Same rack: 2
– Same data center: 4
– Different data center: 6
• Hadoop needs help: You have to specify config. (topology)
• Sample configuration:
'13.2.3.4' : '/datacenter1/rack0',
'13.2.3.5' : '/datacenter1/rack0',
'13.2.3.6' : '/datacenter1/rack0',
'10.2.3.4' : '/datacenter2/rack0',
'10.2.3.4' : '/datacenter2/rack0'
....
Distributed Data Management, SoSe 2015, S. Michel
53
Shuffle and Sort
• Output of map is partitioned by key as
standard
• Reducer is guaranteed to get entire partition
• Sorted by key (but not by value within each
group)
• Output of each reducer is sorted also by this
key
• Selecting which key to use, hence, affects
partitions and sort order (see few slides later how to
customize)
Distributed Data Management, SoSe 2015, S. Michel
54
Shuffle and Sort
Copy phase
reduce task
map task
fetch
buffer in
memory
merge
map
merge
on disk
input split
merge
partitions
other maps
Distributed Data Management, SoSe 2015, S. Michel
other reducers
55
Shuffle and Sort (Cont’d)
“Sort” phase
Reduce phase
reduce task
map task
fetch
merge
merge
merge
on disk
reduce
merge
output
other maps
mixture of in-memory
and
data
Distributed Data Management,
SoSeon-disk
2015, S. Michel
other reducers
56
Secondary Sort
• In MapReduce (Hadoop) tuples/records are
sorted by key before reaching the reducers.
• For a single key, however, tuples are not sorted
in any specific order (and this can also vary from
one execution of the job to another).
• How can we impose a specific order?
Distributed Data Management, SoSe 2015, S. Michel
57
Partitioning, Grouping, Sorting
• Consider weather data, temperature (temp)
for each day. Want: maximum temp per year
• So, want data per year sorted by temp:
1900
1900
1900
...
1901
1901
35°C
34°C
34°C
36°C
35°C
max for year 1900
max for year 1901
• Idea: composite key: (year, temp)
example source: T. White,
Distributed Data Management, SoSe 2015, S. Michel
Hadoop, The Definitive Guide, 3rd edition
58
Partitioning, Grouping, Sorting
(Cont’d)
• Obviously, doesn’t work: (1900, 35°C) and
(1900, 34°C) end up at different partitions
• Solution(?): Write a custom partitioner that
considers year as partition and sort comparator
for sorting by temperature
Distributed Data Management, SoSe 2015, S. Michel
59
Need for Custom Grouping
• With that custom partitioner by year and still
year and temp as key we get
Partition
1900
1900
1900
...
1901
1901
Group
35°C
34°C
34°C
36°C
35°C
• Problem: reducer still consumes groups by key
(within correct partitions)
Distributed Data Management, SoSe 2015, S. Michel
60
Custom Grouping
• Solution: Define custom grouping method
(class) that considers year for grouping
Partition
1900
1900
1900
...
1901
1901
Group
35°C
34°C
34°C
36°C
35°C
Distributed Data Management, SoSe 2015, S. Michel
61
Custom Sorting
• Finally, we provide a custom sorting that sorts
the keys by temperature in descending order (=
large values first)
• What happens then? Hadoop uses year for
grouping (as said on previous slide), but which
temp is used as the key (remember, we still have
composite keys).
• The first one observed is used as key, i.e., the
largest (max) temperature is used for the temp.
Note that this example specifically aims at computing the max using secondary sort.
How would you implement a job such that the output is sorted by (year,temp) ?
Distributed Data Management, SoSe 2015, S. Michel
62
Secondary Sort: Summary
• Recipe to get sorting by value
– Use composite key of natural key and natural value
– Sort comparator has to order by the composite key
(i.e., both natural key and natural value)
– Partitioner and grouping comparator for the
composite key should use only the natural key for
partitioning and grouping.
Hint (for Hadoop):
job.setMapperClass(…);
job.setPartitionerClass(…);
job.setSortComparatorClass(…);
job.setGroupingComparatorClass(…);
job.setReducerClass(…);
Distributed Data Management, SoSe 2015, S. Michel
63
Failure/Recovery in MR
• Tasktracker failure:
–
–
–
–
detected by master through periodic heartbeats
can also be black listed if too many failures occur
just restart if dead.
Jobtracker re-schedules tasks
• Master failure:
– unlikely to happen (only one machine) but if: all
running jobs failed
– improved in Hadoop 2.x (YARN)
Distributed Data Management, SoSe 2015, S. Michel
64
And Specifically in HDFS
• NameNode marks DataNodes without recent
Heartbeats as dead
• Replication factor of some blocks can fall below
their specified value
• The NameNode constantly tracks which blocks
need to be replicated and initiates replication
whenever necessary.
• If NameNode crashed: Manual restart/recovery.
Distributed Data Management, SoSe 2015, S. Michel
65
Literature
• Read on: hadoop.apache.org, there is also a tutorial
• Hadoop Book: Tom White. Hadoop: The
definitive Guide. O’Reilly.
• Hadoop Illuminated:
http://hadoopilluminated.com/hadoop_book/
• Websites, e.g.,
http://bradhedlund.com/2011/09/10/understandinghadoop-clusters-and-the-network/
• http://lintool.github.io/MapReduceAlgorithms/MapReduce
-book-final.pdf
Distributed Data Management, SoSe 2015, S. Michel
66