Lecture12 - MapReduce

MapReduce Locality Sensitive Hashing
GPUs
NLP ML Web
Andrew Rosenberg
Big Data
What is Big Data?
Data Analysis based on more data than previously
considered.
Analysis that requires new or different processing due to scale.
Variety,Velocity,Volume
(Veracity,Value)
AR’s rule of thumb: If the data for analysis does not fit in
memory on one machine. You are dealing with Big Data. If it does (or can be made to) you are probably not.
Fast Matching
• Often you want to find the closest
match, or exact match in a set.
• Compare query to document
• Nearest Neighbors classification
• Or duplicate detection
• Compare all pairs of N objects O(N )
2
https://users.soe.ucsc.edu/~niejiazhong/slides/kumar.pdf
Fast Matching
• Question: Can we represent similarities
between objects in a succinct manner?
• As a function of a single object
• By obtaining a “sketch” of the object
• Sacrifice exactness for efficiency
• By using randomization
Minhashing
• Assume sparse features
• words
• n-grams (character n-grams aka
“shingles”)
• Jaccard Similarity (set similarity)
Ci \ Cj
sim(Ci , Cj ) =
Ci [ Cj
Minhashing
C1
C2
0
1
1
0
1
1
0
0
1
1
0
1
• Idea: Hash the columns C1,
C2 to a smaller signature,
sig(C1), sig(C2)
• How would you do this?
• Sample P rows (features)
Sim(C1, C2) = 2/5
Minhashing
• Better: Randomly permute
C1
C2
P1
P2
0
1
3
4
1
0
2
1
1
1
1
5
0
0
4
2
1
1
5
3
0
1
6
6
rows
• h(Ci) := the first row where
Ci=1
• Property: p(h(Ci) = h(Cj)) = Sim(Ci,Cj)
• Why?
Both are:
11/(11+10+01)
Sim(C1, C2) = 2/5
Minhash Signatures
C1
C2
P1
P2
0
1
3
4
1
0
2
1
1
1
1
5
0
0
4
2
1
1
5
3
0
1
6
6
• Pick P random permutations
• sig(C) := list of P indices of
first row with 1 in column C
• sim(sig(Ci), sig(Cj)) =
fraction of permutations
where values agree.
• E[sim(sig(Ci), sig(Cj))] = sim(Ci, Cj)
Sim(C1, C2) = 2/5
Locality Sensitive Hashing
• Apply multiple hash functions on
random feature subsets.
• Hash functions should be chose so
collisions occur on similar objects
Sim(x, y) = P(hi(x) = hi(y))
Locality Sensitive Hashing
• Hash functions should be chose so
collisions occur on similar objects
• How to define hash functions?
• String or graph hash functions for
equality of short strings.
• Bin continuous values.
• geometrically inspired.
LSH: Randomized minhash
Sim(x, y) = P(hi(x) = hi(y))
• Randomly sample N hash functions h for comparison
• Likelihood of false negatives?
• If x and y are identical, what is the probability that the hash
i
functions will disagree?
• Likelihood of false positives?
• If x and y are different, what is the probability that the hash
functions will disagree?
• p = hash collision likelihood. (two non identical entities have the same hash value)
N
• p = probability of N hash collisions
• Even 20% hash collisions (5 bins), with 5 hash functions 5
LSH has a p(FP) = .2 = 0.00032
Cosine Similarity LSH
Locality Sensitive Hashing
Benjamin Van Durme & Ashwin Lall
ACL 2010
Van Durme & Lall ACL 2010
Cosine Similarity LSH
Locality Sensitive Hashing
Hamming Distance := h = 1
Signature Length := b = 6
Benjamin Van Durme & Ashwin Lall
cos(✓) ⇡ cos( hb ⇡)
=
1
cos( 6 ⇡)
ACL 2010
Accumulate Counts
• Why do you want to do this?
Accumulate Counts
• [Hard disk >> Memory]
• Language Modeling
• for ASR, MT, n-gram statistics, etc.
• Inverse Indexes
Accumulate Counts
Serial
def updateCounts(f, counts):
for word in f:
count[word] += 1
counts = dict()
for f in files:
One file at a time
updateCounts(f, counts)
dumpDictionaryToFile(counts, outfile)
Parallel vs. Serial Processing
updateCounts(f1)
updateCounts(f2)
updateCounts(f3)
Accumulate Counts
Parallel
def updateCounts(f, counts):
for word in f:
count[word] += 1
counts = dict()
What is wrong with this?
parallel for f in files:
updateCounts(f, counts)
dumpDictionaryToFile(counts, outfile)
Accumulate Counts
Parallel
def updateCounts(f, counts):
count = dict()
for word in f:
count[word] += 1
return count
counts = []
parallel for f in files:
counts[f] = getCounts(f)
total_counts = mergeCounts(counts)
dumpDictionaryToFile(total_counts, outfile)
Parallel vs. Serial Processing
updateCounts(f1)
updateCounts(f2)
updateCounts(f3)
count
count
count
final_count
General Divide and Conquer
Divide and Conquer
“Work”
Partition
w1
w2
w3
“worker”
“worker”
“worker”
r1
r2
r3
“Result”
Combine
inspiration and images from
4
http://www.cse.buffalo.edu/~jing/cse601/fa12/materials/clustering_mapreduce.pdf
Challenges in Parallel/Distributed
Processing
• Unpredictable runtimes.
• One slow job can make everything slow.
• Shared memory/race conditions
• Debugging
• Hardware Failures
• If a hard drive lasts 5 years and you have
365*5 hard drives in a cluster. You should
expect one hard drive failure per day.
Map Reduce Framework
Map
• Iterate over a lot of data
• Extract something of interest from each
• Shuffle and Sort intermediate Results
Reduce
• Aggregate intermediate results
• Generate Final Output
Programmer Responsibility
• Map:
• def
map(k, v) -> [(k’, v’)]
• given a key value pair, generate a new
list of intermediate keys and values
• Reduce:
•def
reduce(k’,[v’]) -> [(k’, v’’)]
• all values with the same key are sent
to the same reduce function.
Map Reduce Word Count
def map(k, v):
‘’’k: doc name, v: text’’’
for word in v:
emit(word, 1)
def reduce(k, vals):
‘’’k: word, vals: list of counts’’’
result = sum(vals)
emit(k, result)
Visual Representation
k1 v1
k2 v2
map
a 1
k3 v3
k4 v4
map
b 2
c 3
k5 v5
k6 v6
map
c 6
a 5
map
c 2
b 7
c 8
Shuffle and Sort: aggregate values by keys
a
1 5
b
2 7
c
2 3 6 8
reduce
reduce
reduce
r1 s1
r 2 s2
r3 s3
Inverted Indexing
F.D.A. Ruling Would Sharply Restrict Sale of Trans Fats
FDA to phase out use of artery-clogging trans fats
In
memory
F.D.A.
1:1, …
Ruling
1:2, …
Would
1:3, …
sharply
1:4, …
restrict
1:5, …
sale
1:6, …
of
1:7,2:6 …
trans
1:8,2:8, …
fats
1:9,2:9, …
FDA
2:1, …
to
2:2, …
phase
2:3, …
out
2:4, …
use
2:5, …
On Disk
artery-clogging 2:7, …
Dictionary
Posting List
Inverted Indexing
def map(k, v):
‘’’k: doc name, v: text’’’
for word, offset in tokenize(v):
emit(word, (k, offset))
def reduce(k, vals):
‘’’k: word, vals: list of doc-index pairs’’’
emit(k, vals)
Visual Representation
Move
Code to Data
k1 v1
k2 v2
map
a 1
k3 v3
k4 v4
map
b 2
c 3
k5 v5
k6 v6
map
c 6
a 5
map
c 2
b 7
c 8
Shuffle and Sort: aggregate values by keys
a
1 5
b
2 7
c
2 3 6 8
reduce
reduce
reduce
r1 s1
r 2 s2
r3 s3
Index can be
concatenated
or left
distributed
GPUs
• Graphical Processing Units
• Processors designed specifically to
display and process graphics.
• Now used for general purpose
computation.
http://lorenabarba.com/gpuatbu/Program_files/Cruz_gpuComputing09.pdf
Graphics processing
• What about graphics processing is
general purpose?
• Data points are vectors (usually in 2 or
3 dims)
•
• Rotation
Scaling (zoom)
•
Skew

c~x
cos ✓
R=
sin ✓

1
R=
tan(✓)
sin ✓
cos ✓
tan(✓)
1
T
~x R
T
~x R
Graphics processing
• What about graphics processing is
general purpose?
• These processors are designed to be
very efficient linear algebra processors
• matrix multiplication, addition
• matrix inversion, transposition
• These are many of the same operations
necessary for machine learning
GPUs
• Massively parallel
• Hundreds/Thousands of
cores
• Thousands of threads
• Cheap (~$5k per SoA GPU)
• Highly available
(<2 day delivery)
• Programable: CUDA
Some stats
• GPU processing power is doubling
every 18 months
• CPU power is not.
Workstations
Some stats
• GPU processing power is doubling
every 18 months
• CPU power is not.
Servers
Threads
and
Memory
Memory hierarchy
•
• Registers per thread
• Shared memory per
3 types
of memory.
of memory in the graphic card:
•Three types


•Global memory: 4GB
•Shared memory: 16 KB
•Registers: 16 KB
•Latency:
block





•
memory
memory: 400-600
cyclesper
•Global Global


memory: Fast
•SharedGPU
•
Fast
•Register:
Major
bottleneck is
•Purpose:transferring data from
memory:memory
IO for grid (RAM)
•Global main
memory:
collaboration
•Sharedto
GPUthread
memory
•Registers: thread space
Felipe A. Cruz





Basic Workflow
Work flow
Memory
allocation
Time
Memory
copy: Host -> GPU
Kernel call
Memory
copy: GPU -> Host
Free GPU memory
0
1
2
3
4
5
6
7
...
pycuda example
import pycuda.autoinit
import pycuda.driver as drv
multiply_them =
mod.get_function("multiply_them")
import numpy
from pycuda.compiler import
SourceModule
## CUDA CODE
mod = SourceModule("""
__global__ void
multiply_them(float *dest, float
*a, float *b)
{
const int i = threadIdx.x;
dest[i] = a[i] * b[i];
}
a =
numpy.random.randn(400).astype(nu
mpy.float32)
b =
numpy.random.randn(400).astype(nu
mpy.float32)
dest = numpy.zeros_like(a)
multiply_them(
drv.Out(dest), drv.In(a),
drv.In(b),
block=(400,1,1),
grid=(1,1))
""")
print dest-a*b
Wrap up
• LSH converts “distance” calculations to
repeated, probabilistic matching functions.
• Map Reduce enables web-scale distributed
computation
• Current generation is geared around
streaming data, and iterative processing
(Dataflow, Summingbird, Spark)
• GPUs greatly accelerate parallelizable
computation like linear algebra for graphics,
but also much scientific computation