MapReduce Locality Sensitive Hashing GPUs NLP ML Web Andrew Rosenberg Big Data What is Big Data? Data Analysis based on more data than previously considered. Analysis that requires new or different processing due to scale. Variety,Velocity,Volume (Veracity,Value) AR’s rule of thumb: If the data for analysis does not fit in memory on one machine. You are dealing with Big Data. If it does (or can be made to) you are probably not. Fast Matching • Often you want to find the closest match, or exact match in a set. • Compare query to document • Nearest Neighbors classification • Or duplicate detection • Compare all pairs of N objects O(N ) 2 https://users.soe.ucsc.edu/~niejiazhong/slides/kumar.pdf Fast Matching • Question: Can we represent similarities between objects in a succinct manner? • As a function of a single object • By obtaining a “sketch” of the object • Sacrifice exactness for efficiency • By using randomization Minhashing • Assume sparse features • words • n-grams (character n-grams aka “shingles”) • Jaccard Similarity (set similarity) Ci \ Cj sim(Ci , Cj ) = Ci [ Cj Minhashing C1 C2 0 1 1 0 1 1 0 0 1 1 0 1 • Idea: Hash the columns C1, C2 to a smaller signature, sig(C1), sig(C2) • How would you do this? • Sample P rows (features) Sim(C1, C2) = 2/5 Minhashing • Better: Randomly permute C1 C2 P1 P2 0 1 3 4 1 0 2 1 1 1 1 5 0 0 4 2 1 1 5 3 0 1 6 6 rows • h(Ci) := the first row where Ci=1 • Property: p(h(Ci) = h(Cj)) = Sim(Ci,Cj) • Why? Both are: 11/(11+10+01) Sim(C1, C2) = 2/5 Minhash Signatures C1 C2 P1 P2 0 1 3 4 1 0 2 1 1 1 1 5 0 0 4 2 1 1 5 3 0 1 6 6 • Pick P random permutations • sig(C) := list of P indices of first row with 1 in column C • sim(sig(Ci), sig(Cj)) = fraction of permutations where values agree. • E[sim(sig(Ci), sig(Cj))] = sim(Ci, Cj) Sim(C1, C2) = 2/5 Locality Sensitive Hashing • Apply multiple hash functions on random feature subsets. • Hash functions should be chose so collisions occur on similar objects Sim(x, y) = P(hi(x) = hi(y)) Locality Sensitive Hashing • Hash functions should be chose so collisions occur on similar objects • How to define hash functions? • String or graph hash functions for equality of short strings. • Bin continuous values. • geometrically inspired. LSH: Randomized minhash Sim(x, y) = P(hi(x) = hi(y)) • Randomly sample N hash functions h for comparison • Likelihood of false negatives? • If x and y are identical, what is the probability that the hash i functions will disagree? • Likelihood of false positives? • If x and y are different, what is the probability that the hash functions will disagree? • p = hash collision likelihood. (two non identical entities have the same hash value) N • p = probability of N hash collisions • Even 20% hash collisions (5 bins), with 5 hash functions 5 LSH has a p(FP) = .2 = 0.00032 Cosine Similarity LSH Locality Sensitive Hashing Benjamin Van Durme & Ashwin Lall ACL 2010 Van Durme & Lall ACL 2010 Cosine Similarity LSH Locality Sensitive Hashing Hamming Distance := h = 1 Signature Length := b = 6 Benjamin Van Durme & Ashwin Lall cos(✓) ⇡ cos( hb ⇡) = 1 cos( 6 ⇡) ACL 2010 Accumulate Counts • Why do you want to do this? Accumulate Counts • [Hard disk >> Memory] • Language Modeling • for ASR, MT, n-gram statistics, etc. • Inverse Indexes Accumulate Counts Serial def updateCounts(f, counts): for word in f: count[word] += 1 counts = dict() for f in files: One file at a time updateCounts(f, counts) dumpDictionaryToFile(counts, outfile) Parallel vs. Serial Processing updateCounts(f1) updateCounts(f2) updateCounts(f3) Accumulate Counts Parallel def updateCounts(f, counts): for word in f: count[word] += 1 counts = dict() What is wrong with this? parallel for f in files: updateCounts(f, counts) dumpDictionaryToFile(counts, outfile) Accumulate Counts Parallel def updateCounts(f, counts): count = dict() for word in f: count[word] += 1 return count counts = [] parallel for f in files: counts[f] = getCounts(f) total_counts = mergeCounts(counts) dumpDictionaryToFile(total_counts, outfile) Parallel vs. Serial Processing updateCounts(f1) updateCounts(f2) updateCounts(f3) count count count final_count General Divide and Conquer Divide and Conquer “Work” Partition w1 w2 w3 “worker” “worker” “worker” r1 r2 r3 “Result” Combine inspiration and images from 4 http://www.cse.buffalo.edu/~jing/cse601/fa12/materials/clustering_mapreduce.pdf Challenges in Parallel/Distributed Processing • Unpredictable runtimes. • One slow job can make everything slow. • Shared memory/race conditions • Debugging • Hardware Failures • If a hard drive lasts 5 years and you have 365*5 hard drives in a cluster. You should expect one hard drive failure per day. Map Reduce Framework Map • Iterate over a lot of data • Extract something of interest from each • Shuffle and Sort intermediate Results Reduce • Aggregate intermediate results • Generate Final Output Programmer Responsibility • Map: • def map(k, v) -> [(k’, v’)] • given a key value pair, generate a new list of intermediate keys and values • Reduce: •def reduce(k’,[v’]) -> [(k’, v’’)] • all values with the same key are sent to the same reduce function. Map Reduce Word Count def map(k, v): ‘’’k: doc name, v: text’’’ for word in v: emit(word, 1) def reduce(k, vals): ‘’’k: word, vals: list of counts’’’ result = sum(vals) emit(k, result) Visual Representation k1 v1 k2 v2 map a 1 k3 v3 k4 v4 map b 2 c 3 k5 v5 k6 v6 map c 6 a 5 map c 2 b 7 c 8 Shuffle and Sort: aggregate values by keys a 1 5 b 2 7 c 2 3 6 8 reduce reduce reduce r1 s1 r 2 s2 r3 s3 Inverted Indexing F.D.A. Ruling Would Sharply Restrict Sale of Trans Fats FDA to phase out use of artery-clogging trans fats In memory F.D.A. 1:1, … Ruling 1:2, … Would 1:3, … sharply 1:4, … restrict 1:5, … sale 1:6, … of 1:7,2:6 … trans 1:8,2:8, … fats 1:9,2:9, … FDA 2:1, … to 2:2, … phase 2:3, … out 2:4, … use 2:5, … On Disk artery-clogging 2:7, … Dictionary Posting List Inverted Indexing def map(k, v): ‘’’k: doc name, v: text’’’ for word, offset in tokenize(v): emit(word, (k, offset)) def reduce(k, vals): ‘’’k: word, vals: list of doc-index pairs’’’ emit(k, vals) Visual Representation Move Code to Data k1 v1 k2 v2 map a 1 k3 v3 k4 v4 map b 2 c 3 k5 v5 k6 v6 map c 6 a 5 map c 2 b 7 c 8 Shuffle and Sort: aggregate values by keys a 1 5 b 2 7 c 2 3 6 8 reduce reduce reduce r1 s1 r 2 s2 r3 s3 Index can be concatenated or left distributed GPUs • Graphical Processing Units • Processors designed specifically to display and process graphics. • Now used for general purpose computation. http://lorenabarba.com/gpuatbu/Program_files/Cruz_gpuComputing09.pdf Graphics processing • What about graphics processing is general purpose? • Data points are vectors (usually in 2 or 3 dims) • • Rotation Scaling (zoom) • Skew c~x cos ✓ R= sin ✓ 1 R= tan(✓) sin ✓ cos ✓ tan(✓) 1 T ~x R T ~x R Graphics processing • What about graphics processing is general purpose? • These processors are designed to be very efficient linear algebra processors • matrix multiplication, addition • matrix inversion, transposition • These are many of the same operations necessary for machine learning GPUs • Massively parallel • Hundreds/Thousands of cores • Thousands of threads • Cheap (~$5k per SoA GPU) • Highly available (<2 day delivery) • Programable: CUDA Some stats • GPU processing power is doubling every 18 months • CPU power is not. Workstations Some stats • GPU processing power is doubling every 18 months • CPU power is not. Servers Threads and Memory Memory hierarchy • • Registers per thread • Shared memory per 3 types of memory. of memory in the graphic card: •Three types •Global memory: 4GB •Shared memory: 16 KB •Registers: 16 KB •Latency: block • memory memory: 400-600 cyclesper •Global Global memory: Fast •SharedGPU • Fast •Register: Major bottleneck is •Purpose:transferring data from memory:memory IO for grid (RAM) •Global main memory: collaboration •Sharedto GPUthread memory •Registers: thread space Felipe A. Cruz Basic Workflow Work flow Memory allocation Time Memory copy: Host -> GPU Kernel call Memory copy: GPU -> Host Free GPU memory 0 1 2 3 4 5 6 7 ... pycuda example import pycuda.autoinit import pycuda.driver as drv multiply_them = mod.get_function("multiply_them") import numpy from pycuda.compiler import SourceModule ## CUDA CODE mod = SourceModule(""" __global__ void multiply_them(float *dest, float *a, float *b) { const int i = threadIdx.x; dest[i] = a[i] * b[i]; } a = numpy.random.randn(400).astype(nu mpy.float32) b = numpy.random.randn(400).astype(nu mpy.float32) dest = numpy.zeros_like(a) multiply_them( drv.Out(dest), drv.In(a), drv.In(b), block=(400,1,1), grid=(1,1)) """) print dest-a*b Wrap up • LSH converts “distance” calculations to repeated, probabilistic matching functions. • Map Reduce enables web-scale distributed computation • Current generation is geared around streaming data, and iterative processing (Dataflow, Summingbird, Spark) • GPUs greatly accelerate parallelizable computation like linear algebra for graphics, but also much scientific computation
© Copyright 2024