Hadoop과 R을 이용한 분산처리 시스템 구축 및 예제 한상우, 지현웅, 이수지., 박새롬, 장희수, 이재욱 [email protected] This document is confidential and is intended solely for the use <1> 목차 I. Hadoop 소개 II. Rhadoop III. MapReduce IV.Examples <2> 배경 기존의 문제점 1. 저장 불가능 2. 비경제적 3. 엄청난 분석 시간 소요 Hadoop - 방대한 양의 데이터 - 분산 처리 - 빠른 시간 내에 결과 제공 - Open source <3> Hadoop Hadoop = Hadoop Distributed File System (HDFS) + Hadoop MapReduce - HDFS : Distributed File System 수천 대의 서버를 네트워크로 묶어 하나의 서버가 보유 하고 있는 파일 시스템 처럼 사용 - MapReduce : Distributed Processing System 각 서버가 저장하고 있는 데이터를 동시에 병렬로 처리 <4> Hadoop Structure MASTER Back-up Master Master SLAVE Slave 1 Slave 2 Slave 3 Slave 4 - Master Node : Slave node 관리 Namenode(HDFS), Job Tracker(MapReduce) 역할 - Slave Node : 데이터 저장 및 전달 Datanode(HDFS), Task Tracker(MapReduce) 역할 <5> Rhadoop? 1. Rhdfs – R and HDFS 2. Rhbase – R and HBASE 3. RMR – R and MapReduce Rhadoop을 통해서 R 사용자는 Hadoop으로 데이터를 관리, 분석이 가능 <6> MapReduce • Map function processes a key/value pair to generate a set of intermediate key/value pairs • Reduce function merges all intermediate values associated with the same intermediate key <7> Getting Started with RHadoop • With RHadoop rmr package we could use ‘mapreduce’ function to implement same calculations to a list of data • Simple example to double all the numbers from 1 to 100 : ints = to.dfs(1:100) calc = mapreduce(input = ints, map = function(k, v) cbind(v, 2*v)) from.dfs(calc) $val [1,] [2,] [3,] [4,] [5,] ..... v 1 2 3 4 5 2 4 6 8 10 <8> Rhadoop Examples • Map and Reduce functions are defined with respect to data structured in (key, value) pairs nums = to.dfs(rnorm(100, 100, 10)) sort.map.fn <- function(k,v) { key <- ifelse(v < 100, "less", "greater") keyval(key, 1) } count.reduce.fn <- function(k,v) { keyval(k, length(v)) } $key "𝒍𝒆𝒔𝒔" "𝒈𝒓𝒆𝒕𝒆𝒓" ⋮ "𝒍𝒆𝒔𝒔" count <- mapreduce (input= nums, map = sort.map.fn, reduce = count.reduce.fn) from.dfs(count) >$key [1] "greater" "less" $val [1] 45 55 Reduce functions handle data separately for each key value <9> $value 𝟏 𝟏 ⋮ 𝟏 Simple Simulations for Option Pricing Call option value • We assume that asset 𝑺(𝒕) follows the stochastic differential equation under the risk-neutral probability: 𝒅𝑺 𝒕 = 𝒓𝑺 𝒕 + 𝝈𝑺 𝒕 𝒅𝑾 Exercise price where 𝑾 is the Brownian motion under Stock price the risk-neutral probability • We reproduce the future prices of the underlying asset, and then the future payoffs to be obtained • The sample mean of the discounted payoffs is the value of the option contract < 10 > Example - Option Pricing • An example for European option pricing : inp = cbind(S0*rep(1,nTraj), rep(0,nTraj)); inp = to.dfs(inp); buildTraj <- function(k, v ){ deltaT = T/nPas; Data-specific quantities are desired for (i in 1:nPas){ to be stated in terms of data dW = sqrt(deltaT)*rnorm(length(v[,1])); v[,2] = v[,1] + r*v[,1]*deltaT + sigma*v[,1]*dW; v[,1] = v[,2]; } key <- ifelse(v[,1]-K>0, "call", "put"); value <- ifelse(v[,1]-K>0, exp(-r*T)*(v[,1]-K), exp(-r*T)*(K-v[,1])); } keyval(key,value) price.reduce.fn <- function(k,v) { keyval(k, mean(v)*(length(v)/nTraj)) } > [1] "call" "put" $val [1] 6.038343 10.676495 call <- mapreduce(input = inp , map = buildTraj, reduce = price.reduce.fn); < 11 > Rhadoop Examples • Ellapsed time : -R #Timestep \ #traj 100,000 1,000,000 5,000,000 100 0.8 9.8 130.3 500 3.8 48.3 201.3 #Timestep \ #traj 100,000 1,000,000 5,000,000 100 41.9 55.5 130.6 500 44.3 82.4 237.9 - RHadoop < 12 > Example – k-means Clustering k- Means Clustering • Simple partitional clustering • Chooses the number of clusters k Iterate { Compute distance from all points to all k-centers MAP Assign each point to the nearest k-center Compute the average of all points assigned to all specific kcenters REDUCE Replace the k-centers with the new averages } < 13 > Example – k-means Clustering • In Map function.. Input Data 𝑥11 𝑥21 . . . 𝑥𝑛1 𝑥12 𝑥22 . . . 𝑥𝑛2 dim1 dim2 distance (x1,c1) distance (x2,c2) distance (x3,c3) distance (x2,c1) distance (x2,c2) distance (x2,c3) distance (xn,c1) distance (xn,c2) distance (xn,c3) Output $key 1 3 cluster # . . . 2 < 14 > $value 𝑥11 𝑥21 . . . 𝑥𝑛1 𝑥12 𝑥22 . . . 𝑥𝑛2 Example – k-means Clustering • In Reduce function.. $key $value 𝑥𝑎12 1 𝑥𝑎1 1 𝑥𝑎22 1 𝑥𝑎2 1 . $key $value 𝑥𝑏12 2 𝑥𝑏11 𝑥𝑏22 2 𝑥𝑏21 . . . . . . . . 1 𝑥𝑎𝑙 1 𝑥𝑎𝑙 2 . . . . . . . . . 2 𝑥𝑏𝑚1 𝑥𝑏𝑚2 . $key New centers $key $value 𝑥𝑐11 𝑥𝑐12 3 𝑥𝑐21 𝑥𝑐22 3 . . . 3 𝑥𝑐𝑝1 𝑥𝑐𝑝2 . $value 1 𝑐11 𝑐12 2 𝑐21 𝑐22 3 𝑐31 𝑐32 < 15 > . . . . Example – k-means Clustering < 16 > Reference - Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simpli ed Data Processing on Large Clusters, Google, Inc. < 17 >