Download Report

International Journal of Research In Science & Engineering
Volume: 1 Special Issue: 2
e-ISSN: 2394-8299
p-ISSN: 2394-8280
A REAL TIME MEMORY SLOT UTILIZATION DESIGN FOR
MAPREDUCE MEMORY CLUSTERS
Suma R 1, Vinay T R 2 , Byre Gowda B K 3
1
Post graduate Student, CSE, SVCE, Bangalore
Assistant Professor, CSE, SVCE, Bangalore
3
Assistant professor, CSE, SIR.MVIT, Bangalore
2
ABSTRACT
For large data processing in the cloud map reduce is a process we can split the data into multiple parts or
make it into the slot and then process and mapping process will happen . The slot based map reduce is not too
effective it gives the poor performance because of the unoptimized resource allocation and they have the
various challenges. The map reduce job task execution have the two unique feature. The map slot allocation
only allocate the map task and reduce task only be allocated to reduce task and the map task process before
the reduce task. The data locality maximization for the efficiency and utilization is required to improve the
quality of the system proposed the various challenge to address this problem. The DynamicMR is a dynamic
slot allocation framework to improve the performance of map reduce[1]. The DynamicMR focuses on
Hadoop fair scheduler (HFS). The Dynamic scheduler consist of three optimization techniques Dynamic
Hadoop slot Allocation (DHSA), Speculative Execution performance Balancing(SEPB) and Slot
Prescheduling.
1. INTRODUCTION
Big data is a collection of both structured and unstructured data that is too large , fast and distinct to be
managed by traditional database management tools or traditional data processing applications . Hadoop is an
open-source software framework from Apache that supports scalable distributed applications . Hadoop supports
running applications on large clusters of commodity hardware and provide fast and reliable analysis of both
structured and unstructured data. Hadoop uses simple programming model. Hadoop can scale from single
servers to thousands of machines, each offering local computation and storage. Despite many studies in
optimizing MapReduce/Hadoop, there are many key challenges for the utilization and performance
improvement of a Hadoop cluster.
Firstly, the resources (e.g., CPU cores) are abstracted as map and reduce slots. A MapReduce job
execution has two unique features:
1. The slot allocation constraint assumption that map slots are allocated to map tasks and reduce slots are
allocated to reduce tasks, and
2. The map task are executed first and then the reduce task.
We have 2 observation here
1. For different slot configuration there are different system utilization and performance for a mapreduce
workload.
2. Idle
reduce
slots
which
affects
the
performance
and
system
utilization.
Secondly due to straggler problem of map or reduce task,delay of the whole job occurs. Thirdly,by
maximizing the data locality performance and slot utilization efficiency improvement occurs in mapreduce
workload. DynamicMR have 3 techniques here,Dynamic Hadoop Slot Allocation(DHSA),Speculative
Execution Performance Balancing(SEPB),Slot Prescheduling(SP).
DHSA
1. Slots can be used for map task or reduce task,in map slots if there are insufficient slots for map task it can
borrow unused slots from reduce slots. Similarly reduce slots can also borrow slots from map slots if the reduce
task is greater than reduce slots.
2. Map slots are used for map task and reduce slots are used for reduce task.
IJRISE| www.ijrise.org|[email protected][253-258]
International Journal of Research In Science & Engineering
Volume: 1 Special Issue: 2
e-ISSN: 2394-8299
p-ISSN: 2394-8280
SEPB
It is used to find out the slow running task.So we propose a technique called speculative execution
performance balancing for this task which is speculative.By this it can balance performance tradeoff b/w a
single job and a batch of job execution time.
Slot Prescheduling
Approach for improving data locality in mapreduce. So, it is at the cost of fairness. DynamicMR
improve performance and utilization of mapreduce workload with 46% -115% for single job and 49% - 112%
for multiple job.
1. Slot Utilization
Optimization
PI-DHSA
Idle
Slot
PD-DHSA
Dynamic Hadoop
Slot Allocation
(DHSA)
2. Utilization
Efficiency
Optimization
Reduce
Task
Map
Task
Speculative
Execution
Performance
Balancing
(SEPB)
Slot
PreScheduling
Fig- 1: O verview of DynamicMR Framework
Popularity of mapreduce in industry, bioinformatics, machine learning. Implementation of mapreduce
is hadoop[6]. Multiple task run in mapreduce in each node. Each node host a configure number of map and
reduce slots. When task is assigned to slots it get occupied,and when task completes slot gets released. Resource
underutilization overcome by using resource stealing. Speculative execution in mapreduce to support fault
tolerance. Progress of all scheduled task maintained by master node. When master node finds a slow running
task,a speculative task is launched to process the task fast. To process huge data framework and powerful
hardware is required. Google proposed mapreduce for parallel data. Dynamic and aggressive approach is
mapreduce. Sometimes fairness and data locality conflict eachother,when fairness is strict data locality
degradation occurs and purely data locality result in unfairness of resource usage.
Mapreduce is a programming model for large scale data processing [2]. Mapreduce which process 20
petabytes of data per day. Open source implementation of mapreduce is hadoop.example of hadoop is
facebook,google etc. In mapreduce it uses a distributed storage layer refered to as Hadoop distributed file
system.
A job is submitted by user comprising of map function and reduce function which are transformed into
map and reduce task respectively. Data is split into equal size by HDFS and distributes data into cluster nodes,
mapping is performed in HDFS. Intermediate output are partitioned into one or many reduce task. LocalityAware Reduce task Scheduler (LARTS) partition of sizes to have data locality.
2. EXISTING SYSTEM
Scheduling and resource allocation optimization
There are scheduling and allocation of resource for mapreduce jobs. In mapreduce case for 1 job we
have multiple task. In the same time all job arrive and minimize the job completion time is objective . To
achieve this we develop a computation model to solve large scale data problem and undergo graph analysis.
Mapreduce modeled into 2 stage hybrid flow shop. Job submission result in performance improvement of
system and utilization. Map and reduce task execution time should be known before, which is not possible in
real world application. DHSA can be used for any mapreduce workload . In optimal hadoop configuration eg:In
Map/reduce slot configuration ,it contain room for improving performance of mapreduce workload. Guo et al
propose a method called resource stealing[3] to steal resources which are reserved for idle slots here adopting
multi-threading technique for task which is running on multiple CPU cores. Polo et al propose a method called
resource aware scheduling technique for map reduce workload, which improve resource utilization.In DHSA we
can improve system utilization by allocating unused map and reduce slots. New version of hadoop is
YARN.Inefficiency problem of hadoop is overcome by using YARN.Resources are managed here consisting of
resources like memory,band width.However for multiple jobs DynamicMR is better than YARN bcz here is
YARN there is no concept of slot.
IJRISE| www.ijrise.org|[email protected][253-258]
International Journal of Research In Science & Engineering
Volume: 1 Special Issue: 2
e-ISSN: 2394-8299
p-ISSN: 2394-8280
Speculative Execution optimization: Use to deal with straggler problem using LATE.Longest Approximate
Time to End is algorithm for Speculative Execution which focuses on heterogeneous environment and
speculative task are capped.
By Guo et al LATE performance is improved by proposing a Benefit Aware Speculative Execution(BASE).
Benefit Speculative Task , so we propose SEPB to balance tadeoff b/w single job and batch of job.
Data Locality Optimization
For efficiency improvement and performance of the cluster utilization by data locality Optimization[4].
In mapreduce we have map side and reduce side. The data locality optimization for mapside is moving the
maptask close to the input data blocks. Mapreduce jobs are classified into map-input heavy ,map and reduce
input heavy and reduce-input heavy. The reduce-side data locality place reduce task to the machines that
generate intermediate data by maptask. Mapside data locality belong to slot prescheduling. Extra idle slots is
used to maximize data locality and faireness. Delay scheduler and slot prescheduling is used to achieve faireness
and data locality. 2 types of slot optimizers SEPB and Slot prescheduling for improvement of DHSA.
Mapreduce optimization on cloud computing
Fine grained optimization for hadoop is DynamicMR. By combine existing system and DynamicMR
together develop framework and budget in cloud computing.
3. PROPOSED SYSTEM
Mapreduce performance can be improved from 2 perspective. Firstly slots are classified into busy slot and
idle slot. One approach here is to increasing slot utilization by maximizing busy slot and minimizing the idle
slots. Second is every busy slot have not been efficiently utilized. Thus our approaches is to improve the
utilization of busy slot. DHSA which is used to increase slot utilization and maintaining faireness [4]. SEPB
improve slow running task. Slot prescheduling [10]improves performance by data locality and faireness.
DynamicMR have the following step-by-step processes:
1 When there is a idle slot, DynamicMR will improve the slot utilization with DHSA. DynamicMR will
decide whether to allocate it or not Eg:Faireness.
2 Allocation is true, DynamicMR will improve the efficiency of slot by SEPB. Speculative Execution will
achieve performance tradeoff b/w a single job and batch of job.
3 For pending maptask allocate idle slots. DynamicMR will improve efficiency of slot utilization with slot
prescheduling.
3.1 Dynamic Hadoop Slot Utilization:
Mapreduce current design suffers from under utilization of slots bcz number of map and reduce task
varies over time. Where the number of map/reduce task is greater than the map/reduce slots. Reduce task which
is overloaded we can use unused map slots by that mapreduce performance is improved. All workload will lie in
the map side. So we use idle reduce slots for map task. Map and reduce task can run on either map slots or
reduce slots.
1 In HFS faireness is important: When all pools are allocated with equal amount of resources it is a fair.
2 Map slots and reduce slot resource requirement is different[9]. Memory and n/w bandwidth are resources of
reduce task. DHSA contain 2 alternatives namely PD-DHSA and PI-DHSA. Pool-Independent DHSA:PI-DHSA
process consist of 2 parts:
Fig-2: Pool-Independent DHSA
IJRISE| www.ijrise.org|[email protected][253-258]
International Journal of Research In Science & Engineering
Volume: 1 Special Issue: 2
e-ISSN: 2394-8299
p-ISSN: 2394-8280
1 Intra-phase dynamic slot allocation:
Pool is divided into 2 sub pools i.e. map-phase pool and reduce-phase pool. The pool which is
overloaded and have slot demand can borrow unused slots from other pool of same phase.
Eg: Map phase pool 2 can borrow map slots from map phase pool 1 and pool 3.
2 Inter-phase dynamic slot allocation:
When reduce phase contain unused reduce slot and we have insufficient map slots for map task, then it
will borrow idle slots from reduce slots. Nm-total number of map task. Nr-total number of reduce task. Sm-total
number of map slots. Sr-total number of reduce slots.
Case 1: When Nm ≤ Sm and Nr ≤ Sr map slots run on reduce task and reduce slots run on reduce task i.e slots
borrowing does not takes place.
Case 2: When Nm > Sm and Nr < Sr reduce slots for reduce task and use idle reduce slots for running map task.
Case 3: When Nm < Sm and Nr > Sr , for running reduce task we use unused mapslots.
Case 4: When Nm > Sm and Nr > Sr system in busy state , map and reduce slots have no movement. We have 2
variables PercentageOfBorrowed MapSlots and PercentageOfBorrowed ReduceSlots.
PD-DHSA:
Fig-3: Pool – Dependent DHSA
2 pools map-phase pool and reduce-phase pool is selfish. Until the map-phase and reduce-phase satisfy its own
shared map and reduce slots before going to other pools. 2 processes:
1 Intra-pool dynamic slot allocation: In this pool we have 4 relationship
Case a: Mapslot Demand < Mapshare and Reduceslot Demand > reduce share,Borrow unused map slots from
reduce phase pool 1st for its overloaded reduce task.
Case b: MapslotsDemand > Mapshare and ReduceSlotsDemand < reduce share, reduce phase contain unused
slots to its map task.
Case c: MapSlotsDemand ≤ Mapshare and reduceslotsdemand ≤ reduceshare,mapslots and reduce slots do not
borrow any slots. It can give slots to other pools.
Case d: MapSlotsDemand > mapshare and reduceSlotsDemand > reduceshare.Here mapslots and reduceslots are
insufficient. Map slots and reduce slots borrow slots from other pools.
2 Inter-pool dynamic slot allocation:
MapslotsDemand + ReduceslotsDemand ≤ Mapshare + reduceshare in this case no need of borrowing
slots from other pools. MapSlotsDemand + ReduceSlotsDemand > mapshare + reduceshare in this case even
after Intra-pool dynamic slot allocation slots are not enough . So it will borrow unused slots from other pools.
Tasktracker have 4 possible slot allocation.
IJRISE| www.ijrise.org|[email protected][253-258]
International Journal of Research In Science & Engineering
Volume: 1 Special Issue: 2
e-ISSN: 2394-8299
p-ISSN: 2394-8280
Fig-4: Slot Allocation For PD-DHSA
Case 1: Tasktracker if have idle map slots it undergo map tasks allocation and it contain pending task for pool.
Case 2: If case 1 fails then Tasktracker if have idle reduce slots it undergo reduce task allocation and it contain
pending task for pool.
Case 3: If case 1 and case 2 fails then in case 3 for map task we try reduce slots.
Case 4: For reduce task we allocate map slots.
3.2 Speculative execution performance balance:
Job execution time for mapreduce is very sensitive to slow running task. Stragglers due to faulty
hardware and software misconfiguration . Stragglers are 2 types Hard straggler and soft straggler.
Hard straggler :A task due to endless waiting for certain res ources goes to deadlock status. we should kill the
task , because it will not stop .
Soft straggler :A task take much longer time than the common task, but the task get successfully complete.
Back up task means killing task of Hard straggler and running other task. Straggler problem detected by Late
algorithm . Speculation excecution will reduce a job excecution time .
Fig-5: T otalnumofPending maptask and totalnumofPending reducetask .
In SEPB 1st the task which is failed given higher priority. 2nd the task which are pending are considered .
LATE which handle straggled task,it will call backup task and allocate a slot. Consider example with 6 jobs.
Speculative cap for LATE is 4 and the maxnum of jobs checked for pending taskis 4. Idle slots are 4. SEPB will
allocate all 4 idle slot to pending task bcz pending task for j1,j2,j3,j4,j5,j6 are 0,0,10,10,15,20 respectively. On
top of LATE, SEPB works and SEPB is enhancement of LATE.
3.3 Slot perscheduling:
Which improve data locality[5] and without having negative impact on the faireness of mapreduce jobs.
Defn 1: The available idle map slots that can be allocated to the tasktracker.
Defn 2: The extra idle map slots are subtracting used map slots and allow available idle map slots.
Technique
Faireness
Slot
Utilization
IJRISE| www.ijrise.org|[email protected][253-258]
Performance
International Journal of Research In Science & Engineering
Volume: 1 Special Issue: 2
DHSA
SEPB
DS
SPS
+
_
+
+
+
%(+)
%(+)
e-ISSN: 2394-8299
p-ISSN: 2394-8280
+
+
+
+
TABLE 1: +, _ , % Denotes Benefit,Cost,efficiency respectively.
4. CONCLUSION
Improving performance of Mapreduce workload by DynamicMR framewo rk and maintaining faireness
.Three techniques here are DHSA , SEPB , Slot prescheduling all focus on utilization of slot for mapreduce
cluster. Utilization of slot can be maximized by DHSA. Inefficiency of slot is identified by SEPB. Slot
prescheduling improves slot utilization efficiency. Combining these techniques improve Hadoop System.
REFERENCES
[1] Q. Chen, C. Liu, Z. Xiao, Improving MapReduce Performance Using Smart Speculative Execution Strategy.
IEEE Transactions on Computer, 2013.
[2] J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large
Clusters, In OSDI’04, pp. 107-113, 2004.
[3] Z.H. Guo, G. Fox, M. Zhou, Y. Ruan.Improving Resource Utilization in MapReduce.
In IEEE Cluster’12. pp. 402-410, 2012.
[4] Z. H. Guo, G. Fox, and M. Zhou.Investigation of data locality and fairness in MapReduce. In
MapReduce’12, pp, 25-32, 2012.
[5] Z. H. Guo, G. Fox, and M. Zhou. Investigation of Data Locality in MapReduce. In IEEE/ACM CCGrid’12,
pp, 419-426, 2012.
[6] Hadoop. http://hadoop.apache.org.
[7] M. Hammoud and M. F. Sakr. Locality-Aware Reduce Task Scheduling for MapReduce. In IEEE
CLOUDCOM ’11. pp. 570-576, 2011.
[8] M. Hammoud, M. S. Rehman, M. F. Sakr. Center-of-Gravity Reduce Task Scheduling to Lower MapReduce
Network Traffic. In IEEE CLOUD’12, pp. 49-58, 2012.
[9] B. Palanisamy, A. Singh, L. Liu and B. Jain, Purlieus: Localityaware Resource Allocation for MapReduce in
a Cloud, In SC’11, pp. 1-11, 2011.
[10] J. Polo, C. Castillo, D. Carrera, et al. Resource-aware Adaptive Scheduling for MapReduce Clusters. In
Middleware’11, pp. 187-207,
IJRISE| www.ijrise.org|[email protected][253-258]