A Novel Data Replication Policy in Data Grid

World Applied Sciences Journal 28 (11): 1847-1852, 2013
ISSN 1818-4952
© IDOSI Publications, 2013
DOI: 10.5829/idosi.wasj.2013.28.11.1934
A Novel Data Replication Policy in Data Grid
Y. Nematii
Department of Computer Engineering, Islamic Azad University, branch of biayza, Iran
Abstract: Data grid aims to provide services for sharing and managing large data files around the world.
The challenging problem in data grid is how to reduce network traffic. One common method to tackle network
traffic is to replicate files at different sites and select the best replica to reduce access latency. This can
generate many copies of file and stores them on suitable place to shorten the time of getting file. To employ the
above two concepts, in this paper we propose a dynamic data replication strategy. Simulation Results with
Optorsim shows better performance of our algorithm than former ones.
Key words: Data Replication
Data Grid
Optorsim
INTRODUCTION
Data grid is a major type of grid, used in
data-intensive applications; where size of data files
reach tera or sometimes peta bytes. High Energy Physics
(HEP), Genetic, Earth Observation, are examples of such
applications. In such applications, managing and storing
this huge data is very difficult. These massive amounts of
data are shared among researchers around the world.
Hence, management this large scale distributed data
resources is difficult. The data grid is a solution for this
problem [1]. Data Grid is an integrating architecture that
permit connect a group of geographically distributed
computers and storage resources that enable users to
share data and resources [2]. When a user wants a file,
huge amount of bandwidth could be spend to transfer
the file from sever to client. Thus, data replication is an
important optimization step to manage large data by
replicating data in different globally distributed sites.
Replication can be static or dynamic. In static replication,
a replica keeps until it was deleted by client or its duration
was finished. But, dynamic replication creates and deletes
replicas according to variation access of the patterns or
environment behavior. In data replication, there are three
issues: (1) replica management, (2) replica selection and
(3) replica location. Replica management is the process of
creating, deleting, moving and modifying replicas in a
data grid. Replica selection is the process of selecting
most appropriate replica from those copies geographically
spreading across the grid. Lastly, replica location is the
problem of selecting the physical locations of several
replicas of the desired data usefully in a large-scale
wide area data grid system. There are several
methods proposed for dynamic replication in grid [3-5].
The motivation of replication is that it can enhance data
availability, network performance, load balancing and
performance of file access.
In a data grid latency for a job depends on the
computing resource selected for job execution and the
place of the data file(s) the job wants to access.
Therefore, scheduling jobs at proper executing nodes
and putting data at proper locations are important from
the user’s view and it is difficult because of the next
points: Each job needs a large scale of data, therefore data
locality requires to be taken into check in any scheduling
decisions. Data management policies like data replication
would decrease the latency of data access and improve
the performance of grid environment. So, there should be
a coupling between data locality reached through data
replication strategies and scheduling.
In this paper, a dynamic data replication policy is
proposed. We developed a replication strategy for 3-level
hierarchical structure, called HRS (Hierarchical Replication
Strategy). In this replication algorithm, replicated file
stored in the site with the largest number of access for the
file and main factor for replica selection is bandwidth.
Our replica strategy are simulated in Optorsim and
compared with different scheduling algorithms and replica
strategies. The results of simulation show that the
proposed method has minimum job execution time.
Corresponding Author: Y. Nematii, Department of Computer Engineering,
Islamic Azad University, branch of biayza, Iran.
1847
World Appl. Sci. J., 28 (11): 1847-1852, 2013
The rest of the paper is organized as follows:
section 2 gives an overview of pervious work on data
replication and job scheduling. Section 3 presents our
scheduling algorithm
and
replication
strategy.
Section 4 describes our experiments and simulation
results. Finally, section 5 explains conclusions and
proposed future work.
Related Work: In [8, 9], Ranganathan and Foster
proposed six distinct replica strategies for multi-tier
data grid as follows: (1) No replication or caching: just the
root node keeps replicas, (2) Best Client: a replica is
created for the client who has the largest number of
requests for the file, (3) Cascading replication: original
data files are stored at the root. when popularity is higher
than the threshold of root, a replica is created at next level
which is on the path to the best client, (4) Plain caching:
the client that requests the file keeps a local copy, (5)
Caching plus Cascading Replication: this merges Plain
caching and Cascading and (6) Fast Spread: replicas of
the file are stored at each node on the path to the best
client. They compared the strategies for three different
data patterns by means of the access latency and
bandwidth consumption of each algorithm with the
simulation tool. By the results of simulations, different
access pattern needs different replica strategies.
Cascading and Fast performed the best in the simulations.
Also, the authors developed different scheduling
strategies and combined them with different replication
strategies. The results showed that data locality is
importance in scheduling job.
In [10], Chakrabarti et al. proposed an Integrated
Replication and Scheduling Strategy (IRS). Their goal was
an iterative enhancement of the performance based on
the coupling between the scheduling and replication
methods. After jobs are scheduled, popularity of required
files is computed and then applied to replicate data for
the next set of jobs.
In work [11] the authors thought that data
placement jobs should be treated differently from
computational jobs, because they may have different
semantics and different properties. So, they have
developed a scheduler that provides solutions for
many of the data placement activities in the grid.
Their idea was to map data near to computational
resource to complete
computational
cycles
efficiently. The method can automatically determine
which protocol to select to transfer data from one
host to another.
Sashi and Thanamani in [12] proposed a dynamic
replication strategy where the replicas are created based
on weights. The main points of this algorithm were
increasing of the importance of newer records with giving
higher weighs to them and replica placement was done in
the site level. In this method, time distance of accesses
from current time and the access number were main
factors in decision of file popularity.
A novel replication method, called Branch replication
schema (BRS) have been presented in [13]. In their
model, each replica is comprised of a disparate set of
subreplicas grouped using a hierarchical tree topology
and subreplicas of a replica do not overlap. In order to
improve the scalability and performances of system for
both read and write operation, they used parallel
approach [14-16]. The results of simulation showed that
BRS improves data access performance for files with
different sizes and provides an efficient way for updating
data replicas.
Sang Min Park et al. [17] represented a bandwidth
Hierarchy Replication (BHR) which decreases data access
time by avoiding network congestions in a data grid
network. By using BHR algorithm, one can take benefits
from ‘network-level locality’ which shows that wanted file
is located in the site which has broad bandwidth to the
site of job execution. In data grid, some sites are linked
closely may be placed within a region. For example, a
country can be referred to as this network region.
Network bandwidth between sites within a region will be
higher than bandwidth between sites between regions.
Hence, hierarchy of network bandwidth may appear in
Internet. If the required file is placed in the same
region, needed time for fetching the file will be less.
BHR algorithm decreases data access time by maximizing
this network-level locality. In this work, replicated files are
placed in all the requested sites. In another paper [18]
the authors presented Modified BHR which tries to
replicate files in a site where the file has been accessed
frequently based on the assumption that it may require in
the future. The results showed that the proposed
algorithm minimized the data access time and avoided
unnecessary replication, so improved the performance.
But they didn’t use an efficient scheduling algorithm. In
our work, we addressed the problem of both replication
and scheduling.
A replication algorithm for a 3-level hierarchy
structure and a scheduling algorithm were proposed in
[19]. They considered network hierarchical structure with
three levels. Nodes in the first level connected through
1848
World Appl. Sci. J., 28 (11): 1847-1852, 2013
internet which has low bandwidth. Next level they have
moderately higher bandwidth comparing first level. In the
last level nodes connected to each other with highest
speed networks. For proficient scheduling of any jobs,
the algorithm selects best region, LAN and site
respectively. In their method when there is not enough
space for replica, major factor for deleting is bandwidth.
It leads to a better performance with LRU method.
According previous works although 3-layer replication
makes some improvement in some metrics of performance,
it shows a weakness which is replica was placed in all
requested site not in the best site. Therefore we extended
this strategy to lessen this weakness.
In [6], the authors presented a data grid architecture
supporting efficient data replication and job scheduling.
They considered computing sites into individual
domains according to the network connection and
each domain has a replica server. They proposed two
centralized dynamic replication algorithms with different
replica placement methods and a distributed dynamic
replication algorithm, such as the grid scheduling
heuristics of shortest turnaround time, least relative load
and data present. They used the LRU algorithm for replica
replacement. The results showed that the dynamic
replication algorithms can reduce the job turnaround time
significantly.
In [20], the performances of eight dynamic replication
strategies were evaluated under different data grid
setting. The simulation results showed that the file
replication policy chosen and the file access pattern
had great influence on the real-time grid performance.
Fast Spread-Enhanced was the best of the eight
algorithms
considered. Also, the peer-to-peer
communication was indicated to be very profitable in
boosting the real-time performance.
In [21] authors presented a dynamic replication
algorithm based on fast spread in multi-tier data grid and
called their method predictive hierarchical fast spread
(PHFS). PHFS predicts user’s subsequence component
to adapt the replication configuration with the available
condition, so increase locality in access. PHFS is
proposed based on the idea that users who work on the
same context will want some files with the high
probability. One of the main results is that the PHFS
algorithm is appropriate for applications in which the
clients work on a context for duration of time and the
client’s requests are not random. They investigated PHFS
and common fast spread with point of view access
latency and the results showed that PHFS has better
performance and lower latency than common fast spread.
Rashedur M.Rahman et al. in [22] proposed a
technique to select the best replica, by using previous file
transfer logs called the k-Nearest Neighbor (KNN) rule
exploited. When a new request comes for the best replica,
all prior data is use to find a subset of prior file requests
that are like to it and then employs these to predict the
best site for the replica. They also proposed a predictive
framework that considers data from different sources
and predicts transfer times of the sites that host replicas.
The results showed that the K-nearest method has a
better performance improvement than the traditional
replica catalog based model.
Replication Algorithm: When a job is allocated to local
scheduler, the replica manager transfers all the needed
files that are not available in local site. The goal of
replication is to transferring required data to local site
before job execution. So, data replication enhances job
scheduling performance by decreasing job execution
time. Our replication strategy decides which copy will
be transferred and how handles this new replica.
The replicated files were not put in all the requested sites.
In [9] three types of localities are showed.
For all required file of job, Replica Manager checks
the existence of the file in local site. If file doesn’t
available in the local site, first the method searches local
LAN, if the file duplicated in the same LAN, then it
chooses a file with maximum bandwidth available for
transferring it. If the file doesn’t exist in the local LAN,
then it searches local region and chooses a file that has
highest bandwidth to the requester node. If the file
doesn’t exist in the same region, then generates a list of
replicas in other regions and chooses the replica with
maximum bandwidth available for transferring it. In
another word, it has destination between intra-LAN
communication, intra-region communication
and
inter-region communication.
Simulation
Simulation Tool: We evaluated the performance of our
algorithms by implementing them in Optor Sim, a data
grid simulator developed by the EU DataGrid project as
shown in Fig. 1 [23]. It is proposed to evaluate dynamic
replication strategies [24, 25]. The OptorSim contains of
various parts: the Computing Element (CE) that illustrates
computational resource to process grid jobs, Storage
Element (SE) shows where grid data are stored, Resource
Broker (RB) that gets job submission from users and
allocates each job to proper site based on the scheduling
policy, Replica Manager (RM) at each site controls
1849
World Appl. Sci. J., 28 (11): 1847-1852, 2013
Fig. 1: Optor Sim architecture
RESULTS AND DISCUSSION
Table 1: Bandwidth configuration
Parameter
Value (Mpbs)
Inter LAN bandwidth
1000
Intra LAN bandwidth
100
Intra Region bandwidth
10
Table 2: General job configuration
Parameter
Value
Number of jobs types
6
Number of file access per jobs
16
Size of single file (GB)
1
Total size of files (GB)
100
Job delay (ms)
2500
Maximum Queue Size
200
the data transferring between sites and gives interface to
directly access the Replica Catalogue and Replica
Optimizer (RO) within RM includes the replication
algorithm. Broker distributes jobs to grid sites.
Jobs taking data from the SE and are executed by using
CE. The replica manager maintains the replica catalogue
and optimizer selects the best site to fetch the replica.
Experimental Evaluation: The OptorSim structure is flat,
so alternation was done in OptorSim code to implement
the hierarchical structure. We assumed there are
3 regions with three sites on the average in each region.
The storage capacity of the master site is 250 Giga Byte
and the storage capacity of all other sites is 10 Giga Byte.
We assumed the bandwidth as shown in Table 1.
The number of storage elements and the number of
computing elements are set 11 and 10, respectively.
Table 2 specifies the simulation parameters and their
values used in our study. There are 6 job types, each job
type on average requires 16 files (each is 1 Giga Byte)
to execute.
We studied the performance of various replica
strategies and different algorithm combinations.
The Random scheduler, Shortest Queue scheduler,
Access Cost scheduler and
Queue Access Cost
scheduler are four strategies have been evaluated.
The Random scheduler schedules a job randomly.
The Shortest Queue scheduler selects computing
element that has the least number of jobs waiting in
the queue. The Access Cost scheduler assigns the
job to computing element where the file has the lowest
access cost (cost to get all files needed for
executing job).
The
Queue Access Cost scheduler selects
computing element with the smallest sum of the
access cost for the job and the access costs for the all
of the jobs in the queue. More details about these
schedulers have been given in [27]. Four replication
algorithms have been investigated are Least Frequently
Used (LFU), Least Recently Used (LRU), BHR and
ModifiedBHR. In LRU strategy always replication takes
place in the site where the job is executing. If there is not
enough space for new replica, the oldest file in the
storage element is deleted. In LFU always replication
takes place in the site where the job is executing. If there
is not enough space for new replica, least accessed file in
the storage element is deleted. The BHR algorithm which
is based on the network-level locality, tries to locate a
variety of replicas as many as possible within a region,
where broad bandwidth is available between sites. The
Modified BHR Region Based algorithm replicates different
files within the region with the condition that the replica
is placed in the site where the file is accessed for the most
of time.
1850
World Appl. Sci. J., 28 (11): 1847-1852, 2013
3.
4.
5.
6.
Fig. 2: Mean job execution time
Figure 2 shows the effect of number of job on the
runtimes for different replica strategies for 1500 jobs.
When the number of job increases the performance of
our method also increases comparing with other method.
In grid environment a lot of job should be run.
7.
8.
CONCLUSION
In this paper, we proposed a novel replication
strategy. our replication strategy allows place data in a
manner will suggest a faster access to files require by
grid jobs, therefore increase the job execution’s
performance. If the available storage for replication is not
sufficient, our proposed algorithm will only replicate
those files that do not exist in the local LAN.
We analyzed the performance of our replication
method with Optorsim. Experimental results showed that
our replica algorithm outperforms the LRU method as the
number of jobs increases.
In future work, we plan to use other intelligent
strategy to predict future needs and to acquire the best
replication configuration.
REFERENCES
1.
2.
9.
10.
11
12.
Chervenak, A., I. Foster, C. Kesselman, C. Salisbury
and S. Tuecke, 2000. The Data Grid: towards an
architecture for the distributed management and
analysis of large scientific datasets, J. Network
Comput. Appl., 23(3): 187-200.
Ann Chervenak, Ian Foster, Carl Kesselman, Charles
Salisbury, Steven Tuecke. “The Data Grid: Towards
an Architecture for the Distributed Management
and Analysis of Large Scientific Datasets”, Journal
of Network and Computer Applications, Volume
23,2001, pp: 187-200.
13.
14.
1851
Lamehamedi, H. and B.K. Szymanski, 2007.
“Decentralized Data Management Framework for
Data Grids”, Future Generation Computer Systems,
Elsevier, 23: 109-115.
Abawajy, J.H., 2004. “Placement of File Replicas in
Data Grid Environments”, ICCS’04, LNCS 3038,
pp: 66-73.
Tang, M., B.S. Lee, C.K. Yeo and X. Tang, 2005.
“Dynamic Replication Algorithms for the Multi-tier
Data Grid”, Future Generation Computer Systems,
Elsevier, 21: 775-790.
Schintke, F. and A. Reinefeld, 2003. Modeling replica
availability in large Data Grids, Journal of Grid
Computing 1: 2.
Zhao, W., X. Xu, Z. wang, Y. Zhang and S. He, 2010.
“A Dynamic Optimal Replication Strategy in Data
Grid Environment”, IEEE.
Foster, I. and K. Ranganathan, 2001. Design and
evaluation of dynamic replication strategies for
high performance data grids, in: Proceedings of
International Conference on Computing in High
Energy and Nuclear Physics, Beijing, China.
Foster, I. and K. Ranganathan, 2002. Identifying
dynamic replication strategies for high performance
data grids, in: Proceedings of 3rd IEEE/ACM
International Workshop on Grid Computing, in:
Lecture Notes on Computer Science, vol. 2242,
Denver, USA, pp: 75-86.
Chakrabarti, A., R.A. Dheepak and S. Sengupta, 2004.
Integration of scheduling and replication in Data
Grids, in: Lecture Notes in Computer Science, vol.
3296, pp: 375-385.
Kosar, T. and M. Livny, 2004. Stork: Making data
placement a first class citizen in the grid, in:
Proceedings of the 24th International Conference
on Distributed Computing Systems, ICDCS2004,
Tokyo, Japan, pp: 342-349.
Sashi, K. and Dr. Antony, 2010. Selvadoss
Thanamani, “Dynamic Replica Management for
Data Grid”, IACSIT International Journal of
Engineering and Technology, 2: 4.
José, M. Pérez and Félix García-Carballeira, 2009.
Jesús Carretero, Alejandro Calderón, Javier
Fernández: “Branch replication scheme: A new
model for data replication in large scale data grids”,
Future Generation Computer Systems, Elsevier.
Jin, H., T. Cortes and R. Buyya, (Eds.), 2002.
High Performance Mass Storage and Parallel I/O:
Technologies and Applications, IEEE Press and
Wiley.
World Appl. Sci. J., 28 (11): 1847-1852, 2013
15. Perez, J.M., F. Garcia, J. Carretero, A. Calderon
and J. Fernandez, 2004. A Parallel I/O Middleware to
Integrate Heterogeneous Storage Resources on
Grids (Grid Computing: First European Across Grids
Conference, Santiago de Compostela, Spain,
February 13_14, 2004), in: Lecture Notes in Computer
Science Series, vol. 2970, pp: 124-131.
16. Garcia-Carballeira, F., J. Carretero, A. Calderon,
J.D. Garcia and L.M. Sanchez, 2007. A global and
parallel file systems for grids, Future Generation
Computer Systems, 23(1): 116-122.
17. Sang-Min Park, Jai-Hoon Kim, Young-Bae Ko and
Won-Sik Yoon, 2004. Dynamic Data Replication
Strategy Based on Internet Hierarchy BHR, in:
Lecture notes in Computer Science Publisher, vol.
3033, Springer-Verlag, Heidelberg, pp: 838-846.
18. Sashi, K. and Dr. Antony, 2011. Selvadoss
Thanamani, “Dynamic replication in a data grid
using a Modified BHR Region Based Algorithm”,
Future Generation Computer Systems, Elsevier,
27: 202-210.
19. R. Sepahvand, A. Horri and Gh. Dastghaiby fard,
2008. “A New 3-layer replication Scheduling Strategy
In Dta grid”, IEEE.
20. Atakan Dogan,” 2009. A study on performance of
dynamic file replication algorithms for real-time file
access in Data Grids”, Future Generation Computer
Systems, Elsevier, 25: 829-839.
21. Leyli Mohammad Khanli, Ayaz Isazadeh and
Tahmuras N. Shishavan,” 2010. PHFS: A dynamic
replication method, to decrease access latency
in the multi-tier data grid”, Future Generation
Computer Systems, Elsevier.
22. Rashedur M. Rahman, Reda Alhajj and Ken Barker,”
2008. Replica selection strategies in data grid”, J.
Parallel Distrib. Comput., Elsevier.
23. Cameron, D.G., R. Carvajal-Schiaffino, A.P. Millar, C.
Nicholson, K. Stockinger and F. Zini, 2004. Optorsim:
A simulation tool for scheduling and replica
optimization in data grids. In: International
Conference for Computing in High Energy and
Nuclear Physics (CHEP 2004), Interlaken.
24. Optor Sim-A Replica Optimizer Simulation: http://edgwp2.web.cern.ch/edgwp2/ optimization/optorsim.html
25. EU Data Grid project. http://www.eu-egee.org/.
26. Hoschek, W., F.J. Jaen-Martinez, A. Samar,
H. Stockinger and K. Stockinger, 2000. Data
management in an international data grid project, in:
Proceedings of the First IEEE/ACM International
Workshop on Grid Computing, GRID ’00, in: Lecture
Notes in Computer Science, vol. 1971, Bangalore,
India, pp: 77-90.
27. David G. Cameron, Ruben Carvajal-Schiaffino,
A. Paul Millar, Caitriana Nicholson, Kurt Stockinger
and Floriano Zini, 2003. Evaluating scheduling and
replica optimisation strategies in OptorSim, in:
4th International Workshop on Grid Computing
(Grid2003), Phoenix, Arizona, November 17, IEEE
Computer Society Press.
1852