Gfarm Grid File System Osamu Tatebe University of Tsukuba

First NEGST workshop
June 23-24, 2006, Paris
Gfarm Grid File System
Osamu Tatebe
University of Tsukuba
Petascale Data Intensive Computing
High Energy Physics
z CERN LHC, KEK-B Belle
z ~MB/collision,
100 collisions/sec
z ~PB/year
z2000 physicists, 35 countries
Detector for
LHCb experiment
Detector for
ALICE experiment
Astronomical Data Analysis
z data analysis of the whole data
z TB~PB/year/telescope
z Subaru telescope
z 10 GB/night, 3 TB/year
Petascale Data Intensive Computing
Requirements
Storage Capacity
Peta/Exabyte scale files, millions of millions of files
Computing Power
> 1TFLOPS, hopefully > 10TFLOPS
I/O Bandwidth
> 100GB/s, hopefully > 1TB/s within a system and
between systems
Global Sharing
group-oriented authentication and access control
Does Grid computing helps?
Yes, of course !!
Federation of resources increases storage capacity and
computing power in total
But, reality is . . .
Federation does not immediately mean a big highperformance computer
Many many small resources not tightly coupled at distant
locations
Sharing large-scale data at 100GB/s rate not possible
Limited type of applications can be applied to
Network bandwidth improvement helps to be, but Grid
middleware needs to be matured
Application modification required to fully utilize resources
Grid MW should be a distributed OS that hides a detail
Recap - What we really want
A *High-Performance* Computer
> 10TFLOPS – scalable computing power
A *High-Performance* File System
> 1PB – scalable capacity
> 1TB/s – scalable I/O bandwidth
Can we realize in Grid???
POINT: Access locality in space
How to couple distributed storage and
distributed computing is a key
Gfarm Grid File System (1)
Virtual file system that federates local storage of
commodity PCs (cluster nodes or Grid nodes)
Enables transparent access using Global namespace to
dispersed file data in a Grid
Supports fault tolerance and avoids access
concentration by automatic replica selection
It can be mounted by all cluster nodes and clients
Global
namespace /gfarm
ggf
aist
file1
jp
gtrc
file2
file1
file3
Gfarm File System
mapping
file2
file4
File replica creation
Gfarm Grid File System (2)
Files can be shared among all nodes and clients
Physically, it may be replicated and stored on any file
system node
Applications can access it regardless of its location
File system nodes can be distributed
GridFTP, samba, NFS server
GridFTP, samba, NFS server
Compute & fs node
/gfarm
Gfarm metadata server
metadata
Compute & fs node
File A
Compute & fs node
Compute & fs node
File File
B C
Compute & fs node
Compute & fs node
File A
Compute & fs node
Compute & fs node
Compute & fs node
Compute & fs node
US
File B
Compute & fs node
Japan
File C
Client
PC
File A
File C
File B
Gfarm
file system
Note
PC
…
Scalable I/O performance in
distributed environment
User’s view
Physical execution view in Gfarm
(file-affinity scheduling)
User A submits Job A that accesses File A
Job A is executed on a node that has File A
User B submits Job B that accesses File B
Job B is executed on a node that has File B
network
Cluster,
Cluster,Grid
Grid
CPU
CPU
CPU
CPU
Gfarm file system
Shared network file system
File system nodes = compute nodes
Do not separate storage and CPU (SAN not necessary)
Move and execute program instead of moving large-scale data
exploiting local I/O is a key for scalable I/O performance
Example - Gaussian 03 in
Gfarm
Ab initio quantum chemistry Package
Install once and run everywhere
No modification required to access Gfarm
Compute
Test415 (IO intensive test input)
node
1h 54min 33sec (NFS)
NFS vs GfarmFS
1h 0min 51sec (Gfarm)
Parallel analysis of all 666 test inputs using 47
nodes
Compute Compute
Compute
.
.
.
node
node
node
Write error! (NFS)
Due to heavy IO load
16h 6m 58s (Gfarm)
NFS vs GfarmFS
Quite good scalability of IO performance
Longest test input takes 15h 42m 10s (test 574)
*Gfarm consists of local
disks of compute nodes
Software Stack
Gfarm
Application
UNIX
Application
Windows
Appli.
(Word,PPT,
Excel etc)
UNIX Command
Gfarm
Command
NFS
Client
Windows
Client
nfsd
samba
FTP
Client
SFTP
/ SCP
Web
Browser
/ Client
ftpd
sshd
httpd
Syscall-hook、GfarmFS-FUSE (that mounts FS in User-space)
libgfarm - Gfarm I/O Library
Gfarm Grid File System
Feature of Gfarm Grid File System
Commodity PC based scalable architecture
Add commodity PCs to increase storage capacity in operation
much more than petabyte scale
Even PCs at distant locations via internet
Adaptive replica creation and consistent management
Create multiple file replicas to increase performance and
reliability
Create file replicas at distant locations for disaster recovery
Open Source Software
Linux binary packages, ports for *BSD, . . .
It is included in Naregi, Knoppix HTC edition, and Rocks cluster
distribution
Existing applications can access w/o any modification
Network mount, Samba, HTTP, FTP, …
http://datafarm.apgrid.org/
SC03 Bandwidth
SC05 StorCloud
Toward Instant Grid File System
Private IP issue
Currently, each file system node needs to have a
public IP address
Evaluation of data?
Probably, there may be some other issues
Ongoing Collaboration
JuxMem group, Gabriel Antoniu (IRISA/INRIA)
JuxMem provides shared object access using
JXTA
juxmem_malloc, juxmem_mmap
juxmem_acquire, juxmem_acquire_read, juxmem_release
Brainstorming
Summary
Grid potentially has capability to enable
Petascale Data Intensive Computing
Grid MW needs to be matured
Gfarm Grid File System
Global virtual file system
A dependable network shared file system in a cluster
or a grid
High performance data computing support
Associates Computational Grid with Data Grid
Collaboration with JuxMem group