Document 224310

HOW TO USE HARDWARE AND
SOFTWARE RESOURCES OF THE
ENEA-GRID
(Serial and Parallel jobs using LSF)
Salvatore Raia, C.R.E.S.C.O project
C.R. ENEA-Portici
C.R. ENEA-Casaccia, 29-02-2008
Part 1: Introduction
LSF and GRIDs
Some definitions
Resources
LSF structure
C.R. ENEA-Casaccia, 29-02-2008
Part1: Introduction
(ENEA-GRID structure (SW) )
ICA client
Resources management
File System
Operating Systems
C.R. ENEA-Casaccia, 29-02-2008
Part1: Introduction
(LSF and GRID computing)
• Computational Grids: a lot of HW and SW available
resources
• Users: programs that take a lot of CPU time and
memory (serial and parallel/distibuted appl.)
• LSF:(Load Sharing Facilities) is a framework that
allow to manage (large) programs that take a lot of
CPU time and memory, hiding to the user
heterogeneous system
Manage sharing resources and load balancing
C.R. ENEA-Casaccia, 29-02-2008
Part1: Introduction
(Main definitions)
•
Job: Jobs are the basic unit of work in LSF. Most of what
you will do with LSF involves:
1. submitting
2. monitoring
3. controlling jobs
•
Resource: The LSF system uses builtbuilt-in and configured
resources to track job resource requirements and
schedule jobs according to the resources available on
individual hosts (…see next).
•
Batch (jobs / systems): program assigned to the
computer to run without further user interaction.
C.R. ENEA-Casaccia, 29-02-2008
Part1: Introduction
(Resources discovering)
LSF schedules jobs based on available resources. There are
many resources built into LSF, but you can also add your
own resources, and then use them same way as built-in
resources
Ex: suppose you have a software working in some hosts in
different cluster, but not all of the hosts. LSF
administrators can add the resource
specifying the machines where you software works
(configuration files).
When users want to launch the program, they don’t need to
select the host, but require the resource and LSF will
dispatch the job to one of those hosts where the resource
was defined
Load balancing ?
C.R. ENEA-Casaccia, 29-02-2008
Part1: Introduction
(Check available Resources)
lsinfo: resources list
lshosts <cluster_name>: displays hosts and their static
resource information
Ex.: lshosts portici
C.R. ENEA-Casaccia, 29-02-2008
Part1: Introduction
(Check available Resources)
lsload <cluster_name>: displays load information for hosts
Ex.: lsload –R sp4
C.R. ENEA-Casaccia, 29-02-2008
Part1: Introduction
(How does LSF work?)
Q: how does LSF take my program (job) and dispatch it fitting
available resources and user requirement ?
LSF can be configured in different ways that affect the
scheduling of jobs. By default, this is how LSF handles a
new job:
1.
2.
3.
4.
5.
Receive the job. Create a job file. Return the job ID to the
user.
Schedule the job and select the best available host.
Dispatch the job to a selected host.
Set the environment on the host.
Start the job.
C.R. ENEA-Casaccia, 29-02-2008
Part1: Introduction
(Job life cycle)
C.R. ENEA-Casaccia, 29-02-2008
Part1: Introduction
(LSF software architecture)
There is one master batch daemon (MBD) running in each LSF cluster,
and one slave batch daemon (SBD) running on each LSF server host.
C.R. ENEA-Casaccia, 29-02-2008
Part 2: Serial jobs submission
Basic commands
Monitoring and controlling jobs
C.R. ENEA-Casaccia, 29-02-2008
Part2: Serial Jobs Submission
(basic commands)
bsub <command/program>: submit the job
Ex_1.: bsub hostname
Ex_2.: bsub –I hostname
C.R. ENEA-Casaccia, 29-02-2008
Part2: Serial Jobs Submission
(basic commands)
bsub <options> <command/program>: submit the job
Ex_3.: bsub –I –m lin4p.frascati.enea.it hostname
Ex_4.: bsub –I –R sp4 hostname
C.R. ENEA-Casaccia, 29-02-2008
Part2: Serial Jobs Submission
(basic commands)
bsub <options> <command/program>: submit the job
Ex_5.: bsub –o pippo.out –e error.out hostname
Ex_6.: cat pippo.out
C.R. ENEA-Casaccia, 29-02-2008
Part2: Serial Jobs Submission
(basic commands)
bsub –q <queue_name> <application>
How to check available queues ?
C.R. ENEA-Casaccia, 29-02-2008
Part2: Serial Jobs Submission
(basic commands)
bsub –q large -R netsim2 –o out.ns –e err.ns netsim2 tcp_5.tcl
bsub –q large -R linux –o out –e err icoFoam …/icoFoam/ cavity
C.R. ENEA-Casaccia, 29-02-2008
Part2: Serial Jobs Submission
(monitoring the job)
bjobs
Possible job Status
PEND (pending job, often waiting for the right recources before execution),
RUN (running or executing job),
DONE (as it says),
P-,U-,S-SUSP (pending-, user- or system-suspended job)
EXIT (as it says, via the command 'bkill jobid' by you,
system or LSF administrator(me))
Kill a job: bkill <job_ID>
…more info: bjobs –l <job_ID>
C.R. ENEA-Casaccia, 29-02-2008
Part2: Serial Jobs Submission
(controlling the job)
Checking Stderr and Stdoutput
History
C.R. ENEA-Casaccia, 29-02-2008
Resume some <bsub> options
C.R. ENEA-Casaccia, 29-02-2008
Part 3: Parallel/distributed jobs submission
LSF utilities
Job array
MPI jobs
C.R. ENEA-Casaccia, 29-02-2008
Part 3: Parallel/distributed Jobs Submission
(LSF utilities)
LSF Environment variables
LSB_JOBID
The LSF Batch job ID number.
LSB_HOSTS
The list of hosts selected by LSF Batch to run the batch job. If
the job is run on a single processor, the value of LSB_HOSTS
is the name of the execution host. For parallel jobs, the names
of all execution hosts are listed separated by spaces. The
batch job file is run on the first host in the lis
LSB_HOSTS
Can be overwrited setting an <hostfile> for some LSF utilities
[see next lsgrun ]
C.R. ENEA-Casaccia, 29-02-2008
Part 3: Parallel/distributed Jobs Submission
(LSF utilities)
Running the same program on several hosts
bsub -n <n_procs> <program>
here “-n” allocates the right processors number
Running the same program with “lsgrun”
bsub –n 4 'lsgrun -p -f <host_file> <prog_name> '
<host_file> lin4p.frascati.enea.it
bw305-3.frascati.enea.it
bw305-7.frascati.enea.it
pace.bologna.enea.it
C.R. ENEA-Casaccia, 29-02-2008
Part 3: Parallel/distributed Jobs Submission
(Job Array (Multicase))
Job array allows a sequence of jobs that share the same
executable and resource requirements, but have different
input files, to be submitted, controlled, and monitored as a
single unit.
1-Creating a Job Array
A job array is created at job submission time using the -J
option of bsub. For example, the following command
creates a job array named myArray made up of 1000 jobs.
bsub -J “ myArray[1myArray[1-1000] “ <myJob>
Job <123> is submitted to default queue <normal>.
C.R. ENEA-Casaccia, 29-02-2008
Part 3: Parallel/distributed Jobs Submission
(Job Array (Multicase))
2-Redirecting Standard Input and Output
The variables %I and %J are used as substitution strings to
support file redirection for jobs submitted from a job array. At
execution time, %I is expanded to provide the job array index
value of the current job, and %J is expanded at to provide the job
ID of the job array.
bsub -J "myArray[1"myArray[1-1000]" -i "input.%I" <myJob>
bsub -J "myArray[1"myArray[1-1000]" -i "input.%I" -o “output.%J.%I” <myJob>
C.R. ENEA-Casaccia, 29-02-2008
Part 3: Parallel/distributed Jobs Submission
(Job Array (Multicase))
3-Passing Arguments on the Command Line
The environment variable LSB_JOBINDEX is used as a
substitution string to support passing job array indices on
the command line. When the job is dispatched, LSF sets
LSB_JOBINDEX in the execution environment to the job
array index of the current job.
LSB_JOBINDEX: can be used inside a script or a program
reading enviromental variables
input.1, input.2, input.3, ..., input.N
bsub -J "myArray[1-1000]" myJob -f input.\
input.\$LSB_JOBINDEX
C.R. ENEA-Casaccia, 29-02-2008
Part 3: Parallel/distributed Jobs Submission
(Job Array (Multicase))
4-Job array status
To display summary information about the currently running jobs submitted from a job array, use the -A
option of bjobs. For example, a job array of 10 jobs with job ID 123:
bjobs -A 123
JOBID ARRAY_SPEC OWNER NJOBS PEND DONE RUN EXIT SSUSP USUSP PSUSP
123
myArra[1-10]
user1
10
3
3
4
0
0
0
0
Individual job status
Current
To display the status of the individual jobs submitted from a job array, specify the job array job ID with
bjobs. For jobs submitted from a job array, JOBID displays the job array job ID, and JOBNAME displays
the job array name and the index value of each job. For example, to view a job array with job ID 123:
bjobs 123
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
123
user1 DONE default
hostA
hostC
myArray[1]
Feb 29 12:34
123
user1 DONE default
hostA
hostQ
myArray[2]
Feb 29 12:34
123
user1 DONE default
hostA
hostB
myArray[3]
Feb 29 12:34
123
user1 RUN default
hostA
hostC
myArray[4]
Feb 29 12:34
123
user1 RUN default
hostA
hostL
myArray[5]
Feb 29 12:34
123
user1 RUN default
hostA
hostB
myArray[6]
Feb 29 12:34
123
user1 RUN default
hostA
hostQ
myArray[7]
Feb 29 12:34
123
user1 PEND default
hostA
myArray[8]
Feb 29 12:34
123
user1 PEND default
hostA
myArray[9]
Feb 29 12:34
123
user1 PEND default
hostA
myArray[10]
Feb 29 12:34
Other commands: bkill , bstop, bmod “Job_ID [ index ] “
C.R. ENEA-Casaccia, 29-02-2008
Part 3: Parallel/distributed Jobs Submission
(MPI jobs)
Interactive MPICH job
mpirun -n 10 -machinefile machine_file prog_name
Interactive POE job
mpiexec -n 10 <prog_name> -hostfile <machine_file>
Submit MPICH job with <bsub>
bsub –n 10 <options> mpirun -n 10 -machinefile machine_file
prog_name
…or using wrappers system <poejob> <mpijob>
bsub –n 10 mpijob mpirun <prog_name>
C.R. ENEA-Casaccia, 29-02-2008
Part 3: Parallel/distributed Jobs Submission
(MPI jobs)
•
•
•
•
Instead of starting the PJL directly, PAM starts the specified PJL
wrapper on a single host.
The PJL wrapper starts the PJL (for example, mpirun).
Instead of starting tasks directly, PJL starts TS on each host
selected to run the parallel job.
TS starts the task.
C.R. ENEA-Casaccia, 29-02-2008