HOW TO USE HARDWARE AND SOFTWARE RESOURCES OF THE ENEA-GRID (Serial and Parallel jobs using LSF) Salvatore Raia, C.R.E.S.C.O project C.R. ENEA-Portici C.R. ENEA-Casaccia, 29-02-2008 Part 1: Introduction LSF and GRIDs Some definitions Resources LSF structure C.R. ENEA-Casaccia, 29-02-2008 Part1: Introduction (ENEA-GRID structure (SW) ) ICA client Resources management File System Operating Systems C.R. ENEA-Casaccia, 29-02-2008 Part1: Introduction (LSF and GRID computing) • Computational Grids: a lot of HW and SW available resources • Users: programs that take a lot of CPU time and memory (serial and parallel/distibuted appl.) • LSF:(Load Sharing Facilities) is a framework that allow to manage (large) programs that take a lot of CPU time and memory, hiding to the user heterogeneous system Manage sharing resources and load balancing C.R. ENEA-Casaccia, 29-02-2008 Part1: Introduction (Main definitions) • Job: Jobs are the basic unit of work in LSF. Most of what you will do with LSF involves: 1. submitting 2. monitoring 3. controlling jobs • Resource: The LSF system uses builtbuilt-in and configured resources to track job resource requirements and schedule jobs according to the resources available on individual hosts (…see next). • Batch (jobs / systems): program assigned to the computer to run without further user interaction. C.R. ENEA-Casaccia, 29-02-2008 Part1: Introduction (Resources discovering) LSF schedules jobs based on available resources. There are many resources built into LSF, but you can also add your own resources, and then use them same way as built-in resources Ex: suppose you have a software working in some hosts in different cluster, but not all of the hosts. LSF administrators can add the resource specifying the machines where you software works (configuration files). When users want to launch the program, they don’t need to select the host, but require the resource and LSF will dispatch the job to one of those hosts where the resource was defined Load balancing ? C.R. ENEA-Casaccia, 29-02-2008 Part1: Introduction (Check available Resources) lsinfo: resources list lshosts <cluster_name>: displays hosts and their static resource information Ex.: lshosts portici C.R. ENEA-Casaccia, 29-02-2008 Part1: Introduction (Check available Resources) lsload <cluster_name>: displays load information for hosts Ex.: lsload –R sp4 C.R. ENEA-Casaccia, 29-02-2008 Part1: Introduction (How does LSF work?) Q: how does LSF take my program (job) and dispatch it fitting available resources and user requirement ? LSF can be configured in different ways that affect the scheduling of jobs. By default, this is how LSF handles a new job: 1. 2. 3. 4. 5. Receive the job. Create a job file. Return the job ID to the user. Schedule the job and select the best available host. Dispatch the job to a selected host. Set the environment on the host. Start the job. C.R. ENEA-Casaccia, 29-02-2008 Part1: Introduction (Job life cycle) C.R. ENEA-Casaccia, 29-02-2008 Part1: Introduction (LSF software architecture) There is one master batch daemon (MBD) running in each LSF cluster, and one slave batch daemon (SBD) running on each LSF server host. C.R. ENEA-Casaccia, 29-02-2008 Part 2: Serial jobs submission Basic commands Monitoring and controlling jobs C.R. ENEA-Casaccia, 29-02-2008 Part2: Serial Jobs Submission (basic commands) bsub <command/program>: submit the job Ex_1.: bsub hostname Ex_2.: bsub –I hostname C.R. ENEA-Casaccia, 29-02-2008 Part2: Serial Jobs Submission (basic commands) bsub <options> <command/program>: submit the job Ex_3.: bsub –I –m lin4p.frascati.enea.it hostname Ex_4.: bsub –I –R sp4 hostname C.R. ENEA-Casaccia, 29-02-2008 Part2: Serial Jobs Submission (basic commands) bsub <options> <command/program>: submit the job Ex_5.: bsub –o pippo.out –e error.out hostname Ex_6.: cat pippo.out C.R. ENEA-Casaccia, 29-02-2008 Part2: Serial Jobs Submission (basic commands) bsub –q <queue_name> <application> How to check available queues ? C.R. ENEA-Casaccia, 29-02-2008 Part2: Serial Jobs Submission (basic commands) bsub –q large -R netsim2 –o out.ns –e err.ns netsim2 tcp_5.tcl bsub –q large -R linux –o out –e err icoFoam …/icoFoam/ cavity C.R. ENEA-Casaccia, 29-02-2008 Part2: Serial Jobs Submission (monitoring the job) bjobs Possible job Status PEND (pending job, often waiting for the right recources before execution), RUN (running or executing job), DONE (as it says), P-,U-,S-SUSP (pending-, user- or system-suspended job) EXIT (as it says, via the command 'bkill jobid' by you, system or LSF administrator(me)) Kill a job: bkill <job_ID> …more info: bjobs –l <job_ID> C.R. ENEA-Casaccia, 29-02-2008 Part2: Serial Jobs Submission (controlling the job) Checking Stderr and Stdoutput History C.R. ENEA-Casaccia, 29-02-2008 Resume some <bsub> options C.R. ENEA-Casaccia, 29-02-2008 Part 3: Parallel/distributed jobs submission LSF utilities Job array MPI jobs C.R. ENEA-Casaccia, 29-02-2008 Part 3: Parallel/distributed Jobs Submission (LSF utilities) LSF Environment variables LSB_JOBID The LSF Batch job ID number. LSB_HOSTS The list of hosts selected by LSF Batch to run the batch job. If the job is run on a single processor, the value of LSB_HOSTS is the name of the execution host. For parallel jobs, the names of all execution hosts are listed separated by spaces. The batch job file is run on the first host in the lis LSB_HOSTS Can be overwrited setting an <hostfile> for some LSF utilities [see next lsgrun ] C.R. ENEA-Casaccia, 29-02-2008 Part 3: Parallel/distributed Jobs Submission (LSF utilities) Running the same program on several hosts bsub -n <n_procs> <program> here “-n” allocates the right processors number Running the same program with “lsgrun” bsub –n 4 'lsgrun -p -f <host_file> <prog_name> ' <host_file> lin4p.frascati.enea.it bw305-3.frascati.enea.it bw305-7.frascati.enea.it pace.bologna.enea.it C.R. ENEA-Casaccia, 29-02-2008 Part 3: Parallel/distributed Jobs Submission (Job Array (Multicase)) Job array allows a sequence of jobs that share the same executable and resource requirements, but have different input files, to be submitted, controlled, and monitored as a single unit. 1-Creating a Job Array A job array is created at job submission time using the -J option of bsub. For example, the following command creates a job array named myArray made up of 1000 jobs. bsub -J “ myArray[1myArray[1-1000] “ <myJob> Job <123> is submitted to default queue <normal>. C.R. ENEA-Casaccia, 29-02-2008 Part 3: Parallel/distributed Jobs Submission (Job Array (Multicase)) 2-Redirecting Standard Input and Output The variables %I and %J are used as substitution strings to support file redirection for jobs submitted from a job array. At execution time, %I is expanded to provide the job array index value of the current job, and %J is expanded at to provide the job ID of the job array. bsub -J "myArray[1"myArray[1-1000]" -i "input.%I" <myJob> bsub -J "myArray[1"myArray[1-1000]" -i "input.%I" -o “output.%J.%I” <myJob> C.R. ENEA-Casaccia, 29-02-2008 Part 3: Parallel/distributed Jobs Submission (Job Array (Multicase)) 3-Passing Arguments on the Command Line The environment variable LSB_JOBINDEX is used as a substitution string to support passing job array indices on the command line. When the job is dispatched, LSF sets LSB_JOBINDEX in the execution environment to the job array index of the current job. LSB_JOBINDEX: can be used inside a script or a program reading enviromental variables input.1, input.2, input.3, ..., input.N bsub -J "myArray[1-1000]" myJob -f input.\ input.\$LSB_JOBINDEX C.R. ENEA-Casaccia, 29-02-2008 Part 3: Parallel/distributed Jobs Submission (Job Array (Multicase)) 4-Job array status To display summary information about the currently running jobs submitted from a job array, use the -A option of bjobs. For example, a job array of 10 jobs with job ID 123: bjobs -A 123 JOBID ARRAY_SPEC OWNER NJOBS PEND DONE RUN EXIT SSUSP USUSP PSUSP 123 myArra[1-10] user1 10 3 3 4 0 0 0 0 Individual job status Current To display the status of the individual jobs submitted from a job array, specify the job array job ID with bjobs. For jobs submitted from a job array, JOBID displays the job array job ID, and JOBNAME displays the job array name and the index value of each job. For example, to view a job array with job ID 123: bjobs 123 JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 123 user1 DONE default hostA hostC myArray[1] Feb 29 12:34 123 user1 DONE default hostA hostQ myArray[2] Feb 29 12:34 123 user1 DONE default hostA hostB myArray[3] Feb 29 12:34 123 user1 RUN default hostA hostC myArray[4] Feb 29 12:34 123 user1 RUN default hostA hostL myArray[5] Feb 29 12:34 123 user1 RUN default hostA hostB myArray[6] Feb 29 12:34 123 user1 RUN default hostA hostQ myArray[7] Feb 29 12:34 123 user1 PEND default hostA myArray[8] Feb 29 12:34 123 user1 PEND default hostA myArray[9] Feb 29 12:34 123 user1 PEND default hostA myArray[10] Feb 29 12:34 Other commands: bkill , bstop, bmod “Job_ID [ index ] “ C.R. ENEA-Casaccia, 29-02-2008 Part 3: Parallel/distributed Jobs Submission (MPI jobs) Interactive MPICH job mpirun -n 10 -machinefile machine_file prog_name Interactive POE job mpiexec -n 10 <prog_name> -hostfile <machine_file> Submit MPICH job with <bsub> bsub –n 10 <options> mpirun -n 10 -machinefile machine_file prog_name …or using wrappers system <poejob> <mpijob> bsub –n 10 mpijob mpirun <prog_name> C.R. ENEA-Casaccia, 29-02-2008 Part 3: Parallel/distributed Jobs Submission (MPI jobs) • • • • Instead of starting the PJL directly, PAM starts the specified PJL wrapper on a single host. The PJL wrapper starts the PJL (for example, mpirun). Instead of starting tasks directly, PJL starts TS on each host selected to run the parallel job. TS starts the task. C.R. ENEA-Casaccia, 29-02-2008
© Copyright 2024