 
        Introduction to SLURM scitas.epfl.ch October 9, 2014 Bellatrix I Frontend at bellatrix.epfl.ch I 16 x 2.2 GHz cores per node I 424 nodes with 32GB I Infiniband QDR network I The batch system is SLURM 1 / 36 Castor I Frontend at castor.epfl.ch I 16 x 2.6 GHz cores per node I 50 nodes with 64GB I 2 nodes with 256GB I For sequential jobs (Matlab etc.) I The batch system is SLURM I RedHat 6.5 2 / 36 Deneb (October 2014) I Frontend at deneb.epfl.ch I 16 x 2.6 GHz cores per node I 376 nodes with 64GB I 8 nodes with 256GB I 2 nodes with 512GB and 32 cores I 16 nodes with 4 Nvidia K40 GPUs I Infiniband QDR network 3 / 36 Storage /home I filesystem has per user quotas I will be backed up for important things (source code, results and theses) /scratch I high performance ”temporary” space I is not backed up I is organised by laboratory 4 / 36 Connection Start the X server (automatic on a Mac) Open a terminal ssh -Y [email protected] Try the following commands: I id I pwd I quota I ls /scratch/<group>/<username> 5 / 36 The batch system Goal: to take a list of jobs and execute them when appropriate resources become available SCITAS uses SLURM on its clusters: http://slurm.schedmd.com The configuration depends on the purpose of the cluster (serial vs parallel) 6 / 36 sbatch The fundamental command is sbatch sbatch submits jobs to the batch system Suggested workflow: I create a short job-script I submit it to the batch system 7 / 36 sbatch - exercise Copy the first two examples to your home directory cp /scratch/examples/ex1.run . cp /scratch/examples/ex2.run . Open the file ex1.run with your editor of choice 8 / 36 ex1.run #!/bin/bash #SBATCH #SBATCH #SBATCH #SBATCH #SBATCH --workdir /scratch/<group>/<username> --nodes 1 --ntasks 1 --cpus-per-task 1 --mem 1024 sleep 10 echo "hello from $(hostname)" sleep 10 9 / 36 ex1.run #SBATCH is a directive to the batch system --nodes 1 the number of nodes to use - on Castor this is limited to 1 --ntasks 1 the number of tasks (in an MPI sense) to run per job --cpu-per-task 1 the number of cores per aforementioned task --mem 4096 the memory required per node in MB --time 12:00:00 --time 2-6 the time required # 12 hours # two days and six hours 10 / 36 Running ex1.run The job is assigned a default runtime of 15 minutes $ sbatch ex1.run Submitted batch job 439 $ cat /scratch/<group>/<username>/slurm-439.out hello from c03 11 / 36 What went on? sacct -j <JOB_ID> sacct -l -j <JOB_ID> Or more usefully: Sjob <JOB ID> 12 / 36 Cancelling jobs To cancel a specific job: scancel <JOB_ID> To cancel all your jobs: scancel -u <username> 13 / 36 ex2.run #!/bin/bash #SBATCH #SBATCH #SBATCH #SBATCH #SBATCH #SBATCH --workdir /scratch/<group>/<username> --nodes 1 --ntasks 1 --cpus-per-task 8 --mem 122880 --time 00:30:00 /scratch/examples/linpack/runme_1_45k 14 / 36 What’s going on? squeue squeue -u <username> Squeue Sjob <JOB_ID> scontrol -d show job <JOB_ID> sinfo 15 / 36 squeue and Squeue squeue I Job states: Pending, Resources, Priority, Running squeue | grep <JOB_ID> squeue -j <JOB_ID> Squeue <JOB_ID> 16 / 36 Sjob $ Sjob <JOB_ID> JobID JobName Cluster Account Partition Timelimit User Group ------------ ---------- ---------- ---------- ---------- ---------- --------- --------31006 ex1.run castor scitas-ge serial 00:15:00 jmenu scitas-ge 31006.batch batch castor scitas-ge Submit Eligible Start End ------------------- ------------------- ------------------- ------------------2014-05-12T15:55:48 2014-05-12T15:55:48 2014-05-12T15:55:48 2014-05-12T15:56:08 2014-05-12T15:55:48 2014-05-12T15:55:48 2014-05-12T15:55:48 2014-05-12T15:56:08 Elapsed ExitCode State ---------- -------- ---------00:00:20 0:0 COMPLETED 00:00:20 0:0 COMPLETED NCPUS NTasks NodeList UserCPU SystemCPU AveCPU MaxVMSize ---------- -------- --------------- ---------- ---------- ---------- ---------1 c04 00:00:00 00:00.001 1 1 c04 00:00:00 00:00.001 00:00:00 207016K 17 / 36 scontrol $ scontrol -d show job <JOB_ID> $ scontrol -d show job 400 obId=400 Name=s1.job UserId=user(123456) GroupId=group(654321) Priority=111 Account=scitas-ge QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0 DerivedExitCode=0:0 RunTime=00:03:39 TimeLimit=00:15:00 TimeMin=N/A SubmitTime=2014-03-06T09:45:27 EligibleTime=2014-03-06T09:45:27 StartTime=2014-03-06T09:45:27 EndTime=2014-03-06T10:00:27 PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=serial AllocNode:Sid=castor:106310 ReqNodeList=(null) ExcNodeList=(null) NodeList=c03 BatchHost=c03 NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqS:C:T=*:*:* Nodes=c03 CPU IDs=0 Mem=1024 MinCPUsNode=1 MinMemoryCPU=1024M MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=OK Contiguous=0 Licenses=(null) Network=(null) Command=/home/<user>/jobs/s1.job WorkDir=/scratch/<group>/<user> 18 / 36 Modules Modules make your life easier I module avail I module show <take your pick> I module load <take your pick> I module list I module purge I module list 19 / 36 ex3.run - Mathematica Copy the following files to your chosen directory: cp /scratch/examples/ex3.run . cp /scratch/examples/mathematica.in . Submit ‘ex3.run’ to the batch system and see what happens... 20 / 36 ex3.run #!/bin/bash #SBATCH #SBATCH #SBATCH #SBATCH #SBATCH --ntasks 1 --cpus-per-task 1 --nodes 1 --mem 4096 --time 00:05:00 echo STARTING AT ‘date‘ module purge module load mathematica/9.0.1 math < mathematica.in echo FINISHED at ‘date‘ 21 / 36 Compiling ex4.* source files Copy the following files to your chosen directory: /scratch/examples/ex4_README.txt /scratch/examples/ex4.c /scratch/examples/ex4.cxx /scratch/examples/ex4.f90 /scratch/examples/ex4.run Then compile them with: module load intelmpi/4.1.3 mpiicc -o ex4 c ex4.c mpiicpc -o ex4 cxx ex4.cxx mpiifort -o ex4 f90 ex4.f90 22 / 36 The 3 methods to get interactive access 1/3 In order to schedule an allocation use salloc with exactly the same options for resources as sbatch You will then arrive in a new prompt which is still on the submission node but by using srun you can get access to the allocated resources eroche@castor:hello > salloc -N 1 -n 2 salloc: Granted job allocation 1234 bash-4.1$ hostname castor bash-4.1$ srun hostname c03 c03 23 / 36 The 3 methods to get interactive access 2/3 To get a prompt on the machine one needs to use the “--pty” option with “srun” and then “bash -i” (or “tcsh -i”) to get the shell: eroche@castor > salloc -N 1 -n 1 salloc: Granted job allocation 1235 eroche@castor > srun --pty bash -i bash-4.1$ hostname c03 24 / 36 The 3 methods to get interactive access 3/3 This is the least elegant but it is the method by which one can run X11 applications: eroche@bellatrix > salloc -n 1 -c 16 -N 1 salloc: Granted job allocation 1236 bash-4.1$ srun hostname c04 bash-4.1$ ssh -Y c04 eroche@c04 > 25 / 36 Dynamic libs used in an application “ldd” displays the libraries an executable file depends on: /COURS > ldd ex4 f90 jmenu@castor:~ linux-vdso.so.1 => (0x00007fff4b905000) libmpigf.so.4 => /opt/software/intel/14.0.1/intel64/lib/libmpigf.so.4 (0x00007f556cf88000) libmpi.so.4 => /opt/software/intel/14.0.1/intel64/lib/libmpi.so.4 (0x00007f556c91c000) libdl.so.2 => /lib64/libdl.so.2 (0x0000003807e00000) librt.so.1 => /lib64/librt.so.1 (0x0000003808a00000) libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003808200000) libm.so.6 => /lib64/libm.so.6 (0x0000003807600000) libc.so.6 => /lib64/libc.so.6 (0x0000003807a00000) libgcc s.so.1 => /lib64/libgcc s.so.1 (0x000000380c600000) /lib64/ld-linux-x86-64.so.2 (0x0000003807200000) 26 / 36 ex4.run #!/bin/bash ... module purge module load intelmpi/4.1.3 module list echo LAUNCH DIR=/scratch/scitas-ge/jmenu EXECUTABLE="./ex4 f90" echo "--> LAUNCH DIR = ${LAUNCH DIR}" echo "--> EXECUTABLE = ${EXECUTABLE}" echo echo "--> ${EXECUTABLE} depends on the following dynamic libraries:" ldd ${EXECUTABLE} echo cd ${LAUNCH DIR} srun ${EXECUTABLE} ... 27 / 36 The debug QoS In order to have priority access for debugging sbatch --qos debug ex1.run Limits on Castor: I 30 minutes walltime I 1 job per user I 16 cores between all users To display the available QoS’s: sacctmgr show qos 28 / 36 cgroups (Castor) General: I cgroups (“control groups”) is a Linux kernel feature to limit, account, and isolate resource usage (CPU, memory, disk I/O, etc.) of process groups SLURM: I Linux cgroups apply contraints to the CPUs and memory that can be used by a job I They are automatically generated using the resource requests given to SLURM I They are automatically destroyed at the end of the job, thus releasing all resources used Even if there is physical memory available a task will be killed if it tries to exceed the limits of the cgroup! 29 / 36 System process view Two tasks running on the same node with “ps auxf” root user user 177873 177877 177908 slurmstepd: [1072] \_ /bin/bash /var/spool/slurmd/job01072/slurm_script \_ sleep 10 root user user 177890 177894 177970 slurmstepd: [1073] \_ /bin/bash /var/spool/slurmd/job01073/slurm_script \_ sleep 10 Check memory, thread and core usage with “htop” 30 / 36 Fair share 1/3 The scheduler is configured to give all groups a share of the computing power Within each group the members have an equal share by default: jmenu@castor:~ > sacctmgr show association where account=lacal format=Account,Cluster,User,GrpNodes, QOS,DefaultQOS,Share tree Account Cluster User GrpNodes QOS Def QOS Share -------------------- ---------- ---------- -------- -------------------- --------- --------lacal castor normal 1 lacal castor aabecker debug,normal normal 1 lacal castor kleinjun debug,normal normal 1 lacal castor knikitin debug,normal normal 1 Priority is based on recent usage I this is forgotten about with time (half life) I fair share comes into play when the resources are heavily used 31 / 36 Fair share 2/3 Job priority is a weighted sum of various factors: jmenu@castor:~ > sprio -w JOBID Weights PRIORITY AGE 1000 jmenu@bellatrix:~ > sprio -w JOBID PRIORITY Weights FAIRSHARE 10000 AGE 1000 QOS 100000 FAIRSHARE 10000 JOBSIZE 100 To compare jobs’ priorities: jmenu@castor:~ > sprio -j80833,77613 JOBID PRIORITY AGE FAIRSHARE 77613 145 146 0 80833 9204 93 9111 32 / 36 QOS 0 0 QOS 100000 Fair share 3/3 FairShare values range from 0.0 to 1.0: Value Meaning ≈ 0.0 you used much more resources that you were granted 0.5 ≈ 1.0 you got what you paid for you used nearly no resources jmenu@bellatrix:~ > sshare -a -A lacal Accounts requested: : lacal Account User Raw Shares Norm Shares Raw Usage Effectv Usage FairShare -------------------- ---------- ---------- ----------- ----------- ------------- ---------lacal 666 0.097869 1357691548 0.256328 0.162771 lacal boissaye 1 0.016312 0 0.042721 0.162771 lacal janson 1 0.016312 0 0.042721 0.162771 lacal jetchev 1 0.016312 0 0.042721 0.162771 lacal kleinjun 1 0.016312 1357691548 0.256328 0.000019 lacal pbottine 1 0.016312 0 0.042721 0.162771 lacal saltini 1 0.016312 0 0.042721 0.162771 More information at: http://schedmd.com/slurmdocs/priority_multifactor.html 33 / 36 Helping yourself man pages are your friend! I man sbatch I man sacct I man gcc module load intel/14.0.1 I man ifort 34 / 36 Getting help If you still have problems then send a message to: [email protected] Please start the subject with HPC for automatic routing to the HPC team Please give as much information as possible including: I the jobid I the directory location and name of the submission script I where the “slurm-*.out” file is to be found I how the “sbatch” command was used to submit it I the output from “env” and “module list” commands 35 / 36 Appendix Change your shell at: https://dinfo.epfl.ch/cgi-bin/accountprefs Scitas web site: http://scitas.epfl.ch 36 / 36
© Copyright 2025