Cluster@WU User’s Manual 1 Introduction and scope

Cluster@WU
User’s Manual
Stefan Theußl
Martin Pacala
September 29, 2014
1
Introduction and scope
At the WU Wirtschaftsuniversit¨at Wien the Research Institute for Computational Methods (Forschungsinstitut f¨
ur rechenintensive Methoden or FIRM
for short) hosts a cluster of Intel workstations, known as cluster@WU. See
http://www.wu.ac.at/firm.
The scope of this manual is limited to a general introduction of the cluster@wu and its usage. It assumes general basic linux knowledge on the part
of the user. For a more comprehensive guide, including detailed instructions
for windows users and users new to unix and optional arguments to available
commands, please visit http://statmath.wu.ac.at/cluster
Suggestions and improvements to this manual (as well as the website
manual) can be emailed to [email protected].
1.1
cluster@WU
With a total of 528 64-bit computation cores and a total of one terabyte
of RAM, cluster@WU is well equipped to tackle challenging problems from
various research areas.
The high performance computing cluster consists of four parts: the cluster
running the applications, which itself consists of 44 nodes, a login server, a
file server and storage from a storage area network (SAN).
The 44 nodes, each offering 12 cores for a total of 528 cores capable of
processing jobs, are combined in the queue node.q. Table 1 provides a brief
overview of the specs of each individual node.
1
2
24
node.q – 44 nodes
Intel x5670 (6 Cores) @ 2.93 GHz
GB RAM
Table 1: cluster@WU specification
The file server (clusterfs.wu.ac.at) hosts the userdata, application data
and the scheduling system Sun Grid Engine. This grid engine is responsible
for job administration and supports the submission of serial tasks as well as
parallel tasks.
The login server (cluster.wu.ac.at) is the main entry point for application
developers and cluster users. This server solely handles user authentication
and execution of programs. This machine provides the cluster users with a
platform for managing their computational intensive jobs.
2
Cluster Access
To get access to cluster@WU a local unix account is needed. To acquire such
an account send an email to [email protected]. You will be notified by
email once your account is created.
2.1
Login
To login to the cluster type ssh [email protected] into your
unix shell (a number of ssh clients eg putty are available for Windows users).
After authentication with your username and password a shell on the login
server will become available. The user is provided with programs for editing
and compiling as well as tools for managing jobs for the grid engine.
The login server solely serves as an access point to the main cluster and
therefore it should only be used for editing programs (eg installing R packages
into your personal library), compiling small applications and managing jobs
submitted to the grid engine.
2.2
Changing the password
In order to change your password simply use the command passwd after
logging into cluster@WU and then enter your password into the terminal.
2
3
Using the Cluster
In this section a summary of the capabilities of the Sun Grid Engine (SGE)
is presented as well as an overview of how to use this software.
Sun Grid Engine is an open source cluster resource management and
scheduling software. On cluster@WU version 6.0 of the Grid Engine
manages the remote execution of cluster jobs. It can be obtained
from http://gridengine.sunsource.net/, the commercial version
Sun N1 Grid Engine can be found on http://www.sun.com/software/
gridware/.
3.1
Definitions
Before going into further details the user should be aware of certain often
used terms:
Nodes In terms of the SGE a node is referred to as one core in the cluster.
We refer to a node as one rack unit. Each of the cluster@WU nodes
has 12 cores of which each can process one job simultaneously. This
can sometimes cause confusion between different texts dealing with
clusters.
Jobs are user requests for resources available in a grid. These are then
evaluated by the SGE and distributed to the nodes for processing.
Task are smaller parts of a job that can be separately processed on different
nodes or cores. A single job can consist of many tasks (even thousands).
Each of these tasks can be performing the similar or completely different
calculations depending on the arguments passed to the SGE.
Job Arguments Each job can be submitted with extra parameters which
affect how the job gets processed. These arguments can be specified in
the job submission file.
3.2
Fair Use
In order to provide maximum flexibility, some aspects of the grid engine are
set up with very few restriction. For expample there is no limit on how many
jobs a user can start or have in the queue. However this also means that
users need to write and submit their jobs in a way, which does not adversely
affect the rights of other users to run jobs on the cluster.
3
For example, if you wish to start a job, which contains hundreds or even
a few thousand tasks, which will occupy a significant amount of resources for
extended periods of time, then submit the job with a reduced priority (see
section on Arguments below) so other users jobs can get processed whenever
any task is completed
It is not allowed to start programs which are time or resource intensive
on the login server (as this will have side effects to all users logged in). If
such tasks are started anyway, they may be terminated by the administrators
without notice.
3.3
How the Grid Engine Operates
In general the grid engine has to match the available resources to the requests
of the grid users. Therefore the grid engine is responsible for
• accepting jobs from the outside world
• delay a job until it can be run
• sending job from the holding area to an execution device (node)
• managing running jobs
• logging of jobs
The User needs not to care about things like “On which node should I
run my tasks” or “How do I get the results of my computation back from
one node to my home directory”, etc. All this is handled via the grid engine.
3.4
Submission of a Job
A simple example shows how one can submit a job on the WU cluster infrastructure:
1. Login to the cluster with your user (when using a terminal then using
the following command)
ssh [email protected]
2. Create a text file on the cluster (we will call the file myJob) with the
following content
4
#$ -N firstJob
R-i --vanilla < myRprogram.R
3. Submit the job to the grid engine with the following command
qsub myJob
This will queue the job and run it immediately whenever a node is free.
3.5
Output
For every submitted and processed job the system creates 2 files. Each
filename consists of the name of the job, distinction if it contains the output
or the errors ("e" and "o" respectively) and the job ID. The submitted
myJob from the previous example would hence result in the following files
being created:
firstJob.oXXXXX : Starts with a prologue with some meta information
of the job and then continues the actual output (only the standard
output) and ends with an epilogue (contains runtime etc.)
firstJob.eXXXXX : Contains any errors encountered
XXXXX refers to the job id. Note that the output is cached while the job
is running and will be displayed in the 2 files with a delay (or immediately
if the job finishes).
3.6
Deletion of a Job
To delete a job use the qdel command.
3.7
Monitoring Job/cluster status
Statistics of the jobs running on cluster@WU can be obtained via calling
qstat.
An overview of all jobs/nodes and their utilization can be obtained using
the sjs and sns commands.
5
3.8
Summary of Grid Engine Arguments
A job begins typically with commands to the grid engine. These commands
start with #$ followed by the argument as described below and then the
desired value.
One or more arguments can be passed to the grid engine:
-N defines the actual jobname that will be displayed in the queue and on the
output and error files
-m bea this tells the system to send an email when the job starts (b) / ends
(e) / is aborted (a)
-M followed by the email address
-q selects one of currently two node queues (either bignode.q or node.q).
Default is node.q
-pe [type] [n] starts up a parallel environment [type] reserving [n] cores.
3.9
Job best practices & what to avoid
While it is possible to run jobs on nodes which then spawn further processes
(see Fair Use section above for an explanation), please refrain from doing
so if you didn’t reserve the appropriate amount of cores on a node you will
be using (such as submitting a job in a parallel environment, see example
below). Otherwise the Grid Engine might try to allocate further jobs to a
particular node even though it is already is running at maximum capacity.
It might also be tempting to have your jobs write data back into your
home directory (which are remotely mounted on each node when needed) as
the job gets processed. This isn’t an issue if done in a limited fashion but if
done excessively with hundreds of simultaneous jobs can cause the fileserver
to become unresponsive. This then results in both jobs being unable to pass
the data they need to the fileserver as well as users being unable to access
their own home directories upon logging in to the cluster (a typical symptom
would be a user who normally uses ssh-key based authentification being asked
to input their password, since due to the unresponsiveness of the fileserver,
their public key cannot get accessed).
Instead make use of the local storage in the /tmp/ directory of each node
(approx 10-15 GB) as well as the /scratch/ directory which is mounted on
each node and allows for cross-node access to data and have your scripts
write into these directories instead of your home directory. Once you have
larger chunks of data ready, then you can have them get copied to your home
directory.
6
3.10
Troubleshooting & further details
Troubleshooting issues and further details are outside the scope of this manual. Please refer to the website at http://statmath.wu.ac.at/cluster for
more Information. If that is also insufficient to resolve your issues then you
are welcome to contact the cluster admins at [email protected]
4
Job Examples
4.1
A Simple Job
What follows is a simple job without any parameters to the SGE. The shell
commands date is run then after a pause of 30 seconds this command is run
again.
# print date and time
date
# sleep for 30 seconds
sleep 30
# print date and time again
date
4.2
Compilation of Applications on the Cluster
The following job starts remote compilation on cluster@WU. The arguments
to the grid engine define the email address of the user to whom a mail should
be sent. The flag -m e causes that the email is sent at the end of the job.
#$
#$
#$
#$
-N
-M
-m
-q
compile-application
[email protected]
e
node.q
cd /path/to/src
echo "#### clean ####"
make clean
echo "#### configure ####"
7
./configure CC=icc CXX=icpc FC=f77 --prefix=/path/to/local/lib/
echo "#### make ####"
make all
echo "#### install ####"
make install
echo "#### finished ####"
4.3
Same Job with Different Parameters
This is a common used possibility to (pseudo) parallelize tasks. For one
job different tasks are executed. The key is an environment variable called
SGE TASK ID. For a range of task numbers provided by the -t argument a
task is started running the given job having access to a unique environment
variable which identifies this task.
To illustrate this way of job creation see the following task:
#$ -N R_alternatives -t 1:10
R-i --vanilla <<-EOF
run
<- as.integer(Sys.getenv("SGE_TASK_ID"))
param <- expand.grid(mu=c(0.01, 0.025, 0.05, 0.075, 0.1), sd = c(0.04,
0.1))
param
vec <- rnorm(50, param[[run,1]], param[[run,2]])
mean(vec)
sd(vec)
EOF
For each task ID a vector with 50 normally distributed pseudo random
numbers is generated. The parameters for the normal distribution are chosen
using the SGE TASK ID environment variable.
4.4
openMPI/parallel Job
The grid engine helps the user with setting up parallel environments. The
-pe argument followed by the desired parallel environment (e.g., orte, PVM)
informs the grid engine to start the specified environment.
8
#$ -N RMPI
#$ -pe orte 20
#$ -q node.q
# Job for using the MPI implementation LAM on 20 Nodes
mpirun -np 20 /path/to/lam/executable
4.5
PVM Job
#$ -N pvm-example
#$ -pe pvm 5
#$ -q node.q
/path/to/pvm/executable
5
Available Software
In this section a summary of available programs is given. The operating
system is Debian GNU Linux (http://www.debian.org).
9
R-i
R-g
gcc
g++
gfortran
icc
icpc
ifort
R
R compiled with Intel compiler and linked against libgoto
R compiled with the default settings
Compiler
GNU C Compiler, Stallman and the GCC Developer Community (2007)
GNU C++ Compiler,
GNU FORTRAN Compiler
Intel C Compiler, Intel corporation (2007a)
Intel C++ Compiler, Intel corporation (2007a)
Intel FORTRANCompiler, Intel corporation (2007b)
Editor
emacs
vi/vim
nano
joe
Scientific
octave
LAM/MPI
PVM
HPC
version 7.1.3—this is for running MPI programs
Parallel Virtual Machine version 3.4
Table 2: Available software
A
.bashrc Modifications
In this appendix parts of the .bashrc are explained which enables specific
functionality on the cluster. Keep in mind that jobs need to be modified to
specifically include your .bashrc by adding
#!/bin/sh
to the beginning of the file which contains your job information and instructions.
A.1
Enable openMPI
To get the MPI wrappers and libraries add the following to your .bashrc:
export LD_LIBRARY_PATH=/opt/libs/openmpi-1.4.3-INTEL-12.0-64/lib:$LD_LIBRARY_PATH
export PATH=$PATH:/opt/libs/openmpi-1.4.3-INTEL-12.0-64/bin
10
A.2
Enable PVM
To enable PVM add the following to your .bashrc:
# PVM>
# you may wish to use this for your own programs (edit the last
# part to point to a different directory f.e. ~/bin/_$PVM_ARCH.
#
if [ -z $PVM_ROOT ]; then
if [ -d /usr/lib/pvm3 ]; then
export PVM_ROOT=/usr/lib/pvm3
else
echo "Warning - PVM_ROOT not defined"
echo "To use PVM, define PVM_ROOT and rerun your .bashrc"
fi
fi
if [ -n $PVM_ROOT ]; then
export PVM_ARCH=‘$PVM_ROOT/lib/pvmgetarch‘
#
# uncomment one of the following lines if you want the PVM commands
# directory to be added to your shell path.
#
#
export PATH=$PATH:$PVM_ROOT/lib
# generic
export PATH=$PATH:$PVM_ROOT/lib/$PVM_ARCH # arch-specific
#
# uncomment the following line if you want the PVM executable directory
# to be added to your shell path.
#
export PATH=$PATH:$PVM_ROOT/bin/$PVM_ARCH
fi
A.3
Local R Library
Only the administrator has write access to the site library and not all packages that users
may require are pre-installed.
For this reason one should create their own package library in their home directory.
Since the home directories are exported to all nodes during execution of jobs the personal
package library will also be available as well.
To do this create a directory in your home folder and add the following to your .bashrc
file:
# R package directory
export R_LIBS=~/path/to/R/lib
If upon your next login to the cluster you start R (see the Section Available Software
above) and your newly created folder gets displayed as primary library then the .bashrc
modification worked.
11
References
Richard Stallman and the GCC Developer Community. Using the GNU Compiler Collection. The Free Software Foundation, 2007. URL http://gcc.gnu.org
Intel Corporation Intel C++ Compiler Documentation. Intel Corporation 2007a. URL
www.intel.com
Intel Corporation Intel C++ Compiler Documentation. Intel Corporation 2007b. URL
www.intel.com
12