GPC Quickstart

From oldwiki.scinet.utoronto.ca
Jump to navigation Jump to search
General Purpose Cluster (GPC)
University of Tor 79284gm-a.jpg
Installed June 2009
Operating System Linux
Interconnect 1/4 on Infiniband, rest on GigE
Ram/Node 16 Gb
Cores/Node 8
Login/Devel Node gpc-login1 (142.150.188.51)
Vendor Compilers icc (C) ifort (fortran) icpc (C++)
Queue Submission Moab/Torque

The General Purpose Cluster is an extremely large cluster (ranked 16th in the world at its inception, and fastest in Canada) and is where most simulations are to be done at SciNet. It is an IBM iDataPlex cluster based on Intel's Nehalem architecture (one of the first in the world to make use of the new chips). The GPC consists of 3,780 nodes with a total of 30,240 2.5GHz cores, with 16GB RAM per node (2GB per core). One quarter of the cluster will be interconnected with non-blocking 4x-DDR Infiniband while the rest of the nodes are connected with gigabit ethernet.

Log In

The login node for the GPC cluster is gpc-login1 (142.150.188.51).

Compile/Devel Nodes

From a login node you can ssh to gpc-f101n001 and gpc-f101n002, these are the exact same hardware as the compute nodes (8core Nehalem with 16GB RAM). You can compile and test your codes on these nodes, however for Infiniband jobs you can compile but not run as these two nodes do not have Infiniband hardware.

Environment Variables

A modules system is used to handle environment variables associated with different compilers, MPI versions, libraries etc. To see all the options available type

module avail

To load a module

module load intel

These commands should go in your .bashrc files and/or in your submission scripts to make sure you are using the correct packages.

Compilers

The intel compilers are icc/icpc/ifort for C/C++/Fortran. For MPI jobs, the scripts mpicc/mpiCC/mpiF90 are wrappers to the compilers which ensure the MPI header files and libraries are correctly included and linked to.

The SciNet machines are shared systems, and jobs that are to run on them are submitted to a queue; the scheduler then orders the jobs in order to make the best use of the machine, and has them launched when resources become availble. The intervention of the scheduler can mean that the jobs aren't quite run in a first-in first-out order.

The maximum wallclock time for a job in the queue is 48 hours; computations that will take longer than this must be broken into 48-hour chunks and run as several jobs. The usual way to do this is with checkpoints, writing out the complete state of the computation every so often in such a way that a job can be restarted from this state information and continue on from where it left off. Generating checkpoints is a good idea anyway, as in the unlikely event of a hardware failure during your run, it allows you to restart without having lost much work.

If your job should run in fewer than 48 hours, specify that in your script -- your job will start sooner. (It's easier for the scheduler to fit in a short job than a long job). On the downside, the job will be killed automatically by the queue manager software at the end of the specified wallclock time, so if you guess wrong you might loose some work. So the standard procedure is to estimate how long your job will take and add 10% or so.

Jobs submitted to the TCS must make extremely efficient use of the small number of very expensive TCS nodes; otherwise, jobs should be run on the GPC, where there are 10 times as many cores available. You must use at least 32 (and preferably 64) tasks per node, and keep those tasks quite busy, or your job will be killed.

You interact with the queuing system through the queue/resource manager, Loadleveler. (On the back end, the scheduler on the TCS is Torque, but you won't be directly interacting with it.) Most loadleveller commands begin with `ll'. Thus to see the state of the queue, you use the command

llq

There are many options for llq which can be seen on the man page (type `man llq'). llq shows the current jobs in the queue, including jobs currently running. The state of the job (eg, R, I, H, etc) indicates whether the job is Running, Idle (waiting to run), Held (either because it cannot run, or because the user is choosing to have it run later, perhaps after another job is completed). The number of jobs in the Idle state give you an idea of how many jobs are ahead of you in the queue.

More detailed information on jobs that are actively running is availalbe; SciNet staff have put together a script, jobState.sh, which we generally refer to as llq1, which can be accessed by editing in your .profile:

alias llq1='/xcat/tools/tcs-scripts/LL/jobState.sh'

Then typing llq1 -n gives quite detailed information about each job, on which nodes it is running, and what resources it is using. Of particular note is the column labeled `utilization', which gives an estimate of what percentage of the available CPU time on the nodes are being used by the job; if this number is low (say less than 70%) then the scarce TCS nodes aren't being well used and you will likely be asked to migrate your jobs to the GPC. Another measure of utilization, the amount of memory in use by the job, can be estimated by multiplying the `maximum RSS (MB)' field by the `number of tasks per node' field; if this amount of memory in use isn't a significant fraction of the 128GB on the node, then again it isn't clear if this is a job which should remain on the TCS.

To submit your own job, you must write a script which describes the job and how it is to be run (a sample script follows) and submit it to the queue, using the command

llsubmit SCRIPT-FILE-NAME

where you will replace SCRIPT-FILE-NAME with the file containing the submission script. This will return a job ID, for example tcs-f11n06.3404.0, which is used to identify the jobs. Jobs can be cancelled with the command

llcancel JOB-ID

or placed into the hold state with

llhold JOB-ID

Again, these commands have many options, which can be read about on their man pages.


Much more information on the queueing system is available on our queue page.=== Submitting a Job ===


The GPC uses MOAB/Torque as the queue manger to handle job submission and resource allocation. The submission commands are

 qsub myscript.pbs
* canceljob myscript.pbs

Submission Script

A sample submission script is shown below with the #PBS directives at the top and the rest being what will be executed on the compute node.

#!/bin/bash
# MOAB/Torque submission script for SciNet GPC
#
#PBS -l nodes=2:ppn=8,walltime=1:00:00,os=centos53computeA
#PBS -N test

# SOURCE YOUR ENVIRONMENT VARIABLES
source /scratch/user/.bashrc

# GO TO DIRECTORY SUBMITTED FROM
cd $PBS_O_WORKDIR

# MPIRUN COMMAND 
mpirun -np 16 -hostfile $PBS_NODEFILE ./a.out

#

MPI over Infiniband

To use the Infiniband interconnect for MPI communications, the MVAPICH2 implementation has been installed and tested for both the Intel V11 and GCC v4.1 compilers.

You will need to source one of the following to setup the appropriate environment variables depending on if you want to compile with the Intel or gcc compilers.

INTEL

module load mvapich2 intel

GCC

module load mvapich2_gcc

MVAPICH2 uses the wrappers mpicc/mpicxx/mpif90/mpif77 for the compilers.

Currently you can compile and link your MPI code on the development nodes gpc-f101n001 and gpc-f101n002 however you will not be able to interactively test as these nodes are not connected with Infiniband. You can alternatively compile, link, and test an MPI code using an interactive queue session, using the os image "centos53develibA" as follows.

qsub -l nodes=2:ib:ppn=8,walltime=12:00:00,os=centos53develibA -I

Once you have compiled your MPI code and would like to test it, use the following command with $PROCS being the number of processors to run on and a.out being your code.

mpirun_rsh -np $PROCS -hostfile $PBS_NODEFILE ./a.out

To run your MPI-Infiniband job in a non-interactive queue you can use a submission script as follows, remembering to source the appropriate environment variables.

#!/bin/bash
#PBS -l nodes=2:ib:ppn=8,walltime=1:00:00,os=centos53computeibA
#PBS -N testib

# INTEL & MVAPICH2 ENVIRONMENT VARIABLES
module load intel mpvapich2

# GO TO DIRECTORY SUBMITTED FROM
cd $PBS_O_WORKDIR

# MPIRUN COMMAND 
mpirun_rsh -np 16 -hostfile $PBS_NODEFILE ./a.out


Performance Tools