TCS Quickstart

From oldwiki.scinet.utoronto.ca
Revision as of 09:49, 13 August 2009 by Ljdursi (talk | contribs)
Jump to navigation Jump to search
Tightly Coupled System (TCS)
TCS-1.jpg
Installed Jan 2009
Operating System AIX 5.3
Interconnect Infiniband
Ram/Node 128 GB (256 on 2 nodes)
Cores/Node 32
Login/Devel Node tcs01, tcs02 (from login.scinet)
Vendor Compilers xlc (C) xlf (fortran) xlC (C++)
Queue Submission LoadLeveler

The TCS is a cluster of `fat' (high-memory, many-core) nodes on a very fast Infiniband connection. It has a relatively small number of cores (~3000) and so is dedicated for jobs that require such a large memory / low latency configuration.

Login

First login via ssh with your scinet account at login.scinet.utoronto.ca.

Compile/Devel node

From the scinet login nodes, you can ssh to tcs01 or tcs02 to compile jobs and submit them to the queue. These machines have the same configuration as the compute nodes, so what works here should work for your run.

Your home directory is in /home/USER; you have 10GB there that is backed up. This directory cannot be seen by the compute nodes! Thus, to run jobs, you'll use the /scratch/USER directory. Here, there is a large amount of disk space, but it is not backed up. Thus it makes sense to keep your codes in /home, compile there, and then run them in the /scratch directory.

The IBM compilers are xlc/xlC/xlf for C/C++/Fortran. Documentation about them, including relevant compiler flags, can be found here. For OpenMP or other threaded applications, use the `reentrant-safe' versions of them, xlc_r, xlC_r, and xlf_r. For MPI applications, the scripts mpcc/mpCC/mpxlf are wrappers for the compilers which include the appropriate flags to use the IBM MPI libraries. Hybrid applications will need to use mpcc_r/mpCC_r/mpxlf_r.

To ensure that your code works before submitting it to the queue, you can interactively run short, small test jobs on tcs01 or tcs02; you can also run short, small jobs analyzing your data on these machines. Instructions can be found in the FAQ.

Node configuration

Each node has 16 processors, each with 2 independent cores. Thus, to fully utilize the node, at least 32 tasks are required. Because there are so few of these nodes compared to the GPC, jobs must use at least 32 tasks or they will be killed auotmatically.

In fact, the Power6 series of processors has a facility called Simultaneous Multi Threading (SMT) which allows two tasks to be very efficiently bound to each core. Because most applications spend a lot of time waiting for data from memory, this allows the second task to use the processor while the first is stuck waiting for memory, and vice versa. This requires no changes to the code, only running 64 rather than 32 tasks on the node. Real applications run by early users suggest that this gives a speedup of 1.4x to 1.9x for no user work at all!

There are 102 nodes, with a naming convention such that tcs-fFFnNN is the NNth node in rack FF. They each have 128GB of memory, save two large-memory nodes (tcs-f11n03 and n04) which have 256Gb.

Each node has 4 InfiniBand (IB) interfaces used for your codes data and message-passing traffic, and 2 Gig-E interfaces used for management purposes.

Compilation

We strongly suggest the use of at least the following compiler flags

-q64 -O3 -qhot -qarch=pwr6 -qtune=pwr6

The -qarch=pwr6 and -qtune=pwr6 flags ensure that the new executable will take advantage of new features of the Power6 architecture. -q64 generates 64-bit code, which can then take advantage of the large amount of memory on the nodes. -O3 is a moderately aggressive level of optimzation, and -qhot enables `Higher Order Transformations' which include automatically using the ESSL and MASSV high-peformance mathematics libraries. As with any new platform, compiler and new set of compilation flags, you should ensure that you are still getting the same answers on some well understood test problem with this configuration.

On the link line we suggest using

-q64 -bdatapsize:64k -bstackpsize:64k

to again make use of the large memory, and to set page sizes to 64k (as vs the default 4k). This page size setting can make significant improvement for codes that manipulate large amounts of small pieces of data .

Submitting A Job

The SciNet machines are shared systems, and jobs that are to run on them are submitted to a queue; the scheduler then orders the jobs in order to make the best use of the machine, and has them launched when resources become availble. The intervention of the scheduler can mean that the jobs aren't quite run in a first-in first-out order.

The maximum wallclock time for a job in the queue is 48 hours; computations that will take longer than this must be broken into 48-hour chunks and run as several jobs. The usual way to do this is with checkpoints, writing out the complete state of the computation every so often in such a way that a job can be restarted from this state information and continue on from where it left off. Generating checkpoints is a good idea anyway, as in the unlikely event of a hardware failure during your run, it allows you to restart without having lost much work.

If your job should run in fewer than 48 hours, specify that in your script -- your job will start sooner. (It's easier for the scheduler to fit in a short job than a long job). On the downside, the job will be killed automatically by the queue manager software at the end of the specified wallclock time, so if you guess wrong you might loose some work. So the standard procedure is to estimate how long your job will take and add 10% or so.

Jobs submitted to the TCS must make extremely efficient use of the small number of very expensive TCS nodes; otherwise, jobs should be run on the GPC, where there are 10 times as many cores available. You must use at least 32 (and preferably 64) tasks per node, and keep those tasks quite busy, or your job will be killed.

You interact with the queuing system through the queue/resource manager, Loadleveler. Most loadleveller commands begin with `ll'. Thus to see the state of the queue, you use the command

llq

There are many options for llq which can be seen on the man page (type `man llq'). llq shows the current jobs in the queue, including jobs currently running. The state of the job (eg, R, I, H, etc) indicates whether the job is Running, Idle (waiting to run), Held (either because it cannot run, or because the user is choosing to have it run later, perhaps after another job is completed). The number of jobs in the Idle state give you an idea of how many jobs are ahead of you in the queue.

More detailed information on jobs that are actively running is availalbe; SciNet staff have put together a script, jobState.sh, which we generally refer to as llq1, which can be accessed by editing in your .profile:

alias llq1='/xcat/tools/tcs-scripts/LL/jobState.sh'

Then typing llq1 -n gives quite detailed information about each job, on which nodes it is running, and what resources it is using. Of particular note is the column labeled `utilization', which gives an estimate of what percentage of the available CPU time on the nodes are being used by the job; if this number is low (say less than 70%) then the scarce TCS nodes aren't being well used and you will likely be asked to migrate your jobs to the GPC. Another measure of utilization, the amount of memory in use by the job, can be estimated by multiplying the `maximum RSS (MB)' field by the `number of tasks per node' field; if this amount of memory in use isn't a significant fraction of the 128GB on the node, then again it isn't clear if this is a job which should remain on the TCS.

To submit your own job, you must write a script which describes the job and how it is to be run (a sample script follows) and submit it to the queue, using the command

llsubmit SCRIPT-FILE-NAME

where you will replace SCRIPT-FILE-NAME with the file containing the submission script. This will return a job ID, for example tcs-f11n06.3404.0, which is used to identify the jobs. Jobs can be cancelled with the command

llcancel JOB-ID

or placed into the hold state with

llhold JOB-ID

Again, these commands have many options, which can be read about on their man pages.


Much more information on the queueing system is available on our queue page.

Submission Script for An MPI job

A template submission script follows for a typical MPI job. To use it, you need only change the job name, the directory you will run from, the executable file you'll run, and the arguments that need to be passed to it. (The example given below will change directory to /scratch/USER/SOMEDIRECTORY and mpirun `athena -i ./athinput'). You'll also have to specify the number of tasks per node you want (which should be 64 absent a very good reason otherwise), and how many nodes you require. You can also set where the output and error logs (the output of stdout and stderr) go, and the wall clock time you expect the job to take (maximum 48 hours). The `notification' lines ensure that email is sent (to user@example.com) upon job completion; notification can be set to `start' for job start, `error' for termination on an error, `complete' for job completion, `always' for all of the above, and `never' for never.

#
# LoadLeveler submission script for SciNet TCS: MPI job
#
#@ job_name        = SOME-DECSCIPTIVE-NAME
#@ initialdir      = /scratch/USER/SOMEDIRECTORY
#@ executable      = athena
#@ arguments       = -i ./athinput 
#
#@ tasks_per_node  = 64
#@ node            = 3
#@ wall_clock_limit= 12:00:00
#@ output          = $(job_name).$(jobid).out
#@ error           = $(job_name).$(jobid).err
#
#@ notification = complete
#@ notify_user  = user@example.com
#
# Don't change anything below here unless you know exactly
# why you are changing it.
#
#@ job_type        = parallel
#@ class           = verylong
#@ node_usage      = not_shared
#@ rset = rset_mcm_affinity
#@ mcm_affinity_options = mcm_distribute mcm_mem_req mcm_sni_none
#@ cpus_per_core=2
#@ task_affinity=cpu(1)
#@ environment = COPY_ALL; MEMORY_AFFINITY=MCM; MP_SYNC_QP=YES; \
#                MP_RFIFO_SIZE=16777216; MP_SHM_ATTACH_THRESH=500000; \
#                MP_EUIDEVELOP=min; MP_USE_BULK_XFER=yes; \
#                MP_RDMA_MTU=4K; MP_BULK_MIN_MSG_SIZE=64k; MP_RC_MAX_QP=8192; \
#                PSALLOC=early; NODISCLAIM=true
#
#
# Submit the job
#
#@ queue


Submission Script for an OpenMP Job

Here is an example of a LoadLeveler script which will submit a shared-memory job. Since the machine has fat nodes, with 32 cores each, and efficiently runs up to 64 tasks (with SMT), all within a shared-memory, NUMA-style architecture, it can be a very efficient platform for OpenMP.

Note that there are a few important options that help the code run better, such as enabling processor affinity and memory affinity. The processor affinity is taken care of by the "ccsm_launch" utility (essentially the same as another utility from IBM called "hybrid_launch").

#===============================================================================
# Specifies the name of the shell to use for the job 
# @ shell = /usr/bin/ksh
# @ job_name = pi
# @ job_type = parallel
# @ class = verylong
# @ environment = COPY_ALL
# @ node = 1
# @ tasks_per_node = 1
# @ node_usage = not_shared
# @ output = $(jobid).out
# @ error = $(jobid).err
# @ wall_clock_limit = 04:00:00
# @ queue

export TARGET_CPU_RANGE=-1
cd /scratch/USER/SOMEDIRECTORY

# # next variable is for performance, so that memory is allocated as
# # close to the cpu running the task as possible (NUMA architecture)

export MEMORY_AFFINITY=MCM
# # next variable is for OpenMP
export OMP_NUM_THREADS=64 
# # next variable is for ccsm_launch
export THRDS_PER_TASK=64

# # ccsm_launch is a "hybrid program launcher" for MPI-OpenMP programs
poe ccsm_launch ./stream

Advanced LoadLeveler Options

Steps

LoadLeveler has a lot of advanced features to control job submission and execution. One of these features is called steps. This feature allows a series of jobs to be submitted using one script with dependencies defined between the jobs. What this allows is for a series of jobs to be run sequentially, waiting for the previous job, called a step, to be finished before the next job is started. The following example uses the same LoadLeveler script as previously shown, however the #@ step_name and #@ dependency directives are used to rerun the same case three times in a row, waiting until each job is finished to start the next.

#
# LoadLeveler submission script for SciNet TCS
#
#@ job_name        = SOME-DECSCIPTIVE-NAME
#@ initialdir      = /scratch/USER/SOMEDIRECTORY
#@ executable      = athena
#@ arguments       = -i ./athinput 
#
#@ tasks_per_node  = 64
#@ node            = 3
#@ wall_clock_limit= 12:00:00
#@ output          = $(job_name).$(jobid).out
#@ error           = $(job_name).$(jobid).err
#
#@ notification = complete
#@ notify_user  = user@example.com
#
# Don't change anything below here unless you know exactly
# why you are changing it.
#
#@ job_type        = parallel
#@ class           = verylong
#@ rset = rset_mcm_affinity
#@ mcm_affinity_options = mcm_distribute mcm_mem_req mcm_sni_none
#@ cpus_per_core=2
#@ task_affinity=cpu(1)
#@ environment = COPY_ALL; MEMORY_AFFINITY=MCM; MP_SYNC_QP=YES; \
#                MP_RFIFO_SIZE=16777216; MP_SHM_ATTACH_THRESH=500000; \
#                MP_EUIDEVELOP=min; MP_USE_BULK_XFER=yes; \
#                MP_RDMA_MTU=4K; MP_BULK_MIN_MSG_SIZE=64k; MP_RC_MAX_QP=8192; \
#                PSALLOC=early; NODISCLAIM=true
#
# Submit the 3 steps of the same job, with dependencies 
# defined so that step2 won't start until step1 is done
# and step3 won't start until step2 is done.
#
# @ step_name = step1
# @ queue
#
# @ step_name = step2
# @ dependency = step1 == 0 
# @ queue
#
# @ step_name = step3
# @ dependency = step2 == 0
# @ queue

When a job with steps is submitted all the steps will have the same jobid, just incremented .0, .1, .2, etc. The first job or step is queued and run normally while the other steps will be shown as "NQ" or not queued until the dependency is satisfied, which in this case is for the previous step to have finished.