TCS Quickstart
Tightly Coupled System (TCS) | |
---|---|
Installed | Jan 2009 |
Operating System | AIX 5.3 |
Number of Nodes | 102 (3264 cores) |
Interconnect | Infiniband (4 DDR/node) |
Ram/Node | 128 GB (256 on 2 nodes) |
Cores/Node | 32 (64 threads) |
Login/Devel Node | tcs01, tcs02 (from login.scinet) |
Vendor Compilers | xlc (C) xlf (fortran) xlC (C++) |
Queue Submission | LoadLeveler |
Specifications
The TCS is a cluster of `fat' (high-memory, many-core) IBM Power 575 nodes (with 4.7GHz Power 6 processors) on a very fast Infiniband connection. It has a relatively small number of cores (~3000) and so is dedicated for jobs that require such a large memory / low latency configuration. Jobs need to use multiples of 32 cores (a node), and are submitted to a queuing system that allows jobs with a maximum wall time of 48 hours.
Login
First login via SSH with your SciNet account at login.scinet.utoronto.ca.
Compile/Devel node
From the scinet login nodes, you can ssh to tcs01 or tcs02 to compile jobs and submit them to the queue. These machines have the same configuration as the compute nodes, so what works here should work for your run.
Note that access to the TCS is not enabled by default. We ask that people justify the need for this highly specialized machine. Contact us explaining the nature of your work if you want access to the TCS. In particular, applications should scale well to 64 processes/threads to run on this system.
Your home directory is in $HOME; you have 50GB there that is backed up. This directory is seen as read-only by the compute nodes! Thus, to run jobs, you'll use the $SCRATCH directory. Here, there is a large amount of disk space, but it is not backed up. Thus it makes sense to keep your codes in $HOME, compile there, and then run them in the $SCRATCH directory.
The IBM compilers are xlc/xlC/xlf for C/C++/Fortran. Documentation about them, including relevant compiler flags, can be found here. For OpenMP or other threaded applications, use the `reentrant-safe' versions of them, xlc_r, xlC_r, and xlf_r. For MPI applications, the scripts mpcc/mpCC/mpxlf are wrappers for the compilers which include the appropriate flags to use the IBM MPI libraries. Hybrid applications will need to use mpcc_r/mpCC_r/mpxlf_r.
To ensure that your code works before submitting it to the queue, you can interactively run short, small test jobs on tcs01 or tcs02; you can also run short, small jobs analyzing your data on these machines. Instructions can be found in the FAQ.
Node configuration
Each node has 16 processors, each with 2 independent cores. Thus, to fully utilize the node, at least 32 tasks are required. Because there are so few of these nodes compared to the GPC, jobs must use at least 32 tasks or they will be killed automatically.
In fact, the Power6 series of processors has a facility called Simultaneous Multi Threading (SMT) which allows two tasks to be very efficiently bound to each core. Because most applications spend a lot of time waiting for data from memory, this allows the second task to use the processor while the first is stuck waiting for memory, and vice versa. This requires no changes to the code, only running 64 rather than 32 tasks on the node. Real applications run by early users suggest that this gives a speedup of 1.4x to 1.9x for no user work at all!
There are 102 nodes, with a naming convention such that tcs-fFFnNN is the NNth node in rack FF. They each have 128GB of memory, save two large-memory nodes (tcs-f11n03 and n04) which have 256Gb.
Each node has 4 InfiniBand (IB) interfaces used for your codes data and message-passing traffic, and 2 Gig-E interfaces used for management purposes.
Compilation
We strongly suggest the use of at least the following compiler flags for MPI code:
-q64 -qhot -qarch=pwr6 -qtune=pwr6 -O3
For OpenMP (or hybrid) code, the recommended appropriate flags are slightly different:
-q64 -qhot -qarch=pwr6 -qtune=pwr6 -O4 -qsmp=omp
(with the (older) xl compilers, -O3 optimization is typically not good enough for OpenMP programs.)
The -qarch=pwr6 and -qtune=pwr6 flags ensure that the new executable will take advantage of new features of the Power6 architecture. -q64 generates 64-bit code, which can then take advantage of the large amount of memory on the nodes. -O3 is a moderately aggressive level of optimization, and -qhot enables `Higher Order Transformations' which include automatically using the ESSL and MASSV high-performance mathematics libraries. As with any new platform, compiler and new set of compilation flags, you should ensure that you are still getting the same answers on some well understood test problem with this configuration.
Note:
- You can try increasing the optimization level to -O5 (for mpi and for openmp code), but beware that compilation may be considerably longer.
On the link line we suggest using
-q64 -bdatapsize:64k -bstackpsize:64k
to again make use of the large memory, and to set page sizes to 64k (as vs the default 4k). This page size setting can make significant improvement for codes that manipulate large amounts of small pieces of data. For OpenMP code, the link line should be supplemented by
-qsmp=omp
Note: To use the full C++ bindings of MPI (those in the MPI namespace) in c++ code, you need to add -cpp -bh:5 to the xlC compilation command (you only need the -bh:5 when linking several c++ object files).
Debuggers
- ddt - Allinea's graphical parallel debugger, in the ddt module. Highly recommended
- dbx - IBM's debugger. Standard available.
- gdb - The GNU Debugger (version 6.8). Standard available.
Submitting A Job
The SciNet machines are shared systems, and jobs that are to run on them are submitted to a queue; the scheduler then orders the jobs in order to make the best use of the machine, and has them launched when resources become available. The intervention of the scheduler can mean that the jobs aren't quite run in a first-in first-out order.
The maximum wallclock time for a job in the queue is 48 hours; computations that will take longer than this must be broken into 48-hour chunks and run as several jobs. The usual way to do this is with checkpoints, writing out the complete state of the computation every so often in such a way that a job can be restarted from this state information and continue on from where it left off. Generating checkpoints is a good idea anyway, as in the unlikely event of a hardware failure during your run, it allows you to restart without having lost much work.
If your job should run in fewer than 48 hours, specify that in your script -- your job will start sooner. (It's easier for the scheduler to fit in a short job than a long job). On the downside, the job will be killed automatically by the queue manager software at the end of the specified wallclock time, so if you guess wrong you might loose some work. So the standard procedure is to estimate how long your job will take and add 10% or so.
Jobs submitted to the TCS must make extremely efficient use of the small number of very expensive TCS nodes; otherwise, jobs should be run on the GPC, where there are 10 times as many cores available. You must use at least 32 (and preferably 64) tasks per node, and keep those tasks quite busy, or your job will be killed.
You interact with the queuing system through the queue/resource manager, LoadLeveler. Most loadleveler commands begin with `ll'. Thus to see the state of the queue, you use the command
$ llq
There are many options for llq which can be seen on the man page (type `man llq'). llq shows the current jobs in the queue, including jobs currently running. The state of the job (eg, R, I, H, etc) indicates whether the job is Running, Idle (waiting to run), Held (either because it cannot run, or because the user is choosing to have it run later, perhaps after another job is completed). The number of jobs in the Idle state give you an idea of how many jobs are ahead of you in the queue.
More detailed information on jobs that are actively running is available; SciNet staff have put together a script, jobState.sh, which we generally refer to as llq1, which can be accessed by editing in your .profile: <source lang="bash"> alias llq1='/xcat/tools/tcs-scripts/LL/jobState.sh' </source>
Then typing llq1 -n gives quite detailed information about each job, on which nodes it is running, and what resources it is using. Of particular note is the column labeled `utilization', which gives an estimate of what percentage of the available CPU time on the nodes are being used by the job; if this number is low (say less than 70%) then the scarce TCS nodes aren't being well used and you will likely be asked to migrate your jobs to the GPC. Another measure of utilization, the amount of memory in use by the job, can be estimated by multiplying the `maximum RSS (MB)' field by the `number of tasks per node' field; if this amount of memory in use isn't a significant fraction of the 128GB on the node, then again it isn't clear if this is a job which should remain on the TCS.
To submit your own job, you must write a script which describes the job and how it is to be run (a sample script follows) and submit it to the queue, using the command
$ llsubmit [SCRIPT-FILE-NAME]
where you will replace [SCRIPT-FILE-NAME] with the file containing the submission script. This will return a job ID, for example tcs-f11n06.3404.0, which is used to identify the jobs. Jobs can be cancelled with the command
$ llcancel [JOB-ID]
or placed into the hold state with
$ llhold [JOB-ID]
Again, these commands have many options, which can be read about on their man pages.
Much more information on the queuing system is available in the LoadLeveler manual.
Submission Script for An MPI job
A template submission script follows for a typical MPI job. To use it, you need only change the job name, the directory you will run from, the executable file you'll run, and the arguments that need to be passed to it. (The example given below will change directory to $SCRATCH/SOMEDIRECTORY and mpirun `athena -i ./athinput'). You'll also have to specify the number of tasks per node you want (which should be 64 absent a very good reason otherwise), and how many nodes you require. You can also set where the output and error logs (the output of stdout and stderr) go, and the wall clock time you expect the job to take (maximum 48 hours). The `notification' lines ensure that email is sent (to user@example.com) upon job completion; notification can be set to `start' for job start, `error' for termination on an error, `complete' for job completion, `always' for all of the above, and `never' for never.
# # LoadLeveler submission script for SciNet TCS: MPI job # #@ job_name = SOME-DESCRIPTIVE-NAME #@ initialdir = $SCRATCH/SOMEDIRECTORY #@ executable = athena #@ arguments = -i ./athinput # #@ tasks_per_node = 64 #@ node = 3 #@ wall_clock_limit= 12:00:00 #@ output = $(job_name).$(jobid).out #@ error = $(job_name).$(jobid).err # #@ notification = complete # # Don't change anything below here unless you know exactly # why you are changing it. # #@ job_type = parallel #@ class = verylong #@ node_usage = not_shared #@ rset = rset_mcm_affinity #@ mcm_affinity_options = mcm_distribute mcm_mem_req mcm_sni_none #@ cpus_per_core=2 #@ task_affinity=cpu(1) #@ environment = COPY_ALL; MEMORY_AFFINITY=MCM; MP_SYNC_QP=YES; \ # MP_RFIFO_SIZE=16777216; MP_SHM_ATTACH_THRESH=500000; \ # MP_EUIDEVELOP=min; MP_USE_BULK_XFER=yes; \ # MP_RDMA_MTU=4K; MP_BULK_MIN_MSG_SIZE=64k; MP_RC_MAX_QP=8192; \ # PSALLOC=early; NODISCLAIM=true # # # Submit the job # #@ queue
Submission Script for an OpenMP Job
Here is an example of a LoadLeveler script which will submit a shared-memory job. Since the machine has fat nodes, with 32 cores each, and efficiently runs up to 64 tasks (with SMT), all within a shared-memory, NUMA-style architecture, it can be a very efficient platform for OpenMP.
Note that there are a few important options that help the code run better, such as enabling processor affinity and memory affinity. The processor affinity is taken care of by the "ccsm_launch" utility (essentially the same as another utility from IBM called "hybrid_launch").
#=============================================================================== # Specifies the name of the shell to use for the job # @ shell = /usr/bin/ksh # @ job_name = pi # @ job_type = parallel # @ class = verylong # @ environment = COPY_ALL; MEMORY_AFFINITY=MCM; MP_SYNC_QP=YES; \ # MP_RFIFO_SIZE=16777216; MP_SHM_ATTACH_THRESH=500000; \ # MP_EUIDEVELOP=min; MP_USE_BULK_XFER=yes; \ # MP_RDMA_MTU=4K; MP_BULK_MIN_MSG_SIZE=64k; MP_RC_MAX_QP=8192; \ # PSALLOC=early; NODISCLAIM=true # @ node = 1 # @ tasks_per_node = 1 # @ node_usage = not_shared # @ output = $(jobid).out # @ error = $(jobid).err # @ wall_clock_limit = 04:00:00 # @ queue export TARGET_CPU_RANGE=-1 cd $SCRATCH/SOMEDIRECTORY # # next variable is for performance, so that memory is allocated as # # close to the cpu running the task as possible (NUMA architecture) export MEMORY_AFFINITY=MCM # # next variable is for OpenMP export OMP_NUM_THREADS=64 # # next variable is for ccsm_launch export THRDS_PER_TASK=64 # # ccsm_launch is a "hybrid program launcher" for MPI-OpenMP programs poe ccsm_launch ./stream
Submission Script for a mixed-mode MPI/OpenMP Job
Here is an example of a LoadLeveler script which will submit a mixed MPI/shared-memory job. Since the machine has fat nodes, with 32 cores each, and efficiently runs up to 64 tasks (with SMT), all within a shared-memory, NUMA-style architecture, it can be a very efficient platform for OpenMP. However, not all codes scale well to large numbers of threads, and it is sometimes more efficient to have a "main" MPI program, where each MPI task is itself multi-threaded.
Note that there are a few important options that help the code run better, such as enabling processor affinity and memory affinity. The processor affinity is taken care of by the "ccsm_launch" utility (essentially the same as another utility from IBM called "hybrid_launch"). This same ccsm_launch utility allows one to launch hybrid programs, as well as MIMD programs.
The following script assumes that the user has prepared a file with the poe commands that will launch the individual MPI tasks. This file (poe.cmdfile) looks like:
ccsm_launch ./myprog ccsm_launch ./myprog . . . ccsm_launch ./myprog
and there are as many lines as there are MPI tasks to be run. The MPI tasks are assigned to nodes according to the #@task_geometry option in the LoadLeveler script. In the following example we assume that the job will start 32 MPI tasks, 8 per-node, and each of these will in turn be 8-way threaded, for a total of 64 threads per node, 256 total.
# Specifies the name of the shell to use for the job # @ shell = /usr/bin/ksh # @ job_name = hybrid_CPMD # @ job_type = parallel # @ class = verylong # @ environment = COPY_ALL; MEMORY_AFFINITY=MCM; MP_SYNC_QP=YES; \ # MP_RFIFO_SIZE=16777216; MP_SHM_ATTACH_THRESH=500000; \ # MP_EUIDEVELOP=min; MP_USE_BULK_XFER=yes; \ # MP_RDMA_MTU=4K; MP_BULK_MIN_MSG_SIZE=64k; MP_RC_MAX_QP=8192; \ # PSALLOC=early; NODISCLAIM=true ## @ node = 1 ## @ tasks_per_node = 1 # @ task_geometry = {(0,1,2,3,4,5,6,7)(8,9,10,11,12,13,14,15)(16,17,18,19,20,21,22,23)(24,25,26,27,28,29,30,31)} # @ node_usage = not_shared # @ output = $(jobid).out # @ error = $(jobid).err # @ wall_clock_limit = 04:00:00 # # @ requirements = (Machine == {"tcs-f11n03"}) #===================================== ## this is necessary in order to avoid core dumps for batch files ## which can cause the system to be overloaded # ulimits # @ core_limit = 0 #===================================== # @ queue hostname export TARGET_CPU_RANGE=-1 cd $SCRATCH/src/stream # # next variable is for performance, so that memory is allocated as # # close to the cpu running the task as possible (NUMA architecture) export MEMORY_AFFINITY=MCM # # next variable is for OpenMP export OMP_NUM_THREADS=8 # # next variable is for ccsm_launch # # note that there is one entry per MPI task, and each of these is then multithreaded # # (8-way in this example) export THRDS_PER_TASK=8:8:8:8:8:8:8:8:8:8:8:8:8:8:8:8:8:8:8:8:8:8:8:8:8:8:8:8:8:8:8:8 # # ccsm_launch is a "hybrid program launcher" for MPI-OpenMP programs # # poe reads from a commands file, where each MPI task is launched # # with ccsm_launch, which takes care of the processor affinity for the # # OpenMP threads. Each line in the poe.cmdfile reads something like: # # ccsm_launch ./myCPMD # # and there must be as many such lines as MPI tasks. The number of MPI # # tasks must match the task_geometry statement describing the process placement # # on the nodes. poe -cmdfile poe.cmdfile wait
One can monitor the progress of the job, and the load on the nodes, using the "llq1 -n" utility on the TCS development nodes. Please note that we recommend running 64 tasks per TCS node, but your code may work better with 32 tasks. It also depends on how much memory each task requires.
Advanced LoadLeveler Options
Steps
LoadLeveler has a lot of advanced features to control job submission and execution. One of these features is called steps. This feature allows a series of jobs to be submitted using one script with dependencies defined between the jobs. What this allows is for a series of jobs to be run sequentially, waiting for the previous job, called a step, to be finished before the next job is started. The following example uses the same LoadLeveler script as previously shown, however the #@ step_name and #@ dependency directives are used to rerun the same case three times in a row, waiting until each job is finished to start the next.
# # LoadLeveler submission script for SciNet TCS # #@ job_name = SOME-DECSCIPTIVE-NAME #@ initialdir = $SCRATCH/SOMEDIRECTORY #@ executable = athena #@ arguments = -i ./athinput # #@ tasks_per_node = 64 #@ node = 3 #@ wall_clock_limit= 12:00:00 #@ output = $(job_name).$(jobid).out #@ error = $(job_name).$(jobid).err # #@ notification = complete # # Don't change anything below here unless you know exactly # why you are changing it. # #@ job_type = parallel #@ class = verylong #@ rset = rset_mcm_affinity #@ mcm_affinity_options = mcm_distribute mcm_mem_req mcm_sni_none #@ cpus_per_core=2 #@ task_affinity=cpu(1) #@ environment = COPY_ALL; MEMORY_AFFINITY=MCM; MP_SYNC_QP=YES; \ # MP_RFIFO_SIZE=16777216; MP_SHM_ATTACH_THRESH=500000; \ # MP_EUIDEVELOP=min; MP_USE_BULK_XFER=yes; \ # MP_RDMA_MTU=4K; MP_BULK_MIN_MSG_SIZE=64k; MP_RC_MAX_QP=8192; \ # PSALLOC=early; NODISCLAIM=true # # Submit the 3 steps of the same job, with dependencies # defined so that step2 won't start until step1 is done # and step3 won't start until step2 is done. # # @ step_name = step1 # @ queue # # @ step_name = step2 # @ dependency = step1 == 0 # @ queue # # @ step_name = step3 # @ dependency = step2 == 0 # @ queue
When a job with steps is submitted all the steps will have the same jobid, just incremented .0, .1, .2, etc. The first job or step is queued and run normally while the other steps will be shown as "NQ" or not queued until the dependency is satisfied, which in this case is for the previous step to have finished.
Big Memory
There are 2 nodes in the TCS, tcs-f11n03 and tcs-f11n04, that have 256GB of memory and they can be asked for explicitly by adding the following line to your submission script.
# @ requirements = (Machine == "tcs-f11n03")