TCS Quickstart
Tightly Coupled System (TCS) | |
---|---|
Installed | Jan 2009 |
Operating System | AIX 5.3 |
Interconnect | Infiniband |
Ram/Node | 128 GB (256 on 2 nodes) |
Cores/Node | 32 |
Login/Devel Node | tcs-f11n05 (142.150.188.41) |
Vendor Compilers | xlc (C) xlf (fortran) xlC (C++) |
Queue Submission | LoadLeveler |
The TCS is a cluster of `fat' (high-memory, many-core) nodes on a very fast Infiniband connection. It has a relatively small number of cores (~3000) and so is dedicated for jobs that require such a large memory / low latency configuration.
Login
The login node is tcs-f11n05. This is the only node with a connection to the internet, so do not run any jobs on this machine or else you will make it difficult for others to log into use the cluster at all!
Compile/Devel node
To compile jobs and submit them to the queue, ssh to tcs-f11n06 and work from there. This machine has the same configuration as the compute nodes, so what works here should work for your run.
Your home directory is in /home/USER; you have 10Gb there that is backed up. This directory cannot be seen by the compute nodes! Thus, to run jobs, you'll use the /scratch/USER directory. Here, there is a large amount of disk space, but it is not backed up. Thus it makes sense to keep your code in /home, compile there, and then run them in the /scratch directory.
The IBM compilers are xlc/xlC/xlf for C/C++/Fortran. Documentation about them, including relevant compiler flags, can be found here. For OpenMP or other threaded applications, use the `reentrant-safe' versions of them, xlc_r, xlC_r, and xlf_r. For MPI applications, the scripts mpcc/mpCC/mpxlf are wrappers for the compilers which include the appropriate flags to use the IBM MPI libraries. Hybrid applications will need to use mpcc_r/mpCC_r/mpxlf_r.
To ensure that your code works before submitting it to the queue, you can interactively run short, small test jobs on tcs-f11n06.
Node configuration
Each node has 16 processors, each with 2 independent cores. Thus, to fully utilize the node, at least 32 tasks are required. Because there are so few of these nodes compared to the GPC, jobs must use at least 32 tasks or they will be killed auotmatically.
In fact, the Power6 series of processors has a facility called Simultaneous Multi Threading (SMT) which allows two tasks to be very efficiently bound to each core. Because most applications spend a lot of time waiting for data from memory, this allows the second task to use the processor while the first is stuck waiting for memory, and vice versa. This requires no changes to the code, only running 64 rather than 32 tasks on the node. Real applications run by early users suggest that this gives a speedup of 1.4x to 1.9x for no user work at all!
There are 102 nodes, with a naming convention such that tcs-fFFnNN is the NNth node in rack FF. They each have 128GB of memory, save two large-memory nodes (tcs-f11n03 and n04) which have 256Gb.
Each node has 4 InfiniBand (IB) interfaces used for your codes data and message-passing traffic, and 2 Gig-E interfaces used for management purposes.
Compilation
We strongly suggest the use of at least the following compiler flags
-q64 -O3 -qhot -qarch=pwr6 -qtune=pwr6
The -qarch=pwr6 and -qtune=pwr6 flags ensure that the new executable will take advantage of new features of the Power6 architecture. -q64 generates 64-bit code, which can then take advantage of the large amount of memory on the nodes. -O3 is a moderately aggressive level of optimzation, and -qhot enables `Higher Order Transformations' which include automatically using the ESSL and MASSV high-peformance mathematics libraries. As with any new platform, compiler and new set of compilation flags, you should ensure that you are still getting the same answers on some well understood test problem with this configuration.
On the link line we suggest using
-q64 -bdatapsize:64k -bstackpsize:64k
to again make use of the large memory, and to set page sizes to 64k (as vs the default 4k). This page size setting can make significant improvement for codes that manipulate large amounts of small pieces of data .
Submitting A Job
The SciNet machines are shared systems, and jobs that are to run on them are submitted to a queue; the scheduler then orders the jobs in order to make the best use of the machine, and has them launched when resources become availble. The intervention of the scheduler can mean that the jobs aren't quite run in a first-in first-out order.
The maximum wallclock time for a job in the queue is 48 hours; computations that will take longer than this must be broken into 48-hour chunks and run as several jobs. The usual way to do this is with checkpoints, writing out the complete state of the computation every so often in such a way that a job can be restarted from this state information and continue on from where it left off. Generating checkpoints is a good idea anyway, as in the unlikely event of a hardware failure during your run, it allows you to restart without having lost much work.
If your job should run in fewer than 48 hours, specify that in your script -- your job will start sooner. (It's easier for the scheduler to fit in a short job than a long job). On the downside, the job will be killed automatically by the queue manager software at the end of the specified wallclock time, so if you guess wrong you might loose some work. So the standard procedure is to estimate how long your job will take and add 10% or so.
Jobs submitted to the TCS must make extremely efficient use of the small number of very expensive TCS nodes; otherwise, jobs should be run on the GPC, where there are 10 times as many cores available. You must use at least 32 (and preferably 64) tasks per node, and keep those tasks quite busy, or your job will be killed.
You interact with the queuing system through the queue/resource manager, Loadleveler. (On the back end, the scheduler on the TCS is Torque, but you won't be directly interacting with it.) Most loadleveller commands begin with `ll'. Thus to see the state of the queue, you use the command
llq
There are many options for llq which can be seen on the man page (type `man llq'). llq shows the current jobs in the queue, including jobs currently running. The state of the job (eg, R, I, H, etc) indicates whether the job is Running, Idle (waiting to run), Held (either because it cannot run, or because the user is choosing to have it run later, perhaps after another job is completed). The number of jobs in the Idle state give you an idea of how many jobs are ahead of you in the queue.
More detailed information on jobs that are actively running is availalbe; SciNet staff have put together a script, jobState.sh, which we generally refer to as llq1, which can be accessed by editing in your .profile:
alias llq1='/xcat/tools/tcs-scripts/LL/jobState.sh'
Then typing llq1 -n gives quite detailed information about each job, on which nodes it is running, and what resources it is using. Of particular note is the column labeled `utilization', which gives an estimate of what percentage of the available CPU time on the nodes are being used by the job; if this number is low (say less than 70%) then the scarce TCS nodes aren't being well used and you will likely be asked to migrate your jobs to the GPC. Another measure of utilization, the amount of memory in use by the job, can be estimated by multiplying the `maximum RSS (MB)' field by the `number of tasks per node' field; if this amount of memory in use isn't a significant fraction of the 128GB on the node, then again it isn't clear if this is a job which should remain on the TCS.
To submit your own job, you must write a script which describes the job and how it is to be run (a sample script follows) and submit it to the queue, using the command
llsubmit SCRIPT-FILE-NAME
where you will replace SCRIPT-FILE-NAME with the file containing the submission script. This will return a job ID, for example tcs-f11n06.3404.0, which is used to identify the jobs. Jobs can be cancelled with the command
llcancel JOB-ID
or placed into the hold state with
llhold JOB-ID
Again, these commands have many options, which can be read about on their man pages.
Much more information on the queueing system is available on our queue page.
Submission Script for An MPI job
A template submission script follows for a typical MPI job. To use it, you need only change the job name, the directory you will run from, the executable file you'll run, and the arguments that need to be passed to it. (The example given below will change directory to /scratch/USER/SOMEDIRECTORY and mpirun `athena -i ./athinput'). You'll also have to specify the number of tasks per node you want (which should be 64 absent a very good reason otherwise), and how many nodes you require. You can also set where the output and error logs (the output of stdout and stderr) go, and the wall clock time you expect the job to take (maximum 48 hours). The `notification' lines ensure that email is sent (to user@example.com) upon job completion; notification can be set to `start' for job start, `error' for termination on an error, `complete' for job completion, `always' for all of the above, and `never' for never.
# # LoadLeveler submission script for SciNet TCS # #@ job_name = SOME-DECSCIPTIVE-NAME #@ initialdir = /scratch/USER/SOMEDIRECTORY #@ executable = athena #@ arguments = -i ./athinput # #@ tasks_per_node = 64 #@ node = 3 #@ wall_clock_limit= 12:00:00 #@ output = $(job_name).$(jobid).out #@ error = $(job_name).$(jobid).err # #@ notification = complete #@ notify_user = user@example.com # # Don't change anything below here unless you know exactly # why you are changing it. # #@ job_type = parallel #@ class = verylong #@ rset = rset_mcm_affinity #@ mcm_affinity_options = mcm_distribute mcm_mem_req mcm_sni_none #@ cpus_per_core=2 #@ task_affinity=cpu(1) #@ environment = COPY_ALL; MEMORY_AFFINITY=MCM; MP_SYNC_QP=YES; \ # MP_RFIFO_SIZE=16777216; MP_SHM_ATTACH_THRESH=500000; \ # MP_EUIDEVELOP=min; MP_USE_BULK_XFER=yes; \ # MP_RDMA_MTU=4K; MP_BULK_MIN_MSG_SIZE=64k; MP_RC_MAX_QP=8192; \ # PSALLOC=early; NODISCLAIM=true # # # Submit the job # #@ queue
Advanced LoadLeveler Options
Steps
LoadLeveler has a lot of advanced features to control job submission and execution. One of these features is called steps. This feature allows a series of jobs to be submitted using one script with dependencies defined between the jobs. What this allows is for a series of jobs to be run sequentially, waiting for the previous job, called a step, to be finished before the next job is started. The following example uses the same loadleveler script as previously shown, however the #@ step_name and #@ dependency directives are used to rerun the same case three times in a row, waiting until each job is finished to start the next.
# # LoadLeveler submission script for SciNet TCS # #@ job_name = SOME-DECSCIPTIVE-NAME #@ initialdir = /scratch/USER/SOMEDIRECTORY #@ executable = athena #@ arguments = -i ./athinput # #@ tasks_per_node = 64 #@ node = 3 #@ wall_clock_limit= 12:00:00 #@ output = $(job_name).$(jobid).out #@ error = $(job_name).$(jobid).err # #@ notification = complete #@ notify_user = user@example.com # # Don't change anything below here unless you know exactly # why you are changing it. # #@ job_type = parallel #@ class = verylong #@ rset = rset_mcm_affinity #@ mcm_affinity_options = mcm_distribute mcm_mem_req mcm_sni_none #@ cpus_per_core=2 #@ task_affinity=cpu(1) #@ environment = COPY_ALL; MEMORY_AFFINITY=MCM; MP_SYNC_QP=YES; \ # MP_RFIFO_SIZE=16777216; MP_SHM_ATTACH_THRESH=500000; \ # MP_EUIDEVELOP=min; MP_USE_BULK_XFER=yes; \ # MP_RDMA_MTU=4K; MP_BULK_MIN_MSG_SIZE=64k; MP_RC_MAX_QP=8192; \ # PSALLOC=early; NODISCLAIM=true # # Submit the 3 steps of the same job, with dependencies # defined so that step2 won't start until step1 is done # and step3 won't start until step2 is done. # # @ step_name = step1 # @ queue # # @ step_name = step2 # @ dependency = step1 == 0 # @ queue # # @ step_name = step3 # @ dependency = step2 == 0 # @ queue
When a job with steps is submitted all the steps will have the same jobid, just incremented .0, .1, .2, etc. The first job or step is queued and run normally while the other steps will be shown as "NQ" or not queued until the dependency is satisfied, which in this case is for the previous step to have finished.