Using the TCS
About
The Tightly-coupled Capability System (TCS) is a cluster of IBM Power 6 nodes intended for jobs that scale well to at least 32 processes and which require high bandwidth and large memory. It was installed at SciNet in late 2008 and is operating in "friendly-user" mode during winter 2009
Node Names
- node tcs-f02n01 is node # 1 in frame/rack #2
- entire list of 104 nodes can be seen with llstatus
Node Specs
There are 102 compute nodes each with:
- 32 Power6 cores (4.7GHz each)
- each core is 2-way multi-threaded using SMT (simultaneous multithreading)
- 128GB of RAM (except for tcs-f11n03 and n04 which have 256GB each)
- 4 InfiniBand (IB) interfaces used for data and message-passing traffic
- 2 GigE interfaces used for management and GPFS token traffic
User Documentation
User Access
- login to 142.150.188.41 (this is node tcs-f11n05) in order to start using the TCS
Login Nodes
- there are two interactive login nodes: tcs-f11n05 and tcs-f11n06
- use the login nodes to submit and monitor jobs, edit files, compile code etc
- small, interactive, short test jobs may be run ONLY on tcs-f11n06
Submitting Jobs
Loadleveler Batch Files
Here is a sample LoadLeveler job script. It will request 5 nodes, with 32 MPI tasks per node, for a total of 160 MPI tasks. As with all jobs, it must run from the /scratch filesystem. This job also requests use of the InfiniBand network for MPI traffic. If one does not explicitly request the IB network, then the system defaults to using the GigE network, which is very much slower.
- ===============================================================================
- Specifies the name of the shell to use for the job
- @ shell = /usr/bin/ksh
- @ job_type = parallel
- @ job_name = pi
- @ class = verylong
- @ node = 5
- @ tasks_per_node = 32
- @ output = $(jobid).out
- @ error = $(jobid).err
- @ wall_clock_limit = 04:00:00
- =====================================
- this is necessary in order to avoid core dumps for batch files
- which can cause the system to be overloaded
- ulimits
- @ core_limit = 0
- =====================================
- necessary to force use of infiniband network for MPI traffic
- @ network.MPI = sn_all,not_shared,US,HIGH
- =====================================
- @ queue
cd /scratch/cloken/examples pi
Directories available to batch system
- loadleveler jobs run from /scratch
- loadleveler jobs can NOT access /home
- users must take care of copying any required executables, input files etc to their /scratch/ space before submitting a job
Monitoring Jobs
Filesystems
- 10GB quota in your home directory; it is backed-up to disk
- your /home/ directory is NOT mounted on the compute nodes
- loadleveler jobs will run from /scratch
- files in /scratch are NEVER backed-up