Using the TCS

From oldwiki.scinet.utoronto.ca
Revision as of 18:42, 4 August 2010 by Rzon (talk | contribs) (Redirected page to TCS Quickstart)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Redirect page
Jump to navigation Jump to search

Redirect to:

About

The Tightly-coupled Capability System (TCS) is a cluster of IBM Power 6 nodes intended for jobs that scale well to at least 32 processes and which require high bandwidth and large memory. It was installed at SciNet in late 2008 and is operating in "friendly-user" mode during winter 2009

Node Names

  • node tcs-f02n01 is node # 1 in frame/rack #2
  • entire list of 104 nodes can be seen with llstatus

Node Specs

There are 102 compute nodes each with:

  • 32 Power6 cores (4.7GHz each)
    • each core is 2-way multi-threaded using SMT (simultaneous multithreading)
  • 128GB of RAM (except for tcs-f11n03 and n04 which have 256GB each)
  • 4 InfiniBand (IB) interfaces used for data and message-passing traffic
  • 2 GigE interfaces used for management and GPFS token traffic

User Documentation

User Access

  • login to 142.150.188.41 (this is node tcs-f11n05) in order to start using the TCS

Login Nodes

  • there are two interactive login nodes: tcs-f11n05 and tcs-f11n06
  • use the login nodes to submit and monitor jobs, edit files, compile code etc
  • small, interactive, short test jobs may be run ONLY on tcs-f11n06

Submitting Jobs

Loadleveler Batch Files

Here is a sample LoadLeveler job script. It will request 5 nodes, with 32 MPI tasks per node, for a total of 160 MPI tasks. As with all jobs, it must run from the /scratch filesystem. This job also requests use of the InfiniBand network for MPI traffic. If one does not explicitly request the IB network, then the system defaults to using the GigE network, which is very much slower.

#===============================================================================
# Specifies the name of the shell to use for the job 
# @ shell = /usr/bin/ksh
# @ job_type = parallel
# @ job_name = pi
# @ class = verylong
# @ node = 5
# @ tasks_per_node = 32
# @ output = $(jobid).out
# @ error = $(jobid).err
# @ wall_clock_limit = 04:00:00
#=====================================
## this is necessary in order to avoid core dumps for batch files
## which can cause the system to be overloaded
# ulimits
# @ core_limit = 0
#=====================================
## necessary to force use of infiniband network for MPI traffic
# @ network.MPI = sn_all,not_shared,US,HIGH
#=====================================
# @ queue
 cd /scratch/cloken/examples
 pi

Directories available to batch system

  • loadleveler jobs run from /scratch
  • loadleveler jobs can NOT access /home
  • users must take care of copying any required executables, input files etc to their /scratch/ space before submitting a job

Monitoring Jobs

Filesystems

  • 10GB quota in your home directory; it is backed-up to disk
  • your /home/ directory is NOT mounted on the compute nodes
  • loadleveler jobs will run from /scratch
  • files in /scratch are NEVER backed-up