Difference between revisions of "GPC Quickstart"

General Purpose Cluster (GPC)
General Purpose Cluster (GPC)
Installed	June 2009
Operating System	Linux
Interconnect	1/4 on Infiniband, rest on GigE
Ram/Node	16 Gb
Cores/Node	8
Login/Devel Node	gpc-login01 (142.150.188.51)
Vendor Compilers	icc (C) ifort (fortran) icpc (C++)
Queue Submission	Moab/Torque

Revision as of 17:17, 7 August 2009

The General Purpose Cluster is an extremely large cluster (ranked 16th in the world at its inception, and fastest in Canada) and is where most simulations are to be done at SciNet. It is an IBM iDataPlex cluster based on Intel's Nehalem architecture (one of the first in the world to make use of the new chips). The GPC consists of 3,780 nodes with a total of 30,240 2.5GHz cores, with 16GB RAM per node (2GB per core). Approximately one quarter of the cluster is be interconnected with non-blocking 4x-DDR Infiniband while the rest of the nodes are connected with gigabit ethernet.

Login

Currently you need to log into the TCS first (142.150.188.41) then log into the GPC devel nodes listed below. This will change in the near future.

Compile/Devel Nodes

From a login node you can ssh to gpc-f101n001 and gpc-f101n002. These are the same hardware configuration as the as most of the compute nodes -- 8 Nehalem processing cores with 16GB RAM and Gigabit ethernet. You can compile and test your codes on these nodes. To interactively test on more than processors, or to test your code over an Infiniband connection, you can submit an interactive job request.

Your home directory is in /home/USER; you have 10Gb there that is backed up. This directory cannot be written to by the compute nodes! Thus, to run jobs, you'll use the /scratch/USER directory. Here, there is a large amount of disk space, but it is not backed up. Thus it makes sense to keep your code in /home, compile there, and then run them in the /scratch directory.

Environment Variables

A modules system is used to handle environment variables associated with different compilers, MPI versions, libraries etc. To see all the options available type

module avail

To load a module

module load intel

To unload a module

module unload intel

To unload all modules

module purge

These commands should go in your .bashrc files and/or in your submission scripts to make sure you are using the correct packages.

Compilers

The intel compilers are icc/icpc/ifort for C/C++/Fortran. gcc/g++/gfortran v. 4.1.2 are also available, as are somewhat newer 4.3 versions (gcc43/g++43/gfortran43). To ensure that the intel compilers are in your PATH and their libraries are in your LD_LIBRARY_PATH, use the command

module load intel

This should likely go in your .bashrc file so that it will automatically be loaded.

MPI

SciNet currently provides two sets of MPI libraries for the GPC OpenMPI and MVAPICH2. Both sets of libraries will automatically work with both the infiniband and gigabit ethernet interconnects on the GPC system. We recommend OpenMPI as the default, as it quite reliably demonstrates good performance.

Both sets of libraries are compiled with the gnu compiler suite and the intel compiler suite. To use (for instance) the intel-compiled OpenMPI libraries, which we recommend as the default (and use for most of our examples here), use

module load openmpi intel

in your .bashrc. Other combinations behave similarly.

Both sets of MPI libraries define the wrappers mpicc/mpicxx/mpif90/mpif77 as wrappers around the appropriate compilers, which ensure the appropriate include and library directories and used in the compilation and linking steps.

Submitting A Batch Job

The SciNet machines are shared systems, and jobs that are to run on them are submitted to a queue; the scheduler then orders the jobs in order to make the best use of the machine, and has them launched when resources become availble. The intervention of the scheduler can mean that the jobs aren't quite run in a first-in first-out order.

The maximum wallclock time for a job in the queue is 48 hours; computations that will take longer than this must be broken into 48-hour chunks and run as several jobs. The usual way to do this is with checkpoints, writing out the complete state of the computation every so often in such a way that a job can be restarted from this state information and continue on from where it left off. Generating checkpoints is a good idea anyway, as in the unlikely event of a hardware failure during your run, it allows you to restart without having lost much work.

If your job should run in fewer than 48 hours, specify that in your script -- your job will start sooner. (It's easier for the scheduler to fit in a short job than a long job). On the downside, the job will be killed automatically by the queue manager software at the end of the specified wallclock time, so if you guess wrong you might loose some work. So the standard procedure is to estimate how long your job will take and add 10% or so.

You interact with the queuing system through the queue/resource manager, Moab and Torque. To see all the jobs in the queue use

showq

To submit your own job, you must write a script which describes the job and how it is to be run (a sample script follows) and submit it to the queue, using the command

qsub SCRIPT-FILE-NAME

where you will replace SCRIPT-FILE-NAME with the file containing the submission script. This will return a job ID, for example 31415, which is used to identify the jobs. Information about a queued job can be found using

checkjob JOB-ID

and jobs can be canceled with the command

canceljob JOB-ID

Again, these commands have many options, which can be read about on their man pages.

Much more information on the queueing system is available on our queue page.

Batch Submission Script

A sample submission script is shown below for an mpi job using ethernet with the #PBS directives at the top and the rest being what will be executed on the compute node.

#!/bin/bash
# MOAB/Torque submission script for SciNet GPC (ethernet)
#
#PBS -l nodes=2:ppn=8,walltime=1:00:00
#PBS -N test

# DIRECTORY TO RUN
cd /scratch/USER/SOMEDIRECTORY

# EXECUTION COMMAND 
mpirun -np 16 -hostfile $PBS_NODEFILE ./a.out

The script above requests two nodes, using 8 processors per node, for a wallclock time of one hour. (The resources required by the job are listed on the #PBS -l line.) Other options can be given in other #PBS lines, such as #PBS -N, which sets the the name of the job. On the first of the two nodes, a shell is launched that changes directory to /scratch/USER/SOMEDIRECTORY and then uses the mpirun command to launch the job. Assumed here is that the user has a line like

module load openmpi intel

in their .bashrc.

Submitting an Interactive Job

It is sometimes convenient to run a job interactively; this can be very handy for (eg) debugging purposes. In this case, you type a qsub command which submits an interactive job to the queue; when the scheduler selects this job to run, then it starts a shell running on the first node of the job, which connects to your terminal. You can then type any series of commands (for instance, the same commands listed as in the batch submission script above) to run a job interactively.

For instance, to start the same sort of job as in the batch submission script above, but interactively, one would type

$ qsub -I -l nodes=2:ppn=8,walltime=1:00:00

This is exactly the #PBS -l line in the batch script above (which requests all 8 processors on each of 2 nodes for one hour), but prepended with a -I for `interactive'. When this job begins, your terminal will now show you as being logged in to one of the compute nodes, and one can type in any shell command, run mpirun, etc. When you exit the shell, the job will end.

Ethernet vs. Infiniband

About 1/4 of the GPC (864 nodes or 6912 cores) is connected with a high bandwidth low-latency fabric called InfiniBand. Many jobs which require tight coupling to scale well greatly benefit from this interconnect; others, which types of jobs, which have relatively modest communications, do not require this and run fine on Gigabit ethernet.

Jobs which require the infiniband for good performance can request the nodes that have the `ib' feature in the #PBS -l line,

#PBS -l nodes=2:ib:ppn=8,walltime=1:00:00

Because there are a limited number of these nodes, your job will run faster if you do not request them (eg, if you use the scripts as shown above), as this increases the number of nodes available to run your job. In fact, the infiniband nodes are to be used only for jobs that are known to scale well and will benefit from this type of interconnect. The MPI libraries provided by scinet automatically correctly use either the infiniband or ethernet interconnect depending on which nodes your job runs on.

@@ Line 24: / Line 24: @@
 From a login node you can ssh to gpc-f101n001 and gpc-f101n002.  These are the same hardware configuration as the
-as most of the compute nodes -- 8 Nehalem processing cores with 16GB RAM and Gigabit ethernet.  You can compile and test your codes on these nodes. To interactively test on more than processors, or to test your code over an Infiniband connection, you can submit an [[GPC_Quickstart#Interative_Job | interactive job request]].
+as most of the compute nodes -- 8 Nehalem processing cores with 16GB RAM and Gigabit ethernet.  You can compile and test your codes on these nodes. To interactively test on more than processors, or to test your code over an Infiniband connection, you can submit an [[GPC_Quickstart#Submitting_an_Interactive_Job | interactive job request]].
 Your home directory is in /home/USER; you have 10Gb there that is backed up. This directory cannot be written to by the compute nodes! Thus, to run jobs, you'll use the /scratch/USER directory. Here, there is a large amount of disk space, but it is not backed up. Thus it makes sense to keep your code in /home, compile there, and then run them in the /scratch directory.
@@ Line 75: / Line 75: @@
 in your <tt>.bashrc</tt>.   Other combinations behave similarly.
-===Submitting A Job===
+Both sets of MPI libraries define the wrappers mpicc/mpicxx/mpif90/mpif77 as wrappers around the appropriate compilers, which ensure the appropriate include and library directories and used in the compilation and linking steps.
+===Submitting A Batch Job===
 The SciNet machines are shared systems, and jobs that are to run on them are submitted to a queue; the
@@ Line 94: / Line 96: @@
 add 10% or so.
-You interact with the queuing system through the queue/resource manager, [[Moab]].   (On the back end, the [[scheduler]]
+You interact with the queuing system through the queue/resource manager, [[Moab]] and [[Torque]].  To see all the jobs in the queue use
-on the GPC is [[Torque]], but you won't be directly interacting with it.)  To see all the jobs in the queue use
 <pre>
 showq
@@ Line 117: / Line 118: @@
 Much more information on the queueing system is available on our [[queue]] page.
-===Submission Script===
+====Batch Submission Script====
 A sample submission script is shown below for an mpi job using ethernet with the <tt> #PBS </tt> directives at the top and the rest being
@@ Line 126: / Line 127: @@
 # MOAB/Torque submission script for SciNet GPC (ethernet)
 #
-#PBS -l nodes=2:ppn=8,walltime=1:00:00,os=centos53computeA
+#PBS -l nodes=2:ppn=8,walltime=1:00:00
 #PBS -N test
-# ENVIRONMENT VARIABLES
-module load intel openmpi
 # DIRECTORY TO RUN
-/scratch/USER/SOMEDIRECTORY
+cd /scratch/USER/SOMEDIRECTORY
 # EXECUTION COMMAND
 mpirun -np 16 -hostfile $PBS_NODEFILE ./a.out
 </pre>
-===MPI over Infiniband===
+The script above requests two nodes, using 8 processors per node, for a [[wallclock time]] of one hour.  (The resources required by the job are listed on the <tt>#PBS -l</tt> line.)   Other options can be given in other <tt>#PBS</tt> lines, such as <tt>#PBS -N</tt>, which sets the the name of the job.   On the first of the two nodes, a shell is launched that changes directory to <tt>/scratch/USER/SOMEDIRECTORY</tt> and then uses the <tt>mpirun</tt> command to launch the job.   Assumed here is that the user has a line like
-About 1/4 of the GPC (864 nodes or 6912 cores) is connected with a high bandwidth low-latency fabric called
-Infiniband.  Infiniband is best suited for highly coupled, highly scalable parallel jobs.  Due to the
-limited number of these nodes, they are to be used only for jobs that are known to scale well and
-will benefit from this type of interconnect.  To use the infiniband interconnect for MPI communications
-OpenMPI can also be used.
-You will need to source one of the following to setup the appropriate
-environment variables depending on if you want to compile with the
-====INTEL====
 <pre>
 module load openmpi intel
 </pre>
-====GCC====
+in their <tt>.bashrc</tt>.
-<pre>
-module load openmpi gcc
-</pre>
-Openmpi uses the wrappers mpicc/mpicxx/mpif90/mpif77 for the compilers.
-Currently you can compile and link your MPI code on the development nodes
+===Submitting an Interactive Job===
-gpc-f101n001 and gpc-f101n002 however you will not be able to interactively test
-as these nodes are not connected with Infiniband. You can alternatively
-compile, link, and test an MPI code using an interactive queue session,
-using the os image "centos53develibA" as follows.
-<pre>
+It is sometimes convenient to run a job interactively; this can be very handy for (eg) debugging purposes.  In this case, you type a <tt>qsub</tt> command which submits an interactive job to the queue; when the scheduler selects this job to run, then it starts a shell running on the first node of the job, which connects to your terminal.  You can then type any series of commands (for instance, the same commands listed as in the batch submission script above) to run a job interactively.
-qsub -l nodes=2:ib:ppn=8,walltime=12:00:00,os=centos53develibA -I
-</pre>
-Once you have compiled your MPI code and would like to test it, use the following command
+For instance, to start the same sort of job as in the batch submission script above, but interactively, one would type
-with <tt>$PROCS</tt> being the number of processors to run on and <tt>a.out</tt> being
-your code.
 <pre>
-mpirun -np $PROCS -hostfile $PBS_NODEFILE ./a.out
+$ qsub -I -l nodes=2:ppn=8,walltime=1:00:00
 </pre>
-=== MPI Submission Script ===
+This is exactly the <tt>#PBS -l</tt> line in the batch script above (which requests all 8 processors on each of 2 nodes for one hour), but prepended with a <tt>-I</tt> for `interactive'.   When this job begins, your terminal will now show you as being logged in to one of the compute nodes, and one can type in any shell command, run <tt>mpirun</tt>, etc.   When you exit the shell, the job will end.
-To run your MPI-Infiniband job in a non-interactive queue you can use a
+===Ethernet vs. Infiniband===
-submission script as follows, remembering to source the appropriate
-environment variables.
+About 1/4 of the GPC (864 nodes or 6912 cores) is connected with a high bandwidth low-latency fabric called
+[http://en.wikipedia.org/wiki/InfiniBand InfiniBand].  Many jobs which require tight coupling to scale well greatly benefit from this interconnect;
+others, which types of jobs, which have relatively modest communications, do not require this and run fine on Gigabit ethernet.
+Jobs which require the infiniband for good performance can request the nodes that have the `<tt>ib</tt>' feature in the <tt>#PBS -l</tt> line,
 <pre>
-#!/bin/bash
+#PBS -l nodes=2:ib:ppn=8,walltime=1:00:00
-# MOAB/Torque submission script for SciNet GPC (Infiniband)
-#
-#PBS -l nodes=2:ib:ppn=8,walltime=1:00:00,os=centos53computeA
-#PBS -N testib
-# INTEL & OPENMPI ENVIRONMENT VARIABLES
-module load intel openmpi
-# DIRECTORY TO RUN
-/scratch/USER/SOMEDIRECTORY
-# EXECUTION COMMAND
-mpirun -np 16 -hostfile $PBS_NODEFILE ./a.out
 </pre>
-===Performance Tools===
+Because there are a limited number of these nodes, your job will run faster if you do not request them (eg, if you use the scripts as shown above), as this increases the number of nodes available to run your job. In fact, the infiniband nodes are to be used only for jobs that are known to scale well and  will benefit from this type of interconnect.    The MPI libraries provided by scinet automatically correctly use either the infiniband or ethernet interconnect depending on which nodes your job runs on.