GPC Quickstart

From oldwiki.scinet.utoronto.ca
Jump to navigation Jump to search
General Purpose Cluster (GPC)
University of Tor 79284gm-a.jpg
Installed June 2009
Operating System Linux Centos 6.4
Number of Nodes 3864 (30,912 cores)
Interconnect 840 nodes 1:1 DDR, 3024 nodes 5:1 QDR
Ram/Node 16 Gb
Cores/Node 8 (16 threads)
Login/Devel Node gpc01..gpc08 (from login.scinet)
Vendor Compilers icc (C) ifort (fortran) icpc (C++)
Queue Submission Moab/Torque

Specifications

The General Purpose Cluster is an extremely large cluster (ranked 16th in the world at its inception, and fastest in Canada) and is where most computations are to be done at SciNet. The GPC consists of 3,780 nodes with 16 GB of memory or more and 8 Intel cores each. The nodes use a shared parallel file system and have no local disks. The compute nodes are accessed through a queuing system that allows jobs with a minimum of 15 minutes and maximum wall time of 48 hours .

Technical details: The GPC is an IBM iDataPlex cluster based on Intel's Nehalem architecture (one of the first in the world to make use of the new chips). The GPC consists of 3,780 nodes (IBM iDataPlex DX360M2) with a total of 30,912 cores (Intel Xeon E5540) at 2.53GHz, with 16GB RAM per node (2GB per core). Approximately one quarter of the cluster is interconnected with non-blocking DDR InfiniBand while the rest of the nodes are connected with 5:1 blocked QDR InfiniBand.

Login

First login via ssh with your SciNet account at login.scinet.utoronto.ca. This puts you on a 'login node', but not yet on the GPC. From the login node, you have to ssh to one of the 8 development nodes (gpc01, gpc02, ..., gpc08) to compile/test your code.

Compile/Devel Nodes

To get to one of the devel nodes from a scinet login node, you can ssh to gpc01..gpc08. You may also just use the gpc command to take directly to the dev node with lowest cpu load (reassessed every 5 minutes).

Except for these four development nodes on the GPC, all other nodes are 'compute nodes' that can be used only through the scheduler. The devel nodes have the same hardware configuration as most of the compute nodes except with more memory -- 8 processing cores with 36 GB RAM and QDR Infiniband. You can compile and test your codes on these devel nodes. To interactively test on more than 8 processors, you can submit an interactive job request to get time-limited command-line access to a compute node.

Your home directory is in $HOME. You have 50 GB there that is backed up. In addition, you have a larger, non-backed up space in $SCRATCH for temporary files. Note that your directories are encoded in the environment variables $HOME and $SCRATCH. Currently $HOME=/home/G/GROUP/USER and $SCRATCH=/scratch/G/GROUP/USER or /scratch2/G/GROUP/USER (where GROUP is your group's name, G is the first letter of that groups name, and USER is your user name). But the locations can change as the storage system evolves and grows with demand, so please use the environment variables!

The GPC devel and compute nodes do not have local disks. Instead, $HOME and $SCRATCH are shared parallel file systems (the file system is called GPFS), which means that your files are seen on all the nodes.

Your home directory cannot be written to by the compute nodes! Thus, to run jobs, you'll use the $SCRATCH directory (currently /scratch/g/group/USER but again, use the environment variable). Here, there is a large amount of disk space, but it is not backed up. Thus it makes sense to keep your codes in /home, compile there, and then run them in the /scratch directory.

Modules and Environment Variables

To use most packages on the SciNet machines - including any of the compilers - , you will have to use the `module' command. The command module load some-package will set your environment variables (PATH, LD_LIBRARY_PATH, etc) to include the default version of that package. module load some-package/specific-version will load a specific version of that package. This makes it very easy for different users to use different versions of compilers, MPI versions, libraries etc.

Note that to use even the gcc compilers you will have to do

$ module load gcc

but in fact you probably should use the intel compilers installed on this system as they often produce faster executables (and occasionally, much faster.)

A list of the installed software and more information on using the module command is available in Software & Libraries. Or, when logged in to the gpc, you can see all available modules on the system by typing

$ module avail

To load a module (for example, the default version of the intel compilers)

$ module load intel

To unload a module

$ module unload intel

To unload all modules

$ module purge

To list all loaded modules

$ module list

These commands should go in your submission scripts to make sure you are using the correct packages. It is possible to load them in your .bashrc files as well, but this is generally not recommended (see Important .bashrc guidelines), especially if you routinely have to flip back and forth between modules.

Note that a module load command only sets the environment variables in your current shell (and any subprocesses that the shell launches). It does not affect other shell environments; in particular, a queued job that is running is unaffected by you interactively loading a module, and conversely you loading a module at the prompt and then submitting a job does not ensure that the module is loaded when the job runs. To ensure that a module is loaded when a job runs, be sure to put your module load command in your job submission script.

Again more information on modules, and how to resolve dependencies between modules can be found on the Software & Libraries page.

Compilers

The intel compilers are icc/icpc/ifort for C/C++/Fortran, and are available with the default module "intel". The intel compilers are recommended over the GNU compilers. Documentation about icpc is available at http://software.intel.com/en-us/articles/intel-software-technical-documentation/. The Intel compilers accept many of the options that the GNU compilers accept, but tend to produce faster programs on our system. If, for some reason, you really need the GNU compilers, the latest version of the GNU compiler collection (currently 4.4.0) is available by loading the "gcc" module, with gcc/g++/gfortran for C/C++/Fortran. Coarray fortran is support by the intel compilers from version 12 upwards and by the GNU fortran compiler version 5.2.0. Note that the f77/g77 compilers are not supported, but the available fortran compilers are able to compile fortran 77.

To ensure that the intel compilers are in your PATH and their libraries are in your LD_LIBRARY_PATH, use the command

$ module load intel/15.0.2

Optimize your code for the GPC machine using of at least the following compiler flags:

   -O3 -xHost

(or -O3 -march=native for the GNU compilers).

  • If your program uses openmp, add -fopenmp for GNU compilers.
  • If you get the warning feupdatreenv is not implemented, add -limf to the link line.
  • If you need to link in the MKL libraries, you are well advised to use the Intel(R) Math Kernel Library Link Line Advisor: http://software.intel.com/en-us/articles/intel-mkl-link-line-advisor/ for help in devising the list of libraries to link with your code. Note that this give the link line for the command prompt. When using this in Makefiles, replace $MKLPATH by ${MKLPATH}.
  • More questions about compiling? See the FAQ.

Debuggers

  • ddt - Allinea's graphical parallel debugger, in the ddt module. Highly recommended!
  • gdb - The GNU Debugger, available in the gdb module.
  • idbc/idb - The intel debuggers, part of the intel module(s).
  • ddd - A graphical debuggerThe GNU Debugger, available in the ddd module.


Note that to debug code, you have to give the -g flags to the compiler. The intel compiler needs the additional option -debug parallel to debug threaded/OpenMP code.

MPI

SciNet currently provides multiple MPI libraries for the GPC; OpenMPI, and IntelMPI. We currently recommend OpenMPI as the default, as it quite reliably demonstrates good performance on the infiniband network (and did so too on the ethernet network). For full details and options see the complete MPI section.

The MPI libraries are compiled with both the gnu compiler suite and the intel compiler suite. To use (for instance) the intel-compiled OpenMPI libraries, which we recommend as the default (and use for most of our examples here), use

$ module load intel/15.0.2 openmpi/intel/1.6.4

in your job submission scripts and on the command-line before compiling. Putting these in your .bashrc is no longer recommended (since Oct 10, 2013).

Other combinations behave similarly.

The MPI libraries define the wrappers mpicc/mpicxx/mpif90/mpif77 as wrappers around the appropriate compilers, which ensure the appropriate include and library directories are used in the compilation and linking steps.

We currently recommend the Intel + OpenMPI combination. However, if you require the GNU compilers as well as MPI, you will want to find the most recent openmpi module available with `gcc' in the version name. This will enable development and runtime with gcc/g++/gfortran and OpenMPI.

For mixed OpenMP/MPI code using Intel MPI, add the compilation flag -mt_mpi for full thread-safety (no such flag is necessary for OpenMPI).

Submitting A Batch Job

The SciNet machines are shared systems, and jobs that are to run on them are submitted to a queue; the scheduler then orders the jobs in order to make the best use of the machine, and has them launched when resources become availble. The intervention of the scheduler can mean that the jobs aren't quite run in a first-in first-out order. The scheduler used on the GPC is called Moab (with Torque as the 'resource manager').

The scheduler will have the job run on one or more of the compute nodes of the GPC. There are a few important differences between the devel nodes and the compute nodes:

  1. As stated above already, on compute nodes, your home directory is read-only. You have to run your jobs from the $SCRATCH directory instead. See Data Management for more details on the file systems at SciNet.
  1. The available memory on compute nodes is approximately 14GB (16GB - 2GB for the operating system). The devel nodes have 36GB, but this is shared with all the users of the node.
  1. Some libraries, especially those for graphics, are not installed on the compute nodes. This leaves more memory available for your job, but if you have an application that requires such libraries (notably R and octave), you will need to "module load extras" in your job script to make them work.

The maximum wallclock time for a job in the queue is 48 hours; computations that will take longer than this must be broken into 48-hour chunks and run as several jobs. Also a minimum job length of 15 minutes is enforced and shorter jobs should be batched together. The usual way to do this is with checkpoints, writing out the complete state of the computation every so often in such a way that a job can be restarted from this state information and continue on from where it left off. Generating checkpoints is a good idea anyway, as in the unlikely event of a hardware failure during your run, it allows you to restart without having lost much work.

There are limits to how many jobs you can submit. If your group has a default account, up to 32 nodes at a time for 48 hours per job on the GPC cluster are allowed to be queued. This is a total limit, e.g., you could request 64 nodes for 24 hours. Jobs of users with an LRAC or NRAC allocation will run at a higher priority than others while their resources last. Because of the group-based allocation, it is conceivable that your jobs won't run if your colleagues have already exhausted your group's limits.

Note that scheduling big jobs greatly affects the queue and other users, so you have to talk to us first to run massively parallel jobs (> 2048 cores). We will help make sure that your jobs start and run efficiently.

If your job should run in fewer than 48 hours, specify that in your script -- your job will start sooner. (It's easier for the scheduler to fit in a short job than a long job). On the downside, the job will be killed automatically by the queue manager software at the end of the specified wallclock time, so if you guess wrong you might lose some work. So the standard procedure is to estimate how long your job will take and add 10% or so.

You interact with the queuing system through the queue/resource manager, Moab and Torque. To see all the jobs in the queue use

$ showq

To submit your own job, you must write a script which describes the job and how it is to be run (a sample script follows) and submit it to the queue, using the command

$ qsub [SCRIPT-FILE-NAME]

where you will replace [SCRIPT-FILE-NAME] with the file containing the submission script. This will return a job ID, for example 31415, which is used to identify the jobs. Information about a queued job can be found using

$ qstat [JOB-ID]

and jobs can be canceled with the command

$ canceljob [JOB-ID]

Again, these commands have many options, which can be read about on their man pages.

Much more information on the queueing system is available on our queue page.

Batch Submission Script: MPI

A sample submission script is shown below for an mpi job with the #PBS directives at the top and the rest being what will be executed on the compute node.

<source lang="bash">

  1. !/bin/bash
  2. MOAB/Torque submission script for SciNet GPC
  3. PBS -l nodes=2:ppn=8,walltime=1:00:00
  4. PBS -N test
  1. load modules (must match modules used for compilation)

module load intel/15.0.2 openmpi/intel/1.6.4

  1. DIRECTORY TO RUN - $PBS_O_WORKDIR is directory job was submitted from

cd $PBS_O_WORKDIR

  1. EXECUTION COMMAND; -np = nodes*ppn

mpirun -np 16 ./a.out </source>

The lines that begin #PBS are commands that are parsed and interpreted by qsub at submission time, and control administrative things about your job. In this example, the script above requests two nodes, using 8 processors per node, for a wallclock time of one hour. (The resources required by the job are listed on the #PBS -l line.) Other options can be given in other #PBS lines, such as #PBS -N, which sets the name of the job.

The rest of the script is run as a bash script at run time. A bash shell on the first node of the two nodes that are requested executes these commands as a normal bash script, just as if you had run this as a shell script from the terminal. The only difference is that PBS sets certain environment variables that you can use in the script. $PBS_O_WORKDIR is set to be the directory that the command was 'submitted' from - eg, $SCRATCH/SOMEDIRECTORY. The script then uses the mpirun command to launch the job.


Submitting Collections of Serial Jobs

You cannot run purely serial jobs on the GPC (or any of SciNet's systems), as this would mean only one core out of 8 is used. If you have serial jobs, you have to bunch them together. SciNet-approved methods for running collections of serial jobs can be found on the serial run wiki page.

Batch Submission Script: OpenMP

For running OpenMP jobs, the procedure is similar as for MPI jobs:

<source lang="bash">

  1. !/bin/bash
  2. MOAB/Torque submission script for SciNet GPC (OpenMP)
  3. PBS -l nodes=1:ppn=8,walltime=1:00:00
  4. PBS -N test
  1. load modules (must match modules used for compilation)

module load intel/15.0.2

  1. DIRECTORY TO RUN - $PBS_O_WORKDIR is directory job was submitted from

cd $PBS_O_WORKDIR

export OMP_NUM_THREADS=8 ./a.out </source>

Note that in some circumstances it can be more efficient to run (say) two jobs each running on four threads than one job running on eight threads. In that case you can use the same `ampersand-and-wait' technique outlined for serial jobs (see serial run wiki page) for less-than-eight-core OpenMP jobs.

Hybrid MPI/OpenMP jobs

Using Intel MPI

Here is how to run hybrid codes using intelmpi::

http://software.intel.com/en-us/articles/hybrid-applications-intelmpi-openmp/

Make sure you compile with the -mt_mpi option to the compilers to use the thread safe libraries. Set the environment variable I_MPI_PIN_DOMAIN:

$ export I_MPI_PIN_DOMAIN=omp

This will set the process pinning domain size to be equal to OMP_NUM_THREADS (which you should set to the desired number of threads per mpi process). Therefore, each MPI process can create $OMP_NUM_THREADS number of children threads for running within the corresponding domain. If OMP_NUM_THREADS is not set, each node is treated as a separate domain (which will allow as many threads per MPI processes as there are cores).

In addition, when invoking mpirun, you should add the argument "-ppn X", where X is the number of MPI processes per node. For example:

$ mpirun -ppn 2 -np 8 [executable]

would start 2 mpi processes of [executable] per node for a total of 8 processes, so mpirun will try to run mpi processes on 4 nodes (OMP_NUM_THREADS is then probably best set at 4). Your job script should still ask for these 4 nodes with the line <source lang="bash">

    #PBS -l nodes=4:ppn=8,walltime=....

</source> (ppn=8 is not a mistake here; the ppn parameter has a different meaning for PBS and for mpirun)

The ppn parameter to mpirun is very important! Without it, eight mpi jobs would get bunched on the first node in this example, leaving 3 nodes unused.

NOTE: In order to pin OpenMP threads inside the domain, use the corresponding OpenMP feature by setting the KMP_AFFINITY environment variable, see Compiler User and Reference Guide.

The IntelMPI manual is referenced on the front page of our wiki:

http://software.intel.com/sites/products/documentation/hpc/mpi/linux/reference_manual.pdf

For the above example of a total of 8 processes on 4 nodes, you could use the following script: <source lang="bash">

  1. !/bin/bash
  2. MOAB/Torque submission script for SciNet GPC (hybrid job)
  3. PBS -l nodes=4:ppn=8,walltime=1:00:00
  4. PBS -N test
  1. load modules (must match modules used for compilation)

module load intel/15.0.2 intelmpi/4.1.2.040

  1. DIRECTORY TO RUN - $PBS_O_WORKDIR is directory job was submitted from

cd $PBS_O_WORKDIR

  1. SET THE NUMBER OF THREADS PER PROCESS:

export OMP_NUM_THREADS=4

  1. PIN THE MPI DOMAINS ACCORDING TO OMP

export I_MPI_PIN_DOMAIN=omp

  1. EXECUTION COMMAND; -np = nodes*ppn

mpirun -ppn 2 -np 8 ./a.out </source>

Using Open MPI

For mixed MPI/OpenMP jobs using OpenMPI, which is the default for many users, the procedure is similar, but details differ.

  • Request the number of nodes in the PBS script.
  • Set OMP_NUM_THREADS to the number of threads per MPI process.
  • In addition to the -np parameter for mpirun, add the argument --bynode, so that the mpi processes are not bunched up.

So for example, to start a total of 8 processes on 4 nodes, you could use the following script <source lang="bash">

  1. !/bin/bash
  2. MOAB/Torque submission script for SciNet GPC (hybrid job)
  3. PBS -l nodes=4:ppn=8,walltime=1:00:00
  4. PBS -N test
  1. load modules (must match modules used for compilation)

module load intel/15.0.2 openmpi/intel/1.6.4

  1. DIRECTORY TO RUN - $PBS_O_WORKDIR is directory job was submitted from

cd $PBS_O_WORKDIR

  1. SET THE NUMBER OF THREADS PER PROCESS:

export OMP_NUM_THREADS=4

  1. EXECUTION COMMAND; -np = nodes*processes_per_nodes; --byhost forces a round robin of nodes.

mpirun -np 8 --bynode ./a.out </source>


Automated Email Notifications from Jobs

By default, you get an email if your job fails (aborts) when run through the scheduler. You can setup email notifications as well for when the jobs begins or ends. For this, you use the -m submission option, followed by any combination of a (to get emails when the job fails), b (to get emails when the job starts), and e (to get emails when the job ends), or n (for no emails).

For instance, to get email in all three cases, you could have a job script like this: <source lang="bash">

  1. !/bin/bash
  2. MOAB/Torque submission script for SciNet GPC
  3. PBS -l nodes=2:ppn=8,walltime=1:00:00
  4. PBS -N test
  5. PBS -m abe
  1. load modules (must match modules used for compilation)

module load intel openmpi

  1. DIRECTORY TO RUN - $PBS_O_WORKDIR is directory job was submitted from

cd $PBS_O_WORKDIR

  1. EXECUTION COMMAND; -np = nodes*ppn

mpirun -np 16 ./a.out </source>

Notes:

  • Getting email for every job is not necessarily a good idea, as you may get a lot of email. This is why the default is to send an email only when the job fails
  • The email address that the emails are sent to is the one that you gave when you created your account. You cannot have the emails send to another address (even if you there is a -M scheduler option that would allow you to specify your email address, for security, this option is disabled on the GPC).


Submitting an Interactive (Debug) Job

Your development work flow may require a lot of small test runs. You are allowed to do these on the development nodes, as long as it's very brief (a few minutes), and does not use all cores on the machine. For anything more you will have to use the compute nodes. (In a hurry? Check out the debugjob command below!)

It is sometimes convenient to run a job interactively; this can be very handy for debugging purposes. In this case, you type a qsub command which submits an interactive job to the queue; when the scheduler selects this job to run, then it starts a shell running on the first node of the job, which connects to your terminal. You can then type any series of commands (for instance, the same commands listed as in the batch submission script above) to run a job interactively.

For example, to start the same sort of job as in the batch submission script above, but interactively, one would type

$ qsub -I -l nodes=2:ppn=8,walltime=1:00:00

If your interactive job requires graphics, add "-X" to this command.

Note that this is exactly the #PBS -l line in the batch script above (which requests all 8 processors on each of 2 nodes for one hour), but prepended with a -I for `interactive'. When this job begins, your terminal will now show you as being logged in to one of the compute nodes, and one can type in any shell command, run mpirun, etc. When you exit the shell, the job will end.

Interactive jobs can be used with any of the GPC queues however, there is a short high turnover queue called debug which can be especially useful when the system is busy:

$ qsub -I -l nodes=2:ppn=8,walltime=1:00:00 -q debug

Because the combination of running interactively and use the high turnover queue is very common, SciNet has the specialized command

$ debugjob [NUMBEROFNODES]

where NUMBEROFNODES is an optional argument that defaults to one. This command requests NUMBEROFNODES compute nodes for a limited time between 30 minutes and 2 hours (less time the more nodes you take), and gives you a shell on the (first) compute node.


  • More questions about running test jobs? See the FAQ.


QDR vs. DDR Infiniband

The GPC Infiniband network GPC has two sections, one connecting 3,024 nodes with 5:1 blocking/oversubscribed QDR and the second connecting 840 nodes 1:1 non-blocking DDR Infiniband. By default a user's job will go to whichever network section best accommodates it, typically smaller jobs to the QDR and larger jobs to the DDR. However a user can override this by simply adding the flags "ddr" or "qdr" to the job resource request.

For example, to request two nodes anywhere on the GPC (QDR or DDR), use

#PBS -l nodes=2:ppn=8,walltime=1:00:00

in your job submission script.

For two nodes using DDR, use

#PBS -l nodes=2:ddr:ppn=8,walltime=1:00:00

To get two nodes using QDR, instead, you would use

#PBS -l nodes=2:qdr:ppn=8,walltime=1:00:00

The queueing system also tries its best to keep jobs within the same switch of the QDR thus avoiding the 5:1 blocking. A user can can explicitly request this behaviour if their jobs is less than 30 nodes using

#PBS -W x=nodesetisoptional:false

HyperThreading

Each GPC compute node has 8 Nehalem cores (2 sockets each with a four-core Intel Xeon E5540 @ 2.53GHz). Thus, to make full use of the computing power of a GPC node, you must be running least 8 "tasks" -- MPI processes, or OpenMP threads.

Under most circumstances, running exactly 8 tasks is the most efficient way to use these nodes. However, sometimes software design (eg, having one thread for communication and one for computation) can usefully `oversubscribe' the number of physical cores, and running (say) twice as many tasks as cores can be a useful strategy. If your code is highly memory-bandwidth bound, having one task ready to run while another waits for memory access can make more effective use of the processor.

The Nehalem processors have hardware support for such two-way overloading of processors, through "HyperThreading"; there are an extra set of registers on each core to facilitate rapid switching between two tasks, making it look to the operating system that there are in fact 16 cores per node. Depending on the nature of your code, making use of these virtual extra cores may speed up or slow down your computation; you should run small test cases before running production jobs in this manner. In most cases, the speed difference will be under 10%. Some of our users have obtained an 8% speedup by running gromacs with 16 tasks instead of 8 on a single node (mpirun -np 16 ./gromacs/mdrun -npme 4 is 108% the speed of mpirun -np 8 ./gromacs/mdrun with -npme 2 or -1).

HyperThreading with OpenMP

To use hyperthreading with an OpenMP job, one just runs twice as many threads as one would have previously; eg, if you were running 8 threads before (export OMP_NUM_THREADS=8) you would run with 16 (export OMP_NUM_THREADS=16). Everything else remains the same, including the job submission script; one still uses ppn=8 in the submission of the job, as Torque has no way of knowing (or reason for caring) that you will be running on 16 `virtual' cores rather than 8 physical cores.

HyperThreading with MPI

To use hyperthreading with an MPI job, one just runs twice as many MPI processes as one would have previously; eg, if you were running on three nodes using 8 MPI tasks per node and used mpirun ... -np 24, you could run instead with -np 48. Everything else remains the same, including the job submission script; one still uses ppn=8 in the submission of the job, as Torque has no way of knowing (or reason for caring) that you will be running on 16 `virtual' cores rather than 8 physical cores.

Note that if you are using OpenMPI (as is the default), there is another consideration; OpenMPI assumes that there is no oversubscription and each task very aggressively makes full use of a core when it is waiting for a message (eg, the waits are "busywaits"). If you find a significant slowdown when running multiple MPI tasks per core with OpenMPI, you may want to try adding the additional option to mpirun: --mca mpi_yield_when_idle 1. This will increase the latency of individual messages, but free up the core to do additional work while waiting.

With IntelMPI, the problem should be less pronounced, but you can still improve things by using mpirun -genv I_MPI_SPIN_COUNT 1 ...

Examples of hyperthreading with MPI

Hyperthreading using gromacs: https://support.scinet.utoronto.ca/wiki/index.php/Gromacs#Hyperthreading_with_Gromacs

HyperThreading with Hybrid MPI/OpenMP codes

With a hybrid code, one has extra flexibility in how to assign the "extra" cores -- you could run extra MPI tasks or extra OpenMPI threads. As with all hybrid codes, the combination which results in the best performance depends very strongly on the nature of your code, and you should experiment with different combinations. In addition, with hybrid codes processor and memory affinity issues become very important; if you're unsure as to how to tune your application for best performance, please make an appointment with the SciNet technical analysts for more help.

Memory Configuration

Number of Nodes Memory Notes
3655 16 GB 14 GB is available for your application.
205 32 GB use the flag m32g to request (see below)
72 64 GB part of the contributed system Sandy, but unused cycles are available.
4 128 GB or 256 GB 2 x 128 GB and 2x256 GB as part of Sandy,
2 128 GB 2 x older harpertown nodes with 16 cores and 128 GB each.
2 128 GB 2 x new haswell nodes with 20 cores and 128 GB each.

16G

There are 3655 nodes which have 16G of memory, and is the primary configuration in the GPC. These nodes will be used by default. On these nodes, about 2 GB is taken by the operating system. So for mpi runs with 8 processes per node, this leaves about 1.75GB max per mpi process. Do not try to use more than the available memory: the node will crash and your job will either die or hang until the requested walltime has elapsed.

If you need more memory per process or per thread, you can either try to use the limit number of larger memory nodes listed below, or you can run with less mpi processes, or use a different decomposition, such that the job fits on a node.


32G

There are 205 nodes which have 32G of memory. To request these nodes use:

$ qsub -l nodes=1:m32g:ppn=8,walltime=1:00:00 

Again, also on these nodes, about 2 GB is taken by the operating system, but this is a relatively small amount compared to the total of 32GB.

64G/128G/256G

There are 72 16-core Intel Sandybridge nodes which have 64G, 2 with 128G, and 2 with 256G of memory available as part of the contributed Sandy cluster. These nodes are requested through the sandy queue.

$ qsub -l nodes=1:m64g:ppn=16,walltime=1:00:00 -q sandy
$ qsub -l nodes=1:m128g:ppn=16,walltime=1:00:00 -q sandy
$ qsub -l nodes=1:m256g:ppn=16,walltime=1:00:00 -q sandy

Again, also on these nodes, about 2 GB is taken by the operating system, but this is a relatively small amount compared to the total.

128G

There are four other stand-alone large memory (128GB) nodes which are primarily to be used for data analysis of runs. Both are Intel machines running the same linux OS as the compute nodes however, they are of different Intel generations than the regular GPC compute nodes. Two nodes are 16 core older Harpertown nodes, so codes may have to be compiled separately for these machine, and some modules that work on the other GPC nodes, such as octave, will not work on these nodes. The two other are 20 core newer Haswell nodes that all of the GPC modules should work on.

These nodes can be accessed using a specific largemem queue

$ qsub -l nodes=1,walltime=1:00:00 -q largemem -I

To specifically request the older/newer nodes, use the number of processers (16/20)

$ qsub -l nodes=1:ppn=16,walltime=1:00:00 -q largemem -I
$ qsub -l nodes=1:ppn=20,walltime=1:00:00 -q largemem -I

Note: To estimate your time of access to these nodes, use

$ showq -w class=largemem

How to run less MPI processes per node for memory bound applications

If your compute job requires more memory than is available, it will crash. In some cases, this will even crash the node itself. One way to exceed the memory is if your MPI application is too large for 8 processes to fit in memory. Oftentimes, you can solve this problem by subdividing the computation over more processes (reducing the memory per process) and ask for more nodes. Sometimes, this does not work, for instance if each MPI process requires a fixed minimum chunk of memory that is too large. In that case, you will want to run with less mpi processes than there are cpus in a nodes (i.e. less than 8 for the GPC). Running in such a mode requires a change in the mpirun command. The mpirun command now should not only specify the total number of mpi processes, but should also specify how these processes will be distributed over the nodes. The default distribution of processes is such that the first nodes get loaded with 8 processes, than the second nodes gets filled up, etc, and this will clearly lead to the same memory issue that you had before. Instead, you would want your processes to be distributed evenly over the nodes. The syntax for specifying that depends on whether you are using openmpi or intelmpi. For openmpi running on 4 nodes with 3 processes per node, for example, you would invoke mpirun as follows

mpirun -np 12 --bynode <application>

For intelmpi, the same setup would be invoked using:

mpirun -np 12 -ppn 3 <application>


Checking memory usage from jobs

Ram Disk

On the GPC nodes, there is a `ram disk' available - up to half of the memory on the node may be used as a temporary file system. This is particularly useful for use in the early stages of migrating destop-computing codes to a High Performance Computing platform such as the GPC. It is much faster than real disk and does not require network traffic; however, each node sees its own ramdisk and cannot see files on that of other nodes. This is a very easy way to cache writes (by writing them to fast ram disk instead of slow `real' disk); and then one would periodically copy the files to files on /scratch or /project so that they are available after the job has completed.

To use the ramdisk, create and read to or write from files in /dev/shm/... just as one would to (eg) $SCRATCH. Only the amount of RAM needed to store the files will be taken up by the temporary file system; thus if you have 8 serial jobs each requiring 1 GB of RAM, and 1GB is taken up by various OS services, you would still have approximately 7GB available to use as ramdisk on a 16GB node. However, if you were to write 8 GB of data to the RAM disk, this would exceed available memory and your job would likely crash.

NOTE: it is very important to delete your files from ram disk at the end of your job. If you do not do this, the next user to use that node will have less RAM available than they might expect, and this might kill their jobs.

More details on how to setup your script to use the ramdisk can be found on the Ramdisk wiki page.

Managing jobs on the Queuing system

Information on checking available resources, starting, viewing, managing and canceling jobs on Moab/Torque. Also check out the FAQ.