2010-08-09T16:13:33Z

Cneale: /* Examples */

{{Infobox Computer
|image=[[Image:University_of_Tor_79284gm-a.jpg|center|300px|thumb]]
|name=General Purpose Cluster (GPC)
|installed=June 2009
|operatingsystem= Linux
|loginnode= gpc01..gpc04 (from <tt>login.scinet</tt>)
|numberofnodes=3780
|rampernode=16 Gb
|corespernode=8
|interconnect=1/4 on Infiniband, rest on GigE
|vendorcompilers=icc (C) ifort (fortran) icpc (C++)
|queuetype=[[Moab | Moab/Torque]]
}}

The General Purpose Cluster is an extremely large cluster (ranked [http://www.top500.org/list/2009/06/100 16th] in the world at its inception, and fastest in Canada) and is where most simulations are to be done at SciNet. It is an IBM iDataPlex cluster based on Intel's Nehalem architecture (one of the [http://www.hpcwire.com/features/HPC-Vendors-Jump-On-Nehalem-42360237.html first in the world] to make use of the new chips). The GPC consists of 3,780 nodes with a total of 30,240 2.5GHz cores, with 16GB RAM per node (2GB per core). Approximately one quarter of the cluster is interconnected with non-blocking 4x-DDR InfiniBand while the rest of the nodes are connected with gigabit ethernet. The compute nodes are accessed through a queuing system that allows jobs with a maximum wall time of 48 hours.

===Login===

First login via ssh with your scinet account at <tt>login.scinet.utoronto.ca</tt>, and from there you can proceed to the Development nodes to compile/test your code.

===Compile/Devel Nodes===

From a scinet login node you can ssh to <tt>gpc01</tt>..<tt>gpc04</tt>. These nodes have the same hardware configuration as most of the compute nodes -- 8 Nehalem processing cores with 16GB RAM and Gigabit ethernet. You can compile and test your codes on these nodes. To interactively test on more than 8 processors, or to test your code over an InfiniBand connection, you can submit an [[GPC_Quickstart#Submitting_an_Interactive_Job | interactive job request]].

Your [[Storage_Quickstart | home directory]] is in <tt>/home/USER</tt>; you have 10GB there that is backed up. This directory cannot be written to by the compute nodes! Thus, to run jobs, you'll use the <tt>/scratch/USER</tt> directory. Here, there is a large amount of disk space, but it is not backed up. Thus it makes sense to keep your codes in /home, compile there, and then run them in the /scratch directory.

===Modules and Environment Variables===

To use most packages on the SciNet machines - including any of the compilers - , you will have to use the `modules' command. The command <tt>module load some-package</tt> will set your environment variables (<tt>PATH</tt>, <tt>LD_LIBRARY_PATH</tt>, etc) to include the default version of that package. <tt>module load some-package/specific-version</tt> will load a specific version of that package. This makes it very easy for different users to use different versions of compilers, MPI versions, libraries etc.

Note that to use even the gcc compilers you will have to do
<pre>
$ module load gcc
</pre>

but in fact you probably should use the intel compilers installed on this system as they usually produce faster code (and sometimes, much faster.)

A list of the installed software is available in [[Software_and_Libraries | Software & Libraries]] and can
be seen on the system by typing
<pre>
$ module avail
</pre>

To load a module (for example, the default version of the intel compilers)
<pre>
$ module load intel
</pre>
To unload a module
<pre>
$ module unload intel
</pre>
To unload all modules
<pre>
$ module purge
</pre>

These commands should go in your .bashrc files and/or in your submission scripts to make sure you
are using the correct packages.

===Compilers===

The intel compilers are icc/icpc/ifort for C/C++/Fortran, and are available with the default module "intel". The intel compilers are recommended over the GNU compilers. Documentation about icpc is available at
http://software.intel.com/en-us/articles/intel-software-technical-documentation/. The Intel compilers accept many of the options that the GNU compilers accept, but tend to produce faster programs on our system. If, for some reason, you really need the GNU compilers, the latest version of the GNU compiler collection (currently 4.4.0) is available by loading the "gcc" module, with gcc/g++/gfortran for C/C++/Fortran. Note that f77/g77 is not supported.

To ensure that the intel compilers are in your <tt>PATH</tt> and their libraries are in your <tt>LD_LIBRARY_PATH</tt>, use the command

<pre>
$ module load intel
</pre>

This should likely go in your <tt>.bashrc</tt> file so that it will automatically be loaded.

Optimize your code for the GPC machine using of at least the following compiler flags:
<pre>
-O3 -xHost
</pre>
(or <tt>-O3 -march=native</tt> for the GNU compilers).

*If your program uses openmp, add <tt>-openmp</tt> (<tt>-fopenmp</tt> for GNU compilers).
*If you get the warning <tt>feupdatreenv is not implemented</tt>, add -limf to the link line.
*If you need to link in the MKL libraries, you are well advised to use the Intel(R) Math Kernel Library Link Line Advisor: http://software.intel.com/en-us/articles/intel-mkl-link-line-advisor/ for help in devising the list of libraries to link with your code.

===[[ GPC_MPI_Versions | MPI]]===

SciNet currently provides multiple MPI libraries for the GPC; [http://www.open-mpi.org/ OpenMPI], and [http://software.intel.com/en-us/intel-mpi-library/ IntelMPI]. We currently recommend OpenMPI as the default, as it quite reliably demonstrates good performance on both the infiniband and ethernet networks. For full details and options see the complete [[ GPC_MPI_Versions | '''MPI''']] section.

The MPI libraries are compiled with both the gnu compiler suite and the intel compiler suite. To use (for instance) the intel-compiled OpenMPI libraries, which we recommend as the default (and use for most of our examples here), use

<pre>
$ module load openmpi intel
</pre>

in your <tt>.bashrc</tt>. Other combinations behave similarly.

The MPI libraries define the wrappers mpicc/mpicxx/mpif90/mpif77 as wrappers around the appropriate compilers, which ensure the appropriate include and library directories and used in the compilation and linking steps.

We currently recommend the Intel + OpenMPI combination. However, if you require the GNU compilers as well as MPI, you would want to find the most recent openmpi module available with `gcc' in the version name. This will enable development and runtime with gcc/g++/gfortran and OpenMPI. You can make this your default by putting the module load line in your ~/.bashrc file.

For mixed OpenMP/MPI code using Intel MPI, add the compilation flag -mt_mpi for full thread-safety.

===Submitting A Batch Job===

The SciNet machines are shared systems, and jobs that are to run on them are submitted to a queue; the
[[Moab | scheduler]] then orders the jobs in order to make the best use of the machine, and has them launched
when resources become availble. The intervention of the scheduler can mean that the jobs aren't
quite run in a first-in first-out order.

The maximum [[wallclock time]] for a job in the queue is 48 hours; computations that will take longer than
this must be broken into 48-hour chunks and run as several jobs. The usual way to do this is with [[checkpoints]],
writing out the complete state of the computation every so often in such a way that a job can be restarted from
this state information and continue on from where it left off. Generating [[checkpoints]] is a good idea anyway,
as in the unlikely event of a hardware failure during your run, it allows you to restart without having lost much work.

There are limits to how many jobs you can submit. If your group has a default account, up to 32 nodes at a time for 48 hours per job on the GPC cluster are allowed to be queued. This is a total limit, e.g., you could request 64 nodes for 24 hours. Jobs of users with an LRAC or NRAC allocation will run at a higher priority than others while their resources last. Because of the group-based allocation, it is conceivable that your jobs won't run if your colleagues have already exhausted your group's limits.

Note that scheduling big jobs greatly affects the queuer and other users, so you have to talk to us first to run massively parallel jobs (> 2048 cores). We will help make sure that your jobs start and run efficiently.

If your job should run in fewer than 48 hours, specify that in your script -- your job
will start sooner. (It's easier for the [[Moab | scheduler]] to fit in a short job than a long job). On the downside, the
job will be killed automatically by the queue manager software at the end of the specified [[wallclock time]], so if you
guess wrong you might lose some work. So the standard procedure is to estimate how long your job will take and
add 10% or so.

You interact with the queuing system through the queue/resource manager, [[Moab | Moab]] and [[Moab | Torque]]. To see all the jobs in the queue use
<pre>
$ showq
</pre>

To submit your own job, you must write a script which describes the job and how it is to be run (a sample script [[GPC_Quickstart#Submission_Script | follows]]) and submit it to the queue, using the command
<pre>
$ qsub [SCRIPT-FILE-NAME]
</pre>
where you will replace <tt>[SCRIPT-FILE-NAME]</tt> with the file containing the submission script. This will return a job ID, for example 31415, which is used to identify the jobs. Information about a queued job can be found using
<pre>
$ checkjob [JOB-ID]
</pre>
and jobs can be canceled with the command
<pre>
$ canceljob [JOB-ID]
</pre>

Again, these commands have many options, which can be read about on their man pages.

Much more information on the queueing system is available on our [[Moab | queue]] page.

====Batch Submission Script: MPI====

A sample submission script is shown below for an mpi job using ethernet with the <tt> #PBS </tt> directives at the top and the rest being
what will be executed on the compute node.

<source lang="bash">
#!/bin/bash
# MOAB/Torque submission script for SciNet GPC (ethernet)
#
#PBS -l nodes=2:ppn=8,walltime=1:00:00
#PBS -N test

# DIRECTORY TO RUN - $PBS_O_WORKDIR is directory job was submitted from
cd $PBS_O_WORKDIR

# EXECUTION COMMAND; -np = nodes*ppn
mpirun -np 16 -hostfile $PBS_NODEFILE ./a.out
</source>

The lines that begin <tt>#PBS</tt> are commands that are parsed and interpreted by qsub at submission time, and control administrative things about your job. In this example, the script above requests two nodes, using 8 processors per node, for a [[wallclock time]] of one hour. (The resources required by the job are listed on the <tt>#PBS -l</tt> line.) Other options can be given in other <tt>#PBS</tt> lines, such as <tt>#PBS -N</tt>, which sets the name of the job.

The rest of the script is run as a bash script at run time. A bash shell on the first node of the two nodes that are requested executes these commands as a normal bash script, just as if you had run this as a shell script from the terminal. The only difference is that PBS sets certain environment variables that you can use in the script. <tt>$PBS_O_WORKDIR</tt> is set to be the directory that the command was 'submitted' from - eg, <tt>/scratch/USER/SOMEDIRECTORY</tt> - and <tt>$PBS_NODEFILE</tt> is the name of a file which contains all the nodes on which programs should execute. Using these environment variables, the script then uses the <tt>mpirun</tt> command to launch the job. Assumed here is that the user has a line like

<pre>
$ module load openmpi intel
</pre>

in their <tt>.bashrc</tt>.

* Note: The different versions of MPI require different commands to launch the run, and thus different scripts. The above script is specific for the openmpi module. For the intelmpi module, the last line of the script should read
<pre>
$ mpirun -r ssh -np 16 -env I_MPI_DEVICE ssm ./a.out
</pre>

====Submitting Collections of Serial Jobs====

SciNet-approved methods for running collections of serial jobs can be found on the [[User_Serial|serial run wiki page]].

====Batch Submission Script: OpenMP====

For running OpenMP jobs, the procedure is similar as for MPI jobs:

<source lang="bash">
#!/bin/bash
# MOAB/Torque submission script for SciNet GPC (OpenMP)
#
#PBS -l nodes=1:ppn=8,walltime=1:00:00
#PBS -N test

# DIRECTORY TO RUN - $PBS_O_WORKDIR is directory job was submitted from
cd $PBS_O_WORKDIR

export OMP_NUM_THREADS=8
./a.out
</source>

Note that [[Introduction_To_Performance#Throughput | in some circumstances]] it can be more efficient to run (say) two jobs each running on four threads than one job running on eight threads. In that case you can use the same `ampersand-and-wait' technique outlined for serial jobs (see [[User_Serial|serial run wiki page]]) for less-than-eight-core OpenMP jobs.

====Hybrid MPI/OpenMP jobs====

=====Using Intel MPI=====
Here is how to run hybrid codes using intelmpi::

http://software.intel.com/en-us/articles/hybrid-applications-intelmpi-openmp/

Make sure you compile with the -mt_mpi option to the compilers to use the thread safe libraries.
Set the environment variable I_MPI_PIN_DOMAIN:
<pre>
$ export I_MPI_PIN_DOMAIN=omp
</pre>
This will set the process pinning domain size to be equal to OMP_NUM_THREADS (which you should set to the desired number of threads per mpi process). Therefore, each MPI process can create $OMP_NUM_THREADS number of children threads for running within the corresponding domain. If OMP_NUM_THREADS is not set, each node is treated as a separate domain (which will allow as many threads per MPI processes as there are cores).

In addition, when invoking mpirun, you should add the argument "-ppn X", where X is the number of MPI processes per node.
For example:
<pre>
$ mpirun -r ssh -ppn 2 -np 8 [executable]
</pre>
would start 2 mpi processes of <tt>[executable]</tt> per node for a total of 8 processes, so mpirun will try to run mpi processes on 4 nodes
(OMP_NUM_THREADS is then probably best set at 4).
Your job script should still ask for these 4 nodes with the line
<source lang="bash">
#PBS -l nodes=4:ppn=8,walltime=....
</source>
(<tt>ppn=8</tt> is not a mistake here; the ppn parameter has a different meaning for PBS and for mpirun)

''The ppn parameter to ''mpirun'' is very important! Without it, eight mpi jobs would get bunched on the first node in this example, leaving 3 nodes unused.''

NOTE: In order to pin OpenMP threads inside the domain, use the corresponding OpenMP feature by setting the KMP_AFFINITY environment variable, see [http://software.intel.com/sites/products/documentation/hpc/compilerpro/en-us/fortran/lin/compiler_f/optaps/common/optaps_openmp_thread_affinity.htm#KMP_AFFINITY_Environment_Variable|Intel's Compiler User and Reference Guide].

The IntelMPI manual is referenced on the front page of our wiki:

http://software.intel.com/sites/products/documentation/hpc/mpi/linux/reference_manual.pdf

For the above example of a total of 8 processes on 4 nodes, you could use the following script:
<source lang="bash">
#!/bin/bash
# MOAB/Torque submission script for SciNet GPC (hybrid job)
#
#PBS -l nodes=4:ppn=8,walltime=1:00:00
#PBS -N test

# DIRECTORY TO RUN - $PBS_O_WORKDIR is directory job was submitted from
cd $PBS_O_WORKDIR

# SET THE NUMBER OF THREADS PER PROCESS:
export OMP_NUM_THREADS=4

# PIN THE MPI DOMAINS ACCORDING TO OMP
export I_MPI_PIN_DOMAIN=omp

# EXECUTION COMMAND; -np = nodes*ppn
mpirun -r ssh -ppn 2 -np 8 ./a.out
</source>

=====Using Open MPI=====

For mixed MPI/OpenMP jobs using OpenMPI, which is the default for many users, the procedure is similar, but details differ.

* Request the number of nodes in the PBS script.
* Set OMP_NUM_THREADS to the number of threads per MPI process.
* In addition to the -np parameter for mpirun, add the argument <tt>--bynode</tt>, so that the mpi processes are not bunched up.

So for example, to start a total of 8 processes on 4 nodes, you could use the following script
<source lang="bash">
#!/bin/bash
# MOAB/Torque submission script for SciNet GPC (hybrid job)
#
#PBS -l nodes=4:ppn=8,walltime=1:00:00
#PBS -N test

# DIRECTORY TO RUN - $PBS_O_WORKDIR is directory job was submitted from
cd $PBS_O_WORKDIR

# SET THE NUMBER OF THREADS PER PROCESS:
export OMP_NUM_THREADS=4

# EXECUTION COMMAND; -np = nodes*processes_per_nodes; --byhost forces a round robin of nodes.
mpirun -np 8 --bynode -hostfile $PBS_NODEFILE ./a.out
</source>

===Submitting an Interactive (Debug) Job===

It is sometimes convenient to run a job interactively; this can be very handy for debugging purposes. In this case, you type a <tt>qsub</tt> command which submits an interactive job to the queue; when the scheduler selects this job to run, then it starts a shell running on the first node of the job, which connects to your terminal. You can then type any series of commands (for instance, the same commands listed as in the batch submission script above) to run a job interactively.

For example, to start the same sort of job as in the batch submission script above, but interactively, one would type

<pre>
$ qsub -I -l nodes=2:ppn=8,walltime=1:00:00
</pre>

This is exactly the <tt>#PBS -l</tt> line in the batch script above (which requests all 8 processors on each of 2 nodes for one hour), but prepended with a <tt>-I</tt> for `interactive'. When this job begins, your terminal will now show you as being logged in to one of the compute nodes, and one can type in any shell command, run <tt>mpirun</tt>, etc. When you exit the shell, the job will end. Interactive jobs can be used with any of the [[ Moab#GPC | GPC queues ]] however, there is a short
high turnover queue called [[ Moab#debug | debug ]] which can be especially useful when the system is busy.

===Ethernet vs. Infiniband===

About 1/4 of the GPC (862 nodes or 6896 cores) is connected with a high bandwidth low-latency fabric called
[http://en.wikipedia.org/wiki/InfiniBand InfiniBand]. Many jobs which require tight coupling to scale well greatly benefit from this interconnect;
other types of jobs, which have relatively modest communications, do not require this and run fine on Gigabit ethernet.

Jobs which require the InfiniBand for good performance can request the nodes that have the `<tt>ib</tt>' feature in the <tt>#PBS -l</tt> line,
<source lang="bash">
#PBS -l nodes=2:ib:ppn=8,walltime=1:00:00
</source>

Because there are a limited number of these nodes, your job will start running faster if you do not request them (e.g. if you use the scripts as shown above), as this increases the number of nodes available to run your job. In fact, the InfiniBand nodes are to be used only for jobs that are known to scale well and will benefit from this type of interconnect. As such the minimum number of nodes requested has to be at least 2, as single node jobs will not benefit from using an
Infiniband node. The MPI libraries provided by SciNet automatically correctly use either the InfiniBand or ethernet interconnect depending on which nodes your job runs on.

===HyperThreading===

Each GPC compute node has 8 Nehalem cores (2 sockets each with a four-core Intel Xeon E5540 @ 2.53GHz). Thus, to make full use of the computing power of a GPC node, you must be running least 8 "tasks" -- MPI processes, or OpenMP threads.

Under most circumstances, running exactly 8 tasks is the most efficient way to use these nodes. However, sometimes software design (eg, having one thread for communication and one for computation) can usefully `oversubscribe' the number of physical cores, and running (say) twice as many tasks as cores can be a useful strategy. If your code is highly memory-bandwidth bound, having one task ready to run while another waits for memory access can make more effective use of the processor.

The Nehalem processors have hardware support for such two-way overloading of processors, through "HyperThreading"; there are an extra set of registers on each core to facilitate rapid switching between two tasks, making it look to the operating system that there are in fact 16 cores per node. Depending on the nature of your code, making use of these virtual extra cores may speed up or slow down your computation; you should run small test cases before running production jobs in this manner. In most cases, the speed difference will be under 10%. Some of our users have obtained an 8% speedup by running gromacs with 16 tasks instead of 8 on a single node (mpirun -np 16 ./gromacs/mdrun -npme 4 is 108% the speed of mpirun -np 8 ./gromacs/mdrun with -npme 2 or -1).

====HyperThreading with OpenMP====

To use hyperthreading with an OpenMP job, one just runs twice as many threads as one would have previously; eg, if you were running 8 threads before (<tt>export OMP_NUM_THREADS=8</tt>) you would run with 16 (<tt>export OMP_NUM_THREADS=16</tt>). Everything else remains the same, including the job submission script; one still uses <tt>ppn=8</tt> in the submission of the job, as Torque has no way of knowing (or reason for caring) that you will be running on 16 `virtual' cores rather than 8 physical cores.

====HyperThreading with MPI====

To use hyperthreading with an MPI job, one just runs twice as many MPI processes as one would have previously; eg, if you were running on three nodes using 8 MPI tasks per node and used <tt>mpirun ... -np 24</tt>, you could run instead with <tt>-np 48</tt>. Everything else remains the same, including the job submission script; one still uses <tt>ppn=8</tt> in the submission of the job, as Torque has no way of knowing (or reason for caring) that you will be running on 16 `virtual' cores rather than 8 physical cores.

Note that if you are using OpenMPI (as is the default), there is another consideration; OpenMPI assumes that there is no oversubscription and each task very aggressively makes full use of a core when it is waiting for a message (eg, the waits are "busywaits"). If you find a significant slowdown when running multiple MPI tasks per core with OpenMPI, you may want to try adding the additional option to mpirun: <tt>--mca mpi_yield_when_idle 1</tt>. This will increase the latency of individual messages, but free up the core to do additional work while waiting.

With IntelMPI, the problem should be less pronounced, but you can still improve things by using <tt>mpirun -genv I_MPI_SPIN_COUNT 1 ...</tt>

=====Examples of hyperthreading with MPI=====
Hyperthreading using gromacs: https://support.scinet.utoronto.ca/wiki/index.php/Gromacs#Hyperthreading_with_Gromacs

====HyperThreading with Hybrid MPI/OpenMP codes====

With a hybrid code, one has extra flexibility in how to assign the "extra" cores -- you could run extra MPI tasks or extra OpenMPI threads. As with all hybrid codes, the combination which results in the best performance depends very strongly on the nature of your code, and you should experiment with different combinations. In addition, with hybrid codes processor and memory affinity issues become very important; if you're unsure as to how to tune your application for best performance, please make an appointment with the SciNet technical analysts for more help.

===Memory Configuration===

==== 16G ====

There are 3756 nodes which have 16G of memory, and is the primary configuration in the GPC. These nodes will be used by default.

==== 18G ====

There are 24 Infiniband nodes which have 18G of memory. These nodes have a fully populated memory configuration that maximizes memory bandwidth. To
request these nodes use:

<pre>
$ qsub -l nodes=2:ib:m18g:ppn=8,walltime=1:00:00
</pre>

==== 32G ====

There are 84 Infiniband nodes which have 32G of memory. To request these nodes use:

<pre>
$ qsub -l nodes=2:ib:m32g:ppn=8,walltime=1:00:00
</pre>

==== 128G ====
There are two stand-alone large memory (128GB) nodes, <tt>gpc-lrgmem01</tt> and <tt>gpc-lrgmem02</tt> which are primarily to be used for data analysis of runs. They have 16 cores and are intel machines running linux, but they are not the same architecture (Nehalem) as the GPC compute nodes, so codes may have to be compiled separately for these machines. They can be accessed using a specific <tt>largemem</tt> queue.

<pre>
$ qsub -l nodes=2:ppn=8,walltime=1:00:00 -q largemem -I
</pre>

===Ram Disk===

On the GPC nodes, there is a `ram disk' available - up to half of the memory on the node may be used as a temporary file system. This is particularly useful for use in the early stages of migrating destop-computing codes to a High Performance Computing platform such as the GPC. It is much faster than real disk and does not require network traffic; however, each node sees its own ramdisk and cannot see files on that of other nodes. This is a very easy way to cache writes (by writing them to fast ram disk instead of slow `real' disk); and then one would periodically copy the files to files on /scratch or /project so that they are available after the job has completed.

To use the ramdisk, create and read to / write from files in /dev/shm/.. just as one would to (eg) /scratch/USER/. Only the amount of RAM needed to store the files will be taken up by the temporary file system; thus if you have 8 serial jobs each requiring 1 GB of RAM, and 1GB is taken up by various OS services, you would still have approximately 7GB available to use as ramdisk on a 16GB node. However, if you were to write 8 GB of data to the RAM disk, this would exceed available memory and your job would likely crash.

NOTE: it is very important to delete your files from ram disk at the end of your job. If you do not do this, the next user to use that node will have less RAM available than they might expect, and this might kill their jobs.

More details on how to setup your script to use the ramdisk can be found on the [[User_Ramdisk|Ramdisk wiki page]].

=== Managing jobs on the Queuing system ===
Information on checking available resources, starting, viewing, managing and canceling jobs on [[Moab | Moab/Torque]]

GPC Quickstart

2010-08-09T16:12:46Z

Cneale: /* HyperThreading with MPI */

{{Infobox Computer
|image=[[Image:University_of_Tor_79284gm-a.jpg|center|300px|thumb]]
|name=General Purpose Cluster (GPC)
|installed=June 2009
|operatingsystem= Linux
|loginnode= gpc01..gpc04 (from <tt>login.scinet</tt>)
|numberofnodes=3780
|rampernode=16 Gb
|corespernode=8
|interconnect=1/4 on Infiniband, rest on GigE
|vendorcompilers=icc (C) ifort (fortran) icpc (C++)
|queuetype=[[Moab | Moab/Torque]]
}}

The General Purpose Cluster is an extremely large cluster (ranked [http://www.top500.org/list/2009/06/100 16th] in the world at its inception, and fastest in Canada) and is where most simulations are to be done at SciNet. It is an IBM iDataPlex cluster based on Intel's Nehalem architecture (one of the [http://www.hpcwire.com/features/HPC-Vendors-Jump-On-Nehalem-42360237.html first in the world] to make use of the new chips). The GPC consists of 3,780 nodes with a total of 30,240 2.5GHz cores, with 16GB RAM per node (2GB per core). Approximately one quarter of the cluster is interconnected with non-blocking 4x-DDR InfiniBand while the rest of the nodes are connected with gigabit ethernet. The compute nodes are accessed through a queuing system that allows jobs with a maximum wall time of 48 hours.

===Login===

First login via ssh with your scinet account at <tt>login.scinet.utoronto.ca</tt>, and from there you can proceed to the Development nodes to compile/test your code.

===Compile/Devel Nodes===

From a scinet login node you can ssh to <tt>gpc01</tt>..<tt>gpc04</tt>. These nodes have the same hardware configuration as most of the compute nodes -- 8 Nehalem processing cores with 16GB RAM and Gigabit ethernet. You can compile and test your codes on these nodes. To interactively test on more than 8 processors, or to test your code over an InfiniBand connection, you can submit an [[GPC_Quickstart#Submitting_an_Interactive_Job | interactive job request]].

Your [[Storage_Quickstart | home directory]] is in <tt>/home/USER</tt>; you have 10GB there that is backed up. This directory cannot be written to by the compute nodes! Thus, to run jobs, you'll use the <tt>/scratch/USER</tt> directory. Here, there is a large amount of disk space, but it is not backed up. Thus it makes sense to keep your codes in /home, compile there, and then run them in the /scratch directory.

===Modules and Environment Variables===

To use most packages on the SciNet machines - including any of the compilers - , you will have to use the `modules' command. The command <tt>module load some-package</tt> will set your environment variables (<tt>PATH</tt>, <tt>LD_LIBRARY_PATH</tt>, etc) to include the default version of that package. <tt>module load some-package/specific-version</tt> will load a specific version of that package. This makes it very easy for different users to use different versions of compilers, MPI versions, libraries etc.

Note that to use even the gcc compilers you will have to do
<pre>
$ module load gcc
</pre>

but in fact you probably should use the intel compilers installed on this system as they usually produce faster code (and sometimes, much faster.)

A list of the installed software is available in [[Software_and_Libraries | Software & Libraries]] and can
be seen on the system by typing
<pre>
$ module avail
</pre>

To load a module (for example, the default version of the intel compilers)
<pre>
$ module load intel
</pre>
To unload a module
<pre>
$ module unload intel
</pre>
To unload all modules
<pre>
$ module purge
</pre>

These commands should go in your .bashrc files and/or in your submission scripts to make sure you
are using the correct packages.

===Compilers===

The intel compilers are icc/icpc/ifort for C/C++/Fortran, and are available with the default module "intel". The intel compilers are recommended over the GNU compilers. Documentation about icpc is available at
http://software.intel.com/en-us/articles/intel-software-technical-documentation/. The Intel compilers accept many of the options that the GNU compilers accept, but tend to produce faster programs on our system. If, for some reason, you really need the GNU compilers, the latest version of the GNU compiler collection (currently 4.4.0) is available by loading the "gcc" module, with gcc/g++/gfortran for C/C++/Fortran. Note that f77/g77 is not supported.

To ensure that the intel compilers are in your <tt>PATH</tt> and their libraries are in your <tt>LD_LIBRARY_PATH</tt>, use the command

<pre>
$ module load intel
</pre>

This should likely go in your <tt>.bashrc</tt> file so that it will automatically be loaded.

Optimize your code for the GPC machine using of at least the following compiler flags:
<pre>
-O3 -xHost
</pre>
(or <tt>-O3 -march=native</tt> for the GNU compilers).

*If your program uses openmp, add <tt>-openmp</tt> (<tt>-fopenmp</tt> for GNU compilers).
*If you get the warning <tt>feupdatreenv is not implemented</tt>, add -limf to the link line.
*If you need to link in the MKL libraries, you are well advised to use the Intel(R) Math Kernel Library Link Line Advisor: http://software.intel.com/en-us/articles/intel-mkl-link-line-advisor/ for help in devising the list of libraries to link with your code.

===[[ GPC_MPI_Versions | MPI]]===

SciNet currently provides multiple MPI libraries for the GPC; [http://www.open-mpi.org/ OpenMPI], and [http://software.intel.com/en-us/intel-mpi-library/ IntelMPI]. We currently recommend OpenMPI as the default, as it quite reliably demonstrates good performance on both the infiniband and ethernet networks. For full details and options see the complete [[ GPC_MPI_Versions | '''MPI''']] section.

The MPI libraries are compiled with both the gnu compiler suite and the intel compiler suite. To use (for instance) the intel-compiled OpenMPI libraries, which we recommend as the default (and use for most of our examples here), use

<pre>
$ module load openmpi intel
</pre>

in your <tt>.bashrc</tt>. Other combinations behave similarly.

The MPI libraries define the wrappers mpicc/mpicxx/mpif90/mpif77 as wrappers around the appropriate compilers, which ensure the appropriate include and library directories and used in the compilation and linking steps.

We currently recommend the Intel + OpenMPI combination. However, if you require the GNU compilers as well as MPI, you would want to find the most recent openmpi module available with `gcc' in the version name. This will enable development and runtime with gcc/g++/gfortran and OpenMPI. You can make this your default by putting the module load line in your ~/.bashrc file.

For mixed OpenMP/MPI code using Intel MPI, add the compilation flag -mt_mpi for full thread-safety.

===Submitting A Batch Job===

The SciNet machines are shared systems, and jobs that are to run on them are submitted to a queue; the
[[Moab | scheduler]] then orders the jobs in order to make the best use of the machine, and has them launched
when resources become availble. The intervention of the scheduler can mean that the jobs aren't
quite run in a first-in first-out order.

The maximum [[wallclock time]] for a job in the queue is 48 hours; computations that will take longer than
this must be broken into 48-hour chunks and run as several jobs. The usual way to do this is with [[checkpoints]],
writing out the complete state of the computation every so often in such a way that a job can be restarted from
this state information and continue on from where it left off. Generating [[checkpoints]] is a good idea anyway,
as in the unlikely event of a hardware failure during your run, it allows you to restart without having lost much work.

There are limits to how many jobs you can submit. If your group has a default account, up to 32 nodes at a time for 48 hours per job on the GPC cluster are allowed to be queued. This is a total limit, e.g., you could request 64 nodes for 24 hours. Jobs of users with an LRAC or NRAC allocation will run at a higher priority than others while their resources last. Because of the group-based allocation, it is conceivable that your jobs won't run if your colleagues have already exhausted your group's limits.

Note that scheduling big jobs greatly affects the queuer and other users, so you have to talk to us first to run massively parallel jobs (> 2048 cores). We will help make sure that your jobs start and run efficiently.

If your job should run in fewer than 48 hours, specify that in your script -- your job
will start sooner. (It's easier for the [[Moab | scheduler]] to fit in a short job than a long job). On the downside, the
job will be killed automatically by the queue manager software at the end of the specified [[wallclock time]], so if you
guess wrong you might lose some work. So the standard procedure is to estimate how long your job will take and
add 10% or so.

You interact with the queuing system through the queue/resource manager, [[Moab | Moab]] and [[Moab | Torque]]. To see all the jobs in the queue use
<pre>
$ showq
</pre>

To submit your own job, you must write a script which describes the job and how it is to be run (a sample script [[GPC_Quickstart#Submission_Script | follows]]) and submit it to the queue, using the command
<pre>
$ qsub [SCRIPT-FILE-NAME]
</pre>
where you will replace <tt>[SCRIPT-FILE-NAME]</tt> with the file containing the submission script. This will return a job ID, for example 31415, which is used to identify the jobs. Information about a queued job can be found using
<pre>
$ checkjob [JOB-ID]
</pre>
and jobs can be canceled with the command
<pre>
$ canceljob [JOB-ID]
</pre>

Again, these commands have many options, which can be read about on their man pages.

Much more information on the queueing system is available on our [[Moab | queue]] page.

====Batch Submission Script: MPI====

A sample submission script is shown below for an mpi job using ethernet with the <tt> #PBS </tt> directives at the top and the rest being
what will be executed on the compute node.

<source lang="bash">
#!/bin/bash
# MOAB/Torque submission script for SciNet GPC (ethernet)
#
#PBS -l nodes=2:ppn=8,walltime=1:00:00
#PBS -N test

# DIRECTORY TO RUN - $PBS_O_WORKDIR is directory job was submitted from
cd $PBS_O_WORKDIR

# EXECUTION COMMAND; -np = nodes*ppn
mpirun -np 16 -hostfile $PBS_NODEFILE ./a.out
</source>

The lines that begin <tt>#PBS</tt> are commands that are parsed and interpreted by qsub at submission time, and control administrative things about your job. In this example, the script above requests two nodes, using 8 processors per node, for a [[wallclock time]] of one hour. (The resources required by the job are listed on the <tt>#PBS -l</tt> line.) Other options can be given in other <tt>#PBS</tt> lines, such as <tt>#PBS -N</tt>, which sets the name of the job.

The rest of the script is run as a bash script at run time. A bash shell on the first node of the two nodes that are requested executes these commands as a normal bash script, just as if you had run this as a shell script from the terminal. The only difference is that PBS sets certain environment variables that you can use in the script. <tt>$PBS_O_WORKDIR</tt> is set to be the directory that the command was 'submitted' from - eg, <tt>/scratch/USER/SOMEDIRECTORY</tt> - and <tt>$PBS_NODEFILE</tt> is the name of a file which contains all the nodes on which programs should execute. Using these environment variables, the script then uses the <tt>mpirun</tt> command to launch the job. Assumed here is that the user has a line like

<pre>
$ module load openmpi intel
</pre>

in their <tt>.bashrc</tt>.

* Note: The different versions of MPI require different commands to launch the run, and thus different scripts. The above script is specific for the openmpi module. For the intelmpi module, the last line of the script should read
<pre>
$ mpirun -r ssh -np 16 -env I_MPI_DEVICE ssm ./a.out
</pre>

====Submitting Collections of Serial Jobs====

SciNet-approved methods for running collections of serial jobs can be found on the [[User_Serial|serial run wiki page]].

====Batch Submission Script: OpenMP====

For running OpenMP jobs, the procedure is similar as for MPI jobs:

<source lang="bash">
#!/bin/bash
# MOAB/Torque submission script for SciNet GPC (OpenMP)
#
#PBS -l nodes=1:ppn=8,walltime=1:00:00
#PBS -N test

# DIRECTORY TO RUN - $PBS_O_WORKDIR is directory job was submitted from
cd $PBS_O_WORKDIR

export OMP_NUM_THREADS=8
./a.out
</source>

Note that [[Introduction_To_Performance#Throughput | in some circumstances]] it can be more efficient to run (say) two jobs each running on four threads than one job running on eight threads. In that case you can use the same `ampersand-and-wait' technique outlined for serial jobs (see [[User_Serial|serial run wiki page]]) for less-than-eight-core OpenMP jobs.

====Hybrid MPI/OpenMP jobs====

=====Using Intel MPI=====
Here is how to run hybrid codes using intelmpi::

http://software.intel.com/en-us/articles/hybrid-applications-intelmpi-openmp/

Make sure you compile with the -mt_mpi option to the compilers to use the thread safe libraries.
Set the environment variable I_MPI_PIN_DOMAIN:
<pre>
$ export I_MPI_PIN_DOMAIN=omp
</pre>
This will set the process pinning domain size to be equal to OMP_NUM_THREADS (which you should set to the desired number of threads per mpi process). Therefore, each MPI process can create $OMP_NUM_THREADS number of children threads for running within the corresponding domain. If OMP_NUM_THREADS is not set, each node is treated as a separate domain (which will allow as many threads per MPI processes as there are cores).

In addition, when invoking mpirun, you should add the argument "-ppn X", where X is the number of MPI processes per node.
For example:
<pre>
$ mpirun -r ssh -ppn 2 -np 8 [executable]
</pre>
would start 2 mpi processes of <tt>[executable]</tt> per node for a total of 8 processes, so mpirun will try to run mpi processes on 4 nodes
(OMP_NUM_THREADS is then probably best set at 4).
Your job script should still ask for these 4 nodes with the line
<source lang="bash">
#PBS -l nodes=4:ppn=8,walltime=....
</source>
(<tt>ppn=8</tt> is not a mistake here; the ppn parameter has a different meaning for PBS and for mpirun)

''The ppn parameter to ''mpirun'' is very important! Without it, eight mpi jobs would get bunched on the first node in this example, leaving 3 nodes unused.''

NOTE: In order to pin OpenMP threads inside the domain, use the corresponding OpenMP feature by setting the KMP_AFFINITY environment variable, see [http://software.intel.com/sites/products/documentation/hpc/compilerpro/en-us/fortran/lin/compiler_f/optaps/common/optaps_openmp_thread_affinity.htm#KMP_AFFINITY_Environment_Variable|Intel's Compiler User and Reference Guide].

The IntelMPI manual is referenced on the front page of our wiki:

http://software.intel.com/sites/products/documentation/hpc/mpi/linux/reference_manual.pdf

For the above example of a total of 8 processes on 4 nodes, you could use the following script:
<source lang="bash">
#!/bin/bash
# MOAB/Torque submission script for SciNet GPC (hybrid job)
#
#PBS -l nodes=4:ppn=8,walltime=1:00:00
#PBS -N test

# DIRECTORY TO RUN - $PBS_O_WORKDIR is directory job was submitted from
cd $PBS_O_WORKDIR

# SET THE NUMBER OF THREADS PER PROCESS:
export OMP_NUM_THREADS=4

# PIN THE MPI DOMAINS ACCORDING TO OMP
export I_MPI_PIN_DOMAIN=omp

# EXECUTION COMMAND; -np = nodes*ppn
mpirun -r ssh -ppn 2 -np 8 ./a.out
</source>

=====Using Open MPI=====

For mixed MPI/OpenMP jobs using OpenMPI, which is the default for many users, the procedure is similar, but details differ.

* Request the number of nodes in the PBS script.
* Set OMP_NUM_THREADS to the number of threads per MPI process.
* In addition to the -np parameter for mpirun, add the argument <tt>--bynode</tt>, so that the mpi processes are not bunched up.

So for example, to start a total of 8 processes on 4 nodes, you could use the following script
<source lang="bash">
#!/bin/bash
# MOAB/Torque submission script for SciNet GPC (hybrid job)
#
#PBS -l nodes=4:ppn=8,walltime=1:00:00
#PBS -N test

# DIRECTORY TO RUN - $PBS_O_WORKDIR is directory job was submitted from
cd $PBS_O_WORKDIR

# SET THE NUMBER OF THREADS PER PROCESS:
export OMP_NUM_THREADS=4

# EXECUTION COMMAND; -np = nodes*processes_per_nodes; --byhost forces a round robin of nodes.
mpirun -np 8 --bynode -hostfile $PBS_NODEFILE ./a.out
</source>

===Submitting an Interactive (Debug) Job===

It is sometimes convenient to run a job interactively; this can be very handy for debugging purposes. In this case, you type a <tt>qsub</tt> command which submits an interactive job to the queue; when the scheduler selects this job to run, then it starts a shell running on the first node of the job, which connects to your terminal. You can then type any series of commands (for instance, the same commands listed as in the batch submission script above) to run a job interactively.

For example, to start the same sort of job as in the batch submission script above, but interactively, one would type

<pre>
$ qsub -I -l nodes=2:ppn=8,walltime=1:00:00
</pre>

This is exactly the <tt>#PBS -l</tt> line in the batch script above (which requests all 8 processors on each of 2 nodes for one hour), but prepended with a <tt>-I</tt> for `interactive'. When this job begins, your terminal will now show you as being logged in to one of the compute nodes, and one can type in any shell command, run <tt>mpirun</tt>, etc. When you exit the shell, the job will end. Interactive jobs can be used with any of the [[ Moab#GPC | GPC queues ]] however, there is a short
high turnover queue called [[ Moab#debug | debug ]] which can be especially useful when the system is busy.

===Ethernet vs. Infiniband===

About 1/4 of the GPC (862 nodes or 6896 cores) is connected with a high bandwidth low-latency fabric called
[http://en.wikipedia.org/wiki/InfiniBand InfiniBand]. Many jobs which require tight coupling to scale well greatly benefit from this interconnect;
other types of jobs, which have relatively modest communications, do not require this and run fine on Gigabit ethernet.

Jobs which require the InfiniBand for good performance can request the nodes that have the `<tt>ib</tt>' feature in the <tt>#PBS -l</tt> line,
<source lang="bash">
#PBS -l nodes=2:ib:ppn=8,walltime=1:00:00
</source>

Because there are a limited number of these nodes, your job will start running faster if you do not request them (e.g. if you use the scripts as shown above), as this increases the number of nodes available to run your job. In fact, the InfiniBand nodes are to be used only for jobs that are known to scale well and will benefit from this type of interconnect. As such the minimum number of nodes requested has to be at least 2, as single node jobs will not benefit from using an
Infiniband node. The MPI libraries provided by SciNet automatically correctly use either the InfiniBand or ethernet interconnect depending on which nodes your job runs on.

===HyperThreading===

Each GPC compute node has 8 Nehalem cores (2 sockets each with a four-core Intel Xeon E5540 @ 2.53GHz). Thus, to make full use of the computing power of a GPC node, you must be running least 8 "tasks" -- MPI processes, or OpenMP threads.

Under most circumstances, running exactly 8 tasks is the most efficient way to use these nodes. However, sometimes software design (eg, having one thread for communication and one for computation) can usefully `oversubscribe' the number of physical cores, and running (say) twice as many tasks as cores can be a useful strategy. If your code is highly memory-bandwidth bound, having one task ready to run while another waits for memory access can make more effective use of the processor.

The Nehalem processors have hardware support for such two-way overloading of processors, through "HyperThreading"; there are an extra set of registers on each core to facilitate rapid switching between two tasks, making it look to the operating system that there are in fact 16 cores per node. Depending on the nature of your code, making use of these virtual extra cores may speed up or slow down your computation; you should run small test cases before running production jobs in this manner. In most cases, the speed difference will be under 10%. Some of our users have obtained an 8% speedup by running gromacs with 16 tasks instead of 8 on a single node (mpirun -np 16 ./gromacs/mdrun -npme 4 is 108% the speed of mpirun -np 8 ./gromacs/mdrun with -npme 2 or -1).

====HyperThreading with OpenMP====

To use hyperthreading with an OpenMP job, one just runs twice as many threads as one would have previously; eg, if you were running 8 threads before (<tt>export OMP_NUM_THREADS=8</tt>) you would run with 16 (<tt>export OMP_NUM_THREADS=16</tt>). Everything else remains the same, including the job submission script; one still uses <tt>ppn=8</tt> in the submission of the job, as Torque has no way of knowing (or reason for caring) that you will be running on 16 `virtual' cores rather than 8 physical cores.

====HyperThreading with MPI====

To use hyperthreading with an MPI job, one just runs twice as many MPI processes as one would have previously; eg, if you were running on three nodes using 8 MPI tasks per node and used <tt>mpirun ... -np 24</tt>, you could run instead with <tt>-np 48</tt>. Everything else remains the same, including the job submission script; one still uses <tt>ppn=8</tt> in the submission of the job, as Torque has no way of knowing (or reason for caring) that you will be running on 16 `virtual' cores rather than 8 physical cores.

Note that if you are using OpenMPI (as is the default), there is another consideration; OpenMPI assumes that there is no oversubscription and each task very aggressively makes full use of a core when it is waiting for a message (eg, the waits are "busywaits"). If you find a significant slowdown when running multiple MPI tasks per core with OpenMPI, you may want to try adding the additional option to mpirun: <tt>--mca mpi_yield_when_idle 1</tt>. This will increase the latency of individual messages, but free up the core to do additional work while waiting.

With IntelMPI, the problem should be less pronounced, but you can still improve things by using <tt>mpirun -genv I_MPI_SPIN_COUNT 1 ...</tt>

=====Examples=====
Hyperthreading using gromacs: https://support.scinet.utoronto.ca/wiki/index.php/Gromacs#Hyperthreading_with_Gromacs

====HyperThreading with Hybrid MPI/OpenMP codes====

With a hybrid code, one has extra flexibility in how to assign the "extra" cores -- you could run extra MPI tasks or extra OpenMPI threads. As with all hybrid codes, the combination which results in the best performance depends very strongly on the nature of your code, and you should experiment with different combinations. In addition, with hybrid codes processor and memory affinity issues become very important; if you're unsure as to how to tune your application for best performance, please make an appointment with the SciNet technical analysts for more help.

===Memory Configuration===

==== 16G ====

There are 3756 nodes which have 16G of memory, and is the primary configuration in the GPC. These nodes will be used by default.

==== 18G ====

There are 24 Infiniband nodes which have 18G of memory. These nodes have a fully populated memory configuration that maximizes memory bandwidth. To
request these nodes use:

<pre>
$ qsub -l nodes=2:ib:m18g:ppn=8,walltime=1:00:00
</pre>

==== 32G ====

There are 84 Infiniband nodes which have 32G of memory. To request these nodes use:

<pre>
$ qsub -l nodes=2:ib:m32g:ppn=8,walltime=1:00:00
</pre>

==== 128G ====
There are two stand-alone large memory (128GB) nodes, <tt>gpc-lrgmem01</tt> and <tt>gpc-lrgmem02</tt> which are primarily to be used for data analysis of runs. They have 16 cores and are intel machines running linux, but they are not the same architecture (Nehalem) as the GPC compute nodes, so codes may have to be compiled separately for these machines. They can be accessed using a specific <tt>largemem</tt> queue.

<pre>
$ qsub -l nodes=2:ppn=8,walltime=1:00:00 -q largemem -I
</pre>

===Ram Disk===

On the GPC nodes, there is a `ram disk' available - up to half of the memory on the node may be used as a temporary file system. This is particularly useful for use in the early stages of migrating destop-computing codes to a High Performance Computing platform such as the GPC. It is much faster than real disk and does not require network traffic; however, each node sees its own ramdisk and cannot see files on that of other nodes. This is a very easy way to cache writes (by writing them to fast ram disk instead of slow `real' disk); and then one would periodically copy the files to files on /scratch or /project so that they are available after the job has completed.

To use the ramdisk, create and read to / write from files in /dev/shm/.. just as one would to (eg) /scratch/USER/. Only the amount of RAM needed to store the files will be taken up by the temporary file system; thus if you have 8 serial jobs each requiring 1 GB of RAM, and 1GB is taken up by various OS services, you would still have approximately 7GB available to use as ramdisk on a 16GB node. However, if you were to write 8 GB of data to the RAM disk, this would exceed available memory and your job would likely crash.

NOTE: it is very important to delete your files from ram disk at the end of your job. If you do not do this, the next user to use that node will have less RAM available than they might expect, and this might kill their jobs.

More details on how to setup your script to use the ramdisk can be found on the [[User_Ramdisk|Ramdisk wiki page]].

=== Managing jobs on the Queuing system ===
Information on checking available resources, starting, viewing, managing and canceling jobs on [[Moab | Moab/Torque]]

Gromacs

2010-08-09T16:11:29Z

Cneale: /* Hyperthreading with Gromacs */

Download and general information: http://www.gromacs.org

Search the mailing list archives: http://www.gromacs.org/Support/Mailing_Lists/Search

=Peculiarities of running single node GROMACS jobs on SCINET=
This is '''VERY IMPORTANT !!!'''
Please read the [[https://support.scinet.utoronto.ca/wiki/index.php/User_Tips#Running_single_node_MPI_jobs relevant user tips section]] for information that is essential for your single node (up to 8 core) MPI GROMACS jobs.

-- [[User:Cneale|cneale]] 14 September 2009

=Compiling GROMACS on SciNet=
Please refer to the [[Compiling_Gromacs|GROMACS compilation page]]

=Submitting GROMACS jobs on SciNet=
Please refer to the [[Running_Gromacs|GROMACS submission page]]

-- [[User:Cneale|cneale]] 18 August 2009
=GROMACS benchmarks on Scinet=

This is a rudimentary list of scaling information.

I have a 50K atom system running performance on GPC right now. On 56
cores connected with IB I am getting 55 ns/day. I set up 50 such
simulations, each with 2 proteins in a bilayer and I'm getting a total
of 5.5 us per day. I am using gromacs 4.0.5 and a 5
fs timestep by fixing the bond lengths and all angles involving
hydrogen.

I can get about 12 ns/day on 8 cores of the non-IB part of GPC -- also
excellent.

As for larger systems, My speedup over saw.sharcnet.ca for a 1e6 atom
system is only 1.2x running on 128 cores in single precision. Although saw.sharcnet.ca
is composed of xeons, they are running at 2.83 GHz (https://www.sharcnet.ca/my/systems/show/41), which is a
faster clock speed than the Scinet 2.5 GHz for Intel's next-generation X86-CPU architecture.
While GROMACS is generally not excellent for scaling up to or beyond 128 cores (even for large systems),
our benchmarking of this system on saw.sharcnet.ca indicated that it was running at about 65% efficiency.
Benchmarking was also done on Scinet for this system, but was not recorded as we were mostly tinkering with the
-npme option to mdrun in an attempt to optimize it. My recollection, though, is that the scaling was similar on scinet.

-- [[User:Cneale|cneale]] 19 August 2009
=Strong scaling for GROMACS on GPC=

Requested, and on our list to complete, but not yet available in a complete chart form.

-- [[User:Cneale|cneale]] 19 August 2009
=Scientific studies being carried out using GROMACS on GPC=

Requested, but not yet available

-- [[User:Cneale|cneale]] 19 August 2009

=Hyperthreading with Gromacs=
Using -np 16 on an 8 core box, I get an 8% to 18% performance increase
when using -np 16 and optimizing -npme as compared to -np 8 and optimizing -npme (using gromacs 4.0.7).
I now regularly overload the number of processes.

selected examples:
System A with 250,000 atoms:
mdrun -np 8 -npme -1 1.15 ns/day
mdrun -np 8 -npme 2 1.02 ns/day
mdrun -np 16 -npme 2 0.99 ns/day
mdrun -np 16 -npme 4 1.36 ns/day <-- 118 % performance vs 1.15 ns/day
mdrun -np 15 -npme 3 1.32 ns/day

System B with 35,000 atoms (4 fs timestep):
mdrun -np 8 -npme -1 22.66 ns/day
mdrun -np 8 -npme 2 23.06 ns/day
mdrun -np 16 -npme -1 22.69 ns/day
mdrun -np 16 -npme 4 24.90 ns/day <-- 108 % performance vs 23.06 ns/day
mdrun -np 56 -npme 16 14.15 ns/day

Cutoffs and timesteps differ between these runs, but both use PME and
explicit water.

And according to gromacs developer Berk Hess ( http://lists.gromacs.org/pipermail/gmx-users/2010-August/053033.html )

"In Gromacs 4.5 there is no difference [between -np and -nt based hyperthreading], since it does not use real thread parallelization.
Gromacs 4.5 has a built-in threaded MPI library, but openmpi also has an efficient
MPI implementation for shared memory machines. But even with proper thread
parallelization I expect the same 15 to 20% performance improvement."

Cneale: /* Checking on the remaining walltime from within a job */

__FORCETOC__
==Running single node MPI jobs==
In order to run GROMACS on a single node, the following two things are '''essential'''. If you do not include these two things, then some of your jobs will rune fine, but others will run slowly and others will produce only the beginning of a short log file and will produce no further output, even though they will continue to occupy the resources fully.

1. add :compute-eth: to your #PBS -l line
<source lang="sh">
#PBS -l nodes=1:compute-eth:ppn=8,walltime=3:00:00,os=centos53computeA
</source>
2. add -mca btl_sm_num_fifos 7 -np $(wc -l $PBS_NODEFILE | gawk '{print $1}') -mca btl self,sm to the mpirun arguments.

It appears to be important that you put the -np argument between the two -mca arguments.
<source lang="sh">
/scinet/gpc/mpi/openmpi/1.3.2-intel-v11.0-ofed/bin/mpirun -mca btl_sm_num_fifos 7
-np $(wc -l $PBS_NODEFILE | gawk '{print $1}') -mca btl self,sm -machinefile $PBS_NODEFILE
/scratch/cneale/GPC/exe/intel/gromacs-4.0.5/exec/bin/mdrun_openmpi -deffnm test
</source>
''Historical Note:''
Another solution is to use -mca btl self,tcp instead of what is listed above. This, however, forces your communication to go over sockets and is less efficient than using shared memory. If you want to try it, the code is below. However, for smaller systems (<100,000 atoms) you will see a 1% to 5% performance reduction in comparison to the code listed above.
<source lang="sh">
/scinet/gpc/mpi/openmpi/1.3.2-intel-v11.0-ofed/bin/mpirun --mca btl self,tcp
-np $(wc -l $PBS_NODEFILE | gawk '{print $1}') -machinefile $PBS_NODEFILE
/scratch/cneale/GPC/exe/intel/gromacs-4.0.5/exec/bin/mdrun_openmpi -deffnm test
</source>

We are not exactly sure why this is required, or if it is required for programs other than GROMACS. However, you are strongly recommended to add this to any such script as it should only force you to get what you intend to get in any event. Refer to the section entitled "Ensuring that you get non-IB nodes" below for more information about what these commands do.

[[User:Cneale|cneale]] September 14 2009 (Updated September 22 by cneale)

Currently the reason for the second point above is that the shared-memory communication in OpenMPI seems to be buggy, at least when the code is compiled with gcc, so instead of the (default) option "--mca btl self,sm", which gets the code to often fail, we use the "--mca btl self,tcp" option which forces communication to go via tcp. The tcp option is slower than the sm option, but at least for now it works.

[[User:Dgruner|dgruner]] September 21 2009

Note: This bugginess with the shared memory transport in OpenMPI 1.3.2 and 1.3.3 with gcc has been resolved with the new default openmpi, 1.4.1; please use that instead. Also note that with the newest versions of openmpi, you do not need a -hostfile or -machinefile entry.

[[User:Ljdursi|Ljdursi]] 17:16, 25 February 2010 (UTC)

==Benchmarking==
===Ensuring that you get non-IB nodes===
You can specify gigE only nodes using a "compute-eth" flag

nodes=2:compute-eth:ppn=8

and this will only allow the code to run on "gigabit only
nodes. So even if IB nodes are available it will sit in the queue.

By default (ie no property feature for the node) the scheduler (moab) is setup to use the gigE nodes first then the IB nodes. The scheduler configuration is ongoing but explicitly putting either "compute-eth" for ethernet or "ib" for infiniband nodes will guarantee the right type of node is used.

Also you can specify the type of interconnect directly on the mpirun line using mpirun --mca btl self,tcp for ethernet, so even if it was on an IB node it would still use ethernet for communication. Since the nodes are exactly the same except for the IB card, any benchmarking would still be valid.

[[User:northrup|Scott]] August 27 2009

==Advanced interactions with PBS or MOAB==
===Checking on the remaining walltime from within a job===

There are a number of options for doing this.

1. use start=$(date +%s) to capture the start time of your script and then calculate the number of seconds that have elapsed by running like this:

#!/bin/bash
start=$(date +%s)
...
...
now=$(date +%s)
timeUsed=$(echo "$now $start"|awk '{print $1-$2}')
# bc is not available on nodes so must use awk

2. One can use checkjob, but be aware that it may fail and gpc01 may be off, so one needs to handle that condition in their script.

#This returns the seconds REMAINING:
val="";
while [ -z $val ]; do
val=$(ssh gpc01 "checkjob $PBS_JOBID" 2>/dev/null|grep Reservation|awk '{print $5}'|awk -F ':' '{print $1*3600+$2*60+$3}');
done;
echo "$val"

3. qstat is better, because it fails less often than checkjob. checkjob is a moab command, which can fail much more often than qstat (PBS command) when moab is busy scheduling large amount of jobs. Nevertheless, this command can also fail, so protect it like this:

#This returns the seconds USED (and it only updates every few minutes):
val="";
while [ -z $val ]; do
val=$(qstat -f $PBS_JOBID 2>/dev/null|egrep resources_used.walltime|awk '{print $3}'|awk -F ':' '{print $1*3600+$2*60+$3}');
done;
echo "$val"

4. To be independent of qstat or checkjob command, one possibilty is to parse the output of ps (see man ps for more detail). For example,

ps -eo pid,etime,args|egrep /var/spool/torque/mom_priv/jobs/$PBS_JOBID | egrep -v egrep| ...

Although we're still not exactly sure how to get the time out of this. If you know, then please add it!

These are meant to be useful, but as always, please test before production runs.

[[User:cneale|cneale]] July 1 2010

User Tips

2010-07-01T16:43:27Z

Cneale: /* Checking on the remaining walltime from within a job */