User Tips

From oldwiki.scinet.utoronto.ca
Jump to navigation Jump to search

Running single node MPI jobs

In order to run GROMACS on a single node, the following two things are essential. If you do not include these two things, then some of your jobs will rune fine, but others will run slowly and others will produce only the beginning of a short log file and will produce no further output, even though they will continue to occupy the resources fully.

1. add :compute-eth: to your #PBS -l line <source lang="sh">

  1. PBS -l nodes=1:compute-eth:ppn=8,walltime=3:00:00,os=centos53computeA

</source> 2. add -mca btl_sm_num_fifos 7 -np $(wc -l $PBS_NODEFILE | gawk '{print $1}') -mca btl self,sm to the mpirun arguments.

It appears to be important that you put the -np argument between the two -mca arguments. <source lang="sh"> /scinet/gpc/mpi/openmpi/1.3.2-intel-v11.0-ofed/bin/mpirun -mca btl_sm_num_fifos 7 -np $(wc -l $PBS_NODEFILE | gawk '{print $1}') -mca btl self,sm -machinefile $PBS_NODEFILE /scratch/cneale/GPC/exe/intel/gromacs-4.0.5/exec/bin/mdrun_openmpi -deffnm test </source> Historical Note Another solution is to use -mca btl self,tcp instead of what is listed above. This, however, forces your communication to go over sockets and is less efficient than using shared memory. If you want to try it, the code is below. However, for smaller systems (<100,000 atoms) you will see a 1% to 5% performance reduction in comparison to the code listed above. <source lang="sh"> /scinet/gpc/mpi/openmpi/1.3.2-intel-v11.0-ofed/bin/mpirun --mca btl self,tcp -np $(wc -l $PBS_NODEFILE | gawk '{print $1}') -machinefile $PBS_NODEFILE /scratch/cneale/GPC/exe/intel/gromacs-4.0.5/exec/bin/mdrun_openmpi -deffnm test </source>

We are not exactly sure why this is required, or if it is required for programs other than GROMACS. However, you are strongly recommended to add this to any such script as it should only force you to get what you intend to get in any event. Refer to the section entitled "Ensuring that you get non-IB nodes" below for more information about what these commands do.

cneale September 14 2009 (Updated September 22 by cneale)

Currently the reason for the second point above is that the shared-memory communication in OpenMPI seems to be buggy, at least when the code is compiled with gcc, so instead of the (default) option "--mca btl self,sm", which gets the code to often fail, we use the "--mca btl self,tcp" option which forces communication to go via tcp. The tcp option is slower than the sm option, but at least for now it works.

dgruner September 21 2009

Benchmarking

Ensuring that you get non-IB nodes

You can specify gigE only nodes using a "compute-eth" flag

nodes=2:compute-eth:ppn=8

and this will only allow the code to run on "gigabit only nodes. So even if IB nodes are available it will sit in the queue.

By default (ie no property feature for the node) the scheduler (moab) is setup to use the gigE nodes first then the IB nodes. The scheduler configuration is ongoing but explicitly putting either "compute-eth" for ethernet or "ib" for infiniband nodes will guarantee the right type of node is used.

Also you can specify the type of interconnect directly on the mpirun line using mpirun --mca btl self,tcp for ethernet, so even if it was on an IB node it would still use ethernet for communication. Since the nodes are exactly the same except for the IB card, any benchmarking would still be valid.

Scott August 27 2009