User Tips

From oldwiki.scinet.utoronto.ca
Revision as of 11:41, 1 July 2010 by Cneale (talk | contribs) (advanced PBS and MOAB section added to user tips with first entry on obtaining walltime left)
Jump to navigation Jump to search

Running single node MPI jobs

In order to run GROMACS on a single node, the following two things are essential. If you do not include these two things, then some of your jobs will rune fine, but others will run slowly and others will produce only the beginning of a short log file and will produce no further output, even though they will continue to occupy the resources fully.

1. add :compute-eth: to your #PBS -l line <source lang="sh">

  1. PBS -l nodes=1:compute-eth:ppn=8,walltime=3:00:00,os=centos53computeA

</source> 2. add -mca btl_sm_num_fifos 7 -np $(wc -l $PBS_NODEFILE | gawk '{print $1}') -mca btl self,sm to the mpirun arguments.

It appears to be important that you put the -np argument between the two -mca arguments. <source lang="sh"> /scinet/gpc/mpi/openmpi/1.3.2-intel-v11.0-ofed/bin/mpirun -mca btl_sm_num_fifos 7 -np $(wc -l $PBS_NODEFILE | gawk '{print $1}') -mca btl self,sm -machinefile $PBS_NODEFILE /scratch/cneale/GPC/exe/intel/gromacs-4.0.5/exec/bin/mdrun_openmpi -deffnm test </source> Historical Note: Another solution is to use -mca btl self,tcp instead of what is listed above. This, however, forces your communication to go over sockets and is less efficient than using shared memory. If you want to try it, the code is below. However, for smaller systems (<100,000 atoms) you will see a 1% to 5% performance reduction in comparison to the code listed above. <source lang="sh"> /scinet/gpc/mpi/openmpi/1.3.2-intel-v11.0-ofed/bin/mpirun --mca btl self,tcp -np $(wc -l $PBS_NODEFILE | gawk '{print $1}') -machinefile $PBS_NODEFILE /scratch/cneale/GPC/exe/intel/gromacs-4.0.5/exec/bin/mdrun_openmpi -deffnm test </source>

We are not exactly sure why this is required, or if it is required for programs other than GROMACS. However, you are strongly recommended to add this to any such script as it should only force you to get what you intend to get in any event. Refer to the section entitled "Ensuring that you get non-IB nodes" below for more information about what these commands do.

cneale September 14 2009 (Updated September 22 by cneale)

Currently the reason for the second point above is that the shared-memory communication in OpenMPI seems to be buggy, at least when the code is compiled with gcc, so instead of the (default) option "--mca btl self,sm", which gets the code to often fail, we use the "--mca btl self,tcp" option which forces communication to go via tcp. The tcp option is slower than the sm option, but at least for now it works.

dgruner September 21 2009

Note: This bugginess with the shared memory transport in OpenMPI 1.3.2 and 1.3.3 with gcc has been resolved with the new default openmpi, 1.4.1; please use that instead. Also note that with the newest versions of openmpi, you do not need a -hostfile or -machinefile entry.

Ljdursi 17:16, 25 February 2010 (UTC)

Benchmarking

Ensuring that you get non-IB nodes

You can specify gigE only nodes using a "compute-eth" flag

nodes=2:compute-eth:ppn=8

and this will only allow the code to run on "gigabit only nodes. So even if IB nodes are available it will sit in the queue.

By default (ie no property feature for the node) the scheduler (moab) is setup to use the gigE nodes first then the IB nodes. The scheduler configuration is ongoing but explicitly putting either "compute-eth" for ethernet or "ib" for infiniband nodes will guarantee the right type of node is used.

Also you can specify the type of interconnect directly on the mpirun line using mpirun --mca btl self,tcp for ethernet, so even if it was on an IB node it would still use ethernet for communication. Since the nodes are exactly the same except for the IB card, any benchmarking would still be valid.

Scott August 27 2009

Advanced interactions with PBS or MOAB

Checking on the remaining walltime from within a job

There are a number of options for doing this.

1. use start=$(date +%s) to capture the start time of your script and then calculate the number of seconds that have elapsed by running like this:

  1. !/bin/bash

start=$(date +%s) ... ... now=$(date +%s) timeUsed=$(echo "$now $start"|awk '{print $1-$2}')

  1. bc is not available on nodes so must use awk

2. One can use checkjob, but be aware that it may fail and gpc01 may be off, so one needs to handle that condition in their script.

  1. This returns the seconds REMAINING:

val=""; while [ -z $val ]; do

 val=$(ssh gpc01 "checkjob $PBS_JOBID" 2>/dev/null|grep Reservation|awk '{print $5}'|awk -F ':' '{print $1*3600+$2*60+$3}'); 

done; echo "$val"

3. qstat is better, because it fails less often than checkjob. checkjob is a moab command, which can fail much more often than qstat (PBS command) when moab is busy scheduling large amount of jobs. Nevertheless, this command can also fail, so protect it like this:

  1. This returns the seconds USED (and it only updates every few minutes):

val=""; while [ -z $val ]; do

 val=$(qstat -f $PBS_JOBID 2>/dev/null|egrep resources_used.walltime|awk '{print $3}'|awk -F ':' '{print $1*3600+$2*60+$3}');

done; echo "$val"

4. To be independent of qstat or checkjob command, one possibilty is to parse the output of ps (see man ps for more detail). For example,

ps -eo pid,etime,args|egrep /var/spool/torque/mom_priv/jobs/$PBS_JOBID | egrep -v egrep| ...

Although we're still not exactly sure how to get the time out of this. If you know, then please add it!

These are meant to be useful, but as always, please test before production runs.