User Tips

From oldwiki.scinet.utoronto.ca
Jump to navigation Jump to search

Reducing virtual memory consumption for multithreaded programs

You may have experienced that your program uses significantly more memory than it is supposed to. Usually this will be due to a memory leak bug in your program, but there is another possible cause one should be aware of.

Relatively recent (version 2.10 and newer) of glibc, the GNU C library in which malloc is implemented, use separate memory allocation pools per thread in a multithreaded process. This increases performance, but can result in much higher memory usage. A set of memory "arenas" of 64 MB each are set up, and are used to fulfill memory requests from the process. The maximum number of such arenas is 8 times the number of cores on the node, e.g. 64 on a typical Scinet node. Hence the maximum total size is 4 GB. This isn't the memory your process is using. It is overhead that comes in addition to that. So if your program should be using 12 GB of ram, it might end up using 16 GB instead due to this overhead. If you run multiple tasks per node, then each of these will have this overhead.

The symptoms of this memory overhead problem are:

  1. Virtual memory use, but not resident memory, is much higher than expected
  2. Memory use stabilizes - it does not continue to ramp up as a memory leak would
  3. Memory use increases with the number of threads used
  4. pmap shows that the extra memory is locked in unreadable, unwritable anonymous mappings of 64 MB each, with the number of such mappings increasing with the number of threads

Here is an example of the latter:

00007f5768000000     496     116     116 rw---   [ anon ]
00007f576807c000   65040       0       0 -----   [ anon ]
00007f576c000000    1828     116     116 rw---   [ anon ]
00007f576c1c9000   63708       0       0 -----   [ anon ]
00007f5770000000     496     116     116 rw---   [ anon ]
00007f577007c000   65040       0       0 -----   [ anon ]
00007f5774000000    7172     140     140 rw---   [ anon ]
00007f5774701000   58364       0       0 -----   [ anon ]
00007f5778000000     496     116     116 rw---   [ anon ]
00007f577807c000   65040       0       0 -----   [ anon ]
00007f577c000000     368     108     108 rw---   [ anon ]
00007f577c05c000   65168       0       0 -----   [ anon ]

It turns out that such a large number of memory arenas is not needed to achieve good performance, and capping the number to 4 gives a negligible performance hit while greatly reducing the memory overhead for multithreaded programs. This can be done by setting the environment variable MALLOC_ARENA_MAX=4, and practically eliminates the problem. I recommend exporting this in your .bashrc even if you don't experience virtual memory problems right now. It might save you days of debugging in the future.

For more information about this problem, see this IBM article. Sigurdkn 11:38, 27 September 2014 (EDT)

Running single node MPI jobs

In order to run GROMACS on a single node, the following two things are essential. If you do not include these two things, then some of your jobs will rune fine, but others will run slowly and others will produce only the beginning of a short log file and will produce no further output, even though they will continue to occupy the resources fully.

  1. NOTE THIS IS DEPRECATED

1. add :compute-eth: to your #PBS -l line <source lang="bash">

  1. PBS -l nodes=1:ppn=8,walltime=3:00:00

</source> 2. add -mca btl_sm_num_fifos 7 -np $(wc -l $PBS_NODEFILE | gawk '{print $1}') -mca btl self,sm to the mpirun arguments.

It appears to be important that you put the -np argument between the two -mca arguments. <source lang="bash"> /scinet/gpc/mpi/openmpi/1.3.2-intel-v11.0-ofed/bin/mpirun -mca btl_sm_num_fifos 7 -np $(wc -l $PBS_NODEFILE | gawk '{print $1}') -mca btl self,sm -machinefile $PBS_NODEFILE /scratch/cneale/GPC/exe/intel/gromacs-4.0.5/exec/bin/mdrun_openmpi -deffnm test </source> Historical Note: Another solution is to use -mca btl self,tcp instead of what is listed above. This, however, forces your communication to go over sockets and is less efficient than using shared memory. If you want to try it, the code is below. However, for smaller systems (<100,000 atoms) you will see a 1% to 5% performance reduction in comparison to the code listed above. <source lang="bash"> /scinet/gpc/mpi/openmpi/1.3.2-intel-v11.0-ofed/bin/mpirun --mca btl self,tcp -np $(wc -l $PBS_NODEFILE | gawk '{print $1}') -machinefile $PBS_NODEFILE /scratch/cneale/GPC/exe/intel/gromacs-4.0.5/exec/bin/mdrun_openmpi -deffnm test </source>

We are not exactly sure why this is required, or if it is required for programs other than GROMACS. However, you are strongly recommended to add this to any such script as it should only force you to get what you intend to get in any event. Refer to the section entitled "Ensuring that you get non-IB nodes" below for more information about what these commands do.

cneale September 14 2009 (Updated September 22 by cneale)

Currently the reason for the second point above is that the shared-memory communication in OpenMPI seems to be buggy, at least when the code is compiled with gcc, so instead of the (default) option "--mca btl self,sm", which gets the code to often fail, we use the "--mca btl self,tcp" option which forces communication to go via tcp. The tcp option is slower than the sm option, but at least for now it works.

dgruner September 21 2009

Note: This bugginess with the shared memory transport in OpenMPI 1.3.2 and 1.3.3 with gcc has been resolved with the new default openmpi, 1.4.1; please use that instead. Also note that with the newest versions of openmpi, you do not need a -hostfile or -machinefile entry.

Ljdursi 17:16, 25 February 2010 (UTC)

Benchmarking

Ensuring that you get non-IB nodes

You can specify gigE only nodes using a "compute-eth" flag

nodes=2:compute-eth:ppn=8

and this will only allow the code to run on "gigabit only nodes. So even if IB nodes are available it will sit in the queue.

By default (ie no property feature for the node) the scheduler (moab) is setup to use the gigE nodes first then the IB nodes. The scheduler configuration is ongoing but explicitly putting either "compute-eth" for ethernet or "ib" for infiniband nodes will guarantee the right type of node is used.

Also you can specify the type of interconnect directly on the mpirun line using mpirun --mca btl self,tcp for ethernet, so even if it was on an IB node it would still use ethernet for communication. Since the nodes are exactly the same except for the IB card, any benchmarking would still be valid.

Scott August 27 2009

Advanced interactions with PBS or MOAB

Checking on the remaining walltime from within a job

There are a number of options for doing this.

1. use start=$(date +%s) to capture the start time of your script and then calculate the number of seconds that have elapsed by running like this:

 #!/bin/bash
 start=$(date +%s)
 ...
 ...
 now=$(date +%s)
 timeUsed=$(echo "$now $start"|awk '{print $1-$2}')
 # bc is not available on nodes so must use awk

(note that bc is available in the extras module now).

2. One can use checkjob, but be aware that it may fail and gpc01 may be off, so one needs to handle that condition in their script.

 #This returns the seconds REMAINING:
 val=""; 
 while [ -z $val ]; do 
   val=$(ssh gpc01 "checkjob $PBS_JOBID" 2>/dev/null|grep Reservation|awk '{print $5}'|awk -F ':' '{print $1*3600+$2*60+$3}'); 
 done; 
 echo "$val"

3. qstat is better, because it fails less often than checkjob. checkjob is a moab command, which can fail much more often than qstat (PBS command) when moab is busy scheduling large amount of jobs. Nevertheless, this command can also fail, so protect it like this:

 #This returns the seconds USED (and it only updates every few minutes):
 val=""; 
 while [ -z $val ]; do 
   val=$(qstat -f $PBS_JOBID 2>/dev/null|egrep resources_used.walltime|awk '{print $3}'|awk -F ':' '{print $1*3600+$2*60+$3}');
 done; 
 echo "$val"

4. To be independent of qstat or checkjob command, one possibilty is to parse the output of ps (see man ps for more detail). For example,

 ps -eo pid,etime,args|egrep /var/spool/torque/mom_priv/jobs/$PBS_JOBID | egrep -v egrep| ...

Although we're still not exactly sure how to get the time out of this. If you know, then please add it!

These are meant to be useful, but as always, please test before production runs.


cneale July 1 2010