FAQ

https://support.scinet.utoronto.ca/wiki/index.php?title=FAQ&action=edit

Who do I contact for support

Who do I contact if I have problems or questions about how to use the SciNet systems?

Answer

E-mail <support@scinet.utoronto.ca>

What does code scaling mean?

Answer

Please see A Performance Primer

What do you mean by throughput?

Answer

Please see A Performance Primer

MPI development and interactive testing

I am in the process of playing around with the mpi calls in my code to get it to work. I do a lot of tests and each of them takes a couple of seconds only.Sometimes (like now that I'm sending this email), all the machines are full and I'm put in the line. Since I just need a couple of SECONDS, is there any way I can test it on the log-in nodes? I can't do it using the llsubmit command and if I use mpiexec then I need a host file. Can I use a host file to run my 2 second test jobs on the log in nodes? If yes, can you send me an example host file please?

Answer

On the TCS you can run small MPI jobs on the tcs-f11n06 node, which is meant for development use. Please don't run them on the main login node tcs-f11n05. Now, as for the hostfile, it simply looks like:

tcs-f11n06
tcs-f11n06
tcs-f11n06
tcs-f11n06

for a 4-task run. When you invoke "poe" or "mpirun", there are runtime arguments that you specify pointing to this file. You can also specify it in an environment variable MP_HOSTFILE, so, if your file is in your /scratch/USER/hostfile, then you would do

export MP_HOSTFILE=/scratch/USER/hostfile

in your shell. You will also need to create a .rhosts file in your home director, again listing tcs-f11n06 so that poe can start jobs. After that you can simply run your program. You can use mpiexec:

mpiexec -n 4 my_test_program

adding -hostfile /path/to/my/hostfile if you did not set the environment variable above. Alternatively, you can run it with the poe command (do a "man poe" for details), or even by just directly running it. In this case the number of MPI processes will by default be the number of entries in your hostfile.

On the GPC one can run similar jobs on the development nodes. Even better, though, is to request an interactive job and run the tests there. More details will be forthcoming. Please talk to our support group at <support@scinet.utoronto.ca>.

How can I monitor my jobs?

By the way, how can I monitor the load? not with llq?

Answer

You can get more information with the command

/xcat/tools/tcs-scripts/LL/jobState.sh

which I alias as:

alias llq1='/xcat/tools/tcs-scripts/LL/jobState.sh'

If you run "llq1 -n" you will see a listing of jobs together with a lot of information, including the load.

Autoparallelization does not work!

I compiled my code with the -qsmp=omp,auto option, and then I specified that it should be run with 64 threads - with

export OMP_NUM_THREADS=64

However, when I check the load using llq1 -n, it shows a load on the node of 1.37. Why?

Answer

Using the autoparallelization will only get you so far. In fact, it usually does not do too much. What is helpful is to run the compiler with the -qreport option, and then read the output listing carefully to see where the compiler thought it could parallelize, where it could not, and the reasons for this. Then you can go back to your code and carefully try to address each of the issues brought up by the compiler. We emphasize that this is just a rough first guide, and that the compilers are still not magical!

Another transport will be used instead

I get error messages like the following when running on the GPC at the start of the run, although the job seems to proceed OK. Is this a problem?

--------------------------------------------------------------------------
[[45588,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: gpc-f101n005

Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------

Answer

Everything's fine. The two MPI libraries scinet provides work for both the InifiniBand and the Gigabit Ethernet interconnects, and will always try to use the fastest interconnect available. In this case, you ran on normal gigabit GPC nodes with no infiniband; but the MPI libraries have no way of knowing this, and try the infiniband first anyway. This is just a harmless `failover' message; it tried to use the infiniband, which doesn't exist on this node, then fell back on using Gigabit ethernet (`another transport').

Next question, please

We'll answer it asap!

FAQ

Contents

Who do I contact for support

Answer

What does code scaling mean?

Answer

What do you mean by throughput?

Answer

MPI development and interactive testing

Answer

How can I monitor my jobs?

Answer

Autoparallelization does not work!

Answer

Another transport will be used instead

Answer

Next question, please

Navigation menu

Search