Difference between revisions of "FAQ"

From oldwiki.scinet.utoronto.ca
Jump to navigation Jump to search
Line 250: Line 250:
  
 
'''Answer''': You don't.
 
'''Answer''': You don't.
 +
 +
===On GPC, `Job cannot be executed'===
 +
 +
I get error messages like this trying to run on GPC:
 +
 +
<pre>
 +
PBS Job Id: 30414.gpc-sched
 +
Job Name:  namd
 +
Exec host:  gpc-f120n011/7+gpc-f120n011/6+gpc-f120n011/5+gpc-f120n011/4+gpc-f120n011/3+gpc-f120n011/2+gpc-f120n011/1+gpc-f120n011/0
 +
Aborted by PBS Server
 +
Job cannot be executed
 +
See Administrator for help
 +
 +
 +
 +
PBS Job Id: 30414.gpc-sched
 +
Job Name:  namd
 +
Exec host:  gpc-f120n011/7+gpc-f120n011/6+gpc-f120n011/5+gpc-f120n011/4+gpc-f120n011/3+gpc-f120n011/2+gpc-f120n011/1+gpc-f120n011/0
 +
An error has occurred processing your job, see below.
 +
request to copy stageout files failed on node 'gpc-f120n011/7+gpc-f120n011/6+gpc-f120n011/5+gpc-f120n011/4+gpc-f120n011/3+gpc-f120n011/2+gpc-f120n011/1+gpc-f120n011/0' for job 30414.gpc-sched
 +
 +
Unable to copy file 30414.gpc-sched.OU to cmadill@gpc-f101n084.scinet.local:/scratch/cmadill/projects/sim-performance-test/runtime/l/namd/8/namd.o30414
 +
*** error from copy
 +
30414.gpc-sched.OU: No such file or directory
 +
*** end error output
 +
</pre>
 +
 +
Try doing the following:
 +
<pre>
 +
mkdir /scratch/${USER}/.pbs_spool
 +
ln -s /scratch/${USER}/.pbs_spool ~/.pbs_spool
 +
</pre>
 +
 +
This is how all new accounts are setup on SciNet.
 +
 +
<tt>/home</tt> on GPC for compute jobs is mounted as a read-only file system. 
 +
PBS by default tries to spool its output  files to <tt>${HOME}/.pbs_spool</tt>
 +
which fails as it tries to write to a read-only file 
 +
system.    New accounts at SciNet  get around this by having ${HOME}/.pbs_spool 
 +
point to somewhere appropriate on <tt>/scratch</tt>, but if you've deleted that link
 +
or directory, or had an old account, you will see errors like the above.
  
 
===Next question, please===
 
===Next question, please===
  
 
Send your question to [mailto:support@scinet.utoronto.ca <support@scinet.utoronto.ca>];  we'll answer it asap!
 
Send your question to [mailto:support@scinet.utoronto.ca <support@scinet.utoronto.ca>];  we'll answer it asap!

Revision as of 11:48, 16 August 2009

Who do I contact for support

Who do I contact if I have problems or questions about how to use the SciNet systems?

Answer:

E-mail <support@scinet.utoronto.ca>

What does code scaling mean?

Answer:

Please see A Performance Primer

What do you mean by throughput?

Answer:

Please see A Performance Primer

MPI development and interactive testing

I am in the process of playing around with the mpi calls in my code to get it to work. I do a lot of tests and each of them takes a couple of seconds only.Sometimes (like now that I'm sending this email), all the machines are full and I'm put in the line. Since I just need a couple of SECONDS, is there any way I can test it on the log-in nodes? I can't do it using the llsubmit command and if I use mpiexec then I need a host file. Can I use a host file to run my 2 second test jobs on the log in nodes? If yes, can you send me an example host file please?

Answer:

On the TCS you can run small MPI jobs on the tcs-f11n06 node, which is meant for development use. Please don't run them on the main login node tcs-f11n05. Now, as for the hostfile, it simply looks like:

tcs-f11n06
tcs-f11n06
tcs-f11n06
tcs-f11n06

for a 4-task run. When you invoke "poe" or "mpirun", there are runtime arguments that you specify pointing to this file. You can also specify it in an environment variable MP_HOSTFILE, so, if your file is in your /scratch/USER/hostfile, then you would do

 export MP_HOSTFILE=/scratch/USER/hostfile

in your shell. You will also need to create a .rhosts file in your home director, again listing tcs-f11n06 so that poe can start jobs. After that you can simply run your program. You can use mpiexec:

 mpiexec -n 4 my_test_program

adding -hostfile /path/to/my/hostfile if you did not set the environment variable above. Alternatively, you can run it with the poe command (do a "man poe" for details), or even by just directly running it. In this case the number of MPI processes will by default be the number of entries in your hostfile.

On the GPC one can run similar jobs on the development nodes. Even better, though, is to request an interactive job and run the tests there. More details will be forthcoming. Please talk to our support group at <support@scinet.utoronto.ca>.

How can I monitor my jobs on TCS?

By the way, how can I monitor the load? not with llq?

Answer:

You can get more information with the command

/xcat/tools/tcs-scripts/LL/jobState.sh

which I alias as:

alias llq1='/xcat/tools/tcs-scripts/LL/jobState.sh'

If you run "llq1 -n" you will see a listing of jobs together with a lot of information, including the load.

Autoparallelization does not work!

I compiled my code with the -qsmp=omp,auto option, and then I specified that it should be run with 64 threads - with

export OMP_NUM_THREADS=64

However, when I check the load using llq1 -n, it shows a load on the node of 1.37. Why?

Answer:

Using the autoparallelization will only get you so far. In fact, it usually does not do too much. What is helpful is to run the compiler with the -qreport option, and then read the output listing carefully to see where the compiler thought it could parallelize, where it could not, and the reasons for this. Then you can go back to your code and carefully try to address each of the issues brought up by the compiler. We emphasize that this is just a rough first guide, and that the compilers are still not magical!

Another transport will be used instead

I get error messages like the following when running on the GPC at the start of the run, although the job seems to proceed OK. Is this a problem?

--------------------------------------------------------------------------
[[45588,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: gpc-f101n005

Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------

Answer:

Everything's fine. The two MPI libraries scinet provides work for both the InifiniBand and the Gigabit Ethernet interconnects, and will always try to use the fastest interconnect available. In this case, you ran on normal gigabit GPC nodes with no infiniband; but the MPI libraries have no way of knowing this, and try the infiniband first anyway. This is just a harmless `failover' message; it tried to use the infiniband, which doesn't exist on this node, then fell back on using Gigabit ethernet (`another transport').

OpenMP on the TCS

How do I run an OpenMP job on the TCS?

Answer:

Please look at the TCS Quickstart page.

I changed my .bashrc/.bash_profile and now nothing works

The default startup scripts provided by SciNet are as follows. Certain things - like sourcing /etc/profile and /etc/bashrc are required for various SciNet routines to work!

.bash_profile <source lang="bash"> if [ -f /etc/profile ]; then

      . /etc/profile

fi

  1. commands which work for both GPC and TCS can go here

alias passwd='echo "Please use the SciNet portal to change password: https://portal.scinet.utoronto.ca/change_password"'

HOST=$(uname)

if [ "${HOST}" == "AIX" ] then

       # do things for the TCS machine
       alias llq1='/xcat/tools/tcs-scripts/LL/jobState.sh'
       alias llstat='/xcat/tools/tcs-scripts/LL/jobSummary.sh'
       if [ "${TERM}" = "xterm-color" ]; then
               export TERM=xterm
       fi
       # user environment for login shells goes here
       # replace colon with your own commands
       :

else

       # do things for the GPC machine
       # user environment for login shells goes here
       # replace colon with your own commands
       :

fi

PS1="\h-\$ "


if [ -f ~/.bashrc ]; then

      . ~/.bashrc

fi </source>

.bashrc <source lang="bash"> if [ -f /etc/bashrc ]; then

      . /etc/bashrc

fi

  1. commands which work for both GPC and TCS can go here

HOST=$(uname)

if [ "${HOST}" == "AIX" ]; then

       # do things for the TCS machine
       # user environment for all shells goes here
       # replace colon with your own commands
       :

else

       # do things for the GPC machine
       module load intel openmpi
       # user environment for all shells goes here
       # replace colon with your own commands
       :

fi </source>

How do I run serial jobs on GPC?

Answer:

So it should be said first that SciNet is a parallel computing resource, and our priority will always be parallel jobs. Having said that, if you can make efficient use of the resources using serial jobs and get good science done, that's good too, and we're happy to help you.

The GPC nodes each have 8 processing cores, and making efficient use of these nodes means using all eight cores. As a result, we'd like to have the users take up whole nodes (eg, run multiples of 8 jobs) at a time. The most straightforward way to do this is to bunch the jobs in groups of 8 that will take roughly the same amount of time, and create a job that looks a bit like this

<source lang="bash">

  1. !/bin/bash
  2. MOAB/Torque submission script for multiple serial jobs on
  3. SciNet GPC
  4. PBS -l nodes=1:ppn=8,walltime=1:00:00
  5. PBS -N serialx8
  1. DIRECTORY TO RUN - $PBS_O_WORDIR is directory job was submitted from

cd $PBS_O_WORKDIR

  1. EXECUTION COMMAND; ampersand off 8 jobs and wait

(cd jobdir1; ./dojob1) & (cd jobdir2; ./dojob2) & (cd jobdir3; ./dojob3) & (cd jobdir4; ./dojob4) & (cd jobdir5; ./dojob5) & (cd jobdir6; ./dojob6) & (cd jobdir7; ./dojob7) & (cd jobdir8; ./dojob8) & wait </source>

There are three important things to take note of here. First, the wait command at the end is crucial; without it the job will terminate immediately, killing the 8 programs you just started.

Second is that it is important to group the programs by how long they will take. If (say) dojob8 takes 2 hours and the rest only take 1, then for one hour 7 of the 8 cores on the GPC node are wasted; they are sitting idle but are unavailable for other users, and the utilization of this node over the whole run is only 56%. This is the sort of thing we'll notice, and users who don't make efficient use of the machine will have their ability to use scinet resources reduced.

Third is that it is necessary to have a good idea of how much memory the jobs will require. The GPC compute nodes have about 14.5GB in total available to user jobs running on the 8 cores (a bit less, say 13GB, on the devel ndoes gpc01..04). So the jobs also have to be bunched in ways that will fit into 14.5GB. If that's not possible -- each individual job requires significantly in excess of ~1.8GB -- then its possible in principle to just run fewer jobs so that they do fit; but then, again there is an under-utilization problem. In that case, the jobs are likely candidates for parallelization, and you can contact us at <support@scinet.utoronto.ca> and arrange a meeting with one of the technical analysts to help you do just that.

How do I run serial jobs on TCS?

Answer: You don't.

On GPC, `Job cannot be executed'

I get error messages like this trying to run on GPC:

PBS Job Id: 30414.gpc-sched
Job Name:   namd
Exec host:  gpc-f120n011/7+gpc-f120n011/6+gpc-f120n011/5+gpc-f120n011/4+gpc-f120n011/3+gpc-f120n011/2+gpc-f120n011/1+gpc-f120n011/0
Aborted by PBS Server 
Job cannot be executed
See Administrator for help



PBS Job Id: 30414.gpc-sched
Job Name:   namd
Exec host:  gpc-f120n011/7+gpc-f120n011/6+gpc-f120n011/5+gpc-f120n011/4+gpc-f120n011/3+gpc-f120n011/2+gpc-f120n011/1+gpc-f120n011/0
An error has occurred processing your job, see below.
request to copy stageout files failed on node 'gpc-f120n011/7+gpc-f120n011/6+gpc-f120n011/5+gpc-f120n011/4+gpc-f120n011/3+gpc-f120n011/2+gpc-f120n011/1+gpc-f120n011/0' for job 30414.gpc-sched

Unable to copy file 30414.gpc-sched.OU to cmadill@gpc-f101n084.scinet.local:/scratch/cmadill/projects/sim-performance-test/runtime/l/namd/8/namd.o30414
*** error from copy
30414.gpc-sched.OU: No such file or directory
*** end error output

Try doing the following:

mkdir /scratch/${USER}/.pbs_spool
ln -s /scratch/${USER}/.pbs_spool ~/.pbs_spool

This is how all new accounts are setup on SciNet.

/home on GPC for compute jobs is mounted as a read-only file system. PBS by default tries to spool its output files to ${HOME}/.pbs_spool which fails as it tries to write to a read-only file system. New accounts at SciNet get around this by having ${HOME}/.pbs_spool point to somewhere appropriate on /scratch, but if you've deleted that link or directory, or had an old account, you will see errors like the above.

Next question, please

Send your question to <support@scinet.utoronto.ca>; we'll answer it asap!