Scheduler

From oldwiki.scinet.utoronto.ca
Jump to navigation Jump to search

WARNING: SciNet is in the process of replacing this wiki with a new documentation site. For current information, please go to https://docs.scinet.utoronto.ca

The queueing system used at SciNet is based around Cluster Resources Moab Workload Manager. Moab is used on both the GPC and TCS however Torque is used as the backend resource manager on the GPC and IBM's LoadLeveler is used on the TCS.

This page outlines some of the most common Moab commands with full documentation available from Moab here, the torque (pbs) commands full documentation is here.

Some common questions about the queuing system can be found on the FAQ as well.

Queues

GPC

batch

The batch queue is the default queue on the GPC allowing the user access to all the resources for jobs between 15 minutes and upto 48 hours. If a specific queue is not specified, -q flag, then a job is submitted to the batch queue. Most jobs will run in batch, and user's can use feature flags to determine the type of nodes they require.

For example, to request two nodes anywhere on the GPC (QDR or DDR), use

#PBS -l nodes=2:ppn=8,walltime=1:00:00

For two nodes using DDR, use

#PBS -l nodes=2:ddr:ppn=8,walltime=1:00:00

To get two nodes using QDR, instead, you would use

#PBS -l nodes=2:qdr:ppn=8,walltime=1:00:00
debug

A debug queue has been set up primarily for code developers to quickly test and evaluate their codes and configurations without having to wait in the batch queue. There are 10 nodes currently reserved for the debug queue. It has quite restrictive limits to promote high turnover and availability thus a user can only use 2 nodes (16 cores) for 2 hours, to a maximum of 8 nodes (64 cores) for 1/2 an hour and can only have one job in the debug queue at a time. There is no minimum time limit on this queue.

$ qsub -l nodes=1:ppn=8,walltime=1:00:00 -q debug -I
largemem

The largemem queue is used for accessing one of two 16 core with 128 GB memory intel Xeon (non-nehalem) nodes.

$ qsub -l nodes=1:ppn=16,walltime=1:00:00 -q largemem -I

TCS

The TCS currently only has one queue, or class, in use called "verylong" and its only limitation is that jobs must be under 48 hours.

#@ class           = verylong

Job Info

To see all jobs queued on a system use

$ showq

Three sections are shown; running, idle, and blocked. Idle jobs are commonly referred to as queued jobs as they meet all the requirements, however they are waiting for available resources. Blocked jobs are either caused by improper resource requests or more commonly by exceeding a user or groups allowable resources. For example if you are allowed to submit 10 jobs and you submit 20, the first 10 jobs will be submitted properly and either run right away or be queued, however the other 10 jobs will be blocked and the jobs won't be submitted to the queue until one of the first 10 finishes.

Available Resources

Determining when your job will run can be tricky as it involves a combination of queue type, node type, system reservations, and job priority. The following commands are provided to help you figure out what resources are currently available, however they may not tell you exactly when your job will run for the aforementioned reasons.

GPC

To show how many qdr nodes are currently free, use the show back fill command

$ showbf -f qdr

To show how many ddr nodes are free, use

$ showbf -f ddr

TCS

To show how many TCS nodes are free, use

$ showbf -c verylong

For example checking for a qdr job

$ showbf -f qdr
Partition     Tasks  Nodes      Duration   StartOffset       StartDate
---------     -----  -----  ------------  ------------  --------------
ALL           14728   1839       7:36:23      00:00:00  00:23:37_09/24
ALL             256     30      INFINITY      00:00:00  00:23:37_09/24

shows that for jobs under 7:36:23 you can use 1839 nodes, but if you submit a job over that time only 30 will be available. In this case this is due to a large reservation made my SciNet staff, but from a users point of view, showbf tells you very simply what is available and at what time point. In this case, a user may wish to set #PBS -l walltime=7:30:00 in their script, or add -l walltime=7:30:00 to their qsub command in order to ensure that the jobs backfill the reserved nodes.

NOTE: showbf shows currently available nodes, however just because nodes are available doesn't mean that your job will start right away. Job priority, system reservations along with dedicated nodes, such as those for the debug queue, will alter when jobs run so even if enough nodes appear "free", it doesn't mean your job will actually run right away.

Job Submission

Interactive

On the GPC an interactive queue session can be requested using the following

$ qsub -l nodes=2:ppn=8,walltime=1:00:00 -I

In case you may experience longer than usual delays when requesting an interactive session, try specifying the "debug" queue:

$ qsub -l nodes=2:ppn=8,walltime=1:00:00 -I -q debug

Non-interactive (Batch)

For a non-interactive job submission you require a submission script formatted for the appropriate resource manger. Examples are provided for the GPC and TCS.

Job Status

$ qstat jobid

To see the status of all your jobs, use

showq -u username

or

qstat -u username

If your job appears to be blocked, you can try the following command

$ checkjob jobid

which gives more verbose and often a bit confusing output.

Cancel a Job

$ canceljob jobid

To cancel all of your jobs, try this command:

$ canceljob `showq | grep yourusername | cut -c -8` 

or using the torque commands (these tend to work faster for larger number of jobs)

$ qdel `qstat | grep yourusername | cut -c -8` 

Accounting

For any user with an NRAC/LRAC allocation, a special account with the Resource Allocation Project (RAP) identifier (RAPI) from Compute Canada Database (CCDB) is set up in order to access the allocated resources. Please use the following instructions to run your job using your special allocation. This is necessary both for accounting purposes as well as to assign the appropriate priority to your jobs.

Each job run on the system will have a default RAP associated with it. Most users already have their default RAP properly set. However, if you have more than one allocation (different RAPs), you may need/want to change your default RAP in order to charge your jobs to a particular RAP.

Changing your default RAP

  1. Go to the portal, login with your SciNet username and password.
  2. Click on "Change SciNet default RAP" and change your default RAP.

Specifying the RAP for GPC

Alternatively, you may want to assign a RAP for each particular job you run. There are two ways to specify an account for Moab/Torque: From the command line or inside the batch submission script.

Command line

Use the '-A RAPI' flag when you submit your job using qsub. Note that the command line option will override the submission script if an account is specified on both the submission script and the command line. "RAPI" is the RAP Identifier, e.g. abc-123-de.

Submission Script

Add a line in your submit script as follows:

#PBS -A RAPI

Please replace "RAPI" with your RAP Identifier.

Specifiying the RAP for TCS

Add a line in your submit script as follows:

# @ account_no = RAPI

Please replace "RAPI" with your RAP Identifier.


User Stats

Show current usage stats for a $USER

$ showstats -u $USER

Reservations

$ showres

Standard users can only see their reservations not other users or system ones. To determine what is available a user can use "showbf", it shows what resources are available and at what time level, taking into account running jobs and all the reservations. Refer to the Available Resources section of this page for more details.

Job Dependencies

Sometimes you may want one job not to start until another job finishes, however you would like to submit them both at the same time. This can be done using job dependencies on both the GPC and TCS, however the commands are different due to the underlying resource managers being different.

GPC

Use the -W flag with the following syntax in your submission script to have this job not start until the job with jobid has successfully finished

-W depend=afterok:jobid

This functionality also allows to add dependencies on several jobs, eg.

-W depend=afterok:jobid1:jobid2:...:jobidN

which in this case, the job will not start until all the jobs jobid1, jobid2, ..., jobidN have successfully finished. Note that the length of the string "afterok:jobid1:jobid2...:jobidN" cannot exceed 1024 characters.


More detailed syntax, dependency options and examples can be found [here].

TCS

Loadleveler does job dependencies using what they call steps. See the TCS Quickstart guide for an example.

Adjusting Job Priority

The ability to adjust job priorities downwards can also be of use to adjust relative priorities of jobs between users who are running jobs of the same allocation (eg, a default or NRAC allocation of the same PI). Priorities are determined by how much of the time of that allocation been currently used, and all users using that account will have identical priorities. This mechanism allows users to voluntarily reduce their priority to allow other users of the same allocation to run ahead of them.

In principle, by adjusting a jobs priority downwards, you could reduce your jobs priority to the point that someone elses job entirely could go ahead of yours. In practice, however, this is extremely unlikely. Users with NRAC allocations have priorities that are extremely large positive numbers that depend on their allocation and how much of it they have already used during the past fairshare window (2 weeks); it is very unlikely that two groups would have priorities that are within 10 or 100 or 1000 of each other.

Note that at the moment, we do not allow priorities to go negative; they are integers that can go no lower than 1. (This may change in the future) That means that users of accounts that have already used their full allocation during the current fairshare period (eg, over the past two weeks), and so whose priority would normally be negative but is capped at 1, can not lower their priority any further. Similar, users with a `default' allocation have priority 1, and cannot lower their priorities any further.

GPC

Moab allows users to adjust their jobs' priority moderately downwards, with the -p flag; that is, on a qsub line

$ qsub ... -p -10  JOBID

or in a script

...
#PBS -p -10
..

The number used (-10 in the examples above) can be any negative number down to -1024.

The ability to adjust job priorities downwards can be useful when you are running a number of jobs and want some to enter the queue at higher priorities than others. Note that if you absolutely require some jobs to start before others, you could use job dependencies instead.


For a job that is currently queued, one can adjust its priority with

$ qalter -p -10 JOBID


Suspending a Running Job

Separate from, and in addition to, the ability to place a hold on a queued job, you may want to suspend a running job. For example, you may want to test the timing of events in a weakly coupled parallel environment.

GPC

To suspend a job:

qsig -s STOP <jobid>

and to start it again:

qsig -s CONT <jobid>.

Scripts are suspendable by default, so you don't need to add any signal handling for this to work. As far as we can tell, the result is identical to using fg and ctrl-Z (or kill -STOP <PID>) in an interactive run.

More about using (and trapping) signals can be found on the Using Signals page.


QDR Network Switch Affinity

The QDR network is globally 5:1 oversubscribed, but on switch it has full 1:1 cross-section whereas the DDR is completely 1:1 non-blocking. When a job is submitted to the GPC QDR nodes, the queuing system tries to fulfill the requirements with nodes on the same switch to improve network performance. If not enough nodes are available on the same switch to satisfy the job request, the queue will then use any available nodes. This behavior can be changed by using the submission flag " -l nodesetisoptional=false " which forces the queuing system to only run this job on the same switch, thus the job will stay queued until enough nodes on one switch are available to satisfy this request. Note that the maximum number of nodes on one switch is 30, so a request of greater than 30 nodes with this flag will never run.

qsub -l nodes=4:ppn=8,walltime=10:00,nodesetisoptional=false

Multiple Job Submissions

If you are doing doing batch processing of a number of similar jobs on the GPC, torque has a feature called job arrays that can be used to simplify this process. By using the "-t 0-N" option on the command line during job submission or putting it in the job script file, #PBS -t 0-N, torque will expand your single job submission into N+1 jobs and sets the environment variable PBS_ARRAYID equal to that jobs specific number, ie 0-N, for each job. This reduces the amount of calls to qsub, and can allow the user to have many less submission scripts. Job arrays also have the benefit of batching groups of jobs allowing commands like qalter, qdel, qhold to work on all or a subset of the job array jobs with one command, instead of having to run the command for each job.

In the following example, 10 jobs are submitted using a single command

qsub -t 0-9 jobscript.sh

and the submission script then modifies the job based on the PBS_ARRAYID.

#!/bin/bash
#PBS -l nodes=1:ppn=8,walltime=10:00:00
#PBS -N array_jobs

cd ${PBS_O_WORKDIR}
mkdir job.${PBS_ARRAYID}
cd job.${PBS_ARRAYID}

echo "Running job ${PBS_ARRAYID}"
mpirun -np 8 ./mycode >& array_job.${PBS_ARRAYID}.out

The JOBID and the job name both get the additional ARRAYID added onto them in the form of a hyphen, ie JOBID-ARRAYID. If for example you wanted to cancel all the jobs in a job array you would use "qdel JOBID", whereas if you wanted to cancel just one of the jobs you would use "qdel JOBID-ARRAYID".

See here and here for full details.


Job Tools and Scripts

The following list of commands and scripts are useful at the moment of monitoring and managing submissions and jobs. We have discussed them in a recent TechTalk. More details can be found here.


In addition, the following scripts are available:

diskUsage

informs about the user and group file system usage.

quota

offers a shorter version for the user.

qsum

similar to showq, summarizes by user.

jobperf jobID

informs about the performance per-node of a given job.

jobError <jobID | jobNAME>

displays on realtime the error output of a given job.

jobOutput <jobID | jobNAME>

displays on realtime the standard output of a given job.

jobcd <jobID | jobNAME>

allows users to quickly move into the working directory of a given job.

jobscript <jobID | jobNAME>

displays the submission script used when submitting a given job.

jobssh <jobID | jobNAME>

allows users to connect to the head-node of a given job.

jobtop <jobID | jobNAME>

allows users to "top" on the head-node of a given job.

jobtree [user]

displays the jobs tree of dependencies for a given user

jobdep <jobID>

displays the dependencies of a given job.

jobperf <jobID>

displays current performance statistics of a given job (must be running).


Monitoring Jobs

Tech Talk on Monitoring Jobs


Checking the memory usage from jobs

In many occasions it can be really useful to take a look at how much memory your job is using while it is running. There a couple of ways to do so:

1) using some of the command line utilities we have developed, e.g: by using the jobperf or jobtop utilities, it will allow you to check the job performance and head's node utilization respectively.

2) ssh into the nodes where your job is being run and check for memory usage and system stats right there. For instance, trying the 'top' or 'free' commands, in those nodes.

Also, it always a good a idea and strongly encouraged to inspect the standard output-log and error-log generated for your job submissions. These files are named respectively: JobName.{o|e}jobIdNumber; where JobName is the name you gave to the job (via the '-N' PBS flag) and JobIdNumber is the id number of the job. These files are saved in the working directory after the job is finished, but they can be also accessed on real-time using the jobError and jobOutput command line utilities.

Other related topics to memory usage:
Using Ram Disk
Different Memory Configuration nodes
Monitoring Jobs in the Queue