Difference between revisions of "Scheduler"

Revision as of 11:57, 26 April 2012

The queueing system used at SciNet is based around Cluster Resources Moab Workload Manager. Moab is used on both the GPC and TCS however Torque is used as the backend resource manager on the GPC and IBM's LoadLeveler is used on the TCS.

This page outlines some of the most common Moab commands with full documentation available from Moab here, the torque (pbs) commands full documentation is here.

Some common questions about the queuing system can be found on the FAQ as well.

Queues

GPC

batch

The batch queue is the default queue on the GPC allowing the user access to all the resources for jobs upto 48 hours. If a specific queue is not specified, -q flag, then a job is submitted to the batch queue. Most jobs will run in batch, and user's can use feature flags to determine the type of nodes they require.

For example, to request two nodes anywhere on the GPC (QDR or DDR), use

#PBS -l nodes=2:ppn=8,walltime=1:00:00

For two nodes using DDR, use

#PBS -l nodes=2:ddr:ppn=8,walltime=1:00:00

To get two nodes using QDR, instead, you would use

#PBS -l nodes=2:qdr:ppn=8,walltime=1:00:00

debug

A debug queue has been set up primarily for code developers to quickly test and evaluate their codes and configurations without having to wait in the batch queue. There are 10 nodes currently reserved for the debug queue. It has quite restrictive limits to promote high turnover and availability thus a user can only use 2 nodes (16 cores) for 2 hours, to a maximum of 8 nodes (64 cores) for 1/2 an hour and can only have one job in the debug queue at a time.

$ qsub -l nodes=1:ppn=8,walltime=1:00:00 -q debug -I

largemem

The largemem queue is used for accessing one of two 16 core with 128 GB memory intel Xeon (non-nehalem) nodes.

$ qsub -l nodes=1:ppn=16,walltime=1:00:00 -q largemem -I

TCS

The TCS currently only has one queue, or class, in use called "verylong" and its only limitation is that jobs must be under 48 hours.

#@ class           = verylong

Job Info

To see all jobs queued on a system use

$ showq

Three sections are shown; running, idle, and blocked. Idle jobs are commonly referred to as queued jobs as they meet all the requirements, however they are waiting for available resources. Blocked jobs are either caused by improper resource requests or more commonly by exceeding a user or groups allowable resources. For example if you are allowed to submit 10 jobs and you submit 20, the first 10 jobs will be submitted properly and either run right away or be queued, however the other 10 jobs will be blocked and the jobs won't be submitted to the queue until one of the first 10 finishes.

Available Resources

Determining when your job will run can be tricky as it involves a combination of queue type, node type, system reservations, and job priority. The following commands are provided to help you figure out what resources are currently available, however they may not tell you exactly when your job will run for the aforementioned reasons.

GPC

To show how many qdr nodes are currently free, use the show back fill command

$ showbf -f qdr

To show how many ddr nodes are free, use

$ showbf -f ddr

TCS

To show how many TCS nodes are free, use

$ showbf -c verylong

For example checking for a qdr job

$ showbf -f qdr
Partition     Tasks  Nodes      Duration   StartOffset       StartDate
---------     -----  -----  ------------  ------------  --------------
ALL           14728   1839       7:36:23      00:00:00  00:23:37_09/24
ALL             256     30      INFINITY      00:00:00  00:23:37_09/24

shows that for jobs under 7:36:23 you can use 1839 nodes, but if you submit a job over that time only 30 will be available. In this case this is due to a large reservation made my SciNet staff, but from a users point of view, showbf tells you very simply what is available and at what time point. In this case, a user may wish to set #PBS -l walltime=7:30:00 in their script, or add -l walltime=7:30:00 to their qsub command in order to ensure that the jobs backfill the reserved nodes.

NOTE: showbf shows currently available nodes, however just because nodes are available doesn't mean that your job will start right away. Job priority, system reservations along with dedicated nodes, such as those for the debug queue, will alter when jobs run so even if enough nodes appear "free", it doesn't mean your job will actually run right away.

Job Submission

Interactive

On the GPC an interactive queue session can be requested using the following

$ qsub -l nodes=2:ppn=8,walltime=1:00:00 -I

Non-interactive (Batch)

For a non-interactive job submission you require a submission script formatted for the appropriate resource manger. Examples are provided for the GPC and TCS.

Job Status

$ checkjob jobid

Cancel a Job

$ canceljob jobid

Accounting

For any user with an NRAC/LRAC allocation, a special account with the Resource Allocation Project (RAP) identifier (RAPI) from Compute Canada Database (CCDB) is set up in order to access the allocated resources. Please use the following instructions to run your job using your special allocation. This is necessary both for accounting purposes as well as to assign the appropriate priority to your jobs.

Each job run on the system will have a default RAP associated with it. Most users already have their default RAP properly set. However, if you have more than one allocation (different RAPs), you may need/want to change your default RAP in order to charge your jobs to a particular RAP.

Changing your default RAP

Go to the portal, login with your SciNet username and password.
Click on "Change SciNet default RAP" and change your default RAP.

Specifying the RAP for GPC

Alternatively, you may want to assign a RAP for each particular job you run. There are two ways to specify an account for Moab/Torque: From the command line or inside the batch submission script.

Command line

Use the '-A RAPI' flag when you submit your job using qsub. Note that the command line option will override the submission script if an account is specified on both the submission script and the command line. "RAPI" is the RAP Identifier, e.g. abc-123-de.

Submission Script

Add a line in your submit script as follows:

#PBS -A RAPI

Please replace "RAPI" with your RAP Identifier.

Specifiying the RAP for TCS

Add a line in your submit script as follows:

# @ account_no = RAPI

Please replace "RAPI" with your RAP Identifier.

User Stats

Show current usage stats for a $USER

$ showstats -u $USER

Reservations

$ showres

Standard users can only see their reservations not other users or system ones. To determine what is available a user can use "showbf", it shows what resources are available and at what time level, taking into account running jobs and all the reservations. Refer to the Available Resources section of this page for more details.

Job Dependencies

Sometimes you may want one job not to start until another job finishes, however you would like to submit them both at the same time. This can be done using job dependencies on both the GPC and TCS, however the commands are different due to the underlying resource managers being different.

GPC

Use the -W flag with the following syntax in your submission script to have this job not start until the job with jobid or jobName (given with -N jobName) has successfully finished

-W depend:afterok:{jobid | jobName}

More detailed syntax and examples can be found [here ] and [here].

TCS

Loadleveler does job dependencies using what they call steps. See the TCS Quickstart guide for an example.

Adjusting Job Priority

The ability to adjust job priorities downwards can also be of use to adjust relative priorities of jobs between users who are running jobs of the same allocation (eg, a default or NRAC allocation of the same PI). Priorities are determined by how much of the time of that allocation been currently used, and all users using that account will have identical priorities. This mechanism allows users to voluntarily reduce their priority to allow other users of the same allocation to run ahead of them.

In principle, by adjusting a jobs priority downwards, you could reduce your jobs priority to the point that someone elses job entirely could go ahead of yours. In practice, however, this is extremely unlikely. Users with NRAC allocations have priorities that are extremely large positive numbers that depend on their allocation and how much of it they have already used during the past fairshare window (2 weeks); it is very unlikely that two groups would have priorities that are within 10 or 100 or 1000 of each other.

Note that at the moment, we do not allow priorities to go negative; they are integers that can go no lower than 1. (This may change in the future) That means that users of accounts that have already used their full allocation during the current fairshare period (eg, over the past two weeks), and so whose priority would normally be negative but is capped at 1, can not lower their priority any further. Similar, users with a `default' allocation have priority 1, and cannot lower their priorities any further.

GPC

Moab allows users to adjust their jobs' priority moderately downwards, with the -p flag; that is, on a qsub line

$ qsub ... -p -10  JOBID

or in a script

...
#PBS -p -10
..

The number used (-10 in the examples above) can be any negative number down to -1024.

The ability to adjust job priorities downwards can be useful when you are running a number of jobs and want some to enter the queue at higher priorities than others. Note that if you absolutely require some jobs to start before others, you could use job dependencies instead.

For a job that is currently queued, one can adjust its priority with

$ qalter -p -10 JOBID

Suspending a Running Job

Separate from, and in addition to, the ability to place a hold on a queued job, you may want to suspend a running job. For example, you may want to test the timing of events in a weakly coupled parallel environment.

GPC

To suspend a job:

qsig -s STOP <jobid>

and to start it again:

qsig -s CONT <jobid>.

Scripts are suspendable by default, so you don't need to add any signal handling for this to work. As far as we can tell, the result is identical to using fg and ctrl-Z (or kill -STOP <PID>) in an interactive run.

More about using (and trapping) signals can be found on the Using Signals page.

QDR Network Switch Affinity

The QDR network is globally 5:1 oversubscribed, but on switch it has full 1:1 cross-section whereas the DDR is completely 1:1 non-blocking. When a job is submitted to the GPC QDR nodes, the queuing system tries to fulfill the requirements with nodes on the same switch to improve network performance. If not enough nodes are available on the same switch to satisfy the job request, the queue will then use any available nodes. This behavior can be changed by using the submission flag " -l nodesetisoptional=false " which forces the queuing system to only run this job on the same switch, thus the job will stay queued until enough nodes on one switch are available to satisfy this request. Note that the maximum number of nodes on one switch is 30, so a request of greater than 30 nodes with this flag will never run.

qsub -l nodes=4:ppn=8,walltime=10:00,nodesetisoptional=false

Multiple Job Submissions

The current version of Torque we are running has problems with job arrays, so the section below currently does not apply.

If you are doing doing batch processing of a number of similar jobs on the GPC, torque has a feature called job arrays that can be used to simplify this process. By using the "-t 0-N" option on the command line during job submission or putting it in the job script file, #PBS -t 0-N, torque will expand your single job submission into N jobs and sets the environment variable PBS_ARRAYID equal to that jobs specific number, ie 0-N, for each job. This reduces the amount of calls to qsub, and can allow the user to have many less submission scripts. Job arrays also have the benefit of batching groups of jobs allowing commands like qalter, qdel, qhold to work on all or a subset of the job array jobs with one command, instead of having to run the command for each job.

In the following example, 10 jobs are submitted using a single command

qsub -t 0-10 jobscript.sh

and the submission script then modifies the job based on the PBS_ARRAYID.

#!/bin/bash
#PBS -l nodes=1:ppn=8,walltime=10:00:00
#PBS -N array_jobs

cd ${PBS_O_WORKDIR}
mkdir job.${PBS_ARRAYID}
cd job.${PBS_ARRAYID}

echo "Running job ${PBS_ARRAYID}"
mpirun -np 8 ./mycode >& array_job.${PBS_ARRAYID}.out

The JOBID and the job name both get the additional ARRAYID added onto them in the form of a hyphen, ie JOBID-ARRAYID. If for example you wanted to cancel all the jobs in a job array you would use "qdel JOBID", whereas if you wanted to cancel just one of the jobs you would use "qdel JOBID-ARRAYID".

See here and here for full details.