Difference between revisions of "Scheduler"

Revision as of 14:10, 6 June 2011

The queueing system used at SciNet is based around Cluster Resources Moab Workload Manager. Moab is used on both the GPC and TCS however Torque is used as the backend resource manager on the GPC and IBM's LoadLeveler is used on the TCS.

This page outlines some of the most common Moab commands with full documentation available from Moab here, the torque (pbs) commands full documentation is here.

Some common questions about the queuing system can be found on the FAQ as well.

Queues

GPC

batch

The batch queue is the default queue on the GPC allowing the user access to all the resources for jobs upto 48 hours. If a specific queue is not specified, -q flag, then a job is submitted to the batch queue.

debug

A debug queue has been set up primarily for code developers to quickly test and evaluate their codes and configurations without having to wait in the batch queue. There are 10 nodes currently reserved for the debug queue. It has quite restrictive limits to promote high turnover and availability thus a user can only use 2 nodes (16 cores) for 2 hours, to a maximum of 8 nodes (64 cores) for 1/2 an hour and can only have one job in the debug queue at a time.

$ qsub -l nodes=1:ppn=8,walltime=1:00:00 -q debug -I

largemem

The largemem queue is used for accessing one of two 16 core with 128 GB memory intel Xeon (non-nehalem) nodes.

$ qsub -l nodes=1:ppn=16,walltime=1:00:00 -q largemem -I

TCS

The TCS currently only has one queue, or class, in use called "verylong" and its only limitation is that jobs must be under 48 hours.

#@ class           = verylong

Job Info

To see all jobs queued on a system use

$ showq

Three sections are shown; running, idle, and blocked. Idle jobs are commonly referred to as queued jobs as they meet all the requirements, however they are waiting for available resources. Blocked jobs are either caused by improper resource requests or more commonly by exceeding a user or groups allowable resources. For example if you are allowed to submit 10 jobs and you submit 20, the first 10 jobs will be submitted properly and either run right away or be queued, however the other 10 jobs will be blocked and the jobs won't be submitted to the queue until one of the first 10 finishes.

If showq is returning output slowly, you can query cached info using

$ showq --noblock

Available Resources

Determining when your job will run can be tricky as it involves a combination of queue type, node type, system reservations, and job priority. The following commands are provided to help you figure out what resources are currently available, however they may not tell you exactly when your job will run for the aforementioned reasons.

GPC

To show how many ethernet nodes are currently free, use the show back fill command

$ showbf -f compute-eth

To show how many infiniband nodes are free, use

$ showbf -f ib

TCS

To show how many TCS nodes are free, use

$ showbf -c verylong

For example checking for an ethernet job

$ showbf -f compute-eth
Partition     Tasks  Nodes      Duration   StartOffset       StartDate
---------     -----  -----  ------------  ------------  --------------
ALL           14728   1839       7:36:23      00:00:00  00:23:37_09/24
ALL             256     30      INFINITY      00:00:00  00:23:37_09/24

shows that for jobs under 7:36:23 you can use 1839 nodes, but if you submit a job over that time only 30 will be available. In this case this is due to a large reservation made my SciNet staff, but from a users point of view, showbf tells you very simply what is available and at what time point. In this case, a user may wish to set #PBS -l walltime=7:30:00 in their script, or add -l walltime=7:30:00 to their qsub command in order to ensure that the jobs backfill the reserved nodes.

NOTE: showbf shows currently available nodes, however just because nodes are available doesn't mean that your job will start right away. Job priority, system reservations along with dedicated nodes, such as those for the debug queue, will alter when jobs run so even if enough nodes appear "free", it doesn't mean your job will actually run right away.

Job Submission

Interactive

On the GPC an interactive queue session can be requested using the following

$ qsub -l nodes=2:ppn=8,walltime=1:00:00 -I

Non-interactive (Batch)

For a non-interactive job submission you require a submission script formatted for the appropriate resource manger. Examples are provided for the GPC and TCS.

Job Status

$ checkjob jobid

Cancel a Job

$ canceljob jobid

Accounting

For any user with an NRAC/LRAC allocation, a special account with the Resource Allocation Project (RAP) identifier (RAPI) from Compute Canada Database (CCDB) is set up in order to access the allocated resources. Please use the following instructions to run your job using your special allocation. This is necessary both for accounting purposes as well as to assign the appropriate priority to your jobs.

Each job run on the system will have a default RAP associated with it. Most users already have their default RAP properly set. However, if you have more than one allocation (different RAPs), you may need/want to change your default RAP in order to charge your jobs to a particular RAP.

Changing your default RAP

Go to the portal, login with your SciNet username and password.
Click on "Change SciNet default RAP" and change your default RAP.

Specifying the RAP for GPC

Alternatively, you may want to assign a RAP for each particular job you run. There are two ways to specify an account for Moab/Torque: From the command line or inside the batch submission script.

Command line

Use the '-A RAPI' flag when you submit your job using qsub. Note that the command line option will override the submission script if an account is specified on both the submission script and the command line. "RAPI" is the RAP Identifier, e.g. abc-123-de.

Submission Script

Add a line in your submit script as follows:

#PBS -A RAPI

Please replace "RAPI" with your RAP Identifier.

Specifiying the RAP for TCS

Add a line in your submit script as follows:

# @ account_no = RAPI

Please replace "RAPI" with your RAP Identifier.

User Stats

Show current usage stats for a $USER

$ showstats -u $USER

Reservations

$ showres

Standard users can only see their reservations not other users or system ones. To determine what is available a user can use "showbf", it shows what resources are available and at what time level, taking into account running jobs and all the reservations. Refer to the Available Resources section of this page for more details.

Job Dependencies

Sometimes you may want one job not to start until another job finishes, however you would like to submit them both at the same time. This can be done using job dependencies on both the GPC and TCS, however the commands are different due to the underlying resource managers being different.

GPC

Use the -W flag with the following syntax in your submission script to have this job not start until the job with jobid or jobName (given with -N jobName) has successfully finished

-W depend:afterok:{jobid | jobName}

More detailed syntax and examples can be found [here ] and [here].

TCS

Loadleveler does job dependencies using what they call steps. See the TCS Quickstart guide for an example.

Adjusting Job Priority

The ability to adjust job priorities downwards can also be of use to adjust relative priorities of jobs between users who are running jobs of the same allocation (eg, a default or NRAC allocation of the same PI). Priorities are determined by how much of the time of that allocation been currently used, and all users using that account will have identical priorities. This mechanism allows users to voluntarily reduce their priority to allow other users of the same allocation to run ahead of them.

In principle, by adjusting a jobs priority downwards, you could reduce your jobs priority to the point that someone elses job entirely could go ahead of yours. In practice, however, this is extremely unlikely. Users with NRAC allocations have priorities that are extremely large positive numbers that depend on their allocation and how much of it they have already used during the past fairshare window (2 weeks); it is very unlikely that two groups would have priorities that are within 10 or 100 or 1000 of each other.

Note that at the moment, we do not allow priorities to go negative; they are integers that can go no lower than 1. (This may change in the future) That means that users of accounts that have already used their full allocation during the current fairshare period (eg, over the past two weeks), and so whose priority would normally be negative but is capped at 1, can not lower their priority any further. Similar, users with a `default' allocation have priority 1, and cannot lower their priorities any further.

GPC

Moab allows users to adjust their jobs' priority moderately downwards, with the -p flag; that is, on a qsub line

$ qsub ... -p -10  JOBID

or in a script

...
#PBS -p -10
..

The number used (-10 in the examples above) can be any negative number down to -1024.

The ability to adjust job priorities downwards can be useful when you are running a number of jobs and want some to enter the queue at higher priorities than others. Note that if you absolutely require some jobs to start before others, you could use job dependencies instead.

For a job that is currently queued, one can adjust its priority with

$ qalter -p -10 JOBID

@@ Line 214: / Line 214: @@
-The ability to adjust job priorities downwards can also be of use to adjust relative priorities of jobs between users who are running jobs of the same allocation (eg, a default, LRAC, or NRAC allocation of the same PI).   Priorities are determined by how much of the time of that allocation been currently used, and all users using that account will have identical priorities.   This mechanism allows users to voluntarily reduce their priority to allow other users of the same allocation to run ahead of them.
+The ability to adjust job priorities downwards can also be of use to adjust relative priorities of jobs between users who are running jobs of the same allocation (eg, a default or NRAC allocation of the same PI).   Priorities are determined by how much of the time of that allocation been currently used, and all users using that account will have identical priorities.   This mechanism allows users to voluntarily reduce their priority to allow other users of the same allocation to run ahead of them.
-In principle, by adjusting a jobs priority downwards, you could reduce your jobs priority to the point that someone elses job entirely could go ahead of yours.  In practice, however, this is extremely unlikely.   Users with LRAC or NRAC allocations have priorities that are extremely large positive numbers that depend on their allocation and how much of it they have already used during the past fairshare window (2 weeks); it is very unlikely that two groups would have priorities that are within 10 or 100 or 1000 of each other.
+In principle, by adjusting a jobs priority downwards, you could reduce your jobs priority to the point that someone elses job entirely could go ahead of yours.  In practice, however, this is extremely unlikely.   Users with NRAC allocations have priorities that are extremely large positive numbers that depend on their allocation and how much of it they have already used during the past fairshare window (2 weeks); it is very unlikely that two groups would have priorities that are within 10 or 100 or 1000 of each other.
 Note that at the moment, we do not allow priorities to go negative; they are integers that can go no lower than 1.  (This may change in the future)  That means that users of accounts that have already used their full allocation during the current fairshare period (eg, over the past two weeks), and so whose priority would normally be negative but is capped at 1, can not lower their priority any further.   Similar, users with a `default' allocation have priority 1, and cannot lower their priorities any further.
@@ Line 242: / Line 242: @@
 $ qalter -p -10 JOBID
 </pre>
+<!--
 ==== TCS ====
@@ Line 252: / Line 254: @@
 where the number can be between 0 (which is 50 below the default priority) to 50 (the default priority).
+--!>
 === Suspending a Running Job ===