User Serial
General considerations
Use whole nodes...
When you submit a job on a SciNet system, it is run on one (or more than one) entire node - meaning that your job is occupying at least 8 processors during the duration of its run. The SciNet systems are usually, with many researchers waiting in the queue for computational resources, so we require that you make full use of the nodes that your job is allocated, so other resarchers don't have to wait unnecessarily, and so that your jobs get as much work done for you while they run as possible.
Often, the best way to make full use of the node is to run one large parallel computation; but sometimes it is beneficial to run several serial codes at the same time. On this page, we discuss ways to run suites of serial computations at once, as efficiently as possible, using the full resources of the node.
...but not more.
When running multiple jobs on the same node, it is essential to have a good idea of how much memory the jobs will require. The GPC compute nodes have about 14GB in total available to user jobs running on the 8 cores (a bit less, say 13GB, on the devel ndoes gpc01..04, and somewhat more for some compute nodes) So the jobs also have to be bunched in ways that will fit into 14GB. If they use more than this, they will crash the node, inconveniencing you and other researchers waiting for that node.
If that's not possible -- each individual job requires significantly in excess of ~1.75GB -- then it's possible to just run fewer jobs so that they do fit; but then, again there is an under-utilization problem. In that case, the jobs are likely candidates for parallelization, and you can contact us at <support@scinet.utoronto.ca> and arrange a meeting with one of the technical analysts to help you do just that.
If the memory requirements allow it, you could actually run more than 8 jobs at the same time, up to 16, exploiting the HyperThreading feature of the Intel Nehalem cores. It may seem counterintuitive, but running 16 jobs on 8 cores for certain types of tasks has increased some users overall throughput by 10 to 30 percent.
Is your job really serial?
While your program may not be explicitly parallel, it may use some of SciNet's threaded libraries for numerical computations, which can make use of multiple processors. In particular, SciNet's Python and R modules are compiled with aggressive optimization and using threaded numerical libraries which by default will make use of multiple cores for computations such as large matrix operations. This can greatly speed up individual runs, but by less (usually much less) than a factor of 8. If you do have many such computations to do, your throughput will be better - you will get more calculations done per unit time -if you turn off the threading and run multiple such computations at once. Threading is turned off with the shell script line export OMP_NUM_THREADS=1; that line will be included in the scripts below.
If your calculations do implicitly use threading, you may want to experiment to see what gives you the best performance - you may find that running 4 (or even 8) jobs with 2 threads each (OMP_NUM_THREADS=2), or 2 jobs with 4 threads, gives better performance than 8 jobs with 1 thread (and almost certainly better than 1 job with 8 threads). We'd encourage to you to perform exactly such a scaling test; for a small up-front investment in time you may significantly speed up all the computations you need to do.
Serial jobs of similar duration
The most straightforward way to run multiple serial jobs is to bunch the jobs in groups of 8 or more that will take roughly the same amount of time, and create a job that looks a bit like this <source lang="bash">
- !/bin/bash
- MOAB/Torque submission script for multiple serial jobs on
- SciNet GPC
- PBS -l nodes=1:ppn=8,walltime=1:00:00
- PBS -N serialx8
- DIRECTORY TO RUN - $PBS_O_WORKDIR is directory job was submitted from
cd $PBS_O_WORKDIR
- Turn off implicit threading in Python, R
export OMP_NUM_THREADS=1
- EXECUTION COMMAND; ampersand off 8 jobs and wait
(cd jobdir1; ./dojob1) & (cd jobdir2; ./dojob2) & (cd jobdir3; ./dojob3) & (cd jobdir4; ./dojob4) & (cd jobdir5; ./dojob5) & (cd jobdir6; ./dojob6) & (cd jobdir7; ./dojob7) & (cd jobdir8; ./dojob8) & wait </source>
There are four important things to take note of here. First, the wait command at the end is crucial; without it the job will terminate immediately, killing the 8 programs you just started.
Second is that it is important to group the programs by how long they will take. If (say) dojob8 takes 2 hours and the rest only take 1, then for one hour 7 of the 8 cores on the GPC node are wasted; they are sitting idle but are unavailable for other users, and the utilization of this node over the whole run is only 56%. This is the sort of thing we'll notice, and users who don't make efficient use of the machine will have their ability to use scinet resources reduced. If you have many serial jobs of varying length, use the submission script to balance the computational load, as explained below.
Third, we reiterate that if memory requirements allow it, you should try to run more than 8 jobs at once, with a maximum of 16 jobs.
GNU Parallel
GNU parallel is a really nice tool written by Ole Tange to run multiple serial jobs in parallel. It allows you to keep the processors on each 8core node busy, if you provide enough jobs to do.
GNU parallel is accessible on the GPC in the module gnu-parallel: <source lang="bash"> module load gnu-parallel/20140622 </source> Note that there are several versions of gnu-parallel installed on the GPC; we recommend using the newer version.
The citation for GNU Parallel is: O. Tange (2011): GNU Parallel - The Command-Line Power Tool, ;login: The USENIX Magazine, February 2011:42-47.
It is easiest to demonstrate the usage of GNU parallel by examples. Suppose you have 16 jobs to do, that these jobs duration varies quite a bit, but that the average job duration is around 10 hours. You could use the following script: <source lang="bash">
- !/bin/bash
- MOAB/Torque submission script for multiple serial jobs on SciNet GPC
- PBS -l nodes=1:ppn=8,walltime=24:00:00
- PBS -N gnu-parallel-example
- DIRECTORY TO RUN - $PBS_O_WORKDIR is directory job was submitted from
cd $PBS_O_WORKDIR
- Turn off implicit threading in Python, R
export OMP_NUM_THREADS=1
module load gnu-parallel/20140622
- EXECUTION COMMAND
parallel -j 8 <<EOF
cd jobdir1; ./dojob1; echo "job 1 finished" cd jobdir2; ./dojob2; echo "job 2 finished" cd jobdir3; ./dojob3; echo "job 3 finished" cd jobdir4; ./dojob4; echo "job 4 finished" cd jobdir5; ./dojob5; echo "job 5 finished" cd jobdir6; ./dojob6; echo "job 6 finished" cd jobdir7; ./dojob7; echo "job 7 finished" cd jobdir8; ./dojob8; echo "job 8 finished" cd jobdir9; ./dojob9; echo "job 9 finished" cd jobdir10; ./dojob10; echo "job 10 finished" cd jobdir11; ./dojob11; echo "job 11 finished" cd jobdir12; ./dojob12; echo "job 12 finished" cd jobdir13; ./dojob13; echo "job 13 finished" cd jobdir14; ./dojob14; echo "job 14 finished" cd jobdir15; ./dojob15; echo "job 15 finished" cd jobdir16; ./dojob16; echo "job 16 finished"
EOF </source>
The -j8 parameter sets the number of jobs to run at the same time, but 16 jobs are lined up. Initially, 8 jobs are given to the 8 processors on the node. When one of the processors is done with its assigned job, it will get a next job instead of sitting idle until the other processors are done. While you would expect that on average this script should take 20 hours (each processor on average has to complete two jobs of 10hours), there's a good chance that one of the processors gets two jobs that take more than 10 hours, so the job script requests 24 hours. How much more time you should ask for in practice depends on the spread in run times of the separate jobs.
Serial jobs of varying duration
If you have a lot (50+) of relatively short serial runs to do, of which the walltime varies, and if you know that eight jobs fit in memory without memory issues, then writing all the command explicitly in the jobscript can get tedious. If you follw a convention in that the jobs are all started by auxiliary scripts called jobs<something>, the following strategy in your submission script would maximize the cpu utilization. <source lang="bash">
- !/bin/bash
- MOAB/Torque submission script for multiple, dynamically-run
- serial jobs on SciNet GPC
- PBS -l nodes=1:ppn=8,walltime=1:00:00
- PBS -N serialdynamic
- DIRECTORY TO RUN - $PBS_O_WORKDIR is directory job was submitted from
cd $PBS_O_WORKDIR
- Turn off implicit threading in Python, R
export OMP_NUM_THREADS=1
module load gnu-parallel/20140622
- COMMANDS ARE ASSUMED TO BE SCRIPTS CALLED job*.sh
echo job*.sh | tr ' ' '\n' | parallel -j 8 </source> Notes:
- As before, GNU Parallel keeps 8 jobs running at a time, and if one finishes, starts the next. This is an easy way to do load balancing.
- You can in fact run more or less than 8 processes per node by modifying parallel's -j8 argument.
- Doing many serial jobs often entails doing many disk reads and writes, which can be detrimental to the performance. In that case, running from the ramdisk may be an option.
- When using a ramdisk, make sure you copy your results from the ramdisk back to the scratch after the runs, or when the job is killed because time has run out.
- More details on how to setup your script to use the ramdisk can be found on the Ramdisk wiki page.
- This script optimizes resource utility, but can only use 1 node (8 cores) at a time. The next section addresses how to use more nodes.
- While on the command line, the option "--bar" can be nice to see the progress, when running as a job, you would not see this status bar. Consider using the "--joblog FILENAME" option, which writes the finished jobs into the file FILENAME.
- The latter also keeps track of failed jobs, so you can later try to redo those with the same command, but with the option "--resume" added.
Version for more than 8 cores at once (still serial)
If you have hundreds of serial jobs that you want to run concurrently and the nodes are available, then the approach above, while useful, would require tens of scripts to be submitted separately. It is possible for you to request more than one node and to use the following routine to distribute your processes amongst the cores. In this case, it is important to use the newer version of GNU parallel installed on the GPC.
<source lang="bash">
- !/bin/bash
- MOAB/Torque submission script for multiple, dynamically-run
- serial jobs on SciNet GPC
- PBS -l nodes=25:ppn=8,walltime=1:00:00
- PBS -N serialdynamicMulti
- DIRECTORY TO RUN - $PBS_O_WORKDIR is directory job was submitted from
cd $PBS_O_WORKDIR
- Turn off implicit threading in Python, R
export OMP_NUM_THREADS=1
module load gnu-parallel/20140622
- START PARALLEL JOBS USING NODE LIST IN $PBS_NODEFILE
seq 800 | parallel -j8 --sshloginfile $PBS_NODEFILE --workdir $PWD ./myrun {} </source> Explanation:
- seq 800 outputs the numbers 1 through 800 on separate lines. This output is piped to (ie becomes the input of) the parallel command.
- The use of the "seq 800" is that each line that you give to parallel defines a new job. So here, there are 800 jobs.
- Each job runs a command, but because the numbers generated by seq are not commands, a real command is constructed, in this case, by the argument ./myrun {}. Here myrun is supposed to be the name of the application to run. The two curly brackets {} get replaced by the line from the input, that is, by one of the numbers.
- So parallel will run the 800 commands:
./myrun 1
./myrun 2
...
./myrun 800 - The parameter --sshloginfile $PBS_NODEFILE tells parallel to look for the file named $PBS_NODEFILE which contains the host names of the nodes assigned to the current job (as stated above, it is automatically generated).
- The parameter -j8 tells parallel to run 8 of these at a time on each of the hosts.
- The --workdir $PWD sets the working directory on the other nodes to the working directory on the first node. Without this, the run tries to start from the wrong place and will most likely fail (unless using the latest gnu parallel module, 20130422, which by default puts you in $PWD on the remote node).
- Loaded modules should get automatically loaded on the remote nodes too for the latest gnu parallel module, but not for earlier ones.
- If you need an environment variable to be transfered from the job script to the remotely running subjobs, use --env ENVIRONMENTVARIABLE. SciNet's gnu-parallel modules automatically transfer OMP_NUM_THREADS, and typical environment variables set by most modules.
Notes:
- Of course, this is just an example of what you could do with gnu parallel. How you set up your specific run depends on how each of the runs would be started. One could for instance also prepare a file of commands to run and make that the input to parallel as well.
- Note that submitting several bunches to single nodes, as in the section above, is a more failsafe way of proceeding, since a node failure would only affect one of these bunches, rather than all runs.
- GNU Parallel can be passed a file with the list of nodes to which to ssh, using --sshloginfile (thanks to Ole Tange for pointing this out). This list is automatically generated by the scheduler and its name is made available in the environment variable $PBS_NODEFILE.
- Alternatively, GNU Parallel can take a comma separated list of nodes given to its -S argument, but this would need to be constructed from the file $PBS_NODEFILE which contains all nodes assigned to the job, with each node duplicated 8x for the number of cores on each node.
- GNU Parallel can reads lines of input and convert those to arguments in the execution command. The execution command is the last argument given to parallel, with {} replaces by the lines on input.
- The --workdir argument is essential: it sets the working directory on the other nodes, which would default to your home directory if omitted. Since /home is read-only on the compute nodes, you would not like not get any output at all!
This is no longer true for the latest GNU Parallel modules (20130422), which puts you in the current directory on the remote nodes. - We reiterate that if memory requirements allow it, you should try to run more than 8 jobs at once, with a maximum of 16 jobs. You can run more or fewer than 8 processes per node by modifying the -j8 parameter to the parallel command.
More on GNU parallel
- Slides of the SciNet TechTalk on Gnu Parallel (14 Nov 2012)
- The documentation for GNU parallel can be found at http://www.gnu.org/software/parallel/
- Its man page can be found here http://www.gnu.org/software/parallel/man.html
- The man page is also available on the GPC when the gnu-parallel module is loaded, with the command
$ man parallel
. The man page contains options, such as how to make sure the output is not all scrambled, and examples.
GNU Parallel Reference
- O. Tange (2011): GNU Parallel - The Command-Line Power Tool, ;login: The USENIX Magazine, February 2011:42-47.
Older scripts
Older scripts, which mimicked some of GNU parallel functionality, can be found on the Deprecated scripts page.
--Rzon 02:22, 14 Nov 2010 (UTC)