Programming Concepts

From oldwiki.scinet.utoronto.ca
Jump to navigation Jump to search

Signal Handling

WARNING: SciNet is in the process of replacing this wiki with a new documentation site. For current information, please go to https://docs.scinet.utoronto.ca

General

In unix/linux one can send signals to a program or a script. Some common signals are:

Signal Meaning Can be trapped?
HUP Terminal log-out Yes
INT Interrupt signal (Ctrl-C is pressed) Yes
TERM Termination of the process (as if 'kill' was called) Yes
KILL Kill the process (as if 'kill -9' was called) No
STOP Suspend the process No
CONT Resume the process Yes
USR1 Unspecified user signal (can be given with 'kill -USR1') Yes
USR2 Unspecified user signal (can be given with 'kill -USR2') Yes

From the node on which a a process is running, signals can be given using

 $ kill -[SIGNAL] [PID]

On the GPC, you can pass a signal to a running job using

 $ qsig -s [SIGNAL] [JOBID]

So for example, you could suspend the process running under job id 314159 by qsig -s STOP 314159, and resume it using qsig -s CONT 314159 (note that the wall clock keeps ticking!).

For qsig to work for signals whose default action is to kill your job (such as HUP, INT, TERM, and, perhaps surprisingly, USR1 and USR2), your job script should either trap these explicitly (see below), or contain a line

 trap : [LIST-OF-SIGNALS-TO-BE-PASSED] 

(e.g. trap : USR1 USR2) before the application that traps the signal is started. Note that he spaces surrounding the colon are essential here, because the colon is actually a bash command (one that does nothing).

Signals can also be given by the system. One important instance of system signals is that on SciNet, a TERM signal is sent to a jobs whose requested time is over, after which the job has about 2 minutes to clean up.

With the techniques explained below, a program or script can be set up to listen for a particular kind of signal, except for KILL and STOP (which always kill and suspend a process, respectively). When it receive a signal of that kind, its execution is interrupted and a call is made to a function, specified earlier in the program. It is up to the program or script what this signal handling function does but it is a good idea to make the action appropriate for the event that triggers the signal. For instance, a TERM signal should be handled as a request to terminate the application. The user signals USR1 and USR2 do not have a pre-designated meaning and can be used for application-specific actions such as checkpointing.

Trapping signals in bash scripts

To trap signals in a bash script, one has to bind a specific signal to a command or function with the trap command. For example, <source lang="bash">

  1. !/bin/bash

trap "echo Term was trapped.; exit" TERM for ((i=0;i<60;i++)) do

 echo $i
 sleep 1

done </source> Running this script in the background and sending it a TERM command ('kill -TERM [pid]') will cause the message 'Term was trapped.' to be printed. Notes:

  • The trapped command has to use 'exit' explicitly to stop the script's execution.
  • The 'sleep' command in bash cannot be interrupted, which is why the script contains a succession of 1 second 'sleep's.

Another useful example of signal trapping in a bash script is given on the wiki page about using ramdisk.

Trapping signals in C

To trap signals in a c program, one has to include the signal.h header file: <source lang="c">

  1. include <signal.h>

</source> and provide a signal handler function <source lang="c"> void term_trap(int sig) {

  /* do something */

} </source> which is linked to the specific signal somewhere as follows: <source lang="c"> signal(SIGTERM, term_trap); </source> Note that the names of signals are prepended with "SIG" in signal.h.

A minimal example: <source lang="c">

  1. include <stdio.h>
  2. include <signal.h>
  3. include <unistd.h>

void term_trap(int sig) {

   printf("Term was trapped.\n");

}

int main() {

   signal(SIGTERM, term_trap);
   sleep(60);

} </source> Note that the names of the signals are prepended with SIG in C. To test this program:

  • save the above code as sigex.c
  • compile: icc -O3 -xHost sigex.c -o sigex
  • run: sigex&
  • note the pid (process idenitfier)
  • you can then give the process the term signal with kill -TERM [pid].

Trapping signals in C++

The same method as for a C program works in C++, except the header file is 'csignal'. For example: <source lang="c">

  1. include <iostream>
  2. include <csignal>
  3. include <unistd.h>

void term_trap(int sig) {

   std::cout << "Term was trapped.\n";

}

int main() {

   signal(SIGTERM, term_trap);
   sleep(60);

} </source>

Trapping signals in Fortran

Unfortunately, signal handling in fortran is not standard, and different compilers have different ways of doing this. Below, examples are given for the three fortran compilers available on SciNet: ifort, gfortran, and xlf. These examples were designed such that even though the ways to register the handler differ, the signal handlers themselves are always the same.

ifort <source lang="fortran"> C IFORT SIGNAL TRAPPING EXAMPLE

     USE IFPORT
     EXTERNAL TRAP_TERM
     INTEGER  TRAP_TERM
     INTEGER  ERR
         ERR = SIGNAL (SIGTERM, TRAP_TERM, -1)
         CALL SLEEP (60)
     END

C SIGNAL HANDLER FUNCTION

     FUNCTION TRAP_TERM (SIG_NUM)
     INTEGER TRAP_TERM
     INTEGER SIG_NUM
         PRINT *, "Term was trapped."
         TRAP_TERM = 1
     END

</source>

gfortran <source lang="fortran"> C GFORTRAN SIGNAL TRAPPING EXAMPLE

     INTRINSIC SIGNAL
     INTEGER SIGTERM
     PARAMETER (SIGTERM = 15)
     EXTERNAL TRAP_TERM
         CALL SIGNAL (SIGTERM, TRAP_TERM)
         CALL SLEEP (60)
     END

C SIGNAL HANDLER FUNCTION

     FUNCTION TRAP_TERM (SIG_NUM)
     INTEGER TRAP_TERM
     INTEGER SIG_NUM
         PRINT *, "Term was trapped."
         TRAP_TERM = 1
     END

</source> Note: the SIGNAL function is broken in the official gfortran versions 4.6.0 and 4.6.1, but this bug has been fixed in SciNet's gcc modules.

xlf <source lang="fortran"> C XLF SIGNAL TRAPPING EXAMPLE

     INCLUDE 'fexcp.h'
     INTEGER SIGTERM
     PARAMETER (SIGTERM = 15)
     EXTERNAL TRAP_TERM
         CALL SIGNAL (SIGTERM, TRAP_TERM)
         CALL SLEEP (60)
     END

C SIGNAL HANDLER FUNCTION

     FUNCTION TRAP_TERM (SIG_NUM)
     INTEGER TRAP_TERM
     INTEGER SIG_NUM
         PRINT *, "Term was trapped."
         TRAP_TERM = 1
     END

</source> --Rzon 30 June 2010

Checkpointing

Having your program output periodic checkpoints is a very good idea; if it doesn't already, you should think about adding this feature.

What are Checkpoints?

Checkpoint files are files output by your program which contain the entire state of the program run so far, so that if the program ends for whatever reason it can be restarted from that point as if it had not stopped.

SciNet's systems do not have system-checkpoints (for a various reasons such as non-portability and difficulties with parallel programs), so you are responsible for your own checkpointing, from within your own code. It is the only way to be reliably able to restart your calculations.

Why Checkpoints?

Unlike a dedicated lab machine where you can run a job forever, on large shared computer facilities job wallclock time limits are typically 48 hours; that is, your job must end within 48 hours or it will be helpfully ended for you. Although exact limits vary, 48 hours is generally found to be a good balance between turnaround (ensuring people aren't waiting in the queue for weeks without being able to run) and being able to get a significant amount of work done in a single run. If, as will inevitably be the case, your runs grow to a size that they won't finish before the queue window closes, you will have to run jobs in several steps by outputting checkpoints and restarting from them.

Checkpoints also provide a certain amount of safety in case of hardware or software failure - one can restart from an earlier checkpoint without loosing much work. In addition, checkpoints can be useful if you want to run to some intermediate state, and then use that as the starting point for several different runs; then you can save yourself having to run to the intermediate state many times.

What should be in Checkpoint Files?

Checkpoint files should contain the entire state of your run, at full precision, so that your run can continue exactly where it left off. Note that this may differ from typical outputs which may not need to be in full precision or which may not need absolutely everything in the simulations memory.

Typically checkpoints are written out between iterations or steps in your job. To decide what needs to be output, ask yourself: `what data structures need to be filled for the program to take the next steps?' Then the checkpoint writing should dump out all that information, and to restart from a checkpoint, you would read in that information, and then start from that step as if you had been running uninterrupted the whole time.

How often should you checkpoint?

There's a tension here between not spending a lot of time (or diskspace) checkpointing but also not loosing much work if you need to restart from the last checkpoint. Checkpoint `several' times during the queue window, for whatever value of `several' is most suitable for your work.

Rolling checkpoints

Because checkpoints may require substantial disk space, it is often the case that users do not keep all checkpoints; only the last (say) two may be needed, and after each following one is successfully written, the earliest one is deleted.