Difference between revisions of "User Ramdisk"

Revision as of 11:05, 18 June 2010

Using Ramdisk

On the GPC nodes, a `ramdisk' is available. Up to half of the memory on the node may be used as a temporary file system. This is particularly useful for use in the early stages of migrating desktop-computing codes to a High Performance Computing platform such as the GPC, especially those that use a lot of I/O, such as Blast. Using a lot if I/O becomes a bottleneck in large scale computing. One especially suffers a performance penalty on parallel file systems (such as the GPFS used on SciNet), since the files are synchronized across the whole network.

Ramdisk is much faster than real disk, and is especially beneficial for codes which perform a lot of small I/O work, since the ramdisk does not require network traffic. However, each node sees its own ramdisk and cannot see files on that of other nodes.

To use the ramdisk, create and read to / write from files in /dev/shm/.. just as one would to (eg) /scratch/USER/. Only the amount of RAM needed to store the files will be taken up by the temporary file system. Thus if you have 8 serial jobs each requiring 1 GB of RAM, and 1GB is taken up by various OS services, you would still have approximately 7GB available to use as ramdisk on a 16GB node. However, if you were to write 8 GB of data to the RAM disk, this would exceed available memory and your job would likely crash.

Note that when using the ramdisk:

At the start of your job, you can copy frequently accessed files to ramdisk (stage in). If there are many such files, it is beneficial to put them in a tar file.
One would periodically copy the output files from ramdisk to /scratch or /project, as well as at the end of the job, of course (stage out).
It is very important to delete your files from ram disk at the end of your job. If you do not do this, the next user to use that node will have less RAM available than they might expect, and this might kill their jobs.

A script using the ramdisk in a 1 day openMP job might look like this:

#!/bin/bash
#MOAB/Torque submission script for SciNet GPC 
#PBS -l nodes=1:ppn=8,walltime=24:00:00
#PBS -N ramdisk-test

#Job parameters:
execname=job          # name of the executable
input_tar=input.tar   # tar file with input files and executables
output_tar=out.tar    # file in which to store output
input_subdir=indir    # sub-directory (within input_tar) with input files
output_subdir=outdir  # sub-directory to contain of output files
poll_period=60        # how often check for job completion (in seconds)
save_period=120       # how often to save output (in minutes)

#Track how long everything takes.
date

#Copy to ramdisk
echo "Stage-in: copying files to ramdisk directory /dev/shm/$USER"
mkdir -p /dev/shm/$USER
mkdir -p /dev/shm/$USER/$output_subdir
cd /dev/shm/$USER
cp $PBS_O_WORKDIR/$input_tar .
tar xf $input_tar
rm -rf $input_tar

#Track how long everything takes.
echo -n "Stage-in completed on "
date

#Run on ramdisk
echo "Starting job"
./$execname $input_subdir $output_subdir &
# Store the process id in $pid so we may check if it's still running:
pid=$!

#Note:
# 1. The above launching command is appropriate for a multi-threaded (openMP) applications.
# 2. Ramdisk MPI jobs are limited to 1 node as /dev/shm is not shared across nodes.
# 3. For serial jobs, you'd want to start 8 jobs at the same time instead, e.g.
#     mkdir -p $output_subdir/1
#     ./$execname ${input_subdir}/1 ${output_subdir}/1 &
#     pid=$!
#     mkdir -p $output_subdir/2
#     ./$execname ${input_subdir}/2 ${output_subdir}/2 &
#     pid=$pid,$!
#     mkdir -p $output_subdir/3
#     ./$execname ${input_subdir}/3 ${output_subdir}/3 &
#     pid=$pid,$!
#     mkdir -p $output_subdir/4
#     ./$execname ${input_subdir}/4 ${output_subdir}/4 &
#     pid=$pid,$!
#     mkdir -p $output_subdir/5
#     ./$execname ${input_subdir}/5 ${output_subdir}/5 &
#     pid=$pid,$!
#     mkdir -p $output_subdir/6
#     ./$execname ${input_subdir}/6 ${output_subdir}/6 &
#     pid=$pid,$!
#     mkdir -p $output_subdir/7
#     ./$execname ${input_subdir}/7 ${output_subdir}/7 &
#     pid=$pid,$!
#     mkdir -p $output_subdir/8
#     ./$execname ${input_subdir}/8 ${output_subdir}/8 &
#     pid=$pid,$!

#Track how long everything takes.
echo -n "Job started on "
date

function save_results {    
    echo -n "Copying from directory $output_subdir to file $PBS_O_WORKDIR/$output_tar on "
    date
    tar cf $output_tar $output_subdir/*
    cp $output_tar $PBS_O_WORKDIR
    echo -n "Copying of output complete on "
    date
}

function cleanup_ramdisk {
    echo -n "Cleaning up ramdisk directory /dev/shm/$USER on "
    date
    rm -rf /dev/shm/$USER
    echo -n "done at "
    date
}

function trap_term {
    echo -n "Trapped term (soft kill) signal on "
    date
    save_results
    cleanup_ramdisk
    exit
}

function interruptible_sleep {
    # waits for a number of seconds
    # argument 1 = number of seconds
    # note: just doing `sleep $1' would not be interruptible!
    for m in `seq $1`; do  
        sleep 1
    done
}

function is_running {
    # check if one or more process is running 
    # argument 1 = a command separated list of PIDs (no spaces)
    ps -p $1 -o pid= | wc -l
}

#trap the termination signal, and call the function 'trap_term' when 
# that happens, so results may be saved.
trap "trap_term" TERM

#number of pollings per save period (rounded down):
npoll=$(($save_period*60/$poll_period))

#polling and saving loop
running=$(is_running $pid)
while [ $running -gt 0 ]
do
    for n in `seq $npoll`
    do
        interruptible_sleep $poll_period
        running=$(is_running $pid)
        if [ $running -eq 0 ]; then
            break
        fi
    done
    save_results
done

#Done
cleanup_ramdisk

echo -n "Job finished cleanly on "
date

Notes with this script:

The script assumes that the tar file input.tar contains the executable job and the input files in a subdirectory called indir (with further subdirectories for the case of 8 serial jobs).
The executable is supposed to take the locations of the input and output directory as arguments.
The trap comment makes sure that the results gets saved and the ramdisk gets flushed even when the jobs gets killed before the end of the script is reached. trap is a bash script construction that executes the given command when the script is given, in this case, a TERM signal. The TERM signal is given by the scheduler 30 seconds before your time is up.
You could also trap signals in your C, C++ or FORTRAN codes.
All files are kept in a subdirectory of /dev/shm. This makes the clean up simpler, and keeps things tidy when doing small test jobs on the development nodes.

Further notes:

Often collections of serial jobs are run on the ramdisk, see the serial run wiki page for more details on that.
If your application needs just a bit more ramdisk, there are 24 nodes with 18GB and 84 nodes with 32GB of RAM. These nodes can be requested by qsub -l nodes=2:ib:m18g:ppn=8,walltime=1:00:00 or qsub -l nodes=2:ib:m32g:ppn=8,walltime=1:00:00. They are infiniband nodes, which are in short supply, so only use these nodes if you have to. Finally, there are 2 stand-alone large memory (128GB) nodes. They have 16 cores and are intel machines running linux, but they are not the same architecture (Nehalem) as the GPC compute nodes, so codes may have to be compiled separately for these machines. They can be accessed using a specific largemem queue. See GPC Quickstart.

--Rzon 18 June 2010 (UTC)

Difference between revisions of "User Ramdisk"

Revision as of 11:05, 18 June 2010

Using Ramdisk

Navigation menu

Search

@@ Line 1: / Line 1: @@
-==Ram Disk==
+==Using Ramdisk==
-On the GPC nodes, a `ram disk' is available. Up to half of the memory
+On the GPC nodes, a `ramdisk' is available. Up to half of the memory
 on the node may be used as a temporary file system.  This is
 particularly useful for use in the early stages of migrating
@@ Line 24: / Line 24: @@
 However, if you were to write 8 GB of data to the RAM disk, this would
 exceed available memory and your job would likely crash.
-If your application needs just a bit more ramdisk, there are 24 nodes with 18GB and 84 nodes with 32GB of RAM.  These nodes can be requested by <tt>qsub -l nodes=2:ib:m18g:ppn=8,walltime=1:00:00</tt> or <tt>qsub -l nodes=2:ib:m32g:ppn=8,walltime=1:00:00</tt>. They are infiniband nodes, which are in short supply, so only use these nodes if you have to.
-Finally, there are 2 stand-alone large memory (128GB) nodes. They have 16 cores and are intel machines running linux, but they are not the same architecture (Nehalem) as the GPC compute nodes, so codes may have to be compiled separately for these machines. They can be accessed using a specific largemem queue. See [[GPC Quickstart]].
 Note that when using the ramdisk:
-* At the start of your job, you can copy frequently accessed files to ramdisk. If there are many such files, it is beneficial to put them in a tar file.
+* At the start of your job, you can copy frequently accessed files to ramdisk (''stage in''). If there are many such files, it is beneficial to put them in a tar file.
-* One would periodically copy the output files to files on /scratch or /project so that they are available after the job has completed.
+* One would periodically copy the output files from ramdisk to /scratch or /project, as well as at the end of the job, of course (''stage out'').
 * It is very important to delete your files from ram disk at the end of your job.  If you do not do this, the next user to use that node will have less RAM available than they might expect, and this might kill their jobs.
@@ Line 38: / Line 35: @@
 #!/bin/bash
 #MOAB/Torque submission script for SciNet GPC
-#PBS -l nodes=1:ppn=8,walltime=1:00:00
+#PBS -l nodes=1:ppn=8,walltime=24:00:00
 #PBS -N ramdisk-test
@@ Line 49: / Line 46: @@
 poll_period=60        # how often check for job completion (in seconds)
 save_period=120       # how often to save output (in minutes)
+#Track how long everything takes.
+date
 #Copy to ramdisk
-echo "Setting up files on ramdisk directory /dev/shm/$USER"
+echo "Stage-in: copying files to ramdisk directory /dev/shm/$USER"
 mkdir -p /dev/shm/$USER
 mkdir -p /dev/shm/$USER/$output_subdir
@@ Line 58: / Line 58: @@
 tar xf $input_tar
 rm -rf $input_tar
+#Track how long everything takes.
+echo -n "Stage-in completed on "
+date
 #Run on ramdisk
 echo "Starting job"
 ./$execname $input_subdir $output_subdir &
+# Store the process id in $pid so we may check if it's still running:
 pid=$!
+#Note:
+# 1. The above launching command is appropriate for a multi-threaded (openMP) applications.
+# 2. Ramdisk MPI jobs are limited to 1 node as /dev/shm is not shared across nodes.
+# 3. For serial jobs, you'd want to start 8 jobs at the same time instead, e.g.
+#     mkdir -p $output_subdir/1
+#     ./$execname ${input_subdir}/1 ${output_subdir}/1 &
+#     pid=$!
+#     mkdir -p $output_subdir/2
+#     ./$execname ${input_subdir}/2 ${output_subdir}/2 &
+#     pid=$pid,$!
+#     mkdir -p $output_subdir/3
+#     ./$execname ${input_subdir}/3 ${output_subdir}/3 &
+#     pid=$pid,$!
+#     mkdir -p $output_subdir/4
+#     ./$execname ${input_subdir}/4 ${output_subdir}/4 &
+#     pid=$pid,$!
+#     mkdir -p $output_subdir/5
+#     ./$execname ${input_subdir}/5 ${output_subdir}/5 &
+#     pid=$pid,$!
+#     mkdir -p $output_subdir/6
+#     ./$execname ${input_subdir}/6 ${output_subdir}/6 &
+#     pid=$pid,$!
+#     mkdir -p $output_subdir/7
+#     ./$execname ${input_subdir}/7 ${output_subdir}/7 &
+#     pid=$pid,$!
+#     mkdir -p $output_subdir/8
+#     ./$execname ${input_subdir}/8 ${output_subdir}/8 &
+#     pid=$pid,$!
+#Track how long everything takes.
+echo -n "Job started on "
+date
 function save_results {
-     echo "Copying from directory $output_subdir to file $PBS_O_WORKDIR/$output_tar"
+     echo -n "Copying from directory $output_subdir to file $PBS_O_WORKDIR/$output_tar on "
+    date
      tar cf $output_tar $output_subdir/*
      cp $output_tar $PBS_O_WORKDIR
+    echo -n "Copying of output complete on "
+    date
 }
 function cleanup_ramdisk {
-     echo "Cleaning up ramdisk directory /dev/shm/$USER"
+     echo -n "Cleaning up ramdisk directory /dev/shm/$USER on "
+    date
      rm -rf /dev/shm/$USER
-     echo "done"
+     echo -n "done at "
+    date
 }
 function trap_term {
-     echo "Trapped term (soft kill) signal"
+     echo -n "Trapped term (soft kill) signal on "
+    date
      save_results
      cleanup_ramdisk
@@ Line 84: / Line 128: @@
 function interruptible_sleep {
-     # a sleep $1 would not be interruptible
+     # waits for a number of seconds
+    # argument 1 = number of seconds
+    # note: just doing `sleep $1' would not be interruptible!
      for m in `seq $1`; do
          sleep 1
@@ Line 91: / Line 137: @@
 function is_running {
-     #check if a process is running
+     # check if one or more process is running
+    # argument 1 = a command separated list of PIDs (no spaces)
      ps -p $1 -o pid= | wc -l
 }
+#trap the termination signal, and call the function 'trap_term' when
+# that happens, so results may be saved.
 trap "trap_term" TERM
@@ Line 118: / Line 167: @@
 cleanup_ramdisk
-echo "Clean end of job"
+echo -n "Job finished cleanly on "
+date
 </pre>
-Notes:
+Notes with this script:
-* The script assumes that the tar file <tt>input.tar</tt> contains the executable <tt>job</tt> and the input files in a subdirectory called <tt>indir</tt>.
+* The script assumes that the tar file <tt>input.tar</tt> contains the executable <tt>job</tt> and the input files in a subdirectory called <tt>indir</tt> (with further subdirectories for the case of 8 serial jobs).
 * The executable is supposed to take the locations of the input and output directory as arguments.
 * The trap comment makes sure that the results gets saved and the ramdisk gets flushed even when the jobs gets killed before the end of the script is reached.  <tt>trap</tt> is a bash script construction that executes the given command when the script is given, in this case, a TERM signal.  The TERM signal is given by the scheduler 30 seconds before your time is up.
 * You could also [[User_Signals|trap signals in your C, C++ or FORTRAN code]]s.
 * All files are kept in a subdirectory of <tt>/dev/shm</tt>. This makes the clean up simpler, and keeps things tidy when doing small test jobs on the development nodes.
+Further notes:
 * Often collections of serial jobs are run on the ramdisk, see the [[User_Serial|serial run wiki page]] for more details on that.
---[[User:Rzon|Rzon]] 20:11, 9 April 2010 (UTC)
+* If your application needs just a bit more ramdisk, there are 24 nodes with 18GB and 84 nodes with 32GB of RAM.  These nodes can be requested by <tt>qsub -l nodes=2:ib:m18g:ppn=8,walltime=1:00:00</tt> or <tt>qsub -l nodes=2:ib:m32g:ppn=8,walltime=1:00:00</tt>. They are infiniband nodes, which are in short supply, so only use these nodes if you have to. Finally, there are 2 stand-alone large memory (128GB) nodes. They have 16 cores and are intel machines running linux, but they are not the same architecture (Nehalem) as the GPC compute nodes, so codes may have to be compiled separately for these machines. They can be accessed using a specific largemem queue. See [[GPC Quickstart]].
+--[[User:Rzon|Rzon]] 18 June 2010 (UTC)