oldwiki.scinet.utoronto.ca - User contributions [en-gb]

R Statistical Package

2015-12-18T23:24:11Z

Leon: /* Running R on the GPC */

==Running R on the GPC==
[http://www.r-project.org/ R] is powerful statistical and plotting software available on the [[GPC_Quickstart|GPC]] in the [[Software_and_Libraries|module]] R. In fact, there are currently six R modules installed, 2.13.1, 2.14.1, 2.15.1, 3.0.0, 3.0.1 and 3.1.1. While 2.15.1 is the default, we do recommend making the transition to a newer version, which you load by specifying the version number explicitly:
<pre>
$ module load intel R/3.0.1
</pre>
(The intel module is a prerequesite for the R module). If you will be using Rmpi, you will need to load the openmpi module as well.

Many optional packages are available for R which add functionality for specific domains; they are available through the [http://cran.r-project.org/mirrors.html Comprehensive R Archive Network (CRAN)].

R provides an easy way for users to install the libraries they need in their home directories rather than having them installed system-wide; there are so many potential optional packages for R people could potentially want, we recommend users who want additional packages to proceed this way. This is almost certainly the easiest way to deal with the wide range of packages, ensure they're up to date, and ensure that users package choices don't conflict.

In general, you can install those that you need yourself in your home directory; eg,

<pre>
$ R
> install.packages("package-name", dependencies = TRUE)
</pre>

will download and compile the source for the packages you need in your home directory under <tt>${HOME}/R/x86_64-unknown-linux-gnu-library/2.11/</tt> (you can specify another directory with a lib= option.) Then take a look at help(".libPaths") to make sure that R knows where to look for the packages you've compiled. Note that you must install packages with logged into a development node as write access to the library folder is not available to a standard node on the cluster.

Note that during the installation you may get warnings that the packages cannot be installed in e.g. /scinet/gpc/Applications/R/3.0.1/lib64/R/bin/. But after those messages, R should have succeeded in installing the package into your home directory.

=== Running serial R jobs ===

As with all serial jobs, if your R computation do not use multiple cores, you should bundle them up so the 8 cores of a nodes are all performing work. Examples of this can be found on the [[User_Serial]] page.

== Saving images from R in compute jobs ==

To make use of the graphics capability of R, R insists on having an X server running, even if you're just writing to a file. There is no X server on the compute nodes, and you'd get a message like

unable to open connection to X11 display ''

To get around this issue, you can run a 'virtual' X server on the compute nodes by adding the following commands at the start of your job script:

# Make virtual X server command called Xvfb available:
module load Xlibraries

# Select a unique display number:
let DISPLAYNUM=$UID%65274
export DISPLAY=":$DISPLAYNUM"

# Start the virtual X server
Xvfb $DISPLAY -fp $SCINET_FONTPATH -ac 2>/dev/null &

After this, run R or Rscript as usual. The virtual X server will be running in the background and will get killed which your job is done. Alternatively, you may want to kill it explicitly at the end of you job script using

# Kill any remaining Xvfb server
pkill -u $UID Xvfb

== Rmpi (R with MPI) ==

All the newer R installations on the GPC have Rmpi installed by default using OpenMPI. Be sure to load the OpenMPI module if you wish to use Rmpi.

=== Installing Rmpi, version 2.13.1 ===

Version 2.13.1 does not have the Rmpi library as a standard package, which means you have to install it yourself if you are using that version. The same is true if you want to use IntelMPI instead of OpenMPI.

Installing the Rmpi package can be a bit challenging, since some additional parameters need to be given to the installation, which contain the path to various header files and libraries. These paths differ depending on what [[GPC_MPI_Versions|MPI version]] you are using.

The various MPI versions on the GPC are loaded with the [[Software_and_Libraries|module]] command. So the first thing to do is to decide what mpi version to use (openmpi or intelmpi), and to type the corresponding "module load" command on the command-line (as well as in your jobs scripts).

Because the MPI modules define all the paths in environment variables, the following line seem to work for installations of all openmpi versions.

<pre>
> install.packages("Rmpi",
configure.args =
c(paste("--with-Rmpi-include=",Sys.getenv("SCINET_MPI_INC"),sep=""),
paste("--with-Rmpi-libpath=",Sys.getenv("SCINET_MPI_LIB"),sep=""),
"--with-Rmpi-type=OPENMPI"))
</pre>

For intelmpi, you only need to change <tt>OPENMPI</tt> to <tt>MPICH2</tt> in the last line.

=== Running Rmpi ===

To start using R with Rmpi, make sure you have all require modules loaded (e.g. <tt>module load intel openmpi R/2.14.1</tt>), then launch it with
<pre>
$ mpirun -np 1 R --no-save
</pre>
which starts one master mpi process, but starts up the infrastructure to be able to spawn additional processes.

== Creating an R cluster ==

The 'parallel' package allows you to use R to launch individual serial subjobs across multiple nodes. This section describes how this is accomplished.

=== Creating your Rscript wrapper ===

The first thing to do is create a wrapper for Rscript. This needs to be done because the R module needs to be loaded on all nodes, but the submission script only loads modules on the head node of the job. The wrapper script, let's call it MyRscript.sh, is short:
<pre>
#!/bin/bash

module load intel/13.1.1 R/3.0.1
${SCINET_R_BIN}/Rscript --no-restore "$@"
</pre>
The "--no-restore" flag prevents Rscript from loading your "workspace image", if you have one saved. Loading the image causes problems for the cluster.

Once you've created your wrapper, make it executable:
<pre>
$ chmod u+x MyRscript.sh
</pre>
Your wrapper is now ready to be used.

=== The cluster R code ===
The R code which we will run consists of two parts, the code which launches the cluster, and does pre- and post-analysis, and the code which is run on the individual cluster "nodes". Here is some code which demonstrates this functionality. Let's call it MyClusterCode.R.
<pre>

######################################################
#
# worker code
#

# first define the function which will be run on all the cluster nodes. This is just a test function.
# Put your real worker code here.
testfunc <- function(a) {

# this part is just to waste time
b <- 0
for (i in 1:10000) {
b <- b + 1
}

s <- Sys.info()['nodename']
return(paste0(s, " ", a[1], " ", a[2]))

}

######################################################
#
# head node code
#

# Create a bunch of index pairs to feed to the worker function. These could be parameters,
# or whatever your code needs to vary across jobs. Note that the worker function only
# takes a single argument; each entry in the list must contain all the information
# that the function needs to run. In this example, each entry contains a list which
# contains two pieces of information, a pair of indices.
indexlist <- list()
index <- 1
for (i in 1:10) {
for (j in 1:10) {
indexlist[index] <- list(c(i,j))
index <- index +1
}
}

# Now set up the cluster.

# First load the parallel library.
library(parallel)

# Next find all the nodes which the scheduler has given to us.
# These are listed in the file which is indicated by the PBS_NODEFILE
# environment variable.
nodefile <- Sys.getenv("PBS_NODEFILE")
hostnames <- readLines(nodefile)

# Now launch the cluster, using the list of nodes and our Rscript
# wrapper.
cl <- makePSOCKcluster(names = hostnames, rscript = "/path/to/your/MyRscript.sh")

# Now run the worker code, using the parameter list we created above.
result <- clusterApplyLB(cl, indexlist, testfunc)

# The results of all the jobs will now be put in the 'result' variable,
# in the order they were specified in the 'indexlist' variable.

# Don't forget to stop the cluster when you're finished.
stopCluster(cl)
</pre>
You can, of course, add any post-processing code you need to the above code.

=== Submitting an R cluster job ===
You are now ready to submit your job to the GPC queue. The submission script is like most others:
<pre>
#!/bin/bash
#PBS -l nodes=3:ppn=8
#PBS -l walltime=5:00:00
#PBS -N MyRCluster

cd $PBS_O_WORKDIR
module load intel/13.1.1 R/3.0.1
${SCINET_R_BIN}/Rscript --no-restore MyClusterCode.R
</pre>
Be sure to use whatever number of nodes, length of time, etc., is appropriate for your job.

HPSS

2013-09-16T20:36:14Z

Leon: First step is to get a HPSS account

= '''High Performance Storage System''' =

The High Performance Storage System ([http://www.hpss-collaboration.org/index.shtml HPSS]) is a tape-backed hierarchical storage system that will provide a significant portion of the allocated storage space at SciNet. It is a repository for archiving data that is not being actively used. Data can be returned to the active GPFS filesystem when it is needed.

Because this system is intended for large data storage, it is accessible only to groups who have been awarded storage space at SciNet beyond 5TB in the yearly RAC resource allocation round.

Access and transfer of data into and out of HPSS is done under the control of the user, whose interaction is expected to be scripted and submitted as a batch job, using one or more of the following utilities:
* [http://www.mgleicher.us/GEL/hsi HSI] is a client with an ftp-like functionality which can be used to archive and retrieve large files. It is also useful for browsing the contents of HPSS.
* [http://www.mgleicher.us/GEL/htar HTAR] is a utility that creates tar formatted archives directly into HPSS. It also creates a separate index file (.idx) that can be accessed and browsed quickly.
* [https://support.scinet.utoronto.ca/wiki/index.php/ISH ISH] is a TUI utility that can perform an inventory of the files and directories in your tarballs.

We're currently running HPSS v 7.3.3 patch 6, and HSI/HTAR version 4.0.1.2

== '''Why should I use and trust HPSS?''' ==
* 10+ years history, used by about 40 facilities in the [http://www.top500.org “Top 500”] HPC list
* very reliable, data redundancy and data insurance built-in.
* highly scalable, reasonable performance at SciNet - Ingest: ~24 TB/day, Recall: ~12 TB/day (aggregated)
* HSI/HTAR clients also very reliable and used on several HPSS sites. ISH was written at SciNet.
* [[Media:HPSS_rationale_SNUG.pdf|HPSS fits well with the Storage Capacity Expansion Plan at SciNet]] (pdf presentation)

== '''Guidelines''' ==
* Expanded storage capacity is provided on tape -- a media that is not suited for storing small files. Files smaller than ~200MB should be grouped into tarballs with tar or htar.
* Optimal performance for aggregated transfers and allocation on tapes is obtained with tarballs of size around 100GB.
* We strongly urge that you use the sample scripts we are providing as the basis for your job submissions.
* Make sure to check the application's exit code and returned logs for errors after any data transfer or tarball creation process

== '''New to the System?''' ==
The first step is to email scinet support and request an HPSS account (or else you will get "Error - authentication/initialization failed" and 71 exit codes).

THIS set of instructions on the wiki is the best and most compressed "manual" we have. It may seem a bit overwhelming at first, because of all the job script templates we make available below (they are here so you don't have to think
too much, just copy and paste), but if you approach the index at the top as a "case switch" mechanism for what you intend to do, everything falls in place.

Try this sequence:

1) [https://support.scinet.utoronto.ca/wiki/index.php/HPSS#Deleting_with_an_interactive_HSI_session take a look around HPSS using an interactive HSI session]

(most linux shell commands have an equivalent in HPSS)

2) [https://support.scinet.utoronto.ca/wiki/index.php/HPSS#Sample_tarball_create archive a small test directory using HTAR]

2a) use step 1) to see what happened

3) [https://support.scinet.utoronto.ca/wiki/index.php/HPSS#Sample_data_offload archive a file using hsi]

3a) use step 1) to see what happened

4) [https://support.scinet.utoronto.ca/wiki/index.php/HPSS#Sample_transferring_directories archive a small test directory using HSI]

4a) use step 1) to see what happened

5) now try the other cases and so on. In a couple of hours you'll be in pretty good shape.

== '''Bridge between BGQ and HPSS''' ==

BGQ users may transfer material to/from HPSS via the GPC archive queue. On the HPSS gateway node (gpc-archive01), the BGQ GPFS file systems are mounted under a single mounting point /bgq (/bgq/scratch and /bgq/home).

== '''Access Through the Queue System''' ==
All access to the archive system is done through the [[Moab|GPC queue system]].

* Job submissions should be done to the 'archive' queue
* Short jobs are limited to 1H walltime by default. Long jobs (> 1H) are limited to 72H walltime.
* Users are limited to only 1 long job and 1 short job at the same time.
* There can only be 5 long jobs running at any given time overall. Remaining submissions will be placed on hold for the time being. So far we have not seen a need for overall limit on short jobs.

The status of pending jobs can be monitored with showq specifying the archive queue:
<pre>
showq -w class=archiveshort
OR
showq -w class=archivelong
</pre>

=== Scripted File Transfers ===
File transfers in and out of the HPSS should be scripted into jobs and submitted to the ''archive'' queue. See generic example below.
<source lang="bash">
#!/bin/bash
#PBS -l walltime=72:00:00
#PBS -q archive
#PBS -N htar_create_tarball_in_hpss
#PBS -j oe
#PBS -m e

echo "Creating a htar of finished-job1/ directory tree into HPSS"

trap "echo 'Job script not completed';exit 129" TERM INT
# Note that your initial directory in HPSS will be $ARCHIVE

DEST=$ARCHIVE/finished-job1.tar

# htar WILL overwrite an existing file with the same name so check beforehand.

hsi ls $DEST &> /dev/null
status=$?

if [ $status == 0 ]; then
echo 'File $DEST already exists. Nothing has been done'
exit 1
fi

cd $SCRATCH/workarea/
htar -cpf $ARCHIVE/finished-job1.tar finished-job1/
status=$?

trap - TERM INT

if [ ! $status == 0 ]; then
echo 'HTAR returned non-zero code.'
/scinet/gpc/bin/exit2msg $status
exit $status
else
echo 'TRANSFER SUCCESSFUL'
fi
</source>
'''Note:''' Always trap the execution of your jobs for abnormal terminations, and be sure to return the exit code

=== Job Dependencies ===

Typically data will be recalled to /scratch when it is needed for analysis. Job dependencies can be constructed so that analysis jobs wait in the queue for data recalls before starting. The qsub flag is
<pre>
-W depend=afterok:<JOBID>
</pre>
where JOBID is the job number of the archive recalling job that must finish successfully before the analysis job can start.

Here is a short cut for generating the dependency (lookup [https://support.scinet.utoronto.ca/wiki/index.php/HPSS#Sample_data_recall data-recall.sh samples]):
<pre>
gpc04 $ qsub $(qsub data-recall.sh | awk -F '.' '{print "-W depend=afterok:"$1}') job-to-work-on-recalled-data.sh
</pre>

== '''HTAR''' ==
''' Please aggregate small files (<~200MB) into tarballs or htar files. '''

HTAR is a utility that is used for aggregating a set of files and directories, by using a sophisticated multithreaded buffering scheme to write files directly from GPFS into HPSS, creating an archive file that conforms to the POSIX TAR specification, thereby achieving a high rate of performance. HTAR does not do gzip compression, however it already has a built-in checksum algorithm.

'''Caution'''
* Files larger than 68 GB cannot be stored in an HTAR archive. If you attempt to start a transfer with any files larger than 68GB the whole HTAR session will fail, and you'll get a notification listing all those files, so that you can transfer them with HSI. unintentionally overwriting the htar destination file in HPSS
* Files with pathnames too long will be skipped (greater than 100 characters), so as to conform with TAR protocol [[(POSIX 1003.1 USTAR)]] -- Note that the HTAR will erroneously indicate success, however will produce exit code 70. For now, you can check for this type of error by "grep Warning my.output" after the job has completed.
* Unlike with cput/cget in HSI, "prompt before overwrite", this is not the default with (h)tar. Be careful not to unintentionally overwrite a previous htar destination file in HPSS. There could be a similar situation when extracting material back into GPFS and overwriting the originals. Be sure to double-check the logic in your scripts.
* Check the HTAR exit code and log file before removing any files from the GPFS active filesystems.

=== HTAR Usage ===
* To write the ''file1'' and ''file2'' files to a new archive called ''files.tar'' in the default HPSS home directory, and preserve mask attributes (-p), enter:
<pre>
htar -cpf files.tar file1 file2
OR
htar -cpf $ARCHIVE/files.tar file1 file2
</pre>

* To write a ''subdirA'' to a new archive called ''subdirA.tar'' in the default HPSS home directory, enter:
<pre>
htar -cpf subdirA.tar subdirA
</pre>

* To extract all files from the archive file called ''proj1.tar'' in HPSS into the ''project1/src'' directory in GPFS, and use the time of extraction as the modification time, enter:
<pre>
htar -xpmf proj1.tar project1/src
</pre>

* To display the names of the files in the ''out.tar'' archive file within the HPSS home directory, enter (the out.tar.idx file will be queried):
<pre>
htar -vtf out.tar
</pre>

For more details please check the '''[http://www.mgleicher.us/GEL/htar/ HTAR - Introduction]''' or the '''[http://www.mgleicher.us/GEL/htar/htar_man_page.html HTAR Man Page]''' online

==== Sample tarball create ====
<source lang="bash">
#!/bin/bash
#PBS -l walltime=72:00:00
#PBS -q archive
#PBS -N htar_create_tarball_in_hpss
#PBS -j oe
#PBS -m e

trap "echo 'Job script not completed';exit 129" TERM INT
# Note that your initial directory in HPSS will be /archive/$(id -gn)/$(whoami)/

DEST=$ARCHIVE/finished-job1.tar

# htar WILL overwrite an existing file with the same name so check beforehand.

hsi ls $DEST &> /dev/null
status=$?

if [ $status == 0 ]; then
echo 'File $DEST already exists. Nothing has been done'
exit 1
fi

cd $SCRATCH/workarea/
htar -cpf $DEST finished-job1/
status=$?

trap - TERM INT

if [ ! $status == 0 ]; then
echo 'HTAR returned non-zero code.'
/scinet/gpc/bin/exit2msg $status
exit $status
else
echo 'TRANSFER SUCCESSFUL'
fi
</source>

'''Note:''' If you attempt to start a transfer with any files larger than 68GB the whole HTAR session will fail, and you'll get a notification listing all those files, so that you can transfer them with HSI.
<pre>
----------------------------------------
INFO: File too large for htar to handle: finished-job1/file1 (86567185745 bytes)
INFO: File too large for htar to handle: finished-job1/file2 (71857244579 bytes)
ERROR: 2 oversize member files found - please correct and retry
ERROR: [FATAL] error(s) generating filename list
HTAR: HTAR FAILED
###WARNING htar returned non-zero exit status
</pre>

==== Sample tarball list ====
<source lang="bash">
#!/bin/bash
#PBS -l walltime=72:00:00
#PBS -q archive
#PBS -N htar_list_tarball_in_hpss
#PBS -j oe
#PBS -m e

trap "echo 'Job script not completed';exit 129" TERM INT
# Note that your initial directory in HPSS will be $ARCHIVE

DEST=$ARCHIVE/finished-job1.tar

htar -tvf $DEST
status=$?

trap - TERM INT

if [ ! $status == 0 ]; then
echo 'HTAR returned non-zero code.'
/scinet/gpc/bin/exit2msg $status
exit $status
else
echo 'TRANSFER SUCCESSFUL'
fi
</source>

==== Sample tarball extract ====
<source lang="bash">
#!/bin/bash
#PBS -l walltime=72:00:00
#PBS -q archive
#PBS -N htar_extract_tarball_from_hpss
#PBS -j oe
#PBS -m e

trap "echo 'Job script not completed';exit 129" TERM INT
# Note that your initial directory in HPSS will be $ARCHIVE

cd $SCRATCH/recalled-from-hpss
htar -xpmf $ARCHIVE/finished-job1.tar
status=$?

trap - TERM INT

if [ ! $status == 0 ]; then
echo 'HTAR returned non-zero code.'
/scinet/gpc/bin/exit2msg $status
exit $status
else
echo 'TRANSFER SUCCESSFUL'
fi
</source>

== '''HSI''' ==

HSI may be the primary client with which some users will interact with HPSS. It provides an ftp-like interface for archiving and retrieving tarballs or [https://support.scinet.utoronto.ca/wiki/index.php/HPSS#Sample_transferring_directories directory trees]. In addition it provides a number of shell-like commands that are useful for examining and manipulating the contents in HPSS. The most commonly used commands will be:
{|border="1" cellpadding="10" cellspacing="0"
|-
| cput
| Conditionally saves or replaces a HPSSpath file to GPFSpath if the GPFS version is new or has been updated
cput [options] GPFSpath [: HPSSpath]
|-
| cget
| Conditionally retrieves a copy of a file from HPSS to GPFS only if a GPFS version does not already exist.
cget [options] [GPFSpath :] HPSSpath
|-
| cd,mkdir,ls,rm,mv
| Operate as one would expect on the contents of HPSS.
|-
| lcd,lls
| ''Local'' commands to GPFS
|}

*There are 3 distinctions about HSI that you should keep in mind, and that can generate a bit of confusion when you're first learning how to use it:
** HSI doesn't currently support renaming directories paths during transfers on-the-fly, therefore the syntax for cput/cget may not work as one would expect in some scenarios, requiring some workarounds.
** HSI has an operator ":" which separates the GPFSpath and HPSSpath, and must be surrounded by whitespace (one or more space characters)
** The order for referring to files in HSI syntax is different from FTP. In HSI the general format is always the same, GPFS first, HPSS second, cput or cget:
<pre>
GPFSfile : HPSSfile
</pre>
For example, when using HSI to store the tarball file from GPFS into HPSS, then recall it to GPFS, the following commands could be used:
<pre>
cput tarball-in-GPFS : tarball-in-HPSS
cget tarball-recalled : tarball-in-HPSS
</pre>

unlike with FTP, where the following syntax would be used:
<pre>
put tarball-in-GPFS tarball-in-HPSS
get tarball-in-HPSS tarball-recalled
</pre>
* Simple commands can be executed on a single line.
<pre>
hsi "mkdir LargeFilesDir; cd LargeFilesDir; cput tarball-in-GPFS : tarball-in-HPSS"
</pre>

* More complex sequences can be performed using an except such as this:
<pre>
hsi <<EOF
mkdir LargeFilesDir
cd LargeFilesDir
cput tarball-in-GPFS : tarball-in-HPSS
lcd $SCRATCH/LargeFilesDir2/
cput -Ruph *
end
EOF
</pre>

* The commands below are equivalent, but we recommend that you always use full path, and organize the contents of HPSS, where the default HSI directory placement is $ARCHIVE:
<pre>
hsi cput tarball
hsi cput tarball : tarball
hsi cput $SCRATCH/tarball : $ARCHIVE/tarball
</pre>

* There are no known issues renaming files on-the-fly:
<pre>
hsi cput $SCRATCH/tarball1 : $ARCHIVE/tarball2
hsi cget $SCRATCH/tarball3 : $ARCHIVE/tarball2
</pre>

* However the syntax forms such as the ones below will fail, since they rename the directory paths.
<pre>
hsi cput -Ruph $SCRATCH/LargeFilesDir : $ARCHIVE/LargeFilesDir (FAILS)
OR
hsi cget -Ruph $SCRATCH/LargeFilesDir : $ARCHIVE/LargeFilesDir2 (FAILS)
OR
hsi cput -Ruph $SCRATCH/LargeFilesDir/* : $ARCHIVE/LargeFilesDir2 (FAILS)
OR
hsi cget -Ruph $SCRATCH/LargeFilesDir : $ARCHIVE/LargeFilesDir (FAILS)
</pre>

One workaround is the following 2-steps process, where you do a "lcd " in GPFS first, and recursively transfer the whole directory (-R), keeping the same name. You may use '-u' option to resume a previously disrupted session, and the '-p' to preserve timestamp, and '-h' to keep the links.
<pre>
hsi <<EOF
lcd $SCRATCH
cget -Ruph LargeFilesDir
end
EOF
</pre>

Another workaround is do a "lcd" into the GPFSpath first and a "cd" in the HPSSpath, but transfer the files individually with the '*' wild character. This option lets you change the directory name:
<pre>
hsi <<EOF
lcd $SCRATCH/LargeFilesDir
mkdir $ARCHIVE/LargeFilesDir2
cd $ARCHIVE/LargeFilesDir2
cput -Ruph *
end
EOF
</pre>

=== Documentation ===
Complete documentation on HSI is available from the Gleicher Enterprises links below. You may peruse those links and come with alternative syntax forms. You may even be already familiar with HPSS/HSI from other HPC facilities, that may or not have procedures similar to ours. HSI doesn't always work as expected when you go outside of our recommended syntax, so '''we strongly urge that you use the sample scripts we are providing as the basis''' for your job submissions
* [http://www.mgleicher.us/GEL/hsi/ HSI Introduction]
* [http://www.mgleicher.us/GEL/hsi/hsi_man_page.html man hsi]
* [https://support.scinet.utoronto.ca/wiki/index.php/HSI_help hsi help]
* [http://www.mgleicher.us/GEL/hsi/hsi-exit-codes.html exit codes]
'''Note:''' HSI returns the highest-numbered exit code, in case of multiple operations in the same hsi session. You may use '/scinet/gpc/bin/exit2msg $status' to translate those codes into intelligible messages

=== Typical Usage Scripts===
The most common interactions will be ''putting'' data into HPSS, examining the contents (ls,ish), and ''getting'' data back onto GPFS for inspection or analysis.

==== Sample '''data offload''' ====
<source lang="bash">
#!/bin/bash

# This script is named: data-offload.sh
#PBS -l walltime=72:00:00
#PBS -q archive
#PBS -N offload
#PBS -j oe
#PBS -me

trap "echo 'Job script not completed';exit 129" TERM INT
# individual tarballs already exist

/usr/local/bin/hsi -v <<EOF1
mkdir put-away
cd put-away
cput $SCRATCH/workarea/finished-job1.tar.gz : finished-job1.tar.gz
end
EOF1
status=$?
if [ ! $status == 0 ];then
echo 'HSI returned non-zero code.'
/scinet/gpc/bin/exit2msg $status
exit $status
else
echo 'TRANSFER SUCCESSFUL'
fi

/usr/local/bin/hsi -v <<EOF2
mkdir put-away
cd put-away
cput $SCRATCH/workarea/finished-job2.tar.gz : finished-job2.tar.gz
end
EOF2
status=$?
if [ ! $status == 0 ];then
echo 'HSI returned non-zero code.'
/scinet/gpc/bin/exit2msg $status
exit $status
else
echo 'TRANSFER SUCCESSFUL'
fi

trap - TERM INT
</source>

'''Note:''' as in the above example, we recommend that you capture the (highest-numbered) exit code for each hsi session independently. And remember, you may improve your exit code verbosity by adding the excerpt below to your scripts:
<source lang="bash">
if [ ! $status == 0 ];then
echo 'HSI returned non-zero code.'
/scinet/gpc/bin/exit2msg $status
exit $status
else
echo 'TRANSFER SUCCESSFUL'
fi
</source>

==== Sample '''data list''' ====
A very trivial way to list the contents of HPSS would be to just submit the HSI 'ls' command.
<source lang="bash">
#!/bin/bash

# This script is named: data-list.sh
#PBS -l walltime=1:00:00
#PBS -q archive
#PBS -N hpss_ls
#PBS -j oe
#PBS -me

/usr/local/bin/hsi -v <<EOF
cd put-away
ls -R
end
EOF
</source>
''Warning: if you have a lot of files, the ls command will take a long time to complete. For instance, about 400,000 files can be listed in about an hour. Adjust the walltime accordingly, and be on the safe side.''

However, we provide a much more useful and convenient way to explore the contents of HPSS with the inventory shell [[ISH]]. This example creates an index of all the files in a user's portion of the namespace. The list is placed in the directory /home/$(whoami)/.ish_register that can be inspected from the gpc-devel nodes.
<source lang="bash">
#!/bin/bash

# This script is named: data-list.sh
#PBS -l walltime=1:00:00
#PBS -q archive
#PBS -N hpss_index
#PBS -j oe
#PBS -me

INDEX_DIR=$HOME/.ish_register
if ! [ -e "$INDEX_DIR" ]; then
mkdir -p $INDEX_DIR
fi

export ISHREGISTER="$INDEX_DIR"
/scinet/gpc/bin/ish hindex
</source>
''Note: the above warning on collecting the listing for many files applies here too.''

This index can be browsed or searched with ISH on the development nodes.
<source lang="bash">
gpc-f104n084-$ /scinet/gpc/bin/ish ~/.ish_register/hpss.igz
[ish]hpss.igz> help
</source>

ISH is a powerful tool that is also useful for creating and browsing indices of tar and htar archives, so please look at the [[ISH|documentation]] or built in help.

==== Sample '''data recall''' ====
<source lang="bash">
#!/bin/bash

# This script is named: data-recall.sh
#PBS -l walltime=72:00:00
#PBS -q archive
#PBS -N recall_files
#PBS -j oe
#PBS -me

trap "echo 'Job script not completed';exit 129" TERM INT

mkdir -p $SCRATCH/recalled-from-hpss

# individual tarballs previously organized in HPSS inside the put-away-on-2010/ folder
hsi -v << EOF
cget $SCRATCH/recalled-from-hpss/Jan-2010-jobs.tar.gz : $ARCHIVE/put-away-on-2010/Jan-2010-jobs.tar.gz
cget $SCRATCH/recalled-from-hpss/Feb-2010-jobs.tar.gz : $ARCHIVE/put-away-on-2010/Feb-2010-jobs.tar.gz
end
EOF
status=$?

trap - TERM INT

if [ ! $status == 0 ]; then
echo 'HSI returned non-zero code.'
/scinet/gpc/bin/exit2msg $status
exit $status
else
echo 'TRANSFER SUCCESSFUL'
fi
</source>

We should emphasize that a single ''cget'' of multiple files (rather than several separate gets) allows HSI to do optimization, as in the following example:
<source lang="bash">
#!/bin/bash

# This script is named: data-recall.sh
#PBS -l walltime=72:00:00
#PBS -q archive
#PBS -N recall_files_optimized
#PBS -j oe
#PBS -me

trap "echo 'Job script not completed';exit 129" TERM INT
mkdir -p $SCRATCH/recalled-from-hpss

# individual tarballs previously organized in HPSS inside the put-away-on-2010/ folder
hsi -v << EOF
lcd $SCRATCH/recalled-from-hpss/
cd $ARCHIVE/put-away-on-2010/
cget Jan-2010-jobs.tar.gz Feb-2010-jobs.tar.gz
end
EOF
status=$?

trap - TERM INT

if [ ! $status == 0 ]; then
echo 'HSI returned non-zero code.'
/scinet/gpc/bin/exit2msg $status
exit $status
else
echo 'TRANSFER SUCCESSFUL'
fi
</source>

==== Sample '''transferring directories''' ====
Remember, it's not possible to rename directories on-the-fly:
<pre>
hsi cget -Ruph $SCRATCH/LargeFiles-recalled : $ARCHIVE/LargeFiles (FAILS)
</pre>

One workaround is transfer the whole directory (and sub-directories) recursively:
<source lang="bash">
#!/bin/bash

# This script is named: data-recall.sh
#PBS -l walltime=72:00:00
#PBS -q archive
#PBS -N recall_directories
#PBS -j oe
#PBS -me

trap "echo 'Job script not completed';exit 129" TERM INT

mkdir -p $SCRATCH/recalled

hsi -v << EOF
lcd $SCRATCH/recalled
cd $ARCHIVE/
cget -Ruph LargeFiles
end
EOF
status=$?

trap - TERM INT

if [ ! $status == 0 ]; then
echo 'HSI returned non-zero code.'
/scinet/gpc/bin/exit2msg $status
exit $status
else
echo 'TRANSFER SUCCESSFUL'
fi
</source>

Another workaround is to transfer files and subdirectories individually with the "*" wild character:
<source lang="bash">
#!/bin/bash

# This script is named: data-recall.sh
#PBS -l walltime=72:00:00
#PBS -q archive
#PBS -N recall_directories
#PBS -j oe
#PBS -me

trap "echo 'Job script not completed';exit 129" TERM INT

mkdir -p $SCRATCH/LargeFiles-recalled

hsi -v << EOF
lcd $SCRATCH/LargeFiles-recalled
cd $ARCHIVE/LargeFiles
cget -Ruph *
end
EOF
status=$?

trap - TERM INT

if [ ! $status == 0 ]; then
echo 'HSI returned non-zero code.'
/scinet/gpc/bin/exit2msg $status
exit $status
else
echo 'TRANSFER SUCCESSFUL'
fi
</source>

* For more details please check the '''[http://www.mgleicher.us/GEL/hsi/ HSI Introduction]''', the '''[http://www.mgleicher.us/GEL/hsi/hsi_man_page.html HSI Man Page]''' or the or the [https://support.scinet.utoronto.ca/wiki/index.php/HSI_help '''hsi help''']

== '''[[ISH|ISH]]''' ==
=== [[ISH|Documentation and Usage]] ===

== '''File and directory management''' ==
=== Moving/renaming ===
* you may use 'mv' or 'cp' in the same way as the linux version.
<source lang="bash">
#!/bin/bash
#PBS -l walltime=72:00:00
#PBS -q archive
#PBS -N deletion_script
#PBS -j oe
#PBS -m e

echo "HPSS file and directory management"

trap "echo 'Job script not completed';exit 129" TERM INT

/usr/local/bin/hsi -v <<EOF1
mkdir $ARCHIVE/2011
mv $ARCHIVE/oldjobs $ARCHIVE/2011
cp -r $ARCHIVE/almostfinished/*done $ARCHIVE/2011
end
EOF1
status=$?

trap - TERM INT

if [ ! $status == 0 ]; then
echo 'HSI returned non-zero code.'
/scinet/gpc/bin/exit2msg $status
exit $status
else
echo 'TRANSFER SUCCESSFUL'
fi
</source>

=== Deletions ===
==== Recommendations ====
* Be careful with the use of 'cd' commands to non-existing directories before the 'rm' command. Results may be unpredictable
* Avoid the use of the stand alone wild character '''*'''. If necessary, whenever possible have it bound to common patterns, such as '*.tmp', so to limit unintentional mis-happens
* Avoid using relative paths, even the env variable $ARCHIVE. Better to explicitly expand the full paths in your scripts
* Avoid using recursive/looped deletion instructions on $SCRATCH contents from the archive job scripts. Even on $ARCHIVE contents, it may be better to do it as an independent job submission, after you have verified that the original ingestion into HPSS finished without any issues.

==== Typical example ====
<source lang="bash">
#!/bin/bash
#PBS -l walltime=72:00:00
#PBS -q archive
#PBS -N deletion_script
#PBS -j oe
#PBS -m e

echo "Deletion of an outdated directory tree into HPSS"

trap "echo 'Job script not completed';exit 129" TERM INT
# Note that the initial directory in HPSS ($ARCHIVE) has the path explicitly expanded

/usr/local/bin/hsi -v <<EOF1
rm /archive/s/scinet/pinto/*.tmp
rm -R /archive/s/scinet/pinto/obsolete
end
EOF1
status=$?

trap - TERM INT

if [ ! $status == 0 ]; then
echo 'HSI returned non-zero code.'
/scinet/gpc/bin/exit2msg $status
exit $status
else
echo 'TRANSFER SUCCESSFUL'
fi
</source>

==== Deleting with an interactive HSI session ====
* You may feel more comfortable acquiring an interactive shell, starting an HSI session and proceeding with your deletions that way. Keep in mind, you're restricted to 1H.

<pre>
gpc-f103n084-$ qsub -q archive -I
qsub: waiting for job 11611291.gpc-sched to start
qsub: job 11611291.gpc-sched ready

----------------------------------------
Begin PBS Prologue Mon May 28 13:15:28 EDT 2012 1338225328
Job ID: 11611291.gpc-sched
Username: pinto
Group: scinet
Nodes: gpc-archive01
End PBS Prologue Mon May 28 13:15:28 EDT 2012 1338225328
----------------------------------------
hpss-archive01-$ hsi
******************************************************************
* Welcome to HPSS@SciNet - High Perfomance Storage System *
* *
* Contact Information: support@scinet.utoronto.ca *
* NOTE: do not transfer SMALL FILES with HSI. Use HTAR instead *
* CHECK THE INTEGRITY OF YOUR TARBALLS *
******************************************************************
Username: pinto UID: 10010 Acct: 10010(10010) Copies: 2 Firewall: off [hsi.4.0.1 Thu Mar 22 11:44:03 EDT 2012]
[HSI]/archive/s/scinet/pinto-> rm -R junk

</pre>

== '''HPSS for the 'Watchmaker' ''' ==
=== Efficient alternative to htar ===
<source lang="bash">
#!/bin/bash
#PBS -l walltime=72:00:00
#PBS -q archive
#PBS -N tar_create_tarball_in_hpss_with_hsi_by_piping
#PBS -j oe
#PBS -m e

trap "echo 'Job script not completed';exit 129" TERM INT
# Note that your initial directory in HPSS will be $ARCHIVE

# When using a pipeline like this
set -o pipefail

# to put
tar -c $SCRATCH/mydir | hsi cput - : $ARCHIVE/mydir.tar
status=$?
if [ ! $status == 0 ]; then
echo 'TAR+HSI+piping returned non-zero code.'
/scinet/gpc/bin/exit2msg $status
exit $status
else
echo 'TRANSFER SUCCESSFUL'
fi

# to immediately generate an index
ish hindex $ARCHIVE/mydir.tar
status=$?
if [ ! $status == 0 ]; then
echo 'ISH returned non-zero code.'
/scinet/gpc/bin/exit2msg $status
exit $status
else
echo 'TRANSFER SUCCESSFUL'
fi

# to get
#cd $SCRATCH
#hsi cget - : $ARCHIVE/mydir.tar | tar -xv
#status=$?
# if [ ! $status == 0 ]; then
# echo 'TAR+HSI+piping returned non-zero code.'
# /scinet/gpc/bin/exit2msg $status
# exit $status
#else
# echo 'TRANSFER SUCCESSFUL'
#fi

trap - TERM INT

</source>
'''Notes:'''
* Combining commands in this fashion, besides being HPSS-friendly, should not be that noticeably slower than the recursive put with HSI that stores each file one by one. However, reading the files back from tape in this format will be many times faster. It would also overcome the current 68GB limit on the size of stored files that we have with htar.
* To top things off, we recommend indexing with ish (in the same script) immediately after the tarball creation , while it resides in the HPSS cache. It would be as if htar was used.
* To ensure that an error at any stage of the pipeline shows up in the returned status use: ''set -o pipefail'' (The default is to return the status of the last command in the pipeline and this is not what you want.)

=== Multi-threaded gzip'ed compression with pigz ===
We compiled multi-threaded implementation of gzip called pigz (http://zlib.net/pigz/). It's now part of the "extras" module. It can also be used on any compute or devel nodes. This makes the execution of the previous version of the script much quicker than if you were to use 'tar -cfz'. In addition, by piggy-backing ISH to the end of the script, it will know what to do with the just created mydir.tar.gz compressed tarball.

<source lang="bash">
#!/bin/bash
#PBS -l walltime=72:00:00
#PBS -q archive
#PBS -N tar_create_compressed_tarball_in_hpss_with_hsi_by_piping
#PBS -j oe
#PBS -m e

trap "echo 'Job script not completed';exit 129" TERM INT
# Note that your initial directory in HPSS will be $ARCHIVE

# When using a pipeline like this
set -o pipefail

load module extras

# to put
tar -c $SCRATCH/mydir | pigz | hsi cput - : $ARCHIVE/mydir.tar.gz
status=$?
if [ ! $status == 0 ]; then
echo 'TAR+PIGZ+HSI+piping returned non-zero code.'
/scinet/gpc/bin/exit2msg $status
exit $status
else
echo 'TRANSFER SUCCESSFUL'
fi
</source>

=== Content Verification ===

==== HTAR CRC checksums ====
Specifies that HTAR should generate CRC checksums when creating the archive.

<source lang="bash">
#!/bin/bash
#PBS -l walltime=72:00:00
#PBS -q archive
#PBS -N htar_create_tarball_in_hpss_with_checksum_verification
#PBS -j oe
#PBS -m e

trap "echo 'Job script not completed';exit 129" TERM INT
# Note that your initial directory in HPSS will be $ARCHIVE

cd $SCRATCH/workarea

# to put
htar -cpf $ARCHIVE/finished-job1.tar -Hcrc -Hverify=1 finished-job1/

# to get
#mkdir $SCRATCH/verification
#cd $SCRATCH/verification
#htar -Hcrc -xvpmf $ARCHIVE/finished-job1.tar

status=$?

trap - TERM INT

if [ ! $status == 0 ]; then
echo 'HTAR returned non-zero code.'
/scinet/gpc/bin/exit2msg $status
exit $status
else
echo 'TRANSFER SUCCESSFUL'
fi
</source>

==== Current HSI version - Checksum built-in ====

[http://www.mgleicher.us/GEL/hsi/hsi_reference_manual_2/checksum-feature.html More usage details here]

MD5 is the standard Hashing Algorithm for the HSI build at SciNet.

The checksum algorithm is very CPU-intensive. Although the checksum code is compiled with a high level of compiler optimization, transfer rates can be significantly reduced when checksum creation or verification is in effect. The amount of degradation in transfer rates depends on several factors, such as processor speed, network transfer speed, and speed of the local filesystem (GPFS).

<source lang="bash">
#!/bin/bash
#PBS -l walltime=72:00:00
#PBS -q archive
#PBS -N MD5_checksum_verified_transfer
#PBS -j oe
#PBS -me

thefile=<GPFSpath>
storedfile=<HPSSpath>

# Generate checksum on fly (-c on)
hsi -q put -c on $thefile : $storedfile
pid=$!

# Check the exit code of the HSI process
status=$?

if [ ! $status == 0 ]; then
echo 'HSI returned non-zero code.'
/scinet/gpc/bin/exit2msg $status
exit $status
else
echo 'TRANSFER SUCCESSFUL'
fi

# verify checksum
hsi lshash $storedfile
status=$?

if [ ! $status == 0 ]; then
echo 'HSI returned non-zero code.'
/scinet/gpc/bin/exit2msg $status
exit $status
else
echo 'TRANSFER SUCCESSFUL'
fi

# get the file back with checksum
hsi get -c on $storedfile
status=$?

if [ ! $status == 0 ]; then
echo 'HSI returned non-zero code.'
/scinet/gpc/bin/exit2msg $status
exit $status
else
echo 'TRANSFER SUCCESSFUL'
fi
</source>

==== Prior to HSI version 4.0.1.1 ====

This will checksum the contents of the HPSSpath against the original GPFSpath after the transfer has finished.
<source lang="bash">
#!/bin/bash
#PBS -l walltime=72:00:00
#PBS -q archive
#PBS -N checksum_verified_transfer
#PBS -j oe
#PBS -me

thefile=<GPFSpath>
storedfile=<HPSSpath>

# Generate checksum on fly using a named pipe so that file is only read from GPFS once
mkfifo /tmp/NPIPE
cat $thefile | tee /tmp/NPIPE | hsi -q put - : $storedfile &
pid=$!
md5sum /tmp/NPIPE |tee /tmp/$fname.md5
rm -f /tmp/NPIPE

# Check the exit code of the HSI process
wait $pid
status=$?

if [ ! $status == 0 ]; then
echo 'HSI returned non-zero code.'
/scinet/gpc/bin/exit2msg $status
exit $status
else
echo 'TRANSFER SUCCESSFUL'
fi

# change filename to stdin in checksum file
sed -i.1 "s+/tmp/NPIPE+-+" /tmp/$fname.md5

# verify checksum
hsi -q get - : $storedfile | md5sum -c /tmp/$fname.md5
status=$?

if [ ! $status == 0 ]; then
echo 'HSI returned non-zero code.'
/scinet/gpc/bin/exit2msg $status
exit $status
else
echo 'TRANSFER SUCCESSFUL'
fi
</source>

== '''User provided Content/Suggestions''' ==
=== '''[[HPSS-by-pomes|Packing up large data sets and putting them on HPSS]]''' ===
(Pomés group recommendations)

[[Data Management|BACK TO Data Management]]