Difference between revisions of "HPSS"

From oldwiki.scinet.utoronto.ca
Jump to navigation Jump to search
Line 185: Line 185:
 
'''Note:''' HSI returns the highest-numbered exit code, in case of multiple operations in the same hsi session
 
'''Note:''' HSI returns the highest-numbered exit code, in case of multiple operations in the same hsi session
  
 +
 
=== Typical Usage Scripts===
 
=== Typical Usage Scripts===
 
The most common interactions will be ''putting'' data into HPSS, examining the contents (ls,ish), and ''getting'' data back onto GPFS for inspection or analysis.
 
The most common interactions will be ''putting'' data into HPSS, examining the contents (ls,ish), and ''getting'' data back onto GPFS for inspection or analysis.
Line 450: Line 451:
 
* For more details please check the '''[http://www.mgleicher.us/GEL/hsi/ HSI Introduction]''', the '''[http://www.mgleicher.us/GEL/hsi/hsi_man_page.html HSI Man Page]''' or the or the [https://support.scinet.utoronto.ca/wiki/index.php/HSI_help '''hsi help''']
 
* For more details please check the '''[http://www.mgleicher.us/GEL/hsi/ HSI Introduction]''', the '''[http://www.mgleicher.us/GEL/hsi/hsi_man_page.html HSI Man Page]''' or the or the [https://support.scinet.utoronto.ca/wiki/index.php/HSI_help '''hsi help''']
  
 
 
== '''HTAR''' ==
 
== '''HTAR''' ==
 
''' Please aggregate small files (<~200MB) into tarballs or htar files. '''
 
''' Please aggregate small files (<~200MB) into tarballs or htar files. '''

Revision as of 18:20, 6 July 2011

High Performance Storage System

(Pilot usage phase to start in Jun/2011 with a select group of users. Deployment and configuration are still a work in progress)

The High Performance Storage System (HPSS) is a tape-backed hierarchical storage system that will provide a significant portion of the allocated storage space at SciNet. It is a repository for archiving data that is not being actively used. Data can be returned to the active GPFS filesystem when it is needed.

Access and transfer of data into and out of HPSS is done under the control of the user, whose interaction is expected to be scripted and submitted as a batch job, using one or more of the following utilities:

  • HSI is a client with an ftp-like functionality which can be used to archive and retrieve large files. It is also useful for browsing the contents of HPSS.
  • HTAR is a utility that creates tar formatted archives directly into HPSS. It also creates a separate index file (.idx) that can be accessed and browsed quickly.
  • ISH is a TUI utility that can perform an inventory of the files and directories in your tarballs.

Why should I use and trust HPSS?

  • 10+ years history, used by 50+ facilities in the “Top 500” HPC list
  • very reliable, data redundancy and data insurance built-in.
  • highly scalable, reasonable performance at SciNet - Ingest: ~12 TB/day, Recall: ~24 TB/day (aggregated)
  • HSI/HTAR clients also very reliable and used on several HPSS sites. ISH was written at SciNet.
  • HPSS fits well with the Storage Capacity Expansion Plan at SciNet (pdf presentation)

Guidelines

  • Expanded storage capacity is provided on tape -- a media that is not suited for storing small files. Files smaller than ~200MB should be grouped into tarballs with tar or htar.
  • Optimal performance for aggregated transfers and allocation on tapes is obtained with tarballs of size around 100GB.
  • Make sure to check the application's exit code and returned logs for errors after any data transfer or tarball creation process
  • Pilot users: DURING THE TESTING PHASE DO NOT DELETE THE ORIGINAL FILES FROM /scratch OR /project

Access Through the Queue System

All access to the archive system is done through the GPC queue system.

Scripted File Transfers

File transfers in and out of the HPSS should be scripted into jobs and submitted to the archive queue. See generic example below. <source lang="bash">

  1. !/bin/bash
  2. PBS -q archive
  3. PBS -N hsi_put_file_in_hpss
  4. PBS -j oe
  5. PBS -m e

echo "Starting hsi to copy workarea/finished-job1.tar.gz TO finished-job1.tar.gz"

trap "echo 'Job script not completed';exit 129" TERM INT

  1. Note that upon executing hsi, your initial directory in HPSS will be: /archive/$(id -gn)/$(whoami)/

/usr/local/bin/hsi -v <<EOF cput -up /scratch/$(whoami)/workarea/finished-job1.tar.gz : /archive/$(id -gn)/$(whoami)/finished-job1.tar.gz end EOF status=$?

trap - TERM INT

if [ ! $status == 0 ]; then

  echo 'HSI returned non-zero code.'
  /scinet/gpc/bin/exit2msg $status
  exit $status

else

  echo 'TRANSFER SUCCESSFUL'

fi

</source> Note: Always trap the execution of your jobs for abnormal terminations, and be sure to return the exit code


The status of pending jobs can be monitored with showq specifying the archive queue:

showq -w class=archive

Recalling Data for Analysis

Typically data will be recalled to /scratch when it is needed for analysis. Job dependencies can be constructed so that analysis jobs wait in the queue for data recalls before starting. The qsub flag is

-W depend=afterok:<JOBID>

where JOBID is the job number of the archive recalling job that must finish successfully before the analysis job can start.

Here is a short cut for generating the dependency (lookup data-recall.sh samples):

gpc04 $ qsub $(qsub data-recall.sh | awk -F '.' '{print "-W depend=afterok:"$1}') job-to-work-on-recalled-data.sh

HSI

HSI is the primary client with which a user will interact with HPSS. It provides an ftp-like interface for archiving and retrieving tarballs or directory trees. In addition it provides a number of shell-like commands that are useful for examining and manipulating the contents in HPSS. The most commonly used commands will be:

cput Conditionally stores a file only if the file does not already exist in HPSS
cput [options] GPFSpath [: HPSSpath]
cget Conditionally retrieves a copy of a file from HPSS to GPFS only if a local copy does not already exist.
cget [options] [GPFSpath :] HPSSpath
cd,mkdir,ls,rm,mv Operate as one would expect on the contents of HPSS.
lcd,lls Local commands to GPFS


  • There are 3 distinctions about HSI that you should keep in mind, and that can generate a bit of confusion when you're first learning how to use it:
    • HSI doesn't currently support renaming directories during transfers on-the-fly, therefore the syntax for cput/cget may not work as one would expect in some scenarios, requiring some workarounds.
    • HSI has an operator ":" which separates the GPFSpath and HPSSpath, and must be surrounded by whitespace (one or more space characters)
    • The order for referring to files in HSI syntax is different from FTP. In HSI the general format is always the same, GPFS first, HPSS second, cput or cget:
     GPFSfile : HPSSfile

For example, when using HSI to store the tarball file from GPFS into HPSS, then recall it to GPFS, the following commands could be used:

    cput tarball-in-GPFS : tarball-in-HPSS
    cget tarball-recalled : tarball-in-HPSS

unlike with FTP, where the following syntax would be used:

    put tarball-in-GPFS tarball-in-HPSS 
    get tarball-in-HPSS tarball-recalled
  • Simple commands can be executed on a single line.
    hsi "mkdir LargeFilesDir; cd LargeFilesDir; cput tarball-in-GPFS : tarball-in-HPSS"
  • More complex sequences can be performed using an except such as this:
    hsi <<EOF
      mkdir LargeFilesDir
      cd LargeFilesDir
      cput tarball-in-GPFS : tarball-in-HPSS
      lcd /scratch/$(whoami)/LargeFilesDir2/
      cput -Rup *  
    end
    EOF
  • The commands below are equivalent, but we recommend that you always use full path, and organize the contents of HPSS (the default HSI directory placement is /archive/$(id -gn)/$(whoami)/:
    hsi cput tarball
    hsi cput tarball : tarball
    hsi cput /scratch/$(whoami)/tarball : /archive/$(id -gn)/$(whoami)/tarball
  • There are no issues renaming files on-the-fly:
    hsi cput /scratch/$(whoami)/tarball1 : /archive/$(id -gn)/$(whoami)/tarball2
    hsi cget /scratch/$(whoami)/tarball3 : /archive/$(id -gn)/$(whoami)/tarball2
  • There are no issues transferring directories and all its contents recursively (as in rsync), provided that you keep the same directory name on GPFS and HPSS. You may use '-u' option to resume a previously disrupted session, and the '-p' to preserve ACL's
   hsi cput /scratch/$(whoami)/LargeFilesDir : /archive/$(id -gn)/$(whoami)/LargeFilesDir
OR
   hsi cget /scratch/$(whoami)/LargeFilesDir : /archive/$(id -gn)/$(whoami)/LargeFilesDir
  • However the syntax forms below will fail.
   hsi cput -Rup /scratch/$(whoami)/LargeFilesDir : /archive/$(id -gn)/$(whoami)/LargeFilesDir2    (FAILS)
OR
   hsi cput -Rup /scratch/$(whoami)/LargeFilesDir/* : /archive/$(id -gn)/$(whoami)/LargeFilesDir2  (FAILS)

One workaround is the following 2-steps process:

   hsi cput -Rup /scratch/$(whoami)/LargeFilesDir : /archive/$(id -gn)/$(whoami)/LargeFilesDir 
   hsi mv /archive/$(id -gn)/$(whoami)/LargeFilesDir /archive/$(id -gn)/$(whoami)/LargeFilesDir2   

Another workaround is do a "cd" in GPFS first:

   lcd GPFSpath, cget -R ...
    hsi <<EOF
      lcd /scratch/$(whoami)/LargeFilesDir
      mkdir /archive/$(id -gn)/$(whoami)/LargeFilesDir2
      cd /archive/$(id -gn)/$(whoami)/LargeFilesDir2
      cput -Rup *  
    end
    EOF

Documentation

Complete documentation of HSI is available on the Gleicher Enterprises web site.

Note: HSI returns the highest-numbered exit code, in case of multiple operations in the same hsi session


Typical Usage Scripts

The most common interactions will be putting data into HPSS, examining the contents (ls,ish), and getting data back onto GPFS for inspection or analysis.

Sample data offload

<source lang="bash">

  1. !/bin/bash
  1. This script is named: data-offload.sh
  1. PBS -q archive
  2. PBS -N offload
  3. PBS -j oe
  4. PBS -me

trap "echo 'Job script not completed';exit 129" TERM INT

  1. individual tarballs already exist

/usr/local/bin/hsi -v <<EOF1 mkdir put-away cd put-away cput /scratch/$(whoami)/workarea/finished-job1.tar.gz : finished-job1.tar.gz end EOF1 status=$? if [ ! $status == 0 ];then

  echo 'HSI returned non-zero code.'
  /scinet/gpc/bin/exit2msg $status
  exit $status

else

  echo 'TRANSFER SUCCESSFUL'

fi

/usr/local/bin/hsi -v <<EOF2 mkdir put-away cd put-away cput /scratch/$(whoami)/workarea/finished-job2.tar.gz : finished-job2.tar.gz end EOF2 status=$? if [ ! $status == 0 ];then

  echo 'HSI returned non-zero code.'
  /scinet/gpc/bin/exit2msg $status
  exit $status

else

  echo 'TRANSFER SUCCESSFUL'

fi

trap - TERM INT </source>


Note: as in the above example, we recommend that you capture the (highest-numbered) exit code for each hsi session independently. And remember, you may improve your exit code verbosity by adding the excerpt below to your scripts: <source lang="bash"> if [ ! $status == 0 ];then

  echo 'HSI returned non-zero code.'
  /scinet/gpc/bin/exit2msg $status
  exit $status

else

  echo 'TRANSFER SUCCESSFUL'

fi </source>


Sample data list

A very trivial way to list the contents of HPSS would be to just submit the HSI 'ls' command. <source lang="bash">

  1. !/bin/bash
  1. This script is named: data-list.sh
  1. PBS -q archive
  2. PBS -N hpss_ls
  3. PBS -j oe
  4. PBS -me

/usr/local/bin/hsi -v <<EOF cd put-away ls -R end EOF </source>


However, we provide a much more useful and convenient way to explore the contents of HPSS with the inventory shell ISH. This example creates an index of all the files in a user's portion of the namespace. The list is placed in the directory /home/$(whoami)/.ish_register that can be inspected from the gpc-devel nodes. <source lang="bash">

  1. !/bin/bash
  1. This script is named: data-list.sh
  1. PBS -q archive
  2. PBS -N hpss_index
  3. PBS -j oe
  4. PBS -me

INDEX_DIR=$HOME/.ish_register if ! [ -e "$INDEX_DIR" ]; then

 mkdir -p $INDEX_DIR

fi

export ISHREGISTER="$INDEX_DIR" ish hindex </source>


This index can be browsed or searched with ISH on the development nodes. <source lang="bash"> gpc-f104n084-$ ish ~/.ish_register/hpss.igz [ish]hpss.igz> help </source>

ISH is a powerful tool that is also useful for creating and browsing indices of tar and htar archives, so please look at the documentation or built in help.


Sample data recall

<source lang="bash">

  1. !/bin/bash
  1. This script is named: data-recall.sh
  1. PBS -q archive
  2. PBS -N recall_files
  3. PBS -j oe
  4. PBS -me

trap "echo 'Job script not completed';exit 129" TERM INT

mkdir -p /scratch/$(whoami)/recalled-from-hpss

  1. individual tarballs previously organized in HPSS inside the put-away-on-2010/ folder

hsi -v << EOF cget /scratch/$(whoami)/recalled-from-hpss/Jan-2010-jobs.tar.gz : /archive/$(id -gn)/$(whoami)/put-away-on-2010/Jan-2010-jobs.tar.gz cget /scratch/$(whoami)/recalled-from-hpss/Feb-2010-jobs.tar.gz : /archive/$(id -gn)/$(whoami)/put-away-on-2010/Feb-2010-jobs.tar.gz end EOF status=$?

trap - TERM INT

if [ ! $status == 0 ]; then

  echo 'HSI returned non-zero code.'
  /scinet/gpc/bin/exit2msg $status
  exit $status

else

  echo 'TRANSFER SUCCESSFUL'

fi </source>


We should emphasize that a single cget of multiple files (rather than several separate gets) allows HSI to do optimization, as in the following example: <source lang="bash">

  1. !/bin/bash
  1. This script is named: data-recall.sh
  1. PBS -q archive
  2. PBS -N recall_files_optimized
  3. PBS -j oe
  4. PBS -me

trap "echo 'Job script not completed';exit 129" TERM INT mkdir -p /scratch/$(whoami)/recalled-from-hpss

  1. individual tarballs previously organized in HPSS inside the put-away-on-2010/ folder

hsi -v << EOF lcd /scratch/$(whoami)/recalled-from-hpss/ cd /archive/$(id -gn)/$(whoami)/put-away-on-2010/ cget Jan-2010-jobs.tar.gz Feb-2010-jobs.tar.gz end EOF status=$?

trap - TERM INT

if [ ! $status == 0 ]; then

  echo 'HSI returned non-zero code.'
  /scinet/gpc/bin/exit2msg $status
  exit $status

else

  echo 'TRANSFER SUCCESSFUL'

fi </source>


Sample transferring directories

Remember, it's not possible to:

hsi cget -Rup /scratch/$(whoami)/LargeFiles-recalled : /archive/$(id -gn)/$(whoami)/LargeFiles    (FAILS)

The workaround is: <source lang="bash">

  1. !/bin/bash
  1. This script is named: data-recall.sh
  1. PBS -q archive
  2. PBS -N recall_directories
  3. PBS -j oe
  4. PBS -me

trap "echo 'Job script not completed';exit 129" TERM INT

mkdir -p /scratch/$(whoami)/LargeFiles-recalled

hsi -v << EOF lcd /scratch/$(whoami)/LargeFiles-recalled cd /archive/$(id -gn)/$(whoami)/LargeFiles cget -Rup * end EOF status=$?

trap - TERM INT

if [ ! $status == 0 ]; then

  echo 'HSI returned non-zero code.'
  /scinet/gpc/bin/exit2msg $status
  exit $status

else

  echo 'TRANSFER SUCCESSFUL'

fi </source>


Sample verify checksum

<source lang="bash">

  1. !/bin/env bash
  2. PBS -q archive
  3. PBS -N checksum_verified_transfer
  4. PBS -j oe
  5. PBS -me

thefile=<localpath> storedfile=<hpsspath>

  1. Generate checksum on fly using a named pipe so that file is only read from GPFS once

mkfifo /tmp/NPIPE cat $thefile | tee /tmp/NPIPE | hsi -q put - : $storedfile & pid=$! md5sum /tmp/NPIPE |tee /tmp/$fname.md5 rm /tmp/NPIPE

  1. Check the exit code of the HSI process

wait $pid sc=$? if [ $sc != 0 ];then

 echo "File transfer failed"
 exit $sc

fi

  1. change filename to stdin in checksum file

sed -i.1 "s+/tmp/NPIPE+-+" /tmp/$fname.md5

  1. verify checksum

hsi -q get - : $storedfile | md5sum -c /tmp/$fname.md5 sc=$? if [ $sc != 0 ]; then

 echo '!!! Job Failed !!!'
 echo 'error=' $sc
 exit $sc

fi </source>

HTAR

Please aggregate small files (<~200MB) into tarballs or htar files.

HTAR is a utility that is used for aggregating a set of files and directories, by using a sophisticated multithreaded buffering scheme to write files from the local filesystem directly into HPSS, creating an archive file that conforms to the POSIX TAR specification, thereby achieving a high rate of performance.

CAUTION

  • Files larger than 68 GB cannot be stored in an htar archive (you'll get an error message for the whole operation)
  • Check the HTAR exit code and log file before removing any files from the active filesystems.

HTAR Usage

  • To write the file1 and file2 files to a new archive called files.tar in the default HPSS home directory, enter:
    htar -cf files.tar file1 file2
OR
    htar -cf /archive/$(id -gn)/$(whoami)/files.tar file1 file2
  • To write a subdirA to a new archive called subdirA.tar in the default HPSS home directory, enter:
    htar -cf subdirA.tar subdirA
  • To extract all files from the project1/src directory in the archive file called proj1.tar, and use the time of extraction as the modification time, enter:
    htar -xm -f proj1.tar project1/src
  • To display the names of the files in the out.tar archive file within the HPSS home directory, enter (the out.tar.idx file will be queried):
    htar -vtf out.tar

For more details please check the HTAR - Introduction or the HTAR Man Page online