Data Management

From oldwiki.scinet.utoronto.ca
Revision as of 11:56, 12 December 2010 by Cneale (talk | contribs) (→‎Selective: added "or"'s to listing of how to use dsmmigrate and dsmrecall, because otherwise it looked like either use (a) the top 3 commands -or- (b) the bottome 2 commands in each list)
Jump to navigation Jump to search

Storage Space

SciNet's storage system is based on IBM's GPFS (General Parallel File System). There are two main systems for user data: /home, a small, backed-up space where user home directories are located, and /scratch, a large system for input or output data for jobs; data on /scratch is not only not backed up (a third storage system, /project, exist only for groups with LRAC/NRAC allocations). Data placed on scratch will be deleted if it has not been accessed in 3 months. SciNet does not provide long-term storage for large data sets.

Overview of the different file systems

file system purpose quota block size backed up purged access
/home development 10 GB 256 KB yes never read-only on compute nodes (r/w on login, devel and datamover1)
/scratch computation 20 TB 4 MB no files > 3 month read/write on all nodes
/project computation by allocation 256 KB no never read/write on all nodes

Home Disk Space

Every SciNet user gets a 10GB directory on /home which is regularly backed-up. Home is visible from login.scinet nodes, and from the development nodes on GPC and the TCS. However, on the compute nodes of the GPC clusters -- as when jobs are running -- /home is mounted read-only; thus GPC jobs can read files in /home but cannot write to files there. /home is a good place to put code, input files for runs, and anything else that needs to be kept to reproduce runs. On the other hand, /home is not a good place to put many small files, since the block size for the file system is 256KB, so you would quickly run out of disk quota and you will make the backup system very slow.

If your application absolutely insists on writing material to your home account and you can't find a way to instruct it to write somewhere else, an alternative is to create a link pointing from your account under /home to a location under /scratch.

Scratch Disk Space

Every SciNet user also gets a directory in /scratch. Scratch is visible from the login.scinet nodes, the development nodes on GPC and the TCS, and on the compute nodes of the clusters, mounted as read-write. Thus jobs would normally write their output somewhere in /scratch. There are NO backups of anything on /scratch.

There is a large amount of space available on /scratch but it is purged routinely so that all users running jobs and generating large outputs will have room to store their data temporarily. Computational results which you want to keep longer than this must be copied (using scp) off of SciNet entirely and to your local system. SciNet does not routinely provide long-term storage for large data sets.

Scratch Disk Purging Policy

In order to ensure that there is always significant space available for running jobs we automatically delete files in /scratch that have not been accessed for more than 3 months by the actual deletion day on the 15th of each month. This policy is subject to revision depending on its effectiveness. More details about the purging process and how users can check if their files will be deleted follows. If you have files scheduled for deletion you should move them to a more permanent locations such as your departmental server or your /project space (for PIs who have either been allocated disk space by the LRAC or have bought diskspace).

On the first of each month, a list of files scheduled for purging is produced, and an email notification is sent to each user on that list. Furthermore, at/or about the 12th of each month a 2nd scan produces a more current assessment and another email notification is sent. This way users can double check that they have indeed taken care of all the files they needed to relocate before the purging deadline. Those files will be automatically deleted on the 15th of the same month unless they have been accessed or relocated in the interim. If you have files scheduled for deletion then they will be listed in a file in /scratch/todelete/current, which has your userid and groupid in the filename. For example, if user xxyz wants to check if they have files scheduled for deletion they can issue the following command on a system which mounts /scratch (e.g. a scinet login node): ls -l1 /scratch/todelete/current |grep xxyz. In the example below, the name of this file indicates that user xxyz is part of group abc, has 9,560 files scheduled for deletion and they take up 1.0TB of space:

[xxyz@scinet04 ~]$ ls -l1 /scratch/todelete/current |grep xxyz
-rw-r----- 1 xxyz     root       1733059 Jan 12 11:46 10001___xxyz_______abc_________1.00T_____9560files

The file itself contains a list of all files scheduled for deletion (in the last column) and can be viewed with standard commands like more/less/cat - e.g. more /scratch/todelete/current/10001___xxyz_______abc_________1.00T_____9560files

Similarly, you can also verify all other users on your group by using the ls command with grep on your group. For example: ls -l1 /scratch/todelete/current |grep abc. That will list all other users in the same group that xxyz is part of, and have files to be purged on the 15th. Members of the same group have access to each other's contents.

NOTE: Preparing these assessments takes several hours. If you change the access/modification time of a file in the interim, that will not be detected until the next cycle. A way for you to get immediate feedback is to use the 'ls -lu' command on the file. If the file atime has been updated, coming the purging date on the 15th it will not be deleted any longer.

Project Disk Space

Investigators who have been granted allocations through the LRAC/NRAC Application Process may have been allocated disk space in addition to compute time. For the period of time that the allocation is granted, they will have disk space on the /project disk system. Space on the project systems are not purged, but neither are they backed up. All members of the investigators groups will have access to these systems, which will be mounted read/write everywhere.

How much Disk Space Do I have left?

The /scinet/gpc/bin/diskUsage command, available on the login nodes, datamovers and the GPC devel nodes, provides information in a number of ways on the home, scratch, and project file systems. For instance, how much disk space is being used by yourself and your group (with the -a option), or how much your usage has changed over a certain period ("delta information") or you may generate plots of your usage over time. Please see the usage help below for more details.

Usage: diskUsage [-h|-?| [-a] [-u <user>] [-de|-plot]
       -h|-?: help
       -a: list usages of all members on the group
       -u <user>: as another user on your group
       -de: include delta information
       -plot: create plots of disk usages

Note that information on usage and quota is only updated hourly!

Performance

GPFS is a high-performance filesystem which provides rapid reads and writes to large datasets in parallel from many nodes. As a consequence of this design, however, the file system performs quite poorly at accessing data sets which consist of many, small files. For instance, you will find that reading data in from one 4MB file is enormously faster than from 100 40KB files. Such small files are also quite wasteful of space, as the blocksize for the filesystem is 4MB. This is something you should keep in mind when planning your input/output strategy for runs on SciNet.

For instance, if you run multi-process jobs, having each process write to a file of its own is not an scalable I/O solution. A directory gets locked by the first process accessing it, so all other processes have to wait for it. Not only has the code just become considerably less parallel, chances are the file system will have a time-out while waiting for your other processes, leading your program to crash mysteriously. Consider using MPI-IO (part of the MPI-2 standard), which allows files to be opened simultaneously by different processes, or using a dedicated process for I/O to which all other processes send their data, and which subsequently writes this data to a single file.

Local Disk

The compute nodes on the GPC do not contain hard drives so there is no local disk available to use during your computation. You can however use part of a compute nodes RAM like a local disk ('ramdisk') but this will reduce how much memory is available for your program. This can be accessed using /dev/shm/ and is currently set to 8GB. Anything written to this location that you want to keep must be copied back to the /scratch filesystem as /dev/shm is wiped after each job and since it is in memory will not survive through a reboot of the node. More on ramdisk usage can be found here.

Note that the absense of hard drives also means that the nodes cannot swap memory, so be sure that your computation fits within memory.

Data Transfer

General guidelines

All traffic to and from the data centre has to go via SSH, or secure shell. This is a protocol which sets up a secure connection between two sites. In all cases, incoming connections to SciNet go through relatively low-speed connections to the login.scinet gateways, but there are many ways to copy files on top of the ssh protocol.

What node to use for data transfer to and from SciNet depends mostly on the amount of data to transfer:

Moving <10GB through the login nodes

The login nodes are accessible from outside SciNet, which means that you can transfer data between your own office/home machine and SciNet using scp or rsync (see below). Although the login nodes have a cpu_time timeout of 5 minutes (emphasis on cpu_time, not wall_time), most likely if you try to transfer more than 10GB probably you wouldn't succeed. While the login nodes can be used for transfers of less than 10GB, using a datamover node would still be faster.

Note that transfers through a login node will timeout after a certain time (currently set to 5 minutes cpu_time), so if you have a slow connection you may need to go through datamover1.

Moving >10GB through the datamover1 node

Serious moves of data (>10GB) to or from SciNet should be done from datamover1 or datamover2 nodes. From any of the interactive SciNet nodes, one should be able to ssh datamover1 or ssh datmover2 to log in. Those are the machines that have the fastest network connections to the outside world (by a factor of 10; a 10Gb/s link as vs 1Gb/s).

Transfers must be originated from datamover1 or datamover2; that is, one can not copy files from the outside world directly to or from the datamovers; one has to log in to a datamover and copy the data to or from the outside network. Your local machine must be reachable from the outside as well, either by its name or its IP address. If you are behind a firewall or a (wireless) router, this may not be possible. You may need to ask your network administrator to allow the datamovers to ssh to your machine. If you need to open a hole on your firewall we provide their IPs:

datamover1 142.150.188.121
datamover2 142.150.188.122

Hpn-ssh

The usual ssh protocols were not designed for speed. On the datamover1 or datamover2 nodes, we have installed hpn-ssh, or High-Performance-enabled ssh. You use this higher-performance ssh/scp/sftp variant by loading the `hpnssh' module. Hpn-ssh is backwards compatible with the `usual' ssh, but is capable of significantly higher speeds. If you routinely have large data transfers to do, we recommend having your system administrator look into installing hpn-ssh on your system.

Everything we discuss below, unless otherwise stated, will work regardless of whether you have hpn-ssh installed on your remote system.

For Microsoft Windows users

Linux-windows transfers can be a bit more involved than linux-to-linux, but using Cygwin, this should not be a problem. Make sure you install Cygwin with the openssh package.

If you want to remain 100% a Windows environment, another very good tool is WinSCP. It will let you easily transfer and synchronize data between your Windows workstation and the login nodes using your ssh credentials (provided that it's not much more than 10GB on each sync pass).

If you are going to use the datamover1 method, and assuming your machine is not a wireless laptop (if it is, best to find a nearby computer that's not wireless and use a usb memory stick), you'll need the IP address of your machine, which you find by typing "ipconfig /all" on your local windows machine. Also, you will need to have the ssh daemon (sshd) running locally in Cygwin.

Also note that your windows user name does not have to be the same as on SciNet, this just depends on how your local windows system was set up.

All locations given to scp or rsync in cygwin have to be in unix format (using "/" not "\"), and will be relative to cygwin's path, not windows (e.g. use /cygdrive/c/...... to get to the windows C: drive).

Ways to transfer data

Globus data transfer

Globus is a file-transfer service with an easy-to-use web interface to allow people to transfer file with ease. To get started, please sign up for a Globus account at Globus website. Once you sign up for an account, go to this page to start the file transfer. Please enter computecanada#niagara as one endpoint for file transfer. If you are trying to transfer data from a laptop or desktop, you will need to install Globus Connect Personal software available here to setup an endpoint for the laptop or desktop and perform the transfer.

Please see the following page on how to setup Globus to perform data transfer.

scp

scp, or secure copy, is the easiest way to copy files, although we generally find rsync below to be faster.

scp works like cp to copy files:

$ scp original_file  copy_file

except that either the original or the copy can be on another system:

$ scp jonsdatafile.bin jon@remote.system.com:/home/jon/bigdatadir/

will copy the data file into the directory /home/jon/bigdatadir/ on remote.system.com after logging in as jon; you will be prompted for a password (unless you've set up ssh keys).

Copying from remote systems works the same way:

$ scp jon@remote.system.com:/home/jon/bigdatadir/newdata.bin .

And wildcards work as you'd expect (except you have to quote the wildcards on the remote system, as it can't expand properly here.)

$ scp *.bin jon@remote.system.com:/home/jon/bigdatadir/
$ scp jon@remote.system.com:"/home/jon/inputdata/*" .

There are few options worth knowing about:

  • scp -C compresses the file before transmitting it; if the file compresses well, this can significantly increase the effective data transfer rate (though usually not as much as compressing the data, then sending it, then uncompressing). If the file doesn't compress well, than this adds CPU overhead without accomplishing much, and can slow down your data transfer.
  • scp -oNoneEnabled=yes -oNoneSwitch=yes -- This is an hpn-ssh only option. If CPU overhead is a significant bottleneck in the data transfer, then we can avoid this by turning off the secure encryption of the data. For most of us, this is ok, but for others it is not. In either cases, authentication remains secure, it is only the data transfer that is in plaintext.

rsync

rsync is a very powerful tool for mirroring directories of data.

$ rsync -av -e ssh scinetdatadir jon@remote.system.com:/home/jon/bigdatadir/

rsync has a dizzying number of options; the above syncs scinetdatadir to the remote system; that is, any files that are newer on the localsystem are updated on the remote system. The converse isn't true; if there were newer files on the remote system, you'd have to bring those over with

$ rsync -av -e ssh jon@remote.system.com:/home/jon/bigdatadir/ scinetdatadir 

The -av options are for verbose and `archive' mode, which preserves timestamps and permissions, which is normally what you want. -e ssh tells it to use ssh for the transfer.

One of the powerful things about rsync is that it looks to see what files already exist before copying, so you can use it repeatedly as a data directory fills and it won't make unnecessary copies; similarly, if a (say) log file grows over time, it will only copy the difference between the files, further speeding things up. This also means that it behaves well if a transfer is interrupted; a second invocation of rsync will continue where the other left off.

As with scp -C, rsync -z compresses on the fly, which can significantly enhance effective data transfer rates if the files compress well, or hurt it if not.

As with scp, if both sides are running hpn-ssh one can disable encryption of the data stream should that prove to be a bottleneck:

$ rsync -av -e "ssh -oNoneEnabled=yes -oNoneSwitch=yes" jon@remote.system.com:/home/jon/bigdatadir/ scinetdatadir

SciNet's login nodes, 142.150.188.5[1-4], are publicly accessible and can be used for data transfer as long as your material is not one big chunk (much more than 2-3GB each file). We have a 5 minutes CPU time limit on the login nodes and the transfer process may be killed by the kernel before completion. The workaround is to transfer your data using a rsync loop, by checking the rsync return code, assuming some files can be transferred before reaching the CPU limit. For example in a bash shell:

  for i in {1..100}; do   ### try 100 times
    rsync ...
    [ "$?" == "0" ] && break
  done

ssh tunnel

Alternatively you may use a reverse ssh tunnel (ssh -R).

If your transfer is above 10GB you will need to use one of SciNet's datamovers. If your workstation is behind a firewall (as the datamovers are), you'll need a node external to your firewall, on the edge of your organization's network, that will serve as a gateway, and can be accessible via ssh by both your workstation and the datamovers. Initiate a "ssh -R" connection from SciNet's datamover to that node. This node needs to have its ssh GatewayPorts enabled so that your workstation can connect to the specified port on that node, which will forward the traffic back to SciNet's datamover.

Transfer speeds

What transfer speeds could I expect?

Below are some typical transfer numbers from datamover1 to another University of Toronto machine with a 1Gb/s link to the campus network:

Mode With hpn-ssh Without
rsync 60-80 MB/s 30-40 MB/s
scp 50 MB/s 25 MB/s

What can slow down my data transfer?

To move data quickly, all of the stages in the process have to be fast; the file system you are reading data from, the CPU reading the data, the network connection between the sender and the reciever, and the recipient CPU and disk. The slowest element in that chain will slow down the entire transfer.

On SciNet's side, our underlying filesystem is the high-performance GPFS system, and the node we recommend you use (datamover1) has a high-speed connection to the network and fast CPUs.

Why are my transfers so much slower?

If you get numbers significantly lower than above, then there is a bottleneck in the transfer. The first thing to do is to run top on datamover1; if other people are transfering large files at the same time you are trying to, network congestion could result and you'll just have to wait until they are done.

If nothing else is going on on datamover1, there are a number of possibilites:

  • network connection between SciNet and your machine - do you know the network connection of your remote machine? Are your systems connections tuned for performance [1]?
  • is the remote server busy?
  • are the remote servers disks busy, or known to be slow?

For any further questions, contact us at Support @ SciNet


File/Ownership Management (ACL)

  • By default, at SciNet, users within the same group have read permission to each other's files (not write)
  • You may use access control list (ACL) to allow your supervisor (or another user within your group) to manage files for you (i.e., create, move, rename, delete), while still retaining your access and permission as the original owner of the files/directories.
  • NOTE: We highly recommend that you never give write permission to other users on the top level of your home directory (/home/[owner]), since that would seriously compromise your privacy, in addition to disable ssh key authentication, among other things. If necessary, make specific sub-directories under your home directory so that other users can manipulate/access files from those.
  • If you need to set up permissions across groups contact us (and the other group's supervisor!).

Using setfacl/getfacl

  • To allow [supervisor] to manage files in /project/group/[owner] using setfacl and getfacl commands, follow the 3-steps below as the [owner] account from a shell:
1) $ /scinet/gpc/bin/setfacl -d -m user:[supervisor]:rwx /project/group/[owner]
   (every *new* file/directory inside [owner] will inherit [supervisor] ownership by default from now on)

2) $ /scinet/gpc/bin/setfacl -d -m user:[owner]:rwx /project/group/[owner]
   (but will also inherit [owner] ownership, ie, ownership of both by default, for files/directories created by [supervisor])

3) $ /scinet/gpc/bin/setfacl -Rm user:[supervisor]:rwx /project/group/[owner]
   (recursively modify all *existing* files/directories inside [owner] to also be rwx by [supervisor])

   $ /scinet/gpc/bin/getfacl /project/group/[owner]
   (to determine the current ACL attributes)

   $ /scinet/gpc/bin/setfacl -b /project/group/[owner]
   (to remove any previously set ACL)

PS: on the datamovers getfacl, setfacl and chacl will be on your path

For more information on using setfacl or getfacl see their man pages.

Using mmputacl/mmgetacl

  • Alternatively, you may use gpfs' native mmputacl and mmgetacl commands. The advantages are that you can set "control" permission and that POSIX or NFS v4 style ACL are supported. You will need first to create a /tmp/supervisor.acl file with the following contents:
user::rwxc
group::----
other::----
mask::rwxc
user:[owner]:rwxc
user:[supervisor]:rwxc

Then issue the following 2 commands:

1) $ mmputacl -i /tmp/supervisor.acl /project/group/[owner]
2) $ mmputacl -d -i /tmp/supervisor.acl /project/group/[owner]
   (every *new* file/directory inside [owner] will inherit [supervisor] ownership by default as well as 
   [owner] ownership, ie, ownership of both by default, for files/directories created by [supervisor])

   $ mmgetacl /project/group/[owner]
   (to determine the current ACL attributes)

   $ mmdelacl -d /project/group/[owner]
   (to remove any previously set ACL)

   $ mmeditacl /project/group/[owner]
   (to create or change a GPFS access control list)
   (for this command to work set the EDITOR environment variable: export EDITOR=/usr/bin/vi)

There is no option to recursively add or remove ACL attributes using a gpfs built-in command. You'll need to use the -i option as above for each file or directory individually. Here is a sample bash script you may use for that purpose

For more information on using mmputacl or mmgetaclacl see their man pages.


Appendix

bash script that you may adapt to recursively add or remove ACL attributes using gpfs built-in commands

Courtesy of Agata Disks (http://csngwinfo.in2p3.fr/mediawiki/index.php/GPFS_ACL)

#!/bin/bash
# USAGE
#     - on one directory:     ./set_acl.sh dir_name
#     - on more directories:  ./set_acl.sh 'dir_nam*'
#

# Path of the file that contains the ACL
ACL_FILE_PATH=/agatadisks/data/acl_file.acl

# Directories onto the ACLs have to be set
dirs=$1

# Recursive function that sets ACL to files and directories
set_acl () {
  curr_dir=$1
  for args in $curr_dir/*
  do
    if [ -f $args ]; then
      echo "ACL set on file $args"
      mmputacl -i $ACL_FILE_PATH $args
      if [ $? -ne 0 ]; then
        echo "ERROR: ACL not set on $args"
        exit -1
      fi
    fi
    if [ -d $args ]; then
      # Set Default ACL in directory
      mmputacl -i $ACL_FILE_PATH $args -d
      if [ $? -ne 0 ]; then
        echo "ERROR: Default ACL not set on $args"
        exit -1
      fi
      echo "Default ACL set on directory $args"
      # Set ACL in directory
      mmputacl -i $ACL_FILE_PATH $args
      if [ $? -ne 0 ]; then
        echo "ERROR: ACL not set on $args"
        exit -1
      fi
      echo "ACL set on directory $args"
      set_acl $args
    fi
  done
}
for dir in $dirs
do
  if [ ! -d $dir ]; then
    echo "ERROR: $dir is not a directory"
    exit -1
  fi
  set_acl $dir
done
exit 0


Hierarchical Storage Management (HSM)

(a pilot project is starting in July/2010 with a select group of users)

Basic Concepts

Hierarchical Storage Management (HSM) is a data storage technique which automatically moves data between high-cost and low-cost storage media. HSM systems exist because high-speed storage devices, such as hard disk drive arrays, are more expensive (per byte stored) than slower devices, such as optical discs and magnetic tape drives. While it would be ideal to have all data available on high-speed devices all the time, this is prohibitively expensive for many organizations. Instead, HSM systems store the bulk of the enterprise's data on slower devices, and then copy data to faster disk drives when needed. In effect, HSM turns the fast disk drives into caches for the slower mass storage devices. The HSM system monitors the way data is used and makes best guesses as to which data can safely be moved to slower devices and which data should stay on the fast devices.

In a typical HSM scenario, data files which are frequently used are stored on disk drives, but are eventually migrated to tape if they are not used for a certain period of time, typically a few months. If a user does reuse a file which is on tape, it is automatically moved back to disk storage. The advantage is that the total amount of stored data can be much larger than the capacity of the disk storage available, but since only rarely-used files are on tape, most users will not notice any slowdown.

The HSM client provides both automatic and selective migration. Once file migration begins, the HSM client sends a copy of your file to storage volumes on disk devices or devices that support removable media, such as tape and replaces the original file with a stub file on HSM managed file system (aka repository at SciNet)

Repository commonly refers to a location for long-term storage, often for safety or preservation.

Migration, in the context of HSM, refers to set of actions that move files from the front-end disk based repository to a back-end tape library system (often invisible or inaccessible to users)

Relocation, in the context of SciNet, refers to the use of unix commands such as copy, move, tar or rsync to get data into the repository.

The stub file is a small replacement file that makes it appear as though the original file is on the repository. It contains required metadata information to locate and recall a migrated file and to respond to specific UNIX commands without recalling the file.

Automatic migration periodically monitors space usage and automatically migrates eligible files according to the options and settings that have been selected. The HSM client provides two types of automatic migration: threshold migration and demand migration.

Threshold migration maintains a specific level of free space on the repository file system. When disk usage reaches the high threshold percentage, eligible files are migrated to tapes automatically. When space usage drops to the low threshold set for the file system, file migration stops.

Demand migration responds to an out-of-space condition on the repository file system. Demand migration starts automatically if the file system runs out of space (usually triggered at 90%). For HSM, as files are migrated (oldest/largest first), space becomes available on the file system, and the process or event that caused the out-of-space condition can be resumed.

Selective migration often an user given HSM command, that migrates specific files from the repository at will, in anticipation of the automatic migration, or independently of the system wide eligibility criteria. For example, if you know that you will not be using a particular group of files for an extended time, you can migrate them, so as to free additional space on the repository.

Reclamation is the process of reclaiming unused space on a tape (applies to Virtual Tapes as well). Over time, as files/directories get deleted or updated on the repository, a process will expire old data, creating gaps of unused storage on the tapes. Since tapes are sequential media, typical tape handling software can only write data to the end of the tape, so these gaps of “Empty Space” cannot be used. The process entails periodically and in a rolling fashion copying active data from the "Swiss Cheese" like tapes to unused tapes on a compacted form.

Optimal environment: HSM should be used in an environment where the old and large files which need to be preserved are not used regularly. Files that are needed frequently should not be migrated at all, otherwise HSM would act as a cache, by migrating files and shortly after the migration the same files will be recalled. This is not advisable. The repository file system needs to be large enough to hold all regularly used files. If the file system is too small and cannot hold all regularly needed files (**), HSM is permanently recalling requested files, getting beyond the high-threshold limit, migrating other files to get below the low-threshold limit and so on.

Deployment at SciNet

HSM is performed by a dedicated IBM software made up of a number of HSM daemons running on datamover2. These daemons constantly monitor the usage of the /repository GPFS and, depending on a predefined set of policies, data may be automatically or manually migrated to the Tivoli Storage Management server (TSM), and kept on our library of LTO-4 tapes.

/repository is a 15TB "transient" location accessible only from datamover2. Users may relocate data as required from /scratch or /project to /repository in a number of ways, such as copy, move, tar or rsync. "Transient" refers to the fact that /repository works like a "Black Hole": in the background it is constantly being emptied, even while you relocate data in from other file systems. What is left behind is the directory tree with the stub files (0 byte in size at SciNet) and the metadata associated with it, which takes up about 1-2%. But, even if /repository is at 1% to start with, we ask that you please do not initiate a relocation of more than a 10TB chunk at once, so that the system has time to process your data and still allow other user(s) to migrate/recall some material before reaching 100% full.

Inside /repository, data is segregated on a per group basis, just as in /project. Within groups, users and group supervisors can structure materials anyway they prefer. But the recommendation is that those involved spend some time designing that structure ahead of time, since you may merge data from project and/or scratch (or even home). In tests we performed, we've been able to reorganize the FS structure after migration, change the name and ownership of directories and stubs, and still recall files under the new path and ownership. HSM does seem to keep a very symbiotic relation between the metadata and the inode attributes at the file system level, without necessarily having to replicate these changes with tape recall & migration operations. But please don't abuse this flexibility. If possible, keep your initial layout structure somewhat fixed over time.

We also recommend that users bundle files in tar-balls of at least 10GB before relocation, and keep a listing of those files somewhere; in fact you may use the 'tar' command to create the tar-ball directly in /repository on-the-fly. See examples below:

tar -czvf /repository/[group]/[user]/myproject1.tar.gz /project/[group]/[user]/project1/ > /project/[group]/[user]/myproject1-repository-listing.txt

or

tar -czvf /repository/[group]/[user]/myscratch1.tar.gz /scratch/[user]/scratchdata1/ > /home/[user]/myscratch1-repository-listing.txt

It is important to keep a listing of the files that are in each tar on a partition other than the HSM repository so that you can quickly decide which tar you need to recover. While the tar stub will always exist on the HSM disk, you will not be able to run tar --list on the stub without recalling the full tar file back from tape to disk. The redirection in the examples above accomplishes this.

The important is to avoid the relocation of many thousands (or millions) of small files. It's very demanding on the system to constantly scan/reconcile all these files on the file system, tapes, metadata and database. A good reference is average file size > 100MB in /repository. Deep directory nesting in general also increases the time required to traverse a file system and thus should be avoided where possible.

Performance

Unlike /project or /scratch, /repository is only a 2 tier disk raid, so don't expect transfer rates much higher than 60MB/sec on a rsync session for example. In another words, a 10TB offload operation will typically take 2 days to complete, if made up of large files. On the other hand, we have conducted experiments where we migrated only 1TB, but with 1 million files expanded, and that took nearly a day! This is a situation should to be avoided for many TeraBytes. Performance is as much a function of the number of files as the amount of data

As for the "ideal tar-ball size", experiments have shown that an isolated 10GB tar-ball typically takes 10-15 minutes to be pulled back, considering all tape operations involved. That seems like a reasonable amount of time to wait for a group of files kept off-line for an extended period of time. Also consider that pulling back an individual tiny file could still take as long as 5-8 minutes. So, it's pretty clear that you get the best pay for the buck by tar'ing your material, and you won't tie up the tape system for too long. As for the upper limit, you can probably bundle files in 100-500GB tar-balls, provided that you're OK with waiting a couple of hours for them to be recalled at a later date; at least from SciNet's perceptive, it would be a very efficient migration.

Please be sure to contact us to schedule your transfers IN or OUT of the system, to avoid conflict with other users or within the system settings. For instance, if you recall large amounts of data at once, let's say 7.5TB (about half of /repository), we would have to adjust the high threshold accordingly for that period (to 50%), so we don't induce the never ending migrate/recall issues (**) described on the Optimal environment.

How to migrate/recall data

Automatic

We currently setup /repository with High and Low thresholds of 2% and 1% respectively. That means, at regular intervals the file system is monitored to determine if the 2% usage mark has been reached or surpassed. In that case, data is automatically migrated to tapes, oldest (or largest) first, until the file system is down to 1%, if possible (metadata is not migrated). Since data may be copied/moved/rsync'ed/tar'ed in faster than /repository can be emptied, you may observe 80-90% disk usages sporadically (hence the 10TB chunk of data limit). For now at SciNet we migrate every file in /repository to tapes.

To recall a file automatically all you have to do is access it. There are many ways you can do this. For example, you may view a file with 'cat', 'more', 'vi/vim', etc. You may also copy the file (or directory) from /repository to another location. Please be patient: the file will have to be pulled back from tape, and this will take some time, longer if it happens to be at the end of a tape.

Selective

Used to overwrite the internal priority of HSM (oldest/largest) or to migrate files/directories "immediately". The recommendation is to not wait for the automatic migration cycle to kick in, since this could take some 6 to 12 hours at SciNet. If you already know that you relocated material to repository with the intention of having it migrated to tapes, you can just use dsmmigrate as soon as the rsync to repository has finished, for instance.

(files won't be migrated until they have "aged" for at least 5 minutes, that is, after their last access/modification time)

dsmmigrate [path to FILE]
or
dsmmigrate -R -D /repository/[group]/[user]/[directory]
or
dsmmigrate /repository/scinet/pinto/blahblahblah.tar.Z
or
{
 cd /repository/scinet/pinto/
 dsmmigrate blahblahblah.tar.Z
}

To selectively recall data, just type:

dsmrecall [path to FILE]
or
dsmrecall -R -D /repository/[group]/[user]/[directory]
or
dsmrecall /repository/scinet/pinto/blahblahblah.tar.Z
or
{
 cd /repository/scinet/pinto/
 dsmrecall blahblahblah.tar.Z
}

Note: We've been finding that the search for new candidates for automatic migration takes much longer once repository is already full of files/stubs. That is to be expected, hence the recommendation to not wait and proceed with the selective migration of your own files/directories asap.

Disaster Recovery

As with any disk based storage, although it's a raid 5 file system, repository is not immune to failures. We do not do regular backups, but it's possible to do a full recovery in case of catastrophic loss of repository. For that it's important that all files have been completely migrated to tapes before hand. That puts the onus on users to ensure this migration is indeed finished (with selective migration) for the relocated material before they delete the originals from /project or /scratch.

Common HSM commands

Some traditional unix/linux commands, such as 'ls' or 'rm' for instance, will work with the stub file as the real files. But others, such as 'du' or 'df', you better use a HSM equivalent, which will give you more meaningful information in the context of HSM. They only work inside /repository. Some of them will be executable only by root, such as 'dsmrm', in which case you'll be notified.

dsmls

to check status of files; used in the directory where you expect to have migrated files

r: resident (the file is on repository only)

m: migrated (only the stub of the file is on repository)

p: premigrated (the file is on repository and on tape)

Usage: dsmls [-Noheader] [-Recursive] [-Help] [file specs|-FIlelist=file]

Example:

gpc-logindm02-$ dsmls -R a3
IBM Tivoli Storage Manager
Command Line Space Management Client Interface
  Client Version 6, Release 1, Level 0.0  
  Client date/time: 07/27/2010 12:06:36
(c) Copyright by IBM Corporation and other(s) 1990, 2009. All Rights Reserved.

      Actual     Resident     Resident  File   File
        Size         Size     Blk (KB)  State  Name
       <dir>         8192            8   -      a3/

/repository/scinet/pinto/a3:
 34008432640            0            0   m      32G-1
 34008432640  34008432640            0   r      32G-2
 34008432640  34008432640            0   p      32G-3
           0            0            0   r      dsmerror.log

dsmdu

disk usage on the original files/directory

Usage: dsmdu [-Allfiles] [-Summary] [-Help] [directory names]

dsmdf

disk free on the HSM file system.

Usage: dsmdf [-Help] [-Detail] [file systems]

dsmmigrate

Usage: dsmmigrate [-Recursive] [-Premigrate] [-Detail] [-Help] filespecs|-FIlelist=file 

dsmrecall

Usage: dsmrecall [-Recursive] [-Detail] [-Help] file specs|-FIlelist=file
   or  dsmrecall [-Detail] -offset=XXXX[kmgKMG] -size=XXXX[kmgKMG] file specs 


To have an idea of what HSM is doing on datamover2 at a given time:


[pinto@gpc-logindm02 ~]$ ps -def | grep dsm | grep -v mmfs

root      2455 15190  0 16:26 ?        00:00:00 dsmmonitord
root      2456  2455  2 16:26 ?        00:05:38 dsmautomig -2 system::/repository
pinto    10997 10637 30 16:40 pts/3    01:14:20 dsmmigrate -R -D pinto
root     12857     1  0 16:15 ?        00:00:00 dsmrecalld
root     13013 12857  0 16:15 ?        00:00:01 dsmrecalld
root     13015 12857  0 16:15 ?        00:00:00 dsmrecalld
root     15190     1  0 16:15 ?        00:00:00 dsmmonitord
root     16936     1  3 16:15 ?        00:10:44 dsmscoutd
root     17217     1 13 16:16 ?        00:36:49 dsmrootd
root     18732  2456  4 17:51 ?        00:07:19 dsmautomig -2 system::/repository
root     18737  2456  0 17:51 ?        00:00:26 dsmautomig -2 system::/repository
pinto    24533 10363  0 20:48 pts/2    00:00:00 grep dsm
root     25090     1  0 06:42 ?        00:00:08 dsmwatchd nodetach
root     30840 13013  0 17:15 ?        00:00:02 dsmrecalld

In the above example, dsmmonitord, dsmrecalld, dsmscoutd, dsmrootd and dsmwatchd are the 5 typical HSM daemons, and they always running. In addition, there are 3 streams of dsmautomig (triggered by threshold migration) and 1 stream of dsmmigrate (selective migration initiated by user pinto).