Difference between revisions of "Data Management"

Latest revision as of 13:32, 9 August 2018

WARNING: SciNet is in the process of replacing this wiki with a new documentation site. For current information, please go to https://docs.scinet.utoronto.ca

Storage Space

SciNet's storage system is based on IBM's GPFS (General Parallel File System). There are two main systems for user data: /home, a small, backed-up space where user home directories are located, and /scratch, a large system for input or output data for jobs; data on /scratch is not only not backed up (a third storage system, /project, exist only for groups with LRAC/NRAC allocations). Data placed on scratch will be deleted if it has not been accessed in 3 months. SciNet does not provide long-term storage for large data sets.

Overview of the different file systems

file system	purpose	user quota	block size	backed up	purged	access
/home	development	50 GB	256 KB	yes	never	read-only on compute nodes (r/w on login, devel and datamover1)
/scratch	computation	first of (20 TB ; 1 million files)	4 MB	no	files > 3 months	read/write on all nodes
/project	computation	by allocation	256 KB	yes	never	read/write on all nodes

project is included in scratch

Home Disk Space

Every SciNet user gets a 50GB directory on /home in a directory /home/G/GROUP/USER, which is regularly backed-up. Home is visible from login.scinet nodes, and from the development nodes on GPC and the TCS. However, on the compute nodes of the GPC clusters -- as when jobs are running -- /home is mounted read-only; thus GPC jobs can read files in /home but cannot write to files there. /home is a good place to put code, input files for runs, and anything else that needs to be kept to reproduce runs. On the other hand, /home is not a good place to put many small files, since the block size for the file system is 256KB, so you would quickly run out of disk quota and you will make the backup system very slow.

If your application absolutely insists on writing material to your home account and you can't find a way to instruct it to write somewhere else, an alternative is to create a link pointing from your account under /home to a location under /scratch.

Scratch Disk Space

Every SciNet user also gets a directory in /scratch called /scratch/G/GROUP/USER. Scratch is visible from the login.scinet nodes, the development nodes on GPC and the TCS, and on the compute nodes of the clusters, mounted as read-write. Thus jobs would normally write their output somewhere in /scratch. There are NO backups of anything on /scratch.

There is a large amount of space available on /scratch but it is purged routinely so that all users running jobs and generating large outputs will have room to store their data temporarily. Computational results which you want to keep longer than this must be copied (using scp) off of SciNet entirely and to your local system. SciNet does not routinely provide long-term storage for large data sets.

Also note that the shared parallel file system was not designed to do many small file transactions. For that reason, the number of files that any user can have on scratch is limited to 1 million. This limit should be thought of as a safeguard, not an invitation to create one million files. Please see File System and I/O dos and don'ts.

Scratch Disk Purging Policy

In order to ensure that there is always significant space available for running jobs we automatically delete files in /scratch that have not been accessed or modified for more than 3 months by the actual deletion day on the 15th of each month. Note that we recently changed the cut out reference to the MostRecentOf(atime,mtime). This policy is subject to revision depending on its effectiveness. More details about the purging process and how users can check if their files will be deleted follows. If you have files scheduled for deletion you should move them to a more permanent locations such as your departmental server or your /project space (for PIs who have either been allocated disk space by the LRAC or have bought diskspace).

On the first of each month, a list of files scheduled for purging is produced, and an email notification is sent to each user on that list. Furthermore, at/or about the 12th of each month a 2nd scan produces a more current assessment and another email notification is sent. This way users can double check that they have indeed taken care of all the files they needed to relocate before the purging deadline. Those files will be automatically deleted on the 15th of the same month unless they have been accessed or relocated in the interim. If you have files scheduled for deletion then they will be listed in a file in /scratch/t/todelete/current, which has your userid and groupid in the filename. For example, if user xxyz wants to check if they have files scheduled for deletion they can issue the following command on a system which mounts /scratch (e.g. a scinet login node): ls -1 /scratch/t/todelete/current |grep xxyz. In the example below, the name of this file indicates that user xxyz is part of group abc, has 9,560 files scheduled for deletion and they take up 1.0TB of space:

 [xxyz@scinet04 ~]$ ls -1 /scratch/t/todelete/current |grep xxyz
 -rw-r----- 1 xxyz     root       1733059 Jan 12 11:46 10001___xxyz_______abc_________1.00T_____9560files

The file itself contains a list of all files scheduled for deletion (in the last column) and can be viewed with standard commands like more/less/cat - e.g. more /scratch/t/todelete/current/10001___xxyz_______abc_________1.00T_____9560files

Similarly, you can also verify all other users on your group by using the ls command with grep on your group. For example: ls -1 /scratch/t/todelete/current |grep abc. That will list all other users in the same group that xxyz is part of, and have files to be purged on the 15th. Members of the same group have access to each other's contents.

NOTE: Preparing these assessments takes several hours. If you change the access/modification time of a file in the interim, that will not be detected until the next cycle. A way for you to get immediate feedback is to use the 'ls -lu' command on the file to verify the atime and 'ls -la' for the mtime. If the file atime/mtime has been updated in the meantime, coming the purging date on the 15th it will no longer be deleted.

Project Disk Space

Investigators who have been granted allocations through the LRAC/NRAC Application Process may have been allocated disk space in addition to compute time. For the period of time that the allocation is granted, they will have disk space on the /project disk system. Space on project is a subset of scratch, but is not purged and is backed up. All members of the investigators groups will have access to this disk system, which will be mounted read/write everywhere.

How much Disk Space Do I have left?

The /scinet/gpc/bin6/diskUsage command, available on the login nodes, datamovers and the GPC devel nodes, provides information in a number of ways on the home, scratch, and project file systems. For instance, how much disk space is being used by yourself and your group (with the -a option), or how much your usage has changed over a certain period ("delta information") or you may generate plots of your usage over time. Please see the usage help below for more details.

Usage: diskUsage [-h|-?| [-a] [-u <user>] [-de|-plot]
       -h|-?: help
       -a: list usages of all members on the group
       -u <user>: as another user on your group
       -de: include delta information
       -plot: create plots of disk usages

Did you know that you can check which of your directories have more than 1000 files with the /scinet/gpc/bin6/topUserDirOver1000list command and which have more than 1GB of material with the /scinet/gpc/bin6/topUserDirOver1GBlist command?

Notes:

information on usage and quota is only updated hourly!
contents of project count against space and #files limits on scratch

Performance

GPFS is a high-performance filesystem which provides rapid reads and writes to large datasets in parallel from many nodes. As a consequence of this design, however, the file system performs quite poorly at accessing data sets which consist of many, small files. For instance, you will find that reading data in from one 16MB file is enormously faster than from 400 40KB files. Such small files are also quite wasteful of space, as the blocksize for the scratch filesystem is 16MB. This is something you should keep in mind when planning your input/output strategy for runs on SciNet.

For instance, if you run multi-process jobs, having each process write to a file of its own is not an scalable I/O solution. A directory gets locked by the first process accessing it, so all other processes have to wait for it. Not only has the code just become considerably less parallel, chances are the file system will have a time-out while waiting for your other processes, leading your program to crash mysteriously. Consider using MPI-IO (part of the MPI-2 standard), which allows files to be opened simultaneously by different processes, or using a dedicated process for I/O to which all other processes send their data, and which subsequently writes this data to a single file.

Local Disk

The compute nodes on the GPC do not contain hard drives so there is no local disk available to use during your computation. You can however use part of a compute nodes RAM like a local disk ('ramdisk') but this will reduce how much memory is available for your program. This can be accessed using /dev/shm/ and is currently set to 8GB. Anything written to this location that you want to keep must be copied back to the /scratch filesystem as /dev/shm is wiped after each job and since it is in memory will not survive through a reboot of the node. More on ramdisk usage can be found here.

Note that the absense of hard drives also means that the nodes cannot swap memory, so be sure that your computation fits within memory.

Buying storage space on GPFS or HPSS

Groups can buy space on GPFS or HPSS rather than rely on the annual allocation process. A good budgetary number would be:

GPFS $400/TB

HPSS $150/TB

This is a one-time cost. We have no formal, written data retention policy at this point but the intent is to keep any HPSS data (including migrating to new tape technologies) as long as SciNet is in operation. These numbers are for budgetary purposes only and subject to change (e.g. as markets and technologies evolve).

Data Transfer

General guidelines

All traffic to and from the data centre has to go via SSH, or secure shell. This is a protocol which sets up a secure connection between two sites. In all cases, incoming connections to SciNet go through relatively low-speed connections to the login.scinet gateways, but there are many ways to copy files on top of the ssh protocol.

What node to use for data transfer to and from SciNet depends mostly on the amount of data to transfer:

Moving <10GB through the login nodes

The login nodes are accessible from outside SciNet, which means that you can transfer data between your own office/home machine and SciNet using scp or rsync (see below). Although the login nodes have a cpu_time timeout of 5 minutes (emphasis on cpu_time, not wall_time), most likely if you try to transfer more than 10GB probably you wouldn't succeed. While the login nodes can be used for transfers of less than 10GB, using a datamover node would still be faster.

Note that transfers through a login node will timeout after a certain time (currently set to 5 minutes cpu_time), so if you have a slow connection you may need to go through datamover1.

Moving >10GB through the datamover1 node

Serious moves of data (>10GB) to or from SciNet should be done from datamover1 or datamover2 nodes. From any of the interactive SciNet nodes, one should be able to ssh datamover1 or ssh datmover2 to log in. Those are the machines that have the fastest network connections to the outside world (by a factor of 10; a 10Gb/s link as vs 1Gb/s).

Transfers must be originated from datamover1 or datamover2; that is, one can not copy files from the outside world directly to or from the datamovers; one has to log in to a datamover and copy the data to or from the outside network. Your local machine must be reachable from the outside as well, either by its name or its IP address. If you are behind a firewall or a (wireless) router, this may not be possible. You may need to ask your network administrator to allow the datamovers to ssh to your machine. If you need to open a hole on your firewall we provide their IPs:

datamover1 142.150.188.121
datamover2 142.150.188.122

Hpn-ssh

The usual ssh protocols were not designed for speed. On the datamover1 or datamover2 nodes, we have installed hpn-ssh, or High-Performance-enabled ssh. You use this higher-performance ssh/scp/sftp variant by loading the `hpnssh' module. Hpn-ssh is backwards compatible with the `usual' ssh, but is capable of significantly higher speeds. If you routinely have large data transfers to do, we recommend having your system administrator look into installing hpn-ssh on your system.

Everything we discuss below, unless otherwise stated, will work regardless of whether you have hpn-ssh installed on your remote system.

For Microsoft Windows users

Linux-windows transfers can be a bit more involved than linux-to-linux, but using Cygwin, this should not be a problem. Make sure you install Cygwin with the openssh package.

If you want to remain 100% a Windows environment, another very good tool is WinSCP. It will let you easily transfer and synchronize data between your Windows workstation and the login nodes using your ssh credentials (provided that it's not much more than 10GB on each sync pass).

If you are going to use the datamover1 method, and assuming your machine is not a wireless laptop (if it is, best to find a nearby computer that's not wireless and use a usb memory stick), you'll need the IP address of your machine, which you find by typing "ipconfig /all" on your local windows machine. Also, you will need to have the ssh daemon (sshd) running locally in Cygwin.

Also note that your windows user name does not have to be the same as on SciNet, this just depends on how your local windows system was set up.

All locations given to scp or rsync in cygwin have to be in unix format (using "/" not "\"), and will be relative to cygwin's path, not windows (e.g. use /cygdrive/c/...... to get to the windows C: drive).

Ways to transfer data

Globus data transfer

Globus is a file-transfer service with an easy-to-use web interface to allow people to transfer file with ease. To get started, please sign up for a Globus account at Globus website. Once you sign up for an account, go to this page to start the file transfer. Please enter computecanada#niagara as one endpoint for file transfer. If you are trying to transfer data from a laptop or desktop, you will need to install Globus Connect Personal software available here to setup an endpoint for the laptop or desktop and perform the transfer.

Please see the following page on how to setup Globus to perform data transfer.

scp

scp, or secure copy, is the easiest way to copy files, although we generally find rsync below to be faster.

scp works like cp to copy files:

$ scp original_file  copy_file

except that either the original or the copy can be on another system:

$ scp jonsdatafile.bin jon@remote.system.com:/home/jon/bigdatadir/

will copy the data file into the directory /home/jon/bigdatadir/ on remote.system.com after logging in as jon; you will be prompted for a password (unless you've set up ssh keys).

Copying from remote systems works the same way:

$ scp jon@remote.system.com:/home/jon/bigdatadir/newdata.bin .

And wildcards work as you'd expect (except you have to quote the wildcards on the remote system, as it can't expand properly here.)

$ scp *.bin jon@remote.system.com:/home/jon/bigdatadir/
$ scp jon@remote.system.com:"/home/jon/inputdata/*" .

There are few options worth knowing about:

scp -C compresses the file before transmitting it; if the file compresses well, this can significantly increase the effective data transfer rate (though usually not as much as compressing the data, then sending it, then uncompressing). If the file doesn't compress well, than this adds CPU overhead without accomplishing much, and can slow down your data transfer.
scp -oNoneEnabled=yes -oNoneSwitch=yes -- This is an hpn-ssh only option. If CPU overhead is a significant bottleneck in the data transfer, then we can avoid this by turning off the secure encryption of the data. For most of us, this is ok, but for others it is not. In either cases, authentication remains secure, it is only the data transfer that is in plaintext.

rsync

rsync is a very powerful tool for mirroring directories of data.

$ rsync -av -e ssh scinetdatadir jon@remote.system.com:/home/jon/bigdatadir/

rsync has a dizzying number of options; the above syncs scinetdatadir to the remote system; that is, any files that are newer on the localsystem are updated on the remote system. The converse isn't true; if there were newer files on the remote system, you'd have to bring those over with

$ rsync -av -e ssh jon@remote.system.com:/home/jon/bigdatadir/ scinetdatadir

The -av options are for verbose and `archive' mode, which preserves timestamps and permissions, which is normally what you want. -e ssh tells it to use ssh for the transfer.

One of the powerful things about rsync is that it looks to see what files already exist before copying, so you can use it repeatedly as a data directory fills and it won't make unnecessary copies; similarly, if a (say) log file grows over time, it will only copy the difference between the files, further speeding things up. This also means that it behaves well if a transfer is interrupted; a second invocation of rsync will continue where the other left off.

As with scp -C, rsync -z compresses on the fly, which can significantly enhance effective data transfer rates if the files compress well, or hurt it if not.

As with scp, if both sides are running hpn-ssh one can disable encryption of the data stream should that prove to be a bottleneck:

$ rsync -av -e "ssh -oNoneEnabled=yes -oNoneSwitch=yes" jon@remote.system.com:/home/jon/bigdatadir/ scinetdatadir

SciNet's login nodes, 142.150.188.5[1-4], are publicly accessible and can be used for data transfer as long as your material is not one big chunk (much more than 2-3GB each file). We have a 5 minutes CPU time limit on the login nodes and the transfer process may be killed by the kernel before completion. The workaround is to transfer your data using a rsync loop, by checking the rsync return code, assuming some files can be transferred before reaching the CPU limit. For example in a bash shell:

  for i in {1..100}; do   ### try 100 times
    rsync ...
    [ "$?" == "0" ] && break
  done

ssh tunnel

Alternatively you may use a reverse ssh tunnel (ssh -R).

If your transfer is above 10GB you will need to use one of SciNet's datamovers. If your workstation is behind a firewall (as the datamovers are), you'll need a node external to your firewall, on the edge of your organization's network, that will serve as a gateway, and can be accessible via ssh by both your workstation and the datamovers. Initiate a "ssh -R" connection from SciNet's datamover to that node. This node needs to have its ssh GatewayPorts enabled so that your workstation can connect to the specified port on that node, which will forward the traffic back to SciNet's datamover.

Transfer speeds

What transfer speeds could I expect?

Below are some typical transfer numbers from datamover1 to another University of Toronto machine with a 1Gb/s link to the campus network:

Mode	With hpn-ssh	Without
rsync	60-80 MB/s	30-40 MB/s
scp	50 MB/s	25 MB/s

What can slow down my data transfer?

To move data quickly, all of the stages in the process have to be fast; the file system you are reading data from, the CPU reading the data, the network connection between the sender and the reciever, and the recipient CPU and disk. The slowest element in that chain will slow down the entire transfer.

On SciNet's side, our underlying filesystem is the high-performance GPFS system, and the node we recommend you use (datamover1) has a high-speed connection to the network and fast CPUs.

Why are my transfers so much slower?

If you get numbers significantly lower than above, then there is a bottleneck in the transfer. The first thing to do is to run top on datamover1; if other people are transfering large files at the same time you are trying to, network congestion could result and you'll just have to wait until they are done.

If nothing else is going on on datamover1, there are a number of possibilites:

network connection between SciNet and your machine - do you know the network connection of your remote machine? Are your systems connections tuned for performance [1]?
is the remote server busy?
are the remote servers disks busy, or known to be slow?

For any further questions, contact us at Support @ SciNet

File/Ownership Management (ACL)

By default, at SciNet, users within the same group have read permission to each other's files (not write)
You may use access control list (ACL) to allow your supervisor (or another user within your group) to manage files for you (i.e., create, move, rename, delete), while still retaining your access and permission as the original owner of the files/directories.
NOTE: We highly recommend that you never give write permission to other users on the top level of your home directory (/home/G/GROUP/[owner]), since that would seriously compromise your privacy, in addition to disable ssh key authentication, among other things. If necessary, make specific sub-directories under your home directory so that other users can manipulate/access files from those.
If you need to set up permissions across groups contact us (and the other group's supervisor!).

Using mmputacl/mmgetacl

You may use gpfs' native mmputacl and mmgetacl commands. The advantages are that you can set "control" permission and that POSIX or NFS v4 style ACL are supported. You will need first to create a /tmp/supervisor.acl file with the following contents:

user::rwxc
group::----
other::----
mask::rwxc
user:[owner]:rwxc
user:[supervisor]:rwxc

Then issue the following 2 commands:

1) $ mmputacl -i /tmp/supervisor.acl /project/g/group/[owner]
2) $ mmputacl -d -i /tmp/supervisor.acl /project/g/group/[owner]
   (every *new* file/directory inside [owner] will inherit [supervisor] ownership by default as well as 
   [owner] ownership, ie, ownership of both by default, for files/directories created by [supervisor])

   $ mmgetacl /project/g/group/[owner]
   (to determine the current ACL attributes)

   $ mmdelacl -d /project/g/group/[owner]
   (to remove any previously set ACL)

   $ mmeditacl /project/g/group/[owner]
   (to create or change a GPFS access control list)
   (for this command to work set the EDITOR environment variable: export EDITOR=/usr/bin/vi)

NOTES:

There is no option to recursively add or remove ACL attributes using a gpfs built-in command to existing files. You'll need to use the -i option as above for each file or directory individually. Here is a sample bash script you may use for that purpose

mmputacl/setfacl will not overwrite the original linux group permissions for a directory when copied to another directory already with ACLs, hence the "#effective:r-x" note you may see from time to time with mmgetacf/getfacl. If you want to give rwx permissions to everyone in your group you should simply rely on the plain unix 'chmod g+rwx' command. You may do that before or after copying the original material to another folder with the ACLs.

For more information on using mmputacl or mmgetaclacl see their man pages.

Appendix (ACL)

bash script that you may adapt to recursively add or remove ACL attributes using gpfs built-in commands

Courtesy of Agata Disks (http://csngwinfo.in2p3.fr/mediawiki/index.php/GPFS_ACL)

#!/bin/bash
# USAGE
#     - on one directory:     ./set_acl.sh dir_name
#     - on more directories:  ./set_acl.sh 'dir_nam*'
#

# Path of the file that contains the ACL
ACL_FILE_PATH=/agatadisks/data/acl_file.acl

# Directories onto the ACLs have to be set
dirs=$1

# Recursive function that sets ACL to files and directories
set_acl () {
  curr_dir=$1
  for args in $curr_dir/*
  do
    if [ -f $args ]; then
      echo "ACL set on file $args"
      mmputacl -i $ACL_FILE_PATH $args
      if [ $? -ne 0 ]; then
        echo "ERROR: ACL not set on $args"
        exit -1
      fi
    fi
    if [ -d $args ]; then
      # Set Default ACL in directory
      mmputacl -i $ACL_FILE_PATH $args -d
      if [ $? -ne 0 ]; then
        echo "ERROR: Default ACL not set on $args"
        exit -1
      fi
      echo "Default ACL set on directory $args"
      # Set ACL in directory
      mmputacl -i $ACL_FILE_PATH $args
      if [ $? -ne 0 ]; then
        echo "ERROR: ACL not set on $args"
        exit -1
      fi
      echo "ACL set on directory $args"
      set_acl $args
    fi
  done
}
for dir in $dirs
do
  if [ ! -d $dir ]; then
    echo "ERROR: $dir is not a directory"
    exit -1
  fi
  set_acl $dir
done
exit 0