Data Management
Storage Space
SciNet's storage system is based on IBM's GPFS (General Parallel File System). There are two main systems for user data: /home, a small, backed-up space where user home directories are located, and /scratch, a large system for input or output data for jobs; data on /scratch is not only not backed up (a third storage system, /project, exist only for groups with LRAC/NRAC allocations). Data placed on scratch will be deleted if it has not been accessed in 3 months. SciNet does not provide long-term storage for large data sets.
Home Disk Space
Every SciNet user gets a 10GB directory on /home which is regularly backed-up. Home is visible from login.scinet nodes, and from the development nodes on GPC and the TCS. However, on the compute nodes of the GPC clusters -- as when jobs are running -- /home is mounted read-only; thus GPC jobs can read files in /home but cannot write to files there. /home is a good place to put code, input files for runs, and anything else that needs to be kept to reproduce runs. On the other hand, /home is not a good place to put many small files, since the block size for the file system is 4MB, so you would quickly run out of disk quota and you will make the backup system very slow.
Scratch Disk Space
Every SciNet user also gets a directory in /scratch. Scratch is visible from the login.scinet nodes, the development nodes on GPC and the TCS, and on the compute nodes of the clusters, mounted as read-write. Thus jobs would normally write their output somewhere in /scratch. There are NO backups of anything on /scratch.
There is a large amount of space available on /scratch but it is purged routinely so that all users running jobs and generating large outputs will have room to store their data temporarily. Computational results which you want to keep longer than this must be copied (using scp) off of SciNet entirely and to your local system. SciNet does not routinely provide long-term storage for large data sets.
Scratch Disk Purging Policy
In order to ensure that there is always significant space available for running jobs we automatically delete files in /scratch that have not been accessed in 3 months. This policy is subject to revision depending on its effectiveness. More details about the purging process and how users can check if their files will be deleted follows. If you have files scheduled for deletion you should move them to a more permanent locations such as your departmental server or your /project space (for PIs who have either been allocated disk space by the LRAC or have bought diskspace).
On the first of each month, a list of files scheduled for deletion will be produced, and an email notification will be sent to each user on that list (starting on Jul/2010). Those files will be automatically deleted on the 15th of the same month unless they have been accessed or relocated in the interim. If you have files scheduled for deletion then they will be listed in a file in /scratch/todelete/current, which has your userid and groupid in the filename. For example, if user xxyz wants to check if they have files scheduled for deletion they can issue the following command on a system which mounts /scratch (e.g. a scinet login node): ls -l1 /scratch/todelete/current |grep xxyz. In the example below, the name of this file indicates that user xxyz is part of group abc, has 9,560 files scheduled for deletion and they take up 1.0TB of space:
[xxyz@scinet04 ~]$ ls -l1 /scratch/todelete/current |grep xxyz -rw-r----- 1 xxyz root 1733059 Jan 12 11:46 10001___xxyz_______abc_________1.00TB_____9560files
The file itself contains a list of all files scheduled for deletion (in the last column) and can be viewed with standard commands like more/less/cat - e.g. more /scratch/todelete/current/10001___xxyz_______abc_________1.00TB_____9560files
Similarly, you can also verify all other users in your group by using the ls command with grep on your group. For example: ls -l1 /scratch/todelete/current |grep abc. That will list all other users in the same group that xxyz is part of, and have files to be purged on the 15th. Members of the same group have access to each other's contents.
Furthermore, starting in Jul/2010 we are running a 2nd scan at/or about the 12th of each month to produce a more current assessment, so that users can double check that they have indeed taken care of all the files they needed to relocate before the purging deadline.
Project Disk Space
Investigators who have been granted allocations through the LRAC/NRAC Application Process may have been allocated disk space in addition to compute time. For the period of time that the allocation is granted, they will have disk space on the /project disk system. Space on the project systems are not purged, but neither are they backed up. All members of the investigators groups will have access to these systems, which will be mounted read/write everywhere.
How much Disk Space Do I have left?
The /scinet/gpc/bin/diskUsage [-a] command, available on the login nodes and the GPC devel nodes, reports how much disk space is being used by yourself and your group (with the -a option) on the home, scratch, and project file systems, and how much remains available. This information is updated hourly.
mmlsquota will show your quotas on the various filesystems. You must use mmlsquota -g <groupname> to check your group quota on /project.
Performance
GPFS is a high-performance filesystem which provides rapid reads and writes to large datasets in parallel from many nodes. As a consequence of this design, however, the file system performs quite poorly at accessing data sets which consist of many, small files. For instance, you will find that reading data in from one 4MB file is enormously faster than from 100 40KB files. Such small files are also quite wasteful of space, as the blocksize for the filesystem is 4MB. This is something you should keep in mind when planning your input/output strategy for runs on SciNet.
For instance, if you run multi-process jobs, having each process write to a file of its own is not an scalable I/O solution. A directory gets locked by the first process accessing it, so all other processes have to wait for it. Not only has the code just become considerably less parallel, chances are the file system will have a time-out while waiting for your other processes, leading your program to crash mysteriously. Consider using MPI-IO (part of the MPI-2 standard), which allows files to be opened simultaneously by different processes, or using a dedicated process for I/O to which all other processes send their data, and which subsequently writes this data to a single file.
Local Disk
The compute nodes on the GPC do not contain hard drives so there is no local disk available to use during your computation. You can however use part of a compute nodes RAM like a local disk ('ramdisk') but this will reduce how much memory is available for your program. This can be accessed using /dev/shm/ and is currently set to 8GB. Anything written to this location that you want to keep must be copied back to the /scratch filesystem as /dev/shm is wiped after each job and since it is in memory will not survive through a reboot of the node. More on ramdisk usage can be found here.
Data Transfer
General guidelines
All traffic to and from the data centre has to go via SSH, or secure shell. This is a protocol which sets up a secure connection between two sites. In all cases, incoming connections to SciNet go through relatively low-speed connections to the login.scinet gateways, but there are many ways to copy files on top of the ssh protocol.
What node to use for data transfer to and from SciNet depends mostly on the amount of data to transfer:
Moving <10GB through the login nodes
The login nodes are accessible from outside SciNet, which means that you can transfer data between your own office/home machine and SciNet using scp or rsync (see below). Although the login nodes have a cpu_time timeout of 5 minutes (emphasis on cpu_time, not wall_time), most likely if you try to transfer more than 10GB probably you wouldn't succeed. While the login nodes can be used for transfers of less than 10GB, using a datamover node would still be faster.
Note that transfers through a login node will timeout after a certain time (currently set to 5 minutes cpu_time), so if you have a slow connection you may need to go through datamover1.
Moving >10GB through the datamover1 node
Serious moves of data (>10GB) to or from SciNet should be done from datamover1 or datamover2 nodes. From any of the interactive SciNet nodes, one should be able to ssh datamover1 or ssh datmover2 to log in. Those are the machines that have the fastest network connections to the outside world (by a factor of 10; a 10Gb/s link as vs 1Gb/s).
Transfers must be originated from datamover1 or datamover2; that is, one can not copy files from the outside world directly to or from the datamovers; one has to log in to a datamover and copy the data to or from the outside network. Your local machine must be reachable from the outside as well, either by its name or its IP address. If you are behind a firewall or a (wireless) router, this may not be possible. You may need to ask your network administrator to allow the datamovers to ssh to your machine. If you need to open a hole on your firewall we provide their IPs:
datamover1 142.150.188.121 datamover2 142.150.188.122
Hpn-ssh
The usual ssh protocols were not designed for speed. On the datamover1 or datamover2 nodes, we have installed hpn-ssh, or High-Performance-enabled ssh. You use this higher-performance ssh/scp/sftp variant by loading the `hpnssh' module. Hpn-ssh is backwards compatible with the `usual' ssh, but is capable of significantly higher speeds. If you routinely have large data transfers to do, we recommend having your system administrator look into installing hpn-ssh on your system.
Everything we discuss below, unless otherwise stated, will work regardless of whether you have hpn-ssh installed on your remote system.
For Microsoft Windows users
Linux-windows transfers can be a bit more involved than linux-to-linux, but using Cygwin, this should not be a problem. Make sure you install Cygwin with the openssh package.
If you want to remain 100% a Windows environment, another very good tool is WinSCP. It will let you easily transfer and synchronize data between your Windows workstation and the login nodes using your ssh credentials (provided that it's not much more than 10GB on each sync pass).
If you are going to use the datamover1 method, and assuming your machine is not a wireless laptop (if it is, best to find a nearby computer that's not wireless and use a usb memory stick), you'll need the IP address of your machine, which you find by typing "ipconfig /all" on your local windows machine. Also, you will need to have the ssh daemon (sshd) running locally in Cygwin.
Also note that your windows user name does not have to be the same as on SciNet, this just depends on how your local windows system was set up.
All locations given to scp or rsync in cygwin have to be in unix format (using "/" not "\"), and will be relative to cygwin's path, not windows (e.g. use /cygdrive/c/...... to get to the windows C: drive).
Ways to transfer data
Globus data transfer
Globus is a file-transfer service with an easy-to-use web interface to allow people to transfer file with ease. To get started, please sign up for a Globus account at Globus website. Once you sign up for an account, go to this page to start the file transfer. Please enter computecanada#niagara as one endpoint for file transfer. If you are trying to transfer data from a laptop or desktop, you will need to install Globus Connect Personal software available here to setup an endpoint for the laptop or desktop and perform the transfer.
Please see the following page on how to setup Globus to perform data transfer.
scp
scp, or secure copy, is the easiest way to copy files, although we generally find rsync below to be faster.
scp works like cp to copy files:
$ scp original_file copy_file
except that either the original or the copy can be on another system:
$ scp jonsdatafile.bin jon@remote.system.com:/home/jon/bigdatadir/
will copy the data file into the directory /home/jon/bigdatadir/ on remote.system.com after logging in as jon; you will be prompted for a password (unless you've set up ssh keys).
Copying from remote systems works the same way:
$ scp jon@remote.system.com:/home/jon/bigdatadir/newdata.bin .
And wildcards work as you'd expect (except you have to quote the wildcards on the remote system, as it can't expand properly here.)
$ scp *.bin jon@remote.system.com:/home/jon/bigdatadir/ $ scp jon@remote.system.com:"/home/jon/inputdata/*" .
There are few options worth knowing about:
- scp -C compresses the file before transmitting it; if the file compresses well, this can significantly increase the effective data transfer rate (though usually not as much as compressing the data, then sending it, then uncompressing). If the file doesn't compress well, than this adds CPU overhead without accomplishing much, and can slow down your data transfer.
- scp -oNoneEnabled=yes -oNoneSwitch=yes -- This is an hpn-ssh only option. If CPU overhead is a significant bottleneck in the data transfer, then we can avoid this by turning off the secure encryption of the data. For most of us, this is ok, but for others it is not. In either cases, authentication remains secure, it is only the data transfer that is in plaintext.
rsync
rsync is a very powerful tool for mirroring directories of data.
$ rsync -av -e ssh scinetdatadir jon@remote.system.com:/home/jon/bigdatadir/
rsync has a dizzying number of options; the above syncs scinetdatadir to the remote system; that is, any files that are newer on the localsystem are updated on the remote system. The converse isn't true; if there were newer files on the remote system, you'd have to bring those over with
$ rsync -av -e ssh jon@remote.system.com:/home/jon/bigdatadir/ scinetdatadir
The -av options are for verbose and `archive' mode, which preserves timestamps and permissions, which is normally what you want. -e ssh tells it to use ssh for the transfer.
One of the powerful things about rsync is that it looks to see what files already exist before copying, so you can use it repeatedly as a data directory fills and it won't make unnecessary copies; similarly, if a (say) log file grows over time, it will only copy the difference between the files, further speeding things up. This also means that it behaves well if a transfer is interrupted; a second invocation of rsync will continue where the other left off.
As with scp -C, rsync -z compresses on the fly, which can significantly enhance effective data transfer rates if the files compress well, or hurt it if not.
As with scp, if both sides are running hpn-ssh one can disable encryption of the data stream should that prove to be a bottleneck:
$ rsync -av -e "ssh -oNoneEnabled=yes -oNoneSwitch=yes" jon@remote.system.com:/home/jon/bigdatadir/ scinetdatadir
SciNet's login nodes, 142.150.188.5[1-4], are publicly accessible and can be used for data transfer as long as your material is not one big chunk (much more than 2-3GB each file). We have a 5 minutes CPU time limit on the login nodes and the transfer process may be killed by the kernel before completion. The workaround is to transfer your data using a rsync loop, by checking the rsync return code, assuming some files can be transferred before reaching the CPU limit. For example in a bash shell:
for i in {1..100}; do ### try 100 times rsync ... [ "$?" == "0" ] && break done
ssh tunnel
Alternatively you may use a reverse ssh tunnel (ssh -R).
If your transfer is above 10GB you will need to use one of SciNet's datamovers. If your workstation is behind a firewall (as the datamovers are), you'll need a node external to your firewall, on the edge of your organization's network, that will serve as a gateway, and can be accessible via ssh by both your workstation and the datamovers. Initiate a "ssh -R" connection from SciNet's datamover to that node. This node needs to have its ssh GatewayPorts enabled so that your workstation can connect to the specified port on that node, which will forward the traffic back to SciNet's datamover.
Transfer speeds
What transfer speeds could I expect?
Below are some typical transfer numbers from datamover1 to another University of Toronto machine with a 1Gb/s link to the campus network:
Mode | With hpn-ssh | Without |
---|---|---|
rsync | 60-80 MB/s | 30-40 MB/s |
scp | 50 MB/s | 25 MB/s |
What can slow down my data transfer?
To move data quickly, all of the stages in the process have to be fast; the file system you are reading data from, the CPU reading the data, the network connection between the sender and the reciever, and the recipient CPU and disk. The slowest element in that chain will slow down the entire transfer.
On SciNet's side, our underlying filesystem is the high-performance GPFS system, and the node we recommend you use (datamover1) has a high-speed connection to the network and fast CPUs.
Why are my transfers so much slower?
If you get numbers significantly lower than above, then there is a bottleneck in the transfer. The first thing to do is to run top on datamover1; if other people are transfering large files at the same time you are trying to, network congestion could result and you'll just have to wait until they are done.
If nothing else is going on on datamover1, there are a number of possibilites:
- network connection between SciNet and your machine - do you know the network connection of your remote machine? Are your systems connections tuned for performance [1]?
- is the remote server busy?
- are the remote servers disks busy, or known to be slow?
For any further questions, contact us at Support @ SciNet
File/Ownership Management (ACL)
- By default, at SciNet, users within the same group have read permission to each other's files (not write)
- You may use access control list (ACL) to allow your supervisor (or another user within your group) to manage files for you (i.e., create, move, rename, delete), while still retaining your access as the original owner of the files/directories.
- For example, to allow [supervisor] to manage files in /project/group/[owner], issue the following commands as the [owner] account from a shell:
$ getfacl /project/group/[owner] (to determine the current ACL attributes) $ setfacl -d -m user:[supervisor]:rwx /project/group/[owner] (every *new* file/directory inside [owner] will inherit [supervisor] ownership by default from now on) $ setfacl -d -m user:[owner]:rwx /project/group/[owner] (but will also inherit [owner] ownership, ie, ownership of both by default) $ setfacl -Rm user:[supervisor]:rwx /project/group/[owner] (recursively modify all *existing* files/directories inside [owner] to also be rwx by [supervisor])
For more information on using getfacl and setfacl, see their man pages.
If you need to set up permissions across groups, contact us (and the other group's supervisor!).
Hierarchical Storage Management (HSM)
(a pilot project is starting in July/2010 with a group selected users)
Basic Concept
Hierarchical Storage Management (HSM) is a data storage technique which automatically moves data between high-cost and low-cost storage media. HSM systems exist because high-speed storage devices, such as hard disk drive arrays, are more expensive (per byte stored) than slower devices, such as optical discs and magnetic tape drives. While it would be ideal to have all data available on high-speed devices all the time, this is prohibitively expensive for many organizations. Instead, HSM systems store the bulk of the enterprise's data on slower devices, and then copy data to faster disk drives when needed. In effect, HSM turns the fast disk drives into caches for the slower mass storage devices. The HSM system monitors the way data is used and makes best guesses as to which data can safely be moved to slower devices and which data should stay on the fast devices.
In a typical HSM scenario, data files which are frequently used are stored on disk drives, but are eventually migrated to tape if they are not used for a certain period of time, typically a few months. If a user does reuse a file which is on tape, it is automatically moved back to disk storage. The advantage is that the total amount of stored data can be much larger than the capacity of the disk storage available, but since only rarely-used files are on tape, most users will usually not notice any slowdown.
Repository commonly refers to a location for storage, often for safety or preservation.
Deployment at SciNet
At SciNet HSM is performed by a dedicated IBM software made up of a number of HSM daemons running on datamover1. These daemons constantly monitor the usage of the /repository GPFS, and depending on a predefined set of policies, data may be automatically or manually migrated to the Tivoli Storage Management server (TSM), and kept on our library of LTO-4 tapes.
/repository is a 7.2TB "transient" location accessible only from datamover1. Users may migrate data as required from scratch or project to repository with copy, move or rsync. In the background, transfers out of repository to tapes can happen even while you migrate data in from other locations, but we ask that you please never migrate more than 10TB chunks at a time, so as to allow the system to process that data before reaching 100% full.
Performance: unlike /project or /scratch, /repository for now is only a 1 tier disk raid, so don't expect rates of much more than 50MB/sec. In another words, a 10TB offload operation will typically take 2 days to complete. So please be sure to contact us to schedule your transfers IN or OUT of the system, so as to avoid conflicts.