Data Management

From oldwiki.scinet.utoronto.ca
Jump to navigation Jump to search

Storage Space

SciNet's storage system is based on IBM's GPFS (General Parallel File System). There are two main systems for user data: /home, a small, backed-up space where user home directories are located, and /scratch, a large system for input or output data for jobs; data on /scratch is not only not backed up (a third storage system, /project, exist only for groups with LRAC/NRAC allocations). Data placed on scratch will be deleted if it has not been accessed in 3 months. SciNet does not provide long-term storage for large data sets.


Home Disk Space

Every SciNet user gets a 10GB directory on /home which is regularly backed-up. Home is visible from login.scinet nodes, and from the development nodes on GPC and the TCS. However, on the compute nodes of the GPC clusters -- as when jobs are running -- /home is mounted read-only; thus GPC jobs can read files in /home but cannot write to files there. /home is a good place to put code, input files for runs, and anything else that needs to be kept to reproduce runs. On the other hand, /home is not a good place to put many small files, since the block size for the file system is 256KB, so you would quickly run out of disk quota and you will make the backup system very slow.

If your application absolutely insists on writing material to your home account and you can't find a way to instruct it to write somewhere else, an alternative is to create a link pointing from your account under /home to a location under /scratch.

Scratch Disk Space

Every SciNet user also gets a directory in /scratch. Scratch is visible from the login.scinet nodes, the development nodes on GPC and the TCS, and on the compute nodes of the clusters, mounted as read-write. Thus jobs would normally write their output somewhere in /scratch. There are NO backups of anything on /scratch.

There is a large amount of space available on /scratch but it is purged routinely so that all users running jobs and generating large outputs will have room to store their data temporarily. Computational results which you want to keep longer than this must be copied (using scp) off of SciNet entirely and to your local system. SciNet does not routinely provide long-term storage for large data sets.

Scratch Disk Purging Policy

In order to ensure that there is always significant space available for running jobs we automatically delete files in /scratch that have not been accessed in 3 months. This policy is subject to revision depending on its effectiveness. More details about the purging process and how users can check if their files will be deleted follows. If you have files scheduled for deletion you should move them to a more permanent locations such as your departmental server or your /project space (for PIs who have either been allocated disk space by the LRAC or have bought diskspace).

On the first of each month, a list of files scheduled for deletion will be produced, and an email notification will be sent to each user on that list (starting on Jul/2010). Those files will be automatically deleted on the 15th of the same month unless they have been accessed or relocated in the interim. If you have files scheduled for deletion then they will be listed in a file in /scratch/todelete/current, which has your userid and groupid in the filename. For example, if user xxyz wants to check if they have files scheduled for deletion they can issue the following command on a system which mounts /scratch (e.g. a scinet login node): ls -l1 /scratch/todelete/current |grep xxyz. In the example below, the name of this file indicates that user xxyz is part of group abc, has 9,560 files scheduled for deletion and they take up 1.0TB of space:

[xxyz@scinet04 ~]$ ls -l1 /scratch/todelete/current |grep xxyz
-rw-r----- 1 xxyz     root       1733059 Jan 12 11:46 10001___xxyz_______abc_________1.00TB_____9560files

The file itself contains a list of all files scheduled for deletion (in the last column) and can be viewed with standard commands like more/less/cat - e.g. more /scratch/todelete/current/10001___xxyz_______abc_________1.00TB_____9560files

Similarly, you can also verify all other users in your group by using the ls command with grep on your group. For example: ls -l1 /scratch/todelete/current |grep abc. That will list all other users in the same group that xxyz is part of, and have files to be purged on the 15th. Members of the same group have access to each other's contents.

Furthermore, starting in Jul/2010 we are running a 2nd scan at/or about the 12th of each month to produce a more current assessment, so that users can double check that they have indeed taken care of all the files they needed to relocate before the purging deadline.

Project Disk Space

Investigators who have been granted allocations through the LRAC/NRAC Application Process may have been allocated disk space in addition to compute time. For the period of time that the allocation is granted, they will have disk space on the /project disk system. Space on the project systems are not purged, but neither are they backed up. All members of the investigators groups will have access to these systems, which will be mounted read/write everywhere.

How much Disk Space Do I have left?

The /scinet/gpc/bin/diskUsage [-a] command, available on the login nodes and the GPC devel nodes, reports how much disk space is being used by yourself and your group (with the -a option) on the home, scratch, and project file systems, and how much remains available. This information is updated hourly.

mmlsquota will show your quotas on the various filesystems. You must use mmlsquota -g <groupname> to check your group quota on /project.

Performance

GPFS is a high-performance filesystem which provides rapid reads and writes to large datasets in parallel from many nodes. As a consequence of this design, however, the file system performs quite poorly at accessing data sets which consist of many, small files. For instance, you will find that reading data in from one 4MB file is enormously faster than from 100 40KB files. Such small files are also quite wasteful of space, as the blocksize for the filesystem is 4MB. This is something you should keep in mind when planning your input/output strategy for runs on SciNet.

For instance, if you run multi-process jobs, having each process write to a file of its own is not an scalable I/O solution. A directory gets locked by the first process accessing it, so all other processes have to wait for it. Not only has the code just become considerably less parallel, chances are the file system will have a time-out while waiting for your other processes, leading your program to crash mysteriously. Consider using MPI-IO (part of the MPI-2 standard), which allows files to be opened simultaneously by different processes, or using a dedicated process for I/O to which all other processes send their data, and which subsequently writes this data to a single file.

Local Disk

The compute nodes on the GPC do not contain hard drives so there is no local disk available to use during your computation. You can however use part of a compute nodes RAM like a local disk ('ramdisk') but this will reduce how much memory is available for your program. This can be accessed using /dev/shm/ and is currently set to 8GB. Anything written to this location that you want to keep must be copied back to the /scratch filesystem as /dev/shm is wiped after each job and since it is in memory will not survive through a reboot of the node. More on ramdisk usage can be found here.


Data Transfer

General guidelines

All traffic to and from the data centre has to go via SSH, or secure shell. This is a protocol which sets up a secure connection between two sites. In all cases, incoming connections to SciNet go through relatively low-speed connections to the login.scinet gateways, but there are many ways to copy files on top of the ssh protocol.

What node to use for data transfer to and from SciNet depends mostly on the amount of data to transfer:

Moving <10GB through the login nodes

The login nodes are accessible from outside SciNet, which means that you can transfer data between your own office/home machine and SciNet using scp or rsync (see below). Although the login nodes have a cpu_time timeout of 5 minutes (emphasis on cpu_time, not wall_time), most likely if you try to transfer more than 10GB probably you wouldn't succeed. While the login nodes can be used for transfers of less than 10GB, using a datamover node would still be faster.

Note that transfers through a login node will timeout after a certain time (currently set to 5 minutes cpu_time), so if you have a slow connection you may need to go through datamover1.

Moving >10GB through the datamover1 node

Serious moves of data (>10GB) to or from SciNet should be done from datamover1 or datamover2 nodes. From any of the interactive SciNet nodes, one should be able to ssh datamover1 or ssh datmover2 to log in. Those are the machines that have the fastest network connections to the outside world (by a factor of 10; a 10Gb/s link as vs 1Gb/s).

Transfers must be originated from datamover1 or datamover2; that is, one can not copy files from the outside world directly to or from the datamovers; one has to log in to a datamover and copy the data to or from the outside network. Your local machine must be reachable from the outside as well, either by its name or its IP address. If you are behind a firewall or a (wireless) router, this may not be possible. You may need to ask your network administrator to allow the datamovers to ssh to your machine. If you need to open a hole on your firewall we provide their IPs:

datamover1 142.150.188.121
datamover2 142.150.188.122

Hpn-ssh

The usual ssh protocols were not designed for speed. On the datamover1 or datamover2 nodes, we have installed hpn-ssh, or High-Performance-enabled ssh. You use this higher-performance ssh/scp/sftp variant by loading the `hpnssh' module. Hpn-ssh is backwards compatible with the `usual' ssh, but is capable of significantly higher speeds. If you routinely have large data transfers to do, we recommend having your system administrator look into installing hpn-ssh on your system.

Everything we discuss below, unless otherwise stated, will work regardless of whether you have hpn-ssh installed on your remote system.

For Microsoft Windows users

Linux-windows transfers can be a bit more involved than linux-to-linux, but using Cygwin, this should not be a problem. Make sure you install Cygwin with the openssh package.

If you want to remain 100% a Windows environment, another very good tool is WinSCP. It will let you easily transfer and synchronize data between your Windows workstation and the login nodes using your ssh credentials (provided that it's not much more than 10GB on each sync pass).

If you are going to use the datamover1 method, and assuming your machine is not a wireless laptop (if it is, best to find a nearby computer that's not wireless and use a usb memory stick), you'll need the IP address of your machine, which you find by typing "ipconfig /all" on your local windows machine. Also, you will need to have the ssh daemon (sshd) running locally in Cygwin.

Also note that your windows user name does not have to be the same as on SciNet, this just depends on how your local windows system was set up.

All locations given to scp or rsync in cygwin have to be in unix format (using "/" not "\"), and will be relative to cygwin's path, not windows (e.g. use /cygdrive/c/...... to get to the windows C: drive).

Ways to transfer data

Globus data transfer

Globus is a file-transfer service with an easy-to-use web interface to allow people to transfer file with ease. To get started, please sign up for a Globus account at Globus website. Once you sign up for an account, go to this page to start the file transfer. Please enter computecanada#niagara as one endpoint for file transfer. If you are trying to transfer data from a laptop or desktop, you will need to install Globus Connect Personal software available here to setup an endpoint for the laptop or desktop and perform the transfer.

Please see the following page on how to setup Globus to perform data transfer.

scp

scp, or secure copy, is the easiest way to copy files, although we generally find rsync below to be faster.

scp works like cp to copy files:

$ scp original_file  copy_file

except that either the original or the copy can be on another system:

$ scp jonsdatafile.bin jon@remote.system.com:/home/jon/bigdatadir/

will copy the data file into the directory /home/jon/bigdatadir/ on remote.system.com after logging in as jon; you will be prompted for a password (unless you've set up ssh keys).

Copying from remote systems works the same way:

$ scp jon@remote.system.com:/home/jon/bigdatadir/newdata.bin .

And wildcards work as you'd expect (except you have to quote the wildcards on the remote system, as it can't expand properly here.)

$ scp *.bin jon@remote.system.com:/home/jon/bigdatadir/
$ scp jon@remote.system.com:"/home/jon/inputdata/*" .

There are few options worth knowing about:

  • scp -C compresses the file before transmitting it; if the file compresses well, this can significantly increase the effective data transfer rate (though usually not as much as compressing the data, then sending it, then uncompressing). If the file doesn't compress well, than this adds CPU overhead without accomplishing much, and can slow down your data transfer.
  • scp -oNoneEnabled=yes -oNoneSwitch=yes -- This is an hpn-ssh only option. If CPU overhead is a significant bottleneck in the data transfer, then we can avoid this by turning off the secure encryption of the data. For most of us, this is ok, but for others it is not. In either cases, authentication remains secure, it is only the data transfer that is in plaintext.

rsync

rsync is a very powerful tool for mirroring directories of data.

$ rsync -av -e ssh scinetdatadir jon@remote.system.com:/home/jon/bigdatadir/

rsync has a dizzying number of options; the above syncs scinetdatadir to the remote system; that is, any files that are newer on the localsystem are updated on the remote system. The converse isn't true; if there were newer files on the remote system, you'd have to bring those over with

$ rsync -av -e ssh jon@remote.system.com:/home/jon/bigdatadir/ scinetdatadir 

The -av options are for verbose and `archive' mode, which preserves timestamps and permissions, which is normally what you want. -e ssh tells it to use ssh for the transfer.

One of the powerful things about rsync is that it looks to see what files already exist before copying, so you can use it repeatedly as a data directory fills and it won't make unnecessary copies; similarly, if a (say) log file grows over time, it will only copy the difference between the files, further speeding things up. This also means that it behaves well if a transfer is interrupted; a second invocation of rsync will continue where the other left off.

As with scp -C, rsync -z compresses on the fly, which can significantly enhance effective data transfer rates if the files compress well, or hurt it if not.

As with scp, if both sides are running hpn-ssh one can disable encryption of the data stream should that prove to be a bottleneck:

$ rsync -av -e "ssh -oNoneEnabled=yes -oNoneSwitch=yes" jon@remote.system.com:/home/jon/bigdatadir/ scinetdatadir

SciNet's login nodes, 142.150.188.5[1-4], are publicly accessible and can be used for data transfer as long as your material is not one big chunk (much more than 2-3GB each file). We have a 5 minutes CPU time limit on the login nodes and the transfer process may be killed by the kernel before completion. The workaround is to transfer your data using a rsync loop, by checking the rsync return code, assuming some files can be transferred before reaching the CPU limit. For example in a bash shell:

  for i in {1..100}; do   ### try 100 times
    rsync ...
    [ "$?" == "0" ] && break
  done

ssh tunnel

Alternatively you may use a reverse ssh tunnel (ssh -R).

If your transfer is above 10GB you will need to use one of SciNet's datamovers. If your workstation is behind a firewall (as the datamovers are), you'll need a node external to your firewall, on the edge of your organization's network, that will serve as a gateway, and can be accessible via ssh by both your workstation and the datamovers. Initiate a "ssh -R" connection from SciNet's datamover to that node. This node needs to have its ssh GatewayPorts enabled so that your workstation can connect to the specified port on that node, which will forward the traffic back to SciNet's datamover.

Transfer speeds

What transfer speeds could I expect?

Below are some typical transfer numbers from datamover1 to another University of Toronto machine with a 1Gb/s link to the campus network:

Mode With hpn-ssh Without
rsync 60-80 MB/s 30-40 MB/s
scp 50 MB/s 25 MB/s

What can slow down my data transfer?

To move data quickly, all of the stages in the process have to be fast; the file system you are reading data from, the CPU reading the data, the network connection between the sender and the reciever, and the recipient CPU and disk. The slowest element in that chain will slow down the entire transfer.

On SciNet's side, our underlying filesystem is the high-performance GPFS system, and the node we recommend you use (datamover1) has a high-speed connection to the network and fast CPUs.

Why are my transfers so much slower?

If you get numbers significantly lower than above, then there is a bottleneck in the transfer. The first thing to do is to run top on datamover1; if other people are transfering large files at the same time you are trying to, network congestion could result and you'll just have to wait until they are done.

If nothing else is going on on datamover1, there are a number of possibilites:

  • network connection between SciNet and your machine - do you know the network connection of your remote machine? Are your systems connections tuned for performance [1]?
  • is the remote server busy?
  • are the remote servers disks busy, or known to be slow?

For any further questions, contact us at Support @ SciNet


File/Ownership Management (ACL)

  • By default, at SciNet, users within the same group have read permission to each other's files (not write)
  • You may use access control list (ACL) to allow your supervisor (or another user within your group) to manage files for you (i.e., create, move, rename, delete), while still retaining your access as the original owner of the files/directories.
  • For example, to allow [supervisor] to manage files in /project/group/[owner], issue the following commands as the [owner] account from a shell:
$ getfacl /project/group/[owner]
(to determine the current ACL attributes)

$ setfacl -d -m user:[supervisor]:rwx /project/group/[owner]
(every *new* file/directory inside [owner] will inherit [supervisor] ownership by default from now on)

$ setfacl -d -m user:[owner]:rwx /project/group/[owner]
(but will also inherit [owner] ownership, ie, ownership of both by default)

$ setfacl -Rm user:[supervisor]:rwx /project/group/[owner]
(recursively modify all *existing* files/directories inside [owner] to also be rwx by [supervisor])

For more information on using getfacl and setfacl, see their man pages.

If you need to set up permissions across groups, contact us (and the other group's supervisor!).


Hierarchical Storage Management (HSM)

(a pilot project is starting in July/2010 with a select group of users)

Basic Concepts

Hierarchical Storage Management (HSM) is a data storage technique which automatically moves data between high-cost and low-cost storage media. HSM systems exist because high-speed storage devices, such as hard disk drive arrays, are more expensive (per byte stored) than slower devices, such as optical discs and magnetic tape drives. While it would be ideal to have all data available on high-speed devices all the time, this is prohibitively expensive for many organizations. Instead, HSM systems store the bulk of the enterprise's data on slower devices, and then copy data to faster disk drives when needed. In effect, HSM turns the fast disk drives into caches for the slower mass storage devices. The HSM system monitors the way data is used and makes best guesses as to which data can safely be moved to slower devices and which data should stay on the fast devices.

In a typical HSM scenario, data files which are frequently used are stored on disk drives, but are eventually migrated to tape if they are not used for a certain period of time, typically a few months. If a user does reuse a file which is on tape, it is automatically moved back to disk storage. The advantage is that the total amount of stored data can be much larger than the capacity of the disk storage available, but since only rarely-used files are on tape, most users will usually not notice any slowdown.

The HSM client provides both automatic and selective migration. Once file migration begins, the HSM client sends a copy of your file to storage volumes on disk devices or devices that support removable media, such as tape and replaces the original file with a stub file on HSM managed file system (aka repository at SciNet)

Repository commonly refers to a location for long-term storage, often for safety or preservation.

Migration, in the context of HSM, refers to set of actions that move files from the front-end repository to a back-end tape library system (often invisible or unaccessible to users)

The stub file is a small replacement file that makes it appear as though the original file is on the repository. It contains required information to locate and recall a migrated file and to respond to specific UNIX commands without recalling the file.

Automatic migration periodically monitors space usage and automatically migrates eligible files according to the options and settings that have been selected. The HSM client provides two types of automatic migration: threshold migration and demand migration.

Threshold migration maintains a specific level of free space on the repository file system. When disk usage reaches the high threshold percentage, eligible files are migrated to tapes automatically. When space usage drops to the low threshold set for the file system, file migration stops.

Demand migration responds to an out-of-space condition on the repository file system. Demand migration starts automatically if the file system runs out of space (usually triggered at 90%). For HSM, as files are migrated, space becomes available on the file system, and the process or event that caused the out-of-space condition can be resumed.

Selective migration often an user given command, that migrates specific files from the repository at will, in anticipation of the automatic migration, or independently of the system wide eligibility criteria. For example, if you know that you will not be using a particular group of files for an extended time, you can migrate them, so as to free additional space on the repository.

Reclamation is the process of reclaiming unused space on a tape (applies to Virtual Tapes as well). Over time, as files/directories get deleted on the file system, an expiration process will expire old data, creating gaps of unused storage on the tapes. Since tapes are sequential media, typical backup software can only write data to the end of the tape, so these gaps of “Empty Space” cannot be used. The process entails periodically and in a rolling fashion copying active data from the "Swiss Cheese" like tapes to unused tapes on a compacted form.

Optimal environment: HSM should be used in an environment where the old and large files which need to be migrated are not used regularly. Files that are needed frequently should not be migrated at all, otherwise HSM would act as a cache, by migrating files and shortly after the migration the same files will be recalled. This is not advisable. The file system needs to be large enough to hold all regularly used files. If the file system is too small and cannot hold all regularly needed files, HSM is permanently recalling requested files, getting beyond the high-threshold limit, migrating other files to get below the low-threshold limit and so on.

Deployment at SciNet

HSM is performed by a dedicated IBM software made up of a number of HSM daemons running on datamover2. These daemons constantly monitor the usage of the /repository GPFS and, depending on a predefined set of policies, data may be automatically or manually migrated to the Tivoli Storage Management server (TSM), and kept on our library of LTO-4 tapes.

/repository is a 7.2TB "transient" location accessible only from datamover2. Users may migrate data as required from scratch or project to repository in a number of ways, such as copy, move, tar or rsync. In the background, transfers out of repository to tapes can happen even while you migrate data in from other locations, but we ask that you please never initiate a migration of more than a 10TB chunk at a time, so as to allow the system to process that data before reaching 100% full.

Inside /repository data is segregated on a per group basis, just as in /project. Within groups, users and group supervisors can structure materials anyway the prefer. But the recommendation is that those involved spend some time designing that structure ahead of time, since you may merge data from project and/or scratch or even home. In tests we performed, we've been able to reorganize the FS structure after migration, change the name and ownership of directories and stubs, and still recall files under the new path and ownership. HSM does seem to keep a very symbiotic relation between the metadata and the inode attributes at the file system level, without necessarily having to replicate these changes with tape recall & migration operations. But please don't abuse this flexibility. If possible, keep your initial layout structure somewhat fixed over time.

We also recommend that users bundle files in tar-balls of at least 10GB, and keep the listing of those files in your home directories, so as to avoid the migration of many thousands (or millions) of small files. It's very demanding on the system to constantly scan/reconcile all these files on the file system, tapes, metadata and database. A good reference is average file size > 10MB. Deep directory nesting levels in general also increases the time required to traverse a file system and thus should be avoided where possible.

Performance: unlike /project or /scratch, /repository for now is only a 1 tier disk raid, so don't expect rates much beyond 50MB/sec. In another words, a 10TB offload operation will typically take 2 days to complete. Please be sure to contact us to schedule your transfers IN or OUT of the system, so as to avoid conflict with other users.

How to migrate/recall data

Automatic

We currently setup repository with High and Low thresholds of 3% and 1% respectively. That means, at regular intervals the file system is monitored to determine if the 3% usage mark has been reached or surpassed. In that case, data is automatically migrated to tapes, oldest (or largest) first, until the file system is down to 1%. Since data may be copied/moved/rsync'ed in faster than repository can be emptied, you may observe 80-90% disk usages at times (hence the 10TB chunk of data limit). This happens without the interference of the user.

To recall a file automatically all you have to do is access it. There are many ways you can do this. For example, you may view a file with 'cat', 'more', 'vi/vim', etc. You may also copy the file (or directory) from /repository to another location. Please be patient: the file will have to be pulled back from tape, and this will take some time, longer if it happens to be at the end of a tape.

Selective

To overwrite the internal priority of HSM, or to migrate files/directories immediately, type:

(files won't be migrated for at least 5 minutes after the last access/modification time)

dsmmigrate [path to FILE]
dsmmigrate -R -D /repository/[group]/[user]/[directory]
dsmmigrate /repository/scinet/pinto/blahblahblah.tar.Z
or
cd /repository/scinet/pinto/
dsmmigrate blahblahblah.tar.Z

To selectively recall data, just type:

dsmrecall [path to FILE]
dsmrecall -R -D /repository/[group]/[user]/[directory]
dsmrecall /repository/scinet/pinto/blahblahblah.tar.Z
or
cd /repository/scinet/pinto/
dsmrecall blahblahblah.tar.Z

Common HSM commands

Some traditional unix/linux commands, such as 'ls' or 'rm' for example, will work with the file stubs as the real files. But others, such as 'du' or 'df', you better use a HSM equivalent, which will give you more meaningful information in the context of HSM. They only work inside /repository. In a few cases, they will be executable only by root, such as 'dsmrm', in which case you'll be notified.

dsmls

to check status of files; used in the directory where you expect to have migrated files

r: resident (the file is on the file system)

m: migrated (only the stub of the file is on the file system)

p: premigrated (the file is on the file system and on tape)

Usage: dsmls [-Noheader] [-Recursive] [-Help] [file specs|-FIlelist=file]

dsmdu

disk usage on the original files/directory

Usage: dsmdu [-Allfiles] [-Summary] [-Help] [directory names]

dsmdf

disk free on the HSM file system.

Usage: dsmdf [-Help] [-Detail] [file systems]

dsmmigrate

Usage: dsmmigrate [-Recursive] [-Premigrate] [-Detail] [-Help] filespecs|-FIlelist=file 

dsmrecall

Usage: dsmrecall [-Recursive] [-Detail] [-Help] file specs|-FIlelist=file
   or  dsmrecall [-Detail] -offset=XXXX[kmgKMG] -size=XXXX[kmgKMG] file specs