Difference between revisions of "Data Management"

From oldwiki.scinet.utoronto.ca
Jump to navigation Jump to search
m
 
(45 intermediate revisions by 4 users not shown)
Line 1: Line 1:
 +
{| style="border-spacing: 8px; width:100%"
 +
| valign="top" style="cellpadding:1em; padding:1em; border:2px solid; background-color:#f6f674; border-radius:5px"|
 +
'''WARNING: SciNet is in the process of replacing this wiki with a new documentation site. For current information, please go to [https://docs.scinet.utoronto.ca https://docs.scinet.utoronto.ca]'''
 +
|}
 +
 
=='''Storage Space'''==
 
=='''Storage Space'''==
 
SciNet's storage system is based on IBM's [http://en.wikipedia.org/wiki/IBM_General_Parallel_File_System GPFS] (General Parallel File System).  There are two main systems for user data: <tt>/home</tt>, a small, backed-up space where user home directories are located, and <tt>/scratch</tt>, a large system for input or output data for jobs; data on <tt>/scratch</tt> is not only not backed up (a third storage system, /project, exist only for groups with LRAC/NRAC allocations). Data placed on scratch will be deleted if it has not been accessed in 3 months.  SciNet does not provide long-term storage for large data sets.   
 
SciNet's storage system is based on IBM's [http://en.wikipedia.org/wiki/IBM_General_Parallel_File_System GPFS] (General Parallel File System).  There are two main systems for user data: <tt>/home</tt>, a small, backed-up space where user home directories are located, and <tt>/scratch</tt>, a large system for input or output data for jobs; data on <tt>/scratch</tt> is not only not backed up (a third storage system, /project, exist only for groups with LRAC/NRAC allocations). Data placed on scratch will be deleted if it has not been accessed in 3 months.  SciNet does not provide long-term storage for large data sets.   
Line 8: Line 13:
 
! {{Hl2}} | file system  
 
! {{Hl2}} | file system  
 
! {{Hl2}} | purpose  
 
! {{Hl2}} | purpose  
! {{Hl2}} | quota  
+
! {{Hl2}} | user quota  
 
! {{Hl2}} | block size  
 
! {{Hl2}} | block size  
 
! {{Hl2}} | backed up
 
! {{Hl2}} | backed up
Line 16: Line 21:
 
| /home
 
| /home
 
| development
 
| development
| 10 GB
+
| 50 GB
 
| 256 KB
 
| 256 KB
 
| yes
 
| yes
Line 24: Line 29:
 
| /scratch
 
| /scratch
 
| computation
 
| computation
| 20 TB
+
| first of (20 TB ; 1 million files)
 
| 4 MB
 
| 4 MB
 
| no
 
| no
| files > 3 month
+
| files > 3 months
 
| read/write on all nodes
 
| read/write on all nodes
 
|-  
 
|-  
Line 34: Line 39:
 
| by allocation
 
| by allocation
 
| 256 KB
 
| 256 KB
| no
+
| yes
 
| never
 
| never
| read/write on all nodes
+
| read/write on all nodes  
 
|}
 
|}
 +
project is included in scratch
  
 
===Home Disk Space===
 
===Home Disk Space===
  
Every SciNet user gets a 10GB directory on <tt>/home</tt> which is regularly backed-up.  Home is visible from <tt>login.scinet</tt> nodes, and from the development nodes on [[GPC_Quickstart | GPC]] and the [[TCS_Quickstart | TCS]].  However, on the compute nodes of the GPC clusters -- as when jobs are running -- <tt>/home</tt> is mounted '''''read-only'''''; thus GPC jobs can read files in /home but cannot write to files there.  <tt>/home</tt> is a good place to put code, input files for runs, and anything else that needs to be kept to reproduce runs.  On the other hand, <tt>/home</tt> is not a good place to put many small files, since
+
Every SciNet user gets a 50GB directory on <tt>/home</tt> in a directory <tt>/home/G/GROUP/USER</tt>, which is regularly backed-up.  Home is visible from <tt>login.scinet</tt> nodes, and from the development nodes on [[GPC_Quickstart | GPC]] and the [[TCS_Quickstart | TCS]].  However, on the compute nodes of the GPC clusters -- as when jobs are running -- <tt>/home</tt> is mounted '''''read-only'''''; thus GPC jobs can read files in /home but cannot write to files there.  <tt>/home</tt> is a good place to put code, input files for runs, and anything else that needs to be kept to reproduce runs.  On the other hand, <tt>/home</tt> is not a good place to put many small files, since
 
the block size for the file system is 256KB, so you would quickly run out of disk quota and you will make the backup system very slow.
 
the block size for the file system is 256KB, so you would quickly run out of disk quota and you will make the backup system very slow.
  
Line 48: Line 54:
 
===Scratch Disk Space===
 
===Scratch Disk Space===
  
Every SciNet user also gets a directory in <tt>/scratch</tt>.  Scratch is visible from the <tt>login.scinet</tt> nodes,  the development nodes on [[GPC_Quickstart | GPC]] and the [[TCS_Quickstart | TCS]], and on the compute nodes of the clusters, mounted as read-write.  Thus jobs would normally write their output somewhere in <tt>/scratch</tt>.  There are '''NO''' backups of anything on <tt>/scratch</tt>.
+
Every SciNet user also gets a directory in <tt>/scratch</tt> called <tt>/scratch/G/GROUP/USER</tt>.  Scratch is visible from the <tt>login.scinet</tt> nodes,  the development nodes on [[GPC_Quickstart | GPC]] and the [[TCS_Quickstart | TCS]], and on the compute nodes of the clusters, mounted as read-write.  Thus jobs would normally write their output somewhere in <tt>/scratch</tt>.  There are '''NO''' backups of anything on <tt>/scratch</tt>.
  
There is a large amount of space available on <tt>/scratch</tt> but it is purged routinely so that all users running jobs and generating large outputs will have room to store their data temporarily.  Computational results which you want to keep longer than this must be copied (using <tt>scp</tt>) off of SciNet entirely and to your local system.  SciNet does not routinely provide long-term storage for large data sets.
+
There is a large amount of space available on <tt>/scratch</tt> but it is purged routinely so that all users running jobs and generating large outputs will have room to store their data temporarily.  Computational results which you want to keep longer than this must be copied (using <tt>scp</tt>) off of SciNet entirely and to your local system.  SciNet does not routinely provide long-term storage for large data sets.
 +
 
 +
Also note that the shared parallel file system was not designed to do many small file transactions. For that reason, the number of files that any user can have on scratch is limited to 1 million. This limit should be thought of as a safeguard, not an invitation to create one million files. Please see [[File System and I/O dos and don'ts]].  
  
 
===Scratch Disk Purging Policy===
 
===Scratch Disk Purging Policy===
  
In order to ensure that there is always significant space available for running jobs '''we automatically delete files in /scratch that have not been accessed for more than 3 months by the actual deletion day on the 15th of each month'''. This policy is subject to revision depending on its effectiveness. More details about the purging process and how users can check if their files will be deleted follows. If you have files scheduled for deletion you should move them to a more permanent locations such as your departmental server or your /project space (for PIs who have either been allocated disk space by the LRAC or have bought diskspace).
+
In order to ensure that there is always significant space available for running jobs '''we automatically delete files in /scratch that have not been accessed or modified for more than 3 months by the actual deletion day on the 15th of each month'''. Note that we recently changed the cut out reference to the ''MostRecentOf(atime,mtime)''. This policy is subject to revision depending on its effectiveness. More details about the purging process and how users can check if their files will be deleted follows. If you have files scheduled for deletion you should move them to a more permanent locations such as your departmental server or your /project space (for PIs who have either been allocated disk space by the LRAC or have bought diskspace).
  
On the '''first''' of each month, a list of files scheduled for purging is produced, and an email notification is sent to each user on that list. Furthermore, at/or about the '''12th''' of each month a 2nd scan produces a more current assessment and another email notification is sent. This way users can double check that they have indeed taken care of all the files they needed to relocate before the purging deadline. Those files will be automatically deleted on the '''15th''' of the same month unless they have been accessed or relocated in the interim. If you have files scheduled for deletion then they will be listed in a file in /scratch/todelete/current, which has your userid and groupid in the filename. For example, if user xxyz wants to check if they have files scheduled for deletion they can issue the following command on a system which mounts /scratch (e.g. a scinet login node): '''ls -l1 /scratch/todelete/current |grep xxyz'''. In the example below, the name of this file indicates that user xxyz is part of group abc, has 9,560 files scheduled for deletion and they take up 1.0TB of space:
+
On the '''first''' of each month, a list of files scheduled for purging is produced, and an email notification is sent to each user on that list. Furthermore, at/or about the '''12th''' of each month a 2nd scan produces a more current assessment and another email notification is sent. This way users can double check that they have indeed taken care of all the files they needed to relocate before the purging deadline. Those files will be automatically deleted on the '''15th''' of the same month unless they have been accessed or relocated in the interim. If you have files scheduled for deletion then they will be listed in a file in /scratch/t/todelete/current, which has your userid and groupid in the filename. For example, if user xxyz wants to check if they have files scheduled for deletion they can issue the following command on a system which mounts /scratch (e.g. a scinet login node): '''ls -1 /scratch/t/todelete/current |grep xxyz'''. In the example below, the name of this file indicates that user xxyz is part of group abc, has 9,560 files scheduled for deletion and they take up 1.0TB of space:
  
  [xxyz@scinet04 ~]$ ls -l1 /scratch/todelete/current |grep xxyz
+
<pre>
 +
  [xxyz@scinet04 ~]$ ls -1 /scratch/t/todelete/current |grep xxyz
 
  -rw-r----- 1 xxyz    root      1733059 Jan 12 11:46 10001___xxyz_______abc_________1.00T_____9560files
 
  -rw-r----- 1 xxyz    root      1733059 Jan 12 11:46 10001___xxyz_______abc_________1.00T_____9560files
 +
</pre>
  
The file itself contains a list of all files scheduled for deletion (in the last column) and can be viewed with standard commands like more/less/cat - e.g. '''more /scratch/todelete/current/10001___xxyz_______abc_________1.00T_____9560files'''
+
The file itself contains a list of all files scheduled for deletion (in the last column) and can be viewed with standard commands like more/less/cat - e.g. '''more /scratch/t/todelete/current/10001___xxyz_______abc_________1.00T_____9560files'''
  
Similarly, you can also verify all other users on your group by using the ls command with grep on your group. For example: '''ls -l1 /scratch/todelete/current |grep abc'''. That will list all other users in the same group that xxyz is part of, and have files to be purged on the 15th. Members of the same group have access to each other's contents.
+
Similarly, you can also verify all other users on your group by using the ls command with grep on your group. For example: '''ls -1 /scratch/t/todelete/current |grep abc'''. That will list all other users in the same group that xxyz is part of, and have files to be purged on the 15th. Members of the same group have access to each other's contents.
  
'''NOTE:''' Preparing these assessments takes several hours. If you change the access/modification time of a file in the interim, that will not be detected until the next cycle. A way for you to get immediate feedback is to use the ''''ls -lu'''' command on the file. If the file atime has been updated, coming the purging date on the 15th it will not be deleted any longer.
+
'''NOTE:''' Preparing these assessments takes several hours. If you change the access/modification time of a file in the interim, that will not be detected until the next cycle. A way for you to get immediate feedback is to use the ''''ls -lu'''' command on the file to verify the atime and ''''ls -la'''' for the mtime. If the file atime/mtime has been updated in the meantime, coming the purging date on the 15th it will no longer be deleted.
  
 
===Project Disk Space===
 
===Project Disk Space===
  
Investigators who have been granted allocations through the [http://www.scinet.utoronto.ca/resources/Account_Allocations.htm LRAC/NRAC Application Process] may have been allocated disk space in addition to compute time.  For the period of time that the allocation is granted, they will have disk space on the <tt>/project</tt> disk system.  Space on the project systems are not purged, but neither are they backed up.   All members of the investigators groups will have access to these systems, which will be mounted read/write everywhere.
+
Investigators who have been granted allocations through the [http://wiki.scinethpc.ca/wiki/index.php/Application_Process LRAC/NRAC Application Process] may have been allocated disk space in addition to compute time.  For the period of time that the allocation is granted, they will have disk space on the <tt>/project</tt> disk system.  Space on project is a subset of scratch, but is not purged and is backed up. All members of the investigators groups will have access to this disk system, which will be mounted read/write everywhere.
  
 
===How much Disk Space Do I have left?===
 
===How much Disk Space Do I have left?===
  
The <tt>'''/scinet/gpc/bin/diskUsage'''</tt> command, available on the login nodes, datamovers and the GPC devel nodes, provides information in a number of ways on the home, scratch, and project file systems. For instance, how much disk space is being used by yourself and your group (with the -a option), or how much your usage has changed over a certain period ("delta information") or you may generate plots of your usage over time. Please see the usage help below for more details.
+
The <tt>'''/scinet/gpc/bin6/diskUsage'''</tt> command, available on the login nodes, datamovers and the GPC devel nodes, provides information in a number of ways on the home, scratch, and project file systems. For instance, how much disk space is being used by yourself and your group (with the -a option), or how much your usage has changed over a certain period ("delta information") or you may generate plots of your usage over time. Please see the usage help below for more details.
 
<pre>
 
<pre>
 
Usage: diskUsage [-h|-?| [-a] [-u <user>] [-de|-plot]
 
Usage: diskUsage [-h|-?| [-a] [-u <user>] [-de|-plot]
Line 82: Line 92:
 
       -plot: create plots of disk usages
 
       -plot: create plots of disk usages
 
</pre>
 
</pre>
Note that information on usage and quota is only updated hourly!
+
 
 +
Did you know that you can check which of your directories have more than 1000 files with the <tt>'''/scinet/gpc/bin6/topUserDirOver1000list'''</tt> command and which have more than 1GB of material with the <tt>'''/scinet/gpc/bin6/topUserDirOver1GBlist'''</tt> command?
 +
 
 +
Notes:
 +
* information on usage and quota is only updated hourly!
 +
* contents of project count against space and #files limits on scratch
  
 
===Performance===
 
===Performance===
  
[http://en.wikipedia.org/wiki/IBM_General_Parallel_File_System GPFS] is a high-performance filesystem which provides rapid reads and writes to large datasets in parallel from many nodes.  As a consequence of this design, however, '''the file system performs quite ''poorly'' at accessing data sets which consist of many, small files.'''  For instance, you will find that reading data in from one 4MB file is enormously faster than from 100 40KB files.   Such small files are also quite wasteful of space, as the blocksize for the filesystem is 4MB.   This is something you should keep in mind when planning your input/output strategy for runs on SciNet.
+
[http://en.wikipedia.org/wiki/IBM_General_Parallel_File_System GPFS] is a high-performance filesystem which provides rapid reads and writes to large datasets in parallel from many nodes.  As a consequence of this design, however, '''the file system performs quite ''poorly'' at accessing data sets which consist of many, small files.'''  For instance, you will find that reading data in from one 16MB file is enormously faster than from 400 40KB files. Such small files are also quite wasteful of space, as the blocksize for the scratch filesystem is 16MB. This is something you should keep in mind when planning your input/output strategy for runs on SciNet.
  
 
For instance, if you run multi-process jobs, having each process write to a file of its own is not an scalable I/O solution. A directory gets locked by the first process accessing it, so all other processes have to wait for it. Not only has the code just become considerably less parallel, chances are the file system will have a time-out while waiting for your other processes, leading your program to crash mysteriously.
 
For instance, if you run multi-process jobs, having each process write to a file of its own is not an scalable I/O solution. A directory gets locked by the first process accessing it, so all other processes have to wait for it. Not only has the code just become considerably less parallel, chances are the file system will have a time-out while waiting for your other processes, leading your program to crash mysteriously.
Consider using MPI-IO (part of the MPI-2 standard), which allows files to be opened simultaneously
+
Consider using MPI-IO (part of the MPI-2 standard), which allows files to be opened simultaneously by different processes, or using a dedicated process for I/O to which all other processes send their data, and which subsequently writes this data to a single file.
by different processes, or using a dedicated process for I/O to which all other processes send their
 
data, and which subsequently writes this data to a single file.
 
  
 
===Local Disk===
 
===Local Disk===
Line 100: Line 113:
  
 
Note that the absense of hard drives also means that the nodes cannot swap memory, so be sure that your computation fits within memory.
 
Note that the absense of hard drives also means that the nodes cannot swap memory, so be sure that your computation fits within memory.
 +
 +
===Buying storage space on GPFS or HPSS===
 +
 +
Groups can buy space on GPFS or HPSS rather than rely on the [http://wiki.scinethpc.ca/wiki/index.php/Application_Process annual allocation process].  A good budgetary number would be:
 +
 +
GPFS $400/TB
 +
 +
HPSS $150/TB
 +
 +
This is a one-time cost. We have no formal, written data retention policy at this point but the intent is to keep any HPSS data (including migrating to new tape technologies) as long as SciNet is in operation. These numbers are for budgetary purposes only and subject to change (e.g. as markets and technologies evolve).
 +
  
 
=='''Data Transfer'''==
 
=='''Data Transfer'''==
Line 107: Line 131:
 
* By default, at SciNet, users within the same group have read permission to each other's files (not write)
 
* By default, at SciNet, users within the same group have read permission to each other's files (not write)
 
* You may use access control list ('''ACL''') to allow your supervisor (or another user within your group) to manage files for you (i.e., create, move, rename, delete), while still retaining your access and permission as the original owner of the files/directories.
 
* You may use access control list ('''ACL''') to allow your supervisor (or another user within your group) to manage files for you (i.e., create, move, rename, delete), while still retaining your access and permission as the original owner of the files/directories.
* '''NOTE''': We highly recommend that you never give write permission to other users on the top level of your home directory (/home/[owner]), since that would seriously compromise your privacy, in addition to disable ssh key authentication, among other things. If necessary, make specific sub-directories under your home directory so that other users can manipulate/access files from those.
+
* '''NOTE''': We highly recommend that you never give write permission to other users on the top level of your home directory (/home/G/GROUP/[owner]), since that would seriously compromise your privacy, in addition to disable ssh key authentication, among other things. If necessary, make specific sub-directories under your home directory so that other users can manipulate/access files from those.
 
* If you need to set up permissions across groups [mailto:support@scinet.utoronto.ca contact us] (and the other group's supervisor!).
 
* If you need to set up permissions across groups [mailto:support@scinet.utoronto.ca contact us] (and the other group's supervisor!).
 +
 +
<!--
  
 
===Using  setfacl/getfacl===
 
===Using  setfacl/getfacl===
* To allow [supervisor] to manage files in /project/group/[owner] using '''setfacl''' and '''getfacl''' commands, follow the 3-steps below as the [owner] account from a shell:
+
* To allow [supervisor] to manage files in /project/g/group/[owner] using '''setfacl''' and '''getfacl''' commands, follow the 3-steps below as the [owner] account from a shell:
 
<pre>
 
<pre>
1) $ /scinet/gpc/bin/setfacl -d -m user:[supervisor]:rwx /project/group/[owner]
+
1) $ /scinet/gpc/bin/setfacl -d -m user:[supervisor]:rwx /project/g/group/[owner]
 
   (every *new* file/directory inside [owner] will inherit [supervisor] ownership by default from now on)
 
   (every *new* file/directory inside [owner] will inherit [supervisor] ownership by default from now on)
  
2) $ /scinet/gpc/bin/setfacl -d -m user:[owner]:rwx /project/group/[owner]
+
2) $ /scinet/gpc/bin/setfacl -d -m user:[owner]:rwx /project/g/group/[owner]
 
   (but will also inherit [owner] ownership, ie, ownership of both by default, for files/directories created by [supervisor])
 
   (but will also inherit [owner] ownership, ie, ownership of both by default, for files/directories created by [supervisor])
  
3) $ /scinet/gpc/bin/setfacl -Rm user:[supervisor]:rwx /project/group/[owner]
+
3) $ /scinet/gpc/bin/setfacl -Rm user:[supervisor]:rwx /project/g/group/[owner]
 
   (recursively modify all *existing* files/directories inside [owner] to also be rwx by [supervisor])
 
   (recursively modify all *existing* files/directories inside [owner] to also be rwx by [supervisor])
  
   $ /scinet/gpc/bin/getfacl /project/group/[owner]
+
   $ /scinet/gpc/bin/getfacl /project/g/group/[owner]
 
   (to determine the current ACL attributes)
 
   (to determine the current ACL attributes)
  
   $ /scinet/gpc/bin/setfacl -b /project/group/[owner]
+
   $ /scinet/gpc/bin/setfacl -b /project/g/group/[owner]
 
   (to remove any previously set ACL)
 
   (to remove any previously set ACL)
  
Line 131: Line 157:
 
</pre>
 
</pre>
 
For more information on using [http://linux.die.net/man/1/setfacl <tt>setfacl</tt>] or [http://linux.die.net/man/1/getfacl <tt>getfacl</tt>] see their man pages.
 
For more information on using [http://linux.die.net/man/1/setfacl <tt>setfacl</tt>] or [http://linux.die.net/man/1/getfacl <tt>getfacl</tt>] see their man pages.
 +
 +
-->
  
 
===Using mmputacl/mmgetacl===
 
===Using mmputacl/mmgetacl===
* Alternatively, you may use gpfs' native '''mmputacl''' and '''mmgetacl''' commands. The advantages are that you can set "control" permission and that [http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=%2Fcom.ibm.cluster.gpfs.doc%2Fgpfs31%2Fbl1adm1160.html POSIX or NFS v4 style ACL] are supported. You will need first to create a /tmp/supervisor.acl file with the following contents:
+
* You may use gpfs' native '''mmputacl''' and '''mmgetacl''' commands. The advantages are that you can set "control" permission and that [http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=%2Fcom.ibm.cluster.gpfs.doc%2Fgpfs31%2Fbl1adm1160.html POSIX or NFS v4 style ACL] are supported. You will need first to create a /tmp/supervisor.acl file with the following contents:
 
<pre>
 
<pre>
 
user::rwxc
 
user::rwxc
Line 145: Line 173:
 
Then issue the following 2 commands:
 
Then issue the following 2 commands:
 
<pre>
 
<pre>
1) $ mmputacl -i /tmp/supervisor.acl /project/group/[owner]
+
1) $ mmputacl -i /tmp/supervisor.acl /project/g/group/[owner]
2) $ mmputacl -d -i /tmp/supervisor.acl /project/group/[owner]
+
2) $ mmputacl -d -i /tmp/supervisor.acl /project/g/group/[owner]
 
   (every *new* file/directory inside [owner] will inherit [supervisor] ownership by default as well as  
 
   (every *new* file/directory inside [owner] will inherit [supervisor] ownership by default as well as  
 
   [owner] ownership, ie, ownership of both by default, for files/directories created by [supervisor])
 
   [owner] ownership, ie, ownership of both by default, for files/directories created by [supervisor])
  
   $ mmgetacl /project/group/[owner]
+
   $ mmgetacl /project/g/group/[owner]
 
   (to determine the current ACL attributes)
 
   (to determine the current ACL attributes)
  
   $ mmdelacl -d /project/group/[owner]
+
   $ mmdelacl -d /project/g/group/[owner]
 
   (to remove any previously set ACL)
 
   (to remove any previously set ACL)
  
   $ mmeditacl /project/group/[owner]
+
   $ mmeditacl /project/g/group/[owner]
 
   (to create or change a GPFS access control list)
 
   (to create or change a GPFS access control list)
 
   (for this command to work set the EDITOR environment variable: export EDITOR=/usr/bin/vi)
 
   (for this command to work set the EDITOR environment variable: export EDITOR=/usr/bin/vi)
 
</pre>
 
</pre>
  
There is no option to recursively add or remove ACL attributes using a gpfs built-in command. You'll need to use the -i option as above for each file or directory individually. [[Data_Management#bash_script_that_you_may_adapt_to_recursively_add_or_remove_ACL_attributes_using_gpfs_built-in_commands | Here is a sample bash script you may use for that purpose]]
+
NOTES:
 +
* There is no option to recursively add or remove ACL attributes using a gpfs built-in command to existing files. You'll need to use the -i option as above for each file or directory individually. [[Data_Management#bash_script_that_you_may_adapt_to_recursively_add_or_remove_ACL_attributes_using_gpfs_built-in_commands | Here is a sample bash script you may use for that purpose]]
 +
 
 +
* mmputacl/setfacl will not overwrite the original linux group permissions for a directory when copied to another directory already with ACLs, hence the "#effective:r-x" note you may see from time to time with mmgetacf/getfacl. If you want to give rwx permissions to everyone in your group you should simply rely on the plain unix 'chmod g+rwx' command. You may do that before or after copying the original material to another folder with the ACLs.
  
 
For more information on using [http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=%2Fcom.ibm.cluster.gpfs.doc%2Fgpfs31%2Fbl1adm11120.html <tt>mmputacl</tt>] or [http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=%2Fcom.ibm.cluster.gpfs.doc%2Fgpfs31%2Fbl1adm11120.html <tt>mmgetaclacl</tt>] see their man pages.
 
For more information on using [http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=%2Fcom.ibm.cluster.gpfs.doc%2Fgpfs31%2Fbl1adm11120.html <tt>mmputacl</tt>] or [http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=%2Fcom.ibm.cluster.gpfs.doc%2Fgpfs31%2Fbl1adm11120.html <tt>mmgetaclacl</tt>] see their man pages.
Line 226: Line 257:
 
</pre>
 
</pre>
  
 +
==[[HPSS|'''High Performance Storage System (HPSS)''']]==
  
 
=='''Hierarchical Storage Management (HSM)'''==
 
Implementing a hierarchical storage management scheme (HSM) is a pilot project started in July/2010 with a select group of users, and is still a work in progress.
 
 
You may read more on HSM basic concepts, definitions, etc, on the [https://support.scinet.utoronto.ca/wiki/index.php/Data_Management#Appendix_.28HSM.29 Appendix] section.
 
 
===Deployment at SciNet===
 
What we are offering users is a way to offoad/archive data from the most active file systems (scratch and project) without necessarily having to deal directly with the tape library or "tape commands". We devised a 2-step migration process and deployed a 15 TB disk based cache mounted as '''/repository''', accessible from '''datamover2'''. HSM is performed by a dedicated IBM software made up of a number of HSM daemons. These daemons constantly monitor the usage of /repository and, depending on a predefined set of policies, data may be [https://support.scinet.utoronto.ca/wiki/index.php/Data_Management#Automatic automatically] or [https://support.scinet.utoronto.ca/wiki/index.php/Data_Management#Selective_.28aka_manual.29 manually] migrated to the Tivoli Storage Management server (TSM), and kept on our library of LTO-4 tapes.
 
 
On '''step 1''' users may relocate data as required from /scratch or /project to /repository in a number of ways, such as copy, move, tar or rsync. On '''step 2''' /repository is constantly being purged in the background by the HSM daemons. What is left behind is the directory tree with the HSM stub files and the metadata associated with them (about 1-2%). In this scenario, on '''step 2 users also have the option to [https://support.scinet.utoronto.ca/wiki/index.php/Data_Management#Selective_.28aka_manual.29 manually migrate or recall files]''' between /repository and the "tape system" with simple commands such as 'dsmmigrate' or 'dsmrecall'.
 
 
Inside /repository, data is segregated on a per group basis, just as in /project. Within groups, users and group supervisors can structure materials anyway they prefer. But please follow the [https://support.scinet.utoronto.ca/wiki/index.php/Data_Management#Recommendations recommendations].
 
 
'''Please be sure to contact us to schedule your transfers IN or OUT of repository''', first to avoid conflict with other users, and also to allow us time to reserve and allocate tapes for your group.
 
 
===How to migrate/recall data===
 
====Automatic====
 
We currently setup the disk cache component of /repository with High and Low thresholds of 10% and 2% respectively. That means, at regular intervals the file system is monitored to determine if the 10% usage mark has been reached or surpassed. In that case, data is automatically migrated to tapes, oldest (or largest) first, and purged from /repository until the file system is down to 2%, if possible (metadata is not migrated). For now at SciNet we migrate every file in /repository to tapes.
 
 
To recall a file automatically all you have to do is '''access''' it. There are many ways you can do this. For example, you may view a file with 'cat', 'more', 'vi/vim', or countless combinations of 'find', 'grep', 'head', 'tail', etc. You may also copy the file (or directory) from /repository to another location. '''Please be patient:''' the file will have to be pulled back from tape, and this will take some time, longer if it happens to be at the end of a tape.
 
 
====Selective (aka manual)====
 
Used to overwrite the internal priority of HSM (oldest/largest) or to migrate files/directories "immediately". If you already know that you relocated material to repository with the intention of having it migrated to tapes, you may as well just use dsmmigrate as soon as the rsync to repository has finished.
 
 
'''Note:''' files won't be migrated until they have "aged" for at least 5 minutes, that is, after their last access/modification time.
 
 
<pre>
 
dsmmigrate [path to FILE]
 
or
 
dsmmigrate -R -D /repository/[group]/[user]/[directory]
 
or
 
dsmmigrate /repository/scinet/pinto/blahblahblah.tar.Z
 
or
 
{
 
cd /repository/scinet/pinto/
 
dsmmigrate blahblahblah.tar.Z
 
}
 
 
where:
 
    R: recursive
 
    D: details
 
</pre>
 
 
To selectively recall data, just type:
 
<pre>
 
dsmrecall [path to FILE]
 
or
 
dsmrecall -R -D /repository/[group]/[user]/[directory]
 
or
 
dsmrecall /repository/scinet/pinto/blahblahblah.tar.Z
 
or
 
{
 
cd /repository/scinet/pinto/
 
dsmrecall blahblahblah.tar.Z
 
}
 
</pre>
 
 
===Recommendations===
 
* Those involved should spend some time designing the structure inside their area in repository ahead of time, since you may merge data from project and/or scratch (or even home). It's possible to reorganize the FS structure after migration, change the name and ownership of directories and stubs, and still recall files under the new path and ownership. HSM does seem to keep a very symbiotic relation between the metadata and the inode attributes at the file system level, without necessarily having to replicate these changes with tape recall & migration operations. But please don't abuse this flexibility. If possible, keep your initial layout structure somewhat fixed over time.
 
 
* Data may be copied/moved/rsync'ed/tar'ed in faster than /repository can be purged, so you may observe 80-90% disk usages sporadically. Do not initiate a relocation of more than a 10TB chunk at once, even if /repository is at 1% to start with, so that the system has time to process your data and still allow other user(s) to migrate/recall some material before reaching 100% full.
 
 
* Users should stage and bundle files in tar-balls of at least 10GB before relocation, and keep a listing of those files somewhere; in fact you may use the 'tar' command to create the tar-ball directly in /repository on-the-fly. See examples below:
 
<pre>
 
tar -czvf /repository/[group]/[user]/myproject1.tar.gz /project/[group]/[user]/project1/ > /project/[group]/[user]/myproject1-repository-listing.txt
 
 
or
 
 
tar -czvf /repository/[group]/[user]/myscratch1.tar.gz /scratch/[user]/scratchdata1/ > /home/[user]/myscratch1-repository-listing.txt
 
</pre>
 
 
* Keep the listing of the files that are in each tar on a partition other than the HSM repository so that you can quickly decide which tar you need to recall (see the redirection in the above examples). While the tar stub will always exist on the HSM disk, you will not be able to run tar --list on the stub without recalling the full tar file back from tape to the disk cache.
 
 
* The important is to avoid the relocation of many thousands (or millions) of small files. It's very demanding on the system to constantly scan/reconcile all these files on the file system, tapes, metadata and database. A good reference is '''average file size > 100MB''' in /repository.
 
 
* Deep directory nesting in general also increases the time required to traverse a file system and thus should be avoided where possible.
 
 
* We've been finding that the search for new candidates for automatic migration takes much longer once repository is already full of files/stubs, and this cycle could take some 6 to 12 hours to kick in at SciNet. Hence '''do not wait''' and proceed with the selective migration of your own files/directories asap.
 
 
* Avoid working with or manipulating files inside /repository if you are not using tar-balls. Just copy the material out of there and back into /scratch or /project. That will trigger the automatic recall, and once that operation is finished the recalled files will be released and re-scheduled for purging again.
 
 
* Disaster Recovery: as with any disk based storage, repository is just cache, and not immune to failures. We do not do regular backups of its contents, but it's possible to do a full recovery of the directory tree and stubs in case of catastrophic loss of repository. For that '''it's important that all files have been completely migrated to tapes''' before hand. That puts the onus on users to ensure this migration is indeed finished (with selective migration) for the relocated material before the originals are deleted from /project or /scratch.
 
 
===Performance===
 
Unlike /project or /scratch, /repository is only a 2 tier disk raid, so don't expect transfer rates much higher than 60MB/sec on a rsync session for example. In another words, a 10TB offload operation will typically take 2 days to complete, if made up of large files. On the other hand, we have conducted experiments where we migrated only 1TB of small files, 1 million of them, and that took nearly a day! This is a situation that should be avoided. Performance is as much a function of the amount of data as the number of files. Please pack them in tar-balls.
 
 
As for the "ideal tar-ball size", experiments have shown that an isolated 10GB tar-ball typically takes 10-15 minutes to be pulled back, considering all tape operations involved. That seems like a reasonable amount of time to wait for a group of files kept off-line for an extended period of time. Also consider that pulling back an individual tiny file could still take as long as 5-8 minutes. So, it's pretty clear that you get the best pay for the buck by tar'ing your material, and you won't tie up the tape system for too long. As for the upper limit, you can probably bundle files in 100-500GB tar-balls, provided that you're OK with waiting a couple of hours for them to be recalled at a later date; at least from SciNet's perceptive, it would be a very efficient migration.
 
 
==='''Appendix (HSM)'''===
 
====Basic Concepts====
 
'''Hierarchical Storage Management (HSM)''' is a data storage technique which automatically moves data between high-cost and low-cost storage media. HSM systems exist because high-speed storage devices, such as hard disk drive arrays, are more expensive (per byte stored) than slower devices, such as optical discs and magnetic tape drives. While it would be ideal to have all data available on high-speed devices all the time, this is prohibitively expensive for many organizations. Instead, HSM systems store the bulk of the enterprise's data on slower devices, and then copy data to faster disk drives when needed. In effect, HSM turns the fast disk drives into caches for the slower mass storage devices. The HSM system monitors the way data is used and makes best guesses as to which data can safely be moved to slower devices and which data should stay on the fast devices.
 
 
In a typical HSM scenario, data files which are frequently used are stored on disk drives, but are eventually migrated to tape if they are not used for a certain period of time, typically a few months. If a user does reuse a file which is on tape, it is automatically moved back to disk storage. The advantage is that the total amount of stored data can be much larger than the capacity of the disk storage available, but since only rarely-used files are on tape, most users will not notice any slowdown.
 
 
The HSM client provides both ''automatic'' and ''selective migration''. Once file ''migration'' begins, the HSM client sends a copy of your file to storage volumes on disk devices or devices that support removable media, such as tape and replaces the original file with a ''stub file'' on HSM managed file system (aka ''repository'' at SciNet)
 
 
===='''Definitions'''====
 
 
'''''Repository''''' commonly refers to a location for long-term storage, often for safety or preservation. It has a component of disk cache and another of tapes.
 
 
'''''Migration''''', in the context of HSM, refers to set of actions that move files from the front-end disk based cache to a back-end tape library system (often invisible or inaccessible to users)
 
 
'''''Relocation''''', in the context of SciNet, refers to the use of unix commands such as copy, move, tar or rsync to get data into the repository.
 
 
'''''The stub file''''' is a small replacement file that makes it appear as though the original file is on the repository. It contains required metadata information to locate and recall a migrated file and to respond to specific UNIX commands without recalling the file.
 
 
'''''Automatic migration''''' periodically monitors space usage and automatically migrates eligible files according to the options and settings that have been selected. The HSM client provides two types of automatic migration: ''threshold migration'' and ''demand migration''.
 
 
'''''Threshold migration''''' maintains a specific level of free space on the repository file system. When disk usage reaches the high threshold percentage, eligible files are migrated to tapes automatically. When space usage drops to the low threshold set for the file system, file migration stops.
 
 
'''''Demand migration:''''' responds to an out-of-space condition on the repository file system. Demand migration starts automatically if the file system runs out of space (usually triggered at 90%). For HSM, as files are migrated (oldest/largest first), space becomes available on the file system, and the process or event that caused the out-of-space condition can be resumed.
 
 
'''''Selective migration''''' often an user given HSM command, that migrates specific files from the repository at will, in anticipation of the automatic migration, or independently of the system wide eligibility criteria. For example, if you know that you will not be using a particular group of files for an extended time, you can migrate them, so as to free additional space on the repository.
 
 
'''''Reclamation''''' is the process of reclaiming unused space on a tape (applies to Virtual Tapes as well). Over time, as files/directories get deleted or updated on the repository, a process will expire old data, creating gaps of unused storage on the tapes. Since tapes are sequential media, typical tape handling software can only write data to the end of the tape, so these gaps of “Empty Space” cannot be used. The process entails periodically and in a rolling fashion copying active data from the "Swiss Cheese" like tapes to unused tapes on a compacted form, and recycling the former.
 
 
'''''Optimal environment''''' HSM should be used in an environment where the old and large files which need to be preserved are not used regularly. Files that are needed frequently should not be migrated at all, otherwise HSM would act as a very active cache, by migrating files and shortly after recalling them. This is not advisable, in particular for the stress it imposes on the tape system. The disk cache component of repository needs to be large enough to hold all regularly used files.
 
 
====Common HSM commands====
 
Some traditional unix/linux commands, such as 'ls' or 'rm' for instance, will work with the stub file as the real files. But others, such as 'du' or 'df', you better use a HSM equivalent, which will give you more meaningful information in the context of HSM. They only work inside /repository. Some of them will be executable only by root, such as 'dsmrm', in which case you'll be notified.
 
 
'''''dsmls'''''
 
to check status of files; used in the directory where you expect to have migrated files
 
 
'''r''': ''resident''    (the file is on repository only)
 
 
'''m''': ''migrated''    (only the stub of the file is on repository)
 
 
'''p''': ''premigrated'' (the file is on repository and on tape)
 
 
<pre>
 
Usage: dsmls [-Noheader] [-Recursive] [-Help] [file specs|-FIlelist=file]
 
</pre>
 
 
Example:
 
 
<pre>
 
gpc-logindm02-$ dsmls -R a3
 
IBM Tivoli Storage Manager
 
Command Line Space Management Client Interface
 
  Client Version 6, Release 1, Level 0.0 
 
  Client date/time: 07/27/2010 12:06:36
 
(c) Copyright by IBM Corporation and other(s) 1990, 2009. All Rights Reserved.
 
 
      Actual    Resident    Resident  File  File
 
        Size        Size    Blk (KB)  State  Name
 
      <dir>        8192            8  -      a3/
 
 
/repository/scinet/pinto/a3:
 
34008432640            0            0  m      32G-1
 
34008432640  34008432640            0  r      32G-2
 
34008432640  34008432640            0  p      32G-3
 
          0            0            0  r      dsmerror.log
 
 
</pre>
 
 
'''''dsmdu'''''
 
disk usage on the original files/directory
 
<pre>
 
Usage: dsmdu [-Allfiles] [-Summary] [-Help] [directory names]
 
</pre>
 
 
'''''dsmdf'''''
 
disk free on the HSM file system.
 
<pre>
 
Usage: dsmdf [-Help] [-Detail] [file systems]
 
</pre>
 
 
'''''dsmmigrate'''''
 
<pre>
 
Usage: dsmmigrate [-Recursive] [-Premigrate] [-Detail] [-Help] filespecs|-FIlelist=file
 
</pre>
 
 
'''''dsmrecall'''''
 
<pre>
 
Usage: dsmrecall [-Recursive] [-Detail] [-Help] file specs|-FIlelist=file
 
  or  dsmrecall [-Detail] -offset=XXXX[kmgKMG] -size=XXXX[kmgKMG] file specs
 
</pre>
 
 
 
To have an idea of what HSM is doing on datamover2 at a given time:
 
<pre>
 
 
[pinto@gpc-logindm02 ~]$ ps -def | grep dsm | grep -v mmfs
 
 
root      2455 15190  0 16:26 ?        00:00:00 dsmmonitord
 
root      2456  2455  2 16:26 ?        00:05:38 dsmautomig -2 system::/repository
 
pinto    10997 10637 30 16:40 pts/3    01:14:20 dsmmigrate -R -D pinto
 
root    12857    1  0 16:15 ?        00:00:00 dsmrecalld
 
root    13013 12857  0 16:15 ?        00:00:01 dsmrecalld
 
root    13015 12857  0 16:15 ?        00:00:00 dsmrecalld
 
root    15190    1  0 16:15 ?        00:00:00 dsmmonitord
 
root    16936    1  3 16:15 ?        00:10:44 dsmscoutd
 
root    17217    1 13 16:16 ?        00:36:49 dsmrootd
 
root    18732  2456  4 17:51 ?        00:07:19 dsmautomig -2 system::/repository
 
root    18737  2456  0 17:51 ?        00:00:26 dsmautomig -2 system::/repository
 
pinto    24533 10363  0 20:48 pts/2    00:00:00 grep dsm
 
root    25090    1  0 06:42 ?        00:00:08 dsmwatchd nodetach
 
root    30840 13013  0 17:15 ?        00:00:02 dsmrecalld
 
</pre>
 
 
In the above example, dsmmonitord, dsmrecalld, dsmscoutd, dsmrootd and dsmwatchd are the 5 typical HSM daemons, and they always running. In addition, there are 3 streams of dsmautomig (triggered by threshold migration) and 1 stream of dsmmigrate (selective migration initiated by user pinto).
 
  
 
==More questions on data management?==
 
==More questions on data management?==
  
 
Check out the [[FAQ#Data_on_SciNet_disks|FAQ]].
 
Check out the [[FAQ#Data_on_SciNet_disks|FAQ]].

Latest revision as of 13:32, 9 August 2018

WARNING: SciNet is in the process of replacing this wiki with a new documentation site. For current information, please go to https://docs.scinet.utoronto.ca

Storage Space

SciNet's storage system is based on IBM's GPFS (General Parallel File System). There are two main systems for user data: /home, a small, backed-up space where user home directories are located, and /scratch, a large system for input or output data for jobs; data on /scratch is not only not backed up (a third storage system, /project, exist only for groups with LRAC/NRAC allocations). Data placed on scratch will be deleted if it has not been accessed in 3 months. SciNet does not provide long-term storage for large data sets.

Overview of the different file systems

file system purpose user quota block size backed up purged access
/home development 50 GB 256 KB yes never read-only on compute nodes (r/w on login, devel and datamover1)
/scratch computation first of (20 TB ; 1 million files) 4 MB no files > 3 months read/write on all nodes
/project computation by allocation 256 KB yes never read/write on all nodes

project is included in scratch

Home Disk Space

Every SciNet user gets a 50GB directory on /home in a directory /home/G/GROUP/USER, which is regularly backed-up. Home is visible from login.scinet nodes, and from the development nodes on GPC and the TCS. However, on the compute nodes of the GPC clusters -- as when jobs are running -- /home is mounted read-only; thus GPC jobs can read files in /home but cannot write to files there. /home is a good place to put code, input files for runs, and anything else that needs to be kept to reproduce runs. On the other hand, /home is not a good place to put many small files, since the block size for the file system is 256KB, so you would quickly run out of disk quota and you will make the backup system very slow.

If your application absolutely insists on writing material to your home account and you can't find a way to instruct it to write somewhere else, an alternative is to create a link pointing from your account under /home to a location under /scratch.

Scratch Disk Space

Every SciNet user also gets a directory in /scratch called /scratch/G/GROUP/USER. Scratch is visible from the login.scinet nodes, the development nodes on GPC and the TCS, and on the compute nodes of the clusters, mounted as read-write. Thus jobs would normally write their output somewhere in /scratch. There are NO backups of anything on /scratch.

There is a large amount of space available on /scratch but it is purged routinely so that all users running jobs and generating large outputs will have room to store their data temporarily. Computational results which you want to keep longer than this must be copied (using scp) off of SciNet entirely and to your local system. SciNet does not routinely provide long-term storage for large data sets.

Also note that the shared parallel file system was not designed to do many small file transactions. For that reason, the number of files that any user can have on scratch is limited to 1 million. This limit should be thought of as a safeguard, not an invitation to create one million files. Please see File System and I/O dos and don'ts.

Scratch Disk Purging Policy

In order to ensure that there is always significant space available for running jobs we automatically delete files in /scratch that have not been accessed or modified for more than 3 months by the actual deletion day on the 15th of each month. Note that we recently changed the cut out reference to the MostRecentOf(atime,mtime). This policy is subject to revision depending on its effectiveness. More details about the purging process and how users can check if their files will be deleted follows. If you have files scheduled for deletion you should move them to a more permanent locations such as your departmental server or your /project space (for PIs who have either been allocated disk space by the LRAC or have bought diskspace).

On the first of each month, a list of files scheduled for purging is produced, and an email notification is sent to each user on that list. Furthermore, at/or about the 12th of each month a 2nd scan produces a more current assessment and another email notification is sent. This way users can double check that they have indeed taken care of all the files they needed to relocate before the purging deadline. Those files will be automatically deleted on the 15th of the same month unless they have been accessed or relocated in the interim. If you have files scheduled for deletion then they will be listed in a file in /scratch/t/todelete/current, which has your userid and groupid in the filename. For example, if user xxyz wants to check if they have files scheduled for deletion they can issue the following command on a system which mounts /scratch (e.g. a scinet login node): ls -1 /scratch/t/todelete/current |grep xxyz. In the example below, the name of this file indicates that user xxyz is part of group abc, has 9,560 files scheduled for deletion and they take up 1.0TB of space:

 [xxyz@scinet04 ~]$ ls -1 /scratch/t/todelete/current |grep xxyz
 -rw-r----- 1 xxyz     root       1733059 Jan 12 11:46 10001___xxyz_______abc_________1.00T_____9560files

The file itself contains a list of all files scheduled for deletion (in the last column) and can be viewed with standard commands like more/less/cat - e.g. more /scratch/t/todelete/current/10001___xxyz_______abc_________1.00T_____9560files

Similarly, you can also verify all other users on your group by using the ls command with grep on your group. For example: ls -1 /scratch/t/todelete/current |grep abc. That will list all other users in the same group that xxyz is part of, and have files to be purged on the 15th. Members of the same group have access to each other's contents.

NOTE: Preparing these assessments takes several hours. If you change the access/modification time of a file in the interim, that will not be detected until the next cycle. A way for you to get immediate feedback is to use the 'ls -lu' command on the file to verify the atime and 'ls -la' for the mtime. If the file atime/mtime has been updated in the meantime, coming the purging date on the 15th it will no longer be deleted.

Project Disk Space

Investigators who have been granted allocations through the LRAC/NRAC Application Process may have been allocated disk space in addition to compute time. For the period of time that the allocation is granted, they will have disk space on the /project disk system. Space on project is a subset of scratch, but is not purged and is backed up. All members of the investigators groups will have access to this disk system, which will be mounted read/write everywhere.

How much Disk Space Do I have left?

The /scinet/gpc/bin6/diskUsage command, available on the login nodes, datamovers and the GPC devel nodes, provides information in a number of ways on the home, scratch, and project file systems. For instance, how much disk space is being used by yourself and your group (with the -a option), or how much your usage has changed over a certain period ("delta information") or you may generate plots of your usage over time. Please see the usage help below for more details.

Usage: diskUsage [-h|-?| [-a] [-u <user>] [-de|-plot]
       -h|-?: help
       -a: list usages of all members on the group
       -u <user>: as another user on your group
       -de: include delta information
       -plot: create plots of disk usages

Did you know that you can check which of your directories have more than 1000 files with the /scinet/gpc/bin6/topUserDirOver1000list command and which have more than 1GB of material with the /scinet/gpc/bin6/topUserDirOver1GBlist command?

Notes:

  • information on usage and quota is only updated hourly!
  • contents of project count against space and #files limits on scratch

Performance

GPFS is a high-performance filesystem which provides rapid reads and writes to large datasets in parallel from many nodes. As a consequence of this design, however, the file system performs quite poorly at accessing data sets which consist of many, small files. For instance, you will find that reading data in from one 16MB file is enormously faster than from 400 40KB files. Such small files are also quite wasteful of space, as the blocksize for the scratch filesystem is 16MB. This is something you should keep in mind when planning your input/output strategy for runs on SciNet.

For instance, if you run multi-process jobs, having each process write to a file of its own is not an scalable I/O solution. A directory gets locked by the first process accessing it, so all other processes have to wait for it. Not only has the code just become considerably less parallel, chances are the file system will have a time-out while waiting for your other processes, leading your program to crash mysteriously. Consider using MPI-IO (part of the MPI-2 standard), which allows files to be opened simultaneously by different processes, or using a dedicated process for I/O to which all other processes send their data, and which subsequently writes this data to a single file.

Local Disk

The compute nodes on the GPC do not contain hard drives so there is no local disk available to use during your computation. You can however use part of a compute nodes RAM like a local disk ('ramdisk') but this will reduce how much memory is available for your program. This can be accessed using /dev/shm/ and is currently set to 8GB. Anything written to this location that you want to keep must be copied back to the /scratch filesystem as /dev/shm is wiped after each job and since it is in memory will not survive through a reboot of the node. More on ramdisk usage can be found here.

Note that the absense of hard drives also means that the nodes cannot swap memory, so be sure that your computation fits within memory.

Buying storage space on GPFS or HPSS

Groups can buy space on GPFS or HPSS rather than rely on the annual allocation process. A good budgetary number would be:

GPFS $400/TB

HPSS $150/TB

This is a one-time cost. We have no formal, written data retention policy at this point but the intent is to keep any HPSS data (including migrating to new tape technologies) as long as SciNet is in operation. These numbers are for budgetary purposes only and subject to change (e.g. as markets and technologies evolve).


Data Transfer

General guidelines

All traffic to and from the data centre has to go via SSH, or secure shell. This is a protocol which sets up a secure connection between two sites. In all cases, incoming connections to SciNet go through relatively low-speed connections to the login.scinet gateways, but there are many ways to copy files on top of the ssh protocol.

What node to use for data transfer to and from SciNet depends mostly on the amount of data to transfer:

Moving <10GB through the login nodes

The login nodes are accessible from outside SciNet, which means that you can transfer data between your own office/home machine and SciNet using scp or rsync (see below). Although the login nodes have a cpu_time timeout of 5 minutes (emphasis on cpu_time, not wall_time), most likely if you try to transfer more than 10GB probably you wouldn't succeed. While the login nodes can be used for transfers of less than 10GB, using a datamover node would still be faster.

Note that transfers through a login node will timeout after a certain time (currently set to 5 minutes cpu_time), so if you have a slow connection you may need to go through datamover1.

Moving >10GB through the datamover1 node

Serious moves of data (>10GB) to or from SciNet should be done from datamover1 or datamover2 nodes. From any of the interactive SciNet nodes, one should be able to ssh datamover1 or ssh datmover2 to log in. Those are the machines that have the fastest network connections to the outside world (by a factor of 10; a 10Gb/s link as vs 1Gb/s).

Transfers must be originated from datamover1 or datamover2; that is, one can not copy files from the outside world directly to or from the datamovers; one has to log in to a datamover and copy the data to or from the outside network. Your local machine must be reachable from the outside as well, either by its name or its IP address. If you are behind a firewall or a (wireless) router, this may not be possible. You may need to ask your network administrator to allow the datamovers to ssh to your machine. If you need to open a hole on your firewall we provide their IPs:

datamover1 142.150.188.121
datamover2 142.150.188.122

Hpn-ssh

The usual ssh protocols were not designed for speed. On the datamover1 or datamover2 nodes, we have installed hpn-ssh, or High-Performance-enabled ssh. You use this higher-performance ssh/scp/sftp variant by loading the `hpnssh' module. Hpn-ssh is backwards compatible with the `usual' ssh, but is capable of significantly higher speeds. If you routinely have large data transfers to do, we recommend having your system administrator look into installing hpn-ssh on your system.

Everything we discuss below, unless otherwise stated, will work regardless of whether you have hpn-ssh installed on your remote system.

For Microsoft Windows users

Linux-windows transfers can be a bit more involved than linux-to-linux, but using Cygwin, this should not be a problem. Make sure you install Cygwin with the openssh package.

If you want to remain 100% a Windows environment, another very good tool is WinSCP. It will let you easily transfer and synchronize data between your Windows workstation and the login nodes using your ssh credentials (provided that it's not much more than 10GB on each sync pass).

If you are going to use the datamover1 method, and assuming your machine is not a wireless laptop (if it is, best to find a nearby computer that's not wireless and use a usb memory stick), you'll need the IP address of your machine, which you find by typing "ipconfig /all" on your local windows machine. Also, you will need to have the ssh daemon (sshd) running locally in Cygwin.

Also note that your windows user name does not have to be the same as on SciNet, this just depends on how your local windows system was set up.

All locations given to scp or rsync in cygwin have to be in unix format (using "/" not "\"), and will be relative to cygwin's path, not windows (e.g. use /cygdrive/c/...... to get to the windows C: drive).

Ways to transfer data

Globus data transfer

Globus is a file-transfer service with an easy-to-use web interface to allow people to transfer file with ease. To get started, please sign up for a Globus account at Globus website. Once you sign up for an account, go to this page to start the file transfer. Please enter computecanada#niagara as one endpoint for file transfer. If you are trying to transfer data from a laptop or desktop, you will need to install Globus Connect Personal software available here to setup an endpoint for the laptop or desktop and perform the transfer.

Please see the following page on how to setup Globus to perform data transfer.

scp

scp, or secure copy, is the easiest way to copy files, although we generally find rsync below to be faster.

scp works like cp to copy files:

$ scp original_file  copy_file

except that either the original or the copy can be on another system:

$ scp jonsdatafile.bin jon@remote.system.com:/home/jon/bigdatadir/

will copy the data file into the directory /home/jon/bigdatadir/ on remote.system.com after logging in as jon; you will be prompted for a password (unless you've set up ssh keys).

Copying from remote systems works the same way:

$ scp jon@remote.system.com:/home/jon/bigdatadir/newdata.bin .

And wildcards work as you'd expect (except you have to quote the wildcards on the remote system, as it can't expand properly here.)

$ scp *.bin jon@remote.system.com:/home/jon/bigdatadir/
$ scp jon@remote.system.com:"/home/jon/inputdata/*" .

There are few options worth knowing about:

  • scp -C compresses the file before transmitting it; if the file compresses well, this can significantly increase the effective data transfer rate (though usually not as much as compressing the data, then sending it, then uncompressing). If the file doesn't compress well, than this adds CPU overhead without accomplishing much, and can slow down your data transfer.
  • scp -oNoneEnabled=yes -oNoneSwitch=yes -- This is an hpn-ssh only option. If CPU overhead is a significant bottleneck in the data transfer, then we can avoid this by turning off the secure encryption of the data. For most of us, this is ok, but for others it is not. In either cases, authentication remains secure, it is only the data transfer that is in plaintext.

rsync

rsync is a very powerful tool for mirroring directories of data.

$ rsync -av -e ssh scinetdatadir jon@remote.system.com:/home/jon/bigdatadir/

rsync has a dizzying number of options; the above syncs scinetdatadir to the remote system; that is, any files that are newer on the localsystem are updated on the remote system. The converse isn't true; if there were newer files on the remote system, you'd have to bring those over with

$ rsync -av -e ssh jon@remote.system.com:/home/jon/bigdatadir/ scinetdatadir 

The -av options are for verbose and `archive' mode, which preserves timestamps and permissions, which is normally what you want. -e ssh tells it to use ssh for the transfer.

One of the powerful things about rsync is that it looks to see what files already exist before copying, so you can use it repeatedly as a data directory fills and it won't make unnecessary copies; similarly, if a (say) log file grows over time, it will only copy the difference between the files, further speeding things up. This also means that it behaves well if a transfer is interrupted; a second invocation of rsync will continue where the other left off.

As with scp -C, rsync -z compresses on the fly, which can significantly enhance effective data transfer rates if the files compress well, or hurt it if not.

As with scp, if both sides are running hpn-ssh one can disable encryption of the data stream should that prove to be a bottleneck:

$ rsync -av -e "ssh -oNoneEnabled=yes -oNoneSwitch=yes" jon@remote.system.com:/home/jon/bigdatadir/ scinetdatadir

SciNet's login nodes, 142.150.188.5[1-4], are publicly accessible and can be used for data transfer as long as your material is not one big chunk (much more than 2-3GB each file). We have a 5 minutes CPU time limit on the login nodes and the transfer process may be killed by the kernel before completion. The workaround is to transfer your data using a rsync loop, by checking the rsync return code, assuming some files can be transferred before reaching the CPU limit. For example in a bash shell:

  for i in {1..100}; do   ### try 100 times
    rsync ...
    [ "$?" == "0" ] && break
  done

ssh tunnel

Alternatively you may use a reverse ssh tunnel (ssh -R).

If your transfer is above 10GB you will need to use one of SciNet's datamovers. If your workstation is behind a firewall (as the datamovers are), you'll need a node external to your firewall, on the edge of your organization's network, that will serve as a gateway, and can be accessible via ssh by both your workstation and the datamovers. Initiate a "ssh -R" connection from SciNet's datamover to that node. This node needs to have its ssh GatewayPorts enabled so that your workstation can connect to the specified port on that node, which will forward the traffic back to SciNet's datamover.

Transfer speeds

What transfer speeds could I expect?

Below are some typical transfer numbers from datamover1 to another University of Toronto machine with a 1Gb/s link to the campus network:

Mode With hpn-ssh Without
rsync 60-80 MB/s 30-40 MB/s
scp 50 MB/s 25 MB/s

What can slow down my data transfer?

To move data quickly, all of the stages in the process have to be fast; the file system you are reading data from, the CPU reading the data, the network connection between the sender and the reciever, and the recipient CPU and disk. The slowest element in that chain will slow down the entire transfer.

On SciNet's side, our underlying filesystem is the high-performance GPFS system, and the node we recommend you use (datamover1) has a high-speed connection to the network and fast CPUs.

Why are my transfers so much slower?

If you get numbers significantly lower than above, then there is a bottleneck in the transfer. The first thing to do is to run top on datamover1; if other people are transfering large files at the same time you are trying to, network congestion could result and you'll just have to wait until they are done.

If nothing else is going on on datamover1, there are a number of possibilites:

  • network connection between SciNet and your machine - do you know the network connection of your remote machine? Are your systems connections tuned for performance [1]?
  • is the remote server busy?
  • are the remote servers disks busy, or known to be slow?

For any further questions, contact us at Support @ SciNet

File/Ownership Management (ACL)

  • By default, at SciNet, users within the same group have read permission to each other's files (not write)
  • You may use access control list (ACL) to allow your supervisor (or another user within your group) to manage files for you (i.e., create, move, rename, delete), while still retaining your access and permission as the original owner of the files/directories.
  • NOTE: We highly recommend that you never give write permission to other users on the top level of your home directory (/home/G/GROUP/[owner]), since that would seriously compromise your privacy, in addition to disable ssh key authentication, among other things. If necessary, make specific sub-directories under your home directory so that other users can manipulate/access files from those.
  • If you need to set up permissions across groups contact us (and the other group's supervisor!).


Using mmputacl/mmgetacl

  • You may use gpfs' native mmputacl and mmgetacl commands. The advantages are that you can set "control" permission and that POSIX or NFS v4 style ACL are supported. You will need first to create a /tmp/supervisor.acl file with the following contents:
user::rwxc
group::----
other::----
mask::rwxc
user:[owner]:rwxc
user:[supervisor]:rwxc

Then issue the following 2 commands:

1) $ mmputacl -i /tmp/supervisor.acl /project/g/group/[owner]
2) $ mmputacl -d -i /tmp/supervisor.acl /project/g/group/[owner]
   (every *new* file/directory inside [owner] will inherit [supervisor] ownership by default as well as 
   [owner] ownership, ie, ownership of both by default, for files/directories created by [supervisor])

   $ mmgetacl /project/g/group/[owner]
   (to determine the current ACL attributes)

   $ mmdelacl -d /project/g/group/[owner]
   (to remove any previously set ACL)

   $ mmeditacl /project/g/group/[owner]
   (to create or change a GPFS access control list)
   (for this command to work set the EDITOR environment variable: export EDITOR=/usr/bin/vi)

NOTES:

  • mmputacl/setfacl will not overwrite the original linux group permissions for a directory when copied to another directory already with ACLs, hence the "#effective:r-x" note you may see from time to time with mmgetacf/getfacl. If you want to give rwx permissions to everyone in your group you should simply rely on the plain unix 'chmod g+rwx' command. You may do that before or after copying the original material to another folder with the ACLs.

For more information on using mmputacl or mmgetaclacl see their man pages.

Appendix (ACL)

bash script that you may adapt to recursively add or remove ACL attributes using gpfs built-in commands

Courtesy of Agata Disks (http://csngwinfo.in2p3.fr/mediawiki/index.php/GPFS_ACL)

#!/bin/bash
# USAGE
#     - on one directory:     ./set_acl.sh dir_name
#     - on more directories:  ./set_acl.sh 'dir_nam*'
#

# Path of the file that contains the ACL
ACL_FILE_PATH=/agatadisks/data/acl_file.acl

# Directories onto the ACLs have to be set
dirs=$1

# Recursive function that sets ACL to files and directories
set_acl () {
  curr_dir=$1
  for args in $curr_dir/*
  do
    if [ -f $args ]; then
      echo "ACL set on file $args"
      mmputacl -i $ACL_FILE_PATH $args
      if [ $? -ne 0 ]; then
        echo "ERROR: ACL not set on $args"
        exit -1
      fi
    fi
    if [ -d $args ]; then
      # Set Default ACL in directory
      mmputacl -i $ACL_FILE_PATH $args -d
      if [ $? -ne 0 ]; then
        echo "ERROR: Default ACL not set on $args"
        exit -1
      fi
      echo "Default ACL set on directory $args"
      # Set ACL in directory
      mmputacl -i $ACL_FILE_PATH $args
      if [ $? -ne 0 ]; then
        echo "ERROR: ACL not set on $args"
        exit -1
      fi
      echo "ACL set on directory $args"
      set_acl $args
    fi
  done
}
for dir in $dirs
do
  if [ ! -d $dir ]; then
    echo "ERROR: $dir is not a directory"
    exit -1
  fi
  set_acl $dir
done
exit 0

High Performance Storage System (HPSS)

More questions on data management?

Check out the FAQ.