Difference between revisions of "Data Management"

From oldwiki.scinet.utoronto.ca
Jump to navigation Jump to search
Line 13: Line 13:
 
==Performance==
 
==Performance==
  
[http://en.wikipedia.org/wiki/IBM_General_Parallel_File_System GPFS] is a high-performance filesystem which provides rapid reads and writes to large datasets in parallel from many nodes.  As a consequence of this design, however, it performs quite ''poorly'' at accessing data sets which consist of many, small files.  For instance, you will find that reading data in from one 4MB file is enormously faster than from 100 40KB files.  Such small files are also quite wasteful of space, as the blocksize for the filesystem is 4MB.  This is something you should keep in mind When planning your input/output strategy for runs on SciNet.
+
[http://en.wikipedia.org/wiki/IBM_General_Parallel_File_System GPFS] is a high-performance filesystem which provides rapid reads and writes to large datasets in parallel from many nodes.  As a consequence of this design, however, it performs quite ''poorly'' at accessing data sets which consist of many, small files.  For instance, you will find that reading data in from one 4MB file is enormously faster than from 100 40KB files.  Such small files are also quite wasteful of space, as the blocksize for the filesystem is 4MB.  This is something you should keep in mind when planning your input/output strategy for runs on SciNet.

Revision as of 11:58, 13 August 2009

SciNet's storage system is based on IBM's GPFS (General Parallel File System). There are two main systems for user data: /home, a small, backed-up space where user home directories are located, and /scratch, a large system for input or output data for jobs; data on /scratch is not only not backed up, data placed there will be deleted after a two weels. SciNet does not provide long-term storage for large data sets.

Home Disk Space

Every SciNet user gets a 10GB directory on /home which is regularly backed-up. Home is visible from login.scinet nodes, and from the development nodes on GPC and the TCS. However, on the compute nodes of the GPC clusters -- as when jobs are running -- /home is mounted read-only; thus GPC jobs can read files in /home but cannot write to files there. /home is a good place to put code, input files for runs, and anything else that needs to be kept to reproduce runs.

Scratch Disk Space

Every SciNet user also gets a directory in /scratch. Scratch is visible from the login.scinet nodes, the development nodes on GPC and the TCS, and on the compute nodes of the clusters, mounted as read-write. Thus jobs must write their output somewhere in /scratch. There are NO backups of anything on /scratch.

There is a large amount of space available on /scratch but it is purged every two weeks so that all users running jobs and generating large outputs will have room to store their data temporarily. Computational results which you want to keep longer than this must be copied (using scp) off of SciNet entirely and to your local system. SciNet does not provide long-term storage for large data sets.

Performance

GPFS is a high-performance filesystem which provides rapid reads and writes to large datasets in parallel from many nodes. As a consequence of this design, however, it performs quite poorly at accessing data sets which consist of many, small files. For instance, you will find that reading data in from one 4MB file is enormously faster than from 100 40KB files. Such small files are also quite wasteful of space, as the blocksize for the filesystem is 4MB. This is something you should keep in mind when planning your input/output strategy for runs on SciNet.