Data Transfer

From oldwiki.scinet.utoronto.ca
Jump to navigation Jump to search

General guidelines

All traffic to and from the data centre has to go via SSH, or secure shell. This is a protocol which sets up a secure connection between two sites. In all cases, incoming connections to SciNet go through relatively low-speed connections to the login.scinet gateways, but there are many ways to copy files on top of the ssh protocol.

What node to use for data transfer to and from SciNet depends mostly on the amount of data to transfer:

Moving <10GB through the login nodes

The login nodes are accessible from outside SciNet, which means that you can transfer data between your own office/home machine and SciNet using scp or rsync (see below). Although the login nodes have a cpu_time timeout of 5 minutes (emphasis on cpu_time, not wall_time), most likely if you try to transfer more than 10GB probably you wouldn't succeed. While the login nodes can be used for transfers of less than 10GB, using a datamover node would still be faster.

Note that transfers through a login node will timeout after a certain time (currently set to 5 minutes cpu_time), so if you have a slow connection you may need to go through datamover1.

Moving >10GB through the datamover1 node

Serious moves of data (>10GB) to or from SciNet should be done from the datamover1 node. From any of the interactive SciNet nodes, one should be able to ssh datamover1 to log in. This is the machine that has the fastest network connection to the outside world (by a factor of 10; a 10Gb/s link as vs 1Gb/s).

Transfers must be originated from datamover1; that is, one can not copy files from the outside world directly to or from the data mover node; one has to log in to the data mover node and copy the data to or from the outside network. Your local machine must be reachable from the outside, either by its name or its IP address. If you are behind a firewall or a (wireless) router, this may not be possible. You may need to ask your system administrator to allow datamover to ssh to your machine.

Hpn-ssh

The usual ssh protocols were not designed for speed. On the datamover1 node, we have installed hpn-ssh, or High-Performance-enabled ssh. You use this higher-performance ssh/scp/sftp variant by loading the `hpnssh' module. Hpn-ssh is backwards compatible with the `usual' ssh, but is capable of significantly higher speeds. If you routinely have large data transfers to do, we recommend having your system administrator look into installing hpn-ssh on your system.

Everything we discuss below, unless otherwise stated, will work regardless of whether you have hpn-ssh installed on your remote system.

For Microsoft Windows users

Linux-windows transfers can be a bit more involved than linux-to-linux, but using Cygwin, this should not be a problem. Make sure you install Cygwin with the openssh package.

If you want to remain 100% a Windows environment, another very good tool is WinSCP. It will let you easily transfer and synchronize data between your Windows workstation and the login nodes using your ssh credentials (provided that it's not much more than 10GB on each sync pass).

If you are going to use the datamover1 method, and assuming your machine is not a wireless laptop (if it is, best to find a nearby computer that's not wireless and use a usb memory stick), you'll need the IP address of your machine, which you find by typing "ipconfig /all" on your local windows machine. Also, you will need to have the ssh daemon (sshd) running locally in Cygwin.

Also note that your windows user name does not have to be the same as on SciNet, this just depends on how your local windows system was set up.

All locations given to scp or rsync in cygwin have to be in unix format (using "/" not "\"), and will be relative to cygwin's path, not windows (e.g. use /cygdrive/c/...... to get to the windows C: drive).

Ways to transfer data

scp

scp, or secure copy, is the easiest way to copy files, although we generally find rsync below to be faster.

scp works like cp to copy files:

$ scp original_file  copy_file

except that either the original or the copy can be on another system:

$ scp jonsdatafile.bin jon@remote.system.com:/home/jon/bigdatadir/

will copy the data file into the directory /home/jon/bigdatadir/ on remote.system.com after logging in as jon; you will be prompted for a password (unless you've set up ssh keys).

Copying from remote systems works the same way:

$ scp jon@remote.system.com:/home/jon/bigdatadir/newdata.bin .

And wildcards work as you'd expect (except you have to quote the wildcards on the remote system, as it can't expand properly here.)

$ scp *.bin jon@remote.system.com:/home/jon/bigdatadir/
$ scp jon@remote.system.com:"/home/jon/inputdata/*" .

There are few options worth knowing about:

  • scp -C compresses the file before transmitting it; if the file compresses well, this can significantly increase the effective data transfer rate (though usually not as much as compressing the data, then sending it, then uncompressing). If the file doesn't compress well, than this adds CPU overhead without accomplishing much, and can slow down your data transfer.
  • scp -oNoneEnabled=yes -oNoneSwitch=yes -- This is an hpn-ssh only option. If CPU overhead is a significant bottleneck in the data transfer, then we can avoid this by turning off the secure encryption of the data. For most of us, this is ok, but for others it is not. In either cases, authentication remains secure, it is only the data transfer that is in plaintext.

rsync

rsync is a very powerful tool for mirroring directories of data.

$ rsync -av -e ssh scinetdatadir jon@remote.system.com:/home/jon/bigdatadir/

rsync has a dizzying number of options; the above syncs scinetdatadir to the remote system; that is, any files that are newer on the localsystem are updated on the remote system. The converse isn't true; if there were newer files on the remote system, you'd have to bring those over with

$ rsync -av -e ssh jon@remote.system.com:/home/jon/bigdatadir/ scinetdatadir 

The -av options are for verbose and `archive' mode, which preserves timestamps and permissions, which is normally what you want. -e ssh tells it to use ssh for the transfer.

One of the powerful things about rsync is that it looks to see what files already exist before copying, so you can use it repeatedly as a data directory fills and it won't make unnecessary copies; similarly, if a (say) log file grows over time, it will only copy the difference between the files, further speeding things up. This also means that it behaves well if a transfer is interrupted; a second invocation of rsync will continue where the other left off.

As with scp -C, rsync -z compresses on the fly, which can significantly enhance effective data transfer rates if the files compress well, or hurt it if not.

As with scp, if both sides are running hpn-ssh one can disable encryption of the data stream should that prove to be a bottleneck:

$ rsync -av -e "ssh -oNoneEnabled=yes -oNoneSwitch=yes" jon@remote.system.com:/home/jon/bigdatadir/ scinetdatadir

SciNet's login nodes, 142.150.188.5[1-4], are publicly accessible and can be used for data transfer as long as your material is not one big chunk (much more than 2-3GB each file). We have a 5 minutes CPU time limit on the login nodes and the transfer process may be killed by the kernel before completion. The workaround is to transfer your data using a rsync loop, by checking the rsync return code, assuming some files can be transferred before reaching the CPU limit. For example in a bash shell:

  for i in {1..100}; do   ### try 100 times
    rsync ...
    [ "$?" == "0" ] && break
  done

ssh tunnel

Alternatively you may use a reverse ssh tunnel (ssh -R).

If your transfer is above 10GB you will need to use one of SciNet's datamovers. If your workstation is behind a firewall (as the datamovers are), you'll need a node external to your firewall, on the edge of your organization's network, that will serve as a gateway, and can be accessible via ssh by both your workstation and the datamovers. Initiate a "ssh -R" connection from SciNet's datamover to that node. This node needs to have its ssh GatewayPorts enabled so that your workstation can connect to the specified port on that node, which will forward the traffic back to SciNet's datamover.

Globus Online

Another alternative is to outsource that gateway functionality, which can be done via Globus Online. You'll need to do the following:

1) set up your Globus Online account at https://www.globusonline.org/SignUp You only need to do this once ever

2) Create a Westgrid account by going to the following site: https://ccdb.computecanada.ca/me/facilities

3) Download the globus connect client for your workstation here and follow the instructions to install and setup: https://www.globusonline.org/globus_connect/

4) Create a proxy: ssh bugaboo.westgrid.ca myproxy-init When prompted, enter your WestGrid account password. You need to do this before you transfer files, but the proxy is valid or 1 week

5) Link your Westgrid account to globus:

   a) Go to https://www.globusonline.org/account/ManageIdentities.
   b) Click "Add External Identity".
   c) Select "WestGrid" as the Account Provider
   d) Login using your WestGrid credentials.
      * You only ever have to do this once, but you must create a proxy (step 4) first.

6) Go to the following page to start the transfer: https://www.globusonline.org/xfer/StartTransfer

 - For SciNet, use scinet#dtn

This process will take one or two days since we need to wait for certificate to get signed before everything will work. The tool just broker transfer and should work properly even if firewall is in place.

Transfer speeds

What transfer speeds could I expect?

Below are some typical transfer numbers from datamover1 to another University of Toronto machine with a 1Gb/s link to the campus network:

Mode With hpn-ssh Without
rsync 60-80 MB/s 30-40 MB/s
scp 50 MB/s 25 MB/s

What can slow down my data transfer?

To move data quickly, all of the stages in the process have to be fast; the file system you are reading data from, the CPU reading the data, the network connection between the sender and the reciever, and the recipient CPU and disk. The slowest element in that chain will slow down the entire transfer.

On SciNet's side, our underlying filesystem is the high-performance GPFS system, and the node we recommend you use (datamover1) has a high-speed connection to the network and fast CPUs.

Why are my transfers so much slower?

If you get numbers significantly lower than above, then there is a bottleneck in the transfer. The first thing to do is to run top on datamover1; if other people are transfering large files at the same time you are trying to, network congestion could result and you'll just have to wait until they are done.

If nothing else is going on on datamover1, there are a number of possibilites:

  • network connection between SciNet and your machine - do you know the network connection of your remote machine? Are your systems connections tuned for performance [1]?
  • is the remote server busy?
  • are the remote servers disks busy, or known to be slow?

For any further questions, contact us at Support @ SciNet