oldwiki.scinet.utoronto.ca - User contributions [en-gb]

Oldwiki.scinet.utoronto.ca:System Alerts

2017-01-28T13:41:44Z

Jchong: /* System Status */

== System Status==

{|
|[[File:down.png|up|link=GPC Quickstart]][[GPC Quickstart|GPC]]
|[[File:down.png|up|link=TCS Quickstart]][[TCS Quickstart|TCS]]
|[[File:down.png|up|link=Sandy]][[Sandy]]
|[[File:down.png|up|link=Gravity]][[Gravity]]
|[[File:up.png|up|link=BGQ]][[BGQ]]
|[[File:down.png|file systems unmounted]]File System
|-
|[[File:down.png|up|link=P7 Linux Cluster]][[P7 Linux Cluster|P7]]
|[[File:down.png|up|link=P8]][[P8]]
|[[File:down.png|up|link=Knights Landing]][[Knights Landing|KNL]]
|[[File:down.png|up|link=Visualization Nodes]][[Visualization Nodes|Viz]]
|[[File:down.png|up|link=HPSS]][[HPSS]]
|}
Sat 28 Jan 2017 8:41 EST BGQ is not affected and the system is up.

Sat 28 Jan 2017 8:15 EST Further issues found on the file system; system access to users has been closed until we can solve these issues.

Fri 27 Jan 2017 15:11:58 EST Cluster network issue is resolved and filesystem access is finally resolved after determining the root cause of the network issue.

 Fri Jan 27 11:20:32 EST 2017 While we're restoring things, file systems will generally not be available for usage during this period, to facilitate our work. Sorry.

 Fri Jan 27 10:02:32 EST 2017 The IB network fabric had a failure earlier today that affected the file systems. The IB fabric is back to normal, and we're working on restoring the file systems at the moment.

 Fri Jan 27 7:34:00 EST 2017 Issues with the new scratch file system; we're investigating.

 Thu Jan 26 21:24:14 EST 2017 Maintenance finished, systems back online and available, with the exception of the TCS, that does not accept jobs yet (but the devel nodes are accessible).

Jan 25, 2017, 18:48: BGQ is online.

Oldwiki.scinet.utoronto.ca:System Alerts

2017-01-27T20:12:54Z

Jchong: /* System Status */

== System Status==

{|
|[[File:up.png|up|link=GPC Quickstart]][[GPC Quickstart|GPC]]
|[[File:down.png|up|link=TCS Quickstart]][[TCS Quickstart|TCS]]
|[[File:down.png|up|link=Sandy]][[Sandy]]
|[[File:down.png|up|link=Gravity]][[Gravity]]
|[[File:up.png|up|link=BGQ]][[BGQ]]
|[[File:up.png|file systems unmounted]]File System
|-
|[[File:down.png|up|link=P7 Linux Cluster]][[P7 Linux Cluster|P7]]
|[[File:down.png|up|link=P8]][[P8]]
|[[File:down.png|up|link=Knights Landing]][[Knights Landing|KNL]]
|[[File:down.png|up|link=Visualization Nodes]][[Visualization Nodes|Viz]]
|[[File:down.png|up|link=HPSS]][[HPSS]]
|}
Fri 27 Jan 2017 15:11:58 EST Cluster network issue is resolved and filesystem access is finally resolved after determining the root cause of the network issue.

 Fri Jan 27 11:20:32 EST 2017 While we're restoring things, file systems will generally not be available for usage during this period, to facilitate our work. Sorry.

 Fri Jan 27 10:02:32 EST 2017 The IB network fabric had a failure earlier today that affected the file systems. The IB fabric is back to normal, and we're working on restoring the file systems at the moment.

 Fri Jan 27 7:34:00 EST 2017 Issues with the new scratch file system; we're investigating.

 Thu Jan 26 21:24:14 EST 2017 Maintenance finished, systems back online and available, with the exception of the TCS, that does not accept jobs yet (but the devel nodes are accessible).

Jan 25, 2017, 18:48: BGQ is online.

Oldwiki.scinet.utoronto.ca:System Alerts

2017-01-27T20:12:36Z

Jchong: /* System Status */

== System Status==

{|
|[[File:up.png|up|link=GPC Quickstart]][[GPC Quickstart|GPC]]
|[[File:down.png|up|link=TCS Quickstart]][[TCS Quickstart|TCS]]
|[[File:down.png|up|link=Sandy]][[Sandy]]
|[[File:down.png|up|link=Gravity]][[Gravity]]
|[[File:down.png|up|link=BGQ]][[BGQ]]
|[[File:up.png|file systems unmounted]]File System
|-
|[[File:down.png|up|link=P7 Linux Cluster]][[P7 Linux Cluster|P7]]
|[[File:down.png|up|link=P8]][[P8]]
|[[File:down.png|up|link=Knights Landing]][[Knights Landing|KNL]]
|[[File:down.png|up|link=Visualization Nodes]][[Visualization Nodes|Viz]]
|[[File:down.png|up|link=HPSS]][[HPSS]]
|}
Fri 27 Jan 2017 15:11:58 EST Cluster network issue is resolved and filesystem access is finally resolved after determining the root cause of the network issue.

 Fri Jan 27 11:20:32 EST 2017 While we're restoring things, file systems will generally not be available for usage during this period, to facilitate our work. Sorry.

 Fri Jan 27 10:02:32 EST 2017 The IB network fabric had a failure earlier today that affected the file systems. The IB fabric is back to normal, and we're working on restoring the file systems at the moment.

 Fri Jan 27 7:34:00 EST 2017 Issues with the new scratch file system; we're investigating.

 Thu Jan 26 21:24:14 EST 2017 Maintenance finished, systems back online and available, with the exception of the TCS, that does not accept jobs yet (but the devel nodes are accessible).

Jan 25, 2017, 18:48: BGQ is online.

Oldwiki.scinet.utoronto.ca:System Alerts

2016-08-28T16:40:20Z

Jchong: /* System Status */

== System Status==

{|
|[[File:up.png|up|link=GPC Quickstart]][[GPC Quickstart|GPC]]
|[[File:down.png|up|link=TCS Quickstart]][[TCS Quickstart|TCS]]
|[[File:down.png|up|link=Sandy]][[Sandy]]
|[[File:down.png|up]]File System
|-
|[[File:down.png|down|link=Gravity]][[Gravity]]
|[[File:up.png|up|link=P7 Linux Cluster]][[P7 Linux Cluster|P7]]
|[[File:down.png|down|link=Visualization Nodes]][[Visualization Nodes|Viz]]
|[[File:down.png|down|link=BGQ]][[BGQ]]
|[[File:down.png|down|link=HPSS]][[HPSS]]
|}

Sun 28 Aug 2016 10:19:17 EDT Turning systems on. Expect to be up in a couple of hours.

Sun Aug 28 09:16:02 EDT 2016 Datacentre was shutdown due to facility power failure.

Tue 16 Aug 2016 21:55:20 Cooling restored, filesystem up and OK, bringing up clusters. GPC should be available to users by 11PM (perhaps as early as 1030)

Tue 16 Aug 2016 20:46:37 Water service has been restored to building. Restarting cooling system.

Tue 16 Aug 2016 19:52:32 SciNet-related maintenance and modifications have been completed successfully. Work on the building water valve is expected to be done on time (by 9PM). Once water service is restored we need to restore cooling, power-up filesystems and then restart the clusters. Unlikely that any systems are available to users before 11PM and it will take longer to get everything online. Check here for updates

Tue 16 Aug 2016 07:07:32 Shutdown has started

 Scheduled full-day maintenance shutdown begins:

7AM, Tuesday, 16 Aug 

Several projects (adding new 208V circuits for storage, cooling tower maintenance etc) are being carried out on same day as the landlord needs to shutdown the main building water supply (and therefore our cooling system as well) for repairs.

Expect to start bringing systems up about 10PM depending on when the water work is done. Check here for further updates during the day

FAQ

2016-03-11T19:41:34Z

Jchong: Fixed link

__TOC__

==The Basics==
===Whom do I contact for support?===

Whom do I contact if I have problems or questions about how to use the SciNet systems?

'''Answer:'''

E-mail [mailto:support@scinet.utoronto.ca <support@scinet.utoronto.ca>]

In your email, please include the following information:

* your username on SciNet
* the cluster that your question pertains to (GPC or TCS; SciNet is not a cluster!),
* any relevant error messages
* the commands you typed before the errors occured
* the path to your code (if applicable)
* the location of the job scripts (if applicable)
* the directory from which it was submitted (if applicable)
* a description of what it is supposed to do (if applicable)
* if your problem is about connecting to SciNet, the type of computer you are connecting from.

Note that your password should never, never, never be to sent to us, even if your question is about your account.

Try to avoid sending email only to specific individuals at SciNet. Your chances of a quick reply increase significantly if you email our team!

===What does ''code scaling'' mean?===

'''Answer:'''

Please see [[Introduction_To_Performance#Parallel_Speedup|A Performance Primer]]

===What do you mean by ''throughput''?===

'''Answer:'''

Please see [[Introduction_To_Performance#Throughput|A Performance Primer]].

Here is a simple example:

Suppose you need to do 10 computations. Say each of these runs for
1 day on 8 cores, but they take "only" 18 hours on 16 cores. What is the
fastest way to get all 10 computations done - as 8-core jobs or as
16-core jobs? Let us assume you have 2 nodes at your disposal.
The answer, after some simple arithmetic, is that running your 10
jobs as 8-core jobs will take 5 days, whereas if you ran them
as 16-core jobs it would take 7.5 days. Take your own conclusions...

===I changed my .bashrc/.bash_profile and now nothing works===

The default startup scripts provided by SciNet, and guidelines for them, can be found [[Important_.bashrc_guidelines|here]]. Certain things - like sourcing <tt>/etc/profile</tt>
and <tt>/etc/bashrc</tt> are ''required'' for various SciNet routines to work!

If the situation is so bad that you cannot even log in, please send email [mailto:support@scinet.utoronto.ca support].

===Could I have my login shell changed to (t)csh?===

The login shell used on our systems is bash. While the tcsh is available on the GPC and the TCS, we do not support it as the default login shell at present. So "chsh" will not work, but you can always run tcsh interactively. Also, csh scripts will be executed correctly provided that they have the correct "shebang" <tt>#!/bin/tcsh</tt> at the top.

===How can I run Matlab / IDL / Gaussian / my favourite commercial software at SciNet?===

'''Answer:'''

Because SciNet serves such a disparate group of user communities, there is just no way we can buy licenses for everyone's commercial package. The only commercial software we have purchased is that which in principle can benefit everyone -- fast compilers and math libraries (Intel's on GPC, and IBM's on TCS).

If your research group requires a commercial package that you already have or are willing to buy licenses for, contact us at [mailto:support@scinet.utoronto.ca support@scinet] and we can work together to find out if it is feasible to implement the packages licensing arrangement on the SciNet clusters, and if so, what is the the best way to do it.

Note that it is important that you contact us before installing commercially licensed software on SciNet machines, even if you have a way to do it in your own directory without requiring sysadmin intervention. It puts us in a very awkward position if someone is found to be running unlicensed or invalidly licensed software on our systems, so we need to be aware of what is being installed where.

===Do you have a recommended ssh program that will allow scinet access from Windows machines?===

'''Answer:'''

The [[Ssh#SSH_for_Windows_Users | SSH for Windows users]] programs we recommend are:

* [http://mobaxterm.mobatek.net/en/ MobaXterm] is a tabbed ssh client with some Cygwin tools, including ssh and X, all wrapped up into one executable.
* [http://www.chiark.greenend.org.uk/~sgtatham/putty/ PuTTY] - this is a terminal for windows that connects via ssh. It is a quick install and will get you up and running quickly. '''WARNING:''' Make sure you download putty from the official website, because there are "trojanized" versions of putty around that will send your login information to a site in Russia (as reported [http://blogs.cisco.com/security/trojanized-putty-software here]). To set up your passphrase protected ssh key with putty, see [http://the.earth.li/~sgtatham/putty/0.61/htmldoc/Chapter8.html#pubkey here].
* [http://www.cygwin.com/ CygWin] - this is a whole linux-like environment for windows, which also includes an X window server so that you can display remote windows on your desktop. Make sure you include the openssh and X window system in the installation for full functionality. This is recommended if you will be doing a lot of work on Linux machines, as it makes a very similar environment available on your computer. To set up your ssh keys, following the Linux instruction on the [[Ssh keys]] page.
 To set up your ssh keys, following the Linux instruction on the [[Ssh keys]] page.

===My ssh key does not work! WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! ===

'''Answer:'''

[[Ssh_keys#Testing_Your_Key | Testing Your Key]]

* If this doesn't work, you should be able to login using your password, and investigate the problem. For example, if during a login session you get an message similar to the one below, just follow the instruction and delete the offending key on line 3 (you can use vi to jump to that line with ESC plus : plus 3). That only means that you may have logged in from your home computer to SciNet in the past, and that key is obsolete.
<pre>
$ ssh USERNAME@login.scinet.utoronto.ca

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@**@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@**@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle
attack)!
It is also possible that the RSA host key has just been changed.
The fingerprint for the RSA key sent by the remote host is
53:f9:60:71:a8:0b:5d:74:83:52:**fe:ea:1a:9e:cc:d3.
Please contact your system administrator.
Add correct host key in /home/<user>/.ssh/known_hosts to get rid of
this message.
Offending key in /home/<user>/.ssh/known_hosts:3
RSA host key for login.scinet.utoronto.ca
<http://login.scinet.utoronto.ca <http://login.scinet.utoronto.ca>> has
changed and you have requested
</pre>

* If you get the message below you may need to logout of your gnome session and log back in since the ssh-agent needs to be
restarted with the new passphrase ssh key.
<pre>
$ ssh USERNAME@login.scinet.utoronto.ca

Agent admitted failure to sign using the key.
</pre>

===Can't forward X: "Warning: No xauth data; using fake authentication data", or "X11 connection rejected because of wrong authentication."===

I used to be able to forward X11 windows from SciNet to my home machine, but now I'm getting these messages; what's wrong?

'''Answer:'''

This very likely means that ssh/xauth can't update your ${HOME}/.Xauthority file.

The simplest pssible reason for this is that you've filled your 10GB /home quota and so can't write anything to your home directory. Use

<pre>
$ module load extras
$ diskUsage
</pre>

to check to see how close you are to your disk usage on ${HOME}.

Alternately, this could mean your .Xauthority file has become broken/corrupted/confused some how, in which case you can delete that file, and when you next log in you'll get a similar warning message involving creating .Xauthority, but things should work.

===How come I can not login to TCS?===

'''Answer:'''

A SciNet account doesn't automatically entitle you to TCS access. At a minimum, TCS jobs need to run on at least 32 cores (64 preferred because of Simultaneous Multi Threading - [[TCS_Quickstart#Node_configuration|SMT]] - on these nodes) and need the large memory (4GB/core) and bandwidth on the system. Essentially you need to be able to explain why the work can't be done on the GPC.

===How can I reset the password for my Compute Canada account?===

'''Answer:'''

You can reset your password for your Compute Canada account here:

https://ccdb.computecanada.ca/security/forgot

===How can I change or reset the password for my SciNet account?===

'''Answer:'''

To reset your password at SciNet please go to [https://portal.scinet.utoronto.ca/password_resets Password reset page].

If you know your old password and want to change it, that can be done here:

https://portal.scinet.utoronto.ca/change_password

===Why am I getting the error "Permission denied (publickey,gssapi-with-mic,password)"?===

This error can pop up in a variety of situations: when trying to log in, or when after a job has finished, when the error and output files fail to be copied (there are other possible reasons for this failure as well -- see [[FAQ#My_GPC_job_died.2C_telling_me_.60Copy_Stageout_Files_Failed.27|My GPC job died, telling me:Copy Stageout Files Failed]]).
In most cases, the "Permission denioed" error is caused by incorrect permission of the (hidden) .ssh directory. Ssh is used for logging in as well as for the copying of the standard error and output files after a job.

For security reasons,
the directory .ssh should only be writable and readable to you, but yours
has read permission for everybody, and thus it fails. You can change
this by
<pre>
chmod 700 ~/.ssh
</pre>
And to be sure, also do
<pre>
chmod 600 ~/.ssh/id_rsa ~/.ssh/id_rsa.pub ~/authorized_keys
</pre>

===ERROR:102: Tcl command execution failed? when loading modules ===
Modules sometimes require other modules to be loaded first.
Module will let you know if you didn’t.
For example:
<pre>
$ module purge
$ module load python
python/2.6.2(11):ERROR:151: Module ’python/2.6.2’ depends on one of the module(s) ’gcc/4.4.0’
python/2.6.2(11):ERROR:102: Tcl command execution failed: prereq gcc/4.4.0
$ gpc-f103n084-$ module load gcc python
$
</pre>

=== How do I compute the core-years usage of my code? ===

The "core-years" quantity is a way to account for the time your code runs, by considering the total number of cores and time used, accounting for the total number of hours in a year.
For instance if your code uses ''HH'' hours, in ''NN'' nodes, where each node has ''CC'' cores, then "core-years" can be computed as follow:

''HH*(NN*CC)/(365*24)''

If you have several independent instances (batches) running on different nodes, with ''BB'' number of batches and each batch during ''HH'' hours, then your core-years usage can be computed as,

''BB*HH*(NN*CC)/(365*24)''

As a general rule, in our GPC system, each node has only 8 cores, so ''CC'' will be always 8.

==Compiling your Code==

===How can I get g77 to work?===

The fortran 77 compilers on the GPC are ifort and gfortran. We have dropped support for g77. This has been a conscious decision. g77 (and the associated library libg2c) were completely replaced six years ago (Apr 2005) by the gcc 4.x branch, and haven't undergone any updates at all, even bug fixes, for over five years.
If we would install g77 and libg2c, we would have to deal with the inevitable confusion caused when users accidentally link against the old, broken, wrong versions of the gcc libraries instead of the correct current versions.

If your code for some reason specifically requires five-plus-year-old libraries, availability, compatibility, and unfixed-known-bug problems are only going to get worse for you over time, and this might be as good an opportunity as any to address those issues.

''A note on porting to gfortran or ifort:''

While gfortran and ifort are rather compatible with g77, one
important difference is that by default, gfortran does not preserve
local variables between function calls, while g77 does. Preserved
local variables are for instance often used in implementations of quasi-random number
generators. Proper fortran requires to declare such variables as SAVE
but not all old code does this.
Luckily, you can change gfortran's default behavior with the flag
<tt>-fno-automatic</tt>. For ifort, the corresponding flag is <tt>-noautomatic</tt>.

===Where is libg2c.so?===

libg2c.so is part of the g77 compiler, for which we dropped support. See [[#How can I get g77 to work on the GPC?]] for our reasons.

===Autoparallelization does not work!===

I compiled my code with the <tt>-qsmp=omp,auto</tt> option, and then I specified that it should be run with 64 threads - with
export OMP_NUM_THREADS=64

However, when I check the load using <tt>llq1 -n</tt>, it shows a load on the node of 1.37. Why?

'''Answer:'''

Using the autoparallelization will only get you so far. In fact, it usually does not do too much. What is helpful is to run the compiler with the <tt>-qreport</tt> option, and then read the output listing carefully to see where the compiler thought it could parallelize, where it could not, and the reasons for this. Then you can go back to your code and carefully try to address each of the issues brought up by the compiler.
We ''emphasize'' that this is just a rough first guide, and that the compilers are still not magical! For more sophisticated approaches to parallelizing your code, email us at [mailto:support@scinet.utoronto.ca <support@scinet.utoronto.ca>] to set up an appointment with one
of our technical analysts.

===How do I link against the Intel Math Kernel Library?===

If you need to link to the Intel Math Kernal Library (MKL) with the intel compilers, just add the <pre>-mkl</pre> flag. There are in fact three flavours: <tt>-mkl=sequential</tt>, <tt>-mkl=parallel</tt> and <tt>-mkl=cluster</tt>, for the serial version, the threaded version and the mpi version, respectively. (Note: The cluster version is available only when using the intelmpi module and mpi compilation wrappers.)

If you need to link in the Intel Math Kernel Library (MKL) libraries to gcc/gfortran/c++, you are well advised to use the Intel(R) Math Kernel Library Link Line Advisor: http://software.intel.com/en-us/articles/intel-mkl-link-line-advisor/ for help in devising the list of libraries to link with your code.

'''''Note that this give the link line for the command line. When using this in Makefiles, replace $MKLPATH by ${MKLPATH}.'''''

'''''Note too that, unless the integer arguments you will be passing to the MKL libraries are actually 64-bit integers, rather than the normal int or INTEGER types, you want to specify 32-bit integers (lp64) .'''''

===Can the compilers on the login nodes be disabled to prevent accidentally using them?===

'''Answer:'''

You can accomplish this by modifying your .bashrc to not load the compiler modules. See [[Important .bashrc guidelines]].

==="relocation truncated to fit: R_X86_64_PC32": Huh?===

What does this mean, and why can't I compile this code?

'''Answer:'''

Welcome to the joys of the x86 architecture! You're probably having trouble building arrays larger than 2GB, individually or together. Generally, you have to try to use the medium or large x86 `memory model'. For the intel compilers, this is specified with the compile options

-mcmodel=medium -shared-intel

==="feupdateenv is not implemented and will always fail"===

How do I get rid of this and what does it mean?

'''Answer:'''
First note that, as ominous as it sounds, this is really just a warning, and has to do with the intel math library. You can ignore it (unless you really are trying to manually change the exception handlers for floating point exceptions such as divide by zero), or take the safe road and get rid off it by linking with the intel math functions library:<pre>-limf</pre>See also [[#How do I link against the Intel Math Kernel Library?]]

===Cannot find rdmacm library when compiling on GPC===

I get the following error building my code on GPC: "<tt>ld: cannot find -lrdmacm</tt>". Where can I find this library?

'''Answer:'''

This library is part of the MPI libraries; if your compiler is having problems picking it up, it probably means you are mistakenly trying to compile on the login nodes (scinet01..scinet04). The login nodes aren't part of the GPC; they are for logging into the data centre only. From there you must go to the GPC or TCS development nodes to do any real work.

=== Why do I get this error when I try to compile: "icpc: error #10001: could not find directory in which /usr/bin/g++41 resides" ?===

You are trying to compile on the login nodes. As described in the wiki ( https://support.scinet.utoronto.ca/wiki/index.php/GPC_Quickstart#Login ), or in the users guide you would have received with your account, Scinet supports two main clusters, with very different architectures. Compilation must be done on the development nodes of the appropriate cluster (in this case, gpc01-04). Thus, log into gpc01, gpc02, gpc03, or gpc04, and compile from there.

==Testing your Code==

=== Can I run a something for a short time on the development nodes? ===

I am in the process of playing around with the mpi calls in my code to get it to work. I do a lot of tests and each of them takes a couple of seconds only. Can I do this on the development nodes?

'''Answer:'''

Yes, as long as it's very brief (a few minutes). People use the development nodes
for their work, and you don't want to bog it down for people, and testing a real
code can chew up a lot more resources than compiling, etc. The procedures differ
depending on what machine you're using.

==== TCS ====

On the TCS you can run small MPI jobs on the tcs02 node, which is meant for
development use. But even for this test run on one node, you'll need a host file --
a list of hosts (in this case, all tcs-f11n06, which is the `real' name of tcs02)
that the job will run on. Create a file called `hostfile' containing the following:

tcs-f11n06
tcs-f11n06
tcs-f11n06
tcs-f11n06

for a 4-task run. When you invoke "poe" or "mpirun", there are runtime
arguments that you specify pointing to this file. You can also specify it
in an environment variable MP_HOSTFILE, so, if your file is in your /scratch directory, say
${SCRATCH}/hostfile, then you would do

<pre>
export MP_HOSTFILE=${SCRATCH}/hostfile
</pre>

in your shell. You will also need to create a <tt>.rhosts</tt> file in your
home director, again listing <tt>tcs-f11n06</tt> so that <tt>poe</tt>
can start jobs. After that you can simply run your program. You can use
mpiexec:

<pre>
mpiexec -n 4 my_test_program
</pre>

adding <tt> -hostfile /path/to/my/hostfile</tt> if you did not set the environment
variable above. Alternatively, you can run it with the poe command (do a "man poe" for details), or even by
just directly running it. In this case the number of MPI processes will by default
be the number of entries in your hostfile.

==== GPC ====

On the GPC one can run short test jobs on the GPC [[GPC_Quickstart#Compile.2FDevel_Nodes | development nodes ]]<tt>gpc01</tt>..<tt>gpc04</tt>;
if they are single-node jobs (which they should be) they don't need a hostfile. Even better, though, is to request an [[ Moab#Interactive | interactive ]] job and run the tests either in regular batch queue or using a short high availability [[ Moab#debug | debug ]] queue that is reserved for this purpose.

=== How do I run a longer (but still shorter than an hour) test job quickly ? ===

'''Answer'''

On the GPC there is a high turnover short queue called [[ Moab#debug | debug ]] that is designed for
this purpose. You can use it by adding
<pre>
#PBS -q debug
</pre>
to your submission script.

==Submitting your jobs==

===Error Submitting My Job: qsub: Bad UID for job execution MSG=ruserok failed ===

I write up a submission script as in the examples, but when I attempt to submit the job, I get the above error. What's wrong?

'''Answer:'''

This error will occur if you try to submit a job from the login nodes. The login nodes are the gateway to all of SciNet's systems (GPC, TCS, P7, ARC), which have different hardware and queueing systems. To submit a job, you must log into a development node for the particular cluster you are submitting to and submit from there.

=== How do I charge jobs to my RAC allocation? ===

'''Answer:'''

Please see the [[Moab#Accounting|accounting section of Moab page]].

===How can I automatically resubmit a job?===

Commonly you may have a job that you know will take longer to run than what is
permissible in the queue. As long as your program contains [[Checkpoints|checkpoint]] or
restart capability, you can have one job automatically submit the next. In
the following example it is assumed that the program finishes before
the 48 hour limit and then resubmits itself by logging into one
of the development nodes.

<source lang="bash">
#!/bin/bash
# MOAB/Torque example submission script for auto resubmission
# SciNet GPC
#
#PBS -l nodes=1:ppn=8,walltime=48:00:00
#PBS -N my_job

# DIRECTORY TO RUN - $PBS_O_WORKDIR is directory job was submitted from
cd $PBS_O_WORKDIR

# YOUR CODE HERE
./run_my_code

# RESUBMIT 10 TIMES HERE
num=$NUM
if [ $num -lt 10 ]; then
num=$(($num+1))
ssh gpc01 "cd $PBS_O_WORKDIR; qsub ./script_name.sh -v NUM=$num";
fi
</source>

<pre>
qsub script_name.sh -v
</pre>

You can alternatively use [[ Moab#Job_Dependencies | Job dependencies ]] through the queuing system which will not start one job until another job has completed.

If your job can't be made to automatically stop before the 48 hour queue window, but it does write out checkpoints, you can use the timeout command to stop the program while you still have time to resubmit; for instance

<source lang="bash">
timeout 2850m ./run_my_code argument1 argument2
</source>

will run the program for 47.5 hours (2850 minutes), and then send it SIGTERM to exit the program.

===How can I pass in arguments to my submission script?===

If you wish to make your scripts more generic you can use qsub's ability
to pass in environment variables to pass in arguments to your script.
The following example shows a case where an input and an output
file are passed in on the qsub line. Multiple variables can be
passed in using the qsub "-v" option and comma delimited.

<source lang="bash">
#!/bin/bash
# MOAB/Torque example of passing in arguments
# SciNet GPC
#
#PBS -l nodes=1:ppn=8,walltime=48:00:00
#PBS -N my_job

# DIRECTORY TO RUN - $PBS_O_WORKDIR is directory job was submitted from
cd $PBS_O_WORKDIR

# YOUR CODE HERE
./run_my_code -f $INFILE -o $OUTFILE
</source>

<pre>
qsub script_name.sh -v INFILE=input.txt,OUTFILE=outfile.txt
</pre>

===I submit my GPC job, and I get an email saying it was rejected===
'''Answer:'''

This happens because the job you've submitted breaks one of the rules of the queues and is rejected. An email
is sent with the JOBID, JOBNAME, and the reason it was rejected. The following is an example where a job
requests more than 48 hours and was rejected.

<pre>
PBS Job Id: 3462493.gpc-sched
Job Name: STDIN
job deleted
Job deleted at request of root@gpc-sched
MOAB_INFO: job was rejected - job violates class configuration 'wclimit too high for class 'batch_ib' (345600 > 172800)'
</pre>

Jobs on the TCS or GPC may only run for 48 hours at a time; this restriction greatly increases responsiveness of the queue and queue throughput for all our users. If your computation requires longer than that, as many do, you will have to [[ Checkpoints | checkpoint ]] your job and restart it after each 48-hour queue window. You can manually re-submit jobs, or if you can have your job cleanly exit before the 48 hour window, there are ways to [[ FAQ#How_can_I_automatically_resubmit_a_job.3F | automatically resubmit jobs ]].

Other rejections return a more cryptic error saying "job violates class configuration" such as follows:
<pre>
PBS Job Id: 3462409.gpc-sched
Job Name: STDIN
job deleted
Job deleted at request of root@gpc-sched
MOAB_INFO: job was rejected - job violates class configuration 'user required by class 'batch''
</pre>

The most common problems that result in this error are:

* '''Incorrect number of processors per node''': Jobs on the GPC are scheduled per-node not per-core and since each node has 8 processor cores (ppn=8) the smallest job allowed is one node with 8 cores (nodes=1:ppn=8). For serial jobs users must bundle or batch them together in groups of 8. See [[ FAQ#How_do_I_run_serial_jobs_on_GPC.3F | How do I run serial jobs on GPC? ]]
* '''No number of nodes specified''': Jobs submitted to the main queue must request a specific number of nodes, either in the submission script (with a line like <tt>#PBS -l nodes=2:ppn=8</tt>) or on the command line (eg, <tt>qsub -l nodes=2:ppn=8,walltime=5:00:00 script.pbs</tt>). Note that for the debug queue, you can get away without specifying a number of nodes and a default of one will be assigned; for both technical and policy reasons, we do not enforce such a default for the main ("batch") queue.
* '''There is a 15 minute walltime minimum''' on all queues except debug and if you set your walltime less than this, it will be rejected.

=== When submitting your job, fails saying: "script is written in DOS/Windows text format" ===

'''Answer:'''

Very likely you have written your script in a windows machine, so to fix this you just need to change the format of you submission script to unix from Windows/DOS.
Use the command below for all your script files:

<pre>
dos2unix <pbs-script-file>
</pre>

where <pbs-script-file> has to be substituted by the name of your script file.

==Running your jobs==

===My job can't write to /home===

My code works fine when I test on the development nodes, but when I submit a job, or even run interactively in the development queue on GPC, it fails. What's wrong?

'''Answer:'''

As [[Data_Management#Home_Disk_Space | discussed]] [https://support.scinet.utoronto.ca/wiki/images/5/54/SciNet_Tutorial.pdf elsewhere], <tt>/home</tt> is mounted read-only on the compute nodes; you can only write to <tt>/home</tt> from the login nodes and devel nodes. (The [[GPC_Quickstart#128Glargemem | largemem nodes]] on GPC, in this respect, are more like devel nodes than compute nodes). In general, to run jobs you can read from <tt>/home</tt> but you'll have to write to <tt>/scratch</tt> (or, if you were allocated space through the RAC process, on <tt>/project</tt>). More information on SciNet filesytems can be found on our [[Data_Management | Data Management]] page.

===OpenMP on the TCS===

How do I run an OpenMP job on the TCS?

'''Answer:'''

Please look at the [[TCS_Quickstart#Submission_Script_for_an_OpenMP_Job | TCS Quickstart ]] page.

===Can I can use hybrid codes consisting of MPI and openMP on the GPC?===

'''Answer:'''

Yes. Please look at the [[GPC_Quickstart#Hybrid_MPI.2FOpenMP_jobs | GPC Quickstart ]] page.

===How do I run serial jobs on GPC?===

'''Answer''':

So it should be said first that SciNet is a parallel computing resource,
and our priority will always be parallel jobs. Having said that, if
you can make efficient use of the resources using serial jobs and get
good science done, that's good too, and we're happy to help you.

The GPC nodes each have 8 processing cores, and making efficient use of these
nodes means using all eight cores. As a result, we'd like to have the
users take up whole nodes (eg, run multiples of 8 jobs) at a time.

It depends on the nature of your job what the best strategy is. Several approaches are presented on the [[User_Serial|serial run wiki page]].

===Why can't I request only a single cpu for my job on GPC?===

'''Answer''':

On GPC, computers are allocated by the node - that is, in chunks of 8 processors. If you want to run a job that requires only one processor, you need to bundle the jobs into groups of 8, so as to not be wasting the other 7 for 48 hours. See [[User_Serial|serial run wiki page]].

===How do I run serial jobs on TCS?===

'''Answer''': You don't.

===But in the queue I found a user who is running jobs on GPC, each of which is using only one processor, so why can't I?===

'''Answer''':

The pradat* and atlaspt* jobs, amongst others, are jobs of the ATLAS high energy physics project. That they are reported as single cpu jobs is an artifact of the moab scheduler. They are in fact being automatically bundled into 8-job bundles but have to run individually to be compatible with their international grid-based systems.

===How do I use the ramdisk on GPC?===

To use the ramdisk, create and read to / write from files in /dev/shm/.. just as one would to (eg) ${SCRATCH}. Only the amount of RAM needed to store the files will be taken up by the temporary file system; thus if you have 8 serial jobs each requiring 1 GB of RAM, and 1GB is taken up by various OS services, you would still have approximately 7GB available to use as ramdisk on a 16GB node. However, if you were to write 8 GB of data to the RAM disk, this would exceed available memory and your job would likely crash.

It is very important to delete your files from ram disk at the end of your job. If you do not do this, the next user to use that node will have less RAM available than they might expect, and this might kill their jobs.

''More details on how to setup your script to use the ramdisk can be found on the [[User_Ramdisk|Ramdisk wiki page]].''

=== How can I run a job longer than 48 hours? ===

'''Answer:'''

The SciNet queues have a queue limit of 48 hours. This is pretty typical for systems of its size in Canada and elsewhere, and larger systems commonly have shorter limits. The limits are there to ensure that every user gets a fair share of the system (so that no one user ties up lots of nodes for a long time), and for safety (so that if one memory board in one node fails in the middle of a very long job, you haven't lost a months' worth of work).

Since many of us have simulations that require more than that much time, most widely-used scientific applications have "checkpoint-restart" functionality, where every so often the complete state of the calculation is stored as a checkpoint file, and one can restart a simulation from one of these. In fact, these restart files tend to be quite useful for a number of purposes.

If your job will take longer, you will have to submit your job in multiple parts, restarting from a checkpoint each time. In this way, one can run a simulation much longer than the queue limit. In fact, one can even write job scripts which automatically re-submit themselves until a run is completed, using [[FAQ#How_can_I_automatically_resubmit_a_job.3F | automatic resubmission. ]]

=== Why did showstart say it would take 3 hours for my job to start before, and now it says my job will start in 10 hours? ===

'''Answer:'''

Please look at the [[FAQ#How_do_priorities_work.2Fwhy_did_that_job_jump_ahead_of_mine_in_the_queue.3F | How do priorities work/why did that job jump ahead of mine in the queue? ]] page.

===How do priorities work/why did that job jump ahead of mine in the queue?===

'''Answer:'''

The [[Moab | queueing system]] used on SciNet machines is a [http://en.wikipedia.org/wiki/Priority_queue Priority Queue]. Jobs enter the queue at the back of the queue, and slowly make their way to the front as those ahead of them are run; but a job that enters the queue with a higher priority can `cut in line'.

The main factor which determines priority is whether or not the user (or their PI) has an [http://wiki.scinethpc.ca/wiki/index.php/Application_Process RAC allocation]. These are competitively allocated grants of computer time; there is a call for proposals towards the end of every calendar year. Users with an allocation have high priorities in an attempt to make sure that they can use the amount of computer time the committees granted them. Their priority decreases as they approach their allotted usage over the current window of time; by the time that they have exhausted that allotted usage, their priority is the same as users with no allocation (unallocated, or `default' users). Unallocated users have a fixed, low, priority.

This priority system is called `fairshare'; the scheduler attempts to make sure everyone has their fair share of the machines, where the share that's fair has been determined by the allocation committee. The fairshare window is a rolling window of two weeks; that is, any time you have a job in the queue, the fairshare calculation of its priority is given by how much of your allocation of the machine has been used in the last 14 days.

A particular allocation might have some fraction of GPC - say 4% of the machine (if the PI had been allocated 10 million CPU hours on GPC). The allocations have labels; (called `Resource Allocation Proposal Identifiers', or RAPIs) they look something like

abc-123-ab

where abc-123 is the PIs CCRI, and the suffix specifies which of the allocations granted to the PI is to be used. These can be specified on a job-by-job basis. On GPC, one adds the line
#PBS -A RAPI
to your script; on TCS, one uses
# @ account_no = RAPI
If the allocation to charge isn't specified, a default is used; each user has such a default, which can be changed at the same portal where one changes one's password,

https://portal.scinet.utoronto.ca/

A jobs priority is determined primarily by the fairshare priority of the allocation it is being charged to; the previous 14 days worth of use under that allocation is calculated and compared to the allocated fraction (here, 5%) of the machine over that window (here, 14 days). The fairshare priority is a decreasing function of the allocation left; if there is no allocation left (eg, jobs running under that allocation have already used 379,038 CPU hours in the past 14 days), the priority is the same as that of a user with no granted allocation. (This last part has been the topic of some debate; as the machine gets more utilized, it will probably be the case that we allow RAC users who have greatly overused their quota to have their priorities to drop below that of unallocated users, to give the unallocated users some chance to run on our increasingly crowded system; this would have no undue effect on our allocated users as they still would be able to use the amount of resources they had been allocated by the committees.) Note that all jobs charging the same allocation get the same fairshare priority.

There are other factors that go into calculating priority, but fairshare is the most significant. Other factors include
* amount of time waiting in queue (measured in units of the requested runtime). A waiting queue job gains priority as it sits in the queue to avoid job starvation.
* User adjustment of priorities ( See below ).

The major effect of these subdominant terms is to shuffle the order of jobs running under the same allocation.

===How do we manage job priorities within our research group?===

'''Answer:'''

Obviously, managing shared resources within a large group - whether it
is conference funding or CPU time - takes some doing.

It's important to note that the fairshare periods are intentionally kept
quite short - just two weeks long. So, for example, let us say that in your resource
allocation you have about 10% of the machine. Then for someone to use
up the whole two week amount of time in 2 days, they'd have to use 70%
of the machine in those two days - which is unlikely to happen by
accident. If that does happen,
those using the same allocation as the person who used 70% of the
machine over the two days will suffer by having much lower priority for
their jobs, but only for the next 12 days - and even then, if there are
idle cpus they'll still be able to compute.

There will be online tools for seeing how the allocation is being used,
and those people who are in charge in your group will be able to use
that information to manage the users, telling them to dial it down or
up. We know that managing a large research group is hard, and we want
to make sure we provide you the information you need to do your job
effectively.

One way for users within a group to manage their priorities within the group
is with [[Moab#Adjusting_Job_Priority | user-adjusted priorities]]; this is
described in more detail on the [[Moab | Scheduling System]] page.



=== I couldn't find the .o output file in the .pbs_spool directory as I used to ===

On Feb 24 2011, the temporary location of standard input and output files was moved from the shared file system ${SCRATCH}/.pbs_spool to the
node-local directory /var/spool/torque/spool (which resides in ram). The final location after a job has finished is unchanged,
but to check the output/error of running jobs, users will now have to ssh into the (first) node assigned to the job and look in
/var/spool/torque/spool.

This alleviates access contention to the temporary directory, especially for those users that are running a lot of jobs, and reduces the burden on the file system in general.

Note that it is good practice to redirect output to a file rather than to count on the scheduler to do this for you.

=== My GPC job died, telling me `Copy Stageout Files Failed' ===

'''Answer:'''

When a job runs on GPC, the script's standard output and error are redirected to
<tt>$PBS_JOBID.gpc-sched.OU</tt> and <tt>$PBS_JOBID.gpc-sched.ER</tt> in
/var/spool/torque/spool on the (first) node on which your job is running. At the end of the job, those .OU and .ER files are copied to where the batch script tells them to be copied, by default <tt>$PBS_JOBNAME.o$PBS_JOBID</tt> and<tt>$PBS_JOBNAME.e$PBS_JOBID</tt>. (You can set those filenames to be something clearer with the -e and -o options in your PBS script.)

When you get errors like this:
<pre>
An error has occurred processing your job, see below.
request to copy stageout files failed on node
</pre>
it means that the copying back process has failed in some way. There could be a few reasons for this. The first thing to '''make sure that your .bashrc does not produce any output''', as the output-stageout is performed by bash and further output can cause this to fail.
But it also could have just been a random filesystem error, or it could be that your job failed spectacularly enough to shortcircuit the normal job-termination process (e.g. ran out of memory very quickly) and those files just never got copied.

Write to [mailto:support@scinet.utoronto.ca <support@scinet.utoronto.ca>] if your input/output files got lost, as we will probably be able to retrieve them for you (please supply at least the jobid, and any other information that may be relevant).

Mind you that it is good practice to redirect output to a file rather than depending on the job scheduler to do this for you.



===IB Memory Errors, eg <tt> reg_mr Cannot allocate memory </tt>===

Infiniband requires more memory than ethernet; it can use RDMA (remote direct memory access) transport for which it sets aside registered memory to transfer data.

In our current network configuration, it requires a _lot_ more memory, particularly as you go to larger process counts; unfortunately, that means you can't get around the "I need more memory" problem the usual way, by running on more nodes. Machines with different memory or
network configurations may exhibit this problem at higher or lower MPI
task counts.

Right now, the best workaround is to reduce the number and size of OpenIB queues, using XRC: with the OpenMPI, add the following options to your mpirun command:

<pre>
-mca btl_openib_receive_queues X,128,256,192,128:X,2048,256,128,32:X,12288,256,128,32 -mca btl_openib_max_send_size 12288
</pre>

With Intel MPI, you should be able to do

<pre>
module load intelmpi/4.0.3.008
mpirun -genv I_MPI_FABRICS=shm:ofa -genv I_MPI_OFA_USE_XRC=1 -genv I_MPI_OFA_DYNAMIC_QPS=1 -genv I_MPI_DEBUG=5 -np XX ./mycode
</pre>

to the same end.

For more information see [[GPC MPI Versions]].

===My compute job fails, saying <tt>libpng12.so.0: cannot open shared object file</tt> or <tt>libjpeg.so.62: cannot open shared object file</tt>===

'''Answer:'''

To maximize the amount of memory available for compute jobs, the compute nodes have a less complete system image than the development nodes. In particular, since interactive graphics libraries like matplotlib and gnuplot are usually used interactively, the libraries for their use are included in the devel nodes' image but not the compute nodes.

Many of these extra libraries are, however, available in the "extras" module. So adding a "module load extras" to your job submission script - or, for overkill, to your .bashrc - should enable these scripts to run on the compute nodes.

==Monitoring jobs in the queue==

===Why hasn't my job started?===

'''Answer:'''

Use the moab command

<pre>
checkjob -v jobid
</pre>

and the last couple of lines should explain why a job hasn't started.

Please see [[Moab| Job Scheduling System (Moab) ]] for more detailed information

===How do I figure out when my job will run?===

'''Answer:'''

Please see [[Moab#Available_Resources| Job Scheduling System (Moab) ]]



===Running checkjob on my job gives me messages about JobFail and rejected===

Running checkjob on my job gives me messages that suggest my job has failed, as below: what did I do wrong?

<pre>
AName: test
State: Idle
Creds: user:xxxxxx group:xxxxxxxx account:xxxxxxxx class:batch_ib qos:ibqos
WallTime: 00:00:00 of 8:00:00
BecameEligible: Wed Jul 23 10:39:27
SubmitTime: Wed Jul 23 10:38:22
(Time Queued Total: 00:01:47 Eligible: 00:01:05)

Total Requested Tasks: 8

Req[0] TaskCount: 8 Partition: ALL
Opsys: centos6computeA Arch: --- Features: ---

Notification Events: JobFail

IWD: /scratch/x/xxxxxxxx/xxxxxxx/xxxxxxx
Partition List: torque,DDR
Flags: RESTARTABLE
Attr: checkpoint
StartPriority: 76
rejected for Opsys - (null)
rejected for State - (null)
rejected for Reserved - (null)
NOTE: job req cannot run in partition torque (available procs do not meet requirements : 0 of 8 procs found)
idle procs: 793 feasible procs: 0

Node Rejection Summary: [Opsys: 117][State: 2895][Reserved: 19]

NOTE: job violates constraints for partition SANDY (partition SANDY not in job partition mask)

NOTE: job violates constraints for partition GRAVITY (partition GRAVITY not in job partition mask)

rejected for State - (null)
NOTE:
</pre>

'''Answer:'''

The output from check job is a little cryptic in places, and if you are wondering why your job hasn't started yet, you might think that "rejection" and "JobFail" suggest that there's something wrong. But the above message is actually normal; you can use the <tt>showstart</tt> command on your job to get a (preliminary, subject to change) estimate as to when the job will start, and you'll find that it is in fact scheduled to start up in the near future.

In the above message:

* `Notification Events: JobFail` just means that, if notifications are enabled, you'll get a message if the job fails;
* `job req cannot run in partition torque` just means that the job cannot run just yet (that's why it's queued);
* `job req cannot run in dynamic partition DDR now (insufficient procs available: 0 < 8)` says why: there aren't processors available; and
* `job violates constraints for partition SANDY/GRAVITY` just means that the job isn't eligable to run in those paritcular (small) sections of the cluster.

that is, the above output is the normal and expected (if somewhat cryptic) explanation as to why the job is waiting - nothing to worry about.

===How can I monitor my running jobs on TCS?===

How can I monitor the load of TCS jobs?

'''Answer:'''

You can get more information with the command
/xcat/tools/tcs-scripts/LL/jobState.sh
which I alias as:
alias llq1='/xcat/tools/tcs-scripts/LL/jobState.sh'
If you run "llq1 -n" you will see a listing of jobs together with a lot of information, including the load.

===How can I check the memory usage from my jobs?===

How can I check the memory usage from my jobs?

'''Answer:'''

In many occasions it can be really useful to take a look at how much memory your job is using while it is running.
There a couple of ways to do so:

1) using some of the [https://wiki.scinet.utoronto.ca/wiki/index.php/SciNet_Command_Line_Utilities command line utilities] we have developed, e.g: by using the '''jobperf''' or '''jobtop''' utilities, it will allow you to check the job performance and head's node utilization respectively.

2) ''ssh'' into the nodes where your job is being run and check for memory usage and system stats right there. For instance, trying the 'top' or 'free' commands, in those nodes.

Also, it always a good a idea and strongly encouraged to inspect the standard output-log and error-log generated for your job submissions.
These files are named respectively: ''JobName.{o|e}jobIdNumber''; where ''JobName'' is the name you gave to the job (via the '-N' PBS flag) and ''JobIdNumber'' is the id number of the job.
These files are saved in the working directory after the job is finished, but they can be also accessed on real-time using the '''jobError''' and '''jobOutput''' [https://wiki.scinet.utoronto.ca/wiki/index.php/SciNet_Command_Line_Utilities command line utilities] available loading the ''extras'' module.

Other related topics to memory usage: 
[https://wiki.scinet.utoronto.ca/wiki/index.php/GPC_Quickstart#Ram_Disk Using Ram Disk]
 
[https://wiki.scinet.utoronto.ca/wiki/index.php/GPC_Quickstart#Memory_Configuration Different Memory Configuration nodes]
 
[https://wiki.scinet.utoronto.ca/wiki/index.php/FAQ#Monitoring_jobs_in_the_queue Monitoring Jobs in the Queue]
 
[https://wiki.scinet.utoronto.ca/wiki/images/a/a0/TechTalkJobMonitoring.pdf Tech Talk on Monitoring Jobs]

===Can I run cron jobs on devel nodes to monitor my jobs?===

Can I run cron jobs on devel nodes to monitor my jobs?

'''Answer:'''

No, we do not permit cron jobs to be run by users. To monitor the status of your jobs using a cron job running on your own machine, use the command

<pre>
ssh myusername@login.scinet.utoronto.ca "qstat -u myusername"
</pre>

or some variation of this command. Of course, you will need to have SSH keys setup on the machine running the cron job, so that password entry won't be necessary.

=== How does one check the amount of used CPU-hours in a project, and how does one get statistics for each user in the project? ===

'''Answer:'''

This information is available on the scinet portal,https://portal.scinet.utoronto.ca, See also [[SciNet Usage Reports]].



=== I couldn't find the .o output file in the .pbs_spool directory as I used to ===

On Feb 24 2011, the temporary location of standard input and output files was moved from the shared file system ${SCRATCH}/.pbs_spool to the
node-local directory /var/spool/torque/spool (which resides in ram). The final location after a job has finished is unchanged,
but to check the output/error of running jobs, users will now have to ssh into the (first) node assigned to the job and look in
/var/spool/torque/spool.

This alleviates access contention to the temporary directory, especially for those users that are running a lot of jobs, and reduces the burden on the file system in general.

Note that it is good practice to redirect output to a file rather than to count on the scheduler to do this for you.

=== My GPC job died, telling me `Copy Stageout Files Failed' ===

'''Answer:'''

When a job runs on GPC, the script's standard output and error are redirected to
<tt>$PBS_JOBID.gpc-sched.OU</tt> and <tt>$PBS_JOBID.gpc-sched.ER</tt> in
/var/spool/torque/spool on the (first) node on which your job is running. At the end of the job, those .OU and .ER files are copied to where the batch script tells them to be copied, by default <tt>$PBS_JOBNAME.o$PBS_JOBID</tt> and<tt>$PBS_JOBNAME.e$PBS_JOBID</tt>. (You can set those filenames to be something clearer with the -e and -o options in your PBS script.)

When you get errors like this:
<pre>
An error has occurred processing your job, see below.
request to copy stageout files failed on node
</pre>
it means that the copying back process has failed in some way. There could be a few reasons for this. The first thing to '''make sure that your .bashrc does not produce any output''', as the output-stageout is performed by bash and further output can cause this to fail.
But it also could have just been a random filesystem error, or it could be that your job failed spectacularly enough to shortcircuit the normal job-termination process (e.g. ran out of memory very quickly) and those files just never got copied.

Write to [mailto:support@scinet.utoronto.ca <support@scinet.utoronto.ca>] if your input/output files got lost, as we will probably be able to retrieve them for you (please supply at least the jobid, and any other information that may be relevant).

Mind you that it is good practice to redirect output to a file rather than depending on the job scheduler to do this for you.



===IB Memory Errors, eg <tt> reg_mr Cannot allocate memory </tt>===

Infiniband requires more memory than ethernet; it can use RDMA (remote direct memory access) transport for which it sets aside registered memory to transfer data.

In our current network configuration, it requires a _lot_ more memory, particularly as you go to larger process counts; unfortunately, that means you can't get around the "I need more memory" problem the usual way, by running on more nodes. Machines with different memory or
network configurations may exhibit this problem at higher or lower MPI
task counts.

Right now, the best workaround is to reduce the number and size of OpenIB queues, using XRC: with the OpenMPI, add the following options to your mpirun command:

<pre>
-mca btl_openib_receive_queues X,128,256,192,128:X,2048,256,128,32:X,12288,256,128,32 -mca btl_openib_max_send_size 12288
</pre>

With Intel MPI, you should be able to do

<pre>
module load intelmpi/4.0.3.008
mpirun -genv I_MPI_FABRICS=shm:ofa -genv I_MPI_OFA_USE_XRC=1 -genv I_MPI_OFA_DYNAMIC_QPS=1 -genv I_MPI_DEBUG=5 -np XX ./mycode
</pre>

to the same end.

For more information see [[GPC MPI Versions]].

===My compute job fails, saying <tt>libpng12.so.0: cannot open shared object file</tt> or <tt>libjpeg.so.62: cannot open shared object file</tt>===

'''Answer:'''

To maximize the amount of memory available for compute jobs, the compute nodes have a less complete system image than the development nodes. In particular, since interactive graphics libraries like matplotlib and gnuplot are usually used interactively, the libraries for their use are included in the devel nodes' image but not the compute nodes.

Many of these extra libraries are, however, available in the "extras" module. So adding a "module load extras" to your job submission script - or, for overkill, to your .bashrc - should enable these scripts to run on the compute nodes.

-->

==Data on SciNet disks==

===How do I find out my disk usage?===

'''Answer:'''

The standard unix/linux utilities for finding the amount of disk space used by a directory are very slow, and notoriously inefficient on the GPFS filesystems that we run on the SciNet systems. There are utilities that very quickly report your disk usage:

The <tt>'''diskUsage'''</tt> command, available with the 'extras' module on the login nodes, datamovers and the GPC devel nodes, provides information in a number of ways on the home, scratch, and project file systems. For instance, how much disk space is being used by yourself and your group (with the -a option), or how much your usage has changed over a certain period ("delta information") or you may generate plots of your usage over time.
This information is only updated hourly!

More information about these filesystems is available at the [[Data_Management | Data_Management]].

===How do I transfer data to/from SciNet?===

'''Answer:'''

All incoming connections to SciNet go through relatively low-speed connections to the <tt>login.scinet</tt> gateways, so using scp to copy files the same way you ssh in is not an effective way to move lots of data. Better tools are described in our page on [[Data_Management#Data_Transfer | Data Transfer]].

===My group works with data files of size 1-2 GB. Is this too large to transfer by scp to login.scinet.utoronto.ca ?===

'''Answer:'''

Generally, occasion transfers of data less than 10GB is perfectly acceptible to so through the login nodes. See [[Data_Management#Data_Transfer | Data Transfer]].

===How can I check if I have files in /scratch that are scheduled for automatic deletion?===

'''Answer:'''

Please see [[Storage_Quickstart#Scratch_Disk_Purging_Policy | Storage At SciNet]]

===How to allow my supervisor to manage files for me using ACL-based commands?===

'''Answer:'''

Please see [[Data_Management#File.2FOwnership_Management_.28ACL.29 | File/Ownership Management]]

===Can we buy extra storage space on SciNet?===

'''Answer:'''
Yes, please see [[Data_Management#Buying_storage_space_on_GPFS_or_HPSS | Buying storage space on GPFS or HPSS ]] for more details.

===Can I transfer files between BGQ and HPSS?===

'''Answer:'''
Yes, please see [https://support.scinet.utoronto.ca/wiki/index.php/BGQ#Bridge_to_HPSS Bridge to HPSS ] for more details.

==Keep 'em Coming!==

===Next question, please===

Send your question to [mailto:support@scinet.utoronto.ca <support@scinet.utoronto.ca>]; we'll answer it asap!

Oldwiki.scinet.utoronto.ca:System Alerts

2015-09-24T04:52:50Z

Jchong: /* System Status */

== System Status==

{|
|[[File:up.png| down|link=GPC Quickstart]][[GPC Quickstart|GPC]]
|[[File:down.png| down|link=TCS Quickstart]][[TCS Quickstart|TCS]]
|[[File:up.png| down|link=Sandy]][[Sandy]]
|[[File:up.png| down|link=GPU Devel Nodes]][[GPU Devel Nodes|ARC]]
|[[File:up.png| down]]File System
|-
|[[File:up.png| up|link=Gravity]][[Gravity]]
|[[File:down.png| down|link=P7 Linux Cluster]][[P7 Linux Cluster|P7]]
|[[File:up.png| up|link=BGQ]][[BGQ]]
|[[File:down.png| down|link=HPSS]][[HPSS]]
|
|}
Thu Sep 24 00:52:29 EDT 2015 BGQ and GPC is up now.

Wed Sep 23 23:41:40 EDT 2015: We are in the process of booting GPC up now after some filesystem issue. GPC should be up in an hour.

Wed 23 Sep 2015 22:23:51: Cooling has been restored. New networking issue causing problems with bringing up systems. More later

Wed 23 Sep 2015 18:33:09: '''**DELAY**''' encountered. A control board failed on the chiller and needs to be replaced. Expect to have cooling restored by 10PM and then will bring up systems.

On Wednesday September 23, 2015, downtime has been scheduled for maintenance and improvements on the SciNet data centre cooling system. Significant work will be done to improve the serviceability and durability of the system. This requires a shut-down for most of the day, which will start in the morning around 6:00 am. All login sessions and jobs have been killed at that time.

Services are expected to be available again around 8:00 pm today.
Check back here (on the SciNet wiki's front page) for updates.

Oldwiki.scinet.utoronto.ca:System Alerts

2015-09-24T04:25:25Z

Jchong:

== System Status==

{|
|[[File:up.png| down|link=GPC Quickstart]][[GPC Quickstart|GPC]]
|[[File:down.png| down|link=TCS Quickstart]][[TCS Quickstart|TCS]]
|[[File:up.png| down|link=Sandy]][[Sandy]]
|[[File:up.png| down|link=GPU Devel Nodes]][[GPU Devel Nodes|ARC]]
|[[File:up.png| down]]File System
|-
|[[File:up.png| down|link=Gravity]][[Gravity]]
|[[File:down.png| down|link=P7 Linux Cluster]][[P7 Linux Cluster|P7]]
|[[File:down.png| down|link=BGQ]][[BGQ]]
|[[File:down.png| down|link=HPSS]][[HPSS]]
|
|}

<h4>Scheduled Downtime on Wed. Sept. 23, 2015 6AM-8PM In Effect</h4>
Wed Sep 23 23:41:40 EDT 2015: We are in the process of booting GPC up now after some filesystem issue. GPC should be up in an hour.

Wed 23 Sep 2015 22:23:51: Cooling has been restored. New networking issue causing problems with bringing up systems. More later

Wed 23 Sep 2015 18:33:09: '''**DELAY**''' encountered. A control board failed on the chiller and needs to be replaced. Expect to have cooling restored by 10PM and then will bring up systems.

On Wednesday September 23, 2015, downtime has been scheduled for maintenance and improvements on the SciNet data centre cooling system. Significant work will be done to improve the serviceability and durability of the system. This requires a shut-down for most of the day, which will start in the morning around 6:00 am. All login sessions and jobs have been killed at that time.

Services are expected to be available again around 8:00 pm today.
Check back here (on the SciNet wiki's front page) for updates.

Oldwiki.scinet.utoronto.ca:System Alerts

2015-09-24T03:42:32Z

Jchong: /* System Status */

== System Status==

{|
|[[File:down.png| down|link=GPC Quickstart]][[GPC Quickstart|GPC]]
|[[File:down.png| down|link=TCS Quickstart]][[TCS Quickstart|TCS]]
|[[File:down.png| down|link=Sandy]][[Sandy]]
|[[File:down.png| down|link=GPU Devel Nodes]][[GPU Devel Nodes|ARC]]
|[[File:down.png| down]]File System
|-
|[[File:down.png| down|link=Gravity]][[Gravity]]
|[[File:down.png| down|link=P7 Linux Cluster]][[P7 Linux Cluster|P7]]
|[[File:down.png| down|link=BGQ]][[BGQ]]
|[[File:down.png| down|link=HPSS]][[HPSS]]
|
|}

<h4>Scheduled Downtime on Wed. Sept. 23, 2015 6AM-8PM In Effect</h4>
Wed Sep 23 23:41:40 EDT 2015: We are in the process of booting GPC up now after some filesystem issue. GPC should be up in an hour.

Wed 23 Sep 2015 22:23:51: Cooling has been restored. New networking issue causing problems with bringing up systems. More later

Wed 23 Sep 2015 18:33:09: '''**DELAY**''' encountered. A control board failed on the chiller and needs to be replaced. Expect to have cooling restored by 10PM and then will bring up systems.

On Wednesday September 23, 2015, downtime has been scheduled for maintenance and improvements on the SciNet data centre cooling system. Significant work will be done to improve the serviceability and durability of the system. This requires a shut-down for most of the day, which will start in the morning around 6:00 am. All login sessions and jobs have been killed at that time.

Services are expected to be available again around 8:00 pm today.
Check back here (on the SciNet wiki's front page) for updates.

File:Jchong.txt

2015-07-10T16:00:13Z

Jchong: testing

testing

Data Transfer

2015-05-26T16:43:36Z

Jchong: /* Globus data transfer */

=== General guidelines ===

All traffic to and from the data centre has to go via [http://en.wikipedia.org/wiki/Secure_Shell SSH], or secure shell.
This is a protocol which sets up a secure connection between two sites. In all cases, incoming connections to SciNet go through relatively low-speed connections to the login.scinet gateways, but there are many ways to copy files on top of the ssh protocol.

What node to use for data transfer to and from SciNet depends mostly on the amount of data to transfer:

==== Moving <10GB through the login nodes ====

The login nodes are accessible from outside SciNet, which means that you can transfer data between your own office/home machine and SciNet using scp or rsync (see below). Although the login nodes have a cpu_time timeout of 5 minutes (emphasis on cpu_time, not wall_time), most likely if you try to transfer more than 10GB probably you wouldn't succeed. While the login nodes can be used for transfers of less than 10GB, using a datamover node would still be faster.

Note that transfers through a login node will timeout after a certain time (currently set to 5 minutes cpu_time), so if you have a slow connection you may need to go through datamover1.

==== Moving >10GB through the datamover1 node ====

Serious moves of data (>10GB) to or from SciNet should be done from <tt>datamover1</tt> or <tt>datamover2</tt> nodes. From any of the interactive SciNet nodes, one should be able to <tt>ssh datamover1|2</tt> to log in. Those are the machines that have the fastest network connections to the outside world (by a factor of 10; a 10Gb/s link as vs 1Gb/s).

Transfers must be ''originated'' from <tt>datamover1|2</tt>; that is, one can not copy files from the outside world directly to or from the datamovers; one has to log in to a datamover and copy the data to or from the outside network. Your local machine must be reachable from the outside as well, either by its name or its IP address. If you are behind a firewall or a (wireless) router, this may not be possible. You may need to ask your network administrator to allow the datamovers to ssh to your machine. If you need to open a hole on your firewall we provide their IPs:

datamover1 142.150.188.121
datamover2 142.150.188.122

==== Hpn-ssh ====

The usual ssh protocols were not designed for speed. On the <tt>datamover1|2</tt> nodes, we have installed hpn-ssh, or [http://www.psc.edu/networking/projects/hpn-ssh/ High-Performance-enabled ssh]. You use this higher-performance ssh/scp/sftp variant by loading the `hpnssh' module. Hpn-ssh is backwards compatible with the `usual' ssh, but is capable of significantly higher speeds. If you routinely have large data transfers to do, we recommend having your system administrator look into installing [http://www.psc.edu/networking/projects/hpn-ssh/ hpn-ssh] on your system.

Everything we discuss below, unless otherwise stated, will work regardless of whether you have hpn-ssh installed on your remote system.

==== For Microsoft Windows users ====

Linux-windows transfers can be a bit more involved than linux-to-linux, but using [http://www.cygwin.com Cygwin], this should not be a problem. Make sure you install Cygwin with the openssh package.

If you want to remain 100% a Windows environment, another very good tool is [http://winscp.net/eng/index.php WinSCP]. It will let you easily transfer and synchronize data between your Windows workstation and the login nodes using your ssh credentials (provided that it's not much more than 10GB on each sync pass).

If you are going to use the [[Data_Management#Moving_.3E10GB_through_the_datamover1_node | datamover1 method]], and assuming your machine is not a wireless laptop (if it
is, best to find a nearby computer that's not wireless and use a usb
memory stick), you'll need the IP address of your machine, which you find by
typing "ipconfig /all" on your local windows machine. Also, you will need to have the ssh daemon (sshd) running locally in Cygwin.

Also note that your windows user name does not have to be the same as on SciNet, this just
depends on how your local windows system was set up.

All locations given to scp or rsync in cygwin have to be in unix format (using "/" not "\"), and will be relative to cygwin's path, not windows (e.g.
use /cygdrive/c/...... to get to the windows C: drive).

=== Ways to transfer data ===

==== Globus data transfer ====

Globus is a file-transfer service with an easy-to-use web interface to allow people to transfer file with ease. To get started, please sign up for a Globus account at [https://www.globus.org/ Globus website]. Once you sign up for an account, go to [https://www.globus.org/xfer/StartTransfer this page] to start the file transfer. Please enter computecanada#gpc as one endpoint for file transfer. The computecanada#gpc endpoint requires authentication using your scinet username and password. If you are trying to transfer data from a laptop or desktop, you will need to install Globus Connect Personal software available [https://support.globus.org/entries/24044351 here] to setup an endpoint for the laptop or desktop and perform the transfer.

Please see the following [http://www.computecanada.ca/research-portal/globus-portal/globus-user-documentation/ page] on how to setup Globus to perform data transfer.

==== scp ====

<tt>scp</tt>, or secure copy, is the easiest way to copy files, although we generally find rsync below to be faster.

scp works like cp to copy files:

$ scp original_file copy_file

except that either the original or the copy can be on another system:

$ scp jonsdatafile.bin jon@remote.system.com:/home/jon/bigdatadir/

will copy the data file into the directory <tt>/home/jon/bigdatadir/</tt> on <tt>remote.system.com</tt> after logging in as <tt>jon</tt>; you will be prompted for a password (unless you've set up ssh keys).

Copying from remote systems works the same way:

$ scp jon@remote.system.com:/home/jon/bigdatadir/newdata.bin .

And wildcards work as you'd expect (except you have to quote the wildcards on the remote system, as it can't expand properly here.)

$ scp *.bin jon@remote.system.com:/home/jon/bigdatadir/
$ scp jon@remote.system.com:"/home/jon/inputdata/*" .

There are few options worth knowing about:
* <tt>scp -C</tt> compresses the file before transmitting it; ''if'' the file compresses well, this can significantly increase the effective data transfer rate (though usually not as much as compressing the data, then sending it, then uncompressing). If the file doesn't compress well, than this adds CPU overhead without accomplishing much, and can slow down your data transfer.
* <tt>scp -oNoneEnabled=yes -oNoneSwitch=yes</tt> -- This is an hpn-ssh only option. If CPU overhead is a significant bottleneck in the data transfer, then we can avoid this by turning off the secure encryption of the data. For most of us, this is ok, but for others it is not. In either cases, '''authentication''' remains secure, it is only the data transfer that is in plaintext.

==== rsync ====

[http://samba.anu.edu.au/rsync/ rsync] is a very powerful tool for mirroring directories of data.
$ rsync -av -e ssh scinetdatadir jon@remote.system.com:/home/jon/bigdatadir/
rsync has a dizzying number of options; the above syncs <tt>scinetdatadir</tt> ''to'' the remote system; that is, any files that are newer on the localsystem are updated on the remote system. The converse isn't true; if there were newer files on the remote system, you'd have to bring those over with
$ rsync -av -e ssh jon@remote.system.com:/home/jon/bigdatadir/ scinetdatadir
The <tt>-av</tt> options are for verbose and `archive' mode, which preserves timestamps and permissions, which is normally what you want. <tt>-e ssh</tt> tells it to use ssh for the transfer.

One of the powerful things about rsync is that it looks to see what files already exist before copying, so you can use it repeatedly as a data directory fills and it won't make unnecessary copies; similarly, if a (say) log file grows over time, it will only copy the difference between the files, further speeding things up. This also means that it behaves well if a transfer is interrupted; a second invocation of rsync will continue where the other left off.

As with <tt>scp -C</tt>, <tt>rsync -z</tt> compresses on the fly, which can significantly enhance effective data transfer rates if the files compress well, or hurt it if not.

As with scp, if both sides are running hpn-ssh one can disable encryption of the data stream should that prove to be a bottleneck:
$ rsync -av -e "ssh -oNoneEnabled=yes -oNoneSwitch=yes" jon@remote.system.com:/home/jon/bigdatadir/ scinetdatadir

SciNet's login nodes, 142.150.188.5[1-4], are publicly accessible and can be used for data transfer as long as your material is not one big chunk (much more than 2-3GB each file). We have a 5 minutes CPU time limit on the login nodes and the transfer process may be killed by the kernel before completion. The workaround is to transfer your data using a rsync loop, by checking the rsync return code, assuming some files can be transferred before reaching the CPU limit. For example in a bash shell:
<pre>
for i in {1..100}; do ### try 100 times
rsync ...
[ "$?" == "0" ] && break
done
</pre>

==== ssh tunnel ====

Alternatively you may use a reverse ssh tunnel (ssh -R).

If your transfer is above 10GB you will need to use one of SciNet's datamovers. If your workstation is behind a firewall (as the datamovers are), you'll need a node external to your firewall, on the edge of your organization's network, that will serve as a gateway, and can be accessible via ssh by both your workstation and the datamovers. Initiate a "ssh -R" connection from SciNet's datamover to that node. This node needs to have its ssh GatewayPorts enabled so that your workstation can connect to the specified port on that node, which will forward the traffic back to SciNet's datamover.

=== Transfer speeds ===

==== What transfer speeds could I expect? ====

Below are some typical transfer numbers from datamover1 to another University of Toronto machine with a 1Gb/s link to the campus network:

{| class="wikitable" border="0"
|-
!{{Hl2}}| Mode
!{{Hl2}}| With hpn-ssh
!{{Hl2}}| Without
|-
| rsync
| 60-80 MB/s
| 30-40 MB/s
|-
| scp
| 50 MB/s
| 25 MB/s
|}

==== What can slow down my data transfer? ====

To move data quickly, ''all'' of the stages in the process have to be fast; the file system you are reading data from, the CPU reading the data, the network connection between the sender and the reciever, and the recipient CPU and disk. The slowest element in that chain will slow down the entire transfer.

On SciNet's side, our underlying filesystem is the high-performance [http://www-03.ibm.com/systems/software/gpfs/index.html GPFS] system, and the node we recommend you use (datamover1) has a high-speed connection to the network and fast CPUs.

==== Why are my transfers so much slower? ====

If you get numbers significantly lower than above, then there is a bottleneck in the transfer. The first thing to do is to run <tt>top</tt> on datamover1; if other people are transfering large files at the same time you are trying to, network congestion could result and you'll just have to wait until they are done.

If nothing else is going on on datamover1, there are a number of possibilites:
* network connection between SciNet and your machine - do you know the network connection of your remote machine? Are your systems connections tuned for performance [http://www.psc.edu/networking/projects/tcptune]?
* is the remote server busy?
* are the remote servers disks busy, or known to be slow?

For any further questions, contact us at [mailto:support@scinet.utoronto.ca Support @ SciNet]

Oldwiki.scinet.utoronto.ca:System Alerts

2015-02-07T20:50:14Z

Jchong: /* System Status */

== System Status==

{|
|[[File:up.png| down|link=GPC Quickstart]][[GPC Quickstart|GPC]]
|[[File:down.png| down|link=TCS Quickstart]][[TCS Quickstart|TCS]]
|[[File:up.png| down|link=Sandy]][[Sandy]]
|[[File:up.png| down|link=GPU Devel Nodes]][[GPU Devel Nodes|ARC]]
|[[File:up.png| down| up ]]File System
|-
|[[File:up.png| down|link=Gravity]][[Gravity]]
|[[File:up.png| down|link=P7 Linux Cluster]][[P7 Linux Cluster|P7]]
|[[File:down.png| down |link=BGQ]][[BGQ]]
|[[File:down.png| down|link=HPSS]][[HPSS]]
|
|}

Sat Feb 7 15:11:27 EST 2015: GPC and storage is back now. We are working on getting TCS and BGQ up.

Sat Feb 7 13:04:53 EST 2015: cooling has been restored. Some systems (GPC and storage) likely back mid-afternoon. Please check back later

Sat Feb 7 11:00:20 EST 2015: primary chilled water pump won't restart. Waiting for technician. May be noon before root cause is understood. Systems unlikely to be back before 2PM at very earliest.

Sat Feb 7 10:02:02 EST 2015: staff onsite. Assessing problem

Sat Feb 7 07:00:14 EST 2015: datacenter shutdown automatically due to a power outage

Thu Jan 22 13:27:44 EST 2015: BGQ now available as a single 4-rack system. bgqdev-fen1 is the single login/devel/submission node.

([[Previous_messages:|Previous messages]])

Oldwiki.scinet.utoronto.ca:System Alerts

2015-02-07T20:43:22Z

Jchong: /* System Status */

== System Status==

{|
|[[File:up.png| down|link=GPC Quickstart]][[GPC Quickstart|GPC]]
|[[File:down.png| down|link=TCS Quickstart]][[TCS Quickstart|TCS]]
|[[File:up.png| down|link=Sandy]][[Sandy]]
|[[File:up.png| down|link=GPU Devel Nodes]][[GPU Devel Nodes|ARC]]
|[[File:up.png| down| up ]]File System
|-
|[[File:up.png| down|link=Gravity]][[Gravity]]
|[[File:down.png| down|link=P7 Linux Cluster]][[P7 Linux Cluster|P7]]
|[[File:down.png| down |link=BGQ]][[BGQ]]
|[[File:down.png| down|link=HPSS]][[HPSS]]
|
|}

Sat Feb 7 15:11:27 EST 2015: GPC and storage is back now. We are working on getting TCS and BGQ up.

Sat Feb 7 13:04:53 EST 2015: cooling has been restored. Some systems (GPC and storage) likely back mid-afternoon. Please check back later

Sat Feb 7 11:00:20 EST 2015: primary chilled water pump won't restart. Waiting for technician. May be noon before root cause is understood. Systems unlikely to be back before 2PM at very earliest.

Sat Feb 7 10:02:02 EST 2015: staff onsite. Assessing problem

Sat Feb 7 07:00:14 EST 2015: datacenter shutdown automatically due to a power outage

Thu Jan 22 13:27:44 EST 2015: BGQ now available as a single 4-rack system. bgqdev-fen1 is the single login/devel/submission node.

([[Previous_messages:|Previous messages]])

Oldwiki.scinet.utoronto.ca:System Alerts

2015-02-07T20:11:17Z

Jchong: /* System Status */

== System Status==

{|
|[[File:up25.png| down|link=GPC Quickstart]][[GPC Quickstart|GPC]]
|[[File:down.png| down|link=TCS Quickstart]][[TCS Quickstart|TCS]]
|[[File:up.png| down|link=Sandy]][[Sandy]]
|[[File:up.png| down|link=GPU Devel Nodes]][[GPU Devel Nodes|ARC]]
|[[File:up.png| down| up ]]File System
|-
|[[File:up.png| down|link=Gravity]][[Gravity]]
|[[File:down.png| down|link=P7 Linux Cluster]][[P7 Linux Cluster|P7]]
|[[File:down.png| down |link=BGQ]][[BGQ]]
|[[File:down.png| down|link=HPSS]][[HPSS]]
|
|}

Sat Feb 7 15:11:27 EST 2015: GPC and storage is back now. We are working on getting TCS and BGQ up.

Sat Feb 7 13:04:53 EST 2015: cooling has been restored. Some systems (GPC and storage) likely back mid-afternoon. Please check back later

Sat Feb 7 11:00:20 EST 2015: primary chilled water pump won't restart. Waiting for technician. May be noon before root cause is understood. Systems unlikely to be back before 2PM at very earliest.

Sat Feb 7 10:02:02 EST 2015: staff onsite. Assessing problem

Sat Feb 7 07:00:14 EST 2015: datacenter shutdown automatically due to a power outage

Thu Jan 22 13:27:44 EST 2015: BGQ now available as a single 4-rack system. bgqdev-fen1 is the single login/devel/submission node.

([[Previous_messages:|Previous messages]])

Data Transfer

2015-02-04T21:21:08Z

Jchong: /* Globus data transfer */

=== General guidelines ===

All traffic to and from the data centre has to go via [http://en.wikipedia.org/wiki/Secure_Shell SSH], or secure shell.
This is a protocol which sets up a secure connection between two sites. In all cases, incoming connections to SciNet go through relatively low-speed connections to the login.scinet gateways, but there are many ways to copy files on top of the ssh protocol.

What node to use for data transfer to and from SciNet depends mostly on the amount of data to transfer:

==== Moving <10GB through the login nodes ====

The login nodes are accessible from outside SciNet, which means that you can transfer data between your own office/home machine and SciNet using scp or rsync (see below). Although the login nodes have a cpu_time timeout of 5 minutes (emphasis on cpu_time, not wall_time), most likely if you try to transfer more than 10GB probably you wouldn't succeed. While the login nodes can be used for transfers of less than 10GB, using a datamover node would still be faster.

Note that transfers through a login node will timeout after a certain time (currently set to 5 minutes cpu_time), so if you have a slow connection you may need to go through datamover1.

==== Moving >10GB through the datamover1 node ====

Serious moves of data (>10GB) to or from SciNet should be done from <tt>datamover1</tt> or <tt>datamover2</tt> nodes. From any of the interactive SciNet nodes, one should be able to <tt>ssh datamover1|2</tt> to log in. Those are the machines that have the fastest network connections to the outside world (by a factor of 10; a 10Gb/s link as vs 1Gb/s).

Transfers must be ''originated'' from <tt>datamover1|2</tt>; that is, one can not copy files from the outside world directly to or from the datamovers; one has to log in to a datamover and copy the data to or from the outside network. Your local machine must be reachable from the outside as well, either by its name or its IP address. If you are behind a firewall or a (wireless) router, this may not be possible. You may need to ask your network administrator to allow the datamovers to ssh to your machine. If you need to open a hole on your firewall we provide their IPs:

datamover1 142.150.188.121
datamover2 142.150.188.122

==== Hpn-ssh ====

The usual ssh protocols were not designed for speed. On the <tt>datamover1|2</tt> nodes, we have installed hpn-ssh, or [http://www.psc.edu/networking/projects/hpn-ssh/ High-Performance-enabled ssh]. You use this higher-performance ssh/scp/sftp variant by loading the `hpnssh' module. Hpn-ssh is backwards compatible with the `usual' ssh, but is capable of significantly higher speeds. If you routinely have large data transfers to do, we recommend having your system administrator look into installing [http://www.psc.edu/networking/projects/hpn-ssh/ hpn-ssh] on your system.

Everything we discuss below, unless otherwise stated, will work regardless of whether you have hpn-ssh installed on your remote system.

==== For Microsoft Windows users ====

Linux-windows transfers can be a bit more involved than linux-to-linux, but using [http://www.cygwin.com Cygwin], this should not be a problem. Make sure you install Cygwin with the openssh package.

If you want to remain 100% a Windows environment, another very good tool is [http://winscp.net/eng/index.php WinSCP]. It will let you easily transfer and synchronize data between your Windows workstation and the login nodes using your ssh credentials (provided that it's not much more than 10GB on each sync pass).

If you are going to use the [[Data_Management#Moving_.3E10GB_through_the_datamover1_node | datamover1 method]], and assuming your machine is not a wireless laptop (if it
is, best to find a nearby computer that's not wireless and use a usb
memory stick), you'll need the IP address of your machine, which you find by
typing "ipconfig /all" on your local windows machine. Also, you will need to have the ssh daemon (sshd) running locally in Cygwin.

Also note that your windows user name does not have to be the same as on SciNet, this just
depends on how your local windows system was set up.

All locations given to scp or rsync in cygwin have to be in unix format (using "/" not "\"), and will be relative to cygwin's path, not windows (e.g.
use /cygdrive/c/...... to get to the windows C: drive).

=== Ways to transfer data ===

==== Globus data transfer ====

Globus is a file-transfer service with an easy-to-use web interface to allow people to transfer file with ease. To get started, please sign up for a Globus account at [https://www.globus.org/ Globus website]. Once you sign up for an account, go to [https://www.globus.org/xfer/StartTransfer this page] to start the file transfer. Please enter computecanada#gpc as one endpoint for file transfer. The computecanada#gpc endpoint requires authentication using your scinet username and password. If you are trying to transfer data from a laptop or desktop, you will need to install Globus Connect Personal software available [https://support.globus.org/entries/24044351 here] to setup an endpoint for the laptop or desktop and perform the transfer.

Please see the following [https://computecanada.ca/en/component/content/article/15-english-catogories/general-content-english/357-globus-user-documentation page] on how to setup Globus to perform data transfer.

==== scp ====

<tt>scp</tt>, or secure copy, is the easiest way to copy files, although we generally find rsync below to be faster.

scp works like cp to copy files:

$ scp original_file copy_file

except that either the original or the copy can be on another system:

$ scp jonsdatafile.bin jon@remote.system.com:/home/jon/bigdatadir/

will copy the data file into the directory <tt>/home/jon/bigdatadir/</tt> on <tt>remote.system.com</tt> after logging in as <tt>jon</tt>; you will be prompted for a password (unless you've set up ssh keys).

Copying from remote systems works the same way:

$ scp jon@remote.system.com:/home/jon/bigdatadir/newdata.bin .

And wildcards work as you'd expect (except you have to quote the wildcards on the remote system, as it can't expand properly here.)

$ scp *.bin jon@remote.system.com:/home/jon/bigdatadir/
$ scp jon@remote.system.com:"/home/jon/inputdata/*" .

There are few options worth knowing about:
* <tt>scp -C</tt> compresses the file before transmitting it; ''if'' the file compresses well, this can significantly increase the effective data transfer rate (though usually not as much as compressing the data, then sending it, then uncompressing). If the file doesn't compress well, than this adds CPU overhead without accomplishing much, and can slow down your data transfer.
* <tt>scp -oNoneEnabled=yes -oNoneSwitch=yes</tt> -- This is an hpn-ssh only option. If CPU overhead is a significant bottleneck in the data transfer, then we can avoid this by turning off the secure encryption of the data. For most of us, this is ok, but for others it is not. In either cases, '''authentication''' remains secure, it is only the data transfer that is in plaintext.

==== rsync ====

[http://samba.anu.edu.au/rsync/ rsync] is a very powerful tool for mirroring directories of data.
$ rsync -av -e ssh scinetdatadir jon@remote.system.com:/home/jon/bigdatadir/
rsync has a dizzying number of options; the above syncs <tt>scinetdatadir</tt> ''to'' the remote system; that is, any files that are newer on the localsystem are updated on the remote system. The converse isn't true; if there were newer files on the remote system, you'd have to bring those over with
$ rsync -av -e ssh jon@remote.system.com:/home/jon/bigdatadir/ scinetdatadir
The <tt>-av</tt> options are for verbose and `archive' mode, which preserves timestamps and permissions, which is normally what you want. <tt>-e ssh</tt> tells it to use ssh for the transfer.

One of the powerful things about rsync is that it looks to see what files already exist before copying, so you can use it repeatedly as a data directory fills and it won't make unnecessary copies; similarly, if a (say) log file grows over time, it will only copy the difference between the files, further speeding things up. This also means that it behaves well if a transfer is interrupted; a second invocation of rsync will continue where the other left off.

As with <tt>scp -C</tt>, <tt>rsync -z</tt> compresses on the fly, which can significantly enhance effective data transfer rates if the files compress well, or hurt it if not.

As with scp, if both sides are running hpn-ssh one can disable encryption of the data stream should that prove to be a bottleneck:
$ rsync -av -e "ssh -oNoneEnabled=yes -oNoneSwitch=yes" jon@remote.system.com:/home/jon/bigdatadir/ scinetdatadir

SciNet's login nodes, 142.150.188.5[1-4], are publicly accessible and can be used for data transfer as long as your material is not one big chunk (much more than 2-3GB each file). We have a 5 minutes CPU time limit on the login nodes and the transfer process may be killed by the kernel before completion. The workaround is to transfer your data using a rsync loop, by checking the rsync return code, assuming some files can be transferred before reaching the CPU limit. For example in a bash shell:
<pre>
for i in {1..100}; do ### try 100 times
rsync ...
[ "$?" == "0" ] && break
done
</pre>

==== ssh tunnel ====

Alternatively you may use a reverse ssh tunnel (ssh -R).

If your transfer is above 10GB you will need to use one of SciNet's datamovers. If your workstation is behind a firewall (as the datamovers are), you'll need a node external to your firewall, on the edge of your organization's network, that will serve as a gateway, and can be accessible via ssh by both your workstation and the datamovers. Initiate a "ssh -R" connection from SciNet's datamover to that node. This node needs to have its ssh GatewayPorts enabled so that your workstation can connect to the specified port on that node, which will forward the traffic back to SciNet's datamover.

=== Transfer speeds ===

==== What transfer speeds could I expect? ====

Below are some typical transfer numbers from datamover1 to another University of Toronto machine with a 1Gb/s link to the campus network:

{| class="wikitable" border="0"
|-
!{{Hl2}}| Mode
!{{Hl2}}| With hpn-ssh
!{{Hl2}}| Without
|-
| rsync
| 60-80 MB/s
| 30-40 MB/s
|-
| scp
| 50 MB/s
| 25 MB/s
|}

==== What can slow down my data transfer? ====

To move data quickly, ''all'' of the stages in the process have to be fast; the file system you are reading data from, the CPU reading the data, the network connection between the sender and the reciever, and the recipient CPU and disk. The slowest element in that chain will slow down the entire transfer.

On SciNet's side, our underlying filesystem is the high-performance [http://www-03.ibm.com/systems/software/gpfs/index.html GPFS] system, and the node we recommend you use (datamover1) has a high-speed connection to the network and fast CPUs.

==== Why are my transfers so much slower? ====

If you get numbers significantly lower than above, then there is a bottleneck in the transfer. The first thing to do is to run <tt>top</tt> on datamover1; if other people are transfering large files at the same time you are trying to, network congestion could result and you'll just have to wait until they are done.

If nothing else is going on on datamover1, there are a number of possibilites:
* network connection between SciNet and your machine - do you know the network connection of your remote machine? Are your systems connections tuned for performance [http://www.psc.edu/networking/projects/tcptune]?
* is the remote server busy?
* are the remote servers disks busy, or known to be slow?

For any further questions, contact us at [mailto:support@scinet.utoronto.ca Support @ SciNet]

Oldwiki.scinet.utoronto.ca:System Alerts

2015-01-20T19:27:42Z

Jchong:

== System Status==

{|
|[[File:up.png| up |link=GPC Quickstart]][[GPC Quickstart|GPC]]
|[[File:up.png| up |link=TCS Quickstart]][[TCS Quickstart|TCS]]
|[[File:up.png| up |link=Sandy]][[Sandy]]
|[[File:up.png| up |link=GPU Devel Nodes]][[GPU Devel Nodes|ARC]]
|[[File:up.png| up | up ]]File System
|-
|[[File:up.png| up |link=Gravity]][[Gravity]]
|[[File:up.png| up |link=P7 Linux Cluster]][[P7 Linux Cluster|P7]]
|[[File:down.png| down |link=BGQ]][[BGQ]]
|[[File:up.png| up |link=HPSS]][[HPSS]]
|
|}

Tue Jan 20 14:27:06 EST 2015: At noon on Tuesday January 20th, 2015, both 2-rack BlueGene/Q systems, bgq and bgqdev, will be taken down in order to be merged into one 4-rack system (i.e. 65536 cores). We expect that the BGQ will be up again some time on Thursday January 22nd, 2015.

Sat 17 Jan 2015 21:50:40 EST: Cooling has been restored. Systems being restarted. Likely available within an hour or so. Root cause was a frozen pipe in cooling tower (very strange; has never happened before and today is relatively warm compared to past two weeks).

Sat 17 Jan 2015 19:34:00 EST: JCI on site as well. Diagnosing issue.

Sat 17 Jan 2015 17:33:47 EST: Unusual cooling problem. Systems down. Staff enroute to site

--

([[Previous_messages:|Previous messages]])

Oldwiki.scinet.utoronto.ca:System Alerts

2015-01-20T19:26:12Z

Jchong: /* System Status */

Oldwiki.scinet.utoronto.ca:System Alerts

2014-11-13T18:37:04Z

Jchong: /* System Status */

== System Status==

{|
|[[File:up.png|up|link=GPC Quickstart]][[GPC Quickstart|GPC]]
|[[File:up.png|up|link=TCS Quickstart]][[TCS Quickstart|TCS]]
|[[File:up.png|up|link=Sandy]][[Sandy]]
|[[File:up.png|up|link=GPU Devel Nodes]][[GPU Devel Nodes|ARC]]
|[[File:up.png|up]]File System
|-
|[[File:up.png| up |link=Gravity]][[Gravity]]
|[[File:up.png| up |link=P7 Linux Cluster]][[P7 Linux Cluster|P7]]
|[[File:up.png| up |link=BGQ]][[BGQ]]
|[[File:up.png| up |link=HPSS]][[HPSS]]
|
|}

Thu Nov 13 13:36:19 EST 2014: There will be an outage to SciNet via ssh from 10pm EST to 12am EST today to perform emergency repair on the fibre. No running job will be affected. Sorry for the inconvenience.

Wed Nov 12 15:41:44 EST 2014: There will be a 15 to 20 minute outage to SciNet via ssh from 10pm EST to 12am EST today to perform emergency repair on the fibre. No running job will be affected. Sorry for the inconvenience.

Tue Nov 11 17:32:59 EST 2014: The emergency fibre maintenance is complete. Please report any network connectivity issue if there are issues.

Tue Nov 11 11:35:38 EST 2014: Connection to SciNet might be disconnected due to emergency fibre maintenance. Outage should last about 5 minutes and it will not affect running job on the system.

Fri Nov 7 15:00:00 EDT 2014: HPSS is nearing capacity: jobs can be submitted, but will only run once they are reviewed and released by SciNet staff.

Thu Oct 30 21:10:53 EDT 2014: BGQ devel system back in production, upgraded to 1 full-rack.

([[Previous_messages:|Previous messages]])

Oldwiki.scinet.utoronto.ca:System Alerts

2014-11-12T20:43:07Z

Jchong: /* System Status */

== System Status==

{|
|[[File:up.png|up|link=GPC Quickstart]][[GPC Quickstart|GPC]]
|[[File:up.png|up|link=TCS Quickstart]][[TCS Quickstart|TCS]]
|[[File:up.png|up|link=Sandy]][[Sandy]]
|[[File:up.png|up|link=GPU Devel Nodes]][[GPU Devel Nodes|ARC]]
|[[File:up.png|up]]File System
|-
|[[File:up.png| up |link=Gravity]][[Gravity]]
|[[File:up.png| up |link=P7 Linux Cluster]][[P7 Linux Cluster|P7]]
|[[File:up.png| up |link=BGQ]][[BGQ]]
|[[File:up.png| up |link=HPSS]][[HPSS]]
|
|}

Wed Nov 12 15:41:44 EST 2014: There will be a 15 to 20 minute outage to SciNet via ssh from 10pm EST to 12am EST today to perform emergency repair on the fibre. No running job will be affected. Sorry for the inconvenience.

Tue Nov 11 17:32:59 EST 2014: The emergency fibre maintenance is complete. Please report any network connectivity issue if there are issues.

Tue Nov 11 11:35:38 EST 2014: Connection to SciNet might be disconnected due to emergency fibre maintenance. Outage should last about 5 minutes and it will not affect running job on the system.

Fri Nov 7 15:00:00 EDT 2014: HPSS is nearing capacity: jobs can be submitted, but will only run once they are reviewed and released by SciNet staff.

Thu Oct 30 21:10:53 EDT 2014: BGQ devel system back in production, upgraded to 1 full-rack.

([[Previous_messages:|Previous messages]])

Oldwiki.scinet.utoronto.ca:System Alerts

2014-11-11T22:33:51Z

Jchong: /* System Status */

== System Status==

{|
|[[File:up.png|up|link=GPC Quickstart]][[GPC Quickstart|GPC]]
|[[File:up.png|up|link=TCS Quickstart]][[TCS Quickstart|TCS]]
|[[File:up.png|up|link=Sandy]][[Sandy]]
|[[File:up.png|up|link=GPU Devel Nodes]][[GPU Devel Nodes|ARC]]
|[[File:up.png|up]]File System
|-
|[[File:up.png| up |link=Gravity]][[Gravity]]
|[[File:up.png| up |link=P7 Linux Cluster]][[P7 Linux Cluster|P7]]
|[[File:up.png| up |link=BGQ]][[BGQ]]
|[[File:up.png| up |link=HPSS]][[HPSS]]
|
|}

Tue Nov 11 17:32:59 EST 2014: The emergency fibre maintenance is complete. Please report any network connectivity issue if there are issues.

Tue Nov 11 11:35:38 EST 2014: Connection to SciNet might be disconnected due to emergency fibre maintenance. Outage should last about 5 minutes and it will not affect running job on the system.

Fri Nov 7 15:00:00 EDT 2014: HPSS is nearing capacity: jobs can be submitted, but will only run once they are reviewed and released by SciNet staff.

Thu Oct 30 21:10:53 EDT 2014: BGQ devel system back in production, upgraded to 1 full-rack.

([[Previous_messages:|Previous messages]])

Oldwiki.scinet.utoronto.ca:System Alerts

2014-11-11T16:36:51Z

Jchong: /* System Status */

Oldwiki.scinet.utoronto.ca:System Alerts

2014-11-11T16:36:30Z

Jchong: /* System Status */

Data Transfer

2014-08-11T18:08:26Z

Jchong:

=== General guidelines ===

All traffic to and from the data centre has to go via [http://en.wikipedia.org/wiki/Secure_Shell SSH], or secure shell.
This is a protocol which sets up a secure connection between two sites. In all cases, incoming connections to SciNet go through relatively low-speed connections to the login.scinet gateways, but there are many ways to copy files on top of the ssh protocol.

What node to use for data transfer to and from SciNet depends mostly on the amount of data to transfer:

==== Moving <10GB through the login nodes ====

The login nodes are accessible from outside SciNet, which means that you can transfer data between your own office/home machine and SciNet using scp or rsync (see below). Although the login nodes have a cpu_time timeout of 5 minutes (emphasis on cpu_time, not wall_time), most likely if you try to transfer more than 10GB probably you wouldn't succeed. While the login nodes can be used for transfers of less than 10GB, using a datamover node would still be faster.

Note that transfers through a login node will timeout after a certain time (currently set to 5 minutes cpu_time), so if you have a slow connection you may need to go through datamover1.

==== Moving >10GB through the datamover1 node ====

Serious moves of data (>10GB) to or from SciNet should be done from <tt>datamover1</tt> or <tt>datamover2</tt> nodes. From any of the interactive SciNet nodes, one should be able to <tt>ssh datamover1|2</tt> to log in. Those are the machines that have the fastest network connections to the outside world (by a factor of 10; a 10Gb/s link as vs 1Gb/s).

Transfers must be ''originated'' from <tt>datamover1|2</tt>; that is, one can not copy files from the outside world directly to or from the datamovers; one has to log in to a datamover and copy the data to or from the outside network. Your local machine must be reachable from the outside as well, either by its name or its IP address. If you are behind a firewall or a (wireless) router, this may not be possible. You may need to ask your network administrator to allow the datamovers to ssh to your machine. If you need to open a hole on your firewall we provide their IPs:

datamover1 142.150.188.121
datamover2 142.150.188.122

==== Hpn-ssh ====

The usual ssh protocols were not designed for speed. On the <tt>datamover1|2</tt> nodes, we have installed hpn-ssh, or [http://www.psc.edu/networking/projects/hpn-ssh/ High-Performance-enabled ssh]. You use this higher-performance ssh/scp/sftp variant by loading the `hpnssh' module. Hpn-ssh is backwards compatible with the `usual' ssh, but is capable of significantly higher speeds. If you routinely have large data transfers to do, we recommend having your system administrator look into installing [http://www.psc.edu/networking/projects/hpn-ssh/ hpn-ssh] on your system.

Everything we discuss below, unless otherwise stated, will work regardless of whether you have hpn-ssh installed on your remote system.

==== For Microsoft Windows users ====

Linux-windows transfers can be a bit more involved than linux-to-linux, but using [http://www.cygwin.com Cygwin], this should not be a problem. Make sure you install Cygwin with the openssh package.

If you want to remain 100% a Windows environment, another very good tool is [http://winscp.net/eng/index.php WinSCP]. It will let you easily transfer and synchronize data between your Windows workstation and the login nodes using your ssh credentials (provided that it's not much more than 10GB on each sync pass).

If you are going to use the [[Data_Management#Moving_.3E10GB_through_the_datamover1_node | datamover1 method]], and assuming your machine is not a wireless laptop (if it
is, best to find a nearby computer that's not wireless and use a usb
memory stick), you'll need the IP address of your machine, which you find by
typing "ipconfig /all" on your local windows machine. Also, you will need to have the ssh daemon (sshd) running locally in Cygwin.

Also note that your windows user name does not have to be the same as on SciNet, this just
depends on how your local windows system was set up.

All locations given to scp or rsync in cygwin have to be in unix format (using "/" not "\"), and will be relative to cygwin's path, not windows (e.g.
use /cygdrive/c/...... to get to the windows C: drive).

=== Ways to transfer data ===

==== Globus data transfer ====

Globus is a file-transfer service with an easy-to-use web interface to allow people to transfer file with ease. To get started, please sign up for a Globus account at [https://www.globus.org/ Globus website]. Once you sign up for an account, go to [https://www.globus.org/xfer/StartTransfer this page] to start the file transfer. Please enter computecanada#gpc as one endpoint for file transfer. The computecanada#gpc endpoint requires authentication using your scinet username and password. If you are trying to transfer data from a laptop or desktop, you will need to install Globus Connect Personal software available [https://support.globus.org/entries/24044351 here] to setup an endpoint for the laptop or desktop and perform the transfer.

==== scp ====

<tt>scp</tt>, or secure copy, is the easiest way to copy files, although we generally find rsync below to be faster.

scp works like cp to copy files:

$ scp original_file copy_file

except that either the original or the copy can be on another system:

$ scp jonsdatafile.bin jon@remote.system.com:/home/jon/bigdatadir/

will copy the data file into the directory <tt>/home/jon/bigdatadir/</tt> on <tt>remote.system.com</tt> after logging in as <tt>jon</tt>; you will be prompted for a password (unless you've set up ssh keys).

Copying from remote systems works the same way:

$ scp jon@remote.system.com:/home/jon/bigdatadir/newdata.bin .

And wildcards work as you'd expect (except you have to quote the wildcards on the remote system, as it can't expand properly here.)

$ scp *.bin jon@remote.system.com:/home/jon/bigdatadir/
$ scp jon@remote.system.com:"/home/jon/inputdata/*" .

There are few options worth knowing about:
* <tt>scp -C</tt> compresses the file before transmitting it; ''if'' the file compresses well, this can significantly increase the effective data transfer rate (though usually not as much as compressing the data, then sending it, then uncompressing). If the file doesn't compress well, than this adds CPU overhead without accomplishing much, and can slow down your data transfer.
* <tt>scp -oNoneEnabled=yes -oNoneSwitch=yes</tt> -- This is an hpn-ssh only option. If CPU overhead is a significant bottleneck in the data transfer, then we can avoid this by turning off the secure encryption of the data. For most of us, this is ok, but for others it is not. In either cases, '''authentication''' remains secure, it is only the data transfer that is in plaintext.

==== rsync ====

[http://samba.anu.edu.au/rsync/ rsync] is a very powerful tool for mirroring directories of data.
$ rsync -av -e ssh scinetdatadir jon@remote.system.com:/home/jon/bigdatadir/
rsync has a dizzying number of options; the above syncs <tt>scinetdatadir</tt> ''to'' the remote system; that is, any files that are newer on the localsystem are updated on the remote system. The converse isn't true; if there were newer files on the remote system, you'd have to bring those over with
$ rsync -av -e ssh jon@remote.system.com:/home/jon/bigdatadir/ scinetdatadir
The <tt>-av</tt> options are for verbose and `archive' mode, which preserves timestamps and permissions, which is normally what you want. <tt>-e ssh</tt> tells it to use ssh for the transfer.

One of the powerful things about rsync is that it looks to see what files already exist before copying, so you can use it repeatedly as a data directory fills and it won't make unnecessary copies; similarly, if a (say) log file grows over time, it will only copy the difference between the files, further speeding things up. This also means that it behaves well if a transfer is interrupted; a second invocation of rsync will continue where the other left off.

As with <tt>scp -C</tt>, <tt>rsync -z</tt> compresses on the fly, which can significantly enhance effective data transfer rates if the files compress well, or hurt it if not.

As with scp, if both sides are running hpn-ssh one can disable encryption of the data stream should that prove to be a bottleneck:
$ rsync -av -e "ssh -oNoneEnabled=yes -oNoneSwitch=yes" jon@remote.system.com:/home/jon/bigdatadir/ scinetdatadir

SciNet's login nodes, 142.150.188.5[1-4], are publicly accessible and can be used for data transfer as long as your material is not one big chunk (much more than 2-3GB each file). We have a 5 minutes CPU time limit on the login nodes and the transfer process may be killed by the kernel before completion. The workaround is to transfer your data using a rsync loop, by checking the rsync return code, assuming some files can be transferred before reaching the CPU limit. For example in a bash shell:
<pre>
for i in {1..100}; do ### try 100 times
rsync ...
[ "$?" == "0" ] && break
done
</pre>

==== ssh tunnel ====

Alternatively you may use a reverse ssh tunnel (ssh -R).

If your transfer is above 10GB you will need to use one of SciNet's datamovers. If your workstation is behind a firewall (as the datamovers are), you'll need a node external to your firewall, on the edge of your organization's network, that will serve as a gateway, and can be accessible via ssh by both your workstation and the datamovers. Initiate a "ssh -R" connection from SciNet's datamover to that node. This node needs to have its ssh GatewayPorts enabled so that your workstation can connect to the specified port on that node, which will forward the traffic back to SciNet's datamover.

=== Transfer speeds ===

==== What transfer speeds could I expect? ====

Below are some typical transfer numbers from datamover1 to another University of Toronto machine with a 1Gb/s link to the campus network:

{| class="wikitable" border="0"
|-
!{{Hl2}}| Mode
!{{Hl2}}| With hpn-ssh
!{{Hl2}}| Without
|-
| rsync
| 60-80 MB/s
| 30-40 MB/s
|-
| scp
| 50 MB/s
| 25 MB/s
|}

==== What can slow down my data transfer? ====

To move data quickly, ''all'' of the stages in the process have to be fast; the file system you are reading data from, the CPU reading the data, the network connection between the sender and the reciever, and the recipient CPU and disk. The slowest element in that chain will slow down the entire transfer.

On SciNet's side, our underlying filesystem is the high-performance [http://www-03.ibm.com/systems/software/gpfs/index.html GPFS] system, and the node we recommend you use (datamover1) has a high-speed connection to the network and fast CPUs.

==== Why are my transfers so much slower? ====

If you get numbers significantly lower than above, then there is a bottleneck in the transfer. The first thing to do is to run <tt>top</tt> on datamover1; if other people are transfering large files at the same time you are trying to, network congestion could result and you'll just have to wait until they are done.

If nothing else is going on on datamover1, there are a number of possibilites:
* network connection between SciNet and your machine - do you know the network connection of your remote machine? Are your systems connections tuned for performance [http://www.psc.edu/networking/projects/tcptune]?
* is the remote server busy?
* are the remote servers disks busy, or known to be slow?

For any further questions, contact us at [mailto:support@scinet.utoronto.ca Support @ SciNet]

Oldwiki.scinet.utoronto.ca:System Alerts

2014-08-08T18:11:15Z

Jchong: /* System Status */

== System Status==

{|
|[[File:up.png|up|link=GPC Quickstart]][[GPC Quickstart|GPC]]
|[[File:up.png|up|link=TCS Quickstart]][[TCS Quickstart|TCS]]
|[[File:up.png|up|link=Sandy]][[Sandy]]
|[[File:up.png|up|link=GPU Devel Nodes]][[GPU Devel Nodes|ARC]]
|[[File:up.png|up]]File System
|-
|[[File:up.png|up|link=Gravity]][[Gravity]]
|[[File:up.png|up|link=P7 Linux Cluster]][[P7 Linux Cluster|P7]]
|[[File:up.png|up|link=BGQ]][[BGQ]]
|[[File:up.png|up|link=HPSS]][[HPSS]]
|
|}

Tue Aug 5 09:34:00 EDT 2014: File system hickup on the BGQ. The issue has been resolved, please resubmit BGQ jobs.

Fri Aug 1 20:19:44 EDT 2014: The 3 second power-outage took down all the GPC, TCS and BGQ compute nodes so all running jobs were killed. Queued and new jobs started 3s later on the GPC. The TCS and BGQ are back-online as well. Please email support@scinet.utoronto.ca if you still encounter issues

Fri Aug 1 17:23:05 EDT 2014: Around 5pm, a few seconds of power outage has taken down an as-of-yet unknown number of nodes. GPC, Sandy, TCS, Gravity, ARC are certainly affected, but to which extent is not clear yet. Updates will be posted here.

Fri Aug 1 17:46:04 EDT 2014: GPC, Sandy, ARC, Gravity, TCS, and BGQ were all affected. P7, HPSS and file system are okay. We're rebooting the nodes.

Note: As a precaution, emails by the Moab/Torque scheduler have been disabled because of a potential security vulnerability since Jan 24th 2014.

([[Previous_messages:|Previous messages]])

Oldwiki.scinet.utoronto.ca:System Alerts

2014-08-01T23:58:34Z

Jchong: /* System Status */

== System Status==

{|
|[[File:up.png|up|link=GPC Quickstart]][[GPC Quickstart|GPC]]
|[[File:up75.png|down|link=TCS Quickstart]][[TCS Quickstart|TCS]]
|[[File:up.png|down|link=Sandy]][[Sandy]]
|[[File:up.png|down|link=GPU Devel Nodes]][[GPU Devel Nodes|ARC]]
|[[File:up.png|up]]File System
|-
|[[File:up.png|down|link=Gravity]][[Gravity]]
|[[File:up.png|up|link=P7 Linux Cluster]][[P7 Linux Cluster|P7]]
|[[File:up.png|down|link=BGQ]][[BGQ]]
|[[File:up.png|up|link=HPSS]][[HPSS]]
|
|}

Fri Aug 1 17:23:05 EDT 2014: Around 5pm, a few seconds of power outage has taken down an as-of-yet unknown number of nodes. GPC, Sandy, TCS, Gravity, ARC are certainly affected, but to which extent is not clear yet. Updates will be posted here.

Fri Aug 1 17:46:04 EDT 2014: GPC, Sandy, ARC, Gravity, TCS, and BGQ were all affected. P7, HPSS and file system are okay. We're rebooting the nodes.

Note: As a precaution, emails by the Moab/Torque scheduler have been disabled because of a potential security vulnerability since Jan 24th 2014.

Last updated: Tue Jul 15 7:51:44 EDT 2014
([[Previous_messages:|Previous messages]])

Data Transfer

2014-07-09T17:42:38Z

Jchong: Edit the globus data transfer page

=== General guidelines ===

All traffic to and from the data centre has to go via [http://en.wikipedia.org/wiki/Secure_Shell SSH], or secure shell.
This is a protocol which sets up a secure connection between two sites. In all cases, incoming connections to SciNet go through relatively low-speed connections to the login.scinet gateways, but there are many ways to copy files on top of the ssh protocol.

What node to use for data transfer to and from SciNet depends mostly on the amount of data to transfer:

==== Moving <10GB through the login nodes ====

The login nodes are accessible from outside SciNet, which means that you can transfer data between your own office/home machine and SciNet using scp or rsync (see below). Although the login nodes have a cpu_time timeout of 5 minutes (emphasis on cpu_time, not wall_time), most likely if you try to transfer more than 10GB probably you wouldn't succeed. While the login nodes can be used for transfers of less than 10GB, using a datamover node would still be faster.

Note that transfers through a login node will timeout after a certain time (currently set to 5 minutes cpu_time), so if you have a slow connection you may need to go through datamover1.

==== Moving >10GB through the datamover1 node ====

Serious moves of data (>10GB) to or from SciNet should be done from <tt>datamover1</tt> or <tt>datamover2</tt> nodes. From any of the interactive SciNet nodes, one should be able to <tt>ssh datamover1|2</tt> to log in. Those are the machines that have the fastest network connections to the outside world (by a factor of 10; a 10Gb/s link as vs 1Gb/s).

Transfers must be ''originated'' from <tt>datamover1|2</tt>; that is, one can not copy files from the outside world directly to or from the datamovers; one has to log in to a datamover and copy the data to or from the outside network. Your local machine must be reachable from the outside as well, either by its name or its IP address. If you are behind a firewall or a (wireless) router, this may not be possible. You may need to ask your network administrator to allow the datamovers to ssh to your machine. If you need to open a hole on your firewall we provide their IPs:

datamover1 142.150.188.121
datamover2 142.150.188.122

==== Hpn-ssh ====

The usual ssh protocols were not designed for speed. On the <tt>datamover1|2</tt> nodes, we have installed hpn-ssh, or [http://www.psc.edu/networking/projects/hpn-ssh/ High-Performance-enabled ssh]. You use this higher-performance ssh/scp/sftp variant by loading the `hpnssh' module. Hpn-ssh is backwards compatible with the `usual' ssh, but is capable of significantly higher speeds. If you routinely have large data transfers to do, we recommend having your system administrator look into installing [http://www.psc.edu/networking/projects/hpn-ssh/ hpn-ssh] on your system.

Everything we discuss below, unless otherwise stated, will work regardless of whether you have hpn-ssh installed on your remote system.

==== For Microsoft Windows users ====

Linux-windows transfers can be a bit more involved than linux-to-linux, but using [http://www.cygwin.com Cygwin], this should not be a problem. Make sure you install Cygwin with the openssh package.

If you want to remain 100% a Windows environment, another very good tool is [http://winscp.net/eng/index.php WinSCP]. It will let you easily transfer and synchronize data between your Windows workstation and the login nodes using your ssh credentials (provided that it's not much more than 10GB on each sync pass).

If you are going to use the [[Data_Management#Moving_.3E10GB_through_the_datamover1_node | datamover1 method]], and assuming your machine is not a wireless laptop (if it
is, best to find a nearby computer that's not wireless and use a usb
memory stick), you'll need the IP address of your machine, which you find by
typing "ipconfig /all" on your local windows machine. Also, you will need to have the ssh daemon (sshd) running locally in Cygwin.

Also note that your windows user name does not have to be the same as on SciNet, this just
depends on how your local windows system was set up.

All locations given to scp or rsync in cygwin have to be in unix format (using "/" not "\"), and will be relative to cygwin's path, not windows (e.g.
use /cygdrive/c/...... to get to the windows C: drive).

=== Ways to transfer data ===

==== Globus data transfer ====

Globus is a file-transfer service with a easy-to-use web interface to allow people to transfer file with ease. To get started, please sign up for a Globus account at [https://www.globus.org/ Globus website]. Once you sign up for an account, go to [https://www.globus.org/xfer/StartTransfer this page] to start the file transfer. Please enter computecanada#gpc as one endpoint for file transfer. If you are trying to transfer data from a laptop or desktop, you will need to install Globus Connect Personal software available [https://support.globus.org/entries/24044351 here] to setup an endpoint for the laptop or desktop and perform the transfer.

==== scp ====

<tt>scp</tt>, or secure copy, is the easiest way to copy files, although we generally find rsync below to be faster.

scp works like cp to copy files:

$ scp original_file copy_file

except that either the original or the copy can be on another system:

$ scp jonsdatafile.bin jon@remote.system.com:/home/jon/bigdatadir/

will copy the data file into the directory <tt>/home/jon/bigdatadir/</tt> on <tt>remote.system.com</tt> after logging in as <tt>jon</tt>; you will be prompted for a password (unless you've set up ssh keys).

Copying from remote systems works the same way:

$ scp jon@remote.system.com:/home/jon/bigdatadir/newdata.bin .

And wildcards work as you'd expect (except you have to quote the wildcards on the remote system, as it can't expand properly here.)

$ scp *.bin jon@remote.system.com:/home/jon/bigdatadir/
$ scp jon@remote.system.com:"/home/jon/inputdata/*" .

There are few options worth knowing about:
* <tt>scp -C</tt> compresses the file before transmitting it; ''if'' the file compresses well, this can significantly increase the effective data transfer rate (though usually not as much as compressing the data, then sending it, then uncompressing). If the file doesn't compress well, than this adds CPU overhead without accomplishing much, and can slow down your data transfer.
* <tt>scp -oNoneEnabled=yes -oNoneSwitch=yes</tt> -- This is an hpn-ssh only option. If CPU overhead is a significant bottleneck in the data transfer, then we can avoid this by turning off the secure encryption of the data. For most of us, this is ok, but for others it is not. In either cases, '''authentication''' remains secure, it is only the data transfer that is in plaintext.

==== rsync ====

[http://samba.anu.edu.au/rsync/ rsync] is a very powerful tool for mirroring directories of data.
$ rsync -av -e ssh scinetdatadir jon@remote.system.com:/home/jon/bigdatadir/
rsync has a dizzying number of options; the above syncs <tt>scinetdatadir</tt> ''to'' the remote system; that is, any files that are newer on the localsystem are updated on the remote system. The converse isn't true; if there were newer files on the remote system, you'd have to bring those over with
$ rsync -av -e ssh jon@remote.system.com:/home/jon/bigdatadir/ scinetdatadir
The <tt>-av</tt> options are for verbose and `archive' mode, which preserves timestamps and permissions, which is normally what you want. <tt>-e ssh</tt> tells it to use ssh for the transfer.

One of the powerful things about rsync is that it looks to see what files already exist before copying, so you can use it repeatedly as a data directory fills and it won't make unnecessary copies; similarly, if a (say) log file grows over time, it will only copy the difference between the files, further speeding things up. This also means that it behaves well if a transfer is interrupted; a second invocation of rsync will continue where the other left off.

As with <tt>scp -C</tt>, <tt>rsync -z</tt> compresses on the fly, which can significantly enhance effective data transfer rates if the files compress well, or hurt it if not.

As with scp, if both sides are running hpn-ssh one can disable encryption of the data stream should that prove to be a bottleneck:
$ rsync -av -e "ssh -oNoneEnabled=yes -oNoneSwitch=yes" jon@remote.system.com:/home/jon/bigdatadir/ scinetdatadir

SciNet's login nodes, 142.150.188.5[1-4], are publicly accessible and can be used for data transfer as long as your material is not one big chunk (much more than 2-3GB each file). We have a 5 minutes CPU time limit on the login nodes and the transfer process may be killed by the kernel before completion. The workaround is to transfer your data using a rsync loop, by checking the rsync return code, assuming some files can be transferred before reaching the CPU limit. For example in a bash shell:
<pre>
for i in {1..100}; do ### try 100 times
rsync ...
[ "$?" == "0" ] && break
done
</pre>

==== ssh tunnel ====

Alternatively you may use a reverse ssh tunnel (ssh -R).

If your transfer is above 10GB you will need to use one of SciNet's datamovers. If your workstation is behind a firewall (as the datamovers are), you'll need a node external to your firewall, on the edge of your organization's network, that will serve as a gateway, and can be accessible via ssh by both your workstation and the datamovers. Initiate a "ssh -R" connection from SciNet's datamover to that node. This node needs to have its ssh GatewayPorts enabled so that your workstation can connect to the specified port on that node, which will forward the traffic back to SciNet's datamover.

=== Transfer speeds ===

==== What transfer speeds could I expect? ====

Below are some typical transfer numbers from datamover1 to another University of Toronto machine with a 1Gb/s link to the campus network:

{| class="wikitable" border="0"
|-
!{{Hl2}}| Mode
!{{Hl2}}| With hpn-ssh
!{{Hl2}}| Without
|-
| rsync
| 60-80 MB/s
| 30-40 MB/s
|-
| scp
| 50 MB/s
| 25 MB/s
|}

==== What can slow down my data transfer? ====

To move data quickly, ''all'' of the stages in the process have to be fast; the file system you are reading data from, the CPU reading the data, the network connection between the sender and the reciever, and the recipient CPU and disk. The slowest element in that chain will slow down the entire transfer.

On SciNet's side, our underlying filesystem is the high-performance [http://www-03.ibm.com/systems/software/gpfs/index.html GPFS] system, and the node we recommend you use (datamover1) has a high-speed connection to the network and fast CPUs.

==== Why are my transfers so much slower? ====

If you get numbers significantly lower than above, then there is a bottleneck in the transfer. The first thing to do is to run <tt>top</tt> on datamover1; if other people are transfering large files at the same time you are trying to, network congestion could result and you'll just have to wait until they are done.

If nothing else is going on on datamover1, there are a number of possibilites:
* network connection between SciNet and your machine - do you know the network connection of your remote machine? Are your systems connections tuned for performance [http://www.psc.edu/networking/projects/tcptune]?
* is the remote server busy?
* are the remote servers disks busy, or known to be slow?

For any further questions, contact us at [mailto:support@scinet.utoronto.ca Support @ SciNet]

Data Transfer

2014-07-09T17:42:07Z

Jchong: /* Globus data transfer (BETA) */

=== General guidelines ===

All traffic to and from the data centre has to go via [http://en.wikipedia.org/wiki/Secure_Shell SSH], or secure shell.
This is a protocol which sets up a secure connection between two sites. In all cases, incoming connections to SciNet go through relatively low-speed connections to the login.scinet gateways, but there are many ways to copy files on top of the ssh protocol.

What node to use for data transfer to and from SciNet depends mostly on the amount of data to transfer:

==== Moving <10GB through the login nodes ====

The login nodes are accessible from outside SciNet, which means that you can transfer data between your own office/home machine and SciNet using scp or rsync (see below). Although the login nodes have a cpu_time timeout of 5 minutes (emphasis on cpu_time, not wall_time), most likely if you try to transfer more than 10GB probably you wouldn't succeed. While the login nodes can be used for transfers of less than 10GB, using a datamover node would still be faster.

Note that transfers through a login node will timeout after a certain time (currently set to 5 minutes cpu_time), so if you have a slow connection you may need to go through datamover1.

==== Moving >10GB through the datamover1 node ====

Serious moves of data (>10GB) to or from SciNet should be done from <tt>datamover1</tt> or <tt>datamover2</tt> nodes. From any of the interactive SciNet nodes, one should be able to <tt>ssh datamover1|2</tt> to log in. Those are the machines that have the fastest network connections to the outside world (by a factor of 10; a 10Gb/s link as vs 1Gb/s).

Transfers must be ''originated'' from <tt>datamover1|2</tt>; that is, one can not copy files from the outside world directly to or from the datamovers; one has to log in to a datamover and copy the data to or from the outside network. Your local machine must be reachable from the outside as well, either by its name or its IP address. If you are behind a firewall or a (wireless) router, this may not be possible. You may need to ask your network administrator to allow the datamovers to ssh to your machine. If you need to open a hole on your firewall we provide their IPs:

datamover1 142.150.188.121
datamover2 142.150.188.122

==== Hpn-ssh ====

The usual ssh protocols were not designed for speed. On the <tt>datamover1|2</tt> nodes, we have installed hpn-ssh, or [http://www.psc.edu/networking/projects/hpn-ssh/ High-Performance-enabled ssh]. You use this higher-performance ssh/scp/sftp variant by loading the `hpnssh' module. Hpn-ssh is backwards compatible with the `usual' ssh, but is capable of significantly higher speeds. If you routinely have large data transfers to do, we recommend having your system administrator look into installing [http://www.psc.edu/networking/projects/hpn-ssh/ hpn-ssh] on your system.

Everything we discuss below, unless otherwise stated, will work regardless of whether you have hpn-ssh installed on your remote system.

==== For Microsoft Windows users ====

Linux-windows transfers can be a bit more involved than linux-to-linux, but using [http://www.cygwin.com Cygwin], this should not be a problem. Make sure you install Cygwin with the openssh package.

If you want to remain 100% a Windows environment, another very good tool is [http://winscp.net/eng/index.php WinSCP]. It will let you easily transfer and synchronize data between your Windows workstation and the login nodes using your ssh credentials (provided that it's not much more than 10GB on each sync pass).

If you are going to use the [[Data_Management#Moving_.3E10GB_through_the_datamover1_node | datamover1 method]], and assuming your machine is not a wireless laptop (if it
is, best to find a nearby computer that's not wireless and use a usb
memory stick), you'll need the IP address of your machine, which you find by
typing "ipconfig /all" on your local windows machine. Also, you will need to have the ssh daemon (sshd) running locally in Cygwin.

Also note that your windows user name does not have to be the same as on SciNet, this just
depends on how your local windows system was set up.

All locations given to scp or rsync in cygwin have to be in unix format (using "/" not "\"), and will be relative to cygwin's path, not windows (e.g.
use /cygdrive/c/...... to get to the windows C: drive).

=== Ways to transfer data ===

==== Globus data transfer (BETA) ====

Globus is a file-transfer service with a easy-to-use web interface to allow people to transfer file with ease. To get started, please sign up for a Globus account at [https://www.globus.org/ Globus website]. Once you sign up for an account, go to [https://www.globus.org/xfer/StartTransfer this page] to start the file transfer. Please enter computecanada#gpc as one endpoint for file transfer. If you are trying to transfer data from a laptop or desktop, you will need to install Globus Connect Personal software available [https://support.globus.org/entries/24044351 here] to setup an endpoint for the laptop or desktop and perform the transfer.

==== scp ====

<tt>scp</tt>, or secure copy, is the easiest way to copy files, although we generally find rsync below to be faster.

scp works like cp to copy files:

$ scp original_file copy_file

except that either the original or the copy can be on another system:

$ scp jonsdatafile.bin jon@remote.system.com:/home/jon/bigdatadir/

will copy the data file into the directory <tt>/home/jon/bigdatadir/</tt> on <tt>remote.system.com</tt> after logging in as <tt>jon</tt>; you will be prompted for a password (unless you've set up ssh keys).

Copying from remote systems works the same way:

$ scp jon@remote.system.com:/home/jon/bigdatadir/newdata.bin .

And wildcards work as you'd expect (except you have to quote the wildcards on the remote system, as it can't expand properly here.)

$ scp *.bin jon@remote.system.com:/home/jon/bigdatadir/
$ scp jon@remote.system.com:"/home/jon/inputdata/*" .

There are few options worth knowing about:
* <tt>scp -C</tt> compresses the file before transmitting it; ''if'' the file compresses well, this can significantly increase the effective data transfer rate (though usually not as much as compressing the data, then sending it, then uncompressing). If the file doesn't compress well, than this adds CPU overhead without accomplishing much, and can slow down your data transfer.
* <tt>scp -oNoneEnabled=yes -oNoneSwitch=yes</tt> -- This is an hpn-ssh only option. If CPU overhead is a significant bottleneck in the data transfer, then we can avoid this by turning off the secure encryption of the data. For most of us, this is ok, but for others it is not. In either cases, '''authentication''' remains secure, it is only the data transfer that is in plaintext.

==== rsync ====

[http://samba.anu.edu.au/rsync/ rsync] is a very powerful tool for mirroring directories of data.
$ rsync -av -e ssh scinetdatadir jon@remote.system.com:/home/jon/bigdatadir/
rsync has a dizzying number of options; the above syncs <tt>scinetdatadir</tt> ''to'' the remote system; that is, any files that are newer on the localsystem are updated on the remote system. The converse isn't true; if there were newer files on the remote system, you'd have to bring those over with
$ rsync -av -e ssh jon@remote.system.com:/home/jon/bigdatadir/ scinetdatadir
The <tt>-av</tt> options are for verbose and `archive' mode, which preserves timestamps and permissions, which is normally what you want. <tt>-e ssh</tt> tells it to use ssh for the transfer.

One of the powerful things about rsync is that it looks to see what files already exist before copying, so you can use it repeatedly as a data directory fills and it won't make unnecessary copies; similarly, if a (say) log file grows over time, it will only copy the difference between the files, further speeding things up. This also means that it behaves well if a transfer is interrupted; a second invocation of rsync will continue where the other left off.

As with <tt>scp -C</tt>, <tt>rsync -z</tt> compresses on the fly, which can significantly enhance effective data transfer rates if the files compress well, or hurt it if not.

As with scp, if both sides are running hpn-ssh one can disable encryption of the data stream should that prove to be a bottleneck:
$ rsync -av -e "ssh -oNoneEnabled=yes -oNoneSwitch=yes" jon@remote.system.com:/home/jon/bigdatadir/ scinetdatadir

SciNet's login nodes, 142.150.188.5[1-4], are publicly accessible and can be used for data transfer as long as your material is not one big chunk (much more than 2-3GB each file). We have a 5 minutes CPU time limit on the login nodes and the transfer process may be killed by the kernel before completion. The workaround is to transfer your data using a rsync loop, by checking the rsync return code, assuming some files can be transferred before reaching the CPU limit. For example in a bash shell:
<pre>
for i in {1..100}; do ### try 100 times
rsync ...
[ "$?" == "0" ] && break
done
</pre>

==== ssh tunnel ====

Alternatively you may use a reverse ssh tunnel (ssh -R).

If your transfer is above 10GB you will need to use one of SciNet's datamovers. If your workstation is behind a firewall (as the datamovers are), you'll need a node external to your firewall, on the edge of your organization's network, that will serve as a gateway, and can be accessible via ssh by both your workstation and the datamovers. Initiate a "ssh -R" connection from SciNet's datamover to that node. This node needs to have its ssh GatewayPorts enabled so that your workstation can connect to the specified port on that node, which will forward the traffic back to SciNet's datamover.

=== Transfer speeds ===

==== What transfer speeds could I expect? ====

Below are some typical transfer numbers from datamover1 to another University of Toronto machine with a 1Gb/s link to the campus network:

{| class="wikitable" border="0"
|-
!{{Hl2}}| Mode
!{{Hl2}}| With hpn-ssh
!{{Hl2}}| Without
|-
| rsync
| 60-80 MB/s
| 30-40 MB/s
|-
| scp
| 50 MB/s
| 25 MB/s
|}

==== What can slow down my data transfer? ====

To move data quickly, ''all'' of the stages in the process have to be fast; the file system you are reading data from, the CPU reading the data, the network connection between the sender and the reciever, and the recipient CPU and disk. The slowest element in that chain will slow down the entire transfer.

On SciNet's side, our underlying filesystem is the high-performance [http://www-03.ibm.com/systems/software/gpfs/index.html GPFS] system, and the node we recommend you use (datamover1) has a high-speed connection to the network and fast CPUs.

==== Why are my transfers so much slower? ====

If you get numbers significantly lower than above, then there is a bottleneck in the transfer. The first thing to do is to run <tt>top</tt> on datamover1; if other people are transfering large files at the same time you are trying to, network congestion could result and you'll just have to wait until they are done.

If nothing else is going on on datamover1, there are a number of possibilites:
* network connection between SciNet and your machine - do you know the network connection of your remote machine? Are your systems connections tuned for performance [http://www.psc.edu/networking/projects/tcptune]?
* is the remote server busy?
* are the remote servers disks busy, or known to be slow?

For any further questions, contact us at [mailto:support@scinet.utoronto.ca Support @ SciNet]

Oldwiki.scinet.utoronto.ca:System Alerts

2014-07-02T14:45:39Z

Jchong: /* System Status */

== System Status==

{|
|[[File:up.png|up|link=GPC Quickstart]][[GPC Quickstart|GPC]]
|[[File:up.png|up|link=TCS Quickstart]][[TCS Quickstart|TCS]]
|[[File:up.png|up|link=Sandy]][[Sandy]]
|[[File:up.png|up|link=GPU Devel Nodes]][[GPU Devel Nodes|ARC]]
|[[File:up.png|up]]File System
|-
|[[File:up.png|up|link=Gravity]][[Gravity]]
|[[File:up.png|up|link=P7 Linux Cluster]][[P7 Linux Cluster|P7]]
|[[File:up.png|up|link=BGQ]][[BGQ]]
|[[File:down.png|up|link=HPSS]][[HPSS]]
|
|}

Mon Jun 30 15:19:39 EDT: All system down. Some kind of power issue (again).

Sun Jun 29 19:57:29: Compute systems started coming online about 730PM.

Sun Jun 29 18:20:41: filesystems restarted after some issues. Likely at least 8PM before compute systems available

Sun Jun 29 16:39:35 EDT 2014: large voltage spike tripped our main circuit breaker. We have power though it's out at sites within 2k because of lightning strike. Cooling system being restored

Sun Jun 29 15:47:11 EDT 2014: staff enroute to site. Should have update on cause within an hour

Sun Jun 29 15:40:31 EDT 2014: power lost about 3:20P today. All systems down. Investigating.

Note: As a precaution, emails by the Moab/Torque scheduler have been disabled because of a potential security vulnerability since Jan 24th 2014.

Last updated: Fri May 23 12:01:44 EDT 2014
([[Previous_messages:|Previous messages]])

Oldwiki.scinet.utoronto.ca:System Alerts

2014-06-30T23:40:23Z

Jchong: /* System Status */

== System Status==

{|
|[[File:up.png|up|link=GPC Quickstart]][[GPC Quickstart|GPC]]
|[[File:up.png|up|link=TCS Quickstart]][[TCS Quickstart|TCS]]
|[[File:up.png|up|link=Sandy]][[Sandy]]
|[[File:up.png|up|link=GPU Devel Nodes]][[GPU Devel Nodes|ARC]]
|[[File:up.png|up]]File System
|-
|[[File:up25.png|up|link=Gravity]][[Gravity]]
|[[File:up.png|up|link=P7 Linux Cluster]][[P7 Linux Cluster|P7]]
|[[File:up.png|up|link=BGQ]][[BGQ]]
|[[File:down.png|up|link=HPSS]][[HPSS]]
|
|}

Mon Jun 30 15:19:39 EDT: All system down. Some kind of power issue (again).

Sun Jun 29 19:57:29: Compute systems started coming online about 730PM.

Sun Jun 29 18:20:41: filesystems restarted after some issues. Likely at least 8PM before compute systems available

Sun Jun 29 16:39:35 EDT 2014: large voltage spike tripped our main circuit breaker. We have power though it's out at sites within 2k because of lightning strike. Cooling system being restored

Sun Jun 29 15:47:11 EDT 2014: staff enroute to site. Should have update on cause within an hour

Sun Jun 29 15:40:31 EDT 2014: power lost about 3:20P today. All systems down. Investigating.

Note: As a precaution, emails by the Moab/Torque scheduler have been disabled because of a potential security vulnerability since Jan 24th 2014.

Last updated: Fri May 23 12:01:44 EDT 2014
([[Previous_messages:|Previous messages]])

Oldwiki.scinet.utoronto.ca:System Alerts

2014-06-30T23:39:32Z

Jchong: /* System Status */

== System Status==

{|
|[[File:up.png|up|link=GPC Quickstart]][[GPC Quickstart|GPC]]
|[[File:up.png|up|link=TCS Quickstart]][[TCS Quickstart|TCS]]
|[[File:down.png|up|link=Sandy]][[Sandy]]
|[[File:up.png|up|link=GPU Devel Nodes]][[GPU Devel Nodes|ARC]]
|[[File:up.png|up]]File System
|-
|[[File:up25.png|up|link=Gravity]][[Gravity]]
|[[File:up.png|up|link=P7 Linux Cluster]][[P7 Linux Cluster|P7]]
|[[File:up.png|up|link=BGQ]][[BGQ]]
|[[File:down.png|up|link=HPSS]][[HPSS]]
|
|}

Mon Jun 30 15:19:39 EDT: All system down. Some kind of power issue (again).

Sun Jun 29 19:57:29: Compute systems started coming online about 730PM.

Sun Jun 29 18:20:41: filesystems restarted after some issues. Likely at least 8PM before compute systems available

Sun Jun 29 16:39:35 EDT 2014: large voltage spike tripped our main circuit breaker. We have power though it's out at sites within 2k because of lightning strike. Cooling system being restored

Sun Jun 29 15:47:11 EDT 2014: staff enroute to site. Should have update on cause within an hour

Sun Jun 29 15:40:31 EDT 2014: power lost about 3:20P today. All systems down. Investigating.

Note: As a precaution, emails by the Moab/Torque scheduler have been disabled because of a potential security vulnerability since Jan 24th 2014.

Last updated: Fri May 23 12:01:44 EDT 2014
([[Previous_messages:|Previous messages]])

Oldwiki.scinet.utoronto.ca:System Alerts

2014-06-30T00:08:28Z

Jchong: /* System Status */

== System Status==

{|
|[[File:up.png|up|link=GPC Quickstart]][[GPC Quickstart|GPC]]
|[[File:up.png|up|link=TCS Quickstart]][[TCS Quickstart|TCS]]
|[[File:up.png|up|link=Sandy]][[Sandy]]
|[[File:up.png|up|link=GPU Devel Nodes]][[GPU Devel Nodes|ARC]]
|[[File:up.png|up]]File System
|-
|[[File:up.png|up|link=Gravity]][[Gravity]]
|[[File:up.png|up|link=P7 Linux Cluster]][[P7 Linux Cluster|P7]]
|[[File:up.png|up|link=BGQ]][[BGQ]]
|[[File:down.png|up|link=HPSS]][[HPSS]]
|
|}

Sun Jun 29 19:57:29: Compute systems started coming online about 730PM.

Sun Jun 29 18:20:41: filesystems restarted after some issues. Likely at least 8PM before compute systems available

Sun Jun 29 16:39:35 EDT 2014: large voltage spike tripped our main circuit breaker. We have power though it's out at sites within 2k because of lightning strike. Cooling system being restored

Sun Jun 29 15:47:11 EDT 2014: staff enroute to site. Should have update on cause within an hour

Sun Jun 29 15:40:31 EDT 2014: power lost about 3:20P today. All systems down. Investigating.

Note: As a precaution, emails by the Moab/Torque scheduler have been disabled because of a potential security vulnerability since Jan 24th 2014.

Last updated: Fri May 23 12:01:44 EDT 2014
([[Previous_messages:|Previous messages]])

Data Transfer

2014-06-03T18:33:38Z

Jchong: Added small section on globus data transfer

=== General guidelines ===

All traffic to and from the data centre has to go via [http://en.wikipedia.org/wiki/Secure_Shell SSH], or secure shell.
This is a protocol which sets up a secure connection between two sites. In all cases, incoming connections to SciNet go through relatively low-speed connections to the login.scinet gateways, but there are many ways to copy files on top of the ssh protocol.

What node to use for data transfer to and from SciNet depends mostly on the amount of data to transfer:

==== Moving <10GB through the login nodes ====

The login nodes are accessible from outside SciNet, which means that you can transfer data between your own office/home machine and SciNet using scp or rsync (see below). Although the login nodes have a cpu_time timeout of 5 minutes (emphasis on cpu_time, not wall_time), most likely if you try to transfer more than 10GB probably you wouldn't succeed. While the login nodes can be used for transfers of less than 10GB, using a datamover node would still be faster.

Note that transfers through a login node will timeout after a certain time (currently set to 5 minutes cpu_time), so if you have a slow connection you may need to go through datamover1.

==== Moving >10GB through the datamover1 node ====

Serious moves of data (>10GB) to or from SciNet should be done from <tt>datamover1</tt> or <tt>datamover2</tt> nodes. From any of the interactive SciNet nodes, one should be able to <tt>ssh datamover1|2</tt> to log in. Those are the machines that have the fastest network connections to the outside world (by a factor of 10; a 10Gb/s link as vs 1Gb/s).

Transfers must be ''originated'' from <tt>datamover1|2</tt>; that is, one can not copy files from the outside world directly to or from the datamovers; one has to log in to a datamover and copy the data to or from the outside network. Your local machine must be reachable from the outside as well, either by its name or its IP address. If you are behind a firewall or a (wireless) router, this may not be possible. You may need to ask your network administrator to allow the datamovers to ssh to your machine. If you need to open a hole on your firewall we provide their IPs:

datamover1 142.150.188.121
datamover2 142.150.188.122

==== Hpn-ssh ====

The usual ssh protocols were not designed for speed. On the <tt>datamover1|2</tt> nodes, we have installed hpn-ssh, or [http://www.psc.edu/networking/projects/hpn-ssh/ High-Performance-enabled ssh]. You use this higher-performance ssh/scp/sftp variant by loading the `hpnssh' module. Hpn-ssh is backwards compatible with the `usual' ssh, but is capable of significantly higher speeds. If you routinely have large data transfers to do, we recommend having your system administrator look into installing [http://www.psc.edu/networking/projects/hpn-ssh/ hpn-ssh] on your system.

Everything we discuss below, unless otherwise stated, will work regardless of whether you have hpn-ssh installed on your remote system.

==== For Microsoft Windows users ====

Linux-windows transfers can be a bit more involved than linux-to-linux, but using [http://www.cygwin.com Cygwin], this should not be a problem. Make sure you install Cygwin with the openssh package.

If you want to remain 100% a Windows environment, another very good tool is [http://winscp.net/eng/index.php WinSCP]. It will let you easily transfer and synchronize data between your Windows workstation and the login nodes using your ssh credentials (provided that it's not much more than 10GB on each sync pass).

If you are going to use the [[Data_Management#Moving_.3E10GB_through_the_datamover1_node | datamover1 method]], and assuming your machine is not a wireless laptop (if it
is, best to find a nearby computer that's not wireless and use a usb
memory stick), you'll need the IP address of your machine, which you find by
typing "ipconfig /all" on your local windows machine. Also, you will need to have the ssh daemon (sshd) running locally in Cygwin.

Also note that your windows user name does not have to be the same as on SciNet, this just
depends on how your local windows system was set up.

All locations given to scp or rsync in cygwin have to be in unix format (using "/" not "\"), and will be relative to cygwin's path, not windows (e.g.
use /cygdrive/c/...... to get to the windows C: drive).

=== Ways to transfer data ===

==== Globus data transfer (BETA) ====

Globus is a file-transfer service with a easy-to-use web interface to allow people to transfer file with ease. To get started, please sign up for a Globus account at [https://www.globus.org/ Globus website]. Once you sign up for an account, go to [https://www.globus.org/xfer/StartTransfer this page] to start the file transfer. For the time being, please enter scinet#gpctest as one endpoint for file transfer. If you are trying to transfer data from a laptop or desktop, you will need to install Globus Connect Personal software available [https://support.globus.org/entries/24044351 here] to setup an endpoint for the laptop or desktop and perform the transfer.

==== scp ====

<tt>scp</tt>, or secure copy, is the easiest way to copy files, although we generally find rsync below to be faster.

scp works like cp to copy files:

$ scp original_file copy_file

except that either the original or the copy can be on another system:

$ scp jonsdatafile.bin jon@remote.system.com:/home/jon/bigdatadir/

will copy the data file into the directory <tt>/home/jon/bigdatadir/</tt> on <tt>remote.system.com</tt> after logging in as <tt>jon</tt>; you will be prompted for a password (unless you've set up ssh keys).

Copying from remote systems works the same way:

$ scp jon@remote.system.com:/home/jon/bigdatadir/newdata.bin .

And wildcards work as you'd expect (except you have to quote the wildcards on the remote system, as it can't expand properly here.)

$ scp *.bin jon@remote.system.com:/home/jon/bigdatadir/
$ scp jon@remote.system.com:"/home/jon/inputdata/*" .

There are few options worth knowing about:
* <tt>scp -C</tt> compresses the file before transmitting it; ''if'' the file compresses well, this can significantly increase the effective data transfer rate (though usually not as much as compressing the data, then sending it, then uncompressing). If the file doesn't compress well, than this adds CPU overhead without accomplishing much, and can slow down your data transfer.
* <tt>scp -oNoneEnabled=yes -oNoneSwitch=yes</tt> -- This is an hpn-ssh only option. If CPU overhead is a significant bottleneck in the data transfer, then we can avoid this by turning off the secure encryption of the data. For most of us, this is ok, but for others it is not. In either cases, '''authentication''' remains secure, it is only the data transfer that is in plaintext.

==== rsync ====

[http://samba.anu.edu.au/rsync/ rsync] is a very powerful tool for mirroring directories of data.
$ rsync -av -e ssh scinetdatadir jon@remote.system.com:/home/jon/bigdatadir/
rsync has a dizzying number of options; the above syncs <tt>scinetdatadir</tt> ''to'' the remote system; that is, any files that are newer on the localsystem are updated on the remote system. The converse isn't true; if there were newer files on the remote system, you'd have to bring those over with
$ rsync -av -e ssh jon@remote.system.com:/home/jon/bigdatadir/ scinetdatadir
The <tt>-av</tt> options are for verbose and `archive' mode, which preserves timestamps and permissions, which is normally what you want. <tt>-e ssh</tt> tells it to use ssh for the transfer.

One of the powerful things about rsync is that it looks to see what files already exist before copying, so you can use it repeatedly as a data directory fills and it won't make unnecessary copies; similarly, if a (say) log file grows over time, it will only copy the difference between the files, further speeding things up. This also means that it behaves well if a transfer is interrupted; a second invocation of rsync will continue where the other left off.

As with <tt>scp -C</tt>, <tt>rsync -z</tt> compresses on the fly, which can significantly enhance effective data transfer rates if the files compress well, or hurt it if not.

As with scp, if both sides are running hpn-ssh one can disable encryption of the data stream should that prove to be a bottleneck:
$ rsync -av -e "ssh -oNoneEnabled=yes -oNoneSwitch=yes" jon@remote.system.com:/home/jon/bigdatadir/ scinetdatadir

SciNet's login nodes, 142.150.188.5[1-4], are publicly accessible and can be used for data transfer as long as your material is not one big chunk (much more than 2-3GB each file). We have a 5 minutes CPU time limit on the login nodes and the transfer process may be killed by the kernel before completion. The workaround is to transfer your data using a rsync loop, by checking the rsync return code, assuming some files can be transferred before reaching the CPU limit. For example in a bash shell:
<pre>
for i in {1..100}; do ### try 100 times
rsync ...
[ "$?" == "0" ] && break
done
</pre>

==== ssh tunnel ====

Alternatively you may use a reverse ssh tunnel (ssh -R).

If your transfer is above 10GB you will need to use one of SciNet's datamovers. If your workstation is behind a firewall (as the datamovers are), you'll need a node external to your firewall, on the edge of your organization's network, that will serve as a gateway, and can be accessible via ssh by both your workstation and the datamovers. Initiate a "ssh -R" connection from SciNet's datamover to that node. This node needs to have its ssh GatewayPorts enabled so that your workstation can connect to the specified port on that node, which will forward the traffic back to SciNet's datamover.

=== Transfer speeds ===

==== What transfer speeds could I expect? ====

Below are some typical transfer numbers from datamover1 to another University of Toronto machine with a 1Gb/s link to the campus network:

{| class="wikitable" border="0"
|-
!{{Hl2}}| Mode
!{{Hl2}}| With hpn-ssh
!{{Hl2}}| Without
|-
| rsync
| 60-80 MB/s
| 30-40 MB/s
|-
| scp
| 50 MB/s
| 25 MB/s
|}

==== What can slow down my data transfer? ====

To move data quickly, ''all'' of the stages in the process have to be fast; the file system you are reading data from, the CPU reading the data, the network connection between the sender and the reciever, and the recipient CPU and disk. The slowest element in that chain will slow down the entire transfer.

On SciNet's side, our underlying filesystem is the high-performance [http://www-03.ibm.com/systems/software/gpfs/index.html GPFS] system, and the node we recommend you use (datamover1) has a high-speed connection to the network and fast CPUs.

==== Why are my transfers so much slower? ====

If you get numbers significantly lower than above, then there is a bottleneck in the transfer. The first thing to do is to run <tt>top</tt> on datamover1; if other people are transfering large files at the same time you are trying to, network congestion could result and you'll just have to wait until they are done.

If nothing else is going on on datamover1, there are a number of possibilites:
* network connection between SciNet and your machine - do you know the network connection of your remote machine? Are your systems connections tuned for performance [http://www.psc.edu/networking/projects/tcptune]?
* is the remote server busy?
* are the remote servers disks busy, or known to be slow?

For any further questions, contact us at [mailto:support@scinet.utoronto.ca Support @ SciNet]

FAQ

2014-02-28T13:49:36Z

Jchong: /* How can I change or reset the password for my SciNet account? */

__TOC__

==The Basics==
===Whom do I contact for support?===

Whom do I contact if I have problems or questions about how to use the SciNet systems?

'''Answer:'''

E-mail [mailto:support@scinet.utoronto.ca <support@scinet.utoronto.ca>]

In your email, please include the following information:

* your username on SciNet
* the cluster that your question pertains to (GPC or TCS; SciNet is not a cluster!),
* any relevant error messages
* the commands you typed before the errors occured
* the path to your code (if applicable)
* the location of the job scripts (if applicable)
* the directory from which it was submitted (if applicable)
* a description of what it is supposed to do (if applicable)
* if your problem is about connecting to SciNet, the type of computer you are connecting from.

Note that your password should never, never, never be to sent to us, even if your question is about your account.

Try to avoid sending email only to specific individuals at SciNet. Your chances of a quick reply increase significantly if you email our team!

===What does ''code scaling'' mean?===

'''Answer:'''

Please see [[Introduction_To_Performance#Parallel_Speedup|A Performance Primer]]

===What do you mean by ''throughput''?===

'''Answer:'''

Please see [[Introduction_To_Performance#Throughput|A Performance Primer]].

Here is a simple example:

Suppose you need to do 10 computations. Say each of these runs for
1 day on 8 cores, but they take "only" 18 hours on 16 cores. What is the
fastest way to get all 10 computations done - as 8-core jobs or as
16-core jobs? Let us assume you have 2 nodes at your disposal.
The answer, after some simple arithmetic, is that running your 10
jobs as 8-core jobs will take 5 days, whereas if you ran them
as 16-core jobs it would take 7.5 days. Take your own conclusions...

===I changed my .bashrc/.bash_profile and now nothing works===

The default startup scripts provided by SciNet, and guidelines for them, can be found [[Important_.bashrc_guidelines|here]]. Certain things - like sourcing <tt>/etc/profile</tt>
and <tt>/etc/bashrc</tt> are ''required'' for various SciNet routines to work!

If the situation is so bad that you cannot even log in, please send email [mailto:support@scinet.utoronto.ca support].

===Could I have my login shell changed to (t)csh?===

The login shell used on our systems is bash. While the tcsh is available on the GPC and the TCS, we do not support it as the default login shell at present. So "chsh" will not work, but you can always run tcsh interactively. Also, csh scripts will be executed correctly provided that they have the correct "shebang" <tt>#!/bin/tcsh</tt> at the top.

===How can I run Matlab / IDL / Gaussian / my favourite commercial software at SciNet?===

'''Answer:'''

Because SciNet serves such a disparate group of user communities, there is just no way we can buy licenses for everyone's commercial package. The only commercial software we have purchased is that which in principle can benefit everyone -- fast compilers and math libraries (Intel's on GPC, and IBM's on TCS).

If your research group requires a commercial package that you already have or are willing to buy licenses for, contact us at [mailto:support@scinet.utoronto.ca support@scinet] and we can work together to find out if it is feasible to implement the packages licensing arrangement on the SciNet clusters, and if so, what is the the best way to do it.

Note that it is important that you contact us before installing commercially licensed software on SciNet machines, even if you have a way to do it in your own directory without requiring sysadmin intervention. It puts us in a very awkward position if someone is found to be running unlicensed or invalidly licensed software on our systems, so we need to be aware of what is being installed where.

===Do you have a recommended ssh program that will allow scinet access from Windows machines?===

'''Answer:'''

The [[Ssh#SSH_for_Windows_Users | SSH for Windows users]] programs we recommend are:

* [http://mobaxterm.mobatek.net/en/ MobaXterm] is a tabbed ssh client with some Cygwin tools, including ssh and X, all wrapped up into one executable.
* [http://www.chiark.greenend.org.uk/~sgtatham/putty/ PuTTY] - this is a terminal for windows that connects via ssh. It is a quick install and will get you up and running quickly. To set up your passphrase protected ssh key with putty, see [http://the.earth.li/~sgtatham/putty/0.61/htmldoc/Chapter8.html#pubkey here].
* [http://www.cygwin.com/ CygWin] - this is a whole linux-like environment for windows, which also includes an X window server so that you can display remote windows on your desktop. Make sure you include the openssh and X window system in the installation for full functionality. This is recommended if you will be doing a lot of work on Linux machines, as it makes a very similar environment available on your computer. To set up your ssh keys, following the Linux instruction on the [[Ssh keys]] page.
 To set up your ssh keys, following the Linux instruction on the [[Ssh keys]] page.

===My ssh key does not work! WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! ===

'''Answer:'''

[[Ssh_keys#Testing_Your_Key | Testing Your Key]]

* If this doesn't work, you should be able to login using your password, and investigate the problem. For example, if during a login session you get an message similar to the one below, just follow the instruction and delete the offending key on line 3 (you can use vi to jump to that line with ESC plus : plus 3). That only means that you may have logged in from your home computer to SciNet in the past, and that key is obsolete.
<pre>
$ ssh USERNAME@login.scinet.utoronto.ca

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@**@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@**@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle
attack)!
It is also possible that the RSA host key has just been changed.
The fingerprint for the RSA key sent by the remote host is
53:f9:60:71:a8:0b:5d:74:83:52:**fe:ea:1a:9e:cc:d3.
Please contact your system administrator.
Add correct host key in /home/<user>/.ssh/known_hosts to get rid of
this message.
Offending key in /home/<user>/.ssh/known_hosts:3
RSA host key for login.scinet.utoronto.ca
<http://login.scinet.utoronto.ca <http://login.scinet.utoronto.ca>> has
changed and you have requested
</pre>

* If you get the message below you may need to logout of your gnome session and log back in since the ssh-agent needs to be
restarted with the new passphrase ssh key.
<pre>
$ ssh USERNAME@login.scinet.utoronto.ca

Agent admitted failure to sign using the key.
</pre>

===Can't forward X: "Warning: No xauth data; using fake authentication data", or "X11 connection rejected because of wrong authentication."===

I used to be able to forward X11 windows from SciNet to my home machine, but now I'm getting these messages; what's wrong?

'''Answer:'''

This very likely means that ssh/xauth can't update your ${HOME}/.Xauthority file.

The simplest pssible reason for this is that you've filled your 10GB /home quota and so can't write anything to your home directory. Use

<pre>
$ module load extras
$ diskUsage
</pre>

to check to see how close you are to your disk usage on ${HOME}.

Alternately, this could mean your .Xauthority file has become broken/corrupted/confused some how, in which case you can delete that file, and when you next log in you'll get a similar warning message involving creating .Xauthority, but things should work.

===How come I can not login to TCS?===

'''Answer:'''

A SciNet account doesn't automatically entitle you to TCS access. At a minimum, TCS jobs need to run on at least 32 cores (64 preferred because of Simultaneous Multi Threading - [[TCS_Quickstart#Node_configuration|SMT]] - on these nodes) and need the large memory (4GB/core) and bandwidth on the system. Essentially you need to be able to explain why the work can't be done on the GPC.

===How can I reset the password for my Compute Canada account?===

'''Answer:'''

You can reset your password for your Compute Canada account here:

https://ccdb.computecanada.org/security/forgot

===How can I change or reset the password for my SciNet account?===

'''Answer:'''

To reset your password at SciNet please go to [https://portal.scinet.utoronto.ca/password_resets Password reset page].

If you know your old password and want to change it, that can be done here:

https://portal.scinet.utoronto.ca/change_password

===Why am I getting the error "Permission denied (publickey,gssapi-with-mic,password)"?===

This error can pop up in a variety of situations: when trying to log in, or when after a job has finished, when the error and output files fail to be copied (there are other possible reasons for this failure as well -- see [[FAQ#My_GPC_job_died.2C_telling_me_.60Copy_Stageout_Files_Failed.27|My GPC job died, telling me:Copy Stageout Files Failed]]).
In most cases, the "Permission denioed" error is caused by incorrect permission of the (hidden) .ssh directory. Ssh is used for logging in as well as for the copying of the standard error and output files after a job.

For security reasons,
the directory .ssh should only be writable and readable to you, but yours
has read permission for everybody, and thus it fails. You can change
this by
<pre>
chmod 700 ~/.ssh
</pre>
And to be sure, also do
<pre>
chmod 600 ~/.ssh/id_rsa ~/.ssh/id_rsa.pub ~/authorized_keys
</pre>

===ERROR:102: Tcl command execution failed? when loading modules ===
Modules sometimes require other modules to be loaded first.
Module will let you know if you didn’t.
For example:
<pre>
$ module purge
$ module load python
python/2.6.2(11):ERROR:151: Module ’python/2.6.2’ depends on one of the module(s) ’gcc/4.4.0’
python/2.6.2(11):ERROR:102: Tcl command execution failed: prereq gcc/4.4.0
$ gpc-f103n084-$ module load gcc python
$
</pre>

==Compiling your Code==

===How can I get g77 to work?===

The fortran 77 compilers on the GPC are ifort and gfortran. We have dropped support for g77. This has been a conscious decision. g77 (and the associated library libg2c) were completely replaced six years ago (Apr 2005) by the gcc 4.x branch, and haven't undergone any updates at all, even bug fixes, for over five years.
If we would install g77 and libg2c, we would have to deal with the inevitable confusion caused when users accidentally link against the old, broken, wrong versions of the gcc libraries instead of the correct current versions.

If your code for some reason specifically requires five-plus-year-old libraries, availability, compatibility, and unfixed-known-bug problems are only going to get worse for you over time, and this might be as good an opportunity as any to address those issues.

''A note on porting to gfortran or ifort:''

While gfortran and ifort are rather compatible with g77, one
important difference is that by default, gfortran does not preserve
local variables between function calls, while g77 does. Preserved
local variables are for instance often used in implementations of quasi-random number
generators. Proper fortran requires to declare such variables as SAVE
but not all old code does this.
Luckily, you can change gfortran's default behavior with the flag
<tt>-fno-automatic</tt>. For ifort, the corresponding flag is <tt>-noautomatic</tt>.

===Where is libg2c.so?===

libg2c.so is part of the g77 compiler, for which we dropped support. See [[#How can I get g77 to work on the GPC?]] for our reasons.

===Autoparallelization does not work!===

I compiled my code with the <tt>-qsmp=omp,auto</tt> option, and then I specified that it should be run with 64 threads - with
export OMP_NUM_THREADS=64

However, when I check the load using <tt>llq1 -n</tt>, it shows a load on the node of 1.37. Why?

'''Answer:'''

Using the autoparallelization will only get you so far. In fact, it usually does not do too much. What is helpful is to run the compiler with the <tt>-qreport</tt> option, and then read the output listing carefully to see where the compiler thought it could parallelize, where it could not, and the reasons for this. Then you can go back to your code and carefully try to address each of the issues brought up by the compiler.
We ''emphasize'' that this is just a rough first guide, and that the compilers are still not magical! For more sophisticated approaches to parallelizing your code, email us at [mailto:support@scinet.utoronto.ca <support@scinet.utoronto.ca>] to set up an appointment with one
of our technical analysts.

===How do I link against the Intel Math Kernel Library?===

If you need to link in the Intel Math Kernel Library (MKL) libraries, you are well advised to use the Intel(R) Math Kernel Library Link Line Advisor: http://software.intel.com/en-us/articles/intel-mkl-link-line-advisor/ for help in devising the list of libraries to link with your code.

'''''Note that this give the link line for the command line. When using this in Makefiles, replace $MKLPATH by ${MKLPATH}.'''''

'''''Note too that, unless the integer arguments you will be passing to the MKL libraries are actually 64-bit integers, rather than the normal int or INTEGER types, you want to specify 32-bit integers (lp64) .'''''

===Can the compilers on the login nodes be disabled to prevent accidentally using them?===

'''Answer:'''

You can accomplish this by modifying your .bashrc to not load the compiler modules. See [[Important .bashrc guidelines]].

==="relocation truncated to fit: R_X86_64_PC32": Huh?===

What does this mean, and why can't I compile this code?

'''Answer:'''

Welcome to the joys of the x86 architecture! You're probably having trouble building arrays larger than 2GB, individually or together. Generally, you have to try to use the medium or large x86 `memory model'. For the intel compilers, this is specified with the compile options

-mcmodel=medium -shared-intel

==="feupdateenv is not implemented and will always fail"===

How do I get rid of this and what does it mean?

'''Answer:'''
First note that, as ominous as it sounds, this is really just a warning, and has to do with the intel math library. You can ignore it (unless you really are trying to manually change the exception handlers for floating point exceptions such as divide by zero), or take the safe road and get rid off it by linking with the intel math functions library:<pre>-limf</pre>See also [[#How do I link against the Intel Math Kernel Library?]]

===Cannot find rdmacm library when compiling on GPC===

I get the following error building my code on GPC: "<tt>ld: cannot find -lrdmacm</tt>". Where can I find this library?

'''Answer:'''

This library is part of the MPI libraries; if your compiler is having problems picking it up, it probably means you are mistakenly trying to compile on the login nodes (scinet01..scinet04). The login nodes aren't part of the GPC; they are for logging into the data centre only. From there you must go to the GPC or TCS development nodes to do any real work.

=== Why do I get this error when I try to compile: "icpc: error #10001: could not find directory in which /usr/bin/g++41 resides" ?===

You are trying to compile on the login nodes. As described in the wiki ( https://support.scinet.utoronto.ca/wiki/index.php/GPC_Quickstart#Login ), or in the users guide you would have received with your account, Scinet supports two main clusters, with very different architectures. Compilation must be done on the development nodes of the appropriate cluster (in this case, gpc01-04). Thus, log into gpc01, gpc02, gpc03, or gpc04, and compile from there.

==Testing your Code==

=== Can I run a something for a short time on the development nodes? ===

I am in the process of playing around with the mpi calls in my code to get it to work. I do a lot of tests and each of them takes a couple of seconds only. Can I do this on the development nodes?

'''Answer:'''

Yes, as long as it's very brief (a few minutes). People use the development nodes
for their work, and you don't want to bog it down for people, and testing a real
code can chew up a lot more resources than compiling, etc. The procedures differ
depending on what machine you're using.

==== TCS ====

On the TCS you can run small MPI jobs on the tcs02 node, which is meant for
development use. But even for this test run on one node, you'll need a host file --
a list of hosts (in this case, all tcs-f11n06, which is the `real' name of tcs02)
that the job will run on. Create a file called `hostfile' containing the following:

tcs-f11n06
tcs-f11n06
tcs-f11n06
tcs-f11n06

for a 4-task run. When you invoke "poe" or "mpirun", there are runtime
arguments that you specify pointing to this file. You can also specify it
in an environment variable MP_HOSTFILE, so, if your file is in your /scratch directory, say
${SCRATCH}/hostfile, then you would do

<pre>
export MP_HOSTFILE=${SCRATCH}/hostfile
</pre>

in your shell. You will also need to create a <tt>.rhosts</tt> file in your
home director, again listing <tt>tcs-f11n06</tt> so that <tt>poe</tt>
can start jobs. After that you can simply run your program. You can use
mpiexec:

<pre>
mpiexec -n 4 my_test_program
</pre>

adding <tt> -hostfile /path/to/my/hostfile</tt> if you did not set the environment
variable above. Alternatively, you can run it with the poe command (do a "man poe" for details), or even by
just directly running it. In this case the number of MPI processes will by default
be the number of entries in your hostfile.

==== GPC ====

On the GPC one can run short test jobs on the GPC [[GPC_Quickstart#Compile.2FDevel_Nodes | development nodes ]]<tt>gpc01</tt>..<tt>gpc04</tt>;
if they are single-node jobs (which they should be) they don't need a hostfile. Even better, though, is to request an [[ Moab#Interactive | interactive ]] job and run the tests either in regular batch queue or using a short high availability [[ Moab#debug | debug ]] queue that is reserved for this purpose.

=== How do I run a longer (but still shorter than an hour) test job quickly ? ===

'''Answer'''

On the GPC there is a high turnover short queue called [[ Moab#debug | debug ]] that is designed for
this purpose. You can use it by adding
<pre>
#PBS -q debug
</pre>
to your submission script.

==Running your jobs==

===My job can't write to /home===

My code works fine when I test on the development nodes, but when I submit a job, or even run interactively in the development queue on GPC, it fails. What's wrong?

'''Answer:'''

As [[Data_Management#Home_Disk_Space | discussed]] [https://support.scinet.utoronto.ca/wiki/images/5/54/SciNet_Tutorial.pdf elsewhere], <tt>/home</tt> is mounted read-only on the compute nodes; you can only write to <tt>/home</tt> from the login nodes and devel nodes. (The [[GPC_Quickstart#128Glargemem | largemem nodes]] on GPC, in this respect, are more like devel nodes than compute nodes). In general, to run jobs you can read from <tt>/home</tt> but you'll have to write to <tt>/scratch</tt> (or, if you were allocated space through the LRAC/NRAC process, on <tt>/project</tt>). More information on SciNet filesytems can be found on our [[Data_Management | Data Management]] page.

===Error Submitting My Job: qsub: Bad UID for job execution MSG=ruserok failed ===

I write up a submission script as in the examples, but when I attempt to submit the job, I get the above error. What's wrong?

'''Answer:'''

This error will occur if you try to submit a job from the login nodes. The login nodes are the gateway to all of SciNet's systems (GPC, TCS, P7, ARC), which have different hardware and queueing systems. To submit a job, you must log into a development node for the particular cluster you are submitting to and submit from there.

===OpenMP on the TCS===

How do I run an OpenMP job on the TCS?

'''Answer:'''

Please look at the [[TCS_Quickstart#Submission_Script_for_an_OpenMP_Job | TCS Quickstart ]] page.

===Can I can use hybrid codes consisting of MPI and openMP on the GPC?===

'''Answer:'''

Yes. Please look at the [[GPC_Quickstart#Hybrid_MPI.2FOpenMP_jobs | GPC Quickstart ]] page.

===How do I run serial jobs on GPC?===

'''Answer''':

So it should be said first that SciNet is a parallel computing resource,
and our priority will always be parallel jobs. Having said that, if
you can make efficient use of the resources using serial jobs and get
good science done, that's good too, and we're happy to help you.

The GPC nodes each have 8 processing cores, and making efficient use of these
nodes means using all eight cores. As a result, we'd like to have the
users take up whole nodes (eg, run multiples of 8 jobs) at a time.

It depends on the nature of your job what the best strategy is. Several approaches are presented on the [[User_Serial|serial run wiki page]].

===Why can't I request only a single cpu for my job on GPC?===

'''Answer''':

On GPC, computers are allocated by the node - that is, in chunks of 8 processors. If you want to run a job that requires only one processor, you need to bundle the jobs into groups of 8, so as to not be wasting the other 7 for 48 hours. See [[User_Serial|serial run wiki page]].

===How do I run serial jobs on TCS?===

'''Answer''': You don't.

===But in the queue I found a user who is running jobs on GPC, each of which is using only one processor, so why can't I?===

'''Answer''':

The pradat* and atlaspt* jobs, amongst others, are jobs of the ATLAS high energy physics project. That they are reported as single cpu jobs is an artifact of the moab scheduler. They are in fact being automatically bundled into 8-job bundles but have to run individually to be compatible with their international grid-based systems.

===How do I use the ramdisk on GPC?===

To use the ramdisk, create and read to / write from files in /dev/shm/.. just as one would to (eg) ${SCRATCH}. Only the amount of RAM needed to store the files will be taken up by the temporary file system; thus if you have 8 serial jobs each requiring 1 GB of RAM, and 1GB is taken up by various OS services, you would still have approximately 7GB available to use as ramdisk on a 16GB node. However, if you were to write 8 GB of data to the RAM disk, this would exceed available memory and your job would likely crash.

It is very important to delete your files from ram disk at the end of your job. If you do not do this, the next user to use that node will have less RAM available than they might expect, and this might kill their jobs.

''More details on how to setup your script to use the ramdisk can be found on the [[User_Ramdisk|Ramdisk wiki page]].''

===How can I automatically resubmit a job?===

Commonly you may have a job that you know will take longer to run than what is
permissible in the queue. As long as your program contains [[Checkpoints|checkpoint]] or
restart capability, you can have one job automatically submit the next. In
the following example it is assumed that the program finishes before
the 48 hour limit and then resubmits itself by logging into one
of the development nodes.

<source lang="bash">
#!/bin/bash
# MOAB/Torque example submission script for auto resubmission
# SciNet GPC
#
#PBS -l nodes=1:ppn=8,walltime=48:00:00
#PBS -N my_job

# DIRECTORY TO RUN - $PBS_O_WORKDIR is directory job was submitted from
cd $PBS_O_WORKDIR

# YOUR CODE HERE
./run_my_code

# RESUBMIT 10 TIMES HERE
num=$NUM
if [ $num -lt 10 ]; then
num=$(($num+1))
ssh gpc01 "cd $PBS_O_WORKDIR; qsub ./script_name.sh -v NUM=$num";
fi
</source>

<pre>
qsub script_name.sh -v
</pre>

You can alternatively use [[ Moab#Job_Dependencies | Job dependencies ]] through the queuing system which will not start one job until another job has completed.

If your job can't be made to automatically stop before the 48 hour queue window, but it does write out checkpoints, you can use the timeout command to stop the program while you still have time to resubmit; for instance

<source lang="bash">
timeout 2850m ./run_my_code argument1 argument2
</source>

will run the program for 47.5 hours (2850 minutes), and then send it SIGTERM to exit the program.

===How can I pass in arguments to my submission script?===

If you wish to make your scripts more generic you can use qsub's ability
to pass in environment variables to pass in arguments to your script.
The following example shows a case where an input and an output
file are passed in on the qsub line. Multiple variables can be
passed in using the qsub "-v" option and comma delimited.

<source lang="bash">
#!/bin/bash
# MOAB/Torque example of passing in arguments
# SciNet GPC
#
#PBS -l nodes=1:ppn=8,walltime=48:00:00
#PBS -N my_job

# DIRECTORY TO RUN - $PBS_O_WORKDIR is directory job was submitted from
cd $PBS_O_WORKDIR

# YOUR CODE HERE
./run_my_code -f $INFILE -o $OUTFILE
</source>

<pre>
qsub script_name.sh -v INFILE=input.txt,OUTFILE=outfile.txt
</pre>

=== How can I run a job longer than 48 hours? ===

'''Answer:'''

The SciNet queues have a queue limit of 48 hours. This is pretty typical for systems of its size in Canada and elsewhere, and larger systems commonly have shorter limits. The limits are there to ensure that every user gets a fair share of the system (so that no one user ties up lots of nodes for a long time), and for safety (so that if one memory board in one node fails in the middle of a very long job, you haven't lost a months' worth of work).

Since many of us have simulations that require more than that much time, most widely-used scientific applications have "checkpoint-restart" functionality, where every so often the complete state of the calculation is stored as a checkpoint file, and one can restart a simulation from one of these. In fact, these restart files tend to be quite useful for a number of purposes.

If your job will take longer, you will have to submit your job in multiple parts, restarting from a checkpoint each time. In this way, one can run a simulation much longer than the queue limit. In fact, one can even write job scripts which automatically re-submit themselves until a run is completed, using [[FAQ#How_can_I_automatically_resubmit_a_job.3F | automatic resubmission. ]]

=== Why did showstart say it would take 3 hours for my job to start before, and now it says my job will start in 10 hours? ===

'''Answer:'''

Please look at the [[FAQ#How_do_priorities_work.2Fwhy_did_that_job_jump_ahead_of_mine_in_the_queue.3F | How do priorities work/why did that job jump ahead of mine in the queue? ]] page.

===How do priorities work/why did that job jump ahead of mine in the queue?===

'''Answer:'''

The [[Moab | queueing system]] used on SciNet machines is a [http://en.wikipedia.org/wiki/Priority_queue Priority Queue]. Jobs enter the queue at the back of the queue, and slowly make their way to the front as those ahead of them are run; but a job that enters the queue with a higher priority can `cut in line'.

The main factor which determines priority is whether or not the user (or their PI) has an [http://wiki.scinethpc.ca/wiki/index.php/Application_Process LRAC or NRAC allocation]. These are competitively allocated grants of computer time; there is a call for proposals towards the end of every calendar year. Users with an allocation have high priorities in an attempt to make sure that they can use the amount of computer time the committees granted them. Their priority decreases as they approach their allotted usage over the current window of time; by the time that they have exhausted that allotted usage, their priority is the same as users with no allocation (unallocated, or `default' users). Unallocated users have a fixed, low, priority.

This priority system is called `fairshare'; the scheduler attempts to make sure everyone has their fair share of the machines, where the share that's fair has been determined by the allocation committee. The fairshare window is a rolling window of two weeks; that is, any time you have a job in the queue, the fairshare calculation of its priority is given by how much of your allocation of the machine has been used in the last 14 days.

A particular allocation might have some fraction of GPC - say 4% of the machine (if the PI had been allocated 10 million CPU hours on GPC). The allocations have labels; (called `Resource Allocation Proposal Identifiers', or RAPIs) they look something like

abc-123-ab

where abc-123 is the PIs CCRI, and the suffix specifies which of the allocations granted to the PI is to be used. These can be specified on a job-by-job basis. On GPC, one adds the line
#PBS -A RAPI
to your script; on TCS, one uses
# @ account_no = RAPI
If the allocation to charge isn't specified, a default is used; each user has such a default, which can be changed at the same portal where one changes one's password,

https://portal.scinet.utoronto.ca/

A jobs priority is determined primarily by the fairshare priority of the allocation it is being charged to; the previous 14 days worth of use under that allocation is calculated and compared to the allocated fraction (here, 5%) of the machine over that window (here, 14 days). The fairshare priority is a decreasing function of the allocation left; if there is no allocation left (eg, jobs running under that allocation have already used 379,038 CPU hours in the past 14 days), the priority is the same as that of a user with no granted allocation. (This last part has been the topic of some debate; as the machine gets more utilized, it will probably be the case that we allow RAC users who have greatly overused their quota to have their priorities to drop below that of unallocated users, to give the unallocated users some chance to run on our increasingly crowded system; this would have no undue effect on our allocated users as they still would be able to use the amount of resources they had been allocated by the committees.) Note that all jobs charging the same allocation get the same fairshare priority.

There are other factors that go into calculating priority, but fairshare is the most significant. Other factors include
* amount of time waiting in queue (measured in units of the requested runtime). A job that requests 1 hour in the queue and has been waiting 2 days will get a bump in its priority larger than a job that requests 2 days and has been waiting the same time.
* User adjustment of priorities ( See below ).

The major effect of these subdominant terms is to shuffle the order of jobs running under the same allocation.

===How do we manage job priorities within our research group?===

'''Answer:'''

Obviously, managing shared resources within a large group - whether it
is conference funding or CPU time - takes some doing.

It's important to note that the fairshare periods are intentionally kept
quite short - just two weeks long. So, for example, let us say that in your resource
allocation you have about 10% of the machine. Then for someone to use
up the whole two week amount of time in 2 days, they'd have to use 70%
of the machine in those two days - which is unlikely to happen by
accident. If that does happen,
those using the same allocation as the person who used 70% of the
machine over the two days will suffer by having much lower priority for
their jobs, but only for the next 12 days - and even then, if there are
idle cpus they'll still be able to compute.

There will be online tools for seeing how the allocation is being used,
and those people who are in charge in your group will be able to use
that information to manage the users, telling them to dial it down or
up. We know that managing a large research group is hard, and we want
to make sure we provide you the information you need to do your job
effectively.

One way for users within a group to manage their priorities within the group
is with [[Moab#Adjusting_Job_Priority | user-adjusted priorities]]; this is
described in more detail on the [[Moab | Scheduling System]] page.

=== How do I charge jobs to my NRAC/LRAC allocation? ===

'''Answer:'''

Please see the [[Moab#Accounting|accounting section of Moab page]].

=== How does one check the amount of used CPU-hours in a project, and how does one get statistics for each user in the project? ===

'''Answer:'''

This information is available on the scinet portal,https://portal.scinet.utoronto.ca, See also [[SciNet Usage Reports]].

=== How does the Infiniband Upgrade affect my 2012 NRAC allocation ?===

The NRAC allocations for the current (2012) year that were based on ethernet and infiniband will carry over, however the allocation will be on the full GPC, not just the subsection. So if you were allocated 500 hours on Infiniband your fairshare allocation will still be 500 hours, just 500 out or 30,000, instead of 500 out of 7,000. If you received two allocations, one on gigE and one on IB, they will simply be combined. This should benefit all users as the desegregation of the GPC provides a greater pool of nodes increasing the probability of your job to run.

==Monitoring jobs in the queue==

===Why hasn't my job started?===

'''Answer:'''

Use the moab command

<pre>
checkjob -v jobid
</pre>

and the last couple of lines should explain why a job hasn't started.

Please see [[Moab| Job Scheduling System (Moab) ]] for more detailed information

===How do I figure out when my job will run?===

'''Answer:'''

Please see [[Moab#Available_Resources| Job Scheduling System (Moab) ]]



===I submit my GPC job, and I get an email saying it was rejected===
'''Answer:'''

This happens because the job you've submitted breaks one of the rules of the queues and is rejected. An email
is sent with the JOBID, JOBNAME, and the reason it was rejected. The following is an example where a job
requests more than 48 hours and was rejected.

<pre>
PBS Job Id: 3462493.gpc-sched
Job Name: STDIN
job deleted
Job deleted at request of root@gpc-sched
MOAB_INFO: job was rejected - job violates class configuration 'wclimit too high for class 'batch_ib' (345600 > 172800)'
</pre>

Jobs on the TCS or GPC may only run for 48 hours at a time; this restriction greatly increases responsiveness of the queue and queue throughput for all our users. If your computation requires longer than that, as many do, you will have to [[ Checkpoints | checkpoint ]] your job and restart it after each 48-hour queue window. You can manually re-submit jobs, or if you can have your job cleanly exit before the 48 hour window, there are ways to [[ FAQ#How_can_I_automatically_resubmit_a_job.3F | automatically resubmit jobs ]].

Other rejections return a more cryptic error saying "job violates class configuration" such as follows:
<pre>
PBS Job Id: 3462409.gpc-sched
Job Name: STDIN
job deleted
Job deleted at request of root@gpc-sched
MOAB_INFO: job was rejected - job violates class configuration 'user required by class 'batch''
</pre>

The most common problems that result in this error are:

* '''Incorrect number of processors per node''': Jobs on the GPC are scheduled per-node not per-core and since each node has 8 processor cores (ppn=8) the smallest job allowed is one node with 8 cores (nodes=1:ppn=8). For serial jobs users must bundle or batch them together in groups of 8. See [[ FAQ#How_do_I_run_serial_jobs_on_GPC.3F | How do I run serial jobs on GPC? ]]
* '''No number of nodes specified''': Jobs submitted to the main queue must request a specific number of nodes, either in the submission script (with a line like <tt>#PBS -l nodes=2:ppn=8</tt>) or on the command line (eg, <tt>qsub -l nodes=2:ppn=8,walltime=5:00:00 script.pbs</tt>). Note that for the debug queue, you can get away without specifying a number of nodes and a default of one will be assigned; for both technical and policy reasons, we do not enforce such a default for the main ("batch") queue.
* '''There is a 15 minute walltime minimum''' on all queues except debug and if you set your walltime less than this, it will be rejected.

===How can I monitor my running jobs on TCS?===

How can I monitor the load of TCS jobs?

'''Answer:'''

You can get more information with the command
/xcat/tools/tcs-scripts/LL/jobState.sh
which I alias as:
alias llq1='/xcat/tools/tcs-scripts/LL/jobState.sh'
If you run "llq1 -n" you will see a listing of jobs together with a lot of information, including the load.

==Errors in running jobs==

===On GPC, `Job cannot be executed'===

I get error messages like this trying to run on GPC:

<pre>
PBS Job Id: 30414.gpc-sched
Job Name: namd
Exec host: gpc-f120n011/7+gpc-f120n011/6+gpc-f120n011/5+gpc-f120n011/4+gpc-f120n011/3+gpc-f120n011/2+gpc-f120n011/1+gpc-f120n011/0
Aborted by PBS Server
Job cannot be executed
See Administrator for help

PBS Job Id: 30414.gpc-sched
Job Name: namd
Exec host: gpc-f120n011/7+gpc-f120n011/6+gpc-f120n011/5+gpc-f120n011/4+gpc-f120n011/3+gpc-f120n011/2+gpc-f120n011/1+gpc-f120n011/0
An error has occurred processing your job, see below.
request to copy stageout files failed on node 'gpc-f120n011/7+gpc-f120n011/6+gpc-f120n011/5+gpc-f120n011/4+gpc-f120n011/3+gpc-f120n011/2+gpc-f120n011/1+gpc-f120n011/0' for job 30414.gpc-sched

Unable to copy file 30414.gpc-sched.OU to USER@gpc-f101n084.scinet.local:/scratch/G/GROUP/USER/projects/sim-performance-test/runtime/l/namd/8/namd.o30414
*** error from copy
30414.gpc-sched.OU: No such file or directory
*** end error output
</pre>

Try doing the following:
<pre>
mkdir ${SCRATCH}/.pbs_spool
ln -s ${SCRATCH}/.pbs_spool ~/.pbs_spool
</pre>

This is how all new accounts are setup on SciNet.

<tt>/home</tt> on GPC for compute jobs is mounted as a read-only file system.
PBS by default tries to spool its output files to <tt>${HOME}/.pbs_spool</tt>
which fails as it tries to write to a read-only file
system. New accounts at SciNet get around this by having ${HOME}/.pbs_spool
point to somewhere appropriate on <tt>/scratch</tt>, but if you've deleted that link
or directory, or had an old account, you will see errors like the above.

'''On Feb 24, the input/output mechanism has been reconfigured to use a local ramdisk as the temporary location, which means that .pbs_spool is no longer needed and this error should not occur anymore.'''

=== I couldn't find the .o output file in the .pbs_spool directory as I used to ===

On Feb 24 2011, the temporary location of standard input and output files was moved from the shared file system ${SCRATCH}/.pbs_spool to the
node-local directory /var/spool/torque/spool (which resides in ram). The final location after a job has finished is unchanged,
but to check the output/error of running jobs, users will now have to ssh into the (first) node assigned to the job and look in
/var/spool/torque/spool.

This alleviates access contention to the temporary directory, especially for those users that are running a lot of jobs, and reduces the burden on the file system in general.

Note that it is good practice to redirect output to a file rather than to count on the scheduler to do this for you.

=== My GPC job died, telling me `Copy Stageout Files Failed' ===

'''Answer:'''

When a job runs on GPC, the script's standard output and error are redirected to
<tt>$PBS_JOBID.gpc-sched.OU</tt> and <tt>$PBS_JOBID.gpc-sched.ER</tt> in
/var/spool/torque/spool on the (first) node on which your job is running. At the end of the job, those .OU and .ER files are copied to where the batch script tells them to be copied, by default <tt>$PBS_JOBNAME.o$PBS_JOBID</tt> and<tt>$PBS_JOBNAME.e$PBS_JOBID</tt>. (You can set those filenames to be something clearer with the -e and -o options in your PBS script.)

When you get errors like this:
<pre>
An error has occurred processing your job, see below.
request to copy stageout files failed on node
</pre>
it means that the copying back process has failed in some way. There could be a few reasons for this. The first thing to '''make sure that your .bashrc does not produce any output''', as the output-stageout is performed by bash and further output can cause this to fail.
But it also could have just been a random filesystem error, or it could be that your job failed spectacularly enough to shortcircuit the normal job-termination process and those files just never got copied.

Write to [mailto:support@scinet.utoronto.ca <support@scinet.utoronto.ca>] if your input/output files got lost, as we will probably be able to retrieve them for you (please supply at least the jobid, and any other information that may be relevant).

Mind you that it is good practice to redirect output to a file rather than depending on the job scheduler to do this for you.



===IB Memory Errors, eg <tt> reg_mr Cannot allocate memory </tt>===

Infiniband requires more memory than ethernet; it can use RDMA (remote direct memory access) transport for which it sets aside registered memory to transfer data.

In our current network configuration, it requires a _lot_ more memory, particularly as you go to larger process counts; unfortunately, that means you can't get around the "I need more memory" problem the usual way, by running on more nodes. Machines with different memory or
network configurations may exhibit this problem at higher or lower MPI
task counts.

Right now, the best workaround is to reduce the number and size of OpenIB queues, using XRC: with the OpenMPI, add the following options to your mpirun command:

<pre>
-mca btl_openib_receive_queues X,128,256,192,128:X,2048,256,128,32:X,12288,256,128,32 -mca btl_openib_max_send_size 12288
</pre>

With Intel MPI, you should be able to do

<pre>
module load intelmpi/4.0.3.008
mpirun -genv I_MPI_FABRICS=shm:ofa -genv I_MPI_OFA_USE_XRC=1 -genv I_MPI_OFA_DYNAMIC_QPS=1 -genv I_MPI_DEBUG=5 -np XX ./mycode
</pre>

to the same end.

For more information see [[GPC MPI Versions]].

===My compute job fails, saying <tt>libpng12.so.0: cannot open shared object file</tt> or <tt>libjpeg.so.62: cannot open shared object file</tt>===

'''Answer:'''

To maximize the amount of memory available for compute jobs, the compute nodes have a less complete system image than the development nodes. In particular, since interactive graphics libraries like matplotlib and gnuplot are usually used interactively, the libraries for their use are included in the devel nodes' image but not the compute nodes.

Many of these extra libraries are, however, available in the "extras" module. So adding a "module load extras" to your job submission script - or, for overkill, to your .bashrc - should enable these scripts to run on the compute nodes.

==Data on SciNet disks==

===How do I find out my disk usage?===

'''Answer:'''

The standard unix/linux utilities for finding the amount of disk space used by a directory are very slow, and notoriously inefficient on the GPFS filesystems that we run on the SciNet systems. There are utilities that very quickly report your disk usage:

The <tt>'''/scinet/gpc/bin/diskUsage'''</tt> command, available on the login nodes, datamovers and the GPC devel nodes, provides information in a number of ways on the home, scratch, and project file systems. For instance, how much disk space is being used by yourself and your group (with the -a option), or how much your usage has changed over a certain period ("delta information") or you may generate plots of your usage over time.
This information is only updated hourly!

More information about these filesystems is available at the [[Data_Management | Data_Management]].

===How do I transfer data to/from SciNet?===

'''Answer:'''

All incoming connections to SciNet go through relatively low-speed connections to the <tt>login.scinet</tt> gateways, so using scp to copy files the same way you ssh in is not an effective way to move lots of data. Better tools are described in our page on [[Data_Management#Data_Transfer | Data Transfer]].

===My group works with data files of size 1-2 GB. Is this too large to transfer by scp to login.scinet.utoronto.ca ?===

'''Answer:'''

Generally, occasion transfers of data less than 10GB is perfectly acceptible to so through the login nodes. See [[Data_Management#Data_Transfer | Data Transfer]].

===How can I check if I have files in /scratch that are scheduled for automatic deletion?===

'''Answer:'''

Please see [[Storage_Quickstart#Scratch_Disk_Purging_Policy | Storage At SciNet]]

===How to allow my supervisor to manage files for me using ACL-based commands?===

'''Answer:'''

Please see [[Data_Management#File.2FOwnership_Management_.28ACL.29 | File/Ownership Management]]

===Can we buy extra storage space on SciNet?===

'''Answer:'''
Yes, please see [[Data_Management#Buying_storage_space_on_GPFS_or_HPSS | Buying storage space on GPFS or HPSS ]] for more details.

===Can I transfer files between BGQ and HPSS?===

'''Answer:'''
Yes, please see [https://support.scinet.utoronto.ca/wiki/index.php/BGQ#Bridge_to_HPSS Bridge to HPSS ] for more details.

==Keep 'em Coming!==

===Next question, please===

Send your question to [mailto:support@scinet.utoronto.ca <support@scinet.utoronto.ca>]; we'll answer it asap!

Oldwiki.scinet.utoronto.ca:System Alerts

2014-02-18T18:51:52Z

Jchong: /* System Status */

== System Status==

{|
|[[File:down.png|up|link=GPC Quickstart]][[GPC Quickstart|GPC]]
|[[File:down.png|up|link=TCS Quickstart]][[TCS Quickstart|TCS]]
|[[File:down.png|up|link=Sandy]][[Sandy]]
|[[File:down.png|up|link=GPU Devel Nodes]][[GPU Devel Nodes|ARC]]
|-
|[[File:down.png|up|link=Gravity]][[Gravity]]
|[[File:down.png|up|link=P7 Linux Cluster]][[P7 Linux Cluster|P7]]
|[[File:down.png|up|link=BGQ]][[BGQ]]
|[[File:down.png|down|link=HPSS]][[HPSS]]
|}

Tue Feb 18 13:50:41 EST: Power outage at datacenter. We are investigating the issue and try to restore the system ASAP.

Tue Feb 18 13:39:38 EST: Most systems are restore now and available for use. Please e-mail support@scinet.utoronto.ca if there are any issues.

Tue Feb 18 11:13:53 EST: Power glitch at datacenter shutdown cooling system and hence computers as well. Restoring cooling now. Systems likely back online within 2 hrs or so.

As a precaution, emails by the Moab/Torque scheduler have been disabled because of a potential security vulnerability.

Last updated: Mon Feb 13:57:06 EST 2014
([[Previous_messages:|Previous messages]])

Oldwiki.scinet.utoronto.ca:System Alerts

2014-02-18T18:40:32Z

Jchong: /* System Status */

Oldwiki.scinet.utoronto.ca:System Alerts

2014-02-18T18:39:26Z

Jchong: /* System Status */

Oldwiki.scinet.utoronto.ca:System Alerts

2014-02-18T18:33:22Z

Jchong: /* System Status */

Oldwiki.scinet.utoronto.ca:System Alerts

2014-02-18T18:33:02Z

Jchong: /* System Status */

Oldwiki.scinet.utoronto.ca:System Alerts

2014-02-18T18:18:58Z

Jchong: /* System Status */

Oldwiki.scinet.utoronto.ca:System Alerts

2014-02-18T16:12:16Z

Jchong: /* System Status */

== System Status==

{|
|[[File:down.png|up|link=GPC Quickstart]][[GPC Quickstart|GPC]]
|[[File:down.png|up|link=TCS Quickstart]][[TCS Quickstart|TCS]]
|[[File:down.png|up|link=Sandy]][[Sandy]]
|[[File:down.png|up|link=GPU Devel Nodes]][[GPU Devel Nodes|ARC]]
|-
|[[File:down.png|up|link=Gravity]][[Gravity]]
|[[File:down.png|up|link=P7 Linux Cluster]][[P7 Linux Cluster|P7]]
|[[File:down.png|up|link=BGQ]][[BGQ]]
|[[File:down.png|down|link=HPSS]][[HPSS]]
|}

Wed Feb 5 01:01:52:

GPC scheduler crashed around 10PM on Tuesday. Unfortunately, the scheduler cannot be revived normally. All GPC jobs were lost. Please resubmit your jobs. We apologize for the inconvenience.

Mon Feb 3 14:20:27: jobs can be submitted to the GPC. GPC nodes are coming back online. TCS is still rebooting,

At approximately 1:54 pm, Monday February 3, SciNet data centre lost power for 4 seconds. Systems are being restored.

As a precaution, emails by the Moab/Torque scheduler have been disabled because of a potential security vulnerability.

Last updated: Mon Feb 13:57:06 EST 2014
([[Previous_messages:|Previous messages]])

Data Transfer

2013-10-31T20:17:49Z

Jchong:

=== General guidelines ===

All traffic to and from the data centre has to go via [http://en.wikipedia.org/wiki/Secure_Shell SSH], or secure shell.
This is a protocol which sets up a secure connection between two sites. In all cases, incoming connections to SciNet go through relatively low-speed connections to the login.scinet gateways, but there are many ways to copy files on top of the ssh protocol.

What node to use for data transfer to and from SciNet depends mostly on the amount of data to transfer:

==== Moving <10GB through the login nodes ====

The login nodes are accessible from outside SciNet, which means that you can transfer data between your own office/home machine and SciNet using scp or rsync (see below). Although the login nodes have a cpu_time timeout of 5 minutes (emphasis on cpu_time, not wall_time), most likely if you try to transfer more than 10GB probably you wouldn't succeed. While the login nodes can be used for transfers of less than 10GB, using a datamover node would still be faster.

Note that transfers through a login node will timeout after a certain time (currently set to 5 minutes cpu_time), so if you have a slow connection you may need to go through datamover1.

==== Moving >10GB through the datamover1 node ====

Serious moves of data (>10GB) to or from SciNet should be done from the <tt>datamover1</tt> node. From any of the interactive SciNet nodes, one should be able to <tt>ssh datamover1</tt> to log in. This is the machine that has the fastest network connection to the outside world (by a factor of 10; a 10Gb/s link as vs 1Gb/s).

Transfers must be ''originated'' from <tt>datamover1</tt>; that is, one can not copy files from the outside world directly to or from the data mover node; one has to log in to the data mover node and copy the data to or from the outside network. Your local machine must be reachable from the outside, either by its name or its IP address. If you are behind a firewall or a (wireless) router, this may not be possible. You may need to ask your system administrator to allow datamover to ssh to your machine.

==== Hpn-ssh ====

The usual ssh protocols were not designed for speed. On the <tt>datamover1</tt> node, we have installed hpn-ssh, or [http://www.psc.edu/networking/projects/hpn-ssh/ High-Performance-enabled ssh]. You use this higher-performance ssh/scp/sftp variant by loading the `hpnssh' module. Hpn-ssh is backwards compatible with the `usual' ssh, but is capable of significantly higher speeds. If you routinely have large data transfers to do, we recommend having your system administrator look into installing [http://www.psc.edu/networking/projects/hpn-ssh/ hpn-ssh] on your system.

Everything we discuss below, unless otherwise stated, will work regardless of whether you have hpn-ssh installed on your remote system.

==== For Microsoft Windows users ====

Linux-windows transfers can be a bit more involved than linux-to-linux, but using [http://www.cygwin.com Cygwin], this should not be a problem. Make sure you install Cygwin with the openssh package.

If you want to remain 100% a Windows environment, another very good tool is [http://winscp.net/eng/index.php WinSCP]. It will let you easily transfer and synchronize data between your Windows workstation and the login nodes using your ssh credentials (provided that it's not much more than 10GB on each sync pass).

If you are going to use the [[Data_Management#Moving_.3E10GB_through_the_datamover1_node | datamover1 method]], and assuming your machine is not a wireless laptop (if it
is, best to find a nearby computer that's not wireless and use a usb
memory stick), you'll need the IP address of your machine, which you find by
typing "ipconfig /all" on your local windows machine. Also, you will need to have the ssh daemon (sshd) running locally in Cygwin.

Also note that your windows user name does not have to be the same as on SciNet, this just
depends on how your local windows system was set up.

All locations given to scp or rsync in cygwin have to be in unix format (using "/" not "\"), and will be relative to cygwin's path, not windows (e.g.
use /cygdrive/c/...... to get to the windows C: drive).

=== Ways to transfer data ===

==== scp ====

<tt>scp</tt>, or secure copy, is the easiest way to copy files, although we generally find rsync below to be faster.

scp works like cp to copy files:

$ scp original_file copy_file

except that either the original or the copy can be on another system:

$ scp jonsdatafile.bin jon@remote.system.com:/home/jon/bigdatadir/

will copy the data file into the directory <tt>/home/jon/bigdatadir/</tt> on <tt>remote.system.com</tt> after logging in as <tt>jon</tt>; you will be prompted for a password (unless you've set up ssh keys).

Copying from remote systems works the same way:

$ scp jon@remote.system.com:/home/jon/bigdatadir/newdata.bin .

And wildcards work as you'd expect (except you have to quote the wildcards on the remote system, as it can't expand properly here.)

$ scp *.bin jon@remote.system.com:/home/jon/bigdatadir/
$ scp jon@remote.system.com:"/home/jon/inputdata/*" .

There are few options worth knowing about:
* <tt>scp -C</tt> compresses the file before transmitting it; ''if'' the file compresses well, this can significantly increase the effective data transfer rate (though usually not as much as compressing the data, then sending it, then uncompressing). If the file doesn't compress well, than this adds CPU overhead without accomplishing much, and can slow down your data transfer.
* <tt>scp -oNoneEnabled=yes -oNoneSwitch=yes</tt> -- This is an hpn-ssh only option. If CPU overhead is a significant bottleneck in the data transfer, then we can avoid this by turning off the secure encryption of the data. For most of us, this is ok, but for others it is not. In either cases, '''authentication''' remains secure, it is only the data transfer that is in plaintext.

==== rsync ====

[http://samba.anu.edu.au/rsync/ rsync] is a very powerful tool for mirroring directories of data.
$ rsync -av -e ssh scinetdatadir jon@remote.system.com:/home/jon/bigdatadir/
rsync has a dizzying number of options; the above syncs <tt>scinetdatadir</tt> ''to'' the remote system; that is, any files that are newer on the localsystem are updated on the remote system. The converse isn't true; if there were newer files on the remote system, you'd have to bring those over with
$ rsync -av -e ssh jon@remote.system.com:/home/jon/bigdatadir/ scinetdatadir
The <tt>-av</tt> options are for verbose and `archive' mode, which preserves timestamps and permissions, which is normally what you want. <tt>-e ssh</tt> tells it to use ssh for the transfer.

One of the powerful things about rsync is that it looks to see what files already exist before copying, so you can use it repeatedly as a data directory fills and it won't make unnecessary copies; similarly, if a (say) log file grows over time, it will only copy the difference between the files, further speeding things up. This also means that it behaves well if a transfer is interrupted; a second invocation of rsync will continue where the other left off.

As with <tt>scp -C</tt>, <tt>rsync -z</tt> compresses on the fly, which can significantly enhance effective data transfer rates if the files compress well, or hurt it if not.

As with scp, if both sides are running hpn-ssh one can disable encryption of the data stream should that prove to be a bottleneck:
$ rsync -av -e "ssh -oNoneEnabled=yes -oNoneSwitch=yes" jon@remote.system.com:/home/jon/bigdatadir/ scinetdatadir

SciNet's login nodes, 142.150.188.5[1-4], are publicly accessible and can be used for data transfer as long as your material is not one big chunk (much more than 2-3GB each file). We have a 5 minutes CPU time limit on the login nodes and the transfer process may be killed by the kernel before completion. The workaround is to transfer your data using a rsync loop, by checking the rsync return code, assuming some files can be transferred before reaching the CPU limit. For example in a bash shell:
<pre>
for i in {1..100}; do ### try 100 times
rsync ...
[ "$?" == "0" ] && break
done
</pre>

==== ssh tunnel ====

Alternatively you may use a reverse ssh tunnel (ssh -R).

If your transfer is above 10GB you will need to use one of SciNet's datamovers. If your workstation is behind a firewall (as the datamovers are), you'll need a node external to your firewall, on the edge of your organization's network, that will serve as a gateway, and can be accessible via ssh by both your workstation and the datamovers. Initiate a "ssh -R" connection from SciNet's datamover to that node. This node needs to have its ssh GatewayPorts enabled so that your workstation can connect to the specified port on that node, which will forward the traffic back to SciNet's datamover.

=== Transfer speeds ===

==== What transfer speeds could I expect? ====

Below are some typical transfer numbers from datamover1 to another University of Toronto machine with a 1Gb/s link to the campus network:

{| class="wikitable" border="0"
|-
!{{Hl2}}| Mode
!{{Hl2}}| With hpn-ssh
!{{Hl2}}| Without
|-
| rsync
| 60-80 MB/s
| 30-40 MB/s
|-
| scp
| 50 MB/s
| 25 MB/s
|}

==== What can slow down my data transfer? ====

To move data quickly, ''all'' of the stages in the process have to be fast; the file system you are reading data from, the CPU reading the data, the network connection between the sender and the reciever, and the recipient CPU and disk. The slowest element in that chain will slow down the entire transfer.

On SciNet's side, our underlying filesystem is the high-performance [http://www-03.ibm.com/systems/software/gpfs/index.html GPFS] system, and the node we recommend you use (datamover1) has a high-speed connection to the network and fast CPUs.

==== Why are my transfers so much slower? ====

If you get numbers significantly lower than above, then there is a bottleneck in the transfer. The first thing to do is to run <tt>top</tt> on datamover1; if other people are transfering large files at the same time you are trying to, network congestion could result and you'll just have to wait until they are done.

If nothing else is going on on datamover1, there are a number of possibilites:
* network connection between SciNet and your machine - do you know the network connection of your remote machine? Are your systems connections tuned for performance [http://www.psc.edu/networking/projects/tcptune]?
* is the remote server busy?
* are the remote servers disks busy, or known to be slow?

For any further questions, contact us at [mailto:support@scinet.utoronto.ca Support @ SciNet]

Data Transfer

2013-10-31T20:17:30Z

Jchong:

=== General guidelines ===

All traffic to and from the data centre has to go via [http://en.wikipedia.org/wiki/Secure_Shell SSH], or secure shell.
This is a protocol which sets up a secure connection between two sites. In all cases, incoming connections to SciNet go through relatively low-speed connections to the login.scinet gateways, but there are many ways to copy files on top of the ssh protocol.

What node to use for data transfer to and from SciNet depends mostly on the amount of data to transfer:

==== Moving <10GB through the login nodes ====

The login nodes are accessible from outside SciNet, which means that you can transfer data between your own office/home machine and SciNet using scp or rsync (see below). Although the login nodes have a cpu_time timeout of 5 minutes (emphasis on cpu_time, not wall_time), most likely if you try to transfer more than 10GB probably you wouldn't succeed. While the login nodes can be used for transfers of less than 10GB, using a datamover node would still be faster.

Note that transfers through a login node will timeout after a certain time (currently set to 5 minutes cpu_time), so if you have a slow connection you may need to go through datamover1.

==== Moving >10GB through the datamover1 node ====

Serious moves of data (>10GB) to or from SciNet should be done from the <tt>datamover1</tt> node. From any of the interactive SciNet nodes, one should be able to <tt>ssh datamover1</tt> to log in. This is the machine that has the fastest network connection to the outside world (by a factor of 10; a 10Gb/s link as vs 1Gb/s).

Transfers must be ''originated'' from <tt>datamover1</tt>; that is, one can not copy files from the outside world directly to or from the data mover node; one has to log in to the data mover node and copy the data to or from the outside network. Your local machine must be reachable from the outside, either by its name or its IP address. If you are behind a firewall or a (wireless) router, this may not be possible. You may need to ask your system administrator to allow datamover to ssh to your machine.

==== Hpn-ssh ====

The usual ssh protocols were not designed for speed. On the <tt>datamover1</tt> node, we have installed hpn-ssh, or [http://www.psc.edu/networking/projects/hpn-ssh/ High-Performance-enabled ssh]. You use this higher-performance ssh/scp/sftp variant by loading the `hpnssh' module. Hpn-ssh is backwards compatible with the `usual' ssh, but is capable of significantly higher speeds. If you routinely have large data transfers to do, we recommend having your system administrator look into installing [http://www.psc.edu/networking/projects/hpn-ssh/ hpn-ssh] on your system.

Everything we discuss below, unless otherwise stated, will work regardless of whether you have hpn-ssh installed on your remote system.

==== For Microsoft Windows users ====

Linux-windows transfers can be a bit more involved than linux-to-linux, but using [http://www.cygwin.com Cygwin], this should not be a problem. Make sure you install Cygwin with the openssh package.

If you want to remain 100% a Windows environment, another very good tool is [http://winscp.net/eng/index.php WinSCP]. It will let you easily transfer and synchronize data between your Windows workstation and the login nodes using your ssh credentials (provided that it's not much more than 10GB on each sync pass).

If you are going to use the [[Data_Management#Moving_.3E10GB_through_the_datamover1_node | datamover1 method]], and assuming your machine is not a wireless laptop (if it
is, best to find a nearby computer that's not wireless and use a usb
memory stick), you'll need the IP address of your machine, which you find by
typing "ipconfig /all" on your local windows machine. Also, you will need to have the ssh daemon (sshd) running locally in Cygwin.

Also note that your windows user name does not have to be the same as on SciNet, this just
depends on how your local windows system was set up.

All locations given to scp or rsync in cygwin have to be in unix format (using "/" not "\"), and will be relative to cygwin's path, not windows (e.g.
use /cygdrive/c/...... to get to the windows C: drive).

=== Ways to transfer data ===

==== scp ====

<tt>scp</tt>, or secure copy, is the easiest way to copy files, although we generally find rsync below to be faster.

scp works like cp to copy files:

$ scp original_file copy_file

except that either the original or the copy can be on another system:

$ scp jonsdatafile.bin jon@remote.system.com:/home/jon/bigdatadir/

will copy the data file into the directory <tt>/home/jon/bigdatadir/</tt> on <tt>remote.system.com</tt> after logging in as <tt>jon</tt>; you will be prompted for a password (unless you've set up ssh keys).

Copying from remote systems works the same way:

$ scp jon@remote.system.com:/home/jon/bigdatadir/newdata.bin .

And wildcards work as you'd expect (except you have to quote the wildcards on the remote system, as it can't expand properly here.)

$ scp *.bin jon@remote.system.com:/home/jon/bigdatadir/
$ scp jon@remote.system.com:"/home/jon/inputdata/*" .

There are few options worth knowing about:
* <tt>scp -C</tt> compresses the file before transmitting it; ''if'' the file compresses well, this can significantly increase the effective data transfer rate (though usually not as much as compressing the data, then sending it, then uncompressing). If the file doesn't compress well, than this adds CPU overhead without accomplishing much, and can slow down your data transfer.
* <tt>scp -oNoneEnabled=yes -oNoneSwitch=yes</tt> -- This is an hpn-ssh only option. If CPU overhead is a significant bottleneck in the data transfer, then we can avoid this by turning off the secure encryption of the data. For most of us, this is ok, but for others it is not. In either cases, '''authentication''' remains secure, it is only the data transfer that is in plaintext.

==== rsync ====

[http://samba.anu.edu.au/rsync/ rsync] is a very powerful tool for mirroring directories of data.
$ rsync -av -e ssh scinetdatadir jon@remote.system.com:/home/jon/bigdatadir/
rsync has a dizzying number of options; the above syncs <tt>scinetdatadir</tt> ''to'' the remote system; that is, any files that are newer on the localsystem are updated on the remote system. The converse isn't true; if there were newer files on the remote system, you'd have to bring those over with
$ rsync -av -e ssh jon@remote.system.com:/home/jon/bigdatadir/ scinetdatadir
The <tt>-av</tt> options are for verbose and `archive' mode, which preserves timestamps and permissions, which is normally what you want. <tt>-e ssh</tt> tells it to use ssh for the transfer.

One of the powerful things about rsync is that it looks to see what files already exist before copying, so you can use it repeatedly as a data directory fills and it won't make unnecessary copies; similarly, if a (say) log file grows over time, it will only copy the difference between the files, further speeding things up. This also means that it behaves well if a transfer is interrupted; a second invocation of rsync will continue where the other left off.

As with <tt>scp -C</tt>, <tt>rsync -z</tt> compresses on the fly, which can significantly enhance effective data transfer rates if the files compress well, or hurt it if not.

As with scp, if both sides are running hpn-ssh one can disable encryption of the data stream should that prove to be a bottleneck:
$ rsync -av -e "ssh -oNoneEnabled=yes -oNoneSwitch=yes" jon@remote.system.com:/home/jon/bigdatadir/ scinetdatadir

SciNet's login nodes, 142.150.188.5[1-4], are publicly accessible and can be used for data transfer as long as your material is not one big chunk (much more than 2-3GB each file). We have a 5 minutes CPU time limit on the login nodes and the transfer process may be killed by the kernel before completion. The workaround is to transfer your data using a rsync loop, by checking the rsync return code, assuming some files can be transferred before reaching the CPU limit. For example in a bash shell:
<pre>
for i in {1..100}; do ### try 100 times
rsync ...
[ "$?" == "0" ] && break
done
</pre>

==== ssh tunnel ====

Alternatively you may use a reverse ssh tunnel (ssh -R).

If your transfer is above 10GB you will need to use one of SciNet's datamovers. If your workstation is behind a firewall (as the datamovers are), you'll need a node external to your firewall, on the edge of your organization's network, that will serve as a gateway, and can be accessible via ssh by both your workstation and the datamovers. Initiate a "ssh -R" connection from SciNet's datamover to that node. This node needs to have its ssh GatewayPorts enabled so that your workstation can connect to the specified port on that node, which will forward the traffic back to SciNet's datamover.

=== Transfer speeds ===

==== What transfer speeds could I expect? ====

Below are some typical transfer numbers from datamover1 to another University of Toronto machine with a 1Gb/s link to the campus network:

{| class="wikitable" border="0"
|-
!{{Hl2}}| Mode
!{{Hl2}}| With hpn-ssh
!{{Hl2}}| Without
|-
| rsync
| 60-80 MB/s
| 30-40 MB/s
|-
| scp
| 50 MB/s
| 25 MB/s
|}

==== What can slow down my data transfer? ====

To move data quickly, ''all'' of the stages in the process have to be fast; the file system you are reading data from, the CPU reading the data, the network connection between the sender and the reciever, and the recipient CPU and disk. The slowest element in that chain will slow down the entire transfer.

On SciNet's side, our underlying filesystem is the high-performance [http://www-03.ibm.com/systems/software/gpfs/index.html GPFS] system, and the node we recommend you use (datamover1) has a high-speed connection to the network and fast CPUs.

==== Why are my transfers so much slower? ====

If you get numbers significantly lower than above, then there is a bottleneck in the transfer. The first thing to do is to run <tt>top</tt> on datamover1; if other people are transfering large files at the same time you are trying to, network congestion could result and you'll just have to wait until they are done.

If nothing else is going on on datamover1, there are a number of possibilites:
* network connection between SciNet and your machine - do you know the network connection of your remote machine? Are your systems connections tuned for performance [http://www.psc.edu/networking/projects/tcptune]?
* is the remote server busy?
* are the remote servers disks busy, or known to be slow?

For any further questions, contact us at [mailto:support@scinet.utoronto.ca Support @ SciNet]

Oldwiki.scinet.utoronto.ca:System Alerts

2013-08-22T20:37:18Z

Jchong: /* System Status */

Oldwiki.scinet.utoronto.ca:System Alerts

2013-08-22T20:36:20Z

Jchong: /* System Status */

Oldwiki.scinet.utoronto.ca:System Alerts

2013-08-21T23:13:14Z

Jchong: /* System Status */

Previous messages:

2013-08-21T23:09:38Z

Jchong: Added old updates into previous message page.

'''''These are old messages, the most recent system status can be found on the [[SciNet User Support Library|main page]].'''''

'''''You can also check our twitter feed, @SciNetHPC, for updates.'''''

Wed Aug 21 06:16 - The /scratch filesystem will be mounted read-write around 2PM today. Processes on the login, devel, and datamovers that are accessing /scratch2 will be killed, or the node may be rebooted if /scratch2 cannot be re-mounted in read only mode. /scratch2 on the compute nodes will be unmounted once the job is finished. All jobs scheduled to run after 2PM will need to use /scratch otherwise the job will fail. Please cancel your jobs that uses /scratch2 and resubmit them after 2PM.

Tue Aug 20 16:26 - The process of finding the files that had parity errors is taking longer than expected due to multiple passes requirement. These process will also affect the performance of GPFS at times. We hope to finish these processes and have /scratch mounted read-write sometime tomorrow.

Mon Aug 19 14:17 - This is the plan for the next few days to get scratch back into production:

* We'll first be doing another verify to find bad sectors. This will generate a list of suspect files.
* These suspect files will be moved to a separate location, and the owners of these files will be notified.
* Sometime tomorrow (Tue Aug 20), this should allow /scratch to be mounted read/write again.
* Once /scratch is back in full production, and after running jobs using /scratch2 have finished (say 2-3 days later), /scratch2 will be phased out.
* This phase-out entails /scratch2 to be mounted read-only on the login and devel nodes to allow users to copy results that they want to keep from /scratch2 to /scratch, /home, or off-site. The login and devel nodes may require a reboot tomorrow if /scratch2 cannot be mounted read-only on the node.
* After a week or two, /scratch2 will be unmounted.

Sat Aug 17 02:21:09 - /scratch and /project are now mounted read-only on the login and devel nodes. Please read details immediately below. Note also (further below) that the monthly purge of /scratch will be delayed at least another week.

Sat Aug 17 00:04:39 - the '''/scratch and /project filesystems will be mounted read-only again by 0900 today but only on the login and devel nodes'''. You will not be able to write to /scratch or /project but you will be able to access and copy files away from them (e.g. to /scratch2). The storage vendor has completed recovering the raid parity errors and we are testing that they have "fixed" them properly, we're trying to resolve discrepancies between their lists and ours and we are identifying the names of those files which have been corrupted. Unfortunately, if there are any remaining, or improperly "fixed", parity errors, then the entire filesystem can crash when somebody accesses the affected files (this is why we had to unmount /scratch earlier this week). Accordingly, we are testing all the disk sectors that the vendor has claimed to have fixed overnight. If the filesystem remains stable over the weekend then we hope to be able to return /scratch and /project to normal on Monday or Tuesday.

Thu Aug 15 12:59:18 - work continues on recovering the filesystem. The vast majority of data appears intact but the storage vendor is still resolving parity errors. Also working on a way for users to identify what files might possibly have been corrupted. Unfortunately the timeline for all this is still uncertain. We're trying to balance paranoia for preserving data with the need for users to get back to work. /scratch2 is working well and the GPC is currently at 80% utilization

Wed Aug 14 20:00:00 - The login and GPC development nodes are back in service now. We have disabled the read-only mount for scratch since that was causing issues with the ongoing recovery. It will be made available later in the week when the recovery is complete. Please continue to check here for further updates.

Wed Aug 14 19:36:41 - There are currently filesystem issues with the gpc login node and the general scinet login node. We are working on the issue and trying to fix it.

Wed Aug 14 00:30:46 - '''the regular monthly purge of /scratch will be delayed because of the problems with the filesystem. It will tentatively take place on 22 Aug (or later). New date will be announced here.'''

Tue Aug 13 20:23:27 - GPC and TCS available. See notes below about scratch2, scratch and project filesystems.

Tue Aug 13 19:15:28- for the time being, /scratch and /project will be available only from the login and devel nodes and will only be readable (you can not write to them). This way users can retrieve files they really need but we minimize the stress on the filesystem while we complete LUN verifies and filesystem checks. These filesystems will return to normal later this week (likely Wed or Thurs but may take longer than expected). We know that there are some files that may have corrupted data and will post more details later about how to identify them. The total amount of corrupted data is small and appears to be limited only to those files which were open for writing when the problems started (about 1445 on Friday, 9 Aug). GPC users will still need to use /scratch2 for running jobs while TCS users will need to make use of /reserved1.

Tue Aug 13 17:24:18 - there is good news about /scratch and /project. They appear to be at least 99% intact. However, there are still more LUN verifies that needs to be run as well as disk fscks. It's not yet clear whether we will be able to make these disks available tonight or at some point tomorrow. Systems should come online again within a couple of hours though perhaps only with the new /scratch2 for now.

Tue Aug 13 17:13:58 - datacentre upgrades finished. Snubber network, upgraded trigger board, UPS for the controller and the Quickstart feature should make chiller more resilient to power events and improve the time it takes to restart. Hot circuit breakers also replaced

Tues Aug 13 09:00:00 - systems down for datacentre improvement work

Sun Aug 11 21:55:06 - TCS can be used by those groups which have /reserved1 space. Use /reserved1 to run jobs as you would hve with the old /scratch (which we are still trying to recover)

Sun Aug 11 21:49:03 - GPC is available for use. There is no /scratch or /project filesystem as we are still trying to recover them. You can use /scratch2 to run jobs in exactly the same way as the old scratch (however the environment variable is $SCRATCH2). New policies for /scratch2 are being set but for now each user is limited to 10TB and 1 million files. /home is unscathed.

Sun Aug 11 15:35:32 - We are implementing a contingency plan for access by GPC users. Should be available within a few hours. There will be a new scratch2 filesystem that can be used for submitting and running jobs. TCS users may have to wait another day for a fix (it is technically impossible to mount the new /scratch2 on the TCS). Unfortunately, nobody will be able to access the original /scratch or /project space and the timeline for attempting to fix and recover those filesystems is virtually impossible to judge (have to deal with new problems as they crop-up and there's no way to know how many problems lie ahead).

Sun Aug 11 09:25:41 - work resumed before 8AM this morning. Still correcting disk errors that surface so we can reach the stage where the OS can actually mount the filesystem

Sat Aug 10 22:31:45 - work stopping for this evening. SciNet and vendor staff have worked continuously for more than 30 hrs on this problem. No point risking making a mistake now. Will continue tomorrow

Sat Aug 10 20:39:34 - work continues. Disks and NSDs have been powered-up and the filesystem is attempting to read the disks. Problems with individual disks are being fixed manually as they are exposed

Sat Aug 10, 17:03 - Still no resolution to the problem. SciNet staff continue to work onsite, in consultation with the storage vendor.

Sat Aug 10 10:38:46 - storage vendor still working on solution with SciNet staff onsite. There are 2,000 hard drives and the controller is confused about location and ID of some of them. Getting a single one wrong will result in data loss so we are proceeding cautiously. Only /scratch and /project are affected. /home is accessible but GPC and TCS can not be used as they rely on /scratch. BGQ system is still usable because of separate filesystem

Sat Aug 10 07:05:07 - staff and vendor tech support still on-site. New action plan from storage vendor is being tested.

Sat Aug 10 00:28:54 - Vendor has escalated to yet a higher level of support but still no solution. People will remain on-site for a while longer to see what the new support team recommends.

Fri Aug 9 22:03:48 - Staff and vendor technician remain on-site. Storage vendor has escalated problem to critical but suggested fixes have not yet resolved the problem. BGQ remains up because it has separate filesystem.

Fri Aug 9 15 32 - /scratch and /project are down. Login and home directories are ok, but no jobs can run, and most of those running will likely die if/when they need to do I/O.

Fri Aug 9 15:25 - File system problems. Scratch is unmounted. Jobs are likely dying. We are working on it.

Thu Aug 8 13:22 - most systems are back up

Thu Aug 8 11:18:45 - problems with storage hardware. Trying to resolve with vendor

Thu Aug 8 08:14:01 Cooling has been restored. Starting to recover systems.

Thu 8 Aug 2013 06:12:28 Large voltage drop at site knocked-out cooling system at 0558 today. Staff enroute to site.

Wed 7 Aug 2013 16:27:00 EDT: All systems are up again.

Wed 7 Aug 2013 14:50:00 EDT: GPC, TCS, P7, ARC and HPSS systems are up again.

Wed 7 Aug 2013 11:47:00 EDT: Power outage at site because of thunderstorm. Systems down.

Tue Aug 6 16:52:35 EDT 2013: File systems have been fixed. Systems are back up.

Tue Aug 6 15:27:35 EDT 2013: File system trouble under investigation. Clusters aren't down per se, but the home and scratch file systems are not accessible. Many user jobs very likely died. Systems are expected to be up by the end of the afternoon today. Check here for updates.

Thu Aug 1 17:13:52 EDT 2013: Systems are back up and accessible to users.

Tue Aug 1, 8:06:00: As announced, all systems have been shutdown
at 8AM on Thurs, 1 Aug for emergency repair of a component
in the cooling system. Systems are expected to be back
on-line in the afternoon. Check here for progress updates.

16:30 update: Systems expected to be up by 5pm today.

Tue Jul 30, 19:24:00: Downtime announcement:

All systems will be shutdown at 8AM on Thurs, 1 Aug for emergency repair
of a component in the cooling system. Systems are expected to be back
on-line in the afternoon. Check here for progress updates.

Apologies for the short notice but we only learned of the problem this
afternoon. We're now attempting to re-schedule other maintenance planned
for later in August to this Thursday as well (hence the uncertainty in
the length of the required downtime).

Mon Jul 29 10:40:00 All systems back up.

Mon Jul 29 10:09:00 TCS is back up. BGQ still down.

Mon Jul 29 8:37:00 Power glitch overnight took systems down. GPC is already up, and other systems are being brought up.

Wed Jul 24 15:00:00 All BGQ racks back in production

Thu Jul 18 10:00:00 Bgqdev and one of the two bgq racks are up again

Wed Jul 17 17:00:00 Bgqdev and bgq systems are down.

Wed Jul 17 15:58:00 We're reenabling the rack, please resubmit crashed jobs.

Wed Jul 17 15:24:12 One of the two racks of the BlueGene/Q production system has gone down.

Mon Jul 15 09:45:49: Gravity01 (head node in gravity cluster) is down until futher notice. Jobs may still be submitted from devel nodes or arc01

Tue Jul 9 19:15:57 EDT: one rack of the production BGQ systems remains down due to a faulty flow sensor

Tue Jul 9 15:16:49 EDT: GPC & TCS are back online. Other systems being restored

Tue Jul 9 12:45:25 EDT: Power has been restored. We need to restart cooling systems, restart and check the filesystems etc. Will have better idea of timeline by 3PM

Tue Jul 9 11:52:02 EDT: Powerstream on-site. Wind/tension damage to two hydro poles caused overhead fuse to blow. Repairs are underway

Tue Jul 9 09:30:52 EDT: Fuse on power lines blew last night. Utility is backed-up dealing with other problems. No ETA for them to restore power at the site

Tue Jul 9 02:22:47 EDT: No power at site. Will resolve by 10AM and update here.

Tue Jul 9 01:04:40 EDT: Power failure at site and UPS has drained. Major storms and problems throughout Toronto. Staff enroute.

Mon Jul 5, 14:18:00 EDT 2013 Both BGQ systems, bgq and bgqdev, are up.

Mon Jul 4, 15:50:00 EDT 2013: Both BGQ systems, bgq and bgqdev, are down due to a cooling failure. We are investigating the cause. Given the scheduled BGQ downtime tomorrow, these systems will not be brought up before tomorrow (Friday Jul 5 2013) late morning or early afternoon. You can check the system status here on the wiki. All other SciNet systems are up and should not be affected by the downtime.

Mon Jul 4, 13:00:00 EDT 2013: On Friday July 5th, 2013, bgq and bgqdev will be taken down for maintenance. This BGQ downtime will start at 8:00 am. We expect they will be up again early afternoon on the same day. You can check the system status here on Monday. Other SciNet systems will not be affected.

Mon Jul 1, 9:00:00 EDT 2013 : BGQ production system is fully up.

Jun 26 07:55:20 EDT 2013: One of the two racks of the BGQ production system is down. The remainder is operational, as is the development BGQ cluster.

Sat May 4 14:59:00 EDT: Systems all back on-line. Let us know if you encounter issues.

Sat May 4 13:13:54 EDT: Staff on-site. Expect to have at least GPC available by 4PM (if not earlier).

Sat May 4 09:15:54 EDT: Systems unavailable due to power glitch at data center; will update shortly

Tue Apr 16 12:45:52 EDT 2013: All systems are back up. Please resubmit your jobs.

Tue Apr 16 02:37:44 EDT 2013 Systems unexpectedly went down on Apr 15, 2013 around 10:45 pm due to loss of one phase of power at site. Local utility expected to restore power this morning. Check here for updates.

Thu Apr 11 14:39:27 EDT 2013: All systems are back up. Please report any problems or unusual behaviour.

Mon Apr 8 12:17:50 EDT 2013: All systems will be '''shutdown''' at 8AM on Wed, 10 April. They are expected to be back online by the evening of Thurs, 11 April. The downtime will allow us to make a number of datacenter improvements that will reduce the number of required maintenance downtimes per year
and improve datacentre uptime. We also plan to upgrade the GPFS filesystem in order to allow for planned storage system upgrades later this year.

Wed Feb 27 14:15:48 Most systems are up. Please check this site for updates. Please report any problems or unusual system behaviour.

Wed Feb 27 12:55:35 Systems coming up. GPC will be accessible shortly, as will BGQ. We estimate 2PM for this. TCS may take a bit longer.

Wed Feb 27 10:01:05 Cooling restored. Power fluctuations had tripped breakers for cooling system. Computer systems are being tested before bringing them online. Further updates will be posted when available.

Wed Feb 27 03:34:03 Complete loss of cooling as of 0230 this morning. Under investigation. Unlikely that any systems will be back before noon today

Fri Feb 22, 2013, 7:30 am: The BGQ devel system shut down at 7:30 this morning because it detected a coolant issue. We hope to have it, and the production system, back up later this afternoon.

Wed Feb 20 04:12:26 EST 2013: Some compute nodes will be turned off Thursday (21 Feb) morning in order to reduce the cooling load in the datacentre. We'll be running on free-cooling only so that the bearings in the chiller can be replaced; that work is expected to be completed by end of Friday. At this point we're planning to shutdown 30 TCS nodes and the production BGQ (the devel system will keep running) on Thursday morning and 20% of the GPC on Friday morning. This will be done through reservations in the queueing system so that no jobs will be killed.

Plans may change depending on outside air temperatures and progress of the work.

Jan 17 17:21:01 EST 2013: Chiller maintenance work finished. System is running normally.

Oct 22 15:20 TCS is back up. Both running and queued jobs for this system were killed. Please resubmit. All other clusters are also up.

Oct 22 15:00 GPC is back up. While running jobs were killed and should be resubmitted, previously queued jobs will now start to run.

Oct 22 14:34 While testing prepareness for power problems, an unfortunate human error in reconfiguring inadvertently triggered our
emergency shutdown routine. We sincerely apologize. The systems are being brought up again. Please check back here
for updates.

Oct 22 14:19 System shutdown; all running jobs lost. We will work on bringing back up the systems as soon as possible.

Oct 22 14:00:00 Logins to the SciNet systems were suddenly and unexpectedly disconnected. We are investigating the issue.

Oct 19 19:00:00 All systems should be up. Let us know if you still are experiencing difficulties.

Oct 19 16:20:00 The GPC and TCS have been brought back up. ARC, BGQ, and HPSS are not in operation yet.

Oct 19 13:05:00 Half of the GPC is being brought up again. TCS, P7, ARC, BGQ, and HPSS are not in operation yet as the chiller control system still needs repairing.

Oct 19 11:02:48 Staff and technicians on-site have concluded that a chiller control board needs to be replaced. We believe we can bring up the chiller manually now and get a portion of the GPC running by 1PM. The repair work will require a brief chiller shutdown (but no GPC shutdown) later in the day so TCS will stay off for now in order to minimize heat load.

Oct 18 23:19:04 Still seeing significant voltage fluctuations in facility power. Will keep systems off rather then risk another failure overnight. Sorry for the inconvenience. Expect to be back up by noon tomorrow (possibly earlier)

Oct 18 22:35:13 Power quality issues brought down the chiller, which required a shutdown of the clusters. Power and chiller are coming back up, and we hope to have the clusters up by morning.

Oct 18 21:01:00 The datacentre is down due to a power failure. We are investigating the problem.

Oct 5 16:36:01: The DDR portion of the cluster will be drained of jobs in order to free it up for maintenance work on Tuesday, 9 Oct at 10:30 AM. Jobs will continue to start on the DDR portion over the long weekend so long as the requested wall-clock time allows them to finish before 10:30 AM on Tuesday. The DDR partition will be back in regular service by noon Tuesday.

Oct 4, ~9:00AM: A routing issue prevented logins to scinet. The issue is fixed, running jobs should not have been affected.

Oct 1, 8:00PM: All systems are back online.

Sep 25: All systems will be '''shutdown at 7AM on Monday, 1 Oct''' for annual cooling tower maintenance and cleaning. We expect to come back up in the evening of the same day. Check here in the late afternoon for status updates.

Tue Sep 4 16:11:21 EDT 2012: The connection to the SciNet datacentre will be interrupted from September 5 at 10:00 pm to September 6 at 2:00 am, for router maintenance. Users will not be able to log into SciNet during this window. Running jobs will NOT be affected.

Sun 15 Jul 2012 11:11:37 EDT: Systems back online. Please report any problems/issue to support@scinet.utoronto.ca

Sun 15 Jul 2012 09:24:18 EDT: Main breaker tripped. Power now restored. Cooling system coming back online. Will likely need at least a couple of hours to get systems checked and back in production.

Sun 15 Jul 2012 08:43:21 EDT: Power issue. Staff investigating.

Sun Jul 8 11:49:20 EDT 2012: Systems back online after power failure at Jul 8 09:17:13. Report any problems to support@scinet.utoronto.ca

Tue Jun 26 18:49:54 EDT 2012: Systems back online. Report any problems to support@scinet.utoronto.ca

Tue 26 Jun 2012 15:47:36 EDT: Utility power event tripped the datacenter's under-voltage protection breaker. Power appears OK now. Restarting cooling systems and then will restart compute systems. Should be back online in a couple of hours.

Tue Jun 26 2012 15:35:00 EDT: Power failure of some kind; systems down. We are investigating.

Mon Jun 25 11:22:09 EDT 2012: Systems back up

Mon Jun 25 08:51:09 EDT 2012: Under-voltage event from electrical utility automatically tripped our main circuit breaker to avoid equipment loss/damage. Power has now been restored and cooling system is being re-started. Need to check that everything is OK before restoring systems. They should be back online before noon assuming no new problems are uncovered.

Mon Jun 25 07:37:58 EDT 2012: Staff on-site. No power at main electrical panel.

Mon Jun 25 06:17:43 EDT 2012: Power failure at 0557 today. All systems shutdown. We're investigating.

Fri Jun 8 10:50:00 EDT 2012 The GPC QDR nodes will be unavailable on Monday June 11th, from 9-10am to perform network switch maintenance. All other systems and filesystems will still be available.

Thu Jun 7 20:50:00 EDT 2012 The scheduled electrical work has been completed. Systems are now available. Please email support@scinet.utoronto.ca if you experience any problems.

Thu Jun 7 07:22:24 EDT 2012 All power to the SciNet facility is off in order to complete the scheduled electrical work outlined below. The work is planned to take at least 12 hrs and the earliest we expect systems to be available to users is 10 PM tonight. Watch here for updates.

Fri Jun 1 13:14:51 EDT 2012 There will be a full SciNet shutdown on Thu Jun 7 2012, starting at 6AM.

This is the final scheduled shutdown in preparation for the installation of IBM Blue Gene/Q system. A new machine room has been built (walls, raised floor, cooling unit, electrical and water connections), but downtime is required to connect 800 kW of power from our electrical room to the new room.

All systems will go down at 6 AM on Thu 7 Jun; all login sessions and jobs will be killed at that time.

At the earliest, the systems will be available again around 10PM in the evening of Thu 7 Jun. Check on this page for updates on Thursday.

Sun May 27 09:11:53 EDT 2012: scratch became umounted on all nodes about 0700 this morning. Problem has been resolved and /scratch has been remounted everywhere.

Sun May 27 09:11:53 EDT 2012: scratch became umounted on all nodes. Working on fix

Thu May 10 15:17:54 EDT 2012: '''Systems now up'''. Overnight testing took longer than expected. Testing is completed; system is fully up and running. Much of the GPC is about to come out of warranty coverage this month, and the thorough pre-expiration shakedown provided by the tests during this downtime uncovered hardware or configuration issues with over 60 GPC nodes, including problems with memory DIMMs, network cards, and power supplies; these issues are now fixed or slated to be fixed with the offending nodes offlined. Testing also closely examined the new networking infrastructure at very large scale and several minor issues have been identified which will be improved in the very near future.

Thu May 10 07:31:54 EDT 2012: '''Systems expected to be available by 2PM today (10 May)'''. Overnight testing took longer than expected.

Tue 8 May 2012 9:30:46 EDT:
The announced 8/9 May SciNet '''shutdown''' has started. This shutdown is intended for final configurations in the changeover to full infiniband for the GPC, for some back-end maintenance, and to test the computational and file system performance of the TCS and GPC. Systems went down at 9 am on May 8; all login sessions and jobs were killed at that time. The system should be available again tomorrow evening. Check here on Wednesday for updates.

Wed 2 May 2012 10:20:46 EDT: ANNOUNCEMENT:
There will be a full SciNet '''shutdown''' from '''Tue May 8 to Wed May 9''' for final configurations in the changeover to full infiniband for the GPC, for some back-end maintenance, and to test the computational and file system performance of the TCS and GPC.
Systems will go down at 9 am on May 8; all login sessions and jobs will be killed at that time. The system should be available again in the evening of the next day. Check here on Wednesday for updates.

As noted before, see [[GPC_Quickstart#QDR_vs._DDR_Infiniband|GPC_Quickstart]] for how to run mpi jobs on the gpc in light of the new infiniband network (mostly coincides with the old way, with less parameters).

Wed 24 Apr 2012 12:47:46 EDT: The Apr 19 upgrade of the GPC to a low-latency, high-bandwidth Infiniband network throughout the cluster is now reflected in (most of) the wiki. The appropriate way to request nodes in job scripts for the new setup (which will coincide with the old way for many users) is described on the [[GPC_Quickstart#QDR_vs._DDR_Infiniband|GPC_Quickstart]] page.

Thu 19 Apr 2012 19:43:46 EDT:

The GPC network has been upgraded to a low-latency,
high-bandwidth Infiniband network throughout the cluster. Several significant benefits over the old ethernet/infiniband mixed setup are expected,
including:
*better I/O performance for all jobs
*better job performance for what used to be multi-node ethernet jobs (as they will now make use of Infiniband),
*for users that were already using Infiniband, improved queue throughput (there are now 4x as many available nodes), and the ability to run larger IB jobs.

NOTE 1: Our wiki is NOT completely up-to-date after
this recent change. For the time being, you should first check this
current page and the temporary [https://support.scinet.utoronto.ca/wiki/index.php/Infiniband_Upgrade Infiniband Upgrade] page
for anything related to networks and queueing.

NOTE 2: The temporary mpirun settings that were recommended for multinode ethernet runs are no longer in effect, as all MPI traffic is now going over InfiniBand.

NOTE 3: Though we have been testing the new system since last night, a change of
this magnitude (3,000 adapter cards installed, 5km of copper cable, 35km of fibre optic cable) is likely to result in some teething problems so please
bear with us over the next few days. Please report any issues/problems
that are not explained/resolved after reading this current page or our
[https://support.scinet.utoronto.ca/wiki/index.php/Infiniband_Upgrade Infiniband Upgrade] page
to support@scinet.utoronto.ca.

Thu Apr 12 17:39:50 EDT 2012: The TCS maintenance has been completed. Please report any problems.

Thu Apr 12 17:08:00 EST 2012: scheduled maintenance downtime of the TCS. As announced, running TCS jobs and TCS login sessions were killed. All other systems are up. The TCS is expected to be up again sometime this evening.

Tue Apr 10 16:24:00 EST 2012: scheduled downtimes:

Apr 12: TCS Shutdown (Other systems will remain up). The shutdown will start at 11 am and the system should be available again at in the evening of the same day.

Wed 28 Mar 2012 21:45:03 EDT: Connection problem was caused by trouble with a filesystem manager. Problem solved.

Wed 28 Mar 2012 20:55:27 EDT: We're experiencing some problems connecting to the login nodes. Investigating.

Wed Mar 28 10:34:25 EDT 2012: There have been some GPC file system and network stability issues reported over that past few days that we believe are related to some OS configuration changes. We are in the process of resolving them. Thanks for your patience.

Tue Mar 6 18:30:00 EST 2012: We had a glitch on our core switch due to configuraion errors. Unfortunately, this short outage has resulted in unmount of GPFS and jobs got killed. Systems are recovered. Please resubmit jobs.

Fri Mar 2 11:59:33 EST 2012: Roughly 1/3 of the TCS nodes thermal-checked themselves off ~1140 today due to a glitch in the water supply temperature. Unfortunately, all jobs running on those nodes were lost. Please check your jobs and resubmit if necessary.

Thu Feb 9 11:50:57 EST 2012: ''System Temporary Change for MPI ethernet jobs:'' 
Due to some changes we are making to the GPC GigE nodes, if you run multinode ethernet MPI jobs (IB multinode jobs are fine), you will need to explicitly request the ethernet interface in your mpirun:

For Openmpi -> mpirun --mca btl self,sm,tcp
For IntelMPI -> mpirun -env I_MPI_FABRICS shm:tcp

There is no need to do this if you run on IB, or if you run single node mpi jobs on the ethernet (GigE) nodes. Please check [[GPC_MPI_Versions]] for more details.

Thu Feb 9 11:50:57 EST 2012: Scheduled downtime is over. TCS is up. GPC is coming back rack-by-rack.

Mon Jan 31 9:12:00 EST 2012: File systems (scratch and home) got unmounted around 3:30 am and again at around 23:15 on Jan/30. Jobs may have crashed. Filesystems are back now. Please resubmit you jobs.

Wed Jan 18 16:47:38 EST 2012<; Full system shutdown as of 7AM on Tues, 17 Jan in order to perform annual maintenance on the chiller. Most work has been completed on schedule. Expect systems to be available by 8PM today.

Wed Jan 4, 14:48: Scratch file system got unmounted. Most jobs died. We are trying to fix the problem. Check back here for updates.

Wed Jan 3, 13:58: Datamover1 is down due to hardware problems. Use datamover2 instead.

Wed Dec 28 13:46:37 EST 2011 Systems are back up. All running and queued jobs were lost, due to a power failure at the SciNet datacentre. Please resubmit your jobs. Also, please report any problems to <support@scinet.utoronto.ca>.

Wed Dec 28 09:25 EST 2011 Electrician enroute. No power at our main electrical panel.

Wed Dec 28 08:51 EST 2011 Staff enroute to datacentre. More info once we understand what has happened.

Wed Dec 28 02:33 EST 2011 Datacentre appears to have lost all power. All remote access lost.

Thu Dec 8 16:43:47 EST 2011: The GPC was transitioned to CentOS 6 on Monday, December 5, 2011. Some of the known issues (and workarounds) are listed here. Thanks for your patience and understanding! - The SciNet Team.

System appears to have stabilized. Please let us know if there are any problems.

Fri Nov 25 12:39:54 EST 2011:
IMPORTANT upcoming change: The GPC will be transitioned to CentOS 6 on Monday, December 5, 2011.
All GPC devel nodes will be rebooted at noon on Monday Dec 5/11, with a CentOS6 image. The compute nodes will be rebooted as jobs finish, starting on Saturday Dec 3/11. You may already submit jobs requesting the new image (os=centos6computeA), and these jobs will be serviced as the nodes get rebooted into the new OS.

Wed Nov 16 10:53:37 EST 2011:
A glitch caused the scratch file system to get unmounted everywhere. We are on track to fixing the situation. However, most jobs were killed and you will have to resubmit your job once scratch is back.

As for the recovery of /project directories for groups with storage allocation which is less than 5 TB in /project, recovery of their directories is still in progress. Until then, those directories are unaccessible (owned by root). If you can read your project directory, it means that the recovery is complete. To expedite this process, for now, no material can be retrieved from HPSS by users.

Note that the monthly purge of the scratch space will be delayed until Friday, 18 Nov because of the downtime.

Wed Nov 15 10:47:37 EST 2011:
All systems are up and accessible again. Both /project and HPSS now follow the same new directory structure as on /home and /scratch, i.e. /project/<first-letter-of-group>/group/user.

Be aware that for groups with storage allocation which is less than 5 TB in /project, recovery of their directories is in progress and will finish in the next day or so. Until then, those directories are unaccessible (owned by root). If you can read your project directory, it means that the recovery is complete.

Note that the monthly purge of the scratch space will be delayed until Friday, 18 Nov because of the downtime.

Tue Nov 14 9:40:37 EST 2011:
All systems will be shutdown Monday morning in order to complete the
disk rearrangement begun this past week. Specifically, the /project
disks will be reformatted and added to the /scratch filesystem. The new
/scratch will be larger and faster (because of more spindles and a
second controller). We expect to be back online by late afternoon.

The monthly purge of the scratch space will be delayed until Friday, 18
Nov because of the downtime.

For groups with storage allocations, a new /project will be created but
disk allocations on it will be decreased and the difference made up with
allocations on HPSS. Both /project and HPSS will follow the same new
directory structure as now used on /home and /scratch.

Tue Nov 8 17:05:37 EST 2011: Filesystem hierarchy has been renamed as per past emails and the newsletter. e.g. the home directory of user 'resu' in group 'puorg' is now /home/p/puorg/resu and similarly for /scratch. The planned changes to /scratch (new disks and controller) have been postponed until later this month. /project remains read-only for now and there will be follow-up email to project users tomorrow.

Tue Nov 8 11:24:56 EDT 2011 Systems are down for scheduled maintenance. Expect to come back online in early evening after filesystem changes detailed in recent emails (including new directory naming hierarchy).

Wed Nov 2 13:45:24 EDT 2011 Reminder - shutdown of all systems scheduled for 9AM on Tues 8 Nov. Expect to come back online in early evening after filesystem changes detailed in recent emails (including new directory naming hierarchy).

Sat Oct 15 10:44:27 EDT 2011 Filesystem issues resolved.

Sat Oct 15 09:57:57 EDT 2011 Filesystem appears hung because of issues to do with OOM'd nodes and users close to quota limits. We're trying to resolve.

Fri Oct 7 14:00:00 EDT 2011 Scheduler issues resolved

Fri Oct 7 11:17:08 EDT 2011 The scheduler is having issues; we are investigating. The file system also seems unhappy.

Wed Oct 5 02:50:08 EDT 2011 Transformer maintenance complete. Systems back online

Wed Oct 5 01:18:27 EDT 2011 Power was restored ~12:30. Cooling has been restored as of ~ 1:10AM. Starting to bring up filesystems. Will be at least an hour before users can get access

Tue Oct 4 23:38:57 EDT 2011 Electrical crews had some problems. Expect power to be restored by midnight but then will take at least 2 hrs before we have cooling working and can hope to have systems back. Will update

Tue Oct 4 14:24:52 EDT 2011 '''System shutdown''' scheduled for '''2PM on Tues, 4 Oct''' for maintenance on building transformers. All logins and jobs will be killed at that time. Power may not be restored until midnight with systems coming back online 1-2 hrs later.

Sun Oct 2 14:20:57 EDT 2011 Around noon on Sunday we had a hiccup with GPFS that unmounted /scratch and /project on several computer nodes. Systems were back to normal at around 2PM. Several jobs were disrupted, but if your job survived please be sure it produced the expected results.

Sun Oct 2 12:20:57 EDT 2011: GPFS problems, We are investigating

Wed Sep 7 18:26:49 EDT 2011: Systems back online after emergency repair of condenser water loop.

Wed Sep 7 16:28:13 EDT 2011: Cleaning up disk issues. Hope to make systems available again by 19:00-20:00

Wed Sep 7 16:02:19 EDT 2011: Cooling has been restored. Emergency repair was underway but couldn't prevent the shutdown. Update about expected system availability timing by 16:30

Wed Sep 7 15:53:34 EDT 2011: Cooling plant problems.

Sep 3 10:21:37 EDT 2011: scratch is at over 97% full, and may be making general access very slow for everyone. A couple of users are border-lining their quota limits, and are being contacted. If possible please delete any non-essential or temporary files. Your cooperation is much appreciated.

Tue Aug 30 14:58:41 EDT 2011: Systems likely available to users by 3:30PM

Tue Aug 30 13:21:45 EDT 2011: Cracked and suspect valves replaced. Cooling restored. Starting to bring up and test systems. Next update by 3PM

Mon Aug 29 16:01:45 EDT 2011: All systems will be shutdown at 0830 on Tuesday, 30 Aug to replace a cracked valve in the primary cooling system. Expect systems to be back online by 5-6PM today.

Wed Aug 24 13:00:00 EDT 2011: Systems back on-line. Pump rewired and cooling system restarted

Wed Aug 24 07:40:39 EDT 2011: On-site since 0430. Main chilled water pump refuses to start. Technician investigating.

Wed 24 Aug 2011 03:43:16 EDT Emergency shutdown due to failure in cooling system. More later once situation diagnosed

Mon Aug 22 23:29:05 EDT 2011: GPC scheduler appears to be working properly again

Mon Aug 22 22:41:40 EDT 2011: ongoing issues with scheduler on GPC. Working on fix

Mon Aug 22 21:53:40 EDT 2011: Systems are back up

Mon Aug 22 17:14:06 EDT 2011: Bringing up and testing systems. Another update by 8PM

Mon Aug 22 16:13:18 EDT 2011: Cracked valve has been replaced. Cooling system has been restarted. Expect to be online this evening. Another update by 5PM

Mon Aug 22 14:54:45 EDT 2011: Emergency shutdown at 3PM today (22 Aug). Possibility of severe water damage otherwise. More information by 5PM.

Sun Aug 21 18:53:35 EDT 2011: Systems being brought back online and tested. Login should be enabled by 8-9 PM

Sun Aug 21 17:22:12 EDT 2011: Cooling has been restored. Starting to bring back systems but encountering some problems. Check back later - should have a handle on timelines by 7PM.

Sun Aug 21 15:15:35 EDT 2011: Major storms in Toronto. Power glitch appears to have knocked out cooling systems. All computers shutdown to avoid overheating. More later as we learn exactly what has happened.

Thu Aug 9, 12:31:03 EDT 2011 If you had jobs running when the file system failed this morning (10:30AM; 9 Aug), please check their status, as many jobs died.

Mon Jul 18, 15:04:17 EDT 2011 Datamovers and large-memory nodes are up and running. Intel license server is functional again. Purging of /scratch will happen next Friday, July/22.

Tue Jun 14, 14:05 EDT 2011 Many SciNet staff are at the HPCS meeting this week so, apologies in advance, but we may not respond as quickly as usual to email.

Sun May 22 12:15:29 EDT 2011 Cooling pump failure at data centre resulted in emergency shutdown last night. Systems are back to normal.

Tue May 17 17:57:41 EDT 2011 Cause of the last two chiller outages has been fixed. All available systems back online.

Thu May 12 12:23:35 EDT 2011 NOTE: There is some UofT network reconfiguration and maintenance on Friday morning May 13 07:45-08:15 that most likely will disrupt external network connections and data transfers. Local running jobs should not be affected.

Thu May 5 19:17:37 EDT 2011 Chiller has been fixed. Systems back to normal.

You can check our twitter feed, @SciNetHPC, for updates.

Wed May 4 22:29:00 EDT 2011 Cooling failure at data centre resulted in an emergency shutdown last night. Chiller replacement part won't arrive until noon tomorrow and then needs installation and testing. Roughly 40% of the GPC is currently available. We're using free-cooling instead of the chiller and therefore can't risk turning on more nodes.
NOTE - we plan to keep these systems up tomorrow when parts are replaced but if there are complications they may be shutdown without warning. You can check our twitter feed, @SciNetHPC, for updates.

Wed May 4 16:43:34 EDT 2011 Cooling failure at data centre resulted in an emergency shutdown last night. Chiller replacement part won't arrive until noon tomorrow and then needs installation and testing.
Roughly 1/4 of the GPC (evenly split beween IB and gigE) is currently available. We're using free-cooling instead of the chiller and may increase the number of nodes depending on how the room responds.
NOTE - we plan to keep these systems up tomorrow when parts are replaced but if there are complications they may be shutdown without warning.
You can check our twitter feed, @SciNetHPC, for updates.

Tues 3 May 2011 20:30 Cooling failure at data centre has resulted in emergency shutdown. This space will be updated when more information is available.

Thu 27 Apr 2011 20:21 EDT: System maintenance completed. All systems are operational.
TCS is back to normal too, with tcs01 and tcs02 as the devel nodes.

Sun 17 Apr 2011 18:34:10 EDT Problem with building power supply on Saturday at 2AM today took down the cooling system. Power has been restored but new problems have arisen with TCS water units.

We have been able to partially restore the TCS. The usual development nodes, tcs01 and tcs02 are not available.
However, we have created a temporary workaround using node tcs03 (tcs-f02n01), where you can login, compile, and submit jobs. Not all the compute nodes are up, but we are working on that.
Please let us know if there are any problems with this setup.

Sun Apr 16 13:01:14 EDT 2011 Problem with building power supply at 2AM today took down the cooling system . Waiting for electrical utility to check lines.

Wed Mar 23 19:06:21 EDT 2011: A VSD controller failed in the cooling system. A temporary solution will allow systems to come back online this evening (likely by 8PM). There will likely need to be downtime in the next two days in order to replace the controller.

Wed Mar 23 17:00 EST 2011 Down due to cooling failure. Being worked on. Large electrical surge took out datacentre fuses. No estimated time to solution yet.

Mon Mar 7 14:39:59 EST 2011 NB - the SciNet network connection will be cut briefly (5 mins or less) at about 8AM on Tues, 8 March in order to test the new UofT gateway router connection

Thu Feb 24 13:21:38 EST 2011 Systems were shutdown at 0830 today for scheduled repair of a leak in the cooling system.GPC is online. TCS expected to be back online by 4PM

Tue Feb 24 9:35:31 EST 2011: All systems will be shutdown at:0830 Thursday, 24 Feb in order to repair a small leak in the cooling system. Systems will be back online by afternoon or evening. Check back here for updates during the day.

Wed Feb 9 10:27:13 EST 2011: There was a cooling system failure last night, causing all running and queued jobs to be lost. All systems are back up.

Sat Feb 5 20:33:34 EST 2011: Power outage in Vaughan. TCS jobs all died. Some GPC jobs have survived.

Sat Feb 5 18:51:00 EST 2011: We just had a hiccup with the cooling tower. We suspect it's a power-grid issue, but are investigating the situation.

Thu Jan 20 18:25:54 EST 2011: Maintenance complete. Systems back online.

Wed Jan 19 07:22:32 EST 2011: Systems offline for scheduled maintenance of chiller. Expect to be back on-line in evening of Thurs, 20 Jan. Check here for updates and revised estimates of timing.

Fri Dec 24 15:40:12 EST 2010: We experienced a failure of the /scratch filesystem, with a corrupt quota mechanism, on the afternoon of Friday December 24. This resulted in a general gpfs failure that required the system to be shutdown and rebooted. Consequently all running jobs were lost. We apologize for the disruption this has caused, as well as the lost work.

Happy holidays to all! The SciNet team.

Note: From Dec 22, 2010 to Jan 2, 2011, the SciNet offices are officially closed, but the system will be up and running and we will keep an eye out for emergencies.

Mon Dec 6 17:42:12 EST 2010: File system problems appear to have been resolved.

Mon Dec 6 16:39:55 EST 2010: File system issues on both TCS and GPC. Investigating.

Fri Nov 26 18:30:55 EST 2010: All systems are up and accepting user jobs. There was a province-wide power issue at 1:30 the morning of Friday Nov 26, which caused the chiller to fail, and all systems to shutdown. All jobs running or queued at the time were killed as a result. The system is accessible now and you can resubmit your jobs.

Fri Nov 26 6:15 EST 2010: The data center had some heating issue at around 1:30am Fri Nov 26 and system are down right now. We are investigating the cause of the problem and trying to fix the issue so that we could bring the system back up and running.

Fri Nov 5 15:26 EDT 2010: Scratch is down; we are working on it.

Tue Oct 26 10:32:22 EDT 2010: scratch is hung, we're investigating.

Fri Sep 24 23:02:41 EDT 2010: Systems are back up. Please report any problems to <support at scinet dot utoronto dot ca>

Fri Sep 24 19:55:02 EDT 2010: Chiller restarted. Systems should be back online this evening - likely 9-10PM. Widespread power glitch seems to have confused one of the Variable Speed Drives and control system was unable to restart it automatically.

Fri Sep 24 16:44:39 EDT 2010: The systems were unexpectedly and automatically shut down, due to a failure of the chiller. We are investigating and will bring the systems back up as soon as possible.

Mon Sep 13 15:39:05 EDT 2010: The systems were unexpectedly shut down, automatically, due to a failure of the chiller. We are investigating and will bring the systems back up as soon as possible.

Fri Sep 10 15:08:53 EDT 2010: GPFS upgraded to 3.3.0.6; new scratch quotas applied; check yours with /scinet/gpc/bin/diskUsage.

Mon Aug 16 13:12:08 EDT 2010: Filesystems are accessible and login is normal.

Mon Aug 16 12:38:45 EDT 2010: Login access to SciNet is failing due to /home filesystem problems.
We are actively working on a solution.

Mon Aug 9 16:10:03 EDT 2010: The file system is slow and scheduler cannot be reached. We are working on the problem.

Sat Aug 7 11:42:50 EDT 2010: System status: Normal.

Sat Aug 7 10:45:52 EDT 2010: Problems with scheduler not responding on GPC. Should be fixed within an hour

Wed Aug 4 14:03:59 EDT 2010: GPC scheduler currently having filesystem issue which means you may have difficulty submitting jobs and monitoring the queue. We are working on the issue now.

Fri Jul 23 17:50:00 EDT 2010: Systems are up again after testing and maintenance.

Fri Jul 23 16:14:02 EDT 2010: Systems are down for testing and maintenance. Expect to be back up about 9PM this evening.

Tue Jul 20 17:40:45 EDT 2010: Systems are back. You may resubmit jobs.

Tue Jul 20 15:39:37 EDT 2010: Most GPC jobs died as well. The ones that appear to be running are likely in an unknown state, so will be killed in order to ensure they do not produce bogus results.
The machines should be up shortly.
Thanks for your patience and understanding. The SciNet team.

Tue Jul 20 15:19:30 EDT 2010: File systems down. We are looking at the problem now. All jobs on TCS have died. We'll inform you when the machine is available for use again. Please log out from the TCS.

Tue Jul 20 14:26:10 EDT 2010: Scratch file system down. We are looking at the problem now.

Fri Jul 16 10:59:27 EDT 2010: Systems normal.

Sun Jul 11 13:08:02 EDT 2010: All jobs running at ~3AM this morning almost certainly failed and/or were killed about 11AM. A hardware failure has reduced /home and /scratch performance by a factor of 2 but should be corrected tomorrow.

Sun Jul 11 09:56:27 EDT 2010: /scratch was inaccessible as of about 3AM

Fri Jul 9 16:18:18 EDT 2010: /scratch is accessible again

Fri Jul 9 15:38:43 EDT 2010: New trouble with the filesystems. We are working to fix things

Fri Jul 9 11:28:00 EDT 2010: The /scratch filesystem died at about 3 AM on Fri Jul 9, and all jobs running at the time died.

Oldwiki.scinet.utoronto.ca:System Alerts

2013-08-15T15:26:48Z

Jchong: /* System Status */

== System Status==

[[File:up.png|up|link=GPC Quickstart]]GPC
[[File:up.png|up|link=TCS Quickstart]]TCS
[[File:up.png|up|link=GPU Devel Nodes]]ARC
[[File:up.png|up|link=P7 Linux Cluster]]P7
[[File:up.png|up|link=BGQ]]BGQ
[[File:up.png|up|link=HPSS]]HPSS

Wed Aug 14 20:00:00 - '''The login and GPC development nodes are back in service now. We have disabled the read-only mount for scratch since that was causing issues with the ongoing recovery. It will be made available later in the week when the recovery is complete. Please continue to check here for further updates.'''

Wed Aug 14 19:36:41 - There are currently filesystem issues with the gpc login node and the general scinet login node. We are working on the issue and trying to fix it.

Wed Aug 14 00:30:46 - '''the regular monthly purge of /scratch will be delayed because of the problems with the filesystem. It will tentatively take place on 22 Aug (or later). New date will be announced here.'''

Tue Aug 13 20:23:27 - GPC and TCS available. See notes below about scratch2, scratch and project filesystems.

Tue Aug 13 19:15:28- for the time being, /scratch and /project will be available only from the login and devel nodes and will only be readable (you can not write to them). This way users can retrieve files they really need but we minimize the stress on the filesystem while we complete LUN verifies and filesystem checks. These filesystems will return to normal later this week (likely Wed or Thurs but may take longer than expected). We know that there are some files that may have corrupted data and will post more details later about how to identify them. The total amount of corrupted data is small and appears to be limited only to those files which were open for writing when the problems started (about 1445 on Friday, 9 Aug). GPC users will still need to use /scratch2 for running jobs while TCS users will need to make use of /reserved1.

Tue Aug 13 17:24:18 - there is good news about /scratch and /project. They appear to be at least 99% intact. However, there are still more LUN verifies that needs to be run as well as disk fscks. It's not yet clear whether we will be able to make these disks available tonight or at some point tomorrow. Systems should come online again within a couple of hours though perhaps only with the new /scratch2 for now.

Tue Aug 13 17:13:58 - datacentre upgrades finished. Snubber network, upgraded trigger board, UPS for the controller and the Quickstart feature should make chiller more resilient to power events and improve the time it takes to restart. Hot circuit breakers also replaced

Tues Aug 13 09:00:00 - systems down for datacentre improvement work

Sun Aug 11 21:55:06 - TCS can be used by those groups which have /reserved1 space. Use /reserved1 to run jobs as you would hve with the old /scratch (which we are still trying to recover)

Sun Aug 11 21:49:03 - GPC is available for use. There is no /scratch or /project filesystem as we are still trying to recover them. You can use /scratch2 to run jobs in exactly the same way as the old scratch (however the environment variable is $SCRATCH2). New policies for /scratch2 are being set but for now each user is limited to 10TB and 1 million files. /home is unscathed.

([[Previous_messages:|Previous messages]])

Oldwiki.scinet.utoronto.ca:System Alerts

2013-08-15T00:03:56Z

Jchong: /* System Status */

== System Status==

[[File:up.png|up|link=GPC Quickstart]]GPC
[[File:up.png|up|link=TCS Quickstart]]TCS
[[File:up.png|up|link=GPU Devel Nodes]]ARC
[[File:up.png|up|link=P7 Linux Cluster]]P7
[[File:up.png|up|link=BGQ]]BGQ
[[File:up.png|up|link=HPSS]]HPSS

Wed Aug 14 20:00:00 - '''The login node and GPC development node are back in service now. We have disabled read-only mount for scratch since that was causing issues with the ongoing recover. Please check the wiki for further updates.'''

Wed Aug 14 19:36:41 - There are currently filesystem issues with the gpc login node and the general scinet login node. We are working on the issue and trying to fix it.

Wed Aug 14 00:30:46 - the regular monthly purge of /scratch will be delayed because of the problems with the filesystem. It will tentatively take place on 22 Aug (or later). New date will be announced here.

Tue Aug 13 20:23:27 - GPC and TCS available. See notes below about scratch2, scratch and project filesystems.

Tue Aug 13 19:15:28- for the time being, /scratch and /project will be available only from the login and devel nodes and will only be readable (you can not write to them). This way users can retrieve files they really need but we minimize the stress on the filesystem while we complete LUN verifies and filesystem checks. These filesystems will return to normal later this week (likely Wed or Thurs but may take longer than expected). We know that there are some files that may have corrupted data and will post more details later about how to identify them. The total amount of corrupted data is small and appears to be limited only to those files which were open for writing when the problems started (about 1445 on Friday, 9 Aug). GPC users will still need to use /scratch2 for running jobs while TCS users will need to make use of /reserved1.

Tue Aug 13 17:24:18 - there is good news about /scratch and /project. They appear to be at least 99% intact. However, there are still more LUN verifies that needs to be run as well as disk fscks. It's not yet clear whether we will be able to make these disks available tonight or at some point tomorrow. Systems should come online again within a couple of hours though perhaps only with the new /scratch2 for now.

Tue Aug 13 17:13:58 - datacentre upgrades finished. Snubber network, upgraded trigger board, UPS for the controller and the Quickstart feature should make chiller more resilient to power events and improve the time it takes to restart. Hot circuit breakers also replaced

Tues Aug 13 09:00:00 - systems down for datacentre improvement work

Sun Aug 11 21:55:06 - TCS can be used by those groups which have /reserved1 space. Use /reserved1 to run jobs as you would hve with the old /scratch (which we are still trying to recover)

Sun Aug 11 21:49:03 - GPC is available for use. There is no /scratch or /project filesystem as we are still trying to recover them. You can use /scratch2 to run jobs in exactly the same way as the old scratch (however the environment variable is $SCRATCH2). New policies for /scratch2 are being set but for now each user is limited to 10TB and 1 million files. /home is unscathed.

([[Previous_messages:|Previous messages]])

Oldwiki.scinet.utoronto.ca:System Alerts

2013-08-15T00:03:16Z

Jchong: /* System Status */

== System Status==

[[File:up.png|up|link=GPC Quickstart]]GPC
[[File:up.png|up|link=TCS Quickstart]]TCS
[[File:up.png|up|link=GPU Devel Nodes]]ARC
[[File:up.png|up|link=P7 Linux Cluster]]P7
[[File:up.png|up|link=BGQ]]BGQ
[[File:up.png|up|link=HPSS]]HPSS

Wed Aug 14 20:00:00 - The login node and GPC development node are back in service now. We have disabled read-only mount for scratch since that was causing issues with the ongoing recover. Please check the wiki for further updates.

Wed Aug 14 19:36:41 - There are currently filesystem issues with the gpc login node and the general scinet login node. We are working on the issue and trying to fix it.

Wed Aug 14 00:30:46 - the regular monthly purge of /scratch will be delayed because of the problems with the filesystem. It will tentatively take place on 22 Aug (or later). New date will be announced here.

Tue Aug 13 20:23:27 - GPC and TCS available. See notes below about scratch2, scratch and project filesystems.

Tue Aug 13 19:15:28- for the time being, /scratch and /project will be available only from the login and devel nodes and will only be readable (you can not write to them). This way users can retrieve files they really need but we minimize the stress on the filesystem while we complete LUN verifies and filesystem checks. These filesystems will return to normal later this week (likely Wed or Thurs but may take longer than expected). We know that there are some files that may have corrupted data and will post more details later about how to identify them. The total amount of corrupted data is small and appears to be limited only to those files which were open for writing when the problems started (about 1445 on Friday, 9 Aug). GPC users will still need to use /scratch2 for running jobs while TCS users will need to make use of /reserved1.

Tue Aug 13 17:24:18 - there is good news about /scratch and /project. They appear to be at least 99% intact. However, there are still more LUN verifies that needs to be run as well as disk fscks. It's not yet clear whether we will be able to make these disks available tonight or at some point tomorrow. Systems should come online again within a couple of hours though perhaps only with the new /scratch2 for now.

Tue Aug 13 17:13:58 - datacentre upgrades finished. Snubber network, upgraded trigger board, UPS for the controller and the Quickstart feature should make chiller more resilient to power events and improve the time it takes to restart. Hot circuit breakers also replaced

Tues Aug 13 09:00:00 - systems down for datacentre improvement work

Sun Aug 11 21:55:06 - TCS can be used by those groups which have /reserved1 space. Use /reserved1 to run jobs as you would hve with the old /scratch (which we are still trying to recover)

Sun Aug 11 21:49:03 - GPC is available for use. There is no /scratch or /project filesystem as we are still trying to recover them. You can use /scratch2 to run jobs in exactly the same way as the old scratch (however the environment variable is $SCRATCH2). New policies for /scratch2 are being set but for now each user is limited to 10TB and 1 million files. /home is unscathed.

([[Previous_messages:|Previous messages]])

Oldwiki.scinet.utoronto.ca:System Alerts

2013-08-14T23:47:16Z

Jchong: /* System Status */

== System Status==

[[File:up.png|up|link=GPC Quickstart]]GPC
[[File:up.png|up|link=TCS Quickstart]]TCS
[[File:up.png|up|link=GPU Devel Nodes]]ARC
[[File:up.png|up|link=P7 Linux Cluster]]P7
[[File:up.png|up|link=BGQ]]BGQ
[[File:up.png|up|link=HPSS]]HPSS

Wed Aug 14 19:36:41 - There are currently filesystem issues with the gpc login node and the general scinet login node. We are working on the issue and trying to fix it.

Wed Aug 14 00:30:46 - the regular monthly purge of /scratch will be delayed because of the problems with the filesystem. It will tentatively take place on 22 Aug (or later). New date will be announced here.

Tue Aug 13 20:23:27 - GPC and TCS available. See notes below about scratch2, scratch and project filesystems.

Tue Aug 13 19:15:28- for the time being, /scratch and /project will be available only from the login and devel nodes and will only be readable (you can not write to them). This way users can retrieve files they really need but we minimize the stress on the filesystem while we complete LUN verifies and filesystem checks. These filesystems will return to normal later this week (likely Wed or Thurs but may take longer than expected). We know that there are some files that may have corrupted data and will post more details later about how to identify them. The total amount of corrupted data is small and appears to be limited only to those files which were open for writing when the problems started (about 1445 on Friday, 9 Aug). GPC users will still need to use /scratch2 for running jobs while TCS users will need to make use of /reserved1.

Tue Aug 13 17:24:18 - there is good news about /scratch and /project. They appear to be at least 99% intact. However, there are still more LUN verifies that needs to be run as well as disk fscks. It's not yet clear whether we will be able to make these disks available tonight or at some point tomorrow. Systems should come online again within a couple of hours though perhaps only with the new /scratch2 for now.

Tue Aug 13 17:13:58 - datacentre upgrades finished. Snubber network, upgraded trigger board, UPS for the controller and the Quickstart feature should make chiller more resilient to power events and improve the time it takes to restart. Hot circuit breakers also replaced

Tues Aug 13 09:00:00 - systems down for datacentre improvement work

Sun Aug 11 21:55:06 - TCS can be used by those groups which have /reserved1 space. Use /reserved1 to run jobs as you would hve with the old /scratch (which we are still trying to recover)

Sun Aug 11 21:49:03 - GPC is available for use. There is no /scratch or /project filesystem as we are still trying to recover them. You can use /scratch2 to run jobs in exactly the same way as the old scratch (however the environment variable is $SCRATCH2). New policies for /scratch2 are being set but for now each user is limited to 10TB and 1 million files. /home is unscathed.

([[Previous_messages:|Previous messages]])

Oldwiki.scinet.utoronto.ca:System Alerts

2013-08-14T23:37:38Z

Jchong: /* System Status */

== System Status==

[[File:up.png|up|link=GPC Quickstart]]GPC
[[File:up.png|up|link=TCS Quickstart]]TCS
[[File:up.png|up|link=GPU Devel Nodes]]ARC
[[File:up.png|up|link=P7 Linux Cluster]]P7
[[File:up.png|up|link=BGQ]]BGQ
[[File:up.png|up|link=HPSS]]HPSS

Wed Aug 14 19:36:41 - There are currently filesystem issues with the gpc login node and the general scinet login node. We are working on the issue and trying to fix it.
Wed Aug 14 00:30:46 - the regular monthly purge of /scratch will be delayed because of the problems with the filesystem. It will tentatively take place on 22 Aug (or later). New date will be announced here.

Tue Aug 13 20:23:27 - GPC and TCS available. See notes below about scratch2, scratch and project filesystems.

Tue Aug 13 19:15:28- for the time being, /scratch and /project will be available only from the login and devel nodes and will only be readable (you can not write to them). This way users can retrieve files they really need but we minimize the stress on the filesystem while we complete LUN verifies and filesystem checks. These filesystems will return to normal later this week (likely Wed or Thurs but may take longer than expected). We know that there are some files that may have corrupted data and will post more details later about how to identify them. The total amount of corrupted data is small and appears to be limited only to those files which were open for writing when the problems started (about 1445 on Friday, 9 Aug). GPC users will still need to use /scratch2 for running jobs while TCS users will need to make use of /reserved1.

Tue Aug 13 17:24:18 - there is good news about /scratch and /project. They appear to be at least 99% intact. However, there are still more LUN verifies that needs to be run as well as disk fscks. It's not yet clear whether we will be able to make these disks available tonight or at some point tomorrow. Systems should come online again within a couple of hours though perhaps only with the new /scratch2 for now.

Tue Aug 13 17:13:58 - datacentre upgrades finished. Snubber network, upgraded trigger board, UPS for the controller and the Quickstart feature should make chiller more resilient to power events and improve the time it takes to restart. Hot circuit breakers also replaced

Tues Aug 13 09:00:00 - systems down for datacentre improvement work

Sun Aug 11 21:55:06 - TCS can be used by those groups which have /reserved1 space. Use /reserved1 to run jobs as you would hve with the old /scratch (which we are still trying to recover)

Sun Aug 11 21:49:03 - GPC is available for use. There is no /scratch or /project filesystem as we are still trying to recover them. You can use /scratch2 to run jobs in exactly the same way as the old scratch (however the environment variable is $SCRATCH2). New policies for /scratch2 are being set but for now each user is limited to 10TB and 1 million files. /home is unscathed.

([[Previous_messages:|Previous messages]])