Oldwiki.scinet.utoronto.ca:System Alerts

From oldwiki.scinet.utoronto.ca
Revision as of 19:04, 21 February 2017 by Pinto (talk | contribs) (→‎System Status)
Jump to navigation Jump to search

System Status

upGPC upTCS upSandy upGravity upBGQ file systems unmountedFile System
upP7 upP8 upKNL upViz downHPSS


Tue Feb 21 19:01:08 EST 2017 The robot arm in the HPSS library is stuck, and a support call with the vendor has been opened. All jobs have been suspended until that is fixed, hopefully tomorrow (Wednesday).

Mon Feb 6 16:01:47 EST 2017 Full shutdown of the systems was required, in order to bring up and orderly interconnect and filesystems. Please watch this space for updates.

Mon Feb 6 14:24:51 EST 2017 We're experiencing internal networking problems, which have caused filesystems to become inaccessible on some nodes. Work is underway to fix this.

Sat Jan 28 12:54:12 EST 2017 GPC and TCS are up, as well as the filesystems. We have reverted to the old /scratch and /project disks on the GPC, until we can ascertain what was wrong with the new appliance. In the meantime please submit your jobs as usual. Also, please help us by cleaning up unnecessary stuff on /scratch. For TCS users: the changes we implemented, where you need to use /scratchtcs, are still in effect.

Sat Jan 28 12:19:16 EST 2017 We are bringing the systems back. Expect to be ready in a couple of hours.

Sat 28 Jan 2017 8:41 EST BGQ is not affected and the system is up.

Sat 28 Jan 2017 8:15 EST Further issues found on the file system; system access to users has been closed until we can solve these issues.

Fri 27 Jan 2017 15:11:58 EST Cluster network issue is resolved and filesystem access is finally resolved after determining the root cause of the network issue.

Fri Jan 27 11:20:32 EST 2017 While we're restoring things, file systems will generally not be available for usage during this period, to facilitate our work. Sorry.

Fri Jan 27 10:02:32 EST 2017 The IB network fabric had a failure earlier today that affected the file systems. The IB fabric is back to normal, and we're working on restoring the file systems at the moment.

Fri Jan 27 7:34:00 EST 2017 Issues with the new scratch file system; we're investigating.

Thu Jan 26 21:24:14 EST 2017 Maintenance finished, systems back online and available, with the exception of the TCS, that does not accept jobs yet (but the devel nodes are accessible).

Jan 25, 2017, 18:48: BGQ is online.