Oldwiki.scinet.utoronto.ca:System Alerts

System Status

GPC TCS ARC P7 BGQ HPSS

Thu Aug 15 12:59:18 - work continues on recovering the filesystem. The vast majority of data appears intact but the storage vendor is still resolving parity errors. Also working on a way for users to identify what files might possibly have been corrupted. Timeline for all this is still uncertain

Wed Aug 14 20:00:00 - The login and GPC development nodes are back in service now. We have disabled the read-only mount for scratch since that was causing issues with the ongoing recovery. It will be made available later in the week when the recovery is complete. Please continue to check here for further updates.

Wed Aug 14 19:36:41 - There are currently filesystem issues with the gpc login node and the general scinet login node. We are working on the issue and trying to fix it.

Wed Aug 14 00:30:46 - the regular monthly purge of /scratch will be delayed because of the problems with the filesystem. It will tentatively take place on 22 Aug (or later). New date will be announced here.

Tue Aug 13 20:23:27 - GPC and TCS available. See notes below about scratch2, scratch and project filesystems.

Tue Aug 13 19:15:28- for the time being, /scratch and /project will be available only from the login and devel nodes and will only be readable (you can not write to them). This way users can retrieve files they really need but we minimize the stress on the filesystem while we complete LUN verifies and filesystem checks. These filesystems will return to normal later this week (likely Wed or Thurs but may take longer than expected). We know that there are some files that may have corrupted data and will post more details later about how to identify them. The total amount of corrupted data is small and appears to be limited only to those files which were open for writing when the problems started (about 1445 on Friday, 9 Aug). GPC users will still need to use /scratch2 for running jobs while TCS users will need to make use of /reserved1.

Tue Aug 13 17:24:18 - there is good news about /scratch and /project. They appear to be at least 99% intact. However, there are still more LUN verifies that needs to be run as well as disk fscks. It's not yet clear whether we will be able to make these disks available tonight or at some point tomorrow. Systems should come online again within a couple of hours though perhaps only with the new /scratch2 for now.

Tue Aug 13 17:13:58 - datacentre upgrades finished. Snubber network, upgraded trigger board, UPS for the controller and the Quickstart feature should make chiller more resilient to power events and improve the time it takes to restart. Hot circuit breakers also replaced

Tues Aug 13 09:00:00 - systems down for datacentre improvement work

Sun Aug 11 21:55:06 - TCS can be used by those groups which have /reserved1 space. Use /reserved1 to run jobs as you would hve with the old /scratch (which we are still trying to recover)

Sun Aug 11 21:49:03 - GPC is available for use. There is no /scratch or /project filesystem as we are still trying to recover them. You can use /scratch2 to run jobs in exactly the same way as the old scratch (however the environment variable is $SCRATCH2). New policies for /scratch2 are being set but for now each user is limited to 10TB and 1 million files. /home is unscathed.

(Previous messages)

Oldwiki.scinet.utoronto.ca:System Alerts

System Status

Navigation menu

Search