Oldwiki.scinet.utoronto.ca:System Alerts

From oldwiki.scinet.utoronto.ca
Jump to navigation Jump to search

System Status

upGPC upTCS upARC upP7 upBGQ upHPSS

Mon Aug 19 14:17 - This is the plan for the next few days to get scratch back into production:

  • We'll first be doing another verify to find bad sectors. This will generate a list of suspect files.
  • These suspect files will be moved to a separate location, and the owners of these files will be notified.
  • Sometime tomorrow (Tue Aug 20), this should allow /scratch to be mounted read/write again.
  • Once /scratch is back in full production, and after running jobs using /scratch2 have finished (say 2-3 days later), /scratch2 will be phased out.
  • This phase-out entails /scratch2 to be mounted read-only for a week to allow users to copy results that they want to keep from /scratch2 to /scratch, /home, or off-site.
  • After that week, /scratch2 will be unmounted.

Sat Aug 17 02:21:09 - /scratch and /project are now mounted read-only on the login and devel nodes. Please read details immediately below. Note also (further below) that the monthly purge of /scratch will be delayed at least another week.

Sat Aug 17 00:04:39 - the /scratch and /project filesystems will be mounted read-only again by 0900 today but only on the login and devel nodes. You will not be able to write to /scratch or /project but you will be able to access and copy files away from them (e.g. to /scratch2). The storage vendor has completed recovering the raid parity errors and we are testing that they have "fixed" them properly, we're trying to resolve discrepancies between their lists and ours and we are identifying the names of those files which have been corrupted. Unfortunately, if there are any remaining, or improperly "fixed", parity errors, then the entire filesystem can crash when somebody accesses the affected files (this is why we had to unmount /scratch earlier this week). Accordingly, we are testing all the disk sectors that the vendor has claimed to have fixed overnight. If the filesystem remains stable over the weekend then we hope to be able to return /scratch and /project to normal on Monday or Tuesday.

Thu Aug 15 12:59:18 - work continues on recovering the filesystem. The vast majority of data appears intact but the storage vendor is still resolving parity errors. Also working on a way for users to identify what files might possibly have been corrupted. Unfortunately the timeline for all this is still uncertain. We're trying to balance paranoia for preserving data with the need for users to get back to work. /scratch2 is working well and the GPC is currently at 80% utilization

Wed Aug 14 20:00:00 - The login and GPC development nodes are back in service now. We have disabled the read-only mount for scratch since that was causing issues with the ongoing recovery. It will be made available later in the week when the recovery is complete. Please continue to check here for further updates.

Wed Aug 14 19:36:41 - There are currently filesystem issues with the gpc login node and the general scinet login node. We are working on the issue and trying to fix it.

Wed Aug 14 00:30:46 - the regular monthly purge of /scratch will be delayed because of the problems with the filesystem. It will tentatively take place on 22 Aug (or later). New date will be announced here.

Tue Aug 13 20:23:27 - GPC and TCS available. See notes below about scratch2, scratch and project filesystems.

Tue Aug 13 19:15:28- for the time being, /scratch and /project will be available only from the login and devel nodes and will only be readable (you can not write to them). This way users can retrieve files they really need but we minimize the stress on the filesystem while we complete LUN verifies and filesystem checks. These filesystems will return to normal later this week (likely Wed or Thurs but may take longer than expected). We know that there are some files that may have corrupted data and will post more details later about how to identify them. The total amount of corrupted data is small and appears to be limited only to those files which were open for writing when the problems started (about 1445 on Friday, 9 Aug). GPC users will still need to use /scratch2 for running jobs while TCS users will need to make use of /reserved1.

Tue Aug 13 17:24:18 - there is good news about /scratch and /project. They appear to be at least 99% intact. However, there are still more LUN verifies that needs to be run as well as disk fscks. It's not yet clear whether we will be able to make these disks available tonight or at some point tomorrow. Systems should come online again within a couple of hours though perhaps only with the new /scratch2 for now.

Tue Aug 13 17:13:58 - datacentre upgrades finished. Snubber network, upgraded trigger board, UPS for the controller and the Quickstart feature should make chiller more resilient to power events and improve the time it takes to restart. Hot circuit breakers also replaced

Tues Aug 13 09:00:00 - systems down for datacentre improvement work

Sun Aug 11 21:55:06 - TCS can be used by those groups which have /reserved1 space. Use /reserved1 to run jobs as you would hve with the old /scratch (which we are still trying to recover)

Sun Aug 11 21:49:03 - GPC is available for use. There is no /scratch or /project filesystem as we are still trying to recover them. You can use /scratch2 to run jobs in exactly the same way as the old scratch (however the environment variable is $SCRATCH2). New policies for /scratch2 are being set but for now each user is limited to 10TB and 1 million files. /home is unscathed.


(Previous messages)