Difference between revisions of "Oldwiki.scinet.utoronto.ca:System Alerts"

From oldwiki.scinet.utoronto.ca
Jump to navigation Jump to search
Line 14: Line 14:
 
[[File:down.png|down|link=HPSS]]HPSS
 
[[File:down.png|down|link=HPSS]]HPSS
  
 +
Tue Aug 13 19:15:28- for the time being, /scratch and /project will be available only from the devel nodes and will only be readable (you can not write to them). This way users can retrieve files they really need but we minimize the stress on the filesystem while we complete LUN verifies and filesystem checks. These filesystems will return to normal later this week (likely Wed or Thurs at latest). We know that there are some files that may have corrupted data and will post more details later about how to identify them. The total amount of corrupted data is small and appears to be limited only to those files which were open for writing when the problems started (about 1445 on Friday, 9 Aug).  GPC users will still need to use /scratch2 for running jobs while TCS users will need to make use of /reserved1.
  
  

Revision as of 19:26, 13 August 2013

System Status

downGPC downTCS downARC downP7 downBGQ downHPSS

Tue Aug 13 19:15:28- for the time being, /scratch and /project will be available only from the devel nodes and will only be readable (you can not write to them). This way users can retrieve files they really need but we minimize the stress on the filesystem while we complete LUN verifies and filesystem checks. These filesystems will return to normal later this week (likely Wed or Thurs at latest). We know that there are some files that may have corrupted data and will post more details later about how to identify them. The total amount of corrupted data is small and appears to be limited only to those files which were open for writing when the problems started (about 1445 on Friday, 9 Aug). GPC users will still need to use /scratch2 for running jobs while TCS users will need to make use of /reserved1.


Tue Aug 13 17:24:18 - there is good news about /scratch and /project. They appear to be at least 99% intact. However, there are still more LUN verifies that needs to be run as well as disk fscks. It's not yet clear whether we will be able to make these disks available tonight or at some point tomorrow. Systems should come online again within a couple of hours though perhaps only with the new /scratch2 for now.

Tue Aug 13 17:13:58 - datacentre upgrades finished. Snubber network, upgraded trigger board, UPS for the controller and the Quickstart feature should make chiller more resilient to power events and improve the time it takes to restart. Hot circuit breakers also replaced

Tues Aug 13 09:00:00 - systems down for datacentre improvement work

Sun Aug 11 21:55:06 - TCS can be used by those groups which have /reserved1 space. Use /reserved1 to run jobs as you would hve with the old /scratch (which we are still trying to recover)

Sun Aug 11 21:49:03 - GPC is available for use. There is no /scratch or /project filesystem as we are still trying to recover them. You can use /scratch2 to run jobs in exactly the same way as the old scratch (however the environment variable is $SCRATCH2). New policies for /scratch2 are being set but for now each user is limited to 10TB and 1 million files. /home is unscathed.

Sun Aug 11 15:35:32 - We are implementing a contingency plan for access by GPC users. Should be available within a few hours. There will be a new scratch2 filesystem that can be used for submitting and running jobs. TCS users may have to wait another day for a fix (it is technically impossible to mount the new /scratch2 on the TCS). Unfortunately, nobody will be able to access the original /scratch or /project space and the timeline for attempting to fix and recover those filesystems is virtually impossible to judge (have to deal with new problems as they crop-up and there's no way to know how many problems lie ahead).


(Previous messages)