Oldwiki.scinet.utoronto.ca:System Alerts

From oldwiki.scinet.utoronto.ca
Jump to navigation Jump to search

System Status

downGPC downTCS downARC downP7 downBGQ File:Down.png downHPSS


SYSTEM DOWNTIME - all systems will be shutdown for datacentre improvements beginning at 0930 on Tuesday (13 Aug). Downtime is expected to last 4 - 6 hrs. As usual, follow progress here. This work has been scheduled for a month and, ordinarily, we would have warned users on Friday. However, given the uncertainty about the filesystem we considered delaying the work and held off with the notification. Given the progress made over the weekend and the importance of the datacentre work we have decided to proceed as originally scheduled. Tomorrow's work specifically addresses the resiliency of the chiller to power events such as the ones that we experienced Thursday. In particular, the chiller controller will be moved to the 480V UPS, a quickstart feature will be installed to speed up the restart of the chiller after an event and a trigger board which has caused two spurious shutdowns over the past 6 months will be replaced


Sun Aug 11 21:55:06 - TCS can be used by those groups which have /reserved1 space. Use /reserved1 to run jobs as you would hve with the old /scratch (which we are still trying to recover)

Sun Aug 11 21:49:03 - GPC is available for use. There is no /scratch or /project filesystem as we are still trying to recover them. You can use /scratch2 to run jobs in exactly the same way as the old scratch (however the environment variable is $SCRATCH2). New policies for /scratch2 are being set but for now each user is limited to 10TB and 1 million files. /home is unscathed.

Sun Aug 11 15:35:32 - We are implementing a contingency plan for access by GPC users. Should be available within a few hours. There will be a new scratch2 filesystem that can be used for submitting and running jobs. TCS users may have to wait another day for a fix (it is technically impossible to mount the new /scratch2 on the TCS). Unfortunately, nobody will be able to access the original /scratch or /project space and the timeline for attempting to fix and recover those filesystems is virtually impossible to judge (have to deal with new problems as they crop-up and there's no way to know how many problems lie ahead).


(Previous messages)