Oldwiki.scinet.utoronto.ca:System Alerts

From oldwiki.scinet.utoronto.ca
Jump to navigation Jump to search

System Status

scratch file system downGPC scratch file system downTCS scratch file system downARC scratch file system downP7 upBGQ scratch file system downHPSS

Sun Aug 11 15:35:32 - We are implementing a contingency plan for access by GPC users. Should be available within a couple of hours. There will be a new scratch2 filesystem that can be used for submitting and running jobs. TCS users may have to wait another day for a fix (it is technically impossible to mount the new /scratch2 on the TCS). Unfortunately, nobody will be able to access the original /scratch or /project space and the timeline for attempting to fix and recover those filesystems is virtually impossible to judge (have to deal with new problems as they crop-up and there's no way to know how many problems lie ahead).

Sun Aug 11 09:25:41 - work resumed before 8AM this morning. Still correcting disk errors that surface so we can reach the stage where the OS can actually mount the filesystem

Sat Aug 10 22:31:45 - work stopping for this evening. SciNet and vendor staff have worked continuously for more than 30 hrs on this problem. No point risking making a mistake now. Will continue tomorrow

Sat Aug 10 20:39:34 - work continues. Disks and NSDs have been powered-up and the filesystem is attempting to read the disks. Problems with individual disks are being fixed manually as they are exposed

Sat Aug 10, 17:03 - Still no resolution to the problem. SciNet staff continue to work onsite, in consultation with the storage vendor.

Sat Aug 10 10:38:46 - storage vendor still working on solution with SciNet staff onsite. There are 2,000 hard drives and the controller is confused about location and ID of some of them. Getting a single one wrong will result in data loss so we are proceeding cautiously. Only /scratch and /project are affected. /home is accessible but GPC and TCS can not be used as they rely on /scratch. BGQ system is still usable because of separate filesystem

(Previous messages)