Difference between revisions of "Oldwiki.scinet.utoronto.ca:System Alerts"

From oldwiki.scinet.utoronto.ca
Jump to navigation Jump to search
Line 25: Line 25:
  
 
Sat Aug 10 10:38:46 - storage vendor still working on solution with SciNet staff onsite. There are 2,000 hard drives and the controller is confused about location and ID of some of them. Getting a single one wrong will result in data loss so we are proceeding cautiously. Only /scratch and /project are affected. /home is accessible but GPC and TCS can not be used as they rely on /scratch. BGQ system is still usable because of separate filesystem
 
Sat Aug 10 10:38:46 - storage vendor still working on solution with SciNet staff onsite. There are 2,000 hard drives and the controller is confused about location and ID of some of them. Getting a single one wrong will result in data loss so we are proceeding cautiously. Only /scratch and /project are affected. /home is accessible but GPC and TCS can not be used as they rely on /scratch. BGQ system is still usable because of separate filesystem
 
Sat Aug 10 07:05:07 - staff and vendor tech support still on-site. New action plan from storage vendor is being tested.
 
 
Sat Aug 10 00:28:54 - Vendor has escalated to yet a higher level of support but still no solution. People will remain on-site for a while longer to see what the new support team recommends.
 
 
Fri Aug 9  22:03:48 -  Staff and vendor technician remain on-site. Storage vendor has escalated problem to critical but suggested fixes have not yet resolved the problem.  BGQ remains up because it has separate filesystem.
 
 
Fri Aug 9 15 32 - /scratch and /project are down.  Login and home directories are ok, but no jobs can run, and most of those running will likely die if/when they need to do I/O.
 
 
Fri Aug 9 15:25 - File system problems. Scratch is unmounted. Jobs are likely dying. We are working on it.
 
 
 
  
 
([[Previous_messages:|Previous messages]])
 
([[Previous_messages:|Previous messages]])

Revision as of 15:45, 11 August 2013

System Status

scratch file system downGPC scratch file system downTCS scratch file system downARC scratch file system downP7 upBGQ scratch file system downHPSS

Sun Aug 11 15:35:32 - We are implementing a contingency plan for access by GPC users. Should be available within a couple of hours. There will be a new scratch2 filesystem that can be used for submitting and running jobs. TCS users may have to wait another day for a fix (it is technically impossible to mount the new /scratch2 on the TCS). Unfortunately, nobody will be able to access the original /scratch or /project space and the timeline for attempting to fix and recover those filesystems is virtually impossible to judge (have to deal with new problems as they crop-up and there's no way to know how many problems lie ahead).

Sun Aug 11 09:25:41 - work resumed before 8AM this morning. Still correcting disk errors that surface so we can reach the stage where the OS can actually mount the filesystem

Sat Aug 10 22:31:45 - work stopping for this evening. SciNet and vendor staff have worked continuously for more than 30 hrs on this problem. No point risking making a mistake now. Will continue tomorrow

Sat Aug 10 20:39:34 - work continues. Disks and NSDs have been powered-up and the filesystem is attempting to read the disks. Problems with individual disks are being fixed manually as they are exposed

Sat Aug 10, 17:03 - Still no resolution to the problem. SciNet staff continue to work onsite, in consultation with the storage vendor.

Sat Aug 10 10:38:46 - storage vendor still working on solution with SciNet staff onsite. There are 2,000 hard drives and the controller is confused about location and ID of some of them. Getting a single one wrong will result in data loss so we are proceeding cautiously. Only /scratch and /project are affected. /home is accessible but GPC and TCS can not be used as they rely on /scratch. BGQ system is still usable because of separate filesystem

(Previous messages)