Oldwiki.scinet.utoronto.ca:System Alerts

From oldwiki.scinet.utoronto.ca
Jump to navigation Jump to search

System Status

scratch file system downGPC scratch file system downTCS scratch file system downARC scratch file system downP7 upBGQ scratch file system downHPSS


Sat Aug 10 20:39:34 - work continues. Disks and NSDs have been powered-up and the filesystem is attempting to read the disks. Problems with individual disks are being fixed manually as they are exposed

Sat Aug 10, 17:03 - Still no resolution to the problem. SciNet staff continue to work onsite, in consultation with the storage vendor.

Sat Aug 10 10:38:46 - storage vendor still working on solution with SciNet staff onsite. There are 2,000 hard drives and the controller is confused about location and ID of some of them. Getting a single one wrong will result in data loss so we are proceeding cautiously. Only /scratch and /project are affected. /home is accessible but GPC and TCS can not be used as they rely on /scratch. BGQ system is still usable because of separate filesystem

Sat Aug 10 07:05:07 - staff and vendor tech support still on-site. New action plan from storage vendor is being tested.

Sat Aug 10 00:28:54 - Vendor has escalated to yet a higher level of support but still no solution. People will remain on-site for a while longer to see what the new support team recommends.

Fri Aug 9 22:03:48 - Staff and vendor technician remain on-site. Storage vendor has escalated problem to critical but suggested fixes have not yet resolved the problem. BGQ remains up because it has separate filesystem.

Fri Aug 9 15 32 - /scratch and /project are down. Login and home directories are ok, but no jobs can run, and most of those running will likely die if/when they need to do I/O.

Fri Aug 9 15:25 - File system problems. Scratch is unmounted. Jobs are likely dying. We are working on it.

Thu Aug 8 13:22 - most systems are back up

Thu Aug 8 11:18:45 - problems with storage hardware. Trying to resolve with vendor

Thu Aug 8 08:14:01 Cooling has been restored. Starting to recover systems.

Large voltage drop at site knocked-out cooling system at 0558 today. Staff enroute to site.


Last update: Thu 8 Aug 2013 06:12:28 EDT


(Previous messages)