Difference between revisions of "Oldwiki.scinet.utoronto.ca:System Alerts"

Revision as of 22:34, 10 August 2013

System Status

GPC TCS ARC P7 BGQ HPSS

Sat Aug 10 22:31:45 - work stopping for this evening. SciNet and vendor staff have worked continuously for more than 30 hrs on this problem. No point risking making a mistake now. Will continue tomorrow

Sat Aug 10 20:39:34 - work continues. Disks and NSDs have been powered-up and the filesystem is attempting to read the disks. Problems with individual disks are being fixed manually as they are exposed

Sat Aug 10, 17:03 - Still no resolution to the problem. SciNet staff continue to work onsite, in consultation with the storage vendor.

Sat Aug 10 10:38:46 - storage vendor still working on solution with SciNet staff onsite. There are 2,000 hard drives and the controller is confused about location and ID of some of them. Getting a single one wrong will result in data loss so we are proceeding cautiously. Only /scratch and /project are affected. /home is accessible but GPC and TCS can not be used as they rely on /scratch. BGQ system is still usable because of separate filesystem

Sat Aug 10 07:05:07 - staff and vendor tech support still on-site. New action plan from storage vendor is being tested.

Sat Aug 10 00:28:54 - Vendor has escalated to yet a higher level of support but still no solution. People will remain on-site for a while longer to see what the new support team recommends.

Fri Aug 9 22:03:48 - Staff and vendor technician remain on-site. Storage vendor has escalated problem to critical but suggested fixes have not yet resolved the problem. BGQ remains up because it has separate filesystem.

Fri Aug 9 15 32 - /scratch and /project are down. Login and home directories are ok, but no jobs can run, and most of those running will likely die if/when they need to do I/O.

Fri Aug 9 15:25 - File system problems. Scratch is unmounted. Jobs are likely dying. We are working on it.

Thu Aug 8 13:22 - most systems are back up

Thu Aug 8 11:18:45 - problems with storage hardware. Trying to resolve with vendor

Thu Aug 8 08:14:01 Cooling has been restored. Starting to recover systems.

Large voltage drop at site knocked-out cooling system at 0558 today. Staff enroute to site.

Last update: Thu 8 Aug 2013 06:12:28 EDT

(Previous messages)

Difference between revisions of "Oldwiki.scinet.utoronto.ca:System Alerts"

Revision as of 22:34, 10 August 2013

System Status

Navigation menu

Search

@@ Line 14: / Line 14: @@
 [[File:down.png|scratch file system down|link=HPSS]]HPSS
+Sat Aug 10 22:31:45  - work stopping for this evening. SciNet and vendor staff have worked continuously for more than 30 hrs on this problem. No point risking making a mistake now. Will continue tomorrow
 Sat Aug 10 20:39:34 - work continues. Disks and NSDs have been powered-up and the filesystem is attempting to read the disks. Problems with individual disks are being fixed manually as they are exposed