Difference between revisions of "Oldwiki.scinet.utoronto.ca:System Alerts"

From oldwiki.scinet.utoronto.ca
Jump to navigation Jump to search
Line 19: Line 19:
 
|}
 
|}
  
Sat Nov 22: 0448:  Power glitch at site at 0342. Access to TCS has been lost. Appears to be at least one IB switch misbehaving as a result which has killed jobs touching one rack of the GPC. Will be more investigation later this morning.
+
Sat Nov 22: 0448:  Power glitch at site at 0342. Access to TCS has been lost. Appears to be at least one IB switch misbehaving as a result which has killed jobs touching one rack of the GPC. Will investigate more closely later this morning.
  
 
Fri Nov 22:
 
Fri Nov 22:

Revision as of 06:06, 23 November 2013

System Status

upGPC downTCS upSandy upARC
upGravity upP7 upBGQ upHPSS

Sat Nov 22: 0448: Power glitch at site at 0342. Access to TCS has been lost. Appears to be at least one IB switch misbehaving as a result which has killed jobs touching one rack of the GPC. Will investigate more closely later this morning.

Fri Nov 22:

One of our IB fabric managers died last night. As a result, many nodes including the GPFS managers could not communicate properly and many nodes had their GPFS unmounted. If you had crashed jobs, please resubmit.

Last updated: Fri Nov 22 14:49

(Previous messages)