Difference between revisions of "Oldwiki.scinet.utoronto.ca:System Alerts"

From oldwiki.scinet.utoronto.ca
Jump to navigation Jump to search
Line 1: Line 1:
== System Status: <span style="color:#dd1100">'''DOWN'''</span>==
+
== System Status: <span style="color:#00bb11">'''UP'''</span>==
.
 
  
Thu May 10 07:31:54 EDT 2012:  '''Systems expected to be available by 2PM today (10 May)'''. Overnight testing took longer than expected.
+
Thu May 10 12:56:54 EDT 2012:  '''Systems now available for login'''. Overnight testing took longer than expected.
  
 
+
Testing is completed; system is available for login and will be fully up and running shortly.  Much of the GPC is about to come out of warranty coverage this month, and the thorough pre-expiration shakedown provided by the tests during this downtime uncovered hardware or configuration issues with over 60 GPC nodes, including problems with memory DIMMs, network cards, and power supplies; these issues are now fixed or slated to be fixed with the offending nodes offlined.  Testing also closely examined the new networking infrastructure at very large scale and several minor issues have been identified which will be improved in the very near future.
The announced 8/9 May SciNet '''shutdown''' has started. This shutdown is intended for final configurations in the changeover to full infiniband for the GPC, for some back-end maintenance, and to test the computational and file system performance of the TCS and GPC.
 
 
 
Systems went down at 9 am on May 8; all login sessions and jobs were killed at that time. The system should be available again tomorrow evening. Check here on Wednesday for updates.
 
 
 
Tue 8 May 2012 9:30:46 EDT
 
  
 
([[Previous_messages:|Previous messages]])
 
([[Previous_messages:|Previous messages]])

Revision as of 13:01, 10 May 2012

System Status: UP

Thu May 10 12:56:54 EDT 2012: Systems now available for login. Overnight testing took longer than expected.

Testing is completed; system is available for login and will be fully up and running shortly. Much of the GPC is about to come out of warranty coverage this month, and the thorough pre-expiration shakedown provided by the tests during this downtime uncovered hardware or configuration issues with over 60 GPC nodes, including problems with memory DIMMs, network cards, and power supplies; these issues are now fixed or slated to be fixed with the offending nodes offlined. Testing also closely examined the new networking infrastructure at very large scale and several minor issues have been identified which will be improved in the very near future.

(Previous messages)