Difference between revisions of "Oldwiki.scinet.utoronto.ca:System Alerts"

From oldwiki.scinet.utoronto.ca
Jump to navigation Jump to search
Line 19: Line 19:
 
  -->
 
  -->
 
{|  
 
{|  
|[[File:up.png| down|link=GPC Quickstart]][[GPC Quickstart|GPC]]
+
|[[File:up.png| up|link=GPC Quickstart]][[GPC Quickstart|GPC]]
|[[File:up.png| down|link=TCS Quickstart]][[TCS Quickstart|TCS]]
+
|[[File:up.png| up|link=TCS Quickstart]][[TCS Quickstart|TCS]]
|[[File:up.png| down|link=Sandy]][[Sandy]]
+
|[[File:up.png| up|link=Sandy]][[Sandy]]
|[[File:up.png| down|link=GPU Devel Nodes]][[GPU Devel Nodes|ARC]]
+
|[[File:up.png| up|link=GPU Devel Nodes]][[GPU Devel Nodes|ARC]]
|[[File:up75.png| scratch file system went down on subset of nodes]]File System
+
|[[File:up.png| up]]File System
 
|-
 
|-
 
|[[File:up.png| up|link=Gravity]][[Gravity]]
 
|[[File:up.png| up|link=Gravity]][[Gravity]]
|[[File:up.png| down|link=P7 Linux Cluster]][[P7 Linux Cluster|P7]]
+
|[[File:up.png| up|link=P7 Linux Cluster]][[P7 Linux Cluster|P7]]
 
|[[File:up.png| up|link=BGQ]][[BGQ]]
 
|[[File:up.png| up|link=BGQ]][[BGQ]]
 
|[[File:down.png| down|link=HPSS]][[HPSS]]
 
|[[File:down.png| down|link=HPSS]][[HPSS]]

Revision as of 10:11, 29 September 2015

System Status

upGPC upTCS upSandy upARC upFile System
upGravity upP7 upBGQ downHPSS

Mon Sep 28 15:18:00: A relatively small portion of nodes (still about 250) lost connection to the $SCRATCH file system. Jobs running on those nodes likely failed. The file system is back to normal.

Thu Sep 24 00:52:29 EDT 2015 BGQ and GPC is up now.

Wed Sep 23 23:41:40 EDT 2015: We are in the process of booting GPC up now after some filesystem issue. GPC should be up in an hour.

Wed 23 Sep 2015 22:23:51: Cooling has been restored. New networking issue causing problems with bringing up systems. More later

Wed 23 Sep 2015 18:33:09: **DELAY** encountered. A control board failed on the chiller and needs to be replaced. Expect to have cooling restored by 10PM and then will bring up systems.

On Wednesday September 23, 2015, downtime has been scheduled for maintenance and improvements on the SciNet data centre cooling system. Significant work will be done to improve the serviceability and durability of the system. This requires a shut-down for most of the day, which will start in the morning around 6:00 am. All login sessions and jobs have been killed at that time.

Services are expected to be available again around 8:00 pm today. Check back here (on the SciNet wiki's front page) for updates.