Difference between revisions of "Oldwiki.scinet.utoronto.ca:System Alerts"

From oldwiki.scinet.utoronto.ca
Jump to navigation Jump to search
 
(953 intermediate revisions by 12 users not shown)
Line 1: Line 1:
 
== System Status==
 
== System Status==
<!-- The 'status circles' can be one of the following files:  
+
<!--  
 +
  Notes for updating the system status:
 +
 
 +
  -  When removing system status entries, please archive them to:
 +
 
 +
    http://wiki.scinethpc.ca/wiki/index.php/Previous_messages:
 +
 
 +
    (yes, the trailing colon is part of the url)
 +
 
 +
  -  The 'status circles' can be one of the following files:  
 +
 
 
     down.png  for down
 
     down.png  for down
 
     up25.png  for 25% up
 
     up25.png  for 25% up
Line 6: Line 16:
 
     up75.png  for 75% up
 
     up75.png  for 75% up
 
     up.png    for 100% up
 
     up.png    for 100% up
-->
 
[[File:down.png|scratch file system down|link=GPC Quickstart]]GPC
 
[[File:down.png|scratch file system down|link=TCS Quickstart]]TCS
 
[[File:down.png|scratch file system down|link=GPU Devel Nodes]]ARC
 
[[File:down.png|scratch file system down|link=P7 Linux Cluster]]P7
 
[[File:up.png|up|link=BGQ]]BGQ
 
[[File:down.png|scratch file system down|link=HPSS]]HPSS
 
 
Sat Aug 10 22:31:45  - work stopping for this evening. SciNet and vendor staff have worked continuously for more than 30 hrs on this problem. No point risking making a mistake now. Will continue tomorrow
 
 
Sat Aug 10 20:39:34 - work continues. Disks and NSDs have been powered-up and the filesystem is attempting to read the disks. Problems with individual disks are being fixed manually as they are exposed
 
 
Sat Aug 10, 17:03 - Still no resolution to the problem.  SciNet staff continue to work onsite, in consultation with the storage vendor.
 
 
Sat Aug 10 10:38:46 - storage vendor still working on solution with SciNet staff onsite. There are 2,000 hard drives and the controller is confused about location and ID of some of them. Getting a single one wrong will result in data loss so we are proceeding cautiously. Only /scratch and /project are affected. /home is accessible but GPC and TCS can not be used as they rely on /scratch. BGQ system is still usable because of separate filesystem
 
 
Sat Aug 10 07:05:07 - staff and vendor tech support still on-site. New action plan from storage vendor is being tested.
 
 
Sat Aug 10 00:28:54 - Vendor has escalated to yet a higher level of support but still no solution. People will remain on-site for a while longer to see what the new support team recommends.
 
  
Fri Aug 9 22:03:48 - Staff and vendor technician remain on-site. Storage vendor has escalated problem to critical but suggested fixes have not yet resolved the problem. BGQ remains up because it has separate filesystem.
+
   
 +
{|
 +
|[[File:up.png|up|link=https://docs.scinet.utoronto.ca/index.php/Main_Page]][https://docs.scinet.utoronto.ca Niagara]
 +
|-
 +
|[[File:up.png|up|link=BGQ]][[BGQ]]
 +
|[[File:up.png|up|link=P7 Linux Cluster]][[P7 Linux Cluster|P7]]
 +
|[[File:up.png|up|link=P8]][[P8]]
 +
|-
 +
|[[File:up.png|up|link=SOSCIP_GPU]][[SOSCIP_GPU|SGC]]
 +
|[[File:up.png|up|link=Knights Landing]][[Knights Landing|KNL]]
 +
|[[File:down.png|up|link=HPSS]][https://docs.scinet.utoronto.ca/index.php/HPSS HPSS]
 +
|-
 +
|[[File:up.png|up|]]File System
 +
|[[File:up.png|up|]]External Network
 +
|
 +
|}
  
Fri Aug 9 15 32 - /scratch and /project are down.  Login and home directories are ok, but no jobs can run, and most of those running will likely die if/when they need to do I/O.
+
-->
  
Fri Aug 9 15:25 - File system problems. Scratch is unmounted. Jobs are likely dying. We are working on it.
+
System status can now be found at [https://docs.scinet.utoronto.ca docs.scinet.utoronto.ca]
  
Thu Aug  8  13:22 - most systems are back up
 
  
Thu Aug  8 11:18:45 - problems with storage hardware.  Trying to resolve with vendor
+
<b> Mon 23 Apr 2018 </b> GPC-compute is decommissioned, GPC-storage available until <font color=red><b>30 May 2018</b></font>
  
Thu Aug 8 08:14:01  Cooling has been restored. Starting to recover systems.  
+
<b> Thu 18 Apr 2018 </b> Niagara system will undergo an upgrade to its Infiniband network between 9am and 12pm, should be transparent to users, however there is a chance of network interruption.
  
Large voltage drop at site knocked-out cooling system at 0558 today. Staff enroute to site.
+
<b> Fri 13 Apr 2018 </b> HPSS system will be down for a few hours on <b>Mon, Apr/16, 9AM</b>, for hardware upgrades, in preparation for the eventual move to the Niagara side.
  
 +
<b> Tue 10 Apr 2018 </b> Niagara is open to users.
  
Last update:  Thu  8 Aug 2013 06:12:28 EDT
+
<b> Wed 4 Apr 2018 </b> We are very close to the production launch of Niagara, the new system installed at SciNet.
 +
While the RAC allocation year officially starts today, April 4/18, the Niagara system is still undergoing some final tuning and software updates, so the plan is to officially open it to users on next week.
  
 +
All active GPC users will have their accounts, $HOME, and $PROJECT, transferred to the new
 +
Niagara system.  Those of you who are new to SciNet, but got RAC allocations on Niagara,
 +
will have your accounts created and ready for you to login.
  
 +
We are planning an extended [https://support.scinet.utoronto.ca/education/go.php/370/index.php Intro to SciNet/Niagara session], available in person at our office, and webcast on Vidyo and possibly other means, on Wednesday April 11 at noon EST.
  
([[Previous_messages:|Previous messages]])
+
<!-- [https://support.scinet.utoronto.ca/wiki/index.php/Previous_messages:] -->

Latest revision as of 14:23, 7 May 2018

System Status

System status can now be found at docs.scinet.utoronto.ca


Mon 23 Apr 2018 GPC-compute is decommissioned, GPC-storage available until 30 May 2018

Thu 18 Apr 2018 Niagara system will undergo an upgrade to its Infiniband network between 9am and 12pm, should be transparent to users, however there is a chance of network interruption.

Fri 13 Apr 2018 HPSS system will be down for a few hours on Mon, Apr/16, 9AM, for hardware upgrades, in preparation for the eventual move to the Niagara side.

Tue 10 Apr 2018 Niagara is open to users.

Wed 4 Apr 2018 We are very close to the production launch of Niagara, the new system installed at SciNet. While the RAC allocation year officially starts today, April 4/18, the Niagara system is still undergoing some final tuning and software updates, so the plan is to officially open it to users on next week.

All active GPC users will have their accounts, $HOME, and $PROJECT, transferred to the new Niagara system. Those of you who are new to SciNet, but got RAC allocations on Niagara, will have your accounts created and ready for you to login.

We are planning an extended Intro to SciNet/Niagara session, available in person at our office, and webcast on Vidyo and possibly other means, on Wednesday April 11 at noon EST.