Difference between revisions of "Oldwiki.scinet.utoronto.ca:System Alerts"

From oldwiki.scinet.utoronto.ca
Jump to navigation Jump to search
m (Reverted edits by Rzon (talk) to last revision by Northrup)
(Undo revision 8832 by Rzon (talk))
Line 27: Line 27:
 
|-
 
|-
 
|[[File:up.png|up|link=P7 Linux Cluster]][[P7 Linux Cluster|P7]]
 
|[[File:up.png|up|link=P7 Linux Cluster]][[P7 Linux Cluster|P7]]
|[[File:up.png|up|link=P8]][[P8]]
+
|[[File:up50.png|one P8 node up, one down|link=P8]][[P8]]
 
|[[File:up.png|up|link=Knights Landing]][[Knights Landing|KNL]]
 
|[[File:up.png|up|link=Knights Landing]][[Knights Landing|KNL]]
 
|[[File:up.png|up|link=Visualization Nodes]][[Visualization Nodes|Viz]]
 
|[[File:up.png|up|link=Visualization Nodes]][[Visualization Nodes|Viz]]
Line 33: Line 33:
 
|}
 
|}
  
 +
Fri 31 March, 2017, 10:30 PM: Power 7 system us back.
  
<b> Mon Mar 20 20:50:00 EDT 2017</b> File system has recovered.
+
Fri 31 March, 2017, 8:44 PM: Power 7 system is down. We are investigating. The p8t02 power8 node is also unreachable.
 
 
<b> Mon Mar 20 14:56:05 EDT 2017</b> Problems with IB fabric or the scratch3 & project3 file systems. We are investigating.
 
 
 
<b> Tue Mar 15 18:00:00 EST 2017</b> Systems are back online and fully operational.
 
 
 
<b> Tue Mar 15 16:31:39 EST 2017</b> Power glitch at data center. Compute nodes went down, bringing them up.
 
 
 
<b> Sun Mar  5 14:34:11 EST 2017</b> Globus access to HPSS has been re-enabled.
 
 
 
<b> Thu Mar  2 9:29:14 EST 2017 </b> GPC jobs are back running. 
 
 
 
<b>Thu Mar  2 01:54:57 EST 2017</b> scratch filesystem went down earlier and most GPC jobs were killed. New GPC jobs are in hold till disk check finished in the morning.
 
 
 
<b>Tue Feb 28 2017 16:00:00 EST</b> The file transfer of users files on the old scratch system to the new scratch system has been completed.  The new scratch folders are logically in the same place as before, i.e. /scratch/G/GROUP/USER.  Your $SCRATCH environment variable will point to this location when you log in.  The project folders have also been moved in the same way. Compute jobs have been released and are starting to run. Let us know if you have any concerns. Thank you for you patience.
 
 
 
<b>Tue Feb 28 2017 10:02:45 EST</b> It could take a few more hours for the scratch migration to finish. We still have a dozen or so users to go. Please check this page from time to time for updates.
 
 
 
<b>Mon Feb 27 2017 10:00:00 EST</b> The old scratch was 99% full. Given the current incident of scratch getting unmounted everywhere, we had little choice but to decide that it is time to initiate the transition to the new scratch file system at this point, instead of performing a roll-out approach that we had planned earlier.
 
 
 
We estimate the transition to the new scratch will take roughly one day, but since we want all users' data on the old scratch system to be available in the new scratch (at the same logical location), the exact duration of the transition depends on the amount of new data to be transferred over.
 
 
 
In the meantime, no jobs will start running on the GPC, Sandy, Gravity or P7. 
 
 
 
In addition, $SCRATCH will not be accessible to users during the transition, but you can login to the login and devel nodes. $HOME is not affected.
 
 
 
The current scratch system issue and the scratch transition don't affect the BGQ or TCS anymore (although running jobs on TCS may have stopped this morning), because BGQ and TCS have their own separate scratch file systems. It also does not affect groups whose scratch space is on /scratch2.
 
 
 
<b>Mon Feb 27 2017 7:20:00 EST</b> Scratch file system is down. We are investigating.
 
 
 
<b>Wed Feb 22 2017 16:17:00 EST</b> Globus access to HPSS is currently not operational.  We hope to have a resolution for this soon.
 
  
 
<!-- [https://support.scinet.utoronto.ca/wiki/index.php/Previous_messages:]  -->
 
<!-- [https://support.scinet.utoronto.ca/wiki/index.php/Previous_messages:]  -->

Revision as of 23:03, 31 March 2017

System Status

upGPC upTCS upSandy upGravity upBGQ Up.pngFile System
upP7 one P8 node up, one downP8 upKNL upViz upHPSS

Fri 31 March, 2017, 10:30 PM: Power 7 system us back.

Fri 31 March, 2017, 8:44 PM: Power 7 system is down. We are investigating. The p8t02 power8 node is also unreachable.