Difference between revisions of "Oldwiki.scinet.utoronto.ca:System Alerts"

From oldwiki.scinet.utoronto.ca
Jump to navigation Jump to search
Line 4: Line 4:
  
 
Thu Dec  8 16:43:47 EST 2011
 
Thu Dec  8 16:43:47 EST 2011
 
--------------------
 
 
We continue to experience random outages of the system.  Network problems are the latest suspect.  All/most GPC jobs died at around 2:40pm today.
 
 
  
 
--------------------
 
--------------------

Revision as of 20:51, 8 December 2011

System Status: UP

We appear to have stabilized the system. Please let us know if there are any problems.

Thu Dec 8 16:43:47 EST 2011



We are still encountering problems resulting from the transition to CentOS 6. While we had tested this operating system on a subset of nodes, there are problems when running at large scale, i.e. with almost 4,000 nodes in the GPC cluster.

Please bear with us as we try to fix things. Some of the symptoms are evidenced in the slow (or disappearing) filesystems, sluggish nodes, and general network problems. We'll inform users when we've solved this. In the meantime, please check this space regularly for updates.

Some of the known issues (and workarounds) are listed here.

Thanks for your patience and understanding!

The SciNet Team.


The GPC has been transitioned to CentOS 6 on Monday, December 5, 2011. While this should not have influenced running jobs, unexpectedly, the scratch and home file systems got unmounted on Monday afternoon, killing most jobs. Please resubmit.

Let us know if you encounter unexpected behavior due to the transition.

Last updated: Thu Dec 8 15:06:31 EST 2011



(Previous messages)