Difference between revisions of "Previous messages:"

From oldwiki.scinet.utoronto.ca
Jump to navigation Jump to search
Line 1: Line 1:
 
'''''Note that these are old messages, the most recent system status can be found on the [[SciNet User Support Library|main page]].'''''
 
'''''Note that these are old messages, the most recent system status can be found on the [[SciNet User Support Library|main page]].'''''
 +
 +
Wed Mar 23 17:00 EST 2011 Down due to cooling failure.  Being worked on.  Large electrical surge took out datacentre fuses.  No estimated time to solution yet.
  
 
Mon Mar  7 14:39:59 EST 2011 NB - the SciNet network connection will be cut briefly (5 mins or less) at about 8AM on Tues, 8 March in order to test the new UofT gateway router connection
 
Mon Mar  7 14:39:59 EST 2011 NB - the SciNet network connection will be cut briefly (5 mins or less) at about 8AM on Tues, 8 March in order to test the new UofT gateway router connection

Revision as of 18:47, 23 March 2011

Note that these are old messages, the most recent system status can be found on the main page.

Wed Mar 23 17:00 EST 2011 Down due to cooling failure. Being worked on. Large electrical surge took out datacentre fuses. No estimated time to solution yet.

Mon Mar 7 14:39:59 EST 2011 NB - the SciNet network connection will be cut briefly (5 mins or less) at about 8AM on Tues, 8 March in order to test the new UofT gateway router connection

Thu Feb 24 13:21:38 EST 2011 Systems were shutdown at 0830 today for scheduled repair of a leak in the cooling system.GPC is online. TCS expected to be back online by 4PM

Tue Feb 24 9:35:31 EST 2011: All systems will be shutdown at:0830 Thursday, 24 Feb in order to repair a small leak in the cooling system. Systems will be back online by afternoon or evening. Check back here for updates during the day.

Wed Feb 9 10:27:13 EST 2011: There was a cooling system failure last night, causing all running and queued jobs to be lost. All systems are back up.

Sat Feb 5 20:33:34 EST 2011: Power outage in Vaughan. TCS jobs all died. Some GPC jobs have survived.

Sat Feb 5 18:51:00 EST 2011: We just had a hiccup with the cooling tower. We suspect it's a power-grid issue, but are investigating the situation.

Thu Jan 20 18:25:54 EST 2011: Maintenance complete. Systems back online.

Wed Jan 19 07:22:32 EST 2011: Systems offline for scheduled maintenance of chiller. Expect to be back on-line in evening of Thurs, 20 Jan. Check here for updates and revised estimates of timing.

Fri Dec 24 15:40:12 EST 2010: We experienced a failure of the /scratch filesystem, with a corrupt quota mechanism, on the afternoon of Friday December 24. This resulted in a general gpfs failure that required the system to be shutdown and rebooted. Consequently all running jobs were lost. We apologize for the disruption this has caused, as well as the lost work.

Happy holidays to all! The SciNet team.

Note: From Dec 22, 2010 to Jan 2, 2011, the SciNet offices are officially closed, but the system will be up and running and we will keep an eye out for emergencies.

Mon Dec 6 17:42:12 EST 2010: File system problems appear to have been resolved.

Mon Dec 6 16:39:55 EST 2010: File system issues on both TCS and GPC. Investigating.

Fri Nov 26 18:30:55 EST 2010: All systems are up and accepting user jobs. There was a province-wide power issue at 1:30 the morning of Friday Nov 26, which caused the chiller to fail, and all systems to shutdown. All jobs running or queued at the time were killed as a result. The system is accessible now and you can resubmit your jobs.

Fri Nov 26 6:15 EST 2010: The data center had some heating issue at around 1:30am Fri Nov 26 and system are down right now. We are investigating the cause of the problem and trying to fix the issue so that we could bring the system back up and running.

Fri Nov 5 15:26 EDT 2010: Scratch is down; we are working on it.

Tue Oct 26 10:32:22 EDT 2010: scratch is hung, we're investigating.

Fri Sep 24 23:02:41 EDT 2010: Systems are back up. Please report any problems to <support at scinet dot utoronto dot ca>

Fri Sep 24 19:55:02 EDT 2010: Chiller restarted. Systems should be back online this evening - likely 9-10PM. Widespread power glitch seems to have confused one of the Variable Speed Drives and control system was unable to restart it automatically.

Fri Sep 24 16:44:39 EDT 2010: The systems were unexpectedly and automatically shut down, due to a failure of the chiller. We are investigating and will bring the systems back up as soon as possible.

Mon Sep 13 15:39:05 EDT 2010: The systems were unexpectedly shut down, automatically, due to a failure of the chiller. We are investigating and will bring the systems back up as soon as possible.

Fri Sep 10 15:08:53 EDT 2010: GPFS upgraded to 3.3.0.6; new scratch quotas applied; check yours with /scinet/gpc/bin/diskUsage.

Mon Aug 16 13:12:08 EDT 2010: Filesystems are accessible and login is normal.

Mon Aug 16 12:38:45 EDT 2010: Login access to SciNet is failing due to /home filesystem problems. We are actively working on a solution.

Mon Aug 9 16:10:03 EDT 2010: The file system is slow and scheduler cannot be reached. We are working on the problem.

Sat Aug 7 11:42:50 EDT 2010: System status: Normal.

Sat Aug 7 10:45:52 EDT 2010: Problems with scheduler not responding on GPC. Should be fixed within an hour

Wed Aug 4 14:03:59 EDT 2010: GPC scheduler currently having filesystem issue which means you may have difficulty submitting jobs and monitoring the queue. We are working on the issue now.

Fri Jul 23 17:50:00 EDT 2010: Systems are up again after testing and maintenance.

Fri Jul 23 16:14:02 EDT 2010: Systems are down for testing and maintenance. Expect to be back up about 9PM this evening.

Tue Jul 20 17:40:45 EDT 2010: Systems are back. You may resubmit jobs.

Tue Jul 20 15:39:37 EDT 2010: Most GPC jobs died as well. The ones that appear to be running are likely in an unknown state, so will be killed in order to ensure they do not produce bogus results. The machines should be up shortly. Thanks for your patience and understanding. The SciNet team.

Tue Jul 20 15:19:30 EDT 2010: File systems down. We are looking at the problem now. All jobs on TCS have died. We'll inform you when the machine is available for use again. Please log out from the TCS.

Tue Jul 20 14:26:10 EDT 2010: Scratch file system down. We are looking at the problem now.

Fri Jul 16 10:59:27 EDT 2010: Systems normal.

Sun Jul 11 13:08:02 EDT 2010: All jobs running at ~3AM this morning almost certainly failed and/or were killed about 11AM. A hardware failure has reduced /home and /scratch performance by a factor of 2 but should be corrected tomorrow.

Sun Jul 11 09:56:27 EDT 2010: /scratch was inaccessible as of about 3AM

Fri Jul 9 16:18:18 EDT 2010: /scratch is accessible again

Fri Jul 9 15:38:43 EDT 2010: New trouble with the filesystems. We are working to fix things

Fri Jul 9 11:28:00 EDT 2010: The /scratch filesystem died at about 3 AM on Fri Jul 9, and all jobs running at the time died.