Previous messages:

From oldwiki.scinet.utoronto.ca
Revision as of 17:10, 15 August 2011 by Pinto (talk | contribs)
Jump to navigation Jump to search

These are old messages, the most recent system status can be found on the main page.

You can also check our twitter feed, @SciNetHPC, for updates.

Thu Aug 9, 12:31:03 EDT 2011 If you had jobs running when the file system failed this morning (10:30AM; 9 Aug), please check their status, as many jobs died.

Mon Jul 18, 15:04:17 EDT 2011 Datamovers and large-memory nodes are up and running. Intel license server is functional again. Purging of /scratch will happen next Friday, July/22.

Tue Jun 14, 14:05 EDT 2011 Many SciNet staff are at the HPCS meeting this week so, apologies in advance, but we may not respond as quickly as usual to email.

Sun May 22 12:15:29 EDT 2011 Cooling pump failure at data centre resulted in emergency shutdown last night. Systems are back to normal.

Tue May 17 17:57:41 EDT 2011 Cause of the last two chiller outages has been fixed. All available systems back online.

Thu May 12 12:23:35 EDT 2011 NOTE: There is some UofT network reconfiguration and maintenance on Friday morning May 13 07:45-08:15 that most likely will disrupt external network connections and data transfers. Local running jobs should not be affected.

Thu May 5 19:17:37 EDT 2011 Chiller has been fixed. Systems back to normal.

You can check our twitter feed, @SciNetHPC, for updates.

Wed May 4 22:29:00 EDT 2011 Cooling failure at data centre resulted in an emergency shutdown last night. Chiller replacement part won't arrive until noon tomorrow and then needs installation and testing. Roughly 40% of the GPC is currently available. We're using free-cooling instead of the chiller and therefore can't risk turning on more nodes. NOTE - we plan to keep these systems up tomorrow when parts are replaced but if there are complications they may be shutdown without warning. You can check our twitter feed, @SciNetHPC, for updates.


Wed May 4 16:43:34 EDT 2011 Cooling failure at data centre resulted in an emergency shutdown last night. Chiller replacement part won't arrive until noon tomorrow and then needs installation and testing. Roughly 1/4 of the GPC (evenly split beween IB and gigE) is currently available. We're using free-cooling instead of the chiller and may increase the number of nodes depending on how the room responds. NOTE - we plan to keep these systems up tomorrow when parts are replaced but if there are complications they may be shutdown without warning. You can check our twitter feed, @SciNetHPC, for updates.


Tues 3 May 2011 20:30 Cooling failure at data centre has resulted in emergency shutdown. This space will be updated when more information is available.

Thu 27 Apr 2011 20:21 EDT: System maintenance completed. All systems are operational. TCS is back to normal too, with tcs01 and tcs02 as the devel nodes.

Sun 17 Apr 2011 18:34:10 EDT Problem with building power supply on Saturday at 2AM today took down the cooling system. Power has been restored but new problems have arisen with TCS water units.

We have been able to partially restore the TCS. The usual development nodes, tcs01 and tcs02 are not available. However, we have created a temporary workaround using node tcs03 (tcs-f02n01), where you can login, compile, and submit jobs. Not all the compute nodes are up, but we are working on that. Please let us know if there are any problems with this setup.

Sun Apr 16 13:01:14 EDT 2011 Problem with building power supply at 2AM today took down the cooling system . Waiting for electrical utility to check lines.

Wed Mar 23 19:06:21 EDT 2011: A VSD controller failed in the cooling system. A temporary solution will allow systems to come back online this evening (likely by 8PM). There will likely need to be downtime in the next two days in order to replace the controller.

Wed Mar 23 17:00 EST 2011 Down due to cooling failure. Being worked on. Large electrical surge took out datacentre fuses. No estimated time to solution yet.

Mon Mar 7 14:39:59 EST 2011 NB - the SciNet network connection will be cut briefly (5 mins or less) at about 8AM on Tues, 8 March in order to test the new UofT gateway router connection

Thu Feb 24 13:21:38 EST 2011 Systems were shutdown at 0830 today for scheduled repair of a leak in the cooling system.GPC is online. TCS expected to be back online by 4PM

Tue Feb 24 9:35:31 EST 2011: All systems will be shutdown at:0830 Thursday, 24 Feb in order to repair a small leak in the cooling system. Systems will be back online by afternoon or evening. Check back here for updates during the day.

Wed Feb 9 10:27:13 EST 2011: There was a cooling system failure last night, causing all running and queued jobs to be lost. All systems are back up.

Sat Feb 5 20:33:34 EST 2011: Power outage in Vaughan. TCS jobs all died. Some GPC jobs have survived.

Sat Feb 5 18:51:00 EST 2011: We just had a hiccup with the cooling tower. We suspect it's a power-grid issue, but are investigating the situation.

Thu Jan 20 18:25:54 EST 2011: Maintenance complete. Systems back online.

Wed Jan 19 07:22:32 EST 2011: Systems offline for scheduled maintenance of chiller. Expect to be back on-line in evening of Thurs, 20 Jan. Check here for updates and revised estimates of timing.

Fri Dec 24 15:40:12 EST 2010: We experienced a failure of the /scratch filesystem, with a corrupt quota mechanism, on the afternoon of Friday December 24. This resulted in a general gpfs failure that required the system to be shutdown and rebooted. Consequently all running jobs were lost. We apologize for the disruption this has caused, as well as the lost work.

Happy holidays to all! The SciNet team.

Note: From Dec 22, 2010 to Jan 2, 2011, the SciNet offices are officially closed, but the system will be up and running and we will keep an eye out for emergencies.

Mon Dec 6 17:42:12 EST 2010: File system problems appear to have been resolved.

Mon Dec 6 16:39:55 EST 2010: File system issues on both TCS and GPC. Investigating.

Fri Nov 26 18:30:55 EST 2010: All systems are up and accepting user jobs. There was a province-wide power issue at 1:30 the morning of Friday Nov 26, which caused the chiller to fail, and all systems to shutdown. All jobs running or queued at the time were killed as a result. The system is accessible now and you can resubmit your jobs.

Fri Nov 26 6:15 EST 2010: The data center had some heating issue at around 1:30am Fri Nov 26 and system are down right now. We are investigating the cause of the problem and trying to fix the issue so that we could bring the system back up and running.

Fri Nov 5 15:26 EDT 2010: Scratch is down; we are working on it.

Tue Oct 26 10:32:22 EDT 2010: scratch is hung, we're investigating.

Fri Sep 24 23:02:41 EDT 2010: Systems are back up. Please report any problems to <support at scinet dot utoronto dot ca>

Fri Sep 24 19:55:02 EDT 2010: Chiller restarted. Systems should be back online this evening - likely 9-10PM. Widespread power glitch seems to have confused one of the Variable Speed Drives and control system was unable to restart it automatically.

Fri Sep 24 16:44:39 EDT 2010: The systems were unexpectedly and automatically shut down, due to a failure of the chiller. We are investigating and will bring the systems back up as soon as possible.

Mon Sep 13 15:39:05 EDT 2010: The systems were unexpectedly shut down, automatically, due to a failure of the chiller. We are investigating and will bring the systems back up as soon as possible.

Fri Sep 10 15:08:53 EDT 2010: GPFS upgraded to 3.3.0.6; new scratch quotas applied; check yours with /scinet/gpc/bin/diskUsage.

Mon Aug 16 13:12:08 EDT 2010: Filesystems are accessible and login is normal.

Mon Aug 16 12:38:45 EDT 2010: Login access to SciNet is failing due to /home filesystem problems. We are actively working on a solution.

Mon Aug 9 16:10:03 EDT 2010: The file system is slow and scheduler cannot be reached. We are working on the problem.

Sat Aug 7 11:42:50 EDT 2010: System status: Normal.

Sat Aug 7 10:45:52 EDT 2010: Problems with scheduler not responding on GPC. Should be fixed within an hour

Wed Aug 4 14:03:59 EDT 2010: GPC scheduler currently having filesystem issue which means you may have difficulty submitting jobs and monitoring the queue. We are working on the issue now.

Fri Jul 23 17:50:00 EDT 2010: Systems are up again after testing and maintenance.

Fri Jul 23 16:14:02 EDT 2010: Systems are down for testing and maintenance. Expect to be back up about 9PM this evening.

Tue Jul 20 17:40:45 EDT 2010: Systems are back. You may resubmit jobs.

Tue Jul 20 15:39:37 EDT 2010: Most GPC jobs died as well. The ones that appear to be running are likely in an unknown state, so will be killed in order to ensure they do not produce bogus results. The machines should be up shortly. Thanks for your patience and understanding. The SciNet team.

Tue Jul 20 15:19:30 EDT 2010: File systems down. We are looking at the problem now. All jobs on TCS have died. We'll inform you when the machine is available for use again. Please log out from the TCS.

Tue Jul 20 14:26:10 EDT 2010: Scratch file system down. We are looking at the problem now.

Fri Jul 16 10:59:27 EDT 2010: Systems normal.

Sun Jul 11 13:08:02 EDT 2010: All jobs running at ~3AM this morning almost certainly failed and/or were killed about 11AM. A hardware failure has reduced /home and /scratch performance by a factor of 2 but should be corrected tomorrow.

Sun Jul 11 09:56:27 EDT 2010: /scratch was inaccessible as of about 3AM

Fri Jul 9 16:18:18 EDT 2010: /scratch is accessible again

Fri Jul 9 15:38:43 EDT 2010: New trouble with the filesystems. We are working to fix things

Fri Jul 9 11:28:00 EDT 2010: The /scratch filesystem died at about 3 AM on Fri Jul 9, and all jobs running at the time died.