Previous messages:

From oldwiki.scinet.utoronto.ca
Revision as of 13:13, 1 June 2012 by Rzon (talk | contribs)
Jump to navigation Jump to search

These are old messages, the most recent system status can be found on the main page.

You can also check our twitter feed, @SciNetHPC, for updates.

Sun May 27 09:11:53 EDT 2012: scratch became umounted on all nodes about 0700 this morning. Problem has been resolved and /scratch has been remounted everywhere.

Sun May 27 09:11:53 EDT 2012: scratch became umounted on all nodes. Working on fix

Thu May 10 15:17:54 EDT 2012: Systems now up. Overnight testing took longer than expected. Testing is completed; system is fully up and running. Much of the GPC is about to come out of warranty coverage this month, and the thorough pre-expiration shakedown provided by the tests during this downtime uncovered hardware or configuration issues with over 60 GPC nodes, including problems with memory DIMMs, network cards, and power supplies; these issues are now fixed or slated to be fixed with the offending nodes offlined. Testing also closely examined the new networking infrastructure at very large scale and several minor issues have been identified which will be improved in the very near future.

Thu May 10 07:31:54 EDT 2012: Systems expected to be available by 2PM today (10 May). Overnight testing took longer than expected.

Tue 8 May 2012 9:30:46 EDT: The announced 8/9 May SciNet shutdown has started. This shutdown is intended for final configurations in the changeover to full infiniband for the GPC, for some back-end maintenance, and to test the computational and file system performance of the TCS and GPC. Systems went down at 9 am on May 8; all login sessions and jobs were killed at that time. The system should be available again tomorrow evening. Check here on Wednesday for updates.

Wed 2 May 2012 10:20:46 EDT: ANNOUNCEMENT: There will be a full SciNet shutdown from Tue May 8 to Wed May 9 for final configurations in the changeover to full infiniband for the GPC, for some back-end maintenance, and to test the computational and file system performance of the TCS and GPC. Systems will go down at 9 am on May 8; all login sessions and jobs will be killed at that time. The system should be available again in the evening of the next day. Check here on Wednesday for updates.

As noted before, see GPC_Quickstart for how to run mpi jobs on the gpc in light of the new infiniband network (mostly coincides with the old way, with less parameters).

Wed 24 Apr 2012 12:47:46 EDT: The Apr 19 upgrade of the GPC to a low-latency, high-bandwidth Infiniband network throughout the cluster is now reflected in (most of) the wiki. The appropriate way to request nodes in job scripts for the new setup (which will coincide with the old way for many users) is described on the GPC_Quickstart page.


Thu 19 Apr 2012 19:43:46 EDT:

The GPC network has been upgraded to a low-latency, high-bandwidth Infiniband network throughout the cluster. Several significant benefits over the old ethernet/infiniband mixed setup are expected, including:

  • better I/O performance for all jobs
  • better job performance for what used to be multi-node ethernet jobs (as they will now make use of Infiniband),
  • for users that were already using Infiniband, improved queue throughput (there are now 4x as many available nodes), and the ability to run larger IB jobs.

NOTE 1: Our wiki is NOT completely up-to-date after this recent change. For the time being, you should first check this current page and the temporary Infiniband Upgrade page for anything related to networks and queueing.

NOTE 2: The temporary mpirun settings that were recommended for multinode ethernet runs are no longer in effect, as all MPI traffic is now going over InfiniBand.

NOTE 3: Though we have been testing the new system since last night, a change of this magnitude (3,000 adapter cards installed, 5km of copper cable, 35km of fibre optic cable) is likely to result in some teething problems so please bear with us over the next few days. Please report any issues/problems that are not explained/resolved after reading this current page or our Infiniband Upgrade page to support@scinet.utoronto.ca.


Thu Apr 12 17:39:50 EDT 2012: The TCS maintenance has been completed. Please report any problems.

Thu Apr 12 17:08:00 EST 2012: scheduled maintenance downtime of the TCS. As announced, running TCS jobs and TCS login sessions were killed. All other systems are up. The TCS is expected to be up again sometime this evening.

Tue Apr 10 16:24:00 EST 2012: scheduled downtimes:

Apr 12: TCS Shutdown (Other systems will remain up). The shutdown will start at 11 am and the system should be available again at in the evening of the same day.

Wed 28 Mar 2012 21:45:03 EDT: Connection problem was caused by trouble with a filesystem manager. Problem solved.

Wed 28 Mar 2012 20:55:27 EDT: We're experiencing some problems connecting to the login nodes. Investigating.

Wed Mar 28 10:34:25 EDT 2012: There have been some GPC file system and network stability issues reported over that past few days that we believe are related to some OS configuration changes. We are in the process of resolving them. Thanks for your patience.

Tue Mar 6 18:30:00 EST 2012: We had a glitch on our core switch due to configuraion errors. Unfortunately, this short outage has resulted in unmount of GPFS and jobs got killed. Systems are recovered. Please resubmit jobs.

Fri Mar 2 11:59:33 EST 2012: Roughly 1/3 of the TCS nodes thermal-checked themselves off ~1140 today due to a glitch in the water supply temperature. Unfortunately, all jobs running on those nodes were lost. Please check your jobs and resubmit if necessary.

Thu Feb 9 11:50:57 EST 2012: System Temporary Change for MPI ethernet jobs:
Due to some changes we are making to the GPC GigE nodes, if you run multinode ethernet MPI jobs (IB multinode jobs are fine), you will need to explicitly request the ethernet interface in your mpirun:

For Openmpi -> mpirun --mca btl self,sm,tcp For IntelMPI -> mpirun -env I_MPI_FABRICS shm:tcp

There is no need to do this if you run on IB, or if you run single node mpi jobs on the ethernet (GigE) nodes. Please check GPC_MPI_Versions for more details.

Thu Feb 9 11:50:57 EST 2012: Scheduled downtime is over. TCS is up. GPC is coming back rack-by-rack.

Mon Jan 31 9:12:00 EST 2012: File systems (scratch and home) got unmounted around 3:30 am and again at around 23:15 on Jan/30. Jobs may have crashed. Filesystems are back now. Please resubmit you jobs.

Wed Jan 18 16:47:38 EST 2012<; Full system shutdown as of 7AM on Tues, 17 Jan in order to perform annual maintenance on the chiller. Most work has been completed on schedule. Expect systems to be available by 8PM today.

Wed Jan 4, 14:48: Scratch file system got unmounted. Most jobs died. We are trying to fix the problem. Check back here for updates.

Wed Jan 3, 13:58: Datamover1 is down due to hardware problems. Use datamover2 instead.

Wed Dec 28 13:46:37 EST 2011 Systems are back up. All running and queued jobs were lost, due to a power failure at the SciNet datacentre. Please resubmit your jobs. Also, please report any problems to <support@scinet.utoronto.ca>.

Wed Dec 28 09:25 EST 2011 Electrician enroute. No power at our main electrical panel.

Wed Dec 28 08:51 EST 2011 Staff enroute to datacentre. More info once we understand what has happened.

Wed Dec 28 02:33 EST 2011 Datacentre appears to have lost all power. All remote access lost.


Thu Dec 8 16:43:47 EST 2011: The GPC was transitioned to CentOS 6 on Monday, December 5, 2011. Some of the known issues (and workarounds) are listed here. Thanks for your patience and understanding! - The SciNet Team.

System appears to have stabilized. Please let us know if there are any problems.

Fri Nov 25 12:39:54 EST 2011: IMPORTANT upcoming change: The GPC will be transitioned to CentOS 6 on Monday, December 5, 2011. All GPC devel nodes will be rebooted at noon on Monday Dec 5/11, with a CentOS6 image. The compute nodes will be rebooted as jobs finish, starting on Saturday Dec 3/11. You may already submit jobs requesting the new image (os=centos6computeA), and these jobs will be serviced as the nodes get rebooted into the new OS.

Wed Nov 16 10:53:37 EST 2011: A glitch caused the scratch file system to get unmounted everywhere. We are on track to fixing the situation. However, most jobs were killed and you will have to resubmit your job once scratch is back.

As for the recovery of /project directories for groups with storage allocation which is less than 5 TB in /project, recovery of their directories is still in progress. Until then, those directories are unaccessible (owned by root). If you can read your project directory, it means that the recovery is complete. To expedite this process, for now, no material can be retrieved from HPSS by users.

Note that the monthly purge of the scratch space will be delayed until Friday, 18 Nov because of the downtime.

Wed Nov 15 10:47:37 EST 2011: All systems are up and accessible again. Both /project and HPSS now follow the same new directory structure as on /home and /scratch, i.e. /project/<first-letter-of-group>/group/user.

Be aware that for groups with storage allocation which is less than 5 TB in /project, recovery of their directories is in progress and will finish in the next day or so. Until then, those directories are unaccessible (owned by root). If you can read your project directory, it means that the recovery is complete.

Note that the monthly purge of the scratch space will be delayed until Friday, 18 Nov because of the downtime.

Tue Nov 14 9:40:37 EST 2011: All systems will be shutdown Monday morning in order to complete the disk rearrangement begun this past week. Specifically, the /project disks will be reformatted and added to the /scratch filesystem. The new /scratch will be larger and faster (because of more spindles and a second controller). We expect to be back online by late afternoon.

The monthly purge of the scratch space will be delayed until Friday, 18 Nov because of the downtime.

For groups with storage allocations, a new /project will be created but disk allocations on it will be decreased and the difference made up with allocations on HPSS. Both /project and HPSS will follow the same new directory structure as now used on /home and /scratch.

Tue Nov 8 17:05:37 EST 2011: Filesystem hierarchy has been renamed as per past emails and the newsletter. e.g. the home directory of user 'resu' in group 'puorg' is now /home/p/puorg/resu and similarly for /scratch. The planned changes to /scratch (new disks and controller) have been postponed until later this month. /project remains read-only for now and there will be follow-up email to project users tomorrow.

Tue Nov 8 11:24:56 EDT 2011 Systems are down for scheduled maintenance. Expect to come back online in early evening after filesystem changes detailed in recent emails (including new directory naming hierarchy).

Wed Nov 2 13:45:24 EDT 2011 Reminder - shutdown of all systems scheduled for 9AM on Tues 8 Nov. Expect to come back online in early evening after filesystem changes detailed in recent emails (including new directory naming hierarchy).

Sat Oct 15 10:44:27 EDT 2011 Filesystem issues resolved.

Sat Oct 15 09:57:57 EDT 2011 Filesystem appears hung because of issues to do with OOM'd nodes and users close to quota limits. We're trying to resolve.

Fri Oct 7 14:00:00 EDT 2011 Scheduler issues resolved

Fri Oct 7 11:17:08 EDT 2011 The scheduler is having issues; we are investigating. The file system also seems unhappy.

Wed Oct 5 02:50:08 EDT 2011 Transformer maintenance complete. Systems back online

Wed Oct 5 01:18:27 EDT 2011 Power was restored ~12:30. Cooling has been restored as of ~ 1:10AM. Starting to bring up filesystems. Will be at least an hour before users can get access

Tue Oct 4 23:38:57 EDT 2011 Electrical crews had some problems. Expect power to be restored by midnight but then will take at least 2 hrs before we have cooling working and can hope to have systems back. Will update

Tue Oct 4 14:24:52 EDT 2011 System shutdown scheduled for 2PM on Tues, 4 Oct for maintenance on building transformers. All logins and jobs will be killed at that time. Power may not be restored until midnight with systems coming back online 1-2 hrs later.

Sun Oct 2 14:20:57 EDT 2011 Around noon on Sunday we had a hiccup with GPFS that unmounted /scratch and /project on several computer nodes. Systems were back to normal at around 2PM. Several jobs were disrupted, but if your job survived please be sure it produced the expected results.

Sun Oct 2 12:20:57 EDT 2011: GPFS problems, We are investigating

Wed Sep 7 18:26:49 EDT 2011: Systems back online after emergency repair of condenser water loop.

Wed Sep 7 16:28:13 EDT 2011: Cleaning up disk issues. Hope to make systems available again by 19:00-20:00

Wed Sep 7 16:02:19 EDT 2011: Cooling has been restored. Emergency repair was underway but couldn't prevent the shutdown. Update about expected system availability timing by 16:30

Wed Sep 7 15:53:34 EDT 2011: Cooling plant problems.

Sep 3 10:21:37 EDT 2011: scratch is at over 97% full, and may be making general access very slow for everyone. A couple of users are border-lining their quota limits, and are being contacted. If possible please delete any non-essential or temporary files. Your cooperation is much appreciated.

Tue Aug 30 14:58:41 EDT 2011: Systems likely available to users by 3:30PM

Tue Aug 30 13:21:45 EDT 2011: Cracked and suspect valves replaced. Cooling restored. Starting to bring up and test systems. Next update by 3PM

Mon Aug 29 16:01:45 EDT 2011: All systems will be shutdown at 0830 on Tuesday, 30 Aug to replace a cracked valve in the primary cooling system. Expect systems to be back online by 5-6PM today.

Wed Aug 24 13:00:00 EDT 2011: Systems back on-line. Pump rewired and cooling system restarted

Wed Aug 24 07:40:39 EDT 2011: On-site since 0430. Main chilled water pump refuses to start. Technician investigating.

Wed 24 Aug 2011 03:43:16 EDT Emergency shutdown due to failure in cooling system. More later once situation diagnosed

Mon Aug 22 23:29:05 EDT 2011: GPC scheduler appears to be working properly again

Mon Aug 22 22:41:40 EDT 2011: ongoing issues with scheduler on GPC. Working on fix

Mon Aug 22 21:53:40 EDT 2011: Systems are back up

Mon Aug 22 17:14:06 EDT 2011: Bringing up and testing systems. Another update by 8PM

Mon Aug 22 16:13:18 EDT 2011: Cracked valve has been replaced. Cooling system has been restarted. Expect to be online this evening. Another update by 5PM

Mon Aug 22 14:54:45 EDT 2011: Emergency shutdown at 3PM today (22 Aug). Possibility of severe water damage otherwise. More information by 5PM.

Sun Aug 21 18:53:35 EDT 2011: Systems being brought back online and tested. Login should be enabled by 8-9 PM

Sun Aug 21 17:22:12 EDT 2011: Cooling has been restored. Starting to bring back systems but encountering some problems. Check back later - should have a handle on timelines by 7PM.

Sun Aug 21 15:15:35 EDT 2011: Major storms in Toronto. Power glitch appears to have knocked out cooling systems. All computers shutdown to avoid overheating. More later as we learn exactly what has happened.

Thu Aug 9, 12:31:03 EDT 2011 If you had jobs running when the file system failed this morning (10:30AM; 9 Aug), please check their status, as many jobs died.

Mon Jul 18, 15:04:17 EDT 2011 Datamovers and large-memory nodes are up and running. Intel license server is functional again. Purging of /scratch will happen next Friday, July/22.

Tue Jun 14, 14:05 EDT 2011 Many SciNet staff are at the HPCS meeting this week so, apologies in advance, but we may not respond as quickly as usual to email.

Sun May 22 12:15:29 EDT 2011 Cooling pump failure at data centre resulted in emergency shutdown last night. Systems are back to normal.

Tue May 17 17:57:41 EDT 2011 Cause of the last two chiller outages has been fixed. All available systems back online.

Thu May 12 12:23:35 EDT 2011 NOTE: There is some UofT network reconfiguration and maintenance on Friday morning May 13 07:45-08:15 that most likely will disrupt external network connections and data transfers. Local running jobs should not be affected.

Thu May 5 19:17:37 EDT 2011 Chiller has been fixed. Systems back to normal.

You can check our twitter feed, @SciNetHPC, for updates.

Wed May 4 22:29:00 EDT 2011 Cooling failure at data centre resulted in an emergency shutdown last night. Chiller replacement part won't arrive until noon tomorrow and then needs installation and testing. Roughly 40% of the GPC is currently available. We're using free-cooling instead of the chiller and therefore can't risk turning on more nodes. NOTE - we plan to keep these systems up tomorrow when parts are replaced but if there are complications they may be shutdown without warning. You can check our twitter feed, @SciNetHPC, for updates.

Wed May 4 16:43:34 EDT 2011 Cooling failure at data centre resulted in an emergency shutdown last night. Chiller replacement part won't arrive until noon tomorrow and then needs installation and testing. Roughly 1/4 of the GPC (evenly split beween IB and gigE) is currently available. We're using free-cooling instead of the chiller and may increase the number of nodes depending on how the room responds. NOTE - we plan to keep these systems up tomorrow when parts are replaced but if there are complications they may be shutdown without warning. You can check our twitter feed, @SciNetHPC, for updates.

Tues 3 May 2011 20:30 Cooling failure at data centre has resulted in emergency shutdown. This space will be updated when more information is available.

Thu 27 Apr 2011 20:21 EDT: System maintenance completed. All systems are operational. TCS is back to normal too, with tcs01 and tcs02 as the devel nodes.

Sun 17 Apr 2011 18:34:10 EDT Problem with building power supply on Saturday at 2AM today took down the cooling system. Power has been restored but new problems have arisen with TCS water units.

We have been able to partially restore the TCS. The usual development nodes, tcs01 and tcs02 are not available. However, we have created a temporary workaround using node tcs03 (tcs-f02n01), where you can login, compile, and submit jobs. Not all the compute nodes are up, but we are working on that. Please let us know if there are any problems with this setup.

Sun Apr 16 13:01:14 EDT 2011 Problem with building power supply at 2AM today took down the cooling system . Waiting for electrical utility to check lines.

Wed Mar 23 19:06:21 EDT 2011: A VSD controller failed in the cooling system. A temporary solution will allow systems to come back online this evening (likely by 8PM). There will likely need to be downtime in the next two days in order to replace the controller.

Wed Mar 23 17:00 EST 2011 Down due to cooling failure. Being worked on. Large electrical surge took out datacentre fuses. No estimated time to solution yet.

Mon Mar 7 14:39:59 EST 2011 NB - the SciNet network connection will be cut briefly (5 mins or less) at about 8AM on Tues, 8 March in order to test the new UofT gateway router connection

Thu Feb 24 13:21:38 EST 2011 Systems were shutdown at 0830 today for scheduled repair of a leak in the cooling system.GPC is online. TCS expected to be back online by 4PM

Tue Feb 24 9:35:31 EST 2011: All systems will be shutdown at:0830 Thursday, 24 Feb in order to repair a small leak in the cooling system. Systems will be back online by afternoon or evening. Check back here for updates during the day.

Wed Feb 9 10:27:13 EST 2011: There was a cooling system failure last night, causing all running and queued jobs to be lost. All systems are back up.

Sat Feb 5 20:33:34 EST 2011: Power outage in Vaughan. TCS jobs all died. Some GPC jobs have survived.

Sat Feb 5 18:51:00 EST 2011: We just had a hiccup with the cooling tower. We suspect it's a power-grid issue, but are investigating the situation.

Thu Jan 20 18:25:54 EST 2011: Maintenance complete. Systems back online.

Wed Jan 19 07:22:32 EST 2011: Systems offline for scheduled maintenance of chiller. Expect to be back on-line in evening of Thurs, 20 Jan. Check here for updates and revised estimates of timing.

Fri Dec 24 15:40:12 EST 2010: We experienced a failure of the /scratch filesystem, with a corrupt quota mechanism, on the afternoon of Friday December 24. This resulted in a general gpfs failure that required the system to be shutdown and rebooted. Consequently all running jobs were lost. We apologize for the disruption this has caused, as well as the lost work.

Happy holidays to all! The SciNet team.

Note: From Dec 22, 2010 to Jan 2, 2011, the SciNet offices are officially closed, but the system will be up and running and we will keep an eye out for emergencies.

Mon Dec 6 17:42:12 EST 2010: File system problems appear to have been resolved.

Mon Dec 6 16:39:55 EST 2010: File system issues on both TCS and GPC. Investigating.

Fri Nov 26 18:30:55 EST 2010: All systems are up and accepting user jobs. There was a province-wide power issue at 1:30 the morning of Friday Nov 26, which caused the chiller to fail, and all systems to shutdown. All jobs running or queued at the time were killed as a result. The system is accessible now and you can resubmit your jobs.

Fri Nov 26 6:15 EST 2010: The data center had some heating issue at around 1:30am Fri Nov 26 and system are down right now. We are investigating the cause of the problem and trying to fix the issue so that we could bring the system back up and running.

Fri Nov 5 15:26 EDT 2010: Scratch is down; we are working on it.

Tue Oct 26 10:32:22 EDT 2010: scratch is hung, we're investigating.

Fri Sep 24 23:02:41 EDT 2010: Systems are back up. Please report any problems to <support at scinet dot utoronto dot ca>

Fri Sep 24 19:55:02 EDT 2010: Chiller restarted. Systems should be back online this evening - likely 9-10PM. Widespread power glitch seems to have confused one of the Variable Speed Drives and control system was unable to restart it automatically.

Fri Sep 24 16:44:39 EDT 2010: The systems were unexpectedly and automatically shut down, due to a failure of the chiller. We are investigating and will bring the systems back up as soon as possible.

Mon Sep 13 15:39:05 EDT 2010: The systems were unexpectedly shut down, automatically, due to a failure of the chiller. We are investigating and will bring the systems back up as soon as possible.

Fri Sep 10 15:08:53 EDT 2010: GPFS upgraded to 3.3.0.6; new scratch quotas applied; check yours with /scinet/gpc/bin/diskUsage.

Mon Aug 16 13:12:08 EDT 2010: Filesystems are accessible and login is normal.

Mon Aug 16 12:38:45 EDT 2010: Login access to SciNet is failing due to /home filesystem problems. We are actively working on a solution.

Mon Aug 9 16:10:03 EDT 2010: The file system is slow and scheduler cannot be reached. We are working on the problem.

Sat Aug 7 11:42:50 EDT 2010: System status: Normal.

Sat Aug 7 10:45:52 EDT 2010: Problems with scheduler not responding on GPC. Should be fixed within an hour

Wed Aug 4 14:03:59 EDT 2010: GPC scheduler currently having filesystem issue which means you may have difficulty submitting jobs and monitoring the queue. We are working on the issue now.

Fri Jul 23 17:50:00 EDT 2010: Systems are up again after testing and maintenance.

Fri Jul 23 16:14:02 EDT 2010: Systems are down for testing and maintenance. Expect to be back up about 9PM this evening.

Tue Jul 20 17:40:45 EDT 2010: Systems are back. You may resubmit jobs.

Tue Jul 20 15:39:37 EDT 2010: Most GPC jobs died as well. The ones that appear to be running are likely in an unknown state, so will be killed in order to ensure they do not produce bogus results. The machines should be up shortly. Thanks for your patience and understanding. The SciNet team.

Tue Jul 20 15:19:30 EDT 2010: File systems down. We are looking at the problem now. All jobs on TCS have died. We'll inform you when the machine is available for use again. Please log out from the TCS.

Tue Jul 20 14:26:10 EDT 2010: Scratch file system down. We are looking at the problem now.

Fri Jul 16 10:59:27 EDT 2010: Systems normal.

Sun Jul 11 13:08:02 EDT 2010: All jobs running at ~3AM this morning almost certainly failed and/or were killed about 11AM. A hardware failure has reduced /home and /scratch performance by a factor of 2 but should be corrected tomorrow.

Sun Jul 11 09:56:27 EDT 2010: /scratch was inaccessible as of about 3AM

Fri Jul 9 16:18:18 EDT 2010: /scratch is accessible again

Fri Jul 9 15:38:43 EDT 2010: New trouble with the filesystems. We are working to fix things

Fri Jul 9 11:28:00 EDT 2010: The /scratch filesystem died at about 3 AM on Fri Jul 9, and all jobs running at the time died.