Previous messages:

From oldwiki.scinet.utoronto.ca
Revision as of 07:05, 18 March 2015 by Rzon (talk | contribs)
Jump to navigation Jump to search

These are old messages, the most recent system status can be found on the main page.

You can also check our twitter feed, @SciNetHPC, for updates.

2015

Fri Mar 12 13:13:28 EDT 2015: all GPC racks are back in service

Mon Mar 9 15:20:25 EDT 2015: Unfortunately some GPC racks had to be turned off due to a plumbing problem. Any jobs that were running on f129, f130, f144 and f145 were killed.

Thu Mar 5 12:43:18 EST 2015: BGQ modules bgqgcc/4.8.1 and mpich2/gcc-4.8.1 have been updated to correct for changes related to the V1R2M2 software update.

Some day in March 2015: Warning for GCC users on the BGQ:

The BGQ software stack upgrade to V1R2M2 has caused some changes in the library directory structure for the GNU 4.8.1 compiler. If you are using this compiler, or a module that depends on it (such as the OpenFOAM modules), you may experience issues running your code. We are working on a solution for this problem. Note that projects using the XL compilers (xlf, vacpp) should not be affected.

Tue Feb 24 12:46:26 EST 2015: GPC scheduler is back to normal but all jobs were lost. Please resume submitting jobs. Apologize for the inconvenience.

Tue Feb 24 12:24:03 EST 2015: GPC scheduler has been crashing several times this morning. We are still working on it to get it back online.

Wed Feb 18 19:10:55 EST 2015: HPSS is back in service

Tue Feb 17 13:55:20 EST 2015: HPSS will be offline until sometime on Thursday for a hardware upgrade. We'll be adding a new high density enclosure to the tape library.

Wed Feb 11 00:36:19 EST 2015: Systems are being powering on. Login nodes are accessible.

Tue Feb 10 22:30:29 EST 2015: Technician still on-site trouble-shooting a VSD drive in chiller. Starting to bring up storage and management systems under free-cooling.

Tue Feb 10 20:15:28 EST 2015: Water temperature too high. Systems were shutdown automatically. Staff enroute to assess.

Sat Feb 7 15:11:27 EST 2015: GPC and storage is back now. We are working on getting TCS and BGQ up.

Sat Feb 7 13:04:53 EST 2015: cooling has been restored. Some systems (GPC and storage) likely back mid-afternoon. Please check back later

Sat Feb 7 11:00:20 EST 2015: primary chilled water pump won't restart. Waiting for technician. May be noon before root cause is understood. Systems unlikely to be back before 2PM at very earliest.

Sat Feb 7 10:02:02 EST 2015: staff onsite. Assessing problem

Sat Feb 7 07:00:14 EST 2015: datacenter shutdown automatically due to a power outage

Thu Jan 22 13:27:44 EST 2015: BGQ now available as a single 4-rack system. bgqdev-fen1 is the single login/devel/submission node.


Fri Feb 6 08:41:08 EST 2015: The /scratch file system, which had crashed during the early morning and took almost all jobs with it, is back to normal. Practically all GPC and TCS jobs died. Please resubmit your jobs.

Fri Feb 6 05:58:00 EST 2015: /scratch is showing Stale file handle on many GPC/TCS nodes, indicating some kind of failure. We're investigating.

Sat 17 Jan 2015 21:50:40 EST: Cooling has been restored. Systems being restarted. Likely available within an hour or so. Root cause was a frozen pipe in cooling tower (very strange; has never happened before and today is relatively warm compared to past two weeks).

Sat 17 Jan 2015 19:34:00 EST: JCI on site as well. Diagnosing issue.

Sat 17 Jan 2015 17:33:47 EST: Unusual cooling problem. Systems down. Staff enroute to site.

Thu Jan 15 11:22:00 EST: Cooling tower fan belt service is finished. Chiller is being serviced as scheduled while the chilled water plant is working on free-cooling mode. We are not expecting any interruption for users. Systems are being brought up now.

Wed Jan 14 17:02:18 EST: Emergency shutdown of all compute nodes 8:30AM tomorrow (Thurs, 15 Jan). After starting to bring up systems this afternoon we learned that an emergency replacement of the cooling tower fan belt is required tomorrow morning. Compute systems that are currently up will need to be shutdown at 0830 tomorrow. We will attempt to keep login nodes and storage up during tomorrow's downtime which is expected to last 1-4 hrs.

Wed Jan 14 14:34:18 EST: Expect some systems (login nodes, GPC and BGQ) to be available by approx 3:00-3:30PM.

Wed Jan 14 13:09:03 EST: Free-cooling is being restored and should allow compute systems to come online this afternoon. Chiller maintenance will continue throughout the day and possibly into tomorrow. Check back for updates.

SCHEDULED MAINTENANCE DOWNTIME ANNOUNCEMENT

On January 14 and 15, scheduled maintenance on the data centre's cooling system will require all systems to be shut down for at least the first part of the maintenance. All SciNet systems will be shut down at 7 AM on Wednesday January 14, 2015 and all login sessions and jobs will be killed at that time.

At the earliest, the systems will be available again later on Wednesday afternoon, but is it possible that the downtime will extend into Thursday January 15, 2015. Check here on the SciNet wiki (wiki.scinethpc.ca) for updates on Wednesday and Thursday.

Fri Jan 9 01:27:49 EST 2015: Scheduler glitch. Most jobs were killed about an hour ago. Scheduler is back to normal, and please resume submitting jobs. Apologize for the inconvenience!

Fri Jan 9 01:27:49 EST 2015: Scheduler glitch. Most jobs were killed about an hour ago. Scheduler is back to normal, and please resume submitting jobs. Apologize for the inconvenience!

2014

Thu Dec 18 10:12 EST 2014: Both BGQ systems are back up. Please resubmit your jobs (Note that only the BGQ was affected, all other systems are fine).

Thu Dec 18 9:30 EST 2014: Both BGQ systems shut off due to a cooling issue, killing all running jobs. Systems are being brought back up. Check here for updates.

Fri Dec 12 11:52:52 EST 2014: BGQ Dev system upgraded to 2 full-racks.

Fri Dec 5 08:25:48 EST 2014: City has just confirmed that water has been restored. Staff are on-site and restarting cooling systems. Users should have access to compute systems before noon.

Fri Dec 5 06:51:56 EST 2014: City has reported that water should be restored by 8AM. If they're on schedule, it could still take a few hours after that to restart cooling systems, power-up and test storage etc

Thu 4 Dec 2014 16:05:07 EST: Some systems are back up, but only until tonight when the advertised shutdown will still happen, starting at around 9PM. Only short jobs, that fit in the short time before the systems are taken down, will run. Filesystems are up, as well as devel nodes on all platforms. BGQ will remain down until after the water repairs.

Thu Dec 4 14:03:24 EST 2014: Systems abnormally shutdown due to loss of plumbing to secondary loop. Still investigating.

ALL SYSTEMS TO BE SHUTDOWN ON THURSDAY- On Thursday Dec 4, at 9PM EST, all systems will need to be shutdown. The city of Vaughan has advised us that the city water supply will be turned off in order to fully fix the problem that occured on Nov 21. With no water supply we cannot cool the datacentre, hence the shutdown. We expect all systems to be back up on Friday Dec 5, at around 11AM.

Mon Dec 1 12:41:39 EST 2014: GPC Moab scheduler debugging is finished and is back to normal. No jobs were cancelled.

Thu Nov 27 11:24:17 EST 2014: On Monday, December 1, a Moab developer will debug a scheduler problem, which cancels jobs unexpectedly upon restart on GPC. The debugging process will start at 11AM, and some queued jobs will be cancelled. It's advised not to submit new jobs during this period. Please checks wiki for update.

Mon Nov 24 10:39:00 EST 2014: Filesystems are back.

Mon 24 Nov 2014 10:24:55 EST: Filesystems are experiencing huge waiter problems. Our systems people are working to clear the issue.


Fri Nov 21 23:53:08 EST 2014: System is ready to login. GPC and BGQ are online accepting jobs. Working the rest of systems.

Fri Nov 21 22:45:58 EST 2014: Water is restored. Working on bring up systems.

Fri 21 Nov 2014 21:22:28 EST: Datacentre still down. Emergency water repairs being done in the area. Hence no cooling. We expect systems to be back in operation sometime tomorrow morning.

Fri Nov 21 18:49:25 EST 2014: Datacentre down. Staff enroute. Investigating

Fri Nov 21 14:30:00 EDT 2014: File system is slow. Investigating.

Fri Nov 7 15:00:00 EDT 2014: HPSS is nearing capacity: jobs can be submitted, but will only run once they are reviewed and released by SciNet staff.

Thu Nov 13 13:36:19 EST 2014: There will be an outage to SciNet via ssh from 10pm EST to 12am EST today to perform emergency repair on the fibre. No running job will be affected. Sorry for the inconvenience.

Wed Nov 12 15:41:44 EST 2014: There will be a 15 to 20 minute outage to SciNet via ssh from 10pm EST to 12am EST today to perform emergency repair on the fibre. No running job will be affected. Sorry for the inconvenience.

Tue Nov 11 17:32:59 EST 2014: The emergency fibre maintenance is complete. Please report any network connectivity issue if there are issues.

Tue Nov 11 11:35:38 EST 2014: Connection to SciNet might be disconnected due to emergency fibre maintenance. Outage should last about 5 minutes and it will not affect running job on the system.

Fri Nov 7 15:00:00 EDT 2014: HPSS is nearing capacity: jobs can be submitted, but will only run once they are reviewed and released by SciNet staff.

Thu Oct 30 21:10:53 EDT 2014: BGQ devel system back in production, upgraded to 1 full-rack.

Tue Oct 28 08:38:41 EDT: BGQ devel system is under maintenance, and job submission is suspended. Production system is not affected.

Wed Oct 15 13:41:09 EDT: Turns out that there is no need for BGQ shutdown for today's work after all. Please ignore previous messages

Tue 14 Oct 2014 21:56:35 EDT: The BGQ compute nodes will be shutdown for a brief period (hopefully less than an hour) on the morning of Wed, 15 Oct for room renovations. The BGQ front-end nodes and storage will remain accessible but all compute jobs will be killed at the time of shutdown. Any queued jobs will be able to start when the compute nodes come back online. Please check this space for updates.

Wed Oct 8 18:02:10 EDT 2014 Scheduler crashed around 5:30PM, and is restarted. Most of queued jobs were lost. Please resume submitting jobs.

Sun Oct 5 18:19:14 EDT 2014: Systems started coming online 5:40PM. GPC, TCS and BGQ are running jobs again

Sun Oct 5 15:57:55 EDT 2014: Power was restored at 2:30PM. Cooling system has been restored. Filesystems being tested. System availability for users likely by 6PM

Sun Oct 5 13:47:48 EDT 2014: Power has still not been restored to the area. Our 5PM estimate above for restoring systems was based on Powerstream's forecast that they would be done by noon. Once power is back we'll update our estimate

Sun Oct 5 06:25:25 EDT 2014: All systems down as per note above

Mon Sept 29 00:43:00 EDT 2014: HPSS up.

Sun Sep 28 20:00:00 EDT 2014: ARC up and accepting jobs. File systems should be fixed too.

Sun Sep 28 18:30:00 EDT 2014: Sandy and Gravity are up. ARC is up but the schedule is not yet operational from arc01 (use gpc nodes to submit). Some filesystem issues with gss (/scratch2 for group using this) on some gpc nodes are being investigated too.

Sun Sep 28 13:59:00 EDT 2014: BGQ up.

Sun Sep 28 13:37:00 EDT 2014: GPC, P7 and TCS are back up. Users will be able to login shortly. All jobs were killed in the event, so please resubmit. We're working on getting the BGQ, Sandy and Gravity systems up too.

Sun Sep 28 09:43:28 EDT 2014: Brief power outage knocked-out cooling system at about 0806 this morning. Cooling has been restored. Disk controllers and filesystems are being brought up. Systems will be unavailable until at least noon.

Fri Sep 19 15:50:54 EDT 2014: Scheduler has been stable for the past hour, and jobs are being scheduled. Please submit your jobs. Please be aware that showq is not reporting some running jobs that were running before the glitch. Use qstat instead of showq for these jobs. Most of queued jobs this morning were rejected by the scheduler when it went back online.

Fri Sep 19 12:52:20 EDT 2014: We've been experiencing intermittent problems with the scheduler. Job submission has been paused temporarily, until we can restart the scheduler. Please check this space for updates.

Mon Sep 15 11:13:10 EDT 2014: The scheduler had some issues this morning and had to be restarted to resolve them. Some queued and running jobs have been lost.

Wed Sep 10 02:50:50 EDT: Scheduler had to be shutdown, reconfigured and restarted at about 1:20AM to resolve problems related to its upgrade. Unfortunately, this killed most/all running jobs

Tue Sep 9 21:11:21 EDT: login nodes now up. Check system availability dashboard above. In particular, job submission currently not working on BGQ so it is marked down for now.

Tue Sep 9 16:24:21 EDT: Systems will likely not be available before 8PM today. Transformer maintenance has been completed but waiting for power utility to reconnect us (expected ~4:30). Unfortunately, there will be an unanticipated delay as we've been told the chiller may take 2 hrs to restart and we can't bring up systems without cooling.

On Tuesday September 9 ,2014, the SciNet datacentre will be down for maintenance. The electrical transformers that feed the datacentre require servicing, and the SciNet datacentre will go down at 7:30am. The shutdown is expected to last most of the day, with service resuming by 6pm.

Please note: All jobs on the GPC will be cancelled, and the queue purged. This is in order to allow us to install a new version of the Moab/Torque scheduler. On August 21, 2014, at 00:30am, there will be a 15 minute period during which maintenance on our gateway switch must be conducted. As a result you may not be able to contact the datacentre. No jobs will be affected by this.

Tue Aug 5 09:34:00 EDT 2014: File system hickup on the BGQ. The issue has been resolved, please resubmit BGQ jobs.

Fri Aug 1 20:19:44 EDT 2014: The 3 second power-outage took down all the GPC, TCS and BGQ compute nodes so all running jobs were killed. Queued and new jobs started 3s later on the GPC. The TCS and BGQ are back-online as well. Please email support@scinet.utoronto.ca if you still encounter issues

Fri Aug 1 17:23:05 EDT 2014: Around 5pm, a few seconds of power outage has taken down an as-of-yet unknown number of nodes. GPC, Sandy, TCS, Gravity, ARC are certainly affected, but to which extent is not clear yet. Updates will be posted here.

Fri Aug 1 17:46:04 EDT 2014: GPC, Sandy, ARC, Gravity, TCS, and BGQ were all affected. P7, HPSS and file system are okay. We're rebooting the nodes.

Mon Jun 30 15:19:39 EDT: All system down. Some kind of power issue (again).

Sun Jun 29 19:57:29: Compute systems started coming online about 730PM.

Sun Jun 29 18:20:41: filesystems restarted after some issues. Likely at least 8PM before compute systems available

Sun Jun 29 16:39:35 EDT 2014: large voltage spike tripped our main circuit breaker. We have power though it's out at sites within 2k because of lightning strike. Cooling system being restored

Sun Jun 29 15:47:11 EDT 2014: staff enroute to site. Should have update on cause within an hour

Sun Jun 29 15:40:31 EDT 2014: power lost about 3:20P today. All systems down. Investigating.

Fri May 23 12:01:44 EDT. Most systems online (check status "buttons" above). TCS will be up shortly

Fri May 23 10:17:40 EDT: Cooling has been restored. Starting to bring-up filesystems. Expect systems to start coming online noon-ish barring complications

Fri May 23 10:01:08 EDT: Power has been restored to site. Starting to bring-up cooling system

Fri May 23 08:25:47 EDT: Entire building lost power at 0150 this morning - appear to be blown fuses on the overhead transmission lines. Utility is en route. Will update when more is known

Fri May 23 02:02:55 EDT: Power outage at datacentre. All systems down. Will likely be 9-11AM before systems come back up

TCS maintenance downtime: On Tuesday May 6, at 9AM, the TCS will go down for hardware maintenance, in order to replace some faulty components. The downtime is expected to last around 1 hour. All queued jobs at that time will remain in the queue, and will execute when the system is brought back up.

At 9:00 AM on Wednesday April 30, 2014, there was a scheduled brief loss of connectivity for a minor network reconfiguration. Login session to the data centre would have been disrupted, but running and queued jobs were not affected.

Fri Apr 4 13:30:00 EST 2014: The file system seems to have recovered after jobs that were hard on the file system were canceled and improved.

Fri Apr 4 11:00:00 EST 2014: GPFS file system is slow due to a number of downed nodes, so logging in may be difficult, but jobs that have started are still running.

Tue Mar 4 09:25:34 EST 2014: GPFS had a locking issue due to a number of downed nodes. The issue has been resolved.

Tue Mar 4 08:08:34 EST 2014: We are investigating a problem with the /home file system.


Tue Feb 18 18:36:46 EST: All compute systems available

Tue Feb 18 18:13:14 EST: Expect to have login nodes and GPC available before 7PM.

Tue Feb 18 17:24:20 EST: Utility restored power to area at 4:50PM. Staff have restored cooling system and are starting to bring up storage. Most systems likely back online within 2 hrs or so.

Tue Feb 18 16:01:54 EST: Powerstream is now estimating 5PM for restoring power in the area

Tue Feb 18 13:50:41 EST: Power utility is having significant problems in our area and has cut power again. When power comes back (and appears stable) we will begin to restart systems.

Tue Feb 18 13:39:38 EST: Most systems have been restored and are available for use. Please e-mail support@scinet.utoronto.ca if there are any issues.

Tue Feb 18 11:13:53 EST: Power glitch at datacenter shutdown cooling system and hence computers as well. Restoring cooling now. Systems likely back online within 2 hrs or so.

Wed Feb 5 01:01:52:

GPC scheduler crashed around 10PM on Tuesday. Unfortunately, the scheduler cannot be revived normally. All GPC jobs were lost. Please resubmit your jobs. We apologize for the inconvenience.

Mon Feb 3 14:20:27: jobs can be submitted to the GPC. GPC nodes are coming back online. TCS is still rebooting,

At approximately 1:54 pm, Monday February 3, SciNet data centre lost power for 4 seconds. Systems are being restored.

Thu Jan 16 16:55:06 EST 2014: Scheduled maintenance is complete. Systems are up.

Fri Jan 10 10:33: Systems coming back. Will enable access by ~11-11:30AM

January 10, 4:10 am Datacentre lost power. All systems are down.

2013

Wed Dec 25 07:15:10 EST 2013: Some TCS jobs were killed at ~7AM today as we shutdown frames 9 and 10 to help stabilize temperatures in the machine room. Please check your jobs and resubmit. The nodes are being restarted

Wed Dec 25 07:15:10 EST 2013: Cooling tower was successfully de-iced and water temperatures have returned to normal.

Wed Dec 25 06:57:10 EST 2013: Shutting down some TCS nodes to help lower room temperatures. Cooling tower has frozen over. Trying to get de-icing cycle going again.

Sun Dec 22 11:08:13 EST 2013: Another power event at 0312 today knocked out the BGQ again. Unfortunately key staff are without power so time to restore is unknown (more than 250,000 customers in the GTA currently without power)

Sun Dec 22 00:19:23 EST 2013: BGQ up and jobs running. Some may have been killed so check your logs.

Sat Dec 21 23:39:26 EST 2013: Power glitch to site at 2240 caused the BGQ to shutdown - it is being restored. Large ice storm is underway and PowerStream reports over 20,000 customers without power. There may well be more issues overnight.


Wed Dec 18 15:59:09 EST 2013:

Dear SciNet users:

SciNet is officially on holiday from Sat Dec 21, 2013, until Sun Jan 5, 2014. All systems will be up, and maintained on a best-effort basis. User support will also be on a best-effort basis, though we will try to help if we can.

We wish you all Happy Holidays, and the best for the New Year.

The SciNet team.


Sun Dec 22 11:08:13 EST 2013: Another power event at 0312 today knocked out the BGQ again. Unfortunately key staff are without power so time to restore is unknown (more than 250,000 customers in the GTA currently without power)

Sun Dec 22 00:19:23 EST 2013: BGQ up and jobs running. Some may have been killed so check your logs.

Sat Dec 21 23:39:26 EST 2013: Power glitch to site at 2240 caused the BGQ to shutdown - it is being restored. Large ice storm is underway and PowerStream reports over 20,000 customers without power. There may well be more issues overnight.


Last updated: Wed Dec 18 15:59:09 EST 2013

Dear SciNet users:

SciNet is officially on holiday from Sat Dec 21, 2013, until Sun Jan 5, 2014. All systems will be up, and maintained on a best-effort basis. User support will also be on a best-effort basis, though we will try to help if we can.

We wish you all Happy Holidays, and the best for the New Year.

The SciNet team.


Wed Dec 4, 11:30: The network connection to the SciNet datacentre will go down twice over the next week as UofT will physically move an internet gateway to a new datacentre. Jobs will continue to run during these times but users will not be able to connect to any SciNet systems (GPC, TCS, BGQ etc). The SciNet website and support email system will not be affected. SciNet network connection will be down:

  • 7:50 am to 8:10 am, Friday 6 Dec
  • 7:00 am to 11:00 am, Monday 9 Dec


Sat Nov 22: 1016: Systems have been restored. A 20s power event knocked out the entire TCS and resulted in most/all of the GPC rebooting. Hence most jobs running at 0342 this morning were lost.

Sat Nov 22: 0448: Power glitch at site at 0342. Access to TCS has been lost. Many jobs running on GPC were killed when nodes rebooted. Will investigate more closely later this morning.

Fri Nov 22: One of our IB fabric managers died last night. As a result, many nodes including the GPFS managers could not communicate properly and many nodes had their GPFS unmounted. If you had crashed jobs, please resubmit.

Thu Aug 29 12:18 - BGQ issue resolved, and is back to full production.

Wed Aug 28, 15:15: One rack of the BGQ shut down. Check here for updates.

Mon Aug 26, 9:50 - Connectivity to the SciNet data centre will be disrupted for a few minutes on Tuesday Aug 27, between 07:00 and 07:20, for router upgrades. Running jobs should not be affected, but login shells will be disconnected.

Thu Aug 22 16:34 - Just a reminder that /scratch2 is now mounted read-only on login and development and /scratch2 will be available until Wednesday September 4. Please make sure to migrate the data store on /scratch2 onto another storage space available.

Wed Aug 21 19:10 - We will be resuming purging on /scratch filesystem on August 28. Please archive the file you do not need onto HPSS. Note that scratch is meant for running job and short term data storage only so please copy any important data over to HPSS if you need long-term storage.

Wed Aug 21 14:10 - /scratch is mounted read-write again. Please use /scratch for all new jobs. Disks have been checks and users with suspicious files will be contacted.

Wed Aug 21 06:16 - The /scratch filesystem will be mounted read-write around 2PM today. Processes on the login, devel, and datamovers that are accessing /scratch2 will be killed, or the node may be rebooted if /scratch2 cannot be re-mounted in read only mode. /scratch2 on the compute nodes will be unmounted once the job is finished. All jobs scheduled to run after 2PM will need to use /scratch otherwise the job will fail. Please cancel your jobs that uses /scratch2 and resubmit them after 2PM.

Tue Aug 20 16:26 - The process of finding the files that had parity errors is taking longer than expected due to multiple passes requirement. These process will also affect the performance of GPFS at times. We hope to finish these processes and have /scratch mounted read-write sometime tomorrow.

Mon Aug 19 14:17 - This is the plan for the next few days to get scratch back into production:

  • We'll first be doing another verify to find bad sectors. This will generate a list of suspect files.
  • These suspect files will be moved to a separate location, and the owners of these files will be notified.
  • Sometime tomorrow (Tue Aug 20), this should allow /scratch to be mounted read/write again.
  • Once /scratch is back in full production, and after running jobs using /scratch2 have finished (say 2-3 days later), /scratch2 will be phased out.
  • This phase-out entails /scratch2 to be mounted read-only on the login and devel nodes to allow users to copy results that they want to keep from /scratch2 to /scratch, /home, or off-site. The login and devel nodes may require a reboot tomorrow if /scratch2 cannot be mounted read-only on the node.
  • After a week or two, /scratch2 will be unmounted.

Sat Aug 17 02:21:09 - /scratch and /project are now mounted read-only on the login and devel nodes. Please read details immediately below. Note also (further below) that the monthly purge of /scratch will be delayed at least another week.

Sat Aug 17 00:04:39 - the /scratch and /project filesystems will be mounted read-only again by 0900 today but only on the login and devel nodes. You will not be able to write to /scratch or /project but you will be able to access and copy files away from them (e.g. to /scratch2). The storage vendor has completed recovering the raid parity errors and we are testing that they have "fixed" them properly, we're trying to resolve discrepancies between their lists and ours and we are identifying the names of those files which have been corrupted. Unfortunately, if there are any remaining, or improperly "fixed", parity errors, then the entire filesystem can crash when somebody accesses the affected files (this is why we had to unmount /scratch earlier this week). Accordingly, we are testing all the disk sectors that the vendor has claimed to have fixed overnight. If the filesystem remains stable over the weekend then we hope to be able to return /scratch and /project to normal on Monday or Tuesday.

Thu Aug 15 12:59:18 - work continues on recovering the filesystem. The vast majority of data appears intact but the storage vendor is still resolving parity errors. Also working on a way for users to identify what files might possibly have been corrupted. Unfortunately the timeline for all this is still uncertain. We're trying to balance paranoia for preserving data with the need for users to get back to work. /scratch2 is working well and the GPC is currently at 80% utilization

Wed Aug 14 20:00:00 - The login and GPC development nodes are back in service now. We have disabled the read-only mount for scratch since that was causing issues with the ongoing recovery. It will be made available later in the week when the recovery is complete. Please continue to check here for further updates.

Wed Aug 14 19:36:41 - There are currently filesystem issues with the gpc login node and the general scinet login node. We are working on the issue and trying to fix it.

Wed Aug 14 00:30:46 - the regular monthly purge of /scratch will be delayed because of the problems with the filesystem. It will tentatively take place on 22 Aug (or later). New date will be announced here.

Tue Aug 13 20:23:27 - GPC and TCS available. See notes below about scratch2, scratch and project filesystems.

Tue Aug 13 19:15:28- for the time being, /scratch and /project will be available only from the login and devel nodes and will only be readable (you can not write to them). This way users can retrieve files they really need but we minimize the stress on the filesystem while we complete LUN verifies and filesystem checks. These filesystems will return to normal later this week (likely Wed or Thurs but may take longer than expected). We know that there are some files that may have corrupted data and will post more details later about how to identify them. The total amount of corrupted data is small and appears to be limited only to those files which were open for writing when the problems started (about 1445 on Friday, 9 Aug). GPC users will still need to use /scratch2 for running jobs while TCS users will need to make use of /reserved1.

Tue Aug 13 17:24:18 - there is good news about /scratch and /project. They appear to be at least 99% intact. However, there are still more LUN verifies that needs to be run as well as disk fscks. It's not yet clear whether we will be able to make these disks available tonight or at some point tomorrow. Systems should come online again within a couple of hours though perhaps only with the new /scratch2 for now.

Tue Aug 13 17:13:58 - datacentre upgrades finished. Snubber network, upgraded trigger board, UPS for the controller and the Quickstart feature should make chiller more resilient to power events and improve the time it takes to restart. Hot circuit breakers also replaced

Tues Aug 13 09:00:00 - systems down for datacentre improvement work

Sun Aug 11 21:55:06 - TCS can be used by those groups which have /reserved1 space. Use /reserved1 to run jobs as you would hve with the old /scratch (which we are still trying to recover)

Sun Aug 11 21:49:03 - GPC is available for use. There is no /scratch or /project filesystem as we are still trying to recover them. You can use /scratch2 to run jobs in exactly the same way as the old scratch (however the environment variable is $SCRATCH2). New policies for /scratch2 are being set but for now each user is limited to 10TB and 1 million files. /home is unscathed.

Sun Aug 11 15:35:32 - We are implementing a contingency plan for access by GPC users. Should be available within a few hours. There will be a new scratch2 filesystem that can be used for submitting and running jobs. TCS users may have to wait another day for a fix (it is technically impossible to mount the new /scratch2 on the TCS). Unfortunately, nobody will be able to access the original /scratch or /project space and the timeline for attempting to fix and recover those filesystems is virtually impossible to judge (have to deal with new problems as they crop-up and there's no way to know how many problems lie ahead).

Sun Aug 11 09:25:41 - work resumed before 8AM this morning. Still correcting disk errors that surface so we can reach the stage where the OS can actually mount the filesystem

Sat Aug 10 22:31:45 - work stopping for this evening. SciNet and vendor staff have worked continuously for more than 30 hrs on this problem. No point risking making a mistake now. Will continue tomorrow

Sat Aug 10 20:39:34 - work continues. Disks and NSDs have been powered-up and the filesystem is attempting to read the disks. Problems with individual disks are being fixed manually as they are exposed

Sat Aug 10, 17:03 - Still no resolution to the problem. SciNet staff continue to work onsite, in consultation with the storage vendor.

Sat Aug 10 10:38:46 - storage vendor still working on solution with SciNet staff onsite. There are 2,000 hard drives and the controller is confused about location and ID of some of them. Getting a single one wrong will result in data loss so we are proceeding cautiously. Only /scratch and /project are affected. /home is accessible but GPC and TCS can not be used as they rely on /scratch. BGQ system is still usable because of separate filesystem


Sat Aug 10 07:05:07 - staff and vendor tech support still on-site. New action plan from storage vendor is being tested.

Sat Aug 10 00:28:54 - Vendor has escalated to yet a higher level of support but still no solution. People will remain on-site for a while longer to see what the new support team recommends.

Fri Aug 9 22:03:48 - Staff and vendor technician remain on-site. Storage vendor has escalated problem to critical but suggested fixes have not yet resolved the problem. BGQ remains up because it has separate filesystem.

Fri Aug 9 15 32 - /scratch and /project are down. Login and home directories are ok, but no jobs can run, and most of those running will likely die if/when they need to do I/O.

Fri Aug 9 15:25 - File system problems. Scratch is unmounted. Jobs are likely dying. We are working on it.


Thu Aug 8 13:22 - most systems are back up

Thu Aug 8 11:18:45 - problems with storage hardware. Trying to resolve with vendor

Thu Aug 8 08:14:01 Cooling has been restored. Starting to recover systems.

Thu 8 Aug 2013 06:12:28 Large voltage drop at site knocked-out cooling system at 0558 today. Staff enroute to site.

Wed 7 Aug 2013 16:27:00 EDT: All systems are up again.

Wed 7 Aug 2013 14:50:00 EDT: GPC, TCS, P7, ARC and HPSS systems are up again.

Wed 7 Aug 2013 11:47:00 EDT: Power outage at site because of thunderstorm. Systems down.

Tue Aug 6 16:52:35 EDT 2013: File systems have been fixed. Systems are back up.

Tue Aug 6 15:27:35 EDT 2013: File system trouble under investigation. Clusters aren't down per se, but the home and scratch file systems are not accessible. Many user jobs very likely died. Systems are expected to be up by the end of the afternoon today. Check here for updates.

Thu Aug 1 17:13:52 EDT 2013: Systems are back up and accessible to users.

Tue Aug 1, 8:06:00: As announced, all systems have been shutdown at 8AM on Thurs, 1 Aug for emergency repair of a component in the cooling system. Systems are expected to be back on-line in the afternoon. Check here for progress updates.

16:30 update: Systems expected to be up by 5pm today.

Tue Jul 30, 19:24:00: Downtime announcement:

All systems will be shutdown at 8AM on Thurs, 1 Aug for emergency repair of a component in the cooling system. Systems are expected to be back on-line in the afternoon. Check here for progress updates.

Apologies for the short notice but we only learned of the problem this afternoon. We're now attempting to re-schedule other maintenance planned for later in August to this Thursday as well (hence the uncertainty in the length of the required downtime).

Mon Jul 29 10:40:00 All systems back up.

Mon Jul 29 10:09:00 TCS is back up. BGQ still down.

Mon Jul 29 8:37:00 Power glitch overnight took systems down. GPC is already up, and other systems are being brought up.

Wed Jul 24 15:00:00 All BGQ racks back in production

Thu Jul 18 10:00:00 Bgqdev and one of the two bgq racks are up again

Wed Jul 17 17:00:00 Bgqdev and bgq systems are down.

Wed Jul 17 15:58:00 We're reenabling the rack, please resubmit crashed jobs.

Wed Jul 17 15:24:12 One of the two racks of the BlueGene/Q production system has gone down.

Mon Jul 15 09:45:49: Gravity01 (head node in gravity cluster) is down until futher notice. Jobs may still be submitted from devel nodes or arc01

Tue Jul 9 19:15:57 EDT: one rack of the production BGQ systems remains down due to a faulty flow sensor

Tue Jul 9 15:16:49 EDT: GPC & TCS are back online. Other systems being restored

Tue Jul 9 12:45:25 EDT: Power has been restored. We need to restart cooling systems, restart and check the filesystems etc. Will have better idea of timeline by 3PM

Tue Jul 9 11:52:02 EDT: Powerstream on-site. Wind/tension damage to two hydro poles caused overhead fuse to blow. Repairs are underway

Tue Jul 9 09:30:52 EDT: Fuse on power lines blew last night. Utility is backed-up dealing with other problems. No ETA for them to restore power at the site

Tue Jul 9 02:22:47 EDT: No power at site. Will resolve by 10AM and update here.

Tue Jul 9 01:04:40 EDT: Power failure at site and UPS has drained. Major storms and problems throughout Toronto. Staff enroute.

Mon Jul 5, 14:18:00 EDT 2013 Both BGQ systems, bgq and bgqdev, are up.

Mon Jul 4, 15:50:00 EDT 2013: Both BGQ systems, bgq and bgqdev, are down due to a cooling failure. We are investigating the cause. Given the scheduled BGQ downtime tomorrow, these systems will not be brought up before tomorrow (Friday Jul 5 2013) late morning or early afternoon. You can check the system status here on the wiki. All other SciNet systems are up and should not be affected by the downtime.

Mon Jul 4, 13:00:00 EDT 2013: On Friday July 5th, 2013, bgq and bgqdev will be taken down for maintenance. This BGQ downtime will start at 8:00 am. We expect they will be up again early afternoon on the same day. You can check the system status here on Monday. Other SciNet systems will not be affected.

Mon Jul 1, 9:00:00 EDT 2013 : BGQ production system is fully up.

Jun 26 07:55:20 EDT 2013: One of the two racks of the BGQ production system is down. The remainder is operational, as is the development BGQ cluster.

Sat May 4 14:59:00 EDT: Systems all back on-line. Let us know if you encounter issues.

Sat May 4 13:13:54 EDT: Staff on-site. Expect to have at least GPC available by 4PM (if not earlier).

Sat May 4 09:15:54 EDT: Systems unavailable due to power glitch at data center; will update shortly

Tue Apr 16 12:45:52 EDT 2013: All systems are back up. Please resubmit your jobs.

Tue Apr 16 02:37:44 EDT 2013 Systems unexpectedly went down on Apr 15, 2013 around 10:45 pm due to loss of one phase of power at site. Local utility expected to restore power this morning. Check here for updates.

Thu Apr 11 14:39:27 EDT 2013: All systems are back up. Please report any problems or unusual behaviour.

Mon Apr 8 12:17:50 EDT 2013: All systems will be shutdown at 8AM on Wed, 10 April. They are expected to be back online by the evening of Thurs, 11 April. The downtime will allow us to make a number of datacenter improvements that will reduce the number of required maintenance downtimes per year and improve datacentre uptime. We also plan to upgrade the GPFS filesystem in order to allow for planned storage system upgrades later this year.

Wed Feb 27 14:15:48 Most systems are up. Please check this site for updates. Please report any problems or unusual system behaviour.

Wed Feb 27 12:55:35 Systems coming up. GPC will be accessible shortly, as will BGQ. We estimate 2PM for this. TCS may take a bit longer.

Wed Feb 27 10:01:05 Cooling restored. Power fluctuations had tripped breakers for cooling system. Computer systems are being tested before bringing them online. Further updates will be posted when available.

Wed Feb 27 03:34:03 Complete loss of cooling as of 0230 this morning. Under investigation. Unlikely that any systems will be back before noon today

Fri Feb 22, 2013, 7:30 am: The BGQ devel system shut down at 7:30 this morning because it detected a coolant issue. We hope to have it, and the production system, back up later this afternoon.

Wed Feb 20 04:12:26 EST 2013: Some compute nodes will be turned off Thursday (21 Feb) morning in order to reduce the cooling load in the datacentre. We'll be running on free-cooling only so that the bearings in the chiller can be replaced; that work is expected to be completed by end of Friday. At this point we're planning to shutdown 30 TCS nodes and the production BGQ (the devel system will keep running) on Thursday morning and 20% of the GPC on Friday morning. This will be done through reservations in the queueing system so that no jobs will be killed.

Plans may change depending on outside air temperatures and progress of the work.

Jan 17 17:21:01 EST 2013: Chiller maintenance work finished. System is running normally.


2012

Oct 22 15:20 TCS is back up. Both running and queued jobs for this system were killed. Please resubmit. All other clusters are also up.

Oct 22 15:00 GPC is back up. While running jobs were killed and should be resubmitted, previously queued jobs will now start to run.

Oct 22 14:34 While testing prepareness for power problems, an unfortunate human error in reconfiguring inadvertently triggered our emergency shutdown routine. We sincerely apologize. The systems are being brought up again. Please check back here for updates.

Oct 22 14:19 System shutdown; all running jobs lost. We will work on bringing back up the systems as soon as possible.

Oct 22 14:00:00 Logins to the SciNet systems were suddenly and unexpectedly disconnected. We are investigating the issue.

Oct 19 19:00:00 All systems should be up. Let us know if you still are experiencing difficulties.

Oct 19 16:20:00 The GPC and TCS have been brought back up. ARC, BGQ, and HPSS are not in operation yet.

Oct 19 13:05:00 Half of the GPC is being brought up again. TCS, P7, ARC, BGQ, and HPSS are not in operation yet as the chiller control system still needs repairing.

Oct 19 11:02:48 Staff and technicians on-site have concluded that a chiller control board needs to be replaced. We believe we can bring up the chiller manually now and get a portion of the GPC running by 1PM. The repair work will require a brief chiller shutdown (but no GPC shutdown) later in the day so TCS will stay off for now in order to minimize heat load.

Oct 18 23:19:04 Still seeing significant voltage fluctuations in facility power. Will keep systems off rather then risk another failure overnight. Sorry for the inconvenience. Expect to be back up by noon tomorrow (possibly earlier)

Oct 18 22:35:13 Power quality issues brought down the chiller, which required a shutdown of the clusters. Power and chiller are coming back up, and we hope to have the clusters up by morning.

Oct 18 21:01:00 The datacentre is down due to a power failure. We are investigating the problem.

Oct 5 16:36:01: The DDR portion of the cluster will be drained of jobs in order to free it up for maintenance work on Tuesday, 9 Oct at 10:30 AM. Jobs will continue to start on the DDR portion over the long weekend so long as the requested wall-clock time allows them to finish before 10:30 AM on Tuesday. The DDR partition will be back in regular service by noon Tuesday.

Oct 4, ~9:00AM: A routing issue prevented logins to scinet. The issue is fixed, running jobs should not have been affected.

Oct 1, 8:00PM: All systems are back online.

Sep 25: All systems will be shutdown at 7AM on Monday, 1 Oct for annual cooling tower maintenance and cleaning. We expect to come back up in the evening of the same day. Check here in the late afternoon for status updates.

Tue Sep 4 16:11:21 EDT 2012: The connection to the SciNet datacentre will be interrupted from September 5 at 10:00 pm to September 6 at 2:00 am, for router maintenance. Users will not be able to log into SciNet during this window. Running jobs will NOT be affected.

Sun 15 Jul 2012 11:11:37 EDT: Systems back online. Please report any problems/issue to support@scinet.utoronto.ca

Sun 15 Jul 2012 09:24:18 EDT: Main breaker tripped. Power now restored. Cooling system coming back online. Will likely need at least a couple of hours to get systems checked and back in production.

Sun 15 Jul 2012 08:43:21 EDT: Power issue. Staff investigating.

Sun Jul 8 11:49:20 EDT 2012: Systems back online after power failure at Jul 8 09:17:13. Report any problems to support@scinet.utoronto.ca

Tue Jun 26 18:49:54 EDT 2012: Systems back online. Report any problems to support@scinet.utoronto.ca

Tue 26 Jun 2012 15:47:36 EDT: Utility power event tripped the datacenter's under-voltage protection breaker. Power appears OK now. Restarting cooling systems and then will restart compute systems. Should be back online in a couple of hours.

Tue Jun 26 2012 15:35:00 EDT: Power failure of some kind; systems down. We are investigating.

Mon Jun 25 11:22:09 EDT 2012: Systems back up

Mon Jun 25 08:51:09 EDT 2012: Under-voltage event from electrical utility automatically tripped our main circuit breaker to avoid equipment loss/damage. Power has now been restored and cooling system is being re-started. Need to check that everything is OK before restoring systems. They should be back online before noon assuming no new problems are uncovered.

Mon Jun 25 07:37:58 EDT 2012: Staff on-site. No power at main electrical panel.

Mon Jun 25 06:17:43 EDT 2012: Power failure at 0557 today. All systems shutdown. We're investigating.

Fri Jun 8 10:50:00 EDT 2012 The GPC QDR nodes will be unavailable on Monday June 11th, from 9-10am to perform network switch maintenance. All other systems and filesystems will still be available.

Thu Jun 7 20:50:00 EDT 2012 The scheduled electrical work has been completed. Systems are now available. Please email support@scinet.utoronto.ca if you experience any problems.

Thu Jun 7 07:22:24 EDT 2012 All power to the SciNet facility is off in order to complete the scheduled electrical work outlined below. The work is planned to take at least 12 hrs and the earliest we expect systems to be available to users is 10 PM tonight. Watch here for updates.

Fri Jun 1 13:14:51 EDT 2012 There will be a full SciNet shutdown on Thu Jun 7 2012, starting at 6AM.

This is the final scheduled shutdown in preparation for the installation of IBM Blue Gene/Q system. A new machine room has been built (walls, raised floor, cooling unit, electrical and water connections), but downtime is required to connect 800 kW of power from our electrical room to the new room.

All systems will go down at 6 AM on Thu 7 Jun; all login sessions and jobs will be killed at that time.

At the earliest, the systems will be available again around 10PM in the evening of Thu 7 Jun. Check on this page for updates on Thursday.

Sun May 27 09:11:53 EDT 2012: scratch became umounted on all nodes about 0700 this morning. Problem has been resolved and /scratch has been remounted everywhere.

Sun May 27 09:11:53 EDT 2012: scratch became umounted on all nodes. Working on fix

Thu May 10 15:17:54 EDT 2012: Systems now up. Overnight testing took longer than expected. Testing is completed; system is fully up and running. Much of the GPC is about to come out of warranty coverage this month, and the thorough pre-expiration shakedown provided by the tests during this downtime uncovered hardware or configuration issues with over 60 GPC nodes, including problems with memory DIMMs, network cards, and power supplies; these issues are now fixed or slated to be fixed with the offending nodes offlined. Testing also closely examined the new networking infrastructure at very large scale and several minor issues have been identified which will be improved in the very near future.

Thu May 10 07:31:54 EDT 2012: Systems expected to be available by 2PM today (10 May). Overnight testing took longer than expected.

Tue 8 May 2012 9:30:46 EDT: The announced 8/9 May SciNet shutdown has started. This shutdown is intended for final configurations in the changeover to full infiniband for the GPC, for some back-end maintenance, and to test the computational and file system performance of the TCS and GPC. Systems went down at 9 am on May 8; all login sessions and jobs were killed at that time. The system should be available again tomorrow evening. Check here on Wednesday for updates.

Wed 2 May 2012 10:20:46 EDT: ANNOUNCEMENT: There will be a full SciNet shutdown from Tue May 8 to Wed May 9 for final configurations in the changeover to full infiniband for the GPC, for some back-end maintenance, and to test the computational and file system performance of the TCS and GPC. Systems will go down at 9 am on May 8; all login sessions and jobs will be killed at that time. The system should be available again in the evening of the next day. Check here on Wednesday for updates.

As noted before, see GPC_Quickstart for how to run mpi jobs on the gpc in light of the new infiniband network (mostly coincides with the old way, with less parameters).

Wed 24 Apr 2012 12:47:46 EDT: The Apr 19 upgrade of the GPC to a low-latency, high-bandwidth Infiniband network throughout the cluster is now reflected in (most of) the wiki. The appropriate way to request nodes in job scripts for the new setup (which will coincide with the old way for many users) is described on the GPC_Quickstart page.


Thu 19 Apr 2012 19:43:46 EDT:

The GPC network has been upgraded to a low-latency, high-bandwidth Infiniband network throughout the cluster. Several significant benefits over the old ethernet/infiniband mixed setup are expected, including:

  • better I/O performance for all jobs
  • better job performance for what used to be multi-node ethernet jobs (as they will now make use of Infiniband),
  • for users that were already using Infiniband, improved queue throughput (there are now 4x as many available nodes), and the ability to run larger IB jobs.

NOTE 1: Our wiki is NOT completely up-to-date after this recent change. For the time being, you should first check this current page and the temporary Infiniband Upgrade page for anything related to networks and queueing.

NOTE 2: The temporary mpirun settings that were recommended for multinode ethernet runs are no longer in effect, as all MPI traffic is now going over InfiniBand.

NOTE 3: Though we have been testing the new system since last night, a change of this magnitude (3,000 adapter cards installed, 5km of copper cable, 35km of fibre optic cable) is likely to result in some teething problems so please bear with us over the next few days. Please report any issues/problems that are not explained/resolved after reading this current page or our Infiniband Upgrade page to support@scinet.utoronto.ca.

Thu Apr 12 17:39:50 EDT 2012: The TCS maintenance has been completed. Please report any problems.

Thu Apr 12 17:08:00 EST 2012: scheduled maintenance downtime of the TCS. As announced, running TCS jobs and TCS login sessions were killed. All other systems are up. The TCS is expected to be up again sometime this evening.

Tue Apr 10 16:24:00 EST 2012: scheduled downtimes:

Apr 12: TCS Shutdown (Other systems will remain up). The shutdown will start at 11 am and the system should be available again at in the evening of the same day.

Wed 28 Mar 2012 21:45:03 EDT: Connection problem was caused by trouble with a filesystem manager. Problem solved.

Wed 28 Mar 2012 20:55:27 EDT: We're experiencing some problems connecting to the login nodes. Investigating.

Wed Mar 28 10:34:25 EDT 2012: There have been some GPC file system and network stability issues reported over that past few days that we believe are related to some OS configuration changes. We are in the process of resolving them. Thanks for your patience.

Tue Mar 6 18:30:00 EST 2012: We had a glitch on our core switch due to configuraion errors. Unfortunately, this short outage has resulted in unmount of GPFS and jobs got killed. Systems are recovered. Please resubmit jobs.

Fri Mar 2 11:59:33 EST 2012: Roughly 1/3 of the TCS nodes thermal-checked themselves off ~1140 today due to a glitch in the water supply temperature. Unfortunately, all jobs running on those nodes were lost. Please check your jobs and resubmit if necessary.

Thu Feb 9 11:50:57 EST 2012: System Temporary Change for MPI ethernet jobs:
Due to some changes we are making to the GPC GigE nodes, if you run multinode ethernet MPI jobs (IB multinode jobs are fine), you will need to explicitly request the ethernet interface in your mpirun:

For Openmpi -> mpirun --mca btl self,sm,tcp For IntelMPI -> mpirun -env I_MPI_FABRICS shm:tcp

There is no need to do this if you run on IB, or if you run single node mpi jobs on the ethernet (GigE) nodes. Please check GPC_MPI_Versions for more details.

Thu Feb 9 11:50:57 EST 2012: Scheduled downtime is over. TCS is up. GPC is coming back rack-by-rack.

Mon Jan 31 9:12:00 EST 2012: File systems (scratch and home) got unmounted around 3:30 am and again at around 23:15 on Jan/30. Jobs may have crashed. Filesystems are back now. Please resubmit you jobs.

Wed Jan 18 16:47:38 EST 2012<; Full system shutdown as of 7AM on Tues, 17 Jan in order to perform annual maintenance on the chiller. Most work has been completed on schedule. Expect systems to be available by 8PM today.

Wed Jan 4, 14:48: Scratch file system got unmounted. Most jobs died. We are trying to fix the problem. Check back here for updates.

Wed Jan 3, 13:58: Datamover1 is down due to hardware problems. Use datamover2 instead.


2011

Wed Dec 28 13:46:37 EST 2011 Systems are back up. All running and queued jobs were lost, due to a power failure at the SciNet datacentre. Please resubmit your jobs. Also, please report any problems to <support@scinet.utoronto.ca>.

Wed Dec 28 09:25 EST 2011 Electrician enroute. No power at our main electrical panel.

Wed Dec 28 08:51 EST 2011 Staff enroute to datacentre. More info once we understand what has happened.

Wed Dec 28 02:33 EST 2011 Datacentre appears to have lost all power. All remote access lost.


Thu Dec 8 16:43:47 EST 2011: The GPC was transitioned to CentOS 6 on Monday, December 5, 2011. Some of the known issues (and workarounds) are listed here. Thanks for your patience and understanding! - The SciNet Team.

System appears to have stabilized. Please let us know if there are any problems.

Fri Nov 25 12:39:54 EST 2011: IMPORTANT upcoming change: The GPC will be transitioned to CentOS 6 on Monday, December 5, 2011. All GPC devel nodes will be rebooted at noon on Monday Dec 5/11, with a CentOS6 image. The compute nodes will be rebooted as jobs finish, starting on Saturday Dec 3/11. You may already submit jobs requesting the new image (os=centos6computeA), and these jobs will be serviced as the nodes get rebooted into the new OS.

Wed Nov 16 10:53:37 EST 2011: A glitch caused the scratch file system to get unmounted everywhere. We are on track to fixing the situation. However, most jobs were killed and you will have to resubmit your job once scratch is back.

As for the recovery of /project directories for groups with storage allocation which is less than 5 TB in /project, recovery of their directories is still in progress. Until then, those directories are unaccessible (owned by root). If you can read your project directory, it means that the recovery is complete. To expedite this process, for now, no material can be retrieved from HPSS by users.

Note that the monthly purge of the scratch space will be delayed until Friday, 18 Nov because of the downtime.

Wed Nov 15 10:47:37 EST 2011: All systems are up and accessible again. Both /project and HPSS now follow the same new directory structure as on /home and /scratch, i.e. /project/<first-letter-of-group>/group/user.

Be aware that for groups with storage allocation which is less than 5 TB in /project, recovery of their directories is in progress and will finish in the next day or so. Until then, those directories are unaccessible (owned by root). If you can read your project directory, it means that the recovery is complete.

Note that the monthly purge of the scratch space will be delayed until Friday, 18 Nov because of the downtime.

Tue Nov 14 9:40:37 EST 2011: All systems will be shutdown Monday morning in order to complete the disk rearrangement begun this past week. Specifically, the /project disks will be reformatted and added to the /scratch filesystem. The new /scratch will be larger and faster (because of more spindles and a second controller). We expect to be back online by late afternoon.

The monthly purge of the scratch space will be delayed until Friday, 18 Nov because of the downtime.

For groups with storage allocations, a new /project will be created but disk allocations on it will be decreased and the difference made up with allocations on HPSS. Both /project and HPSS will follow the same new directory structure as now used on /home and /scratch.

Tue Nov 8 17:05:37 EST 2011: Filesystem hierarchy has been renamed as per past emails and the newsletter. e.g. the home directory of user 'resu' in group 'puorg' is now /home/p/puorg/resu and similarly for /scratch. The planned changes to /scratch (new disks and controller) have been postponed until later this month. /project remains read-only for now and there will be follow-up email to project users tomorrow.

Tue Nov 8 11:24:56 EDT 2011 Systems are down for scheduled maintenance. Expect to come back online in early evening after filesystem changes detailed in recent emails (including new directory naming hierarchy).

Wed Nov 2 13:45:24 EDT 2011 Reminder - shutdown of all systems scheduled for 9AM on Tues 8 Nov. Expect to come back online in early evening after filesystem changes detailed in recent emails (including new directory naming hierarchy).

Sat Oct 15 10:44:27 EDT 2011 Filesystem issues resolved.

Sat Oct 15 09:57:57 EDT 2011 Filesystem appears hung because of issues to do with OOM'd nodes and users close to quota limits. We're trying to resolve.

Fri Oct 7 14:00:00 EDT 2011 Scheduler issues resolved

Fri Oct 7 11:17:08 EDT 2011 The scheduler is having issues; we are investigating. The file system also seems unhappy.

Wed Oct 5 02:50:08 EDT 2011 Transformer maintenance complete. Systems back online

Wed Oct 5 01:18:27 EDT 2011 Power was restored ~12:30. Cooling has been restored as of ~ 1:10AM. Starting to bring up filesystems. Will be at least an hour before users can get access

Tue Oct 4 23:38:57 EDT 2011 Electrical crews had some problems. Expect power to be restored by midnight but then will take at least 2 hrs before we have cooling working and can hope to have systems back. Will update

Tue Oct 4 14:24:52 EDT 2011 System shutdown scheduled for 2PM on Tues, 4 Oct for maintenance on building transformers. All logins and jobs will be killed at that time. Power may not be restored until midnight with systems coming back online 1-2 hrs later.

Sun Oct 2 14:20:57 EDT 2011 Around noon on Sunday we had a hiccup with GPFS that unmounted /scratch and /project on several computer nodes. Systems were back to normal at around 2PM. Several jobs were disrupted, but if your job survived please be sure it produced the expected results.

Sun Oct 2 12:20:57 EDT 2011: GPFS problems, We are investigating

Wed Sep 7 18:26:49 EDT 2011: Systems back online after emergency repair of condenser water loop.

Wed Sep 7 16:28:13 EDT 2011: Cleaning up disk issues. Hope to make systems available again by 19:00-20:00

Wed Sep 7 16:02:19 EDT 2011: Cooling has been restored. Emergency repair was underway but couldn't prevent the shutdown. Update about expected system availability timing by 16:30

Wed Sep 7 15:53:34 EDT 2011: Cooling plant problems.

Sep 3 10:21:37 EDT 2011: scratch is at over 97% full, and may be making general access very slow for everyone. A couple of users are border-lining their quota limits, and are being contacted. If possible please delete any non-essential or temporary files. Your cooperation is much appreciated.

Tue Aug 30 14:58:41 EDT 2011: Systems likely available to users by 3:30PM

Tue Aug 30 13:21:45 EDT 2011: Cracked and suspect valves replaced. Cooling restored. Starting to bring up and test systems. Next update by 3PM

Mon Aug 29 16:01:45 EDT 2011: All systems will be shutdown at 0830 on Tuesday, 30 Aug to replace a cracked valve in the primary cooling system. Expect systems to be back online by 5-6PM today.

Wed Aug 24 13:00:00 EDT 2011: Systems back on-line. Pump rewired and cooling system restarted

Wed Aug 24 07:40:39 EDT 2011: On-site since 0430. Main chilled water pump refuses to start. Technician investigating.

Wed 24 Aug 2011 03:43:16 EDT Emergency shutdown due to failure in cooling system. More later once situation diagnosed

Mon Aug 22 23:29:05 EDT 2011: GPC scheduler appears to be working properly again

Mon Aug 22 22:41:40 EDT 2011: ongoing issues with scheduler on GPC. Working on fix

Mon Aug 22 21:53:40 EDT 2011: Systems are back up

Mon Aug 22 17:14:06 EDT 2011: Bringing up and testing systems. Another update by 8PM

Mon Aug 22 16:13:18 EDT 2011: Cracked valve has been replaced. Cooling system has been restarted. Expect to be online this evening. Another update by 5PM

Mon Aug 22 14:54:45 EDT 2011: Emergency shutdown at 3PM today (22 Aug). Possibility of severe water damage otherwise. More information by 5PM.

Sun Aug 21 18:53:35 EDT 2011: Systems being brought back online and tested. Login should be enabled by 8-9 PM

Sun Aug 21 17:22:12 EDT 2011: Cooling has been restored. Starting to bring back systems but encountering some problems. Check back later - should have a handle on timelines by 7PM.

Sun Aug 21 15:15:35 EDT 2011: Major storms in Toronto. Power glitch appears to have knocked out cooling systems. All computers shutdown to avoid overheating. More later as we learn exactly what has happened.

Thu Aug 9, 12:31:03 EDT 2011 If you had jobs running when the file system failed this morning (10:30AM; 9 Aug), please check their status, as many jobs died.

Mon Jul 18, 15:04:17 EDT 2011 Datamovers and large-memory nodes are up and running. Intel license server is functional again. Purging of /scratch will happen next Friday, July/22.

Tue Jun 14, 14:05 EDT 2011 Many SciNet staff are at the HPCS meeting this week so, apologies in advance, but we may not respond as quickly as usual to email.

Sun May 22 12:15:29 EDT 2011 Cooling pump failure at data centre resulted in emergency shutdown last night. Systems are back to normal.

Tue May 17 17:57:41 EDT 2011 Cause of the last two chiller outages has been fixed. All available systems back online.

Thu May 12 12:23:35 EDT 2011 NOTE: There is some UofT network reconfiguration and maintenance on Friday morning May 13 07:45-08:15 that most likely will disrupt external network connections and data transfers. Local running jobs should not be affected.

Thu May 5 19:17:37 EDT 2011 Chiller has been fixed. Systems back to normal.

You can check our twitter feed, @SciNetHPC, for updates.

Wed May 4 22:29:00 EDT 2011 Cooling failure at data centre resulted in an emergency shutdown last night. Chiller replacement part won't arrive until noon tomorrow and then needs installation and testing. Roughly 40% of the GPC is currently available. We're using free-cooling instead of the chiller and therefore can't risk turning on more nodes. NOTE - we plan to keep these systems up tomorrow when parts are replaced but if there are complications they may be shutdown without warning. You can check our twitter feed, @SciNetHPC, for updates.

Wed May 4 16:43:34 EDT 2011 Cooling failure at data centre resulted in an emergency shutdown last night. Chiller replacement part won't arrive until noon tomorrow and then needs installation and testing. Roughly 1/4 of the GPC (evenly split beween IB and gigE) is currently available. We're using free-cooling instead of the chiller and may increase the number of nodes depending on how the room responds. NOTE - we plan to keep these systems up tomorrow when parts are replaced but if there are complications they may be shutdown without warning. You can check our twitter feed, @SciNetHPC, for updates.

Tues 3 May 2011 20:30 Cooling failure at data centre has resulted in emergency shutdown. This space will be updated when more information is available.

Thu 27 Apr 2011 20:21 EDT: System maintenance completed. All systems are operational. TCS is back to normal too, with tcs01 and tcs02 as the devel nodes.

Sun 17 Apr 2011 18:34:10 EDT Problem with building power supply on Saturday at 2AM today took down the cooling system. Power has been restored but new problems have arisen with TCS water units.

We have been able to partially restore the TCS. The usual development nodes, tcs01 and tcs02 are not available. However, we have created a temporary workaround using node tcs03 (tcs-f02n01), where you can login, compile, and submit jobs. Not all the compute nodes are up, but we are working on that. Please let us know if there are any problems with this setup.

Sun Apr 16 13:01:14 EDT 2011 Problem with building power supply at 2AM today took down the cooling system . Waiting for electrical utility to check lines.

Wed Mar 23 19:06:21 EDT 2011: A VSD controller failed in the cooling system. A temporary solution will allow systems to come back online this evening (likely by 8PM). There will likely need to be downtime in the next two days in order to replace the controller.

Wed Mar 23 17:00 EST 2011 Down due to cooling failure. Being worked on. Large electrical surge took out datacentre fuses. No estimated time to solution yet.

Mon Mar 7 14:39:59 EST 2011 NB - the SciNet network connection will be cut briefly (5 mins or less) at about 8AM on Tues, 8 March in order to test the new UofT gateway router connection

Thu Feb 24 13:21:38 EST 2011 Systems were shutdown at 0830 today for scheduled repair of a leak in the cooling system.GPC is online. TCS expected to be back online by 4PM

Tue Feb 24 9:35:31 EST 2011: All systems will be shutdown at:0830 Thursday, 24 Feb in order to repair a small leak in the cooling system. Systems will be back online by afternoon or evening. Check back here for updates during the day.

Wed Feb 9 10:27:13 EST 2011: There was a cooling system failure last night, causing all running and queued jobs to be lost. All systems are back up.

Sat Feb 5 20:33:34 EST 2011: Power outage in Vaughan. TCS jobs all died. Some GPC jobs have survived.

Sat Feb 5 18:51:00 EST 2011: We just had a hiccup with the cooling tower. We suspect it's a power-grid issue, but are investigating the situation.

Thu Jan 20 18:25:54 EST 2011: Maintenance complete. Systems back online.

Wed Jan 19 07:22:32 EST 2011: Systems offline for scheduled maintenance of chiller. Expect to be back on-line in evening of Thurs, 20 Jan. Check here for updates and revised estimates of timing.


2010

Fri Dec 24 15:40:12 EST 2010: We experienced a failure of the /scratch filesystem, with a corrupt quota mechanism, on the afternoon of Friday December 24. This resulted in a general gpfs failure that required the system to be shutdown and rebooted. Consequently all running jobs were lost. We apologize for the disruption this has caused, as well as the lost work.

Happy holidays to all! The SciNet team.

Note: From Dec 22, 2010 to Jan 2, 2011, the SciNet offices are officially closed, but the system will be up and running and we will keep an eye out for emergencies.

Mon Dec 6 17:42:12 EST 2010: File system problems appear to have been resolved.

Mon Dec 6 16:39:55 EST 2010: File system issues on both TCS and GPC. Investigating.

Fri Nov 26 18:30:55 EST 2010: All systems are up and accepting user jobs. There was a province-wide power issue at 1:30 the morning of Friday Nov 26, which caused the chiller to fail, and all systems to shutdown. All jobs running or queued at the time were killed as a result. The system is accessible now and you can resubmit your jobs.

Fri Nov 26 6:15 EST 2010: The data center had some heating issue at around 1:30am Fri Nov 26 and system are down right now. We are investigating the cause of the problem and trying to fix the issue so that we could bring the system back up and running.

Fri Nov 5 15:26 EDT 2010: Scratch is down; we are working on it.

Tue Oct 26 10:32:22 EDT 2010: scratch is hung, we're investigating.

Fri Sep 24 23:02:41 EDT 2010: Systems are back up. Please report any problems to <support at scinet dot utoronto dot ca>

Fri Sep 24 19:55:02 EDT 2010: Chiller restarted. Systems should be back online this evening - likely 9-10PM. Widespread power glitch seems to have confused one of the Variable Speed Drives and control system was unable to restart it automatically.

Fri Sep 24 16:44:39 EDT 2010: The systems were unexpectedly and automatically shut down, due to a failure of the chiller. We are investigating and will bring the systems back up as soon as possible.

Mon Sep 13 15:39:05 EDT 2010: The systems were unexpectedly shut down, automatically, due to a failure of the chiller. We are investigating and will bring the systems back up as soon as possible.

Fri Sep 10 15:08:53 EDT 2010: GPFS upgraded to 3.3.0.6; new scratch quotas applied; check yours with /scinet/gpc/bin/diskUsage.

Mon Aug 16 13:12:08 EDT 2010: Filesystems are accessible and login is normal.

Mon Aug 16 12:38:45 EDT 2010: Login access to SciNet is failing due to /home filesystem problems. We are actively working on a solution.

Mon Aug 9 16:10:03 EDT 2010: The file system is slow and scheduler cannot be reached. We are working on the problem.

Sat Aug 7 11:42:50 EDT 2010: System status: Normal.

Sat Aug 7 10:45:52 EDT 2010: Problems with scheduler not responding on GPC. Should be fixed within an hour

Wed Aug 4 14:03:59 EDT 2010: GPC scheduler currently having filesystem issue which means you may have difficulty submitting jobs and monitoring the queue. We are working on the issue now.

Fri Jul 23 17:50:00 EDT 2010: Systems are up again after testing and maintenance.

Fri Jul 23 16:14:02 EDT 2010: Systems are down for testing and maintenance. Expect to be back up about 9PM this evening.

Tue Jul 20 17:40:45 EDT 2010: Systems are back. You may resubmit jobs.

Tue Jul 20 15:39:37 EDT 2010: Most GPC jobs died as well. The ones that appear to be running are likely in an unknown state, so will be killed in order to ensure they do not produce bogus results. The machines should be up shortly. Thanks for your patience and understanding. The SciNet team.

Tue Jul 20 15:19:30 EDT 2010: File systems down. We are looking at the problem now. All jobs on TCS have died. We'll inform you when the machine is available for use again. Please log out from the TCS.

Tue Jul 20 14:26:10 EDT 2010: Scratch file system down. We are looking at the problem now.

Fri Jul 16 10:59:27 EDT 2010: Systems normal.

Sun Jul 11 13:08:02 EDT 2010: All jobs running at ~3AM this morning almost certainly failed and/or were killed about 11AM. A hardware failure has reduced /home and /scratch performance by a factor of 2 but should be corrected tomorrow.

Sun Jul 11 09:56:27 EDT 2010: /scratch was inaccessible as of about 3AM

Fri Jul 9 16:18:18 EDT 2010: /scratch is accessible again

Fri Jul 9 15:38:43 EDT 2010: New trouble with the filesystems. We are working to fix things

Fri Jul 9 11:28:00 EDT 2010: The /scratch filesystem died at about 3 AM on Fri Jul 9, and all jobs running at the time died.