Infiniband Upgrade

From oldwiki.scinet.utoronto.ca
Revision as of 15:46, 19 April 2012 by Cloken (talk | contribs)
Jump to navigation Jump to search

On Apr 19 2012, SciNet has upgraded the GPC to be fully Infiniband connected. This new Infiniband network replaces the current Ethernet connected section of the GPC with a 5:1 QDR Infiniband while the existing GPC 1:1 DDR Infiniband remains. The GPC now consists of 840 nodes of DDR (6,720 cores) and 3,024 nodes of QDR (24,192 cores). The Infiniband sections are connected, but in general operation, jobs will remain in one section or the other. The GPC Infiniband (both QDR and DDR) are fully compatible in terms of hardware and software so no recompilation or different MPI flags is required. Neither is recompilation required to run jobs that used the Ethernet section of the GPC.

Job Submission

In terms of job submission, for most users your job submission scripts will still work as expected, however all jobs will now run on infiniband, so the :ib: parameter used to ask for infiniband nodes will not be necessary anymore, however will still be accepted. By default a user's job will go to whichever network section best accommodates it, typically smaller jobs to the QDR and larger jobs to the DDR. However a user can override this by simply adding the flags "ddr" or "qdr" to the job resource request.

For example, to request two nodes anywhere on the GPC (QDR or DDR), use

#PBS -l nodes=2:ppn=8,walltime=1:00:00

in your job submission script.

For two nodes using DDR, use

#PBS -l nodes=2:ddr:ppn=8,walltime=1:00:00

To get two nodes using QDR, instead, you would say

#PBS -l nodes=2:qdr:ppn=8,walltime=1:00:00

Parameters for mpirun

No special MPI parameters are required to run a job using Infiniband. It will be used by default, so most users should just use the basic mpirun commands as outlined in GPC_MPI_Versions. For more detailed information please see the recent Techtalk on IB.

NRAC Allocations and Fairshare

The NRAC allocations for the current year that were based on ethernet and infiniband will carry over, however the allocation will be on the full GPC, not just the subsection. So if you were allocated 500 hours on Infiniband your fairshare allocation will still be 500 hours, just 500 out or 30000, instead of 500 out of 7000. If you received two allocations, one on gigE and one on IB, they will simply be combined. This should benefit all users as the desegregation of the GPC provides a greater pool of nodes increasing the probability of your job to run.