Difference between revisions of "BGQ"

From oldwiki.scinet.utoronto.ca
Jump to navigation Jump to search
Line 3: Line 3:
 
|name=Blue Gene/Q (BGQ)
 
|name=Blue Gene/Q (BGQ)
 
|installed=August 2012
 
|installed=August 2012
|operatingsystem= RH6.3, CNK (Linux)  
+
|operatingsystem= RH6.2, CNK (Linux)  
 
|loginnode= bgq01,bgq02  
 
|loginnode= bgq01,bgq02  
 
|nnodes=  2048(32,768 cores), 512 (8,192 cores)
 
|nnodes=  2048(32,768 cores), 512 (8,192 cores)
Line 17: Line 17:
 
===Specifications===
 
===Specifications===
  
BGQ is an extremely dense and energy efficient 3rd generation IBM Supercomputer built around a system on a chip compute node that has a 16core 1.6GHz PowerPC based CPU (PowerPC A2) with 16GB of Ram and runs a very lightweight Linux OS called CNK.  The nodes (with 16 core apieces) are bundled in groups of 32 into node boards and then 16 of these boards make up a midplane with 2 midplanes per rack, or 16,348 cores and 16 TB of RAM per rack.  The compute nodes are all connected together using a custom 5D highspeed interconnect. Each rack has 16 Power7 I/O nodes that run a full Redhat Linux OS that manages the compute nodes and mounts the GPFS filesystem.   
+
BGQ is an extremely dense and energy efficient 3rd generation IBM Supercomputer built around a system on a chip compute node that has a 16core 1.6GHz PowerPC based CPU (PowerPC A2) with 16GB of Ram and runs a very lightweight Linux OS called CNK.  The nodes (with 16 core apiece) are bundled in groups of 32 into node boards and then 16 of these boards make up a midplane with 2 midplanes per rack, or 16,348 cores and 16 TB of RAM per rack.  The compute nodes are all connected together using a custom 5D torus highspeed interconnect. Each rack has 16 I/O nodes that run a full Redhat Linux OS that manages the compute nodes and mounts the GPFS filesystem.   
  
 
[[Image:BlueGeneQHardware2.png‎ |center]]
 
[[Image:BlueGeneQHardware2.png‎ |center]]
 
=== Jobs ===
 
 
BGQ job size is typically determined by midplanes (512 nodes or 8192 cores), however sub-blocks can be used to further subdivide midplanes with a minimum of
 
one IO node per block.  In SciNet's configuration (with 8 I/O nodes per midplane) this allows 64 nodes (1024 cores) to be the smallest job size. A block is setup to match a jobs configuration (number of nodes and and network configuration) and partitioned and booted specifically for that job.  If a job with the same requirements matches an existing running job then the nodes are not required to be rebooted/repartitioned, however if the any of the parameters are different the block needs to be recreated and rebooted to match the new jobs specifications.  When running with loadleveler this happens automatically, and due to the lightweight OS used on the compute nodes, happens very quickly. 
 
  
 
=== 5D Torus (network) ===
 
=== 5D Torus (network) ===
  
The network topology for Blue Gene/Q is a five-dimensional (5D) torus or mesh, with direct links between the nearest neighbors in the ±A, ±B, ±C, ±D, and ±E directions.  As such there are only a few optimum job sizes that will use the network efficiently. They are listed below and notice they are all powers of 2.
+
The network topology of Blue/Gene Q is a five-dimensional (5D) torus, with direct links between the nearest neighbors in the ±A, ±B, ±C, ±D, and ±E directions.  As such there are only a few optimum block sizes that will use the network efficiently.
  
 
{|border="1" cellspacing="0" cellpadding="2"
 
{|border="1" cellspacing="0" cellpadding="2"
Line 72: Line 67:
 
|}
 
|}
  
== *** WAT2Q SPECIFIC **** ==
+
=== Devel Nodes ===
 +
 
 +
The devel nodes for the BGQ are '''bgq01''' and '''bgq02'''
 +
 
 +
 
 +
however sub-blocks can be used to further subdivide midplanes with a minimum of
 +
one IO node per block.  In SciNet's configuration (with 8 I/O nodes per midplane) this allows 64 nodes (1024 cores) to be the smallest job size. A block is setup to match a jobs configuration (number of nodes and and network configuration) and partitioned and booted specifically for that job.  If a job with the same requirements matches an existing running job then the nodes are not required to be rebooted/repartitioned, however if the any of the parameters are different the block needs to be recreated and rebooted to match the new jobs specifications.  When running with loadleveler this happens automatically, and due to the lightweight OS used on the compute nodes, happens very quickly. 
 +
 
 +
  As such there are only a few optimum block sizes that will use the network efficiently. They are listed below and notice they are all powers of 2.
  
 
==== Compile ====
 
==== Compile ====
 +
 +
The BGQ uses IBM XL series compilers in a cross-compile mode which
  
 
<pre>
 
<pre>

Revision as of 12:59, 30 August 2012

Blue Gene/Q (BGQ)
Blue Gene Cabinet.jpeg
Installed August 2012
Operating System RH6.2, CNK (Linux)
Number of Nodes 2048(32,768 cores), 512 (8,192 cores)
Interconnect 5D Torus (jobs), QDR Infiniband (I/O)
Ram/Node 16 Gb
Cores/Node 16 (64 threads)
Login/Devel Node bgq01,bgq02
Vendor Compilers bgxlc, bgxlf
Queue Submission Loadleveler

Specifications

BGQ is an extremely dense and energy efficient 3rd generation IBM Supercomputer built around a system on a chip compute node that has a 16core 1.6GHz PowerPC based CPU (PowerPC A2) with 16GB of Ram and runs a very lightweight Linux OS called CNK. The nodes (with 16 core apiece) are bundled in groups of 32 into node boards and then 16 of these boards make up a midplane with 2 midplanes per rack, or 16,348 cores and 16 TB of RAM per rack. The compute nodes are all connected together using a custom 5D torus highspeed interconnect. Each rack has 16 I/O nodes that run a full Redhat Linux OS that manages the compute nodes and mounts the GPFS filesystem.

BlueGeneQHardware2.png

5D Torus (network)

The network topology of Blue/Gene Q is a five-dimensional (5D) torus, with direct links between the nearest neighbors in the ±A, ±B, ±C, ±D, and ±E directions. As such there are only a few optimum block sizes that will use the network efficiently.

Node Boards Compute Nodes Cores Torus Dimensions
1 32 512 2x2x2x2x2
2 (adjacent pairs) 64 1024 2x2x4x2x2
4 (quadrants) 128 2048 2x2x4x4x2
8 (halves) 256 4096 4x2x4x4x2
16 (midplane) 512 8192 4x4x4x4x2
32 (1 rack) 1024 16384 4x4x4x8x2
64 (2 racks) 2048 32768 4x4x8x8x2

Devel Nodes

The devel nodes for the BGQ are bgq01 and bgq02


however sub-blocks can be used to further subdivide midplanes with a minimum of

one IO node per block. In SciNet's configuration (with 8 I/O nodes per midplane) this allows 64 nodes (1024 cores) to be the smallest job size. A block is setup to match a jobs configuration (number of nodes and and network configuration) and partitioned and booted specifically for that job. If a job with the same requirements matches an existing running job then the nodes are not required to be rebooted/repartitioned, however if the any of the parameters are different the block needs to be recreated and rebooted to match the new jobs specifications. When running with loadleveler this happens automatically, and due to the lightweight OS used on the compute nodes, happens very quickly.

 As such there are only a few optimum block sizes that will use the network efficiently. They are listed below and notice they are all powers of 2.

Compile

The BGQ uses IBM XL series compilers in a cross-compile mode which

/bgsys/drivers/V1R1M1/ppc64/comm/xl/bin/mpich2version
/bgsys/drivers/V1R1M1/ppc64/comm/xl/bin/mpixlc
/bgsys/drivers/V1R1M1/ppc64/comm/xl/bin/mpixf90

Run a Job

When not using loadleveler there is a direct launch program called runjob on BGQ that acts a lot like mpirun/mpiexec. The "block" argument is the predifined group of nodes that are already booted. See the next section on how to create these blocks manually. Note that a block does not need to be rebooted between jobs, only if the number of nodes or network parameters are need to be changed.

runjob --block R00-M0-N03-32 --ranks-per-node=16 --np 512 --cwd=/gpfs/DDNgpfs3/xsnorthrup/osu_bgq --exe=/gpfs/DDNgpfs3/xsnorthrup/osu_bgq/osu_mbw_mr --args file.in

or a using the other form which is convenient if your application has multiple arguments

runjob --block R00-M0-N03-32 --ranks-per-node=16 --np 512 --cwd=/gpfs/DDNgpfs3/xsnorthrup/osu_bgq : /gpfs/DDNgpfs3/xsnorthrup/osu_bgq/osu_mbw_mr file.in

also the flag

--verbose #

where # is from 1-7 is very useful it you are trying to debug an application.


To see running jobs and the status of available blocks use:

list_jobs
list_blocks

Setup blocks

To reconfigure the BGQ nodes use the bg_console

bg_console

There are various options to create block types (section 3.2 in the BGQ admin manual), but the smallest is created using the following command:

gen_small_block <blockid> <midplane> <cnodes> <nodeboard> 
gen_small_block  R00-M0-N03-32 R00-M0 32 N03

The block then needs to be booted using:

allocate R00-M0-N03-32

If those resources are already booted into another block, that block must be freed before the new block can be allocated.

free R00-M0-N03

There are many other functions in bg_console:

help all

The BGQ default nomenclature for hardware is as follows:

(R)ack - (M)idplane - (N)ode board or block - (J)node - (C)ore

So R00-M01-N03-J00-C02 would correspond to the first rack, second midplane, 3rd block, 1st node, and second core.

I/O

GPFS



Documentation

BGQ System Administration Guide

BGQ Application Development