BGQ
Blue Gene/Q (BGQ) | |
---|---|
Installed | August 2012 |
Operating System | RH6.2, CNK (Linux) |
Number of Nodes | 2048(32,768 cores), 512 (8,192 cores) |
Interconnect | 5D Torus (jobs), QDR Infiniband (I/O) |
Ram/Node | 16 GB |
Cores/Node | 16 (64 threads) |
Login/Devel Node | bgqdev-fen1,bgq-fen1 |
Vendor Compilers | bgxlc, bgxlf |
Queue Submission | Loadleveler |
System Status
The current BGQ system status can be found on the wiki's Main Page.
SOSCIP
The BGQ is a Southern Ontario Smart Computing Innovation Platform (SOSCIP) BlueGene/Q supercomputer located at the University of Toronto's SciNet HPC facility. The SOSCIP multi-university/industry consortium is funded by the Ontario Government and the Federal Economic Development Agency for Southern Ontario [1].
Support Email
Please use <bgq-support@scinet.utoronto.ca> for BGQ-specific inquiries.
Specifications
BGQ is an extremely dense and energy efficient 3rd generation blue gene IBM supercomputer built around a system on a chip compute node that has a 16core 1.6GHz PowerPC based CPU (PowerPC A2) with 16GB of Ram. The nodes are bundled in groups of 32 into a node board (512 cores), and 16 boards make up a midplane (8192 cores) with 2 midplanes per rack, or 16,348 cores and 16 TB of RAM per rack. The compute nodes run a very lightweight linux based operating system called CNK. The compute nodes are all connected together using a custom 5D torus highspeed interconnect. Each rack has 16 I/O nodes that run a full Redhat Linux OS that manages the compute nodes and mounts the filesystem. SciNet has 2 BGQ systems, a half rack 8192 core development system, and a 2 rack 32,768 core production system.
5D Torus Network
The network topology of Blue/Gene Q is a five-dimensional (5D) torus, with direct links between the nearest neighbors in the ±A, ±B, ±C, ±D, and ±E directions. As such there are only a few optimum block sizes that will use the network efficiently.
Node Boards | Compute Nodes | Cores | Torus Dimensions |
1 | 32 | 512 | 2x2x2x2x2 |
2 (adjacent pairs) | 64 | 1024 | 2x2x4x2x2 |
4 (quadrants) | 128 | 2048 | 2x2x4x4x2 |
8 (halves) | 256 | 4096 | 4x2x4x4x2 |
16 (midplane) | 512 | 8192 | 4x4x4x4x2 |
32 (1 rack) | 1024 | 16384 | 4x4x4x8x2 |
64 (2 racks) | 2048 | 32768 | 4x4x8x8x2 |
Login/Devel Nodes
The development nodes for the BGQ are bgqdev-fen1 for the half-rack development system and bgq-fen1 for the 2-rack production system. You can login to them from the regular login.scinet.utoronto.ca login nodes or directly from outside to the half-rack development system using bgqdev.scinet.utoronto.ca, e.g.
$ ssh -l USERNAME bgqdev.scinet.utoronto.ca -X
where USERNAME is your username on the BGQ and the -X flag is optional, needed only if you will use X graphics.
Note: To learn how to setup ssh keys for logging in please see Ssh keys.
These development nodes are Power7 machines running Linux which serve as compilation and submission hosts for the BGQ. Programs are cross-compiled for the BGQ on these nodes and then submitted to the queue using loadleveler.
Modules and Environment Variables
To use most packages on the SciNet machines - including most of the compilers - , you will have to use the `modules' command. The command module load some-package will set your environment variables (PATH, LD_LIBRARY_PATH, etc) to include the default version of that package. module load some-package/specific-version will load a specific version of that package. This makes it very easy for different users to use different versions of compilers, MPI versions, libraries etc.
A list of the installed software can be seen on the system by typing
$ module avail
To load a module (for example, the default version of the intel compilers)
$ module load vacpp
To unload a module
$ module unload vacpp
To unload all modules
$ module purge
These commands can go in your .bashrc files to make sure you are using the correct packages.
Modules that load libraries, define environment variables pointing to the location of library files and include files for use Makefiles. These environment variables follow the naming convention
$SCINET_[short-module-name]_BASE $SCINET_[short-module-name]_LIB $SCINET_[short-module-name]_INC
for the base location of the module's files, the location of the libraries binaries and the header files, respectively.
So to compile and link the library, you will have to add -I${SCINET_[module-basename]_INC} and -L${SCINET_[module-basename]_LIB}, respectively, in addition to the usual -l[libname].
Note that a module load command only sets the environment variables in your current shell (and any subprocesses that the shell launches). It does not effect other shell environments.
If you always require the same modules, it is easiest to load those modules in your .bashrc and then they will always be present in your environment; if you routinely have to flip back and forth between modules, it is easiest to have almost no modules loaded in your .bashrc and simply load them as you need them (and have the required module load commands in your job submission scripts).
Compilers
The BGQ uses IBM XL compilers to cross-compile code for the BGQ. Compilers are available for FORTRAN, C, and C++. They are accessible by default, or by loading the xlf and vacpp modules. The compilers by default produce static binaries, however with BGQ it is possible to now use dynamic libraries as well. The compilers follow the XL conventions with the prefix bg, so bgxlc and bgxlf90 are the C and FORTRAN compilers respectively.
Most users however will use the MPI variants, i.e. mpixlf90 and mpixlc and which are available by loading the mpich2 module.
module load mpich2
It is recommended to use at least the following flags when compiling and linking
-O3 -qarch=qp -qtune=qp
If you want to build a package for which the configure script tries to run small test jobs, the cross-compiling nature of the bgq can get in the way. In that case, you should use the interactive debugjob environment as described below.
Software modules installed on the BGQ
Software | Version | Comments | Command/Library | Module Name |
---|---|---|---|---|
Compilers | ||||
IBM fortran compiler | 14.1 | These are cross compilers | bgxlf,bgxlf_r,bgxlf90,... | xlf |
IBM c/c++ compilers | 12.1 | These are cross compilers | bgxlc,bgxlC,bgxlc_r,bgxlC_r,... | vacpp |
MPICH2 MPI library | 1.4.1 | There are 4 versions (see BGQ Applications Development document). | mpicc,mpicxx,mpif77,mpif90 | mpich2 |
GCC Compiler | 4.4.6 | GNU Compiler Collection for BGQ | powerpc64-bgq-linux-gcc, powerpc64-bgq-linux-g++, powerpc64-bgq-linux-gfortran | bgqgcc |
Binutils | 2.21.1 | Cross-compilation utilities | addr2line, ar, ld, ... | binutils |
CMake | 2.8.8 | cross-platform, open-source build system | cmake | cmake |
Debug/performance tools | ||||
gdb | 7.2 | GNU Debugger | gdb | gdb |
DDT | 4.0 | Allinea's Distributed Debugging Tool | ddt | ddt |
HPCTW | 1.0 | BGQ MPI and Hardware Counters | libmpihpm.a, libmpihpm_smp.a, libmpitrace.a | hptibm |
Storage tools/libraries | ||||
HDF5 | 1.8.9-v18 | Scientific data storage and retrieval | h5ls, h5diff, ..., libhdf5 | hdf5/189-v18-serial-xlc* hdf5/189-v18-mpich2-xlc |
NetCDF | 4.2.1.1 | Scientific data storage and retrieval | ncdump,ncgen,libnetcdf | netcdf/4.2.1.1-serial-xlc* netcdf/4.2.1.1-mpich2-xlc |
Parallel NetCDF | 1.3.1 | Parallel scientific data storage and retrieval using MPI-IO | libpnetcdf.a | parallel-netcdf |
Libraries | ||||
ESSL | 5.1 | IBM Engineering and Scientific Subroutine Library (manual below) | libesslbg,libesslsmpbg | essl |
FFTW | 2.1.5, 3.3.2, 3.1.2-esslwrapper | Fast fourier transform | libsfftw,libdfftw,libfftw3, libfftw3f | fftw/2.1.5, fftw/3.3.2, fftw/3.1.2-esslwrapper |
LAPACK + ScaLAPACK | 3.4.2 + 2.0.2 | Linear algebra routines. A subset of Lapack may be found in ESSL as well. | liblapack, libscalpack | lapack |
GSL | 1.15 | GNU Scientific Library | libgsl, libgslcblas | gsl |
BOOST | 1.47.0 | C++ Boost libraries | libboost... | cxxlibraries/boost |
bzip2+szip+zlib | 1.0.6,2.1,1.2.3 | compression libraries | libbz2,libz,libsz | compression |
METIS | 5.0.2 | Serial Graph Partitioning and Fill-reducing Matrix Ordering | libmetis | metis |
ParMETIS | 4.0.2 | Parallel graph partitioning and fill-reducing matrix ordering | libparmetis | parmetis |
Applications | ||||
gnuplot | 4.6.1 | interactive plotting program to be run on front-end nodes | gnuplot | gnuplot |
LAMMPS | Nov 2012 | Molecular Dynamics | lmp_bgq | lammps |
NAMD | 2.9 | Molecular Dynamics | namd2 | namd |
Quantum Espresso | 5.0.3 | Molecular Structure / Quantum Chemistry | qe_pw.x, etc | espresso |
Job Submission
As the BGQ architecture is different from the development nodes, the only way to test your program is to submit a job to the BGQ. Jobs are submitted through loadleveler using runjob which in many ways similar to mpirun or mpiexec. As shown above in the network topology overview, there are only a few optimum job size configurations which is also further constrained by each block requiring a minimum of one IO node. In SciNet's configuration (with 8 I/O nodes per midplane) this allows 64 nodes (1024 cores) to be the smallest block size. Normally a block size matches the job size to offer fully dedicated resources to the job. Smaller jobs can be run within the same block however this results in shared resources (network and IO) and are referred to as sub-block jobs and are described in more detail below.
runjob
All BGQ jobs are launced using runjob which for those familiar with MPI is analogous to mpirun/mpiexec. Jobs run on a block, which is a predefined group of nodes that have already been configured and booted. When using loadleveler this is set for you, and you do not have to specify the block name. For example, if your loadleveler script requests 64 nodes, each with 16 cores (for a total of 1024 cores), you run a job with 16 processes per node and 1024 total processes with
runjob --np 1024 --ranks-per-node=16 --cwd=$PWD : $PWD/code -f file.in
Here, --np 1024 sets the total number of mpi tasks, while --ranks-per-node=16 specifies that 16 processes should run on each node. For pure mpi jobs, it is advisable always to give the number of ranks per node, because the default value of 1 may leave 15 cores on the node idle. The argument to ranks-per-node may be 1, 2, 4, 8, 16, 32, or 64.
(Note: If this were not a loadleveler job, and the block ID was R00-M0-N03-64, the command would be "runjob --block R00-M0-N03-64 --np 1024 --ranks-per-node=16 --cwd=$PWD : $PWD/code -f file.in")
runjob flags are shown with
runjob -h
a particularly useful one is
--verbose #
where # is from 1-7 which can be helpful in debugging an application.
How to set ranks-per-node
There are 16 cores per node, but the argument to ranks-per-node may be 1, 2, 4, 8, 16, 32, or 64. While it may seem natural to set ranks-per-node to 16, this is not generally recommended. On the BGQ, one can efficiently run more than 1 process per node, because each core has four "hardware threads" (similar to HyperThreading on the GPC and Simultaneous Multi Threading on the TCS and P7), which can keep the different parts of each core busy at the same time. One would therefore ideally use 64 ranks per node. There are two main reason why one might not set ranks-per-node to 64:
- The memory requirements do not allow 64 ranks (each rank only has 256MB of memory)
- The application is more efficient in a hybrid MPI/OpenMP mode (or MPI/pthreads). Using less ranks-per-node, the hardware threads are used as OpenMP threads within each process.
Because threads can share memory, the memory requirements of the hybrid runs is typically smaller than that of pure MPI runs.
Note that the total number of mpi processes in a runjob (i.e., the --np argument) should be the ranks-per-node times the number of nodes (set by bg_size in the loadleveler script). So for the same number of nodes, if you change ranks-per-node by a factor of two, you should also multiply the total number of mpi processes by two.
Queue Limits
The maximum wall_clock_limit on the development bgqdev system is 12 hours and 24 hours on the production bgq. Official SOSCIP porject jobs are prioritized over all other jobs using a fairshare algorithm with a 14 day rolling window.
Batch Jobs
Job submission is done through loadleveler with a few blue gene specific commands. The command "bg_size" is in number of nodes, not cores, so a bg_size=64 would be 64x16=1024 cores.
#!/bin/sh # @ job_name = bgsample # @ job_type = bluegene # @ comment = "BGQ Job By Size" # @ error = $(job_name).$(Host).$(jobid).err # @ output = $(job_name).$(Host).$(jobid).out # @ bg_size = 64 # @ wall_clock_limit = 30:00 # @ bg_connectivity = Torus # @ queue # Launch all BGQ jobs using runjob runjob --np 1024 --ranks-per-node=16 --envs OMP_NUM_THREADS=1 --cwd=$SCRATCH/ : $HOME/mycode.exe myflags
To submit to the queue use
llsubmit myscript.sh
Monitoring Jobs
To see running jobs
llq2
or
llq -b
to cancel a job use
llcancel JOBID
and to look at details of the bluegene resources use
llbgstatus
Note: the loadleveler script commands are not run on a bgq compute node but on the front-end node. Only programs started with runjob run on the bgq compute nodes. You should therefore keep scripting in the submission script to a bare minimum.
Interactive Use
As BGQ codes are cross-compiled they cannot be run direclty on the front-nodes. Users however only have access to the BGQ through loadleveler which is appropriate for batchjobs, however an interactive session is typically beneficial when debugging and developing. As such a script has been written to allow a session in which runjob can be run interactively. The script uses loadleveler to setup a block and set all the correct environment variables and then launch a spawned shell on the front-end node. The debugjob session currently allows a 30 minute session on 64 nodes.
[user@bgqdev-fen1]$ debugjob [user@bgqdev-fen1]$ runjob --np 64 --ranks-per-node=16 --cwd=$PWD : $PWD/my_code -f myflags [user@bgqdev-fen1]$ exit
NOTE: This is a prototype script so be gentle
Apart from debugging, this environment is also useful for building libraries and applications that need to run small tests as part of their 'configure' step. Within the debugjob session, applications compiled with the bgxl compilers or the mpcc/mpCC/mpfort wrappers, will automatically run on the BGQ, skipping the need for the runjob command, provided if you set the following environment variables
$ export BG_PGM_LAUNCHER=yes $ export RUNJOB_NP=1
The latter setting sets the number of mpi processes to run. Most configure scripts expect only one mpi process, thus, RUNJOB_NP=1 is appropriate.
Sub-block jobs
BGQ allows multiple applications to share the same block, which is referred to as sub-block jobs, however this needs to be done from within the same loadleveler submission script using multiple calls to runjob. To run a sub-block job, you need to specify a "--corner" within the block to start each job and a 5D Torus AxBxCxDxE "--shape". The starting corner will depend on the specific block details provided by loadleveler and the shape and size of job trying to be used.
Figuring out what the corners and shapes should be is very tricky (especially since it depends on the block you get allocated). For that reason, we've created a script called subblocks that determines the corners and shape of the sub-blocks. It only handles the (presumable common) case in which you want to subdivide the block into n equally sized sub-blocks, where n may be 1,2,4,8,16 and 32.
Here is an example script calling subblocks with a size of 4 that will return the appropriate $SHAPE argument and an array of 16 starting $CORNER. <source lang="bash">
- !/bin/bash
- @ job_name = bgsubblock
- @ job_type = bluegene
- @ comment = "BGQ Job SUBBLOCK "
- @ error = $(job_name).$(Host).$(jobid).err
- @ output = $(job_name).$(Host).$(jobid).out
- @ bg_size = 64
- @ wall_clock_limit = 30:00
- @ bg_connectivity = Torus
- @ queue
- Using subblocks script to set $SHAPE and array of ${CORNERS[n]}
- with size of subblocks in nodes (ie similiar to bg_size)
- In this case 16 sub-blocks of 4 cnodes each (64 total ie bg_size)
source subblocks 4
- 16 jobs of 4 each
for (( i=0; i < 16 ; i++)); do
runjob --corner ${CORNER[$i]} --shape $SHAPE --np 64 --ranks-per-node=16 : your_code_here > $i.out &
done wait </source> Remember that subjobs are not the ideal way to run on the BlueGene/Qs. One needs to consider that these sub-blocks all have to share the same I/O nodes, so for I/O intensive jobs this will be an inefficient setup. Also consider that if you need to run such small jobs that you have to run in sub-blocks, it may be more efficient to use other clusters such as the GPC.
Let us know if you run into any issues with this technique, please contact bgq-support for help.
Filesystem
The BGQ has its own dedicated 500TB file system based on GPFS (General Parallel File System). There are two main systems for user data: /home, a small, backed-up space where user home directories are located, and /scratch, a large system for input or output data for jobs; data on /scratch is not backed up. The path to your home directory is in the environment variable $HOME, and will look like /home/G/GROUP/USER, . The path to your scratch directory is in the environment variable $SCRATCH, and will look like /scratch/G/GROUP/USER (following the conventions of the rest of the SciNet systems).
file system | purpose | user quota | backed up | purged |
---|---|---|---|---|
/home | development | 50 GB | yes | never |
/scratch | computation | 20 TB | no | not currently |
Transfering files
Although the GPFS file system of the BGQ is shared between the bgq development and production system, the file system is not shared with the other SciNet systems (gpc, tcs, p7, arc), nor is the other file system mounted on the BGQ. Use scp to copy files from one file system to the other, e.g., from bgqdev-fen1, you could do
$ scp -c arcfour login.scinet.utoronto.ca:code.tgz .
or from a login node you could do
$ scp -c arcfour code.tgz bgqdev.scinet.utoronto.ca:
The flag -c arcfour is optional. It tells scp (or really, ssh), to use a non-default encryption. The one chosen here, arcfour, has been found to speed up the transfer by a factor of two (you may expect around 85MB/s). This encryption method is only recommended for copying from the BGQ file system to the regular SciNet GPFS file system or back.
Note that although these transfers are witihin the same data center, you have to use the full names of the systems, login.scinet.utoronto.ca and bgq.scinet.utoronto.ca, respectively, and that you will be asked you for your password.
How much Disk Space Do I have left?
The diskUsage command, available on the bgqdev nodes, provides information in a number of ways on the home and scratch file systems. For instance, how much disk space is being used by yourself and your group (with the -a option), or how much your usage has changed over a certain period ("delta information") or you may generate plots of your usage over time. Please see the usage help below for more details.
Usage: diskUsage [-h|-?| [-a] [-u <user>] [-de|-plot] -h|-?: help -a: list usages of all members on the group -u <user>: as another user on your group -de: include delta information -plot: create plots of disk usages
Note that the information on usage and quota is only updated hourly!
Documentation
- BGQ Day: Intro to Using the BGQ
Slides / Video recording (direct link) - BGQ Day: BGQ Hardware Overview
Slides / Video recording (direct link) - Julich Documentation
- Argonne BGQ Wiki
- Argonne MiraCon Presentations
- BGQ System Administration Guide
- BGQ Application Development
- IBM XL C/C++ for Blue Gene/Q: Getting started
- IBM XL C/C++ for Blue Gene/Q: Compiler reference
- IBM XL C/C++ for Blue Gene/Q: Language reference
- IBM XL C/C++ for Blue Gene/Q: Optimization and Programming Guide
- IBM XL Fortran for Blue Gene/Q: Getting started
- IBM XL Fortran for Blue Gene/Q: Compiler reference
- IBM XL Fortran for Blue Gene/Q: Language reference
- IBM XL Fortran for Blue Gene/Q: Optimization and Programming Guide
- IBM ESSL (Engineering and Scientific Subroutine Library) 5.1 for Linux on Power