Namd on BGQ
WARNING: SciNet is in the process of replacing this wiki with a new documentation site. For current information, please go to https://docs.scinet.utoronto.ca |
A parameter study was undertaken to test simulation performance and efficiency of NAMD on the Blue Gene/Q cluster, BGQ, with attention to NAMD performance tuning documentation. Determining optimal parameters for a NAMD simulation on this system is more difficult as there are only certain simulation sizes that have optimal topologies (32, 64, 128, etc.). The system of study is a 246,000 atom membrane protein simulation (Cytochrome c Oxidase embedded in a TIP3P solvated DPPC bilayer) using the CHARMM36 forcefield (protein and lipids). The unit cell is cubic with box dimensions 144 x 144 x 117 Angstroms and the simulation time-step was 2fs. The following non-bonded frequency parameters were also used: nonbondedFreq=1, fullElectFrequency=2, stepspercycle=10.
The following benchmarks were performed with the non-SMP NAMD build with exception to the last section. All simulations are started using a restart file from a pre-equilibrated snapshot. Performance in nanoseconds per day is based on the geometric mean of the three "Benchmark time" lines at the beginning of the simulation's standard output and may not represent long-time averages. The column "cores" in the following results indicates the number of "physical cores" specified by the bg_size parameter on the submission script.
Performance Tuning Benchmarks
Efficiency is measured with respect to the 16 ranks-per-node 32 core simulation. In this section, the PME patch grid was manually doubled in either the X, Y, or Z directions. Default PME patch doubling in NAMD 2.9 is generally recommended (twoAway parameters need not be specified in the configuration file).
Ranks | Cores | NAMD Config Options | ns/day | Efficiency |
16 | 32 | 2.79 | 1.00 | |
16 | 64 | 5.05 | 0.91 | |
16 | 64 | twoAwayX (default) | 5.62 | 1.01 |
16 | 128 | twoAwayX (default) | 10.07 | 0.90 |
16 | 128 | twoAwayXY | 10.59 | 0.95 |
16 | 256 | twoAwayX | 14.32 | 0.64 |
16 | 256 | twoAwayXY (default) | 17.63 | 0.79 |
16 | 256 | twoAwayXYZ | 16.79 | 0.75 |
16 | 512 | twoAwayX | 23.52 | 0.53 |
16 | 512 | twoAwayXY (default) | 25.00 | 0.56 |
16 | 1024 | twoAwayX | 23.67 | 0.27 |
16 | 1024 | twoAwayXY | 28.31 | 0.32 |
16 | 1024 | twoAwayXYZ (default) | 27.98 | 0.31 |
PME Pencils
A "pencil-based" PME decomposition may be more efficient than the default "slab-based decomposition". In this study PME pencil grids are created for both dedicated PME nodes (lblUnload=yes) and non-dedicated PME nodes. Fine-tuning of PMEPencils resulted in insignificant performance gains for this study.
Ranks | Cores | NAMD Config Options | ns/day | Efficiency |
16 | 256 | twoAwayXY, PMEPencils=8, lblUnload=yes | 12.93 | 0.58 |
16 | 256 | twoAwayXY, PMEPencils=12, lblUnload=yes | 17.27 | 0.77 |
16 | 256 | twoAwayXY, PMEPencils=16, lblUnload=yes | 16.02 | 0.72 |
16 | 256 | twoAwayXY, PMEPencils=20, lblUnload=yes | 15.41 | 0.69 |
16 | 256 | twoAwayXY, PMEPencils=12 | 16.21 | 0.73 |
16 | 256 | twoAwayXY, PMEPencils=16 | 17.92 | 0.80 |
16 | 256 | twoAwayXY, PMEPencils=20 | 17.99 | 0.81 |
16 | 256 | twoAwayXY, PMEPencils=24 | 17.83 | 0.80 |
16 | 256 | twoAwayXY, PMEPencils=36 | 16.97 | 0.76 |
8 | 256 | twoAwayXY, PMEPencils=20 | 18.24 | 0.82 |
16 | 256 | twoAwayXY, PMEPencils=20 | 17.99 | 0.81 |
32 | 256 | twoAwayXY, PMEPencils=20 | 13.94 | 0.63 |
Ranks-Per-Node Study
The "ranks-per-node" or simply the number of processes per compute node is a Blue Gene/Q runjob command parameter. The following efficiency estimates are measured with respect to the 16 ranks per node results for the same number of nodes respectively (for the default twoAway choices and no PME Pencils). These simulations offered the best performance and were used in production simulations.
Ranks | Cores | NAMD Config Options | ns/day | Efficiency |
32 | 32 | 4.46 | 1.6 | |
32 | 64 | 8.01 | 1.43 | |
32 | 128 | 13.74 | 1.3 | |
32 | 256 | 19.81 | 1.12 |
Incorrect Particle-Mesh Ewald Grid
Long-range electrostatics are computed using PME for all simulations above with PME grid spacing set to be generated automatically with the "pmeGridSpacing 1.0" setting. A poor choice in PME grid spacing (i.e. not a multiple of 2,3, and 5) can result in increasingly large performance degradation due to the matrix size requirements in the FFT algorithm. Below is an example of the type of performance degradation that one may expect with none of the grid dimensions are divisible by 5. One can draw a comparison to a more correct PME choice in the Performance Tuning Benchmarks above.
Ranks | Cores | NAMD Config Options | ns/day | Efficiency |
16 | 32 | Poor PME Multiple (144x144x111) | 2.70 | 0.97 |
16 | 64 | Poor PME Multiple (144x144x111) | 5.13 | 0.92 |
16 | 128 | Poor PME Multiple (144x144x111) | 8.61 | 0.77 |
16 | 256 | Poor PME Multiple (144x144x111) | 13.93 | 0.62 |
16 | 512 | Poor PME Multiple (144x144x111) | 17.08 | 0.38 |
16 | 1024 | Poor PME Multiple (144x144x111) | 17.64 | 0.20 |
Symmetric multiprocessing Study
All the prior benchmarks were performed with a non-SMP NAMD binary, thus true multithreading was not possible. The following results utilize this feature and offer performance increases of up to 40% compared to the non-SMP benchmarks. Processors per node (ppn) was varied from 1 (not multithreaded) to 4. Values greater than 4 resulted in a crash. The addition of the Charm++ flag "+CmiNoProcForComThread" to the NAMD binary had negligible improvements in all cases. Efficiency is respect to the 512/ppn1 simulation.
Comparing 32/ppn4 with 64/ppn4 we observed 85% scaling, within acceptable criteria for production runs. 128/ppn4 simulations showed poor scaling so larger systems were not tested.
Ranks | Cores | NAMD Config Options | ns/day | Efficiency |
16 | 32 | ppn 1 | 2.75 | 1.00 |
16 | 32 | ppn 2 | 4.60 | 1.67 |
16 | 32 | ppn 3 | 5.59 | 2.03 |
16 | 32 | ppn 4 | 6.31 | 2.29 |
16 | 32 | ppn 1, +CmiNoProcForComThread | 2.74 | 1.00 |
16 | 32 | ppn 2, +CmiNoProcForComThread | 4.62 | 1.68 |
16 | 32 | ppn 3, +CmiNoProcForComThread | 5.50 | 2.00 |
16 | 32 | ppn 4, +CmiNoProcForComThread | 6.31 | 2.30 |
16 | 64 | ppn 1 | 5.54 | 1.01 |
16 | 64 | ppn 2 | 8.31 | 1.51 |
16 | 64 | ppn 3 | 8.56 | 1.56 |
16 | 64 | ppn 4 | 10.60 | 1.92 |
16 | 64 | ppn 4, +CmiNoProcForComThread | 10.58 | 1.92 |
16 | 128 | ppn 1 | 9.93 | 0.90 |
16 | 128 | ppn 2 | 13.86 | 1.26 |
16 | 128 | ppn 3 | 14.72 | 1.34 |
16 | 128 | ppn 4 | 15.10 | 1.37 |
In an attempt to maximize the number of processes per node (64), one may also utilize less ranks-per-node, such as 2,4 or 8, while requesting a larger number of threads. The performance of using 8-ranks per node on 64-physical nodes with ppn 8 is approximately equal to the performance of the 16-ranks-per-node above.
Ranks | Cores | NAMD Config Options | ns/day | Efficiency |
8 | 32 | ppn 1 | 1.53 | 0.56 |
8 | 32 | ppn 2 | 2.77 | 1.01 |
8 | 32 | ppn 4 | 4.64 | 1.69 |
8 | 32 | ppn 8 | 6.32 | 2.30 |
8 | 64 | ppn 1 | 2.79 | 0.51 |
8 | 64 | ppn 2 | 5.59 | 1.02 |
8 | 64 | ppn 4 | 8.25 | 1.50 |
8 | 64 | ppn 8 | 10.71 | 1.95 |
Simulations of both 2 and 4 ranks-per-node offered comparable simulation efficiencies to the other full 64 process jobs in this section. The combination of 32 ranks-per-node and mulithreading was not as effective as using the normal 16 ranks-per-node and ppn 4 as above. A simulations of ppn=64 with 1 rank-per-node was not successful, and the ppn=32 job required the use of PMEPencils.
Ranks | Cores | NAMD Config Options | ns/day | Efficiency |
2 | 32 | ppn 32 | 6.35 | 2.29 |
2 | 32 | ppn 32 (only with PMEPencils 8) | 7.80 | 1.95 |
4 | 32 | ppn 16 | 6.30 | 2.31 |
4 | 32 | ppn 16 | 10.73 | 1.42 |
32 | 32 | ppn 1, +CmiNoProcForComThread | 4.45 | 1.62 |
32 | 32 | ppn 2, +CmiNoProcForComThread | 6.26 | 2.27 |
PME pencils offered minimal improvements when selected appropriately.
Ranks | Cores | NAMD Config Options | ns/day | Efficiency |
16 | 32 | ppn 4, pmePencils 4 | 2.88 | 1.05 |
16 | 32 | ppn 4, pmePencils 16 | 6.34 | 2.31 |
16 | 32 | ppn 4, pmePencils 20 | 6.35 | 2.31 |
16 | 32 | ppn 4, pmePencils 24 | 6.24 | 2.27 |