Namd on BGQ

A parameter study was undertaken to test simulation performance and efficiency of NAMD on the Blue Gene/Q cluster, BGQ, with attention to NAMD performance tuning documentation. Determining optimal parameters for a NAMD simulation on this system is more difficult as there are only certain simulation sizes that have optimal topologies (512, 1024, etc). The system of study is a 246,000 atom membrane protein simulation (Cytochrome c Oxidase embedded in a TIP3P solvated DPPC bilayer) using the CHARMM36 forcefield (protein and lipids). The unit cell is cubic with box dimensions 144 x 144 x 117 Angstroms and the simulation time-step was 2fs. The following non-bonded frequency parameters were also used: nonbondedFreq=1, fullElectFrequency=2, stepspercycle=10.

The following benchmarks were performed with the non-SMP NAMD build with exception to the last section.

Performance Tuning Benchmarks

Efficiency is measured with respect to the 16 ranks-per-node 512 core simulation. All simulations are started using a restart file from a pre-equilibrated snapshot. Performance in nanoseconds per day is based on the geometric mean of the three "Benchmark time" lines at the beginning of the simulation's standard output. In this section, the PME patch grid was manually doubled in either the X, Y, or Z directions. Default PME patch doubling in NAMD 2.9 is generally recommended (twoAway parameters need not be specified in the configuration file).

Ranks	Cores	NAMD Config Options	ns/day	Efficiency
16	512		2.79	1.00
16	1024		5.05	0.91
16	1024	twoAwayX (default)	5.62	1.01
16	2048	twoAwayX (default)	10.07	0.90
16	2048	twoAwayXY	10.59	0.95
16	4096	twoAwayX	14.32	0.64
16	4096	twoAwayXY (default)	17.63	0.79
16	4096	twoAwayXYZ	16.79	0.75
16	8192	twoAwayX	23.52	0.53
16	8192	twoAwayXY (default)	25.00	0.56
16	16384	twoAwayX	23.67	0.27
16	16384	twoAwayXY	28.31	0.32
16	16384	twoAwayXYZ (default)	27.98	0.31

PME Pencils

A "pencil-based" PME decomposition may be more efficient than the default "slab-based decomposition". In this study PME pencil grids are created for both dedicated PME nodes (lblUnload=yes) and non-dedicated PME nodes. Fine-tuning of PMEPencils resulted in insignificant performance gains for this study.

Ranks	Cores	NAMD Config Options	ns/day	Efficiency
16	4096	twoAwayXY, PMEPencils=8, lblUnload=yes	12.93	0.58
16	4096	twoAwayXY, PMEPencils=12, lblUnload=yes	17.27	0.77
16	4096	twoAwayXY, PMEPencils=16, lblUnload=yes	16.02	0.72
16	4096	twoAwayXY, PMEPencils=20, lblUnload=yes	15.41	0.69
16	4096	twoAwayXY, PMEPencils=12	16.21	0.73
16	4096	twoAwayXY, PMEPencils=16	17.92	0.80
16	4096	twoAwayXY, PMEPencils=20	17.99	0.81
16	4096	twoAwayXY, PMEPencils=24	17.83	0.80
16	4096	twoAwayXY, PMEPencils=36	16.97	0.76
8	4096	twoAwayXY, PMEPencils=20	18.24	0.82
16	4096	twoAwayXY, PMEPencils=20	17.99	0.81
32	4096	twoAwayXY, PMEPencils=20	13.94	0.63

Ranks-Per-Node Study

The "ranks-per-node" or simply the number of processes per compute node is a Blue Gene/Q runjob command parameter. In this study, memory requirements were too large to use 64 due to memory errors, and also resulted in out of memory errors for 16384 core simulations of 32 ranks per node. The following efficiency estimates are measured with respect to the 16 ranks per node results for the same number of nodes respectively (for the default twoAway choices and no PME Pencils). These simulations offered the best performance and were used in production simulations. Note that 1024, the first entry in the table, means that 512 physical cores were requested, but due to the double ranks-per-node, a total of 1024 virtual cores were used.

Ranks	Cores	NAMD Config Options	ns/day	Efficiency
32	1024		4.46	1.6
32	2048		8.01	1.43
32	4096		13.74	1.3
32	8192		19.81	1.12

Incorrect Particle-Mesh Ewald Grid

Long-range electrostatics are computed using PME for all simulations above with PME grid spacing set to be generated automatically with the "pmeGridSpacing 1.0" setting. A poor choice in PME grid spacing (i.e. not a multiple of 2,3, and 5) can result in increasingly large performance degradation due to the matrix size requirements in the FFT algorithm. Below is an example of the type of performance degradation that one may expect with none of the grid dimensions are divisible by 5. One can draw a comparison to a more correct PME choice in the Performance Tuning Benchmarks above.

Ranks	Cores	NAMD Config Options	ns/day	Efficiency
16	512	Poor PME Multiple (144x144x111)	2.70	0.97
16	1024	Poor PME Multiple (144x144x111)	5.13	0.92
16	2048	Poor PME Multiple (144x144x111)	8.61	0.77
16	4096	Poor PME Multiple (144x144x111)	13.93	0.62
16	8192	Poor PME Multiple (144x144x111)	17.08	0.38
16	16384	Poor PME Multiple (144x144x111)	17.64	0.20

Symmetric multiprocessing (SMP) Study

All the prior benchmarks were performed with a non-SMP NAMD binary, thus true multithreading was not possible. The following

Ranks	Cores	Options	ns/day	Efficiency
16	512	ppn 1	2.75	1.00
16	512	ppn 2	4.60	1.67
16	512	ppn 3	5.59	2.03
16	512	ppn 4	6.31	2.29
16	512	ppn 1, +CmiNoProcForComThread	2.74	1.00
16	512	ppn 2, +CmiNoProcForComThread	4.62	1.68
16	512	ppn 3, +CmiNoProcForComThread	5.49885623790252	1.99934563610769
16	512	ppn 4, +CmiNoProcForComThread	6.30749014454665	2.29335926412615
16	1024	ppn 1	5.53831498612652	1.00684628037535
16	1024	ppn 2	8.31342729353603	1.51135198496932
16	1024	ppn 3	8.56076430503716	1.55631698798069
16	1024	ppn 4	10.5920197604721	1.92559212038466
16	1024	ppn 4, +CmiNoProcForComThread	10.5836494610629	1.92407042924813
32	512	ppn 1, +CmiNoProcForComThread	4.45216813167732	1.61877716750095
32	512	ppn 2, +CmiNoProcForComThread	6.25934990391898	2.27585580961561


16	512	ppn 4, pmePencils 4	2.88341765728082	1.04839047626371
16	512	ppn 4, pmePencils 16	6.3429104233047	2.30623782954063
16	512	ppn 4, pmePencils 20	6.35005302294274	2.30883482877082
16	512	ppn 4, pmePencils 24	6.23756385705999	2.26793455548001

Namd on BGQ

Contents

Performance Tuning Benchmarks

PME Pencils

Ranks-Per-Node Study

Incorrect Particle-Mesh Ewald Grid

Symmetric multiprocessing (SMP) Study

Documentation

Navigation menu

Search