Difference between revisions of "Namd on BGQ"

Latest revision as of 19:35, 31 August 2018

WARNING: SciNet is in the process of replacing this wiki with a new documentation site. For current information, please go to https://docs.scinet.utoronto.ca

A parameter study was undertaken to test simulation performance and efficiency of NAMD on the Blue Gene/Q cluster, BGQ, with attention to NAMD performance tuning documentation. Determining optimal parameters for a NAMD simulation on this system is more difficult as there are only certain simulation sizes that have optimal topologies (32, 64, 128, etc.). The system of study is a 246,000 atom membrane protein simulation (Cytochrome c Oxidase embedded in a TIP3P solvated DPPC bilayer) using the CHARMM36 forcefield (protein and lipids). The unit cell is cubic with box dimensions 144 x 144 x 117 Angstroms and the simulation time-step was 2fs. The following non-bonded frequency parameters were also used: nonbondedFreq=1, fullElectFrequency=2, stepspercycle=10.

The following benchmarks were performed with the non-SMP NAMD build with exception to the last section. All simulations are started using a restart file from a pre-equilibrated snapshot. Performance in nanoseconds per day is based on the geometric mean of the three "Benchmark time" lines at the beginning of the simulation's standard output and may not represent long-time averages. The column "cores" in the following results indicates the number of "physical cores" specified by the bg_size parameter on the submission script.

Performance Tuning Benchmarks

Efficiency is measured with respect to the 16 ranks-per-node 32 core simulation. In this section, the PME patch grid was manually doubled in either the X, Y, or Z directions. Default PME patch doubling in NAMD 2.9 is generally recommended (twoAway parameters need not be specified in the configuration file).

Ranks	Cores	NAMD Config Options	ns/day	Efficiency
16	32		2.79	1.00
16	64		5.05	0.91
16	64	twoAwayX (default)	5.62	1.01
16	128	twoAwayX (default)	10.07	0.90
16	128	twoAwayXY	10.59	0.95
16	256	twoAwayX	14.32	0.64
16	256	twoAwayXY (default)	17.63	0.79
16	256	twoAwayXYZ	16.79	0.75
16	512	twoAwayX	23.52	0.53
16	512	twoAwayXY (default)	25.00	0.56
16	1024	twoAwayX	23.67	0.27
16	1024	twoAwayXY	28.31	0.32
16	1024	twoAwayXYZ (default)	27.98	0.31

PME Pencils

A "pencil-based" PME decomposition may be more efficient than the default "slab-based decomposition". In this study PME pencil grids are created for both dedicated PME nodes (lblUnload=yes) and non-dedicated PME nodes. Fine-tuning of PMEPencils resulted in insignificant performance gains for this study.

Ranks	Cores	NAMD Config Options	ns/day	Efficiency
16	256	twoAwayXY, PMEPencils=8, lblUnload=yes	12.93	0.58
16	256	twoAwayXY, PMEPencils=12, lblUnload=yes	17.27	0.77
16	256	twoAwayXY, PMEPencils=16, lblUnload=yes	16.02	0.72
16	256	twoAwayXY, PMEPencils=20, lblUnload=yes	15.41	0.69
16	256	twoAwayXY, PMEPencils=12	16.21	0.73
16	256	twoAwayXY, PMEPencils=16	17.92	0.80
16	256	twoAwayXY, PMEPencils=20	17.99	0.81
16	256	twoAwayXY, PMEPencils=24	17.83	0.80
16	256	twoAwayXY, PMEPencils=36	16.97	0.76
8	256	twoAwayXY, PMEPencils=20	18.24	0.82
16	256	twoAwayXY, PMEPencils=20	17.99	0.81
32	256	twoAwayXY, PMEPencils=20	13.94	0.63

Ranks-Per-Node Study

The "ranks-per-node" or simply the number of processes per compute node is a Blue Gene/Q runjob command parameter. The following efficiency estimates are measured with respect to the 16 ranks per node results for the same number of nodes respectively (for the default twoAway choices and no PME Pencils). These simulations offered the best performance and were used in production simulations.

Ranks	Cores	NAMD Config Options	ns/day	Efficiency
32	32		4.46	1.6
32	64		8.01	1.43
32	128		13.74	1.3
32	256		19.81	1.12

Incorrect Particle-Mesh Ewald Grid

Long-range electrostatics are computed using PME for all simulations above with PME grid spacing set to be generated automatically with the "pmeGridSpacing 1.0" setting. A poor choice in PME grid spacing (i.e. not a multiple of 2,3, and 5) can result in increasingly large performance degradation due to the matrix size requirements in the FFT algorithm. Below is an example of the type of performance degradation that one may expect with none of the grid dimensions are divisible by 5. One can draw a comparison to a more correct PME choice in the Performance Tuning Benchmarks above.

Ranks	Cores	NAMD Config Options	ns/day	Efficiency
16	32	Poor PME Multiple (144x144x111)	2.70	0.97
16	64	Poor PME Multiple (144x144x111)	5.13	0.92
16	128	Poor PME Multiple (144x144x111)	8.61	0.77
16	256	Poor PME Multiple (144x144x111)	13.93	0.62
16	512	Poor PME Multiple (144x144x111)	17.08	0.38
16	1024	Poor PME Multiple (144x144x111)	17.64	0.20

Symmetric multiprocessing Study

All the prior benchmarks were performed with a non-SMP NAMD binary, thus true multithreading was not possible. The following results utilize this feature and offer performance increases of up to 40% compared to the non-SMP benchmarks. Processors per node (ppn) was varied from 1 (not multithreaded) to 4. Values greater than 4 resulted in a crash. The addition of the Charm++ flag "+CmiNoProcForComThread" to the NAMD binary had negligible improvements in all cases. Efficiency is respect to the 512/ppn1 simulation.

Comparing 32/ppn4 with 64/ppn4 we observed 85% scaling, within acceptable criteria for production runs. 128/ppn4 simulations showed poor scaling so larger systems were not tested.

Ranks	Cores	NAMD Config Options	ns/day	Efficiency
16	32	ppn 1	2.75	1.00
16	32	ppn 2	4.60	1.67
16	32	ppn 3	5.59	2.03
16	32	ppn 4	6.31	2.29
16	32	ppn 1, +CmiNoProcForComThread	2.74	1.00
16	32	ppn 2, +CmiNoProcForComThread	4.62	1.68
16	32	ppn 3, +CmiNoProcForComThread	5.50	2.00
16	32	ppn 4, +CmiNoProcForComThread	6.31	2.30
16	64	ppn 1	5.54	1.01
16	64	ppn 2	8.31	1.51
16	64	ppn 3	8.56	1.56
16	64	ppn 4	10.60	1.92
16	64	ppn 4, +CmiNoProcForComThread	10.58	1.92
16	128	ppn 1	9.93	0.90
16	128	ppn 2	13.86	1.26
16	128	ppn 3	14.72	1.34
16	128	ppn 4	15.10	1.37

In an attempt to maximize the number of processes per node (64), one may also utilize less ranks-per-node, such as 2,4 or 8, while requesting a larger number of threads. The performance of using 8-ranks per node on 64-physical nodes with ppn 8 is approximately equal to the performance of the 16-ranks-per-node above.

Ranks	Cores	NAMD Config Options	ns/day	Efficiency
8	32	ppn 1	1.53	0.56
8	32	ppn 2	2.77	1.01
8	32	ppn 4	4.64	1.69
8	32	ppn 8	6.32	2.30
8	64	ppn 1	2.79	0.51
8	64	ppn 2	5.59	1.02
8	64	ppn 4	8.25	1.50
8	64	ppn 8	10.71	1.95

Simulations of both 2 and 4 ranks-per-node offered comparable simulation efficiencies to the other full 64 process jobs in this section. The combination of 32 ranks-per-node and mulithreading was not as effective as using the normal 16 ranks-per-node and ppn 4 as above. A simulations of ppn=64 with 1 rank-per-node was not successful, and the ppn=32 job required the use of PMEPencils.

Ranks	Cores	NAMD Config Options	ns/day	Efficiency
2	32	ppn 32	6.35	2.29
2	32	ppn 32 (only with PMEPencils 8)	7.80	1.95
4	32	ppn 16	6.30	2.31
4	32	ppn 16	10.73	1.42
32	32	ppn 1, +CmiNoProcForComThread	4.45	1.62
32	32	ppn 2, +CmiNoProcForComThread	6.26	2.27

PME pencils offered minimal improvements when selected appropriately.

Ranks	Cores	NAMD Config Options	ns/day	Efficiency
16	32	ppn 4, pmePencils 4	2.88	1.05
16	32	ppn 4, pmePencils 16	6.34	2.31
16	32	ppn 4, pmePencils 20	6.35	2.31
16	32	ppn 4, pmePencils 24	6.24	2.27

Difference between revisions of "Namd on BGQ"

Latest revision as of 19:35, 31 August 2018

Contents

Performance Tuning Benchmarks

PME Pencils

Ranks-Per-Node Study

Incorrect Particle-Mesh Ewald Grid

Symmetric multiprocessing Study

Documentation

Navigation menu

Search

@@ Line 1: / Line 1: @@
-A parameter study was undertaken to test simulation performance and efficiency of NAMD on the Blue Gene/Q cluster, [[BGQ]], with attention to NAMD performance tuning documentation. Determining optimal parameters for a NAMD simulation on this system is more difficult as there are only certain simulation sizes that have optimal topologies (512, 1024, etc). The system of study is a 246,000 atom membrane protein simulation ([http://www.rcsb.org/pdb/explore.do?structureId=1m56 Cytochrome c Oxidase] embedded in a TIP3P solvated DPPC bilayer) using the CHARMM36 forcefield (protein and lipids). The unit cell is cubic with box dimensions 144 x 144 x 117 Angstroms and the simulation time-step was 2fs. The following non-bonded frequency parameters were also used: nonbondedFreq=1, fullElectFrequency=2, stepspercycle=10.
+{| style="border-spacing: 8px; width:100%"
+| valign="top" style="cellpadding:1em; padding:1em; border:2px solid; background-color:#f6f674; border-radius:5px"|
+'''WARNING: SciNet is in the process of replacing this wiki with a new documentation site. For current information, please go to [https://docs.scinet.utoronto.ca https://docs.scinet.utoronto.ca]'''
+|}
+A parameter study was undertaken to test simulation performance and efficiency of NAMD on the Blue Gene/Q cluster, [[BGQ]], with attention to NAMD performance tuning documentation. Determining optimal parameters for a NAMD simulation on this system is more difficult as there are only certain simulation sizes that have optimal topologies (32, 64, 128, etc.). The system of study is a 246,000 atom membrane protein simulation ([http://www.rcsb.org/pdb/explore.do?structureId=1m56 Cytochrome c Oxidase] embedded in a TIP3P solvated DPPC bilayer) using the CHARMM36 forcefield (protein and lipids). The unit cell is cubic with box dimensions 144 x 144 x 117 Angstroms and the simulation time-step was 2fs. The following non-bonded frequency parameters were also used: nonbondedFreq=1, fullElectFrequency=2, stepspercycle=10.
-The following benchmarks were performed with the non-SMP NAMD build with exception to the last section. All ns/day measurements were obtained from the standard output Benchmarking lines within the first few minutes of simulation and may not represent long-time averages.
+The following benchmarks were performed with the non-SMP NAMD build with exception to the last section. All simulations are started using a restart file from a pre-equilibrated snapshot. Performance in nanoseconds per day is based on the geometric mean of the three "'''Benchmark time'''" lines at the beginning of the simulation's standard output and may not represent long-time averages. The column "cores" in the following results indicates the number of "physical cores" specified by the '''bg_size''' parameter on the submission script.
 == Performance Tuning Benchmarks ==
-Efficiency is measured with respect to the 16 ranks-per-node 512 core simulation. In this section, the PME patch grid was manually doubled in either the X, Y, or Z directions. Default PME patch doubling in NAMD 2.9 is generally recommended (twoAway parameters need not be specified in the configuration file).
+Efficiency is measured with respect to the 16 ranks-per-node 32 core simulation. In this section, the PME patch grid was manually doubled in either the X, Y, or Z directions. Default PME patch doubling in NAMD 2.9 is generally recommended (twoAway parameters need not be specified in the configuration file).
 {| class="wikitable"
@@ Line 15: / Line 20: @@
 |----
 |16
-|512
+|32
 |
 |2.79
@@ Line 21: / Line 26: @@
 |----
 |16
-|1024
+|64
 |
 |5.05
@@ Line 27: / Line 32: @@
 |----
 |16
-|1024
+|64
 |twoAwayX (default)
 |5.62
@@ Line 33: / Line 38: @@
 |----
 |16
-|2048
+|128
 |twoAwayX (default)
 |10.07
@@ Line 39: / Line 44: @@
 |----
 |16
-|2048
+|128
 |twoAwayXY
 |10.59
@@ Line 45: / Line 50: @@
 |----
 |16
-|4096
+|256
 |twoAwayX
 |14.32
@@ Line 51: / Line 56: @@
 |----
 |16
-|4096
+|256
 |twoAwayXY (default)
 |17.63
@@ Line 57: / Line 62: @@
 |----
 |16
-|4096
+|256
 |twoAwayXYZ
 |16.79
@@ Line 63: / Line 68: @@
 |----
 |16
-|8192
+|512
 |twoAwayX
 |23.52
@@ Line 69: / Line 74: @@
 |----
 |16
-|8192
+|512
 |twoAwayXY (default)
 |25.00
@@ Line 75: / Line 80: @@
 |----
 |16
-|16384
+|1024
 |twoAwayX
 |23.67
@@ Line 81: / Line 86: @@
 |----
 |16
-|16384
+|1024
 |twoAwayXY
 |28.31
@@ Line 87: / Line 92: @@
 |----
 |16
-|16384
+|1024
 |twoAwayXYZ (default)
 |27.98
@@ Line 105: / Line 110: @@
 |----
 |16
-|4096
+|256
 |twoAwayXY, PMEPencils=8, lblUnload=yes
 |12.93
@@ Line 111: / Line 116: @@
 |----
 |16
-|4096
+|256
 |twoAwayXY, PMEPencils=12, lblUnload=yes
 |17.27
@@ Line 117: / Line 122: @@
 |----
 |16
-|4096
+|256
 |twoAwayXY, PMEPencils=16, lblUnload=yes
 |16.02
@@ Line 123: / Line 128: @@
 |----
 |16
-|4096
+|256
 |twoAwayXY, PMEPencils=20, lblUnload=yes
 |15.41
@@ Line 129: / Line 134: @@
 |----
 |16
-|4096
+|256
 |twoAwayXY, PMEPencils=12
 |16.21
@@ Line 135: / Line 140: @@
 |----
 |16
-|4096
+|256
 |twoAwayXY, PMEPencils=16
 |17.92
@@ Line 141: / Line 146: @@
 |----
 |16
-|4096
+|256
 |twoAwayXY, PMEPencils=20
 |17.99
@@ Line 147: / Line 152: @@
 |----
 |16
-|4096
+|256
 |twoAwayXY, PMEPencils=24
 |17.83
@@ Line 153: / Line 158: @@
 |----
 |16
-|4096
+|256
 |twoAwayXY, PMEPencils=36
 |16.97
@@ Line 159: / Line 164: @@
 |----
 |8
-|4096
+|256
 |twoAwayXY, PMEPencils=20
 |18.24
@@ Line 165: / Line 170: @@
 |----
 |16
-|4096
+|256
 |twoAwayXY, PMEPencils=20
 |17.99
@@ Line 171: / Line 176: @@
 |----
 |32
-|4096
+|256
 |twoAwayXY, PMEPencils=20
 |13.94
@@ Line 179: / Line 184: @@
 == Ranks-Per-Node Study ==
-The "ranks-per-node" or simply the number of processes per compute node is a Blue Gene/Q runjob command parameter. In this study, memory requirements were too large to use 64 due to memory errors, and also resulted in out of memory errors for 16384 core simulations of 32 ranks per node. The following efficiency estimates are measured with respect to the 16 ranks per node results for the same number of nodes respectively (for the default twoAway choices and no PME Pencils). These simulations offered the best performance and were used in production simulations. Note that 1024, the first entry in the table, means that 512 physical cores were requested, but due to the double ranks-per-node, a total of 1024 virtual cores were used.
+The "ranks-per-node" or simply the number of processes per compute node is a Blue Gene/Q runjob command parameter. The following efficiency estimates are measured with respect to the 16 ranks per node results for the same number of nodes respectively (for the default twoAway choices and no PME Pencils). These simulations offered the best performance and were used in production simulations.
 {| class="wikitable"
@@ Line 189: / Line 194: @@
 |----
 |32
-|1024
+|32
 |
 |4.46
@@ Line 195: / Line 200: @@
 |----
 |32
-|2048
+|64
 |
 |8.01
@@ Line 201: / Line 206: @@
 |----
 |32
-|4096
+|128
 |
 |13.74
@@ Line 207: / Line 212: @@
 |----
 |32
-|8192
+|256
 |
 |19.81
@@ Line 226: / Line 231: @@
 |----
 |16
-|512
+|32
 |Poor PME Multiple (144x144x111)
 |2.70
@@ Line 232: / Line 237: @@
 |----
 |16
-|1024
+|64
 |Poor PME Multiple (144x144x111)
 |5.13
@@ Line 238: / Line 243: @@
 |----
 |16
-|2048
+|128
 |Poor PME Multiple (144x144x111)
 |8.61
@@ Line 244: / Line 249: @@
 |----
 |16
-|4096
+|256
 |Poor PME Multiple (144x144x111)
 |13.93
@@ Line 250: / Line 255: @@
 |----
 |16
-|8192
+|512
 |Poor PME Multiple (144x144x111)
 |17.08
@@ Line 256: / Line 261: @@
 |----
 |16
-|16384
+|1024
 |Poor PME Multiple (144x144x111)
 |17.64
@@ Line 263: / Line 268: @@
 |}
-== Symmetric multiprocessing (SMP) Study ==
+== Symmetric multiprocessing Study ==
-All the prior benchmarks were performed with a non-SMP NAMD binary, thus true multithreading was not possible. The following results utilize this feature and offer performance increases of up to 40% using 32 physical nodes. Processors per node (ppn) was varied from 1 (not multithreaded) to 4. Values greater than 4 resulted in a crash. Similarly, combinations of ranks-per-node variation AND ppn flags were unsuccessful with the exception of one simulation. The addition of the Charm++ flag "+CmiNoProcForComThread" to the NAMD binary had negligible improvements in all cases.
+All the prior benchmarks were performed with a non-SMP NAMD binary, thus true multithreading was not possible. The following results utilize this feature and offer performance increases of up to 40% compared to the non-SMP benchmarks. Processors per node (ppn) was varied from 1 (not multithreaded) to 4. Values greater than 4 resulted in a crash. The addition of the Charm++ flag "+CmiNoProcForComThread" to the NAMD binary had negligible improvements in all cases. Efficiency is respect to the 512/ppn1 simulation.
+Comparing 32/ppn4 with 64/ppn4 we observed 85% scaling, within acceptable criteria for production runs. 128/ppn4 simulations showed poor scaling so larger systems were not tested.
 {| class="wikitable"
@@ Line 274: / Line 281: @@
 |----
 |16
-|512
+|32
 |ppn 1
 |2.75
@@ Line 280: / Line 287: @@
 |----
 |16
-|512
+|32
 |ppn 2
 |4.60
@@ Line 286: / Line 293: @@
 |----
 |16
-|512
+|32
 |ppn 3
 |5.59
@@ Line 292: / Line 299: @@
 |----
 |16
-|512
+|32
 |ppn 4
 |6.31
@@ Line 298: / Line 305: @@
 |----
 |16
-|512
+|32
 |ppn 1, +CmiNoProcForComThread
 |2.74
@@ Line 304: / Line 311: @@
 |----
 |16
-|512
+|32
 |ppn 2, +CmiNoProcForComThread
 |4.62
@@ Line 310: / Line 317: @@
 |----
 |16
-|512
+|32
 |ppn 3, +CmiNoProcForComThread
 |5.50
@@ Line 316: / Line 323: @@
 |----
 |16
-|512
+|32
 |ppn 4, +CmiNoProcForComThread
 |6.31
@@ Line 322: / Line 329: @@
 |----
 |16
-|1024
+|64
 |ppn 1
 |5.54
@@ Line 328: / Line 335: @@
 |----
 |16
-|1024
+|64
 |ppn 2
 |8.31
@@ Line 334: / Line 341: @@
 |----
 |16
-|1024
+|64
 |ppn 3
 |8.56
@@ Line 340: / Line 347: @@
 |----
 |16
-|1024
+|64
 |ppn 4
 |10.60
@@ Line 346: / Line 353: @@
 |----
 |16
-|1024
+|64
 |ppn 4, +CmiNoProcForComThread
 |10.58
 |1.92
 |----
+|16
+|128
+|ppn 1
+|9.93
+|0.90
+|----
+|16
+|128
+|ppn 2
+|13.86
+|1.26
+|----
+|16
+|128
+|ppn 3
+|14.72
+|1.34
+|----
+|16
+|128
+|ppn 4
+|15.10
+|1.37
+|----
+|}
+In an attempt to maximize the number of processes per node (64), one may also utilize less ranks-per-node, such as 2,4 or 8, while requesting a larger number of threads. The performance of using 8-ranks per node on 64-physical nodes with ppn 8 is approximately equal to the performance of the 16-ranks-per-node above.
+{| class="wikitable"
+|Ranks
+|Cores
+|NAMD Config Options
+|ns/day
+|Efficiency
+|----
+|8
+|32
+|ppn 1
+|1.53
+|0.56
+|----
+|8
+|32
+|ppn 2
+|2.77
+|1.01
+|----
+|8
+|32
+|ppn 4
+|4.64
+|1.69
+|----
+|8
+|32
+|ppn 8
+|6.32
+|2.30
+|----
+|8
+|64
+|ppn 1
+|2.79
+|0.51
+|----
+|8
+|64
+|ppn 2
+|5.59
+|1.02
+|----
+|8
+|64
+|ppn 4
+|8.25
+|1.50
+|----
+|8
+|64
+|ppn 8
+|10.71
+|1.95
+|----
 |}
-The combination of increased ranks-per-node and mulithreading was not as effective as using the normal 16 ranks-per-node and ppn 4 as above.
+Simulations of both 2 and 4 ranks-per-node offered comparable simulation efficiencies to the other full 64 process jobs in this section. The combination of 32 ranks-per-node and mulithreading was not as effective as using the normal 16 ranks-per-node and ppn 4 as above. A simulations of ppn=64 with 1 rank-per-node was not successful, and the ppn=32 job required the use of PMEPencils.
 {| class="wikitable"
@@ Line 362: / Line 453: @@
 |Efficiency
 |----
+|2
+|32
+|ppn 32
+|6.35
+|2.29
+|----
+|2
+|32
+|ppn 32 (only with PMEPencils 8)
+|7.80
+|1.95
+|----
+|4
+|32
+|ppn 16
+|6.30
+|2.31
+|----
+|4
+|32
+|ppn 16
+|10.73
+|1.42
+|----
+|32
 |32
-|512
 |ppn 1, +CmiNoProcForComThread
 |4.45
@@ Line 369: / Line 484: @@
 |----
 |32
-|512
+|32
 |ppn 2, +CmiNoProcForComThread
 |6.26
@@ Line 376: / Line 491: @@
 |}
 PME pencils offered minimal improvements when selected appropriately.
 {| class="wikitable"
@@ Line 386: / Line 501: @@
 |----
 |16
-|512
+|32
 |ppn 4, pmePencils 4
 |2.88
@@ Line 392: / Line 507: @@
 |----
 |16
-|512
+|32
 |ppn 4, pmePencils 16
 |6.34
@@ Line 398: / Line 513: @@
 |----
 |16
-|512
+|32
 |ppn 4, pmePencils 20
 |6.35
@@ Line 404: / Line 519: @@
 |----
 |16
-|512
+|32
 |ppn 4, pmePencils 24
 |6.24