oldwiki.scinet.utoronto.ca - User contributions [en-gb]

ATLAS

2012-06-27T18:42:05Z

Groer: Created page with " == Place holder for instructions for ATLAS users on SciNet == Leslie Groer - June 27, 2012"

== Place holder for instructions for ATLAS users on SciNet ==

Leslie Groer - June 27, 2012

Oldwiki.scinet.utoronto.ca:System Alerts

2011-05-12T16:25:57Z

Groer: /* System Status: Normal */

== System Status: <span style="color:#339900">Normal</span> ==

You can check our twitter feed, @SciNetHPC, for updates.

Last updated: Thu May 12 12:23:35 EDT 2011

'''NOTE: There is some UofT network reconfiguration and maintenance on Friday morning May 13 07:45-08:15 that most likely will disrupt external network connections and data transfers. Local running jobs should not be affected.'''

([[Previous_messages:|Previous messages]])

Oldwiki.scinet.utoronto.ca:System Alerts

2011-05-12T16:25:21Z

Groer: /* System Status: Normal */

== System Status: <span style="color:#339900">Normal</span> ==

You can check our twitter feed, @SciNetHPC, for updates.

Last updated: Thu May 12 12:23:35 EDT 2011

NOTE: There is some UofT network reconfiguration and maintenance on Friday morning May 13 07:45-08:15 that most likely will disrupt external network connections and data transfers. Local running jobs should not be affected.

([[Previous_messages:|Previous messages]])

Introduction To Performance

2009-08-13T19:10:52Z

Groer: /* Compute Time */

==The Concepts of Parallel Performance==

Parallel computing used to be a very specialized domain; but now even making the best use of your laptop, which almost certainly has multiple independant computing cores, requires understanding the basic concepts of performance in a parallel environment.

Most fundamentally, parallel programming allows three possible ways of getting more and better science done:
;Running your computation many times
:If you have a program that works in serial, having many processors available to you allows you to run many copies of the same program at once, improving your [[#Throughput|throughput]]. This (can be) a sort of trivial use of parallel computing and doesn't require very specialized hardware, but it can be extremely useful for, for instance, running parameter studies or sensitivity studies. Best of all, this is essentially guaranteed to run efficiently if your serial code runs efficiently! Because this doesn't require fancy hardware, it is a waste of resources to use the [[TCS_Quickstart|Tightly Coupled System]] for these sorts of tasks and instead they must be run on the [[GPC_Quickstart|General Purpose Cluster]].
;Running your computation faster
:This is what most people think of as parallel computing. It can take a lot of work to make an existing code run efficiently on many processors, or to design a new code to make use of these resources, but when it works, one can achieve a substantial [[#Parallel_Speedup|speedup]] of individual jobs. This might mean the difference between a computation running in a feasible length of time for a research project or taking years to complete --- so while it may be a lot of work, it may be your only option. To determine whether your code runs well on many processors, you need to measure [[#Parallel_Speedup|speedup]] and [[#Efficiency|efficiency]]; to see how many processors one should use for a given problem you must run [[#Strong_Scaling_Tests|strong scaling tests]].
;Running your computation on larger problems
:One achieves speedup by using more processors on the same problem. But by running your job in parallel you may have access to more resources other than just processors --- for instance, more memory, or more disks. In this case, you may be able to run problems that simply wouldn't be possible on a single processor or a single computer; one can achieve significant '''''sizeup'''''. To find how large a problem one can efficiently run, one measures [[#Efficiency|efficiency]] and runs [[#Weak_Scaling_Tests|weak scaling tests]].

Of course, these aren't exclusive; one can take advantage of any combination of the above. It may be that your problem runs efficiently on 8 cores but no more; however, you may be able to get use of more processors by running many jobs to explore parameter space, and already on 8 cores you may be able to consider larger problems than you can with just one!

===Throughput===

Throughput is the most fundamental measure of performance, and the one that ultimately most matters to most computational scientists -- if you have N computations that you need to have done for your research project, how quickly can you get them done? Everything else we'll consider here is just a
way of increasing throughput T:
<math>
T = \frac{\mathrm{Number}\,\mathrm{of}\,\mathrm{computations}}{\mathrm{Unit}\,\mathrm{time}} .
</math>

If you have many indepdendent computations to perform (such as a parameter study or a sensitivity study) you can increase throughput almost arbitrarily by running them alongside each other at the same time, limited only by the number of processors available (or the wait time in the queue, or the disk space available, or some other external resource constraint). This approach obviously doesn't work if you only have one computation to perform, or if later
computations require the output from previous ones. In these cases, or when the individual jobs take infeasibly long, or cannot be performed on only one processor, one must resort to ''also'' using parallel programming techniques to parallelize the individual jobs.

===Compute Time===

Fundamental to everything else that follows is measuring the amount of time a computation takes on some problem size/amount of work <math>N</math> and some number of processors <math>P</math>. We'll denote this by <math>t(N,P)</math>. The easiest way to measure this time is with the <tt>time</tt> command that comes on most flavours of Unix in <tt>/bin/time</tt> or <tt>/usr/bin/time</tt>:

<pre>
/bin/time myprogram
...normal program output...
658.44user 0.85system 10:59.41elapsed
</pre>

The format of the times output at the end may vary from system to system, but the basic information returned will be the same. The ''real'' or ''elapsed'' time listed is the actual [[Wallclock time]] that elapsed during the run, ''user'' or ''cpu'' is the [[CPU time]] that was actually spent doing your computation, and the ''system'' time is the system time that was spent doing system-related things during the run, such as waiting for file input/output. Our goal will be to reduce the real wallclock time that the simulation takes as much as possible while still making efficient use of the resources available.

===Parallel Speedup===

The speedup of an individual job with some amount of work <math>N</math> as you go from some running it serially to running on <math>P</math> is simply:

<math>S(N,P) = \frac{t(N,P=1)}{t(N,P)} .</math>

That is, the time it takes to run the computation on P vs on one processor. The way this is usually done is to run the parallel code on <math>P</math> and
on <math>1</math> processor and take the ratio of the two times; but this is a form of cheating, as the parallel version of the code will generally have
overheads (even in the one-processor case) compared to the best available serial-only version of the code. The best thing to do in considering the efficiency of the parallelization to compare the parallel code to the best available serial code that does the same job.

If you are considering the speedup of a problem that doesn't fit onto one processor, of course, the concept of speedup can be generalized; one needn't start at <math>P=1</math>.

It should go without saying that, while developing your parallel code and during performance tuning, you get the same results with multiple processors as with some `known good' serial test case; it is even easier to introduce bugs in parallel code than it is in serial code!

===Efficiency===

Once you have a parallel code and some timing results one can look at how efficiently you are making use of the resources as you use more and more processors.
The parallel efficiency of a computation of some fixed work size running on <math>P</math> processors as compared to the <math>P=1</math> case is

<math>E = \frac{S(N,P)}{P}</math>

That is, if you get a speedup of <math>8 \times</math> in going from one to eight processors, you are at 1.00 or 100% efficiency; anything less and you are at lower efficiency. It isn't uncommon to achieve greater than 100% parallel efficiencies for small numbers of processors for some types of problems; as you go to more processors, you also have more processor cache, and thus more of the problems data can fit into fast cache. This is called ''super-linear speedup'' and sadly seldom extends out to very many processors.

===Strong Scaling Tests===

[[Image:scaling-example.png|thumb|right|320px|An example of a strong scaling test]]

The figure to the right and data below shows an example of a result of a small strong scaling test --- running a fixed-size problem on a varying number of processors to see how the timing of the computation scales with the number of processors. The code was an OpenMP code run on a node of the GPC. The quantitative results follow below; the times were measured and then speedups and efficiencies were calculated as above.

{|
! <math>P</math>
! <math>t(N,P)</math>
! <math>S(N,P)</math>
! <math>E(N,P)</math>
|-
| 1 || 3:50 || - || -
|-
| 2 || 2:02 || 1.87x || 94 %
|-
| 4 || 1:05 || 3.52x || 88 %
|-
| 6 || 47.8 || 4.81x || 80 %
|-
| 8 || 43.6 || 5.28x || 66%
|}

The plot shows the compute time <math>t(N,P)</math> as a function of P; if the code maintained 100% parallel efficiency, we would expect the scaling to be
as 1/P, so we plot it on a log-log scale. Also shown is the ideal scaling case -- what the times would be if, using the <math>P=1</math> timing as a normalization, we did get 100% efficiency. We can see that past 4 cores the measured case starts to significantly deviate from the ideal, and it looks like things would only get worse past 8 cores.

It's important to note here that scaling tests should be done on realistic problem sizes and for realistic lengths of time. Generally, for either serial or parallel programs there will be some overhead both at initialization time and during the course of the computation; if the problem size is too small, the overhead during the course of the run might be a significant fraction of the real work, and the program will behave needlessly poorly. Similarly, if the number of timesteps or iterations is too small, the initizalization overhead will similarly play a spuriously large role in the performance.

The above behaviour is typical for a small computation; it won't scale to too many cores, and the efficiency becomes monotonically worse as one increases the number of cores in use. The rate at which this happens will depend on the problem size and the type of computation. How is one to tell where to stop;
how good an efficiency is good enough? Certainly there are rules of thumb --- one shudders to see efficiencies below 50% --- but one can arrive at more meaningful and quantitative results by considering throughput. Let's imagine we had 64 cores at our disposal, and we wanted to run 96 jobs as quickly as possible. Our total time to completion of the 96 jobs would vary with the number of cores we ran per job as follows:

{|
! <math>P</math>
! Time for one job
! Time for all 96 jobs
|-
| 1 || 3:50 || 7:40 (2 batches, 64 jobs then 32)
|-
| 2 || 2:02 || 7:08 (3 batches, 32,32,32)
|-
| 4 || 1:05 || 6:30 (6 batches, 6x16)
|-
| 6 || 47.8 || 7:58 (10 batches, 9x10, 6)
|-
| 8 || 43.6 || 8:43 (12 batches)
|}

If we use more than 4 processes per job in this case, it will actually take us longer to do all our runs! For jobs that scale better with the number of processes (this could be a different program, or the same program with different problem size), we will find this turnover point to be at higher <math>P</math>; for jobs that scale worse, lower <math>P</math>.

===Weak Scaling Tests===

[[Image:weak-scaling-example.png|thumb|right|320px|An example of a weak scaling test]]

The strong scaling test described above considers the performance of a parallel code with a fixed work size as the number of processors varies; this tells us how the parallel overhead behaves as you go to more and more processors. A weak scaling test fixes the amount of work '''per processor''' and compares the execution time over number of processors. Since each processor has the same amount to do, in the ideal case the execution time should remain constant. While the strong scaling test tells you how the parallel overhead scales with <math>P</math>, the weak scaling test tells you something weaker -- whether the parallel overhead varies faster or slower than the amount of work.

Nonetheless, the weak scaling test can be the relevant one for determining how large a problem size one can efficiently compute with a given parallel code and system. An example of results for a weak scaling test on the GPC and TCS up to 256 processors (8 nodes of the TCS, 32 of the GPC) is shown to the right. In this case we are maintaining extremely good efficiency up to at least 128 processors with constant work per process on both architectures. It is possible to see different behaviour when first filling up a node (eg, for less than 8 processes for the GPC, or 64 for TCS) than when one starts crossing nodes; one should understand this but it doesn't necessarily indicate problems.

==Performance Tuning==

'''You cannot improve what you cannot measure.''' Performance tuning is an iterative process between running an '''instrumented''' version of your code, getting data on performance throughout the code, and attempting to make chances to the code that will make it run more efficiently.

There are three main ways of instrumenting a code to find its performance. The first is '''manually adding timers''' around important parts of the code to find out how much time is spent in each part. This is worth thinking about doing when putting together a new code, as it means that you'll have a very robust way of finding out how well the different parts of the code perform on different platforms and with different compiler options, etc.. The results are, however, necessarily very coarse-grained; they are very useful for comparing performance under different situations, but give very little information about whether or not there are performance problems or what they might be.

The second technique is '''sampling''', sometimes called `program counter sampling' or `statistical sampling'. In this case, the program is run in an environment where it is interrupted briefly at some set frequency (typically something like 100 times per second) and the location of the program counter is jotted down before the program is resumed. At the end of the program, these locations are translated into locations in the source code, and one has a statistical profile of where the program has spent its time.

Statistical sampling has several advantages. It has a very low overhead --- the sampling procedure for instance takes much less time than a function call to a timer routine --- so that the program runs much as it would without the measurement process. If the samples are taken often enough, the result is a very accurate picture of where your program is spending its time, allowing you to very quickly identify `hotspots' in the code and focus your attention on the most costly areas of the program. This combination of relevant information and low-overhead makes statistical sampling the first resort for serious performance measurement.

Sampling, however, has drawbacks. While it lets you know where the program is spending its time, it doesn't tell you why, or how it got there in the first place. For instance, in a parallel program you may be spending too much time in barriers of one sort or another (perhaps at <tt>MPI_WAITALL</tt> calls in MPI, or implicit barriers at the end of <tt>parallel</tt> sections in OpenMP) but unless you know where in the code that routine was called from, you can't address the problem. In this case you need some sort of '''trace''' through the program which keeps track of which routine called what. This is generally a much heavier-weight process, which can substantially increase the runtime of the code, running the risk of 'the Heisenburg effect' - measurement changing the system under observation. On the other hand, sometimes you just need that level of information, so tracing packages or libraries must be used.

A related method is the use of '''hardware counters''' --- counters within the CPU itself which keep track of performance-related information, such as the number of cache misses or branch mis-predictions within your code. Using this information, either regularly throughout the code or once for the entire code run can give very specific information about performance problems. Right now these counters are available on the TCS system but not on the GPC system, as the mainstream Linux kernel does not provide access to these counters.

==Command-line Performance Tools==

Many of the tools below can be used to examine both serial and parallel performance problems with a code. We'd like to encourage you to tune serial performance first. Worrying about parallel performance before the code performs well with a single task doesn't make much sense! Profiling your code when running with one task allows you to spot serial `hot spots' for optimization, as well as giving you more detailed understanding of where your program spends its time. Further, any performance you make in the serial code will automatically speed up your parallel code.

We've already talked about coarse-grained measurements such as timers within the code and using tools such as <tt>/bin/time</tt>. These are very useful for comparing overall performance between different platforms/parameters, but we won't need to discuss them further here.

===gprof (profiling: everywhere)===

A statistical sampling workhorse is <tt>gprof</tt>, the GNU version of an old common Unix utility called prof. To use this, the code must be re-compiled with both source-code symbols intact (<tt>-g</tt>) and with profiling information available (for most compilers, this is <tt>-pg</tt>; for the IBM compilers (xlf, xlc, xlC) it is <tt>-p</tt>). It is worth knowing because of its ubiquity, and because it contains much of the functionality of newer tools so that the same concepts occur in other concepts.

So let's consider the following trivial program <tt>pi.c</tt>:
<source lang="c" line=1>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

double calc_pi(long n) {
long in = 0;
long out = 0;
long i;
double x,y;

for (i=0; i<n; i++) {
x = drand48();
y = drand48();
if (x*x+y*y < 1) {
in++;
} else {
out++;
}
}

return 4.*(double)in/(double)(in+out);
}

int main(int argc, char **argv) {
long n, defaultn=100000;
double pi;
time_t t;

/* seed random number generator */
srand48(time(&t));

/* get number of tries */
if (argc < 2 || (n=atoi(argv[1]))<1) {
n = defaultn;
printf("Using default n = %ld\n", n);
}

pi = calc_pi(n);
printf("Pi = %lf\n", pi);

return 0;
}
</source>

We can compile this with profiling on and run it:
<pre>
$ gcc -g -pg -o pi pi.c
$ ./pi 100000000
Pi = 3.141804
</pre>

(Note that this isn't a very good way of calculating pi!). On exit, this program creates a file called <tt>gmon.out</tt>; this contains the profiling information about the run of the code. We can take a look at this by using <tt>gprof</tt>:
<pre>
$ gprof pi gmon.out
Flat profile:

Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
100.88 1.00 1.00 1 998.76 998.76 calc_pi

index % time self children called name
1.00 0.00 1/1 main [2]
[1] 100.0 1.00 0.00 1 calc_pi [1]
-----------------------------------------------
<spontaneous>
[2] 100.0 0.00 1.00 main [2]
1.00 0.00 1/1 calc_pi [1]
-----------------------------------------------
</pre>

The first part tells us that essentially all of the time spent running was in the <tt>calc_pi()</tt> routine (of course), and the second part attepts to be a call graph, showing that <tt>main</tt> called <tt>calc_pi()</tt> once. An important concept in the timing is the `self' and `children' times for each routine, sometimes called the inclusive and exclusive time. Because most routines call many other routines, its often useful to distinguish between the total amount of time spent between starting and ending the routine (the `inclusive' time) and that same time excluding the time spent in child routines (`exclusive' time).

The above results are fairly trivial and not very useful for this simple program, but in more complicated routines it can be very valuable to narrow down hotspots to particular regions of code.

[[Image:Xprofiler.png|thumb|300px|The AIX tool <tt>Xprof</tt> gives a visual representation of the <tt>gprof</tt> output.]]

In fact, gprof also allows you to view the time spent in the code by lines of code. As you chop the program up finer, the statistical sampling gets less accurate; thus to look at the results by line of code you must be sure that your sample run was long enough to get meaningful data. But the results can be extremely useful:
<pre>
$ gprof --line pi gmon.out
Flat profile:

Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls Ts/call Ts/call name
70.31 0.70 0.70 calc_pi (pi.c:14 @ 40078b)
14.27 0.84 0.14 calc_pi (pi.c:17 @ 4007bc)
5.10 0.89 0.05 calc_pi (pi.c:11 @ 4007c1)
4.08 0.93 0.04 calc_pi (pi.c:15 @ 4007b5)
3.06 0.96 0.03 calc_pi (pi.c:13 @ 400781)
2.55 0.98 0.03 calc_pi (pi.c:12 @ 400777)
1.53 1.00 0.02 calc_pi (pi.c:11 @ 40076d)
0.00 1.00 0.00 1 0.00 0.00 calc_pi (pi.c:5 @ 40074c)
</pre>

where now we can see that the single line containing the radius calculation (<tt>if (x*x+y*y < 1)</tt>) is 70% of the work for the entire program. This tells you where you should spend your time to optimize the code. Other tools exist for this sort of line-by-line analysis; <tt>gcov</tt> in the gcc compiler suite counts the number of times a given source line is executed - the idea was for coverage analysis for test suites, but it certainly can be used for profiling as well; however, usually the amount of time spent at a line is more important than the number of executions.

For parallel programs, <tt>gprof</tt> will generally output a seperate <tt>gmon.out</tt> file for each process; for threaded applications, output for all threads will be summed into the same <tt>gmon.out</tt>. It may be useful to sum up all the results and view them with gprof or to look at them individually.

There are other tools for looking at the same data. For instance, on the TCS system, the command <tt>Xprof</tt>
(run the same way as <tt>gprof</tt>; <tt>Xprof program_name gmon.out</tt>) lets you look at the call tree as a graphical tree. Each routine is shown by a block with a size proportional to the time spent in each routine; the width is the inclusive time, and the height is the exclusive time.

===hpmcount (performance counters: TCS)===

On the TCS, <tt>hpmcount</tt> allows the querying of the performance counter values over the course of a run. Since here we are simply asking the CPU to report values it obtains during the run of a program, the code does not need to be instrumented; simply typing
<pre>
hpmcount hpmcount_args program_name program_args
</pre>

will run the program and output the results from the hardware performance counters at the end. So for instance, with our trivial pi program above,

<pre>
tcs-f11n05-$ hpmcount ./pi
Using default n = 100000
Pi = 3.144240
Execution time (wall clock time): 0.020325 seconds

######## Resource Usage Statistics ########

Total amount of time in user mode : 0.012754 seconds
Total amount of time in system mode : 0.001486 seconds
Maximum resident set size : 440 Kbytes
Average shared memory use in text segment : 0 Kbytes*sec
Average unshared memory use in data segment : 0 Kbytes*sec
Number of page faults without I/O activity : 53
Number of page faults with I/O activity : 1
Number of times process was swapped out : 0
Number of times file system performed INPUT : 0
Number of times file system performed OUTPUT : 0
Number of IPC messages sent : 0
Number of IPC messages received : 0
Number of signals delivered : 0
Number of voluntary context switches : 6
Number of involuntary context switches : 0

####### End of Resource Statistics ########

Set: 1
Counting duration: 0.014947083 seconds
PM_FPU_1FLOP (FPU executed one flop instruction ) : 400093
PM_FPU_FMA (FPU executed multiply-add instruction) : 500030
PM_FPU_FSQRT_FDIV (FPU executed FSQRT or FDIV instruction) : 1
PM_CYC (Processor cycles) : 58485795
PM_RUN_INST_CMPL (Run instructions completed) : 24238152
PM_RUN_CYC (Run cycles) : 70307511

Utilization rate : 61.172 %
Flop : 1.400 Mflop
Flop rate (flops / WCT) : 68.888 Mflop/s
Flops / user time : 112.614 Mflop/s
FMA percentage : 111.103 %

</pre>
There are a variety of sets of performance counters that can be reported; the default set isn't especially helpful for HPC-type computations; sets of performance counters can be specified on the commandline in the format <tt>-d -s item,item,item</tt>. Sets 5 and 12 are very useful for showing memory performance (showing L1 and L2 cache misses) and set 6 is especially useful for shared memory profiling, giving statistics about how often off-processor memory had to be accessed.

Showing the counters for the entire program will often tell you if there's a problem or not, but won't tell you where it is. For more detailed information, one can [http://www.ncsa.uiuc.edu/UserInfo/Resources/Software/Tools/HPMToolkit/HPM_2_5_2.AIX.html use the hpm library] to manually instrument different regions of your code, and get similar outputs to above for several different, smaller, regions of code.

On the linux side, <tt>oprofile</tt> allows the reporting of similar information, but to use it one must have root access to the linux machine.

===cachegrind (Memory use analysis: GPC)===

[[Image:Kcachegrind.png|thumb|kcachegrind, part of the KDE development package, can give graphical overviews of the output from cachegrind]]

<tt>valgrind</tt> is a memory tool that is usually thought of in terms of finding memory-access bugs in large programs. Rather than instrumenting a code or measuring counters, valgrind takes a fairly extreme approach -- it emulates your program running on a computer, essentially running a simulation of your program running on the same kind of computer valgrind is running on. This has enormous overhead (runtimes can be up to 20x as long as normal) but the result is exquisitely detailed information about what your program is doing.

Memory access is often a bottleneck for HPC codes, and cachegrind is a tool for valgrind which simulates the use of cache in your program, giving you line-by-line information on which parts of the code have cache performance issues. Your code does not need to be recompiled, although compiling with <tt>-g</tt> is necessary for the output to be useful. Cachegrind is run as shown:
<pre>
valgrind --tool=cachegrind myprogram myprogram_arguments
</pre>

Overall results for the whole program are given at the end of the programs normal output, and more detailed information is saved in a file that begins with <tt>cachegrind.out</tt>. These output files are XML files - readable in principle by humans, but it is much easier to see what is going on with visual tools like kcachegrind (shown to the right) or, eventually, valkyrie (which also can be used for <tt>memcheck</tt> output.)

==Graphical Performance Tools==

While graphical performance tools typically measure similar things to their command-line relations, a graphical display opens up the possibility of aggregating much more information and having increased flexibility in displaying it in a variety of ways; this can be very helpful, especially in the initial stages of finding performance problems.

===OpenSpeedShop (profiling, MPI tracing: GPC)===

[[Image:Speedshop2.png|thumb|OpenSpeedShop, like gprof will tell you where the hotspots are in the code by function]]
[[Image:Speedshop1.png|thumb|...or by line of code]]

[http://www.openspeedshop.org OpenSpeedShop] is a tool that (will be) installed on GPC. It provides the functionality of gprof, with the addition of hardware counter measurements (not currently supported on GPC machines) and options to do both lightweight and more detailed, more heavy-weight profiling. OpenSpeedShop also contains enhanced support for dealing with parallel runs, and for tracing of MPI or I/O calls to find performance problems to those areas. The parallel support is considerably more than what <tt>gprof</tt> offers; bundling the data from thousands of tasks into one set of results is a significant algorithmic challenge in itself.

Another important addition, shared by many of the other graphical tools, is the idea of bundling results into different `experiments' --- bundles of an executable, measurement type, and resulting data --- which makes the iterative process of performance tuning much easier. OpenSpeedShop, as with some other tools, has the ability to directly compare the results of different experiments, so one can more easily see if a particular change made things better or worse, and if so where.

OpenSpeedshop does not require re-compilation of the executable (although as with all these tools, for the correlation with the source code to be useful, the code should be compiled with debugging symbols, the option for which is almost universally <tt>-g</tt> to the compiler and linker. The code is then either instrumented, or run in an instrumented environment. Shown to the right are two of the views available for exmaining the timing results of one OpenMP code.

OpenSpeedShop can be launched from the commandline and then used entirely through the gui: there are a variety of `wizards' which guide you through choosing how to instrument and run your experiment:

<pre>
$ openss
</pre>

this should be done from a directory containing the source code and the executable. This is an excellent way to get started with this tool. Once one is more familiar with the tool, one can run a variety of experiments on the command line:

<pre>
openss –f program_name pcsamp
</pre>

where the above runs the <tt>pcsamp</tt> (program counter sampling, as in gprof) measurement on the executable <tt>program_name</tt>. Then one can launch the gui to view the results. There are options for instrumenting the executable in a variety of ways, and taking different measurements; the [http://www.openspeedshop.org OpenSpeedShop] web page contains links to documentation and tutorials.

===PeekPerf (profiling, TCS)===

[[Image:PeekPerf.png|thumb|An example of using PeekPerf]]

[http://domino.research.ibm.com/comm/research_projects.nsf/pages/hpct.index.html Peekperf] is IBM's single graphical `dashboard' providing access to many performance measurement tools for exmaining Hardware Counter data, threads, message passing, IO, and memory access, several of which are available seperately as command-line tools. Like OpenSpeedShop, it does not require re-compilation of the executable; an instrumented version of the code is generated at run time and this instrumented version is executed with whatever options you care to pass to it. It does not have the same support for comparing experiments as OpenSpeedShop does, however it allows running several different types of measurements at once, seeing how they correlate in a given run; this is something OpenSpeedShop doesn't have.

One starts peekperf at the commandline
<pre>
$ peekperf
</pre>

and tell peekperf which executableyou wish to run measurements on. You then highlight which sorts of measurements you wish to make (which sorts are available depend on the type of program - threaded, OpenMPed, etc), select `generate an instrumented executable', and then `run the instrumented executable', giving it the name of either the instrumented executable or a script to run the same; the program will then begin displaying the resulting data as soon as the program has completed.

Understanding the interface and resulting data takes some practice, and the documentation is quite sparse; however the flexibility in the range of measurements to take makes this an excellent source of performance information for programs running on the TCS system.

===Scalasca (profiling, tracing: TCS, GPC)===
[[Image:Scalasca.png|thumb|An example of using Scalasca]]

[http://www.fz-juelich.de/jsc/scalasca/ Scalasca] is a sophisticated tool which takes the aggregation of data shown in the above graphical tools one step further and analyzes the results to pinpoint and display common performance problems; it scales extremely well and the graphical display makes it very easy for the user to find out where the performance issues are.

Scalasca requires the code to be recompiled, and it has wrapper scripts to select and choose the right options to use. If your code for instance is normally compiled with
<pre>
ifort -c myprog.f
ifort -o myprog myprog.o -lm
</pre>

then one can instead use
<pre>
scalasca -instrument ifort -c myprog.f
scalasca -instrument ifort -o myprog myprog.o -lm
</pre>

Scalasca then parses the rest of the command line and adds the necessary flags. (If you are curious, <tt>scalasca -instrument -v</tt> will show you what the resulting command line actually is.) There is also a shortcut, <tt>skin</tt>, which is equivalent to <tt>scalasca -instrument</tt>.

When the new executable is generated, it's run in a similar way; if you normally run your program as
<pre>
./myprog
</pre>

or
<pre>
mpirun -np 5 ./myprog
</pre>

You'd instead do
<pre>
scalasca -analyze ./myprog
scalasca -analyze mpirun -np 5 ./myprog
</pre>

The program will run as usual with only additional lines about output to files. (Again, there is a shortcut available; <tt>scan</tt> is equivalent to <tt>scalasca -analyze</tt>.) To then look at the results, one uses

<pre>
scalasca -examine [epik directory name]
</pre>

where the directory name is that created by the analyze program.

A screenshot of the results is shown to the right for an OpenMP program, where wait times at implicit barriers at the end of parallel sections is selected as an important metric to show on the left; the middle panel shows the call tree indicating the context in which the delays occurred, and the panel on the right gives the breakdown for each thread.

<!-- ===VTune/Thread Profiler (GPC)=== don't have license up for this yet --!>

==Common Serial Performance Problems==

===Poor use of cache===
A classic problem for scientific codes is memory bandwidth; the capacity to do on-chip floating-point or integer operations has grown much faster than the ability to get numbers onto the chip in the first place. One way around that is to use various levels of memory cache; when one number is needed from memory, a whole line of data is brought in from (slow) external memory to fast on-chip cache. This makes the memory access modestly slower, but tends to greatly speed up performance since if you are going to do something to data in one part of memory you're typically going to be also doing to neighboring values.

If you take advantage of data locality --- accessing memory in some kind of order rather than jumping around in memory --- cache can greatly increase the performance of your code. On the other hand, if you '''do''' jump around in memory a lot, cache will actually hurt your performance.

The classic way this comes up is in accessing multidimensional arrays. The example below is simplified; most cases aren't this extreme (or obvious!) but the idea is the same. Let's consider the following FORTRAN code, which simply iterates a few time through a modestly sized multidimensional array:
<source lang="f">
program memaccess

integer, parameter :: li=32,lj=32,lk=32,ll=32,lm=32
real, dimension(li,lj,lk,ll,lm) :: a
integer :: i,j,k,l,m
integer :: iter

a = 0.

do iter=1,10
do m=1,lm
do l=1,ll
do k=1,lk
do j=1,lj
do i=1,li
a(i,j,k,l,m) = a(i,j,k,l,m)+ i+j+k+l+m
enddo
enddo
enddo
enddo
enddo
enddo

end program
</source>
The above program, which we'll suggestively call <tt>memaccess-good.f</tt>, accesses array elements in the order that FORTRAN places in them in the computer's memory; FORTRAN lays out this array in memory as <tt>[a(1,1,1,1,1), a(2,1,1,1,1),... a(31,1,1,1,1), a(1,2,1,1,1)...]</tt> and so on. So by ordering our loops that way we are marching through memory in order, making maximum use of cache. The resulting code can be timed:
<pre>
$ gfortran -O3 -o memaccess-good memaccess-good.f
$ time ./memaccess-good

real 0m2.478s
user 0m2.337s
sys 0m0.094s
</pre>

If we reverse the order of the loops, so they go <tt> do i=1,li.. do j=1,lj,..... do m=1,lm</tt>, however, we get
<pre>
$ gfortran -O3 -o memaccess-bad memaccess-bad.f
$ time ./memaccess-bad

real 0m19.622s
user 0m19.101s
sys 0m0.098s
</pre>
A factor of 8 times worse! Thus tools such as cachegrind can be extremely important for finding significant performance problems in memory-heavy codes.

C-based languages arrange their arrays the opposite ways in memory, so that the equivalent array in C would go as <tt>[a[0][0][0][0][0], a[0][0][0][0][1], ... a[0][0][0][0][31], a[0][0][0][1][0], ... ]</tt>; thus `bad' array access in FORTRAN looks like `good' array access in C, and vice versa.

==Common OpenMP Performance Problems==

==Common MPI Performance Problems==
===Overuse of MPI_BARRIER===
===Many Small Messages===
Typically, a the time it takes for a message of size ''n'' to get from one node to another can be expressed in terms of a [[latency]] ''l'' and a [[bandwidth]] ''b'',
<math>t_c = l + \frac{n}{b} .</math>
For small messages, the latency can dominate the cost of sending (and processing!) the message. By
bundling many small messages into one, you can amortize that cost over many messages, reducing
the time spent communicating.
===Not overlapping computation and communications===

GPC Quickstart

2009-08-13T19:01:23Z

Groer:

{{Infobox Computer
|image=[[Image:University_of_Tor_79284gm-a.jpg|center|300px|thumb]]
|name=General Purpose Cluster (GPC)
|installed=June 2009
|operatingsystem= Linux
|loginnode= gpc01..gpc04 (from <tt>login.scinet</tt>)
|numberofnodes=3780
|rampernode=16 Gb
|corespernode=8
|interconnect=1/4 on Infiniband, rest on GigE
|vendorcompilers=icc (C) ifort (fortran) icpc (C++)
|queuetype=[[Moab/Torque]]
}}

The General Purpose Cluster is an extremely large cluster (ranked [http://www.top500.org/list/2009/06/100 16th] in the world at its inception, and fastest in Canada) and is where most simulations are to be done at SciNet. It is an IBM iDataPlex cluster based on Intel's Nehalem architecture (one of the [http://www.hpcwire.com/features/HPC-Vendors-Jump-On-Nehalem-42360237.html first in the world] to make use of the new chips). The GPC consists of 3,780 nodes with a total of 30,240 2.5GHz cores, with 16GB RAM per node (2GB per core). Approximately one quarter of the cluster is interconnected with non-blocking 4x-DDR InfiniBand while the rest of the nodes are connected with gigabit ethernet.

===Login===

First login via ssh with your scinet account at <tt>login.scinet.utoronto.ca</tt>, and from there you can proceed to the Development nodes to compile/test your code.

===Compile/Devel Nodes===

From a scinet login node you can ssh to <tt>gpc01</tt>..<tt>gpc04</tt>. These nodes have the same hardware configuration as most of the compute nodes -- 8 Nehalem processing cores with 16GB RAM and Gigabit ethernet. You can compile and test your codes on these nodes. To interactively test on more than 8 processors, or to test your code over an InfiniBand connection, you can submit an [[GPC_Quickstart#Submitting_an_Interactive_Job | interactive job request]].

Your [[Storage_Quickstart | home directory]] is in <tt>/home/USER</tt>; you have 10GB there that is backed up. This directory cannot be written to by the compute nodes! Thus, to run jobs, you'll use the <tt>/scratch/USER</tt> directory. Here, there is a large amount of disk space, but it is not backed up. Thus it makes sense to keep your codes in /home, compile there, and then run them in the /scratch directory.

===Environment Variables===

A modules system is used to handle environment variables associated with different compilers, MPI versions, libraries etc. To see all
the options available type

<pre>
module avail
</pre>
To load a module
<pre>
module load intel
</pre>
To unload a module
<pre>
module unload intel
</pre>
To unload all modules
<pre>
module purge
</pre>

These commands should go in your .bashrc files and/or in your submission scripts to make sure you
are using the correct packages.

===Compilers===

The intel compilers are icc/icpc/ifort for C/C++/Fortran, and are available with the default module "intel". The latest version of the GNU compiler suite (currently 4.4.0) is available by loading the "gcc" module. To ensure that the intel compilers are in your <tt>PATH</tt> and their libraries are in your <tt>LD_LIBRARY_PATH</tt>, use the command

<pre>
module load intel
</pre>

This should likely go in your <tt>.bashrc</tt> file so that it will automatically be loaded.

===MPI===

SciNet currently provides two sets of MPI libraries for the GPC [http://www.open-mpi.org/ OpenMPI] and [http://mvapich.cse.ohio-state.edu/ MVAPICH2]. Both sets of libraries will automatically work with both the infiniband and gigabit ethernet interconnects on the GPC system. We recommend OpenMPI as the default, as it quite reliably demonstrates good performance.

Both sets of libraries are compiled with the gnu compiler suite and the intel compiler suite. To use (for instance) the intel-compiled OpenMPI libraries, which we recommend as the default (and use for most of our examples here), use

<pre>
module load openmpi intel
</pre>

in your <tt>.bashrc</tt>. Other combinations behave similarly.

Both sets of MPI libraries define the wrappers mpicc/mpicxx/mpif90/mpif77 as wrappers around the appropriate compilers, which ensure the appropriate include and library directories and used in the compilation and linking steps.

We currently recommend the Intel + OpenMPI combination. However, if you require the GNU compilers as well as MPI, then the module combination

<pre>
module load gcc openmpi/1.3.2-gcc-v4.4.0-ofed
</pre>

will enable development and runtime with gcc/g++/gfortran version 4.4 and OpenMPI version 1.3.2. You can make this your default by putting the module load line in your ~/.bashrc file.

===Submitting A Batch Job===

The SciNet machines are shared systems, and jobs that are to run on them are submitted to a queue; the
[[scheduler]] then orders the jobs in order to make the best use of the machine, and has them launched
when resources become availble. The intervention of the scheduler can mean that the jobs aren't
quite run in a first-in first-out order.

The maximum [[wallclock time]] for a job in the queue is 48 hours; computations that will take longer than
this must be broken into 48-hour chunks and run as several jobs. The usual way to do this is with [[checkpoints]],
writing out the complete state of the computation every so often in such a way that a job can be restarted from
this state information and continue on from where it left off. Generating [[checkpoints]] is a good idea anyway,
as in the unlikely event of a hardware failure during your run, it allows you to restart without having lost much work.

If your job should run in fewer than 48 hours, specify that in your script -- your job
will start sooner. (It's easier for the [[scheduler]] to fit in a short job than a long job). On the downside, the
job will be killed automatically by the queue manager software at the end of the specified [[wallclock time]], so if you
guess wrong you might lose some work. So the standard procedure is to estimate how long your job will take and
add 10% or so.

You interact with the queuing system through the queue/resource manager, [[Moab]] and [[Torque]]. To see all the jobs in the queue use
<pre>
showq
</pre>

To submit your own job, you must write a script which describes the job and how it is to be run (a sample script [[GPC_Quickstart#Submission_Script | follows]]) and submit it to the queue, using the command
<pre>
qsub SCRIPT-FILE-NAME
</pre>
where you will replace <tt>SCRIPT-FILE-NAME</tt> with the file containing the submission script. This will return a job ID, for example 31415, which is used to identify the jobs. Information about a queued job can be found using
<pre>
checkjob JOB-ID
</pre>
and jobs can be canceled with the command
<pre>
canceljob JOB-ID
</pre>

Again, these commands have many options, which can be read about on their man pages.

Much more information on the queueing system is available on our [[queue]] page.

====Batch Submission Script====

A sample submission script is shown below for an mpi job using ethernet with the <tt> #PBS </tt> directives at the top and the rest being
what will be executed on the compute node.

<pre>
#!/bin/bash
# MOAB/Torque submission script for SciNet GPC (ethernet)
#
#PBS -l nodes=2:ppn=8,walltime=1:00:00
#PBS -N test

# DIRECTORY TO RUN - $PBS_O_WORDIR is directory job was submitted from
cd $PBS_O_WORKDIR

# EXECUTION COMMAND; -np = nodes*ppn
mpirun -np 16 -hostfile $PBS_NODEFILE ./a.out
</pre>

The script above requests two nodes, using 8 processors per node, for a [[wallclock time]] of one hour. (The resources required by the job are listed on the <tt>#PBS -l</tt> line.) Other options can be given in other <tt>#PBS</tt> lines, such as <tt>#PBS -N</tt>, which sets the the name of the job. On the first of the two nodes, a shell is launched that changes directory to <tt>/scratch/USER/SOMEDIRECTORY</tt> and then uses the <tt>mpirun</tt> command to launch the job. Assumed here is that the user has a line like

<pre>
module load openmpi intel
</pre>

in their <tt>.bashrc</tt>.

===Submitting an Interactive Job===

It is sometimes convenient to run a job interactively; this can be very handy for debugging purposes. In this case, you type a <tt>qsub</tt> command which submits an interactive job to the queue; when the scheduler selects this job to run, then it starts a shell running on the first node of the job, which connects to your terminal. You can then type any series of commands (for instance, the same commands listed as in the batch submission script above) to run a job interactively.

For example, to start the same sort of job as in the batch submission script above, but interactively, one would type

<pre>
$ qsub -I -l nodes=2:ppn=8,walltime=1:00:00
</pre>

This is exactly the <tt>#PBS -l</tt> line in the batch script above (which requests all 8 processors on each of 2 nodes for one hour), but prepended with a <tt>-I</tt> for `interactive'. When this job begins, your terminal will now show you as being logged in to one of the compute nodes, and one can type in any shell command, run <tt>mpirun</tt>, etc. When you exit the shell, the job will end.

===Ethernet vs. Infiniband===

About 1/4 of the GPC (862 nodes or 6896 cores) is connected with a high bandwidth low-latency fabric called
[http://en.wikipedia.org/wiki/InfiniBand InfiniBand]. Many jobs which require tight coupling to scale well greatly benefit from this interconnect;
other types of jobs, which have relatively modest communications, do not require this and run fine on Gigabit ethernet.

Jobs which require the InfiniBand for good performance can request the nodes that have the `<tt>ib</tt>' feature in the <tt>#PBS -l</tt> line,
<pre>
#PBS -l nodes=2:ib:ppn=8,walltime=1:00:00
</pre>

Because there are a limited number of these nodes, your job will run faster if you do not request them (e.g. if you use the scripts as shown above), as this increases the number of nodes available to run your job. In fact, the InfiniBand nodes are to be used only for jobs that are known to scale well and will benefit from this type of interconnect. The MPI libraries provided by SciNet automatically correctly use either the InfiniBand or ethernet interconnect depending on which nodes your job runs on.

===Large Memory Nodes===

There are two stand-alone large memory (128GB) nodes, <tt>gpc-lrgmem01</tt> and <tt>gpc-lrgmem02</tt> which can be used for data analysis of runs. They have 8 cores and are intel machines running linux, but they are not the same architecture (Nehalem) as the GPC compute nodes, so codes will have to be compiled separately for these machines. They can be logged into from <tt>login.scinet.utoronto.ca</tt>.