Introduction To Performance

From oldwiki.scinet.utoronto.ca
Jump to navigation Jump to search

The Concepts of Parallel Performance

Parallel computing used to be a very specialized domain; but now even making the best use of your laptop, which almost certainly has multiple independant computing cores, requires understanding the basic concepts of performance in a parallel environment.

Most fundamentally, parallel programming allows three possible ways of getting more and better science done:

Running your computation many times
If you have a program that works in serial, having many processors available to you allows you to run many copies of the same program at once, improving your throughput. This (can be) a sort of trivial use of parallel computing and doesn't require very specialized hardware, but it can be extremely useful for, for instance, running parameter studies or sensitivity studies. Best of all, this is essentially guaranteed to run efficiently if your serial code runs efficiently! Because this doesn't require fancy hardware, it is a waste of resources to use the Tightly Coupled System for these sorts of tasks and instead they must be run on the General Purpose Cluster.
Running your computation faster
This is what most people think of as parallel computing. It can take a lot of work to make an existing code run efficiently on many processors, or to design a new code to make use of these resources, but when it works, one can achieve a substantial speedup of individual jobs. This might mean the difference between a computation running in a feasible length of time for a research project or taking years to complete --- so while it may be a lot of work, it may be your only option. To determine whether your code runs well on many processors, you need to measure speedup and efficiency; to see how many processors one should use for a given problem you must run strong scaling tests.
Running your computation on larger problems
One achieves speedup by using more processors on the same problem. But by running your job in parallel you may have access to more resources other than just processors --- for instance, more memory, or more disks. In this case, you may be able to run problems that simply wouldn't be possible on a single processor or a single computer; one can achieve significant sizeup. To find how large a problem one can efficiently run, one measures efficiency and runs weak scaling tests.

Of course, these aren't exclusive; one can take advantage of any combination of the above. It may be that your problem runs efficiently on 8 cores but no more; however, you may be able to get use of more processors by running many jobs to explore parameter space, and already on 8 cores you may be able to consider larger problems than you can with just one!

Throughput

Throughput is the most fundamental measure of performance, and the one that ultimately most matters to most computational scientists -- if you have N computations that you need to have done for your research project, how quickly can you get them done? Everything else we'll consider here is just a way of increasing throughput T: LaTeX: 
T = \frac{\mathrm{Number}\,\mathrm{of}\,\mathrm{computations}}{\mathrm{Unit}\,\mathrm{time}} .

If you have many indepdendent computations to perform (such as a parameter study or a sensitivity study) you can increase throughput almost arbitrarily by running them alongside each other at the same time, limited only by the number of processors available (or the wait time in the queue, or the disk space available, or some other external resource constraint). This approach obviously doesn't work if you only have one computation to perform, or if later computations require the output from previous ones. In these cases, or when the individual jobs take infeasibly long, or cannot be performed on only one processor, one must resort to also using parallel programming techniques to parallelize the individual jobs.

Compute Time

Fundamental to everything else that follows is measuring the amount of time a computation takes on some problem size/amount of work LaTeX: N and some number of processors LaTeX: P. We'll done this by LaTeX: t(N,P). The easiest way to measure this time is with the time command that comes on most flavours of Unix in /bin/time or /usr/bin/time:

/bin/time myprogram
...normal program output...
658.44user 0.85system 10:59.41elapsed 

The format of the times output at the end may vary from system to system, but the basic information returned will be the same. The real or elapsed time listed is the actual Wallclock time that elapsed during the run, user or cpu is the CPU time that was actually spent doing your computation, and the system time is the system time that was spent doing system-related things during the run, such as waiting for file input/output. Our goal will be to reduce the real wallclock time that the simulation takes as much as possible while still making efficient use of the resources available.

Parallel Speedup

The speedup of an individual job with some amount of work LaTeX: N as you go from some running it serially to running on LaTeX: P is simply:

LaTeX: S(N,P) = \frac{t(N,P=1)}{t(N,P)} .

That is, the time it takes to run the computation on P vs on one processor. The way this is usually done is to run the parallel code on LaTeX: P and on LaTeX: 1 processor and take the ratio of the two times; but this is a form of cheating, as the parallel version of the code will generally have overheads (even in the one-processor case) compared to the best available serial-only version of the code. The best thing to do in considering the efficiency of the parallelization to compare the parallel code to the best available serial code that does the same job.

If you are considering the speedup of a problem that doesn't fit onto one processor, of course, the concept of speedup can be generalized; one needn't start at LaTeX: P=1.

It should go without saying that, while developing your parallel code and during performance tuning, you get the same results with multiple processors as with some `known good' serial test case; it is even easier to introduce bugs in parallel code than it is in serial code!

Efficiency

Once you have a parallel code and some timing results one can look at how efficiently you are making use of the resources as you use more and more processors. The parallel efficiency of a computation of some fixed work size running on LaTeX: P processors as compared to the LaTeX: P=1 case is

LaTeX: E = \frac{S(N,P)}{P}

That is, if you get a speedup of LaTeX: 8 \times in going from one to eight processors, you are at 1.00 or 100% efficiency; anything less and you are at lower efficiency. It isn't uncommon to achieve greater than 100% parallel efficiencies for small numbers of processors for some types of problems; as you go to more processors, you also have more processor cache, and thus more of the problems data can fit into fast cache. This is called super-linear speedup and sadly seldom extends out to very many processors.

Strong Scaling Tests

An example of a strong scaling test

The figure to the right and data below shows an example of a result of a small strong scaling test --- running a fixed-size problem on a varying number of processors to see how the timing of the computation scales with the number of processors. The code was an OpenMP code run on a node of the GPC. The quantitative results follow below; the times were measured and then speedups and efficiencies were calculated as above.

LaTeX: P LaTeX: t(N,P) LaTeX: S(N,P) LaTeX: E(N,P)
1 3:50 - -
2 2:02 1.87x 94 %
4 1:05 3.52x 88 %
6 47.8 4.81x 80 %
8 43.6 5.28x 66%

The plot shows the compute time LaTeX: t(N,P) as a function of P; if the code maintained 100% parallel efficiency, we would expect the scaling to be as 1/P, so we plot it on a log-log scale. Also shown is the ideal scaling case -- what the times would be if, using the LaTeX: P=1 timing as a normalization, we did get 100% efficiency. We can see that past 4 cores the measured case starts to significantly deviate from the ideal, and it looks like things would only get worse past 8 cores.

It's important to note here that scaling tests should be done on realistic problem sizes and for realistic lengths of time. Generally, for either serial or parallel programs there will be some overhead both at initialization time and during the course of the computation; if the problem size is too small, the overhead during the course of the run might be a significant fraction of the real work, and the program will behave needlessly poorly. Similarly, if the number of timesteps or iterations is too small, the initizalization overhead will similarly play a spuriously large role in the performance.

The above behaviour is typical for a small computation; it won't scale to too many cores, and the efficiency becomes monotonically worse as one increases the number of cores in use. The rate at which this happens will depend on the problem size and the type of computation. How is one to tell where to stop; how good an efficiency is good enough? Certainly there are rules of thumb --- one shudders to see efficiencies below 50% --- but one can arrive at more meaningful and quantitative results by considering throughput. Let's imagine we had 64 cores at our disposal, and we wanted to run 96 jobs as quickly as possible. Our total time to completion of the 96 jobs would vary with the number of cores we ran per job as follows:

LaTeX: P Time for one job Time for all 96 jobs
1 3:50 7:40 (2 batches, 64 jobs then 32)
2 2:02 7:08 (3 batches, 32,32,32)
4 1:05 6:30 (6 batches, 6x16)
6 47.8 7:58 (10 batches, 9x10, 6)
8 43.6 8:43 (12 batches)

If we use more than 4 processes per job in this case, it will actually take us longer to do all our runs! For jobs that scale better with the number of processes (this could be a different program, or the same program with different problem size), we will find this turnover point to be at higher LaTeX: P; for jobs that scale worse, lower LaTeX: P.

Weak Scaling Tests

An example of a weak scaling test

The strong scaling test described above considers the performance of a parallel code with a fixed work size as the number of processors varies; this tells us how the parallel overhead behaves as you go to more and more processors. A weak scaling test fixes the amount of work per processor and compares the execution time over number of processors. Since each processor has the same amount to do, in the ideal case the execution time should remain constant. While the strong scaling test tells you how the parallel overhead scales with LaTeX: P, the weak scaling test tells you something weaker -- whether the parallel overhead varies faster or slower than the amount of work.

Nonetheless, the weak scaling test can be the relevant one for determining how large a problem size one can efficiently compute with a given parallel code and system. An example of results for a weak scaling test on the GPC and TCS up to 256 processors (8 nodes of the TCS, 32 of the GPC) is shown to the right. In this case we are maintaining extremely good efficiency up to at least 128 processors with constant work per process on both architectures. It is possible to see different behaviour when first filling up a node (eg, for less than 8 processes for the GPC, or 64 for TCS) than when one starts crossing nodes; one should understand this but it doesn't necessarily indicate problems.

Performance Tuning

You cannot improve what you cannot measure. Performance tuning is an iterative process between running an instrumented version of your code, getting data on performance throughout the code, and attempting to make chances to the code that will make it run more efficiently.

There are three main ways of instrumenting a code to find its performance. The first is manually adding timers around important parts of the code to find out how much time is spent in each part. This is worth thinking about doing when putting together a new code, as it means that you'll have a very robust way of finding out how well the different parts of the code perform on different platforms and with different compiler options, etc.. The results are, however, necessarily very coarse-grained; they are very useful for comparing performance under different situations, but give very little information about whether or not there are performance problems or what they might be.

The second technique is sampling, sometimes called `program counter sampling' or `statistical sampling'. In this case, the program is run in an environment where it is interrupted briefly at some set frequency (typically something like 100 times per second) and the location of the program counter is jotted down before the program is resumed. At the end of the program, these locations are translated into locations in the source code, and one has a statistical profile of where the program has spent its time.

Statistical sampling has several advantages. It has a very low overhead --- the sampling procedure for instance takes much less time than a function call to a timer routine --- so that the program runs much as it would without the measurement process. If the samples are taken often enough, the result is a very accurate picture of where your program is spending its time, allowing you to very quickly identify `hotspots' in the code and focus your attention on the most costly areas of the program. This combination of relevant information and low-overhead makes statistical sampling the first resort for serious performance measurement.

Sampling, however, has drawbacks. While it lets you know where the program is spending its time, it doesn't tell you why, or how it got there in the first place. For instance, in a parallel program you may be spending too much time in barriers of one sort or another (perhaps at MPI_WAITALL calls in MPI, or implicit barriers at the end of parallel sections in OpenMP) but unless you know where in the code that routine was called from, you can't address the problem. In this case you need some sort of trace through the program which keeps track of which routine called what. This is generally a much heavier-weight process, which can substantially increase the runtime of the code, running the risk of 'the Heisenburg effect' - measurement changing the system under observation. On the other hand, sometimes you just need that level of information, so tracing packages or libraries must be used.

A related method is the use of hardware counters --- counters within the CPU itself which keep track of performance-related information, such as the number of cache misses or branch mis-predictions within your code. Using this information, either regularly throughout the code or once for the entire code run can give very specific information about performance problems. Right now these counters are available on the TCS system but not on the GPC system, as the mainstream Linux kernel does not provide access to these counters.

Command-line Performance Tools

Many of the tools below can be used to examine both serial and parallel performance problems with a code. We'd like to encourage you to tune serial performance first. Worrying about parallel performance before the code performs well with a single task doesn't make much sense! Profiling your code when running with one task allows you to spot serial `hot spots' for optimization, as well as giving you more detailed understanding of where your program spends its time. Further, any performance you make in the serial code will automatically speed up your parallel code.

We've already talked about coarse-grained measurements such as timers within the code and using tools such as /bin/time. These are very useful for comparing overall performance between different platforms/parameters, but we won't need to discuss them further here.

gprof (profiling: everywhere)

A statistical sampling workhorse is gprof, the GNU version of an old common Unix utility called prof. To use this, the code must be re-compiled with both source-code symbols intact (-g) and with profiling information available (for most compilers, this is -pg; for the IBM compilers (xlf, xlc, xlC) it is -p). It is worth knowing because of its ubiquity, and because it contains much of the functionality of newer tools so that the same concepts occur in other concepts.

So let's consider the following trivial program pi.c: <source lang="c" line=1>

  1. include <stdio.h>
  2. include <stdlib.h>
  3. include <time.h>

double calc_pi(long n) {

   long in = 0;
   long out = 0;
   long i;
   double x,y;
   for (i=0; i<n; i++) {
       x = drand48();
       y = drand48();
       if (x*x+y*y < 1) {
           in++;
       } else {
           out++;
       }
   }
   return 4.*(double)in/(double)(in+out);

}

int main(int argc, char **argv) {

   long n, defaultn=100000;
   double pi;
   time_t t;
   /* seed random number generator */
   srand48(time(&t));
   /* get number of tries */
   if (argc < 2 || (n=atoi(argv[1]))<1) {
       n = defaultn;
       printf("Using default n = %ld\n", n);
   }
   pi = calc_pi(n);
   printf("Pi = %lf\n", pi);
   return 0;

} </source>

We can compile this with profiling on and run it:

$ gcc -g -pg -o pi pi.c
$ ./pi 100000000
Pi = 3.141804

(Note that this isn't a very good way of calculating pi!). On exit, this program creates a file called gmon.out; this contains the profiling information about the run of the code. We can take a look at this by using gprof:

$ gprof pi gmon.out 
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
100.88      1.00     1.00        1   998.76   998.76  calc_pi

index % time    self  children    called     name
                1.00    0.00       1/1           main [2]
[1]    100.0    1.00    0.00       1         calc_pi [1]
-----------------------------------------------
                                                 <spontaneous>
[2]    100.0    0.00    1.00                 main [2]
                1.00    0.00       1/1           calc_pi [1]
-----------------------------------------------

The first part tells us that essentially all of the time spent running was in the calc_pi() routine (of course), and the second part attepts to be a call graph, showing that main called calc_pi() once. An important concept in the timing is the `self' and `children' times for each routine, sometimes called the inclusive and exclusive time. Because most routines call many other routines, its often useful to distinguish between the total amount of time spent between starting and ending the routine (the `inclusive' time) and that same time excluding the time spent in child routines (`exclusive' time).

The above results are fairly trivial and not very useful for this simple program, but in more complicated routines it can be very valuable to narrow down hotspots to particular regions of code.

The AIX tool Xprof gives a visual representation of the gprof output.

In fact, gprof also allows you to view the time spent in the code by lines of code. As you chop the program up finer, the statistical sampling gets less accurate; thus to look at the results by line of code you must be sure that your sample run was long enough to get meaningful data. But the results can be extremely useful:

$ gprof --line pi gmon.out
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  Ts/call  Ts/call  name    
 70.31      0.70     0.70                             calc_pi (pi.c:14 @ 40078b)
 14.27      0.84     0.14                             calc_pi (pi.c:17 @ 4007bc)
  5.10      0.89     0.05                             calc_pi (pi.c:11 @ 4007c1)
  4.08      0.93     0.04                             calc_pi (pi.c:15 @ 4007b5)
  3.06      0.96     0.03                             calc_pi (pi.c:13 @ 400781)
  2.55      0.98     0.03                             calc_pi (pi.c:12 @ 400777)
  1.53      1.00     0.02                             calc_pi (pi.c:11 @ 40076d)
  0.00      1.00     0.00        1     0.00     0.00  calc_pi (pi.c:5 @ 40074c)

where now we can see that the single line containing the radius calculation (if (x*x+y*y < 1)) is 70% of the work for the entire program. This tells you where you should spend your time to optimize the code. Other tools exist for this sort of line-by-line analysis; gcov in the gcc compiler suite counts the number of times a given source line is executed - the idea was for coverage analysis for test suites, but it certainly can be used for profiling as well; however, usually the amount of time spent at a line is more important than the number of executions.

For parallel programs, gprof will generally output a seperate gmon.out file for each process; for threaded applications, output for all threads will be summed into the same gmon.out. It may be useful to sum up all the results and view them with gprof or to look at them individually.

There are other tools for looking at the same data. For instance, on the TCS system, the command Xprof (run the same way as gprof; Xprof program_name gmon.out) lets you look at the call tree as a graphical tree. Each routine is shown by a block with a size proportional to the time spent in each routine; the width is the inclusive time, and the height is the exclusive time.

hpmcount (performance counters: TCS)

On the TCS, hpmcount allows the querying of the performance counter values over the course of a run. Since here we are simply asking the CPU to report values it obtains during the run of a program, the code does not need to be instrumented; simply typing

hpmcount hpmcount_args program_name program_args

will run the program and output the results from the hardware performance counters at the end. So for instance, with our trivial pi program above,

tcs-f11n05-$ hpmcount ./pi
Using default n = 100000
Pi = 3.144240
 Execution time (wall clock time): 0.020325 seconds

 ########  Resource Usage Statistics  ########  

 Total amount of time in user mode            : 0.012754 seconds
 Total amount of time in system mode          : 0.001486 seconds
 Maximum resident set size                    : 440 Kbytes
 Average shared memory use in text segment    : 0 Kbytes*sec
 Average unshared memory use in data segment  : 0 Kbytes*sec
 Number of page faults without I/O activity   : 53
 Number of page faults with I/O activity      : 1
 Number of times process was swapped out      : 0
 Number of times file system performed INPUT  : 0
 Number of times file system performed OUTPUT : 0
 Number of IPC messages sent                  : 0
 Number of IPC messages received              : 0
 Number of signals delivered                  : 0
 Number of voluntary context switches         : 6
 Number of involuntary context switches       : 0

 #######  End of Resource Statistics  ########

 Set: 1
 Counting duration: 0.014947083 seconds
  PM_FPU_1FLOP (FPU executed one flop instruction )          :          400093
  PM_FPU_FMA (FPU executed multiply-add instruction)         :          500030
  PM_FPU_FSQRT_FDIV (FPU executed FSQRT or FDIV instruction) :               1
  PM_CYC (Processor cycles)                                  :        58485795
  PM_RUN_INST_CMPL (Run instructions completed)              :        24238152
  PM_RUN_CYC (Run cycles)                                    :        70307511


  Utilization rate                                 :          61.172 %
  Flop                                             :           1.400 Mflop
  Flop rate (flops / WCT)                          :          68.888 Mflop/s
  Flops / user time                                :         112.614 Mflop/s
  FMA percentage                                   :         111.103 %

There are a variety of sets of performance counters that can be reported; the default set isn't especially helpful for HPC-type computations; sets of performance counters can be specified on the commandline in the format -d -s item,item,item. Sets 5 and 12 are very useful for showing memory performance (showing L1 and L2 cache misses) and set 6 is especially useful for shared memory profiling, giving statistics about how often off-processor memory had to be accessed.

Showing the counters for the entire program will often tell you if there's a problem or not, but won't tell you where it is. For more detailed information, one can use the hpm library to manually instrument different regions of your code, and get similar outputs to above for several different, smaller, regions of code.

On the linux side, oprofile allows the reporting of similar information, but to use it one must have root access to the linux machine.

cachegrind (Memory use analysis: GPC)

kcachegrind, part of the KDE development package, can give graphical overviews of the output from cachegrind

valgrind is a memory tool that is usually thought of in terms of finding memory-access bugs in large programs. Rather than instrumenting a code or measuring counters, valgrind takes a fairly extreme approach -- it emulates your program running on a computer, essentially running a simulation of your program running on the same kind of computer valgrind is running on. This has enormous overhead (runtimes can be up to 20x as long as normal) but the result is exquisitely detailed information about what your program is doing.

Memory access is often a bottleneck for HPC codes, and cachegrind is a tool for valgrind which simulates the use of cache in your program, giving you line-by-line information on which parts of the code have cache performance issues. Your code does not need to be recompiled, although compiling with -g is necessary for the output to be useful. Cachegrind is run as shown:

valgrind --tool=cachegrind myprogram myprogram_arguments

Overall results for the whole program are given at the end of the programs normal output, and more detailed information is saved in a file that begins with cachegrind.out. These output files are XML files - readable in principle by humans, but it is much easier to see what is going on with visual tools like kcachegrind (shown to the right) or, eventually, valkyrie (which also can be used for memcheck output.)

Graphical Performance Tools

While graphical performance tools typically measure similar things to their command-line relations, a graphical display opens up the possibility of aggregating much more information and having increased flexibility in displaying it in a variety of ways; this can be very helpful, especially in the initial stages of finding performance problems.

OpenSpeedShop (profiling, MPI tracing: GPC)

OpenSpeedShop, like gprof will tell you where the hotspots are in the code by function
...or by line of code

OpenSpeedShop is a tool that (will be) installed on GPC. It provides the functionality of gprof, with the addition of hardware counter measurements (not currently supported on GPC machines) and options to do both lightweight and more detailed, more heavy-weight profiling. OpenSpeedShop also contains enhanced support for dealing with parallel runs, and for tracing of MPI or I/O calls to find performance problems to those areas. The parallel support is considerably more than what gprof offers; bundling the data from thousands of tasks into one set of results is a significant algorithmic challenge in itself.

Another important addition, shared by many of the other graphical tools, is the idea of bundling results into different `experiments' --- bundles of an executable, measurement type, and resulting data --- which makes the iterative process of performance tuning much easier. OpenSpeedShop, as with some other tools, has the ability to directly compare the results of different experiments, so one can more easily see if a particular change made things better or worse, and if so where.

OpenSpeedshop does not require re-compilation of the executable (although as with all these tools, for the correlation with the source code to be useful, the code should be compiled with debugging symbols, the option for which is almost universally -g to the compiler and linker. The code is then either instrumented, or run in an instrumented environment. Shown to the right are two of the views available for exmaining the timing results of one OpenMP code.

OpenSpeedShop can be launched from the commandline and then used entirely through the gui: there are a variety of `wizards' which guide you through choosing how to instrument and run your experiment:

$ openss

this should be done from a directory containing the source code and the executable. This is an excellent way to get started with this tool. Once one is more familiar with the tool, one can run a variety of experiments on the command line:

openss –f program_name pcsamp

where the above runs the pcsamp (program counter sampling, as in gprof) measurement on the executable program_name. Then one can launch the gui to view the results. There are options for instrumenting the executable in a variety of ways, and taking different measurements; the OpenSpeedShop web page contains links to documentation and tutorials.

PeekPerf (profiling, TCS)

An example of using PeekPerf

Peekperf is IBM's single graphical `dashboard' providing access to many performance measurement tools for exmaining Hardware Counter data, threads, message passing, IO, and memory access, several of which are available seperately as command-line tools. Like OpenSpeedShop, it does not require re-compilation of the executable; an instrumented version of the code is generated at run time and this instrumented version is executed with whatever options you care to pass to it. It does not have the same support for comparing experiments as OpenSpeedShop does, however it allows running several different types of measurements at once, seeing how they correlate in a given run; this is something OpenSpeedShop doesn't have.

One starts peekperf at the commandline

$ peekperf

and tell peekperf which executableyou wish to run measurements on. You then highlight which sorts of measurements you wish to make (which sorts are available depend on the type of program - threaded, OpenMPed, etc), select `generate an instrumented executable', and then `run the instrumented executable', giving it the name of either the instrumented executable or a script to run the same; the program will then begin displaying the resulting data as soon as the program has completed.

Understanding the interface and resulting data takes some practice, and the documentation is quite sparse; however the flexibility in the range of measurements to take makes this an excellent source of performance information for programs running on the TCS system.

Scalasca (profiling, tracing: TCS, GPC)

An example of using Scalasca

| Scalasca is a sophisticated tool which takes the aggregation of data shown in the above graphical tools one step further and analyzes the results to pinpoint and display common performance problems; it scales extremely well and the graphical display makes it very easy for the user to find out where the performance issues are.

Scalasca requires the code to be recompiled, and it has wrapper scripts to select and choose the right options to use. If your code for instance is normally compiled with

ifort -c myprog.f 
ifort -o myprog myprog.o -lm

then one can instead use

scalasca -instrument ifort -c myprog.f
scalasca -instrument ifort -o myprog myprog.o -lm

Scalasca then parses the rest of the command line and adds the necessary flags. (If you are curious, scalasca -instrument -v will show you what the resulting command line actually is.) There is also a shortcut, skin, which is equivalent to scalasca -instrument.

When the new executable is generated, it's run in a similar way; if you normally run your program as

./myprog

or

mpirun -np 5 ./myprog

You'd instead do

scalasca -analyze ./myprog
scalasca -analyze mpirun -np 5 ./myprog

The program will run as usual with only additional lines about output to files. (Again, there is a shortcut available; scan is equivalent to scalasca -analyze.) To then look at the results, one uses

scalasca -examine [epik directory name]

where the directory name is that created by the analyze program.

A screenshot of the results is shown to the right for an OpenMP program, where wait times at implicit barriers at the end of parallel sections is selected as an important metric to show on the left; the middle panel shows the call tree indicating the context in which the delays occurred, and the panel on the right gives the breakdown for each thread.

VTune/Thread Profiler (GPC)

Common Serial Performance Problems

Poor use of cache

A classic problem for scientific codes is memory bandwidth; the capacity to do on-chip floating-point or integer operations has grown much faster than the ability to get numbers onto the chip in the first place. One way around that is to use various levels of memory cache; when one number is needed from memory, a whole line of data is brought in from (slow) external memory to fast on-chip cache. This makes the memory access modestly slower, but tends to greatly speed up performance since if you are going to do something to data in one part of memory you're typically going to be also doing to neighboring values.

If you take advantage of data locality --- accessing memory in some kind of order rather than jumping around in memory --- cache can greatly increase the performance of your code. On the other hand, if you do jump around in memory a lot, cache will actually hurt your performance.

The classic way this comes up is in accessing multidimensional arrays. The example below is simplified; most cases aren't this extreme (or obvious!) but the idea is the same. Let's consider the following FORTRAN code, which simply iterates a few time through a modestly sized multidimensional array: <source lang="f">

     program memaccess
     integer, parameter :: li=32,lj=32,lk=32,ll=32,lm=32
     real, dimension(li,lj,lk,ll,lm) :: a
     integer :: i,j,k,l,m
     integer :: iter
     
     a = 0.
     do iter=1,10
     do m=1,lm 
        do l=1,ll
            do k=1,lk
                do j=1,lj
                    do i=1,li
                       a(i,j,k,l,m) = a(i,j,k,l,m)+ i+j+k+l+m
                    enddo
                enddo
            enddo
        enddo
     enddo
     enddo
     end program

</source> The above program, which we'll suggestively call memaccess-good.f, accesses array elements in the order that FORTRAN places in them in the computer's memory; FORTRAN lays out this array in memory as [a(1,1,1,1,1), a(2,1,1,1,1),... a(31,1,1,1,1), a(1,2,1,1,1)...] and so on. So by ordering our loops that way we are marching through memory in order, making maximum use of cache. The resulting code can be timed:

$ gfortran -O3 -o memaccess-good memaccess-good.f 
$ time ./memaccess-good

real    0m2.478s
user    0m2.337s
sys     0m0.094s

If we reverse the order of the loops, so they go do i=1,li.. do j=1,lj,..... do m=1,lm, however, we get

$ gfortran -O3 -o memaccess-bad memaccess-bad.f
$ time ./memaccess-bad

real    0m19.622s
user    0m19.101s
sys     0m0.098s

A factor of 8 times worse! Thus tools such as cachegrind can be extremely important for finding significant performance problems in memory-heavy codes.

C-based languages arrange their arrays the opposite ways in memory, so that the equivalent array in C would go as [a[0][0][0][0][0], a[0][0][0][0][1], ... a[0][0][0][0][31], a[0][0][0][1][0], ... ]; thus `bad' array access in FORTRAN looks like `good' array access in C, and vice versa.

Common OpenMP Performance Problems

Common MPI Performance Problems

Overuse of MPI_BARRIER

Many Small Messages

Typically, a the time it takes for a message of size n to get from one node to another can be expressed in terms of a latency l and a bandwidth b, LaTeX: t_c = l + \frac{n}{b} . For small messages, the latency can dominate the cost of sending (and processing!) the message. By bundling many small messages into one, you can amortize that cost over many messages, reducing the time spent communicating.

Not overlapping computation and communications