Introduction To Performance

Serial Performance

Worrying about parallel performance before the code performs well with a single task doesn't make much sense! Profiling your code when running with one task allows you to spot serial `hot spots' for optimization, as well as giving you more detailed understanding of where your program spends its time.

/bin/time

gprof

vtune (Intel)

peekperf, hpmcount (p6)

Parallel Performance

Speedup

$LaTeX: S(N,P) = \frac{t(N,P)}{t(N,P=1)}$

Efficiency

$LaTeX: E = \frac{S}{P}$

Strong Scaling

Weak Scaling

Common OpenMP Performance Problems

Common MPI Performance Problems

Overuse of MPI_BARRIER

Many Small Messages

Typically, a the time it takes for a message of size n to get from one node to another can be expressed in terms of a latency l and a bandwidth b, $LaTeX: t_c = l + \frac{n}{b} .$ For small messages, the latency can dominate the cost of sending (and processing!) the message. By bundling many small messages into one, you can amortize that cost over many messages, reducing the time spent communicating.