Difference between revisions of "Introduction To Performance"

From oldwiki.scinet.utoronto.ca
Jump to navigation Jump to search
Line 11: Line 11:
Throughput is the most fundamental measure of performance, and the one that ultimately most matters to most computational scientists -- if you have N computations that you need to have done for your research project, how quickly can you get them done?  Everything else we'll consider here is just a
way of increasing throughput T:
T = \frac{\mathrm{Number}\,\mathrm{of}\,\mathrm{computations}}{\mathrm{Unit}\,\mathrm{time}} .
If you have many indepdendent computations to perform (such as a parameter study or a sensitivity study) you can increase throughput almost arbitrarily by running them alongside each other at the same time, limited only by the number of processors available (or the wait time in the queue, or the disk space available, or some other external resource constraint).  This approach obviously doesn't work if you only have one computation to perform, or if later
computations require the output from previous ones.  In these cases, or when  the individual jobs take infeasibly long, or cannot be performed on only one processor,  one must resort to ''also'' using parallel programming techniques to parallelize the individual jobs.

Revision as of 11:22, 3 June 2009

The Concepts of Parallel Performance

Parallel computing used to be a very specialized domain; but now even making the best use of your laptop, which almost certainly has multiple independant computing cores, requires understanding the basic concepts of performance in a parallel environment.

Most fundamentally, parallel programming allows three possible ways of getting more and better science done:

  • Running many copies of the same program If you have a program that works in serial, having many processors available to you allows you to run many copies of the same program at once, improving your througput. This (can be) a sort of trivial use of parallel computing and doesn't require very specialized hardware, but it can be extremely useful for, for instance, running parameter studies or sensitivity studies. Best of all, this is essentially guaranteed to run efficiently if your serial code runs efficiently! Because this doesn't require fancy hardware, it is a waste of resources to use the Tightly Coupled System for these sorts of tasks and instead they must be run on the General Purpose Cluster.
  • Running the same program on many processors. This is what most people think of as parallel computing. It can take a lot of work to make an existing code run efficiently on many processors, or to design a new code to make use of these resources, but when it works, one can achieve a substantial speedup of individual jobs. This might mean the difference between a computation running in a feasible length of time for a research project or taking years to complete --- so while it may be a lot of work, it may be your only option. To determine whether your code runs well on many processors, you need to measure speedup and efficiency; to see how many processors one should use for a given problem you must run strong scaling tests.
  • Running larger problems. One achieves speedup by using more processors on the same problem. But by running your job in parallel you may have access to more resources other than just processors --- for instance, more memory, or more disks. In this case, you may be able to run problems that simply wouldn't be possible on a single processor or a single computer; one can achieve significant sizeup. To find how large a problem one can efficiently run, one measures efficiency and runs weak scaling tests.

Of course, these aren't exclusive; one can take advantage of any combination of the above. It may be that your problem runs efficiently on 8 cores but no more; however, you may be able to get use of more processors by running many jobs to explore parameter space, and already on 8 cores you may be able to consider larger problems than you can with just one!


Throughput is the most fundamental measure of performance, and the one that ultimately most matters to most computational scientists -- if you have N computations that you need to have done for your research project, how quickly can you get them done? Everything else we'll consider here is just a way of increasing throughput T: LaTeX: 
T = \frac{\mathrm{Number}\,\mathrm{of}\,\mathrm{computations}}{\mathrm{Unit}\,\mathrm{time}} .

If you have many indepdendent computations to perform (such as a parameter study or a sensitivity study) you can increase throughput almost arbitrarily by running them alongside each other at the same time, limited only by the number of processors available (or the wait time in the queue, or the disk space available, or some other external resource constraint). This approach obviously doesn't work if you only have one computation to perform, or if later computations require the output from previous ones. In these cases, or when the individual jobs take infeasibly long, or cannot be performed on only one processor, one must resort to also using parallel programming techniques to parallelize the individual jobs.


LaTeX: S(N,P) = \frac{t(N,P=1)}{t(N,P)}


LaTeX: E = \frac{S}{P}

Strong Scaling

Serial Performance

Worrying about parallel performance before the code performs well with a single task doesn't make much sense! Profiling your code when running with one task allows you to spot serial `hot spots' for optimization, as well as giving you more detailed understanding of where your program spends its t



vtune (Intel)

peekperf, hpmcount (p6)

Weak Scaling

Parallel Performance Tools

Common OpenMP Performance Problems

Common MPI Performance Problems

Overuse of MPI_BARRIER

Many Small Messages

Typically, a the time it takes for a message of size n to get from one node to another can be expressed in terms of a latency l and a bandwidth b, LaTeX: t_c = l + \frac{n}{b} . For small messages, the latency can dominate the cost of sending (and processing!) the message. By bundling many small messages into one, you can amortize that cost over many messages, reducing the time spent communicating.

Non-overlapping of computation and communications