Parallel Performance Project Research Paper

Research Paper

Approaching a Machine-Application Bound in Delivered Performance on Scientific Code
William Mangione-Smith, Tien-Pao Shih,Santosh G. Abraham, and Edward S. Davidson
Special Issue of IEEE Proceedings on Computer Performance Analysis, August 93.

Abstract

We have developed a performance bounding methodology that explains the performance of loop-dominated scientific applications on particular systems. We model the throughput of key hardware units that are common bottlenecks in concurrent machines. The four units currently used are: memory interface, floating-point, instruction issue, and a ``dependence unit'' which is used to model the effects of performance-limiting recurrences. We propose a workload characterization, and derive upper bounds on the performance of specific machine-workload pairs. Comparing delivered performance with bounds focuses attention on areas for improvement and indicates how much improvement might be attainable.

A detailed analysis and performance improvement effort for the IBM RS/6000, using the Livermore Fortran Kernels 1-12 to represent the target workload, produces a lower bound of average 1.27 clocks per floating-point operation (CPF), whereas machine peak performance is 0.5 CPF and the V2.01 Fortran compiler attains only 2.43 CPF. Code improvements in this study have achieved 1.36 CPF, increasing the harmonic mean steady-state inner loop performance to 97.6\% of the MFLOPS bound. Subsequently the V2.02 compiler achieved 1.75 CPF, and 1.60 with carefully chosen preprocessing. A goal-directed compiler with bound knowledge could produce higher performance code more efficiently and automatically.

In general, achieved performance is also affected by cache misses and register spill code. Simple calibration loops are used to characterize cache performance. The register requirements are characterized as a function of the latency and bandwidth of memory and function units for application kernels that have tree structured dependence graphs.
Back to Publication List, or Parallel Performance Project Home Page