Parallel Performance Project Research Paper
Research Paper
-
Approaching a Machine-Application Bound in Delivered Performance
on Scientific Code
William Mangione-Smith, Tien-Pao Shih,Santosh G. Abraham, and Edward S.
Davidson
Special Issue of IEEE Proceedings on Computer Performance Analysis,
August 93.
Abstract
-
We have developed a performance bounding methodology that explains the
performance of loop-dominated scientific applications on particular systems.
We model the throughput of key hardware units that are common bottlenecks
in concurrent machines. The four units currently used are: memory interface,
floating-point, instruction issue, and a ``dependence unit'' which is
used to model the effects of performance-limiting recurrences.
We propose a workload characterization, and derive upper bounds on the
performance of specific machine-workload pairs. Comparing delivered
performance with bounds focuses attention on areas for improvement and
indicates how much improvement might be attainable.
A detailed analysis and performance improvement effort for the IBM
RS/6000, using the Livermore Fortran Kernels 1-12 to represent the
target workload, produces a lower bound of average 1.27 clocks per
floating-point operation (CPF), whereas machine peak performance is
0.5 CPF and the V2.01 Fortran compiler attains only 2.43 CPF. Code
improvements in this study have achieved 1.36 CPF, increasing the
harmonic mean steady-state inner loop performance to 97.6\% of the
MFLOPS bound. Subsequently the V2.02 compiler achieved 1.75 CPF, and
1.60 with carefully chosen preprocessing. A goal-directed compiler
with bound knowledge could produce higher performance code more
efficiently and automatically.
In general, achieved performance is also affected by cache misses and
register spill code. Simple calibration loops are used to characterize
cache performance. The register requirements are characterized as a
function of the latency and bandwidth of memory and function units for
application kernels that have tree structured dependence graphs.
Back to Publication List, or
Parallel Performance Project Home Page