Parallel Performance Project Research Paper

Research Paper

Modeling and Approaching the Deliverable Performance Capability of the KSR1 Processor
Waqar Azeem
Technical Report CSE-TR-164-93, University of Michigan, 93

Abstract

Application-specific bound models, such as the one developed in this study for the KSR1, permit effective performance evaluation of compilers and architectures. They are useful not only for improving the design of future compilers and machine implementations, but also highlight areas that require the attention of application programmers. Discovering generally applicable code optimization techniques for future compilers is one incentive to investing time and effort into hand coding such optimizations.

We make use of a hierarchical succession of bounds models that evaluate the potential of the machine implementation for a particular high-level application code (MA), the compiler-generated workload (MAC), and the actual compiler-generated schedule for that workload (MACS). This hierarchy of bounds highlights aspects of each kernel that need code optimizations and facilitates making effective decisions regarding which optimization techniques to apply in each case. These bounds provide metrics that can be used by compiler writers, system designers and application developers. The bounds model also helps in deciding when to give up optimization efforts aimed toward unrealizable performance improvements.

The first twelve LFK loops were compiled with the -O2 option of version 1.1.3 of the KSR1 Fortran compiler. The compiled code achieves only 67.34% of the MA bound and 69.34% of the MAC bound, but does achieve 92.5% of the MACS bound. This indicates that the largest performance degradation is due to inefficient scheduling of the compiler-generated workload. The overall effect of the difference between the number of operations in the compiler-generated workload and the essential set is not very large, although redundant memory operations do account for a large part of this difference. However, these redundant operations also introduce new dependence cycles within loop iterations making effective code scheduling a more complicated, or even an impossible task.

We hand-coded the first twelve loops in an attempt to produce schedules that meet the MA bound. Since the bound is calculated for the degree of loop unrolling introduced by the compiler, we did not attempt to alter the compiler- selected degree of unrolling when hand-coding. The main focus of our approach to hand-coding was elimination of redundant operations and fine-grain code scheduling of the remaining set of operations. We were able to closely pack together templates for these remaining instructions without major restructuring for all loops, except loop 1, where we had to resort to a cyclic schedule, and loop 8. Template packing problems arising from the different shapes of reservation templates among the various instruction classes are an impediment to closely packing the instruction templates in loop 8. It is for this reason that the hand-coded schedule for loop 8 achieves only 78.05% of the bound.

The three address register-to-register instruction set architecture of the KSR1 can inhibit register reuse in loops that contain triad instructions. Eight of the twelve LFK loops achieve greater than 91% of the MA bound performance with loops 2, 4, 6 and 8 being the exception. The average CPF of the hand-coded loops for LFK 1-12 is 1.53, which achieves 87.58% of the MA bound performance and a speedup of 1.3 over the -O2 compiled code. After eliminating outer loop overhead from loops 2, 4 and 6 so that delivered steady-state inner- loop performance can be measured, all but loop 8 achieve at least 91.56% of the bound. The average steady-state inner-loop CPF of the hand-coded loops for LFK 1-12 is 1.41, which achieves 95.04% of the MA bound performance.

We have adapted and used the Object Code Optimizer (OCO) which automatically reschedules code using cyclic scheduling. Using OCO, we have demonstrated that it is possible to automatically reschedule code for the KSR1 to produce close to optimal schedules in most cases. The schedules produced for the compiler-generated workload by OCO achieved 93.42% of the (k = 1) MAC bound performance with an average CPF of 2.57. For steady-state inner-loop performance the schedules produced by OCO achieved 96.98% of the (k=1) MAC bound with an average CPF of 2.48.

Our results indicate that future KSR compilers would benefit from a more efficient data flow analysis. The current compiler also does not schedule the compiler-generated code very well. Cyclic scheduling is superior to loop unrolling at the expense of higher code complexity and loop start-up time. A 1.26x speedup over -O1compiled code (k=1) resulted from the use of OCO. OCO would be even more effective for machines with more deeply pipelined functional units, i.e. with larger latencies to hide. It would also benefit from a preprocessor to eliminate redundant operations and choose an effective degree of unrolling, and from backtracking to improve template-packing when needed.
Back to Publication List, or Parallel Performance Project Home Page