Parallel Performance Project Research Paper
Research Paper
-
Modeling and Approaching the Deliverable Performance Capability of
the KSR1 Processor
Waqar Azeem
Technical Report CSE-TR-164-93, University of Michigan, 93
Abstract
-
Application-specific bound models, such as the one developed in this
study for the KSR1, permit effective performance evaluation of compilers and
architectures. They are useful not only for improving the design of future
compilers and machine implementations, but also highlight areas that require
the attention of application programmers. Discovering generally applicable code
optimization techniques for future compilers is one incentive to investing time
and effort into hand coding such optimizations.
We make use of a hierarchical succession of bounds models that evaluate
the potential of the machine implementation for a particular high-level
application code (MA), the compiler-generated workload (MAC), and the actual
compiler-generated schedule for that workload (MACS). This hierarchy of bounds
highlights aspects of each kernel that need code optimizations and facilitates
making effective decisions regarding which optimization techniques to apply in
each case. These bounds provide metrics that can be used by compiler writers,
system designers and application developers. The bounds model also helps in
deciding when to give up optimization efforts aimed toward unrealizable
performance improvements.
The first twelve LFK loops were compiled with the -O2 option of version
1.1.3 of the KSR1 Fortran compiler. The compiled code achieves only 67.34% of
the MA bound and 69.34% of the MAC bound, but does achieve 92.5% of the MACS
bound. This indicates that the largest performance degradation is due to
inefficient scheduling of the compiler-generated workload. The overall effect of
the difference between the number of operations in the compiler-generated
workload and the essential set is not very large, although redundant memory
operations do account for a large part of this difference. However, these
redundant operations also introduce new dependence cycles within loop iterations
making effective code scheduling a more complicated, or even an impossible task.
We hand-coded the first twelve loops in an attempt to produce schedules
that meet the MA bound. Since the bound is calculated for the degree of loop
unrolling introduced by the compiler, we did not attempt to alter the compiler-
selected degree of unrolling when hand-coding. The main focus of our approach to
hand-coding was elimination of redundant operations and fine-grain code
scheduling of the remaining set of operations. We were able to closely pack
together templates for these remaining instructions without major restructuring
for all loops, except loop 1, where we had to resort to a cyclic schedule, and
loop 8. Template packing problems arising from the different shapes of
reservation templates among the various instruction classes are an impediment to
closely packing the instruction templates in loop 8. It is for this reason that
the hand-coded schedule for loop 8 achieves only 78.05% of the bound.
The three address register-to-register instruction set architecture of
the KSR1 can inhibit register reuse in loops that contain triad instructions.
Eight of the twelve LFK loops achieve greater than 91% of the MA bound
performance with loops 2, 4, 6 and 8 being the exception. The average CPF of the
hand-coded loops for LFK 1-12 is 1.53, which achieves 87.58% of the MA bound
performance and a speedup of 1.3 over the -O2 compiled code. After eliminating
outer loop overhead from loops 2, 4 and 6 so that delivered steady-state inner-
loop performance can be measured, all but loop 8 achieve at least 91.56% of the
bound. The average steady-state inner-loop CPF of the hand-coded loops for LFK
1-12 is 1.41, which achieves 95.04% of the MA bound performance.
We have adapted and used the Object Code Optimizer (OCO) which
automatically reschedules code using cyclic scheduling. Using OCO, we have
demonstrated that it is possible to automatically reschedule code for the KSR1
to produce close to optimal schedules in most cases. The schedules produced for
the compiler-generated workload by OCO achieved 93.42% of the (k = 1) MAC bound
performance with an average CPF of 2.57. For steady-state inner-loop performance
the schedules produced by OCO achieved 96.98% of the (k=1) MAC bound with an
average CPF of 2.48.
Our results indicate that future KSR compilers would benefit from a more
efficient data flow analysis. The current compiler also does not schedule the
compiler-generated code very well. Cyclic scheduling is superior to loop
unrolling at the expense of higher code complexity and loop start-up time. A
1.26x speedup over -O1compiled code (k=1) resulted from the use of OCO. OCO
would be even more effective for machines with more deeply pipelined functional
units, i.e. with larger latencies to hide. It would also benefit from a
preprocessor to eliminate redundant operations and choose an effective degree
of unrolling, and from backtracking to improve template-packing when needed.
Back to Publication List, or
Parallel Performance Project Home Page