Parallel Performance Project Research Paper
Research Paper
-
Multi-Configuration Simulation Algorithms for the Evaluation of
Computer Architecute Designs
Rabin A. Sugummar
Ph.D. Thesis (Technical Report CSE-TR-173-93), University of Michigan,
August 93.
Abstract
-
In computer architecture design, a number of candidate designs are
simulated on representative workloads, and the most satisfactory
design in terms of cost d performance is chosen. This simulation
process is time-consuming, especially memory hierarchy simulation, and
is a bottleneck in architectural design. In this thesis the
multi-configuration simulation approach is adopted for speeding up the
simulation process. This approach is based on the observation that
the behavior of adjacent design configurations is largely similar, and
that the similarity may be exploited to reduce simulation work;
significant reductions in simulation time are obtained by a
synergistic simulation of many design configurations.
A suite of multi-configuration simulation algorithms is developed
for memory hierarchy simulation. The suite includes
1) An algorithm for set-associative cache simulation based
on a new data structure (the generalized binomial tree)
which runs about two times faster than earlier algorithms.
2) An algorithm for direct mapped cache simulation based on a novel
tag inclusion property which also gives a factor of two improvement
over an earlier algorithm.
3) An innovative limited lookahead algorithm with stack repair
for simulating the OPT replacement strategy in caches.
4) Novel multi-configuration simulation algorithms for write-buffers.
A simulation package, Cheetah, based on these algorithms has been developed
and used in the following modeling and optimization studies.
First, a new model, the OPT model, is introduced for classifying cache
misses. Unlike earlier models, the OPT model accounts for misses
resulting from sub-optimal replacement policies used in practical
caches. Experimental characterizations based on the OPT model of the
cache misses occurring in the SPEC benchmarks are then presented. The
results demonstrate that the replacement policy contributes to a
significan.
Second, the hit-miss and reuse behavior of individual load/store
instructions of the SPEC benchmarks are profiled. The profiles show
that a small number of instructions contribute to a large percentage
of the misses. By scheduling the instructions that miss to hide
latency, a factor of three improvement is demonstrated for
loop-dominated code. By partially controlling cache replacement using
the profile information on data reuse up to a 20% reduction in miss
ratio is demonstrated.
Back to Publication List, or
Parallel Performance Project Home Page