Parallel Performance Project
Ph.D. Thesis
-
Modulo Scheduling, Machine Representations, and
Register-Sensitive Algorithms
Alexandre E. Eichenberger
Ph.D. Thesis, University of Michigan, Dec. 96.
Postscript,
compressed(gz),
compressed(Z),
2-per-page-compressed(gz).
Abstract
-
High performance compilers increasingly rely on accurate modeling of
the machine resources to efficiently exploit the instruction level
parallelism of an application. In this dissertation, we first propose
a reduced machine description that results in significantly faster
detection of resource contentions while preserving the scheduling
constraints present in the original machine description. This
approach reduces a machine description in an automated, error-free,
and efficient fashion. Moreover, it fully supports the elaborate
scheduling techniques that are used by high-performance compilers,
such as scheduling an operation earlier than others that are already
scheduled, unscheduling operations due to resource conflicts, and
efficient handling of periodic resource requirements found in software
pipelined schedules. Reduced machine descriptions resulted in
processing the queries to the resource model in 58.1% of the original
time for a benchmark of 1327 loops scheduled by a state-of-the-art
modulo scheduler for the Cydra 5 machine.
Scheduling techniques such as Modulo Scheduling are also increasingly
successful at efficiently exploiting the instruction level parallelism
up to the resource limit of the machine, resulting in high performance
but increased register requirements. In this dissertation, we propose
an optimal register-sensitive algorithm for modulo scheduling that
schedules loop operations to achieve the minimum register requirements
for the smallest possible initiation interval between successive
iterations. The proposed approach supports machines with finite
resources and complex reservation tables, such as the Cydra 5.
It also uses a well structured formulation that enables us to find
schedules within a reasonable time for more loops (920 of the 1327
loops) and and larger loops (up to 41 operations).
We also propose an alternative approach, stage scheduling, which
reduces the register requirements of a given modulo schedule by
shifting its operations by multiples of $II$ cycles. In addition to
achieving a significant fraction of the possible decrease in register
requirements (e.g. the average register requirements decrease by
22.2% for the optimal modulo scheduler, by 19.9% for the optimal
stage scheduler, and by 19.8% for our best stage scheduling
heuristic, compared to a register-insensitive modulo scheduler in a
benchmark of 920 loops), the stage scheduling approach also preserves
the steady-state performance of the initial schedules. By bounding
the schedule length of its schedule, we may preserve the transient
performance of the original as well. Thus, by coupling efficient
stage schedule heuristics with a register-insensitive high-performance
modulo scheduler, we may very quickly obtain high-performance
schedules with low register requirements.
A technique that further decreases the register requirements in
predicated code is presented. This technique is based on precisely
computing the interferences among virtual registers in the presence of
predicated operations.
Back to Publication List, or
Parallel Performance Project Home Page