Charles Lefurgy
Dissertation
Efficient Execution of Compressed Programs
June 2000
Dissertation: PDF
Dissertation: PS
Abstract: PDF
Abstract: PS
Slides: PDF
Slides: PS
Abstract
Chair: Trevor Mudge
Code compression is the technique of using data compression to reduce the program memory size for memory-limited, embedded computers. For system-on-a-chip designs, this reduces the system die area which lowers die cost. After compilation, the binary (native code) program is compressed and stored in the embedded system. At run-time, the compressed program is incrementally decompressed and executed. While compressed programs have better code density, their performance is typically lower because additional effort is required to decompress the instruction stream. This dissertation presents methods to improve the performance of compressed programs.
Decompression overhead can be minimized by using special-purpose hardware. This dissertation analyzes IBM's CodePack decompression algorithm and proposes optimizations for it. The optimized decompressor can often execute compressed programs faster than the original native program. The performance benefit of using fewer memory transactions to fetch compressed instructions surpasses the small decompression overhead. Therefore, code compression improves performance as well as code density.
The decompression hardware can be largely replaced with software. The benefits of software decompression are greater design flexibility, reduced hardware complexity, reduced die area, and reduced cost. However, software decompression is much slower than hardware decompression. On a 5-stage pipelined embedded processor with a 4KB instruction cache, CodePack programs execute 1.3 to 27.0 times slower than native programs and reduce program memory die area (instruction cache and main memory) by 26% to 41%. This dissertation proposes instruction set support to enable efficient software-managed decompression. In addition, it explores two software optimizations, hybrid programs and memoization, to improve the execution time of compressed programs by reducing the compression. Hybrid programs contain both native and compressed code to reduce the number of times the decompressor is invoked. Memoization is a dynamic optimization that caches recent decompression results to also avoid invoking the decompressor. Optimized compressed programs that reduce die area 10% to 33% execute only 1.00 to 1.22 times slower than native code. In addition, loop-oriented (multimedia) programs are nearly as fast as native code.
-
Charles Lefurgy, Eva Piccininni, and Trevor Mudge
Reducing Code Size with Run-time Decompression
Proceedings of the 6th International Symposium on High-Performance Computer Architecture (HPCA)
January 2000
Paper: PDF
Paper: PS
Slides: PDF
Slides: PS
Slides: PowerPoint
Abstract: Compressed representations of programs can be used to improve the code density in embedded systems. Several hardware decompression architectures have been proposed recently. In this paper, we present a method of decompressing programs using software. It relies on using a software-managed instruction cache under control of the decompressor. This is achieved by employing a simple cache management instruction that allows explicit writing into a cache line. We also consider selective compression (determining which procedures in a program should be compressed) and show that selection based on cache miss profiles can substantially outperform the usual execution time based profiles for some benchmarks.
-
Charles Lefurgy, Eva Piccininni, and Trevor Mudge
Analysis of a High Performance Code Compression Method
Proceedings of the 32th Annual International Symposium on Microarchitecture
pp. 93-102
November 1999
Paper: PDF
Paper: PS
Slides: PDF
Slides: PS
Abstract: Compressing the instructions of an embedded program is important for cost-sensitive low-power control-oriented embedded computing. A number of compression schemes have been proposed to reduce program size. However, the increased instruction density has an accompanying performance cost because the instructions must be decompressed before execution. In this paper, we investigate the performance penalty of a hardware-managed code compression algorithm recently introduced in IBM's PowerPC 405. This scheme is the first to combine many previously proposed code compression techniques, making it an ideal candidate for study. We find that code compression with appropriate hardware optimizations does not have to incur much performance loss. Furthermore, our studies show this holds for architectures with a wide range of memory configurations and issue widths. Surprisingly, we find that a performance increase over native code is achievable in many situations.
-
Charles Lefurgy and Trevor Mudge
Fast Software-managed Code Decompression
Presented at CASES'99 (Computer and Architecture Support for Embedded Systems)
pp. 139-143
October 1999
Presented at 10th Annual IPoCSE Review, University of Michigan
Paper: PDF
Paper: PS
Slides: PDF
Slides: PS
Abstract: Compressing the instructions of an embedded program is important for cost-sensitive low-power control-oriented embedded computing. A number of compression schemes have been proposed to reduce program size. However, the increased instruction density has an accompanying performance cost because the instructions must be decompressed before execution. In this paper, we investigate the performance penalty of a hardware-managed code compression algorithm recently introduced in IBM's PowerPC 405. This scheme is the first to combine many previously proposed code compression techniques, making it an ideal candidate for study. We find that code compression with appropriate hardware optimizations does not have to incur much performance loss. Furthermore, our studies show this holds for architectures with a wide range of memory configurations and issue widths. Surprisingly, we find that a performance increase over native code is achievable in many situations.
-
Matt Postiff and Trevor Mudge
Smart Register Files for High-Performance Microprocessors
CSE-TR-403-99
University of Michigan
June 28, 1999
PDF report
PDF thesis proposal
PDF thesis proposal slides
Abstract: This report examines how the compiler can more efficiently use a large
number of processor registers. The placement of data items into registers,
called register allocation, is known to be one of the most important
compiler optimizations for high-speed computers because registers are the
fastest storage devices in the computer system. However, register
allocation has been limited in scope because of aliasing in the memory
system. To break this limitation and allow more data to be placed into
registers, new compiler and microarchitecture support is needed.
We propose the modification of register access semantics to include an
indirect access mode. We call this optimization the Smart Register File.
The smart register file allows the relaxation of overly-conservative
assumptions in the compiler by having the hardware provide support for
aliased data items in processor registers. As a result, the compiler can
allocate data from a larger pool of candidates than in a conventional
system. An attendant advantage is that the smart register file reduces the
number of load and store operations executed by the processor. The simple
addition of an indirect register access mode not only simplifies alias
handling, but also provides opportunities for other optimizations. This
research examines several such optimizations.
-
Charles Lefurgy and Trevor Mudge
Code Compression for DSP
Presented at CASES-98 Workshop
CSE-TR-380-98
University of Michigan
November 1998
paper: PDF (52 KB)
slides: PDF
Abstract: Previous works have proposed adding compression techniques
to a variety of architectural styles to reduce instruction memory
requirements. It is not immediately clear how these results apply to
DSP architectures. DSP instructions are longer and have potentially
greater variation which can decrease compression ratio. Our results
demonstrate that DSP programs do provide sufficient repetition for
compression algorithms. We propose a compression method and apply it
to SHARC, a popular DSP architecture. Even using a very simple
compression algorithm, it is possible to halve the size of the
instruction memory requirements.
-
Matthew A. Postiff, David A. Greene, Gary S. Tyson and Trevor N. Mudge
The Limits of Instruction Level Parallelism in SPEC95 Applications
Computer Architecture News, May 1999. (short paper)
Presented at 3rd Workshop on Interaction between Compilers and Computer Architectures (INTERACT-3)
October 1998
PDF full report
PDF short paper
PDF slides
Abstract: This paper examines the limits to instruction level parallelism that can
be found in programs, in particular the SPEC95 benchmark suite. Apart from
using a more recent version of the SPEC benchmark suite, it differs from
earlier studies in removing non-essential true dependencies that occur as
a result of the compiler employing a stack for subroutine linkage. This is
a subtle limitation to parallelism that is not readily evident as it
appears as a true dependency on the stack pointer. Other methods can be
used that do not employ a stack to remove this dependency. In this paper
we show that its removal exposes far more parallelism than has been seen
previously. We refer to this type of parallelism as ``parallelism at a
distance'' because it requires impossibly large instruction windows for
detection. We conclude with two observations: 1) that a single instruction
window characteristic of superscalar machines is inadequate for detecting
parallelism at a distance; and 2) in order to take advantage of this
parallelism the compiler must be involved, or separate threads must be
explicitly programmed.
-
Krisztian Flautner, Gary S. Tyson and Trevor Mudge
MirvSim: A high level simulator integrated with the Mirv compiler
Computer Architecture News, May 1999.
Presented at 3rd Workshop on Interaction between Compilers and Computer Architectures (INTERACT-3)
October 1998
PDF
PS slides
Abstract: A program's execution profile is an increasingly important source of information for optimizations. Along with its use for high-level optimizations, profile information can be used to take advantage of advanced ISA features such as branch hint bits. Profile information is collected either by instrumenting an executable with profile collection code or by running the program in a simulator. Instrumentation has the advantage that it is relatively fast, with the disadvantage that the instrumentation code can interfere with the measurements (e.g. branch prediction). Traditional simulation is slower but a more accurate process. However, since most simulators simulate the low level machine code of a program, relating the results to the high-level program structure can be problematic. MirvSim differs from other simulators - such as SimpleScalar - that are used in computer architecture in that instead of simulating a machine code level instruction set, it simulates the Mirv language, a high-level intermediate program represen-tation (IR). This allows the simulator to be closely cou-pled with the rest of the compiler infrastructure and to communicate information with other compiler passes easily.
-
Krisztian Flautner and Trevor Mudge
Introspective Computers
Computer Architecture News, May 1999.
Presented at 3rd Workshop on Interaction between Compilers and Computer Architectures (INTERACT-3)
October 1998
PDF
PS slides
-
David Greene
Profile-Guided Function Specialization in the MIRV Compiler
Preliminary Examination
September 15, 1998
PDF
PS
Abstract: Today's complex software systems are written in a modular, re-usable
fashion. Because of the desire to optimize programmer time, generic
software modules are developed and assembled into more complex
systems. These modules are often highly parameterized. Parameters set at
execution time often remain constant throughout the program run and
across executions. Furthermore, input data sets are often redundant.
Because programs exhibit re-use of values, it is possible to
{\em specialize} programs to particular frequently-occurring values
in an attempt to increase performance by eliminating unnecessary overhead.
This paper examines a technique to automatically specialize programs
written in C. Value profile information is gathered through the use
of an instrumenting compiler. The compiler utilizes this information
to decide where and on what values to specialize. Preliminary results are
presented that show some promising performance gains. To further explore
the potential for specialization, more extensive experiments are needed
using large programs written in a modular, re-usable style.
-
Matt Postiff, Gary Tyson, and Trevor Mudge
Performance Limits of Trace Caches
CSE-TR-373-98
University of Michigan
September 8, 1998
PDF
Abstract: A growing number of studies have explored the use of trace
caches as a mechanism to increase instruction fetch bandwidth. The trace
cache is a memory structure that stores statically noncontiguous but
dynamically adjacent instructions in contiguous memory locations. When
coupled with an aggressive trace or multiple branch predictor, it can
fetch multiple basic blocks per cycle using a single-ported cache
structure. This paper compares trace cache performance to the theoretical
limit of a three-block fetch mechanism equivalent to an idealized 3-ported
instruction cache with a perfect alignment network. Several new metrics
are defined to formalize analysis of the trace cache. These include
fragmentation, duplication, indexability, and efficiency metrics. We show
that performance is more limited by branch mispredictions than ability to
fetch multiple blocks per cycle. As branch prediction improves, high
duplication and the resulting low efficiency are shown to be among the
reasons that the trace cache does not reach its upper bound. Based on the
shortcomings of the trace cache discovered in this paper, we identify some
potential future research areas.
-
Charles Lefurgy
Space-efficient Executable Program Representations for Embedded Microprocessors
Thesis Proposal
April 1998
PDF (193 KB)
Abstract: Both low-cost embedded systems and high-performance
microprocessors can benefit from small program sizes. This thesis
proposal focuses on program representations of embedded applications,
where execution speed can be traded for code size. Our preliminary
work borrows concepts from the field of text compression and applies
them to the compression of instruction sequences. We present an
experiment that examines modifications at the microarchitecture level
to support compressed programs. A post-compilation analyzer examines a
program and replaces common sequences of instructions with a single
instruction codeword. A microprocessor executes the compressed
instruction sequences by fetching codewords from the instruction
memory, expanding them back to the original sequence of instructions
in the decode stage, and issuing them to the execution stages. We
demonstrate our technique by applying it to the PowerPC, ARM, i386,
and MIPS-16 instruction sets. Finally, we use our results to propose
another efficient program representation that explicitly communicates
the patterns of computation common to each program.
-
Charles Lefurgy, Peter Bird, I-Cheng Cheng, Trevor Mudge
Improving Code Density Using Compression Techniques
Proceedings of the 30th Annual International Symposium on Microarchitecture
pp. 194-203
December 1997
paper: compressed postscript (138 KB),
paper: PDF (106 KB)
slides: PDF (556 KB)
Abstract: We propose a method for compressing programs in embedded processors where instruction memory size dominates cost. A post-compilation analyzer examines a program and replaces common sequences of instructions with a single instruction codeword. A microprocessor executes the compressed instruction sequences by fetching codewords from the instruction memory, expanding them back to the original sequence of instructions in the decode stage, and issuing them to the execution stages. We apply our technique to the PowerPC, ARM, and i386 instruction sets and achieve an average size reduction of 39%, 34%, and 26%, respectively, for SPEC CINT95 programs.
Citations:
- J. Peyton, Jr. and S. de Jong, "Compiler with inter-modular procedure optimization", U.S. Patent 5,920,723, 1997.
-
Charles Lefurgy, Peter Bird, I-Cheng Chen, and Trevor Mudge
Improving Code Density Using Compression Techniques
CSE-TR-342-97
University of Michigan
July 1997
compressed postscript (81 KB)
PDF (96 KB)
Abstract: We propose a method for compressing programs in
embedded processors where instruction memory size dominates cost. A
post-compilation analyzer examines a program and replaces common
sequences of instructions with a single instruction codeword. A
microprocessor executes the compressed instruction sequences by
fetching codewords from the instruction memory, expanding them back to
the original sequence of instructions in the decode stage, and issuing
them to the execution stages. We apply our technique to the PowerPC
instruction set and achieve 30% to 50% reduction in size for SPEC
CINT95 programs.
-
I-Cheng Chen, Peter Bird and Trevor Mudge
The Impact of Instruction Compression on I-cache Performance
CSE-TR-330-97
University of Michigan
1997
compressed postscript (230 KB)
PDF (176 KB)
Abstract: In this paper we present a straightforward technique for compressing
the instruction stream for programs that overcomes some of the
limitations of earlier proposals. After code generation, the
instruction stream is analysed for frequently used sequences of
instructions from within the program's basic blocks. These patterns of
multiple instructions are then mapped into single byte opcodes. This
constitutes a compression of multiple, multi-byte operations onto a
single byte. When compressed opcodes are detected during the
instruction fetch cycle of program execution, they are expanded within
the CPU into the original (multi-cycle) sequence of instructions. We
only examine patterns within a program's basic block, so branch
instructions and their targets are unaffected by this technique
allowing compression to be decoupled from compilation.
-
Peter Bird and Trevor Mudge
An Instruction Stream Compression Technique
CSE-TR-319-96
University of Michigan
1996
compressed postscript (205 KB)
PDF (38 KB)
Abstract: The performance of instruction memory is a critical factor
for both large, high performance applications and for embedded
systems. With high performance systems, the bandwidth to the
instruction cache can be the limiting factor for execution speed.
Code density is often the critical factor for embedded systems.
In this report we demonstrate a straightforward technique for
compressing the instruction stream for programs. After code
generation, the instruction stream is analysed for often reused
sequences of instructions from within the pro gram's basic blocks.
These patterns of multiple instructions are then mapped into single
byte opcodes. This constitutes a compression of multiple, multi-byte
operations onto a single byte. When compressed opcodes are detected
during the instruction fetch cycle of program execution, they are
expanded within the CPU into the original (multi-cycle) set of
instructions. Because we only operate within a program's basic block,
branch instructions and their targets are unaffected by this technique.
We provide statistics gathered from code generated for the Intel
Pentium and the Power PC processors. We have found that incorporating
a 1K decode ROM in the CPU we can reduce a program's code size between
45% and 60%.