Trevor N. Mudge
Graduate students: Bruce Jacob and Chih-chieh Lee
Faculty: Trevor Mudge
Sponsor: Defense Advanced Research Projects Agency
More and more applications put significant demands on the operating system during their execution. This is especially true with multimedia programs that move large chunks of data through the system. Such applications invoke the OS to move data from disk or a network through the computer's memory system. Our research is aimed at providing a better understanding of architecture/operating system interaction throughout the system hierarchy and how a microsupercomputer class system can effectively integrate a 1 GHz processor into a system with much slower caches, main memory and disk.
Our initial studies focus on virtual memory management. Memory management can impose considerable penalties which are highly dependent on the operating system's structure and how its design matches the virtual memory hardware. Our previous work studied OS/architecture interactions such such as cache footprints and TLB usage. Present work looks at OS/architecture design interactions and design tradeoffs; one question is how a simpler virtual memory design can speed up both hardware and software, by allowing both implementations to become simpler.
Graduate Student: Tim Stanley
Faculty: Trevor Mudge
Sponsor: Defense Advanced Research Projects Agency - AASERT program
Computer-aided design tools automate much of the physical design process. However, few systematic methods, algorithms, or tools have been developed to help the computer architect specify near-optimal microarchitectural configurations in the early, high-impact design stages. Such approaches are needed to systematically guide the early design specifications subject to multiple objectives such as cost, performance, and power consumption.
The goal of this work is to develop an objective-driven microarchitectural design methodology that optimizes architectural performance subject to design constraints expressed as macro-models of area, delay and power dissipation. To date, we have shown that the genetic algorithm, a global optimization technique based on the metaphor of natural selection and survival of the fittest, is a good candidate for such an objective-driven search in a high dimensional space. This genetic algorithm has been parallelized on literally hundreds of Internet workstations using the design of a memory hierarchy with multiple performance objectives as a case study.
Future work will use this microarchitectural design methodology to answer specific architectural design optimization questions. Also, other optimization algorithms will be evaluated for this application.
Graduate Student: I-Cheng Kevin Chen
Faculty: Trevor Mudge
To obtain higher performance modern superscalar microprocessors are relying on multiple instruction issue and deep pipelining. In such execution environments, effective branch prediction becomes essential to realize the potential performance of the microprocessor. To achieve this, various branch prediction schemes have been proposed in the literature and some are now being implemented on new microprocessors. However, very few studies address the inherent limit of predictability for a given program. Unless we know the limit of predictability, there are no absolute metrics for judging how good a prediction scheme is and how much more room is left for improvement.
The notion of entropy defined in the information science may shed light on the limit of predictability. This is because the predictability of the branches depends on the randomness of the behavior of the program. Our initial studies analyze the theoretical predictability and entropy of various well-known algorithms. We are also examining new branch prediction schemes in the light of well-known compression algorithms that can achieve minimum entropy. Besides theoretical analysis, experiments on different types of benchmarks are being carried to empirically verify conclusion for those programs which are not amenable to precise analysis.
Graduate Students: Brian T. Davis
Faculty: Trevor Mudge
Sponsor: Defense Advanced Research Projects Agency
It has been observed for some time that processor speeds have grown at a much faster rate than main memory - DRAM speeds. The solution has been to introduce caches between the processor and memory. It has now reached a stage where three levels of cache are being considered for high performance systems. Such designs will be costly and difficult to verify.
In response to this trend DRAM manufacturers have introduced new memory chips that are built with DRAM technology but that expose more of the bandwidth that is inherent within each DRAM chip. Examples include pipelined DRAM, synchronous DRAM, extended-data-out-DRAM and RAMBUS. Their intended application is as a lower cost replacement of secondary (and teriary) caches. Indeed, they may be a feasible substitute for primary caches on low cost slower systems.
The goal of this project is to study new memory organizations that are possible with these new DRAMs and to incorporate our findings into a systematic methods for designing memory hierarchies for the PUMA high speed microprocessor being developed at the University of Michigan.
Student: Chih-Chieh Leee
Professor: Trevor Mudge
Sponsor: Defense Advanced Research Project Agency
The ability to minimize stalls or pipeline bubbles that may result from branches is becoming increasingly critical as microprocessor designs implement greater degrees of instruction level parallelism. There are several techniques for reducing branch penalties including guarded execution, basic block enlargement, and dynamic branch prediction. Among these, dynamic branch prediction is usually preferred, because it can be implemented without changes to the instruction set architecture or pre-existing binaries.
However, to achieve the desired prediction accuracy, current dynamic branch predictors require significant amounts of hardware to minimize the effects of aliasing in their prediction tables. In this work we are investigating better designs for two-level predictors by focusing on the elimination of aliasing. We propose a new approach, the bi-mode predictor. This predictor divides the prediction tables into two halves, and, by dynamically determining the current "mode" of the program, it selects the appropriate half of the table for prediction. This approach has been shown to reduce aliasing and, as a result, to improving prediction accuracy. Moreover, it is simple enough that it does not impact a processor's cycle time.
Student: David Van Campenhout
Professor: Trevor Mudge
Sponsor: Semiconductor Research Council, NSF
Domino logic, a form of dynamic logic, is popular for high-performance microprocessors. The use of domino logic has been restricted mainly to full custom designs, in part because the difficulty of timing verification, which is critical for correct operation. In the absence of good timing models, designers often have to depend solely on electrical simulators such as Spice to verify their designs. This research addresses the timing verification of sequential circuits consisting of both static logic and dynamic logic.
Proper operation of dynamic circuits is formulated in terms of a well established framework for analyzing the timing of sequential circuits. For domino gates, this led to the important observation that input signals to domino gates may start changing near the end of the evaluate phase. Methods for performing static timing verification have been developed. They methods offer different accuracy/cost tradeoffs. Various variants on domino logic are under investigation.
Student: David Van Campenhout
Professor: Trevor Mudge
Sponsor: Defense Advanced Research Project Agency
Hardware design verification tries to establish whether a given implementation conforms the specification. Present approaches are either limited to small circuits (formal methods) or their effectiveness is not quantifiable (simulation based techniques). This work addresses these problems.
Very little data has been published on hardware design errors. This vacuum is a serious obstacle to progress in this area of research. We developed a systematic method for collecting design error data. Design error data from a variety of design projects is being collected. A classification for design errors was drafted. Design error statistics are used to guide the synthesis of design error models. The collected data is also used to evaluate the quality of the error models. Methods to generate verification tests with measurable high coverage for the modeled design errors are being developed. These methods are targeted to handle large and hard cases.
Student: James Dundas
Professor: Trevor Mudge
Sponsor: Defense Advanced Research Project Agency
Most modern high-performance processors attain their performance by aggressive use of out-of-order execution techniques to exploit any available instruction level parallelism. We propose a new approach which is particularly suited for use with in-order processors with very high clock rates.
A "runahead" processor pre-executes instructions during cache miss cycles. These pre-executed instructions can generate highly accurate instruction and data prefetches, as well as branch predictions. As runahead pre-execution is speculative and continues past any instructions that are data-dependent upon cache miss data, all register values computed during runahead are discarded. However, this means that a simple in-order processor can pre-execute a very large number of instructions during a cache miss without using a great deal of additional hardware. The principal hardware cost is an extra register file.
We have performed some preliminary data prefetching studies that have demonstrated that runahead can significantly improve processor performance, particularly when the level one data cache is relatively small. We hope to demonstrate that a simple in-order runahead processor with a high clock rate can compare favorably with a conventional out-of-order processor model.
Student: Krisztian Flautner
Professor: Trevor Mudge
Sponsor: Defense Advanced Research Project Agency
Late binding operations are intrinsic to modern software systems. Dynamic-linked libraries, object models (CORBA, COM) and programming languages rely heavily on late binding operations. The added flexibility of the software, however is balanced by inefficient code that is executed on the processor.
Compatibility in today's terms means that a program compiled for a given processor should run on later models of a processor family. Being compatible neither guarantees that the program will take advantage of new features in the processor, nor that it will experience the same level of speedup as programs explicitly compiled for the new machine. The problem in both cases is that programs are prematurely bound to machine code.
MIRV proposes a different hardware-software interface, where a program is carried in a high-level intermediate program representation (MIRV) and is translated to the target architecture at execution time. By delaying binding more knowledge about the target is available and previously unavailable optimizations can be performed.
Our goal is to define a language that can be used to perform fast and high-quality code generation. We are exploring the architectural trade-offs and optimizations that can be achieved using MIRV.
Student: I-Cheng Chen
Professor: Trevor Mudge
Sponsor: Defense Advanced Research Project Agency
Instruction prefetching can effectively reduce instruction cache misses, thus improving the performance. We propose an efficient instruction prefetching scheme that makes use of advanced branch predictors that are often part of current high performance microprocessors. Using these accurate predictors, the instructions needed in the future can also be predicted accurately and be prefetched into the cache. Our prefetching scheme, branch prediction-based (BP-based) prefetching, employs a branch predictor to run ahead of the execution unit and to prefetch potentially useful instructions.
BP-based prefetching has a separate small fetching unit, allowing it to compute and predict targets autonomously. Our simulations show that a 4-issue machine with BP-based prefetching achieves higher performance than if its instruction cache were four times the size. In addition, BP-based prefetching outperforms other hardware instruction fetching schemes, such as next-n line prefetching and wrong-path prefetching, by a factor of 17-44% in stall overhead. BP-based prefetching is able to generate more useful prefetches than other schemes and generate them earlier. Furthermore, these prefetches are generated selectively, thus, the bus traffic remains manageable.
Professors: Trevor Mudge and Peter Bird
The performance of instruction memory is a critical factor for both large, high performance applications and for embedded systems. With high performance systems, the bandwidth to the instruction cache can be the limiting factor for execution speed. Code density is often the critical factor for embedded systems.
In this work we demonstrate a straightforward technique for compressing the instruction stream for programs. After code generation, the instruction stream is analyzed for often reused sequences of instructions from within the pro gram's basic blocks. These patterns of multiple instructions are then mapped into single byte opcodes. This constitutes a compression of multiple, multi-byte operations onto a single byte. When compressed opcodes are detected during the instruction fetch cycle of program execution, they are expanded within the CPU into the original (multi-cycle) set of instructions. Because we only operate within a program's basic block, branch instructions and their targets are unaffected by this technique.
We provide statistics gathered from code generated for the Intel Pentium and the Power PC processors. We have found that incorporating a 1K decode ROM in the CPU we can reduce a program's code size between 45% and 60%.