Back to Homepage
©. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the publishing entity.
If you would like a copy of the paper, and it is not present here, please request a copy from Catharine June at: cmsj@engin.umich.edu
"Transistor Level Micro-Placement and Routing for Two-Dimensional Digital VLSI Cell Synthesis," M.A. Riepe, K.A. Sakallah, International Symposium on Physical Design, April 12-14, 1999, pp. 74-81.
"Crosstalk Constrained Global Route Embedding," International Symposium on Physical Design, Phiroze Parakh and Richard B. Brown, April 12-14, 1999, pp. 201-206.
"A Quantitative Approach to Non-linear Process Design Rule Scaling," Spencer M. Gold, Bruce Bernhardt, Richard B. Brown, Advanced Research in VLSI, March 21-24, 1999, pp. 99-112.
The Edge-Based Design Rule Model Revisited, M.A. Riepe, K.A. Sakallah, ACM Trans. on Design Automation of Electronic Systems, Vol. 3, no. 3, July 1, 1998, pp. 463-486.
Congestion Driven Quadratic Placement," P. Parakh, R. B. Brown, K. Sakallah, 35th Design Automation Conference, San Francsco, CA, June 15-19, 1998, pp. 275-278.
"M32: A Constructive Multilevel Logic Synthesis System," Victor N. Kravets and Karem A. Sakallah, 35th Design Automation Conference, San Francsco, CA, June 15-19, 1998, pp. 336-341.
Overview of Complementary GaAs Technology for High-Speed VLSI Circuits, by R. Brown, B. Bernhardt, M. LaMacchia, J. Abrokwah, P. Parakh, T. Basso, S. Gold, S. Stetson, C. Gauthier, D. Foster, B. Crawforth, T. McQuire, K. Sakallah, R. Lomax, and T. Mudge, IEEE Transactions on VLSI Circuits, vol. 6, no. 1, pp. 47-51, March 1998.
AFTA: A Formal Delay Model for Functional Timing Analysis, V. Chandramouli, J. P. Whittemore, Karem A. Sakallah, Design, Automation and Test in Europe: Proceedings, Paris, France, February 23-26, 1998, pp. 350-355.
We propose a method for compressing programs in embedded processors where instruction memory size dominates cost. A post-compilation analyzer examines a program and replaces common sequences of instructions with a single instruction codeword. A microprocessor executes the compressed instruction sequences by fetching codewords from the instruction memory, expanding them back to the original sequence of instructions in the decode stage, and issuing them to the execution stages. We apply our technique to the PowerPC, ARM, and i386 instruction sets and achieve an average size reduction of 39%, 34%, and 26%, respectively, for SPEC CINT95 programs.
The bi-mode branch predictor, C. Lee, I. Chen, and T. Mudge, 30th Ann. IEEE/ACM Symp. Microarchitecture (MICRO-30), pp. 4-13, December 1-3, 1997.
A Complementary GaAs (CGaAsTM) 32-bit Multiply Accumulate Unit," Michael Kelley, Matthew Postiff, Timothy D. Strong, Richard B. Brown, Trevor N. Mudge, 31st Asilomar Conference on Signals, Systems, and Computers, Nov. 3-5, 1997, pp. 1507-1511.
Multilevel Optimization of Pipelined Caches, O. A. Olukotun, T. N. Mudge and R. B. Brown, IEEE Transactions on Computers, Vol. 46, No. 10, October, 1997, pp. 1093-1101.
Instruction Prefetching Using Branch Prediction Information, by I-C. Chen, C-C. Lee and T. Mudge. International Conference on Computer Design, Austin, TX, October, 1997, pp. 593-601.
Design Optimization for High-speed Per-address Two-level Branch Predictors, by I-C. Chen, C-C. Lee, M. Postiff and T. Mudge, International Conference on Computer Design, Austin, TX, October 12-15, 1997, pp. 88-96.
Choosing the Appropriate Thresholds for Measuring Propagation Delay and Transition Time, by V. Chandramouli, K. A. Sakallah, AICSP Special Issue, Analog Issues in Digital VLSI, E. B. Friedman, ed., Kluwer Academic Publishers, pp. 9-28, September 1997.
Selection of Voltage Thresholds for Delay Measurement, V. Chandramouli and Karem A. Sakallah, Analog Integrated Circuits and Signal Processing: An International Journal, Vol. 14, Number 1/2, September 1997, pp. 9-28.
Signal Delay in Coupled Distributed RC Lines in the Presence of Temporal Proximity by V. Chandramouli, A. I. Kayssi, and K. A. Sakallah, Proceedings of the Advanced Conference in VLSI, Ann Arbor, Michigan, pp. 32-46, September 15-16, 1997.
Improving Data Cache Performance by Pre-Executing Instructions Under a Cache Miss, by James Dundas and Trevor Mudge. 11th ACM International Conference on Supercomputing, Vienna, Austria, July 7-11, 1997, pp. 68-75.
In this paper we propose and evaluate a technique that improves first level data cache performance by pre-executing future instructions under a data cache miss. We show that these pre-executed instructions can generate highly accurate data prefetches, particularly when the first level cache is small. The technique is referred to as runahead processing. The hardware required to implement runahead is modest because when a miss occurs, it makes use of an otherwise idle resource, the execution logic. The principal hardware cost is an extra register file. To measure the impact of runahead, we simulated a processor executing five integer Spec95 benchmarks. Our results show that runahead was able to significantly reduce data cache CPI for four of the five benchmarks. We also compared runahead to a simple form of prefetching, sequential prefetching, which would seem to be suitable for scientific benchmarks. We confirm this by enlarging the scope of our experiments to include a scientific benchmark. However, we show that runahead was also able to outperform sequential prefetching on the scientific benchmark. We also conduct studies that demonstrate that runahead can generate many useful prefetches for lines that show little spatial locality with the misses that initiate runahead episodes. Finally, we discuss some further enhancements of our baseline runahead prefetching scheme.
Trace-driven memory simulation: A survey, R. Uhlig and T. Mudge. ACM Computing Surveys, vol. 29, no, 2, June 1997, pp 128-170.
Impact of MCMs on High Performance Processors, by B. Davis, C. Gauthier, P. Parakh, T. Basso, C. Lefurgy, R. Brown, and T. Mudge, Proceedings: INTERpack-97: International, Intersociety Electronic & Photonic Packaging Conference vol. 1 (EEP-vol. 19-1), Mauna Lani, HI, June 15-19, 1997, pp. 863-868.
Researchers at the University of Michigan, in collaboration with their partners from Motorola and Cascade Design Automation, are developing design methodologies and automated tools for use in implementing high clock rate digital "systems-on-an-MCM." The PUMA processor, a demonstration vehicle that executes a subset of the PowerPC instruction set, will be designed to operate with a 1 GHz clock. The PUMA will be implemented as a system-on-an-MCM from a set of complementary GaAs (CGaAs) chips that employ area interconnect for high bandwidth inter-chip connections on the MCM. In particular, the paper discusses design tools for systems-on-an-MCM, the optimization of inter-chip drivers, and techniques for optimizing the system performance given the delay of inter-chip crossing.
A Discussion of a GaAs MCM Fabricated at MicroModule Systems through the Multichip Module Designer's Access Service (MIDAS), by J. Peltier, W. Hansford, C. Gauthier, R. Lomax, M. Nanua, P. Parakh, S. Stetson, Proceedings: 6th International Conference on Multichip Modules, Denver, CO, Apr. 2-4, 1997.
The MCM Designers' Access Service (MIDAS) allows designers to obtain low cost, prototype and small quantities of MCMs. MIDAS' low cost price structure comes from a multi-project environment where the customers share tooling and substrate manufacturing costs. The service acts as a technology enabler by supplying the designer with an interface "transparent" to the fabricator and common to multiple vendors.
This paper presents an example of a module that pushes the current MCM-D technology to an extreme limit. This module is a technology evaluation vehicle that aims to characterize the MCM interconnect and driver-receiver pairs for signal integrity, and evaluate the design methodology in Motorola CGaAs[TM]. This paper discusses MIDAS' role in fabricating and assembling the module, the design effort among the parties listed above and the module's functionalities.
Computation of Switching Noise in Printed Circuit Boards, by J.-G. Yook, V. Chandramouli, L. Katehi, K. Sakallah, T. Arabi, and T. Schreyer, IEEE Transactions on Components, Packaging, and Manufacturing Technology - Part A. Vol. 20, No. 1, March 1997, pp. 64-75.
CAD Tools for Area-Distributed I/O Pad Packaging, R. Farbarik, X. Liu, M. Rossman, P. Parakh, T. Basso, R. Brown, 1997 IEEE Multi-Chip Module Conference, February 4-5, 1997, pp. 125-129.
Area I/O Flip-Chip Packaging to Minimize Interconnect Length, by R.J. Lomax, R.B. Brown, M. Nanua, T.D. Strong, Proceedings: Multi-Chip Module Conference (MCMC'97), Santa Cruz, Feb. 4-5, 1997, pp. 2-7.
This paper discusses approaches to using area interconnect to achieve high performance of an experimental multichip microprocessor. These methods are being used in the PUMA project at the University of Michigan to design a processor which has a goal clock speed of 1 GHz. The approach relies on a coordinated placement of functional blocks on chips, and chips on the MCM, together with the use of area arrays to provide high bandwidth interconnections between the chips, and low inductance power connection to the MCM. Three stages of MCM development for this project are described.
Software-managed address translation, by B. L. Jacob and T. N. Mudge. In Proceedings: Third International Symposium on High Performance Computer Architecture (HPCA-3), San Antonio Texas, February 1997.
Given the number of studies that show the handling overhead of
memory-management hardware to be a significant burden on system
performance, it seems prudent to ask the question, "is dedicated
memory-management hardware buying us anything-does
the benefit of hardware address translation outweigh its overhead?"
In this paper we study the performance of software-managed
address translation. The purpose of the study is to specify the
design of hardware to support memory management for a high
clock-rate PowerPC implementation in which a simple design is a
prerequisite for a fast clock and short design cycle.
We show that software-managed address translation is just as
efficient as hardware-managed address translation, and is much
more flexible. Operating systems such as OSF/1 and Mach charge
between 0.10 and 0.28 cycles per instruction (CPI) for address
translation using dedicated memory-management hardware.
Software-managed address translation exacts an overhead of 0.05
CPI. Mechanisms like shared memory, superpages, sub-page
protection, and sparse address spaces can all be defined completely
in software, allowing much more flexibility than hardware-defined
mechanisms.
Trap-driven memory simulation with Tapeworm II, R. Uhlig, D. Nagle, T. Mudge and S. Sechrest, ACM Trans. Modeling and Computer Simulation (TOMACS), vol. 7, no. 1, Jan. 1997, pp. 7-41.
Access to Local Resources in a Nomadic Environment, by B. Jacob and T. Mudge, USENIX Technical Conference, Anaheim, CA, January 6-10, 1997.
Timing Verification of Sequential Domino Circuits, by David Van Campenhout, Trevor Mudge and Karem A. Sakallah, ICCAD-96 Digest of Technical Papers, San Jose, CA, November 10-14, 1996, pp 127-132.
Complementary GaAs Technology for a GHz Microprocessor, by R.B. Brown, T.D. Basso, P.N. Parakh, S.M. Gold, C.R. Gauthier, R.J. Lomax, and T.N. Mudge, Proceedings: 1996 GaAs IC Symposium, Orlando, Florida, November 3-6, 1996.
A DARPA-funded project at the University of Michigan has as a goal the development of technologies and tools needed to implement microprocessors that can be clocked at GHz speeds. A Complementary GaAs HIGFET technology from the Motorola CS-1 facility (CGaAs) is the target semiconductor process. While this technology is immature, it is years ahead of CMOS in terms of fast gate delay at low power supply voltages. A major focus of this work is advanced packaging, which supports partitioning of the design into multiple integrated cirucits, each having an integration level that should be achievable in CGaAs. This paper touches on the major aspects of the project, process technology, circuit design, packaging, architecture, CAD tools and software, with an emphasis on application of the CGaAs technology.
A Complementary GaAs PLL Clock Multiplier with Wide-Bandwidth and Low-Voltage Operation, by P. Stetson and R. Brown, 18th Annual IEEE GaAs IC Symposium: Technical Digest 1996, Orlando, FL, Nov. 3-6, 1996, pp. 317-320.
This paper reports a phase-locked loop clock multiplier designed for wide-bandwidth operation at supply voltages of 0.9 V to 1.5 V. Implemented in Motorola's complementary Gallium-Arsenide (CGaAsTM) process, the target application is the PUMA processor, a multi-chip microprocessor based on the PowerPC Instruction set architecture. This system operates on an input system clock of 100-125 MHz, while the processor clock is targeted to run at a frequency of 1 GHz. Phase-locked loop clock multiplication factors of 2 to 16 are supported, while the achievable output frequency ranges from 100 GHz to 775 GHz. The chip utilizes Motorola's 0.7 um CGaAsTM process and is intirely implemented with the direct-coupled FET standard cell library developed for the PUMA project. This paper discusses the design and implementation of the clock multiplier. Test results are presented. The design measures 1.4 mm[2], including the fully integrated passive filter. The core power dissipation is 300 mW at 1.5 V, and 36 mW at 0.9 V.
Using Stall Cycles to Improve Microprocessor Performance, by James D. Dundas and Trevor N. Mudge. University of Michigan Technical Report: CSE-TR-301-96.
Contemporary microprocessors typically expend a significant amount of their device budget in an attempt to reduce the detrimental effects of memory latency and branch misprediction. The extra hardware frequently reduces the clock rate and wastes valuable die area that might be more productively employed in other ways. We propose a novel approach to this problem, which is intended to be used in relatively simple microprocessors with very high clock rates. A runahead microprocessor continues to execute instructions past a level one data cache miss until the miss is serviced. During this time it is said to be in runahead mode. Any additional level one data cache misses that occur during runahead, whose target addresses can be calculated, can become data stream prefetches. Any conditional branches that are executed during runahead can become trial runs, which may improve branch prediction accuracy. Any level one instruction cache misses that occur during runahead become instruction stream prefetches. After the original data cache miss is serviced, the microprocessor restarts instruction execution at the address of the load or store instruction that caused the original data cache miss. The additional hardware required to checkpoint the sequential state of the microprocessor is rather modest, and should not impact the clock rate of a high-performance implementation.
An analytical model for designing memory hierarchies, by B. Jacob, P. Chen, S. Silverman, and T. Mudge. IEEE Transactions on Computers, vol. 45, no. 10, October 1996.
Memory hierarchies have long been studied by many means: system building, trace-driven simulation, and mathematical analysis. Yet little help is available for the system designer wishing to quickly size the different levels in a memory hierarchy to a first-order approximation. In this paper, we present a simple analysis for providing this practical help and some unexpected results and intuition that come out of this analysis. By applying a specific, parameterized model of workload locality, we are able to derive a closed-form solution for the optimal size of each hierarchy level. We verify the accuracy of this solution against exhaustive simulation of a three-level storage hierarchy and a three-level memory hierarchy. In all but one case, the configuration recommended by the model performs within 5% of optimal. One surprising result of our analysis is that the best place to spend money is the cheapest (rather than the fastest) cache level, particularly with small system budgets.
Analysis of Branch Prediction via Data Compression, by I-C. Chen, J. Coffey and T. Mudge, Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS VII), Cambridge, MA, October 1996.
Support for nomadism in a global environment, by Bruce L Jacob and Trevor N Mudge, Workshop on Object Replication and Mobile Computing, San Jose Califonia, October 7, 1996.
The Trading Function in action, by Bruce L Jacob and Trevor N Mudge, The Seventh ACM SIGOPS European Workshop, Connemara Ireland, September 1996.
Timing Verification of Sequential Domino Circuits, by David Van Campenhout, Trevor Mudge and Karem A. Sakallah, TECHCON'96, September 12-14, 1996.
Modeling the Effects of Temporal Proximity of Input Transitions on Gate Propagation Delay and Transition Time, by V. Chandramouli, K. A. Sakallah, Proc. of 33rd ACM/IEEE Design Automation Conference, pp. 617-622, Las Vegas, June, 1996.
Correlation and Aliasing in Dynamic Branch Predictors, by Stuart Sechrest, Chih-Chieh Lee, and Trevor Mudge, 23rd International Symposium on Computer Architecture (ISCA'96), pp. 22-32, May, 1996.
Rapid prototyping & evaluation of high-performance computers, R. Brown, J. Hayes, and T. Mudge, Proc. Conf. Experimental Research in Computer Systems, NSF Experimental Systems, Ed. L. Snyder, Washington DC, June 1996, pp. 159-168.
Comparison of two common pipeline structures, by M. Golden and T. Mudge, IEE Proc.-Comput. Digit. Tech., vol. 143, no. 3, pp. 161-167, May 1996.
Ravel-XL: A Hardware Accelerator for Assigned-Delay Compiled-Code Logic Gate Simulation, by M.A. Riepe, J.P.M. Silva, K.A. Sakallah, and R.B. Brown, IEEE Transactions on VLSI Systems, vol. 4, no. 1, March 1996, pp. 113-129.
Ravel-XL is a single-board hardware accelerator for gate-level digital logic simulation. It uses a standard levelized-code approach to statically schedule gate evaluations. However, unlike previous approaches based on levelized-code scheduling, it is not limited to zero- or unit-delay gate models and can provide timing accuracy comparable to that obtained from event-driven methods. We review the synchronous waveform algebra that forms the basis of the Ravel-XL simulation algorithm, present an architecture for its hardware realization, and describe an implementation of this architecture as a single VLSI chip. The chip has about 900,000 transistors on a die that is approximately 1.4(cm)2, requires a 256-pin package and is designed to run at 33MHz. A Ravel-XL board consisting of the processor chip and local instruction and data memory can simulate up to one billion gates at a rate of approximately 6.6 million gate evaluations per second. To better appreciate the tradeoffs made in designing Ravel-XL, we compare its capabilities to those of other commerical and research software simulators and hardware accelerators.
OS/Architecture Interactions and their Influence on Computer Architecture, by D. Nagle. Ph.D. Thesis, University of Michigan, 1995.
Trap-Driven Memory Simulation, by R. Uhlig. Ph.D. Thesis, University of Michigan, 1995.
The Role of Adaptivity in Two-Level Adaptive Branck Prediction, S. Sechrest, C.C. Lee, T. Mudge, Proceedings of the 28th Annual IEEE ACM Symposium on Microarchitecture, MiICRO-28, Ann Arbor, MI, 11/29-12/1/95, pp. 264-269.
Power Rail Logic: A Low Power Logic Style for Digital GaAs Circuits, by A. Chandna, R. B. Brown, D. Putti, C. D. Kibler, IEEE Journal of Solid-State Circuits, vol. 30, no. 10, pp. 1096-1100, October 1995.
A Parallel Genetic Algorithm for Multiobjective Microprocessor Design, by T.J. Stanley and T. Mudge, 6th Int. Conf. on Genetic Algorithms, Pittsburgh, PA, July, 1995.
The microprocessor chip designer must solve the problem of partitioning millions of transistors into an arbitrary number of hardware structures within a finite chip area toward achieving maximum performance. This combinative complexity is compounded by a lengthy performance evaluation of each proposed design. We present the application of a real-valued multiobjective genetic algorithm on an asynchronous parallel workstation network as an optimization approach well suited to this problem. By casting design budget constraints as multiple design objectives, the need for penalty functions is eliminated. A microprocessor cache memory design problem is optimized with the genetic algorithm.
Instruction Fetching: Coping with Code Bloat, by R. Uhlig, D. Nagle, T. Mudge, S. Sechrest, and J. Emer, 22nd Int. Symp. on Computer Architecture, Santa Margherita Ligure, Italy, June 18-24, 1995, pp. 345-356.
Previous research has shown that the SPEC benchmarks achieve low miss ratios in relatively small instruction caches. This paper presents evidence that current software-development practices produce applications that exhibit substantially higher instruction-cache miss ratios than do the SPEC benchmarks. To represent these trends, we have assembled a collection of applications, called the Instruction Benchmark Suite (IBS), that provides a better test of instruction-cache performance. We discuss the rationale behind the design of IBS and characterize its behavior relative to the SPEC benchmark suite. Our analysis is based on trace-driven and trap-driven simulations and takes into full account both the application and oeprating-system components of the workloads.
This paper then reexamines a collection of previously-proposed hardware mechanisms for improving instruction-fetch performance in the context of the IBS workloads. We study the impact of cache organization, transfer bandwidth, prefetching, and pipelined memory systems on machines that rely on the use of relatively small primary instruction caches to facilitate increased clock rates. We find that, although of little use for SPEC, the right combination of these techniques substantially benefits IBS. Even so, under IBS, a stubborn lower bound on the instruction-fetch CPI remains as an obstacle to improving overall processor performance.
The Aurora RAM Compiler, by A. Chandna, C.D. Kibler, R.B. Brown, M. Roberts, and K. Sakallah, 32nd Design Automation Conference: Proceedings 1995, San Francisco, CA, June 12-16, 1995, pp. 261-266.
This paper describes a RAM compiler for generating and characterizing highly manufacturable optimized SRAMs using GaAs E/D MESFET technology. The compiler uses a constraint-driven design flow to achieve process tolerant RAMs. This compiler was built using a flexible design framework that can be easily adapted to optimize and characterize memories in different MESFET processes.
Critical Paths in Circuits with Level-Sensitive Latches, by T. Burks, K. Sakallah, T. Mudge, IEEE Transactions on VLSI Systems, vol. 3, no. 2, June, 1995, pp. 273-291.
This paper extends the classical notion of critical paths in combinational circuits to the case of synchronous circuits that use level-sensitive latches. Critical paths in such circuits arise from setup, hold, and cyclic constraints on the data signals at the inputs of each latch and may extend through one or more latches. Two approaches are presented for identifying these critical paths and verifying their timing. The first implicitly checks all paths using a relaxation-based solution procedure. Results of this procedure are used to calculate slack values, which in turn identify satisfied and violated cricital paths. The second approach is based on a constructive algorithm which generates all the critical paths in a circuit and then verifies that their timing constraints are satisfied. Algorithms are evaluated and compared using circuits from the ISCAS89 sequential benchmark suite and the Michigan High Performance Microprocessor Project.
Notes on Calculating Computer Performance, by B. Jacob and T. Mudge. Technical Report CSE-TR-231-95, University of Michigan, Department of Electrical Engineering and Computer Science, March, 1995.
This report explains what it means to characterize the performance of a computer, and which methods are appropriate and inappropriate for the task. The most widely used metric is the performance on the SPEC benchmark suite of programs; currently, the results of running the SPEC benchmark suite are compiled into a single number using the geometric mean. The primary reason for using the geometric mean is that it preserves values across normalization, but unfortunately, it does not preserve total run time, which is probably the figure of greatest interest when performances are being compared.
Instead of the geometric mean, either the harmonic or the arithmetic mean is the appropriate method for averaging a set running times; the arithmetic mean should be used to average times, and the harmonic mean should be used to average rates (1/time). A number of published SPECmarks are recomputed using these means to demonstrate the effect of choosing a favorable algorithm. It is assumed that the reader is familiar with the concept of weighted averages
Systematic Objective-driven Computer Architecture Optimization, by T. Stanley and T. Mudge, 16th Conf. on ARVLSI, Chapel Hill, NC, March 27-29, 1995, pp. 286-300.
Computer designers now have more transistors and architectural alternatives than at any time. Computer-aided design tools automate much of the physical design process. However, few tools have been developed to help the computer architect specify near-optimal microarchitectural configurations in the early design stages. Such tools are needed to systematically guide the early design specifications subject to multiple objectives such as cost, performance, and power comsumption.
This paper illustrates an objective-driven microarchitectural design methodology that couples the specification design phase with an optimization technique. The design of a memory hierarchy with multiple performance objectives is used as a case study. This is a directed search problem with a high dimensionality. We show that the genetic algorithm, a global optimization technique based on the metaphor of natural selection and survival of the fittest, is an ideal candidate for such an objective-driven search in a high dimensional space. The paper concludes that search techniques such as genetic algorithms are necessary to systematically and efficiently drive architectural optimzations for multiple objectives such as dynamic power, and performance in the early, high-impact stages of the design process.
A Verilog Preprocessor for Representing Datapath Components, by B. Davis and T. Mudge, 4th Int. Verilog HDL Conference: Proceedings 1995, Santa Clara, CA, March 27-29, 1995, pp. 90-98.
This paper describes research leading to the generation of a preprocessor for the Verilog hardware description language. The function of this preprocessor is to support repeated feature instances in a Verilog description for a digital system. Repeated features most commonly occur in the description of datapaths, where iterative structures like adders, multipliers and muxes are the basic building blocks. Citations from Verilog users and industry organizations in support of inclusion of a repeated feature syntax are given. Several syntaxes for describing repeated features are presented. From these proposals, a single syntax for support of repeated feature instances is selected. A preprocessor is described that will parse the extended Verilog and translate it to supported Verilog. The challenges in the generation of the preprocessor are given. The paper concludes with a status report on the preprocessor and thoughts for future development.
Timing Models for Gallium Arsenide Direct-Coupled FET Logic Circuits, by A. Kayssi and K. Sakallah, IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems, vol. 14, no. 3, March, 1995, pp. 384-393.
In this paper we derive delay and transition time macromodels for GaAs DCFL logic gates. The macromodels are derived by a systematic application of dimensional analysis aimed at finding suitable minimal functional forms that capture the effects of all relevant parameters. The process is illustrated through a detailed step-by-step account of the macromodel development for DCFL inverters. Based on different modeling approximations, one- and two-argument macromodel functions are derived and compared. The inverter macromodel is then used as a basis for developing timing macromodels for superbuffers and NOR gates. The NOR gate macromodels account for the simultaneous and near-simultaneous switching of two inputs, with an extension to multiple inputs.
A Comparison of Two Pipeline Organizations, by M. Golden and T. Mudge, Micro-27: Proceedings of the 26th Annual International Symposium on Microarchitecture, San Jose, CA, Nov. 30-Dec. 2, 1994, pp. 153-161.
We examine two pipeline structures which are employed in commercial microprocessors. The first is the load-use interlock (LUI) pipeline, which employs an interlock to ensure correct operation during load-use hazards. The second in the address-generation interlock (AGI) pipeline. It eliminates the load-use hazard, but has an address-generation hazard which requires an address-generation interlock for correct operation. We compare the performance of these two pipelines on existing binaries and on applications which have been recompiled with a local code scheduler that understands the difference in the pipeline structures. When branch prediction is more than 80% accurate and the data cache access time is greater than two cycles, the AGI pipeline performs significantly better than the LUI pipeline on existing binaries. Recompiling the benchmarks with a new local code scheduler provides little additional performance improvement.
An Asynchronous GaAs MESFET Static RAM Using a New Current Mirror Memory Cell, by A. Chandna and R. B. Brown, IEEE Journal of Solid-State Circuits, vol. 29, no. 10, October, 1994, pp. 1270-1271.
An experimental 1-kb GaAs MESFET static RAM using a new memory cell has been designed, fabricated and tested. The new memory cell is not subject to the destructive read problems that constrain the design of the conventional six-transistor memory cell. The biasing arrangement for this new cell minimizes the leakage currents associated with unselected bits attached to a column, maximizing the number of bits allowed per column. This new memory cell also provides a much larger access current for readout than is possible using a conventional memory cell of the same area and cell power. A write time of 1.0 ns and address access times of between 1.0 and 2.3 ns have been obtained from a 1-kb test circuit. A cell area of 350 (um)^2 and cell current of 60 uA were achieved using a conventional E/D process.
Power Rail Logic: A Low Power Logic Style for Digital GaAs Circuits, by A. Chandna, R. B. Brown, D. Putti, C. D. Kibler, 16th Annual IEEE GaAs IC Symposium: Technical Digest 1994, Oct. 16-19, 1994, pp. 71-74.
This paper describes a new logic style called Power Rail Logic (PRL) which is compatible with DCFL circuits. Multiplexors, latches, flip-flops and exclusive-OR gates can be built using this logic style. Compared to DCFL, PRL uses fewer transistors, has larger noise margins, and up to 40% lower power-delay products. A test chip containing 32-bit barrel shifters designed in DCFL and in PRL were successfully fabricated and tested. Test results are given for both circuits.
A Variable-Voltage Bidirectional I/O Pad for Digital GaAs Applications, by P. Sherhart, M. Upton, R. Lomax, R. Brown, 16th Annual IEEE GaAs IC Symposium: Technical Digest 1994, Oct. 16-19, 1994, pp. 67-70.
A bidirectional I/O pad for digital GaAs applications has been designed, fabricated, and tested using Vitesse Semiconductor process technology. The I/O pad is designed to operate at frequencies up to 500 HGz and at GTL, ECL, or Rambus voltage levels. The I/O pads can be calibrated to these voltage levels either manually using external signals or internally using on-chip digital calibration logic.
Resource Allocation in a High Clock Rate Microprocessor, by M. Upton, T. Huff, T. Mudge, R. B. Brown, ASPLOS VI Proceedings: Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, San Jose, CA, Oct. 4-7. 1994, pp. 98-109.
This paper discusses the design of a high clock rate (300MHz) processor. The architecture is described, and the goals for the design are explained. The performance of three processor models is evaluated using trace-driven simulation. A cost model is used to estimate the resources required to build processors with varying sizes of on-chip memories, in both single and dual issue models. Recommendations are then made to increase the effectiveness of each of the models.
Trap-driven Simulation with Tapeworm II, by R. Uhlig, D. Nagle, T. Mudge, and S. Sechrest, ASPLOS VI: Proceedings: Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, San Jose, CA, Oct. 4-7. 1994, pp. 132-144.
Tapeworm II is a software-based simulation tool that evaluates the cache and TBL performance of multiple-task and operating system intensive workloads. Tapeworm resides in an OS kernel and causes a host machine's hardware to drive simulations with kernel traps instead of with address traces, as is conventionally done. This allows Tapeworm to quickly and accurately capture complete memory referencing behavior with a limited degradation in overall system performance. This paper compares trap-driven simulation, as implemented in Tapeworm, with the more common technique of trace-driven memory simulation with respect to speed, accuracy, portability and flexibility.
Design Tradeoffs for Software-Managed TLBs, by R. Uhlig, D. Nagle, T. Stanley, T. Mudge, S. Sechrest, R. B. Brown, ACM Trans. on Computer Systems, vol. 12, no. 3, August, 1994, pp. 175-205.
An increasing number of architectures provide virtual memory support through software-managed TLBs. However, software management can impose considerable penalties that are highly dependent on the operating system's structure and its use of virtual memory. This work explores software-managed TLB design tradeoffs and their interaction with a range of monolithic and microkernel operating systems. Through hardware monitoring and simulation, we explore TLB performance for benchmarks running an a MIPS R2000-based workstation running Untrix, OSF/1, and three versions of Mach 3.0.
Macromodel Simplification Using Dimensional Analysis, by A. Kayssi and K. Sakallah, IEEE International Symposium on Circuits and Systems, May 30-June 2, 1994, pp. 335-338.
We present a procedure, based on dimensional analysis, to simplify macromodeling. By combining variables according to their units, simpler equations involving dimensionless variables can be derived. The procedure is illustrated by developing macromodels for power dissipation, supply current, and propagation delay of a CMOS inverter.
Optimization of Storage Hierarchies, by B. Jacob. Technical Report CSE-TR-228-95, University of Michigan, Department of Electrical Engineering and Computer Science, May, 1994.
Optimization is often a question of where one should put one's money in improving performance. As far as large storage hierarchies go, intuition suggests (and common practice supports) adding as much as is affordable of the fastest technology available. Many cache hierarchy studies have shown that this is often not the optimal approach, and we show that for mass storage hierarchies it always tends to be the wrong approach. For large data sets, as is the case for network file servers, a machine with no RAM and several gigabytes of disk performs 30% faster than a machine with no disk and a cost-equivalent amount of RAM.
This paper presents a mathematical analysis of the optimization of an I/O hierarchy, as well as trace driven simulations of a network file server in support of the analysis.
Kernel-based Memory Simulation, by R. Uhlig, D. Nagle, T. Mudge, S. Sechrest, Proceedings: ACM Sigmetrics Conference on Measurement and Modeling of Computer Systems, vol. 22, no. 1, May, 1994, pp. 286-287. Extended Abstract. Trace-driven simulation is a widely-accepted technique for studying the components of computer memory systems such as caches and translation look-aside buffers (TLBs). However, trace-driven methods are time consuming, requiring 20 to 50 times as long to run as actual hardware, and often cannot accurately take process interation and operating system effects into account.
To overcome the limitations of trace-driven simulation, we have developed an alternative approach in which memory simulators run in an active operating system kernel. This method, called kernel-based memory simulation, allows us to account for all system activity, including multiple process and kernel interactions. Further, by using priviledged machine operations to cause traps into the simulator only when a miss in a simulated memory structure occurs, a kernel-based simulator is able to process hits at the full speed of the underlying host hardware.
Delay Macromodels for Point-to-Point MCM Interconnections, by A. Kayssi and K. Sakallah, IEEE Trans. on Components, Packaging and Manufacturing Technology - Part B: Advanced Packaging, vol. 17, no. 2, May, 1994, pp. 147-152.
We develop delay macromodels for lossless as well as lossy point-to-point MCM transmission lines using a systematic model construction procedure that includes dimensional analysis. The result for lossless lines confirms earlier work and extends it to the case of nonlinear drivers and capacitive termination. For lossy lines, we show that dimensional analysis allows us to reduce the complexity of the delay expression without any loss in accuracy. A second-order poynomial fit to a four-argument delay function is shown to yield predictions that are within 5% of detailed circuit simulation at a fraction of the computational cost. We illustrate the applications of these macromodels in studying the effects of packaging technology in signal delay and in sensitivity analysis.
Optimal Allocation of On-chip Memory for Multiple- API Operating Systems, by D. Nagle, R. Uhlig, T. Mudge and S. Sechrest, Proceedings of the 21st Annual International Symposium on Computer Architecture, April 18-21, 1994, pp. 358-369.
The allocation of die area to different processor components is a central issue in the design of single-chip microprocessors. Chip area is occupied by both core execution logic, such as ALU and FPU datapaths, and memory structures, such as caches, TLBs, and write buffers. This work focuses on the allocation of die area to memory structures through a cost-benefit analysis. The cost of memory structures with different sizes and associativities is estimated by using an established area model for on-chip memory. The performance benefits of selecting a given structure are measured through a collection of methods including on-the-fly hardware monitoring, trace-driven simulation and kernel-based analysis. Special consideration is given to operating systems that support multiple application programming interfaces (APIs), a software trend that substantially affects on-chip memory allocation decisions.
Results: Small adjustments in cache and TLB design parameters can significantly impact overall performance. Operating systems that support multiple APIs, such as Mach 3.0, increase the relative importance of on-chip instruction caches and TLBs when compared against single-API systems such as Ultrix.
Several architectural innovations intended to reduce access latency and improve overall throughput increase system bandwidth requirements. Bandwidth scales with clock speed, and can be regarded as an architectural resource to be applied to latency reduction. A properly designed bus provides low arbitration latency and delivers high sustained bandwidth.
This paper evaluates the performance of 3.2 Gbyte/s peak bandwidth, low-latency arbitration bus connecting a GaAs superscalar CPU to a GaAs memory management unit. A microarchitectural performance model was written in the Verilog hardware description language. Bus transactions characteristic of the SPECint92 benchmarks and other workloads were generated as input. Sustained bandwidths of 1.68 Gbyte/s were achieved with arbitration costs of less than 0.5 cycles per data transfer.
Gallium Arsenide Process Evaluation Based on a RISC Microprocessor Example, by R. B. Brown, M. Upton, A. Chandna, T. Huff, T. Mudge, R. Oettel, IEEE Journal of Solid-State Circuits, vol. 28, no. 10, October, 1993, pp. 1030-1037.
This work evaluates the features of a gallium-arsenide E/D MESFET process in which a 32-b RISC microprocessor was implemented. The design methodology and architecture of this prototype CPU are described. The performance sensitivity of the microprocessor and other large circuit blocks to different process parameters is analyzed, and recommendations for future process features, circuit approaches, and layout styles are made. These recommendations are reflected in the design of a second microprocessor using a more advanced process that achieves much higher density and performance.
GaAs Opportunities in High Performance Computing and Communications, by J. Toole and R. B. Brown, 15th Annual IEEE GaAs IC Symposium: Technical Digest 1993, October 10-13, 1993, pp. 3-6.
High performance computing and networking are becoming the backbone of the scientific and information infrastructure, incorporating emerging technologies into productive applications at an accelerating pace. These technologies are important both for our national security and as the basis of our future economic competitiveness. The Federal High Performance Computing and Communications Program provides an innovative and coordinated research agenda for the US in these areas. The question of greatest interest to the III/V community is, "What role will emerge for compound semiconductors as computing reinvents itsefl?" This paper presents some insights, and a sample of the GaAs-related research funded by the Advanced Research Projects Agency. Results of these efforts will significantly contribute to the future role these semiconductors will play in mainstream computing.
Delay Modeling for GaAs DCFL Circuits, by A. Kayssi and K. Sakallah, 15th Annual IEEE GaAs IC Symposium: Technical Digest 1993, October 10-13, 1993, pp. 67-70.
A timing macromodel for GaAs DCFL logic gates is derived. It calculates the delay of a gate as a function of such parameters as transistor sizes, capacitive loading, fanout, and input transition time. For NOR gates, the simultaneous switching of two inputs is also considered. Calculations based on the derived macromodel show excellent agreement with circuit simulation.
An Asynchronous GaAs MESFET Static RAM Using A New Current Mirror Memory Cell, by A. Chandna and R. B. Brown, 15th Annual IEEE GaAs IC Symposium: Technical Digest 1993, October 10-13, 1993, pp. 79-82.
An experimental 1kb GaAs MESFET static RAM using a new memory cell has been designed, fabricated and tested. The new memory cell is not subject to the destructive read problems which constrain the design of the conventional six-transistor memory cell. The memory cell also provides a much larger access current for readout than is possible using a conventional memory cell of the same area and cell power. Address access times of 1.6ns have been obtained from a 1kb test circuit.
Synchronization of Pipelines, by K. Sakallah, T. Mudge, T. Burks, E. Davidson, IEEE Trans. on CAD of Integrated Circuits and Systems, vol. 12, no. 8, August, 1993, pp. 1132-1146.
In this paper we apply a recently formulated general timing model of synchronous operation to the special case of latch-controlled pipelined circuits. The model accounts for multiphase synchronous clocking, correctly captures the behavior of level-sensitive latches, handles both short- and long-path delays, accommodates wave pipelining, and leads to a comprehensive set of timing constraints. Pipeline circuits are important because of their frequent use in computer systems. We define their concurrency as a function of the clock schedule and degree of wave pipelining. We then identify a special class of clock schedules, coincident multiphase clocks, which provide a lower bound on the value of the optimum cycle time. We show that the region of feasible solutions for single-phase clocking can be nonconvex or even disjoint, and derive a closed-form expression for the minimum cycle time of a restricted but practical form of single-phase clocking. We compare these forms of clocking on three pipeline examples and highlight some of the issues in pipeline synchronization.
The Aurora Project, by R. Huff, M. Upton, T. Mudge, and R. Brown, Hot Chips V: Symposium Record, Stanford, CA, August 8-10, 1993, pp. 3.2.1-3.2.12.
The Impact of Signal Transition Time on Path Delay Computation, by A. Kayssi, K. Sakallah, T. Mudge, IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing, vol. 40, no. 5, May, 1993, pp. 302-309.
It has been recognized for some time that nonzero signal rise and fall times contribute to gate propagation delays. Practically, however, most timing analysis tools ignore these contributions when computing path delays and identifying critical paths in combinational circuits. In this paper we describe how these rise and fall times can be incorporated into path analysis algorithms. Interestingly, we show that signal transition time information can be accounted for in a simple and efficient preprocessing step followed by the application of standard path analysis methods. This two-step approach is shown to predict path delays with sufficient accuracy without unnecessarily complicating path analysis.
Design Tradeoffs for Software-Managed TLBs, by D. Nagle, R. Uhlig, T. Stanley, S. Sechrest, T. Mudge, R. B. Brown, Proceedings: The 20th Annual International Symposium on Computer Architecture, San Diego, CA, May 16-19, 1993, pp. 27-38.
An increasing number of archiectures provide virtual memory support through software-managed TLBs. However, software management can impose considerable penalties, which are highly dependent on the operating system's structure and its use of virtual memory. This work explores software-managed TLB design tradeoffs and their interaction with a range of operating systems including monolithic and microkernel designs. Through hardware monitoring and simulation, we explore TLB performance for benchmarks running on a MIPS R2000-based workstation running Ultrix, OSF/1, and three versions of Mach 3.0.
Results: New operating systems are changing the relative frequency of different types of TLB misses, some of which may not be efficiently handled by current architectures. For the same application binaries, total TLB service time varies by as much as an order of magnitude under different operating systems. Reducing the handling cost for kernel TLB misses reduces total TLB service time up to 40%. For TLBs between 32 and 128 slots, each doubling of the TLB size reduces total TLB service time up to 50%.
A High Performance GaAs Microprocessor, by T Huff, M. Upton, P. Sherhard, P. Barker, R. McVay, T. Stanley, R. B. Brown, R. Lomax, T. Mudge, K. Sakallah, Proceedings of the IEEE Laser and Optics Society Sarnoff Symposium, Princeton, NJ, March 1993, no page nos.
A 32-bit RISC microprocessor has been fabricated in a 0.6um GaAs DCFL process. It includes 160,000 transistors on a 13.9 x 7.8 (mm)2 chip, and dissipates 24W. The chip contains an ALU, 32x32 register files, 4-word write buffer, small on-chip I-cache, and support for off-chip instruction and data caches.
A 32kb GaAs SRAM with Electronically Programmable Redundancy, by A. Chandna and R. B. Brown, in Research on Integrated Systems: Proceedings of the 1993 Symposium, ed. by G. Borriello and C. Ebeling, The MIT Press, Cambridge, MA, pp. 155-167.
This paper describes novel circuit techniques that have been employed in a 1Kx32-bit SRAM designed in Vitesse's 0.6um H-GaAs III E/D MESFET technology. The memory has a projected access time of 2.2ns with a nominal power dissipation of 5W.
Subthreshold leakage currents in MESFETs are orders of magnitude larger than those found in MOSFETs. The impact of these currents on circuit area and power are discussed. Based on this SRAM design, the trade-offs presented by subthreshold currents on system-level integration are examined.
A fuse circuit has been developed for this SRAM making it the first GaAs memory to employ electronically programmable redundancy. The fuse circuit is presented with SEM photograph of a programmed device. Transfer curves during programming and I-V curves of the device before and after programmming are also shown.
A high-speed amplifier has been designed for this SRAM with a sensing speed that is independent of bit-line capacitance. Sense-amps tend to be particularly sensitive to the large device threshold variations of GaAs MESFETs. Simulation results showing the effects of these process variations on sense-amp speed and reliability are presented. A circuit design which minimizes the effects of process variations on this sense amplifier is presented.
A 160,000 Transistor GaAs Microprocessor, by M. Upton, T. Huff, P. Sherhart, P. Barker, R. McVay, T. Stanley, R. B. Brown, R. Lomax, T. Mudge, K. Sakallah, 1993 IEEE International Solid-State Circuits Conference: Digest of Technical Papers, Feb., 1993, pp. 92-94.
A single-chip GaAs microprocessor includes a Ling-adder-based ALU, 32b shifter, 32-word register file, 4-word write buffer, 32-word on-chip instruction cache, support for 2 levels of off-chip instruction and data caches and an asynchronous system interface. It is in direct-coupled FET logic (DCFL) and integrates 160,000 transistors on a 13.9 x 7.8 (mm)2 die. When operating from a 2V supply, the chip typically dissipates 24W. Portions of the chip operate at 200 MHz. Full functionality is verified at 100MHz. The speed of the instruction decode in the first prototypes is limited by incorrect clock-phase assignment.
Hardware Support for Hiding Cache Latency, by M. Golden and T. Mudge, Technical Report CSE-TR-152-93, University of Michigan, Department of Electrical Engineering and Computer Science, January, 1993.
See also Technical Report: CSE-TR-152-93
As the decrease in processor cycle time continues to outpace the decrease in memory cycle time, even moderately sized on-chip caches may require several cycles of access time in the near future. This means that time is lost, even on a cache hit, if independent instructions cannot be scheduled after a read from memory. A novel hardware device is proposed that keeps track of the history of load insructions and predicts their targets before they are computed by the instruction pipeline. This allows the saving of several processor cycles. The storage required to implement such a device is quite large, but as the latency required to read from the first level cache grows, a moderate performance improvement is seen.
A simplified version of a RISC microprocessor has been implemented with E/D MESFET DCFL in the Vitesse HGaAs II process. This chip was designed to drive the development of digital GaAs design automation tools. The processor architecture was modified to fit DCFL technology. The 60,500-transistor circuit executes a set of 29 basic instructions. It dissipates 11 W and operates at over 100 MHz. The RISC processor chip set being developed in this project is helping to identify the challenges and opportunities of VLSI GaAs.
Compound Semiconductor Device Requirements for VLSI, by R. B. Brown, A. Chandna, T. Huff, R. Lomax, T. Mudge, R. Oettel, M. Upton, Proceedings of the 19th International Symposium on Gallium Arsenide and Related Compounds 1992: Institute of Physics Conference Series Number 129, Karuizawa, Japan, October 1992, pp. 857-862.
High-performance digital systems, such as a 60,500-transistor RISC microprocessor designed at the University of Michigan, are a growing application area for compound semiconductors. To support such VLSI circuits, a process must have, in addition to good device switching times: high levels of integration, good yields, reasonable power dissipation, and dense, multilevel interconnect. A digital logic family should have gates with good load-driving characteristics, reasonable noise margins, special features for embedded memory, and support by appropriate design automation tools. Device development for digital applications should be guided by all of these requirements.
Delay Macromodels for the Timing Analysis of GaAs DCFL, by A. Kayssi and K. Sakallah, Proc. European Design Automation Conference (EuroDAC), Hamburg, Germany, September, 1992, pp. 142-145.
A timing macromodel for GaAs DCFL logic cells is derived. It calculates the delay of a cell as a function of such parameters as transistor sizes, capacitive loading, fanout, and input switching time. Calculations based on the derived macromodel show excellent agreement with circuit simulation at two to three orders of magnitude savings in computation times.
Performance Optimization of Pipelined Primary Caches, by K. Olukotun, T. Mudge, R. B. Brown, The 19th Annual International Symposium on Computer Architecture, Gold Coast, Australia, May 19-21, 1992, vol. 20, no. 2, pp. 181-190.
The CPU cycle time of a high-performance processor is usually determined by the access time of the primary cache. As processors speeds increase, designers will have to increase the number of pipeline stages used to fetch data from the cache in order to reduce the dependence of CPU cycle time on cache access time. This paper studies the performance advantages of a pipelined cache for a GaAs implementation of the MIPS based architecture using a design methodology that includes long traces of multiprogrammed applications and detailed timing analysis. The study evaluates instruction and data caches with various pipeline depths, cache sizes, block sizes, and refill penalties. The impact on CPU cycle time of these alternatives is also factored into the evaluation. Hardware-based and software-based strategies are considered for hiding the branch and load delays which may be required to avoid pipeline hazards. The results show that software-based methods for mitigating the penalty of branch delays can be as successful as the hardware-based branch-target buffer approach, despite the code-expansion inherent in the software methods. The situation is similar for load delays; while hardware-based dynamic methods hide more delay cycles than do static approaches, they may give up the advantage by extending the cycle time. Because these methods are quite successful at hiding small numbers of branch and load delays, and because processors with pipelined caches also have shorter CPU cycle times and larger caches, a significant performance advantage is gained by using two to three pipeline stages to fetch data from the cache.
Monster: A Tool for Analyzing the Interation Between Operating Systems and Computer Architecture, by D. Nagle, R. Uhlig, T. Mudge, Technical Report CSE-TR-147-92, University of Michigan, Department of Electrical Engineering and Computer Science, May, 1992.
To enable computer designers to better evaluate the architectural needs of operating systems, we have developed Monster, a tool which combines hardware and software monitoring techniques to unobtrusively obtain system performance. data. This report is split into two major parts. In Part I, we argue the need for OS performance evaluation tooks, summarize previous hardware and software based monitoring technieus, discuss our design of Monster and finally present an analysis of compilation workloads which test and demonstrate Monster's capabilities. In Part II, we detail our plans for future studies in which Monster plays a central role.
Impact of MCM's on System Performance Optimization, by A. Kayssi, K. Sakallah, R. B. Brown, R. Lomax, T. Mudge, T. Huff, 1992 IEEE International Symposium on Circuits and Systems, San Diego, CA, May, 1992, vol. 2, pp. 919-922.
We present a performance model for the University of Michigan MCM-based GaAs microcomputer. The model takes into account architectural as well as MCM packaging considerations such as the chip-to-substrate bonding method, the dielectric constant of the insulator, and the resistivity of the metal conductor. We illustrate the use of the model by finding the I-cache size which maximizes the MIPS rating for given MCM technology parameters.
Analysis and Design of Latch-Controlled Synchronous Digital Circuits, by K. Sakallah, T. Mudge, O. Olukotun, IEEE Transactions on Computer-Aided Design, vol. 11, no. 3, March, 1992, pp. 322-333.
We present a succinct formulation of the timing constraints for latch-controlled synchronous digital circuits. We show that the constraints are mildly nonlinear and prove the equivalence of the nonlinear optimal cycle time calculation problem to an associated and simpler linear programming (LP) problem. We present an LP-based algorithm which is guaranteed to obtain the optimal cycle time for arbitrary circuits controlled by a general class of multiphase overlapped clocks. We illustrate the formulation and an initial implementation of the algorithm on some example circuits.
Synthesis and Verification of a GaAs Microprocessor from a Verilog Hardware Description, by R. B. Brown, A. Chandna, T. Hoy, T. Huff, D. Johnson, R. Lomax, T. Mudge, D. Nagle, O. Olukotun, K. Sakallah, R. Uhlig, M. Upton, Proceedings Open Verilog International User Group Meeting, March 1992, pp. 85-93.
The University of Michigan Gallium Arsenide MIPS Project is using Verilog in the design of a 250 MHz MIPS architecture microprocessor. The design system is based on a single Verilog model which is used for simulation, synthesis, and hardware verification. The model is composed of a mixture of Register Transfer Level (RTL) and behavioral descriptions. Datapaths are represented by RTL structural components, while the control logic has behavioral descriptions.
To simplify verification and test development, a number of operating system functions have been implemented using the Verilog PLI (Programming Language Interface). These functions allow the model to load and execute programs compiled for the DECstation 5000. To ensure the model's functional correctness, a verification tool compares simulation results against the execution of a physical MIPS processor. Any inconsistencies are flagged as errors. Once the model is deemed functionally correct, it is synthesized into a logic level implementation. Datapath logic, described at the register transfer level, is directly mapped into a netlist for automatic placement and routing. The control logic is translated to the Finesse logic synthesis language. The Finesse compiler then synthesizes each control block into a netlist which is passed to the physical design tools from Cascade Design Autmoation (CDA) for final layout.
The combination of front-end verification and back-end synthesis results in a very short design time. Our first chip, Aurora I, was completed by a team of 5 graduate students in about 5 months, including tool development and library cell layout.
In the near future, microprocessor systems with very high clock rates will use multichip module (MCM) packaging technology to reduce chip-crossing delays. In this paper, we present the results of a study for the design of a 250 MHz Gallium Arsenide (GaAs) microprocessor that employs MCM technology to improve performance. The design study for the resulting two-level split cache starts with a baseline cache architecture and then examines the following aspects: 1) primary cache size and degree of associativity; 2) data-cache write policy; 3) secondary cache size and organization; 4) primary cache fetch size; 5) concurrency between instruction and data accesses. A trace-driven simulator is used to analyze each design's performance. The results show that memory access time and page-size constraints effectively limit the size of the primary data and instruction caches to 4KW (16KB). For such cache sizes, a write-through policy is better than a write-back policy. Three cache mechanisms that contribute to improved performance are introduced. The first is a variant of the write-through policy called write-only. This write policy provides most of the performance benefits of sub-block placement without extra valid bits. The second is the use of a split secondary cache. Finally, the third mechanism allows loads to pass stores without associative matching.
Module Generation for a GaAs Microprocessor, by M Upton, T. Hoy, T. Huff, R. B. Brown, R. Oettel, M. Rossman, Technical Digest of The Third Physical Design Workshop, May 20-22, 1991, pp. 88-94.
This paper describes the design and verification of a set of Gallium Arsenide module generators used in constructing a high performance microprocessor. This design uses an advanced 1.2 micron 3 layer metal GaAs technology. This technology is well suited to layout by module generators. The GaAs technology imposes many difficulties on the design which must be handled by the CAD system.
Multilevel Optimization in the Design of a High- Performance GaAs Microcomputer, by O. Olukotun, R. B. Brown, R. Lomax, T. Mudge, K. Sakallah, IEEE Journal of Solid-State Circuits, vol. 26, no. 5, May, 1991, pp. 763-767.
The design of microelectronic systems has traditionally been carried out at several levels of abstraction. Partitioning the design process into levels makes it more manageable, but usually results in a suboptimal design. When high performance is the goal, optimization should be done across multiple abstraction levels. This paper illustrates multilevel optimization in the design of an instruction cache for a high-performance GaAs microprocessor. Performance of the system is maximized by concurrently considering the interrelationships of: 1) the time of flight of signals across the multichip module on which the processor and cache chips are mounted; 2) the clocking scheme that synchronizes these signals; and 3) the size of the cache. These three design issues are normally considered independently because they arise in different abstraction levels. Design automation tools developed to facilitate this multilevel optimization are described. This process, applied to various subsystems, has been used to gain substantial performance improvement in the GaAs microcomputer.
Impact of MCM's on System Performance, by A. Kayssi, K. Sakallah, R. B. Brown, R. Lomax, T. Mudge, T. Huff, 1991 Multichip Module Workshop: Extended Abstract Volume, Santa Cruz, CA, March 28-29, 1991, pp. 58-65.
This paper describes the effect of using different MCM technologies on the optimized performance of the GaAs Microcomputer. This high-performance microcomputer, which implements the MIPS RISC architecture, is designed using GaAs direct-coupled FET logic (DCFL) technology from Vitesse Semiconductor Corporation. The system components mounted on the MCM are the CPU and floating-point accelerator, the primary instruction- and data-caches, and a memory management unit (MMU). A multilevel optimization procedure is used with actual MCM layouts. The effects of the dielectric constant of the insulator and of the die attachment method are described.
The Design of a Microsupercomputer, by T. Mudge, R. B. Brown, W. Birmingham, J. Dykstra, A. Kayssi, R. Lomax, O. Olukotun, K. Sakallah, R. Milano, Computer Magazine, vol. 24, no. 1, January, 1991, pp. 57-64.
The performance goals of the prototype 250-MHz microsupercomputer require an integrated design approach in which technology, architecture, and packaging are considered simultaneously. GaAs DCFL technology, with its high speed, high level of integration, and high yield, is an important element in achieving the desired performance. Multichip-module packaging must be used to achieve the needed cache performance, and careful partitioning of the processor components is required to minimize the number of chip crossings in the critical path. The use of CAD tools is critical. It is important to use an automatic layout system to leverage the technology and take immediate advantage of continuously increasing integration levels without extensive and costly redesign. Simulation of cache performance is necessary to achieve the best compromise between size and speed and, in general, the use of simulation is crucial in making the majority of decisions along the design path. Coordinating these different aspects makes it possible to achieve a global optimization of the design and to build a system meeting the specifications described in the article.
The Design of a GaAs Micro-Supercomputer, by T. Mudge, R. B. Brown, W. Birmingham, J. Dykstra, A. Kayssi, R. Lomax, O. Olukotun, K. Sakallah, Proceedings of the Twenty-Fourth Annual Hawaii International Conference on System Sciences, vol. 1, 1991, pp. 421-432.
This paper is an overview of the architecture, technology and CAD tools used in the design of an experimental 250 MHz "micro-supercomputer" which is being designed for a sustained performance of 170 MIPS. The system will include a gallium arsenide processor which executes the MIPS instruction set and a two-level cache memory system, packaged on a multi-chip module. The risk in undertaking this project is minimized by using existing but advanced GaAs technology, by building needed CAD tools on top of commerical tools, and by using a standard instruction-set architecture.
We introduce two CAD tools, checkTc and minTc, for timing verification and optimal clocking. Both tools are based on a new timing model of synchronous digital circuits which is: 1) general enough to handle arbitrary multiphase clocking; 2) complete, in the sense that it captures signal propagation along short as well as long paths in the logic; 3) extensible to make it relatively easy to incorporate "complex" latching structures; and 4) notationally simple to make it amenable to analytic treatment in some important special cases. We are currently using these tools to help in the design of a 4ns gallium arsenide micro-supercomputer.