The Design and Optimization of
CGaAs Cache Memory

Ph.D. Thesis Proposal
Submitted to the
Department of Electrical Engineering and Computer Science
The University of Michigan
August 14, 1997

by
Spencer M. Gold

Committee:
Dr. Richard B. Brown, Chairman
Dr. Ronald J. Lomax
Dr. Trevor N. Mudge
Dr. Karem A. Sakallah
Dr. Kensall D. Wise

ABSTRACT

Complementary gallium arsenide (CGaAs) is an emerging integrated circuit technology that exhibits many positive characteristics, including high electron mobility at low electric fields, inherent radiation hardness, and the ability to realize complementary logic circuits. These qualities enable the fabrication of components having the performance of E/D MESFET GaAs with greatly reduced power dissipation and the ability to implement large capacity, highly efficient SRAM.

CGaAs is potentially well suited for space and military applications that require radiation-hard, high performance systems. If CGaAs is to fulfill this role, improved design techniques for high performance CGaAs memory subsystems, including SRAM-based cache memory, must be established. I propose to develop the architectures, circuit techniques and tools required for supporting the memory bandwidth requirements of a high performance CGaAs system. The goals of this research are to contribute to the current CGaAs knowledge base, increase the maturity of the CGaAs fabrication process, and to demonstrate the feasibility of high performance CGaAs-based cache memory subsystems.

Silicon-based SRAM has steadily improved from a capacity-access time and a capacity-power-access time perspective. On the other hand, GaAs-based SRAMs have been severely limited by low integration levels and high power logic families. I have designed a 40 k-bit CGaAs SRAM to demonstrate the feasibility of large capacity, low access time, low power GaAs SRAM. This chip utilizes a six-transistor memory cell with an area of 183 um2, 22.5% smaller than any previously reported GaAs SRAM cell. I designed the chip using a full-custom methodology in order to achieve a 400 MHz maximum operating frequency.

SRAM compilers have been developed to rapidly explore the power-delay-area memory design space. Since their introduction in 1986, SRAM compilers have evolved into powerful tools that include features such as process independence, iterative optimizing front ends, and BIST functionality. I propose to develop an advanced CGaAs SRAM compiler that employs commercial CAD tools. Doing so will reduce development time and encapsulate the most powerful analysis tools available. This tool will also be used to evaluate the impact of various design rule enhancements on CGaAs SRAM density and speed.

The optimized CGaAs SRAMs generated by this compiler will find their eventual implementation as the cache memories of the PUMA CGaAs PowerPC microprocessor. This MCM-based processor will utilize area interconnect, gold bumping, and flip-chip mounting technologies to provide high bandwidth inter-chip communication. Its memory subsystem will be comprised of two levels of split instruction and data cache memory and a large unified DRAM. The required CGaAs cache memory components include the primary instruction and data caches and high bandwidth, non-blocking secondary instruction and data cache controllers.


Table of Contents

1. Introduction
1.1 Cache Memory Design Considerations
1.2 Gallium Arsenide
1.3 E/D MESFET GaAs
1.4 Complementary GaAs
1.5 CGaAs Integrated Circuit Design
1.6 CGaAs SRAM
1.7 Proposal Outline

2. Trends in SRAM Design
2.1 Silicon SRAM
2.2 GaAs SRAM

3. Cache Design Issues
3.1 Circuit Issues
3.2 Organizational Issues
3.3 System Level Issues
3.4 Low Integration Cache Design Issues

4. Preliminary CGaAs SRAM Research
4.1 40 k-bit SRAM Chip Overview
4.2 Design Methodology
4.3 Design for Testability
4.4 Final Chip Statistics 

5. SRAM Compilers
5.1 Transistor Sizing Algorithms
5.2 Past SRAM Compilers
5.3 Current Commercial SRAM Compilers
5.4 Proposed CGaAs SRAM Compiler

6. CGaAs Process Refinement
6.1 Cost Effective IC Process Improvement Programs
6.2 CGaAs SRAM Improvement

7. PUMA Cache Memory Hierarchy
7.1 MCM-based Microprocessors
7.2 Area Interconnect
7.3 Gold Bumping
7.4 PUMA Memory Subsystem 

8. Research Contributions
8.1 Proposed Research

References

The Design and Optimization of CGaAs Cache Memory

1. Introduction

The technology base of our society has an insatiable need for faster and more powerful computing. In order to meet this demand, microprocessors are expected to operate at ever increasing frequencies. However, a faster operating frequency does not guarantee improved performance, especially if the memory subsystem has not been designed to properly support high frequency operation.

The key to designing a high performance memory subsystem that will support high frequency operation is to design a memory structure that has both low access latency and high capacity. Although this may seem straightforward, it is not, because these design goals are opposing. A trade-off must be made between capacity and latency when designing memory structures.

Hierarchical memory subsystems can be used to create the illusion of a memory structure that is both large and fast. This is done by providing the processor with immediate, or primary access to a small, fast memory known as a cache. This primary cache is connected to higher levels of memory that progressively increase in both capacity and latency until the largest, slowest device is included. When a cache access fails to find its data, or misses, that request is forwarded to the next highest structure in the memory hierarchy. This process continues until the memory access eventually finds its data in the upper portions of the hierarchy. Once this data is retrieved it and the physically adjacent data are copied into the memory structures below it. This allows a processor to exploit the temporal and spacial locality within a process' working set. Depending on the miss percentage of each memory level, a hierarchical subsystem can be very efficient while operating at high frequencies. For example, the average access time of a three-level hierarchical memory subsystem is given below as equation 1.1. As the miss percentages of the lower two memory structures approach zero, the average access latency approaches that of the primary cache.

Average access time = L1 access time + L1 miss% x (L2 access time + L2 miss% x (L3 access time)) (1.1)

1.1 Cache Memory Design Considerations

The ability of a microprocessor to realize its performance potential is largely dependent upon the effectiveness of its cache memories. If the caches are too small and simple, their miss rates could be high and the average access time would be adversely affected. On the other hand, if the caches are too large or complex, their operating frequency would suffer and precious die area would be wasted. Cache designers should have the objective of implementing the most effective balance between capacity and complexity.

High performance cache design requires contributions from several fields in VLSI design. From a circuits perspective, one must consider memory implementation techniques, power dissipation limitations, packaging requirements and operational stability factors [1,2]. An understanding of the chip processing technology must also be acquired in order to determine acceptable integration levels and operational parameters. From an architectural viewpoint one must consider the bandwidth requirements of the main processor when determining a cache's operating frequency and organization. Cache capacity, set-associativity, multi-banking and pipelining are some of the microarchitectural issues that affect overall system performance and must also be taken into account. By considering the issues from each of these fields an efficient, high performance cache memory hierarchy can be obtained.

1.2 Gallium Arsenide

The intrinsic characteristics of gallium arsenide (GaAs) make it a suitable alternative substrate for high frequency integrated circuits. The primary advantage of GaAs over silicon is that it provides a low-field electron velocity in normal doping levels that is five times greater than silicon [3] (4,000 cm2/V-s compared to 800 cm2/V-s). However, the maximum electron drift velocity in GaAs is only twice that of silicon (2x107 cm/s compared to 1x107 cm/s). In saturation mode GaAs transistors can provide up to a factor of two improvement in electron velocity. These characteristics contribute in part to GaAs transistors' higher transconductance and reduced delay at lower voltages compared to silicon transistors.

GaAs substrates can be viewed as semi-insulating since their resistivity is 100,000 times greater than intrinsic silicon substrates (1x109 W-cm in GaAs compared to 1x104 W-cm in Si). Therefore, none of the precautions used in CMOS IC design to guard against latch-up need to be used in GaAs IC design. Due to gallium arsenide's larger bandgap (1.43 eV in GaAs compared to 1.11 eV in Si), its semi-insulating substrate and the absence of native oxides, GaAs-based circuits are more resistant to single event upset and total dose radiation exposure than both CMOS and SOI integrated circuits.

Gallium arsenide does have its drawbacks. GaAs IC fabrication processes are less mature than current CMOS fabrication processes, leading to greatly reduced integration levels. While some CMOS processes can successfully yield chips, such as the DEC Alpha 21264 with over 15 million transistors [4], GaAs processes have only recently produced a functional chip with over one million transistors.

1.3 E/D MESFET GaAs

One of the most established and widely available GaAs fabrication processes is E/D MESFET (enhancement/depletion metal semiconductor field effect transistor). This process provides only enhancement and depletion type transistors and requires the use of direct-coupled FET logic (DCFL). This logic style provides extremely fast and high power circuits. Although E/D MESFET does not offer a great deal of flexibility to the circuit designer, it has been popular in the supercomputing industry. The Tera computer implements some of its multi-chip core with Vitesse Semiconductor's GaAs E/D MESFET process. These chips have integration levels up to 1.2 million devices. Other systems that utilize a GaAs-based processing core are the Fujitsu VPP500 and the Convex 2X0 supercomputers.

1.4 Complementary GaAs

A new arrival in the GaAs IC scene is Motorola's complementary GaAs (CGaAs) process [5-7]. This technology differs from E/D MESFET by providing p-channel instead of depletion transistors. A cross-section of n- and p-channel CGaAs transistors are given in Fig. 1. The CGaAs transistor channel is a buried layer of intrinsic InGaAs and the gate is formed by stacking layers of AlGaAs, GaAs, and TiWN to form a schottky diode with the InGaAs channel. This diode has a turn-on voltage of 1.8 volts yielding devices that operate reliably within the 0.9 to 1.5 V range.
 

Fig. 1. CGaAs n- and p-channel transistor cross section

The availability of p-channel devices makes possible the use of complementary, dynamic, and source-coupled FET logic (SCFL) circuits as well as DCFL circuits. CGaAs components can therefore be designed to have the performance of E/D MESFET components but with greatly reduced power dissipation.

There are numerous advantages to using CGaAs for the implementation of high-frequency processors. In spite of these advantages CGaAs still cannot provide integration levels similar to current CMOS processes. If a system is to be implemented in CGaAs it must currently be comprised of chips having less than one million transistors each.

1.5 CGaAs Integrated Circuit Design

CGaAs closely resembles CMOS from a circuit design perspective. However, there are a few crucial differences between these technologies that make CGaAs circuit design unique. The CGaAs transistor structure differs from that of a MOSFET in that CGaAs gates will conduct at voltages above 1.8 V since their heterostructure insulated gates (HIGs) are essentially diodes. Even at levels significantly below this turn-on voltage, CGaAs gates have significant current leakage. This leakage current is hard to estimate with SPICE since the model we use for CGaAs HIGFETs is the JFET model.

There are a number of unique CGaAs design rules. p- and n-diffusions can directly abut and share a common contact. The large difference in charge carrier mobilities between n- and p-channel devices requires larger p-channel devices in CGaAs complementary gates to provide equal rise and fall times. Contacts to diffusion regions are made through an ohmic layer--an additional mask not included in CMOS processes. CGaAs' current metallization is courser than state-of-the-art CMOS metallization and it only allows two vias to be stacked atop one another.

1.6 CGaAs SRAM

There are two types of volatile random access memory (RAM): static and dynamic. Static random-access memory (SRAM) differs from its dynamic counterpart in that it does not require a periodic refresh in order to keep a stored binary value. Since an SRAM cell stores its value in a pair of cross-coupled inverters, it has greater drive strength, at the expense of larger size, than a dynamic cell which stores its value on a capacitor. This leads to the implementation of memories that have faster access times than dynamic RAMs (DRAM). SRAM can be easily embedded with a processor core because it does not require specialized processing steps in order to operate efficiently, as does one-transistor DRAM. The overhead required to restore destructive reads are also not required with SRAM since its read process is non-destructive. For these reasons SRAM has become the implementation technique of choice for cache memory.

A typical 6-transistor (6T) E/D MESFET SRAM cell utilizes cross-coupled inverters with depletion loads to store a binary value. Regardless of the data value being stored, there is always a conduction path from the power supply to ground in a cell with depletion loads. Consequently, this type of cell dissipates significant static power. A 6T CGaAs SRAM cell, on the other hand, utilizes cross coupled complementary inverters; its static power dissipation is due only to device leakage. It is the use of complementary logic that provides CGaAs' power savings.

In high frequency logic circuits, dynamic power dominates static power due to tits high average switching rate. This characteristic is especially evident in IC technologies that utilize course metallization. The fact that E/D MESFET circuits dissipate significant static power is of lesser importance when viewed from a logic design perspective. In a RAM, however, the average switching rate is much lower. Thus, RAM power dissipation is dominated by its static power component. A major disadvantage of using E/D MESFET for system design is its inability to provide efficient embedded RAM. CGaAs has an advantage over E/D MESFET in its ability to provide large, efficient embedded RAM due to its low static power dissipation.

1.7 Proposal Outline

This proposal is organized to both identify previous work and to propose future research in the area of CGaAs cache memory design. Of course, much of this research will be directly applicable to other IC technologies such as CMOS. Trends in past silicon and GaAs SRAM designs are identified in Section 2 of this proposal. Some of the major issues of high performance cache design are given in Section 3. A full-custom, 40 k-bit CGaAs SRAM organized as data cache is described in Section 4. Automated SRAM compilation and analysis techniques and a proposed CGaAs SRAM compiler that utilizes commercial CAD tools are described in Section 5. The proposed use of this compiler to help optimize the CGaAs process is discussed in Section 6. The CGaAs SRAMs described in this proposal will find their eventual implementation in the PUMA PowerPC microprocessor. Section 7 describes the PUMA's multi-chip cache memory hierarchy and the use of CGaAs to realize high bandwidth operation. Finally, Section 8 describes proposed research that will be pursued in the area of CGaAs cache memory design.


2. Trends in SRAM Design

SRAM development has become a mature discipline of microelectronics design. Improvements in both processing technology and circuit design techniques have enabled SRAM arrays to become larger, faster and more efficient. In 1980, a 64 k-bit, 80 ns design fabricated with a 2.0um CMOS process [8] was considered state-of-the-art. Today's BiCMOS commercial synchronous burst SRAMs, such as Motorola's MCM69P737, have up to 4.5 Mbits with pipelined operating frequencies up to 166 MHz [9].

2.1 Silicon SRAM

Early silicon SRAM designs were fabricated in diverse technologies and made to suit a wide variety of design goals. The SRAM market has since become better defined and custom processing technologies have emerged. Most high performance monolithic SRAMs available today are fabricated with a specialized BiCMOS process having up to six polysilicon [10] and four metallization layers.

Much of the progress seen in silicon SRAM design can be credited to rapidly improving process technology. However, innovative circuit design techniques have also been developed. Plots showing the features of various SRAMs presented at ISSCC, VLSIC, and CICC since 1980 are given in Fig. 2. The first graph plots SRAM capacity per nanosecond access time (k-bits/ns) on a logarithmic scale. This graph clearly indicates a trend of improving access time efficiency. The highest rated silicon SRAM is a 4.5 Mbit, 1.8 ns design fabricated with a 0.3um, 2-polysilicon BiCMOS process [11]. The second graph plots SRAM power efficiency in k-bits/ns-W. This graph indicates a similar trend in SRAM power efficiency. The most efficient high capacity SRAM is a 4 Mbit, 180 mW design fabricated with a 0.3mm, 4-polysilicon BiCMOS process [12].

 
 
Fig 2. Silicon SRAM designs presented at ISSCC, VLSIC and CICC since 1980

Two types of memory cells are used most commonly in silicon SRAM designs. High capacity designs that give priority to cell density usually employ the four transistor (4T) passive load cell shown in Fig. 3. The additional polysilicon layers in specialized SRAM processes make it possible to implement the passive load resistors in a compact fashion. Fig. 3 also illustrates the rate at which silicon 4T cell sizes have decreased since 1980. The smallest 4T cell presented to date measures 2.3 um2 and was fabricated using a 0.25um 5-polysilicon CMOS process [13].

Embedded memories are another major SRAM application. Such designs must be fabricated with more general CMOS logic processes without additional polysilicon layers. Under these circumstances, the 6T active load SRAM cell, shown in Fig. 4, becomes more appropriate. This cell requires more area, but dissipates less static power than its 4T counterpart. Fig. 4 illustrates the rate at which silicon 6T cell sizes have decreased over the years. The smallest 6T SRAM cell presented to date measures 7.82 um2 and belongs to the previously mentioned 4.5 Mbit SRAM that rates highest in access time efficiency.

Fig 3. 4T silicon SRAM cell area

Fig 4. 6T silicon SRAM cell area

A recent arrival on the silicon embedded memory scene is the three-transistor (3T) DRAM cell shown in Fig. 5. This cell exploits the reasonably high capacitance associated with MOSFET gates to store a data value on the gate of transistor M2. The gate of transistor M1 can be lengthened to reduce the static leakage from this storage node. Transistor M3, when asserted by the read word line, allows the data value to be read from the single read bit line. 3T DRAM cells can be practical for embedded memory applications because they do not require specialized processing steps to fabricate the data storage capacitors.

Fig 5. Three-transistor DRAM cell

The 3T cell is more compact than its 6T counterpart allowing for much greater cell densities. A recently reported non-commercial RISC/DSP processor [14] fabricated with a 0.5um 2-metal CMOS process employs a 3T DRAM cell in its embedded caches. This cell measures 32 um2 which is about 40% smaller than an average 6T cell fabricated with a similar process.

Since the 3T cell is dynamic, additional circuitry is required to periodically refresh the data. It also uses a single bit-line for reading data, which requires more complicated sensing circuitry than a double bit-line scheme. Despite the added density it provides, the 3T SRAM cell is an unlikely candidate for large scale commercial use because of reliability problems associated with sensing data from single bit-line SRAM architectures.

2.2 GaAs SRAM

Since their introduction in 1980 [15] GaAs-based SRAMs have been designed in many processing technologies, including E/D MESFET, C-JFET [16,17] and CGaAs. GaAs SRAMs are not yet competitive with silicon designs from density, capacity or efficiency standpoints. Much of this is due to the immaturity of current GaAs IC processing technologies which have courser design rule geometries and inferior integration levels.

Although great progress has been made in the development of GaAs SRAM, it appears as though a wide variety of design goals are being pursued. Fig. 6 plots the access time and power efficiency characteristics of GaAs SRAMs presented at the GaAs IC symposium and ISSCC since 1982. As illustrated by the graphs, there is not an obvious trend in design improvement as measured by these metrics.

Fig 6. GaAs SRAM designs presented at the GaAs IC Symposium and ISSCC since 1980

The largest GaAs SRAM fabricated to date is a 64 k-bit monolithic chip designed for the Fujitsu VPP500 vector parallel processor [18] in 1991. This E/D MESFET DCFL design is quite fast for its capacity but comes at the power dissipation cost of 5.9 watts. The most power efficient GaAs SRAM was designed in 1994 using Motorola's 0.7um CGaAs process. This 4 k-bit SRAM dissipates only 16.2 mW while providing a 5.3 ns access time [19].

GaAs processing technologies have not yet provided a suitable material for highly resistive passive loads. Therefore, 4T SRAM cells are not practical in GaAs. Since GaAs HIGFET gates are essentially high turn-on voltage diodes they cannot be used to store a binary data value. This makes the use of 3T DRAM cells impractical. Designers have had little choice but to use 6T SRAM cells exclusively in past designs.

The smallest GaAs 6T cell reported to date measures 236 um2 and was used in the previously mentioned Fujitsu 64 kbit SRAM. This cell is over 30 times larger than the smallest reported silicon 6T SRAM cell. Despite this large difference, continual improvements in GaAs 6T SRAM cell area have been made, as illustrated in Fig. 7.

Fig 7. GaAs 6T SRAM cell area


3. Cache Design Issues

There are several issues of high performance cache design that must be addressed if an optimal memory subsystem is to be achieved. These issues range from circuit implementation techniques to system level planning. Much has been published concerning these topics which provides the designer with a rich supply of ideas.

3.1 Circuit Issues

The circuit implementation strategies of embedded logic designs also apply to cache hierarchies. When deciding upon a circuit style or design technique one must first consider the ramifications of the technology being used. From an understanding of the IC technology comes an estimate of the available integration levels and operating voltages. A cache circuit designer must choose appropriate buffer styles for the technology being used (push-pull, full complementary, etc.) and develop a technique for optimally sizing them. The decision to use static or dynamic address decoding schemes has a large impact on overall circuit speed and complexity. Other circuit issues include precharging schemes (full VDD or VDD - Vtn), bit-line sensing schemes [20-23] (voltage-mode, current-mode or latching), multiplexer styles (restoring or pass-gate), bit-line architectures (single or double), and memory cell styles (static or dynamic, 4T or 6T, etc.). The choice of an appropriate operating mode [24] (asynchronous, synchronous, or wave-pipelined) is a crucial decision that must be made carefully. Cache designers must also consider whether and how to use row and/or column redundancy [25] to increase yield.

3.2 Organizational Issues

A cache's microarchitectural organization will have a dramatic impact on system performance. One of the most important organizational issues that every cache designer must address is index associativity [26]. Some associativity techniques that have been employed in the past include direct-mapped [27], set-associative [28,29], fully-associative, skewed-associative [30-32], column-associative [33], hybrid associative [34,35] and predictive sequential-associative [36]. Other organizational issues include determining the best line size and how many columns should be "folded" or multiplexed together. Cache designers must also make the best trade-off between capacity and access time as well as, in some cases, determining how to implement multi-ported and/or multi-banked designs.

3.3 System Level Issues

System level issues must also be considered in cache hierarchy design [37]. One must determine how many levels of cache should be employed in a hierarchy and how large and fast each should be [38-40]. The use of virtual or physical indexing and tagging [41] affects the virtual memory subsystem dramatically as well as the decision to use split or unified caching schemes. Cache designers must also determine the best write policies [42] and prefetching schemes [43-47] for their memory hierarchy.

3.4 Low Integration Cache Design Issues

When designing a cache memory hierarchy in an integration level-limited IC technology, such as CGaAs, a number of other important issues emerge. Low integration levels severely limit the capacity of embedded memories and often require the extensive use of off-chip cache. There are two main microarchitectural consequences of moving normally embedded cache memories off-chip. The first of these relates to the additional latency incurred when accessing these memories. Tolerating this latency might require additional pipeline stages, which can increase the branch misprediction penalty. Another technique for tolerating this latency involves increasing the amount of data retrieved each cycle from off-chip caches. This leads to the second major consequence--the processor's I/O bandwidth requirements increase.

The task of supporting the memory bandwidth requirements of a high frequency processor core is more difficult when using off-chip caches. Data busses between the processor and cache memory need to be both wide and fast. Considering that most processors are already I/O limited, multi-chip cache memory hierarchy designers must develop alternate implementation strategies to provide the processor with additional high bandwidth I/O ports. One method that can be used to increase a processor's I/O bandwidth capabilities is flip-chip mounting and area interconnect, which is described in Section 7 of this proposal.


4. Preliminary CGaAs SRAM Research

Preliminary CGaAs SRAM research must be conducted as a precursor to designing a full scale CGaAs cache memory hierarchy. The goal of this research is to determine the feasibility of high capacity, high speed CGaAs SRAM that can be organized as cache memory. Once a successful CGaAs SRAM has been designed, fabricated and tested, an effective CGaAs cache memory hierarchy can be designed.

4.1 40 k-bit SRAM Chip Overview

As a part of this research, a 40 k-bit CGaAs SRAM has been designed. This chip, illustrated in Fig. 8, is organized as 4 KBytes of data cache with tags. The memory was designed to have the maximum capacity that the CGaAs process would allow at the time it was designed while still providing a sub-nanosecond access time (based on preliminary SPICE models). This synchronous memory employs a double bit-line architecture, push-pull buffers, static decoders, VDD - Vtn precharging, latch sense amplifiers and pass-gate multiplexers.

Fig 8. 40 k-bit CGaAs SRAM

The SRAM is comprised of 256 direct-mapped lines. Each line includes four 32-bit words, a 32-bit tag, two attribute bits, and a valid bit. Thus, the total storage capacity is 40.75 k-bits. There is a primary word read/write/invalidate port and a secondary line write port. Data written into the SRAM through the secondary port can be read through the primary port in the same cycle.

A 6T SRAM cell that measures 19.9 um x 9.2 um (183 um2) is used in the design. Although this does not compare well with cells fabricated with current CMOS processes, it does represent a 22.5% decrease from the smallest reported GaAs SRAM cell to date.

The SRAM peripheral circuits were optimized for both speed and power. A mixture of complementary and direct-coupled FET logic (DCFL) is used to achieve the performance goal with minimal power dissipation. Low-power, current-controlled latch sense amplifiers [48] provide fast data resolution times. The extremely low CGaAs threshold voltages are exploited by using push/pull address and word line buffers that provide good drive strength with fast propagation times. Fig. 9 illustrates the SRAM's read/write circuitry.

Fig 9. CGaAs SRAM read/write circuitry

4.2 Design Methodology

Obtaining a sub-nanosecond access time was the highest design priority. In order to meet this target, a mixture of logic styles was used to provide the necessary access time at the lowest possible power dissipation. DCFL-based circuits are advantageous for speed-critical paths since they have low gate delays. However, this speed increase comes at the expense of higher power dissipation and decreased noise immunity due to their high VOL. On the other hand, complementary circuits benefit from much lower power dissipation and excellent noise immunity, but suffer from longer gate delays. Cells were initially designed using DCFL circuits to provide the lowest propagation delays. Once a sub-nanosecond access time was guaranteed, non-speed-critical DCFL components were replaced by complementary circuits in order to reduce power dissipation and increase noise immunity.

The metallization used in the CGaAs process is not optimal for SRAM design. Reducing the parasitic capacitances associated with inter- and intra-cell interconnect proved to be a matter of crucial importance. In order to tightly manage these parasitics, the SRAM was designed with a full-custom layout methodology using Mentor Graphic's IC Station as the primary layout tool. This approach provided significant reductions in parasitic capacitances and helped lower the access time. Circuit simulations were conducted with Meta Software's HSPICE using Motorola device models and parasitics extracted from the circuit layouts.

I developed transistor sizing tools to design buffers with maximum drive strength and a minimum propagation time. The first of these tools creates variant SPICE models of a buffer using the constant ratio tapering method [49]. This tool simulates a design using HSPICE and reports the results, allowing the optimal buffer design to be quickly identified. The second tool resembles the first except that it sizes buffers based on Cherkauer and Friedman's constant capacitance-to-current ratio tapering method [50].

4.3 Design for Testability

Fig. 10 is a block diagram of the SRAM core's 276 I/O signals. Directly testing this number of high speed signals is very difficult and expensive. In order to greatly reduce the number of signals brought off chip, rows of scannable flip-flop chains, organized according to Fig. 11, were added to the cache's inputs and outputs.

Fig 10. Cache core I/O signals
Fig 11. Cache chip testability structures

The chip has two operating modes: low-speed-sequential and high-speed-parallel. In low-speed-sequential mode the flip-flops are connected serially to form long, sequentially-scannable chains. In this mode up to four cache read/write/invalidate commands can be scanned into the flip-flop chains at a low frequency. Then the chip can be placed into high-speed-parallel mode which changes the flip-flop connections from sequential to parallel. The SRAM can then execute these four instructions at a high clock rate. Results are captured by similarly organized output flip-flop chains. Upon completion of the high speed test, the SRAM is placed back into low-speed-sequential mode and the results are scanned off chip at a low frequency. This enhancement reduces the number of high speed signals from 276 to 3 (fast CLOCK, fast CLOCK_BAR, and MODE). AND gates disable the L2WRITE and L1WRITE signals to prevent erroneously altering memory contents while commands are being scanned into the flip-flop chains.

4.4 Final Chip Statistics

Table 1 gives the SRAM chip statistics. Since the chip has not yet been fabricated or tested, these statistics are based on preliminary simulations. The strength of this design is its ability to rate highly in both speed and efficiency--something previous GaAs SRAMs have not successfully accomplished. At 17.7 k-bits/ns this design ranks second in access time efficiency among reported GaAs SRAMs. With a rating of 15.4 k-bits per ns-W the design also ranks second in power efficiency.
Table 1: SRAM chip statistics
Process Technology 0.5 um CGaAs, 3 metal layers
Power supply 1.5 V
Capacity 40.75 k-bits
Chip Size 5.6 mm x 5.6 mm (31.4 mm2)
Transistors 298,497
Access time 
(slow - fast corners)
3.1 ns - 2.3 ns
Maximum simulated operating frequency 
(slow - fast corners)
300 - 400 MHz
Power dissipation @ 400 MHz 
(Embedded memory core with peripheral circuits)
1.15 W
Power dissipation @ 400 MHz 
(Including test structures and clock buffers)
2.52 W


5. SRAM Compilers

The full-custom design approach has several advantages over semi-custom methodologies including the ability to more closely manage the area efficiency of a layout. It has a major drawback however--it is incredibly laborious and tedious. A full-custom design requires significantly more time to complete than if automated layout generators are used. For this reason it is not practical to use full-custom methodologies to explore the vast capacity-power-delay memory design space.

Although transistor sizing tools were used to optimize many of the 40 k-bit CGaAs SRAM's circuits, the design represents only a single point in the large power-delay-area design space. Until this space has been more fully explored it cannot be concluded with any certainty that this design has been truly optimized according to any single metric or combination of metrics. This task could be quickly accomplish with a CGaAs SRAM compiler.

5.1 Transistor Sizing Algorithms

The heart of an effective SRAM compiler is its transistor sizing engine. This routine is responsible for sizing a set of transistors for the SRAM's core and peripheral circuits to meet specified objectives. Transistor sizes have a direct impact on speed, power dissipation and layout area. An effective transistor sizing algorithm is capable of optimizing a design based on one or a combination of these factors. A tremendous amount of research has been devoted to the development of effective transistor sizing algorithms. Because of the enormity of this effort, only a brief summary of the capabilities and optimization techniques employed in past transistor sizing tools will be given here.

Most transistor sizing algorithms have been developed to optimize the power and/or area of generic logic structures subject to timing constraints. These logic structures are given in the form of a circuit netlist annotated with parasitic capacitances and resistances. There are four general approaches to solving the transistor sizing problem. The first method transforms the circuit structure into a distributed RC network and models the delay through it as a classical nonlinear optimization problem. Geometric programming techniques are then employed to exactly solve this problem mathematically. Arriving at this solution often involves defining the nonlinear optimization problem in terms of posynomial functions that can be transformed into a set of convex functions by means of an exponential transformation. A variety of exact or numerical methods can then be employed to find a local minimum, and hence, the global minimum of each convex subproblem. Numerous algorithms and sizing tools [51-57] have been developed that employ this technique. However, this method suffers from its inherent complexity and long computation time. Furthermore, it may not converge to a solution if the initial optimization problem is sufficiently complex. It should also be noted that these solutions are only exact for the given parasitics used to construct the original distributed RC network.

The second technique employs heuristic methods to achieve near-optimal solutions to the nonlinear optimization problem [58-60]. This method identifies those transistors that are most sensitive to the objective and incrementally sizes them in order to achieve the desired effects without violating a constraint. Heuristic algorithms cannot guarantee optimal results but require far less computation time than rigorous mathematically exact methods.

The third approach utilizes both mathematical and heuristic methods to achieve good results that are less computationally expensive. These hybrid techniques [61,62] use geometric programming methods to provide an initial starting point, then use heuristic methods to find a near-optimal solution. Thus, a more accurate solution can be obtained with only a small increase in computation time.

The fourth method of optimizing transistor sizes is based on a stochastic modeling of circuit responses [63]. In this technique the values of the complex nonlinear objective functions are viewed as random variables with known distributions. This view allows variations in process parameters to be accounted for. Stochastic modeling allows the algorithm to globally view and efficiently optimize the entire circuit instead of just the speed-critical path. However, this approach resembles heuristic methods in its ability to provide only near-optimal solutions.

A number of other interesting capabilities have been introduced into transistor sizing tools. Some of these capabilities include using SPICE to conduct iterative circuit simulations [64], using a static timing analyzer to identify speed-critical circuitry [65,66], using an interactive interface to allow the designer to control the direction of the search efforts [67-69], allowing multiple logic styles to be introduced into the design [70], providing gate-based instead of transistor-based optimization [71], and optimizing transistors in a multiple threshold voltage technology [72]. The most powerful sizing tools, however, extract parasitic capacitances and resistances from the layout and conduct both transistor sizing and layout compaction in their optimizing loops [73-76].

SRAM core and peripheral transistor sizing can be viewed as a subset of the traditional transistor sizing problem. An SRAM has a fixed organization that is known beforehand. Therefore, its sizing engine can employ specialized heuristics that are both accurate and fast. The sensitivities of various transistors on read and write access times are well understood, making the judicious choice of transistors for incremental adjustment comparatively easy.

5.2 Past SRAM Compilers

The first published SRAM compiler, known as RAMGEN [77], was developed at Texas Instruments in 1986. This tool simply connected previously designed leaf cells into a parameterized SRAM configuration that could be fabricated with either a 2.0 or 3.0um CMOS process. It also generated corresponding HDL descriptions and performance datasheets of the design. No iterative analysis or rebuilding steps were conducted to further optimize the design because the tool was strictly a layout generator.

Since RAMGEN's introduction, numerous SRAM compilers have been developed. In 1990, Texas Instruments completed a BiCMOS multi-port SRAM compiler [78,79] capable of generating designs with up to 8 k-bits of capacity. Although this tool was capable of producing a larger variety of SRAMs, it was still just a layout generator that did not employ iterative analysis or optimization steps.

In this same year, VLSI Technology, Inc. and LSI Logic Corp. introduced SRAM compilers for their ASIC customers. The VLSI Technology compiler [80] could build memories with up to 128 k-bits of capacity in a 1.0um CMOS technology. It used a proprietary silicon compilation language known as "SLICE" to both generate and connect SRAM leaf cells. This provided the customer with greater ease and flexibility in altering memory organizations. The LSI Logic MEMCOMP compiler [81] could build memories with up to six ports in either a 1.5um or 1.0um CMOS process. It used predesigned full-custom leaf cells as its building blocks. Both of these tools, however, were still just layout generators that did not employ iterative optimization techniques.

A breakthrough in SRAM compiler technology was made in 1990 when Mitsubishi Electronics introduced their MAC2 memory compiler based on Cascade Design Automation's Compiler Development System (CDS) [82]. CDS is a silicon compilation language that provides true process independence by parameterizing both device sizes and technology design rules. This compiler was also the first to use an iterative, constraint-driven optimization technique for buffer sizing. During each iteration, predefined lookup tables were used to help speed up the analysis of a given layout. This tool was capable of creating SRAM designs with up to six ports and 36 k-bits of capacity.

In 1991, Motorola's ASIC division in cooperation with Mentor Graphics Corp. developed the Memorist SRAM compiler [83]. Memorist was based on Mentor's GDT tool suite and was capable of designing 0.7um CMOS SRAMs with up to two ports and 256 k-bits of capacity. It resembled the MAC2 compiler by implementing an iterative optimization technique to size transistors, but employed this technique in sizing the entire critical path circuitry instead of just the buffers. Memorist had another key advantage over MAC2 by allowing the designer to choose between using predefined lookup tables or SPICE to analyze critical path circuitry.

The first GaAs SRAM compiler, known as Aurora [84,85], was developed at The University of Michigan in 1994. This tool was capable of generating and optimizing E/D MESFET SRAMs having up to 8 k-bits of capacity. A novel current mirror memory cell [86] (CMMC) was employed as the storage node in these designs. The compiler resembled MAC2 by providing process independence by using CDS to both generate and connect SRAM leaf cells. It also combined the additional capabilities of the Memorist compiler by employing SPICE as the primary circuit analysis tool. Aurora implemented an iterative, constraint-driven transistor sizing algorithm to build memories that were optimized for speed, power, area or a combination of these metrics. Its heuristic gradient search algorithm also allowed a large power-delay-area design space to be searched rapidly.

5.3 Current Commercial SRAM Compilers

There are two categories of commercial SRAM compilers available today: gate-array compilers and "all-layer" or embedded compilers. Gate array SRAM compilers have been developed by commercial ASIC vendors such as Texas Instruments and Motorola, and also by CAD companies such as Synopsys. These compilers have been designed to create memories from metal programmable gate array ICs. Synopsys' Cell-based-array Memory Architect provides SRAM designs that can be configured in a number of aspects including single or multi-port, speed or power optimized, synchronous or asynchronous timing, available capacity, and height-width aspect ratio.

Embedded CMOS SRAM compilers are currently available from Mentor Graphics, Cadence Design Systems, Artisan Components and Cascade Design Automation. Mentor Graphics' Memory Builder is a graphical abstract floorplanner that can be used to create a parameterized memory generator. This tool iteratively assembles, extracts and characterizes memory layouts. It is not a true optimizing compiler, but merely a layout generator. However, it does provide the ability to create an optimizing front end that iteratively calls the tool to create variant SRAM layouts.

Cadence's Structure Compiler is similar to the Mentor Graphics product. It is a tiling engine that can be used to generate layouts that are comprised of complex, repetitive structures that are either full-custom or parameterized leaf cells. This tool also has a programmable, procedural interface instead of providing its own optimizing front end. Cadence is also developing an iterative, optimizing "all layer" SRAM compiler that is to be completed by the fall of 1997.

One of the most advanced commercial SRAM compilers is Artisan Components' Process Perfect memory generator. This "all-layer" compiler is truly process independent and is capable of generating memory layouts that can be optimized according to a wide variety of design goals. It also implements an iterative, optimizing transistor sizing tool that uses a timing characterization curve-fitting technique that is with 2% of SPICE and a power characterization curve-fitting technique that is within 5% of SPICE. This compiler can also generate custom built-in self test (BIST) controllers for its embedded memories.

Cascade Design Automation has recently released their HSLP (High Speed Low Power) CMOS SRAM compiler. This compiler is similar to the MAC2 compiler in that it utilizes CDS to generate, pitch-match, and tile leaf cells into an SRAM configuration. However, since HSLP was developed in-house, it employs additional undocumented CDS features which were not available to the MAC2 developers. This tool also implements an optimizing, iterative transistor sizing algorithm and device lookup tables to meet required timing constraints.

5.4 Proposed CGaAs SRAM Compiler

I proposed to develop a CGaAs SRAM compiler as part of my Ph.D. research. Such a tool would accelerate the search of the large SRAM design space and would aid in producing optimized designs. It is desirable that this tool have many of the advanced features found in today's commercial SRAM compilers. It is also important not to "reinvent the wheel" by designing tools that are already commercially available. This can be accomplished by building a compiler that employs today's most powerful commercial CAD tools in designing optimized IC layouts.

Some of the most powerful features of current and past SRAM compilers have been packaged and sold in the form of commercial CAD tools. Some of these features include automated leaf cell generators and compactors, hierarchical leaf cell tiling that perform inter-cell pitch-matching, accurate parasitic capacitance extraction from an IC layout, reliable device-based circuit simulation, IC fabrication design rule checking, and layout versus schematic verification. With such tools already available, most of the necessary features of a powerful CGaAs SRAM compiler can be easily realized.

I will develop an SRAM compiler that employs commercial CAD tools from several different vendors. Some of these tools include Mentor Graphics' IC Station, Cascade Design Automation's Masterport cell generator/compactor, Synopsys' Arcadia parasitic RC extractor, Meta Software's HSPICE circuit simulator, and Cadence Design System's Dracula layout verification tool.

This compiler would provide true process design rule independence by using Masterport to generate and compact parameterized, full-custom leaf cells that were originally designed with IC Station. Given a set of process design rules, power rail widths and transistor sizes, this tool will automatically generate pitch-matched, compacted leaf cells that can be readily tiled into a given SRAM configuration.

A heuristic transistor sizing algorithm will be developed to iteratively size, build, and analyze SRAM layouts until the power and/or area objective has been reached subject to the specified access time constraint. This iterative loop will involve tiling the core and peripheral leaf cells into an SRAM critical path layout using Cascade's Compiler Development System (CDS). Arcadia will then be used to extract the parasitic capacitances from the layout and to create an annotated SPICE netlist. HSPICE will then simulate the design and verify both correct functionality and reliable operation in the presence of IC process variations. Fig. 12 illustrates the compiler's iterative optimization method. Once an optimal layout has been achieved, the final and complete layout will be assembled using CDS. This layout will then be design rule checked and undergo layout versus schematic verification using Dracula.

Fig 12. CGaAs SRAM compiler iterative optimization method

An additional feature that should be added to the compiler is the ability to perform IR drop analysis on the SRAM layout while determining power rail widths. The ability to use area interconnect can also be added, enabling the core power rails to become smaller and the density of the SRAM layout to increase. This compiler will provide a way for the large CGaAs power-delay-area design space to be rapidly searched. It will also find a valuable application in the refinement of the CGaAs IC fabrication process.


6. CGaAs Process Refinement

IC feature size reduction is a regular and ongoing process at IC manufacturing facilities. The incremental reductions in minimum feature sizes, especially gate length and metallization reductions, are in large part responsible for the steady increase of microprocessor performance over the years. From an IC design standpoint, the most desirable type of process improvement is a linear shrink of all process design rules. This allows the designer to easily port an IC layout from one process to another by simply mapping the larger design to a grid with smaller dimensions. From a process engineering standpoint, however, this is the most difficult and expensive type of process shrink because it requires the reduction of all process design rules in equal proportions.

6.1 Cost Effective IC Process Improvement Programs

A cost effective IC process improvement program must consider the return on investment of each research dollar spent, measured in terms of overall IC design improvement. To measure this improvement, one must consider that each design rule shrink has its own unique cost function associated with it. Of course, interdependencies between design rule improvements require that the cost of an entire design rule improvement program be considered as a whole. This function, C(P1,P2,...Px), is measured in dollars and accounts for the total research cost associated with reducing design rules 1 through x by P1 through Px%. There is also an overall design improvement function, I(P1,P2,...Px), associated with this design rule improvement program. This function accounts for the improvement in overall circuit density, speed and/or power dissipation when reducing design rules 1 through x by P1 through Px%. Given a fixed research budget B, maximizing I(P1,P2,...Px) subject to C(P1,P2,...Px)=B will yield the most cost effective IC process improvement program.

If both the cost function C and the design improvement function I are accurately stated mathematically, geometric programming techniques can be used to find an optimal set of design rule improvements given a fixed research budget. The cost function C for a mature IC technology, such as CMOS, can be derived by considering the costs associated with past rule shrink efforts. Curve fitting techniques can be used to derive a function that estimates the cost of future research efforts based on past research efforts. However, this technique can be inaccurate because the cost trends associated with past research efforts do not necessarily characterize the costs associated with future research efforts. This process is more difficult for less mature processes, such as CGaAs, where there exists little or no cost data from prior rule shrink efforts. In this case, cost functions must be derived using estimations from IC process engineers.

Deriving the overall design improvement function I is a more complicated matter requiring a different approach. It is extraordinary difficult, if not impossible, to accurately derive this function without first creating a series of IC layouts with a set of incrementally adjusted design rules. Once these layouts have been created, speed and power measurements can be obtained through circuit simulation with extracted parasitics. A process independent circuit compiler could be used to rapidly create and simulate these IC layouts.

6.2 CGaAs SRAM Improvement

Compared to CMOS, CGaAs is not a mature IC fabrication process. The 0.5um CGaAs design rules are courser than those of a typical 0.5um CMOS processes, leading to static memory cells that are larger than their CMOS counterparts. CGaAs SRAM density suffers as a result. To illustrate of the immaturity of CGaAs, Table 2 gives the sizes of several embedded 6T SRAM cells that were fabricated with IC processes resembling the 0.5um CGaAs process. These cell sizes (some had to be estimated from a die photo) range from 52 um2 on Cyrix 6x86 to 66 um2 on the PowerPC 604e. The CGaAs 6T SRAM cell clearly suffers from a density disadvantage.
Table 2: 6T SRAM cell sizes
Microprocessor IC process Cell size (um2)
Sun UltraSPARC (167 MHz) 0.45um 4 metal CMOS
63
Cyrix 6x86 (120 MHz) 0.6um 3 metal CMOS
52
Digital 21164 (300 MHz) 0.5um 4 metal CMOS
63
Motorola/IBM PowerPC 604e (133 MHz) 0.5um 4 metal CMOS
66
Intel PentiumPro (166 MHz) 0.4um 4 metal BiCMOS
55
PUMA 0.5um 3 metal CGaAs
183

CGaAs is an obvious candidate for significant design rule improvement. In order to determine the most cost effective set of design rule shrinks, the CGaAs SRAM compiler can be used to determine the overall design improvement function of the previously stated optimization problem. Masterport provides process independence by allowing all technology design rules to be parameterized. The regeneration of leaf cells given a certain design rule change is an easy and computationally inexpensive task. The heuristic transistor sizing algorithm can be modified to minimize SRAM area/speed/power characteristics subject to a design rule cost equality constraint. Doing so allows the CGaAs SRAM compiler to automate the task of determining the cost effectiveness of a given set of design rule shrinks. A heuristic gradient search algorithm can then be used to iteratively employ the SRAM compiler to search the multi-dimensional design rule improvement space for the most cost effective set of design rule enhancements.


7. PUMA Cache Memory Hierarchy

The memories produced by the CGaAs SRAM compiler will find their eventual implementation into the PUMA PowerPC microprocessor. Due to the low integration levels provided by this technology, PUMA will be a multi-chip implementation of a PowerPC ISA subset. This DARPA funded project has been underway at the University of Michigan since the fall of 1994.

7.1 MCM-based Microprocessors

Multi-chip microprocessor implementations face two major difficulties. The first involves determining how to optimally divide the processor into separate chips while the second involves providing high bandwidth communication buses between these chips. Bandwidth can be defined as the number of bits transmitted per cycle multiplied by the number of cycles per second. Bandwidth can be improved by increasing the width of the bus and/or by increasing the frequency at which it transmits data. Ideally, these inter-chip buses would be both wide and fast.

Separately packaging each chip and mounting them on a printed circuit board (PCB) is an unacceptable solution. Single-chip packages and PCB traces have prohibitively large impedances, while long PCB traces exhibit pronounced transmission line effects when operated at high frequencies. This packaging style also limits the number of chip I/O and hence bus width. An alternative to this approach is to use multi-chip module packaging with area interconnect.

A multi-chip module (MCM) is a substrate upon which multiple bare die are directly fastened. ICs on an MCM are not individually packaged and can be placed in closer physical proximity than on a PCB. The traces on the MCM substrate which connect the bare die are shorter than PCB traces. Research has shown that short MCM traces can maintain signal integrity at much higher frequencies [87]. By dispensing with individual packages, the large impedances associated with them are eliminated. Since MCM traces are smaller, shorter, and have less impedance, chip I/O drivers and receivers can be smaller and faster. Each of these aspects leads to higher bus operating frequencies. Two processors that utilize MCM packaging are the Sun/Hal SPARC64 [88] and the IBM POWER2 [89]. Another advantage to MCM packaging is the ability to use area interconnect.

7.2 Area Interconnect

Area interconnect is an implementation technique that allows an I/O port to be placed anywhere within a chip's layout. This allows power and signal lines to enter the chip directly where they are needed instead of from the chip periphery. Adding I/O ports usually comes at the expense of increased chip area when using a peripheral pad ring. With area interconnect, a chip can have a vastly increased number of I/O with a smaller area penalty.

Flip-chip mounting makes area interconnect possible [90]. This approach we are using to connect bare die to an MCM substrate involves forming metal bumps on a die's area I/O pads. The die is then turned upside-down and fastened to the MCM substrate with an adhesive layer. Area I/O pads make electrical and thermal connection to the MCM traces through the metal bumps. Fig. 13 is an illustration of a flip-chip mounted IC. The PUMA processor will be implemented with MCM-D on ceramic packaging.

Fig 13. Flip-chip mounted ICs

7.3 Gold Bumping

Bare-die bumping processes, such as IBM's C4 [91], use solder to attach bare die to an MCM substrate. Solder has many characteristics that make it advantageous for bumping; it is inexpensive, has well understood properties and is very pliable when heated. Since solder contains lead, it is also a source of alpha particle emissions. As long as solder bumps are kept away from sensitive sections of the processor, such as dynamic logic or memories, alpha-particle emissions are not problematic. The previously mentioned MCM-based commercial processors do not utilize area interconnect. Their flip-chip solder bumps are placed on the chip's periphery--away from sensitive circuitry.

The PUMA processor, however, will utilize area interconnect to provide a larger number of high speed I/O. These area I/O pads will be placed in locations that provide convenient access to all sections of the chip including its sensitive components. Despite their natural hardness to radiation CGaAs dynamic logic and static memory circuits are still somewhat vulnerable to alpha particle disruptions. Therefore, gold bumping will be used instead of solder [92]. Gold makes good bumping material because it is highly pliable and does not emit alpha particles. Its major drawback is the high cost associated with its unique bumping procedure and MCM assembly method.

7.4 PUMA Memory Subsystem

The PUMA's memory hierarchy will be partitioned into numerous chips that communicate with each other across the MCM substrate and to a PCB. Fig. 14 is a block diagram of the PUMA memory subsystem.

Fig 14. PUMA memory subsystem block diagram

The PUMA's three-level memory hierarchy will be comprised of two levels of cache memory and a large DRAM. The cache hierarchy will be split into an instruction side and a data side. On the instruction side, the PUMA fixed-point unit (FXU) will have primary access to an embedded 1 KByte CGaAs instruction cache and an embedded CGaAs 8-line instruction stream buffer. These primary structures will in turn be connected to a secondary 1 MByte instruction cache implemented with four commercial BiCMOS SRAM chips.

On the data side, the FXU will have primary access to 16 KBytes of data cache implemented with four off-chip CGaAs 4 KByte SRAM chips. This cache will in turn be connected to a 1 MByte secondary data cache also implemented with four commercial BiCMOS SRAM chips. Access to the secondary caches and main memory will be conducted through off-chip CGaAs memory management units (MMUs). These high bandwidth cache controllers will manage pipelined, non-blocking access to the secondary caches and to 64 MBytes of main memory via the PUMA Interface to PCI (PIP) chip.


8. Research Contributions

On the morning of January 11, 1997, an unusually large wave of solar radiation known as a coronal mass ejection swept through the earth's upper atmosphere. This phenomena is a relatively common occurrence that does not pose a threat to life on earth. On this particular occasion, however, the solar disruption was of sufficient magnitude to overpower and destroy the microelectronics of AT&T's Telstar 401 communications satellite orbiting at 22,000 miles above the surface of the earth [93-95]. This satellite was designed and built by Lockheed Martin, launched into space in late 1993, and began service as a television signal relay station in early 1994. Telstar was one of three satellites that formed AT&T's Skynet Satellite Services and cost the company several hundred millions of dollars. According to NASA, several other military satellites operated by the Pentagon were also severely affected, resulting in unspecified damage.

Coronal mass ejections are invisible waves of electrically charged particles that can be lethal to microelectronics fabricated with semiconducting substrates such as silicon. It is now apparent that current technologies used to protect orbiting satellites from the damaging effects of solar radiation are insufficient. This event has spawned a renewed search for radiation-hard IC technologies by government institutions and commercial satellite deployment companies.

CGaAs can provide the high performance, radiation hard ICs required for high-altitude orbits. Efforts are currently underway at Motorola and Boeing Space Systems to prepare CGaAs for this task. There are numerous research areas associated with developing a radiation hard, high performance CGaAs microprocessor that have yet to be thoroughly explored. One of the most central areas is the design of reliable embedded memory structures and analysis tools.

8.1 Proposed Research

I propose to conduct the following research relative to the design and optimization of CGaAs cache memory. Each of these research tasks encapsulate and contribute additional understanding to the existing CGaAs knowledge base. These contributions can be directly applied to ongoing government and commercial research and development efforts.


References

[1] T. Ichikawa and M. Sasaki, "A New Analytical Model of SRAM Cell Stability in Low-Voltage Operation," IEEE Transactions on Electron Devices, vol. 43, no. 1, January 1996, pp. 54-61.
[2] H. Kato, M. Matsui, H. Shibata, K. Hashimoto, T. Ootani, and K. Ochii, "SRAM Cell Stability Under the Influence of Parasitic Resistances and Data Holding Voltage as a Stability Prober," IEEE Journal of Solid State Circuits, vol. 32, no. 2, February 1997, pp. 232-237.
[3] D. Hodges and H. Jackson, Analysis and Design of Digital Integrated Circuits, Second edition, New York: McGraw-Hill, 1988.
[4] Linley Gwennap, "Digital 21264 Sets New Standard," Microprocessor Report, v.10 n. 14, October 28, 1996.
[5] J.K. Abrokwah, J.H. Huang, W. Ooms, C. Shurboff, J.A. Hallmark, R. Lucero, "A Manufacturable Complementary GaAs Process," Proceedings of the 1993 IEEE GaAs IC Symposium, pp. 127-129.
[6] B. Bernhardt, et. al., "Complementary GaAs (CGaAs): A High Performance BiCMOS Alternative," Proceedings of the 17th Annual IEEE GaAs IC Symposium, pp. 18-21, 1995.
[7] J. Abrokwah, R. Lucero, J. Hallmark, and B. Bernhardt, "Submicron P-Channel (Al,Ga)As/(In,Ga)As HIGFET's," IEEE Transactions on Electron Devices, vol. 44, no. 7, July 1997, pp. 1040-1045,
[8] T. Ohzone, S. Kondo, K. Tsuji, T. Shiragasawa, T. Ishihara, and S. Horiuchi, "A 64Kb Static RAM," Proceedings of the 1980 IEEE International Solid-state Circuits Conference, pp. 236-237.
[9] J. Mayer, "SRAMs Rev Up, Close Memory-latency Gap," Computer Design, February 1997, pp. 107-109.
[10] K. Ishibashi, K. Komiyaji, S. Morita, T. Aoto, S. Ikeda, K. Asayama, A. Koike, T. Yamanka, N. Hashimoto, H. Iida, F. Kojima, K. Motohashi, and K. Sasaki, "A 12.5ns 16Mb CMOS SRAM," Proceedings of the 1993 IEEE Symposium on VLSI Circuits, pp. 103-104.
[11] A, Suzuki, T. Kobayashi, T. Hamano, H. Hatada, A. Kawasumi, F. Matsuoka, K. Ishimaru, M. Takahashi, M. Nishigohri, Y. Okayama, Y. Unno, M. Kakumu, and J. Tsujimoto, "A 400MHz 4.5Mb Synchronous BiCMOS SRAM with Alternating Bit-line Loads," Proceedings of the 1996 IEEE International Solid-state Circuits Conference, pp. 146-147.
[12] S. Kuhara, H. Toyoshima, K. Takeda, K. Nakamura, H. Okamura, M. Takada, H. Suzuki, H. Yoshida, and T. Yamazaki, "A 6ns 1.5V 4Mb BiCMOS SRAM," Proceedings of the 1996 IEEE International Solid-state Circuits Conference, pp. 144-145.
[13] K. Sasaki, K. Ueda, K. Takasugi, H. Toyoshima, T. Yamanaka, N. Hashimoto, and N. Ohki, "A 16Mb CMOS SRAM with a 2.3mm2 Single-Bit-Line Memory Cell," Proceedings of the 1993 IEEE International Solid-state Circuits Conference, pp. 250-251.
[14] M. Dolle, S. Jhand, W. Lehner, O. Muller, M. Schlett, "A 32-b RISC/DSP Microprocessor with Reduced Complexity," IEEE Journal of Solid State Circuits, vol. 32, no. 7, July 1997, pp. 1056-1066.
[15] D. Kinell and D. Wilson, "GaAs Static Random Access Memory Cell Design," Research abstract of the 1990 IEEE GaAs IC Symposium.
[16] J. Notthoff, R. Krein, J. Stephens, G. Troeger, C. Vogelsang, and C. Hyun, "A 4K X 1 Bit Complementary E-JFET Static RAM," Proceedings of the 1987 IEEE GaAs IC Symposium, pp. 185-188.
[17] C. Vogelsang, J. Castro, J. Notthoff, G. Troeger, J. Stephens, and R. Krein, "Complementary GaAs JFET 16K SRAM," Proceedings of the 1988 IEEE GaAs IC Symposium, pp. 75-78.
[18] M. Suzuki, S. Notomi, M. Ono, N. Kobayashi, E. Mitani, K. Odani, T. Mimura, and M. Abe, "A 1.2ns HEMT 64kb SRAM," Proceedings of the 1991 IEEE International Solid-State Circuits Conference, pp. 48-49.
[19] J. Hallmark, C. Shurboff, B. Oohms, R. Lucero, J. Abrokwah, and J. Huang, "0.9 V DSP Blocks: A 15 ns 4 K SRAM and a 45 ns 16-bit Multiply/Accumulator," Proceedings of the 1994 IEEE GaAs IC Symposium, pp. 55-58.
[20] E. Seevinck, P. van Beers, and H. Ontrop, "Current-mode Techniques for High Speed VLSI Circuits with Application to Current Sense Amplifier for CMOS SRAMs," IEEE Journal of Solid State Circuits, vol. 26, no. 4, April 1991, pp. 525-535.
[21] H. Nambu, K. Kanetani, Y. Idei, N. Homma, K. Yamaguchi, T. Hiramoto, N. Tamba, M. Odaka, K. Watanabe, T. Ikeda, K. Ohhata, and Y. Sakurai, "High-speed Sensing Techniques for Ultrahigh-speed SRAMs," IEEE Journal of Solid State Circuits, vol. 27, no. 4, April 1992, pp. 632-639.
[22] T. Yabe, "High-speed Circuit Techniques for 1 to 5 V Operating Memories," IEICE Transactions on Electronics, vol. E76-C, no. 5, May 1993, pp. 708-713.
[23] K. O'Connor, "A Source Sensing Technique Applied to SRAM Cells," IEEE Journal of Solid State Circuits, vol. 30, no. 4, April 1995, pp. 500-511.
[24] K. Ishibashi, "High-Speed CMOS SRAM Technologies for Cache Applications," IEICE Transactions on Electronics, vol. E79-C, no. 6, June 1996, pp. 724-734.
[25] K. Ohhata, T. Kunsunoki, H. Nambu, K. Kanetani, T. Masuda, M. Ohayashi, S. Hamamoto, K. Yamaguchi, Y. Idei, and N. Homma, "Redundancy Circuit for a Sub-nanosecond, Megabit ECL-CMOS SRAM," IEICE Transactions on Electronics, vol. E79-C, no. 3, March 1996, pp. 415-422.
[26] M. Hill and A. Smith, "Evaluating Associativity in CPU Caches," IEEE Transactions on Computers, vol. 38, no. 2, 12 December 1989, pp. 1612-1630.
[27] M. Hill, "A Case for Direct-Mapped Caches," IEEE Computer, vol. 21, no. 12, Dec 1988, pp. 25-40.
[28] J. Hennessy and D. Patterson, Computer Architecture a Quantitative Approach, San Mateo: Morgan-Kaufmann, 1990.
[29] R. Kessler, R. Jooss, A. Lebeck, and M. Hill, "Inexpensive Implementations of Set-Associativity," Proceedings of the 1989 ACM/IEEE International Symposium on Computer Architecture, pp. 131-138.
[30] F. Bodin and A. Seznec, "Skewed Associativity Enhances Performance Predictability," Proceedings of the 1995 ACM/IEEE International Symposium on Computer Architecture, pp. 265-274.
[31] A. Seznec, "A Case for Two-way Skewed-associative Caches," Proceedings of the 1993 ACM/IEEE International Symposium on Computer Architecture, pp. 169-178.
[32] A. Seznec, "About Set and Skewed Associativity on Second Level Caches," Proceedings of the 1993 IEEE International Conference on Computer Design, pp. 40-43.
[33] A. Agarwal and S. Pudar, "Column-Associative Caches: A Technique for Reducing the Miss Rate of Direct-Mapped Caches," Proceedings of the 1993 ACM/IEEE International Symposium on Computer Architecture, pp. 179-190.
[34] A. Seznec, "DASC Cache," Proceedings of the 1995 IEEE Symposium on High-Performance Computer Architecture, pp. 134-143.
[35] K. Theobald, H. Hum, and G. Gao, "A Design Framework for Hybrid-Access Caches," Proceedings of the 1995 IEEE Symposium on High-Performance Computer Architecture, pp. 144-153.
[36] B. Calder, D. Grunwald, and J. Emer, "Predictive Sequential Associative Cache," Proceedings of the 1996 IEEE Symposium on High-Performance Computer Architecture, pp. 244-253.
[37] S. Przybylski, M. Horowitz, and J. Hennessy, "Performance Tradeoffs in Cache Design," Proceedings of the 1988 ACM/IEEE International Symposium on Computer Architecture, pp. 290-298.
[38] A. Seznec, "Decoupled Sectored Caches: Conciliating Low Tag Implementation Cost and Low Miss Ratio," Proceedings of the 1994 ACM/IEEE International Symposium on Computer Architecture, pp. 384-393.
[39] O. Olukton, T. Mudge, and R. Brown, "Multilevel Optimization of Pipelined Caches," to appear in IEEE Transactions on Computers, 1997.
[40] N. Jouppi and S. Wilton, "Tradeoffs in Two-level On-chip Caching," Proceedings of the 1994 ACM/IEEE International Symposium on Computer Architecture, pp. 34-45.
[41] W. Wang, J. Baer, and H. Levy, "Organization and Performance of a Two-level Virtual-Real Cache Hierarchy," Proceedings of the 1989 ACM/IEEE International Symposium on Computer Architecture, pp. 140-148.
[42] N. Jouppi, "Cache Write Policies and Performance," Proceedings of the 1993 ACM/IEEE International Symposium on Computer Architecture, pp. 191-201.
[43] H. Young and E. Shekita, "An Intelligent I-Cache Prefetch Mechanism," Proceedings of the 1993 IEEE International Conference on Computer Design, pp. 44-49.
[44] N. Jouppi, "Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers," Proceedings of the 1990 ACM/IEEE International Symposium on Computer Architecture, pp. 364-373.
[45] S. Palacharla and R. Kessler, "Evaluating Stream Buffers as a Secondary Cache Replacement," Proceedings of the 1994 ACM/IEEE International Symposium on Computer Architecture, pp. 24-33.
[46] T. Alexander and G. Kedem, "Distributed Prefetch-buffer / Cache Design for High Performance Memory Systems," Proceedings of the 1996 IEEE International Symposium on High-Performance Computer Architecture, pp. 254-263.
[47] B. Bray and M. Flynn, "Fetch Caches," Stanford University Technical Report CSL-TR-93-561, February 1993.
[48] T. Kobayashi, et. al., "A Current-Controlled Latch Sense Amplifier and a Static Power-Saving Input Buffer for Low-Power Architecture," IEEE Journal of Solid State Circuits, vol. 28, pp. 523-527, April 1993.
[49] N. Weste, K. Eshraghian, Principles of CMOS VLSI Design, Second edition, New York: Addison-Wesley, 1993.
[50] B. Cherkauer, E. Friedman, "Design of Tapered Buffers with Local Interconnect Capacitance," IEEE Journal of Solid Stage Circuits, vol. 30, February 1995, pp. 151-154.
[51] K. Hedlund, "Models and Algorithms for Transistor Sizing in MOS circuits," Proceedings of the 1984 IEEE/ACM International Conference on Computer-Aided Design, pp. 12-14.
[52] B. Richman, J. Hansen, and K. Cameron, "A Deterministic Algorithm for Automatic CMOS Transistor Sizing," Proceedings of the 1987 IEEE Custom Integrated Circuits Conference, pp. 421-424.
[53] Z. Dai and K. Asada, "MOSIZ: A Two-step Transistor Sizing Algorithm based on Optimal Timing Assignment Method for Multi-stage Complex Gates," Proceedings of the 1989 IEEE Custom Integrated Circuits Conference, pp. 17.3.1 - 17.3.4
[54] B. Hoppe, G. Neuendorf, and D. Schmitt-Landsidel, "Automatic Transistor Sizing in High Performance CMOS Logic Circuits," Proceedings of the 1989 European Computer Conference, pp. 5-25 - 5-27.
[55] S. Sapatnekar, V. Rao, and P. Vaidya, "A Convex Optimization Approach to Transistor Sizing for CMOS Circuits," Proceedings of the 1991 IEEE/ACM International Conference on Computer-Aided Design, pp. 482-485.
[56] L. Heusler and W. Fichtner, "Transistor Sizing for Large Combinational Digital CMOS Circuits," Integration, The VLSI Journal, vol. 10, no. 2, 2 Jan. 1991, pp. 155-168.
[57] S. Sapatnekar, V. Rao, and P. Vaidya, "An Exact Solution to the Transistor Sizing Problem for CMOS Circuits Using Convex Optimization," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 12. no. 11, November 1993, pp. 1621-1634.
[58] W. Kao, N. Fathi, and C. Lee, "Algorithms for Automatic Transistor Sizing in CMOS Digital Circuits," Proceedings of the 1995 ACM/IEEE Design Automation Conference, pp. 781-784.
[59] N. Azemard, V. Bonzom, and D. Auvergne, "P.SIZE: A Sizing Aid for Optimized Designs," Proceedings of the 1992 European Design Automation Conference, pp. 160-165.
[60] U. Ko and P. Balsara, "Short-Circuit Power Driven Gate Sizing Technique for Reducing Power Dissipation," IEEE Transactions on VLSI Systems, vol. 3, no. 3, September 1995, pp. 450-455.
[61] J. Shyu, J. Fishburn, A. Dunlop, and A. Sangiovanni-Vincentelli, "Optimization-based Transistor Sizing," Proceedings of the 1987 IEEE Custom Integrated Circuits Conference, pp. 417-420.
[62] G. Chen, H. Onodera, and K. Tamaru, "An Iterative Gate Sizing Approach with Accurate Delay Evaluation," Proceedings of the 1995 IEEE/ACM International Conference on Computer-Aided Design, pp. 422-427.
[63] S. Mehrotra, P. Franzon, and W. Liu, "Stochastic Optimization Approach to Transistor Sizing for CMOS VLSI Circuits," Proceedings of the 1994 ACM/IEEE Design Automation Conference, pp. 36-40.
[64] W. Nye, D. Riley, A. Sangiovanni-Vincentelli, and A. Tits, "DELIGHT.SPICE: An Optimization-based System for the Design of Integrated Circuits," IEEE Transactions on Computer-Aided Design, vol. 7 no. 4, April 1988, pp. 501-519.
[65] J. Fishburn and A. Dunlop, "TILOS: A Posynomial Programming Approach to Transistor Sizing," Proceedings of the 1985 IEEE International Conference on Computer-Aided Design, pp. 326-328.
[66] S. Sapatnekar and V. Rao, "iDEAS: A Delay Estimator and Transistor Sizing Tool for CMOS Circuits," Proceedings of the 1990 IEEE Custom Integrated Circuits Conference, pp. 9.3.1 - 9.3.4.
[67] K. Hedlund, "Aesop: A Tool for Automated Transistor Sizing," Proceedings of the 1987 ACM/IEEE Design Automation Conference, pp. 114 -120.
[68] M. Cirit, "Transistor Sizing in CMOS Circuits," Proceedings of the 1987 ACM/IEEE Design Automation Conference, pp. 121-124.
[69] A. Conn, P. Coulman, R. Haring, G. Morrill, and C. Visweswariah, "Optimization of Custom MOS Circuits by Transistor Sizing," Proceedings of the 1996 IEEE/ACM International Conference on Computer-Aided Design, pp. 174-180.
[70] D. Chen and C. Zukowski, "CMOS Optimization Including Logic Family Mixing," Proceedings of the 1991 IEEE International Symposium on Circuits and Systems, vol. 4, pp. 2240-2243.
[71] H. Chen and S. Kang, "A New Circuit Optimization Technique for High Performance CMOS Circuits," IEEE Transactions on Computer-Aided Design, vol. 10, no. 5, May 1991, pp. 670-676.
[72] J. Kao, A. Chandrakasan, and D. Antoniadis, "Transistor Sizing Issues and Tool for Multi-threshold CMOS Technology," Proceedings of the 1997 ACM/IEEE Design Automation Conference, pp. 409-414.
[73] D. Marple, "Transistor Size Optimization in the Tailor Layout System," Proceedings of the 1989 ACM/IEEE Design Automation Conference, pp. 43-48.
[74] M. Yamada, S. Kurosawa, R. Nojima, N. Kojima, T. Mitsuhashi, and N. Goto, "Synergistic Power/Area Optimization with Transistor Sizing and Wire Length Minimization," IEICE Transactions on Electronics, vol. E78-C, no. 4, April 1995, pp. 441-445.
[75] N. Menezes, R. Baldick and L. Pileggi, "A Sequential Quadratic Programming Approach to Concurrent Gate and Wire Sizing," Proceedings of the 1995 IEEE/ACM International Conference on Computer-Aided Design, pp. 144-151.
[76] J. Cong and L. He, "An Efficient Approach to Simultaneous Transistor and Interconnect Sizing," Proceedings of the 1996 IEEE/ACM International Conference on Computer-Aided Design, pp. 181-186.
[77] W. Swartz, C. Giuffre, W. Banzhaf, M. deWit, H. Khan, C. McIntosh, T. Pavey, and D. Thomas, "CMOS RAM, ROM, and PLA Generators for ASIC Applications," Proceedings of the 1986 IEEE Custom Integrated Circuits Conference, pp. 334 - 338.
[78] J. Drummond and M. Lepkowski, "BiCMOS Submicron Compiler Memories," Proceedings of the 1990 IEEE ASIC Seminar and Exhibit, pp. 3-3.1 - 3-3.3.
[79] T. Dao and F. Svejda, "A Dual-port SRAM Compiler for 0.8mm 100K BiCMOS Gate Arrays," Proceedings of the 1991 IEEE Custom Integrated Circuits Conference, pp. 22.4.1-22.4.3.
[80] T. Le, H. Phuong, and P. Lin, "1KX128 High-performance, Low-power configurable CMOS SRAM Compiler," Proceedings of the 1990 IEEE ASIC Seminar and Exhibit, pp. 3-5.1 - 3-5.4.
[81] K. Tsao, N. Zhu, and T. Pham, "A High Performance Memory Compiler for Multi-port RAMs," Proceedings of the 1990 IEEE ASIC Seminar and Exhibit, pp. 3-6.1 - 3-6.4.
[82] H. Shinohara, N. Matsumoto, K. Fujimori, Y. Tsujihashi, H. Nakao, S. Kato, Y. Horiba, and A. Tada, "A Flexible Multiport RAM Compiler for Data Path," IEEE Journal of Solid State Circuits, vol. 26, no. 3., March 1991, pp. 343-348.
[83] J. Tou, P. Gee, J. Duh, and R. Eesley, "A Sub-micron CMOS Embedded SRAM Compiler," Proceedings of the 1991 IEEE Custom Integrated Circuits Conference, pp. 22.3.1-22.3.4.
[84] A. Chandna, C. Kibler, R. Brown, M. Roberts, and K. Sakallah, "The Aurora RAM Compiler," Proceedings of the 1995 ACM/IEEE Design Automation Conference, pp. 261-266.
[85] A. Chandna, "GaAs MESFET Static RAM Design for Embedded Applications," Ph.D. Dissertation, University of Michigan, 1995.
[86] A. Chandna and R. Brown, "An Asynchronous GaAs MESFET Static RAM Using a New Current Mirror Memory Cell," IEEE Journal of Solid State Circuits, October 1994, pp. 1270-1276.
[87] Schwab, D.J., et. al, "Performance Characteristics of Thin Film Multilayer Interconnects in the 1-10 GHz Frequency Range," Proceedings of the 1989 IEEE Electronics Components Conference, pp. 536-542.
[88] T. WIlliams, N. Patkar, G. Shen, "SPARC64: A 64-b 64-Active-Instruction Out-of-Order-Execution MCM Processor," IEEE Journal of Solid-State Circuits, vol. 30, no. 11, November 1995, pp. 1215-1226.
[89] White, S.W., "POWER2: Next Generation of the RISC System/6000 Family," IBM Journal of Research & Development, vol 38, no. 5, 1994.
[90] Lomax, R.J., et. al, "Area I/O Flip-Chip Packaging to Minimize Interconnect Length," Submitted to the 1997 IEEE MCM Conference, 1997.
[91] Nealon, M.J., "C4: A Legacy Technology for the Future," IBM Micronews, vol. 2, no. 4, 4Q 1996.
[92] Aschenbrenner, R. et al., "Flip Chip Attachment Using Non-Conductive Adhesives and Gold Ball Bumps," International Journal of Microelectronics and Electric Packaging, vol. 18, no. 2, 2Q 1995, pp. 154-161.
[93] "Telstar 401 out of service: Satellite's failure could cost AT&T several hundred million dollars," CNNfn news article, January 17, 1997, 1:01 p.m. EST, URL: http://www.cnnfn.com.
[94]' "Solar radiation might have killed satellite," CNN Interactive news article, January 20, 1997, 11:45 p.m. EST, URL: http://www.cnn.com.
[95] "Sun 'ejection' killed TV satellite," CNN Interactive news article, January 21, 1997, 10:10 p.m. EST, URL: http://www.cnn.com.