The Design and Optimization
of
CGaAs Cache Memory
Ph.D. Thesis Proposal
Submitted to the
Department of Electrical Engineering and Computer
Science
The University of Michigan
August 14, 1997
by
Spencer M. Gold
Committee:
Dr. Richard B. Brown, Chairman
Dr. Ronald J. Lomax
Dr. Trevor N. Mudge
Dr. Karem A. Sakallah
Dr. Kensall D. Wise
ABSTRACT
Complementary gallium arsenide (CGaAs) is an emerging integrated circuit
technology that exhibits many positive characteristics, including high
electron mobility at low electric fields, inherent radiation hardness,
and the ability to realize complementary logic circuits. These qualities
enable the fabrication of components having the performance of E/D MESFET
GaAs with greatly reduced power dissipation and the ability to implement
large capacity, highly efficient SRAM.
CGaAs is potentially well suited for space and military applications
that require radiation-hard, high performance systems. If CGaAs is to fulfill
this role, improved design techniques for high performance CGaAs memory
subsystems, including SRAM-based cache memory, must be established. I propose
to develop the architectures, circuit techniques and tools required for
supporting the memory bandwidth requirements of a high performance CGaAs
system. The goals of this research are to contribute to the current CGaAs
knowledge base, increase the maturity of the CGaAs fabrication process,
and to demonstrate the feasibility of high performance CGaAs-based cache
memory subsystems.
Silicon-based SRAM has steadily improved from a capacity-access time
and a capacity-power-access time perspective. On the other hand, GaAs-based
SRAMs have been severely limited by low integration levels and high power
logic families. I have designed a 40 k-bit CGaAs SRAM to demonstrate the
feasibility of large capacity, low access time, low power GaAs SRAM. This
chip utilizes a six-transistor memory cell with an area of 183 um2,
22.5% smaller than any previously reported GaAs SRAM cell. I designed the
chip using a full-custom methodology in order to achieve a 400 MHz maximum
operating frequency.
SRAM compilers have been developed to rapidly explore the power-delay-area
memory design space. Since their introduction in 1986, SRAM compilers have
evolved into powerful tools that include features such as process independence,
iterative optimizing front ends, and BIST functionality. I propose to develop
an advanced CGaAs SRAM compiler that employs commercial CAD tools. Doing
so will reduce development time and encapsulate the most powerful analysis
tools available. This tool will also be used to evaluate the impact of
various design rule enhancements on CGaAs SRAM density and speed.
The optimized CGaAs SRAMs generated by this compiler will find their
eventual implementation as the cache memories of the PUMA CGaAs PowerPC
microprocessor. This MCM-based processor will utilize area interconnect,
gold bumping, and flip-chip mounting technologies to provide high bandwidth
inter-chip communication. Its memory subsystem will be comprised of two
levels of split instruction and data cache memory and a large unified DRAM.
The required CGaAs cache memory components include the primary instruction
and data caches and high bandwidth, non-blocking secondary instruction
and data cache controllers.
Table of Contents
-
1. Introduction
-
1.1 Cache Memory Design Considerations
-
1.2 Gallium Arsenide
-
1.3 E/D MESFET GaAs
-
1.4 Complementary GaAs
-
1.5 CGaAs Integrated Circuit Design
-
1.6 CGaAs SRAM
-
1.7 Proposal Outline
-
2. Trends in SRAM Design
-
2.1 Silicon SRAM
-
2.2 GaAs SRAM
-
3. Cache Design Issues
-
3.1 Circuit Issues
-
3.2 Organizational Issues
-
3.3 System Level Issues
-
3.4 Low Integration Cache Design
Issues
-
4. Preliminary CGaAs
SRAM Research
-
4.1 40 k-bit SRAM Chip Overview
-
4.2 Design Methodology
-
4.3 Design for Testability
-
4.4 Final Chip Statistics
-
5. SRAM Compilers
-
5.1 Transistor Sizing Algorithms
-
5.2 Past SRAM Compilers
-
5.3 Current Commercial SRAM Compilers
-
5.4 Proposed CGaAs SRAM Compiler
-
6. CGaAs Process Refinement
-
6.1 Cost Effective IC Process Improvement
Programs
-
6.2 CGaAs SRAM Improvement
-
7. PUMA Cache Memory
Hierarchy
-
7.1 MCM-based Microprocessors
-
7.2 Area Interconnect
-
7.3 Gold Bumping
-
7.4 PUMA Memory Subsystem
-
8. Research Contributions
-
8.1 Proposed Research
-
References
-
The Design and Optimization of CGaAs Cache Memory
1. Introduction
The technology base of our society has an insatiable need for faster and
more powerful computing. In order to meet this demand, microprocessors
are expected to operate at ever increasing frequencies. However, a faster
operating frequency does not guarantee improved performance, especially
if the memory subsystem has not been designed to properly support high
frequency operation.
The key to designing a high performance memory subsystem that will support
high frequency operation is to design a memory structure that has both
low access latency and high capacity. Although this may seem straightforward,
it is not, because these design goals are opposing. A trade-off must be
made between capacity and latency when designing memory structures.
Hierarchical memory subsystems can be used to create the illusion of
a memory structure that is both large and fast. This is done by providing
the processor with immediate, or primary access to a small, fast
memory known as a cache. This primary cache is connected to higher
levels of memory that progressively increase in both capacity and latency
until the largest, slowest device is included. When a cache access fails
to find its data, or misses, that request is forwarded to the next
highest structure in the memory hierarchy. This process continues until
the memory access eventually finds its data in the upper portions of the
hierarchy. Once this data is retrieved it and the physically adjacent data
are copied into the memory structures below it. This allows a processor
to exploit the temporal and spacial locality within a process' working
set. Depending on the miss percentage of each memory level, a hierarchical
subsystem can be very efficient while operating at high frequencies. For
example, the average access time of a three-level hierarchical memory subsystem
is given below as equation 1.1. As the miss percentages of the lower two
memory structures approach zero, the average access latency approaches
that of the primary cache.
Average access time = L1 access time + L1 miss% x (L2 access time
+ L2 miss% x (L3 access time)) (1.1)
1.1 Cache Memory Design Considerations
The ability of a microprocessor to realize its performance potential is
largely dependent upon the effectiveness of its cache memories. If the
caches are too small and simple, their miss rates could be high and the
average access time would be adversely affected. On the other hand, if
the caches are too large or complex, their operating frequency would suffer
and precious die area would be wasted. Cache designers should have the
objective of implementing the most effective balance between capacity and
complexity.
High performance cache design requires contributions from several fields
in VLSI design. From a circuits perspective, one must consider memory implementation
techniques, power dissipation limitations, packaging requirements and operational
stability factors [1,2].
An understanding of the chip processing technology must also be acquired
in order to determine acceptable integration levels and operational parameters.
From an architectural viewpoint one must consider the bandwidth requirements
of the main processor when determining a cache's operating frequency and
organization. Cache capacity, set-associativity, multi-banking and pipelining
are some of the microarchitectural issues that affect overall system performance
and must also be taken into account. By considering the issues from each
of these fields an efficient, high performance cache memory hierarchy can
be obtained.
1.2 Gallium Arsenide
The intrinsic characteristics of gallium arsenide (GaAs) make it a suitable
alternative substrate for high frequency integrated circuits. The primary
advantage of GaAs over silicon is that it provides a low-field electron
velocity in normal doping levels that is five times greater than silicon
[3] (4,000 cm2/V-s compared
to 800 cm2/V-s). However, the maximum electron drift velocity
in GaAs is only twice that of silicon (2x107 cm/s compared to
1x107 cm/s). In saturation mode GaAs transistors can provide
up to a factor of two improvement in electron velocity. These characteristics
contribute in part to GaAs transistors' higher transconductance and reduced
delay at lower voltages compared to silicon transistors.
GaAs substrates can be viewed as semi-insulating since their resistivity
is 100,000 times greater than intrinsic silicon substrates (1x109
W-cm in GaAs compared to 1x104 W-cm in Si). Therefore, none
of the precautions used in CMOS IC design to guard against latch-up need
to be used in GaAs IC design. Due to gallium arsenide's larger bandgap
(1.43 eV in GaAs compared to 1.11 eV in Si), its semi-insulating substrate
and the absence of native oxides, GaAs-based circuits are more resistant
to single event upset and total dose radiation exposure than both CMOS
and SOI integrated circuits.
Gallium arsenide does have its drawbacks. GaAs IC fabrication processes
are less mature than current CMOS fabrication processes, leading to greatly
reduced integration levels. While some CMOS processes can successfully
yield chips, such as the DEC Alpha 21264 with over 15 million transistors
[4], GaAs
processes have only recently produced a functional chip with over one million
transistors.
1.3 E/D MESFET GaAs
One of the most established and widely available GaAs fabrication processes
is E/D MESFET (enhancement/depletion metal semiconductor field effect transistor).
This process provides only enhancement and depletion type transistors and
requires the use of direct-coupled FET logic (DCFL). This logic style provides
extremely fast and high power circuits. Although E/D MESFET does not offer
a great deal of flexibility to the circuit designer, it has been popular
in the supercomputing industry. The Tera computer implements some of its
multi-chip core with Vitesse Semiconductor's GaAs E/D MESFET process. These
chips have integration levels up to 1.2 million devices. Other systems
that utilize a GaAs-based processing core are the Fujitsu VPP500 and the
Convex 2X0 supercomputers.
1.4 Complementary GaAs
A new arrival in the GaAs IC scene is Motorola's complementary GaAs (CGaAs)
process [5-7].
This technology differs from E/D MESFET by providing p-channel instead
of depletion transistors. A cross-section of n- and p-channel CGaAs transistors
are given in Fig. 1. The CGaAs transistor
channel is a buried layer of intrinsic InGaAs and the gate is formed by
stacking layers of AlGaAs, GaAs, and TiWN to form a schottky diode with
the InGaAs channel. This diode has a turn-on voltage of 1.8 volts yielding
devices that operate reliably within the 0.9 to 1.5 V range.
Fig. 1. CGaAs n- and p-channel transistor
cross section
The availability of p-channel devices makes possible the use of complementary,
dynamic, and source-coupled FET logic (SCFL) circuits as well as DCFL circuits.
CGaAs components can therefore be designed to have the performance of E/D
MESFET components but with greatly reduced power dissipation.
There are numerous advantages to using CGaAs for the implementation
of high-frequency processors. In spite of these advantages CGaAs still
cannot provide integration levels similar to current CMOS processes. If
a system is to be implemented in CGaAs it must currently be comprised of
chips having less than one million transistors each.
1.5 CGaAs Integrated Circuit Design
CGaAs closely resembles CMOS from a circuit design perspective. However,
there are a few crucial differences between these technologies that make
CGaAs circuit design unique. The CGaAs transistor structure differs from
that of a MOSFET in that CGaAs gates will conduct at voltages above 1.8
V since their heterostructure insulated gates (HIGs) are essentially diodes.
Even at levels significantly below this turn-on voltage, CGaAs gates have
significant current leakage. This leakage current is hard to estimate with
SPICE since the model we use for CGaAs HIGFETs is the JFET model.
There are a number of unique CGaAs design rules. p- and n-diffusions
can directly abut and share a common contact. The large difference in charge
carrier mobilities between n- and p-channel devices requires larger p-channel
devices in CGaAs complementary gates to provide equal rise and fall times.
Contacts to diffusion regions are made through an ohmic layer--an additional
mask not included in CMOS processes. CGaAs' current metallization is courser
than state-of-the-art CMOS metallization and it only allows two vias to
be stacked atop one another.
1.6 CGaAs SRAM
There are two types of volatile random access memory (RAM): static and
dynamic. Static random-access memory (SRAM) differs from its dynamic counterpart
in that it does not require a periodic refresh in order to keep a stored
binary value. Since an SRAM cell stores its value in a pair of cross-coupled
inverters, it has greater drive strength, at the expense of larger size,
than a dynamic cell which stores its value on a capacitor. This leads to
the implementation of memories that have faster access times than dynamic
RAMs (DRAM). SRAM can be easily embedded with a processor core because
it does not require specialized processing steps in order to operate efficiently,
as does one-transistor DRAM. The overhead required to restore destructive
reads are also not required with SRAM since its read process is non-destructive.
For these reasons SRAM has become the implementation technique of choice
for cache memory.
A typical 6-transistor (6T) E/D MESFET SRAM cell utilizes cross-coupled
inverters with depletion loads to store a binary value. Regardless of the
data value being stored, there is always a conduction path from the power
supply to ground in a cell with depletion loads. Consequently, this type
of cell dissipates significant static power. A 6T CGaAs SRAM cell, on the
other hand, utilizes cross coupled complementary inverters; its
static power dissipation is due only to device leakage. It is the use of
complementary logic that provides CGaAs' power savings.
In high frequency logic circuits, dynamic power dominates static power
due to tits high average switching rate. This characteristic is especially
evident in IC technologies that utilize course metallization. The fact
that E/D MESFET circuits dissipate significant static power is of lesser
importance when viewed from a logic design perspective. In a RAM, however,
the average switching rate is much lower. Thus, RAM power dissipation is
dominated by its static power component. A major disadvantage of using
E/D MESFET for system design is its inability to provide efficient embedded
RAM. CGaAs has an advantage over E/D MESFET in its ability to provide large,
efficient embedded RAM due to its low static power dissipation.
1.7 Proposal Outline
This proposal is organized to both identify previous work and to propose
future research in the area of CGaAs cache memory design. Of course, much
of this research will be directly applicable to other IC technologies such
as CMOS. Trends in past silicon and GaAs SRAM designs are identified in
Section 2 of this proposal. Some of the major issues
of high performance cache design are given in Section 3.
A full-custom, 40 k-bit CGaAs SRAM organized as data cache is described
in Section 4. Automated SRAM compilation and analysis
techniques and a proposed CGaAs SRAM compiler that utilizes commercial
CAD tools are described in Section 5. The proposed use
of this compiler to help optimize the CGaAs process is discussed in Section
6. The CGaAs SRAMs described in this proposal will find their eventual
implementation in the PUMA PowerPC microprocessor. Section
7 describes the PUMA's multi-chip cache memory hierarchy and the use
of CGaAs to realize high bandwidth operation. Finally, Section
8 describes proposed research that will be pursued in the area of CGaAs
cache memory design.
2. Trends in SRAM Design
SRAM development has become a mature discipline of microelectronics design.
Improvements in both processing technology and circuit design techniques
have enabled SRAM arrays to become larger, faster and more efficient. In
1980, a 64 k-bit, 80 ns design fabricated with a 2.0um CMOS process [8]
was considered state-of-the-art. Today's BiCMOS commercial synchronous
burst SRAMs, such as Motorola's MCM69P737, have up to 4.5 Mbits with pipelined
operating frequencies up to 166 MHz [9].
2.1 Silicon SRAM
Early silicon SRAM designs were fabricated in diverse technologies and
made to suit a wide variety of design goals. The SRAM market has since
become better defined and custom processing technologies have emerged.
Most high performance monolithic SRAMs available today are fabricated with
a specialized BiCMOS process having up to six polysilicon [10]
and four metallization layers.
Much of the progress seen in silicon SRAM design can be credited to
rapidly improving process technology. However, innovative circuit design
techniques have also been developed. Plots showing the features of various
SRAMs presented at ISSCC, VLSIC, and CICC since 1980 are given in Fig.
2. The first graph plots SRAM capacity per nanosecond access time (k-bits/ns)
on a logarithmic scale. This graph clearly indicates a trend of improving
access time efficiency. The highest rated silicon SRAM is a 4.5 Mbit, 1.8
ns design fabricated with a 0.3um, 2-polysilicon BiCMOS process [11].
The second graph plots SRAM power efficiency in k-bits/ns-W. This graph
indicates a similar trend in SRAM power efficiency. The most efficient
high capacity SRAM is a 4 Mbit, 180 mW design fabricated with a 0.3mm,
4-polysilicon BiCMOS process [12].
Fig 2. Silicon SRAM designs presented at ISSCC,
VLSIC and CICC since 1980
Two types of memory cells are used most commonly in silicon SRAM designs.
High capacity designs that give priority to cell density usually employ
the four transistor (4T) passive load cell shown in Fig.
3. The additional polysilicon layers in specialized SRAM processes
make it possible to implement the passive load resistors in a compact fashion.
Fig. 3 also illustrates the rate at which
silicon 4T cell sizes have decreased since 1980. The smallest 4T cell presented
to date measures 2.3 um2 and was fabricated using a 0.25um 5-polysilicon
CMOS process [13].
Embedded memories are another major SRAM application. Such designs must
be fabricated with more general CMOS logic processes without additional
polysilicon layers. Under these circumstances, the 6T active load SRAM
cell, shown in Fig. 4, becomes more appropriate.
This cell requires more area, but dissipates less static power than its
4T counterpart. Fig. 4 illustrates the
rate at which silicon 6T cell sizes have decreased over the years. The
smallest 6T SRAM cell presented to date measures 7.82 um2 and
belongs to the previously mentioned 4.5 Mbit SRAM that rates highest in
access time efficiency.
Fig 3. 4T silicon SRAM cell area
Fig 4. 6T silicon SRAM cell area
A recent arrival on the silicon embedded memory scene is the three-transistor
(3T) DRAM cell shown in Fig. 5. This cell
exploits the reasonably high capacitance associated with MOSFET gates to
store a data value on the gate of transistor M2. The gate of transistor
M1 can be lengthened to reduce the static leakage from this storage node.
Transistor M3, when asserted by the read word line, allows the data value
to be read from the single read bit line. 3T DRAM cells can be practical
for embedded memory applications because they do not require specialized
processing steps to fabricate the data storage capacitors.
Fig 5. Three-transistor DRAM cell
The 3T cell is more compact than its 6T counterpart allowing for much
greater cell densities. A recently reported non-commercial RISC/DSP processor
[14] fabricated with a 0.5um 2-metal
CMOS process employs a 3T DRAM cell in its embedded caches. This cell measures
32 um2 which is about 40% smaller than an average 6T cell fabricated
with a similar process.
Since the 3T cell is dynamic, additional circuitry is required to periodically
refresh the data. It also uses a single bit-line for reading data, which
requires more complicated sensing circuitry than a double bit-line scheme.
Despite the added density it provides, the 3T SRAM cell is an unlikely
candidate for large scale commercial use because of reliability problems
associated with sensing data from single bit-line SRAM architectures.
2.2 GaAs SRAM
Since their introduction in 1980 [15]
GaAs-based SRAMs have been designed in many processing technologies, including
E/D MESFET, C-JFET [16,17]
and CGaAs. GaAs SRAMs are not yet competitive with silicon designs from
density, capacity or efficiency standpoints. Much of this is due to the
immaturity of current GaAs IC processing technologies which have courser
design rule geometries and inferior integration levels.
Although great progress has been made in the development of GaAs SRAM,
it appears as though a wide variety of design goals are being pursued.
Fig. 6 plots the access time and power
efficiency characteristics of GaAs SRAMs presented at the GaAs IC symposium
and ISSCC since 1982. As illustrated by the graphs, there is not an obvious
trend in design improvement as measured by these metrics.
Fig 6. GaAs SRAM designs presented at the
GaAs IC Symposium and ISSCC since 1980
The largest GaAs SRAM fabricated to date is a 64 k-bit monolithic chip
designed for the Fujitsu VPP500 vector parallel processor [18]
in 1991. This E/D MESFET DCFL design is quite fast for its capacity but
comes at the power dissipation cost of 5.9 watts. The most power efficient
GaAs SRAM was designed in 1994 using Motorola's 0.7um CGaAs process. This
4 k-bit SRAM dissipates only 16.2 mW while providing a 5.3 ns access time
[19].
GaAs processing technologies have not yet provided a suitable material
for highly resistive passive loads. Therefore, 4T SRAM cells are not practical
in GaAs. Since GaAs HIGFET gates are essentially high turn-on voltage diodes
they cannot be used to store a binary data value. This makes the use of
3T DRAM cells impractical. Designers have had little choice but to use
6T SRAM cells exclusively in past designs.
The smallest GaAs 6T cell reported to date measures 236 um2
and was used in the previously mentioned Fujitsu 64 kbit SRAM. This cell
is over 30 times larger than the smallest reported silicon 6T SRAM cell.
Despite this large difference, continual improvements in GaAs 6T SRAM cell
area have been made, as illustrated in Fig.
7.
Fig 7. GaAs 6T SRAM cell area
3. Cache Design Issues
There are several issues of high performance cache design that must be
addressed if an optimal memory subsystem is to be achieved. These issues
range from circuit implementation techniques to system level planning.
Much has been published concerning these topics which provides the designer
with a rich supply of ideas.
3.1 Circuit Issues
The circuit implementation strategies of embedded logic designs also apply
to cache hierarchies. When deciding upon a circuit style or design technique
one must first consider the ramifications of the technology being used.
From an understanding of the IC technology comes an estimate of the available
integration levels and operating voltages. A cache circuit designer must
choose appropriate buffer styles for the technology being used (push-pull,
full complementary, etc.) and develop a technique for optimally sizing
them. The decision to use static or dynamic address decoding schemes has
a large impact on overall circuit speed and complexity. Other circuit issues
include precharging schemes (full VDD or VDD - Vtn), bit-line sensing schemes
[20-23]
(voltage-mode, current-mode or latching), multiplexer styles (restoring
or pass-gate), bit-line architectures (single or double), and memory cell
styles (static or dynamic, 4T or 6T, etc.). The choice of an appropriate
operating mode [24] (asynchronous, synchronous,
or wave-pipelined) is a crucial decision that must be made carefully. Cache
designers must also consider whether and how to use row and/or column redundancy
[25] to increase yield.
3.2 Organizational Issues
A cache's microarchitectural organization will have a dramatic impact on
system performance. One of the most important organizational issues that
every cache designer must address is index associativity [26].
Some associativity techniques that have been employed in the past include
direct-mapped [27], set-associative [28,29],
fully-associative, skewed-associative [30-32],
column-associative [33], hybrid associative
[34,35]
and predictive sequential-associative [36].
Other organizational issues include determining the best line size and
how many columns should be "folded" or multiplexed together. Cache designers
must also make the best trade-off between capacity and access time as well
as, in some cases, determining how to implement multi-ported and/or multi-banked
designs.
3.3 System Level Issues
System level issues must also be considered in cache hierarchy design [37].
One must determine how many levels of cache should be employed in a hierarchy
and how large and fast each should be [38-40].
The use of virtual or physical indexing and tagging [41]
affects the virtual memory subsystem dramatically as well as the decision
to use split or unified caching schemes. Cache designers must also determine
the best write policies [42] and prefetching
schemes [43-47]
for their memory hierarchy.
3.4 Low Integration Cache Design Issues
When designing a cache memory hierarchy in an integration level-limited
IC technology, such as CGaAs, a number of other important issues emerge.
Low integration levels severely limit the capacity of embedded memories
and often require the extensive use of off-chip cache. There are two main
microarchitectural consequences of moving normally embedded cache memories
off-chip. The first of these relates to the additional latency incurred
when accessing these memories. Tolerating this latency might require additional
pipeline stages, which can increase the branch misprediction penalty. Another
technique for tolerating this latency involves increasing the amount of
data retrieved each cycle from off-chip caches. This leads to the second
major consequence--the processor's I/O bandwidth requirements increase.
The task of supporting the memory bandwidth requirements of a high frequency
processor core is more difficult when using off-chip caches. Data busses
between the processor and cache memory need to be both wide and fast. Considering
that most processors are already I/O limited, multi-chip cache memory hierarchy
designers must develop alternate implementation strategies to provide the
processor with additional high bandwidth I/O ports. One method that can
be used to increase a processor's I/O bandwidth capabilities is flip-chip
mounting and area interconnect, which is described in Section 7 of this
proposal.
4. Preliminary CGaAs SRAM Research
Preliminary CGaAs SRAM research must be conducted as a precursor to designing
a full scale CGaAs cache memory hierarchy. The goal of this research is
to determine the feasibility of high capacity, high speed CGaAs SRAM that
can be organized as cache memory. Once a successful CGaAs SRAM has been
designed, fabricated and tested, an effective CGaAs cache memory hierarchy
can be designed.
4.1 40 k-bit SRAM Chip Overview
As a part of this research, a 40 k-bit CGaAs SRAM has been designed. This
chip, illustrated in Fig. 8, is organized
as 4 KBytes of data cache with tags. The memory was designed to have the
maximum capacity that the CGaAs process would allow at the time it was
designed while still providing a sub-nanosecond access time (based on preliminary
SPICE models). This synchronous memory employs a double bit-line architecture,
push-pull buffers, static decoders, VDD - Vtn precharging, latch sense
amplifiers and pass-gate multiplexers.
Fig 8. 40 k-bit CGaAs SRAM
The SRAM is comprised of 256 direct-mapped lines. Each line includes
four 32-bit words, a 32-bit tag, two attribute bits, and a valid bit. Thus,
the total storage capacity is 40.75 k-bits. There is a primary word read/write/invalidate
port and a secondary line write port. Data written into the SRAM through
the secondary port can be read through the primary port in the same cycle.
A 6T SRAM cell that measures 19.9 um x 9.2 um (183 um2) is
used in the design. Although this does not compare well with cells fabricated
with current CMOS processes, it does represent a 22.5% decrease from the
smallest reported GaAs SRAM cell to date.
The SRAM peripheral circuits were optimized for both speed and power.
A mixture of complementary and direct-coupled FET logic (DCFL) is used
to achieve the performance goal with minimal power dissipation. Low-power,
current-controlled latch sense amplifiers [48]
provide fast data resolution times. The extremely low CGaAs threshold voltages
are exploited by using push/pull address and word line buffers that provide
good drive strength with fast propagation times. Fig.
9 illustrates the SRAM's read/write circuitry.
Fig 9. CGaAs SRAM read/write circuitry
4.2 Design Methodology
Obtaining a sub-nanosecond access time was the highest design priority.
In order to meet this target, a mixture of logic styles was used to provide
the necessary access time at the lowest possible power dissipation. DCFL-based
circuits are advantageous for speed-critical paths since they have low
gate delays. However, this speed increase comes at the expense of higher
power dissipation and decreased noise immunity due to their high VOL. On
the other hand, complementary circuits benefit from much lower power dissipation
and excellent noise immunity, but suffer from longer gate delays. Cells
were initially designed using DCFL circuits to provide the lowest propagation
delays. Once a sub-nanosecond access time was guaranteed, non-speed-critical
DCFL components were replaced by complementary circuits in order to reduce
power dissipation and increase noise immunity.
The metallization used in the CGaAs process is not optimal for SRAM
design. Reducing the parasitic capacitances associated with inter- and
intra-cell interconnect proved to be a matter of crucial importance. In
order to tightly manage these parasitics, the SRAM was designed with a
full-custom layout methodology using Mentor Graphic's IC Station as the
primary layout tool. This approach provided significant reductions in parasitic
capacitances and helped lower the access time. Circuit simulations were
conducted with Meta Software's HSPICE using Motorola device models and
parasitics extracted from the circuit layouts.
I developed transistor sizing tools to design buffers with maximum drive
strength and a minimum propagation time. The first of these tools creates
variant SPICE models of a buffer using the constant ratio tapering method
[49]. This tool simulates a design using
HSPICE and reports the results, allowing the optimal buffer design to be
quickly identified. The second tool resembles the first except that it
sizes buffers based on Cherkauer and Friedman's constant capacitance-to-current
ratio tapering method [50].
4.3 Design for Testability
Fig. 10 is a block diagram of the SRAM
core's 276 I/O signals. Directly testing this number of high speed signals
is very difficult and expensive. In order to greatly reduce the number
of signals brought off chip, rows of scannable flip-flop chains, organized
according to Fig. 11, were added to the
cache's inputs and outputs.
Fig 10. Cache core I/O signals
Fig 11. Cache chip testability structures
The chip has two operating modes: low-speed-sequential and high-speed-parallel.
In low-speed-sequential mode the flip-flops are connected serially to form
long, sequentially-scannable chains. In this mode up to four cache read/write/invalidate
commands can be scanned into the flip-flop chains at a low frequency. Then
the chip can be placed into high-speed-parallel mode which changes the
flip-flop connections from sequential to parallel. The SRAM can then execute
these four instructions at a high clock rate. Results are captured by similarly
organized output flip-flop chains. Upon completion of the high speed test,
the SRAM is placed back into low-speed-sequential mode and the results
are scanned off chip at a low frequency. This enhancement reduces the number
of high speed signals from 276 to 3 (fast CLOCK, fast CLOCK_BAR, and MODE).
AND gates disable the L2WRITE and L1WRITE signals to prevent erroneously
altering memory contents while commands are being scanned into the flip-flop
chains.
4.4 Final Chip Statistics
Table 1 gives the SRAM chip statistics.
Since the chip has not yet been fabricated or tested, these statistics
are based on preliminary simulations. The strength of this design is its
ability to rate highly in both speed and efficiency--something previous
GaAs SRAMs have not successfully accomplished. At 17.7 k-bits/ns this design
ranks second in access time efficiency among reported GaAs SRAMs. With
a rating of 15.4 k-bits per ns-W the design also ranks second in power
efficiency.
Table 1: SRAM chip statistics
| Process Technology |
0.5 um CGaAs, 3 metal layers |
| Power supply |
1.5 V |
| Capacity |
40.75 k-bits |
| Chip Size |
5.6 mm x 5.6 mm (31.4 mm2) |
| Transistors |
298,497 |
Access time
(slow - fast corners) |
3.1 ns - 2.3 ns |
Maximum simulated operating frequency
(slow - fast corners) |
300 - 400 MHz |
Power dissipation @ 400 MHz
(Embedded memory core with peripheral circuits) |
1.15 W |
Power dissipation @ 400 MHz
(Including test structures and clock buffers) |
2.52 W |
5. SRAM Compilers
The full-custom design approach has several advantages over semi-custom
methodologies including the ability to more closely manage the area efficiency
of a layout. It has a major drawback however--it is incredibly laborious
and tedious. A full-custom design requires significantly more time to complete
than if automated layout generators are used. For this reason it is not
practical to use full-custom methodologies to explore the vast capacity-power-delay
memory design space.
Although transistor sizing tools were used to optimize many of the 40
k-bit CGaAs SRAM's circuits, the design represents only a single point
in the large power-delay-area design space. Until this space has been more
fully explored it cannot be concluded with any certainty that this design
has been truly optimized according to any single metric or combination
of metrics. This task could be quickly accomplish with a CGaAs SRAM compiler.
5.1 Transistor Sizing Algorithms
The heart of an effective SRAM compiler is its transistor sizing engine.
This routine is responsible for sizing a set of transistors for the SRAM's
core and peripheral circuits to meet specified objectives. Transistor sizes
have a direct impact on speed, power dissipation and layout area. An effective
transistor sizing algorithm is capable of optimizing a design based on
one or a combination of these factors. A tremendous amount of research
has been devoted to the development of effective transistor sizing algorithms.
Because of the enormity of this effort, only a brief summary of the capabilities
and optimization techniques employed in past transistor sizing tools will
be given here.
Most transistor sizing algorithms have been developed to optimize the
power and/or area of generic logic structures subject to timing constraints.
These logic structures are given in the form of a circuit netlist annotated
with parasitic capacitances and resistances. There are four general approaches
to solving the transistor sizing problem. The first method transforms the
circuit structure into a distributed RC network and models the delay through
it as a classical nonlinear optimization problem. Geometric programming
techniques are then employed to exactly solve this problem mathematically.
Arriving at this solution often involves defining the nonlinear optimization
problem in terms of posynomial functions that can be transformed into a
set of convex functions by means of an exponential transformation. A variety
of exact or numerical methods can then be employed to find a local minimum,
and hence, the global minimum of each convex subproblem. Numerous algorithms
and sizing tools [51-57]
have been developed that employ this technique. However, this method suffers
from its inherent complexity and long computation time. Furthermore, it
may not converge to a solution if the initial optimization problem is sufficiently
complex. It should also be noted that these solutions are only exact for
the given parasitics used to construct the original distributed RC network.
The second technique employs heuristic methods to achieve near-optimal
solutions to the nonlinear optimization problem [58-60].
This method identifies those transistors that are most sensitive to the
objective and incrementally sizes them in order to achieve the desired
effects without violating a constraint. Heuristic algorithms cannot guarantee
optimal results but require far less computation time than rigorous mathematically
exact methods.
The third approach utilizes both mathematical and heuristic methods
to achieve good results that are less computationally expensive. These
hybrid techniques [61,62]
use geometric programming methods to provide an initial starting point,
then use heuristic methods to find a near-optimal solution. Thus, a more
accurate solution can be obtained with only a small increase in computation
time.
The fourth method of optimizing transistor sizes is based on a stochastic
modeling of circuit responses [63]. In
this technique the values of the complex nonlinear objective functions
are viewed as random variables with known distributions. This view allows
variations in process parameters to be accounted for. Stochastic modeling
allows the algorithm to globally view and efficiently optimize the entire
circuit instead of just the speed-critical path. However, this approach
resembles heuristic methods in its ability to provide only near-optimal
solutions.
A number of other interesting capabilities have been introduced into
transistor sizing tools. Some of these capabilities include using SPICE
to conduct iterative circuit simulations [64],
using a static timing analyzer to identify speed-critical circuitry [65,66],
using an interactive interface to allow the designer to control the direction
of the search efforts [67-69],
allowing multiple logic styles to be introduced into the design [70],
providing gate-based instead of transistor-based optimization [71],
and optimizing transistors in a multiple threshold voltage technology [72].
The most powerful sizing tools, however, extract parasitic capacitances
and resistances from the layout and conduct both transistor sizing and
layout compaction in their optimizing loops [73-76].
SRAM core and peripheral transistor sizing can be viewed as a subset
of the traditional transistor sizing problem. An SRAM has a fixed organization
that is known beforehand. Therefore, its sizing engine can employ specialized
heuristics that are both accurate and fast. The sensitivities of various
transistors on read and write access times are well understood, making
the judicious choice of transistors for incremental adjustment comparatively
easy.
5.2 Past SRAM Compilers
The first published SRAM compiler, known as RAMGEN [77],
was developed at Texas Instruments in 1986. This tool simply connected
previously designed leaf cells into a parameterized SRAM configuration
that could be fabricated with either a 2.0 or 3.0um CMOS process. It also
generated corresponding HDL descriptions and performance datasheets of
the design. No iterative analysis or rebuilding steps were conducted to
further optimize the design because the tool was strictly a layout generator.
Since RAMGEN's introduction, numerous SRAM compilers have been developed.
In 1990, Texas Instruments completed a BiCMOS multi-port SRAM compiler
[78,79]
capable of generating designs with up to 8 k-bits of capacity. Although
this tool was capable of producing a larger variety of SRAMs, it was still
just a layout generator that did not employ iterative analysis or optimization
steps.
In this same year, VLSI Technology, Inc. and LSI Logic Corp. introduced
SRAM compilers for their ASIC customers. The VLSI Technology compiler [80]
could build memories with up to 128 k-bits of capacity in a 1.0um CMOS
technology. It used a proprietary silicon compilation language known as
"SLICE" to both generate and connect SRAM leaf cells. This provided the
customer with greater ease and flexibility in altering memory organizations.
The LSI Logic MEMCOMP compiler [81] could
build memories with up to six ports in either a 1.5um or 1.0um CMOS process.
It used predesigned full-custom leaf cells as its building blocks. Both
of these tools, however, were still just layout generators that did not
employ iterative optimization techniques.
A breakthrough in SRAM compiler technology was made in 1990 when Mitsubishi
Electronics introduced their MAC2 memory compiler based on Cascade Design
Automation's Compiler Development System (CDS) [82].
CDS is a silicon compilation language that provides true process independence
by parameterizing both device sizes and technology design rules. This compiler
was also the first to use an iterative, constraint-driven optimization
technique for buffer sizing. During each iteration, predefined lookup tables
were used to help speed up the analysis of a given layout. This tool was
capable of creating SRAM designs with up to six ports and 36 k-bits of
capacity.
In 1991, Motorola's ASIC division in cooperation with Mentor Graphics
Corp. developed the Memorist SRAM compiler [83].
Memorist was based on Mentor's GDT tool suite and was capable of designing
0.7um CMOS SRAMs with up to two ports and 256 k-bits of capacity. It resembled
the MAC2 compiler by implementing an iterative optimization technique to
size transistors, but employed this technique in sizing the entire critical
path circuitry instead of just the buffers. Memorist had another key advantage
over MAC2 by allowing the designer to choose between using predefined lookup
tables or SPICE to analyze critical path circuitry.
The first GaAs SRAM compiler, known as Aurora [84,85],
was developed at The University of Michigan in 1994. This tool was capable
of generating and optimizing E/D MESFET SRAMs having up to 8 k-bits of
capacity. A novel current mirror memory cell [86]
(CMMC) was employed as the storage node in these designs. The compiler
resembled MAC2 by providing process independence by using CDS to both generate
and connect SRAM leaf cells. It also combined the additional capabilities
of the Memorist compiler by employing SPICE as the primary circuit analysis
tool. Aurora implemented an iterative, constraint-driven transistor sizing
algorithm to build memories that were optimized for speed, power, area
or a combination of these metrics. Its heuristic gradient search algorithm
also allowed a large power-delay-area design space to be searched rapidly.
5.3 Current Commercial SRAM Compilers
There are two categories of commercial SRAM compilers available today:
gate-array compilers and "all-layer" or embedded compilers. Gate array
SRAM compilers have been developed by commercial ASIC vendors such as Texas
Instruments and Motorola, and also by CAD companies such as Synopsys. These
compilers have been designed to create memories from metal programmable
gate array ICs. Synopsys' Cell-based-array Memory Architect provides SRAM
designs that can be configured in a number of aspects including single
or multi-port, speed or power optimized, synchronous or asynchronous timing,
available capacity, and height-width aspect ratio.
Embedded CMOS SRAM compilers are currently available from Mentor Graphics,
Cadence Design Systems, Artisan Components and Cascade Design Automation.
Mentor Graphics' Memory Builder is a graphical abstract floorplanner that
can be used to create a parameterized memory generator. This tool iteratively
assembles, extracts and characterizes memory layouts. It is not a true
optimizing compiler, but merely a layout generator. However, it does provide
the ability to create an optimizing front end that iteratively calls the
tool to create variant SRAM layouts.
Cadence's Structure Compiler is similar to the Mentor Graphics product.
It is a tiling engine that can be used to generate layouts that are comprised
of complex, repetitive structures that are either full-custom or parameterized
leaf cells. This tool also has a programmable, procedural interface instead
of providing its own optimizing front end. Cadence is also developing an
iterative, optimizing "all layer" SRAM compiler that is to be completed
by the fall of 1997.
One of the most advanced commercial SRAM compilers is Artisan Components'
Process Perfect memory generator. This "all-layer" compiler is truly process
independent and is capable of generating memory layouts that can be optimized
according to a wide variety of design goals. It also implements an iterative,
optimizing transistor sizing tool that uses a timing characterization curve-fitting
technique that is with 2% of SPICE and a power characterization curve-fitting
technique that is within 5% of SPICE. This compiler can also generate custom
built-in self test (BIST) controllers for its embedded memories.
Cascade Design Automation has recently released their HSLP (High Speed
Low Power) CMOS SRAM compiler. This compiler is similar to the MAC2 compiler
in that it utilizes CDS to generate, pitch-match, and tile leaf cells into
an SRAM configuration. However, since HSLP was developed in-house, it employs
additional undocumented CDS features which were not available to the MAC2
developers. This tool also implements an optimizing, iterative transistor
sizing algorithm and device lookup tables to meet required timing constraints.
5.4 Proposed CGaAs SRAM Compiler
I proposed to develop a CGaAs SRAM compiler as part of my Ph.D. research.
Such a tool would accelerate the search of the large SRAM design space
and would aid in producing optimized designs. It is desirable that this
tool have many of the advanced features found in today's commercial SRAM
compilers. It is also important not to "reinvent the wheel" by designing
tools that are already commercially available. This can be accomplished
by building a compiler that employs today's most powerful commercial CAD
tools in designing optimized IC layouts.
Some of the most powerful features of current and past SRAM compilers
have been packaged and sold in the form of commercial CAD tools. Some of
these features include automated leaf cell generators and compactors, hierarchical
leaf cell tiling that perform inter-cell pitch-matching, accurate parasitic
capacitance extraction from an IC layout, reliable device-based circuit
simulation, IC fabrication design rule checking, and layout versus schematic
verification. With such tools already available, most of the necessary
features of a powerful CGaAs SRAM compiler can be easily realized.
I will develop an SRAM compiler that employs commercial CAD tools from
several different vendors. Some of these tools include Mentor Graphics'
IC Station, Cascade Design Automation's Masterport cell generator/compactor,
Synopsys' Arcadia parasitic RC extractor, Meta Software's HSPICE circuit
simulator, and Cadence Design System's Dracula layout verification tool.
This compiler would provide true process design rule independence by
using Masterport to generate and compact parameterized, full-custom leaf
cells that were originally designed with IC Station. Given a set of process
design rules, power rail widths and transistor sizes, this tool will automatically
generate pitch-matched, compacted leaf cells that can be readily tiled
into a given SRAM configuration.
A heuristic transistor sizing algorithm will be developed to iteratively
size, build, and analyze SRAM layouts until the power and/or area objective
has been reached subject to the specified access time constraint. This
iterative loop will involve tiling the core and peripheral leaf cells into
an SRAM critical path layout using Cascade's Compiler Development System
(CDS). Arcadia will then be used to extract the parasitic capacitances
from the layout and to create an annotated SPICE netlist. HSPICE will then
simulate the design and verify both correct functionality and reliable
operation in the presence of IC process variations. Fig.
12 illustrates the compiler's iterative optimization method. Once an
optimal layout has been achieved, the final and complete layout will be
assembled using CDS. This layout will then be design rule checked and undergo
layout versus schematic verification using Dracula.
Fig 12. CGaAs SRAM compiler iterative optimization
method
An additional feature that should be added to the compiler is the ability
to perform IR drop analysis on the SRAM layout while determining power
rail widths. The ability to use area interconnect can also be added, enabling
the core power rails to become smaller and the density of the SRAM layout
to increase. This compiler will provide a way for the large CGaAs power-delay-area
design space to be rapidly searched. It will also find a valuable application
in the refinement of the CGaAs IC fabrication process.
6. CGaAs Process Refinement
IC feature size reduction is a regular and ongoing process at IC manufacturing
facilities. The incremental reductions in minimum feature sizes, especially
gate length and metallization reductions, are in large part responsible
for the steady increase of microprocessor performance over the years. From
an IC design standpoint, the most desirable type of process improvement
is a linear shrink of all process design rules. This allows the designer
to easily port an IC layout from one process to another by simply mapping
the larger design to a grid with smaller dimensions. From a process engineering
standpoint, however, this is the most difficult and expensive type of process
shrink because it requires the reduction of all process design rules
in equal proportions.
6.1 Cost Effective IC Process Improvement Programs
A cost effective IC process improvement program must consider the return
on investment of each research dollar spent, measured in terms of overall
IC design improvement. To measure this improvement, one must consider that
each design rule shrink has its own unique cost function associated with
it. Of course, interdependencies between design rule improvements require
that the cost of an entire design rule improvement program be considered
as a whole. This function, C(P1,P2,...Px), is measured
in dollars and accounts for the total research cost associated with reducing
design rules 1 through x by P1 through Px%.
There is also an overall design improvement function, I(P1,P2,...Px),
associated with this design rule improvement program. This function accounts
for the improvement in overall circuit density, speed and/or power dissipation
when reducing design rules 1 through x by P1
through Px%. Given a fixed research budget B,
maximizing I(P1,P2,...Px) subject to C(P1,P2,...Px)=B
will yield the most cost effective IC process improvement program.
If both the cost function C and the design improvement
function I are accurately stated mathematically, geometric programming
techniques can be used to find an optimal set of design rule improvements
given a fixed research budget. The cost function C for
a mature IC technology, such as CMOS, can be derived by considering the
costs associated with past rule shrink efforts. Curve fitting techniques
can be used to derive a function that estimates the cost of future research
efforts based on past research efforts. However, this technique can be
inaccurate because the cost trends associated with past research efforts
do not necessarily characterize the costs associated with future research
efforts. This process is more difficult for less mature processes, such
as CGaAs, where there exists little or no cost data from prior rule shrink
efforts. In this case, cost functions must be derived using estimations
from IC process engineers.
Deriving the overall design improvement function I is
a more complicated matter requiring a different approach. It is extraordinary
difficult, if not impossible, to accurately derive this function without
first creating a series of IC layouts with a set of incrementally adjusted
design rules. Once these layouts have been created, speed and power measurements
can be obtained through circuit simulation with extracted parasitics. A
process independent circuit compiler could be used to rapidly create and
simulate these IC layouts.
6.2 CGaAs SRAM Improvement
Compared to CMOS, CGaAs is not a mature IC fabrication process. The 0.5um
CGaAs design rules are courser than those of a typical 0.5um CMOS processes,
leading to static memory cells that are larger than their CMOS counterparts.
CGaAs SRAM density suffers as a result. To illustrate of the immaturity
of CGaAs, Table 2 gives the sizes of several
embedded 6T SRAM cells that were fabricated with IC processes resembling
the 0.5um CGaAs process. These cell sizes (some had to be estimated from
a die photo) range from 52 um2 on Cyrix 6x86 to 66 um2
on the PowerPC 604e. The CGaAs 6T SRAM cell clearly suffers from a density
disadvantage.
Table 2: 6T SRAM cell
sizes
| Microprocessor |
IC process |
Cell size (um2) |
| Sun UltraSPARC (167 MHz) |
0.45um 4 metal CMOS |
63
|
| Cyrix 6x86 (120 MHz) |
0.6um 3 metal CMOS |
52
|
| Digital 21164 (300 MHz) |
0.5um 4 metal CMOS |
63
|
| Motorola/IBM PowerPC 604e (133 MHz) |
0.5um 4 metal CMOS |
66
|
| Intel PentiumPro (166 MHz) |
0.4um 4 metal BiCMOS |
55
|
| PUMA |
0.5um 3 metal CGaAs |
183
|
CGaAs is an obvious candidate for significant design rule improvement.
In order to determine the most cost effective set of design rule shrinks,
the CGaAs SRAM compiler can be used to determine the overall design improvement
function of the previously stated optimization problem. Masterport provides
process independence by allowing all technology design rules to be parameterized.
The regeneration of leaf cells given a certain design rule change is an
easy and computationally inexpensive task. The heuristic transistor sizing
algorithm can be modified to minimize SRAM area/speed/power characteristics
subject to a design rule cost equality constraint. Doing so allows the
CGaAs SRAM compiler to automate the task of determining the cost effectiveness
of a given set of design rule shrinks. A heuristic gradient search algorithm
can then be used to iteratively employ the SRAM compiler to search the
multi-dimensional design rule improvement space for the most cost effective
set of design rule enhancements.
7. PUMA Cache Memory Hierarchy
The memories produced by the CGaAs SRAM compiler will find their eventual
implementation into the PUMA PowerPC microprocessor. Due to the low integration
levels provided by this technology, PUMA will be a multi-chip implementation
of a PowerPC ISA subset. This DARPA funded project has been underway at
the University of Michigan since the fall of 1994.
7.1 MCM-based Microprocessors
Multi-chip microprocessor implementations face two major difficulties.
The first involves determining how to optimally divide the processor into
separate chips while the second involves providing high bandwidth communication
buses between these chips. Bandwidth can be defined as the number of bits
transmitted per cycle multiplied by the number of cycles per second. Bandwidth
can be improved by increasing the width of the bus and/or by increasing
the frequency at which it transmits data. Ideally, these inter-chip buses
would be both wide and fast.
Separately packaging each chip and mounting them on a printed circuit
board (PCB) is an unacceptable solution. Single-chip packages and PCB traces
have prohibitively large impedances, while long PCB traces exhibit pronounced
transmission line effects when operated at high frequencies. This packaging
style also limits the number of chip I/O and hence bus width. An alternative
to this approach is to use multi-chip module packaging with area interconnect.
A multi-chip module (MCM) is a substrate upon which multiple bare die
are directly fastened. ICs on an MCM are not individually packaged and
can be placed in closer physical proximity than on a PCB. The traces on
the MCM substrate which connect the bare die are shorter than PCB traces.
Research has shown that short MCM traces can maintain signal integrity
at much higher frequencies [87]. By dispensing
with individual packages, the large impedances associated with them are
eliminated. Since MCM traces are smaller, shorter, and have less impedance,
chip I/O drivers and receivers can be smaller and faster. Each of these
aspects leads to higher bus operating frequencies. Two processors that
utilize MCM packaging are the Sun/Hal SPARC64 [88]
and the IBM POWER2 [89]. Another advantage
to MCM packaging is the ability to use area interconnect.
7.2 Area Interconnect
Area interconnect is an implementation technique that allows an I/O port
to be placed anywhere within a chip's layout. This allows power and signal
lines to enter the chip directly where they are needed instead of from
the chip periphery. Adding I/O ports usually comes at the expense of increased
chip area when using a peripheral pad ring. With area interconnect, a chip
can have a vastly increased number of I/O with a smaller area penalty.
Flip-chip mounting makes area interconnect possible [90].
This approach we are using to connect bare die to an MCM substrate involves
forming metal bumps on a die's area I/O pads. The die is then turned upside-down
and fastened to the MCM substrate with an adhesive layer. Area I/O pads
make electrical and thermal connection to the MCM traces through the metal
bumps. Fig. 13 is an illustration of
a flip-chip mounted IC. The PUMA processor will be implemented with MCM-D
on ceramic packaging.
Fig 13. Flip-chip mounted ICs
7.3 Gold Bumping
Bare-die bumping processes, such as IBM's C4 [91],
use solder to attach bare die to an MCM substrate. Solder has many characteristics
that make it advantageous for bumping; it is inexpensive, has well understood
properties and is very pliable when heated. Since solder contains lead,
it is also a source of alpha particle emissions. As long as solder bumps
are kept away from sensitive sections of the processor, such as dynamic
logic or memories, alpha-particle emissions are not problematic. The previously
mentioned MCM-based commercial processors do not utilize area interconnect.
Their flip-chip solder bumps are placed on the chip's periphery--away from
sensitive circuitry.
The PUMA processor, however, will utilize area interconnect to provide
a larger number of high speed I/O. These area I/O pads will be placed in
locations that provide convenient access to all sections of the chip including
its sensitive components. Despite their natural hardness to radiation CGaAs
dynamic logic and static memory circuits are still somewhat vulnerable
to alpha particle disruptions. Therefore, gold bumping will be used instead
of solder [92]. Gold makes good bumping
material because it is highly pliable and does not emit alpha particles.
Its major drawback is the high cost associated with its unique bumping
procedure and MCM assembly method.
7.4 PUMA Memory Subsystem
The PUMA's memory hierarchy will be partitioned into numerous chips that
communicate with each other across the MCM substrate and to a PCB. Fig.
14 is a block diagram of the PUMA memory subsystem.
Fig 14. PUMA memory subsystem block diagram
The PUMA's three-level memory hierarchy will be comprised of two levels
of cache memory and a large DRAM. The cache hierarchy will be split into
an instruction side and a data side. On the instruction side, the PUMA
fixed-point unit (FXU) will have primary access to an embedded 1 KByte
CGaAs instruction cache and an embedded CGaAs 8-line instruction stream
buffer. These primary structures will in turn be connected to a secondary
1 MByte instruction cache implemented with four commercial BiCMOS SRAM
chips.
On the data side, the FXU will have primary access to 16 KBytes of data
cache implemented with four off-chip CGaAs 4 KByte SRAM chips. This cache
will in turn be connected to a 1 MByte secondary data cache also implemented
with four commercial BiCMOS SRAM chips. Access to the secondary caches
and main memory will be conducted through off-chip CGaAs memory management
units (MMUs). These high bandwidth cache controllers will manage pipelined,
non-blocking access to the secondary caches and to 64 MBytes of main memory
via the PUMA Interface to PCI (PIP) chip.
8. Research Contributions
On the morning of January 11, 1997, an unusually large wave of solar radiation
known as a coronal mass ejection swept through the earth's upper atmosphere.
This phenomena is a relatively common occurrence that does not pose a threat
to life on earth. On this particular occasion, however, the solar disruption
was of sufficient magnitude to overpower and destroy the microelectronics
of AT&T's Telstar 401 communications satellite orbiting at 22,000 miles
above the surface of the earth [93-95].
This satellite was designed and built by Lockheed Martin, launched into
space in late 1993, and began service as a television signal relay station
in early 1994. Telstar was one of three satellites that formed AT&T's
Skynet Satellite Services and cost the company several hundred millions
of dollars. According to NASA, several other military satellites operated
by the Pentagon were also severely affected, resulting in unspecified damage.
Coronal mass ejections are invisible waves of electrically charged particles
that can be lethal to microelectronics fabricated with semiconducting substrates
such as silicon. It is now apparent that current technologies used to protect
orbiting satellites from the damaging effects of solar radiation are insufficient.
This event has spawned a renewed search for radiation-hard IC technologies
by government institutions and commercial satellite deployment companies.
CGaAs can provide the high performance, radiation hard ICs required
for high-altitude orbits. Efforts are currently underway at Motorola and
Boeing Space Systems to prepare CGaAs for this task. There are numerous
research areas associated with developing a radiation hard, high performance
CGaAs microprocessor that have yet to be thoroughly explored. One of the
most central areas is the design of reliable embedded memory structures
and analysis tools.
8.1 Proposed Research
I propose to conduct the following research relative to the design and
optimization of CGaAs cache memory. Each of these research tasks encapsulate
and contribute additional understanding to the existing CGaAs knowledge
base. These contributions can be directly applied to ongoing government
and commercial research and development efforts.
-
Design and characterization of a 40 k-bit CGaAs SRAM. This SRAM
will exceed the capacity of any CGaAs SRAM designed to date. It will also
represent a significant improvement in access and power efficiency. This
chip will be thoroughly tested when it returns from fabrication in order
to study the effectiveness of its circuit implementation techniques. This
testing effort will lead to improved circuit techniques for later SRAM
designs. Motorola will also use this chip as a process monitor.
-
Development of a CGaAs SRAM compiler. No CGaAs SRAM compilers are
currently available. Thus, CGaAs SRAM designers must use computationally
expensive and time consuming methodologies for exploring the large CGaAs
power-delay-area memory design space. I propose to develop a commercial
CAD tool-based CGaAs SRAM compiler that will effectively and rapidly explore
this design space. Upon completion, this tool could find immediate application
in other government, academic, or commercial CGaAs research efforts.
-
CGaAs Design Rule Optimization Study. Truly cost effective IC process
improvement programs consider the impact of each design rule shrink on
overall IC density and performance. I propose to use the CGaAs SRAM compiler
to search the multi-dimensional design rule improvement space. This study
will utilize cost estimations from Motorola's CGaAs process engineers and
can be used to determine the impact of various design rule changes on SRAM
density and performance. This information will then be given to Motorola's
CGaAs process engineers as an aid in cost effective process improvement.
-
PUMA CGaAs Cache Memory Hierarchy Design. Finally, I proposed to
design the cache memory structures for the PUMA microprocessor using the
tools and techniques previously described. This effort represents a "second
pass" at designing optimal CGaAs SRAM and will truly demonstrate the capacity-power-delay
capabilities of CGaAs memory. It will also demonstrate the feasibility
of high performance, multi-chip CGaAs microprocessors for space and military
applications. The required PUMA cache memory structures include the following
structures:
1) Primary 1 KByte instruction cache
2) Primary 16 KByte data cache
3) Secondary instruction cache controller (IMMU)
4) Secondary data cache controller (DMMU)
References
-
[1] T. Ichikawa and M. Sasaki, "A New Analytical Model
of SRAM Cell Stability in Low-Voltage Operation," IEEE Transactions
on Electron Devices, vol. 43, no. 1, January 1996, pp. 54-61.
-
[2] H. Kato, M. Matsui, H. Shibata, K. Hashimoto, T.
Ootani, and K. Ochii, "SRAM Cell Stability Under the Influence of Parasitic
Resistances and Data Holding Voltage as a Stability Prober," IEEE Journal
of Solid State Circuits, vol. 32, no. 2, February 1997, pp. 232-237.
-
[3] D. Hodges and H. Jackson, Analysis and Design
of Digital Integrated Circuits, Second edition, New York: McGraw-Hill,
1988.
-
[4] Linley Gwennap, "Digital 21264 Sets New Standard,"
Microprocessor Report, v.10 n. 14, October 28, 1996.
-
[5] J.K. Abrokwah, J.H. Huang, W. Ooms, C. Shurboff,
J.A. Hallmark, R. Lucero, "A Manufacturable Complementary GaAs Process,"
Proceedings of the 1993 IEEE GaAs IC Symposium, pp. 127-129.
-
[6] B. Bernhardt, et. al., "Complementary GaAs
(CGaAs): A High Performance BiCMOS Alternative," Proceedings of the
17th Annual IEEE GaAs IC Symposium, pp. 18-21, 1995.
-
[7] J. Abrokwah, R. Lucero, J. Hallmark, and B. Bernhardt,
"Submicron P-Channel (Al,Ga)As/(In,Ga)As HIGFET's," IEEE Transactions
on Electron Devices, vol. 44, no. 7, July 1997, pp. 1040-1045,
-
[8] T. Ohzone, S. Kondo, K. Tsuji, T. Shiragasawa, T.
Ishihara, and S. Horiuchi, "A 64Kb Static RAM," Proceedings of the 1980
IEEE International Solid-state Circuits Conference, pp. 236-237.
-
[9] J. Mayer, "SRAMs Rev Up, Close Memory-latency Gap,"
Computer Design, February 1997, pp. 107-109.
-
[10] K. Ishibashi, K. Komiyaji, S. Morita, T. Aoto,
S. Ikeda, K. Asayama, A. Koike, T. Yamanka, N. Hashimoto, H. Iida, F. Kojima,
K. Motohashi, and K. Sasaki, "A 12.5ns 16Mb CMOS SRAM," Proceedings
of the 1993 IEEE Symposium on VLSI Circuits, pp. 103-104.
-
[11] A, Suzuki, T. Kobayashi, T. Hamano, H. Hatada,
A. Kawasumi, F. Matsuoka, K. Ishimaru, M. Takahashi, M. Nishigohri, Y.
Okayama, Y. Unno, M. Kakumu, and J. Tsujimoto, "A 400MHz 4.5Mb Synchronous
BiCMOS SRAM with Alternating Bit-line Loads," Proceedings of the 1996
IEEE International Solid-state Circuits Conference, pp. 146-147.
-
[12] S. Kuhara, H. Toyoshima, K. Takeda, K. Nakamura,
H. Okamura, M. Takada, H. Suzuki, H. Yoshida, and T. Yamazaki, "A 6ns 1.5V
4Mb BiCMOS SRAM," Proceedings of the 1996 IEEE International Solid-state
Circuits Conference, pp. 144-145.
-
[13] K. Sasaki, K. Ueda, K. Takasugi, H. Toyoshima,
T. Yamanaka, N. Hashimoto, and N. Ohki, "A 16Mb CMOS SRAM with a 2.3mm2
Single-Bit-Line Memory Cell," Proceedings of the 1993 IEEE International
Solid-state Circuits Conference, pp. 250-251.
-
[14] M. Dolle, S. Jhand, W. Lehner, O. Muller, M. Schlett,
"A 32-b RISC/DSP Microprocessor with Reduced Complexity," IEEE Journal
of Solid State Circuits, vol. 32, no. 7, July 1997, pp. 1056-1066.
-
[15] D. Kinell and D. Wilson, "GaAs Static Random Access
Memory Cell Design," Research abstract of the 1990 IEEE GaAs IC Symposium.
-
[16] J. Notthoff, R. Krein, J. Stephens, G. Troeger,
C. Vogelsang, and C. Hyun, "A 4K X 1 Bit Complementary E-JFET Static RAM,"
Proceedings of the 1987 IEEE GaAs IC Symposium, pp. 185-188.
-
[17] C. Vogelsang, J. Castro, J. Notthoff, G. Troeger,
J. Stephens, and R. Krein, "Complementary GaAs JFET 16K SRAM," Proceedings
of the 1988 IEEE GaAs IC Symposium, pp. 75-78.
-
[18] M. Suzuki, S. Notomi, M. Ono, N. Kobayashi, E.
Mitani, K. Odani, T. Mimura, and M. Abe, "A 1.2ns HEMT 64kb SRAM," Proceedings
of the 1991 IEEE International Solid-State Circuits Conference, pp.
48-49.
-
[19] J. Hallmark, C. Shurboff, B. Oohms, R. Lucero,
J. Abrokwah, and J. Huang, "0.9 V DSP Blocks: A 15 ns 4 K SRAM and a 45
ns 16-bit Multiply/Accumulator," Proceedings of the 1994 IEEE GaAs IC
Symposium, pp. 55-58.
-
[20] E. Seevinck, P. van Beers, and H. Ontrop, "Current-mode
Techniques for High Speed VLSI Circuits with Application to Current Sense
Amplifier for CMOS SRAMs," IEEE Journal of Solid State Circuits,
vol. 26, no. 4, April 1991, pp. 525-535.
-
[21] H. Nambu, K. Kanetani, Y. Idei, N. Homma, K. Yamaguchi,
T. Hiramoto, N. Tamba, M. Odaka, K. Watanabe, T. Ikeda, K. Ohhata, and
Y. Sakurai, "High-speed Sensing Techniques for Ultrahigh-speed SRAMs,"
IEEE Journal of Solid State Circuits, vol. 27, no. 4, April 1992,
pp. 632-639.
-
[22] T. Yabe, "High-speed Circuit Techniques for 1
to 5 V Operating Memories," IEICE Transactions on Electronics, vol.
E76-C, no. 5, May 1993, pp. 708-713.
-
[23] K. O'Connor, "A Source Sensing Technique Applied
to SRAM Cells," IEEE Journal of Solid State Circuits, vol. 30, no.
4, April 1995, pp. 500-511.
-
[24] K. Ishibashi, "High-Speed CMOS SRAM Technologies
for Cache Applications," IEICE Transactions on Electronics, vol.
E79-C, no. 6, June 1996, pp. 724-734.
-
[25] K. Ohhata, T. Kunsunoki, H. Nambu, K. Kanetani,
T. Masuda, M. Ohayashi, S. Hamamoto, K. Yamaguchi, Y. Idei, and N. Homma,
"Redundancy Circuit for a Sub-nanosecond, Megabit ECL-CMOS SRAM," IEICE
Transactions on Electronics, vol. E79-C, no. 3, March 1996, pp. 415-422.
-
[26] M. Hill and A. Smith, "Evaluating Associativity
in CPU Caches," IEEE Transactions on Computers, vol. 38, no. 2,
12 December 1989, pp. 1612-1630.
-
[27] M. Hill, "A Case for Direct-Mapped Caches,"
IEEE Computer, vol. 21, no. 12, Dec 1988, pp. 25-40.
-
[28] J. Hennessy and D. Patterson, Computer Architecture
a Quantitative Approach, San Mateo: Morgan-Kaufmann, 1990.
-
[29] R. Kessler, R. Jooss, A. Lebeck, and M. Hill,
"Inexpensive Implementations of Set-Associativity," Proceedings of the
1989 ACM/IEEE International Symposium on Computer Architecture, pp.
131-138.
-
[30] F. Bodin and A. Seznec, "Skewed Associativity
Enhances Performance Predictability," Proceedings of the 1995 ACM/IEEE
International Symposium on Computer Architecture, pp. 265-274.
-
[31] A. Seznec, "A Case for Two-way Skewed-associative
Caches," Proceedings of the 1993 ACM/IEEE International Symposium on
Computer Architecture, pp. 169-178.
-
[32] A. Seznec, "About Set and Skewed Associativity
on Second Level Caches," Proceedings of the 1993 IEEE International
Conference on Computer Design, pp. 40-43.
-
[33] A. Agarwal and S. Pudar, "Column-Associative Caches:
A Technique for Reducing the Miss Rate of Direct-Mapped Caches," Proceedings
of the 1993 ACM/IEEE International Symposium on Computer Architecture,
pp. 179-190.
-
[34] A. Seznec, "DASC Cache," Proceedings of the
1995 IEEE Symposium on High-Performance Computer Architecture, pp.
134-143.
-
[35] K. Theobald, H. Hum, and G. Gao, "A Design Framework
for Hybrid-Access Caches," Proceedings of the 1995 IEEE Symposium on
High-Performance Computer Architecture, pp. 144-153.
-
[36] B. Calder, D. Grunwald, and J. Emer, "Predictive
Sequential Associative Cache," Proceedings of the 1996 IEEE Symposium
on High-Performance Computer Architecture, pp. 244-253.
-
[37] S. Przybylski, M. Horowitz, and J. Hennessy, "Performance
Tradeoffs in Cache Design," Proceedings of the 1988 ACM/IEEE International
Symposium on Computer Architecture, pp. 290-298.
-
[38] A. Seznec, "Decoupled Sectored Caches: Conciliating
Low Tag Implementation Cost and Low Miss Ratio," Proceedings of the
1994 ACM/IEEE International Symposium on Computer Architecture, pp.
384-393.
-
[39] O. Olukton, T. Mudge, and R. Brown, "Multilevel
Optimization of Pipelined Caches," to appear in IEEE Transactions on
Computers, 1997.
-
[40] N. Jouppi and S. Wilton, "Tradeoffs in Two-level
On-chip Caching," Proceedings of the 1994 ACM/IEEE International Symposium
on Computer Architecture, pp. 34-45.
-
[41] W. Wang, J. Baer, and H. Levy, "Organization and
Performance of a Two-level Virtual-Real Cache Hierarchy," Proceedings
of the 1989 ACM/IEEE International Symposium on Computer Architecture,
pp. 140-148.
-
[42] N. Jouppi, "Cache Write Policies and Performance,"
Proceedings of the 1993 ACM/IEEE International Symposium on Computer
Architecture, pp. 191-201.
-
[43] H. Young and E. Shekita, "An Intelligent I-Cache
Prefetch Mechanism," Proceedings of the 1993 IEEE International Conference
on Computer Design, pp. 44-49.
-
[44] N. Jouppi, "Improving Direct-Mapped Cache Performance
by the Addition of a Small Fully-Associative Cache and Prefetch Buffers,"
Proceedings of the 1990 ACM/IEEE International Symposium on Computer
Architecture, pp. 364-373.
-
[45] S. Palacharla and R. Kessler, "Evaluating Stream
Buffers as a Secondary Cache Replacement," Proceedings of the 1994 ACM/IEEE
International Symposium on Computer Architecture, pp. 24-33.
-
[46] T. Alexander and G. Kedem, "Distributed Prefetch-buffer
/ Cache Design for High Performance Memory Systems," Proceedings of
the 1996 IEEE International Symposium on High-Performance Computer Architecture,
pp. 254-263.
-
[47] B. Bray and M. Flynn, "Fetch Caches," Stanford
University Technical Report CSL-TR-93-561, February 1993.
-
[48] T. Kobayashi, et. al., "A Current-Controlled Latch
Sense Amplifier and a Static Power-Saving Input Buffer for Low-Power Architecture,"
IEEE Journal of Solid State Circuits, vol. 28, pp. 523-527, April
1993.
-
[49] N. Weste, K. Eshraghian, Principles of CMOS
VLSI Design, Second edition, New York: Addison-Wesley, 1993.
-
[50] B. Cherkauer, E. Friedman, "Design of Tapered
Buffers with Local Interconnect Capacitance," IEEE Journal of Solid
Stage Circuits, vol. 30, February 1995, pp. 151-154.
-
[51] K. Hedlund, "Models and Algorithms for Transistor
Sizing in MOS circuits," Proceedings of the 1984 IEEE/ACM International
Conference on Computer-Aided Design, pp. 12-14.
-
[52] B. Richman, J. Hansen, and K. Cameron, "A Deterministic
Algorithm for Automatic CMOS Transistor Sizing," Proceedings of the
1987 IEEE Custom Integrated Circuits Conference, pp. 421-424.
-
[53] Z. Dai and K. Asada, "MOSIZ: A Two-step Transistor
Sizing Algorithm based on Optimal Timing Assignment Method for Multi-stage
Complex Gates," Proceedings of the 1989 IEEE Custom Integrated Circuits
Conference, pp. 17.3.1 - 17.3.4
-
[54] B. Hoppe, G. Neuendorf, and D. Schmitt-Landsidel,
"Automatic Transistor Sizing in High Performance CMOS Logic Circuits,"
Proceedings of the 1989 European Computer Conference, pp. 5-25 -
5-27.
-
[55] S. Sapatnekar, V. Rao, and P. Vaidya, "A Convex
Optimization Approach to Transistor Sizing for CMOS Circuits," Proceedings
of the 1991 IEEE/ACM International Conference on Computer-Aided Design,
pp. 482-485.
-
[56] L. Heusler and W. Fichtner, "Transistor Sizing
for Large Combinational Digital CMOS Circuits," Integration, The VLSI
Journal, vol. 10, no. 2, 2 Jan. 1991, pp. 155-168.
-
[57] S. Sapatnekar, V. Rao, and P. Vaidya, "An Exact
Solution to the Transistor Sizing Problem for CMOS Circuits Using Convex
Optimization," IEEE Transactions on Computer-Aided Design of Integrated
Circuits and Systems, vol. 12. no. 11, November 1993, pp. 1621-1634.
-
[58] W. Kao, N. Fathi, and C. Lee, "Algorithms for
Automatic Transistor Sizing in CMOS Digital Circuits," Proceedings of
the 1995 ACM/IEEE Design Automation Conference, pp. 781-784.
-
[59] N. Azemard, V. Bonzom, and D. Auvergne, "P.SIZE:
A Sizing Aid for Optimized Designs," Proceedings of the 1992 European
Design Automation Conference, pp. 160-165.
-
[60] U. Ko and P. Balsara, "Short-Circuit Power Driven
Gate Sizing Technique for Reducing Power Dissipation," IEEE Transactions
on VLSI Systems, vol. 3, no. 3, September 1995, pp. 450-455.
-
[61] J. Shyu, J. Fishburn, A. Dunlop, and A. Sangiovanni-Vincentelli,
"Optimization-based Transistor Sizing," Proceedings of the 1987 IEEE
Custom Integrated Circuits Conference, pp. 417-420.
-
[62] G. Chen, H. Onodera, and K. Tamaru, "An Iterative
Gate Sizing Approach with Accurate Delay Evaluation," Proceedings of
the 1995 IEEE/ACM International Conference on Computer-Aided Design,
pp. 422-427.
-
[63] S. Mehrotra, P. Franzon, and W. Liu, "Stochastic
Optimization Approach to Transistor Sizing for CMOS VLSI Circuits," Proceedings
of the 1994 ACM/IEEE Design Automation Conference, pp. 36-40.
-
[64] W. Nye, D. Riley, A. Sangiovanni-Vincentelli,
and A. Tits, "DELIGHT.SPICE: An Optimization-based System for the Design
of Integrated Circuits," IEEE Transactions on Computer-Aided Design,
vol. 7 no. 4, April 1988, pp. 501-519.
-
[65] J. Fishburn and A. Dunlop, "TILOS: A Posynomial
Programming Approach to Transistor Sizing," Proceedings of the 1985
IEEE International Conference on Computer-Aided Design, pp. 326-328.
-
[66] S. Sapatnekar and V. Rao, "iDEAS: A Delay Estimator
and Transistor Sizing Tool for CMOS Circuits," Proceedings of the 1990
IEEE Custom Integrated Circuits Conference, pp. 9.3.1 - 9.3.4.
-
[67] K. Hedlund, "Aesop: A Tool for Automated Transistor
Sizing," Proceedings of the 1987 ACM/IEEE Design Automation Conference,
pp. 114 -120.
-
[68] M. Cirit, "Transistor Sizing in CMOS Circuits,"
Proceedings of the 1987 ACM/IEEE Design Automation Conference, pp.
121-124.
-
[69] A. Conn, P. Coulman, R. Haring, G. Morrill, and
C. Visweswariah, "Optimization of Custom MOS Circuits by Transistor Sizing,"
Proceedings of the 1996 IEEE/ACM International Conference on Computer-Aided
Design, pp. 174-180.
-
[70] D. Chen and C. Zukowski, "CMOS Optimization Including
Logic Family Mixing," Proceedings of the 1991 IEEE International Symposium
on Circuits and Systems, vol. 4, pp. 2240-2243.
-
[71] H. Chen and S. Kang, "A New Circuit Optimization
Technique for High Performance CMOS Circuits," IEEE Transactions on
Computer-Aided Design, vol. 10, no. 5, May 1991, pp. 670-676.
-
[72] J. Kao, A. Chandrakasan, and D. Antoniadis, "Transistor
Sizing Issues and Tool for Multi-threshold CMOS Technology," Proceedings
of the 1997 ACM/IEEE Design Automation Conference, pp. 409-414.
-
[73] D. Marple, "Transistor Size Optimization in the
Tailor Layout System," Proceedings of the 1989 ACM/IEEE Design Automation
Conference, pp. 43-48.
-
[74] M. Yamada, S. Kurosawa, R. Nojima, N. Kojima,
T. Mitsuhashi, and N. Goto, "Synergistic Power/Area Optimization with Transistor
Sizing and Wire Length Minimization," IEICE Transactions on Electronics,
vol. E78-C, no. 4, April 1995, pp. 441-445.
-
[75] N. Menezes, R. Baldick and L. Pileggi, "A Sequential
Quadratic Programming Approach to Concurrent Gate and Wire Sizing," Proceedings
of the 1995 IEEE/ACM International Conference on Computer-Aided Design,
pp. 144-151.
-
[76] J. Cong and L. He, "An Efficient Approach to Simultaneous
Transistor and Interconnect Sizing," Proceedings of the 1996 IEEE/ACM
International Conference on Computer-Aided Design, pp. 181-186.
-
[77] W. Swartz, C. Giuffre, W. Banzhaf, M. deWit, H.
Khan, C. McIntosh, T. Pavey, and D. Thomas, "CMOS RAM, ROM, and PLA Generators
for ASIC Applications," Proceedings of the 1986 IEEE Custom Integrated
Circuits Conference, pp. 334 - 338.
-
[78] J. Drummond and M. Lepkowski, "BiCMOS Submicron
Compiler Memories," Proceedings of the 1990 IEEE ASIC Seminar and Exhibit,
pp. 3-3.1 - 3-3.3.
-
[79] T. Dao and F. Svejda, "A Dual-port SRAM Compiler
for 0.8mm 100K BiCMOS Gate Arrays," Proceedings of the 1991 IEEE Custom
Integrated Circuits Conference, pp. 22.4.1-22.4.3.
-
[80] T. Le, H. Phuong, and P. Lin, "1KX128 High-performance,
Low-power configurable CMOS SRAM Compiler," Proceedings of the 1990
IEEE ASIC Seminar and Exhibit, pp. 3-5.1 - 3-5.4.
-
[81] K. Tsao, N. Zhu, and T. Pham, "A High Performance
Memory Compiler for Multi-port RAMs," Proceedings of the 1990 IEEE ASIC
Seminar and Exhibit, pp. 3-6.1 - 3-6.4.
-
[82] H. Shinohara, N. Matsumoto, K. Fujimori, Y. Tsujihashi,
H. Nakao, S. Kato, Y. Horiba, and A. Tada, "A Flexible Multiport RAM Compiler
for Data Path," IEEE Journal of Solid State Circuits, vol. 26, no.
3., March 1991, pp. 343-348.
-
[83] J. Tou, P. Gee, J. Duh, and R. Eesley, "A Sub-micron
CMOS Embedded SRAM Compiler," Proceedings of the 1991 IEEE Custom Integrated
Circuits Conference, pp. 22.3.1-22.3.4.
-
[84] A. Chandna, C. Kibler, R. Brown, M. Roberts, and
K. Sakallah, "The Aurora RAM Compiler," Proceedings of the 1995 ACM/IEEE
Design Automation Conference, pp. 261-266.
-
[85] A. Chandna, "GaAs MESFET Static RAM Design for
Embedded Applications," Ph.D. Dissertation, University of Michigan, 1995.
-
[86] A. Chandna and R. Brown, "An Asynchronous GaAs
MESFET Static RAM Using a New Current Mirror Memory Cell," IEEE Journal
of Solid State Circuits, October 1994, pp. 1270-1276.
-
[87] Schwab, D.J., et. al, "Performance Characteristics
of Thin Film Multilayer Interconnects in the 1-10 GHz Frequency Range,"
Proceedings of the 1989 IEEE Electronics Components Conference,
pp. 536-542.
-
[88] T. WIlliams, N. Patkar, G. Shen, "SPARC64: A 64-b
64-Active-Instruction Out-of-Order-Execution MCM Processor," IEEE Journal
of Solid-State Circuits, vol. 30, no. 11, November 1995, pp. 1215-1226.
-
[89] White, S.W., "POWER2: Next Generation of the RISC
System/6000 Family," IBM Journal of Research & Development,
vol 38, no. 5, 1994.
-
[90] Lomax, R.J., et. al, "Area I/O Flip-Chip Packaging
to Minimize Interconnect Length," Submitted to the 1997 IEEE MCM Conference,
1997.
-
[91] Nealon, M.J., "C4: A Legacy Technology for the
Future," IBM Micronews, vol. 2, no. 4, 4Q 1996.
-
[92] Aschenbrenner, R. et al., "Flip Chip Attachment
Using Non-Conductive Adhesives and Gold Ball Bumps," International Journal
of Microelectronics and Electric Packaging, vol. 18, no. 2, 2Q 1995,
pp. 154-161.
-
[93] "Telstar 401 out of service: Satellite's failure
could cost AT&T several hundred million dollars," CNNfn news
article, January 17, 1997, 1:01 p.m. EST, URL: http://www.cnnfn.com.
-
[94]' "Solar radiation might have killed satellite,"
CNN Interactive news article, January 20, 1997, 11:45 p.m. EST,
URL: http://www.cnn.com.
-
[95] "Sun 'ejection' killed TV satellite," CNN Interactive
news article, January 21, 1997, 10:10 p.m. EST, URL: http://www.cnn.com.