PUMA

University of Michigan: High-Performance Microprocessor Project

ARPA Contract Number: DAAH04-94-G-0327
TITLE: Design Optimization of a GaAs RISC Microprocessor with Area-Interconnect\ MCM Packaging


Recent Accomplishments


CAD Developments

Cascade Design Automation, one of the project's industrial sponsors, has been in the process of developing CAD tools for the University of Michigan project. Cascade enhanced their cell library migration tool, MasterPort, to support complementary GaAs processes. With this tool, the cell library at Michigan can be kept up to date with the ever-changing Motorola (the project's second industrial sponsor) GaAs process. The University of Michigan received early releases of MasterPort, which has now been introduced to the commercial market.

Cascade has also been developing an area-distributed pad router. This tool works within Cascade's design system, Epoch, and enables the designer to plan and optimize a chip using area pads. Presently the tool allows an array of area pads to be distributed across a design and routed. A preliminary version of this tool, Eggo, has been delivered to the University of Michigan.

During the next year, Cascade will fully support the University of Michigan with their area pad designs. Work will continue on the Eggo area pad routing tool, and a top-down floorplanner will be enhanced to assist in the placement of the area pad drivers. Cascade will also enhance their timing analysis tool, Tactic, to support domino logic for this project. This work is scheduled to be completed by February 1997.

University of Michigan researchers, building on previous work by their colleagues, developed a new timing analysis model for domino logic. The timing model was presented to representatives of Cascade, and a proposal was offered for incorporating this timing model into their static timing analysis tool, Tactic.

The development of a businterface chip (PIP) was initiated. The PIP will form the interface between the MMUs, the main memory, and the PCI bus. A detailed specification is being drafted.

Circuit Design

Work is progressing on the design, layout, and verification of a sixty-two cell standard cell library. The library includes several different kinds of buffers as well as logic cells designed in complementary, DCFL, and domino logic styles. To aid in verification of the cell library, a verification environment has been designed around Cascade's TGEN tool. This environment provides automatic vector application and result checking for a functional set of vectors over nearly every cell in the library. Information about these cells, including schematics and layout plots, has been compiled in a standard cell notebook. This notebook, presently in its second revision, will eventually contain all timing data for the standard cells.

The cell library described above has been used in the design of an ALU, which has been taped out for fabrication. This chip, which includes over 100,000 transistors, measures 5.7 x 6.0 mm. It implements over 150 PowerPC instructions in Domino-logic. The ALU test chip is designed to allow high-speed testing on a lower-speed tester; on-chip buffers at the inputs and output allow for slow scan-in and scan-out of data with full-speed operation during the test.

In addition to the ALU, a sub-nanosecond access time 4K-byte SRAM and a 600-ps access time 32x32 register file chip have been designed. Simulation results from these circuits have already provided information which will change the microarchitecture of the final system; test results will qualify the cell set for use in the final processor chips. A 3-port floating point register file has also been designed. The 2 read and 1 write port register file has a 64 by 32 bit organization. The chip core measures 3.6 mm x 0.8 mm. Simulations indicate a read access time of 0.62 ns with a 1.5 V supply and 1 ns clock period. Power is estimated to be 500 mW at the rated conditions. We believe that this design achieves the best power-delay product of any sub-nanosecond register file to date.

Architecture

Two separate FXU-FPU configurations have been developed, with baseline specifications being written for both. Simulations will determine which of these two designs will be used in the final PUMA chipset. Baseline behavioral Verilog has been written and tested for the more promising of the two FXU configurations, while structural Verilog has been written for the FPU multiplier; the Verilog may be modified once the simulations are finished. Both of the proposed configurations have the same two port memory configurations, utilizing a Instruction side Memory Management Unit (MMU) and a Data side MMU. A specification for the design of these two units has been completed, and Verilog coding is in progress. A number of novel concepts for improving system performance given our design constraints have been included in these designs, the most promising being in the areas of cache and branch prediction performance.

The architecture team has simulated the performance of the Spec 92 and 95 benchmarks on different PUMA pipeline organizations, and continues to make improvements to the simulator in order to evaluate different FXU-FPU and memory hierarchy organizations.

The GCC compiler has been modified such that it targets the PUMA instruction set. The modifications prevent the compiler from generating output that includes some unsupported PowerPC instructions. The GCC will be modified such that it will only generate instructions in the PUMA instruction set.

Packaging

The first MCM, designated MCM1, includes a subset of the standard high-frequency passive MCM structures designed by the Special Purpose Processor Development Group at the Mayo Foundation. Five samples of MCM1 were sent to Mayo for characterization and testing in their specially designed test jigs. Tests included S-parameter, time-domain reflectometer and transmission measurements as well as Kelvin resistance measurements. Some of the measurements exhibited unexpected characteristics on specific samples, therefore one was returned to Mayo for additional tests. Mayo identified a few occurrences of open-circuit ground vias, which correlated with the abnormal measurements.

The Mayo group gave us access to a program still under development which extracts the effective R, L, G and C parameters of the transmission lines as a function of frequency. We improved the part of the program which extracts the propagation constant, and are developing a program which de-embeds the probe sites at each end of the lines in order to get better values of the actual line parameters. These results are being compared with simulation parameters determined from the dimensions and structural parameters of the MCM interconnects, using Quad Design's XTX program; preliminary results indicate reasonable agreement. This information will be used in simulation of the MCM interconnect signals.

A second MCM, designated MCM2, will be designed which will have six chips mounted on a 2" substrate. We are developing an ALU chip in CGaAs technology which will be flip chip attached with gold bumps (this chip is further described above under Circuit Design). The design includes area array placement of the pads which will aid verification of the advantages expected with this design technique. We will also develop power and thermal models for the efficient placement of power and thermal pads.

A Microtester in CMOS technology will interface with the SRAM chips, two of which are mounted on the upper surface of the MCM as bare die (if available) and two in single chip packages on the backside of the MCM. This will exercise the Memory management scheme for the final MCM system. The MCM substrate will consist of signal and power layers which can be attached to chips mounted on the back side.

The third MCM, MCM3, will contain the prototype processor, including the FXU, FPU, IMMU, DMMU, and caches, and will interface to a PCI bus. It will be a two-sided MCM on an aluminum nitride substrate, with the Si cache ICs mounted on the lower side, and the GaAs processor chips and the high density interconnects mounted on the upper side.

System-Level Design

We are designing a printed circuit board (PCB) in two phases to provide a platform for verification and characterization of the MCM technology and for testing of the PUMA processor. For phase one (PCB2), the PCB will meet 33MHz, 3.3 V, 32-bit PCI Local Bus specifications (Rev. 2.1) and contain MCM2 and support chips. MCM2 houses the MMT (memory manager test chip) and SRAM. The support chips include a bank of SDRAM, a PCI interface chip (PIP), a PLL clock driver chip to synchronize the PIP to the SDRAM and to provide a 100 MHz memory clock domain derived from the PCI clock, an EPROM to supply instruction memory for the MMT, and a clock generator with crystal for MCM2. A series of crystals will be employed to test functionality at frequencies from 10 to 33 MHz. The target datapath width between SDRAM and PIP is 128 bits, a full cacheline in the PUMA architecture. Thus, PCB2 will provide just enough SDRAM to support the bandwidth. For phase two, PCB3 will include the prototype system on MCM3 and support the entire 64 MB address space.

The complete system includes the above mentioned PCB2 with components and the host processor provided by the Motorola Power Stack. Through the PCI interface and PIP, the host supplies data to the SDRAM and signals the MMT to begin the test procedures. The MMT transfers data from the SDRAM through the PIP to the SRAM. Next, the MMT reads the data from the SRAM and performs basic operations (e.g. XOR, ADD, etc.) before storing it to a new location back in SDRAM. When the transfer is signaled complete, the host reads SDRAM to verify that the correct data changes are stored.

Software

We are porting the LINUX operating system to the PUMA processor and modifying the GCC compiler to generate only the PowerPC instructions that will be implemented in the final processor. We have developed a trace-driven simulator that is being used to evaluate architectural trade-offs.

For further information, please see the papers Software-managed address translation and Simplifying virtual memory management with hardware segmentation


Back to Homepage