Position Statement

Robert Yung
Sun Microsystems Labs

1. Introduction

Designing a modern processor is a complex task that demands the best trade-offs in its hardware and software components. The instruction set architecture, although an important factor in the performance of a design, cannot be easily and radically changed from one generation of a product to the next. This leaves three important and intertwined trade-offs that can affect the performance of a processor. They are the process technology, the micro-architecture, and the software (e.g. compiler, binary translator, operating system). I will focus on the first two trade-offs in this paper.

On a side note, because of the increasing complexity and processing speed in a processor design, design automation and verification become two major issues. New tools are needed not only to ensure functional correctness, but also to ensure timely completion and expected performance of a design.

2. Process Technology

Advances in process technology have been a major contributor to the superior performance of the latest generations of microprocessors. A good design will become a lesser performer if it is unable to "ride the technology curve" or is built using an obsolete process technology. The advances in process technology offer more transistors on an integrated circuit, reduce the switching speed of logic gates, and reduce power dissipation.

In the next decade, I speculate that there will be billion transistor integrated circuits operating at tens or hundreds of giga-hertz. Such integrated circuits will require new ways to supply power to and extract heat from them. New packaging and new I/O organizations are also needed to maintain the high data bandwidth required to sustain the peak execution rates of these processors. Because of miniaturization, future battery powered portable devices (e.g your future wrist-watch) will have much higher computing power than today's processor while providing instant access to information services and to your office via their telecommunication ports. Future nomadic computing may demand new architectural supports, such as for speech recognition and encryption, that are ill-supported in the current general purpose processor design.

3. Micro-architecture

Two trade-offs that have the most impacts on the performance of a uni-processor are branch prediction, and memory bandwidth and latency management. The ability to predict the outcome of issued yet pending branches, and prefetch and issue in the predicted execution path is crucial in a high issue design. Branch prediction has been the focus of much micro-architectural research today, and most of the efforts are concentrated on improving branch history accuracy. This alone is inadequate in optimizing an instruction fetch unit design. Other factors that have large impacts on an instruction fetch unit are branch target prediction, taken branch latency, misprediction costs, and most importantly, cycle time of the design. Simultaneous executions of multiple contexts or threads, and execution in both the predicted and non-predicted branch directions also deserve a fresh look.

As the speed gap between a processor and its memory system widens, there is greater reliance on high speed caches to sustain high instruction and data bandwidth, and with low latency accesses. First level caches should be relatively small to maintain single cycle access and a fast cycle time. The small first level cache should be supplemented with a larger second level cache, possibly with higher degree of associativity, on the same integrated circuit. In the long run, the explosive increase in transistor counts on an integrated circuit opens up many possibilities for radical re-organizations of a computer system. I believe there will be a continual trend in system integration; incorporating more caches, peripherals, special purpose hardware such as multi-media accelerations and networking supports, on an integrated circuit. The driving forces behind integration are economics, and demands for miniaturization and reliability.

Another possible use of the vast transistor counts on an integrated circuit is to integrate a large number of small, homogeneous or heterogeneous processors, SRAM and DRAM together in the form of a SIMD or a MIMD organization. The advantage of such organization is possibly lower memory latency and higher memory bandwidth to each processors on a chip.


Last modified: Wed June 3 13:21 EST 1996