1. Overview
Computer architecture is the interface between the back-end of the compiler and the hardware structures that make up the microarchitecture. The job of the computer architect, in my view, is to refine that interface to best take advantage of what the compiler can produce and what the hardware is able to deliver.
The hardware limitations are fundamental at any point in time. Since this workshop deals with "Strategic Directions," I assume five to ten years out is a reasonable "point in time" to design to. That means 100 million to 1 billion transistors on a chip, and somewhat limited pin bandwidth.
This number of transistors, 100 million to 1 billion, is large, but not infinte. Thus, we can not do everything on one chip and must make partitioning choices. Those choices will deal with raw bandwidth limitations and preferred latency alternatives.
The compiler limitations are harder to quantify, since they don't involve the more predictable process technology. Observations of the recent past, however, support the contention that compilers will continue to do better at performance in behalf of a single instruction stream than they will at producing effective multiple instruction streams.
Given the above, I would argue that high performance computer implementation must start with producing the highest performance uniprocessor on a chip, and then combining those uniprocessors into a multiple-chip multiprocessor or multi-computer.
2. A Comment on Conventional Wisdom.
I realize that the previous statement is usually out-of-step with conventional wisdom. Every year, it seems, the seers of the day proclaim that no more performance can come from the uniprocessor, that we must take what the current uniprocessor gives us and address only the issues of connecting multiple of these processors. I was told at a government sponsored workshop in 1986 when I argued for 4-wide issue, out-of-order execution that the MIPS R2000 would be the uniprocessor of choice. I was told last month (1996) when I argued for 16-wide issue, with a great deal more hardware support for compile-time decisions, that 4-wide issue, out-of-order execution would be the uniprocessor we will use. I believe there is still plenty of performance to be exploited at the uniprocessor level, and that part of our job here is to identify where the problems are, where the benefits are to be accrued and set strategic directions accordingly.
3. The Uniprocessor
The highest performing uniprocessor must deal with three issues: instruction supply, data supply, and instruction processing. This translates into three problems: branch prediction, memory bandwidth, and effective data flow.
Branch prediction is our single most important uniprocessor performance problem since performance dictates wider-issue and deeper pipelines, and prediction misses result in more and more wasted work. Flynn's bottleneck suggests wider issue (performance is limited by the issue-width). Proposed cycle times suggest deeper pipelines. My personal design point for five years out is a 16-wide issue, 20 stage pipeline processor.
Memory bandwidth is our second most important problem as core cycle times continue to increase at a faster rate than memory access times. Partitioning the processing capability so that part of it is at the memory cells is one suggested input to handle this. Other creative solutions involve organizing the code and data so that only what is used is brought onto the chip.
Data Flow is our best hedge against performance degradation due to artifacts of the microarchitecture that have nothing to do with the flow dependencies of an algorithm. Current microprocessors have started to exploit data flow techniques. The next five to ten years are for expanding that exploitation.
4. The Microarchitecture, the Compiler, and the Algorithm
The above paragraph stresses what in my view are the three most important problems for a future high performance uniprocessor, but does not indicate whether these problems should be dealt with at the microarchitecture level, the compiler level or the algorithm level. My answer: all three.
The greatest performance benefit can come from algorithm transformations, but the algorithm people need to know that the chip supports a degree of parallelism that will make their efforts worthwhile. Similarly, the compiler has the luxury of seeing the future (although admittedly only for a sample data set). Nonetheless, any decisions the compiler can make should be made at that level. Organization of code and data, prefetch and poststore instructions, larger atomic units of processing, and predicated execution are some examples of this. Finally, although the hardware can not know the future, it is the only element of the three that knows for sure the past, and the only place where certain non-determinism can be dealt with deterministically.
5. Concluding comment.
I believe that our strategic directions must dictate that to get the highest performance we need performance at all levels of granularity. Certainly that means a large number of nodes, correctly interconnected. But we can not afford to ignore also the node itself. My target for the node, within the time frame that this workshop is considering, would be a shared-memory MP (perhaps 16 processors), consisting of 16-wide issue, single-chip uniprocessors. I believe part of our task in this workshop is to provide the strategic direction, particularly to compiler research and microarchitecture research, to make that node effective.