Position Statement

Jim Smith
U Wisconsin

1 Introduction

We are at a crucial point in the development of techniques for high performance processing. Most of the methods now being used were originally proposed many years ago. Impressive gains have been made since the late 1970s. Many technologists and authors have published graphs showing remarkable rates of improvement in microprocessor performance over the years -- rates of improvement that are about two times as great as traditional mainframes and supercomputers. These charts typically begin with the Intel 8080, a highly serial processor with no caches and a non-overlapped interface to external memory. However, as is seldom pointed out, many of the gains in microprocessor performance since the 8080 have come from adopting innovations that were proposed for, and used in, large systems many years before. In fact, 30-40 years of hardware innovations have been compressed into 15-20 years of microprocessor development. From this perspective, steep performance improvement curves of microprocessors are much less impressive than they first appear.

Today, microprocessors have "caught up" with most innovations of mainframes and supercomputers, and the leading edge of innovation can now be found in microprocessor designs. A question that immediately comes to mind is: What fundamentally new techniques of the same importance as pipelining, superscalar, and VLIW are being proposed today for use in processors 10 years from now? For the most part, researchers investigating high performance processors are looking at evolutionary variations on already-known techniques for increasing instruction level parallelism (ILP) -- e.g. ways to increase superscalar issue by some small factor, more elaborate branch prediction methods, ways of increasing cache hit rates by more sophisticated buffering methods, more elaborate cache coherence methods, etc. These kinds of investigations are worthwhile, to be sure, but the research community should also be looking at bigger, more radical approaches whose value will only become widely-known many years from now.

What are the general characteristics we should be looking for? A high performance processor of the future will probably use the huge number of available transistors to sweep through programs, looking ahead with a horizon of hundreds or thousands of instructions, rather than a few tens as is the case today. The processor of the future will have to provide high bandwidths for fetching and executing instructions and for loading and storing memory data. It will use parallel resources to speculate heavily, occasionally having to throw away large blocks of computation, while simultaneously reaching for other large blocks. It will probably have some aspects in common with multiprocessing, but will depend on a mix of the compiler and aggressive hardware to discover hidden parallelism, often through speculation; the user will not be required to write explicitly parallel code.

1 Limitations of the Current Paradigms

Let's consider the limitations of the current superscalar and VLIW paradigms for ILP. It is these limitations that must be overcome by future paradigms if they are to be effective.

One widely recognized limitation of the superscalar approach is the window size that can be provided for pending instructions. In the recently-announced next generation processors, window sizes are on the order of at most a few 10s of instructions. These limited window sizes restrict the visibility of the parallel issue mechanism. I.e. any parallelism involving instructions that are farther apart than the window size will not be seen by the hardware. A major limiter of window size is branch prediction accuracy. That is, following a miss-prediction in current processors, any instructions placed in the window will later be discarded, rendering them useless.

A second limitation on ILP in many superscalar implementations is caused by the cross-checks that must be performed among multiple instructions as they are checked for data dependences during the dispatch phase.

A third limitation on performance for current ILP methods is that memory load and store instructions are usually identified and addresses are computed in sequential program order. In a typical superscalar implementation, this limits parallelism first by forcing instructions to be scanned and decoded in sequential order so that load and store instructions can be identified. It also forces the load/store instructions to generate their addresses in sequential order to allow address comparisons for conflicts.

Another limitation of existing ILP methods is the reliance on centralized data resources. For example, superscalar and VLIW processors use instruction sets having conventional register files shared by all the instructions. The reorder mechanism for recovering process state is often another centralized resource.

1 A Class of New ILP Paradigms

We now consider a new class of ILP methods which are intended to overcome some, or all, of the limitations of current approaches described above. It seems imperative that we break away from the single program counter model if we are to achieve significantly higher levels of instruction level parallelism. Strictly speaking, these new paradigms use multiple processing units. However, within a system they will be treated as a single processing resource -- only the hardware designer and compiler writer will be aware of the presence of multiple processing units. The class of new paradigms spans a spectrum of methods; the next two sections describe methods at or near the endpoints.

Low Overhead Autotasking

Autotasking: is a Cray Research term for the process of automatically parallelizing loops and executing them on a shared memory multiprocessor . The Convex ASAP method is similar. Parallel tasks are scheduled in user code via special shared communication and synchronization registers that are part of the architecture. Obviously, autotasking is not new, but to date it has mostly been used for highly loop-structured numeric computation in vector processors. As such, it is one of the few mainframe/supercomputer performance methods that has not yet been widely adopted in microprocessors.

A new wrinkle appears as we approach the point where we can put a multiprocessor on a chip. Processors are becoming a less significant part of total system cost, and we would like to parallelize more programs than straightforward numeric codes.

With regard to this last point, the communication and synchronization overheads are an important factor in deciding what can be effectively parallelized and the granularity at which parallelization takes place. Consequently, we emphasize \fIlow overhead\fP autotasking. Multiple processors on a chip will naturally allow reduced communication overhead among the processors. Furthermore, if processors are not an expensive commodity, the processors on a chip can be gang-scheduled with processors occasionally sitting idle when a serial piece of code is being executed. This should further reduce scheduling overheads. Through low overhead autotasking on a single chip, it might become advantageous to use parallel tasks containing only a few instructions -- this could open new opportunities for instruction level parallelism.

The Multiscalar Paradigm

At least superficially, a multiscalar processor resembles a multiprocessor, but with a single logical register file that is implemented with physical copies in the parallel processing units.

In the multiscalar model of execution, the static program is partitioned into sequential ``tasks''. A task may be a basic block, a number of connected basic blocks, a number of loop iterations, a function call, or any combination. The compiler is responsible for marking task boundaries. It does not have to have full knowledge of inter-task or intra-task dependences. Dependences are implied by the normal sequential execution semantics as in conventional architectures. Load balance, i.e. choosing tasks of roughly equivalent size, is one of the more important considerations.

The control unit of a multiscalar processor executes the program binary by processing each task as a single unit. Upon encountering each task, it predicts the subsequent task (by keeping record past history in a manner similar to dynamic branch prediction). Then the task is assigned to one of the parallel processing units. Consequently, the control unit scans the program (speculatively) taking large steps, a task at a time, not pausing to look at any of the individual instructions within a task (including branches that may be contained in a task). The multiple processing units fetch and execute instructions from each of the assigned tasks in parallel. The result is that overall many instructions are executed per clock cycle.

Each task consumes and produces values that are bound to architected registers and memory locations. Because of the natural sequential ordering of tasks, a value that a task consumes must be a value from an (earlier) task that produces it. Although copies are physically distributed, the register file has the appearance of a single, conventional file. When a register value is produced by one task and consumed by another later task, the value must be conveyed to the physical register file in the later task (and all tasks in between). This is accomplish via a combination of the compiler and hardware.

For memory, it is impossible for the compiler to know all the addresses in advance as it does with registers. Hence, the data communication linkages for memory values must be worked out at run time. This is done via a specially-constructed address resolution buffer (ARB). The ARB resolves out-of-order address hazards, and feeds them into an interleaved, multiported data cache.

In a multiscalar processor, tasks complete by committing their state changes (to registers and memory) in the same order in which the tasks were assigned to processors. This can only happen when the task has completed execution and when any predicted branch prior to the beginning of the task has been resolved. If a preceding branch prediction turns out to be wrong, the speculated task and all subsequent tasks are ``squashed'' and processing resumes with the last correct task.

1 "Features of the New Paradigms"

The reason for considering new ILP paradigms is to get beyond the significant limitations of the current approaches. So, we now consider the paradigms just described, low overhead autotasking and multiscalar processing, to see how well they address the limitations of the existing superscalar and VLIW paradigms.

First, the new paradigms can provide a greater effective window size because of the multiple program counters and the task size. That is, the distance between the beginning of the window and the end is at least the distance between the program counters of the logically oldest and newest tasks. This can be many hundreds or thousands of instructions.

Second, the branch accuracy problem can be fixed by embedding hard-to predict branches inside tasks as much as possible. If speculation at task boundaries is based on easy-to-predict branches and hard-to-predict branches are embedded within tasks, then many branches can be included within a window, yet the window can be very accurate.

Third, with autotasking the dispatch complexity problem is relieved by placing the burden for inter-task register value communication is on the compiler, as is the case with VLIW processing. In the multiscalar approach, inter-task register communication is combined between hardware and software.

Finally, for the load and store address resolution problem, there is no completely satisfying solution. Across tasks, compiler software must disambiguate (conservatively) and explicitly manage memory data communication. In multiscalar processors, relying on the compiler to find some data dependences can improve performance by reducing squashes. However, compiler detection is not required. The ARB hardware can resolve hazards in multiple tasks. This safety net allows hardware to discover much more parallelism than would otherwise be found. The complexity of this hardware, and the effect of this complexity on performance, remain to be understood fully.

1 "Other Possibilities"

Low overhead autotasking is relatively conservative and the multiscalar approach is quite aggressive in the search for instruction level parallelism. However, there is plenty of room for innovation in between. There are three significant points around which alternatives can be developed.

The appeal of autotasking is that conventional processors are used, and the cross-processor scheduling is done via software using hardware communication registers and instruction primitives that operate on the registers. The multiscalar approach (as currently being evaluated) uses a special hardwired control unit to parcel out tasks.

Low overhead autotasking uses separate logical register files in each processor, while multiscalar processors use a shared logical file and an update mechanism to keep the multiple physical copies coherent.

Low overhead autotasking relies on software to do memory disambiguation and to manage memory data communication between the parallel tasks. In the multiscalar approach, performance is improved if the compiler can find true data dependences, but it does not depend on it. Hardware can speculatively execute memory loads and stores and sort out the hazards later.

It is evident that one can conceive of a number of systems that use different variants of the above combinations. We will touch on two of them briefly.

One interesting variant is a software-scheduled multiscalar processor. Here, some hardware primitives may be provided, for example a block of shared registers and some primitive instructions to access and modify them. Then, the scheduling function can be done by the processors, themselves, rather than a special control unit. Speculation could still be used in assigning tasks, and speculative memory operations could still be allowed as in the multiscalar paradigm described above. This approach would eliminate the special task control unit hardware and facilitate the use of more conventional parallel processing where appropriate.

Another possibility is to use separate logical register files for the individual processors with explicit inter-task register data passing, but to use the speculative load/store scheme for memory data. The compiler has more knowledge of register data, because it tends to be more local and has fewer aliasing problems than memory data. Hence, the compiler might be able to communicate register data explicitly. Meanwhile, memory data can be managed in the multiscalar manner.

The bottom line is that the overall design space defined by the autotasking/multiscalar paradigm is quite large and will probably take many years to explore.

Acknowledgement

I thank Guri Sohi and all the members of the Kestrel multiscalar design project at the University of Wisconsin and the University of Minnesota for many discussions that contributed to the ideas presented here.


Last modified: Wed June 3 13:21 EST 1996