Long-term (accurate) prognostication is exceptionally difficult to perform in the field of computer architecture, because the field is so technology dependent. It requires anticipating and scheduling technological breakthroughs, which has always seemed like a terribly presumptuous thing to do. Certainly such requirements limit the range of believable predictions. However, based on what we have learned over the last 20-30 years, we can (relatively) safely make a few basic short-term assumptions:
While it is likely that new kinds of technologies will emerge, given the projected behavior of established technologies like CMOS it is unlikely that these new technologies will have a significant impact in the near future. However, it is likely that distributing a synchronous clock to hundreds of millions of transistors will be (at least) a significant engineering challenge. It is also true that processor designs that require hundreds of millions of transistors in order to achieve maximal performance (enormous branch predictors, for example, or huge instruction windows) will not move quickly and easily into newer, less dense technologies.
The solution? A distributed, decentralized (DD) approach to computing. Employ multiple simple independently programmable processing elements on-chip, elastically connected to one another and to memory via FIFO queues. In essence, a message-passing MIMD with explicit support for high-speed interprocessor communication, with programs compiled to exploit Instruction level instead of Data level parallelism.
Such a design would alleviate the synchronous clock distribution problem that plagues VLIW and superscalar processors, which can be thought of as very tightly coupled shared memory systems; they share not only addressable memory but also register space. In order to transmit data among operational units by writing and then reading the contents of a register, the clocks on VLIW and superscalar processors must be synchronized. Separating a program into multiple instruction streams employing an explicit message passing approach using FIFO ordered queues allows the decentralization of the hardware resources, since there is no central instruction window from which instructions are issued. Similarly, there is no central register file to be overloaded with contention among the processing elements, which allows for easier expandability in a MIMD approach.
The DD approach should also scale well both up and down as the number of transistors available changes, and provides a potentially powerful mechanism for the compiler to pass on to the hardware the knowledge it maintains. In particular, little research has been done on exploiting instruction level parallelism across processors on a multiple instruction stream machine. There are a number of reasons why this approach merits further investigation, however. Superscalar machines do not scale well - expanding the number of processing elements available necessitates a corresponding increase in the size of the hardware window over which code scheduling occurs, significantly increasing the scheduling complexity. Compilers for VLIW machines can help circumvent this problem, but do not support out-of-order execution well.
Exploiting instruction level parallelism on MIMD architectures can potentially overcome both these problems. The instruction issue stage of each processor can perform in a simple single-issue, in-order manner, avoiding much of the hardware complexity required to support out-of-order issue in a single instruction stream approach. Out-of-order issue is also supported on a MIMD because the processors are run independently; therefore, any independent instructions executed on different processors can issue in any order without necessitating any hardware support. This is fundamentally different than multiple issue in a VLIW machine because a strict ordering of instructions is not imposed by the compiler unless a dependence exists. Furthermore, by incorporating multiple program counters, a MIMD machine provides the architecture with more dataflow information by enriching the specification of the object language; taken to its extreme this would allow a dataflow machine description of the program.
While a MIMD approach to code scheduling clearly possesses certain advantages, historically these architectures have suffered from severe limitations. Data transfer latencies have been high, and the bandwidth required to support high-throughput, low contention data transfer has been unavailable because of pin limitations and/or board-level interconnects. Even if maximum data transfer rates can be made acceptable, the need to provide synchronization points can cause unacceptable performance loss. Using main memory to handle data transfers between processors can also lead to an unacceptable dependence on memory latency.
Increasing the number of transistors that can be fabricated per square centimeter provides the means by which many of the interprocessor communication problems can be eliminated. Placing several of these processing elements on the same die circumvents the pin limitations on bandwidth, and supports high on-chip data transfer rates. In addition, using FIFO queues in a manner similar to that used by decoupled machines provides a clean way to handle synchronization.
The problem of slow memory access times can also potentially be resolved by having some number of the on-chip processors responsible for accessing memory as far in advance as possible. These processors could execute a skeleton of the program, for example, potentially ensuring that data items arrive in the local memory before they are needed. The processing elements assigned to such a task could vary over time, or perhaps could be dedicated designs.
There are obviously many details to this design that need to be hashed out - should the processors be homogeneous or heterogeneous? What about backward code compatibility? What would an OS look like for such a machine? However, I feel that the potential benefits of this approach make it a clear candidate for further study.
Transistor densities in the near future will climb well into the hundreds of millions. Distributing a synchronous clock to all these transistors will be next to impossible. A better way for the compiler to transfer it's wealth of information down to the hardware is also needed. Both of these problems can be solved by using the wealth of transistors to build multiple simple independently programmable processing elements on-chip, elastically connected to one another and to memory via FIFO queues. In essence, a message-passing MIMD with explicit support for high-speed interprocessor communication, with programs compiled to exploit Instruction level instead of Data level parallelism.