Yale N. Patt
Graduate Students: Jared Stark, Marius Evers, Dan Friendly, Sanjay Patel, and Paul Racunas
Affiliated scientists: Dr. Stephen Melvin (Zytek), Dr. Michael Shebanow (Cyrix)
Professor: Yale Patt
Sponsor: Intel Corporation, Ross Technology
HPS is an execution model for implementing high performance computing engines. We derived the model and began our investigations in 1984 [1,2]. We have been continually refining the model ever since [3]. HPS exploits concurrency at the microarchitecture level via a variety of mechanisms such as multiple instructions issued each cycle, deep pipelines without blocking, multiple function units, out-of-order execution of dynamically scheduled instructions, a rapid retirement/repair mechanism, etc. Variations of the model have appeared in the announced products of many CPU manufacturers.
One on-going aspect of this research is the quantification and analysis of instruction level parallelism. This includes analysis of features of existing code and existing instruction set architectures as well as an understanding of the benefits compilers can provide to enhance parallel execution at the microarchitecture level. We attempt to identify beneficial as well as detrimental characteristics of an ISA so as to be able to design microarchitectural structures to efficiently execute compiled code. We are defining an instruction set architecture with the goal of providing an interface between hardware and software that allows for the efficient extraction of parallelism by the hardware.
This hardware includes instruction supply and delivery, checkpointing, dynamic scheduling, and memory disambiguation structures. In order to determine the cost effectiveness (which reflects the viability) of these structures, we are investigating, designing, and modeling alternative implementations. In addition, we are studying opportunities for new microarchitectural techniques to achieve higher performance. In all cases, we are quantifying the performance improvements of each mechanism.
There are a number of important performance issues regarding HPS. As issue width increases, one of the greatest impediments to issuing large numbers of instructions concurrently is the presence of instructions which redirect the instruction stream. This is manifest most severely by the termination of a packet (the set of instructions issued at the same time) at the first conditional branch. One way to remove this bottleneck is with an aggressive issue mechanism which supports issuing multiple branch instructions in the same cycle. The performance gained from this must be traded against the cost of supporting it.
We are also continuing to address the problem of keeping the CPU supplied with instructions and data. Our previous ("Yeh's Algorithm" [4,5]) and current [6,7] work in branch prediction attacks part of the problem. However, this does not address one of the most important challenges in high performance implementation today, the fact that the CPU cycle time is decreasing at a faster rate than cache and memory access times. The HPS out-of-order execution model helps by tolerating longer data cache and memory latencies than machines that execute instructions in order. We are investigating fetching and issuing instructions out-of-order as a technique for allowing the HPS execution model to tolerate instruction cache misses. But this will not be enough. In addition to quantifying the value of out-of-order execution and out-of-order issue as the solution of this problem, we are also studying the effects of instruction prefetching and aggressive memory disambiguation.
Students: Marius Evers, Sanjay Patel
Affiliated Scientist: Dr. Pohua Chang (Intel), Dr. Eric Hao
Professor: Yale N. Patt
Sponsors: Intel Corporation
Compiler optimization techniques are extremely important for achieving high performance with all computer architectures today. When the optimization techniques are applied properly, the performance improvement is significant. On the one hand, the hardware can take advantage of the information passed from the compiler to reduce pipeline stalls and thus improve performance. On the other hand, the compiler can make use of hardware configuration information to schedule code to achieve the best utilization of the hardware. To achieve the highest levels of performance, today's architecture design should be a composition of both hardware and compiler.
Our compiler research is divided into two efforts: compiler optimizations and a new HPS ISA. The compiler optimization work is described below. The HPS ISA work is described in the block-structured ISA section.
Traditional compiler research has focused on improving the performance of single-issue, statically-scheduled processors. Recently, the introduction of dynamically-scheduled processors with moderate issue widths has created interest in compiler optimizations for such machines. Our compiler research focuses on improving the performance of the HPS paradigm, a wide-issue, dynamically-scheduled processor. The HPS paradigm is capable of exploiting much higher levels of instruction level parallelism than can be exploited by current day processors. To achieve this performance potential, we are investigating several architecture- and microarchitecture-specific optimizations that exploit the out-of-order execution mechanism in HPS, in addition to traditional optimizations. These optimizations are as follows:
1. Instruction level parallelism optimizations
Instruction level parallelism (ILP) optimizations can eliminate flow dependencies within a program, which slow down its execution. Instructions that appear to be dependent are sometimes intrinsically independent. They simply appear dependent due to the way the user wrote the program. ILP optimizations transform such instructions into a set of independent instructions, uncovering more of the program's parallelism for an HPS implementation to exploit.
2. Predicated Execution
We are investigating the use of predicated execution in conjunction with HPS's speculative execution mechanism. Predicated execution can increase performance by eliminating branches and increasing the instruction level parallelism of a program. However, this feature adds the overhead of introducing unnecessary instructions into the machine. While speculative execution does not suffer from this overhead, a processor using the speculative execution mechanism incurs a large performance penalty when a branch is mispredicted. To gain the advantages of the predicated and speculative execution techniques, we are investigating ways to apply predicated execution to the hard-to-predict branches in a program and to use speculative execution on the remaining branches.
3. Code Scheduling
We are investigating code scheduling algorithms that improve performance for dynamically-scheduled machines. These algorithms will focus on: (1) generating branch conditions early to reduce branch penalty, (2) generating memory addresses early to assist the dynamic memory disambiguator, (3) scheduling load instructions early, and (4) scheduling instructions to avoid resource conflict/saturation stalls.
4. Memory Optimizations
Traditional memory optimizations have transformed the program so that the data cache hit rate is improved during execution. These optimizations along with HPS's ability to schedule around cache misses significantly reduce the performance loss due to memory latency. However, for applications with extremely large working sets, such as databases, memory bandwidth is the critical performance bottleneck. To reduce the impact of this bottleneck, we are investigating data placement algorithms that enable more effective utilization of a processor's bandwidth to memory. In particular, we are studying optimizations that rearrange a program's data structures and data layout to increase the probability that when the program makes an access to a cache block, the program will make accesses to most of the other elements in the block while the block is still in the cache. By doing so, the optimizations will reduce the number of memory accesses made by the processor while executing the program, reducing the program's bandwidth requirement.
Students: Marius Evers, Dan Friendly, Sanjay Patel
Affiliated Scientist: Dr. Stephen Melvin (Zytek)
Professor: Yale N. Patt
Sponsors: Intel Corporation
To address the problems faced by wide-issue processors, we are proposing a new class of instruction set architectures, block-structured ISAs [1,2]. The major distinguishing feature of a block-structured ISA is that it defines the architectural atomic unit (i.e. the instruction) to be a group of operations. Each operation corresponds roughly to an instruction in a conventional ISA. This feature simplifies the hardware required to implement a wide-issue processor and increases the processor issue density, thereby increasing performance. One simplification to the hardware is due to the reduction in architectural state that must be maintained because temporary values that do not live beyond the atomic block boundaries are no longer a part of that state [3]. Also, flow dependencies within a block can be explicitly represented, which simplifies the hardware required to decode and issue multiple instructions in parallel [3]. Block-structured ISAs increase the issue density by providing a mechanism for combining separate basic blocks into an enlarged atomic unit allowing the processor to issue past multiple branches.
Our research is examining two key issues in the implementation of block-structured ISAs: 1. block selection for the formation of enlarged atomic blocks and 2. branch prediction for enlarged atomic blocks.
1. The manner in which atomic blocks are enlarged plays a crucial role in the performance of a block-structured ISA processor. By combining more and more blocks, the issue bandwidth of the machine can be more fully utilized, the scope in which compiler optimizations can be applied is enlarged, and the opportunities to assign values to internal registers is increased. However, the number of branches included in a block increases as the blocks are made larger, decreasing the probability that a correct branch prediction can be made for the enlarged block. In addition, increasing the number of blocks that are combined in an enlarged block increases the amount of work that must be discarded in the event of a branch misprediction for one of the branches within that enlarged block. Block enlargement may also lower the icache hit rate because the block enlargement process increases the size of the program. This increase in size occurs when the block enlargement process duplicates a basic block in multiple enlarged blocks. As a result, a program whose blocks have been enlarged may use more icache lines than a program whose blocks have not been enlarged. The compiler must carefully consider the performance tradeoffs when combining basic blocks. To address this issue, our research is evaluating the performance benefits of different heuristics for forming enlarged blocks.
2. Because a single atomic block may consist of multiple basic blocks (and their associated branches), the branch predictor of a block-structured ISA processor must be able to make multiple branch predictions each cycle. The predictor must be able to make these predictions without requiring an increase in cycle time and without appreciably reducing branch prediction accuracy.
We have defined an initial version of a block-structured ISA for an HPS processor. We have implemented a compiler targeted to that ISA and a simulator to model the performance of the HPS processor which implements that ISA. Our initial results show that block-structured ISAs provides an average performance improvement of 12% over conventional ISAs using a real branch predictor and similar execution structures[4]. Using a perfect branch predictor and ICache, the improvement is around 50% indicating that there is still a large potential for further refinement. We are currently developing a better branch predictor for the processor as well as improving the compiler optimizations used to combine basic blocks. We are also continuing to refine the notion of a block-structured ISA, examining new ways in which the compiler and hardware can cooperate to achieve higher levels of performance.
Students: M. Evers, R. Chappell, P. Kim, S. Patel, P. Racunas
Affiliated Scientists: Dr. Tse-Yu Yeh (Intel), Mr. Eric Sprangle (Ross)
Professor: Yale N. Patt
Sponsor: Intel Corporation, Ross Technology
As the issue width and pipeline depth increase for high-performance superscalar processors, the amount of speculative work due to branch prediction becomes much larger. Since this work must be thrown away if the prediction is incorrect, an excellent branch predictor is vital to delivering the potential performance of a wide-issue, deeply-pipelined microarchitecture.
Earlier work in our group resulted in the Two-Level Branch Predictor, which Intel adopted for its newest state-of-the-art microprocessor, the Pentium-Pro [1,2]. Digital and AMD have also included variations in their latest releases. The Two-Level Predictor achieves 96% prediction accuracy on SPECint92 benchmarks. However, a prediction miss rate of 4 percent still results in an unacceptable loss in performance, due to the number of instructions fetched each cycle and the number of cycles before an incorrect prediction is realized.
We have proposed hybrid prediction using branch classification [3] as a mechanism to help improve the accuracy of branch predictors beyond this figure of 96%. Branch classification allows an individual branch instruction to be associated with the branch predictor best suited to predict its direction. Using this approach, a hybrid branch predictor can be constructed such that each component branch predictor is used for branches on which it performs best. We have proposed several hybrid branch predictors that achieve higher prediction accuracy than what was previously available in the literature. They use a branch classifier which partitions branches based on their dynamic taken rates in profiling code. We are developing other methods for classifying branches and for determining the appropriate component predictors to combine into a hybrid predictor.
In addition to our branch classification-based hybrids, we are continuing work on improving hybrid predictors which dynamically select which component predictor to use. We have proposed a dynamic selection mechanism for hybrid branch predictors with more than two single-scheme predictors [4]. Increasing the number of different prediction schemes incorporated into the hybrid branch predictor increases the number of branches that can be accurately predicted. For example, our multiple component hybrid predictor allows us to include predictors with shorter training times to assist in predicting during warm-up phases, such as those that occur after context switches.
We have also proposed a two-level hybrid predictor selector [5]. This uses a two-level mechanism to select the best component predictor to use and has been shown to outperform more conventional hybrid selectors. According to published information about Digital's Alpha 21264, its hybrid branch predictor uses this selection mechanism.
It has been shown that Pattern History Table interference severely limits the performance of two-level predictors at low implementation costs (up to about 16-64 KBytes). We have proposed two methods for reducing this effect: Filtering [6] and Agree Prediction [7]. Filtering dynamically identifies strongly-biased, and therefore, easy-to-predict branches. These branches are predicted with the simple "last time" predictor, while the 2-level predictor is reserved for the harder-to-predict branches. While the filtering mechanism reduces the amount of overall interference, the agree prediction mechanism focuses on shifting the interference from being harmful to being neutral. A single bit associated with each branch holds the most likely direction of that branch. The value of this bit can be set either dynamically or by the compiler. The two-level predictor will instead predict whether to agree or disagree with this one-bit prediction. Since it is more likely for the predictor to agree than to disagree, interference is now less likely to be harmful. The resulting reduction in harmful interference causes an increase in branch prediction accuracy.
As issue widths continue to increase, it becomes increasingly likely that more than one branch instruction should be issued every cycle. New issue mechanisms, such as the trace cache/fill unit, have arisen to attempt to provide instructions to fill this issue bandwidth. The limitation of predicting a single branch per cycle becomes an unacceptable bottleneck in the face of these new issue mechanisms. We are currently working on adapting our hybrid predictors and component single scheme predictors for predicting multiple branches using a single fetch address.
Students: Sanjay J. Patel, Daniel H. Friendly
Affiliated Scientist: Dr. Tse-Yu Yeh (Intel)
Professor: Yale Patt
Getting enough bandwidth from the fetch mechanism has become a critical issue in high performance computer design. In order to realize higher performance, increases in execution bandwidth achieved by adding functional units need to be balanced with increases in delivered bandwidth from the fetch mechanism. The conventional fetch mechanism composed of an instruction cache and a branch predictor is constrained by branches. Such mechanisms can deliver up to a basic block's worth of instructions per cycle, which on average is five instructions for common applications.
The trace cache is a mechanism which stores segments of the dynamic instruction trace, where each segment is composed of several basic blocks which need not be consecutive in the static image of the program. When a particular basic block is requested, the trace cache can potentially respond with the requested block along with several blocks that followed when the block was last executed. The trace cache is therefore not severely impacted by branches in delivering high bandwidth to the execution engine and can overcome the basic block hurdle. Our experiments have shown a 21% improvement in performance with the trace cache fetch mechanism on the SPECint95 benchmarks compared to an aggressive conventional mechanism.
The idea is not new. The original concepts were first put forth by this group [1][2] in 1988. The concept was more recently investigated by Rotenberg et al at the University of Wisconsin and in parallel here in our group [3]. This renewed interest in the trace cache indicates that the time for the trace cache has come as it is an effective means of delivering instruction bandwidth to a wide-issue execution core.
Our work on the trace cache also focuses on the supporting fill unit structure. The fill unit dynamically assembles trace cache segments by monitoring blocks as they are processed by the machine. The processing in the fill unit happens off the critical execution path and additional latency through this structure has little impact on machine performance. This has some good implications for machine design. The fill unit can recreate and mark instruction dependencies within a segment so that a newly fetched segment requires minimal processing before being added to the reservation stations. Structures like the dependency analysis logic and renaming logic can be moved off the execution path or made simpler. The fill unit also has good implications for high performance. Because multiple basic blocks are combined to form a segment, the fill unit can perform run-time optimizations that a compiler may not be able to perform. The fill unit can also retarget applications written in different ISAs to run more efficiently on the machine.
We feel that tomorrow's dynamically scheduled processors will contain a fill unit and trace cache. Our investigations are focusing on the long term, investigating creative uses of these structures for high performance.
Students: Dan Friendly and Jared Stark
Professor: Yale Patt
To exploit larger amounts of instruction level parallelism, processors are being built with deeper pipelines and wider issue widths. However, the occurrence of icache misses and mispredicted branches imposes a severe performance penalty on such machines. These events prevent the processor from fetching new useful instructions until the icache miss or mispredicted branch is resolved. Out-of-order issue reduces this performance penalty by enabling the processor to continue fetching useful instructions even in the event of an icache miss or hard to predict branch.
Out-of-order issue reduces the performance penalty caused by icache misses. In conventional processors, an icache fetch brings in a basic block (subject to issue constraints). Upon encountering an icache miss, an in-order issue processor will wait until the icache miss is serviced before continuing to fetch any new basic blocks. An out-of-order issue processor temporarily ignores the basic block associated with the icache miss, and attempts to fetch the basic blocks that follow the block associated with the miss. The addresses of these blocks can be generated by the branch predictor. Because the branch predictor does not depend on the instructions in the current block to generate the address of the next block to be fetched, it can continue to make predictions even when the current block misses in the icache. Thus, the processor can skip the block that missed in the icache and continue fetching at the following block using the address generated by the branch predictor. After the icache miss is serviced, the processor can fetch and issue the skipped block.
Out-of-order issue also provides a means of reducing the performance penalty due to mispredicted branches. When the processor fetches a branch, it must predict which direction the branch takes in order to determine which instructions to fetch next. Because the branch condition is usually not known at the time of the branch's fetch, the processor must guess which direction will be taken by the branch and speculatively fetch the instructions which follow. If the prediction is correct, no branch penalty is incurred. However, if prediction is incorrect, the processor pays the same penalty as if it had waited for the branch to be resolved. Current branch prediction technology is capable of achieving prediction accuracies on the order of 96% for conditional branches, but the remaining mispredictions still incur a large performance penalty.
Out-of-order issue reduces this penalty by not fetching the blocks that immediately follow a hard to predict branch until either an accurate prediction can be made for the branch, or the branch is resolved. During this period, the processor fetches and issues the instructions that begin at the merge point in the control flow graph of the paths that follow the branch. Because these instructions are control-independent of the branch, they are guaranteed to be on the program's correct path regardless of which direction the branch takes (assuming that all branches beyond the merge point are correctly predicted). The processor will continue to fetch and issue along this path until either an accurate prediction can be made, or the branch is resolved. At this point, it will return to the branch and begin fetching there. Upon reaching the merge point, the processor will jump past the instructions that it has already fetched and issued and continue fetching and issuing from the point at which it left off. As a result, instructions are fetched and issued out-of-order into the machine with respect to the program order.
We are examining the performance benefit provided by out-of-order issue. To do this, we are augmenting our full (executable-driven) simulator so that it will model the performance of a wide-issue, dynamically scheduled machine that supports out-of-order issue.