One chief area where microarchitecture has benefited from change is in the synergy of approaches from several fields. For example, one common retroactive definition of RISC is that it was a marriage of the compiler with a vertically-encoded microengine. The success was due to the synergy of the microengine with the compiler. I see synergy as an emerging trend for microarchitecture research in particular, and for computer engineering in general.
To illustrate the synergy concept further, I will now describe two examples of synergy from the research of my students and myself:
1. Efficient Profiling
Techniques for ILP compilation often include methods to schedule operations beyond branch points, such as trace scheduling [Fish81], superblock scheduling [Hwuetal93], and hyperblock scheduling [MaLC92]. These techniques rely on profile information to predict branches and estimate code weights. Collecting profile information requires a three-step process to compile a program: (1) the initial compile, (2) performing profiling runs, and (3) the optimization compile. This three-step process replaces the traditional individual compilation approach. To further complicate this, profiled code often runs at significantly slower rates than normal code, limiting the ability to capture day-to-day program usage. Due to these problems, the use of profiling has largely been rejected by commercial software vendors.
A method that uses branch prediction hardware to collect profile data has also been developed by my students and myself. It helps illustrate synergy at work. In this technique, hardware (e.g., a branch prediction buffer) is periodically sampled by the OS to obtain profile data. Since hardware is employed, there is little noticeable program slowdown during profiling [CoPC94b], [CoPM96]. This in turn allows unobtrusive profiling of utilities and applications. This approach is easily integrated into the alpha- and beta-testing of commercial software packages. Hence, the problem of the awkwardness of profile-driven optimization was solved through an approach that employs the hardware, the compiler and the operating system.
2. VLIW Compatibility
Compatibility between generations of VLIW architectures has been a key drawback to their acceptance. These architectures do not maintain ISA compatibility across generations, even though the bit encodings and executable file formats are not changed. This occurs because code scheduled for one mix of functional units will not necessarily produce correct results on another. Simply recompiling an executable to achieve compatibility is an option, but is limited since it currently has severe commercialization difficulties (i.e., multiple executables, copy protection complications) [Odon95].
Hardware approaches to VLIW compatibility have been recommended in the literature [Rau93],[MeSP88],[FrSm94]. A limitation of the hardware approaches occurs when code is executed on a machine with resources significantly different from the code's scheduling assumptions. In such a situation, the burden of performance falls upon the compatibility hardware. Thus the hardware must be as complex as a high-performance superscalar engine in order to achieve high performance on old code. Such high-performance superscalar engines are often difficult to pipeline to an aggressive degree, limiting the overall cycle time of the processor. The dilemma is that aggressive cycle time is one of the key advantages of statically-scheduled microarchitectures, yet supporting compatibility through hardware can adversely impact the cycle time.
The synergistic approach that we took to the VLIW ISA compatibility problem was to reschedule code at first-time page faults, resulting in compatibility without added hardware complexity [CoSa95]. The performance of the approach is on par with rescheduling via recompilation, yet the approach requires little hardware, no additional executable files, and no intervention by the user. Specific implementation requires support in the operating system, the instruction set encoding, the assembler, the linker, the compiler, and the file system.
- - - In both of the above examples, a problem thought to be difficult to solve was solved via the synergy between several layers of the system. We are seeing other examples today of often-cited difficult problems, including:
o Early architectural design that takes into account final performance, power, cost, etc., of the architectural decisions.
o Low power wireless applications that require synergy to solve problems of battery life and low cost by employing concurrent design of the distributed OS, embedded processor microarchitectures, compiler or hand-compilation aides, and application writers.
Unfortunately, synergy requires generalists rather than specialists. Our entire educational system-- from the Ph.D. down to the undergraduate curricula-- is focused on producing specialists. Nevertheless, synergy is emerging as an important force in microarchitecture research.
REFERENCES
[Fish81] J. A. Fisher, ``Trace scheduling: A technique for global microcode compaction,'' IEEE Trans. Comput., vol. C-30, no. 7, pp. 478--490, July 1981.
[Hwuetal93] W. W. Hwu, S. A. Mahlke, W. Y. Chen, P. P. Chang, N. J. Warter, R. A. Bringmann, R. G. Ouellette, R. E. Hank, T. Kiyohara, G. E. Haab, J. G. Holm, and D. M. Lavery, ``The superblock: An effective structure for VLIW and superscalar compilation,'' The Journal of Supercomputing, vol. 7, pp. 229--248, Jan. 1993.
[MaLC92] S. A. Mahlke, D. C. Lin, W. Y. Chen, R. E. Hank, and R. A. Bringmann, ``Effective compiler support for predicated execution using the Hyperblock,'' in Proc. 25th Ann. Int'l. Symp. on Microarchitecture, (Portland, OR), pp. 45--54, Dec. 1992.
[CoPC94b] T. M. Conte, B. A. Patel, and J. S. Cox, ``Using branch handling hardware to support profile-driven optimization,'' in Proc. 27th Ann. International Symposium on Microarchitecture, (San Jose, CA), Nov. 1994.
[CoPM96] T. M. Conte, B. A. Patel, K. N. Menezes, and J. S. Cox, ``Hardware- based profiling: An effective technique for profile-driven optimization,'' International Journal of Parallel Programming, vol. 24, Feb. 1996.
[Odon95] J. S. O'Donnell, ``Superscalar vs. VLIW,'' Computer Architecture News (ACM SIGARCH), pp. 26--28, Mar. 1995.
[Rau93] B. R. Rau, ``Dynamically scheduled VLIW processors,'' in Proc. 26th Ann. International Symposium on Microarchitecture, (Austin, TX), pp. 80--90, Dec. 1993.
[MeSP88] S. Melvin, M. Shebanow, and Y. Patt, ``Hardware support for large atomic units in dynamically scheduled machines,'' in Proc. 21th Ann. International Symposium on Microarchitecture, (San Diego, CA), pp. 60--66, Dec. 1988.
[FrSm94] M. Franklin and M. Smotherman, ``A fill-unit approach to multiple instruction issue,'' in Proc. 27th Ann. International Symposium on Microarchitecture, (San Jose, CA), pp. 162--171, Dec. 1994.
[CoSa95] T. M. Conte and S. W. Sathaye, ``Dynamic rescheduling: A technique for object code compatibility in VLIW architectures,'' in Proc. 28th Ann. International Symposium on Microarchitecture, (Ann Arbor, MI), Nov. 1995.