Research

My research focus is on the interaction of architecture and operating systems. I strive to come up with new hardware/software interfaces that can solve existing problems and potentially open up new opportunities for both architects and OS designers. Additionally, as a computer architect, I am cognizant of the fact that architectures are changed relatively infrequently and are hampered by a great deal of legacy software. I endeavour to produce designs that are both novel in thier approach and practical in their utility. Below, I describe the current progress of two research projects and future work that these projects lend themselves to. The first project improves the performance of networked hosts by throwing away traditional hardware and software interfaces between network interface controllers and CPUs and replacing them with more efficient ones that exploit characteristics of network processing. The second project introduces a novel approach to thinking about register context, resulting in a new register renaming mechanism that enables a nearly ideal implementation of register windows and the possiblity for many threads to co-exist in one CPU core. Below, I also describe a tool that my research group developed that enabled the research described here to be carried.

The Simple Integrated Network Interface Controller

The wheel of reincarnation refers to a process in which a computer architect chooses to design a peripheral device, adds some intelligence to that peripheral device, decides that the intelligence should be more general, and realizes that the end result is a simple peripheral device attached to an additional general-purpose processor. My research is an instance of going around the wheel of reincarnation when designing a faster network interface controller (NIC). Commodity NICs have some intelligence that allows them to alleviate the central processor from some simple work. The advent of TCP offload engines (TOEs) goes further around the wheel by adding more intelligence to the NIC---enough intelligence to run the entire TCP/IP protocol stack. I argue that NIC design should come full circle on the wheel of reincarnation by moving the special functionality of the NIC back to the central processor. I have demonstrated the importance of tighter integration between the NIC and CPU. I have shown that simple integration of a traditional NIC onto the CPU die can result in bandwidth improvements of more than a factor of two relative to more conventional designs. Tighter integration alone provides significant benefits, but also enables a redesign of the NIC itself to take advantage of the new properties of the interactions between the NIC and CPU, particularly lower latency. This leads to a NIC design that is significantly simpler than current high performance NICs. This design, which I call the simple integrated NIC or SINIC, moves more of the intelligence of network processing to the CPU core allowing system software programmers significantly more flexibility in the way that system software uses the NIC. Thus, a suitably redesigned NIC enables software optimizations not possible with traditional NIC designs. V-SINIC, an extended version of SINIC, provides virtual per-packet registers, enabling packet-level parallel processing while maintaining a FIFO model. V-SINIC also enables deferring the copy of the packet payload on receive, which I exploit to implement a zero-copy receive optimization in the Linux 2.6 kernel.

The Virtual Context Architecture

Large numbers of logical registers can improve performance by allowing fast access to multiple subroutine contexts (register windows) and multiple thread contexts (multithreading). Support for both of these together requires a multiplicative number of registers that quickly becomes prohibitive. The virtual context architecture (VCA) \cite{oehmke:vca}, a new register-file architecture that virtualizes logical register contexts, overcomes this limitation. VCA works by treating the physical registers as a cache of a much larger memory-mapped logical register space. Complete contexts, whether activation records or threads, are no longer required to reside in their entirety in the physical register file. A VCA implementation of register windows on a single-threaded machine reduces data cache accesses by 20%, providing the same performance as a conventional machine while requiring one fewer cache port. Using VCA to support multithreading enables a four-thread machine to use half as many physical registers without a significant performance loss. VCA naturally extends to support both multithreading and register windows, providing higher performance with significantly fewer registers than a conventional machine.

The M5 Simulator

The research described above has all been done using the M5 simulator. M5 is a modular platform for computer architecture research, encompassing system-level architecture as well as processor microarchitecture. It is intended for use by researchers in academia or industry looking for a free, open-source, full-system simulation environment for processor, system, or platform architecture studies. Because the primary focus of the M5 development team has been simulation of network-oriented server workloads, M5 incorporates several features not commonly found in other simulators including: M5 also integrates a number of other desirable features, including pervasive object orientation, multiple interchangeable CPU models, an event-driven memory system model, and multiprocessor capability. The M5 simulator is largely written in C++ and uses Python extensively for configuration. The code is freely distributable under a BSD-style license and does not depend on any commercial or restricted-license software.