Research Summary
Satish Narayanasamy

Deterministic Replay

BugNet: Reproducing and understanding software failures, especially those due to non-deterministic bugs (Heisenbugs) can be a nightmare. To solve this problem, I proposed a processor feature called BugNet [ISCA'05, IEEE Micro Award'05]. BugNet continuously logs a program execution while sacrificing very little performance (1\%). If the program crashes, the developer can use the log to debug the failure by deterministically replaying the last second of the program's execution that preceded the crash. I showed that this is sufficient to debug and understand the root cause of a majority of software failures [SoMET'06]. The key contribution in BugNet is the checkpointing scheme. Unlike prior schemes, BugNet uses a unique load-based logging scheme that captures only the execution of user code and user level libraries. This approach completely avoids the complexity of recording information about system events such as system calls, I/O, interrupts, DMA, etc. Even so, BugNet supports deterministic replay of user code across such non-deterministic system events. BugNet is system-independent (a property that is key for a feautre to be implemented in a processor)and easy to implement in a processor.

Strata: To record shared memory dependencies between concurrent threads, I proposed a logging primitive called Strata [ASPLOS'06]. Unlike the earlier mechanisms, Strata can capture multi-threaded dependencies in a snoop-based multi-processor system. Also, when compared to the state-of-the-art mechanism for a directory based system, Strata logs are 12 times smaller, and can be captured with 1/16th the hardware.

BugNet's Use In Industry

Automated Data Race Analysis: Microsoft has developed a software tool that implements a checkpointing scheme similar to BugNet using dynamic instrumentation. During a summer internship at Microsoft, I extended their tool's capability to automatically prioritize the most harmful data races using a replay-based dynamic analysis [PLDI'07]. This infrastructure has found a large user base among developers and testers at Microsoft, and has been used to find and fix over a thousand bugs in Windows Vista.

PinSEL for Architectural Simulation: Any application level processor simulator, such as Intel's ASIM or SimpleScalar has to emulate hundreds of system calls and complex system interactions such as interrupts and DMAs in order to execute a workload and evaluate its performance. Developing a system emulator is very time consuming, and it is also difficult to maintain and port the emulator across different operating systems. Colleagues and I built a record and replay software tool called PinSEL that implements BugNet's system independent checkpoint mechanism. A processor simulator like SimpleScalar or ASIM can be coupled with the PinSEL replayer, which takes care of correctly executing a workload across all system effects. PinSEL enabled Intel to do away with their tedious system emulation support in their ASIM simulator, and they were able to port ASIM to MAC OSX from Linux in a matter of a few days. PinSEL is now an integral part of Intel's ASIM based simulation infrastructure [SIGMETRICS'06].

Patching Processor Design Errors

Modern processors contain hundreds of design errors. However, unlike software bugs, there is currently no mechanism to transparently patch processor design errors in the field. I proposed a field programmable on-chip hardware unit to patch processor design errors in the field [ICCD'06, IEEE Micro Award'06 ]. On discovering a design error, the processor manufacturer can generate a patch consisting of an errata signature and a recovery method, and distribute it to customers. The errata signature is used to program the on-chip hardware, which then monitors the processor's execution to detect the error. When an error is detected, it is recovered with the help of replay and hypervisor support.

Support for Unbounded Transactions

For transactions to be a commercially viable solution for lock-free parallel programming, we need system support for virtualizing the execution of transactions in the presence of system events including context switches, cache evictions and paging. Colleagues and I developed a solution that builds on operating systems' virtual memory support to keep track of transactions' speculative data. A transaction's operations like conflict detection, commit and abort were accelerated using hardware support [ASPLOS'06].

Fault Tolerance

Future microprocessors will be highly susceptible to transient errors as the sizes of transistors decrease due to CMOS scaling. Prior techniques advocated full scale structural or temporal redundancy to achieve fault tolerance. Though they can provide complete fault coverage, they incur significant hardware and/or performance cost. It is desirable to have mechanisms that can provide partial but sufficiently high fault coverage with negligible cost. To meet this goal, we proposed a low-cost architectural mechanism that monitors events in the processor to predict transient faults. The proposed mechanism is based on the insight that when a fault occurs, it is likely that the incorrect execution would result in abnormally higher or lower number of mispredictions (branch mispredictions, L2 misses, store set mispredictions) than a correct execution. We designed a transient fault predictor that detects the anomalous behavior in the outcomes of the speculative structures to predict transient faults [DATE'07].

Security

I helped develop and implement compiler code generation optimizations to reduce the performance overhead of bounds checking, which can prevent buffer overflow attacks. We proposed a compiler optimization that eliminates the bounds checks that are not necessary to guarantee security. The optimization is based on the observation that buffer overflow attacks are launched through external inputs. Therefore, it is sufficient to bounds check only the accesses to those data structures that can possibly hold the external inputs. Also, it is sufficient to bounds check only the memory writes. The proposed optimizations reduce the number of required bounds checks as well as the amount of meta-data that need to be maintained to perform those checks [HiPEAC'07].

Hardware Profiler

Programs are very complex, and the real trick in generating useful run-time profiles is sifting through all the unimportant and infrequently occurring events to find those that are important enough to warrant optimization. I worked on a hardware profiler to efficiently catch important events even in the presence of extensive noise. This is achieved using a highly accurate sampling technique that uses multiple hash tables and monitors events over an interval to determine the importance of an event in relationship to all the other events. We evaluated our design for value and edge profiling, and showed that we can capture desired events with ~99% accuracy. The proposed design incurs small area cost, between 7 to 16 Kilobytes and it does not require any software support [HPCA'03].

Dynamic Optimization

I have also worked on a dynamic optimization technique to generate efficient software pipelined schedules for in-order machines A candidate loop is unrolled and aggressively scheduled. The schedule consists of a sequence of "trace blocks" each of which is represented using an indentifier. Thus the schedule of an unrolled loop is translated to a string of indentifiers. Instructions corresponding to a repeating pattern in the string form the kernal of our software pipelined schedule. The proposed dynamic optimization could improve the performance of in-order machines as much as 90% for certain programs and about 18% on average [HPCA'04].