CSE Technical Reports Sorted by Technical Report Number

TR Number Title Authors Date Pages

CSE-TR-383-99 Agent Communication with Differentiated Ontologies: eight new measures of description compatibility Weinstein and Birmingham Jan, 99 37
We propose an approach to achieve appropriate exchange of services and data in distributed systems subject to semantic heterogeneity. We assume _differentiated ontologies_: that terms have formal definitions as concepts related to other concepts, that local concepts inherit from concepts that are shared, and that most or all primitives are shared. We then develop measures of _description compatibility_ using the structure of the source and target definitions. We evaluate these measures by generating description-logic ontologies in artificial worlds. In our simulations, the "meaning" of a concept is its denotation in a finite universe of instances. The accuracy of the description compatibility measures can thus be judged by their success in predicting the overlap of concept denotations. Description compatibility can be used to guide agent search for services across communities that subscribe to differentiated ontologies.

CSE-TR-384-99 The Performance of Tuple Difference Coding for Compressing Databases and Data Warehouses Wu and Ravishankar Feb, 99 29
Databases and data warehouses have grown dramatically in size in recent years, and we are already seeing data warehouses that are tens of terabytes in size. Compression is important in such systems because it reduces space requirements as well as response times for queries, which are typically I/O-bound. Conventional compression methods tend to compress and decompress objects in their entirety, and are generally unsuitable for databases which require selective access to points in the space of tuples they represent. The database compression method called Tuple-Difference Coding (TDC) has been shown to meet the requirements for database compression, and to achieve high compression ratios in practice. The factors affecting the performance of TDC are known, but their effects are not well understood. This paper presents a theoretical analysis of the performance of TDC, and verifies these results through simulations. It presents analytical expressions for estimating the compression ratios when database tuples are composed of one or more attributes drawn from various distributions. It also studies the effects of attribute domain reordering on the compression ratio. Our simulations show excellent agreement with theory.

CSE-TR-385-99 Mining Decentralized Data Repositories Crestana and Soparkar Feb, 99 19
Current technology for mining data typically applies to data stored centrally (i.e., in one single repository, homogeneous, with central administration, and a single schema). For instance, association rules algorithms assume that the data of interest is available at one location in a single table, or that prior to mining, the data will be warehoused centrally. However, it is important to study the mining of decentralized data (i.e., consisting of several tables, perhaps obtained via normalization or partitioning and allocation, stored in several repositories with possibly separate administration and schema). Real-life database design, management, and performance aspects suggest these considerations. While there have been a few extensions to mining algorithms for such data, the algorithms developed have largely been aimed at parallel processing (as opposed to the specific issues relating to decentralized data). In this paper, we examine the issues associated with mining decentralized data. We motivate the need for developing decentralized techniques, and identify the problems that must be addressed. We describe the mining of association rules for the case of a star schema wherein several smaller dimension tables are associated by foreign keys to a central fact table, and present an algorithm that adapts standard algorithms to work efficiently with such data. In contrast to approaches where the data would have been joined first to form a single table, we exploit the foreign key relationships to develop decentralized algorithms that execute concurrently on the separate tables, and thereafter merge the results. We provide analyses to assess our techniques, and we present empirical studies for particular cases to validate the performance.

CSE-TR-386-99 The Use of the MPI Communication Library in the NAS Parallel Benchmarks Tabe and Stout Feb, 99 22
The statistical analysis of traces taken from the NAS Parallel Benchmarks can tell one much about the type of network traffic that can be expected from scientific applications run on distributed-memory parallel computers. For instance, such applications utilize a relatively few number of communication library functions, the length of their messages is widely varying, they use many more short messages than long ones, and within a single application the messages tend to follow relatively simple patterns. Information such as this can be used by hardware and software designers to optimize their systems for the highest possible performance.

CSE-TR-387-99 BLUE: A New Class of Active Queue Management Algorithms Feng Kandlur Saha and Shin Mar, 99 21
In order to stem the increasing packet loss rates caused by an exponential increase in network traffic, the IETF is considering the deploymentof active queue management techniques such as RED. While active queue management can potentially reduce packet loss rates in the Internet, this paper shows that current techniques are ineffective in preventing high loss rates. The inherent problem with these queue management algorithms is that they all use queue lengths as the indicator of the severity of congestion. In light of this observation, a fundamentally different active queue management algorithm called Blue is proposed. Blue uses packet loss and link idle events to manage congestion. Using simulation and controlled experiments, Blue is shown to perform significantly better than RED, both in terms of packet loss rates and buffer size requirements in the network. As an extension to Blue, a novel technique for enforcing fairness among a large number of flows is described. In particular, this paper proposes and evaluates Stochastic Fair Blue (SFB), a queue management algorithm which can identify and rate-limit non-responsive flows using a very small amount of state information.

CSE-TR-388-99 Distributed QoS Routing with Bounded Flooding for Real-Time Applications Kweon and Shin Mar, 99 27
Quality-of-Service (QoS) routing is indispensable to real-time applications on integrated services packet-switching networks. However, existing QoS-routing schemes based on either request flooding or link-state dissemination are not efficient in terms of their message overhead and/or operational cost. This paper proposes a QoS-routing algorithm with bounded flooding which not only dramatically reduces message overhead and operational cost, but also provides good performance. This significant overhead reduction is achieved by limiting the search area within which connection-request messages are flooded. Since our scheme is based on limited flooding, it does not require on-demand path calculation nor link-state dissemination, which are often known to be very expensive. In order to limit the search area, we employ a distance table at each node storing hop counts of the shortest paths to all the nodes reachable via its outgoing links. Unlike link-state routing, the proposed scheme only needs network-topology information that changes much less frequently than link-load states. Extensive simulation results have demonstrated the effectiveness of the proposed scheme in terms of performance, operational cost, and message overhead. Especially, in an ordinary operating condition in which the networks physical topology does not change frequently, it is shown to be much better than all the other existing QoS-routing schemes in terms of message overhead and operational cost, while providing reasonably good performance.

CSE-TR-389-99 Improving Processor Performance by Dynamically Pre-Processing the Instruction Stream Dundas Apr, 99 266
The exponentially increasing gap between processors and off-chip memory, as measured in processor cycles, is rapidly turning memory latency into a major processor performance bottleneck. Traditional solutions, such as employing multiple levels of caches, are expensive and do not work well with some applications. We evaluate a technique, called runahead pre-processing, that can significantly improve processor performance. The basic idea behind runahead is to use the processor pipeline to pre-process instructions during cache miss cycles, instead of stalling. The pre-processed instructions are used to generate highly accurate instruction and data stream prefetches, while all of the pre-processed instruction results are discarded after the cache miss has been serviced: this allows us to achieve a form of very aggressive speculation with a simple in-order pipeline. The principal hardware cost is a means of checkpointing the sequential state of the register file and memory hierarchy while instructions are pre-processed. As we discard all pre-processed instruction results, the checkpointing can be accomplished with a modest amount of hardware. The instruction and data stream prefetches generated during runahead episodes led to a significant performance improvement for all of the benchmarks we examined. We found that runahead typically led to about a 30% reduction in CPI for the four Spec95 integer benchmarks that we simulated, while runahead was able to reduce CPI by 77% for the STREAM benchmark. This is for a five stage pipeline with two levels of split instruction and data caches: 8KB each of L1, and 1MB each of L2. A significant result is that when the latency to off-chip memory increases, or if the caching performance for a particular benchmark is poor, runahead is especially effective as the processor has more opportunities in which to pre-process instructions. Finally, runahead appears particularly well suited for use with high clock-rate in-order processors that employ relatively inexpensive memory hierarchies.

CSE-TR-390-99 Transistor Level Micro Placement and Routing For Two-Dimensional Digital VLSI Cell Synthesis Riepe Apr, 99 300
The automated synthesis of mask geometry for VLSI leaf cells, referred to as the cell synthesis problem, is an important component of any structured custom integrated circuit design environment. Traditional approaches based on the classic functional cell style of Uehara & VanCleemput pose this problem as a straightforward one-dimensional graph optimization problem for which optimal solution methods are known. However, these approaches are only directly applicable to static CMOS circuits and they break down when faced with more exotic logic styles. There is an increasing need in modern VLSI designs for circuits implemented in high-performance logic families such as Cascode Voltage Switch Logic (CVSL), Pass Transistor Logic (PTL), and domino CMOS. Circuits implemented in these non-dual ratioed logic families can be highly irregular with complex geometry sharing and non-trivial routing. Such cells require a relatively unconstrained two-dimensional full-custom layout style which current methods are unable to synthesize. In this work we define the synthesis of complex two-dimensional digital cells as a new problem which we call transistor-level micro-placement and routing. To address this problem we develop a complete end-to-end methodology which is implemented in a prototype tool named TEMPO. A series of experiments on a new set of benchmark circuits verifies the effectiveness of our approach. Our methodology is centered around techniques for the efficient modeling and optimization of geometry sharing, and is supported at two different levels. Chains of diffusion-merged transistors are formed explicitly and their ordering optimized for area and global routing. In addition, more arbitrary merged structures are supported by allowing electrically compatible adjacent transistors to overlap during placement. The synthesis flow in TEMPO begins with a static transistor chain formation step. These chains are broken at the diffusion breaks and the resulting sub-chains passed to the placement step. During placement, an ordering is found for each chain and a location and orientation is assigned to each sub-chain. Different chain orderings affect the placement by changing the relative sizes of the sub-chains and their routing contribution. We conclude with a detailed routing step and an optional compaction step.

CSE-TR-391-99 Report on the Study of University Policies on Intellectual Property Galler Apr, 99 29

CSE-TR-392-99 Load-Sensitive Routing of Long-Lived IP Flows Shaikh Rexford and Shin Apr, 99 20
Internet service providers face a daunting challenge in provisioning network resources, due to the rapid growth of the Internet and wide fluctuations in the underlying traffic patterns. The ability of dynamic routing to circumvent congested links and improve application performance makes it a valuable traffic engineering tool. However, deployment of load-sensitive routing is hampered by the overheads imposed by link-state update propagation, path selection, and signalling. Under reasonable protocol and computational overheads, traditional approaches to load-sensitive routing of IP traffic are ineffective, and can introduce significant route flapping, since paths are selected based on out-of-date link-state information. Although stability is improved by performing load-sensitive routing at the flow level, flapping still occurs, because most IP flows have a short duration relative to the desired frequency of link-state updates. To address the efficiency and stability challenges of load-sensitive routing, we introduce a new hybrid approach that performs dynamic routing of long-lived flows, while forwarding short-lived flows on static preprovisioned paths. By relating the detection of long-lived flows to the timescale of link-state update messages in the routing protocol, route stability is considerably improved. Through simulation experiments using a one-week ISP packet trace, we show that our hybrid approach significantly outperforms traditional static and dynamic routing schemes, by reacting to fluctuations in network load without introducing route flapping.

CSE-TR-393-99 Analytical Macro-Modeling for High-Level Power Estimation Bernacchia and Papaefthymiou Apr, 99 5
In this paper, we present a new analytical macro-modeling technique for high-level power estimation. Our technique is based on a parameterizable analytical model that relies exclusively on statistical information of the circuits primary inputs. During estimation, the statistics of the required metrics are extracted from the input stream and a power estimate is obtained by evaluating a model function that has been characterized in advance. Thus, our model yields power estimates within seconds, since it does not rely on the statistics of the circuits primary outputs and, consequently, does not perform any simulation during estimation. Moreover, our model achieves significantly better accuracy than previous macro-modeling approaches, because it takes into account both spatial and temporal correlations in the input stream. In experiments with ISCAS-85 combinational circuits, the average absolute relative error of our power macro-modeling technique was at most 3%. For all but one circuit in our test suite, the worst-case error was at most 19.5%. In comparison with power estimates for a ripple-carry adder family that were obtained using Spice, the average absolute and worst-case errors of our models estimates were 5.1% and 19.8%, respectively. In addition to power dissipation, our macro-modeling technique can be used to estimate the statistics of a circuits primary outputs. Our experiments with ISCAS-85 circuits yield very low average errors. Thus, our technique is suitable for fast and accurate power estimation in core-based system with pre-characterized blocks. Once the metrics of the primary inputs are known, the power dissipation of the entire system can be estimated by simply propagating this information through the blocks using their corresponding model functions.

CSE-TR-394-99 Trace Cache Design for Wide-Issue Superscalar Processors Sanjay J. Patel Apr, 99 166
To maximize the performance of a wide-issue superscalar processor, the fetch mechanism must be capable of delivering at least the same instruction bandwidth as the execution mechanism is capable of consuming. Fetch mechanisms consisting of a simple instruction cache are limited by difficulty in fetching a branch and its taken target in a single cycle. Such fetch mechanisms will not suffice for processors capable of executing multiple basic blocks worth of instructions. The Trace Cache is proposed to deal with lost fetch bandwidth due to branches. The trace cache is a structure which overcomes this partial fetch problem by storing logically contiguous instructions---instructions which are adjacent in the instruction stream---in physically contiguous storage. In this manner, the trace cache is able to deliver multiple non-contiguous blocks each cycle. This dissertation contains a description of the trace cache mechanism for a 16-wide issue processor, along with an evaluation of basic parameters of this mechanism, such as relative size and associativity. The main contributions of this dissertation are a series of trace cache enhancements which boost instruction fetch bandwidth by 34% and overall performance by 14% over an aggressive instruction cache. Also included is an analysis of two important performance limitations of the trace cache: branch resolution time and instruction duplication.

CSE-TR-395-99 Stylistic Structures: An Initial Investigation of the Stochastic Generation of Tonal Music Alamkan Birmingham and Simoni May, 99 15
In this paper, we explore the use of a Markov Decision Network to generate music. We show how the network is trained and used to generate compositions of approximately 10 minutes in duration that are representative of the style of the composer whose music was used to train the network.

CSE-TR-396-99 Automated Partitioning of Tonal Music Pardo and Birmingham May, 99 34
The majority of research related to automated analysis of music presupposes human partitioning of the input into segments corresponding to significant harmonic or melodic chunks. In this paper, we analyze the difficulty of the partitioning problem in a formal manner. We then describe HarmAn, a system that partitions tonal music into harmonically significant segments corresponding to single chords and labels these segments with the proper chord labels. Input to the system consists of standard MIDI files and the output is both an annotated piano-roll style display and a text file with partitioning and chord-name information. Chord labels for segments are determined through template matching in the space of pitch-class with conflict resolution between equal scoring templates resolved through simple default preference rules. Our system’s results are compared with the results described in papers by Winograd (Winograd 1968), Maxwell (Maxwell 1992), and Temperley and Sleator (Temperley and Sleator 1999)

CSE-TR-397-99 A Constraint Satisfaction Approach to Tonal Harmonic Analysis Hoffman and Birmingham May, 99 24
This paper gives an algorithm for harmonic analysis for for a tonal composition using a constraint-satisfaction problem (CSP) model. The algorithm combines rule-based and preference-based approaches to perform harmonic analysis, taking the position that the problem can be modeled as a computational process and less of a psychological phenomenon. Using cadential patterns within the piece, it identifies tonal centers and assigns the harmonic analysis, applying preferences when necessary. A software implementation of the algorithm is presented, along with a discussion of its results.

CSE-TR-398-99 Design Verification Using Reverse Engineering Hauke May, 99 167
Modern processors are very difficult to design and require advanced design verification methods to ensure that they function correctly. While simulation-based verification is the primary method used by industry, formal methods are gaining acceptance although they are limited to relatively small designs. Equivalence checking, for example, is useful for verifying the equivalence between two levels of a design hierarchy, but is not applicable above the widely used register-transfer level (RTL). For this reason, new design tools are necessary to verify that an RTL description of an implementation matches its original instruction set architecture (ISA), which specifies the processor components and instructions visible to a programmer. This Masters thesis proposes a design verification method, Reverse Engineering Verification (REVE), based on analysis of an implementations data flow. REVE takes a high-level RTL implementation of a new processor design and extracts (reverse engineers) specification information from the implementations internal data paths. The reverse-engineered information is then matched with the original ISA specification to verify functional correctness of the implementation.

CSE-TR-399-99 Eager Writeback - a Technique for Improving Bandwidth Utilization Lee Tyson and Farrens Jun, 99 21
Modern high-performance processors utilize multi-level cache structures to help tolerate the increasing latency (measured in processor cycles) of main memory. These caches employ either a writeback or a write-through strategy to deal with store operations. Write-through caches propagate data to more distant memory levels at the time each store occurs, which requires a very large bandwidth between the memory hierarchy levels. Writeback caches can significantly reduce the bandwidth requirements between caches and memory by marking cache lines as dirty when stores are processed and writing those lines to the memory system only when that dirty line is evicted. This approach works well for many applications (e.g. SPEC95), but for graphics applications that experience significant numbers of cache misses due to streaming data, writeback cache designs can degrade overall system performance by clustering bus activity when dirty lines contend with data being fetched into the cache. In this paper we present a new technique called Eager Writeback, which re-distributes and balances traffic by writing dirty cache lines to memory prior to their eviction. This reduces the likelihood of writing dirty lines that impede the loading of cache miss data. Eager Writeback can be viewed as a compromise between write-through and writeback policies, in which dirty lines are written later than write-through, but prior to writeback. We will show that this approach can reduce the large number of writes seen in a write-through design, while avoiding the performance degradation caused by clustering bus traffic of a writeback approach.

CSE-TR-400-99 A Static Filter for Reducing Prefetch Traffic Srinivasan Tyson and Davidson Jun, 99 20
The growing difference between processor and main memory cycle time necessitates the use of more aggressive techniques to reduce or hide main memory access latency. Prefetching data into higher speed memories is one such technique. However, speculative prefetching can significantly increase memory traffic. We present a new technique, called Static Filtering (SF), to reduce the traffic generated by a given hardware prefetching scheme while preserving its reduced miss rate. SF uses profiling to select which load instructions should be marked "enabled" to do data prefetching. This is done by identifying which load instructions generate data references that are useful prefetch triggers. SF enables the hardware prefetch mechanism only for the set of references made by "enabled" loads. Our results from applying SF to two well-known hardware prefetching techniques, Next Sequential Prefetching (NSP) and Shadow Directory Prefetching (SDP), shows that SF preserves the decrease in misses that they achieve and reduces the prefetch traffic by 50 to 60% for NSP and by 64 to 74% for SDP. In addition, timing analysis reveals that when finite memory bandwidth is a limiting factor, applying SF does in fact increase the speedup obtained by a baseline hardware prefetching technique. The other major contribution of this paper is a complete taxonomy which classifies individual prefetches in terms of the additional traffic they generate and the resulting reduction (or increase) in misses. This taxonomy provides a formal method for classifying prefetches by their usefulness. A histogram of the prefetches by category provides a new basis for comparing prefetch techniques.

CSE-TR-401-99 Allocation By Conflict: A Simple Effective Cache Management Scheme Tam Tyson and Davidson Jun, 99 19
Many schemes have been proposed that incorporate an auxiliary buffer to improve theperformance of a given size cache. One of the most thoroughly evaluated of these schemes, Victim caching, aims to reduce the impact of conflict misses in direct-mappedcaches. While Victim has shown large performance benefits, its competitive advantage is limited to direct-mapped caches, whereas todays caches are increasingly associative. Fur-thermore, it requires a costly data path for swaps and saves between the cache and the buffer.Several other schemes attempt to obtain the performance improvements of Victim, but across a wide range of associativities and without the costly data path for swaps and saves.While these schemes have been shown to perform well overall, their performance still lags that of Victim when the main cache is direct-mapped. Furthermore, they also requirecostly hardware support, but in the form of history tables for maintaining allocation decision information.This paper introduces a new cache management scheme, Allocation By Conflict (ABC), which generally outperforms both Victim and the history-based schemes. Further-more, ABC has the lowest hardware requirements of any proposed scheme -- only a single additional bit per block in the main cache is required to maintain the informationrequired by the allocation decision process, and no swap-save data path is needed.

CSE-TR-402-99 Cost-Effective Adaptive Error Control Using a Hybrid of FEC and ARQ in Wireless Networks Choi and Shin Jun, 99 28
Wireless links are known to suffer location-dependent, time-varying, and bursty errors. This paper considers adaptive error-control schemes in the data link layer for reliable communication over wireless links, in which the error-control code and frame length are chosen adaptively based on the estimated channel state/condition. Three error-control schemes are considered according to (1) the number of code segments a packet is divided into, and (2) the way a lost packet is retransmitted. Through throughput performance and computation complexity analyses, these three schemes are compared, and then one of them is claimed to be the most attractive in terms of computation complexity and practicality even though its throughput performance is not the best. Simulation results also verify that this scheme works well over a time-varying fading channel. Error control for the MAC header and its effect on the performance of each error-control scheme are also considered since, without proper error protection for the header, it would be futile to exercise error control on the user data.

CSE-TR-403-99 Smart Register Files for High-Performance Microprocessors Postiff and Mudge Jun, 99 52
This report examines how the compiler can more efficiently use a large number of processor registers. The placement of data items into registers, called register allocation, is known to be one of the most important compiler optimizations for high-speed computers because registers are the fastest storage devices in the computer system. However, register allocation has been limited in scope because of aliasing in the memory system. To break this limitation and allow more data to be placed into registers, new compiler and microarchitecture support is needed. We propose the modification of register access semantics to include an indirect access mode. We call this optimization the Smart Register File. The smart register file allows the relaxation of overly-conservative assumptions in the compiler by having the hardware provide support for aliased data items in processor registers. As a result, the compiler can allocate data from a larger pool of candidates than in a conventional system. An attendant advantage is that the smart register file reduces the number of load and store operations executed by the processor. The simple addition of an indirect register access mode not only simplifies alias handling, but also provides opportunities for other optimizations. This research examines several such optimizations.

CSE-TR-404-99 Relaxed Multicast Protocols using Concurrency Control Theory Jensen and Soparkar Jun, 99 14
Techniques for coordinating distributed executions includeprocess groups and transactions, each with different properties in a trade-off in terms of the degrees of consistency and performance. Transaction concurrency control theory may be used to characterize process group multicast semantics; our framework which uses conflict relations among events, and the acyclicity properties of a particular precedence graph, may be used to describe the correctness of multicast orderings. In this paper, we describe protocols to realize new relaxed multicast orderings as obtained by the use of concurrency control techniques. In particular, we demonstrate efficient, time-stamp based protocols for relaxed orderings that depend on data, operation, or any similar attributes that bring about such weaker orderings. In fact, we provide a protocol for any arbitrary reflexive and symmetric relation that may be defined among the message types, and this makes our approach amenable to incorporating application semantics into multicast protocols. Using the precedence graph acyclicity framework, we can show that our protocols produce correct multicast delivery orderings. Furthermore, we discuss the efficiency of our protocols in terms of some qualitative performance metrics.

CSE-TR-405-99 A Brachiating Robot Controller Nakanishi Fukuda and Koditschek Aug, 99 47
We report on our empirical studies of a new controller for a two-link brachiating robot. Motivated by the pendulum-like motion of an apes brachiation, we encode this task as the output of a ``target dynamical system.' Numerical simulations indicate that the resulting controller solves a number of brachiation problems that we term the ``ladder', ``swing up' and ``rope' problems. Preliminary analysis provides some explanation for this success. The proposed controller is implemented on a physical system in our laboratory. The robot achieves behaviors including ``swing locomotion' and ``swing up' and is capable of continuous locomotion over several rungs of a ladder. We discuss a number of formal questions whose answers will be required to gain a full understanding of the strengths and weaknesses of this approach.

CSE-TR-406-99 Second-Order Rate-Based Flow Control with Decoupled Error Control for High-Throughput Transport Protocols Zhang and Shin Aug, 99 13
We propose an efficient flow and error control scheme for high-throughput transport protocols, which (i) decouples flow control from error control, and (ii) uses a second-order rate control, called $alpha$-control, for flow control and a new sliding-window scheme for error control. The $alpha$-control minimizes the need for packet retransmissions by adjusting the rate-gain parameter in response to the variations of cross-traffic flows and their round-trip delays (RTDs). Using selective retransmission, the sliding-window scheme guarantees lossless transmission. Separation of flow and error control simplifies both components and enhances the throughput since the source rate control is independent of the dynamics of the error-control window. Applying the $alpha$-control, the proposed scheme can drive the flow-controlled system to a retransmission-free equilibrium state. Using the fluid analysis, we model the packet-loss behavior and derive the closed-form expressions for packet losses, loss rate, and the link-transmission efficiency. We prove that the $alpha$-control is feasible and optimal linear control in terms of efficiency and fairness. Also presented are the vector-space analysis and simulation results that verify the analytical results, and demonstrate the superiority of the proposed scheme to others in dealing with cross-traffic and RTDs variations, controlling packet losses/retransmissions, and achieving buffer-use fairness and a high throughput.

CSE-TR-407-99 Self-Organizing Network Services Jamjoom Jamin and Shin Aug, 99 11
There is a proliferation of application-level servers that offer enhanced network services beyond the simple packet forwarding provided by the underlying network infrastructure. Examples of such servers range from Web server mirrors, Web caches, Web page distilling proxies, video transcoders, and application-level multicast routers, to application-level load-adaptive multipath routers. A fundamental question arising from the deployment of such servers is where to place them within a network. This paper explores technical issues related to the creation of an infrastructure to allow self-organization of network service placement based on observed demand for each service. In so doing, we propose a framework, called Sortie, whereby services are allocated on network nodes based on a set of very simple rules independently executed by each node. The distributed nature of allocation decisions ensures the scalability of the framework. We also present simulation results confirming the stability and efficiency of the proposed framework.

CSE-TR-408-99 A Temporal Consensus Model Zou and Jahanian Sep, 99 22
The active (state-machine) replication protocol has been one of the commonly adopted approaches to provide fault tolerant data services. In such a scheme, service state is replicated at all participating servers and client requests are presented to emph{all} the servers. A strict consensus protocol executed by the servers ensures that all servers receive all the requests in the same order. The output of each request from all the servers are then compared to mask the result of faulty server(s). However, the overhead associated with running the strict consensus protocol tends to slow down the response to client requests, which makes it unsuitable for real-time applications where timing predictability is a necessity. This paper presents a weak consensus model called ``temporal consensus' that reduces such overhead and enable the active scheme to meet the timing constraint of real-time applications.

CSE-TR-409-99 The Theory and Practice of Failure Transparency Lowell and Chen Oct, 99 15
System and application failures are all too common. In this paper,we argue that operating systems should provide the abstraction of failure transparency--the illusion that systems and applications do not fail. We construct a theory of consistent recovery that is useful for systems that want to provide failure transparency. The theory defines precisely what constitutes consistent recovery and provides a simple invariant that all applications must uphold to guarantee they get it. The theory unifies the various recovery protocols for achieving consistent recovery; existing protocols can be viewed as different ways of upholding our theorys invariant. We use the theory to design new recovery protocols and analyze existing protocols. We conclude by evaluating performance for a suite of recovery protocols. We focus our study on interactive programs, and application domain for which it is challenging to provide failure transparency. Our results indicate that it is possible to provide failure transparency for general applications with overhead of 1-2% for systems with reliable main memory, and 10-40% for disk-based systems.

CSE-TR-410-99 Discount Checking: Transparent Low-Overhead Recovery for General Applications Lowell and Chen Oct, 99 12
Checkpointing is a general technique for recovering applications.Unfortunately, current checkpointing systems add many seconds of overhead per checkpoint. Their high overhead prevents them from making failures transparent to users and other external entities, so failures lose visible state. This paper presents a checkpointing system called Discount Checking that is built on reliable main memory and high-speed transactions. Discount Checking can be used to make general-purpose applications recoverable easily and with low overhead. The checkpoints taken by Discount Checking are extremely fast, ranging for our target applications from 50 microseconds to a few milliseconds. Discount Checkings low overhead makes it possible to provide ideal failure transparency by checkpointing each externally visible event. Yet even with this high rate of checkpointing, Discount Checking slows real applications down by less than 0.6%.

CSE-TR-411-99 Low-Cost Embedded Program Loop Caching - Revisited Lee Moyer and Arends Oct, 99 15
Many portable and embedded applications are characterized by spending a large fraction of their execution time on small program loops. In these applications, instruction fetch energy can be reduced by using a small instruction cache when executing these tight loops. Recent work has shown that it is possible to use a small instruction cache without incurring any performance penality [4, 6]. In this paper, we will extend the work done in [6]. In the modified loop caching scheme proposed in this paper, when a program loop is larger than the loop cache size, the loop cache is capable of capturing only part of the program loop without having any cache conflict problem. For a given loop cache size, our loop caching scheme can reduce instruction fetch energy more than other loop cache schemes previously proposed. We will present some quantitative results on how much power can be saved on an integrated embedded design using this scheme.

CSE-TR-412-99 A Formalism for the Composition of Loosely Coupled Robot Behaviors Klavins and Koditschek Nov, 99 39
We address the problem of controlling large distributed robotic systems such as factories. We introduce tools which help us to compose local, hybrid control programs for a class of distributed robotic systems, assuming a palette of controllers for individual tasks is already constructed. These tools, which combine backchaining behaviors with Petri Nets, expand on successful work in sequential composition of robot behaviors. We apply these ideas to the design of a robotic bucket brigade and to simple, distributed assembly tasks.

CSE-TR-413-99 Windowed Certificate Revocation McDaniel and Jamin Nov, 99 19
The advent of electronic commerce and personal communications on the Internet heightens concerns over the lack of privacy and security. Network services providing a wide range of security related guarantees are increasingly based on public key certificates. A fundamental problem inhibiting the wide acceptance of existing certificate distribution services is the lack of a scalable certificate revocation mechanism. We argue in this paper that the resource requirements of extant revocation mechanisms place significant burden on certificate servers and network resources. We propose a novel mechanism called Windowed Revocation that satisfies the security policies and requirements of existing mechanisms and, at the same time, reduces the burden on certificate servers and network resources. We include a proof of correctness of windowed revocation and a trace-based performance study illustrating the scalability and general applicability of windowed revocation.

CSE-TR-414-99 A Fully Associative Software-Managed Cache Design Hallnor and Reinhardt Nov, 99 12
As DRAM access latencies approach a thousand instruction-execution times and on-chip caches grow to multiple megabytes, it is not clear that conventional cache structures continue to be appropriate. Two key features¾full associativity and software manage-ment¾have been used successfully in the virtual-memory domain to cope with disk access latencies. Future systems will need to employ similar techniques to deal with DRAM latencies. This paper presents a practical, fully associative, software-managed sec-ondary cache system that provides performance competitive with or superior to traditional caches without OS or application involvement. We see this structure as the first step toward OS- and application-aware management of large on-chip caches. This paper has two primary contributions: a practical design for a fully associative memory structure, the indirect index cache (IIC), and a novel replacement algorithm, gen-erational replacement, that is specifically designed to work with the IIC. We analyze the behavior of an IIC with generational replacement as a drop-in, transparent substitute for a conventional secondary cache. We achieve miss rate reductions from 8% to 85% relative to a 4-way associative LRU organization, matching or beating a (practically infeasible) fully associative true LRU cache. Incorporating these miss rates into a rudimentary timing model indicates that the IIC/generational replacement cache could be competitive with a conventional cache at today’s DRAM latencies, and will outperform a conventional cache as these CPU-relative latencies grow.

CSE-TR-415-99 Out-of-Order Fetch Decode and Issue Stark Nov, 99 296
To exploit larger amounts of parallelism, processors are being built with ever wider issue widths. Unfortunately, as issue widths grow, instruction cache misses increasingly bottleneck performance. Out-of-order fetch, decode, and issue reduces the bottleneck due to these misses. The term *issue* denotes the act of writing instructions into reservation stations, *fetch block* denotes the instructions brought into the processor by an instruction cache fetch, and *branch predictor* denotes the entire next fetch address generation logic. Upon encountering an instruction cache miss, a conventional processor waits until the miss has been serviced before continuing to fetch, decode, and issue any new fetch blocks. A processor with out-of-order fetch, decode, and issue temporarily ignores the block associated with the miss, and attempts to fetch, decode, and issue the blocks that follow. The addresses of these blocks are generated by the branch predictor. Because the predictor does not depend on the instructions in the current block to generate the address of the next fetch block, it can continue making predictions even if the current block misses in the cache. Thus, the processor can skip the block that missed and continue fetching, decoding, and issuing the following blocks using the addresses generated by the predictor. After servicing the miss, the processor can fetch, decode, and issue the skipped block. This dissertation evaluates the *potential* performances of processors with various issue widths and window sizes, shows that enough potential performance exists to justify building a 16-wide processor, examines bottlenecks that would prevent this processor from achieving its potential performance, and demonstrates that the bottleneck due to instruction cache misses would be severe. It introduces a remedy for this bottleneck--out- of-order fetch, decode, and issue--describes its implementation, and demonstrates its usefulness. For a 16-wide processor with a 16k byte direct-mapped instruction cache, instruction cache misses reduce its performance on a set of sixteen integer benchmarks by an average of 24%. When it is equipped with out-of-order fetch, decode, and issue, its performance is only reduced by an average of 5%.

CSE-TR-416-99 DACIA: A Mobile Component Framework for Building Adaptive Distributed Applications Litiu and Prakash Dec, 99 14
Current computing systems exhibit an impressive heterogeneity of hardware resources, software capabilities, and network connectivity. They have to accommodate various application and client demands and to adjust to the continuously changing load on network segments and intermediate processing nodes. We present DACIA, a framework for building adaptive distributed applications through the flexible composition of software modules implementing individual functions. An application is seen as a collection of components connected in a directed graph. Through runtime reconfiguration of the application graph, the application adapts to changes in the execution environment, resource availability, and application requirements. Thus, a more efficient execution is achieved. The location where various components are executed can be changed dynamically and the execution can be occasionally outsourced to nodes having specialized hardware or to less loaded sites on the network. Some components or groups of components can be replaced by other components or groups of components. DACIA supports application and user mobility, allowing parts of an application to move to different hosts at runtime, while maintaining seamless connectivity.

CSE-TR-417-99 Improving Branch Prediction by Understanding Branch Behavior Evers Dec, 99 163
Accurate branch prediction can be seen as a mechanism for enabling design decisions. When short pipelines were the norm, accurate branch prediction was not as important. However, having accurate branch prediction enables technologies like wide-issue deeply pipelined superscalar processors. If branch predictors can be improved further, we can more successfully use more aggressive speculation techniques. Accurate branch prediction enables larger scheduling windows, out-of-order fetch, deeper pipelines etc. It is therefore likely that there will be a growing demand for more accurate predictors beyond todays prediction technology. Previous studies have shown which branch predictors and configurations best predict the branches in a given set of benchmarks. Some studies have also investigated effects, such as pattern history table interference, that can be detrimental to the performance of these branch predictors. However, little research has been done on which characteristics of branch behavior make branches predictable. This dissertation approaches the branch problem in a different way from previous studies. The focus is on understanding how branches behave and why they are predictable. Branches are classified based on the type of behavior, and the extent of each type of behavior is quantified. One important result is that two thirds of all branches are very predictable using a simple predictor because they follow repeating patterns. We also show how correlation between branches works, and what part of this correlation is important for branch prediction. Based on this information about branch behavior, some shortcomings of current branch predictors are identified, new branch predictors are introduced, and potential areas for future improvement are identified. One of the new predictors, Dual History Length Gshare with Selective Update is more accurate than a Gshare predictor using Branch Filtering while having a simpler implementation. Another new predictor, the Multi Hybrid, suffers 10% fewer mispredictions than a state-of-the-art PAs/Gshare hybrid predictor at an implementation cost of 100 KB.

CSE-TR-418-99 Algebra-based Optimization Strategies for Decentralized Mining Jensen and Soparkar Dec, 99 24

Technical Reports Page