1996

Gwobaw Wu, Atul Prakash

Distributed Object Replication Support for Collaborative Systems

University of Michigan

Abstract: The ability to share synchronized views of interactions with an application is critical to supporting synchronous collaboration over a network. To keep bandwidth requirements and interactive response time low in wide-area networks, we have designed a toolkit, called DistView, that uses an {\em object-level replication} scheme, in which the application and interface objects that need to be shared among users are replicated at the various user sites. Existing schemes for maintaining consistency among replicated active objects are generally designed for database applications and do not necessarily work in collaborative applications where response times or latencies are generally more critical than throughput or persistence. In this paper, we describe an object replication service that is designed to provide low interactive response times and low latencies in communication in collaborative environments. We describe the algorithms used, the architecture of the system, and present an analysis of its performance.

Gheith Abandah, Edward Davidson

Characterizing Shared Memory and Communication Performance: A Case Study of the Convex SPP-10000

University of Michigan

CSE-TR-277-96 cover page

CSE-TR-277-96

Abstract: The objective of this paper is to develop models that characterize the memory and communication performance of shared-memory multiprocessors which is crucial for developing efficient parallel applications for them. The Convex SPP-1000, a modern scalable distributed-memory multiprocessor that supports the shared-memory programming paradigm, is used throughout as a case study. The paper evaluates and models four aspects of SPP-1000 performance: scheduling, local-memory, shared-memory, and synchronization. Our evaluation and modeling are intended to supply useful information for application and compiler development.

Trent Jaeger, Atul Prakash

Using Simulation and Performance Improvement Knowledge for Redesigning Business Processes

University of Michigan

CSE-TR-278-96 cover page

CSE-TR-278-96

Abstract: Recent business improvement methodologies, such as Business Process Reengineering, advocate the redesign of business processes as the primary method to obtain improvement in business performance. Current approaches to process redesign assist the reengineer in identifying problems in meeting business goals, but provide little support in determining how to resolve those problems to meet the desired goals. The number of feasible changes to the business that can potentially resolve the problems is large and it is hard to predict the effect that each change will have on overall business performance. We present a process redesign methodology that helps reengineers to efficiently generate changes that can help meet business performance goals. Based on the methodology, we describe a system that includes: (1) an executable business model for specifying business processes and business performance goals; (2) a simulator that executes these processes to determine actual performance and to help reengineers identify problems in meeting the specified goals; and (3) a performance improvement knowledge base that suggests changes to resolve these problems and meet the specified goals. We demonstrate the use of the process redesign methodology and the system on two non-trivial process redesign examples, and report our experience.

Michael McClennen, Stuart Sechrest

Getting More Information into File Names

University of Michigan

CSE-TR-279-96 cover page

CSE-TR-279-96

Abstract: Hierarchical naming, while deeply embedded in our conception of file systems, is a rather weak tool for storing information about files and their relationships. A consequence is that users of today's file systems frequently have trouble locating files. We describe a system in which a standard directory tree is extended by allowing names to contain auxiliary components representing descriptive attributes rather than directory names. This system allows files to be characterized more extensively, and lets users choose among multiple organizational structures for their stored information. A prototype has been implemented by means of a new vnode layer under SunOS 4.1.3.

Ashish Mehra, Atri Indiresan, Kang Shin

Design and Evaluation of a QoS-Sensitive Communication Subsystem Architecture

University of Michigan

CSE-TR-280-96 cover page

CSE-TR-280-96

Abstract: There are a growing number of real-time applications (e.g., real-time controls, and audio/video conferencing) that require certain quality of service (QoS) guarantees from the underlying communication subsystem. The communication subsystem (host as well as network) must support real-time communication services that can be used to provide the required QoS of these applications, while providing reasonably good performance for best-effort traffic. In this paper we present and evaluate a QoS-sensitive communication subsystem architecture for end hosts that provides real-time communication support for generic network hardware. This architecture provides various services for managing communication resources for guaranteed-QoS (real-time) connections, such as admission control, traffic enforcement, buffer management, and CPU \& link scheduling. The design of the architecture is based on three key goals: maintenance of QoS-guarantees on a per-connection basis, overload protection between established connections, and fairness in delivered performance to best-effort traffic. Using this architecture, we implement real-time channels, a paradigm for real-time communication services in packet-switched networks. The proposed architecture features a process-per-channel model for protocol processing that associates a channel handler with each established channel. The model employed for handler execution is one of ``cooperative'' preemption, where an executing handler yields the CPU to a waiting higher-priority handler at well-defined preemption points. The architecture provides several configurable policies for CPU scheduling and overload protection. We evaluate the implementation to demonstrate that this architecture maintains QoS guarantees while adhering to the stated design goals. The evaluation also demonstrates convincingly the need for specific features and policies provided in the architecture. Key Words --- Real-time communication, QoS-sensitive protocol processing, traffic enforcement, CPU and link scheduling

Ashish Mehra, Atri Indiresan, Kang Shin

Resource Management for Real-Time Communication: Making Theory Meet Practice

University of Michigan

CSE-TR-281-96 cover page

CSE-TR-281-96

Abstract: A growing number of real-time applications (e.g., real-time controls, and audio/video conferencing) require certain quality-of-service (QoS) from the underlying communication subsystem. Real-time communication services are needed in the communication subsystem (host as well as network) to provide the required QoS to these applications while providing reasonably good performance for best-effort traffic. At the host, real-time communication necessitates that shared host resources such as CPU and link bandwidth be consumed according to the relative requirements of the active connections. This requires appropriate resource management policies for admission control and scheduling that are typically formulated assuming idealized resource models. However, when implementing these policies one must account for the performance characteristics of the hardware and software components involved, which could deviate significantly from those of the idealized resource models. In this paper, we focus on bridging the gap between theory and practice in the management of host CPU and link resources for real-time communication. Using our implementation of real-time channels, a paradigm for real-time communication in packet-switched networks, we illustrate the tradeoff between resource capacity and channel admissibility, which determines the number and type of real-time channels that can be accepted for service and the performance delivered to best-effort traffic. We demonstrate that this tradeoff is affected significantly by the choice of implementation paradigms and the grain at which CPU and link resources can be multiplexed amongst active channels. In order to account for this effect, we extend the admission control procedure for real-time channels originally proposed using idealized resource models. Our results show that practical considerations significantly reduce channel admissibility compared to idealized resource models. Further, the optimum choice of the multiplexing grain depends on several factors such as the overheads of resource preemption, the relationship between CPU and link bandwidth, and the manner in which allocation of link bandwidth interacts with allocation of CPU bandwidth. Key Words --- Real-time communication, resource management, QoS-sensitive protocol processing, CPU and link scheduling

Trevor Mudge, I-Cheng Chen, John Coffey

Limits to Branch Prediction

University of Michigan

CSE-TR-282-96 cover page

CSE-TR-282-96

Abstract: Branch prediction is an important mechanism in modern microprocessor design. The focus of research in this area has been on designing new branch prediction schemes. In contrast, very few studies address the inherent limit of predictability of program themselves. Programs have an inherent limit of predictability due to the randomness of input data. Knowing the limit helps us to evaluate how good a prediction scheme is and how much we can expect to improve its accuracy. In this paper we propose two complementary approaches to estimating the limits of predictability: exact analysis of the program and the use of a universal compression/prediction algorithm, prediction by partial matching (PPM), that has been very successful in the field of data and image compression. We review the algorithmic basis for both some common branch predictors and PPM and show that two-level branch prediction, the best method currently in use, is a simplified version of PPM. To illustrate exact analysis, we use Quicksort to calibrate the performance of various branch predictors. With other programs, too complicated to analyze exactly, we use PPM, as a measure of inherent predictability. Our initial results show that PPM can approach the theoretical limit in an analyzable program and perform just as well as the best existing branch predictors for SPECInt92. This suggests that universal compression/prediction algorithms, such as PPM, can be used to estimate the limits to branch prediction for a particular workload of programs.

Stuart Sechrest, Chih-Chieh Lee, Trevor Mudge

Correlation and Aliasing in Dynamic Branch Predictors -- Full Simulation Results

University of Michigan

CSE-TR-283-96 cover page

CSE-TR-283-96

Abstract: The accompanying paper, "Correlation and Aliasing in Dynamic Branch Predictors," presents simulation results for a variety of branch prediction schemes using traces of the integer programs of SPEC92 and the IBS-Ultrix. However, due to space limitations and the need for clarity, only a subset of our simulation results were included. This technical report presents the full results for all of the benchmarks for each prediction scheme studied.

Nelson R. Manohar and Atul Prakash

Tool Coordination and Media Integration on Asynchronously-Shared Computer-Supported Workspaces

University of Michigan

CSE-TR-284-96 cover page

CSE-TR-284-96

Abstract: This paper presents a novel and flexible architecture to support the collaboration paradigm for asynchronous sharing of computer-supported workspaces, or simply replayable workspaces. Such workspace, however, may be composed of multiple, interacting, tools. Through the capture, re-execution, and manipulation of a session with an interactive computer-supported workspace, it is possible to reuse the process of how a task was performed, (i.e., valuable collaborative information). Under our paradigm, a user session with a computer- supported workspace is encapsulated into a data artifact, referred to as a {\em session object}. The session object can be then manipulated through {\sc Vcr}-like controls. The session object is an active object, composed of heterogeneous media streams that represent input sequences to various tools of the workspace. Our computer-supported workspace is made-up of modular components that provide transparent management of their heterogeneous media. Such media may have or lack time-variance, periodicity, and statefulness. The architecture provides fine-grained integration of such heterogeneous media, while providing flexible support for coordinating the various tools found in a computer-supported workspace.

David Van Campenhout and Trevor Mudge

Faster Static Timing Analysis via Bus Compression

University of Michigan

CSE-TR-285-96 cover page

CSE-TR-285-96

Abstract: Static timing analysis is used extensively in the design of high-performance processors. In this paper we present a method which transforms the circuit graph used by timing analysis algorithms to a smaller one, and hence results in faster timing analysis. This transformation, called bus compression, may lead to a more conservative analysis. How ever, experiments on several large designs yielded the same results with or without bus compression.

Peter Chen, Wee-Teck Ng, Gurushankar Rajamani, Christopher Aycock

The Rio File Cache: Surviving Operating System Crashes

University of Michigan

CSE-TR-286-96 cover page

CSE-TR-286-96

Abstract: One of the fundamental limits to high-performance, high-reliability file systems is memory's vulnerability to system crashes. Because memory is viewed as unsafe, systems periodically write data back to disk. The extra disk traffic lowers performance, and the delay period before data is safe lowers reliability. The goal of the Rio (RAM I/O) file cache is to make ordinary main memory safe for persistent storage by enabling memory to survive operating system crashes. Reliable memory enables a system to achieve the best of both worlds: reliability equivalent to a write-through file cache, where every write is instantly safe, and performance equivalent to a pure write-back cache, with no reliability-induced writes to disk. To achieve reliability, we protect memory during a crash and restore it during a reboot (a "warm" reboot). Extensive crash tests show that even without protection, warm reboot enables memory to achieve reliability close to that of a write-through file system while performing 20 times faster. Rio makes all writes immediately permanent, yet performs faster than systems that lose 30 seconds of data on a crash: 35% faster than a standard delayed-write file system and 8% faster than a system that delays both data and metadata. For applications that demand even higher levels of reliability, Rio's optional protection mechanism makes memory even safer than a write-through file system while while lowering performance 20% compared to a pure write-back system.

Wu-chang Feng and Kang G. Shin

Impact of Selection Functions on Routing Algorithm Performance in Multicomputer Networks

University of Michigan

CSE-TR-287-96 cover page

CSE-TR-287-96

Abstract: aximizing overall performance in multicomputers requires matching application communication characteristics with a suitable routing scheme. However, since the communication demands of emerging applications vary significantly, it is hard for a single routing algorithm to perform well under all workloads. In order to study the complex dependencies between routing policies and communication workloads, we have performed a set of multi-factor experiments to better characterize routing performance. These experiments show that in addition to adaptivity, the selection functions used to order the candidate links greatly affect network performance under various traffic patterns. By supporting flexible routing, the network can tune its routing policies to application communication characteristics in order to improve performance.

Neal J. Rothleder

Induction: Inference and Process

University of Michigan

CSE-TR-288-96 cover page

CSE-TR-288-96

Abstract: Despite the variety of discussions on induction, there is still no explicit definition of the concept. However, since most treatments appear to, at least implicitly, acknowledge it, it seems clear that such a definition exists. A key obstacle seems to be the confusion between what induction is, and how induction should be performed. This research proposes the idea that induction should be viewed in two distinct, but related, ways. The more general, abstract way gets to the heart of this concept definition, and treats induction purely as an inference. In this inference view, there is no concern for issues of efficiency, effectiveness, and other qualities of how induction might be performed. The second way to view induction is as the process which implements the inference. This paper establishes a definition of induction in the inference view. The subtleties and implications of this definition are discussed, and are also compared with key issues and published opinions of induction.

Joseph D'Ambrosio

ISMAUT Tools: A Software Tool Kit for Rational Tradeoffs Among Conflicting Objectives

University of Michigan

CSE-TR-289-96 cover page

CSE-TR-289-96

Abstract: One primary task of engineering design is resolving the conflicting objectives that are inherent in the design process. As a result, design automation tools must provide designers with the ability to iirepresent and apply the preferential knowledge required to resolve conflicting objectives. This paper describes a decision-theoretic approach for resolving conflicting objectives and a software took kit, ISMAUT Tools, that enables access to this approach. ISMAUT Tools includes a library of primitive routines that can be linked into other applications, and a text-based interface providing stand-alone decision making capabilities.

Robert W. Hall, Amit Mathur, Farnam Jahanian, Atul Prakash and Craig Rasmussen

Corona: A Communication Service for Scalable, Reliable Group Collaboration Systems (Preliminary Design)

University of Michigan

CSE-TR-290-96 cover page

CSE-TR-290-96

Abstract: We consider the problem of providing communication protocol support for large-scale group collaboration systems for use in environments such as the Internet which are subject to packet loss, wide variations in end-to-end delays, and transient partitions. We identify a set of requirements that are critical for the design of such group collaboration systems. We present a communication service, Corona, that attempts to meet these requirements. Corona supports two communication paradigms: the publish-subscribe paradigm and the peer group paradigm. In the publish-subscribe paradigm one or more data sources or publishers send data to multiple subscribers. This paradigm is characterized by the anonymous nature of communication, where a publisher is aware of the set of subscribers, but the subscribers are unaware of each other and are only aware of the publisher that they are receiving data from. In the peer group paradigm of communication, on the other hand, all the group members are aware of each other, and can send and receive data directly to each other. We present the interfaces provided by Corona to group collaboration applications. We describe the semantics of each interface method call and show how they can meet a variety of group collaboration requirements. We describe the protocols used to implement these semantics.

Tarek Abdelzaher, Anees Shaikh, Farnam Jahanian, and Kang Shin

RTCAST: Lightweight Multicast for Real-Time Process Groups

University of Michigan

CSE-TR-291-96 cover page

CSE-TR-291-96

Abstract: We propose a lightweight fault-tolerant multicast and membership service for real-time process groups which may exchange periodic and aperiodic messages. The service supports bounded-time message transport, atomicity, and order for multicasts within a group of communicating processes in the presence of processor crashes and communication failures. It guarantees agreement on membership among the communicating processors, and ensures that membership changes (e.g., resulting from processor joins or departures) are atomic and ordered with respect to multicast messages. We separate the mechanisms for maintaining consistency in a distributed system from those of maintaining real-time guarantees. The multicast and membership service may be used alone to support service that is not time-critical. An additional, separate real-time admission control layer provides on-line schedulability analysis of application traffic to guarantee hard real-time deadlines in a system with dynamically-changing loads. Moreover, we separate the concerns of processor fault-tolerance and link fault-tolerance by providing a protocol for supporting the former, regardless of the actual network topology and technology used to implement the latter. We provide the flexibility of an event-triggered approach with the fast message delivery time of time-triggered protocols, such as TTP, where messages are delivered to the application immediately upon reception. This is achieved without compromising agreement, order and atomicity properties. In addition to the design and details of the algorithm, we describe our implementation of the protocol using the x-Kernel protocol architecture running under RT Mach 3.0.

J. Silva, K. Sakallah

GRASP -- A New Search Algorithm for Satisfiability

University of Michigan

CSE-TR-292-96 cover page

CSE-TR-292-96

Abstract: This report introduces GRASP (Generic seaRch Algorithm for the Satisfiability Problem), an integrated algorithmic frame-work for SAT that unifies several previously propose search-pruning techniques and facilitates identification of additional ones. GRASP is premised on the inevitability of conflicts during search and it s most distinguishing feature is the augmentation of basic backtracking search with a powerful conflict analysis procedure. Analyzing conflicts to determine their causes enables GRASP to backtrack non-chronologically to earlier levels in the search tree, potentially pruning large portions of the search space. In addition, by �recording� the causes of conflicts, GRASP can recognize and preempt the occurrence of similar conflicts later on in the search. Finally, straightforward bookkeeping of the causality chains leading up toe conflicts allows GRASP to identify assignments that are necessary for a solution to be found. Experimental results obtained from a large number of benchmarks, including many from the field of test pattern generation, indicate that application of the proposed conflict analysis techniques to SAT algorithms can be extremely effective for a large number of representative classes of SAT instances.

Peter Chen

Optimizing Delay in Delayed-Write File Systems

University of Michigan

CSE-TR-293-96 cover page

CSE-TR-293-96

Abstract: Abstract: Delayed writes are used in most file systems to improve performance over write-through while limiting the amount of data lost in a crash. Delayed writes improve performance in three ways: by allowing the file cache to absorb some writes without ever propagating them to disk (write cancellation), by improving the efficiency of disk writes, and by spreading bursts out over time. For each of these benefits, this paper determines the optimal value for the delay interval: the smallest delay that achieves as good (or nearly as good) performance as an infinite delay. The largest value for optimal delay of any of the three benefits is fileCacheSize/diskTransferRate. This value is 1-10 seconds for current systems, implying that the delay used today (30 seconds) is much too large; a smaller delay would minimize data loss yet maintain the same performance.

Tien-Pao Shih

Goal-Directed Performance Tuning for Scientific Applications

University of Michigan

CSE-TR-294-96 cover page

CSE-TR-294-96

Abstract: Performance tuning, as carried out by compiler designers and application programmers to close the performance gap between the achievable peak and delivered performance, becomes increasingly important and challenging as the microprocessor speeds and system sizes increase. However, although performance tuning on scientific codes usually deals with relatively small program regions, it is not generally known how to establish a reasonable performance objective and how to efficiently achieve this objective. We suggest a goal-directed approach and develop such an approach for each of three major system performance components: central processor unit (CPU) computation, memory accessing, and communication. For the CPU, we suggest using a machine-application performance model that characterizes workloads on four key function units (memory, floating-point, issue, and a virtual ``dependence unit'') to produce an upper bound performance objective, and derive a mechanism to approach this objective. A case study shows an average 1.79x speedup achieved by using this approach for the Livermore Fortran Kernels 1-12 running on the IBM RS/6000. For memory, as compulsory and capacity misses are relatively easy to characterize, we derive a method for building application-specific cache behavior models that report the number of misses for all three types of conflict misses: self, cross, and ping-pong. The method uses averaging concepts to determine the expected number of cache misses instead of attempting to count them exactly in each instance, which provides a more rapid, yet realistic assessment of expected cache behavior. For each type of conflict miss, we propose a reduction method that uses one or a combination of three techniques based on modifying or exploiting data layout: array padding, initial address adjustment, and access resequencing. A case study using a blocked matrix multiply program as an example shows that the model is within 11% of the simulation results, and that each type of conflict miss can be effectively reduced or completely eliminated. For communication in shared memory parallel systems, we derive an array grouping mechanism and related loop transformations to reduce communication caused by the problematic case of nonconsecutive references to shared arrays and prove several theorems that determine when and where to apply this technique. The experimental results show a 15% reduction in communication, a 40% reduction in data subcache misses, and an 18% reduction in maximum user time for a finite element application on a 56 processor KSR1 parallel compute

David Van Campenhout, Trevor Mudge, and Karem Sakallah

Modeling Domino Logic for Static Timing Analysis

University of Michigan

CSE-TR-295-96 cover page

CSE-TR-295-96

Abstract: High performance microprocessors employ various advanced circuit techniques to achieve high clock frequencies. Critical sections of the design are often implemented in domino logic, a popular style of dynamic logic. This paper describes a new model for analyzing the temporal behavior of sequential circuits containing both static logic and dom ino logic. Our model integrates naturally with the SMO model for static timing analysis of sequential circuits.

David Van Campenhout, Trevor Mudge, and Karem Sakallah

Timing Analysis of Domino Logic

University of Michigan

CSE-TR-296-96 cover page

CSE-TR-296-96

Abstract: High performance microprocessors employ various advanced circuit techniques to obtain the required speed. Critical sections of the design are often implemented in domino logic, a popular style of dynamic logic. This paper describes a new model for analyzing the temporal behavior of sequential circuits using static logic mixed with domino logic. Our model integrates naturally with the SMO model for static timing analysis of sequential circuits.

G. Robert Malan, Farnam Jahanian, Craig Rasmussen, Peter Knoop

Performance of a Distributed Object-Based Internet Collaboratory

University of Michigan

CSE-TR-297-96 cover page

CSE-TR-297-96

Abstract: Atmospheric Research Collaboratory (UARC), an object-based distributed system. For the past three years, the UARC software system has enabled space scientists to collaborate on atmospheric experiments in real-time over the Internet. The UARC system provides a means for its users to view data from remote atmospheric instruments, share annotations of that data, discuss experimental results in a chat window, and simultaneously edit text manuscripts. However, UARC's distribution of atmospheric data to this geographically dispersed group of scientists is its primary mechanism for effecting their collaboration. This paper investigates the impact of UARC's implementation as a large distributed object-based software system as a means for supporting wide-area collaboratories. Specifically, it focuses on the communication performance and scalability of its object-based data distribution mechanism. First, Internet microbenchmarks are presented which characterize the UARC topology; then the results of application-level experiments are described that investigate UARC's use of NeXTSTEP's Distributed Object method invocations as a communication primitive. Finally, an analysis and discussion of the UARC system's object-based implementation concludes the paper.

Scott Dawson, Farnam Jahanian, and Todd Mitton

Experiments on Six Commercial TCP Implementations Using a Software Fault Injection Tool

University of Michigan

CSE-TR-298-96 cover page

CSE-TR-298-96

Abstract: TCP has become the de facto standard transport protocol in today's operating systems. It is a very robust protocol which can adapt to various network characteristics, packet loss, link congestion, and even significant differences in vendor implementations. This paper, presents an implementation of a tool, called ORCHESTRA, that can be used for testing distributed protocols such as TCP, and a set of experiments performed on six different vendor TCP implementations using the tool. The paper also disusses some lessons learned about TCP implementations through experimentation.

Krishnendu Chakrabarty, Brian. T. Murray, John P. Hayes

Optimal Zero-Aliasing Space Compaction of Test Responses

University of Michigan

CSE-TR-299-96 cover page

CSE-TR-299-96

Abstract: any built-in self-testing (BIST) schemes compress the test responses from a k-output circuit to q signature streams, where q << k, a process termed space compaction. The effectiveness of such a compaction method can be measured by its compaction ratio c=k/q. A high compaction ratio can introduce aliasing, which occurs when a faulty test response maps to the fault-free signature. We investigate the problem of designing zero-aliasing space compaction circuits with maximum compaction ratio c_max. We introduce a graph representation of test responses to study the space compaction process and relate space compactor design to a graph coloring problem. Given a circuit under test, a fault model, and a test set, we determine q_min, which yields c_max = k/q_min. This provides a fundamental bound on the cost of signature-based BIST. We show that q_min \leq 2 for all the ISCAS 85 benchmark circuits. We develop a systematic design procedure for the synthesis of space compaction circuits and apply it to a number of ISCAS 85 circuits. Finally, we describe multi-step compaction, which allows zero aliasing to be achieved with any q, even when q_min > 1.

Sunghyun Choi and Kang G. Shin

A Cellular Wireless Local Area Network with QoS Guarantees for Heterogeneous Traffic

University of Michigan

CSE-TR-300-96 cover page

CSE-TR-300-96

Abstract: A wireless local area network (WLAN) or a cell with quality-of-service (QoS) guarantees for various types of traffic is considered. A centralized (i.e., star) network topology is adopted as the topology of a cell which consists of a base station and a number of mobile clients. Dynamic Time Division Duplexed (TDD) transmission is used, and hence, the same frequency channel is time-shared for downlink and uplink transmissions under the dynamic control of the base station. We divide traffic into three classes (say I, II, and III) according to the required QoS. Whenever there is no eligible class-I and II traffic, class-III traffic which requires no delay bound guarantees is transmitted, while uplink transmissions are controlled with a reservation scheme. Class-I traffic which requires a bounded delay and guaranteed throughput is handled with the framing strategy which consists of a smoothness traffic model and the stop-and-go queueing scheme. We also establish the admission test for adding new class-I connections. We present a modified framing strategy for class-II voice uplink transmissions which utilizes the wireless link efficiently at the cost of some packet losses. Finally, we present the performance (average delay and throughput) evaluation of the reservation scheme for class-III traffic using both analytical calculations and simulations.

James D. Dundas and Trevor N. Mudge

Using Stall Cycles to Improve Microprocessor Performance

University of Michigan

CSE-TR-301-96 cover page

CSE-TR-301-96

Abstract: Contemporary microprocessors typically expend a significant amount of their device budget in an attempt to reduce the detrimental effects of memory latency and branch misprediction. The extra hardware frequently reduces the clock rate and wastes valuable die area that might be more productively employed in other ways. We propose a novel approach to this problem, which is intended to be used in relatively simple microprocessors with very high clock rates. A runahead microprocessor continues to execute instructions past a level one data cache miss until the miss is serviced. During this time it is said to be in runahead mode. Any additional level one data cache misses that occur during runahead, whose target addresses can be calculated, can become data stream prefetches. Any conditional branches that are executed during runahead can become trial runs, which may improve branch prediction accuracy. Any level one instruction cache misses that occur during runahead become instruction stream prefetches. After the original data cache miss is serviced, the microprocessor restarts instruction execution at the address of the load or store instruction that caused the original data cache miss. The additional hardware required to checkpoint the sequential state of the microprocessor is rather modest, and should not impact the clock rate of a high-performance implementation.

Harumi A. Kuno and Elke A. Rundensteiner

The Satisfiability-Indicating Multi-Index Organization for Maintaining Materialized Path Query OODB Views

University of Michigan

CSE-TR-302-96 cover page

CSE-TR-302-96

Abstract: aterialized database views allow applications to benefit from the powerful flexibility of views while minimizing the performance penalties traditionally associated with views. However, the need to maintain materialized views in the face of updates limits the variety of queries that can be used to define them. In this paper we address the problem of incrementally maintaining OODB views formed using path queries. Traditional index organizations are not well suited for this task. The indexing needs of the path query view problem are unique in that because the contents of the materialized view are cached and can be queried directly, the primary use for a supplemental index is during the propagation of updates rather than during query processing. Furthermore, traditional index organizations do not distinguish between single-valued and multi-valued attributes, and thus do not account for the fact that multi-valued attributes enable a single object at the head of a path to be associated with multiple instantiations of the path, any number of which could satisfy the path query predicate. This means that if an updated path involves a multi-valued attribute then the aggregation hierarchy of the object at the head of the path must be completely re-calculated in order to determine whether or not that object participates in an alternative instantiation that fulfills the view query predicate despite the update. As a solution, we introduce a new Satisfiability Indicating Multi-Index (SMX) organization, which maintains partial information indicating whether or not a given endpoint satisfies the query predicate rather than what the exact value of the endpoint is. This new structure offers a number of benefits. (1) At most one path position forward must be traversed to determine whether or not the endpoint of an instantiation of the path fulfills a given path query predicate. (2) The SMX index structure only needs to be updated when the validity of an object's instantiation (in terms of the query predicate) changes. (3) No more than one path position forward must ever be traversed in order to identify whether or not a given object participates in any alternative instantiations that fulfill a given path query predicate. In addition to proposing this new index organization, we also present cost models and analytic evaluations comparing the performance of the SMX organization to those of the multi, nested, and path index organizations with regards to calculating the effects of updates upon views. The results of our evaluations indicate that the SMX dramatically improves upon the performance of traditional index structures with respect to the problem of path query view maintenance. KEYWORDS: Incremental view maintenance, path queries, data warehousing, view materialization, and object-oriented databases.

Jude A. Rivers and Edward S. Davidson

Sectored Cache Performance Evaluation: A case study on the KSR-1 data subcache

University of Michigan

CSE-TR-303-96 cover page

CSE-TR-303-96

Abstract: Sectoring is a cache design and management technique that is re-emerging as cache sizes get larger and computer designers strive to exploit the possible gains from using large block (line) sizes due to spatial locality. Sectoring allows for a small tag array size to suffice retaining address tags only for the large blocks, but still avoids huge miss penalties by utilizing a smaller transfer size between the cache and the next higher level of memory. With this caching strategy comes the need for a new approach for evaluating cache performance, especially relating to cache space and its best use, bus traffic and so forth. In this study, we give a broad overview of the technique of sectoring in caches. We have introduced a new set of metrics for cache performance evaluation, stressing cache block and bus traffic usage. We use these set of superfluity metrics to investigate the behavior of real scientific applications, and also to help determine adequate and appropriate cache design parameters. We show an example of how these metrics can help point to the spatial locality problems in a given application code, thereby indicating code optimization techniques which can most significantly improve the code's performance.

Nelson R. Manohar, Atul Prakash

"Design Issues on the Support of Tools and Media in Replayable Workspaces"

University of Michigan

CSE-TR-304-96 cover page

CSE-TR-304-96

Abstract: This paper presents flexible support for asynchronous sharing (later-time use) of computer-supported workspaces (CSWs), or simply replayable workspaces. Through the interactive replay of a previously recorded CSW session, it is possible to reuse valuable collaborative information. The session is represented through heterogeneous media streams each containing state and inputs to a CSW tool. We present here an architecture and tool/media interfaces for the support of tools as ``plug-in'' components of our replayable workspaces.

Jun Nakanishi, Toshio Fukuda, and Daniel E. Koditschek

Preliminary Analytical Approach to a Brachiation Robot Controller

University of Michigan

CSE-TR-305-96 cover page

CSE-TR-305-96

Abstract: We report on our preliminary studies of a new controller for a two-link brachiating robot. Motivated by the pendulum-like motion of an ape's brachiation, we encode this task as the output of a "target dynamical system." Numerical simulations indicate that the resulting controller solves a number of brachiation problems that we term the "ladder", "swing up" and "rope" problems. Preliminary analysis provides some explanation for this success. We discuss a number of formal questions whose answers will be required to gain a full understanding of the strengths and weaknesses of this approach.

I-Cheng K. Chen, Chih-Chieh Lee, Matthew A. Postiff, and Trevor N. Mudge

Tagless Two-level Branch Prediction Schemes

University of Michigan

CSE-TR-306-96 cover page

CSE-TR-306-96

Abstract: Per-address two-level branch predictors have been shown to be among the best predictors and have been implemented in current microprocessors. However, as the cycle time of modern microprocessors continue to decrease, the implementation of set-associative per-address two-level branch predictors will become more difficult. In this paper, we revisit and analyze an alternative tagless, direct-mapped approach which is simpler, has lower power, and has faster access time. The tagless predictor can also offer comparable performance to current set-associative designs since removal of tags allows more resources to be allocated for the predictor and branch target buffer (BTB). Further, removal of tags allows decoupling of the per-address predictors from the BTB, allowing the two components to be optimized individually. We then show that tagless predictors are better than tagged predictors because of opportunities for better miss-handling. Finally, we examine the system cost-benefit for tagless per-address predictors across a wide design space using equal-cost contours. We also study the sensitivity of performance to the workloads by comparing results from the Instruction Benchmark Suite (IBS) and SPEC CINT95. Our work provides principles and quantitative parameters for optimal configurations of such predictors.

Timothy P. Darr and William P. Birmingham

An Agent Model for Distributed Part Selection

University of Michigan

CSE-TR-307-96 cover page

CSE-TR-307-96

Abstract: This report presents the agent model for a set of catalog, constraint and system agents. For each type of agent, the model defines the goals it can achieve; messages sent agents to other agents requesting that they achieve some goal; the commitments that they make, internally, to achieve goals in response to the messages; and messages that it sends once a goal is achieved. The messages that an agent sends and receives are intentional and are organized into protocols that define the type, sequence and content of messages exchanged among agents to achieve goals.

Stacie Hibino, Elke Rundensteiner

Query Processing in the MultiMedia Visual Information Seeking Environment: A Comparative Evaluation

University of Michigan

CSE-TR-308-96 cover page

CSE-TR-308-96

Abstract: Although much research has been conducted in the area of multidimensional range queries, we examine this problem from a new perspective. In particular, our goal is to support the processing of incremental multidimensional range queries specified via our temporal visual query language (TVQL), a direct manipulation visual query interface. In this paper, we describe the details and context of our problem, emphasizing the need to support temporal browsing and examining the characteristics of temporal data. We present a simple but new array-based index structure called our k-Array and its bucket-based counterpart, the k-Bucket. We then describe the series of experiments over several temporal queries that we ran to compare the k-Array and k-Bucket with basic methods such as the linked array and other popular bucket-based methods such as the grid file and k-d tree. Our results show that the k-Bucket performs the best overall, even though it is subject to projection effects. The other methods also perform fairly competitively, but only under certain conditions.

Douglas John Pearson

Learning Procedural Planning Knowledge in Complex Environments

University of Michigan

CSE-TR-309-96 cover page

CSE-TR-309-96

Abstract: In complex, dynamic environments, an agent's knowledge of the environment (its domain knowledge) will rarely be complete and correct. Existing approaches to learning and correcting domain knowledge have focused on either learning procedural knowledge to directly guide execution (e.g. reinforcement learners) or learning declarative planning knowledge (e.g. theory revision systems). Systems that only learn execution knowledge are generally only applicable to small domains. In these domains it is possible to learn an execution policy that covers the entire state space, making planning unnecessary. Conversely, existing approaches to learning declarative planning knowledge are applicable to large domains, but they are limited to simple agents, where actions produce immediate, deterministic effects in fully sensed, noise-free environments, and where there are no exogenous events. This research investigates the use of procedural knowledge to support the learning of planning knowledge in large and complex environments. We describe a series of environmental properties that constrain learning and are violated by existing approaches to learning planning knowledge. We then present an operator-based representation for planning knowledge that is sufficiently expressive to model complex, conditional actions that produce sequential effects over time. We then present IMPROV, a system for learning and correcting errors in this planning knowledge that only requires procedural access to the knowledge. This procedural restriction ensures that learning remains tractable, even over these large, expressive representations. We first explain how IMPROV incrementally learns operator precondition knowledge. We then demonstrate how a hierarchical, operator-based representation can be used to reduce the problem of learning operator effects to the problem of learning operator preconditions. This result allows IMPROV to use a single learning method to learn both operator preconditions and effects. This also allows IMPROV to learn complex models of actions that produce conditional or sequential effects. Finally, we test the system in two sample domains and empirically demonstrate that it satisfies many of the constraints faced by learning agents in complex and challenging environments.

Anisoara Nica, Elke Angelika Rundensteiner

The Dynamic Information Integration Model

University of Michigan

CSE-TR-311-96 cover page

CSE-TR-311-96

Abstract: Challenging issues for processing queries specified over large-scale information spaces (e.g., Digital Libraries or the World Wide Web) include the diversity of the information sources in terms of their structures, query interfaces and search capabilities, as well as the dynamics of sources continuously being added, removed or upgraded. Query processing can no longer be done in a static fashion against a well-defined schema (which assumes high integration). Rather an interactive query processing strategy that adapts its behavior to the system resources at hand is needed. In this paper, we give an innovative solution for query planning in such environments. The foundation of our solution is the Dynamic Information Integration odel (DIIM) which supports the specification of not only content but also capabilities of resources without requiring the establishment of a uniform integration schema. Besides the development of the DIIM model, contributions of this paper include: (1) introduction of the notion of fully specified queries that are semantically equivalent to a loosely-specified query (we show that our concept is a consistent and natural extension to the concept of full disjunction); (2) translation algorithm of a loosely-specified query into a set of semantically equivalent precise query plans that can be executed against the current configuration of available sources; the steps of the translated query plans are consistent with the binding patterns of query templates of the individual sources (capability descriptions in DIIM) and with possible interrelationships between two or more information sources (expressed as join constraints in DIIM); (3) search restriction algorithm for optimizing query processing by pruning the search space into the relevant subspace of a query based on information source descriptions; the search space of the translation algorithm is thus restricted to a query relevant subspace; and (4) the proofs of correctness for both the search restriction and translation algorithms that show that the plans obtained by the query planning process correspond to semantically equivalent query plans.

Wu-chang Feng, Dilip Kandlur, Debanjan Saha, Kang G. Shin

TCP Enhancements for an Integrated Services Internet

University of Michigan

CSE-TR-312-96 cover page

CSE-TR-312-96

Abstract: Flow and congestion control mechanisms in TCP have been designed with the premise that network capacity should be shared equally among competing connections. With the evolution of the Internet from a class-less best-effort network to an integrated services network with different classes of service, the premise of equal fair sharing is no longer true. In an integrated services Internet, network bandwidth should be shared in accordance with the reservation made for each session. In this paper, we consider a specific class of service, namely controlled-load, which is currently under consideration for deployment in the Internet. We propose a simple scheme to realize this service and study its interaction with the control mechanisms used in TCP. In light of this study, we propose a number of enhancements to the control mechanisms used in popular implementations of TCP. Using deatiled simulation experiments, we show that with the proposed enhancements in place, TCP is indeed able to provide the desired end-to-end behavior.

Bruce Jacob and Trevor Mudge

Specification of the PUMA memory management design

University of Michigan

CSE-TR-314-96 cover page

CSE-TR-314-96

Abstract: In this report we specify the memory management design of a 1GHz PowerPC implementation in which a simple design is a prerequisite for a high clock rate and short design cycle. The scheme involves no translation hardware such as a translation lookaside buffer or a page-table-walking state machine. However, it is just as efficient as hardware-managed address translation and is much more flexible. Modern operating systems such as Mach charge between 0.16 and 0.28 CPI for address translation on systems with TLBs. PUMA's software-managed address translation exacts an overhead of 0.03 CPI. Mechanisms to support such features as shared memory, superpages, sub-page protection, and sparse address spaces can be defined completely in software, allowing much more flexibility than in hardware-defined mechanisms. Our software design combines the virtual caches with a PowerPC-like segmentation mechanism; it maps user 32-bit addresses onto a 44-bit segmented virtual space. We use a global page table to map this entire global virtual space. There are no individual per-process page tables; all process page tables map directly onto the global table and when processes share memory they also share a portion of the global page table. The following benefits are derived from the organization: (a) virtual cache consistency management can be eliminated, (b) the page table space requirements can be cut in half by eliminating the need to replicate page table entries for shared pages, and (c) the virtual memory system can be made less complex because it does not have to deal with the virtual-cache synonym problem.

Nauman Chaudhry, James Moyne, and Elke A. Rundensteiner

Extended Aggregation Relationships for Process Specification and Enactment in Active Databases

University of Michigan

CSE-TR-315-96 cover page

CSE-TR-315-96

Abstract: Process specification in a variety of domains, such as experiment modeling, work-flow modeling, and process-flows in semiconductor manufacturing, is typically characterized by recursive specification in terms of sequences and alternatives. A variety of models have been proposed for the specification of such processes. In particular, object-oriented techniques have been used for achieving various desirable features (e.g., reusability, maintainability, etc.) in the process specification and active databases have been suggested as possible platforms for process enactment. However, on the one hand object-oriented models for process specification lack an important feature of object-orientation, namely the ability to organize processes as classes with inheritance support, and on the other hand various problems, such as lack of methodological support for active rule definition, analysis, and maintenance, stand in the way of successfully employing active database technology for process enactment. To take better advantage of both object-oriented techniques and active database technology, we present a comprehensive framework for process specification and enactment, which provides an integrated solution utilizing ideas from both these domains. This is achieved by developing PSOM (Process Specification Object Model) which is an object-oriented model with explicit aggregation constructs and extended sub-typing relationships for these new aggregation constructs. We show the use of PSOM for defining processes using the aggregation constructs and arranging these processes into class hierarchies based on the formal types of the processes. In addition, we establish guidelines for defining active rules for process enactment on PSOM process specifications. We also prove that the rule definition guidelines lead to modularized rule sets which simplify the analysis of termination behavior of active rules defined in a PSOM process database.

David G. Thaler and Chinya V. Ravishankar

A Name-Based Mapping Scheme for Rendezvous

University of Michigan

CSE-TR-316-96 cover page

CSE-TR-316-96

Abstract: Clusters of identical intermediate servers are often created to improve availability and robustness in many domains. The use of proxy servers for the WWW and of Rendezvous Points in multicast routing are two such situations. However, this approach is inefficient if identical requests are received and processed by multiple servers. We present an analysis of this problem, and develop a method called the Highest Random Weight (HRW) apping that eliminates these difficulties. Given an object name, HRW maps it to a server within a given cluster using the object name, rather than any a priori knowledge of server states. Since HRW always maps a given object name to the same server within a given cluster, it may be used locally at client sites to achieve consensus on object-server mappings. We present an analysis of HRW and validate it with simulation results showing that it gives faster service times than traditional request allocation schemes such as round-robin or least-loaded, and adapts well to changes in the set of servers. HRW is particularly applicable to domains in which there are a large number of requestable objects, and where there is a significant probability that a requested object will be requested again. HRW has now been adopted by the multicast routing protocol PIMv2 as its mechanism for clients to identify Rendezvous Points.

Edward S. Tam

Early Design Cycle Timing Simulation of Caches

University of Michigan

CSE-TR-317-96 cover page

CSE-TR-317-96

Abstract: Cache performance is a key component of a microprocessor's overall performance, as it is the cache that buffers the high speed CPU from the much slower main memory. Behavioral cache simulators indicate the performance of a given cache configuration in terms of the number of hits and misses experienced when running a piece of code. One can attach a leading edge penalty or "effective" penalty estimate to each miss to get a first order idea of run time. However, individual timing penalties are not assessed within these models, nor are the various factors that can affect each memory access' latency. Our Latency Effects (LE) cache model accounts for additional latencies due to trailing edge effects, bus width considerations, the number of outstanding accesses allowable, and port limitations. A tool implementing the LE cache model has been built on top of a behavioral cache simulator, DineroIII, in the spirit of the Resource Conflict Methodology developed by J-D Wellman. The tool was combined with Wellman's RCM_brisc processor simulator to provide a realistic interaction of the cache with the processor (including the latency masking effects of processor activity) and to assess the accuracy of the model when simulating the execution of actual programs. The combined tool accurately mirrors the effects of changing a cache's configuration for a given processor configuration running a variety of programs. While the reported execution times do not exactly match the total execution times of the same programs running on actual hardware, the tool provides enough useful information to guide processor/cache designers early in the design cycle toward optimal configurations for target applications. This addition of cache modeling to the RCM_brisc instruction-level processor simulator with perfect cache increases simulation time by only 17% (less than 5% over a constant miss penalty cache model), which is reasonable given the added benefits of using the LE cache model.

Scott Dawson, Farnam Jahanian, Todd Mitton

ORCHESTRA: A Fault Injection Environment for Distributed Systems

University of Michigan

CSE-TR-318-96 cover page

CSE-TR-318-96

Abstract: This paper reports on ORCHESTRA, a portable fault injection environment for testing implementations of distributed protocols. The paper focuses on architectural features of ORCHESTRA that provide portability, minimize intrusiveness on target protocols, and support testing of real-time systems. ORCHESTRA is based on a simple yet powerful framework, called script-driven probing and fault injection, for the evaluation and validation of the fault-tolerance and timing characteristics of distributed protocols. ORCHESTRA was initially developed on the Real-Time Mach operating system and later ported to other platforms including Solaris and SunOS, and has been used to conduct extensive experiments on several protocol implementations. A novel feature of the Real-Time Mach implementation of ORCHESTRA is that it utilizes certain features of the Real-Time Mach operating system to quantify and compensate for intrusiveness of the fault injection mechanism. In addition to describing the overall ORCHESTRA architecture and implementation, this paper also describes the experimental evaluation of two protocol implementations: a distributed group membership service on the Sun Solaris operating system, and a real-time audio-conferencing application on Real-Time Mach.

Peter L. Bird and Trevor N. Mudge

An Instruction Stream Compression Technique

University of Michigan

CSE-TR-319-96 cover page

CSE-TR-319-96

Abstract: The performance of instruction memory is a critical factor for both large, high performance applications and for embedded systems. With high performance systems, the bandwidth to the instruction cache can be the limiting factor for execution speed. Code density is often the critical factor for embedded systems. In this report we demonstrate a straightforward technique for compressing the instruction stream for programs. After code generation, the instruction stream is analysed for often reused sequences of instructions from within the pro gram's basic blocks. These patterns of multiple instructions are then mapped into single byte opcodes. This constitutes a compression of multiple, multi-byte operations onto a single byte. When compressed opcodes are detected during the instruction fetch cycle of program execution, they are expanded within the CPU into the original (multi-cycle) set of instructions. Because we only operate within a program's basic block, branch instructions and their targets are unaf- fected by this technique. We provide statistics gathered from code generated for the Intel Pentium and the Power PC processors. We have found that incorporating a 1K decode ROM in the CPU we can reduce a program's code size between 45% and 60%.

James K. Huggins

Broy-Lamport Specification Problem: A Gurevich Abstract StateMachine Solution

University of Michigan

CSE-TR-320-96 cover page

CSE-TR-320-96

Abstract: We apply the Gurevich Abstract State Machine methodology to a benchmark specification problem of Broy and Lamport.

James K. Huggins and David Van Campenhout

Specification and Verification of Pipelining in the ARM2RISC Microprocessor

University of Michigan

CSE-TR-321-96 cover page

CSE-TR-321-96

Abstract: We specify the ARM2 RISC microprocessor using the Gurevich Abstract State Machine methodology, and prove the correctness of its pipelining techniques. (See new version of technical report, CSE-TR-371-98).

Yuri Gurevich, Marc Spielmann

Recursive Abstract State Machines

University of Michigan

CSE-TR-322-96 cover page

CSE-TR-322-96

Abstract: The Gurevich abstract state machine (ASM) thesis, supported by numerous applications, asserts that ASMs express algorithms on their natural abstraction levels directly and essentially coding-free. The only objection raised to date has been that ASMs are iterative in their nature, whereas many algorithms are naturally recursive. There seems to be an inherent contradiction between (i) the ASM idea of explicit and comprehensive states, and (ii) higher level recursion with its hiding of the stack. But consider recursion more closely. When an algorithm A calls an algorithm B, a clone of B is created and this clone becomes a slave of A. This raises the idea of treating recursion as an implicitly distributed computation. Slave agents come and go, and the master/slave hierarchy serves as the stack. Building upon this idea, we suggest a definition of recursive ASMs. The implicit use of distributed computing has an important side benefit: it leads naturally to concurrent recursion. In addition, we reduce recursive ASMs to distributed ASMs. If desired, one can view recursive notation as mere abbreviation.

Bruce L. Worthington, Gregory R. Ganger, Yale N. Patt, John Wilkes

On-Line Extraction of SCSI Disk Drive Parameters

University of Michigan

CSE-TR-323-96 cover page

CSE-TR-323-96

Abstract: Sophisticated disk scheduling algorithms require accurate and detailed disk drive specifications, including information on mechanical delays, on-board caching and prefetching algorithms, command processing and protocol overheads, and logical-to-physical block mappings. Comprehensive disk models used in storage subsystem design require similar levels of detail. This report describes a suite of general-purpose techniques and algorithms for acquiring the necessary data from SCSI disks via the ANSI-standard interface. The accuracy of the extracted information is demonstrated by comparing the behavior of a configurable disk simulator against actual disk drive activity. Detailed measurements extracted from several SCSI disk drives are provided.

Send comments and suggestions to:

trcoordinator@eecs.umich.edu