%A Wu,Gwobaw %A Prakash, Atul %T Distributed Object Replication Support for Collaborative Systems %I Computer Science and Engineering, University of Michigan %R CSE-TR-276-96 %D December 28, 1995 %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-276-96-cover.ps.Z %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-276-96.ps.Z %X The ability to share synchronized views of interactions with an application is critical to supporting synchronous collaboration over a network. To keep bandwidth requirements and interactive response time low in wide-area networks, we have designed a toolkit, called DistView, that uses an {\em object-level replication} scheme, in which the application and interface objects that need to be shared among users are replicated at the various user sites. Existing schemes for maintaining consistency among replicated active objects are generally designed for database applications and do not necessarily work in collaborative applications where response times or latencies are generally more critical than throughput or persistence. In this paper, we describe an object replication service that is designed to provide low interactive response times and low latencies in communication in collaborative environments. We describe the algorithms used, the architecture of the system, and present an analysis of its performance. %Z Thu, 15 Oct 1998 14:57:07 GMT %A Abandah, Gheith %A Davidson, Edward %T Characterizing Shared Memory and Communication Performance: A Case Study of the Convex SPP-10000 %I Computer Science and Engineering, University of Michigan %R CSE-TR-277-96 %D January 9, 1996 %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-277-96-cover.ps.Z %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-277-96.ps.Z %X The objective of this paper is to develop models that characterize the memory and communication performance of shared-memory multiprocessors which is crucial for developing efficient parallel applications for them. The Convex SPP-1000, a modern scalable distributed-memory multiprocessor that supports the shared-memory programming paradigm, is used throughout as a case study. The paper evaluates and models four aspects of SPP-1000 performance: scheduling, local-memory, shared-memory, and synchronization. Our evaluation and modeling are intended to supply useful information for application and compiler development. %Z Thu, 15 Oct 1998 14:57:39 GMT %A Jaeger, Trent %A Prakash, Atul %T Using Simulation and Performance Improvement Knowledge for Redesigning Business Processes %I Computer Science and Engineering, University of Michigan %R CSE-TR-278-96 %D January 12, 1996 %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-278-96-cover.ps.Z %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-278-96.ps.Z %X Recent business improvement methodologies, such as Business Process Reengineering, advocate the redesign of business processes as the primary method to obtain improvement in business performance. Current approaches to process redesign assist the reengineer in identifying problems in meeting business goals, but provide little support in determining how to resolve those problems to meet the desired goals. The number of feasible changes to the business that can potentially resolve the problems is large and it is hard to predict the effect that each change will have on overall business performance. We present a process redesign methodology that helps reengineers to efficiently generate changes that can help meet business performance goals. Based on the methodology, we describe a system that includes: (1) an executable business model for specifying business processes and business performance goals; (2) a simulator that executes these processes to determine actual performance and to help reengineers identify problems in meeting the specified goals; and (3) a performance improvement knowledge base that suggests changes to resolve these problems and meet the specified goals. We demonstrate the use of the process redesign methodology and the system on two non-trivial process redesign examples, and report our experience. %Z Thu, 15 Oct 1998 14:58:02 GMT %A McClennen, Michael %A Sechrest, Stuart %T Getting More Information into File Names %I Computer Science and Engineering, University of Michigan %R CSE-TR-279-96 %D January 18, 1996 %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-279-96-cover.ps.Z %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-279-96.ps.Z %X Hierarchical naming, while deeply embedded in our conception of file systems, is a rather weak tool for storing information about files and their relationships. A consequence is that users of today's file systems frequently have trouble locating files. We describe a system in which a standard directory tree is extended by allowing names to contain auxiliary components representing descriptive attributes rather than directory names. This system allows files to be characterized more extensively, and lets users choose among multiple organizational structures for their stored information. A prototype has been implemented by means of a new vnode layer under SunOS 4.1.3. %Z Thu, 15 Oct 1998 14:58:26 GMT %A Mehra, Ashish %A Indiresan, Atri %A Shin, Kang %T Design and Evaluation of a QoS-Sensitive Communication Subsystem Architecture %I Computer Science and Engineering, University of Michigan %R CSE-TR-280-96 %D February 9, 1996 %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-280-96-cover.ps.Z %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-280-96.ps.Z %X There are a growing number of real-time applications (e.g., real-time controls, and audio/video conferencing) that require certain quality of service (QoS) guarantees from the underlying communication subsystem. The communication subsystem (host as well as network) must support real-time communication services that can be used to provide the required QoS of these applications, while providing reasonably good performance for best-effort traffic. In this paper we present and evaluate a QoS-sensitive communication subsystem architecture for end hosts that provides real-time communication support for generic network hardware. This architecture provides various services for managing communication resources for guaranteed-QoS (real-time) connections, such as admission control, traffic enforcement, buffer management, and CPU \& link scheduling. The design of the architecture is based on three key goals: maintenance of QoS-guarantees on a per-connection basis, overload protection between established connections, and fairness in delivered performance to best-effort traffic. Using this architecture, we implement real-time channels, a paradigm for real-time communication services in packet-switched networks. The proposed architecture features a process-per-channel model for protocol processing that associates a channel handler with each established channel. The model employed for handler execution is one of ``cooperative'' preemption, where an executing handler yields the CPU to a waiting higher-priority handler at well-defined preemption points. The architecture provides several configurable policies for CPU scheduling and overload protection. We evaluate the implementation to demonstrate that this architecture maintains QoS guarantees while adhering to the stated design goals. The evaluation also demonstrates convincingly the need for specific features and policies provided in the architecture. %K Key Words - Real-time communication, QoS-sensitive protocol processing, traffic enforcement, CPU and link scheduling %Z Thu, 15 Oct 1998 14:58:52 GMT %A Mehra, Ashish %A Indiresan, Atri %A Shin, Kang %T Resource Management for Real-Time Communication: Making Theory Meet Practice %I Computer Science and Engineering, University of Michigan %R CSE-TR-281-96 %D February 9, 1996 %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-281-96-cover.ps.Z %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-281-96.ps.Z %X A growing number of real-time applications (e.g., real-time controls, and audio/video conferencing) require certain quality-of-service (QoS) from the underlying communication subsystem. Real-time communication services are needed in the communication subsystem (host as well as network) to provide the required QoS to these applications while providing reasonably good performance for best-effort traffic. At the host, real-time communication necessitates that shared host resources such as CPU and link bandwidth be consumed according to the relative requirements of the active connections. This requires appropriate resource management policies for admission control and scheduling that are typically formulated assuming idealized resource models. However, when implementing these policies one must account for the performance characteristics of the hardware and software components involved, which could deviate significantly from those of the idealized resource models. In this paper, we focus on bridging the gap between theory and practice in the management of host CPU and link resources for real-time communication. Using our implementation of real-time channels, a paradigm for real-time communication in packet-switched networks, we illustrate the tradeoff between resource capacity and channel admissibility, which determines the number and type of real-time channels that can be accepted for service and the performance delivered to best-effort traffic. We demonstrate that this tradeoff is affected significantly by the choice of implementation paradigms and the grain at which CPU and link resources can be multiplexed amongst active channels. In order to account for this effect, we extend the admission control procedure for real-time channels originally proposed using idealized resource models. Our results show that practical considerations significantly reduce channel admissibility compared to idealized resource models. Further, the optimum choice of the multiplexing grain depends on several factors such as the overheads of resource preemption, the relationship between CPU and link bandwidth, and the manner in which allocation of link bandwidth interacts with allocation of CPU bandwidth. %K Key Words - Real-time communication, resource management, QoS-sensitive protocol processing, CPU and link scheduling %Z Thu, 15 Oct 1998 15:00:50 GMT %A Mudge, Trevor %A Chen, I-Cheng %A Coffey, John %T Limits to Branch Prediction %I Computer Science and Engineering, University of Michigan %R CSE-TR-282-96 %D February 2, 1996 %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-282-96-cover.ps.Z %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-282-96.ps.Z %X Branch prediction is an important mechanism in modern microprocessor design. The focus of research in this area has been on designing new branch prediction schemes. In contrast, very few studies address the inherent limit of predictability of program themselves. Programs have an inherent limit of predictability due to the randomness of input data. Knowing the limit helps us to evaluate how good a prediction scheme is and how much we can expect to improve its accuracy. In this paper we propose two complementary approaches to estimating the limits of predictability: exact analysis of the program and the use of a universal compression/prediction algorithm, prediction by partial matching (PPM), that has been very successful in the field of data and image compression. We review the algorithmic basis for both some common branch predictors and PPM and show that two-level branch prediction, the best method currently in use, is a simplified version of PPM. To illustrate exact analysis, we use Quicksort to calibrate the performance of various branch predictors. With other programs, too complicated to analyze exactly, we use PPM, as a measure of inherent predictability. Our initial results show that PPM can approach the theoretical limit in an analyzable program and perform just as well as the best existing branch predictors for SPECInt92. This suggests that universal compression/prediction algorithms, such as PPM, can be used to estimate the limits to branch prediction for a particular workload of programs. %Z Thu, 15 Oct 1998 15:01:14 GMT %A Sechrest, Stuart %A Lee, Chih-Chieh %A Mudge, Trevor %T Correlation and Aliasing in Dynamic Branch Predictors - Full Simulation Results %I Computer Science and Engineering, University of Michigan %R CSE-TR-283-96 %D May 22, 1996 %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-283-96-cover.ps.Z %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-283-96.ps.Z %X The accompanying paper, "Correlation and Aliasing in Dynamic Branch Predictors," presents simulation results for a variety of branch prediction schemes using traces of the integer programs of SPEC92 and the IBS-Ultrix. However, due to space limitations and the need for clarity, only a subset of our simulation results were included. This technical report presents the full results for all of the benchmarks for each prediction scheme studied. %Z Thu, 15 Oct 1998 15:01:35 GMT %A Manohar, Nelson R. %A Prakash, Atul %T Tool Coordination and Media Integration on Asynchronously-Shared Computer-Supported Workspaces %I Computer Science and Engineering, University of Michigan %R CSE-TR-284-96 %D February 27, 1996 %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-284-96-cover.ps.Z %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-284-96.ps.Z %X This paper presents a novel and flexible architecture to support the collaboration paradigm for asynchronous sharing of computer-supported workspaces, or simply replayable workspaces. Such workspace, however, may be composed of multiple, interacting, tools. Through the capture, re-execution, and manipulation of a session with an interactive computer-supported workspace, it is possible to reuse the process of how a task was performed, (i.e., valuable collaborative information). Under our paradigm, a user session with a computer- supported workspace is encapsulated into a data artifact, referred to as a {\em session object}. The session object can be then manipulated through {\sc Vcr}-like controls. The session object is an active object, composed of heterogeneous media streams that represent input sequences to various tools of the workspace. Our computer-supported workspace is made-up of modular components that provide transparent management of their heterogeneous media. Such media may have or lack time-variance, periodicity, and statefulness. The architecture provides fine-grained integration of such heterogeneous media, while providing flexible support for coordinating the various tools found in a computer-supported workspace. %Z Thu, 15 Oct 1998 15:01:55 GMT %A VanCampenhout, David %A Mudge, Trevor %T Faster Static Timing Analysis via Bus Compression %I Computer Science and Engineering, University of Michigan %R CSE-TR-285-96 %D February 27, 1996 %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-285-96-cover.ps.Z %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-285-96.ps.Z %X Static timing analysis is used extensively in the design of high-performance processors. In this paper we present a method which transforms the circuit graph used by timing analysis algorithms to a smaller one, and hence results in faster timing analysis. This transformation, called bus compression, may lead to a more conservative analysis. How ever, experiments on several large designs yielded the same results with or without bus compression. %Z Thu, 15 Oct 1998 15:02:15 GMT %A Che, Peter %A Ng, Wee-Teck %A Rajamani, Gurushankar %A Aycock, Christopher %T The Rio File Cache: Surviving Operating System Crashes %I Computer Science and Engineering, University of Michigan %R CSE-TR-286-96 %D March 9, 1996 %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-286-96-cover.ps.Z %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-286-96.ps.Z %X One of the fundamental limits to high-performance, high-reliability file systems is memory's vulnerability to system crashes. Because memory is viewed as unsafe, systems periodically write data back to disk. The extra disk traffic lowers performance, and the delay period before data is safe lowers reliability. The goal of the Rio (RAM I/O) file cache is to make ordinary main memory safe for persistent storage by enabling memory to survive operating system crashes. Reliable memory enables a system to achieve the best of both worlds: reliability equivalent to a write-through file cache, where every write is instantly safe, and performance equivalent to a pure write-back cache, with no reliability-induced writes to disk. To achieve reliability, we protect memory during a crash and restore it during a reboot (a "warm" reboot). Extensive crash tests show that even without protection, warm reboot enables memory to achieve reliability close to that of a write-through file system while performing 20 times faster. Rio makes all writes immediately permanent, yet performs faster than systems that lose 30 seconds of data on a crash: 35% faster than a standard delayed-write file system and 8% faster than a system that delays both data and metadata. For applications that demand even higher levels of reliability, Rio's optional protection mechanism makes memory even safer than a write-through file system while while lowering performance 20% compared to a pure write-back system. %Z Thu, 15 Oct 1998 15:02:33 GMT %A Feng, Wu-chang %A Shin, Kang G. %T Impact of Selection Functions on Routing Algorithm Performance in Multicomputer Networks %I Computer Science and Engineering, University of Michigan %R CSE-TR-287-96 %D March 19, 1996 %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-287-96-cover.ps.Z %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-287-96.ps.Z %X Maximizing overall performance in multicomputers requires matching application communication characteristics with a suitable routing scheme. However, since the communication demands of emerging applications vary significantly, it is hard for a single routing algorithm to perform well under all workloads. In order to study the complex dependencies between routing policies and communication workloads, we have performed a set of multi-factor experiments to better characterize routing performance. These experiments show that in addition to adaptivity, the selection functions used to order the candidate links greatly affect network performance under various traffic patterns. By supporting flexible routing, the network can tune its routing policies to application communication characteristics in order to improve performance. %Z Thu, 15 Oct 1998 15:02:51 GMT %A Rothleder, Neal J. %T Induction: Inference and Process %I Computer Science and Engineering, University of Michigan %R CSE-TR-288-96 %D March 20, 1996 %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-288-96-cover.ps.Z %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-288-96.ps.Z %X Despite the variety of discussions on induction, there is still no explicit definition of the concept. However, since most treatments appear to, at least implicitly, acknowledge it, it seems clear that such a definition exists. A key obstacle seems to be the confusion between what induction is, and how induction should be performed. This research proposes the idea that induction should be viewed in two distinct, but related, ways. The more general, abstract way gets to the heart of this concept definition, and treats induction purely as an inference. In this inference view, there is no concern for issues of efficiency, effectiveness, and other qualities of how induction might be performed. The second way to view induction is as the process which implements the inference. This paper establishes a definition of induction in the inference view. The subtleties and implications of this definition are discussed, and are also compared with key issues and published opinions of induction. %Z Thu, 15 Oct 1998 15:03:09 GMT %A D'Ambrosio, Joseph %T ISMAUT Tools: A Software Tool Kit for Rational Tradeoffs Among Conflicting Objectives %I Computer Science and Engineering, University of Michigan %R CSE-TR-289-96 %D March 25, 1996 %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-289-96-cover.ps.Z %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-289-96.ps.Z %X One primary task of engineering design is resolving the conflicting objectives that are inherent in the design process. As a result, design automation tools must provide designers with the ability to iirepresent and apply the preferential knowledge required to resolve conflicting objectives. This paper describes a decision-theoretic approach for resolving conflicting objectives and a software took kit, ISMAUT Tools, that enables access to this approach. ISMAUT Tools includes a library of primitive routines that can be linked into other applications, and a text-based interface providing stand-alone decision making capabilities. %Z Thu, 15 Oct 1998 15:03:25 GMT %A Hall, Robert W. %A Mathur, Amit %A Jahanian, Farnam %A Prakahs, Atul %A Rasmussen, Craig %T Corona: A Communication Service for Scalable, Reliable Group Collaboration Systems (Preliminary Design) %I Computer Science and Engineering, University of Michigan %R CSE-TR-290-96 %D March 25, 1996 %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-290-96-cover.ps.Z %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-290-96.ps.Z %X We consider the problem of providing communication protocol support for large-scale group collaboration systems for use in environments such as the Internet which are subject to packet loss, wide variations in end-to-end delays, and transient partitions. We identify a set of requirements that are critical for the design of such group collaboration systems. We present a communication service, Corona, that attempts to meet these requirements. Corona supports two communication paradigms: the publish-subscribe paradigm and the peer group paradigm. In the publish-subscribe paradigm one or more data sources or publishers send data to multiple subscribers. This paradigm is characterized by the anonymous nature of communication, where a publisher is aware of the set of subscribers, but the subscribers are unaware of each other and are only aware of the publisher that they are receiving data from. In the peer group paradigm of communication, on the other hand, all the group members are aware of each other, and can send and receive data directly to each other. We present the interfaces provided by Corona to group collaboration applications. We describe the semantics of each interface method call and show how they can meet a variety of group collaboration requirements. We describe the protocols used to implement these semantics. %Z Thu, 15 Oct 1998 15:03:45 GMT %A Abdelzaher, Tarek %A Shaikh, Anees %A Jahanian, Farnam %A Shin, Kang %T RTCAST: Lightweight Multicast for Real-Time Process Groups %I Computer Science and Engineering, University of Michigan %R CSE-TR-291-96 %D April 4, 1996 %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-291-96-cover.ps.Z %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-291-96.ps.Z %X We propose a lightweight fault-tolerant multicast and membership service for real-time process groups which may exchange periodic and aperiodic messages. The service supports bounded-time message transport, atomicity, and order for multicasts within a group of communicating processes in the presence of processor crashes and communication failures. It guarantees agreement on membership among the communicating processors, and ensures that membership changes (e.g., resulting from processor joins or departures) are atomic and ordered with respect to multicast messages. We separate the mechanisms for maintaining consistency in a distributed system from those of maintaining real-time guarantees. The multicast and membership service may be used alone to support service that is not time-critical. An additional, separate real-time admission control layer provides on-line schedulability analysis of application traffic to guarantee hard real-time deadlines in a system with dynamically-changing loads. Moreover, we separate the concerns of processor fault-tolerance and link fault-tolerance by providing a protocol for supporting the former, regardless of the actual network topology and technology used to implement the latter. We provide the flexibility of an event-triggered approach with the fast message delivery time of time-triggered protocols, such as TTP, where messages are delivered to the application immediately upon reception. This is achieved without compromising agreement, order and atomicity properties. In addition to the design and details of the algorithm, we describe our implementation of the protocol using the x-Kernel protocol architecture running under RT Mach 3.0. %Z Thu, 15 Oct 1998 15:04:07 GMT %A Silva, J. %A Sakallah, K. %T GRASP -- A New Search Algorithm for Satisfiability %I Computer Science and Engineering, University of Michigan %R CSE-TR-292-96 %D April 10, 1996 %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-292-96-cover.ps.Z %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-292-96.ps.Z %X This report introduces GRASP (Generic seaRch Algorithm for the Satisfiability Problem), an integrated algorithmic frame-work for SAT that unifies several previously propose search-pruning techniques and facilitates identification of additional ones. GRASP is premised on the inevitability of conflicts during search and it s most distinguishing feature is the augmentation of basic backtracking search with a powerful conflict analysis procedure. Analyzing conflicts to determine their causes enables GRASP to backtrack non-chronologically to earlier levels in the search tree, potentially pruning large portions of the search space. In addition, by ³recording² the causes of conflicts, GRASP can recognize and preempt the occurrence of similar conflicts later on in the search. Finally, straightforward bookkeeping of the causality chains leading up toe conflicts allows GRASP to identify assignments that are necessary for a solution to be found. Experimental results obtained from a large number of benchmarks, including many from the field of test pattern generation, indicate that application of the proposed conflict analysis techniques to SAT algorithms can be extremely effective for a large number of representative classes of SAT instances. %Z Thu, 15 Oct 1998 15:04:23 GMT %A Chen, Peter %T Optimizing Delay in Delayed-Write File Systems %I Computer Science and Engineering, University of Michigan %R CSE-TR-293-96 %D May 3, 1996 %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-293-96-cover.ps.Z %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-293-96.ps.Z %X Delayed writes are used in most file systems to improve performance over write-through while limiting the amount of data lost in a crash. Delayed writes improve performance in three ways: by allowing the file cache to absorb some writes without ever propagating them to disk (write cancellation), by improving the efficiency of disk writes, and by spreading bursts out over time. For each of these benefits, this paper determines the optimal value for the delay interval: the smallest delay that achieves as good (or nearly as good) performance as an infinite delay. The largest value for optimal delay of any of the three benefits is fileCacheSize/diskTransferRate. This value is 1-10 seconds for current systems, implying that the delay used today (30 seconds) is much too large; a smaller delay would minimize data loss yet maintain the same performance. %Z Thu, 15 Oct 1998 15:04:40 GMT %A Shih, Tien-Pao %T Goal-Directed Performance Tuning for Scientific Applications %I Computer Science and Engineering, University of Michigan %R CSE-TR-294-96 %D June 10, 1996 %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-294-96-cover.ps.Z %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-294-96.ps.Z %X Performance tuning, as carried out by compiler designers and application programmers to close the performance gap between the achievable peak and delivered performance, becomes increasingly important and challenging as the microprocessor speeds and system sizes increase. However, although performance tuning on scientific codes usually deals with relatively small program regions, it is not generally known how to establish a reasonable performance objective and how to efficiently achieve this objective. We suggest a goal-directed approach and develop such an approach for each of three major system performance components: central processor unit (CPU) computation, memory accessing, and communication. For the CPU, we suggest using a machine-application performance model that characterizes workloads on four key function units (memory, floating-point, issue, and a virtual ``dependence unit'') to produce an upper bound performance objective, and derive a mechanism to approach this objective. A case study shows an average 1.79x speedup achieved by using this approach for the Livermore Fortran Kernels 1-12 running on the IBM RS/6000. For memory, as compulsory and capacity misses are relatively easy to characterize, we derive a method for building application-specific cache behavior models that report the number of misses for all three types of conflict misses: self, cross, and ping-pong. The method uses averaging concepts to determine the expected number of cache misses instead of attempting to count them exactly in each instance, which provides a more rapid, yet realistic assessment of expected cache behavior. For each type of conflict miss, we propose a reduction method that uses one or a combination of three techniques based on modifying or exploiting data layout: array padding, initial address adjustment, and access resequencing. A case study using a blocked matrix multiply program as an example shows that the model is within 11% of the simulation results, and that each type of conflict miss can be effectively reduced or completely eliminated. For communication in shared memory parallel systems, we derive an array grouping mechanism and related loop transformations to reduce communication caused by the problematic case of nonconsecutive references to shared arrays and prove several theorems that determine when and where to apply this technique. The experimental results show a 15% reduction in communication, a 40% reduction in data subcache misses, and an 18% reduction in maximum user time for a finite element application on a 56 processor KSR1 parallel compute %Z Thu, 15 Oct 1998 15:04:59 GMT %A VanCampenhout, David %A Mudge, Trevor %A Sakallah, Karem %T Modeling Domino Logic for Static Timing Analysis %I Computer Science and Engineering, University of Michigan %R CSE-TR-295-96 %D June 24, 1996 %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-295-96-cover.ps.Z %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-295-96.ps.Z %X High performance microprocessors employ various advanced circuit techniques to achieve high clock frequencies. Critical sections of the design are often implemented in domino logic, a popular style of dynamic logic. This paper describes a new model for analyzing the temporal behavior of sequential circuits containing both static logic and domino logic. Our model integrates naturally with the SMO model for static timing analysis of sequential circuits. %Z Thu, 15 Oct 1998 15:05:17 GMT %A VanCampenhout, David %A Mudge, Trevor %A Sakallah, Karem %T Timing Analysis of Domino Logic %I Computer Science and Engineering, University of Michigan %R CSE-TR-296-96 %D June 24, 1996 %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-296-96-cover.ps.Z %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-296-96.ps.Z %X High performance microprocessors employ various advanced circuit techniques to obtain the required speed. Critical sections of the design are often implemented in domino logic, a popular style of dynamic logic. This paper describes a new model for analyzing the temporal behavior of sequential circuits using static logic mixed with domino logic. Our model integrates naturally with the SMO model for static timing analysis of sequential circuits. %Z Thu, 15 Oct 1998 15:05:35 GMT %A Malan, G. Robert %A Jahanian, Farnam %A Rasmussen, Craig %A Knoop, Peter %T Performance of a Distributed Object-Based Internet Collaboratory %I Computer Science and Engineering, University of Michigan %R CSE-TR-297-96 %D July 10, 1996 %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-297-96-cover.ps.Z %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-297-96.ps.Z %X Atmospheric Research Collaboratory (UARC), an object-based distributed system. For the past three years, the UARC software system has enabled space scientists to collaborate on atmospheric experiments in real-time over the Internet. The UARC system provides a means for its users to view data from remote atmospheric instruments, share annotations of that data, discuss experimental results in a chat window, and simultaneously edit text manuscripts. However, UARC's distribution of atmospheric data to this geographically dispersed group of scientists is its primary mechanism for effecting their collaboration. This paper investigates the impact of UARC's implementation as a large distributed object-based software system as a means for supporting wide-area collaboratories. Specifically, it focuses on the communication performance and scalability of its object-based data distribution mechanism. First, Internet microbenchmarks are presented which characterize the UARC topology; then the results of application-level experiments are described that investigate UARC's use of NeXTSTEP's Distributed Object method invocations as a communication primitive. Finally, an analysis and discussion of the UARC system's object-based implementation concludes the paper. %Z Thu, 15 Oct 1998 15:05:52 GMT %A Dawson, Scott %A Jahanian, Farnam %A Mitton, Todd %T Experiments on Six Commercial TCP Implementations Using a Software Fault Injection Tool %I Computer Science and Engineering, University of Michigan %R CSE-TR-298-96 %D July 25, 1996 %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-298-96-cover.ps.Z %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-298-96.ps.Z %X TCP has become the de facto standard transport protocol in today's operating systems. It is a very robust protocol which can adapt to various network characteristics, packet loss, link congestion, and even significant differences in vendor implementations. This paper, presents an implementation of a tool, called ORCHESTRA, that can be used for testing distributed protocols such as TCP, and a set of experiments performed on six different vendor TCP implementations using the tool. The paper also disusses some lessons learned about TCP implementations through experimentation. %Z Thu, 15 Oct 1998 15:06:07 GMT %A Chakrabarty, Krishnendu %A Murray, Brian T. %A Hayes, John P. %T Optimal Zero-Aliasing Space Compaction of Test Responses %I Computer Science and Engineering, University of Michigan %R CSE-TR-299-96 %D August 21, 996 %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-299-96-cover.ps.Z %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-299-96.ps.Z %X Many built-in self-testing (BIST) schemes compress the test responses from a k-output circuit to q signature streams, where q << k, a process termed space compaction. The effectiveness of such a compaction method can be measured by its compaction ratio c=k/q. A high compaction ratio can introduce aliasing, which occurs when a faulty test response maps to the fault-free signature. We investigate the problem of designing zero-aliasing space compaction circuits with maximum compaction ratio c_max. We introduce a graph representation of test responses to study the space compaction process and relate space compactor design to a graph coloring problem. Given a circuit under test, a fault model, and a test set, we determine q_min, which yields c_max = k/q_min. This provides a fundamental bound on the cost of signature-based BIST. We show that q_min \leq 2 for all the ISCAS 85 benchmark circuits. We develop a systematic design procedure for the synthesis of space compaction circuits and apply it to a number of ISCAS 85 circuits. Finally, we describe multi-step compaction, which allows zero aliasing to be achieved with any q, even when q_min > 1. %Z Thu, 15 Oct 1998 15:06:29 GMT %A Choi, Sunghyun %A Shin, Kang G. %T A Cellular Wireless Local Area Network with QoS Guarantees for Heterogeneous Traffic %I Computer Science and Engineering, University of Michigan %R CSE-TR-300-96 %D August 26, 1996 %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-300-96-cover.ps.Z %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-300-96.ps.Z %X A wireless local area network (WLAN) or a cell with quality-of-service (QoS) guarantees for various types of traffic is considered. A centralized (i.e., star) network topology is adopted as the topology of a cell which consists of a base station and a number of mobile clients. Dynamic Time Division Duplexed (TDD) transmission is used, and hence, the same frequency channel is time-shared for downlink and uplink transmissions under the dynamic control of the base station. We divide traffic into three classes (say I, II, and III) according to the required QoS. Whenever there is no eligible class-I and II traffic, class-III traffic which requires no delay bound guarantees is transmitted, while uplink transmissions are controlled with a reservation scheme. Class-I traffic which requires a bounded delay and guaranteed throughput is handled with the framing strategy which consists of a smoothness traffic model and the stop-and-go queueing scheme. We also establish the admission test for adding new class-I connections. We present a modified framing strategy for class-II voice uplink transmissions which utilizes the wireless link efficiently at the cost of some packet losses. Finally, we present the performance (average delay and throughput) evaluation of the reservation scheme for class-III traffic using both analytical calculations and simulations. %Z Thu, 15 Oct 1998 15:06:47 GMT %A Dundas, James %A Mudge, Trevor %T Using Stall Cycles to Improve Microprocessor Performance %I Computer Science and Engineering, University of Michigan %R CSE-TR-301-96 %D September 4, 1996 %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-301-96-cover.ps.Z %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-301-96.ps.Z %X Contemporary microprocessors typically expend a significant amount of their device budget in an attempt to reduce the detrimental effects of memory latency and branch misprediction. The extra hardware frequently reduces the clock rate and wastes valuable die area that might be more productively employed in other ways. We propose a novel approach to this problem, which is intended to be used in relatively simple microprocessors with very high clock rates. A runahead microprocessor continues to execute instructions past a level one data cache miss until the miss is serviced. During this time it is said to be in runahead mode. Any additional level one data cache misses that occur during runahead, whose target addresses can be calculated, can become data stream prefetches. Any conditional branches that are executed during runahead can become trial runs, which may improve branch prediction accuracy. Any level one instruction cache misses that occur during runahead become instruction stream prefetches. After the original data cache miss is serviced, the microprocessor restarts instruction execution at the address of the load or store instruction that caused the original data cache miss. The additional hardware required to checkpoint the sequential state of the microprocessor is rather modest, and should not impact the clock rate of a high-performance implementation. %Z Thu, 15 Oct 1998 15:07:06 GMT %A Kuno, Harumi A. %A Rundensteinber, Elke A. %T The Satisfiability-Indicating Multi-Index Organization for Maintaining Materialized Path Query OODB Views %I Computer Science and Engineering, University of Michigan %R CSE-TR-302-96 %D September 4, 1996 %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-302-96-cover.ps.Z %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-302-96.ps.Z %X Materialized database views allow applications to benefit from the powerful flexibility of views while minimizing the performance penalties traditionally associated with views. However, the need to maintain materialized views in the face of updates limits the variety of queries that can be used to define them. In this paper we address the problem of incrementally maintaining OODB views formed using path queries. Traditional index organizations are not well suited for this task. The indexing needs of the path query view problem are unique in that because the contents of the materialized view are cached and can be queried directly, the primary use for a supplemental index is during the propagation of updates rather than during query processing. Furthermore, traditional index organizations do not distinguish between single-valued and multi-valued attributes, and thus do not account for the fact that multi-valued attributes enable a single object at the head of a path to be associated with multiple instantiations of the path, any number of which could satisfy the path query predicate. This means that if an updated path involves a multi-valued attribute then the aggregation hierarchy of the object at the head of the path must be completely re-calculated in order to determine whether or not that object participates in an alternative instantiation that fulfills the view query predicate despite the update. As a solution, we introduce a new Satisfiability Indicating Multi-Index (SMX) organization, which maintains partial information indicating whether or not a given endpoint satisfies the query predicate rather than what the exact value of the endpoint is. This new structure offers a number of benefits. (1) At most one path position forward must be traversed to determine whether or not the endpoint of an instantiation of the path fulfills a given path query predicate. (2) The SMX index structure only needs to be updated when the validity of an object's instantiation (in terms of the query predicate) changes. (3) No more than one path position forward must ever be traversed in order to identify whether or not a given object participates in any alternative instantiations that fulfill a given path query predicate. In addition to proposing this new index organization, we also present cost models and analytic evaluations comparing the performance of the SMX organization to those of the multi, nested, and path index organizations with regards to calculating the effects of updates upon views. The results of our evaluations indicate that the SMX dramatically improves upon the performance of traditional index structures with respect to the problem of path query view maintenance. %K KEYWORDS: Incremental view maintenance, path queries, data warehousing, view materialization, and object-oriented databases. %Z Thu, 15 Oct 1998 15:07:27 GMT %A Rivers, Jude A. %A Davidson, Edward S. %T Sectored Cache Performance Evaluation: A case study on the KSR-1 data subcache %I Computer Science and Engineering, University of Michigan %R CSE-TR-303-96 %D September 6, 1996 %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-303-96-cover.ps.Z %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-303-96.ps.Z %X Sectoring is a cache design and management technique that is re-emerging as cache sizes get larger and computer designers strive to exploit the possible gains from using large block (line) sizes due to spatial locality. Sectoring allows for a small tag array size to suffice retaining address tags only for the large blocks, but still avoids huge miss penalties by utilizing a smaller transfer size between the cache and the next higher level of memory. With this caching strategy comes the need for a new approach for evaluating cache performance, especially relating to cache space and its best use, bus traffic and so forth. In this study, we give a broad overview of the technique of sectoring in caches. We have introduced a new set of metrics for cache performance evaluation, stressing cache block and bus traffic usage. We use these set of superfluity metrics to investigate the behavior of real scientific applications, and also to help determine adequate and appropriate cache design parameters. We show an example of how these metrics can help point to the spatial locality problems in a given application code, thereby indicating code optimization techniques which can most significantly improve the code's performance. %Z Thu, 15 Oct 1998 15:07:46 GMT %A Manohar, Nelson R. %A Prakash, Atul %T Design Issues on the Support of Tools and Media in Replayable Workspaces %I Computer Science and Engineering, University of Michigan %R CSE-TR-304-96 %D September 18, 1996 %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-304-96-cover.ps.Z %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-304-96.ps.Z %X This paper presents flexible support for asynchronous sharing (later-time use) of computer-supported workspaces (CSWs), or simply replayable workspaces. Through the interactive replay of a previously recorded CSW session, it is possible to reuse valuable collaborative information. The session is represented through heterogeneous media streams each containing state and inputs to a CSW tool. We present here an architecture and tool/media interfaces for the support of tools as ``plug-in'' components of our replayable workspaces. %Z Thu, 15 Oct 1998 15:08:05 GMT %A Nakanishi, Jun %A Fukuda, T. %A Koditschek, Daniel E. %T Preliminary Analytical Approach to a Brachiation Robot Controller %I Computer Science and Engineering, University of Michigan %R CSE-TR-305-96 %D September 19, 1996 %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-305-96-cover.ps.Z %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-305-96.ps.Z %X We report on our preliminary studies of a new controller for a two-link brachiating robot. Motivated by the pendulum-like motion of an ape's brachiation, we encode this task as the output of a "target dynamical system." Numerical simulations indicate that the resulting controller solves a number of brachiation problems that we term the "ladder", "swing up" and "rope" problems. Preliminary analysis provides some explanation for this success. We discuss a number of formal questions whose answers will be required to gain a full understanding of the strengths and weaknesses of this approach. %Z Thu, 15 Oct 1998 15:08:20 GMT %A Chen, I-Cheng K. %A Lee, Chih-Chieh %A Postiff, Matthew A. %A Mudge, Trevor N. %T Tagless Two-level Branch Prediction Schemes %I Computer Science and Engineering, University of Michigan %R CSE-TR-306-96 %D September 24, 1996 %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-306-96-cover.ps.Z %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-306-96.ps.Z %X Per-address two-level branch predictors have been shown to be among the best predictors and have been implemented in current microprocessors. However, as the cycle time of modern microprocessors continue to decrease, the implementation of set-associative per-address two-level branch predictors will become more difficult. In this paper, we revisit and analyze an alternative tagless, direct-mapped approach which is simpler, has lower power, and has faster access time. The tagless predictor can also offer comparable performance to current set-associative designs since removal of tags allows more resources to be allocated for the predictor and branch target buffer (BTB). Further, removal of tags allows decoupling of the per-address predictors from the BTB, allowing the two components to be optimized individually. We then show that tagless predictors are better than tagged predictors because of opportunities for better miss-handling. Finally, we examine the system cost-benefit for tagless per-address predictors across a wide design space using equal-cost contours. We also study the sensitivity of performance to the workloads by comparing results from the Instruction Benchmark Suite (IBS) and SPEC CINT95. Our work provides principles and quantitative parameters for optimal configurations of such predictors. %Z Thu, 15 Oct 1998 15:08:43 GMT %A Darr, Timothy P. %A Birmingham, William P. %T An Agent Model for Distributed Part Selection %I Computer Science and Engineering, University of Michigan %R CSE-TR-307-96 %D September 23, 1996 %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-307-96-cover.ps.Z %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-307-96.ps.Z %X This report presents the agent model for a set of catalog, constraint and system agents. For each type of agent, the model defines the goals it can achieve; messages sent agents to other agents requesting that they achieve some goal; the commitments that they make, internally, to achieve goals in response to the messages; and messages that it sends once a goal is achieved. The messages that an agent sends and receives are intentional and are organized into protocols that define the type, sequence and content of messages exchanged among agents to achieve goals. %Z Thu, 15 Oct 1998 15:08:58 GMT %A Hibino, Stacie %A Rundensteiner, Elke %T Query Processing in the MultiMedia Visual Information Seeking Environment: A Comparative Evaluation %I Computer Science and Engineering, University of Michigan %R CSE-TR-308-96 %D October 8, 1996 %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-308-96-cover.ps.Z %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-308-96.ps.Z %X Although much research has been conducted in the area of multidimensional range queries, we examine this problem from a new perspective. In particular, our goal is to support the processing of incremental multidimensional range queries specified via our temporal visual query language (TVQL), a direct manipulation visual query interface. In this paper, we describe the details and context of our problem, emphasizing the need to support temporal browsing and examining the characteristics of temporal data. We present a simple but new array-based index structure called our k-Array and its bucket-based counterpart, the k-Bucket. We then describe the series of experiments over several temporal queries that we ran to compare the k-Array and k-Bucket with basic methods such as the linked array and other popular bucket-based methods such as the grid file and k-d tree. Our results show that the k-Bucket performs the best overall, even though it is subject to projection effects. The other methods also perform fairly competitively, but only under certain conditions. %Z Thu, 15 Oct 1998 15:09:17 GMT %A Pearson, Douglas John %T Learning Procedural Planning Knowledge in Complex Environments %I Computer Science and Engineering, University of Michigan %R CSE-TR-309-96 %D October 10, 1996 %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-309-96-cover.ps.Z %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-309-96.ps.Z %X In complex, dynamic environments, an agent's knowledge of the environment (its domain knowledge) will rarely be complete and correct. Existing approaches to learning and correcting domain knowledge have focused on either learning procedural knowledge to directly guide execution (e.g. reinforcement learners) or learning declarative planning knowledge (e.g. theory revision systems). Systems that only learn execution knowledge are generally only applicable to small domains. In these domains it is possible to learn an execution policy that covers the entire state space, making planning unnecessary. Conversely, existing approaches to learning declarative planning knowledge are applicable to large domains, but they are limited to simple agents, where actions produce immediate, deterministic effects in fully sensed, noise-free environments, and where there are no exogenous events. This research investigates the use of procedural knowledge to support the learning of planning knowledge in large and complex environments. We describe a series of environmental properties that constrain learning and are violated by existing approaches to learning planning knowledge. We then present an operator-based representation for planning knowledge that is sufficiently expressive to model complex, conditional actions that produce sequential effects over time. We then present IMPROV, a system for learning and correcting errors in this planning knowledge that only requires procedural access to the knowledge. This procedural restriction ensures that learning remains tractable, even over these large, expressive representations. We first explain how IMPROV incrementally learns operator precondition knowledge. We then demonstrate how a hierarchical, operator-based representation can be used to reduce the problem of learning operator effects to the problem of learning operator preconditions. This result allows IMPROV to use a single learning method to learn both operator preconditions and effects. This also allows IMPROV to learn complex models of actions that produce conditional or sequential effects. Finally, we test the system in two sample domains and empirically demonstrate that it satisfies many of the constraints faced by learning agents in complex and challenging environments. %Z Thu, 15 Oct 1998 15:09:39 GMT %A Chandramouli, V. %A Kayssi, A.I. %A Sakallah, K.A. %T Signal Delay in Coupled Distributed RC Lines in the Presence of Temporal Proximity %I Computer Science and Engineering, University of Michigan %R CSE-TR-310-96 %D October 11, 1996 %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-310-96-cover.ps.gz %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-310-96.ps.gz %X With improvements in technology, accurate delay modeling of interconnects is becoming increasingly important. Due to decreasing feature sizes, the spacing between the signal lines is also decreasing. Consequently, the switching activities on the neighboring lines can have a significant impact on the delay of the line of interest, and can no longer be ignored. Accurate modeling of this phenomenon, which we call the proximity effect, is the subject of this paper. This is similar to the state-dependency of logic gate delays, where signal delay can be affected by the switching activities on the side inputs of a gate. We describe an efficient and accurate delay computation method using precomputed interconnect moments that treats the coupled lines as uniform, distributed RC lines and does not make any lumped approximations. This allows the proposed delay model to be used in a timing analysis tool operating over both gate and interconnect domains while accounting for state-dependency. %Z Thu, 15 Oct 1998 15:09:57 GMT %A Nica, Anisoara %A Rundensteiner, Elke Angelika %T The Dynamic Information Integration Model %I Computer Science and Engineering, University of Michigan %R CSE-TR-311-96 %D October 30, 1996 %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-311-96-cover.ps.Z %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-311-96.ps.Z %X Challenging issues for processing queries specified over large-scale information spaces (e.g., Digital Libraries or the World Wide Web) include the diversity of the information sources in terms of their structures, query interfaces and search capabilities, as well as the dynamics of sources continuously being added, removed or upgraded. Query processing can no longer be done in a static fashion against a well-defined schema (which assumes high integration). Rather an interactive query processing strategy that adapts its behavior to the system resources at hand is needed. In this paper, we give an innovative solution for query planning in such environments. The foundation of our solution is the Dynamic Information Integration Model (DIIM) which supports the specification of not only content but also capabilities of resources without requiring the establishment of a uniform integration schema. Besides the development of the DIIM model, contributions of this paper include: (1) introduction of the notion of fully specified queries that are semantically equivalent to a loosely-specified query (we show that our concept is a consistent and natural extension to the concept of full disjunction); (2) translation algorithm of a loosely-specified query into a set of semantically equivalent precise query plans that can be executed against the current configuration of available sources; the steps of the translated query plans are consistent with the binding patterns of query templates of the individual sources (capability descriptions in DIIM) and with possible interrelationships between two or more information sources (expressed as join constraints in DIIM); (3) search restriction algorithm for optimizing query processing by pruning the search space into the relevant subspace of a query based on information source descriptions; the search space of the translation algorithm is thus restricted to a query relevant subspace; and (4) the proofs of correctness for both the search restriction and translation algorithms that show that the plans obtained by the query planning process correspond to semantically equivalent query plans. %Z Thu, 15 Oct 1998 15:10:17 GMT %A Feng, Wu-chang %A Kandlur, Dilip %A Saha, Debanjan %A Shin, Kang G. Shin %T TCP Enhancements for an Integrated Services Internet %I Computer Science and Engineering, University of Michigan %R CSE-TR-312-96 %D October 30, 1996 %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-312-96-cover.ps.Z %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-312-96.ps.Z %X Flow and congestion control mechanisms in TCP have been designed with the premise that network capacity should be shared equally among competing connections. With the evolution of the Internet from a class-less best-effort network to an integrated services network with different classes of service, the premise of equal fair sharing is no longer true. In an integrated services Internet, network bandwidth should be shared in accordance with the reservation made for each session. In this paper, we consider a specific class of service, namely controlled-load, which is currently under consideration for deployment in the Internet. We propose a simple scheme to realize this service and study its interaction with the control mechanisms used in TCP. In light of this study, we propose a number of enhancements to the control mechanisms used in popular implementations of TCP. Using deatiled simulation experiments, we show that with the proposed enhancements in place, TCP is indeed able to provide the desired end-to-end behavior. %Z Thu, 15 Oct 1998 15:10:33 GMT %A Soparkar, N. %A Ramamritham, K. %T Workshop on Database: Active and Real-Time 1996 (Concepts meet Practice) %I Computer Science and Engineering, University of Michigan %R CSE-TR-313-96 %D November 6, 1996 %Z Thu, 15 Oct 1998 15:10:51 GMT %A Jacob, Bruce %A Mudge, Trevor %T Specification of the PUMA memory management design %I Computer Science and Engineering, University of Michigan %R CSE-TR-314-96 %D December 10, 1996 %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-314-96-cover.ps.Z %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-314-96.ps.Z %X In this report we specify the memory management design of a 1GHz PowerPC implementation in which a simple design is a prerequisite for a high clock rate and short design cycle. The scheme involves no translation hardware such as a translation lookaside buffer or a page-table-walking state machine. However, it is just as efficient as hardware-managed address translation and is much more flexible. Modern operating systems such as Mach charge between 0.16 and 0.28 CPI for address translation on systems with TLBs. PUMA's software-managed address translation exacts an overhead of 0.03 CPI. Mechanisms to support such features as shared memory, superpages, sub-page protection, and sparse address spaces can be defined completely in software, allowing much more flexibility than in hardware-defined mechanisms. Our software design combines the virtual caches with a PowerPC-like segmentation mechanism; it maps user 32-bit addresses onto a 44-bit segmented virtual space. We use a global page table to map this entire global virtual space. There are no individual per-process page tables; all process page tables map directly onto the global table and when processes share memory they also share a portion of the global page table. The following benefits are derived from the organization: (a) virtual cache consistency management can be eliminated, (b) the page table space requirements can be cut in half by eliminating the need to replicate page table entries for shared pages, and (c) the virtual memory system can be made less complex because it does not have to deal with the virtual-cache synonym problem. %Z Thu, 15 Oct 1998 15:11:10 GMT %A Chaudhry, Nauman %A Moyne, James %A Rundensteiner, Elke A. %T Extended Aggregation Relationships for Process Specification and Enactment in Active Databases %I Computer Science and Engineering, University of Michigan %R CSE-TR-315-96 %D November 11, 1996 %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-315-96-cover.ps.Z %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-315-96.ps.Z %X Process specification in a variety of domains, such as experiment modeling, work-flow modeling, and process-flows in semiconductor manufacturing, is typically characterized by recursive specification in terms of sequences and alternatives. A variety of models have been proposed for the specification of such processes. In particular, object-oriented techniques have been used for achieving various desirable features (e.g., reusability, maintainability, etc.) in the process specification and active databases have been suggested as possible platforms for process enactment. However, on the one hand object-oriented models for process specification lack an important feature of object-orientation, namely the ability to organize processes as classes with inheritance support, and on the other hand various problems, such as lack of methodological support for active rule definition, analysis, and maintenance, stand in the way of successfully employing active database technology for process enactment. To take better advantage of both object-oriented techniques and active database technology, we present a comprehensive framework for process specification and enactment, which provides an integrated solution utilizing ideas from both these domains. This is achieved by developing PSOM (Process Specification Object Model) which is an object-oriented model with explicit aggregation constructs and extended sub-typing relationships for these new aggregation constructs. We show the use of PSOM for defining processes using the aggregation constructs and arranging these processes into class hierarchies based on the formal types of the processes. In addition, we establish guidelines for defining active rules for process enactment on PSOM process specifications. We also prove that the rule definition guidelines lead to modularized rule sets which simplify the analysis of termination behavior of active rules defined in a PSOM process database. %Z Thu, 15 Oct 1998 15:11:36 GMT %A Thaler, David G. %A Ravishankar, Chinya V. %T A Name-Based Mapping Scheme for Rendezvous %I Computer Science and Engineering, University of Michigan %R CSE-TR-316-96 %D November 14, 1996 %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-316-96-cover.ps.Z %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-316-96.ps.Z %X Clusters of identical intermediate servers are often created to improve availability and robustness in many domains. The use of proxy servers for the WWW and of Rendezvous Points in multicast routing are two such situations. However, this approach is inefficient if identical requests are received and processed by multiple servers. We present an analysis of this problem, and develop a method called the Highest Random Weight (HRW) Mapping that eliminates these difficulties. Given an object name, HRW maps it to a server within a given cluster using the object name, rather than any a priori knowledge of server states. Since HRW always maps a given object name to the same server within a given cluster, it may be used locally at client sites to achieve consensus on object-server mappings. We present an analysis of HRW and validate it with simulation results showing that it gives faster service times than traditional request allocation schemes such as round-robin or least-loaded, and adapts well to changes in the set of servers. HRW is particularly applicable to domains in which there are a large number of requestable objects, and where there is a significant probability that a requested object will be requested again. HRW has now been adopted by the multicast routing protocol PIMv2 as its mechanism for clients to identify Rendezvous Points. %Z Thu, 15 Oct 1998 15:11:52 GMT %A Tam, Edward S. %T Early Design Cycle Timing Simulation of Caches %I Computer Science and Engineering, University of Michigan %R CSE-TR-317-96 %D November 20, 1996 %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-317-96-cover.ps.Z %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-317-96.ps.Z %X Cache performance is a key component of a microprocessor's overall performance, as it is the cache that buffers the high speed CPU from the much slower main memory. Behavioral cache simulators indicate the performance of a given cache configuration in terms of the number of hits and misses experienced when running a piece of code. One can attach a leading edge penalty or "effective" penalty estimate to each miss to get a first order idea of run time. However, individual timing penalties are not assessed within these models, nor are the various factors that can affect each memory access' latency. Our Latency Effects (LE) cache model accounts for additional latencies due to trailing edge effects, bus width considerations, the number of outstanding accesses allowable, and port limitations. A tool implementing the LE cache model has been built on top of a behavioral cache simulator, DineroIII, in the spirit of the Resource Conflict Methodology developed by J-D Wellman. The tool was combined with Wellman's RCM_brisc processor simulator to provide a realistic interaction of the cache with the processor (including the latency masking effects of processor activity) and to assess the accuracy of the model when simulating the execution of actual programs. The combined tool accurately mirrors the effects of changing a cache's configuration for a given processor configuration running a variety of programs. While the reported execution times do not exactly match the total execution times of the same programs running on actual hardware, the tool provides enough useful information to guide processor/cache designers early in the design cycle toward optimal configurations for target applications. This addition of cache modeling to the RCM_brisc instruction-level processor simulator with perfect cache increases simulation time by only 17% (less than 5% over a constant miss penalty cache model), which is reasonable given the added benefits of using the LE cache model. %Z Thu, 15 Oct 1998 15:12:10 GMT %A Dawson, Scott %A Jahanian, Farnam %A Mitton, Todd %T ORCHESTRA: A Fault Injection Environment for Distributed Systems %I Computer Science and Engineering, University of Michigan %R CSE-TR-318-96 %D November 26, 1996 %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-318-96-cover.ps.Z %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-318-96.ps.Z %X This paper reports on ORCHESTRA, a portable fault injection environment for testing implementations of distributed protocols. The paper focuses on architectural features of ORCHESTRA that provide portability, minimize intrusiveness on target protocols, and support testing of real-time systems. ORCHESTRA is based on a simple yet powerful framework, called script-driven probing and fault injection, for the evaluation and validation of the fault-tolerance and timing characteristics of distributed protocols. ORCHESTRA was initially developed on the Real-Time Mach operating system and later ported to other platforms including Solaris and SunOS, and has been used to conduct extensive experiments on several protocol implementations. A novel feature of the Real-Time Mach implementation of ORCHESTRA is that it utilizes certain features of the Real-Time Mach operating system to quantify and compensate for intrusiveness of the fault injection mechanism. In addition to describing the overall ORCHESTRA architecture and implementation, this paper also describes the experimental evaluation of two protocol implementations: a distributed group membership service on the Sun Solaris operating system, and a real-time audio-conferencing application on Real-Time Mach. %Z Thu, 15 Oct 1998 15:12:34 GMT %A Bird, Peter L. %A Mudge, Trevor N. %T An Instruction Stream Compression Technique %I Computer Science and Engineering, University of Michigan %R CSE-TR-319-96 %D December 2, 1996 %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-319-96-cover.ps.Z %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-319-96.ps.Z %X The performance of instruction memory is a critical factor for both large, high performance applications and for embedded systems. With high performance systems, the bandwidth to the instruction cache can be the limiting factor for execution speed. Code density is often the critical factor for embedded systems. In this report we demonstrate a straightforward technique for compressing the instruction stream for programs. After code generation, the instruction stream is analysed for often reused sequences of instructions from within the pro gram's basic blocks. These patterns of multiple instructions are then mapped into single byte opcodes. This constitutes a compression of multiple, multi-byte operations onto a single byte. When compressed opcodes are detected during the instruction fetch cycle of program execution, they are expanded within the CPU into the original (multi-cycle) set of instructions. Because we only operate within a program's basic block, branch instructions and their targets are unaf- fected by this technique. We provide statistics gathered from code generated for the Intel Pentium and the Power PC processors. We have found that incorporating a 1K decode ROM in the CPU we can reduce a program's code size between 45% and 60%. %Z Thu, 15 Oct 1998 15:13:12 GMT %A Huggins, James K. Huggins %T Broy-Lamport Specification Problem: A Gurevich Abstract StateMachine Solution %I Computer Science and Engineering, University of Michigan %R CSE-TR-320-96 %D December 13, 1996 %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-320-96-cover.ps.Z %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-320-96.ps.Z %X We apply the Gurevich Abstract State Machine methodology to a benchmark specification problem of Broy and Lamport. %Z Thu, 15 Oct 1998 15:13:27 GMT %A Huggins, James K. %A VanCampenhout, David %T Specification and Verification of Pipelining in the ARM2RISC Microprocessor %I Computer Science and Engineering, University of Michigan %R CSE-TR-321-96 %D December 13, 1996 %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-321-96-cover.ps.Z %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-321-96.ps.Z %X We specify the ARM2 RISC microprocessor using the Gurevich Abstract State Machine methodology, and prove the correctness of its pipelining techniques. %Z Thu, 15 Oct 1998 15:13:41 GMT %A Gurevich, Yuri %A Spielmann, Marc %T Recursive Abstract State Machines %I Computer Science and Engineering, University of Michigan %R CSE-TR-322-96 %D December 16, 1996 %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-322-96-cover.ps.Z %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-322-96.ps.Z %X The Gurevich abstract state machine (ASM) thesis, supported by numerous applications, asserts that ASMs express algorithms on their natural abstraction levels directly and essentially coding-free. The only objection raised to date has been that ASMs are iterative in their nature, whereas many algorithms are naturally recursive. There seems to be an inherent contradiction between (i) the ASM idea of explicit and comprehensive states, and (ii) higher level recursion with its hiding of the stack. But consider recursion more closely. When an algorithm A calls an algorithm B, a clone of B is created and this clone becomes a slave of A. This raises the idea of treating recursion as an implicitly distributed computation. Slave agents come and go, and the master/slave hierarchy serves as the stack. Building upon this idea, we suggest a definition of recursive ASMs. The implicit use of distributed computing has an important side benefit: it leads naturally to concurrent recursion. In addition, we reduce recursive ASMs to distributed ASMs. If desired, one can view recursive notation as mere abbreviation. %Z Thu, 15 Oct 1998 15:14:06 GMT %A Worthington, Bruce L. %A Ganger, Gregory R. %A Patt, Yale N. %A Wilkes, John %T On-Line Extraction of SCSI Disk Drive Parameters %I Computer Science and Engineering, University of Michigan %R CSE-TR-323-96 %D December 19, 1996 %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-323-96-cover.ps.Z %U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-323-96.ps.Z %X Sophisticated disk scheduling algorithms require accurate and detailed disk drive specifications, including information on mechanical delays, on-board caching and prefetching algorithms, command processing and protocol overheads, and logical-to-physical block mappings. Comprehensive disk models used in storage subsystem design require similar levels of detail. This report describes a suite of general-purpose techniques and algorithms for acquiring the necessary data from SCSI disks via the ANSI-standard interface. The accuracy of the extracted information is demonstrated by comparing the behavior of a configurable disk simulator against actual disk drive activity. Detailed measurements extracted from several SCSI disk drives are provided. %Z Thu, 15 Oct 1998 15:14:20 GMT %A Jones, Matthew %A Rundensteiner, Elke %T Evaluating View Materialization Strategies for Complex Hierarchical Objects %I Computer Science and Engineering, University of Michigan %R CSE-TR-324-97 %D January 9, 1997 %U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-324-97-cover.ps.Z %U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-324-97.ps.Z %X In many design applications, it is common practice to store complex hierarchical objects in a compact folded form to save storage space and to reduce processing costs for accessing the objects. In these folded represen tations, complex objects are built up from identical and otherwise indistinguishable design objects. However, it is often necessary, especially during the refinement of data, to distinguish between these identical folded objects by personalizing a subset of them. The estab lished practice is to explicitly unfold the hierarchical objects and thus create space in which to store distinct personalization data for each object occurrence. How ever, this explicit unfolding is costly and time consuming, resulting in a potentially much larger structure, and sub stantially increasing the costs of querying and updating the design. Therefore, we propose an unfold view operator and provide the basis for updating of custom ized values for each hierarchical sub-object through the unfolded view. We propose alternative strategies for the maintenance of personalization values, representing various portions of the view materialization spectrum. We present a performance evaluation comparing these strategies as well as the traditional explicit unfolding approach. Our evaluation indicates the trade-offs in terms of storage and query costs and compares the costs to do implicit unfolding through a view rather than explicit unfolding of complex hierarchical objects. %Z Thu, 15 Oct 1998 15:14:38 GMT %A Blass, Andreas %A Gurevich, Yuri %T The Linear Time Hierarchy Theorems for RAMs and Abstract State Machines %I Computer Science and Engineering, University of Michigan %R CSE-TR-325-97 %D January 21, 1997 %U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-325-97-cover.ps.Z %U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-325-97.ps.Z %X Contrary to polynomial time, linear time greatly depends on the computation model. In 1992, Neil Jones designed a number of computation models where the linear-speed-up theorem fails and linear-time computable functions form a hierarchy. The linear time of those models is too restrictive. We prove linear-time hierarchy theorems for random access machines and Gurevich abstract state machines (formerly evolving algebras). The latter generalization is harder and more important because of the greater flexibility of the ASM model. One long-term goal of this line or research is to prove lower bounds for natural linear time problems. %Z Thu, 15 Oct 1998 15:14:56 GMT %A Dexter, Scott %A Doyle, Patrick %A Gurevich, Yuri %T Gurevich Abstract State Machines and Schoenhage Storage Modification Machines %I Computer Science and Engineering, University of Michigan %R CSE-TR-326-97 %D January 23, 1997 %U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-326-97-cover.ps.Z %U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-326-97.ps.Z %X We demonstrate that Schonhage storage modification machines are equivalent, in a strong sense, to unary abstract state machines. We also show that if one extends the Schoenhage model with a pairing function and removes the unary restriction, then equivalence between the two machine models survives. %Z Thu, 15 Oct 1998 15:15:14 GMT %A Gurevich, Yuri %A Soparkar, Nandit %A Wallace, Charles %T Formalizing Database Recovery %I Computer Science and Engineering, University of Michigan %R CSE-TR-327-97 %D January 29, 1997 %U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-327-97-cover.ps.Z %U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-327-97.ps.Z %X Failure resilience is an essential requirement for database systems, yet there has been little effort to specify and verify techniques for failure recovery formally. The desire to improve performance has resulted in algorithms of considerable sophistication, yet understood by few and prone to errors. In this paper, we illustrate how the methodology of Gurevich Abstract State Machines can elucidate recovery and provide formal rigor to the design of a recovery algorithm. In a series of refinements, we model recovery at several levels of abstraction, verifying the correctness of each model. This work suggests that our approach can be applied to more advanced recovery mechanisms. %Z Thu, 15 Oct 1998 15:15:32 GMT %A Jensen, Paul %A Wallace, Charles %A Soparkar, Nandit %T Specifying and Constructing Schedulers for Workflows with Autonomous Executions %I Computer Science and Engineering, University of Michigan %R CSE-TR-328-97 %D February 4, 1997 %U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-328-97-cover.ps.Z %U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-328-97.ps.Z %X Workflow has become an important paradigm for distributed data and computing systems in a wide range of application areas. In a workflow, tasks executing on autonomous, heterogeneous systems are coordinated through data and control flow constraints. An important challenge in workflow management is the scheduling of actions and operations performed by the concurrently executing tasks. The legal interleavings among the tasks must be specified, and scheduling control mechanisms to ensure correct, efficient executions must be generated. Scheduling workflows is particularly difficult because the dependencies between tasks may be application-specific and task autonomy may place certain actions outside the control or observation of the scheduler. We use techniques from supervisory control theory of discrete event systems for specifying and generating scheduling controllers in workflow environments. We specify the tasks and the intertask dependencies as finite state automata. To model task autonomy, we characterize some of the event transitions in the task automata as beyond the control or observation of the workflow scheduler. By applying the techniques of supervisory control theory to these specifications, we show how the existence of schedulers may be ascertained and how schedulers may be constructed. In cases where no controller can allow exactly the desired class of schedules, we show how to construct a scheduler that allows the best possible approximation to the desired class. We also address the issues of prioritized tasks and distributed workflow scheduling. Our approach provides an effective means to model several workflow systems and to create scheduling mechanisms to manage them. %Z Thu, 15 Oct 1998 15:15:47 GMT %A Jensen, Paul A. %A Soparkar, Nandit R. %A Mathur, Amit G. %T Characterizing Multicast Orderings using Concurrency Control Theory %I Computer Science and Engineering, University of Michigan %R CSE-TR-329-97 %D February 4, 1997 %U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-329-97-cover.ps.Z %U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-329-97.ps.Z %X Coordinating distributed executions is achieved by two widely used approaches: process groups and transactions. Typically, the two represent a trade-off in terms of the degrees of consistency and performance. By applying transaction concurrency control techniques to characterize and design process group multicast orderings, we aim to provide aspects of both ends of the trade-off. In particular, we propose a framework in which each message multicast is regarded as a transaction. Appropriate message ordering protocols are devised and shown to be correct using a variant of concurrency control theory. Also, we are able to incorporate certain aspects of application semantics for which existing process group approaches are inadequate. Finally, our framework provides a means to characterize the performance of orderings to allow a comparison of different ordering protocols. %Z Thu, 15 Oct 1998 15:16:09 GMT %A Chen, I-Cheng K. %A Bird, Peter L. %A Mudge, Trevor %T The Impact of Instruction Compression on I-cache Performance %I Computer Science and Engineering, University of Michigan %R CSE-TR-330-97 %D February 13, 1997 %U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-330-97-cover.ps.Z %U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-330-97.ps.Z %X In this paper we present a straightforward technique for compressing the instruction stream for programs that overcomes some of the limitations of earlier proposals. After code generation, the instruction stream is analysed for frequently used sequences of instructions from within the program's basic blocks. These patterns of multiple instructions are then mapped into single byte opcodes. This constitutes a compression of multiple, multi-byte operations onto a single byte. When compressed opcodes are detected during the instruction fetch cycle of program execution, they are expanded within the CPU into the original (multi-cycle) sequence of instructions. We only examine patterns within a program's basic block, so branch instructions and their targets are unaffected by this technique allowing compression to be decoupled from compilation. %Z Thu, 15 Oct 1998 15:16:25 GMT %A Vlaovic, Stevan %T Communication Characterization of a Cray T3D %I Computer Science and Engineering, University of Michigan %R CSE-TR-331-97 %D February 13, 1997 %U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-331-97-cover.ps.Z %U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-331-97.ps.Z %X In order to develop efficient applications for large scale multiprocessors, thecommunication patterns of the target architecture should be well understood. By understanding these communication patterns, a faster running application can be developed. Some applications require the smallest latency possible, whereas others might benefit from higher throughput. The Cray T3D is a multiprocessor that has a high speed 3D torus interconnect, and special communications hardware. By utilizing this hardware properly, a program has the potential to minimize its communications overhead. This study investigates three different libraries that provide communication on the T3D: PVM, MPI, and SMA. Both PVM and MPI are portable, whereas SMA is native. The latency, bandwidth, and collective communication costs of the communication primitives implemented in these libraries are characterized and compared. These results are also briefly contrasted with the results of related studies on the IBM SP2 and the Convex SPP-1000. %Z Thu, 15 Oct 1998 15:16:41 GMT %A Labovitz, Craig %A Malan, G. Robert %A Jahanian, Farnam %T Internet Routing Instability %I Computer Science and Engineering, University of Michigan %R CSE-TR-332-97 %D February 21, 1997 %U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-332-97-cover.ps.Z %U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-332-97.ps.Z %X This paper examines the network inter-domain routing information exchanged between backbone service providers at the major U.S. public Internet exchange points. Internet routing instability, or the rapid fluctuation of network reachability information, is an important problem currently facing the Internet engineering community. High levels of network instability can lead to packet loss, increased network latency and time to convergence. At the extreme, high levels of routing instability have lead to the loss of internal connectivity in wide-area, national networks. In this paper, we describe several unexpected trends in routing instability, and examine a number of anomalies and pathologies observed in the exchange of inter-domain routing information. The analysis in this paper is based on data collected from BGP routing messages generated by border routers at five of the Internet core's public exchange points during a nine month period. We show that the volume of these routing updates is several orders of magnitude more than expected and that the majority of this routing information is redundant, or pathological. Furthermore, our analysis reveals several unexpected trends and ill-behaved systematic properties in Internet routing. We finally posit a number of explanations for these anomalies and evaluate their potential impact on the Internet infrastructure. %Z Thu, 15 Oct 1998 15:17:11 GMT %A Jamin, S. %A Shenker, S.J. %T Measurement-based Admission Control Algorithms for Controlled-load Service: A Structural Examination %I Computer Science and Engineering, University of Michigan %R CSE-TR-333-97 %D April 1, 1997 %U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-333-97-cover.ps.Z %U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-333-97.ps.Z %X The Internet Engineering Task Force (IETF) is considering the adoption of the controlled-load service, a real-time service with very relaxed service guarantees. Measurement-based admission control algorithms (MBAC's), as opposed to the more conservative worst-case parameter-based methods, were expressly designed to achieve high levels of network utilization for such relaxed real-time services. Most researchers studying MBAC's, including ourselves, have focused primarily on the design of the admission control equations using a variety of principled and ad hoc motivations. In this paper we show that, much to our own surprise, the admission control equations developed so far in the MBAC literature give essentially the same performance. Furthermore, we observe that the equations, even though they are derived and motivated in quite different ways, have the same structural form. %Z Thu, 15 Oct 1998 15:17:37 GMT %A Rogers, Seth O. %T Symbolic Performance & Learning in Continuous-valued Environments %I Computer Science and Engineering, University of Michigan %R CSE-TR-334-97 %D May 2, 1997 %U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-334-97-cover.ps.Z %U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-334-97.ps.Z %X Real-world and simulated real-world domains, such as flying and driving, commonly have the characteristics of continuous-valued (CV) environments. These environments are frequently complex and difficult to control, requiring a great deal of specific, detailed knowledge. Symbolic computation can use many different types of knowledge to constrain the control function. This thesis investigates learning the control function using observed data in a symbolic architecture. SPLICE (Symbolic Performance & Learning In Continuous-Valued Environments) is a Soar agent that performs symbolic control function approximation. SPLICE uses a three-level framework to first classify its sensory information into symbolic regions, then map the set of regions to a local model, then use the local model to determine an action. Learning causes the models to gradually become more specific and more accurate. SPLICE agent is evaluated against other adaptive control algorithms in simple domains using four experimental methodologies. We also instantiate SPLICE in a complex environment, simulated flight, and evaluate its performance. The final goal is not only to create an effective controller for complex continuous environments, but also to understand more clearly the ramifications of symbolic learning in continuous domains. %Z Thu, 15 Oct 1998 15:18:01 GMT %A Patel, Sanjay Jeram %A Friendly, Daniel Holmes %A Patt, Yale N. %T Critical Issues Regarding the Trace Cache Fetch Mechanism %I Computer Science and Engineering, University of Michigan %R CSE-TR-335-97 %D May 7, 1997 %U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-335-97-cover.ps.Z %U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-335-97.ps.Z %X In order to meet the demands of wider issue processors, fetch mechanisms will need to fetch multiple basic blocks per cycle. The trace cache supplies several basic blocks each cycle by storing logically contiguous instructions in physically contiguous storage. When a particular basic block is requested, the trace cache can potentially respond with the requested block along with several blocks that followed it when the block was last encountered. In this technical report, we examine some critical features of a trace cache mec hanism designed for a 16-wide issue processor and evaluate their effects on performance . We examine features such as cache associativity, storage partitioning, branch predictor design, instruction cache design, and fill unit design. We compare the performance of our trace cache mechanism with that of the design presented by Rotenberg et al and show a 23% improvement in performance. In our final analysis, we compare our trace cache mechanism with an aggressive single basic block fetch mechanism and show that the trace cache m echanism attains a 24% improvement in performance. %Z Thu, 15 Oct 1998 15:18:24 GMT %A Gurevich, Yuri %T May 1997 Draft of the ASM Guide %I Computer Science and Engineering, University of Michigan %R CSE-TR-336-97 %D May 7, 1997 %U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-336-97-cover.ps.Z %U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-336-97.ps.Z %X This is only a draft of a portion of the ASM guide, but there are no plans to change this part in any substantial way, only to augment it. %Z Thu, 15 Oct 1998 15:18:39 GMT %A Chang, Po-Yung %T Classification-Directed Branch Predictor Design %I Computer Science and Engineering, University of Michigan %R CSE-TR-337-97 %D May 13, 1997 %U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-337-97-cover.ps.Z %U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-337-97.ps.Z %X Pipeline stalls due to branches represent one of the most significant impediments to realizing the performance potential of deeply pipelined superscalar processors. Two well-known mechanisms have been proposed to reduce the branch penalty, speculative execution in conjunction with branch prediction and predicated execution. This dissertation proposes branch classification, coupled with improvements in conditional branch prediction, indirect branch prediction, and predicted execution, to reduce the branch execution penalty. Branch classification allows an individual branch instruction to be associated with the branch predictor best suited to predict its direction. Using this approach, a hybrid branch predictor is constructed which achieves a higher prediction accuracy than any branch predictor previously reported in the literature. This dissertation also proposes a new prediction mechanism for predicting indirect jump targets. For the perl and gcc benchmarks, this mechanism reduces the indirect jump misprediction rate by 93.4\% and 63.3\% and the overall execution time on an 8-wide issue out-of-order execution machine by 14\% and 5\%. Finally, this dissertation proposes a new method for combining the performance benefits of predicated execution and speculative execution. This approach significantly reduces the branch execution penalty suffered by wide-issue processors. %Z Thu, 15 Oct 1998 15:19:03 GMT %A Blass, Andreas %A Gurevich, Yuri %A Shelah, Saharon %T Choiceless Polynomial Time %I Computer Science and Engineering, University of Michigan %R CSE-TR-338-97 %D May 16, 1997 %U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-338-97-cover.ps.Z %U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-338-97.ps.Z %X Turing machines define polynomial time (PTime) on strings but cannot deal with structures like graphs directly, and there is no known, easily computable string encoding of isomorphism classes of structures. Is there a computation model whose machines do not distinguish between isomorphic structures and compute exactly PTime properties? This question can be recast as follows: Does there exists a logic that captures polynomial time (without presuming the presence of a linear order)? Earlier, one of us conjectured the negative answer.The problem motivated a quest for stronger and stronger PTime logics. All these logics avoid arbitrary choice. Here we attempt to capture the choiceless fragment of PTime. Our computation model is a version of abstract state machines (formerly called evolving algebras). The idea is to replace arbitrary choice with parallel execution. The resulting logic is more expressive than other PTime logics in the literature. A more difficult theorem shows that the logic does not capture all PTime. %Z Thu, 15 Oct 1998 15:19:20 GMT %A Lin, Stephen %A Lee, Sang Wook %T Using Chromaticity Distributions and Eigenspace Analysis for Pose-, Illumina tion-, and Specularity-Invariant Color Indexing of 3D Objects %I Computer Science and Engineering, University of Michigan %R CSE-TR-339-97 %D May 23, 1997 %U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-339-97-cover.ps.Z %U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-339-97.ps.Z %X The distribution of object colors can be effectively utilized for recognition and indexing. Difficulties arise in the recognition of object color distributions when there are variations in illumination color, changes in object pose with respect to illumination direction, and specular reflections. However, most of the recent approaches to color-based recognition focus mainly on illumination color invariance. We propose an approach that identifies object color distributions influenced by: (1) illumination pose, (2) illumination color and (3) specularity. We suggest the use of chromaticity distributions to achieve illumination pose invariance. To characterize changes in chromaticity distribution due to illumination color, an eigenspace analysis is employed. A set of chromaticity histograms of each object is generated for a range of lighting colors based on linear models of illumination and reflectance, and the histograms are represented using a small number of eigen basis vectors constructed from principal components analysis. Represented in an eigenspace, chromaticity signatures for each object in the model database form a manifold encompassing a range of lighting colors. Recognition and illumination color estimation are performed by projecting the chromaticity signature of a test object into the eigenspace and examining its proximity to database manifolds. Since specular reflections may displace a projection away from its corresponding manifold, a model-based specularity detection/rejection algorithm, called chromaticity differencing, is developed to reduce this discrepancy. %Z Thu, 15 Oct 1998 15:19:36 GMT %A Crestana, Viviane M. %A Lee, Amy J. %A Rundensteiner, Elke A. %T Schema Version Removal: Optimizing Transparent Schema Evolution Systems %I Computer Science and Engineering, University of Michigan %R CSE-TR-340-97 %D May 27, 1997 %U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-340-97-cover.ps.Z %U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-340-97.ps.Z %X Powerful interoperability-enabling solutions for software application integration must allow applications to evolve and data requirements to change, while minimizing such changes on other integrated applications. The transparent schema evolution system (TSE), that we developed at the University of Michigan, accomplishes evolution through schema versioning, where a schema version is a dynamic object-oriented view. When the TSE system receives a view schema change request, the system computes a new view schema that reflects the desired change instead of modifying the old view schema in place, therefore preserving existing view schemas for old applications. This generation of a potentially large number of schema versions over time results in an excessive build-up of classes and underlying object instances, not all being necessarily still in use. Since our view system provides us with materialized views, a degradation of system performance due to update progagation is expected in a situation where we have no-longer utilized view schemas in the system. Also, a larger number of view schemas will result in storage overhead costs. In this paper, we address this problem using consistent schema removal techniques. First, we characterize four potential problems of schema consistency that could be caused by removal of a single virtual class; and then outline our solution approach for each of these problems. Second, we demonstrate that view schema removal is sensitive to the order in which individual classes are processed. Our solution to this problem is based on a formal model, called the dependency model, of capturing all dependencies between classes as logic clauses and of manipulating them to make decisions on class deletions and nondeletions while guaranteeing the consistency of the schema. Third, based on this formal model, we have developed and proven consistent a dependency graph (DG) representation and an associated set of rules for DG generation, reduction, and transformation. A first preliminary version of the latter has been successfully implemented in our Schema Version Removal (SVR) tool. Fourth, we present a cost model for evaluating alternative removal patterns on DG. Lastly, we report our preliminary experimental studies that demonstrate the impact of our schema version removal on the performance of the TSE system. %Z Thu, 15 Oct 1998 15:20:44 GMT %A Lee, Chih-Chieh %A Uhlig, Richard %A Mudge, Trevor %T A Case Study of a Hardware-Managed TLB in a Multi-Tasking Environment %I Computer Science and Engineering, University of Michigan %R CSE-TR-341-97 %D June 3, 1997 %U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-341-97-cover.ps.Z %U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-341-97.ps.Z %X There have been very few performance studies of hardware-managed translation look-aside buffers (TLBs). The major reason is the lack of efficient and accurate analysis tools. Newer operating systems, applications, and the popularity of the client-server model of computation place a greater burden than their predecessors on memory system components such as TLBs. Thus it is becoming more important to measure the performance of memory systems under such workloads. In this work, we implemented a trap-driven simulator in an operating system to emulate a variety of TLBs. Using this tool, we were able to evaluate the performance of a range of TLBs under these newer workloads. The results show that in order to improve the TLB performance, we should carefully map pages into the TLB, append process identifiers to avoid flushing the TLB contents frequently, or reserve part of the TLB for a particular server process. %Z Thu, 15 Oct 1998 15:20:58 GMT %A Lefurgy, Charles %A Bird, Peter %A Chen, I-Cheng %A Mudge, Trevor %T Improving Code Density Using Compression Techniques %I Computer Science and Engineering, University of Michigan %R CSE-TR-342-97 %D July 8, 1997 %U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-342-97-cover.ps.gz %U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-342-97.ps.gz %X We propose a method for compressing programs in imbedded processors where instruction memory size dominates cost. A post-compilation analyzer examines a program and replaces common sequences of instructions with a single instruction codeword. A microprocessor executes the compressed instruction sequences by fetching codewords from the instruction memory, expanding them back to the original sequence of instructions in the decode stage, and issuing them to the execution stages. We apply our technique to the PowerPC instruction set and achieve 30% to 50% reduction in size for SPEC CINT95 programs. %Z Thu, 15 Oct 1998 15:21:18 GMT %A Berwick, D. %A Lee, S.W. %T A Chromaticity Space for Specularity, Illumination Color- and Illumination Pose-Invariant 3-D Object Recognition %I Computer Science and Engineering, University of Michigan %R CSE-TR-343-97 %D July 28, 1997 %U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-343-97-cover.ps.gz %U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-343-97.ps.gz %X Most of the recentecognition/indexing approaches concentrate on establishing invariance to illumination color to improve the utility of color recognition. However, other effects caused by illumination pose and specularity on three-dimensional object surfaces have not received notable attention. We present a chromaticity recognition method that discounts the effects of illumination pose, illumination color and specularity. It utilizes a chromaticity space based on log-ratio of sensor responses for illumination pose and color invariance. A model-based specularity detection/rejection algorithm can be used to improve the chromaticity recognition and illumination estimation for objects including specular reflections. %Z Thu, 15 Oct 1998 15:21:39 GMT %A Thaler, D. %A Ravishankar, C.V. %T An Architecture for Inter-Domain Troubleshooting %I Computer Science and Engineering, University of Michigan %R CSE-TR-344-97 %D August 1, 1997 %U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-344-97-cover.ps.gz %U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-344-97.ps.gz %X In this paper, we explore the constraints of a new problem: that of coordinating network troubleshooting among peer administrative domains or Internet Service Providers, and untrusted observers. Allowing untrusted observers permits any entity to report problems, whether it is a Network Operations Center (NOC), end-user, or application. Our goals here are to define the inter-domain coordination problem clearly, and to develop an architecture which allows observers to report problems and receive timely feedback, regardless of their own locations and identities. By automating this process, we also relieve human bottlenecks at help desk and NOCs whenever possible. We begin by presenting a troubleshooting methodology for coordinating problem diagnosis. We then describe GDT, a distributed protocol which realizes this methodology. We show through simulation that GDT performs well as the number of observers and problems grows, and continues to function robustly amidst heavy packet loss. %Z Thu, 15 Oct 1998 15:21:53 GMT %A Manohar, Nelson %T Interactive Delayed-Sharing of Computer-Supported Workspaces via the Streaming of Re-Executable Content %I Computer Science and Engineering, University of Michigan %R CSE-TR-345-97 %D August 18, 1997 %U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-345-97-cover.ps.gz %U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-345-97.ps.gz %X This dissertation introduces the notion of interactive delayed-sharing of a session on a computer-supported workspace to allow reuse of such session at a later time. A data artifact, referred to as a session object, encapsulates the session. A session object is composed of heterogeneous multimedia streams that represent temporally-ordered input sequences to applets in the workspace. Playback of a session object recreates the underlying workspace spanned by these applets through the streaming and re-execution of these input sequences in their respective applets. The contributions of this dissertation are as follows. First, this dissertation pioneers the re-execution approach to the record and replay of sessions for supporting computer-supported collaborative work. In doing so, it introduces a novel paradigm for flexible support of asynchronous collaboration that allow users to collaborate by annotating, editing, and refining these delayed-shared workspaces. This dissertation explores these collaborative features particularly focusing on the record, representation, and playback of these workspaces. Second, this dissertation introduces the notion of workspaces as authored, transportable, and temporally-aware artifacts and investigates flexible mechanisms for the delivery of temporal-awareness to workspaces composed of heterogeneous applications through the decoupling of tool and media services. This dissertation develops a formal specification through temporal relationships in these workspaces. Finally, this dissertation describes mechanisms for the integration of re-executable content streams together with continuous multimedia streams. It describes a framework for media-independent integration of heterogeneous multimedia streams. The dissertation introduces the use of statistical process controls to guide the relaxation and/or constraint of scheduling and synchronization mechanisms so as to flexibly support a range of heterogeneous media streams and their temporal relationships. %Z Thu, 15 Oct 1998 15:22:14 GMT %A Tyson, Gary S. %A Lick, Kelsey %A Farrens, Matthew %T Limited Dual Path Execution %I Computer Science and Engineering, University of Michigan %R CSE-TR-346-97 %D August 28, 1997 %U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-346-97-cover.ps.gz %U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-346-97.ps.gz %X This work presents a hybrid branch predictor scheme that uses a limited form of dual path execution along with dynamic branch prediction to improve execution times. The ability to execute down both paths of a conditional branch enables the branch penalty to be minimized; however, relying exclusively on dual path execution is infeasible due because instruction fetch rates far exceed the capability of the pipeline to retire a single branch before others must be processed. By using confidence information, available in the dynamic branch prediction state tables, a limited form of dual path execution becomes feasible. This reduces the burden on the branch predictor by allowing predictions of low confidence to be avoided. In this study we present a new approach to gather branch prediction confidence with little or nor overhead, and use this confidence mechanism to determine whether dual path execution or branch prediction should be used. Comparing this hybrid predictor model to the dynamic branch predictor shows a dramatic decrease in misprediction rate, which translates to an reduction in runtime of over 20%. These results imply that dual path execution, which often is thought to be an excessively resource consuming method, may be a worthy approach if restricted with an appropriate predicting set. %Z Thu, 15 Oct 1998 15:22:27 GMT %A Feng, Wu-chang %A Kandlur, Dilip D. %A Saha, Debanjan %A Shin, Kang S. %T Adaptive Packet Marking for Providing Differentiated Services in the Internet %I Computer Science and Engineering, University of Michigan %R CSE-TR-347-97 %D October 16, 1997 %U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-347-97-cover.ps.gz %U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-347-97.ps.gz %X This paper examines the use of adaptable priority marking for providing soft bandwidth guarantees to individual connections or connection groups over the Internet. The proposed scheme does not require resource reservation for individual connections and can be supported with minimal changes to the network infrastructure. The scheme uses modest support from the network in the form of priority handling for appropriately marked packets and relies on intelligent transmission control mechanisms at the edges of the network to achieve the desired throughput levels. The paper describes the control mechanisms and evaluates their behavior in various network environments, thereby making them suitable for deployment in an evolving Internet. %Z Thu, 15 Oct 1998 15:22:46 GMT %A Tam, Edward %A Rivers, Jude $A Davidson, Esward S. %T Flexible Timing Simulation of Multiple-Cache Configurations %I Computer Science and Engineering, University of Michigan %R CSE-TR-348-97 %D October 10, 1997 %U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-348-97-cover.ps.gz %U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-348-97.ps.gz %X As the gap between processor and memory speeds increases, cache performance becomes more critical to overall system performance. Behavioral cache simulation is typically used early in the design cycle of new processor/cache configurations to determine the performance of proposed cache configurations on target workloads. However, behavioral cache simulation does not account for the latency seen by each memory access. The Latency-Effects (LE) cache model presented in this paper accounts this nominal latency as well as the additional latencies due to trailing-edge effects, bus width considerations, port conflicts, and the number of outstanding accesses that a cache allows before it blocks. We also extend the LE cache model to handle the latency effects of moving data among multiple caches. mlcache, a new, easily configurable and extensible tool, has been built based on the extended LE model. We show the use of mlcache in estimating the performance of traditional and novel cache configurations, including odd/even, 2-level, Assist, Victim, and NTS caches. We also show how the LE cache timing model provides more useful, realistic performance estimates than other possible behavioral-level cache timing models. %Z Thu, 15 Oct 1998 15:23:05 GMT %A Feng, Wu-chang %A Kandlur, Dilip D. %A Saha, Debanjan %A Shin, Kang S. %T Techniques for Eliminating Packet Loss in Congested TCP/IP Networks %I Computer Science and Engineering, University of Michigan %R CSE-TR-349-97 %D November 4, 1997 %U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-349-97-cover.ps.gz %U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-349-97.ps.gz %X The congestion control mechanisms used in TCP the focus of numerous studies and have undergone a number of enhancements. However, even with these enhancements, TCP connections still experience alarmingly high loss rates, especially during times of congestion. To alleviate this problem, the IETF is considering active queue management mechanisms, such as RED, for deployment in the network. In this paper, we first show that even with RED, loss rates can only be reduced marginally in times of congestion. We then show that the solution to this problem lies in understanding the traffic generated by an aggregation of a large number of sources. Given this, we then propose and experiment with more adaptive active queue management algorithms and more intelligent end-host mechanisms which can significantly reduce loss rates across congested links. When used together, these mechanisms can effectively eliminate packet loss while maintaining high link utilizations under the most difficult scenarios. %Z Thu, 15 Oct 1998 15:23:18 GMT %A Shaikh, Anees %A Rexford, Jennifer %A Shin, Kang S. %T Dynamics of Quality-of-Service Routing with Inaccurate Link-State Informaiton %I Computer Science and Engineering, University of Michigan %R CSE-TR-350-97 %D November 17, 1997 %U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-350-97-cover.ps.gz %U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-350-97.ps.gz %X Quality-of-service (QoS) routing can satisfy application performance requirements and optimize network resource usage by selecting paths based on connection traffic parameters and link load information. However, effective path-selection schemes require the distribution of link-state information, which can cause a significant burden on the bandwidth and processing resources in the network. We investigate the fundamental tension between network overheads and the quality of routing decisions in the context of source-directed QoS routing algorithms. In contrast to previous performance studies that compare different routing algorithms under specific network configurations, we characterize how the performance and overheads of QoS routing relate to the link-state update policies, as a function of the underlying traffic load, network topology, and link-cost metrics. We explore the interplay between stale link-state information and random fluctuations in traffic load through a broad set of simulations experiments on a parameterized model of QoS routing. These results suggest ways to tune the frequency of link-state update messages and the link-cost functions to strike a careful balance between high accuracy and low complexity. %Z Thu, 15 Oct 1998 15:23:23 GMT %A Shim, Hyong Sop %A Hall, Robert W. %A Litiu, Radu %A Prakash, Atul %T Stateful Multicase Services for Supporting Collaborative Applications %I Computer Science and Engineering, University of Michigan %R CSE-TR-351-97 %D November 17, 1997 %U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-351-97-cover.ps.gz %U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-351-97.ps.gz %X Collaborative, multi-user applications require group multicast services with ordering guarantees for maintaining consistency of replicated shared state among collaborating processes. In traditional group multicast services (such as Isis), the groupÕs shared state is maintained only by the clients and the multicast service facilitates atomic state transfer from existing clients to new members. In this paper, we argue that, in order to support collaborative applications in Internet-type environments, a copy of groupÕs state, the multicast service can provide consistently fast group-join and state transfer times when both slow and fast clients are present or when clients are unreliable - a crucial requirement in collaborative, multi-user applications where users may dynamically join and leave a collaborative session and expect predictable join times and interactive response time even in the presence of slow or unreliable clients. We show that the overheads incurred by a multicast service in managing each groupÕs shared state can be made minimal and that the multicast service does not have to be aware of the semantics of the groupÕs state. We present the design of such a multicast service, present performance results, and discuss how it meets the various needs of collaborative applications. %Z Thu, 15 Oct 1998 15:23:51 GMT %A Litiu, Radu %A Prakash, Atul %T Individual and Group QoS Issues in Communication Services for Groupware Systems %I Computer Science and Engineering, University of Michigan %R CSE-TR-352-97 %D November 17, 1997 %U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-352-97-cover.ps.gz %U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-352-97.ps.gz %X The proliferation of computer networks in the last decade and the ubiquity of the World Wide Web have led to increased interest in the development of computer-supported cooperative work (CSCW) systems. Collaborative, multi-user applications require group multicast services that provide ordering guarantees for maintaining consistency of replicated shared context as well as provide a high degree of interactivity, even under varying load on the multicast servers. While the most common view of the quality of service (QoS) in a distributed system is in terms of the guarantee of the network connection parameters (bandwidth, end-to-end delay), in this paper we investigate QoS from the perspective of the various requirements placed on group communication servers with limited resources by multiple and diverse groups of collaborative users. We show that in the absence of QoS considerations in the design of a group communication service, some groups or individual users can be severely affected by bursty traffic or increase in the size of other groups. We present the design of a best-effort QoS-based adaptive group communication service for supporting reliable data communication in CSCW systems. Our QoS considerations address both groupÕs requirements and individual userÕs requirements. We present performance results showing the effectiveness of the approach and discuss some of the open issues for future work. %Z Thu, 15 Oct 1998 15:24:11 GMT %A Zhang, Xi %A Shin, Kang G. %A Saha, Dabanjan %A Kandlur, Dilip %T Scalable Flow Control for Multicase ABR Services in ATM Networks %I Computer Science and Engineering, University of Michigan %R CSE-TR-353-97 %D November 20, 1997 %U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-353-97-cover.ps.gz %U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-353-97.ps.gz %X We propose an efficient flow-control scheme for ATM ABR multicast services. We develop a second-order rate-control algorithm to deal with the variation in feedback delay resulting from dynamic ÒdriftÓ of the bottleneck location within a multicast tree. The proposed scheme makes the rate process converge to the available bandwidth of the multicast-connectionÕs most congested link. It also confines the buffer occupancy to a target regime bounded by a given (finite) buffer capacity at the bottleneck node. Using fluid approximation, we model the proposed scheme and study the system dynamics under the most stressful traffic conditions. We derive expressions for queue buildups and average throughputs in both transient and equilibrium states. We identify the system control factors that govern the system dynamics and develop an optimal control condition which guarantees monotonic convergence of the system state to the target regime from an arbitrary initial value. %Z Thu, 15 Oct 1998 15:24:24 GMT %A Zhang, Xi %A Shin, Kang G. %A Zheng, Qin %T Integrated Rate and Credit Based Flow Control for Unicase ABR Service in ATM Networks %I Computer Science and Engineering, University of Michigan %R CSE-TR-354-97 %D November 20, 1997 %U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-354-97-cover.ps.gz %U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-354-97.ps.gz %X We propose a flow-control scheme that combines the merits of credit- and rate-based flow-control schemes by applying direct control over both bandwidth and buffer resources. The goal of the proposed scheme is to design an optimal rate-control policy for a given finite buffer capacity by maximizing average throughput and bounding end-to-end delay. By applying the second-order rate control, the proposed scheme not only makes the rate process converge to the neighborhood of link bandwidth, but also confines the queue-length fluctuation to a regime bounded by buffer capacity. Using hop-by-hop credit flow control, the proposed scheme ensures lossless transmission with a finite buffer capacity while maintaining high network utilization. The source dynamically adjusts its transmission rate and rate-control parameters according to both the network congestion status and whether or not the system states are in the target operating regime. Using the fluid approximation method, we model the proposed flow-control scheme and study the system dynamic behavior for ABR (Available Bit Rate) service under the most stressful traffic condition. We derive the expressions for queue build-ups and average throughput in the equilibrium states as functions of rate-control parameters, feedback delay, link bandwidth, and buffer capacity. Based on the analytical results, we identify the optimal equilibrium state and propose the second-order rate control algorithm for transient-state flow control. We derive a sufficient condition which guarantees the second-order rate control to drive the system from any transient state to an optimal equilibrium state. The worst-case performance bounds, derived as the closed-form functions of flow-control parameters, provide an analytical means of evaluating the performance of the second-order rate control in terms of convergence speed and buffer utilization. The analytical results for both single and multiple connection scenarios have shown the proposed scheme to be stable and efficient in that the source rate and bottleneck queue length rapidly converge to a small neighborhood of the designated operating point. Also, presented are examples showing that the proposed scheme outperforms the other existing schemes. %Z Thu, 15 Oct 1998 15:24:38 GMT %A Wallace, Charles %T The Semantics of the Java Programming Language: Preliminary Version %I Computer Science and Engineering, University of Michigan %R CSE-TR-355-97 %D December 9, 1997 %U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-355-97-cover.ps.gz %U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-355-97.ps.gz %X A mathematical model of the Java Programming language is construced. %Z Thu, 15 Oct 1998 15:24:51 GMT %A Zou, Hengming %A Jahanian, Farnam %T Real-Time Primary-Backup (RTPB) Replication with Temporal Consistency Guarantees %I Computer Science and Engineering, University of Michigan %R CSE-TR-356-98 %D February 13, 1998 %U ftp://ftp.eecs.umich.edu/techreports/cse/1998/CSE-TR-356-98-cover.ps.gz %U ftp://ftp.eecs.umich.edu/techreports/cse/1998/CSE-TR-356-98.ps.gz %X A common approach to building fault-tolerant distributed systems is to replicate servers that fail independently. The objective is to give the clients the illusion of service that is provided by a single server. The main approaches for structuring fault-tolerant servers are passive and active replication. This paper presents a primary-backup (passive) replication scheme for supporting fault-tolerant real-time applications. The proposed scheme, called Real-Time PB (RTPB) replication service, is an elegant and simple approach that provides the benefits of fault-tolerance, real-time access, and temporal consistency guarantees that are not otherwise easily attainable. This paper formally defines two types of temporal consistency, namely external temporal consistency and inter-object temporal consistency. By introducing a key concept called phase variance, we are able to build our temporal consistency models and derive necessary and sufficient conditions that can be used as the basis for update and transmission scheduling that achieve temporal consistency guarantees. Furthermore, we proved that the term phase variance used in the models can be bounded under various scheduling algorithms, namely EDF, Rate Monotonic, and Distance-Constrained Scheduling. The paper also presents an implementation of the real-time primary-backup replication scheme with the aforementioned temporal consistency models. This implementation was developed within the ex-kernel architecture on the MK 7.2 microkernel from the Open Group. The results of a detailed performance evaluation of this implementation is also discussed. %Z Thu, 15 Oct 1998 15:25:06 GMT %A Abandah, Gheith A. %A Davidson, Edward S. %T Configuration Independent Analysis for Characterizing Shared-Memory Applications %I Computer Science and Engineering, University of Michigan %R CSE-TR-357-98 %D December 23, 1997 %U ftp://ftp.eecs.umich.edu/techreports/cse/1998/CSE-TR-357-98-cover.ps.gz %U ftp://ftp.eecs.umich.edu/techreports/cse/1998/CSE-TR-357-98.ps.gz %X Characterizing shared-memory applications provides insight to design efficient systems, and provides awareness to identify and correct application performance bottlenecks. Configuration dependent analysis is often used to simulate detailed application traces on a particular hardware model. The communication traffic and computation workload generated by the application trace is used as a characterization of this application. This paper demonstrates that configuration independent analysis is a useful tool to characterize shared-memory applications. Configuration independent analysis characterizes inherent application characteristics that do not change from one configuration to another. While configuration dependent analysis is repeated for each target configuration, configuration independent analysis is only performed once. Moreover, configuration independent analysis does not require developing models for the target configurations and is faster than detailed simulation. However, configuration dependent analysis directly provides more information about specific configurations. A combination of the two analysis types constitutes a comprehensive and efficient methodology for characterizing shared-memory applications. In this paper, we show how configuration independent analysis is used to characterize eight aspects of application behavior: general characteristics, working sets, concurrency, communication patterns, communication variation over time, communication slack, communication locality, and sharing behavior. We illustrate the advantages and limitations of this approach by analyzing eight case-study benchmarks from the scientific and commercial domains and interpreting the results. %Z Thu, 15 Oct 1998 15:25:22 GMT %A Ganger, Gregory R. %A Worthington, Bruce L. &A Patt, Yale N. %T The DiskSim Simulation Environment Version 1.0 Reference Manual %I Computer Science and Engineering, University of Michigan %R CSE-TR-358-98 %D February 27, 1998 %U ftp://ftp.eecs.umich.edu/techreports/cse/1998/CSE-TR-358-98-cover.ps.gz %U ftp://ftp.eecs.umich.edu/techreports/cse/1998/CSE-TR-358-98.ps.gz %X DiskSim is an efficient, accurate and highly-configurable disk system simulator developed at the University of Michigan to support research into various aspects of storage subsystem architecture. It includes modules that simulate disks, intermediate controllers, and buses, device drivers, request schedulers, disk block caches, and disk array data organizations. In particular, the disk drive module simulates modern disk drives in great detail and has been carefully validated against several productions disks (with accuracy that exceeds any previously reported simulator). This manual describes how to configure and use DiskSim, which has been made publicly available with the hope of advancing the state-of-the-art in disk system performance evaluation in the research community. The manual also briefly describes DiskSim's internal structure and various validation results. %Z Thu, 15 Oct 1998 15:25:36 GMT %A Kravets, Victor N. %A Sakallah, Karem A. %T Constructive Multilevel Logic Synthesis Under Properties of Boolean Algebra %I Computer Science and Engineering, University of Michigan %R CSE-TR-359-98 %D March 9, 1998 %U ftp://ftp.eecs.umich.edu/techreports/cse/1998/CSE-TR-359-98-cover.ps.gz %U ftp://ftp.eecs.umich.edu/techreports/cse/1998/CSE-TR-359-98.ps.gz %X We describe a new constructive multilevel logic synthesis system that integrates the traditionally separate technology-independent and technology-dependent stages of modern synthesis tools. Dubbed M32, this systems is capable of generating circuits incrementally based on both functional as well as structural considerations. This is achieved by maintaining a dynamic structural representation o the evolving implementation and by refining it through progressive introduction of gates from a target technology library. Circuit construction proceeds from the primary inputs towards the primary outputs. Preliminary experimental results show that circuits generated using this approach are generally superior to those produced by multi-stage synthesis. %Z Thu, 15 Oct 1998 15:25:48 GMT %A Hao, Eric %T Block Enlargement Optimizations for Increasing the Instruction Fetch Rate in Block-Structured Instruction Set Architectures %I Computer Science and Engineering, University of Michigan %R CSE-TR-360-98 %D March 11, 1998 %U ftp://ftp.eecs.umich.edu/techreports/cse/1998/CSE-TR-360-98-cover.ps.gz %U ftp://ftp.eecs.umich.edu/techreports/cse/1998/CSE-TR-360-98.ps.gz %X To exploit larger amounts of instruction level parallelism, processors are being built with wider issue widths and larger numbers of functional units. Instruction fetch rate must also be increased in order to effectively exploit the performance potential of such processors. Block-structured ISAs are a new class of instruction set architectures that we designed to address the performance obstacles faced by processors attempting to exploit high levels of instruction level parallelism. The major distinguishing feature of a block-structured ISA is that it defines the architectural atomic unit (i.e. the instruction)( to be a group of operations which is called an atomic block. This dissertation defines an optimization, block enlargement, that can be applied to a block-structured ISA to increase the instruction fetch rate of a processor that implements that ISA. A compiler that generates block-structured ISA processor were constructed to evaluate the performance benefit of block-structured ISAs. This dissertation shows that for the SPECint95 benchmarks, the block-structured ISA processor executing enlarged atomic blocks and using simpler microarchitectural mechanisms to support wide-issue and dynamic scheduling out performs a conventional ISA processor that also supports wide-issue and dynamic scheduling by 28% when assuming perfect branch prediction and by 15% when using real branch prediction. %Z Thu, 15 Oct 1998 15:26:07 GMT %A Srinivasan, Vijayalakshmi %T Improving Performance of an L1 Cache With an Associated Buffer %I Computer Science and Engineering, University of Michigan %R CSE-TR-361-98 %D March 20, 1998 %U ftp://ftp.eecs.umich.edu/techreports/cse/1998/CSE-TR-361-98-cover.ps.gz %U ftp://ftp.eecs.umich.edu/techreports/cse/1998/CSE-TR-361-98.ps.gz %X Memory access latencies are much larger than processor cycle times, and the trend has been for this gap to increase over time. Cache performance becomes critical in bridging this gap. However, since it is difficult to make a cache both large and fast, cache misses are expected to continue to have a significant performance impact. Victim caching, proposed by Jouppi, is an approach to decrease the miss ratio of direct-mapped caches without affecting their access time. NTS caching, proposed by Rivers is a multilateral cache design scheme that improves performance of first-level (L1) caches based on the temporal locality of the reference pattern. We propose an improvement of these schemes, which we call NT-victim caching. Taking the lead from NTS design we have a bilateral L1 cache, having a main cache (cache A) and a small fully associative buffer (cache B). Cache B is similar to a victim buffer, and holds the victim block replace by a miss. A cache block is temporal if after it is brought into cache, some word in that block is accessed more than once before the block is replaced. Unlike Victim caches a block that is hit in cache B is swapped with a block in cache A only if it is temporal, and in most of our replacement strategies temporal blocks are less likely to be selected for replacement than non-temporal blocks. Every cache block loaded into L1 cache is monitored for temporal behavior by a hardware detection unit. Our results who that this design reduces the number of swaps between cache A and cache B, relative to the Victim cache, yet gives a comparable miss ratio. %Z Thu, 15 Oct 1998 15:26:33 GMT %A Abandah, Gheith Ali %T Reducing Communication Cost in Scalable Shared Memory Systems %I Computer Science and Engineering, University of Michigan %R CSE-TR-362-98 %D April 13, 1998 %U ftp://ftp.eecs.umich.edu/techreports/cse/1998/CSE-TR-362-98-cover.ps.gz %U ftp://ftp.eecs.umich.edu/techreports/cse/1998/CSE-TR-362-98.ps.gz %X Disstributed shared-memory systems provide scalable performance and a convenient model for parallel programming. However, their non-uniform memory latency often makes it difficult to develop efficient parallel applications. Future systems should reduce communication cost to achieve better programmability and performance. We have developed a methodology, and implemented a suite of tools, to guide the search for improved codes and systems. As the result of one such search, we recommend a remote data caching technique that significantly reduces communication cost. We analyze applications by instrumenting their assembly-code sources. During execution, an instrumented application pipes a detailed trace to configuration independent (CIAT) and configuration dependent (CDAT) analysis tools. CIAT characterizes inherent application characteristics that do not change from one configuration to another, including working sets, concurrency, sharing behavior, and communication patterns, variation over time, slack, and locality. CDAT simulates the trace on a particular hardware mode, and is easily retargeted to new systems. CIAT is faster than detailed simulation; however, CDAT directly provides more information about a specific configuration. The combination of the two tools constitutes a comprehensive and efficient methodology. We calibrate existing systems using carefully designed microbenchmarks that characterize local and remote memory, producer-consumer communication involving two or more processors, and contention when multiple processors utilize memory and interconnect. This dissertation describes these tools and illustrates their use by characterizing a wide range of applications and assessing the effects of architectural and technological advances on the performance of HP/Convex Exemplar systems, evaluates strengths and weaknesses of current system approaches, and recommends solutions. CDAT analysis of three CC-NUMA system approaches shows that current systems reduce communication cost by minimizing either remote latency or remote communication frequency. We describe four architecturally varied systems that are technologically similar to the low remote latency SGI Origin 2000, but incorporate additional techniques for reducing the number of remote communications. Using CCAT, a CDAT-like simulator that models communication contention, we evaluate the worthiness of these techniques. Superior performance is reached when processors supply cached clean data, or when a remote data cache is introduced that participates in the local bus protocol. %Z Fri, 16 Oct 1998 15:11:19 GMT %A Tam, Edward %A Rivers, Jude %A Tyson, Gary %A Davidson, Edward S. %T mlcache: A Flexible Multi-Lateral Cache Simulator %I Computer Science and Engineering, University of Michigan %R CSE-TR-363-98 %D May 11, 1998 %U ftp://ftp.eecs.umich.edu/techreports/cse/1998/CSE-TR-363-98-cover.ps.gz %U ftp://ftp.eecs.umich.edu/techreports/cse/1998/CSE-TR-363-98.ps.gz %X As the gap between processor and memory speeds increases, cache performance becomes more critical to overall system performance. To address this, processor designers typically design for the largest possible caches that can still remain on the ever growing processor die. However alternate, multi-lateral cache designs such as the Assist Cache, Victim Cache, and NTS Cache have been shown to perform as well as or better than larger, single structure caches while requiring less die area. For a given die size, reducing the requirements to attain a given rate of data supply can allow more space dedicated for branch prediction, data forwarding, increasing the size of the reorder buffer, etc. Current cache simulators are not able to study a wide variety of multi-lateral cache configurations. Thus, the mlcache multi-lateral cache simulator was developed to help designers in the middle of the design cycle decide which cache configuration would best aid in attaining the desired performance goals of the target processor. ml-cache is an event-driven, timing-sensitive simulator based on the Latency Effects cache timing model. It can be easily configured to model various multi-lateral cache configurations using its library of cache state and data movement routines. The simulator can be easily joined to a wide range of event-driven processor simulators such as RCM-brisc, Talisman, SimICS, and SimpleScalar. We use the SPEC95 benchmarks to illustrate how mlcache can be used to compare the performance of several different data cache configurations. %Z Fri, 16 Oct 1998 15:12:33 GMT %A Riepe, Michael %A Sakallah, Karem %T Transistor Level Micro Placement and Routing for Two-Dimensional Digital VLSI CEll Synthesis %I Computer Science and Engineering, University of Michigan %R CSE-TR-364-98 %D June 1, 1998 %U http://www.eecs.umich.edu/eecs/research/techreports/cse98.html %X Previous research into the problem of cell library synthesis for digital VLSI design has concentrated mostly on relatively simple 1-dimensional cell topologies for static CMOS designs. Recent interest has emerged in less constrained 2-dimensional topologies to support more complex non-dual circuits such as latches and flip flops, as well as high performance circuit families such as CVSL, PTL, and domino CMOS. We discuss a CAD methodology which supports a generalized placement and routing approach to the realization of mask geometry for such complex circuits. We explore the options available within this methodology, show how the transistor level placement and routing problems at the transistor level differ from those at the block level, and present some results for a prototype tool, TEMPO, which adopts this methodology. %Z Thu, 15 Oct 1998 15:27:39 GMT %A Zhang, Xi $A Shin, Kang G. %T A Scalable Flow-Control Algorithm Point-to-Multipoint Communications in High-Speed Integrated Networks %I Computer Science and Engineering, University of Michigan %R CSE-TR-365-98 %D June 3, 1998 %U http://www.eecs.umich.edu/eecs/research/techreports/cse98.html %X We propose and study a scalable flow-control algorithm for multicast ABR (Available Bit Rate) service. A key idea behind the algorithm is the Soft-Synchronization Protocol (SSP), which derives a single ÒconsolidatedÓ RM (Resource Management) cell at each multicast branch-point from feedback RM-cells of different downstream branches that are not necessarily responses to the same forward RM-cell in each synchronization cycle. Using balanced and unbalanced binary-tree models, we analyze the scalability of SSP in terms of height and structure of the multicast tree. In contrast with the existing schemes, SSP is shown to be able to effectively support feedback synchronization and make the effective RM-cell roundtrip delay virtually independent of the multicast-treeÕs height and structure, thus making it scalable. Another key problem in multicast flow-control is how to deal with the variation of RM-cell roundtrip delay due to the randomly drifting multicast-tree bottleneck. This problem is handled by adapting the second-order rate-control parameter to roundtrip-delay variations. A fluid model is used to analyze the second-order rate control and study the system dynamics for multicast ABR service. We derive an analytical relationship between the second-order rate parameter and RM-cell roundtrip delay subject to both lossless transmission and finite buffer capacity constraints. This relationship ensures the feasibility of the second-order rate control in dealing with RM-cell roundtrip-delay variations and provides an insight on how the required buffer space can be controlled by adjusting the rate parameter. We develop an optimal control condition, under which the second-order rate control guarantees monotonic convergence of system state of the optimal regime from an arbitrary initial value. The proposed second-order rate-control algorithm is also shown to be feasible and optimal in buffer-allocation efficiency and fairness at the bottleneck. %Z Fri, 16 Oct 1998 15:14:35 GMT %A McDaniel, Patrick %A Jamin, Sugih %T A Scalable Key Distribution Hierarchy %I Computer Science and Engineering, University of Michigan %R CSE-TR-366-98 %D July 7, 1998 %U http://www.eecs.umich.edu/eecs/research/techreports/cse98.html %X As the use of the Internet for electronic commerce, audio and video conferencing, and other applications with sensitive content grows, the need for secure services becomes critical. Central to the success of these services is the support for secure public key distribution. Although there are several existing services available for this purpose, they are not very scalable, either because they depend on a centralized server or rely on ad hoc trust relationships. In this paper, we present and examine a flexible approach to certificate distribution scalable to arbitrarily large networks. We propose a two level hierarchy where certificates can be independently authenticated by one or more per authorities, called keyservers. Certificates for end-user and host entities are managed within local domains, called enterprises. By administering certificates close to the source, we reduce the load on the key servers and the effects of network topology changes. We describe the design of our system and present a preliminary performance analysis based on traces of present-day DNS requests. %Z Fri, 16 Oct 1998 15:15:21 GMT %A Zou, H. %A Jahanian, Farnam %T Optimization of a Real-Time Primary-Backup Replication Service %I Computer Science and Engineering, University of Michigan %R CSE-TR-367-98 %D July 10, 1998 %U http://www.eecs.umich.edu/eecs/research/techreports/cse98.html %X The primary-backup replication model is one of the commonly adopted approaches to providing fault tolerant data services. The extension of the primary-backup protocol to the real-time environment, however, imposes the additional constraint of timing predictability, which requires a bounded overhead for managing redundancy. There is a trade-off between reducing system overhead and increasing (temporal) consistency between the primary and backup, and the way to achieve balance between them is not so obvious. In this paper, we try to explore ways to optimize scheduling update messages from primary to backup while maintaining the temporal consistency guarantees of the system. Depending on the purpose of the individual replication service and the environment in which it is running, very different objectives can be sought in the process of optimization. This paper considers the optimization problem from two perspectives with one aimed to minimize the average temporal distance between the primary and backup, and the other aimed to minimized the resource being used in maintaining a given temporal constraint on the system. Corresponding optimization techniques have been developed for these two diverse objectives and an implementation of such techniques is also presented. The implementation is built on top of the existing RTPB model developed in which in turn was developed within the x-kernel architecture on the Mach OSF platform running MK (mach kernel) 7.2. Results of an experimental evaluation of the proposed optimization techniques are also discussed. %Z Fri, 16 Oct 1998 15:16:18 GMT %A Labovitz, Craig %A Malan, G.R. %A Jahanian, Farnam %T Origins of Internet Routing Instability %I Computer Science and Engineering, University of Michigan %R CSE-TR-368-98 %D July 17, 1998 %U http://www.eecs.umich.edu/eecs/research/techreports/cse98.html %X This paper examines the network routing messages exchanged between core Internet backbone routers. Internet routing instability, or the rapid fluctuation of network reachability information, is an important problem currently facing the Internet engineering community. High levels of network instability can lead to packet loss, increased network latency and time to convergence. At the extreme, high levels of routing instability have led to the loss of internal connectivity in wide-area, national networks. In an earlier study of inter-domain routing, we described widespread, significant pathological behaviors in the routing information exchanged between backbone service providers at the major U.S. public Internet exchange points. These pathologies included several orders of magnitude more routing updates in the Internet core than anticipated, large numbers of duplicate routing messages, and unexpected frequency components between routing instability events. The work described in this paper extends our earlier analysis by identifying the origins of several of these observed pathological Internet routing behaviors. We show that as a result of specific router vendor software changes suggested by our earlier analysis, the volume of Internet routing updates has decreased by an order of magnitude. We also describe additional router software changes that can decrease the volume of routing updates exchanged in the Internet core by an additional 30 percent or more. We conclude with a discussion of trends in the evolution of Internet architecture and policy that may lead to a rise in Internet routing instability. %Z Thu, 15 Oct 1998 15:28:46 GMT %A Blass, A. %A Gurevich, Yuri %T Fixed-Choice and Independent-Choice Logics %I Computer Science and Engineering, University of Michigan %R CSE-TR-369-98 %D August 10, 1998 %U http://www.eecs.umich.edu/eecs/research/techreports/cse98.html %X We study extensions of first-order logic the choice construct (choose x:j(x)). We prove some results about HilbertÕs e operator, but in the main part of the paper we consider the case when all choices are independent. %Z Thu, 15 Oct 1998 15:29:00 GMT %A Lee, J.H. %A Prakash, Atul %T Malleable Shared Workspaces to Support Multiple Usage Paradigms %I Computer Science and Engineering, University of Michigan %R CSE-TR-370-98 %D August 12, 1998 %U http://www.eecs.umich.edu/eecs/research/techreports/cse98.html %X Several recent systems provide a room-based metaphor to represent shared workspaces that require use of multiple collaborative tools. These systems provide users with a fairly static usage paradigm of room-centered collaboration, requiring users to mold their collaborative activities to the paradigm rather than molding the paradigm to fit the requirements of their collaborative activities. In this paper, we propose a powerful and yet simple event-action based model, augmented with access control and multi-user features, for room-based systems for providing a high degree of malleability so that these systems can be adapted to provide support for a variety of collaborative facilities, such as call centers, mailboxes, shared repositories, and role-based collaboration, including facilities that are not necessarily anticipated by system designers. The model can be used by both system developers as well as by end-users to customize a system to meet the requirements of their group tasks. %Z Thu, 15 Oct 1998 15:29:13 GMT %A Huggins, J.K. %A Van Campenhout, David %T Specification and Verification of Pipelining in the ARM2 RISC Microprocessor %I Computer Science and Engineering, University of Michigan %R CSE-TR-371-98 %D August 13, 1998 %U http://www.eecs.umich.edu/eecs/research/techreports/cse98.html %X Gurevich Abstract State Machines (ASMs) provide a sound mathematical basis for the specification and verification of systems. An application of the ASM methodology to the verification of a pipelined microprocessor (an ARM2 implementation) is described. Both the sequential execution model and final pipelined model are formalized using ASMs. A series of intermediate models are introduced that gradually expose the complications of pipelining. The first intermediate model is proven equivalent to the sequential model in the absence of structural, control, and data hazards. In the following steps, these simplifying assumptions are lifted one by one, and the original proof is refined to establish the equivalence of each intermediate model with the sequential model, leading ultimately to a full proof of equivalence of the sequential and pipelined models. %Z Thu, 15 Oct 1998 15:29:26 GMT %A Bozma, H.I. %A Koditschek, D.E. %T Assembly as a Noncooperative Game of its Pieces: Analysis of 1D Sphere Assemblies %I Computer Science and Engineering, University of Michigan %R CSE-TR-372-98 %D August 14, 1998 %U http://www.eecs.umich.edu/eecs/research/techreports/cse98.html %X We propose an event-driven algorithm for the control of simple robot assembly problems based on noncooperative game theory. We examine rigorously the simplest setting - three bodies with one degree of freedom and offer extensive simulations for the 2 DOF extension. The initial analysis and the accompanying simulations suggest that this approach may indeed offer an attractive means of building robust event driven assembly systems. %Z Thu, 15 Oct 1998 15:29:41 GMT %A Postiff, Matt %A Tyson, Gary %A Mudge, Trevor %T Performance Limits of Trace Caches %I Computer Science and Engineering, University of Michigan %R CSE-TR-373-98 %D September 8, 1998 %U http://www.eecs.umich.edu/eecs/research/techreports/cse98.html %X A growing number of studies have explored the use of trace caches as a mechanism to increase instruction fetch bandwidth. The trace cache is a memory structure that stores statically noncontiguous but dynamically adjacent instructions in contiguous memory locations. When coupled with an aggressive trace or multiple branch predictor, it can fetch multiple basic blocks per cycle using a single-ported cache structure. This paper compares trace cache performance to the theoretical limit of a three-block fetch mechanism equivalent to an idealized 3-ported instruction cache with a perfect alignment network. Several new metrics are defined to formalize analysis of the trace cache. These include fragmentation, duplication, indexability, and efficiency metrics. We show that performance is more limited by branch mispredictions than ability to fetch multiple blocks per cycle. As branch prediction improves, high duplication and the resulting low efficiency are shown to be among the reasons that the trace cache does not reach its upper bound. Based on the shortcomings of the trace cache discovered in this paper, we identify some potential future research areas. %Z Thu, 15 Oct 1998 15:29:55 GMT %A Hardwick, J. %A Stout, Quentin %T Optimal Few-Stage Allocation Procedures %I Computer Science and Engineering, University of Michigan %R CSE-TR-374-98 %D September 11, 1998 %U http://www.eecs.umich.edu/eecs/research/techreports/cse98.html %X This paper gives very general algorithms for the design of optimal experiments involving two Bernoulli populations in which sampling is carried out in stages. It is assumed that the outcomes of the previous stage are available before the allocations for the next stage are decided. At each state, one must decide how many observations to take and how many to sample from each of the alternative populations. Of particular interest are 2- and 3-stage experiments. To illustrate that the algorithms can be used for experiments of useful sample sizes, they are applied to estimation and optimization problems. Results indicate that, for problems of moderate size, published asymptotic analyses do not always represent the true behavior of the optimal stage sizes, and efficiency may be lost if the analytical results are used instead of the true optimal allocation. Our results also suggest that one might approach large problems by extrapolating optimal solutions for moderate sample sizes; and, that approaches of this sort could give design guidelines that are far more explicit (and hopefully closer to optimal) than those obtained through asymptotic analyses alone. The examples also show the ease with which the computational approach can solve problems that present analytical difficulties. This allows one to use models that more accurately represent important characteristics of the problem. It is also shown that in various cases the base algorithms can be modified to incorporate simplifications. In such settings, significant speedups and space reductions can be obtained, permitting the exact solution of even larger problems. %Z Fri, 16 Oct 1998 15:17:22 GMT %A Mehra, A. %A Shaikh, A. %A Abdelzaher, T %A Wang, Z. %A Shin, Kang G. %T Realizing Services for Guaranteed-QoS Communication on a Microkernel Operationg System %I Computer Science and Engineering, University of Michigan %R CSE-TR-375-98 %D September 17, 1998 %U http://www.eecs.umich.edu/eecs/research/techreports/cse98.html %X Provision of end-to-end QoS guarantees on communication necessitates appropriate support in the end systems (i.e., hosts) and network routers that form the communication fabric. Typically, the support is in the form of suitable extensions to the communication subsystem and the underlying operating system for specification and maintenance of QoS guarantees. This paper focuses on the architectural and implementation challenges involved in realizing QoS-sensitive host communication subsystems on contemporary microkernel operating systems with limited real-time support. We motivate and describe the components constituting our integrated service architecture that together ensure QoS-sensitive handling of network traffic at both sending and receiving hosts. We separate the policies from mechanisms in each component, demonstrating a communication framework that can implement alternative QoS models by applying appropriate policies. We also report the results of a detailed execution profile of the system to characterize communication costs for the purposes of admission control. An experimental evaluation in a controlled configuration demonstrates the efficacy with which QoS guarantees are maintained, despite limitations imposed by the underlying operating system. %Z Thu, 15 Oct 1998 15:30:59 GMT %A McDaniel, Patrick %A Jamin, Sughi %T Windowed Key Revocation in Public Key Infrastructures %I Computer Science and Engineering, University of Michigan %R CSE-TR-376-98 %D October 13, 1998 %U http://www.eecs.umich.edu/eecs/research/techreports/cse98.html %X A fundamental problem inhibiting the wide acceptance of a Public Key Infrastructure (PKI) in the Internet is the lack of a mechanism that provides scalable certificate revocation. In this paper, we propose a novel mechanism called Windowed Revocation. In windowed revocation, certificate revocation is announced for short periods in periodic Certificate Revocation Lists (CRLs). Due to the assurances provided by the protocol over which certificates are retrieved, we bound the amount of time that any certificate is cached by users. Thus, we can limit the announcement of revocation only to the time in which the certificate may be cached; not until its expiration. Because the time in which certificate are announced is short, CRLs, are similarly small. By limiting the size of CRLs, we are able to integrate other mechanisms that increase the scalability of the PKI. One such mechanism is the use of ÒpushedÓ CRLs using multicast. We include a proof of the correctness of our approach. %Z Thu, 12 Nov 1998 13:29:04 GMT %A Hardwick, Janis %A Oehmke, Robert %A Stout, Quentin F. %T A Program for Sequential Allocation of Three Bernoulli Populations %I Computer Science and Engineering, University of Michigan %R CSE-TR-377-98 %D October 13, 1998 %U http://www.eecs.umich.edu/eecs/research/techreports/cse98.html %X We describe a program for optimizing and analyzing sequential allocation problems involving three Bernoulli populations. Previous researchers had considered this problem computationally intractable, and we know of no prior exact optimizations for such problems, even for very small sample sizes. Despite this, our program is currently able to solve problems of size 200 or more by using a parallel computer, and problems of size 100 on a workstation. We describe the program and the techniques used to enable it to scale to large sample sizes. As an illustrative example, the program is used to create an adaptive sampling procedure that is the optimal solution to a 3-arm bandit problem. The bandit procedure is then compared to two other allocation procedures along various metrics. We also indicate extensions of the program that enable it to solve a variety of related problems. %Z Thu, 12 Nov 1998 13:31:38 GMT %A Chong, Ronald %T Modeling Dual-Task Performance Improvement: Casting Executive Process Knowledge Acquisition as Strategy Refinement %I Computer Science and Engineering, University of Michigan %R CSE-TR-378-98 %D October 15, 1998 %U http://www.eecs.umich.edu/eecs/research/techreports/cse98.html %X People demonstrate a remarkable ability to perform complex, multiple-task activities in spite of the limitations of our sensory, perceptual, cognitive, and motor systems. A prominent theory that addresses how multiple-tasks activities are performed is that of the executive process. Some of the functions of the executive process include enforcing task priorities and arbitrating access to limited resources. It has been shown that a time-sharing skill (or executive-process knowledge) is acquired during training on dual-task combinations. This dissertation presents the development of a computational, task-independent framework for modeling the acquisition of the knowledge acquired during training on dual-task combinations - executive process knowledge. On a selected dual-task combination - a continuous tracking task and a discrete two-choice reaction time task - this framework, when given the declarative and procedural representation of the novice task, has produced an expert model whose performance is a good match to empirical reaction time and tracking error data for the task combination. There are three main contributions of this work. First is the development of EPIC-Soar, a symbolic hybrid architecture that possesses a psychologically-motivated learning mechanism and psychologically-plausible perception and motor systems. Second is the identification and classification of executive process knowledge and the taxonomies that result from this analysis. Third, is an acquisition framework which consists of: a novel data structure for representing task strategies; a task-independent procedure for resolving simultaneous access for motor resources and learning new knowledge that avoids such collisions in the future; a second task-independent learning procedure which refines the strategy data structure and creates new procedural knowledge for performing the task; and a collection of guidelines that regulate how and when promotions are applied. %Z Thu, 12 Nov 1998 13:34:26 GMT %A Wray, Robert %T Ensuring Reasoning Consistency in Hierarchical Architectures %I Computer Science and Engineering, University of Michigan %R CSE-TR-379-98 %D November 9, 1998 %U http://www.eecs.umich.edu/eecs/research/techreports/cse98.html %X Agents often dynamically decompose a task into a hierarchy of subtasks. Hierarchical task decomposition reduced the cost of knowledge design in comparison to non-hierarchical knowledge because knowledge is simplified and can be shared across multiple tasks. However, hierarchical decomposition does have limitations. In particular, hierarchical decomposition can lead to inconsistency in agent processing, resulting potentially in irrational behavior. Further, an agent must generate the decomposition hierarchy before reacting to an external stimulus. Thus, decomposition can also reduce agent responsiveness. This thesis describes ways in which the limitations of hierarchical decomposition can be circumvented while maintaining inexpensive knowledge design and efficient agent processing. We introduce Goal-Oriented Heuristic Hierarchical Consistency (GOHHC), which eliminates across-level inconsistency. GOHHC computes logical dependencies in asserted knowledge between goals rather than individual assertions. Although this goal-oriented heuristic ensures consistency, it does sometime lead to unnecessary repetition in reasoning and delay in task execution. We show empirically that these drawbacks are inconsequential in execution domains. Thus, GOHHC provides an efficient guarantee of processing consistency in hierarchical architectures. Ensuring consistency also provides an architectural framework for unproblematic knowledge compilation in dynamic domains. Knowledge compilation can be used to cache hierarchical reasoning and thus avoid the delay in reaction necessitated by decomposition. An empirical investigation of compilation in hierarchical agents shows that compilation can lead to improvement in both overall performance and responsiveness, while maintaining the low design cost of hierarchical knowledge. %Z Thu, 12 Nov 1998 13:36:07 GMT %A Lefurgy, Charles %A Mudge, Trevor %T Code Compression for DSP %I Computer Science and Engineering, University of Michigan %R CSE-TR-380-98 %D November 11, 1998 %U http://www.eecs.umich.edu/eecs/research/techreports/cse98.html %X Previous works have proposed adding compression techniques to a variety of architectural styles to reduce instruction memory requirements. It is not immediately clear how these results apply to DSP architectures. DSP instructions are longer and have potentially greater variation which can decrease compression ratio. Our results demonstrate that DSP programs do provide sufficient repetition for compression algorithms. We propose a compression method and apply it to the SHARC, a popular DSP architecture. Even using a very simple compression algorithm, it is possible to halve the size of the instruction memory requirements. %Z Mon, 16 Nov 1998 14:42:26 GMT %A Shaikh, A. %A Rexford, Jennifer %A Shin, Kang G. %T Evaluating the Overheads of Course-Directed Quality-of-Service Routing %I Computer Science and Engineering, University of Michigan %R CSE-TR-381-98 %D December 7, 1998 %U http://www.eecs.umich.edu/eecs/research/techreports/cse98.html %X Quality-of-Service (QoS) routing satisfies application performance requirements and optimizes network resource usage by selecting paths based on connection traffic parameters and link load information. However, effective path-selection schemes require the distribution of link-state information, which can impose a significant burden on the bandwidth and processing resources in the network. We investigate the fundamental trade-off between network overheads and the quality of routing decision in the context of the source-directed link-state routing protocols proposed for future IP and ATM networks. In contrast to previous work that compares different routing algorithms under specific network configurations, we construct a detailed model of QoS routing that parameterizes the path-selection algorithm, link-cost function, and link-state update policy. Through extensive simulation experiments with several representative network topologies and traffic patterns, we uncover the effects of stale link-state information and random fluctuations in traffic load on the routing and signaling overheads. We then investigate ho the inaccuracy of link-state information interacts with the size and connectivity of the underlying topology. Finally, we show that by tuning the coarseness of the link-cost metric to the inaccuracy of underlying link-state information we can reduce the computational complexity of the path-selection algorithm without significantly degrading performance. The paper concludes by summarizing our key results as a list of guidelines for designing efficient quality-of service routing policies in large backbone networks. %Z Mon, 21 Dec 1998 9:49:13 GMT %A Labovitz, C. %A Ahuja, A. %A Jahanian, F. %T Experimental Study of Internet Stability and Wide-Area Backbone Failures %I Computer Science and Engineering, University of Michigan %R CSE-TR-382-98 %D December 16, 1998 %U http://www.eecs.umich.edu/eecs/research/techreports/cse98.html %X In this paper, we describe an experimental study of Internet stability and the origins of failure in Internet protocol backbones. The stability of end-to-end Internet paths is dependent both on the underlying telecommunication switching system, as well as the higher level software and hardware components specific to the InternetÕs packet-switched forwarding and routing architecture. Although a number of earlier studies have examined failures in the public telecommunication system, little attention has been given to the characterization of Internet stability. Our paper analyzes Internet failures from three different perspectives. We first examine several recent major Internet failures and their probable origins. These empirical observations illustrate the complexity of the Internet and show that unlike commercial transaction systems, the interactions of the underlying components of the Internet are poorly understood. Next, our examination focuses on the stability of paths between Internet Service Providers. Our analysis is based on the experimental instrumentation of key portions of the Internet infrastructure. Specifically, we logged all the routing control traffic at five of the largest U.S. Internet exchange points over a three year period. This study of network reachability information found unexpectedly high levels of path fluctuation and an aggregate low mean time between failures for individual Internet paths. These results point to a high l eve of instability in the global Internet backbone. While our study of the Internet backbone identifies major trends in the level of path instability between different service provides, these results do not characterize failures inside the network of service provider. The final portion of our paper focuses on a case study of the network failures observed in a large regional Internet backbone. This examination of the internal stability of a network includes twelve months of operational failure logs and a review of the internal routing communication data collected between regional backbone routers. We characterize the type and frequency of failures in twenty categories, and describe the failure properties of the regional backbone as a whole. %Z Wed, 13 Jan 1999 10:38:57 GMT %A Weinstein, Peter C. %A Birmingham, William P. %T Agent Communication with Differentiated Ontologies: eight new measures of description campatibility %I Computer Science and Engineering, University of Michigan %R CSE-TR-383-99 %D January 7, 1999 %U http://www.eecs.umich.edu/eecs/research/techreports/cse99.html %X We propose an approach to achieve appropriate exchange of services and data in distributed systems subject to semantic heterogeneity. We assume differentiated ontologies: that terms have formal definitions as concepts related to other concepts, that local concepts inherit from concepts that are shared, and that most or all primitives are shared. We then develop measures of description compatibility using the structure of the source and target definitions. We evaluate these measures by generating description-logic ontologies in artificial worlds. In our simulations, the ÒmeaningÓ of a concept is its denotation in a finite universe of instances. the accuracy of the description-compatibility measures can thus be judged by their success in predicting the overlap of concept denotations. Description compatibility can be used to guide agent search for services across communities that subscribe to differentiated ontologies. %Z Thu, 4 Feb 1999 8:33:01 GMT %A Wu, W-B. %A Ravishankar, C. %T The Performance of Tuple Difference Coding for Compressing Databases and Data Warehouses %I Computer Science and Engineering, University of Michigan %R CSE-TR-384-99 %D January 26, 1999 %U http://www.eecs.umich.edu/eecs/research/techreports/cse99.html %X Databases and data warehouses have grown dramatically in size in recent years, and we are already seeing data warehouses that are tens of terabytes in size. Compression is important in such systems because it reduced space requirements as well as response times for queries, which are typically I/O-bound. Conventional compression methods tend to compress and decompress objects in their entirety, and are generally unsuitable for databases which require selective access to points in the space of tuples they represent. The database compression method called Tuple-Difference Coding (TDC) has been shown to meet the requirements for database compression, and to achieve high compression ratios in practice. The factors affecting the performance of TDC are known, but their effects are not well understood. This paper presents a theoretical analysis of the performance of TDC, and verifies these results through simulations. It presents analytical expressions for estimating the compression ratios when database tuples are composed of one or more attributes drawn from various distributions. It also studies the effects of attribute domain reordering on the compression ratio. Our simulations show excellent agreement with theory. %Z Thu, 4 Feb 1999 8:35:11 GMT %A Crestana, Viviane %A Soparkar, Nandit %T Mining Decentralized Data Repositories %I Computer Science and Engineering, University of Michigan %R CSE-TR-385-99 %D February 1, 1999 %U http://www.eecs.umich.edu/eecs/research/techreports/cse99.html %X Current technology for mining data typically applies to data stored centrally (i.e., in one single repository, homogeneous, with central administration, and a single schema). For instance, association rules algorithms assume that the data of interest is available at one location in a single table, or that prior to mining, the data will be warehoused centrally. However, it is important to study the mining of decentralized data (i.e., consisting of several tables, perhaps obtained via normalization or partitioning and allocation, stored in several repositories with possibly separate administration and schema). Real-life database design, management, and performance aspects suggest these considerations. While there have been a few extensions to mining algorithms for such data, the algorithms developed have largely been aimed at parallel processing (as opposed to the specific issues relating to decentralized data). In this paper, we examine the issues associated with mining decentralized data. We motivate the need for developing decentralized techniques, and identify the problems that must be addressed. We describe the mining of association rules for the case of a star schema wherein several smaller dimension tables are associated by foreign keys to a central fact table, and present an algorithm that adapts standard algorithms to work efficiently with such data. in contrast to approaches where the data would have been joined first to form a single table, we exploit the foreign key relationships to develop decentralized algorithms that execute concurrently on the separate tables, and thereafter merge the results. We provide analyses to assess our techniques, and we present empirical studies for particular cases to validate the performance. %Z Thu, 4 Feb 1999 8:37:08 GMT %A Tabe, Ted %A Stout, Quentin F. %T The Use of the MPI Communication Library in the NAS Parallel Benchmarks %I Computer Science and Engineering, University of Michigan %R CSE-TR-386-99 %D February 17, 1999 %U http://www.eecs.umich.edu/eecs/research/techreports/cse99.html %X The statistical analysis of traces taken from the NAS Parallel Benchmarks can tell one much about the type of network traffic that can be expected from scientific applications run on distributed-memory parallel computers. For instance, such applications utilize a relatively few number of communication library functions, the length of their messages is widely varying, they use many more short messages than long ones, and within a single application the messages tend to follow relatively simple patterns. Information such as this can be used by hardware and software designers to optimize their systems for the highest possible performance. %Z Fri, 12 Mar 1999 10:54:50 GMT %A Feng, W. %A Kandlur, D. %A Saha, D. %A Shin, Kang G. %T BLUE: A New Class of Active Queue Management Algorithms %I Computer Science and Engineering, University of Michigan %R CSE-TR-387-99 %D March 15, 1999 %U http://www.eecs.umich.edu/eecs/research/techreports/cse99.html %X In order to stem the increasing packet loss rates caused by an exponential increase in network traffic, the IETF is considering the deployment of active queue management techniques such as RED [11]. While active queue management can potentially reduce packet loss rates in the Internet, this paper shows that current techniques are ineffective in preventing high loss rates. The inherent problem with these queue management algorithms is that they all use queue lengths as the indicator of the severity of congestion. In light of this observation, a fundamentally different active queue management algorithm called BLUE is proposed. BLUE uses packet loss and link idle events to manage congestion. Using simulation and controlled experiments, BLUE is shown to perform significantly better than RED both in terms of packet loss rates and buffer size requirements in the network. As an extension to BLUE, a novel technique for enforcing fairness among a large number of flows is described. In particular, this paper proposes and evaluates Stochastic Fair BLUE (SFB), a queue management algorithm which can identify and rate-limit non-responsive flows using a very small amount of state information. %Z Tue, 13 Apr 1999 11:36:48 GMT %A Kweon, Seok-Kyu %A Shin, Kang G. %T Distributed QoS Routing with Bounded Flooding for Real-Time Applications %I Computer Science and Engineering, University of Michigan %R CSE-TR-388-99 %D March 15, 1999 %U http://www.eecs.umich.edu/eecs/research/techreports/cse99.html %X Quality-of-Service (QoS) routing is indispensable to real-time applications on integrated services packet-switching networks. However, existing QoS-routing schemes based on either request flooding or link-state dissemination are not efficient in terms of their message overhead and/or operational cost. This paper proposes a QoS-routing algorithm with bounded flooding which not only dramatically reduces message overhead and operational cost, but also provides good performance. This significant overhead reduction is achieved by limiting the search area within which connection-request messages are flooded. Since our scheme is based on limited flooding, it does not require on-demand path calculation nor link-state dissemination, which are often known to be very expensive. In order to limit the search area, we employ a distance table at each node storing hop counts of the shortest paths to all the nodes reachable via its outgoing links. Unlike link-state routing, the proposed scheme only needs network-topology information that changes much less frequently than link-load states. Extensive simulation results have demonstrated the effectiveness of the propped scheme in terms of performance, operational cost, and message overhead. Especially, in an ordinary operating condition in which the networkÕs physical topology doesnÕt change frequently, it is shown to be much better than all the other existing QoS-routing schemes in terms of message overhead and operational cost, while providing reasonably good performance. %Z Tue, 13 Apr 1999 11:38:58 GMT %A Dundas, James David %T Improving Processor Performance by Dynamically Pre-Processing the Instruction Stream %I Computer Science and Engineering, University of Michigan %R CSE-TR-389-99 %D April 8, 1999 %U http://www.eecs.umich.edu/eecs/research/techreports/cse99.html %X The exponentially increasing gap between processors and off-chip memory, as measured in processor cycles, is rapidly turning memory latency into a major processor performance bottleneck. Traditional solutions, such as employing multiple levels of caches, are expensive and do not work well with some applications. We evaluate a technique, called runahead pre-processing, that can significantly improve processor performance. The basic idea behind runahead is to use the processor pipeline to pre-process instructions during cache miss cycles, instead of stalling. The pre-processed instructions are used to generate highly accurate instruction and data stream prefetches, while all of the pre-processed instruction results are discarded after the cache miss has been serviced: this allows us to achieve a form of very aggressive speculation with a simple in-order pipeline. The principal hardware cost is a means of checkpointing the sequential state of the register file and memory hierarchy while instructions are pre-processed. As we discard all pre-processed instruction results, the checkpointing can be accomplished with a modest amount of hardware. The instruction and data stream prefetches generated during runahead episodes led to a significant performance improvement for all the benchmarks we examined. We found that runahead typically led to about a 30% reduction in CPI for the four Spec95 integer benchmarks that we simulated, while runahead was able to reduce CPI by 77% for the STREAM benchmark. This is for a five stage pipeline with two levels of split instruction and data caches: 8KB each of L1, and 1MB each for L2. A significant result is that when the latency to off-chip memory increases, or if the caching performance for a particular benchmark is poor, runahead is especially effective as the processor has more opportunities in which to pre-process instructions. Finally, runahead appears particularly well suited for use with high clock-rate in-order processors that employ relatively inexpensive memory hierarchies. %Z Tue, 13 Apr 1999 11:40:56 GMT %A Riepe, Michael Anthony %T Transistor Level Micro Placement and Routing for Two-Dimensional Digital VLSI Cell Synthesis %I Computer Science and Engineering, University of Michigan %R CSE-TR-390-99 %D April 8, 1999 %U http://www.eecs.umich.edu/eecs/research/techreports/cse99.html %X The automated synthesis of mask geometry for VLSI leaf cells, referred to as the cell synthesis problem, is an important component of any structured custom integrated circuit design environment. Traditional approaches based on the classic functional cell style of Uehara & VanCleemput post this problem as a straightforward one-dimensional graph optimization problem for which optimal solution methods are known. However, these approaches are only directly applicable to static CMOS circuits and they break down when faced with more exotic logic styles. There is an increasing need tin modern VLSI designs for circuits implemented in high-performance logic families such as Cascode Voltage Switch Logic (CVSL), Pass Transistor Logic (PTL), and domino CMOS. Circuits implemented in these non-dual ratioed logic families can be highly irregular with complex geometry sharing and non-trivial routing. Such cells require a relatively unconstrained two-dimensional full-custom layout style which current methods are unable to synthesize. In this work we define the synthesis of complex two-dimensional digital cells as a new problem which we can transistor-level micro-placement and routing. To address this problem we develop a complete end-to-end methodology which is implemented in a prototype tool named TEMPO. A series of experiments on a new set of benchmark circuits verifies the effectiveness of our approach. Our methodology is centered around techniques for the efficient modeling and optimization of geometry sharing, and is supported at two different levels. Chains of diffusion-merged transistors are formed explicitly and their ordering optimized for area and global routing. In additional, more arbitrary merged structures are supported by allowing electrically compatible adjacent transistors to overlap during placement. The synthesis flow in TEMPO begins with a static transistor chain formation step. These chains are broken at the diffusion breaks and the resulting sub-chains passed to the placement step. During placement, an ordering is found for each chain and a location and orientation is assigned to each sub-chain. Different chain orderings affect the placement by changing the relative sizes of the sub-chains and their routing contribution. We conclude with a detailed routing step and an optional compaction step. %Z Tue, 13 Apr 1999 11:58:11 GMT %A Galler, Bernard A. %T Report on the Study of University Policies on Intellectual Property %I Computer Science and Engineering, University of Michigan %R CSE-TR-391-99 %D April 12, 1999 %U http://www.eecs.umich.edu/eecs/research/techreports/cse99.html %X The purpose of the study was to identify issues which need to be considered by universities in establishing or upgrading policies regarding rights to intellectual property created by various members of the university community. A reading of a number of such policies reveals that there are many aspects in common, such as the treatment of intellectual property created while the work was funded by external sources. The main differences appeared in the treatment of non-supported, or internally supported, intellectual property. This study looks at the policies developed by various universities with regard to software, courseware, and related publications, and the issues raised by these policies. %Z Tue, 13 Apr 1999 11:59:45 GMT %A Shaikh, A. %A Rexford, J. %A Shin, Kang G. %T Load-Sensitive Routing of Long-Lived IP Flows %I Computer Science and Engineering, University of Michigan %R CSE-TR-392-99 %D April 15, 1999 %U http://www.eecs.umich.edu/eecs/research/techreports/cse99.html %X Internet service providers face a daunting challenge in provisioning network resources, due to the rapid growth of the Internet and wide fluctuations in the underlying traffic patterns. The ability of dynamic routing to circumvent congested links and improve application performance makes it a valuable traffic engineering tool. However, deployment of load-sensitive routing is hampered by the overheads imposed by link-state update propagation, path selection, and signalling. Under reasonable protocol and computational overheads, traditional approaches to load-sensitive routing of IP traffic are ineffective, and can introduce significant route flapping, since paths are selected based on out-of-date link-state information. Although stability is improved by performing load-sensitive routing at the flow level, flapping still occurs, because most IP flows have a short duration relative to the desired frequency of link-state updates. To address the efficiency and stability challenges of load-sensitive routing, we introduce a new hybrid approach that performs dynamic routing of long-lived flows, while forwarding short-lived flows on static preprovisioned paths. By relating the detection of long-lived flows to the timescale of link-state update messages in the routing protocol, route stability is considerably improved. Through simulation experiments using a one-week ISP packet race, we show that our hybrid approach significantly outperforms traditional static and dynamic routing schemes, by reacting to fluctuations in network load without introducing route flapping. %Z Fri, 16 Apr 1999 10:41:15 GMT %A Bernacchia, Giuseppe %A Papaefthymiou, Marios %T Analytical Macro-Modeling for High-Level Power Estimation %I Computer Science and Engineering, University of Michigan %R CSE-TR-393-99 %D April 20, 1999 %U http://www.eecs.umich.edu/eecs/research/techreports/cse99.html %X In this paper, we present a new analytical macro-modeling technique for high-level power estimation. Our technique is based on a parameterizable analytical model that relies exclusively on statistical information of the circuitÕs primary inputs. During estimation, the statistics of the required metrics are extracted from the input stream and a power estimate is obtained by evaluating a model function that has been characterized in advance. Thus, our model yields power estimates within seconds, since it does not rely on the statistics of the circuitÕs primary outputs and, consequently, does not perform any simulation during estimation, moreover, our model achieves significantly better accuracy than previous macro-modeling approaches, because it takes into account both spatial and temporal correlations in the input stream. In experiments with ISCAS-85 combinational circuits, the average absolute relative error of our power macro-modeling technique was at most 3%. For all but one circuit in our test suite, the worst-case error was at most 19.5%. In comparison with power estimates for a ripple-carry adder family that were obtained using Spice, the average absolute and worst-case errors of our modelÕs estimate were 5.1% and 19.8% respectively. In addition to power dissipation, our macro-modeling technique can be used to estimate the statistics of a circuitÕs primary outputs. Our experiments with ISCAS-85 circuits yield very low average errors. Thus, our technique is suitable for fast and accurate power estimation in core-based system with pre-characterized blocks. Once the metrics of the primary inputs are known, the power dissipation of the entire system can be estimated by simply propagating this information through the blocks using their corresponding model functions. %Z Wed, 21 Apr 1999 10:33:12 GMT %A Patel, Sanjay J. %T Trace Cache Design for Wide-Issue Superscalar Processors %I Computer Science and Engineering, University of Michigan %R CSE-TR-394-99 %D April 28, 1999 %U http://www.eecs.umich.edu/eecs/research/techreports/cse99.html %X To maximize the performance of a wide-issue superscalar processor, the fetch mechanism must be capable for delivering at least the same instruction bandwidth as the execution mechanism is capable of consuming. Fetch mechanisms consisting of a simple instruction cache are limited by difficulty in fetching a branch and its taken target in a single cycle. Such fetch mechanisms will not suffice for processors capable of executing multiple basic blocksÕ worth of instructions. The Trace Cache is proposed to deal with lost fetch bandwidth due to branches. The trace cache is a structure which overcomes this partial fetch problem by storing logically contiguous instructions---instructions which are adjacent in the instruction stream -- in physically contiguous storage. In this manner, the trace cache is able to deliver multiple non-contiguous blocks each cycle. This dissertation contains a description of the trace cache mechanism for a 16-wide issue processor, along with an evaluation of basic parameters of this mechanism, such as relative size and associativity. The main contributions of this dissertation are a series of trace cache enhancements which boost instruction fetch bandwidth by 34% and overall performance by 14% over an aggressive instruction cache. Also included is an analysis of two important performance limitations of the trace cache: branch resolution time and instruction duplication. %Z Tue, 15 Jun 1999 9:29:45 GMT