%A Wu,Gwobaw
%A Prakash, Atul
%T Distributed Object Replication Support for Collaborative Systems
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-276-96
%D December 28, 1995
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-276-96-cover.ps.Z
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-276-96.ps.Z
%X The ability to share synchronized views of interactions with an
application is critical to supporting synchronous collaboration over a
network.  To keep bandwidth requirements and interactive response time
low in wide-area networks, we have designed a toolkit, called
DistView, that uses an {\em object-level replication} scheme, in which
the application and interface objects that need to be shared among
users are replicated at the various user sites. Existing schemes for
maintaining consistency among replicated active objects are generally
designed for database applications and do not necessarily work in
collaborative applications where response times or latencies are
generally more critical than throughput or persistence.
In this paper, we
describe an object replication service that is designed
to provide low interactive response times and low
latencies in communication in collaborative environments.
We describe the algorithms used, the
architecture of the system, and present an analysis of its
performance.
%Z Thu, 15 Oct 1998 14:57:07 GMT

%A Abandah, Gheith
%A Davidson, Edward 
%T Characterizing Shared Memory and Communication Performance: A Case Study of the Convex SPP-10000
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-277-96
%D January 9, 1996
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-277-96-cover.ps.Z
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-277-96.ps.Z
%X The objective of this paper is to develop models that characterize the
memory and communication performance of shared-memory multiprocessors
which is crucial for developing efficient parallel applications for them. 
The Convex SPP-1000, a modern scalable distributed-memory multiprocessor
that supports the shared-memory programming paradigm, is used throughout
as a case study.  The paper evaluates and models four aspects of SPP-1000
performance: scheduling, local-memory, shared-memory, and synchronization. 
Our evaluation and modeling are intended to supply useful information for
application and compiler development. 
%Z Thu, 15 Oct 1998 14:57:39 GMT

%A Jaeger, Trent
%A Prakash, Atul
%T Using Simulation and Performance Improvement Knowledge for Redesigning Business Processes
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-278-96
%D January 12, 1996
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-278-96-cover.ps.Z
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-278-96.ps.Z
%X Recent business improvement methodologies, such as Business Process
Reengineering, advocate the redesign of business processes as the primary
method to obtain improvement in business performance.  Current approaches to
process redesign assist the reengineer in identifying problems in meeting
business goals, but provide little support in determining how to resolve 
those problems to meet the desired goals.  The number of feasible changes 
to the business that can potentially resolve the problems is large and it is 
hard to predict the effect that each change will have on overall business 
performance.  We present a process redesign methodology that helps 
reengineers to efficiently generate changes that can help meet business 
performance goals.  Based on the methodology, we describe a system that 
includes: (1) an executable business model for specifying business 
processes and business performance goals; (2) a simulator that executes 
these processes to determine actual performance and to help reengineers 
identify problems in meeting the specified goals; and (3) a performance 
improvement knowledge base that suggests changes to resolve these 
problems and meet the specified goals.  We demonstrate the use of the 
process redesign methodology and the system on two non-trivial process 
redesign examples, and report our experience.
%Z Thu, 15 Oct 1998 14:58:02 GMT

%A McClennen, Michael
%A Sechrest, Stuart
%T Getting More Information into File Names
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-279-96
%D January 18, 1996
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-279-96-cover.ps.Z
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-279-96.ps.Z
%X Hierarchical naming, while deeply embedded in our conception of file
systems, is a rather weak tool for storing information about files and
their relationships. A consequence is that users of today's file
systems frequently have trouble locating files. We describe a system
in which a standard directory tree is extended by allowing names to
contain auxiliary components representing descriptive attributes
rather than directory names. This system allows files to be
characterized more extensively, and lets users choose among multiple
organizational structures for their stored information. A prototype
has been implemented by means of a new vnode layer under SunOS 4.1.3.
%Z Thu, 15 Oct 1998 14:58:26 GMT

%A Mehra, Ashish
%A Indiresan, Atri
%A Shin, Kang 
%T Design and Evaluation of a QoS-Sensitive Communication Subsystem Architecture
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-280-96
%D February 9, 1996
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-280-96-cover.ps.Z
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-280-96.ps.Z
%X There are a growing number of real-time applications (e.g., real-time controls,
and audio/video conferencing) that require certain
quality of service (QoS) guarantees from the underlying communication 
subsystem. The communication subsystem (host as well as network) must support 
real-time communication services that can be used to provide the required QoS 
of these applications, while providing reasonably good performance for 
best-effort traffic. 
In this paper we present and evaluate a QoS-sensitive
communication subsystem architecture for end hosts that 
provides real-time communication support for generic network hardware.
This architecture provides various services for managing communication 
resources for 
guaranteed-QoS (real-time) connections, such as admission control, traffic 
enforcement, buffer management, and CPU \& link scheduling. 
The design of the architecture is based on three key goals: 
maintenance of QoS-guarantees on a per-connection basis, 
overload protection between established connections, and 
fairness in delivered performance to best-effort traffic.
Using this architecture, we implement real-time channels, 
a paradigm for real-time communication services in packet-switched networks.
The proposed architecture features a process-per-channel model for
protocol processing that associates a channel handler with 
each established channel.
The model employed for handler execution is one of 
``cooperative'' preemption,
where an executing handler yields the CPU to a waiting higher-priority 
handler at well-defined preemption points.
The architecture provides several configurable policies for 
CPU scheduling and overload protection. 
We evaluate the implementation to demonstrate that this architecture maintains
QoS guarantees while adhering to the stated design goals.
The evaluation also demonstrates convincingly 
the need for specific features and policies provided in the architecture.
%K Key Words - Real-time communication, QoS-sensitive protocol processing, traffic enforcement,
CPU and link scheduling
%Z Thu, 15 Oct 1998 14:58:52 GMT

%A Mehra, Ashish
%A Indiresan, Atri
%A Shin, Kang
%T Resource Management for Real-Time Communication: Making Theory Meet Practice
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-281-96
%D February 9, 1996
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-281-96-cover.ps.Z
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-281-96.ps.Z
%X A growing number of real-time applications (e.g., real-time controls,
and audio/video conferencing) require certain
quality-of-service (QoS) from the underlying communication subsystem. 
Real-time communication services are needed in the communication
subsystem (host as well as network) to provide the 
required QoS to these applications while providing 
reasonably good performance for best-effort traffic.
At the host, real-time communication necessitates that shared host resources 
such as CPU and link bandwidth be consumed 
according to the relative requirements of the active connections.
This requires appropriate resource management policies for admission
control and scheduling that are typically formulated assuming idealized resource
models.
However, when implementing these policies one must account for the 
performance characteristics of the hardware and software components involved,
which could deviate significantly from those of the idealized resource models.
In this paper, we focus on bridging the gap between theory and practice
in the management of host CPU and link resources for real-time communication.
Using our implementation of real-time channels, 
a paradigm for real-time communication in packet-switched networks,
we illustrate the tradeoff between resource capacity and channel
admissibility, which determines the number and type of real-time channels 
that can be accepted for service and the performance delivered to 
best-effort traffic.
We demonstrate that this tradeoff is affected significantly by the choice of
implementation paradigms and the grain
at which CPU and link resources can be multiplexed
amongst active channels.
In order to account for this effect, we extend the admission control
procedure for real-time channels originally proposed using idealized 
resource models.
Our results show that practical considerations significantly reduce
channel admissibility compared to idealized resource models.
Further, the optimum choice of the multiplexing grain depends on 
several factors such as the overheads of resource preemption, the 
relationship between CPU and link bandwidth,
and the manner in which allocation of link bandwidth interacts with allocation 
of CPU bandwidth.
%K Key Words - Real-time communication, resource management, QoS-sensitive protocol 
processing, CPU and link scheduling
%Z Thu, 15 Oct 1998 15:00:50 GMT

%A Mudge, Trevor
%A Chen, I-Cheng
%A Coffey, John 
%T Limits to Branch Prediction
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-282-96
%D February 2, 1996
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-282-96-cover.ps.Z
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-282-96.ps.Z
%X Branch prediction is an important mechanism in modern microprocessor design. 
The focus of research in this area has been on designing new branch prediction 
schemes. In contrast, very few studies address the inherent limit of 
predictability of program themselves. Programs have an inherent limit of 
predictability due to the randomness of input data. Knowing the limit helps us 
to evaluate how good a prediction scheme is and how much we can expect to 
improve its accuracy.
In this paper we propose two complementary approaches to estimating the limits 
of predictability: exact analysis of the program and the use of a universal 
compression/prediction algorithm, prediction by partial matching (PPM), that 
has been very successful in the field of data and image compression. We review 
the algorithmic basis for both some common branch predictors and PPM and show 
that two-level branch prediction, the best method currently in use, is a 
simplified version of PPM. To illustrate exact analysis, we use Quicksort to 
calibrate the performance of various branch predictors. With other programs, 
too complicated to analyze exactly, we use PPM, as a measure of inherent 
predictability.
Our initial results show that PPM can approach the theoretical limit in an 
analyzable program and perform just as well as the best existing branch 
predictors for SPECInt92. This suggests that universal compression/prediction 
algorithms, such as PPM, can be used to estimate the limits to branch 
prediction for a particular workload of programs.
%Z Thu, 15 Oct 1998 15:01:14 GMT

%A Sechrest, Stuart 
%A Lee, Chih-Chieh 
%A Mudge, Trevor 
%T Correlation and Aliasing in Dynamic Branch Predictors - Full Simulation Results
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-283-96
%D May 22, 1996
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-283-96-cover.ps.Z
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-283-96.ps.Z
%X The accompanying paper, "Correlation and Aliasing in Dynamic Branch
Predictors," presents simulation results for a variety of branch
prediction schemes using traces of the integer programs of SPEC92 and the
IBS-Ultrix. However, due to space limitations and the need for clarity,
only a subset of our simulation results were included. This technical
report presents the full results for all of the benchmarks for each
prediction scheme studied. 
%Z Thu, 15 Oct 1998 15:01:35 GMT

%A Manohar, Nelson R. 
%A Prakash, Atul 
%T Tool Coordination and Media Integration on Asynchronously-Shared Computer-Supported Workspaces
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-284-96
%D February 27, 1996
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-284-96-cover.ps.Z
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-284-96.ps.Z
%X This paper presents a novel and flexible architecture
         to support the collaboration paradigm for asynchronous
         sharing of computer-supported workspaces, or simply
         replayable workspaces.  Such workspace, however,
         may be composed of multiple, interacting, tools.
         Through the capture, re-execution, and manipulation of a
         session with an interactive computer-supported workspace,
         it is possible to reuse the process of how a task was
         performed, (i.e., valuable collaborative information).
         Under our paradigm, a user session with a computer-
         supported workspace is encapsulated into a data artifact,
         referred to as a {\em session object}. The session object
         can be then manipulated through {\sc Vcr}-like controls.
         The session object is an active object, composed of
         heterogeneous media streams that represent input sequences
         to various tools of the workspace.  Our computer-supported
         workspace is made-up of modular components that provide
         transparent management of their heterogeneous media.  Such
         media may have or lack time-variance, periodicity, and
         statefulness.  The architecture provides fine-grained
         integration of such heterogeneous media, while providing
         flexible support for coordinating the various tools found
         in a computer-supported workspace.
%Z Thu, 15 Oct 1998 15:01:55 GMT

%A VanCampenhout, David 
%A Mudge, Trevor 
%T Faster Static Timing Analysis via Bus Compression
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-285-96
%D February 27, 1996
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-285-96-cover.ps.Z
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-285-96.ps.Z
%X Static timing analysis is used extensively in the design of
high-performance processors. In this paper we present a method which
transforms the circuit graph used by timing analysis algorithms to a
smaller one, and hence results in faster timing analysis. This
transformation, called bus compression, may lead to a more
conservative analysis. How ever, experiments on several large designs
yielded the same results with or without bus compression.
%Z Thu, 15 Oct 1998 15:02:15 GMT 

%A Che, Peter 
%A Ng, Wee-Teck 
%A Rajamani, Gurushankar 
%A Aycock, Christopher 
%T The Rio File Cache:  Surviving Operating System Crashes
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-286-96
%D March 9, 1996
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-286-96-cover.ps.Z
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-286-96.ps.Z
%X One of the fundamental limits to high-performance, high-reliability file systems is memory's vulnerability to system crashes.  Because memory is
viewed as unsafe, systems periodically write data back to disk.  The extra disk
traffic lowers performance, and the delay period before data is safe lowers
reliability.  The goal of the Rio (RAM I/O) file cache is to make ordinary main
memory safe for persistent storage by enabling memory to survive operating
system crashes.  Reliable memory enables a system to achieve the best of both
worlds: reliability equivalent to a write-through file cache, where every write
is instantly safe, and performance equivalent to a pure write-back cache, with
no reliability-induced writes to disk.  To achieve reliability, we protect
memory during a crash and restore it during a reboot (a "warm" reboot).
Extensive crash tests show that even without protection, warm reboot enables
memory to achieve reliability close to that of a write-through file system
while performing 20 times faster.  Rio makes all writes immediately permanent,
yet performs faster than systems that lose 30 seconds of data on a crash: 35%
faster than a standard delayed-write file system and 8% faster than a system
that delays both data and metadata.  For applications that demand even higher
levels of reliability, Rio's optional protection mechanism makes memory even
safer than a write-through file system while while lowering performance 20%
compared to a pure write-back system.
%Z Thu, 15 Oct 1998 15:02:33 GMT

%A Feng, Wu-chang 
%A Shin, Kang G. 
%T Impact of Selection Functions on Routing Algorithm Performance in Multicomputer Networks
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-287-96
%D March 19, 1996
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-287-96-cover.ps.Z
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-287-96.ps.Z
%X Maximizing overall performance in multicomputers requires matching application communication characteristics with a suitable
routing scheme.  However, since the communication demands of emerging
applications vary significantly, it is hard for a single routing
algorithm to perform well under all workloads.  In order to study the
complex dependencies between routing policies and communication
workloads, we have performed a set of multi-factor experiments to
better characterize routing performance.  These experiments show that
in addition to adaptivity, the selection functions used to order the
candidate links greatly affect network performance under various
traffic patterns.  By supporting flexible routing, the network can
tune its routing policies to application communication characteristics
in order to improve performance.
%Z Thu, 15 Oct 1998 15:02:51 GMT

%A Rothleder, Neal J.
%T Induction: Inference and Process
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-288-96
%D March 20, 1996
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-288-96-cover.ps.Z
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-288-96.ps.Z
%X Despite the variety of discussions on induction, there is still no explicit 
definition of the concept.  However, since most treatments appear to, at least 
implicitly, acknowledge it, it seems clear that such a definition exists.  
A key obstacle seems to be the confusion between what induction is, and how 
induction should be performed.  This research proposes the idea that 
induction should be viewed in two distinct, but related, ways.  The more 
general, abstract way gets to the heart of this concept definition, and 
treats induction purely as an inference.  In this inference view, there is no 
concern for issues of efficiency, effectiveness, and other qualities of how 
induction might be performed.  The second way to view induction is as the 
process which implements the inference.  This paper establishes a definition 
of induction in the inference view.  The subtleties and implications of this 
definition are discussed, and are also compared with key issues and published 
opinions of induction.
%Z Thu, 15 Oct 1998 15:03:09 GMT

%A D'Ambrosio, Joseph 
%T ISMAUT Tools: A Software Tool Kit for Rational Tradeoffs Among Conflicting Objectives
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-289-96
%D March 25, 1996
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-289-96-cover.ps.Z
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-289-96.ps.Z
%X One primary task of engineering design is resolving the conflicting objectives that are inherent in the design process. As a
result, design automation tools must provide designers with the ability to
iirepresent and apply the preferential knowledge required to resolve
conflicting objectives. This paper describes a decision-theoretic approach
for resolving conflicting objectives and a software took kit, ISMAUT Tools,
that enables access to this approach. ISMAUT Tools includes a library of
primitive routines that can be linked into other applications, and a
text-based interface providing stand-alone decision making capabilities.
%Z Thu, 15 Oct 1998 15:03:25 GMT

%A Hall, Robert W. 
%A Mathur, Amit 
%A Jahanian, Farnam 
%A Prakahs, Atul 
%A Rasmussen, Craig 
%T Corona: A Communication Service for Scalable, Reliable Group Collaboration Systems (Preliminary Design)
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-290-96
%D March 25, 1996
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-290-96-cover.ps.Z
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-290-96.ps.Z
%X We consider the problem of providing communication protocol support for large-scale group collaboration systems for use in 
          environments such as the Internet which are subject to packet
          loss, wide variations in end-to-end delays, and transient 
          partitions.  We identify a set of requirements that are critical
          for the design  of such group collaboration systems. We present a
          communication service, Corona, that attempts to meet these
          requirements.
          Corona supports two communication paradigms: the publish-subscribe
          paradigm and the peer group paradigm.  In the publish-subscribe
          paradigm one or more data sources or publishers send data to 
          multiple subscribers.  This paradigm is characterized by the 
          anonymous nature of communication, where a publisher is aware of 
          the set of subscribers, but the subscribers are unaware of each 
          other and are only aware of the publisher that they are receiving
          data from.  In the peer group paradigm of communication, on the 
          other hand, all the group members are aware of each other, and 
          can send and receive data directly to each other.  
          We present the interfaces provided by Corona to group collaboration
          applications. We describe the semantics of each interface 
          method call and show how they can meet a variety of group
          collaboration requirements.  We describe the protocols
          used to implement these semantics.
%Z Thu, 15 Oct 1998 15:03:45 GMT

%A Abdelzaher, Tarek 
%A Shaikh, Anees 
%A Jahanian, Farnam 
%A Shin, Kang 
%T RTCAST:  Lightweight Multicast for Real-Time Process Groups
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-291-96
%D April 4, 1996
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-291-96-cover.ps.Z
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-291-96.ps.Z
%X We propose a lightweight fault-tolerant multicast and membership
service for real-time process groups which may exchange periodic and
aperiodic messages. The service supports bounded-time message
transport, atomicity, and order for multicasts within a group of
communicating processes in the presence of processor crashes and
communication failures.  It guarantees agreement on membership among
the communicating processors, and ensures that membership changes
(e.g., resulting from processor joins or departures) are atomic and
ordered with respect to multicast messages.
We separate the mechanisms for maintaining consistency in a distributed system from those of maintaining real-time guarantees. The
multicast and membership service may be used alone to support service
that is not time-critical.  An additional, separate real-time
admission control layer provides on-line schedulability analysis of
application traffic to guarantee hard real-time deadlines in a system
with dynamically-changing loads.  Moreover, we separate the concerns
of processor fault-tolerance and link fault-tolerance by providing a
protocol for supporting the former, regardless of the actual network
topology and technology used to implement the latter.  We provide the
flexibility of an event-triggered approach with the fast message
delivery time of time-triggered protocols, such as TTP, where messages
are delivered to the application immediately upon reception.  This is
achieved without compromising agreement, order and atomicity
properties.  In addition to the design and details of the algorithm,
we describe our implementation of the protocol using the x-Kernel
protocol architecture running under RT Mach 3.0.
%Z Thu, 15 Oct 1998 15:04:07 GMT

%A Silva, J. 
%A Sakallah, K. 
%T GRASP -- A New Search Algorithm for Satisfiability
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-292-96
%D April 10, 1996
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-292-96-cover.ps.Z
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-292-96.ps.Z
%X This report introduces GRASP (Generic seaRch Algorithm for the Satisfiability Problem), an integrated algorithmic frame-work for SAT that unifies several previously propose search-pruning techniques and facilitates identification of additional ones.  GRASP is premised on the inevitability of conflicts during search and it s most distinguishing feature is the augmentation of basic backtracking search with a powerful conflict analysis procedure.  Analyzing conflicts to determine their causes enables GRASP to backtrack non-chronologically to earlier levels in the search tree, potentially pruning large portions of the search space.  In addition, by �recording� the causes of conflicts, GRASP can recognize and preempt the occurrence of similar conflicts later on in the search.  Finally, straightforward bookkeeping of the causality chains leading up toe conflicts allows GRASP to identify assignments that are necessary for a solution to be found.  Experimental results obtained from a large number of benchmarks, including many from the field of test pattern generation, indicate that application of the proposed conflict analysis techniques to SAT algorithms can be extremely effective for a large number of representative classes of SAT instances.
%Z Thu, 15 Oct 1998 15:04:23 GMT

%A Chen, Peter 
%T Optimizing Delay in Delayed-Write File Systems
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-293-96
%D May 3, 1996
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-293-96-cover.ps.Z
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-293-96.ps.Z
%X Delayed writes are used in most file systems to improve performance
over write-through while limiting the amount of data lost in a crash. Delayed
writes improve performance in three ways: by allowing the file cache to absorb
some writes without ever propagating them to disk (write cancellation), by
improving the efficiency of disk writes, and by spreading bursts out over time.
For each of these benefits, this paper determines the optimal value for the
delay interval: the smallest delay that achieves as good (or nearly as good)
performance as an infinite delay. The largest value for optimal delay of any of
the three benefits is fileCacheSize/diskTransferRate. This value is 1-10
seconds for current systems, implying that the delay used today (30 seconds) is
much too large; a smaller delay would minimize data loss yet maintain the same
performance.
%Z Thu, 15 Oct 1998 15:04:40 GMT

%A Shih, Tien-Pao 
%T Goal-Directed Performance Tuning for Scientific Applications
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-294-96
%D June 10, 1996
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-294-96-cover.ps.Z
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-294-96.ps.Z
%X Performance tuning, as carried out by compiler designers and
application programmers to close the performance gap between the
achievable peak and delivered performance, becomes increasingly
important and challenging as the microprocessor speeds and system
sizes increase.  However, although performance tuning on scientific
codes usually deals with relatively small program regions, it is not
generally known how to establish a reasonable performance objective
and how to efficiently achieve this objective. We suggest a
goal-directed approach and develop such an approach for each of three
major system performance components: central processor unit (CPU)
computation, memory accessing, and communication.
For the CPU, we suggest using a machine-application performance model
that characterizes workloads on four key function units (memory,
floating-point, issue, and a virtual ``dependence unit'') to produce
an upper bound performance objective, and derive a mechanism to
approach this objective.  A case study shows an average 1.79x speedup
achieved by using this approach for the Livermore Fortran Kernels 1-12
running on the IBM RS/6000.
For memory, as compulsory and capacity misses are relatively easy to
characterize, we derive a method for building application-specific
cache behavior models that report the number of misses for all three
types of conflict misses: self, cross, and ping-pong.  The method uses
averaging concepts to determine the expected number of cache misses
instead of attempting to count them exactly in each instance, which
provides a more rapid, yet realistic assessment of expected cache
behavior.  For each type of conflict miss, we propose a reduction
method that uses one or a combination of three techniques based on
modifying or exploiting data layout: array padding, initial address
adjustment, and access resequencing.  A case study using a blocked
matrix multiply program as an example shows that the model is within
11% of the simulation results, and that each type of conflict miss
can be effectively reduced or completely eliminated.
For communication in shared memory parallel systems, we derive an array grouping mechanism and related loop transformations to reduce
communication caused by the problematic case of nonconsecutive
references to shared arrays and prove several theorems that determine when
and where to apply this technique.  The experimental results show a
15% reduction in communication, a 40% reduction in data subcache
misses, and an 18% reduction in maximum user time for a finite
element application on a 56 processor KSR1 parallel compute
%Z Thu, 15 Oct 1998 15:04:59 GMT

%A VanCampenhout, David 
%A Mudge, Trevor 
%A Sakallah, Karem 
%T Modeling Domino Logic for Static Timing Analysis
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-295-96
%D June 24, 1996
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-295-96-cover.ps.Z
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-295-96.ps.Z
%X High performance microprocessors employ various advanced circuit
techniques to achieve high clock frequencies. Critical sections of
the design are often implemented in domino logic, a popular style of
dynamic logic. This paper describes a new model for analyzing the
temporal behavior of sequential circuits containing both static logic
and domino logic. Our model integrates naturally with the SMO model
for static timing analysis of sequential circuits.
%Z Thu, 15 Oct 1998 15:05:17 GMT

%A VanCampenhout, David 
%A Mudge, Trevor 
%A Sakallah, Karem 
%T Timing Analysis of Domino Logic
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-296-96
%D June 24, 1996
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-296-96-cover.ps.Z
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-296-96.ps.Z
%X High performance microprocessors employ various advanced circuit
techniques to obtain the required speed. Critical sections of the
design are often implemented in domino logic, a popular style of
dynamic logic. This paper describes a new model for analyzing the
temporal behavior of sequential circuits using static logic mixed with
domino logic. Our model integrates naturally with the SMO model for
static timing analysis of sequential circuits.
%Z Thu, 15 Oct 1998 15:05:35 GMT

%A Malan, G. Robert 
%A Jahanian, Farnam 
%A Rasmussen, Craig 
%A Knoop, Peter 
%T Performance of a Distributed Object-Based Internet Collaboratory
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-297-96
%D July 10, 1996
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-297-96-cover.ps.Z
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-297-96.ps.Z
%X Atmospheric Research Collaboratory (UARC), an object-based distributed
system.  For the past three years, the UARC software system has enabled
space scientists to collaborate on atmospheric experiments in real-time
over the Internet.  The UARC system provides a means for its users to view
data from remote atmospheric instruments, share annotations of that data,
discuss experimental results in a chat window, and simultaneously edit text
manuscripts.  However, UARC's distribution of atmospheric data to this
geographically dispersed group of scientists is its primary mechanism for
effecting their collaboration.  This paper investigates the impact of
UARC's implementation as a large distributed object-based software system
as a means for supporting wide-area collaboratories.  Specifically, it
focuses on the communication performance and scalability of its
object-based data distribution mechanism.  First, Internet microbenchmarks
are presented which characterize the UARC topology; then the results of
application-level experiments are described that investigate UARC's use of
NeXTSTEP's Distributed Object method invocations as a communication
primitive.  Finally, an analysis and discussion of the UARC system's
object-based implementation concludes the paper.
%Z Thu, 15 Oct 1998 15:05:52 GMT

%A Dawson, Scott 
%A Jahanian, Farnam 
%A Mitton, Todd 
%T Experiments on Six Commercial TCP Implementations Using a Software Fault Injection Tool
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-298-96
%D July 25, 1996
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-298-96-cover.ps.Z
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-298-96.ps.Z
%X TCP has become the de facto standard transport protocol in today's operating systems.  It is a very robust protocol
          which can adapt to various network characteristics, packet
          loss, link congestion, and even significant differences in
          vendor implementations.  This paper, presents an
          implementation of a tool, called ORCHESTRA, that can be used
          for testing distributed protocols such as TCP, and a set of
          experiments performed on six different vendor TCP
          implementations using the tool.  The paper also disusses
          some lessons learned about TCP implementations through
          experimentation.
%Z Thu, 15 Oct 1998 15:06:07 GMT

%A Chakrabarty, Krishnendu 
%A Murray, Brian T. 
%A Hayes, John P. 
%T Optimal Zero-Aliasing Space Compaction of Test Responses
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-299-96
%D August 21, 996
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-299-96-cover.ps.Z
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-299-96.ps.Z
%X Many built-in self-testing (BIST) schemes compress the test responses 
from a k-output circuit to q signature streams, where q << k, a process
termed space compaction. The effectiveness of such a compaction
method can be measured by its compaction ratio c=k/q. A high
compaction ratio can introduce aliasing, which occurs
when a faulty test response maps to the fault-free signature.
We investigate the problem of designing zero-aliasing space compaction
circuits with maximum compaction ratio c_max. We introduce a
graph representation of test responses to study the space compaction 
process and relate space compactor design to a graph coloring
problem. Given a circuit under test, a fault model, and a test
set, we determine q_min, which yields c_max = k/q_min. This
provides a fundamental bound on the cost of signature-based BIST.
We show that q_min \leq 2 for all the ISCAS 85 benchmark
circuits. We develop a systematic design procedure
for the synthesis of space compaction circuits and apply it to a number of
ISCAS 85 circuits. Finally, we describe multi-step compaction,
which allows zero aliasing to be achieved with any q, even when q_min > 1.
%Z Thu, 15 Oct 1998 15:06:29 GMT

%A Choi, Sunghyun 
%A Shin, Kang G. 
%T A Cellular Wireless Local Area Network with QoS Guarantees for Heterogeneous Traffic
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-300-96
%D August 26, 1996
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-300-96-cover.ps.Z
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-300-96.ps.Z
%X A wireless local area network (WLAN) or a cell with quality-of-service
(QoS) guarantees for various types of traffic is considered. A
centralized (i.e., star) network topology is adopted as the topology of
a cell which consists of a base station and a number of mobile clients.
Dynamic Time Division Duplexed (TDD) transmission is used, and hence,
the same frequency channel is time-shared for downlink and uplink
transmissions under the dynamic control of the base station. We divide
traffic into three classes (say I, II, and III) according to the
required QoS. Whenever there is no eligible class-I and II traffic,
class-III traffic which requires no delay bound guarantees is
transmitted, while uplink transmissions are controlled with a
reservation scheme. Class-I traffic which requires a bounded delay and
guaranteed throughput is handled with the framing strategy which
consists of a smoothness traffic model and the stop-and-go queueing
scheme. We also establish the admission test for adding new class-I
connections. We present a modified framing strategy for class-II voice
uplink transmissions which utilizes the wireless link efficiently at
the cost of some packet losses.  Finally, we present the performance
(average delay and throughput) evaluation of the reservation scheme for
class-III traffic using both analytical calculations and simulations.
%Z Thu, 15 Oct 1998 15:06:47 GMT

%A Dundas, James
%A Mudge, Trevor
%T Using Stall Cycles to Improve Microprocessor Performance
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-301-96
%D September 4, 1996
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-301-96-cover.ps.Z
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-301-96.ps.Z
%X Contemporary microprocessors typically expend a significant amount of
their device budget in an attempt to reduce the detrimental effects of
memory latency and branch misprediction. The extra hardware frequently
reduces the clock rate 
and wastes valuable die area that might be more productively employed in
other ways. We propose a novel approach to 
this problem, which is intended to be used in relatively simple
microprocessors with very high clock rates. A runahead 
microprocessor continues to execute instructions past a level one data
cache miss until the miss is serviced. During this 
time it is said to be in runahead mode. Any additional level one data
cache misses that occur during runahead, whose 
target addresses can be calculated, can become data stream prefetches. Any
conditional branches that are executed during runahead can become trial
runs, which may improve branch prediction accuracy. Any level one
instruction cache 
misses that occur during runahead become instruction stream prefetches.
After the original data cache miss is serviced, 
the microprocessor restarts instruction execution at the address of the
load or store instruction that caused the original 
data cache miss. The additional hardware required to checkpoint the
sequential state of the microprocessor is rather 
modest, and should not impact the clock rate of a high-performance
implementation. 
%Z Thu, 15 Oct 1998 15:07:06 GMT

%A Kuno, Harumi A. 
%A Rundensteinber, Elke A. 
%T The Satisfiability-Indicating Multi-Index Organization for Maintaining Materialized Path Query OODB Views
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-302-96
%D September 4, 1996
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-302-96-cover.ps.Z
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-302-96.ps.Z
%X Materialized database views allow applications to benefit from the powerful 
flexibility of views while minimizing the performance penalties
traditionally associated with views.  However, the need to maintain
materialized views in the face of updates limits the variety of queries
that can be used to define them.  In this paper we address the problem of 
incrementally maintaining OODB views formed using path queries.
Traditional index organizations are not well suited for this task.  
The indexing needs of the path query view problem are unique in that
because the contents of the materialized view are cached and can be
queried directly, the primary use for a supplemental index is during
the propagation of updates rather than during query processing.
Furthermore, traditional index organizations do not distinguish between
single-valued and multi-valued attributes, and thus do not account for
the fact that multi-valued attributes enable a single object at the
head of a path to be associated with multiple instantiations of the
path, any number of which could satisfy the path query predicate.  
This means that if an updated path involves a multi-valued attribute then
the aggregation hierarchy of the object at the head of the path must be
completely re-calculated in order to determine whether or not that
object participates in an alternative  instantiation that fulfills the
view query predicate despite the update.  As a solution, we introduce a
new Satisfiability Indicating Multi-Index (SMX) organization, which
maintains partial information indicating whether or not a given
endpoint satisfies the query predicate rather than what the exact value
of the endpoint is.  This new structure offers a number of benefits.
(1) At most one path position forward must be traversed to determine
whether or not the endpoint of an instantiation of the path  fulfills a
given path query predicate.  (2) The SMX index structure only needs to
be updated when the validity of an object's instantiation (in terms of
the query predicate) changes.  (3) No more than one path position
forward must ever be traversed in order to identify whether or not a
given object participates in any alternative instantiations that
fulfill a given path query predicate.  In addition to proposing this
new index organization, we also present cost models and analytic
evaluations comparing the performance of the SMX organization to those
of the multi, nested, and path index organizations with regards to
calculating the effects of updates upon views.  The results of our
evaluations indicate that the SMX dramatically improves upon the
performance of traditional index structures with respect to the problem
of path query view maintenance.
%K KEYWORDS: Incremental view maintenance, path queries, data warehousing,
view materialization, and object-oriented databases.
%Z Thu, 15 Oct 1998 15:07:27 GMT

%A Rivers, Jude A. 
%A Davidson, Edward S. 
%T Sectored Cache Performance Evaluation: A case study on the KSR-1 data subcache
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-303-96
%D September 6, 1996
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-303-96-cover.ps.Z
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-303-96.ps.Z
%X Sectoring is a cache design and management technique 
that is re-emerging as cache sizes get larger and computer 
designers strive to exploit the possible gains from using 
large block (line) sizes due to spatial locality. Sectoring 
allows for a small tag array size to suffice retaining address 
tags only for the large blocks, but still avoids huge miss 
penalties by utilizing a smaller transfer size between the cache 
and the next higher level of memory. With this caching strategy
comes the need for a new approach for evaluating cache 
performance, especially relating to cache space and its best 
use, bus traffic and so forth.
In this study, we give a broad overview of the technique 
of sectoring in caches. We have introduced a new set of metrics
for cache performance evaluation, stressing cache 
block and bus traffic usage. We use these set of superfluity 
metrics to investigate the behavior of real scientific 
applications, and also to help determine adequate and appropriate 
cache design parameters. We show an example of how these 
metrics can help point to the spatial locality problems in a 
given application code, thereby indicating code optimization 
techniques which can most significantly improve the 
code's performance.
%Z Thu, 15 Oct 1998 15:07:46 GMT

%A Manohar, Nelson R. 
%A Prakash, Atul
%T Design Issues on the Support of Tools and Media in Replayable Workspaces
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-304-96
%D September 18, 1996
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-304-96-cover.ps.Z
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-304-96.ps.Z
%X This paper presents flexible
	support for asynchronous sharing (later-time use)
	of computer-supported workspaces (CSWs),
	or simply replayable workspaces.
	Through the interactive replay of a previously recorded CSW session,
	it is possible to reuse valuable collaborative information.
        The session is represented through heterogeneous media streams
	each containing state and inputs to a CSW tool.
	We present here an architecture and tool/media interfaces
	for the support of tools as ``plug-in'' 
	components of our replayable workspaces.
%Z Thu, 15 Oct 1998 15:08:05 GMT

%A Nakanishi, Jun 
%A Fukuda, T. 
%A Koditschek, Daniel E. 
%T Preliminary Analytical Approach to a Brachiation Robot Controller
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-305-96
%D September 19, 1996
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-305-96-cover.ps.Z
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-305-96.ps.Z
%X We report on our preliminary studies of a new controller for a two-link
brachiating robot.  Motivated by the pendulum-like motion of an ape's
brachiation, we encode this task as the output of a "target dynamical
system."  Numerical simulations indicate that the resulting controller
solves a number of brachiation problems that we term the "ladder", "swing
up" and "rope" problems. Preliminary analysis provides some explanation for
this success. We discuss a number of formal questions whose answers will be
required to gain a full understanding of the strengths and weaknesses of
this approach.
%Z Thu, 15 Oct 1998 15:08:20 GMT

%A Chen, I-Cheng K. 
%A Lee, Chih-Chieh 
%A Postiff, Matthew A. 
%A Mudge, Trevor N. 
%T Tagless Two-level Branch Prediction Schemes
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-306-96
%D September 24, 1996
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-306-96-cover.ps.Z
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-306-96.ps.Z
%X Per-address two-level branch predictors have been shown to be among the best 
predictors and have been implemented in current microprocessors. However, as 
the cycle time of modern microprocessors continue to decrease, 
the implementation of set-associative per-address two-level branch predictors 
will become more difficult. In this paper, we revisit and analyze an alternative
tagless, direct-mapped approach which is simpler, has lower power, and has 
faster access time. The tagless predictor can also offer comparable performance 
to current set-associative designs since removal of tags allows more resources 
to be allocated for the predictor and branch target buffer (BTB). Further, 
removal of tags allows decoupling of the per-address predictors from the BTB, 
allowing the two components to be optimized individually. We then show that 
tagless predictors are better than tagged predictors because of opportunities 
for better miss-handling.
Finally, we examine the system cost-benefit for tagless per-address predictors 
across a wide design space using equal-cost contours. We also study the 
sensitivity of performance to the workloads by comparing results from the 
Instruction Benchmark Suite (IBS) and SPEC CINT95. Our work provides principles 
and quantitative parameters for optimal configurations of such predictors.
%Z Thu, 15 Oct 1998 15:08:43 GMT

%A Darr, Timothy P. 
%A Birmingham, William P. 
%T An Agent Model for Distributed Part Selection
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-307-96
%D September 23, 1996
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-307-96-cover.ps.Z
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-307-96.ps.Z
%X This report presents the agent model for a set of catalog, constraint and
system agents. For each type of agent, the model defines the goals it can
achieve; messages sent agents to other agents requesting that they achieve
some goal; the commitments that they make, internally, to achieve goals in
response to the messages; and messages that it sends once a goal is
achieved. The messages that an agent sends and receives are intentional and
are organized into protocols that define the type, sequence and content of
messages exchanged among agents to achieve goals.
%Z Thu, 15 Oct 1998 15:08:58 GMT

%A Hibino, Stacie 
%A Rundensteiner, Elke 
%T Query Processing in the MultiMedia Visual Information Seeking Environment:  A Comparative Evaluation
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-308-96
%D October 8, 1996
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-308-96-cover.ps.Z
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-308-96.ps.Z
%X Although much research has been conducted in the area of 
multidimensional range queries, we examine this problem from a 
new perspective.  In particular, our goal is to support the processing 
of incremental multidimensional range queries specified via our 
temporal visual query language (TVQL), a direct manipulation visual 
query interface.  In this paper, we describe the details and context of 
our problem, emphasizing the need to support temporal browsing 
and examining the characteristics of temporal data.  We present a 
simple but new array-based index structure called our k-Array and 
its bucket-based counterpart, the k-Bucket.  We then describe the 
series of experiments over several temporal queries that we ran to 
compare the k-Array and k-Bucket with basic methods such as the 
linked array and other popular bucket-based methods such as the 
grid file and k-d tree.  Our results show that the k-Bucket performs 
the best overall, even though it is subject to projection effects.  The 
other methods also perform fairly competitively, but only under 
certain conditions.
%Z Thu, 15 Oct 1998 15:09:17 GMT

%A Pearson, Douglas John 
%T Learning Procedural Planning Knowledge in Complex Environments
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-309-96
%D October 10, 1996
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-309-96-cover.ps.Z
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-309-96.ps.Z
%X In complex, dynamic environments, an agent's knowledge of the environment
(its domain knowledge) will rarely be complete and correct.  Existing
approaches to learning and correcting domain knowledge have focused on either
learning procedural knowledge to directly guide execution (e.g. reinforcement
learners) or learning declarative planning knowledge (e.g. theory revision
systems).  Systems that only learn execution knowledge are generally only
applicable to small domains.  In these domains it is possible to learn an
execution policy that covers the entire state space, making planning
unnecessary.  Conversely, existing approaches to learning declarative
planning knowledge are applicable to large domains, but they are limited to
simple agents, where actions produce immediate, deterministic effects in
fully sensed, noise-free environments, and where there are no exogenous events.
This research investigates the use of procedural knowledge to support the
learning of planning knowledge in large and complex environments.  We
describe a series of environmental properties that constrain learning and are
violated by existing approaches to learning planning knowledge.  We then
present an operator-based representation for planning knowledge that is
sufficiently expressive to model complex, conditional actions that produce
sequential effects over time.  We then present IMPROV, a system for learning
and correcting errors in this planning knowledge that only requires
procedural access to the knowledge.  This procedural restriction ensures that
learning remains tractable, even over these large, expressive
representations.  We first explain how IMPROV incrementally learns operator
precondition knowledge.  We then demonstrate how a hierarchical,
operator-based representation can be used to reduce the problem of learning
operator effects to the problem of learning operator preconditions.  This
result allows IMPROV to use a single learning method to learn both operator
preconditions and effects.  This also allows IMPROV to learn
complex models of actions that produce conditional or sequential effects.
Finally, we test the system in two sample domains and empirically
demonstrate that it satisfies many of the constraints faced by learning
agents in complex and challenging environments.
%Z Thu, 15 Oct 1998 15:09:39 GMT

%A Chandramouli, V.
%A Kayssi, A.I.
%A Sakallah, K.A.
%T Signal Delay in Coupled Distributed RC Lines in the Presence of Temporal Proximity
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-310-96
%D October 11, 1996
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-310-96-cover.ps.gz
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-310-96.ps.gz
%X With improvements in technology, accurate delay modeling of interconnects
 is becoming increasingly important. Due to decreasing feature sizes,
 the spacing between the signal lines is also decreasing. Consequently, 
the switching activities on the neighboring lines can have a significant 
impact on the delay of the line of interest, and can no longer be ignored.
 Accurate modeling of this phenomenon, which we call the proximity effect,
 is the subject of this paper. This is similar to the state-dependency of 
logic gate delays, where signal delay can be affected by the switching 
activities on the side inputs of a gate. We describe an efficient and 
accurate delay computation method using precomputed interconnect moments 
that treats the coupled lines as uniform, distributed RC lines and does
 not make any lumped approximations. This allows the proposed delay model
 to be used in a timing analysis tool operating over both gate and 
interconnect domains while accounting for state-dependency.
%Z Thu, 15 Oct 1998 15:09:57 GMT

%A Nica, Anisoara 
%A Rundensteiner, Elke Angelika 
%T The Dynamic Information Integration  Model
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-311-96
%D October 30, 1996
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-311-96-cover.ps.Z
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-311-96.ps.Z
%X Challenging issues for processing queries specified over large-scale  information spaces (e.g.,  Digital Libraries or the World
Wide Web) include the diversity of the information sources in terms of
their structures, query interfaces and search capabilities, as well as
the dynamics of sources continuously being added, removed or upgraded.
Query processing can no longer  be done in  a static fashion against a
well-defined  schema (which  assumes   high integration).    Rather an
interactive query processing strategy that adapts  its behavior to the
system  resources at  hand is   needed.  In  this  paper,  we give  an
innovative   solution for query  planning   in such environments.  The
foundation  of  our solution is   the  Dynamic Information Integration
Model (DIIM) which supports the specification  of not only content but
also capabilities of resources without  requiring the establishment of
a uniform  integration schema.   Besides  the development of the  DIIM
model, contributions  of this paper  include: (1) introduction  of the
notion of fully specified queries that  are semantically equivalent to
a loosely-specified query  (we show that our  concept is  a consistent
and natural   extension  to the concept   of full   disjunction);  (2)
translation algorithm of a   loosely-specified  query into a  set   of
semantically equivalent precise  query   plans that can  be   executed
against the  current configuration of  available sources; the steps of
the translated query plans are consistent with the binding patterns of
query templates of  the individual sources (capability descriptions in
DIIM) and  with  possible  interrelationships  between  two  or   more
information sources (expressed   as join  constraints in DIIM);    (3)
search   restriction  algorithm for  optimizing   query  processing by
pruning the search  space into the relevant subspace  of a query based
on information source     descriptions;   the search space   of    the
translation algorithm is thus restricted to a query relevant subspace;
and (4) the proofs of correctness for both  the search restriction and
translation algorithms that show that  the plans obtained by the query
planning process correspond to semantically equivalent query plans. 
%Z Thu, 15 Oct 1998 15:10:17 GMT

%A Feng, Wu-chang 
%A Kandlur, Dilip
%A Saha,  Debanjan 
%A Shin, Kang G. Shin
%T TCP Enhancements for an Integrated Services Internet
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-312-96
%D October 30, 1996
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-312-96-cover.ps.Z
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-312-96.ps.Z
%X Flow and congestion control mechanisms in TCP have been designed with the premise that network capacity should be shared
equally among competing connections. With the evolution of the
Internet from a class-less best-effort network to an integrated
services network with different classes of service, the premise of
equal fair sharing is no longer true. In an integrated services
Internet, network bandwidth should be shared in accordance with the
reservation made for each session.  In this paper, we consider a
specific class of service, namely controlled-load, which is currently
under consideration for deployment in the Internet. We propose a
simple scheme to realize this service and study its interaction with
the control mechanisms used in TCP. In light of this study, we propose
a number of enhancements to the control mechanisms used in popular
implementations of TCP.  Using deatiled simulation experiments, we
show that with the proposed enhancements in place, TCP is indeed able
to provide the desired end-to-end behavior.
%Z Thu, 15 Oct 1998 15:10:33 GMT

%A Soparkar, N.
%A Ramamritham, K.
%T Workshop on Database: Active and Real-Time 1996 (Concepts meet Practice)
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-313-96
%D November 6, 1996
%Z Thu, 15 Oct 1998 15:10:51 GMT

%A Jacob, Bruce 
%A Mudge, Trevor 
%T Specification of the PUMA memory management design
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-314-96
%D December 10, 1996
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-314-96-cover.ps.Z
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-314-96.ps.Z
%X In this report we specify the memory management design of a 
1GHz PowerPC implementation in which a simple design is a 
prerequisite for a high clock rate and short design cycle. The 
scheme involves no translation hardware such as a translation
lookaside buffer or a page-table-walking state machine. 
However, it is just as efficient as hardware-managed address 
translation and is much more flexible. Modern operating
systems such as Mach charge between 0.16 and 0.28 CPI for 
address translation on systems with TLBs. PUMA's
software-managed address translation exacts an overhead of
0.03 CPI. Mechanisms to support such features as shared memory, 
superpages, sub-page protection, and sparse address spaces 
can be defined completely in software, allowing much more 
flexibility than in hardware-defined mechanisms.
Our software design combines the virtual caches with a 
PowerPC-like segmentation mechanism; it maps user 32-bit 
addresses onto a 44-bit segmented virtual space. We use a
global page table to map this entire global virtual space. There 
are no individual per-process page tables; all process page 
tables map directly onto the global table and when processes 
share memory they also share a portion of the global page 
table. The following benefits are derived from the
organization: (a) virtual cache consistency management can be
eliminated, (b) the page table space requirements can be cut in half 
by eliminating the need to replicate page table entries for 
shared pages, and (c) the virtual memory system can be made 
less complex because it does not have to deal with the
virtual-cache synonym problem.
%Z Thu, 15 Oct 1998 15:11:10 GMT

%A Chaudhry, Nauman 
%A Moyne, James 
%A Rundensteiner, Elke A. 
%T Extended Aggregation Relationships for Process Specification and Enactment in Active Databases
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-315-96
%D November 11, 1996
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-315-96-cover.ps.Z
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-315-96.ps.Z
%X Process specification in a variety of domains, such  as
experiment  modeling,  work-flow  modeling,  and process-flows in
semiconductor  manufacturing,  is  typically   characterized   by
recursive specification in terms of sequences and alternatives. A
variety of models have been proposed  for  the  specification  of
such  processes.  In  particular, object-oriented techniques have
been  used  for  achieving  various  desirable  features   (e.g.,
reusability,  maintainability, etc.) in the process specification
and active databases have been suggested  as  possible  platforms
for  process  enactment. However, on the one hand object-oriented
models for process specification lack  an  important  feature  of
object-orientation,  namely  the ability to organize processes as
classes with inheritance support, and on the other  hand  various
problems, such as lack of methodological  support for active rule
definition, analysis,  and  maintenance,  stand  in  the  way  of
successfully  employing  active  database  technology for process
enactment. To  take  better  advantage  of  both  object-oriented
techniques   and   active   database  technology,  we  present  a
comprehensive framework for process specification and  enactment,
which  provides  an integrated solution utilizing ideas from both
these domains. This  is  achieved  by  developing  PSOM  (Process
Specification  Object  Model)  which  is an object-oriented model
with explicit  aggregation  constructs  and  extended  sub-typing
relationships for  these new aggregation  constructs. We show the
use  of  PSOM  for  defining  processes  using  the   aggregation
constructs   and    arranging    these    processes   into  class
hierarchies based on  the  formal  types  of  the  processes.  In
addition,  we establish guidelines  for  defining   active  rules
for  process  enactment on   PSOM   process   specifications.  We
also  prove  that   the  rule  definition   guidelines   lead  to
modularized rule  sets which simplify the analysis of termination
behavior  of active  rules defined in a PSOM process database.
%Z Thu, 15 Oct 1998 15:11:36 GMT

%A Thaler, David G. 
%A Ravishankar, Chinya V. 
%T A Name-Based Mapping Scheme for Rendezvous
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-316-96
%D November 14, 1996
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-316-96-cover.ps.Z
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-316-96.ps.Z
%X Clusters of identical intermediate servers are often created to improve
availability and robustness in many domains.  The use of proxy servers for
the WWW and of Rendezvous Points in multicast routing are two such 
situations.  However, this approach is inefficient if identical requests 
are received and processed by multiple servers.  We present an analysis of 
this problem, and develop a method called the Highest Random Weight (HRW) 
Mapping that eliminates these difficulties.  Given an object name, HRW maps 
it to a server within a given cluster using the object name, rather than any 
a priori knowledge of server states.  Since HRW always maps a given object 
name to the same server within a given cluster, it may be used locally at 
client sites to achieve consensus on object-server mappings.
We present an analysis of HRW and validate it with simulation results showing
that it gives faster service times than traditional request allocation schemes
such as round-robin or least-loaded, and adapts well to changes in the set of
servers.  HRW is particularly applicable to domains in which there are a large
number of requestable objects, and where there is a significant probability
that a requested object will be requested again.
HRW has now been adopted by the multicast routing protocol PIMv2 as its 
mechanism for clients to identify Rendezvous Points.
%Z Thu, 15 Oct 1998 15:11:52 GMT

%A Tam, Edward S. 
%T Early Design Cycle Timing Simulation of Caches
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-317-96
%D November 20, 1996
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-317-96-cover.ps.Z
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-317-96.ps.Z
%X Cache performance is a key component of a microprocessor's overall performance, 
as it is the cache that buffers the high speed CPU from the much slower main 
memory. Behavioral cache simulators indicate the performance of a given cache 
configuration in terms of the number of hits and misses experienced when running
a piece of code. One can attach a leading edge penalty or "effective" penalty 
estimate to each miss to get a first order idea of run time. However, individual
timing penalties are not assessed within these models, nor are the various 
factors that can affect each memory access' latency. Our Latency Effects (LE) 
cache model accounts for additional latencies due to trailing edge effects, bus 
width considerations, the number of outstanding accesses allowable, and port 
limitations. 
A tool implementing the LE cache model has been built on top of a behavioral 
cache simulator, DineroIII, in the spirit of the Resource Conflict Methodology 
developed by J-D Wellman. The tool was combined with Wellman's RCM_brisc 
processor simulator to provide a realistic interaction of the cache with the 
processor (including the latency masking effects of processor activity) and to 
assess the accuracy of the model when simulating the execution of actual 
programs. The combined tool accurately mirrors the effects of changing a cache's
configuration for a given processor configuration running a variety of programs.
While the reported execution times do not exactly match the total execution 
times of the same programs running on actual hardware, the tool provides enough 
useful information to guide processor/cache designers early in the design cycle
toward optimal configurations for target applications. This addition of cache 
modeling to the RCM_brisc instruction-level processor simulator with perfect 
cache increases simulation time by only 17% (less than 5% over a constant miss 
penalty cache model), which is reasonable given the added benefits of using the 
LE cache model. 
%Z Thu, 15 Oct 1998 15:12:10 GMT

%A Dawson, Scott 
%A Jahanian, Farnam 
%A Mitton, Todd 
%T ORCHESTRA: A Fault Injection Environment for Distributed Systems
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-318-96
%D November 26, 1996
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-318-96-cover.ps.Z
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-318-96.ps.Z
%X This paper reports on ORCHESTRA, a portable fault injection
environment for testing implementations of distributed protocols.  The
paper focuses on architectural features of ORCHESTRA that provide
portability, minimize intrusiveness on target protocols, and support
testing of real-time systems.
ORCHESTRA is based on a simple yet powerful framework, called
script-driven probing and fault injection, for the evaluation and
validation of the fault-tolerance and timing characteristics of
distributed protocols.  ORCHESTRA was initially developed on the
Real-Time Mach operating system and later ported to other platforms
including Solaris and SunOS, and has been used to conduct extensive
experiments on several protocol implementations.  A novel feature of
the Real-Time Mach implementation of ORCHESTRA is that it utilizes
certain features of the Real-Time Mach operating system to quantify
and compensate for intrusiveness of the fault injection mechanism.  In
addition to describing the overall ORCHESTRA architecture and
implementation, this paper also describes the experimental evaluation
of two protocol implementations: a distributed group membership
service on the Sun Solaris operating system, and a real-time
audio-conferencing application on Real-Time Mach.
%Z Thu, 15 Oct 1998 15:12:34 GMT

%A Bird, Peter L. 
%A Mudge, Trevor N. 
%T An Instruction Stream Compression Technique
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-319-96
%D December 2, 1996
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-319-96-cover.ps.Z
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-319-96.ps.Z
%X The performance of instruction memory is a critical factor 
for both large, high performance applications and for embedded 
systems. With high performance systems, the bandwidth to the 
instruction cache can be the limiting factor for execution speed. 
Code density is often the critical factor for embedded systems. 
In this report we demonstrate a straightforward technique for 
compressing the instruction stream for programs. After code 
generation, the instruction stream is analysed for often reused 
sequences of instructions from within the pro gram's basic blocks. 
These patterns of multiple instructions are then mapped into single 
byte opcodes. This constitutes a compression of multiple, multi-byte 
operations onto a single byte. When compressed opcodes are detected 
during the instruction fetch cycle of program execution, they are 
expanded within the CPU into the original (multi-cycle) set of 
instructions. Because we only operate within a program's basic block, 
branch instructions and their targets are unaf- fected by this technique. 
We provide statistics gathered from code generated for the Intel 
Pentium and the Power PC processors. We have found that incorporating 
a 1K decode ROM in the CPU we can reduce a program's code size between 
45% and 60%.
%Z Thu, 15 Oct 1998 15:13:12 GMT

%A Huggins, James K. Huggins
%T Broy-Lamport Specification Problem: A Gurevich Abstract StateMachine Solution
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-320-96
%D December 13, 1996
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-320-96-cover.ps.Z
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-320-96.ps.Z
%X We apply the Gurevich Abstract State Machine methodology to 
a benchmark specification problem of Broy and Lamport. 
%Z Thu, 15 Oct 1998 15:13:27 GMT

%A Huggins, James K. 
%A VanCampenhout, David 
%T Specification and Verification of Pipelining in the ARM2RISC Microprocessor
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-321-96
%D December 13, 1996
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-321-96-cover.ps.Z
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-321-96.ps.Z
%X We specify the ARM2 RISC microprocessor using the Gurevich
Abstract State Machine methodology, and prove the correctness
of its pipelining techniques.
%Z Thu, 15 Oct 1998 15:13:41 GMT

%A Gurevich, Yuri 
%A Spielmann, Marc 
%T Recursive Abstract State Machines
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-322-96
%D December 16, 1996
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-322-96-cover.ps.Z
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-322-96.ps.Z
%X The Gurevich abstract state machine (ASM) thesis, supported by numerous
applications, asserts that ASMs express algorithms on their natural abstraction
levels directly and essentially coding-free.  The only objection raised to date
has been that ASMs are iterative in their nature, whereas many algorithms are
naturally recursive.  There seems to be an inherent contradiction between
    (i) the ASM idea of explicit and comprehensive states, and
   (ii) higher level recursion with its hiding of the stack.
But consider recursion more closely.  When an algorithm A calls an algorithm B,
a clone of B is created and this clone becomes a slave of A.  This raises the
idea of treating recursion as an implicitly distributed computation.  Slave
agents come and go, and the master/slave hierarchy serves as the stack.
Building upon this idea, we suggest a definition of recursive ASMs.  The
implicit use of distributed computing has an important side benefit: it leads
naturally to concurrent recursion.  In addition, we reduce recursive ASMs to
distributed ASMs.  If desired, one can view recursive notation as mere
abbreviation.
%Z Thu, 15 Oct 1998 15:14:06 GMT

%A Worthington, Bruce L. 
%A Ganger, Gregory R. 
%A Patt, Yale N. 
%A Wilkes, John 
%T On-Line Extraction of SCSI Disk Drive Parameters
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-323-96
%D December 19, 1996
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-323-96-cover.ps.Z
%U ftp://ftp.eecs.umich.edu/techreports/cse/1996/CSE-TR-323-96.ps.Z
%X Sophisticated disk scheduling algorithms require accurate and detailed disk 
drive specifications, including information on mechanical delays, on-board 
caching and prefetching algorithms, command processing and protocol overheads, 
and logical-to-physical block mappings.  Comprehensive disk models used in 
storage subsystem design require similar levels of detail.  This report 
describes a suite of general-purpose techniques and algorithms for acquiring 
the necessary data from SCSI disks via the ANSI-standard interface.  The 
accuracy of the extracted information is demonstrated by comparing the 
behavior of a configurable disk simulator against actual disk drive activity.  
Detailed measurements extracted from several SCSI disk drives are provided.
%Z Thu, 15 Oct 1998 15:14:20 GMT

%A Jones, Matthew 
%A Rundensteiner, Elke 
%T Evaluating View Materialization Strategies for Complex Hierarchical Objects
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-324-97
%D January 9, 1997
%U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-324-97-cover.ps.Z
%U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-324-97.ps.Z
%X In many design applications, it is common practice to
store complex hierarchical objects in a compact folded
form to save storage space and to reduce processing
costs for accessing the objects. In these folded represen
tations, complex objects are built up from identical and
otherwise indistinguishable design objects. However, it
is often necessary, especially during the refinement of
data, to distinguish between these identical folded
objects by personalizing a subset of them. The estab
lished practice is to explicitly unfold the hierarchical
objects and thus create space in which to store distinct
personalization data for each object occurrence. How
ever, this explicit unfolding is costly and time consuming,
resulting in a potentially much larger structure, and sub
stantially increasing the costs of querying and updating
the design. Therefore, we propose an unfold view
operator and provide the basis for updating of custom
ized values for each hierarchical sub-object through the
unfolded view. We propose alternative strategies for the
maintenance of personalization values, representing
various portions of the view materialization spectrum.
We present a performance evaluation comparing these
strategies as well as the traditional explicit unfolding
approach. Our evaluation indicates the trade-offs in
terms of storage and query costs and compares the costs
to do implicit unfolding through a view rather than
explicit unfolding of complex hierarchical objects.
%Z Thu, 15 Oct 1998 15:14:38 GMT

%A Blass, Andreas 
%A Gurevich, Yuri 
%T The Linear Time Hierarchy Theorems for RAMs and Abstract State Machines
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-325-97
%D January 21, 1997
%U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-325-97-cover.ps.Z
%U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-325-97.ps.Z
%X Contrary to polynomial time, linear time greatly depends on the
computation model.  In 1992, Neil Jones designed a number of
computation models where the linear-speed-up theorem fails and
linear-time computable functions form a hierarchy.  The linear time of
those models is too restrictive.  We prove linear-time hierarchy
theorems for random access machines and Gurevich abstract state
machines (formerly evolving algebras).  The latter generalization is
harder and more important because of the greater flexibility of the
ASM model.  One long-term goal of this line or research is to prove
lower bounds for natural linear time problems.
%Z Thu, 15 Oct 1998 15:14:56 GMT

%A Dexter, Scott 
%A Doyle, Patrick 
%A Gurevich, Yuri 
%T Gurevich Abstract State Machines and Schoenhage Storage Modification Machines
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-326-97
%D January 23, 1997
%U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-326-97-cover.ps.Z
%U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-326-97.ps.Z
%X We demonstrate that Schonhage storage modification machines are equivalent, in a
 strong sense, to unary abstract state machines.  We
also show that if one extends the Schoenhage model with a pairing function
and removes the unary restriction, then equivalence between the two
machine models survives.
%Z Thu, 15 Oct 1998 15:15:14 GMT

%A Gurevich, Yuri 
%A Soparkar, Nandit 
%A Wallace, Charles 
%T Formalizing Database Recovery
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-327-97
%D January 29, 1997
%U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-327-97-cover.ps.Z
%U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-327-97.ps.Z
%X Failure resilience is an essential requirement for database systems,
yet there has been little effort to specify and verify techniques for
failure recovery formally. The desire to improve performance has
resulted in algorithms of considerable sophistication, yet understood
by few and prone to errors. In this paper, we illustrate how the
methodology of Gurevich Abstract State Machines can elucidate recovery
and provide formal rigor to the design of a recovery algorithm.  In a
series of refinements, we model recovery at several levels of
abstraction, verifying the correctness of each model. This work
suggests that our approach can be applied to more advanced recovery
mechanisms.
%Z Thu, 15 Oct 1998 15:15:32 GMT

%A Jensen, Paul 
%A Wallace, Charles 
%A Soparkar, Nandit 
%T Specifying and Constructing Schedulers for Workflows with Autonomous Executions
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-328-97
%D February 4, 1997
%U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-328-97-cover.ps.Z
%U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-328-97.ps.Z
%X Workflow has become an important paradigm for distributed data and
computing systems in a wide range of application areas. In a workflow,
tasks executing on autonomous, heterogeneous systems are coordinated
through data and control flow constraints.  An important challenge in
workflow management is the scheduling of actions and operations
performed by the concurrently executing tasks. The legal interleavings
among the tasks must be specified, and scheduling control mechanisms
to ensure correct, efficient executions must be generated.  Scheduling
workflows is particularly difficult because the dependencies between
tasks may be application-specific and task autonomy may place certain
actions outside the control or observation of the scheduler.
We use techniques from supervisory control theory of discrete event
systems for specifying and generating scheduling controllers in
workflow environments.  We specify the tasks and the intertask
dependencies as finite state automata.  To model task autonomy, we
characterize some of the event transitions in the task automata as
beyond the control or observation of the workflow scheduler.  By
applying the techniques of supervisory control theory to these
specifications, we show how the existence of schedulers may be
ascertained and how schedulers may be constructed.  In cases where no
controller can allow exactly the desired class of schedules, we show
how to construct a scheduler that allows the best possible
approximation to the desired class.  We also address the issues of
prioritized tasks and distributed workflow scheduling.  Our approach
provides an effective means to model several workflow systems and to
create scheduling mechanisms to manage them.
%Z Thu, 15 Oct 1998 15:15:47 GMT

%A Jensen, Paul A. 
%A Soparkar, Nandit R. 
%A Mathur, Amit G. 
%T Characterizing Multicast Orderings using Concurrency Control Theory
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-329-97
%D February 4, 1997
%U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-329-97-cover.ps.Z
%U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-329-97.ps.Z
%X Coordinating distributed executions is achieved by two widely used
approaches: process groups and transactions.  Typically, the two
represent a trade-off in terms of the degrees of consistency and
performance.  By applying transaction concurrency control techniques
to characterize and design process group multicast orderings, we aim
to provide aspects of both ends of the trade-off.  In particular, we
propose a framework in which each message multicast is regarded as a
transaction.  Appropriate message ordering protocols are devised and
shown to be correct using a variant of concurrency control theory.
Also, we are able to incorporate certain aspects of application
semantics for which existing process group approaches are inadequate.
Finally, our framework provides a means to characterize the
performance of orderings to allow a comparison of different ordering
protocols.
%Z Thu, 15 Oct 1998 15:16:09 GMT

%A Chen, I-Cheng K. 
%A Bird, Peter L. 
%A Mudge, Trevor 
%T The Impact of Instruction Compression on I-cache Performance
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-330-97
%D February 13, 1997
%U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-330-97-cover.ps.Z
%U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-330-97.ps.Z
%X In this paper we present a straightforward technique for compressing the
instruction stream for programs that overcomes some of the limitations of
earlier proposals. After code generation, the instruction stream is analysed for
frequently used sequences of instructions from within the program's basic
blocks. These patterns of multiple instructions are then mapped into single byte
opcodes. This constitutes a compression of multiple, multi-byte operations onto
a single byte. When compressed opcodes are detected during the instruction fetch
cycle of program execution, they are expanded within the CPU into the original
(multi-cycle) sequence of instructions. We only examine patterns within a
program's basic block, so branch instructions and their targets are unaffected
by this technique allowing compression to be decoupled from compilation.
%Z Thu, 15 Oct 1998 15:16:25 GMT

%A Vlaovic, Stevan
%T Communication Characterization of a Cray T3D
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-331-97
%D February 13, 1997
%U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-331-97-cover.ps.Z
%U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-331-97.ps.Z
%X In order to develop efficient applications for large scale
multiprocessors, thecommunication patterns of the target architecture
should be well understood. By understanding these communication patterns,
a faster running application can be developed.  Some applications require
the smallest latency possible, whereas others might benefit from higher
throughput.
The Cray T3D is a multiprocessor that has a high speed 3D torus
interconnect, and special communications hardware.  By utilizing this
hardware properly, a program has the potential to minimize its
communications overhead. This study investigates three different libraries
that provide communication on the T3D: PVM, MPI, and SMA.  Both PVM and
MPI are portable, whereas SMA is native.  The latency, bandwidth, and
collective communication costs of the communication primitives implemented
in these libraries are characterized and compared. These results are also
briefly contrasted with the results of related studies on the IBM SP2 and
the Convex SPP-1000.
%Z Thu, 15 Oct 1998 15:16:41 GMT

%A Labovitz, Craig 
%A Malan, G. Robert 
%A Jahanian, Farnam 
%T Internet Routing Instability
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-332-97
%D February 21, 1997
%U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-332-97-cover.ps.Z
%U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-332-97.ps.Z
%X This paper examines the network inter-domain routing information exchanged
between backbone service providers at the major U.S. public Internet
exchange points. Internet routing instability, or the rapid fluctuation of
network reachability information, is an important problem currently facing
the Internet engineering community. High levels of network instability can
lead to packet loss, increased network latency and time to convergence. At
the extreme, high levels of routing instability have lead to the loss of
internal connectivity in wide-area, national networks.  In this paper, we
describe several unexpected trends in routing instability, and examine a
number of anomalies and pathologies observed in the exchange of
inter-domain routing information.  The analysis in this paper is based on
data collected from BGP routing messages generated by border routers at
five of the Internet core's public exchange points during a nine month
period.  We show that the volume of these routing updates is several orders
of magnitude more than expected and that the majority of this routing
information is redundant, or pathological.  Furthermore, our analysis
reveals several unexpected trends and ill-behaved systematic properties in
Internet routing.  We finally posit a number of explanations for these
anomalies and evaluate their potential impact on the Internet
infrastructure.
%Z Thu, 15 Oct 1998 15:17:11 GMT

%A Jamin, S. 
%A Shenker, S.J.
%T Measurement-based Admission Control Algorithms for Controlled-load Service:
A Structural Examination
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-333-97
%D April 1, 1997
%U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-333-97-cover.ps.Z
%U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-333-97.ps.Z
%X The Internet Engineering Task Force (IETF) is considering the
adoption of the controlled-load service, a real-time service with very
relaxed service guarantees.  Measurement-based admission
control algorithms (MBAC's), as opposed to the more conservative
worst-case parameter-based methods, were expressly designed to
achieve high levels of network utilization for such relaxed real-time
services. Most researchers studying MBAC's, including ourselves,
have focused primarily on the design of the admission
control equations using a variety of principled and ad hoc
motivations.  In this paper we show that, much to our own surprise,
the admission control equations developed so far in the MBAC
literature give essentially the same performance.  Furthermore, we
observe that the equations, even though they are derived and motivated
in quite different ways, have the same structural form.
%Z Thu, 15 Oct 1998 15:17:37 GMT

%A Rogers, Seth O. 
%T Symbolic Performance & Learning in Continuous-valued Environments
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-334-97
%D May 2, 1997
%U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-334-97-cover.ps.Z
%U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-334-97.ps.Z
%X Real-world and simulated real-world domains, such as flying and
driving, commonly have the characteristics of continuous-valued (CV)
environments. These environments are frequently complex and difficult
to control, requiring a great deal of specific, detailed
knowledge. Symbolic computation can use many different types of
knowledge to constrain the control function. This thesis investigates
learning the control function using observed data in a symbolic
architecture. SPLICE (Symbolic Performance & Learning In
Continuous-Valued Environments) is a Soar agent that performs symbolic
control function approximation. SPLICE uses a three-level framework to
first classify its sensory information into symbolic regions, then map
the set of regions to a local model, then use the local model to
determine an action. Learning causes the models to gradually become
more specific and more accurate. SPLICE agent is evaluated against
other adaptive control algorithms in simple domains using four
experimental methodologies. We also instantiate SPLICE in a complex
environment, simulated flight, and evaluate its performance. The final
goal is not only to create an effective controller for complex
continuous environments, but also to understand more clearly the
ramifications of symbolic learning in continuous domains.
%Z Thu, 15 Oct 1998 15:18:01 GMT

%A Patel, Sanjay Jeram 
%A Friendly, Daniel Holmes 
%A Patt, Yale N. 
%T Critical Issues Regarding the Trace Cache Fetch Mechanism
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-335-97
%D May 7, 1997
%U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-335-97-cover.ps.Z
%U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-335-97.ps.Z
%X In order to meet the demands of wider issue processors,
fetch mechanisms will need to fetch multiple basic blocks per cycle.
The trace cache supplies several basic blocks each cycle by storing
logically contiguous instructions in physically contiguous storage.
When a particular basic block is requested, the trace cache can
potentially respond with the requested block along with several
blocks that followed it when the block was last encountered.
In this technical report, we examine some critical features of a trace
cache mec hanism designed for a 16-wide issue processor and evaluate
their effects on performance .  We examine features such as cache
associativity, storage partitioning, branch predictor design,
instruction cache design, and fill unit design.  We compare the
performance of our trace cache mechanism with that of the design
presented by Rotenberg et al and show a 23% improvement in
performance.  In our final analysis, we compare our trace cache
mechanism with an aggressive single basic block fetch mechanism and
show that the trace cache m echanism attains a 24% improvement in
performance.
%Z Thu, 15 Oct 1998 15:18:24 GMT

%A Gurevich, Yuri 
%T May 1997 Draft of the ASM Guide
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-336-97
%D May 7, 1997
%U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-336-97-cover.ps.Z
%U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-336-97.ps.Z
%X This is only a draft of a portion of the ASM guide,
but there are no plans to change this part in any substantial way,
only to augment it.
%Z Thu, 15 Oct 1998 15:18:39 GMT

%A Chang, Po-Yung 
%T Classification-Directed Branch Predictor Design
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-337-97
%D May 13, 1997
%U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-337-97-cover.ps.Z
%U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-337-97.ps.Z
%X Pipeline stalls due to branches represent one of the most significant
impediments to realizing the performance potential of deeply pipelined
superscalar processors.  Two well-known mechanisms have been proposed
to reduce the branch penalty, speculative execution in conjunction with
branch prediction and predicated execution.
This dissertation proposes branch classification, coupled with improvements
in conditional branch prediction, indirect branch prediction, and predicted
execution, to reduce the branch execution penalty.
Branch classification allows an individual branch instruction to be
associated with the branch predictor best suited to predict its direction.
Using this approach, a hybrid branch predictor is constructed which achieves
a higher prediction accuracy than any branch predictor previously reported
in the literature.
This dissertation also proposes a new prediction mechanism for predicting
indirect jump targets.  For the perl and gcc benchmarks, this mechanism
reduces the indirect jump misprediction rate by 93.4\% and 63.3\% and the
overall execution time on an 8-wide issue out-of-order execution machine
by 14\% and 5\%.
Finally, this dissertation proposes a new method for combining the
performance benefits of predicated execution and speculative execution.
This approach significantly reduces the branch execution penalty suffered
by wide-issue processors.
%Z Thu, 15 Oct 1998 15:19:03 GMT

%A Blass, Andreas 
%A Gurevich, Yuri 
%A Shelah, Saharon 
%T Choiceless Polynomial Time
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-338-97
%D May 16, 1997
%U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-338-97-cover.ps.Z
%U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-338-97.ps.Z
%X Turing machines define polynomial time (PTime) on strings but cannot deal
with structures like graphs directly, and there is no known, easily 
computable string encoding of isomorphism classes of structures. Is there
a computation model whose machines do not distinguish between isomorphic 
structures and compute exactly PTime properties? This question can be 
recast as follows: Does there exists a logic that captures polynomial time
(without presuming the presence of a linear order)? Earlier, one of us 
conjectured the negative answer.The problem motivated a quest for stronger
and stronger PTime logics. All these logics avoid arbitrary choice. Here 
we attempt to capture the choiceless fragment of PTime. Our computation 
model is a version of abstract state machines (formerly called evolving 
algebras). The idea is to replace arbitrary choice with parallel execution.
The resulting logic is more expressive than other PTime logics in the 
literature. A more difficult theorem shows that the logic does not capture
all PTime.
%Z Thu, 15 Oct 1998 15:19:20 GMT

%A Lin, Stephen 
%A Lee, Sang Wook 
%T Using Chromaticity Distributions and Eigenspace Analysis for Pose-, Illumina
tion-, and Specularity-Invariant Color Indexing of 3D Objects
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-339-97
%D May 23, 1997
%U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-339-97-cover.ps.Z
%U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-339-97.ps.Z
%X The distribution of object colors can be effectively utilized for
recognition and indexing. Difficulties arise in the recognition of
object color distributions when there are variations in illumination
color, changes in object pose with respect to illumination direction, and
specular reflections. However, most of the recent approaches to color-based
recognition focus mainly on illumination color invariance.  We propose
an approach that identifies object color distributions influenced by:
(1) illumination pose, (2) illumination color and
(3) specularity.  We suggest the use of chromaticity distributions to
achieve illumination pose invariance.  To characterize changes in
chromaticity distribution due to illumination color, an eigenspace
analysis is employed.  A set of chromaticity histograms of each object
is generated for a range of lighting colors based on linear models
of illumination and reflectance, and the histograms are represented using
a small number of eigen basis vectors constructed from principal
components analysis.  Represented in an eigenspace, chromaticity
signatures for each object in the model database form a manifold
encompassing a range of lighting colors. Recognition and illumination
color estimation are performed by projecting the chromaticity
signature of a test object into the eigenspace and examining its
proximity to database manifolds.  Since specular reflections may
displace a projection away from its corresponding manifold, a 
model-based specularity detection/rejection algorithm, called
chromaticity differencing, is developed to reduce this discrepancy.
%Z Thu, 15 Oct 1998 15:19:36 GMT

%A Crestana, Viviane M. 
%A Lee, Amy J. 
%A Rundensteiner, Elke A. 
%T Schema Version Removal: Optimizing Transparent Schema Evolution Systems
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-340-97
%D May 27, 1997
%U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-340-97-cover.ps.Z
%U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-340-97.ps.Z
%X Powerful interoperability-enabling solutions for software application
integration must allow applications to evolve and data requirements to
change, while minimizing such changes on other integrated
applications. The transparent schema evolution system (TSE), that we
developed at the University of Michigan, accomplishes evolution
through schema versioning, where a schema version is a dynamic
object-oriented view.  When the TSE system receives a view schema
change request, the system computes a new view schema that reflects
the desired change instead of modifying the old view schema in place,
therefore preserving existing view schemas for old applications.  This
generation of a potentially large number of schema versions over time
results in an excessive build-up of classes and underlying object
instances, not all being necessarily still in use.  Since our view
system provides us with materialized views, a degradation of system
performance due to update progagation is expected in a situation where
we have no-longer utilized view schemas in the system. Also, a larger
number of view schemas will result in storage overhead costs.  In this
paper, we address this problem using consistent schema removal
techniques.  First, we characterize four potential problems of schema
consistency that could be caused by removal of a single virtual class;
and then outline our solution approach for each of these problems.
Second, we demonstrate that view schema removal is sensitive to the
order in which individual classes are processed.  Our solution to this
problem is based on a formal model, called the dependency model, of
capturing all dependencies between classes as logic clauses and of
manipulating them to make decisions on class deletions and
nondeletions while guaranteeing the consistency of the schema.  Third,
based on this formal model, we have developed and proven consistent a
dependency graph (DG) representation and an associated set of rules
for DG generation, reduction, and transformation.  A first preliminary
version of the latter has been successfully implemented in our Schema
Version Removal (SVR) tool.  Fourth, we present a cost model for
evaluating alternative removal patterns on DG. Lastly, we report our
preliminary experimental studies that demonstrate the impact of our
schema version removal on the performance of the TSE system.
%Z Thu, 15 Oct 1998 15:20:44 GMT

%A Lee, Chih-Chieh 
%A Uhlig, Richard 
%A Mudge, Trevor 
%T A Case Study of a Hardware-Managed TLB in a Multi-Tasking Environment
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-341-97
%D June 3, 1997
%U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-341-97-cover.ps.Z
%U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-341-97.ps.Z
%X There have been very few performance studies of hardware-managed
translation look-aside buffers (TLBs). The major reason is the lack of
efficient and accurate analysis tools. Newer operating systems,
applications, and the popularity of the client-server model of computation
place a greater burden than their predecessors on memory system components
such as TLBs. Thus it is becoming more important to measure the
performance of memory systems under such workloads. In this work, we
implemented a trap-driven simulator in an operating system to emulate a
variety of TLBs. Using this tool, we were able to evaluate the performance
of a range of TLBs under these newer workloads. The results show that in
order to improve the TLB performance, we should carefully map pages into
the TLB, append process identifiers to avoid flushing the TLB contents
frequently, or reserve part of the TLB for a particular server process.
%Z Thu, 15 Oct 1998 15:20:58 GMT 

%A Lefurgy, Charles
%A Bird, Peter
%A Chen, I-Cheng
%A Mudge, Trevor
%T Improving Code Density Using Compression Techniques
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-342-97
%D July 8, 1997
%U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-342-97-cover.ps.gz
%U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-342-97.ps.gz
%X We propose a method for compressing programs in imbedded processors where instruction memory size dominates cost.  A post-compilation analyzer examines a program and replaces common sequences of instructions with a single instruction codeword. A microprocessor executes the compressed instruction sequences by fetching codewords from the instruction memory, expanding them back to the original sequence of instructions in the decode stage, and issuing them to the execution stages.  We apply our technique to the PowerPC instruction set and achieve 30% to 50% reduction in size for SPEC CINT95 programs.
%Z Thu, 15 Oct 1998 15:21:18 GMT

%A Berwick, D.
%A Lee, S.W.
%T A Chromaticity Space for Specularity, Illumination Color- and Illumination Pose-Invariant 3-D Object Recognition
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-343-97
%D July 28, 1997
%U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-343-97-cover.ps.gz
%U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-343-97.ps.gz
%X Most of the recentecognition/indexing approaches concentrate on establishing invariance to illumination color to improve the utility of color recognition.  However, other effects caused by illumination pose and specularity on three-dimensional object surfaces have not received notable attention.  We present a chromaticity recognition method that discounts the effects of illumination pose, illumination color and specularity.  It utilizes a chromaticity space based on log-ratio of sensor responses for illumination pose and color invariance.  A model-based specularity detection/rejection algorithm can be used to improve the chromaticity recognition and illumination estimation for objects including specular reflections.
%Z Thu, 15 Oct 1998 15:21:39 GMT

%A Thaler, D.
%A Ravishankar, C.V.
%T An Architecture for Inter-Domain Troubleshooting
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-344-97
%D August 1, 1997
%U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-344-97-cover.ps.gz
%U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-344-97.ps.gz
%X In this paper, we explore the constraints of a new problem: that of coordinating network troubleshooting among peer administrative domains or Internet Service Providers, and untrusted observers.  Allowing untrusted observers permits any entity to report problems, whether it is a Network Operations Center (NOC), end-user, or application.
Our goals here are to define the inter-domain coordination problem clearly, and to develop an architecture which allows observers to report problems and receive timely feedback, regardless of their own locations and identities.  By automating this process, we also relieve human bottlenecks at help desk and NOCs whenever possible.
We begin by presenting a troubleshooting methodology for coordinating problem diagnosis.  We then describe GDT, a distributed protocol which realizes this methodology.  We show through simulation that GDT performs well as the number of observers and problems grows, and continues to function robustly amidst heavy packet loss.
%Z Thu, 15 Oct 1998 15:21:53 GMT

%A Manohar, Nelson
%T Interactive Delayed-Sharing of Computer-Supported Workspaces via the Streaming of Re-Executable Content
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-345-97
%D August 18, 1997
%U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-345-97-cover.ps.gz
%U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-345-97.ps.gz
%X This dissertation introduces the notion of interactive delayed-sharing of a session on a computer-supported workspace to allow reuse of such session at a later time.  A data artifact, referred to as a session object, encapsulates the session.  A session object is composed of heterogeneous multimedia streams that represent temporally-ordered input sequences to applets in the workspace.  Playback of a session object recreates the underlying workspace spanned by these applets through the streaming and re-execution of these input sequences in their respective applets.  The contributions of this dissertation are as follows.
First, this dissertation pioneers the re-execution approach to the record and replay of sessions for supporting computer-supported collaborative work.  In doing so, it introduces a novel paradigm for flexible support of asynchronous collaboration that allow users to collaborate by annotating, editing, and refining these delayed-shared workspaces.  This dissertation explores these collaborative features particularly focusing on the record, representation, and playback of these workspaces.
Second, this dissertation introduces the notion of workspaces as authored, transportable, and temporally-aware artifacts and investigates flexible mechanisms for the delivery of temporal-awareness to workspaces composed of heterogeneous applications through the decoupling of tool and media services.  This dissertation develops a formal specification through temporal relationships in these workspaces.
Finally, this dissertation describes mechanisms for the integration of re-executable content streams together with continuous multimedia streams.  It describes a framework for media-independent integration of heterogeneous multimedia streams.  The dissertation introduces the use of statistical process controls to guide the relaxation and/or constraint of scheduling and synchronization mechanisms so as to flexibly support a range of heterogeneous media streams and their temporal relationships.
%Z Thu, 15 Oct 1998 15:22:14 GMT 

%A Tyson, Gary S.
%A Lick, Kelsey
%A Farrens, Matthew
%T Limited Dual Path Execution
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-346-97
%D August 28, 1997
%U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-346-97-cover.ps.gz
%U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-346-97.ps.gz
%X This work presents a hybrid branch predictor scheme that uses a limited form of dual path execution along with dynamic branch prediction to improve execution times.  The ability to execute down both paths of a conditional branch enables the branch penalty to be minimized; however, relying exclusively on dual path execution is infeasible due because instruction fetch rates far exceed the capability of the pipeline to retire a single branch before others must be processed.  By using confidence information, available in the dynamic branch prediction state tables, a limited form of dual path execution becomes feasible.  This reduces the burden on the branch predictor by allowing predictions of low confidence to be avoided.
In this study we present a new approach to gather branch prediction confidence with little or nor overhead, and use this confidence mechanism to determine whether  dual path execution or branch prediction should be used.  Comparing this hybrid predictor model to the dynamic branch predictor shows a dramatic decrease in misprediction rate, which translates to an reduction in runtime of over 20%.  These results imply that dual path execution, which often is thought to be an excessively resource consuming method, may be a worthy approach if restricted with an appropriate predicting set.
%Z Thu, 15 Oct 1998 15:22:27 GMT

%A Feng, Wu-chang
%A Kandlur, Dilip D.
%A Saha, Debanjan
%A Shin, Kang S.
%T Adaptive Packet Marking for Providing Differentiated Services in the Internet
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-347-97
%D October 16, 1997
%U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-347-97-cover.ps.gz
%U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-347-97.ps.gz
%X This paper examines the use of adaptable priority marking for providing soft bandwidth guarantees to individual connections or connection groups over the Internet.  The proposed scheme does not require resource reservation for individual connections and can be supported with minimal changes to the network infrastructure.  The scheme uses modest support from the network in the form of priority handling for appropriately marked packets and relies on intelligent transmission control mechanisms at the edges of the network to achieve the desired throughput levels.  The paper describes the control mechanisms and evaluates their behavior in various network environments, thereby making them suitable for deployment in an evolving Internet.
%Z Thu, 15 Oct 1998 15:22:46 GMT

%A Tam, Edward
%A Rivers, Jude
$A Davidson, Esward S.
%T Flexible Timing Simulation of Multiple-Cache Configurations
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-348-97
%D October 10, 1997
%U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-348-97-cover.ps.gz
%U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-348-97.ps.gz
%X As the gap between processor and memory speeds increases, cache performance becomes more critical to overall system performance.  Behavioral cache simulation is typically used early in the design cycle of new processor/cache configurations to determine the performance of proposed  cache configurations on target workloads.  However, behavioral cache simulation does not account for the latency seen by each memory access.  The Latency-Effects (LE) cache model presented in this paper accounts this nominal latency as well as the additional latencies due to trailing-edge effects, bus width considerations, port conflicts, and the number of outstanding accesses that a cache allows before it blocks.  We also extend the LE cache model to handle the latency effects of moving data among multiple caches.  mlcache, a new, easily configurable and extensible tool, has been built based on the extended LE model.  We show the use of mlcache  in estimating the performance of traditional and novel cache configurations, including odd/even, 2-level, Assist, Victim, and NTS caches.  We also show how the LE cache timing model provides more useful, realistic performance estimates than other possible behavioral-level cache timing models.
%Z Thu, 15 Oct 1998 15:23:05 GMT

%A Feng, Wu-chang
%A Kandlur, Dilip D.
%A Saha, Debanjan
%A Shin, Kang S.
%T Techniques for Eliminating Packet Loss in Congested TCP/IP Networks
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-349-97
%D November 4, 1997
%U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-349-97-cover.ps.gz
%U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-349-97.ps.gz
%X The congestion control mechanisms used in TCP the focus of numerous studies and have undergone a number of enhancements.  However, even with these enhancements, TCP connections still experience alarmingly high loss rates, especially during times of congestion.  To alleviate this problem, the IETF is considering active queue management mechanisms, such as RED, for deployment in the network.  In this paper, we first show that even with RED, loss rates can only be reduced marginally in times of congestion.  We then show that the solution to this problem lies in understanding the traffic generated by an aggregation of a large number of sources.  Given this, we then propose and experiment with more adaptive active queue management algorithms and more intelligent end-host mechanisms which can significantly reduce loss rates across congested links.  When used together, these mechanisms can effectively eliminate packet loss while maintaining high link utilizations under the most difficult scenarios.
%Z Thu, 15 Oct 1998 15:23:18 GMT

%A Shaikh, Anees
%A Rexford, Jennifer
%A Shin, Kang S.
%T Dynamics of Quality-of-Service Routing with Inaccurate Link-State Informaiton
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-350-97
%D November 17, 1997
%U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-350-97-cover.ps.gz
%U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-350-97.ps.gz
%X Quality-of-service (QoS) routing can satisfy application performance requirements and optimize network resource usage by selecting paths based on connection traffic parameters and link load information.  However, effective path-selection schemes require the distribution of link-state information, which can cause a significant burden on the bandwidth and processing resources in the network.  We investigate the fundamental tension between network overheads and the quality of routing decisions in the context of source-directed QoS routing algorithms.  In contrast to previous performance studies that compare different routing algorithms under specific network configurations, we characterize how  the performance and overheads of QoS routing relate to the link-state update policies, as a function of the underlying traffic load, network topology, and link-cost metrics.  We explore the interplay between stale link-state information and random fluctuations in traffic load through a broad set of simulations experiments on a parameterized model of QoS routing.  These results suggest ways to tune the frequency of link-state update messages and the link-cost functions to strike a careful balance between high accuracy and low complexity.
%Z Thu, 15 Oct 1998 15:23:23 GMT

%A Shim, Hyong Sop
%A Hall, Robert W.
%A Litiu, Radu
%A Prakash, Atul
%T Stateful Multicase Services for Supporting Collaborative Applications
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-351-97
%D November 17, 1997
%U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-351-97-cover.ps.gz
%U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-351-97.ps.gz
%X Collaborative, multi-user applications require group multicast services with ordering guarantees for maintaining consistency of replicated shared state among collaborating processes.  In traditional group multicast services (such as Isis), the group�s shared state is maintained only by the clients and the multicast service facilitates atomic state transfer from existing clients to new members.  In this paper, we argue that, in order to support collaborative applications in Internet-type environments, a copy of group�s state, the multicast service can provide consistently fast group-join and state transfer times when both slow and fast clients are present or when clients are unreliable - a crucial requirement in collaborative, multi-user applications where users may dynamically join and leave a collaborative session and expect predictable join times and interactive response time even in the presence of slow or unreliable clients.  We show that the overheads incurred by a multicast service in managing each group�s shared state can be made minimal and that the multicast service does not have to be aware of the semantics of the group�s state.  We present the design of such a multicast service, present performance results, and discuss how it meets the various needs of collaborative applications.
%Z Thu, 15 Oct 1998 15:23:51 GMT

%A Litiu, Radu
%A Prakash, Atul
%T Individual and Group QoS Issues in Communication Services for Groupware Systems
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-352-97
%D November 17, 1997
%U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-352-97-cover.ps.gz
%U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-352-97.ps.gz
%X The proliferation of computer networks in the last decade and the ubiquity of the World Wide Web have led to increased interest in the development of computer-supported cooperative work (CSCW) systems.  Collaborative, multi-user applications require group multicast services that provide ordering guarantees for maintaining consistency of replicated shared context as well as provide a high degree of interactivity, even under varying load on the multicast servers.  While the most common view of the quality of service (QoS) in a distributed system is in terms of the guarantee of the network connection parameters (bandwidth, end-to-end delay), in this paper we investigate QoS from the perspective of the various requirements placed on group communication servers with limited resources by multiple and diverse groups of collaborative users.  We show that in the absence of QoS considerations in the design of a group communication service, some groups or individual users can be severely affected by bursty traffic or increase in the size of other groups.  We present the design of a best-effort QoS-based adaptive group communication service for supporting reliable data  communication in CSCW systems.  Our QoS considerations address both group�s requirements and individual user�s requirements.  We present performance results showing the effectiveness of the approach and discuss some of the open issues for future work.
%Z Thu, 15 Oct 1998 15:24:11 GMT

%A Zhang, Xi
%A Shin, Kang G.
%A Saha, Dabanjan
%A Kandlur, Dilip
%T Scalable Flow Control for Multicase ABR Services in ATM Networks
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-353-97
%D November 20, 1997
%U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-353-97-cover.ps.gz
%U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-353-97.ps.gz
%X We propose an efficient flow-control scheme for ATM ABR multicast services.  We develop a second-order rate-control algorithm to deal with the variation in feedback delay resulting from dynamic �drift� of the bottleneck location within a multicast tree.  The proposed scheme makes the rate process converge to the available bandwidth of the multicast-connection�s most congested link.  It also confines the buffer occupancy to a target regime bounded by a given (finite) buffer capacity at the bottleneck node.  Using fluid approximation, we model the proposed scheme and study the system dynamics under the most stressful traffic conditions.  We derive expressions for queue buildups and average throughputs in both transient and equilibrium states.  We identify the system control factors that govern the system dynamics and develop an optimal control condition which guarantees monotonic convergence of the system state to the target regime from an arbitrary initial value.
%Z Thu, 15 Oct 1998 15:24:24 GMT

%A Zhang, Xi
%A Shin, Kang G.
%A Zheng, Qin
%T Integrated Rate and Credit Based Flow Control for Unicase ABR Service in ATM Networks
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-354-97
%D November 20, 1997
%U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-354-97-cover.ps.gz
%U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-354-97.ps.gz
%X We propose a flow-control scheme that combines the merits of credit- and rate-based flow-control schemes by applying direct control over both bandwidth and buffer resources.  The goal of the proposed scheme is to design an optimal rate-control policy for a given finite buffer capacity by maximizing average throughput and bounding end-to-end delay.  By applying the second-order rate control, the proposed scheme not only makes the rate process converge to the neighborhood of link bandwidth, but also confines the queue-length fluctuation to a regime bounded by buffer capacity.  Using hop-by-hop credit flow control, the proposed scheme ensures lossless transmission with a finite buffer capacity while maintaining high network utilization.  The source dynamically adjusts its transmission rate and rate-control parameters according to both the network congestion status and whether or not the system states are in the target operating regime.
Using the fluid approximation method, we model the proposed flow-control scheme and study the system dynamic behavior for ABR (Available Bit Rate) service under the most stressful traffic condition.  We derive the expressions for queue build-ups and average throughput in the equilibrium states as functions of rate-control parameters, feedback delay, link bandwidth, and buffer capacity.  Based on the analytical results, we identify the optimal equilibrium state and propose the second-order rate control algorithm for transient-state flow control.  We derive a sufficient condition which guarantees the second-order rate control to drive the system from any transient state to an optimal equilibrium state.  The worst-case performance bounds, derived as the closed-form functions of flow-control parameters, provide an analytical means of evaluating the performance of the second-order rate control in terms of convergence speed and buffer utilization.  The analytical results for both single and multiple connection scenarios have shown the proposed scheme to be stable and efficient in that the source rate and bottleneck queue length rapidly converge to a small neighborhood of the designated operating point.  Also, presented are examples showing that the proposed scheme outperforms the other existing schemes.
%Z Thu, 15 Oct 1998 15:24:38 GMT

%A Wallace, Charles
%T The Semantics of the Java Programming Language: Preliminary Version
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-355-97
%D December 9, 1997
%U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-355-97-cover.ps.gz
%U ftp://ftp.eecs.umich.edu/techreports/cse/1997/CSE-TR-355-97.ps.gz
%X A mathematical model of the Java Programming language is construced.
%Z Thu, 15 Oct 1998 15:24:51 GMT

%A Zou, Hengming
%A Jahanian, Farnam
%T Real-Time Primary-Backup (RTPB) Replication with Temporal Consistency Guarantees
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-356-98
%D February 13, 1998
%U ftp://ftp.eecs.umich.edu/techreports/cse/1998/CSE-TR-356-98-cover.ps.gz
%U ftp://ftp.eecs.umich.edu/techreports/cse/1998/CSE-TR-356-98.ps.gz
%X A common approach to building fault-tolerant distributed systems is to replicate servers that fail independently.  The objective is to give the clients the illusion of service that is provided by a single server.  The main approaches for structuring fault-tolerant servers are passive and active replication.  This paper presents a primary-backup (passive) replication scheme for supporting fault-tolerant real-time applications.  The proposed scheme, called Real-Time PB (RTPB) replication service, is an elegant and simple approach that provides the benefits of fault-tolerance, real-time access, and temporal consistency guarantees that are not otherwise easily attainable.
This paper formally defines two types of temporal consistency, namely external temporal consistency and inter-object temporal consistency. By introducing a key concept called phase variance, we are able to build our temporal consistency models and derive necessary and sufficient conditions that can be used as the basis for update and transmission scheduling that achieve temporal consistency guarantees.  Furthermore, we proved that the term phase variance used in the models can be bounded under various scheduling algorithms, namely EDF, Rate Monotonic, and Distance-Constrained Scheduling.
The paper also presents an implementation of the real-time primary-backup replication scheme with the aforementioned temporal consistency models.  This implementation was developed within the ex-kernel architecture on the MK 7.2 microkernel from the Open Group.  The results of a detailed performance evaluation of this implementation is also discussed.
%Z Thu, 15 Oct 1998 15:25:06 GMT

%A Abandah, Gheith A.
%A Davidson, Edward S.
%T Configuration Independent Analysis for Characterizing Shared-Memory Applications
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-357-98
%D December 23, 1997
%U ftp://ftp.eecs.umich.edu/techreports/cse/1998/CSE-TR-357-98-cover.ps.gz
%U ftp://ftp.eecs.umich.edu/techreports/cse/1998/CSE-TR-357-98.ps.gz
%X Characterizing shared-memory applications provides insight to design efficient systems, and provides awareness to identify and correct application performance bottlenecks. Configuration dependent analysis is often used to simulate detailed application traces on a particular hardware model. The communication traffic and computation workload generated by the application trace is used as a characterization of this application. This paper demonstrates that configuration independent analysis is a useful tool to characterize shared-memory applications. Configuration independent analysis characterizes inherent application characteristics that do not change from one configuration to another.  While configuration dependent analysis is repeated for each target configuration, configuration independent analysis is only performed once. Moreover, configuration independent analysis does not require developing models for the target configurations and is faster than detailed simulation.  However, configuration dependent analysis directly provides more information about specific configurations. A combination of the two analysis types constitutes a comprehensive and efficient methodology for characterizing shared-memory applications. In this paper, we show how configuration independent analysis is used to characterize eight aspects of application behavior: general characteristics, working sets, concurrency, communication patterns, communication variation over time, communication slack, communication locality, and sharing behavior. We illustrate the advantages and limitations of this approach by analyzing eight case-study benchmarks from the scientific and commercial domains and interpreting the results.
%Z Thu, 15 Oct 1998 15:25:22 GMT

%A Ganger, Gregory R.
%A Worthington, Bruce L.
&A Patt, Yale N.
%T The DiskSim Simulation Environment Version 1.0 Reference Manual
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-358-98
%D February 27, 1998
%U ftp://ftp.eecs.umich.edu/techreports/cse/1998/CSE-TR-358-98-cover.ps.gz
%U ftp://ftp.eecs.umich.edu/techreports/cse/1998/CSE-TR-358-98.ps.gz
%X DiskSim is an efficient, accurate and highly-configurable disk system simulator developed at the University of Michigan to support research into various aspects of storage subsystem architecture.  It includes modules that simulate disks, intermediate controllers, and buses, device drivers, request schedulers, disk block caches, and disk array data organizations. In particular, the disk drive module simulates modern disk drives in great detail and has been carefully validated against several productions disks (with accuracy that exceeds any previously reported simulator).
This manual describes how to configure and use DiskSim, which has been made publicly available with the hope of advancing the state-of-the-art in disk system performance evaluation in the research community.  The manual also briefly describes DiskSim's internal structure and various validation results.
%Z Thu, 15 Oct 1998 15:25:36 GMT

%A Kravets, Victor N.
%A Sakallah, Karem A.
%T Constructive Multilevel Logic Synthesis Under Properties of Boolean Algebra
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-359-98
%D March 9, 1998
%U ftp://ftp.eecs.umich.edu/techreports/cse/1998/CSE-TR-359-98-cover.ps.gz
%U ftp://ftp.eecs.umich.edu/techreports/cse/1998/CSE-TR-359-98.ps.gz
%X We describe a new constructive multilevel logic synthesis system that integrates the traditionally separate technology-independent and technology-dependent stages of modern synthesis tools.  Dubbed M32, this systems is capable of generating circuits incrementally based on both functional as well as structural considerations.  This is achieved by maintaining a dynamic structural representation o the evolving implementation and by refining it through progressive introduction of gates from a target technology library.  Circuit construction proceeds from the primary inputs towards the primary outputs.  Preliminary experimental results show that circuits generated using this approach are generally superior to those produced by multi-stage synthesis.
%Z Thu, 15 Oct 1998 15:25:48 GMT

%A Hao, Eric
%T Block Enlargement Optimizations for Increasing the Instruction Fetch Rate in Block-Structured Instruction Set Architectures
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-360-98
%D March 11, 1998
%U ftp://ftp.eecs.umich.edu/techreports/cse/1998/CSE-TR-360-98-cover.ps.gz
%U ftp://ftp.eecs.umich.edu/techreports/cse/1998/CSE-TR-360-98.ps.gz
%X To exploit larger amounts of instruction level parallelism, processors are being built with wider issue widths and larger numbers of functional units.  Instruction fetch rate must also be increased in order to effectively exploit the performance potential of such processors.  Block-structured ISAs are a new class of instruction set architectures that we designed to address the performance obstacles faced by processors attempting to exploit high levels of instruction level parallelism.  The major distinguishing feature of a block-structured ISA is that it defines the architectural atomic unit (i.e. the instruction)( to be a group of operations which is called an atomic block.  This dissertation defines an optimization, block enlargement, that can be applied to a block-structured ISA to increase the instruction fetch rate of a processor that implements that ISA.  A compiler that generates block-structured ISA processor were constructed to evaluate the performance benefit of block-structured ISAs.  This dissertation shows that for the SPECint95 benchmarks, the block-structured ISA processor executing enlarged atomic blocks and using simpler microarchitectural mechanisms to support wide-issue and dynamic scheduling out performs a conventional ISA processor that also supports wide-issue and dynamic scheduling by 28% when assuming perfect branch prediction and by 15% when using real branch prediction.
%Z Thu, 15 Oct 1998 15:26:07 GMT 

%A Srinivasan, Vijayalakshmi
%T Improving Performance of an L1 Cache With an Associated Buffer
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-361-98
%D March 20, 1998
%U ftp://ftp.eecs.umich.edu/techreports/cse/1998/CSE-TR-361-98-cover.ps.gz
%U ftp://ftp.eecs.umich.edu/techreports/cse/1998/CSE-TR-361-98.ps.gz
%X Memory access latencies are much larger than processor cycle times, and the trend has been for this gap to increase over time.  Cache performance becomes critical in bridging this gap.  However, since it is difficult to make a cache both large and fast, cache misses are expected to continue to have a significant performance impact.  Victim caching, proposed by Jouppi, is an approach to decrease the miss ratio of direct-mapped caches without affecting their access time.  NTS caching, proposed by Rivers is a multilateral cache design scheme that improves performance of first-level (L1) caches based on the temporal locality of the reference pattern.  We propose an improvement of these schemes, which we call NT-victim caching.  Taking the lead from NTS design we have a bilateral L1 cache, having a main cache (cache A) and a small fully associative buffer (cache B).  Cache B is similar to a victim buffer, and holds the victim block replace by a miss.  A cache block is temporal if after it is brought into cache, some word in that block is accessed more than once before the block is replaced.  Unlike Victim caches a block that is hit in cache B is swapped with a block in cache A only if it is temporal, and in most of our replacement strategies temporal blocks are less likely to be selected for replacement than non-temporal blocks.  Every cache block loaded into L1 cache is monitored for temporal behavior by a hardware detection unit.  Our results who that this design reduces the number of swaps between cache A and cache B, relative to the Victim cache, yet gives a comparable miss ratio.
%Z Thu, 15 Oct 1998 15:26:33 GMT

%A Abandah, Gheith Ali
%T Reducing Communication Cost in Scalable Shared Memory Systems
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-362-98
%D April 13, 1998
%U ftp://ftp.eecs.umich.edu/techreports/cse/1998/CSE-TR-362-98-cover.ps.gz
%U ftp://ftp.eecs.umich.edu/techreports/cse/1998/CSE-TR-362-98.ps.gz
%X Disstributed shared-memory  systems provide scalable performance and a convenient model for parallel programming.  However, their non-uniform memory latency often makes it difficult to develop efficient parallel applications.  Future systems should reduce communication cost to achieve better programmability and performance.  We have developed a methodology, and implemented a suite of tools, to guide the search for improved codes and systems.  As the result of one such search, we recommend a remote data caching technique that significantly reduces communication cost.
We analyze applications by instrumenting their assembly-code sources.  During execution, an instrumented application pipes a detailed trace to configuration independent (CIAT) and configuration dependent (CDAT) analysis tools. CIAT characterizes inherent application characteristics that do not change from one configuration to another, including working sets, concurrency, sharing behavior, and communication patterns, variation over time, slack, and locality.  CDAT simulates the trace on a particular hardware mode, and is easily retargeted to new systems.  CIAT is faster than detailed simulation; however, CDAT directly provides more information about a specific configuration.  The combination of the two tools constitutes a comprehensive and efficient methodology.  We calibrate existing systems using carefully designed microbenchmarks that characterize local and remote memory, producer-consumer communication involving two or more processors, and contention when multiple processors utilize memory and interconnect.
This dissertation describes these tools and illustrates their use by characterizing a wide range of applications and assessing the effects of architectural and technological advances on the performance of HP/Convex Exemplar systems, evaluates strengths and weaknesses of current system approaches, and recommends solutions.
CDAT analysis of three CC-NUMA system approaches shows that current systems reduce communication cost by minimizing either remote latency or remote communication frequency.  We describe four architecturally varied systems that are technologically similar to the low remote latency SGI Origin 2000, but incorporate additional techniques for reducing the number of remote communications.  Using CCAT, a CDAT-like simulator that models communication contention, we evaluate the worthiness of these techniques.  Superior performance is reached when processors supply cached clean data, or when a remote data cache is introduced that participates in the local bus protocol.
%Z Fri, 16 Oct 1998 15:11:19 GMT

%A Tam, Edward
%A Rivers, Jude
%A Tyson, Gary
%A Davidson, Edward S.
%T mlcache: A Flexible Multi-Lateral Cache Simulator
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-363-98
%D May 11, 1998
%U ftp://ftp.eecs.umich.edu/techreports/cse/1998/CSE-TR-363-98-cover.ps.gz
%U ftp://ftp.eecs.umich.edu/techreports/cse/1998/CSE-TR-363-98.ps.gz
%X As the gap between processor and memory speeds increases, cache performance becomes more critical to overall system performance.  To address this, processor designers typically design for the largest possible caches that can still remain on the ever growing processor die.  However alternate, multi-lateral cache designs such as the Assist Cache, Victim Cache, and NTS Cache have been shown to perform as well as or better than larger, single structure caches while requiring less die area.  For a given die size, reducing the requirements to attain a given rate of data supply can allow more space dedicated for branch prediction, data forwarding, increasing the size of the reorder buffer, etc.  Current cache simulators are not able to study a wide variety of multi-lateral cache configurations.  Thus, the mlcache multi-lateral cache simulator was developed to help designers in the middle of the design cycle decide which cache configuration would best aid in attaining the desired performance goals of the target processor.  ml-cache is an event-driven, timing-sensitive simulator based on the Latency Effects cache timing model.  It can be easily configured to model various multi-lateral cache configurations using its library of cache state and data movement routines.  The simulator can be easily joined to a wide range of event-driven processor simulators such as RCM-brisc, Talisman, SimICS, and SimpleScalar.  We use the SPEC95 benchmarks to illustrate how mlcache can be used to compare the performance of several different data cache configurations.
%Z Fri, 16 Oct 1998 15:12:33 GMT

%A Riepe, Michael
%A Sakallah, Karem
%T Transistor Level Micro Placement and Routing for Two-Dimensional Digital VLSI CEll Synthesis
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-364-98
%D June 1, 1998
%U http://www.eecs.umich.edu/eecs/research/techreports/cse98.html
%X Previous research into the problem of cell library synthesis for digital VLSI design has concentrated mostly on relatively simple 1-dimensional cell topologies for static CMOS designs.  Recent interest has emerged in less constrained 2-dimensional topologies to support more complex non-dual circuits such as latches and flip flops, as well as high performance circuit families such as CVSL, PTL, and domino CMOS.  We discuss a CAD methodology which supports a generalized placement and routing approach to the realization of mask geometry for such complex circuits.  We explore the options available within this methodology, show how the transistor level placement and routing problems at the transistor level differ from those at the block level, and present some results for a prototype tool, TEMPO, which adopts this methodology.
%Z Thu, 15 Oct 1998 15:27:39 GMT 

%A Zhang, Xi
$A Shin, Kang G.
%T A Scalable Flow-Control Algorithm Point-to-Multipoint Communications in High-Speed Integrated Networks
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-365-98
%D June 3, 1998
%U http://www.eecs.umich.edu/eecs/research/techreports/cse98.html
%X We propose and study a scalable flow-control algorithm for multicast ABR (Available Bit Rate) service.  A key idea behind the algorithm is the Soft-Synchronization Protocol (SSP), which derives a single �consolidated� RM (Resource Management) cell at each multicast branch-point from feedback RM-cells of different downstream branches that are not necessarily responses to the same forward RM-cell in each synchronization cycle.  Using balanced and unbalanced binary-tree models, we analyze the scalability of SSP in terms of height and structure of the multicast tree.  In contrast with the existing schemes, SSP is shown to be able to effectively support feedback synchronization and make the effective RM-cell roundtrip delay virtually independent of the multicast-tree�s height and structure, thus making it scalable.
Another key problem in multicast flow-control is how to deal with the variation of RM-cell roundtrip delay due to the randomly drifting multicast-tree bottleneck.  This problem is handled by adapting the second-order rate-control parameter to roundtrip-delay variations.  A fluid model is used to analyze the second-order rate control and study the system dynamics for multicast ABR service.  We derive an analytical relationship between the second-order rate parameter and RM-cell roundtrip delay subject to both lossless transmission and finite buffer capacity constraints.  This relationship ensures the feasibility of the second-order rate control in dealing with RM-cell roundtrip-delay variations and provides an insight on how the required buffer space can be controlled by adjusting the rate parameter. We develop an optimal control condition, under which the second-order rate control guarantees monotonic convergence of system state of the optimal regime from an arbitrary initial value.  The proposed second-order rate-control algorithm is also shown to be feasible and optimal in buffer-allocation efficiency and fairness at the bottleneck.
%Z Fri, 16 Oct 1998 15:14:35 GMT

%A McDaniel, Patrick
%A Jamin, Sugih
%T A Scalable Key Distribution Hierarchy
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-366-98
%D July 7, 1998
%U http://www.eecs.umich.edu/eecs/research/techreports/cse98.html
%X As the use of the Internet for electronic commerce, audio and video conferencing, and other applications with sensitive content grows, the need for secure services becomes critical.  Central to the success of these services is the support for secure public key distribution.  Although there are several existing services available for this purpose, they are not very scalable, either because they depend on a centralized server or rely on ad hoc trust relationships.
In this paper, we present and examine a flexible approach to certificate distribution scalable to arbitrarily large networks.  We propose a two level hierarchy where certificates can be independently authenticated by one or more per authorities, called keyservers.  Certificates for end-user and host entities are managed within local domains, called enterprises.  By administering certificates close to the source, we reduce the load on the key servers and the effects of network topology changes.  We describe the design of our system and present a preliminary performance analysis based on traces of present-day DNS requests.
%Z Fri, 16 Oct 1998 15:15:21 GMT

%A Zou, H.
%A Jahanian, Farnam
%T Optimization of a Real-Time Primary-Backup Replication Service
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-367-98
%D July 10, 1998
%U http://www.eecs.umich.edu/eecs/research/techreports/cse98.html
%X The primary-backup replication model is one of the commonly adopted approaches to providing fault tolerant data services.  The extension of the primary-backup protocol to the real-time environment, however, imposes the additional constraint of timing predictability, which requires a bounded overhead for managing redundancy.  There is a trade-off between reducing system overhead and increasing (temporal) consistency between the primary and backup, and the way to achieve balance between them is not so obvious.  In this paper, we try to explore ways to optimize scheduling update messages from primary to backup while maintaining the temporal consistency guarantees of the system.
Depending on the purpose of the individual replication service and the environment in which it is running, very different objectives can be sought in the process of optimization.  This paper considers the optimization problem from two perspectives with one aimed to minimize the average temporal distance between the primary and backup, and the other aimed to minimized the resource being used in maintaining a given temporal constraint on the system.  Corresponding optimization techniques have been developed for these two diverse objectives and an implementation of such techniques is also presented.  The implementation is built on top of the existing RTPB model developed in which in turn was developed within the x-kernel architecture on the Mach OSF platform running MK (mach kernel) 7.2.  Results of an experimental evaluation of the proposed optimization techniques are also discussed.
%Z Fri, 16 Oct 1998 15:16:18 GMT

%A Labovitz, Craig
%A Malan, G.R.
%A Jahanian, Farnam
%T Origins of Internet Routing Instability
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-368-98
%D July 17, 1998
%U http://www.eecs.umich.edu/eecs/research/techreports/cse98.html
%X This paper examines the network routing messages exchanged between core Internet backbone routers.  Internet routing instability, or the rapid fluctuation of network reachability information, is an important problem currently facing the Internet engineering community.  High levels of network instability can lead to packet loss, increased network latency and time to convergence.  At the extreme, high levels of routing instability have led to the loss of internal connectivity in wide-area, national networks.  In an earlier study of inter-domain routing, we described widespread, significant pathological behaviors in the routing information exchanged between backbone service providers at the major U.S. public Internet exchange points.  These pathologies included several orders of magnitude more routing updates in the Internet core than anticipated, large numbers of duplicate routing messages, and unexpected frequency components between routing instability events.  The work described in this paper extends our earlier analysis by identifying the origins of several of these observed pathological Internet routing behaviors. We show that as a result of specific router vendor software changes suggested by our earlier analysis, the volume of Internet routing updates has decreased by an order of magnitude.  We also describe additional router software changes that can decrease the volume of routing updates exchanged in the Internet core by an additional 30 percent or more.  We conclude with a discussion of trends in the evolution of Internet architecture and policy that may lead to a rise in Internet routing instability.
%Z Thu, 15 Oct 1998 15:28:46 GMT

%A Blass, A.
%A Gurevich, Yuri
%T Fixed-Choice and Independent-Choice Logics
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-369-98
%D August 10, 1998
%U http://www.eecs.umich.edu/eecs/research/techreports/cse98.html
%X We study extensions of first-order logic the choice construct (choose x:j(x)). We prove some results about Hilbert�s e operator, but in the main part of  the paper we consider  the case when all choices are independent.
%Z Thu, 15 Oct 1998 15:29:00 GMT

%A Lee, J.H.
%A Prakash, Atul
%T Malleable Shared Workspaces to Support Multiple Usage Paradigms
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-370-98
%D August 12, 1998
%U http://www.eecs.umich.edu/eecs/research/techreports/cse98.html
%X Several recent systems provide a room-based metaphor to represent shared workspaces that require use of multiple collaborative tools.  These systems provide users with a fairly static usage paradigm of room-centered collaboration, requiring users to mold their collaborative activities to the paradigm rather than molding the paradigm to fit the requirements of their collaborative activities.  In this paper, we propose a powerful and yet simple event-action based model, augmented with access control and multi-user features, for room-based systems for providing a high degree of malleability so that these systems can be adapted to provide support for a variety of collaborative facilities, such as call centers, mailboxes, shared repositories, and role-based collaboration, including facilities that are not necessarily anticipated by system designers.  The model can be used by both system developers as well as by end-users to customize a system to meet the requirements of their group tasks.
%Z Thu, 15 Oct 1998 15:29:13 GMT

%A Huggins, J.K.
%A Van Campenhout, David
%T Specification and Verification of Pipelining in the ARM2 RISC Microprocessor
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-371-98
%D August 13, 1998
%U http://www.eecs.umich.edu/eecs/research/techreports/cse98.html
%X Gurevich Abstract State Machines (ASMs) provide a sound mathematical basis for the specification and verification of systems.  An application of the ASM methodology to the verification of a pipelined microprocessor (an ARM2 implementation) is described.  Both the sequential execution model and final pipelined model are formalized using ASMs.  A series of intermediate models are introduced that gradually expose the complications of pipelining.  The first intermediate model is proven equivalent to the sequential model in the absence of structural, control, and data hazards.  In the following steps, these simplifying assumptions are lifted one by one, and the original proof is refined to establish the equivalence of each intermediate model with the sequential model, leading ultimately to a full proof of equivalence of the sequential and pipelined models.
%Z Thu, 15 Oct 1998 15:29:26 GMT

%A Bozma, H.I.
%A Koditschek, D.E.
%T Assembly as a Noncooperative Game of its Pieces: Analysis of 1D Sphere Assemblies
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-372-98
%D August 14, 1998
%U http://www.eecs.umich.edu/eecs/research/techreports/cse98.html
%X We propose an event-driven algorithm for the control of simple robot assembly problems based on noncooperative game theory.  We examine rigorously the simplest setting - three bodies with one degree of freedom and offer extensive simulations for the 2 DOF extension.  The initial analysis and the accompanying simulations suggest that this approach may indeed offer an attractive means of building robust event driven assembly systems.
%Z Thu, 15 Oct 1998 15:29:41 GMT 
 
%A Postiff, Matt
%A Tyson, Gary
%A Mudge, Trevor
%T Performance Limits of Trace Caches
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-373-98
%D September 8, 1998
%U http://www.eecs.umich.edu/eecs/research/techreports/cse98.html
%X A growing number of studies have explored the use of trace caches as a mechanism to increase instruction fetch bandwidth.  The trace cache is a memory structure that stores statically noncontiguous but dynamically adjacent instructions in contiguous memory locations.  When coupled with an aggressive trace or multiple branch predictor, it can fetch multiple basic blocks per cycle using a single-ported cache structure.  This paper compares trace cache performance to the theoretical limit of a three-block fetch mechanism equivalent to an idealized 3-ported instruction cache with a perfect alignment network.  Several new metrics are defined to formalize analysis of the trace cache.  These include fragmentation, duplication, indexability, and efficiency metrics.  We show that performance is more limited by branch mispredictions than ability to fetch multiple blocks per cycle.  As branch prediction improves, high duplication and the resulting low efficiency are shown to be among the reasons that the trace cache does not reach its upper bound.  Based on the shortcomings of the trace cache discovered in this paper, we identify some potential future research areas.
%Z Thu, 15 Oct 1998 15:29:55 GMT

%A Hardwick, J.
%A Stout, Quentin
%T Optimal Few-Stage Allocation Procedures
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-374-98
%D September 11, 1998
%U http://www.eecs.umich.edu/eecs/research/techreports/cse98.html
%X This paper gives very general algorithms for the design of optimal experiments involving two Bernoulli populations in which sampling is carried out in stages.  It is assumed that the outcomes of the previous stage are available before the allocations for the next stage are decided.  At each state, one must decide how many observations to take and how many to sample from each of the alternative populations.  Of particular interest are 2- and 3-stage experiments.
To illustrate that the algorithms can be used for experiments of useful sample sizes, they are applied to estimation and optimization problems.  Results indicate that, for problems of moderate size, published asymptotic analyses do not always represent the true behavior of the optimal stage sizes, and efficiency may be lost if the analytical results are used instead of the true optimal allocation.  Our results also suggest that one might approach large problems by extrapolating optimal solutions for moderate sample sizes; and, that approaches of this sort could give design guidelines that are far more explicit (and hopefully closer to optimal) than those obtained through asymptotic analyses alone.
The examples also show the ease with which the computational approach can solve problems that present analytical difficulties.  This allows one to use models that more accurately represent important characteristics of the problem.  It is also shown that in various cases the base algorithms can be modified to incorporate simplifications.  In such settings, significant speedups and space reductions can be obtained, permitting the exact solution of even larger problems.
%Z Fri, 16 Oct 1998 15:17:22 GMT
 
%A Mehra, A.
%A Shaikh, A.
%A Abdelzaher, T
%A Wang, Z.
%A Shin, Kang G.
%T Realizing Services for Guaranteed-QoS Communication on a Microkernel Operationg System
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-375-98
%D September 17, 1998
%U http://www.eecs.umich.edu/eecs/research/techreports/cse98.html
%X Provision of end-to-end QoS guarantees on communication necessitates appropriate support in the end systems (i.e., hosts) and network routers that form the communication fabric.  Typically, the support is in the form of suitable extensions to the communication subsystem and the underlying operating system for specification and maintenance of QoS guarantees.  This paper focuses on the architectural and implementation challenges involved in realizing QoS-sensitive host communication subsystems on contemporary microkernel operating systems with limited real-time support.  We motivate and describe the components constituting our integrated service architecture that together ensure QoS-sensitive handling of network traffic at both sending and receiving hosts.  We separate the policies from mechanisms in each component, demonstrating a communication framework that can implement alternative QoS models by applying appropriate policies.  We also report the results of a detailed execution profile of the system to characterize communication costs for the purposes of admission control.  An experimental evaluation in a controlled configuration demonstrates the efficacy with which QoS guarantees are maintained, despite limitations imposed by the underlying operating system.
%Z Thu, 15 Oct 1998 15:30:59 GMT
 
%A McDaniel, Patrick
%A Jamin, Sughi
%T Windowed Key Revocation in Public Key Infrastructures
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-376-98
%D October 13, 1998
%U http://www.eecs.umich.edu/eecs/research/techreports/cse98.html
%X A fundamental problem inhibiting the wide acceptance of a Public Key Infrastructure (PKI) in the Internet is the lack of a mechanism that provides scalable certificate revocation.  In this paper, we propose a novel mechanism called Windowed Revocation.  In windowed revocation, certificate revocation is announced for short periods in periodic Certificate Revocation Lists (CRLs).  Due to the assurances provided by the protocol over which certificates are retrieved, we bound the amount of time that any certificate is cached by users.  Thus, we can limit the announcement of revocation only to the time in which the certificate may be cached; not until its expiration.  Because the time in which certificate are announced is short, CRLs, are similarly small.  By limiting the size of CRLs, we are able to integrate other mechanisms that increase the scalability of the PKI.  One such mechanism is the use of �pushed� CRLs using multicast.  We include a proof of the correctness of our approach.
%Z Thu, 12 Nov 1998 13:29:04 GMT

%A Hardwick, Janis
%A Oehmke, Robert
%A Stout, Quentin F.
%T A Program for Sequential Allocation of Three Bernoulli Populations
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-377-98
%D October 13, 1998
%U http://www.eecs.umich.edu/eecs/research/techreports/cse98.html
%X We describe a program for optimizing and analyzing sequential allocation problems involving three Bernoulli populations.  Previous researchers had considered this problem computationally intractable, and we know of no prior exact optimizations for such problems, even for very small sample sizes.  Despite this, our program is currently able to solve problems of size 200 or more by using a parallel computer, and problems of size 100 on a workstation.  We describe the program and the techniques used to enable it to scale to large sample sizes.  As an illustrative example, the program is used to create an adaptive sampling procedure that is the optimal solution to a 3-arm bandit problem.  The bandit procedure is then compared to two other allocation procedures along various metrics.  We also indicate extensions of the program that enable it to solve a variety of related problems.
%Z Thu, 12 Nov 1998 13:31:38 GMT

%A Chong, Ronald
%T Modeling Dual-Task Performance Improvement:  Casting Executive Process Knowledge Acquisition as Strategy Refinement
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-378-98
%D October 15, 1998
%U http://www.eecs.umich.edu/eecs/research/techreports/cse98.html
%X People demonstrate a remarkable ability to perform complex, multiple-task activities in spite of the limitations of our sensory, perceptual, cognitive, and motor systems.  A prominent theory that addresses how multiple-tasks activities are performed is that of the executive process.  Some of the functions of the executive process include enforcing task priorities and arbitrating access to limited resources.  It has been shown that a time-sharing skill (or executive-process knowledge) is acquired during training on dual-task combinations.
This dissertation presents the development of a computational, task-independent framework for modeling the acquisition of the knowledge acquired during training on dual-task combinations - executive process knowledge.  On a selected dual-task combination - a continuous tracking task and a discrete two-choice reaction time task - this framework, when given the declarative and procedural representation of the novice task, has produced an expert model whose performance is a good match to empirical reaction time and tracking error data for the task combination.
There are three main contributions of this work.  First is the development of EPIC-Soar, a symbolic hybrid architecture that possesses a psychologically-motivated learning mechanism and psychologically-plausible perception and motor systems.  Second is the identification and classification of executive process knowledge and the taxonomies that result from this analysis.  Third, is an acquisition framework which consists of:  a novel data structure for representing task strategies; a task-independent procedure for resolving simultaneous access for motor resources and learning new knowledge that avoids such collisions in the future; a second task-independent learning procedure which refines the strategy data structure and creates new procedural knowledge for performing the task; and a collection of guidelines that regulate how and when promotions are applied.
%Z Thu, 12 Nov 1998 13:34:26 GMT

%A Wray, Robert 
%T Ensuring Reasoning Consistency in Hierarchical Architectures
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-379-98
%D November 9, 1998
%U http://www.eecs.umich.edu/eecs/research/techreports/cse98.html
%X Agents often dynamically decompose a task into a hierarchy of subtasks.  Hierarchical task decomposition reduced the cost of knowledge design in comparison to non-hierarchical knowledge because knowledge is simplified and can be shared across multiple tasks.  However, hierarchical decomposition does have limitations.  In particular, hierarchical decomposition can lead to inconsistency in agent processing, resulting potentially in irrational behavior.  Further, an agent must generate the decomposition hierarchy before reacting to an external stimulus.  Thus, decomposition can also reduce agent responsiveness.
This thesis describes ways in which the limitations of hierarchical decomposition can be circumvented while maintaining inexpensive knowledge design and efficient agent processing.  We introduce Goal-Oriented Heuristic Hierarchical Consistency (GOHHC), which eliminates across-level inconsistency.  GOHHC computes logical dependencies in asserted knowledge between goals rather than individual assertions.  Although this goal-oriented heuristic ensures consistency, it does sometime lead to unnecessary repetition in reasoning and delay in task execution.  We show empirically that these drawbacks are inconsequential in execution domains.  Thus, GOHHC provides an efficient guarantee of processing consistency in hierarchical architectures.  Ensuring consistency also provides an architectural framework for unproblematic knowledge compilation in dynamic domains.  Knowledge compilation can be used to cache hierarchical reasoning and thus avoid the delay in reaction necessitated by decomposition.  An empirical investigation of compilation in hierarchical agents shows that compilation can lead to improvement in both overall performance and responsiveness, while maintaining the low design cost of hierarchical knowledge.
%Z Thu, 12 Nov 1998 13:36:07 GMT

%A Lefurgy, Charles
%A Mudge, Trevor
%T Code Compression for DSP
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-380-98
%D November 11, 1998
%U http://www.eecs.umich.edu/eecs/research/techreports/cse98.html
%X Previous works have proposed adding compression techniques to a variety of architectural styles to reduce instruction  memory requirements.  It is not immediately clear how these results apply to DSP architectures.  DSP instructions are longer and have potentially greater variation which can decrease compression ratio.  Our results demonstrate that DSP programs do provide sufficient repetition for compression algorithms.  We propose a compression method and apply it to the SHARC, a popular DSP architecture.  Even using a very simple compression algorithm, it is possible to halve the size of the instruction memory requirements.
%Z Mon, 16 Nov 1998 14:42:26 GMT
 
%A Shaikh, A.
%A Rexford, Jennifer
%A Shin, Kang G.
%T Evaluating the Overheads of Course-Directed Quality-of-Service Routing
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-381-98
%D December 7, 1998
%U http://www.eecs.umich.edu/eecs/research/techreports/cse98.html
%X Quality-of-Service (QoS) routing satisfies application performance requirements and optimizes network resource usage by selecting paths based on connection traffic parameters and link load information.  However, effective path-selection schemes require the distribution of link-state information, which can impose a significant burden on the bandwidth and processing resources in the network.  We investigate the fundamental trade-off between network overheads and the quality of routing decision in the context of the source-directed link-state routing protocols proposed for future IP and ATM networks.  In contrast to previous work that compares different routing algorithms under specific network configurations, we construct a detailed model of QoS routing that parameterizes the path-selection algorithm, link-cost function, and link-state update policy.  Through extensive simulation experiments with several representative network topologies and traffic patterns, we uncover the effects of stale link-state information and random fluctuations in traffic load on the routing and signaling overheads.  We then investigate ho the inaccuracy of link-state information interacts with the size and connectivity of the underlying topology.  Finally, we show that by tuning the coarseness of the link-cost metric to the inaccuracy of underlying link-state information we can reduce the computational complexity of the path-selection algorithm without significantly degrading performance.  The paper concludes by summarizing our key results as a list of guidelines for designing efficient quality-of service routing policies in large backbone networks.
%Z Mon, 21 Dec 1998 9:49:13 GMT 

%A Labovitz, C.
%A Ahuja, A.
%A Jahanian, F.
%T Experimental Study of Internet Stability and Wide-Area Backbone Failures
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-382-98
%D December 16, 1998
%U http://www.eecs.umich.edu/eecs/research/techreports/cse98.html
%X In this paper, we describe an experimental study of Internet stability and the origins of failure in Internet protocol backbones.  The stability of end-to-end Internet paths is dependent both on the underlying telecommunication switching system, as well as the higher level software and hardware components specific to the Internet�s packet-switched forwarding and routing architecture.  Although a number of earlier studies have examined failures in the public telecommunication system, little attention has been given to the characterization of Internet stability.  Our paper analyzes Internet failures from three different perspectives.
We first examine several recent major Internet failures and their probable origins.  These empirical observations illustrate the complexity of the Internet and show that unlike commercial transaction systems, the interactions of the underlying components of the Internet are poorly understood.  Next, our examination focuses on the stability of paths between Internet Service Providers.  Our analysis is based on the experimental instrumentation of key portions of the Internet infrastructure.  Specifically, we logged all the routing control traffic at five of the largest U.S. Internet exchange points over a three year period.  This study of network reachability information found unexpectedly high levels of path fluctuation and an aggregate low mean time between failures for individual Internet paths.  These results point to a high l eve of instability in the global Internet backbone.
While our study of the Internet backbone identifies major trends in the level of path instability between different service provides, these results do not characterize failures inside the network of service provider.  The final portion of our paper focuses on a case study of the network failures observed in a large regional Internet backbone.  This examination of the internal stability of a network includes twelve months of operational failure logs and a review of the internal routing communication data collected between regional backbone routers.  We characterize the type and frequency of failures in twenty categories, and describe the failure properties of the regional backbone as a whole.
%Z Wed, 13 Jan 1999 10:38:57 GMT 

%A Weinstein, Peter C.
%A Birmingham, William P.
%T Agent Communication with Differentiated Ontologies: eight new measures of description campatibility
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-383-99
%D January 7, 1999
%U http://www.eecs.umich.edu/eecs/research/techreports/cse99.html
%X We propose an approach to achieve appropriate exchange of services and data in distributed systems subject to semantic heterogeneity.  We assume differentiated ontologies:   that terms have formal definitions as concepts related to other concepts, that local concepts inherit from concepts that are shared, and that most or all primitives are shared.  We then develop measures of description compatibility using the structure of the source and target definitions.  We evaluate these measures by generating description-logic ontologies in artificial worlds.  In our simulations, the �meaning� of a concept is its denotation in a finite universe of instances.  the accuracy of the description-compatibility measures can thus be judged by their success in predicting the overlap of concept denotations.  Description compatibility can be used to guide agent search for services across communities that subscribe to differentiated ontologies.
%Z Thu, 4 Feb 1999 8:33:01 GMT

%A Wu, W-B.
%A Ravishankar, C.
%T The Performance of Tuple Difference Coding for Compressing Databases and Data Warehouses
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-384-99
%D January 26, 1999
%U http://www.eecs.umich.edu/eecs/research/techreports/cse99.html
%X Databases and data warehouses have grown dramatically in size in recent years, and we are already seeing data warehouses that are tens of terabytes in size.  Compression is important in such systems because it reduced space requirements as well as response times for queries, which are typically I/O-bound.
Conventional compression methods tend to compress and decompress objects in their entirety, and are generally unsuitable for databases which require selective access to points in the space of tuples they represent.  The database compression method called Tuple-Difference Coding (TDC) has been shown to meet the requirements for database compression, and to achieve high compression ratios in practice.  The factors affecting the performance of TDC are known, but their effects are not well understood.
This paper presents a theoretical analysis of the performance of TDC, and verifies these results through simulations.  It presents analytical expressions for estimating the compression ratios when database tuples are composed of one or more attributes drawn from various distributions.  It also studies the effects of attribute domain reordering on the compression ratio.  Our simulations show excellent agreement with theory.
%Z Thu, 4 Feb 1999 8:35:11 GMT

%A Crestana, Viviane
%A Soparkar, Nandit
%T Mining Decentralized Data Repositories
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-385-99
%D February 1, 1999
%U http://www.eecs.umich.edu/eecs/research/techreports/cse99.html
%X Current technology for mining data typically applies to data stored centrally (i.e., in one single repository, homogeneous, with central administration, and a single schema).  For instance, association rules algorithms assume that the data of interest is available at one location in a single table, or that prior to mining, the data will be warehoused centrally.  However, it is important to study the mining of decentralized data (i.e., consisting of several tables, perhaps obtained via normalization or partitioning and allocation, stored in several repositories with possibly separate administration and schema).  Real-life database design, management, and performance aspects suggest these considerations.  While there have been a few extensions to mining algorithms for such data, the algorithms developed have largely been aimed at parallel processing (as opposed to the specific issues relating to decentralized data).
In this paper, we examine the issues associated with mining decentralized data.  We motivate the need for developing decentralized techniques, and identify the problems that must be addressed.  We describe the mining of association rules for the case of a star schema wherein several smaller dimension tables are associated by foreign keys to a central fact table, and present an algorithm that adapts standard algorithms to work efficiently with such data.  in contrast to approaches where the data would have been joined first to form a single table, we exploit the foreign key relationships to develop decentralized algorithms that execute concurrently on the separate tables, and thereafter merge the results.  We provide analyses to assess our techniques, and we present empirical studies for particular cases to validate the performance.
%Z Thu, 4 Feb 1999 8:37:08 GMT

%A Tabe, Ted
%A Stout, Quentin F.
%T The Use of the MPI Communication Library in the NAS Parallel Benchmarks
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-386-99
%D February 17, 1999
%U http://www.eecs.umich.edu/eecs/research/techreports/cse99.html
%X The statistical analysis of traces taken from the NAS Parallel Benchmarks can tell one much about the type of network traffic that can be expected from scientific applications run on distributed-memory parallel computers.  For instance, such applications utilize a relatively few number of communication library functions, the length of their messages is widely varying, they use many more short messages than long ones, and within a single application the messages tend to follow relatively simple patterns.  Information such as this can be used by hardware and software designers to optimize their systems for the highest possible performance.
%Z Fri, 12 Mar 1999 10:54:50 GMT  

%A Feng, W.
%A Kandlur, D.
%A Saha, D.
%A Shin, Kang G.
%T BLUE: A New Class of Active Queue Management Algorithms
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-387-99
%D March 15, 1999
%U http://www.eecs.umich.edu/eecs/research/techreports/cse99.html
%X In order to stem the increasing packet loss rates caused by an exponential increase in network traffic, the IETF is considering the deployment of active queue management techniques such as RED [11].  While active queue management can potentially reduce packet loss rates in the Internet, this paper shows that current techniques are ineffective in preventing high loss rates.  The inherent problem with these queue management algorithms is that they all use queue lengths as the indicator of the severity of congestion.  In light of this observation, a fundamentally different active queue management algorithm called BLUE is proposed.  BLUE uses packet loss and link idle events to manage congestion.  Using simulation and controlled experiments, BLUE is shown to perform significantly better than RED both in terms of packet loss rates and buffer size requirements in the network.  As an extension to BLUE, a novel technique for enforcing fairness among a large number of flows is described.  In particular, this paper proposes and evaluates Stochastic Fair BLUE (SFB), a queue management algorithm which can identify and rate-limit non-responsive flows using a very small amount of state information.
%Z Tue, 13 Apr 1999 11:36:48 GMT 

%A Kweon, Seok-Kyu
%A Shin, Kang G.
%T Distributed QoS Routing with Bounded Flooding for Real-Time Applications
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-388-99
%D March 15, 1999
%U http://www.eecs.umich.edu/eecs/research/techreports/cse99.html
%X Quality-of-Service (QoS) routing is indispensable to real-time applications on integrated services packet-switching networks.  However, existing QoS-routing schemes based on either request flooding or link-state dissemination are not efficient in terms of their message overhead and/or operational cost.  This paper proposes a QoS-routing algorithm with bounded flooding which not only dramatically reduces message overhead and operational cost, but also provides good performance.  This significant overhead reduction is achieved by limiting the search area within which connection-request messages are flooded.  Since our scheme is based on limited flooding, it does not require on-demand path calculation nor link-state dissemination, which are often known to be very expensive.  In order to limit the search area, we employ a distance table at each node storing hop counts of the shortest paths to all the nodes reachable via its outgoing links.  Unlike link-state routing, the proposed scheme only needs network-topology information that changes much less frequently than link-load states.  Extensive simulation results have demonstrated the effectiveness of the propped scheme in terms of performance, operational cost, and message overhead.  Especially, in an ordinary operating condition in which the network�s physical topology doesn�t change frequently, it is shown to be much better than all the other existing QoS-routing schemes in terms of message overhead and operational cost, while providing reasonably good performance.
%Z Tue, 13 Apr 1999 11:38:58 GMT

%A Dundas, James David
%T Improving Processor Performance by Dynamically Pre-Processing the Instruction Stream
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-389-99
%D April 8, 1999
%U http://www.eecs.umich.edu/eecs/research/techreports/cse99.html
%X The exponentially increasing gap between processors and off-chip memory, as measured in processor cycles, is rapidly turning memory latency into a major processor performance bottleneck.  Traditional solutions, such as employing multiple levels of caches, are expensive and do not work well with some applications.  We evaluate a technique, called runahead pre-processing, that can significantly improve processor performance.
The basic idea behind runahead is to use the processor pipeline to pre-process instructions during cache miss cycles, instead of stalling.  The pre-processed instructions are used to generate highly accurate instruction and data stream prefetches, while all of the pre-processed instruction results are discarded after the cache miss has been serviced:  this allows us to achieve a form of very aggressive speculation with a simple in-order pipeline.  The principal hardware cost is a means of checkpointing the sequential state of the register file and memory hierarchy while instructions are pre-processed.   As we discard all pre-processed instruction results, the checkpointing can be accomplished with a modest amount of hardware.
The instruction and data stream prefetches generated during runahead episodes led to a significant performance improvement for all the benchmarks we examined.  We found that runahead typically led to about a 30% reduction in CPI for the four Spec95 integer benchmarks that we simulated, while runahead was able to reduce CPI by 77% for the STREAM benchmark.  This is for a five stage pipeline with two levels of split instruction and data caches: 8KB each of L1, and 1MB each for L2.  A significant result is that when the latency to off-chip memory increases, or if the caching performance for a particular benchmark is poor, runahead is especially effective as the processor has more opportunities in which to pre-process instructions.  Finally, runahead appears particularly well suited for use with high clock-rate in-order processors that employ relatively inexpensive memory hierarchies.
%Z Tue, 13 Apr 1999 11:40:56 GMT

%A Riepe, Michael Anthony
%T Transistor Level Micro Placement and Routing for Two-Dimensional Digital VLSI Cell Synthesis
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-390-99
%D April 8, 1999
%U http://www.eecs.umich.edu/eecs/research/techreports/cse99.html
%X The automated synthesis of mask geometry for VLSI leaf cells, referred to as the cell synthesis problem, is an important component of any structured custom integrated circuit design environment.  Traditional approaches based on the classic functional cell style of Uehara & VanCleemput post this problem as a straightforward one-dimensional graph optimization problem  for which optimal solution methods are known.  However, these approaches are only directly applicable to static CMOS circuits and they break down when faced with more exotic logic styles.
There is an increasing need tin modern VLSI designs for circuits implemented in high-performance logic families such as Cascode Voltage Switch Logic (CVSL), Pass Transistor Logic (PTL), and domino CMOS.  Circuits implemented in these non-dual ratioed logic families can be highly irregular with complex geometry sharing and non-trivial routing.  Such cells require a relatively unconstrained two-dimensional full-custom layout style which current methods are unable to synthesize.  In this work we define the synthesis of complex two-dimensional digital cells as a new problem which we can transistor-level micro-placement and routing.  To address this problem we develop a complete end-to-end methodology which is implemented in a prototype tool named TEMPO.  A series of experiments on a new set of benchmark circuits verifies the effectiveness of our approach.
Our methodology is centered around techniques for the efficient modeling and optimization of geometry sharing, and is supported at two different levels.  Chains of diffusion-merged transistors are formed explicitly and their ordering optimized for area and global routing.  In additional, more arbitrary merged structures are supported by allowing electrically compatible adjacent transistors to overlap during placement.  The synthesis flow in TEMPO begins with a static transistor chain formation step.  These chains are broken at the diffusion breaks and the resulting sub-chains passed to the placement step.  During placement, an ordering is found for each chain and a location and orientation is assigned to each sub-chain.  Different chain orderings affect the placement by changing the relative sizes of the sub-chains and their routing contribution.  We conclude with a detailed routing step and an optional compaction step.
%Z Tue, 13 Apr 1999 11:58:11 GMT

%A Galler, Bernard A.
%T Report on the Study of University Policies on Intellectual Property
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-391-99
%D April 12, 1999
%U http://www.eecs.umich.edu/eecs/research/techreports/cse99.html
%X The purpose of the study was to identify issues which need to be considered by universities in establishing or upgrading policies regarding rights to intellectual property created by various members of the university community.  A reading of a number of such policies reveals that there are many aspects in common, such as the treatment of intellectual property created while the work was funded by external sources.  The main differences appeared in the treatment of non-supported, or internally supported, intellectual property.  This study looks at the policies developed by various universities with regard to software, courseware, and related publications, and the issues raised by these policies.
%Z Tue, 13 Apr 1999 11:59:45 GMT
 
%A Shaikh, A.
%A Rexford, J.
%A Shin, Kang G.
%T Load-Sensitive Routing of Long-Lived IP Flows
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-392-99
%D April 15, 1999
%U http://www.eecs.umich.edu/eecs/research/techreports/cse99.html
%X Internet service providers face a daunting challenge in provisioning network resources, due to the rapid growth of the Internet and wide fluctuations in the underlying traffic patterns.  The ability of dynamic routing to circumvent congested links and improve application performance makes it a valuable traffic engineering tool.  However, deployment of load-sensitive routing is hampered by the overheads imposed by link-state update propagation, path selection, and signalling.  Under reasonable protocol and computational overheads, traditional approaches to load-sensitive routing of IP traffic are ineffective, and can introduce significant route flapping, since paths are selected based on out-of-date link-state information.  Although stability is improved by performing load-sensitive routing at the flow level, flapping still occurs, because most IP flows have a short duration relative to the desired frequency of link-state updates.  To address the efficiency and stability challenges of load-sensitive routing, we introduce a new hybrid approach that performs dynamic routing of long-lived flows, while forwarding short-lived flows on static preprovisioned paths.  By relating the detection of long-lived flows to the timescale of link-state update messages in the routing protocol, route stability is considerably improved.  Through simulation experiments using a one-week ISP packet race, we show that our hybrid approach significantly outperforms traditional static and dynamic routing schemes, by reacting to fluctuations in network load without introducing route flapping.
%Z Fri, 16 Apr 1999 10:41:15 GMT
 
%A Bernacchia, Giuseppe
%A Papaefthymiou, Marios
%T Analytical Macro-Modeling for High-Level Power Estimation
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-393-99
%D April 20, 1999
%U http://www.eecs.umich.edu/eecs/research/techreports/cse99.html
%X In this paper, we present a new analytical macro-modeling technique for high-level power estimation.  Our technique is based on a parameterizable analytical model that relies exclusively on statistical information of the circuit�s primary inputs.  During estimation, the statistics of the required metrics are extracted from the input stream and a power estimate is obtained by evaluating a model function that has been characterized in advance.  Thus, our model yields power estimates within seconds, since it does not rely on the statistics of the circuit�s primary outputs and, consequently, does not perform any simulation during estimation, moreover, our model achieves significantly better accuracy than previous macro-modeling approaches, because it takes into account both spatial and temporal correlations in the input stream.
In experiments with ISCAS-85 combinational circuits, the average absolute relative error of our power macro-modeling technique was at most 3%.  For all but one circuit in our test suite, the worst-case error was at most 19.5%.  In comparison with power estimates for a ripple-carry adder family that were obtained using Spice, the average absolute and worst-case errors of our model�s estimate were 5.1% and 19.8% respectively.
In addition to power dissipation, our macro-modeling technique can be used to estimate the statistics of a circuit�s primary outputs.  Our experiments with ISCAS-85 circuits yield very low average errors.  Thus, our technique is suitable for fast and accurate power estimation in core-based system with pre-characterized blocks.  Once the metrics of the primary inputs are known, the power dissipation of the entire system can be estimated by simply propagating this information through the blocks using their corresponding model functions.
%Z Wed, 21 Apr 1999 10:33:12 GMT
 
%A Patel, Sanjay J.
%T Trace Cache Design for Wide-Issue Superscalar Processors
%I Computer Science and Engineering, University of Michigan
%R CSE-TR-394-99
%D April 28, 1999
%U http://www.eecs.umich.edu/eecs/research/techreports/cse99.html
%X To maximize the performance of a wide-issue superscalar processor, the fetch mechanism must be capable for delivering at least the same instruction bandwidth as the execution mechanism is capable of consuming.  Fetch mechanisms consisting of a simple instruction cache are limited by difficulty in fetching a branch and its taken target in a single cycle.  Such fetch mechanisms will not suffice for processors capable of executing multiple basic blocks� worth of instructions.
The Trace Cache is proposed to deal with lost fetch bandwidth due to branches.  The trace cache is a structure which overcomes this partial fetch problem by storing logically contiguous instructions---instructions which are adjacent in the instruction stream -- in physically contiguous storage.  In this manner, the trace cache is able to deliver multiple non-contiguous blocks each cycle.
This dissertation contains a description of the trace cache mechanism for a 16-wide issue processor, along with an evaluation of basic parameters of this mechanism, such as relative size and associativity.  The main contributions of this dissertation are a series of trace cache enhancements which boost instruction fetch bandwidth by 34% and overall performance by 14% over an aggressive instruction cache.  Also included is an analysis of two important performance limitations of the trace cache:  branch resolution time and instruction duplication.
%Z Tue, 15 Jun 1999 9:29:45 GMT