CSE Technical Reports Sorted by Technical Report Number
| TR Number |
Title |
Authors |
Date |
Pages |
| We propose an approach to achieve appropriate exchange of services and data in distributed systems subject to semantic heterogeneity. We assume
_differentiated ontologies_: that terms have formal definitions as concepts
related to other concepts, that local concepts inherit from concepts that
are shared, and that most or all primitives are shared. We then develop
measures of _description compatibility_ using the structure of the source
and target definitions. We evaluate these measures by generating
description-logic ontologies in artificial worlds. In our simulations, the
"meaning" of a concept is its denotation in a finite universe of instances.
The accuracy of the description compatibility measures can thus be judged
by their success in predicting the overlap of concept denotations.
Description compatibility can be used to guide agent search for services
across communities that subscribe to differentiated ontologies.
|
| Databases and data warehouses have grown dramatically in size in recent years, and we are already seeing data warehouses that are tens of terabytes in size. Compression is important in such systems because it reduces space requirements as well as response times for queries, which are typically I/O-bound.
Conventional compression methods tend to compress and decompress objects in their entirety, and are generally unsuitable for databases which require selective access to points in the space of tuples they represent. The database compression method called Tuple-Difference Coding (TDC) has been shown to meet the requirements for database compression, and to achieve high compression ratios in practice. The factors affecting the performance of TDC are known, but their effects are not well understood.
This paper presents a theoretical analysis of the performance of TDC, and verifies these results through simulations. It presents analytical expressions for estimating the compression ratios when database tuples are composed of one or more attributes drawn from various distributions. It also studies the effects of attribute domain reordering on the compression ratio. Our simulations show excellent agreement with theory.
|
| Current technology for mining data typically applies to data stored centrally (i.e., in one single repository, homogeneous, with central
administration, and a single schema). For instance, association rules
algorithms assume that the data of interest is available at one
location in a single table, or that prior to mining, the data will be
warehoused centrally.
However, it is important to study the mining of decentralized data
(i.e., consisting of several tables, perhaps obtained via normalization
or partitioning and allocation, stored in several repositories with
possibly separate administration and schema). Real-life database design,
management, and performance aspects suggest these considerations.
While there have been a few extensions to mining algorithms for
such data, the algorithms developed have largely been aimed at
parallel processing (as opposed to the specific issues relating to
decentralized data).
In this paper, we examine the issues associated with mining decentralized
data. We motivate the need for developing decentralized techniques, and
identify the problems that must be addressed.
We describe the mining of
association rules for the case of a star schema wherein several smaller
dimension tables are associated by foreign keys to a central fact table,
and present an algorithm that adapts standard
algorithms to work efficiently with such data.
In contrast to approaches where
the data would have been joined first to form a single table, we exploit
the foreign key relationships to develop decentralized algorithms
that execute concurrently on the separate tables, and thereafter merge
the results.
We provide analyses to assess our techniques,
and we present empirical studies for particular cases to
validate the performance.
|
| The statistical analysis of traces taken from the NAS Parallel Benchmarks can tell one much about the type of network traffic that can be
expected from scientific applications run on distributed-memory parallel computers.
For instance, such applications
utilize a relatively few number of communication library functions, the
length of their messages is widely varying, they use many
more short messages than long ones, and within a single
application the messages tend to follow relatively simple patterns.
Information such as this can be used by hardware and software designers to
optimize their systems for the highest possible performance.
|
| In order to stem the increasing packet loss rates caused by an exponential increase in network traffic, the IETF is considering
the deploymentof active queue management techniques such as RED.
While active queue management can potentially reduce packet loss
rates in the Internet, this paper shows that current techniques
are ineffective in preventing high loss rates. The inherent problem
with these queue management algorithms is that they all use queue
lengths as the indicator of the severity of congestion. In light
of this observation, a fundamentally different active queue
management algorithm called Blue is proposed. Blue uses packet
loss and link idle events to manage congestion. Using simulation
and controlled experiments, Blue is shown to perform significantly
better than RED, both in terms of packet loss rates and buffer size
requirements in the network. As an extension to Blue, a novel technique
for enforcing fairness among a large number of flows is described.
In particular, this paper proposes and evaluates Stochastic Fair
Blue (SFB), a queue management algorithm which can identify and
rate-limit non-responsive flows using a very small amount of state
information. |
| Quality-of-Service (QoS) routing is indispensable to real-time applications on integrated services packet-switching networks.
However, existing QoS-routing schemes based on either request flooding
or link-state dissemination are not efficient in terms of their message
overhead and/or operational cost.
This paper proposes a QoS-routing algorithm with bounded
flooding which not only dramatically reduces message overhead and
operational cost, but also provides good performance.
This significant overhead reduction is achieved by limiting the search
area within which connection-request messages are flooded.
Since our scheme is based on limited flooding, it does not require
on-demand path calculation nor link-state dissemination, which are
often known to be very expensive.
In order to limit the search area, we employ a distance table at each
node storing hop counts of the shortest paths to all the
nodes reachable via its outgoing links.
Unlike link-state routing, the proposed scheme only needs
network-topology information that changes much less frequently
than link-load states.
Extensive simulation results have demonstrated the effectiveness of
the proposed scheme in terms of performance, operational cost, and
message overhead.
Especially, in an ordinary operating condition in which the networks
physical topology does not change frequently, it is shown to be much better
than all the other existing QoS-routing schemes in terms of message
overhead and operational cost, while providing reasonably good performance. |
| The exponentially increasing gap between processors and off-chip memory, as measured in processor cycles, is rapidly turning memory latency into a major processor performance bottleneck. Traditional solutions, such as employing multiple levels of caches, are expensive and do not work well with some applications. We evaluate a technique, called runahead pre-processing, that can significantly improve processor performance.
The basic idea behind runahead is to use the processor pipeline to pre-process instructions during cache miss cycles, instead of stalling. The pre-processed instructions are used to generate highly accurate instruction and data stream prefetches, while all of the pre-processed instruction results are discarded after the cache miss has been serviced: this allows us to achieve a form of very aggressive speculation with a simple in-order pipeline. The principal hardware cost is a means of checkpointing the sequential state of the register file and memory hierarchy while instructions are pre-processed. As we discard all pre-processed instruction results, the checkpointing can be accomplished with a modest amount of hardware.
The instruction and data stream prefetches generated during runahead episodes led to a significant performance improvement for all of the benchmarks we examined. We found that runahead typically led to about a 30% reduction in CPI for the four Spec95 integer benchmarks that we simulated, while runahead was able to reduce CPI by 77% for the STREAM benchmark. This is for a five stage pipeline with two levels of split instruction and data caches: 8KB each of L1, and 1MB each of L2. A significant result is that when the latency to off-chip memory increases, or if the caching performance for a particular benchmark is poor, runahead is especially effective as the processor has more opportunities in which to pre-process instructions. Finally, runahead appears particularly well suited for use with high clock-rate in-order processors that employ relatively inexpensive memory hierarchies.
|
| The automated synthesis of mask geometry for VLSI leaf cells, referred to as the cell synthesis problem, is an important component of any structured custom integrated circuit design environment. Traditional approaches based on the classic functional cell style of Uehara & VanCleemput pose this problem as a straightforward one-dimensional graph optimization problem for which optimal solution methods are known. However, these approaches are only directly applicable to static CMOS circuits and they break down when faced with more exotic logic styles.
There is an increasing need in modern VLSI designs for circuits implemented in high-performance logic families such as Cascode Voltage Switch Logic (CVSL), Pass Transistor Logic (PTL), and domino CMOS. Circuits implemented in these non-dual ratioed logic families can be highly irregular with complex geometry sharing and non-trivial routing. Such cells require a relatively unconstrained two-dimensional full-custom layout style which current methods are unable to synthesize. In this work we define the synthesis of complex two-dimensional digital cells as a new problem which we call transistor-level micro-placement and routing. To address this problem we develop a complete end-to-end methodology which is implemented in a prototype tool named TEMPO. A series of experiments on a new set of benchmark circuits verifies the effectiveness of our approach.
Our methodology is centered around techniques for the efficient modeling and optimization of geometry sharing, and is supported at two different levels. Chains of diffusion-merged transistors are formed explicitly and their ordering optimized for area and global routing. In addition, more arbitrary merged structures are supported by allowing electrically compatible adjacent transistors to overlap during placement. The synthesis flow in TEMPO begins with a static transistor chain formation step. These chains are broken at the diffusion breaks and the resulting sub-chains passed to the placement step. During placement, an ordering is found for each chain and a location and orientation is assigned to each sub-chain. Different chain orderings affect the placement by changing the relative sizes of the sub-chains and their routing contribution. We conclude with a detailed routing step and an optional compaction step.
|
| Internet service providers face a daunting challenge in provisioning network resources, due to the rapid growth of the Internet and wide
fluctuations in the underlying traffic patterns. The ability of
dynamic routing to circumvent congested links and improve application
performance makes it a valuable traffic engineering tool. However,
deployment of load-sensitive routing is hampered by the overheads
imposed by link-state update propagation, path selection, and
signalling. Under reasonable protocol and computational overheads,
traditional approaches to load-sensitive routing of IP traffic are
ineffective, and can introduce significant route flapping, since paths
are selected based on out-of-date link-state information. Although
stability is improved by performing load-sensitive routing at the flow
level, flapping still occurs, because most IP flows have a short
duration relative to the desired frequency of link-state updates. To
address the efficiency and stability challenges of load-sensitive
routing, we introduce a new hybrid approach that performs dynamic
routing of long-lived flows, while forwarding short-lived flows on
static preprovisioned paths. By relating the detection of long-lived
flows to the timescale of link-state update messages in the routing
protocol, route stability is considerably improved. Through
simulation experiments using a one-week ISP packet trace, we show that
our hybrid approach significantly outperforms traditional static and
dynamic routing schemes, by reacting to fluctuations in network load
without introducing route flapping. |
| In this paper, we present a new analytical macro-modeling technique for high-level power estimation. Our technique is based on a
parameterizable analytical model that relies exclusively on
statistical information of the circuits primary inputs. During
estimation, the statistics of the required metrics are extracted from
the input stream and a power estimate is obtained by evaluating a
model function that has been characterized in advance. Thus, our
model yields power estimates within seconds, since it does not rely on
the statistics of the circuits primary outputs and, consequently,
does not perform any simulation during estimation. Moreover, our
model achieves significantly better accuracy than previous
macro-modeling approaches, because it takes into account both spatial
and temporal correlations in the input stream.
In experiments with ISCAS-85 combinational circuits, the average
absolute relative error of our power macro-modeling technique was at
most 3%. For all but one circuit in our test suite, the worst-case
error was at most 19.5%. In comparison with power estimates for a
ripple-carry adder family that were obtained using Spice, the average
absolute and worst-case errors of our models estimates were 5.1% and
19.8%, respectively.
In addition to power dissipation, our macro-modeling technique can be
used to estimate the statistics of a circuits primary outputs. Our
experiments with ISCAS-85 circuits yield very low average errors.
Thus, our technique is suitable for fast and accurate power estimation
in core-based system with pre-characterized blocks. Once the metrics
of the primary inputs are known, the power dissipation of the entire
system can be estimated by simply propagating this information through
the blocks using their corresponding model functions. |
| To maximize the performance of a wide-issue superscalar processor, the fetch mechanism must be capable of delivering at least the same
instruction bandwidth as the execution mechanism is capable of consuming.
Fetch mechanisms consisting of a simple instruction cache are limited by
difficulty in fetching a branch and its taken target in a single cycle.
Such fetch mechanisms will not suffice for processors capable of executing
multiple basic blocks worth of instructions.
The Trace Cache is proposed to deal with lost fetch bandwidth due
to branches. The trace cache is a structure which overcomes this partial
fetch problem by storing logically contiguous
instructions---instructions which are adjacent in the instruction
stream---in physically contiguous storage. In this manner, the trace
cache is able to deliver multiple non-contiguous blocks each
cycle.
This dissertation contains a description of the trace cache mechanism
for a 16-wide issue processor, along with an evaluation of basic
parameters of this mechanism, such as relative size and associativity.
The main contributions of this dissertation are a series of trace cache
enhancements which boost instruction fetch bandwidth by 34% and overall
performance by 14% over an aggressive instruction cache. Also included
is an analysis of two important performance limitations of the trace
cache: branch resolution time and instruction duplication.
|
| In this paper, we explore the use of a Markov Decision Network to generate music. We show how the network is trained and used to generate
compositions of approximately 10 minutes in duration that are
representative of the style of the composer whose music was used to
train the network.
|
| The majority of research related to automated analysis of music presupposes human partitioning of the input into segments corresponding to significant harmonic or melodic chunks. In this paper, we analyze the difficulty of the partitioning problem in a formal manner. We then describe HarmAn, a system that partitions tonal music into harmonically significant segments corresponding to single chords and labels these segments with the proper chord labels. Input to the system consists of standard MIDI files and the output is both an annotated piano-roll style display and a text file with partitioning and chord-name information. Chord labels for segments are determined through template matching in the space of pitch-class with conflict resolution between equal scoring templates resolved through simple default preference rules. Our system’s results are compared with the results described in papers by Winograd (Winograd 1968), Maxwell (Maxwell 1992), and Temperley and Sleator (Temperley and Sleator 1999) |
| This paper gives an algorithm for harmonic analysis for for a tonal
composition
using a constraint-satisfaction problem (CSP) model. The algorithm combines
rule-based and preference-based approaches to perform harmonic analysis,
taking
the position that the problem can be modeled as a computational process and
less of a psychological phenomenon. Using cadential patterns within the piece,
it identifies tonal centers and assigns the harmonic analysis, applying
preferences when necessary. A software implementation of the algorithm is
presented, along with a discussion of its results. |
| Modern processors are very difficult to design and require advanced design verification methods to ensure that they function correctly. While
simulation-based verification is the primary method used by industry,
formal methods are gaining acceptance although they are limited to
relatively small designs. Equivalence checking, for example, is useful
for verifying the equivalence between two levels of a design hierarchy,
but is not applicable above the widely used register-transfer level (RTL).
For this reason, new design tools are necessary to verify that an RTL
description of an implementation matches its original instruction set
architecture (ISA), which specifies the processor components and
instructions visible to a programmer.
This Masters thesis proposes a design verification method, Reverse
Engineering Verification (REVE), based on analysis of an implementations
data flow. REVE takes a high-level RTL implementation of a new processor
design and extracts (reverse engineers) specification information from the
implementations internal data paths. The reverse-engineered information
is then matched with the original ISA specification to verify functional
correctness of the implementation. |
| Modern high-performance processors utilize multi-level cache structures to help tolerate the increasing latency (measured in processor cycles) of main
memory. These caches employ either a writeback or a write-through strategy
to deal with store operations. Write-through caches propagate data to more
distant memory levels at the time each store occurs, which requires a very
large bandwidth between the memory hierarchy levels. Writeback caches can
significantly reduce the bandwidth requirements between caches and memory by
marking cache lines as dirty when stores are processed and writing those
lines to the memory system only when that dirty line is evicted. This
approach works well for many applications (e.g. SPEC95), but for graphics
applications that experience significant numbers of cache misses due to
streaming data, writeback cache designs can degrade overall system
performance by clustering bus activity when dirty lines contend with data
being fetched into the cache.
In this paper we present a new technique called Eager Writeback, which
re-distributes and balances traffic by writing dirty cache lines to memory
prior to their eviction. This reduces the likelihood of writing dirty lines
that impede the loading of cache miss data. Eager Writeback can be viewed
as a compromise between write-through and writeback policies, in which dirty
lines are written later than write-through, but prior to writeback. We will
show that this approach can reduce the large number of writes seen in a
write-through design, while avoiding the performance degradation caused by
clustering bus traffic of a writeback approach.
|
| The growing difference between processor and main memory cycle time necessitates the use of more aggressive techniques to reduce or hide main
memory access latency. Prefetching data into higher speed memories is one such
technique. However, speculative prefetching can significantly increase memory
traffic.
We present a new technique, called Static Filtering (SF), to reduce the traffic
generated by a given hardware prefetching scheme while preserving its reduced
miss rate. SF uses profiling to select which load instructions should be marked
"enabled" to do data prefetching. This is done by identifying which load
instructions generate data references that are useful prefetch triggers. SF
enables the hardware prefetch mechanism only for the set of references made by
"enabled" loads. Our results from applying SF to two well-known hardware
prefetching techniques, Next Sequential Prefetching (NSP) and Shadow Directory
Prefetching (SDP), shows that SF preserves the decrease in misses that they
achieve and reduces the prefetch traffic by 50 to 60% for NSP and by 64 to 74%
for SDP. In addition, timing analysis reveals that when finite memory bandwidth
is a limiting factor, applying SF does in fact increase the speedup obtained by
a baseline hardware prefetching technique.
The other major contribution of this paper is a complete taxonomy which
classifies individual prefetches in terms of the additional traffic they
generate and the resulting reduction (or increase) in misses. This taxonomy
provides a formal method for classifying prefetches by their usefulness. A
histogram of the prefetches by category provides a new basis for comparing
prefetch techniques. |
| Many schemes have been proposed that incorporate an auxiliary buffer to improve theperformance of a given size cache. One of the most thoroughly evaluated of
these schemes, Victim caching, aims to reduce the impact of conflict misses in
direct-mappedcaches. While Victim has shown large performance benefits, its
competitive advantage is limited to direct-mapped caches, whereas todays
caches are increasingly associative. Fur-thermore, it requires a costly data
path for swaps and saves between the cache and the buffer.Several other schemes
attempt to obtain the performance improvements of Victim, but across a wide
range of associativities and without the costly data path for swaps and
saves.While these schemes have been shown to perform well overall, their
performance still lags that of Victim when the main cache is direct-mapped.
Furthermore, they also requirecostly hardware support, but in the form of
history tables for maintaining allocation decision information.This paper
introduces a new cache management scheme, Allocation By Conflict (ABC), which
generally outperforms both Victim and the history-based schemes. Further-more,
ABC has the lowest hardware requirements of any proposed scheme -- only a
single additional bit per block in the main cache is required to maintain the
informationrequired by the allocation decision process, and no swap-save data
path is needed. |
| Wireless links are known to suffer location-dependent, time-varying, and bursty errors. This paper considers adaptive error-control schemes
in the data link layer for reliable communication over wireless links,
in which the error-control code and frame length are chosen adaptively
based on the estimated channel state/condition. Three error-control
schemes are considered according to (1) the number of code segments a
packet is divided into, and (2) the way a lost packet is
retransmitted. Through throughput performance and computation
complexity analyses, these three schemes are compared, and then one of
them is claimed to be the most attractive in terms of computation
complexity and practicality even though its throughput performance is
not the best. Simulation results also verify that this scheme works
well over a time-varying fading channel. Error control for the MAC
header and its effect on the performance of each error-control scheme
are also considered since, without proper error protection for the
header, it would be futile to exercise error control on the user data.
|
| This report examines how the compiler can more efficiently use a large number of processor registers. The placement of data items into registers, called register allocation, is known to be one of the most important compiler optimizations for high-speed computers because registers are the fastest storage devices in the computer system. However, register allocation has been limited in scope because of aliasing in the memory system. To break this limitation and allow more data to be placed into registers, new compiler and microarchitecture support is needed. We propose the modification of register access semantics to include an indirect access mode. We call this optimization the Smart Register File. The smart register file allows the relaxation of overly-conservative assumptions in the compiler by having the hardware provide support for aliased data items in processor registers. As a result, the compiler can allocate data from a larger pool of candidates than in a conventional system. An attendant advantage is that the smart register file reduces the number of load and store operations executed by the processor. The simple addition of an indirect register access mode not only simplifies alias handling, but also provides opportunities for other optimizations. This research examines several such optimizations.
|
| Techniques for coordinating distributed executions includeprocess groups and transactions, each with different properties in a
trade-off in terms of the degrees of consistency and performance.
Transaction concurrency control theory may be used to characterize
process group multicast semantics; our framework
which uses conflict relations among events, and the acyclicity
properties of a particular precedence graph, may be used to describe
the correctness of multicast orderings. In this paper, we describe
protocols to realize new relaxed multicast orderings as obtained by
the use of concurrency control techniques. In particular, we
demonstrate efficient, time-stamp based protocols for relaxed
orderings that depend on data, operation, or any similar attributes
that bring about such weaker orderings. In fact, we provide a
protocol for any arbitrary reflexive and symmetric relation that may
be defined among the message types, and this makes our approach
amenable to incorporating application semantics into multicast
protocols. Using the precedence graph acyclicity framework, we can
show that our protocols produce correct multicast delivery orderings.
Furthermore, we discuss the efficiency of our protocols in terms of
some qualitative performance metrics.
|
| We report on our empirical studies of a new controller for a two-link brachiating robot. Motivated by the pendulum-like motion of an apes brachiation, we encode this task as the output of a ``target dynamical system.' Numerical simulations
indicate that the resulting controller solves a number of brachiation problems that we term the ``ladder', ``swing up'
and ``rope' problems. Preliminary analysis provides some explanation for this success. The proposed controller is
implemented on a physical system in our laboratory. The robot achieves behaviors including ``swing locomotion'
and ``swing up' and is capable of continuous locomotion over several rungs of a ladder. We discuss a number of formal
questions whose answers will be required to gain a full understanding of the strengths and weaknesses of this approach. |
| We propose an efficient flow and error control scheme for high-throughput transport protocols, which (i) decouples
flow control from error control, and (ii) uses a second-order rate
control, called $alpha$-control,
for flow control and a new sliding-window
scheme for error control. The $alpha$-control
minimizes the need for packet retransmissions by
adjusting the rate-gain parameter in response to the variations
of cross-traffic flows and their round-trip delays (RTDs).
Using selective retransmission,
the sliding-window scheme guarantees lossless transmission.
Separation of flow and error control simplifies both components
and enhances the throughput since the
source rate control is independent of the dynamics of
the error-control window.
Applying the $alpha$-control, the proposed scheme
can drive the flow-controlled system
to a retransmission-free equilibrium state.
Using the fluid analysis, we model the packet-loss behavior
and derive the closed-form expressions for packet losses,
loss rate, and the link-transmission efficiency.
We prove that the $alpha$-control is
feasible and optimal linear control in terms of efficiency
and fairness. Also presented are the vector-space analysis
and simulation results that verify the analytical results,
and demonstrate the superiority of the proposed scheme to others
in dealing with cross-traffic and RTDs variations, controlling
packet losses/retransmissions, and achieving buffer-use fairness and
a high throughput.
|
| There is a proliferation of application-level servers that offer enhanced network services beyond the
simple packet forwarding provided by the underlying network
infrastructure. Examples of such servers range from Web server
mirrors, Web caches, Web page distilling proxies,
video transcoders, and application-level multicast routers, to
application-level load-adaptive multipath routers. A fundamental
question arising from the deployment of such servers is
where to place them within a network. This paper explores technical
issues related to the creation of an infrastructure to allow
self-organization of network service placement based on observed
demand for each service. In so doing, we propose a framework, called
Sortie, whereby services are allocated on network nodes based on a set
of very simple rules independently executed by each node. The
distributed nature of allocation decisions ensures the scalability of
the framework. We also present simulation results confirming the
stability and efficiency of the proposed framework.
|
| The active (state-machine) replication protocol has been one of the commonly adopted approaches to provide fault tolerant data services. In such a scheme,
service state is replicated at all participating servers and client requests
are presented to emph{all} the servers. A strict consensus protocol executed
by the servers ensures that all servers receive all the requests in the same
order. The output of each request from all the servers are then compared to
mask the result of faulty server(s). However, the overhead associated with
running the strict consensus protocol tends to slow down the response to client
requests, which makes it unsuitable for real-time applications where timing
predictability is a necessity. This paper presents a weak consensus model
called ``temporal consensus' that reduces such overhead and enable the active
scheme to meet the timing constraint of real-time applications.
|
| System and application failures are all too common. In this paper,we argue that operating systems should provide the abstraction of
failure transparency--the illusion that systems and applications do
not fail. We construct a theory of consistent recovery that is
useful for systems that want to provide failure transparency. The
theory defines precisely what constitutes consistent recovery and
provides a simple invariant that all applications must uphold to
guarantee they get it. The theory unifies the various recovery
protocols for achieving consistent recovery; existing protocols can
be viewed as different ways of upholding our theorys invariant. We
use the theory to design new recovery protocols and analyze existing
protocols. We conclude by evaluating performance for a suite of
recovery protocols. We focus our study on interactive programs, and
application domain for which it is challenging to provide failure
transparency. Our results indicate that it is possible to provide
failure transparency for general applications with overhead of 1-2%
for systems with reliable main memory, and 10-40% for disk-based
systems. |
| Checkpointing is a general technique for recovering applications.Unfortunately, current checkpointing systems add many seconds of
overhead per checkpoint. Their high overhead prevents them from
making failures transparent to users and other external entities, so
failures lose visible state. This paper presents a checkpointing
system called Discount Checking that is built on reliable main
memory and high-speed transactions. Discount Checking can be used to
make general-purpose applications recoverable easily and with low
overhead. The checkpoints taken by Discount Checking are extremely
fast, ranging for our target applications from 50 microseconds to a
few milliseconds. Discount Checkings low overhead makes it
possible to provide ideal failure transparency by checkpointing
each externally visible event. Yet even with this high rate of
checkpointing, Discount Checking slows real applications down by
less than 0.6%. |
| Many portable and embedded applications are characterized by spending a large fraction of their execution time on
small program loops. In these applications, instruction fetch
energy can be reduced by using a small instruction cache when
executing these tight loops. Recent work has shown that it is
possible to use a small instruction cache without incurring
any performance penality [4, 6]. In this paper, we will extend the
work done in [6]. In the modified loop caching scheme proposed
in this paper, when a program loop is larger than the loop
cache size, the loop cache is capable of capturing only part
of the program loop without having any cache conflict problem.
For a given loop cache size, our loop caching scheme can reduce
instruction fetch energy more than other loop cache schemes
previously proposed. We will present some quantitative results
on how much power can be saved on an integrated embedded design
using this scheme.
|
| We address the problem of controlling large distributed robotic systems such as factories. We introduce tools which help us to compose local, hybrid control programs for a class of distributed
robotic systems, assuming a palette of controllers for individual tasks is already constructed. These
tools, which combine backchaining behaviors with Petri Nets, expand on successful work in sequential
composition of robot behaviors. We apply these ideas to the design of a robotic bucket brigade and
to simple, distributed assembly tasks.
|
| The advent of electronic commerce and personal communications on the Internet heightens concerns over the lack of
privacy and security. Network services providing a wide range of
security related guarantees are increasingly based on public key
certificates. A fundamental problem inhibiting the wide acceptance of
existing certificate distribution services is the lack of a scalable
certificate revocation mechanism. We argue in this paper that the
resource requirements of extant revocation mechanisms place
significant burden on certificate servers and network resources. We
propose a novel mechanism called Windowed Revocation that
satisfies the security policies and requirements of existing
mechanisms and, at the same time, reduces the burden on certificate
servers and network resources. We include a proof of correctness of
windowed revocation and a trace-based performance study illustrating
the scalability and general applicability of windowed revocation.
|
| As DRAM access latencies approach a thousand instruction-execution times and on-chip caches grow to multiple megabytes, it is not clear that conventional cache structures
continue to be appropriate. Two key features¾full associativity and software manage-ment¾have
been used successfully in the virtual-memory domain to cope with disk
access latencies. Future systems will need to employ similar techniques to deal with
DRAM latencies. This paper presents a practical, fully associative, software-managed sec-ondary
cache system that provides performance competitive with or superior to traditional
caches without OS or application involvement. We see this structure as the first step
toward OS- and application-aware management of large on-chip caches.
This paper has two primary contributions: a practical design for a fully associative
memory structure, the indirect index cache (IIC), and a novel replacement algorithm, gen-erational
replacement, that is specifically designed to work with the IIC. We analyze the
behavior of an IIC with generational replacement as a drop-in, transparent substitute for a
conventional secondary cache. We achieve miss rate reductions from 8% to 85% relative
to a 4-way associative LRU organization, matching or beating a (practically infeasible)
fully associative true LRU cache. Incorporating these miss rates into a rudimentary timing
model indicates that the IIC/generational replacement cache could be competitive with a
conventional cache at today’s DRAM latencies, and will outperform a conventional cache
as these CPU-relative latencies grow.
|
| To exploit larger amounts of parallelism, processors are being built with ever wider issue widths. Unfortunately, as issue widths grow, instruction
cache misses increasingly bottleneck performance.
Out-of-order fetch, decode, and issue reduces the bottleneck due to these
misses. The term *issue* denotes the act of writing instructions into
reservation stations, *fetch block* denotes the instructions brought into
the processor by an instruction cache fetch, and *branch predictor* denotes
the entire next fetch address generation logic. Upon encountering an
instruction cache miss, a conventional processor waits until the miss has
been serviced before continuing to fetch, decode, and issue any new fetch
blocks. A processor with out-of-order fetch, decode, and issue temporarily
ignores the block associated with the miss, and attempts to fetch, decode,
and issue the blocks that follow. The addresses of these blocks are
generated by the branch predictor. Because the predictor does not depend
on the instructions in the current block to generate the address of the
next fetch block, it can continue making predictions even if the current
block misses in the cache. Thus, the processor can skip the block that
missed and continue fetching, decoding, and issuing the following blocks
using the addresses generated by the predictor. After servicing the miss,
the processor can fetch, decode, and issue the skipped block.
This dissertation evaluates the *potential* performances of processors
with various issue widths and window sizes, shows that enough potential
performance exists to justify building a 16-wide processor, examines
bottlenecks that would prevent this processor from achieving its potential
performance, and demonstrates that the bottleneck due to instruction cache
misses would be severe. It introduces a remedy for this bottleneck--out-
of-order fetch, decode, and issue--describes its implementation, and
demonstrates its usefulness. For a 16-wide processor with a 16k byte
direct-mapped instruction cache, instruction cache misses reduce its
performance on a set of sixteen integer benchmarks by an average of 24%.
When it is equipped with out-of-order fetch, decode, and issue, its
performance is only reduced by an average of 5%.
|
| Current computing systems exhibit an impressive heterogeneity of hardware resources, software capabilities, and network connectivity.
They have to accommodate various application and client demands and to
adjust to the continuously changing load on network segments and
intermediate processing nodes. We present DACIA, a framework for
building adaptive distributed applications through the flexible
composition of software modules implementing individual functions. An
application is seen as a collection of components connected in a
directed graph. Through runtime reconfiguration of the application
graph, the application adapts to changes in the execution environment,
resource availability, and application requirements. Thus, a more
efficient execution is achieved. The location where various components
are executed can be changed dynamically and the execution can be
occasionally outsourced to nodes having specialized hardware or to
less loaded sites on the network. Some components or groups of
components can be replaced by other components or groups of
components. DACIA supports application and user mobility, allowing
parts of an application to move to different hosts at runtime, while
maintaining seamless connectivity.
|
| Accurate branch prediction can be seen as a mechanism for enabling design decisions. When short pipelines were the norm, accurate branch
prediction was not as important. However, having accurate branch
prediction enables technologies like wide-issue deeply pipelined
superscalar processors. If branch predictors can be improved further,
we can more successfully use more aggressive speculation
techniques. Accurate branch prediction enables larger scheduling
windows, out-of-order fetch, deeper pipelines etc. It is therefore
likely that there will be a growing demand for more accurate
predictors beyond todays prediction technology.
Previous studies have shown which branch predictors and configurations
best predict the branches in a given set of benchmarks. Some studies
have also investigated effects, such as pattern history table
interference, that can be detrimental to the performance of these
branch predictors. However, little research has been done on which
characteristics of branch behavior make branches predictable.
This dissertation approaches the branch problem in a different way
from previous studies. The focus is on understanding how branches
behave and why they are predictable. Branches are classified based on
the type of behavior, and the extent of each type of behavior is
quantified. One important result is that two thirds of all branches
are very predictable using a simple predictor because they follow
repeating patterns. We also show how correlation between branches
works, and what part of this correlation is important for branch
prediction.
Based on this information about branch behavior, some shortcomings of
current branch predictors are identified, new branch predictors are
introduced, and potential areas for future improvement are identified.
One of the new predictors, Dual History Length Gshare with Selective
Update is more accurate than a Gshare predictor using Branch Filtering
while having a simpler implementation. Another new predictor, the
Multi Hybrid, suffers 10% fewer mispredictions than a
state-of-the-art PAs/Gshare hybrid predictor at an implementation cost
of 100 KB.
|
Technical Reports Page
|
|