Parallel Performance Project Research Paper
Research Paper
-
Reducing Communication Cost in Scalable Shared Memory Systems
Gheith A. Abandah,
Ph.D. Thesis (Technical Report CSE-TR-362-98),
University of Michigan, 1998.
Abstract
-
Distributed shared-memory systems provide scalable performance
and a convenient model for parallel programming. However, their
non-uniform memory latency often makes it difficult to develop efficient
parallel applications. Future systems should reduce communication cost
to achieve better programmability and performance. We have developed
a methodology, and implemented a suite of tools, to guide the search
for improved codes and systems. As the result of one such search, we
recommend a remote data caching technique that significantly reduces
communication cost.
We analyze applications by instrumenting their assembly-code sources.
During execution, an instrumented application pipes a detailed trace
to configuration independent (CIAT) and configuration dependent (CDAT)
analysis tools. CIAT characterizes inherent application
characteristics that do not change from one configuration to another,
including working sets, concurrency, sharing behavior, and
communication patterns, variation over time, slack, and locality. CDAT
simulates the trace on a particular hardware model, and is easily
retargeted to new systems. CIAT is faster than detailed simulation;
however, CDAT directly provides more information about a specific
configuration. The combination of the two tools constitutes a
comprehensive and efficient methodology. We calibrate existing systems
using carefully designed microbenchmarks that characterize local and
remote memory, producer-consumer communication involving two or more
processors, and contention when multiple processors utilize memory and
interconnect.
This dissertation describes these tools and illustrates their use by
characterizing a wide range of applications and assessing the effects
of architectural and technological advances on the performance of
HP/Convex Exemplar systems, evaluates strengths and weaknesses of
current system approaches, and recommends solutions.
CDAT analysis of three CC-NUMA system approaches shows that current
systems reduce communication cost by minimizing either remote latency
or remote communication frequency. We describe four architecturally
varied systems that are technologically similar to the low remote
latency SGI Origin 2000, but incorporate additional techniques for
reducing the number of remote communications. Using CCAT, a CDAT-like
simulator that models communication contention, we evaluate the
worthiness of these techniques. Superior performance is reached when
processors supply cached clean data, or when a remote data cache is
introduced that participates in the local bus protocol.
Back to Publication List, or
Parallel Performance Project Home Page