Parallel Performance Project Research Paper
Research Paper
-
Domain Decomposition, Irregular Application, and Parallel Computers
Karen A. Tomko
Ph.D. Thesis, University of Michigan, 1995.
Abstract
-
Many large-scale computational problems are based on irregular
(unstructured) domains. Some examples are finite element methods in
structural analysis, finite volume methods in fluid dynamics, and
circuit simulation for VLSI design. Domain decomposition is a common
technique for distributing the data and work of irregular scientific
applications across a distributed memory parallel machine. To obtain
efficiency, subdomains must be constructed such that the work is
divided with a reasonable balance among the processors while the
communication-causing subdomain boundary is kept small.
Application and machine specific information can be
used in conjunction with domain decomposition to achieve a level of
performance not possible with traditional domain decomposition
methods. Application profiling characterizes the performance of an
application on a specific machine. We present a method that uses
curve-fitting of application profile data to calculate vertex and edge
weights for use with weighted graph decomposition algorithms. We
demonstrate its potential on two routines from a production finite
element application running on the IBM SP2. Our method combined with a
multilevel spectral algorithm reduced load imbalance from 52% to less
than 10% for one routine in our study.
Many irregular applications have several phases, that must be
load balanced individually to achieve high overall application
performance. We propose finding one decomposition that can be used
effectively for each phase of the application, and introduce a
decomposition algorithm which load balances according to two vertex
weight sets for use on two-phase applications. We show that this dual
weight algorithm can be as successful at load balancing two individual
routines together as the traditional single weight algorithm is at
load balancing each routine independently.
Domain decomposition algorithms take a simplistic view of
multiprocessor communication. Higher performance can be achieved by
considering the communication characteristics of the target
multiprocessor in conjunction with decomposition techniques. We
provide a methodology for tuning an application for a shared-address
space multiprocessor by using intelligent layout of the application
data to reduce coherence traffic and employing latency hiding
mechanisms to overlap communication with useful work. These techniques
have been applied to a finite element radar application running on the
Kendall Square KSR1.
Back to Publication List, or
Parallel Performance Project Home Page