A Performance Analysis of Local Synchronization

Julia Lipman     Quentin F. Stout
Computer Science and Engineering, University of Michigan

 

Abstract: Synchronization is often necessary in parallel computing, but it can create delays whenever the receiving processor is idle, waiting for the information to arrive. This is especially true for barrier, or global, synchronization, in which every processor must synchronize with every other processor. Nonetheless, barriers are the only form of synchronization explicitly supplied in MPI and OpenMP.

Many applications do not actually require global synchronization; local synchronization, in which a processor synchronizes only with those processors from which it has an incoming edge in some directed graph, is often adequate. However, the behavior of a system under local synchronization is more difficult to analyze, since processors do not all start tasks at the same time.

In this paper, we show that if the synchronization graph is a directed cycle and the task times are geometrically distributed with success probability p = 0.5, then the time it takes for a processor to complete a task, including synchronization time, approaches an exact limit of 2+√2 as the number of processors in the cycle approaches infinity. Under global synchronization, however, the time is unbounded, increasing logarithmically with the number of processors. Similar results also apply for p ≠ 0.5.

We give a new proof of the constant upper bounds that apply when tasks are normally distributed and the synchronization graph is any graph of bounded degree. We also prove that for some power-law distributions on the tasks, there is no constant upper bound as the number of processors increases, even for the directed cycle. Finally, we show that constant upper bounds apply for some cases of a different synchronization model in which a processor waits for only a subset of its neighbors.

Keywords: synchronization, communication, overhead, parallel computing, performance analysis, geometric distribution, heavy-tailed distribution, stochastic task times, idle time

Complete paper. This paper appears in Proceedings of Symposium on Parallelism in Algorithms and Architectures (SPAA) 2006.

Related work: This research was motivated by earlier measurements of the communication performance of MPI on parallel systems. See Statistical Analysis of Communication Time on the IBM SP2.

 


Quentin's Home Copyright © 2006-2009 Quentin F. Stout.