Edward S. Davidson
Student: Gheith A. Abandah
Professor: Edward S. Davidson
Sponsor: Hewlett-Packard
Connecting processing nodes using a directory-based cache coherent network is becoming a popular approach to building powerful computation, data, and Internet servers. Current scaleable shared memory multiprocessors have internode remote access latencies that are 3-10 times the local memory latency with moderate remote bandwidth. It is difficult to achieve good programmability and scalability for multiple node configurations with a remote latency greater than 2x. Additionally, future processors are going to be faster, demanding even lower latencies and higher bandwidth. By analyzing the behavior of shared memory applications, we seek to find efficient approaches for building new systems and reducing the communication cost.
We have developed a methodology and tools for characterizing shared-memory applications. These tools characterize the system traffic generated by the data, code, and I/O streams. They include tools for tracing shared-memory applications and tools for trace analysis. The analysis tools report a general characterization that is independent of the hardware solution, as well as application traffic characteristics on existing and proposed system configurations.
We have used these tools to analyze the NAS Parallel Benchmarks (NPB) for two sets of problem sizes. Our characterization of NPB includes the amount of memory instructions, communication, and sharing as a function of the number of processors. We have also analyzed some NPB traffic characteristics on a possible future system configuration.
We are currently evaluating current approaches for building scaleable shared-memory systems as a background to getting a better understanding of the design issues.
Previous research included porting scientific applications to massively-parallel processors, modeling the communication performance of message-passing multicomputers, and characterizing the memory and communication performance in cache-coherent nonuniform memory access (CC-NUMA) multiprocessors.
Author: Alexandre Eichenberger
Professor: Edward S. Davidson
Sponsor: Hewlett-Packard
High performance compilers increasingly rely on accurate modeling of the machine resources to efficiently exploit the instruction level parallelism of an application. In this dissertation, we first propose a reduced machine description that results in significantly faster detection of resource contentions while preserving the scheduling constraints present in the original machine description. This approach reduces a machine description in an automated, error-free, and efficient fashion. Moreover, it fully supports the elaborate scheduling techniques that are used by high-performance compilers, such as scheduling an operation earlier than others that are already scheduled, unscheduling operations due to resource conflicts, and efficient handling of periodic resource requirements found in software pipelined schedules. Reduced machine descriptions resulted in processing the queries to the resource model in 58.1% of the original time for a benchmark of 1327 loops scheduled by a state-of-the-art modulo scheduler for the Cydra 5 machine.
Scheduling techniques such as Modulo Scheduling are also increasingly successful at efficiently exploiting the instruction level parallelism up to the resource limit of the machine, resulting in high performance but increased register requirements. In this dissertation, we propose an optimal register-sensitive algorithm for modulo scheduling that schedules loop operations to achieve the minimum register requirements for the smallest possible initiation interval between successive iterations. The proposed approach supports machines with finite resources and complex reservation tables, such as the Cydra 5. It also uses a well structured formulation that enables us to find schedules within a reasonable time for more loops (920 of the 1327 loops) and larger loops (up to 41 operations).
We also propose an alternative approach, stage scheduling, which reduces the register requirements of a given modulo schedule by shifting its operations by multiples of II cycles. In addition to achieving a significant fraction of the possible decrease in register requirements (e.g. the average register requirements decrease by 22.2% for the optimal modulo scheduler, by 19.9% for the optimal stage scheduler, and by 19.8% for our best stage scheduling heuristic, compared to a register-insensitive modulo scheduler in a benchmark of 920 loops), the stage scheduling approach also preserves the steady-state performance of the initial schedules. By bounding the schedule length of its schedule, we may preserve the transient performance of the original as well. Thus, by coupling efficient stage schedule heuristics with a register-insensitive high-performance modulo scheduler, we may very quickly obtain high-performance schedules with low register requirements.
A technique that further decreases the register requirements in predicated code is presented. This technique is based on precisely computing the interferences among virtual registers in the presence of predicated operations.
Student: Shih-Hao Hung
Professor: Edward S. Davidson
Sponsor: TACOM (U.S. Army)
Most programmers are trained to write serial codes and there exist many important serial application codes. There are difficulties in parallel programming that prevent people from developing high performance parallel codes, especially for irregular applications. We approach this problem by developing models and tools that help people parallelize their codes and optimize their performance. The application performance is evaluated and characterized by analyzing the source code or the performance profile with a hierarchical performance bound model. Performance bottlenecks are detected by observing the gaps between two successive performance bounds. The application is then fine-tuned by targeting and relaxing specific performance bottlenecks with a unified performance-tuning methodology.
With source code analysis, trace-driven simulation, and runtime profiling, the performance of an application can be characterized by a hierarchical performance bound model. Previous application-performance bound models developed by our group focus mainly on uniprocessor performance. We found that relatively complicated performance profiles can be summarized with a relatively simple set of performance bounds to provide better performance visualization and understanding. The performance bound hierarchy now successively includes the effects of the machine (M-bound), the application (MA-bound), the compiler (MAC-bound), the scheduling constraints (MACS-bound), the first-level cache (MACS1-bound), the degree of parallelism in the application (MACS1A), the shared-memory system (MACS1M-bound), the overall load imbalance (MACS1ML-bound), the multiple program phases (MACS1MLP-bound), and the dynamic machine/application behavior (MACS1MLPD-bound). The gap between two successive performance bounds indicates the overhead due to the additional constraint. Collectively, they explain why the machine peak performance is not achieved and where the performance bottlenecks are.
The gaps from MACS1A to MACS1MLPD successively indicate the overhead due to imperfect parallelism, interprocessor communications, overall load imbalance, multiple program phases, and dynamic machine/program behavior. This characterization thus shows the performance bottlenecks and can be used to guide the compiler or the programmer in code optimization. We have proposed a unified scheme to improve the performance for irregular applications. An large irregular engineering application code has been used to study the problems associated with irregularity. Compared to regular applications, irregular applications are more difficult to parallelize, load balance and optimize due to indirect array accessing, sparse matrix operations, nonuniform computation requirements across the data domain, and/or unstructured problem domains. Optimal partitioning of irregular applications is an NP-complete problem. Most standard compiler optimizations cannot be applied to irregular applications since the indirect array references are not known until runtime. Interprocessor communications and the load balance are difficult to analyze and improve without performance tools.
By analyzing the communication patterns of such applications using trace-driven simulation tools, communication dependencies between the problem sub-domains are revealed and the communication performance can then be improved by optimizing the data layout, eliminating false-sharing, optimizing the cache coherence protocol, embedding data prefetch instructions, better communication mechanisms, and/or better domain decomposition schemes. The knowledge of communication dependencies also enables the use of efficient and innovative load balancing and synchronization relaxing techniques, which may reduce the overhead due to conventional barrier synchronization and multiple program phases. With this approach, irregular applications could be systematically tuned by solving individual performance issues in an ordered sequence of steps. The hierarchical bound model can then be used again to evaluate the improved solutions.
With a unified method, we are able to characterize applications, evaluate machine-application performance, and fine-tune the performance systematically. We have converted relatively complicated application profiles into simple performance bounds, transformed a generally difficult performance tuning process into step-by-step procedures, and developed techniques and tools to realize the solutions. More importantly, this method can be applied by general users and can possibly be integrated into future compilers.
In our performance tuning efforts, we found that domain decomposition is a critical determinant of the load balance, the communication traffic, and the scheduling constraints, and hence greatly affects overall performance. In addition to carrying out further experiments with different domain decomposition strategies, we plan to address the relationship between individual performance issues so as to provide more global view and more globally motivated solutions to performance-tuning problems.
Student: Jude A. Rivers
Professor: Edward S. Davidson
Sponsor: Rackham Graduate Fellowship
We have been investigating a novel memory system design where the first level (L1) data cache is organized and managed as a multiple resource, a "multi-lateral" L1 cache, within a single-processor chip. Three trends in computer architecture have made multilateral caches attractive. First, advances in VLSI design have reached a level where we anticipate a billion transistors in a single chip within the next decade for the computer designer's use. Second, processor speeds keep increasing faster than memory speeds. The general projection is that the processor-memory cycle gap will continue to increase. For any cache to be able to provide a single cycle access time that is competitive with processor speeds, the cache will likely have to be small and have little or no associativity. Thus employing a large on-chip L1 cache will become progressively more unattractive, despite the available real estate. Third, emerging processors all have increasing multiple instruction issue capabilities for exploiting the available instruction level parallelism (ILP) in a program, and this trend will continue. Multi-issue processor performance will thus benefit greatly from a memory system that has the capability (efficient management and sufficient bandwidth) to service multiple data requests per cycle.
Our multi-lateral cache organization (MLCO), in which the L1 data cache is partitioned into subcaches that are arranged in parallel and placed equidistant from the CPU, is compatible with the constraints and needs described above. Each of the MLCO subcaches is distinctively managed, and all have mutually exclusive contents. This design approach relies on the basic premise that the data reference stream of a running program can be subdivided into possible data classes and mapped into specific subcaches whose management policies are most suitable for the access pattern of that class. Different data cache partitions can have different capacities, associativities, replacement strategies, and block sizes. This sort of selective caching ensures the effective management of the L1 data cache in keeping more reusable data items, which translates into more cache hits, less cache-memory bus contention and an overall improvement in program execution time.
To collectively manage the MLCO subcaches requires an efficient criterion. Possible management criteria include techniques based on data behavior, functional dedication and address type. Temporality-based caching (TBC) is a data behavior-based MLCO management concept used in the Non-Temporal Streaming (NTS) Cache that we have developed. The NTS Cache is a basic MLCO consisting of two subcaches: a larger direct-mapped unit and a smaller fully associative nontemporal (NT) cache. Simulated evaluations show that an (8+1)KB NTS Cache performs as well as, and sometimes better than, a 16KB conventional direct-mapped cache.
Future directions include deriving and validating a theoretical "best use" model for the basic case of MLCOs, a traditional data cache coupled with a small supporting fully-associative buffer. We plan to investigate the multi-level inclusion problems associated with MLCOs backed by a level 2 cache. We also plan to carry out a performance study on MLCO schemes in a bus-based shared-memory multiprocessor environment to determine the effect that MLCO multiprocessor nodes would have on overall system performance.
Author: Tien-Pao Shih
Professor: Edward S. Davidson
Sponsor: Ford Motor Company and Office of Naval Research
Performance tuning, as carried out by compiler designers and application programmers to close the performance gap between the achievable peak and delivered performance, becomes increasingly important and challenging as the microprocessor speeds and system sizes increase. However, although performance tuning on scientific codes usually deals with relatively small program regions, it is not generally known how to establish a reasonable performance objective and how to efficiently achieve this objective. We suggest a goal-directed approach and develop such an approach for each of three major system performance components: central processor unit (CPU) computation, memory accessing, and communication.
For the CPU, we suggest using a machine-application performance model that characterizes workloads on four key function units (memory, floating-point, issue, and a virtual "dependence unit") to produce an upper bound performance objective, and derive a mechanism to approach this objective. A case study shows an average 1.79x speedup achieved by using this approach for the Livermore Fortran Kernels 1-12 running on the IBM RS/6000.
For memory, as compulsory and capacity misses are relatively easy to characterize, we derive a method for building application-specific cache behavior models that report the number of misses for all three types of conflict misses: self, cross, and ping-pong. The method uses averaging concepts to determine the expected number of cache misses instead of attempting to count them exactly in each instance, which provides a more rapid, yet realistic assessment of expected cache behavior. For each type of conflict miss, we propose a reduction method that uses one or a combination of three techniques based on modifying or exploiting data layout: array padding, initial address adjustment, and access resequencing. A case study using a blocked matrix multiply program as an example shows that the model is within 11% of the simulation results, and that each type of conflict miss can be effectively reduced or completely eliminated.
For communication in shared memory parallel systems, we derive an array grouping mechanism and related loop transformations to reduce communication caused by the problematic case of nonconsecutive references to shared arrays and prove several theorems that determine when and where to apply this technique. The experimental results show a 15% reduction in communication, a 40% reduction in data subcache misses, and an 18% reduction in maximum user time for a finite element application on a 56 processor KSR1 parallel computer.
Student: Edward S. Tam
Professor: Edward S. Davidson
Sponsor: IBM
The cache, which buffers the high speed CPU from the much slower main memory, is a key component of a microprocessor's overall performance. Behavioral cache simulators indicate the performance of a given cache configuration in terms of the number of hits and misses experienced when running a piece of code. One can attach a leading edge penalty or "effective" penalty estimate to each miss to get a first order idea of run time. However, individual timing penalties are not assessed within these models, nor are the various factors that can affect the latency of each memory access. Our Latency Effects (LE) cache model accounts for additional latencies due to trailing edge effects, bus width considerations, the number of outstanding accesses allowable, and port limitations.
A tool implementing the LE cache model has been built on top of a behavioral cache simulator, DineroIII, in the spirit of the Resource Conflict Methodology developed by J-D Wellman. The tool was combined with Wellman's RCM_brisc processor simulator to provide a realistic interaction of the cache with the processor (including the latency masking effects of processor activity) and to assess the accuracy of the model when simulating the execution of actual programs. The combined tool accurately mirrors the effects of changing a cache's configuration for a given processor configuration running a variety of programs. While the reported execution times do not exactly match the total execution times of the same programs running on actual production hardware, the tool does provide enough useful information to guide processor/cache designers early in the design cycle toward optimal configurations for target application workloads. This addition of cache modeling to the RCM_brisc instruction-level processor simulator with a perfect cache increases simulation time by only 17% (less than 5% over a constant miss penalty cache model), which is reasonable given the increased timing accuracy of LE cache model.
The tool has exposed some interesting characteristics about the interaction between the cache, the processor, and the memory access stream. For instance, in behavioral cache models, one of the main metrics of performance is the cache miss (or alternatively, hit) ratio. Without considering the latency effects of the cache and the latency masking effects of the processor, two programs with similar miss ratios might be thought to require the same number of cycles to execute, given the behavioral cache simulator's output. In actuality, two workloads can have substantially different execution cycle counts due to processor masking effects and differing access patterns. If one program has many of its misses grouped together in time and/or little processor masking occurring (e.g. few non-memory accessing instructions are being executed during cache misses), a miss may contribute directly to the overall run time of the program. However, if the misses are spread out over time and/or the processor is executing useful instructions while waiting for the outstanding memory requests to complete, a miss may have little or no effect on overall execution time because its latency is masked by useful work.
Early design cycle analysis permits the simulation and testing of novel cache configurations before actual hardware needs to be built. To this end, the combined tool has been expanded to model several temporality-based cache configurations, including the Non-Temporal Streaming cache proposed by J. Rivers and the ASSIST cache used in the HP-7200 processor. Other multi-lateral cache configurations will be evaluated in the future as well, in the hope of obtaining generalized, best-option cache configurations for given design spaces and target workloads.
Student: Stevan Vlaovic
Professor: Edward S. Davidson
In order to develop efficient applications for large scale multiprocessors, the communication patterns of the target architecture should be well understood. By understanding these communication patterns, a faster running application code can be developed. Some communications in an application require the smallest latency possible, whereas others might benefit from higher throughput.
The Cray T3D is a multiprocessor that has a high speed 3D torus interconnect, and special communications hardware. By utilizing this hardware properly, a program has the potential to minimize its communications overhead. We have investigated three different libraries that provide communication on the T3D: PVM, MPI, and SMA. Both PVM and MPI are portable, whereas SMA is native. The latency, bandwidth, and collective communication costs of the communication primitives implemented in these libraries were characterized and compared and contrasted with the results of related studies on the IBM SP2 and the Convex SPP-1000.