Defense Event

Accelerating Data Transfer for Throughput Processors

Davoud Anoushe Jamshidi

Friday, September 02, 2016
12:00pm - 2:00pm
3725 Beyster Bldg.

Add to Google Calendar

About the Event

Graphics processing units (GPUs) have become prevalent in modern computing systems. While their highly parallel architectures are traditionally used as accelerators for rendering graphics, GPUs are also adept at handling data parallel workloads when provided large blocks of data for processing. Extracting performance from a GPU requires the programmer to provide enough work to keep the device fully utilized. Unlike CPUs, which are highly optimized to reduce memory access latency, GPUs are optimized for throughput and tend to have high access latency. The naive approach to obtaining performance is to provide a GPU with hundreds to thousands of threads so that some threads will be able to perform computation while others are waiting for data to arrive. This approach, however, cannot guarantee that there will always be enough computation that can hide the long latency of off-chip memory access. Common memory access patterns on GPUs further complicate code optimization. These x patterns include streaming data that is only used once, tiling data in scratchpad memories to preserve some locality and share data among many threads, and irregular accesses where neighboring threads access divergent memory locations. Limitations posed by the microar- chitecture of modern GPU cores can hinder the GPUs ability to effectively hide memory access latency. This in turn limits GPU throughput and slows down execution of code on GPUs. In this thesis, architectural modifications to GPUs are proposed that address the issues and inefficiencies posed by these access patterns. Inefficiencies in existing microarchitec- ture are outlined that stifle performance of applications that follow such access patterns, and are ameliorated using specialized hardware that decouples many accesses from threads as well as the existing execution pipelines in an effort to boost memory bandwidth utiliza- tion, keep data flowing to cores, and shrink exposed memory latency. By accomplishing these goals, the proposed modifications effectively improve throughput, boosting kernel performance.

Additional Information

Sponsor(s): Professor Scott Mahlke

Open to: Public