EECS 570 Lecture 14 GPUs

Winter 2025

**Prof. Satish Narayanasamy** 

http://www.eecs.umich.edu/co4urses/eecs570/



Slides adapted from instructional material with D. Kirk and W. Hwu, Programming Massively Parallel Processors: A Handson Approach, Third Edition.

Credits to Nikos Hardavellas (Northwestern), Reetu Das (UM), Thomas Wenisch

## Readings

This week:

• Tor M. Aamodt, Wilson Wai Lun Fung, Timothy G. Rogers, General-Purpose Graphics Processor Architectures, Ch. 3.1-3.3, 4.1-4.3

# Growth in GPUs



# **Revolution in GPUs**



## A major paradigm shift

#### 18 future arenas of competition These industries could yield \$29-48 trillion in revenues and \$2-6 trillion in profits by 2040. Cloud Electric E-commerce Al software vehicles and services services THE DEPT



Cybersecurity

Video

games



Semiconductors

Shared autonomous vehicles





**Batteries** 







Streaming

video

Space

Future air



mobility



**Nuclear fission** power plants

McKinsey Global Institute

Robotics



Industrial and consumer





Drugs for obesity and related conditions

Domain-Specific and Generative AI Application Systems

Model Customization, Evaluation, Safety, and Explainability

Model Architecture and Techniques

Systems Optimization



Systems and Applications

Ecosystem: LangChain, LlamaIndex, Weights & Biases NVIDIA: AI Workbench, NeMo Guardrails

#### Services and Microservices

Ecosystem: AWS Bedrock, AzureML, Cohere, Google Vertex AI, OpenAI APIs NVIDIA: NIM, Avatar Cloud Engine (ACE), BioNeMo, NeMo, Picasso

#### Models

Ecosystem: BLOOM, Llama, Mistral, MPT, OPT, Phi-2, Getty Images Al Generator, Shutterstock 3D Generator NVIDIA: BioMegatron, Edify, Nemotron

#### SDKs and Frameworks

Ecosystem: Colossal-AI, HuggingFace Transformers, PyTorch NVIDIA: A2X, Megatron-LM, NeMo Framework, Riva, Picasso

#### Libraries

Ecosystem: XLA NVIDIA: CUDA, CUTLASS, CV-CUDA, Megatron-Core, Megatron-LM, NCCL, RAFT, Transformer Engine, TensorRT-LLM, Ray

#### Management and Orchestration

Ecosystem: Kubernetes, Nephele, Slurm, VMware NVIDIA: Base Command Platform

Al computing stack

## Computing at Exascale

El Capitan at Lawrence Livermore National Laboratory (LLNL)

Performance is expected to exceed 2 exaFLOPS, which comes with a \$600 million price tag.





## **CPUs: Latency Oriented Design**

High clock frequency

- Large caches
  - Convert long latency memory accesses to short latency cache accesses

### Sophisticated control

- Branch prediction for reduced branch latency
- Data forwarding for reduced data latency

Powerful ALU

• Reduced operation latency

### CPU



DRAM

## **GPUs: Throughput Oriented Design**

- Moderate clock frequency
- Smaller caches
  - To boost memory throughput
- Simple control
  - No branch prediction
  - No data forwarding
- Energy efficient ALUs
  - Many, long latency but heavily pipelined for high throughput
- Require massive number of threads to tolerate latencies



DRAM

## CPU vs. GPU

- Different design philosophies
  - CPU: A few out-of-order cores
  - GPU: Many in-order SIMD cores



## NVIDIA B100 (2024-25)



#### B100:

- Streaming Multiprocessors (SMs): 192
- L1 Cache: 128 KB per SM
- L2 Cache: 50 MB
- Transistor Count: 104 billion
- Memory Size: 192 GB HBM3e
- Memory Bandwidth: 8 TB/s
- Power Consumption: 700W

Each SM can execute 32 threads at a time

## Exponential growth continues ...

### Supercharging Next-Generation AI and Accelerated Computing



#### **NVLink At-Scale Performance**



Architecture Release

Massive Parallelism -Regularity

## Applications Benefit from Both CPU and GPU

CPUs for sequential parts where latency matters

CPUs can be 10+X faster than GPUs for sequential code

GPUs for parallel parts where throughput wins

GPUs can be 10+X faster than CPUs for parallel code

## Amdahl's Law



## **Speeding Up Real Applications**

### Big Idea: Amdahl's Law

Speedup = 1  
Non-speed-up part 
$$(1 - F) + \frac{F}{S}$$
 Speed-up part

Example: the execution time of half of the program can be accelerated by a factor of 2. What is the program speed-up overall?

$$\frac{1}{\frac{0.5+0.5}{2}} = \frac{1}{\frac{0.5+0.25}{2}} = 1.33$$

## Load Balance

The total amount of time to complete a parallel job is limited by the thread that takes the longest to finish



## Memory Bandwidth Constraint

Memory Contentions in accessing critical dat serialization

Massively parallel execution cannot afford se

Computation – Communication

Slowest of the two determines performance



## **Global Memory Bandwidth**

Ideal



Reality



## GPUs and SIMD/Vector Data Parallelism

- Graphics processing units (GPUs)
  - How do they have such high peak FLOPS?
  - Ans: exploit massive data parallelism
- "SIMT" execution model
  - **I** Single instruction multiple threads
  - Similar to both "vectors" and "SIMD"
  - □ A key difference: better support for conditional control flow
- Program it with CUDA or OpenCL (or Vulkan or Metal or ...)
  - Extensions to C (or Objective-C in the case of Metal)
  - Perform a "shader task" (a snippet of scalar computation) over many elements
  - Internally, GPU uses scatter/gather and vector mask operations

### Context: History of Programming GPUs

#### • "GPGPU"

- Originally could only perform "shader" computations on images
- **So**, programmers started using this framework for computation
- Puzzle to work around the limitations, unlock the raw potential
- As GPU designers notice this trend...
  - Hardware provided more "hooks" for computation
  - Provided some limited software tools
- GPU designs are now fully embracing compute
  - More programmability features to each generation
  - □ Industrial-strength tools, documentation, tutorials, etc.
  - **Can be used for in-game physics**, etc.
  - Many application targets:

AI, graphics, data analytics, scientific computation, genomics

Throughput Computing: Hardware Basics

> Justin Hensley Advanced Micro Devices, Inc Graphics Product Group



## What does a modern graphics API do?





## A Simple Program - Diffuse Shader

```
sampler mySamp;
Texture2D<float3> myTex;
float3 lightDir;
float4 diffuseShader(float3 norm, float2 uv)
  float3 kd;
  kd = myTex.Sample(mySamp, uv);
  kd *= clamp( dot(lightDir, norm), 0.0, 1.0);
  return float4(kd, 1.0);
                     Each invocation is independent, but no
                     explicitly exposed parallelism
```



## Shader is compiled



## Exploit data parallelism! - add two cores



#### Each invocation is independent!



### Add even more cores - four cores





### How about even more cores - 16 cores







## 128 cores?

#### How do you feed all these cores?





### Back to the simple core...

# •How do you feed all these cores?

Share cost of fetch / decode across many ALUs
SIMD Processing





## Back to the simple core...

# •How do you feed all these cores?

- Share cost of fetch / decode across many ALUs
  SIMD Processing
  Single
  Instruction
  - Multiple
  - Data





### Back to the simple core...

# How do you feed all these cores?

Share cost of fetch / decode across many ALUs
SIMD Processing
Single

#### SIMD Processing does not imply SIMD instructions!





### Back to a single core...







### **128-Fragments in parallel**



# 16 cores → 128 ALUs (16 cores \* 8 ALUs) → 16 independent instruction streams



## 128-things in parallel

•X cores can work on primitives (triangles)

- -"geometry shader"
- Y cores can work on vertices
  - -"vertex shader"
- Z cores can work on fragments
  - -"pixel shader"
- N cores can work on data/work/etc

-"compute kernels"/"compute shaders"

Which cores working on what data changes over time



















## How to handle stalls?

•Memory access latency = 100's to 1000's of cycles

-Stalls occur when a core cannot run the next instruction

•GPUs don't have the large / fancy caches and logic that helps avoid stall because of a dependency on a previous operation.

•But we have LOTS of independent fragments.

–Interleave processing of many fragments on a single core to avoid stalls caused by high latency operations.























## **Throughput computing**





#### Latency Hiding with "Thread Warps"

- Warp: A set of threads that execute the same instruction (on different data elements)
- Fine-grained multithreading
  - One instruction per thread in pipeline at a time (No branch prediction)
  - Interleave warp execution to hide latencies
- Register values of all threads stay in register file
- No OS context switching
- Memory latency hiding
  - **Graphics has millions of pixels**



#### Warp-based SIMD vs. Traditional SIMD

- Traditional SIMD contains a single thread
  - Lock step
  - Programming model is SIMD (no threads)
  - □ ISA contains vector/SIMD instructions
- Warp-based SIMD consists of multiple scalar threads executing in a SIMD manner (i.e., same instruction executed by all threads)
  - Each thread can be treated individually (i.e., placed in a different warp) → programming model not SIMD
    - Enables memory and branch latency tolerance
  - □ ISA is scalar  $\rightarrow$  vector instructions formed dynamically



# CUDA Devices and Threads

- A compute device
  - □ Is a coprocessor to the CPU or host
  - □ Has its own DRAM (device memory)
  - **Runs many threads in parallel**
  - Is typically a GPU but can also be another type of parallel processing device
- Data-parallel portions of an application are expressed as device kernels which run on many threads
- Differences between GPU and CPU threads
  - □ GPU threads are extremely lightweight
    - O Very little creation overhead
  - **GPU needs 1000s of threads for full efficiency** 
    - Multi-core CPU needs (relatively) only a few

## Thread Batching: Grids and Blocks

- A kernel is executed as a grid of thread blocks
  - □ All threads share data memory space
- A thread block is a batch of threads that can cooperate with each other by:
  - □ Synchronizing their execution
    - For hazard-free shared memory accesses
  - Efficiently sharing data through a low latency shared memory
- Two threads from two different blocks cannot cooperate



#### **Execution Model**

- Each thread block is executed by a single multiprocessor
  - **Synchronized using shared memory**
- Many thread blocks are assigned to a single multiprocessor
  - **T** Executed concurrently in a time-sharing fashion
  - □ Keep GPU as busy as possible
- Running many threads in parallel can hide DRAM memory latency
  - □ Global memory access : 2~300 cycles

# CUDA Device Memory Space Overview

- Each thread can:
  - □ R/W per-thread registers
  - **R/W** per-thread local memory
  - □ R/W per-block shared memory
  - **R/W per-grid global memory**
  - Read only per-grid constant memory
  - Read only per-grid texture memory

• The host can R/W global, constant, and texture memories



#### **Example: Vector Addition Kernel**

```
// Pair-wise addition of vector elements
// One thread per addition
```

```
__global___void
vectorAdd(float* iA, float* iB, float* oC)
{
    int idx = threadIdx.x
        + blockDim.x * blockId.x;
        oC[idx] = iA[idx] + iB[idx];
```

Courtesy NVIDIA

# Example: Vector Addition Host Code

```
float* h_A = (float*) malloc(N * sizeof(float));
float* h_B = (float*) malloc(N * sizeof(float));
// ... initalize h_A and h_B
```

```
// allocate device memory
float* d_A, d_B, d_C;
cudaMalloc( (void**) &d_A, N * sizeof(float) );
cudaMalloc( (void**) &d_B, N * sizeof(float) );
cudaMalloc( (void**) &d_C, N * sizeof(float) );
```

```
// copy host memory to device
cudaMemcpy( d_A, h_A, N * sizeof(float),
cudaMemcpyHostToDevice );
cudaMemcpy( d_B, h_B, N * sizeof(float),
cudaMemcpyHostToDevice );
```

```
// execute the kernel on N/256 blocks of 256 threads each
vectorAdd<<< N/256, 256>>> ( d_A, d_B, d_C);
```

#### CUDA-Strengths

- (Relatively) easy to program (small learning curve)
- Success with several complex applications
   At least 7X faster than CPU stand-alone implementations
- Allows us to read and write data at any location in the device memory
- More fast memory close to the processors (registers + shared memory)

#### **CUDA-Limitations**

- Some hardwired graphic components are hidden
- Better tools are needed
  - **Profiling**
  - Memory blocking and layout
  - Binary Translation
- Difficult to find optimal values for CUDA execution parameters
  - Number of thread per block
  - Dimension and orientation of blocks and grid
  - Use of on-chip memory resources including registers and shared memory
- Working with GPUs is an active area of research