EECS 570: Pthreads Tutorial

Developed by Tanvir Ahmed Khan (takh@umich.edu) , Jan 2019.

An introductory tutorial on parallel programming using Pthreads.

Step 0: Let's start with the following sequential baseline:
[source]

Matrix Multiplication !

Compile and run:

$ gcc serial.cpp -o serial
$ time ./serial

real   0m4.306s
user   0m4.288s
sys    0m0.016s

Step 1: We want to accelerate this program.
Following Amdahl’s law- we should parallelize part(s) of the program that take a long time. So, let's profile the runtime.
[source]

Compile and run:

$ gcc serial.cpp -o serial
$ time ./serial
Initialization time: 13155
Multiplication time: 4308665
Print time: 56714

real   0m4.381s
user   0m4.372s
sys    0m0.008s

The actual multiplication operations takes ~98% of the whole execution time. So, we should parallelize multiply(). We will use Pthreads (POSIX Threads) library for this.

Step 2: Pthreads Basics: more details
A thread is the smallest execution unit that the OS scheduler can schedule on a processing unit. Threads within the same process share the same address space.

Create a new thread using pthread_create.
The main thread then waits for termination of the created thread using pthread_join.

Let's add a thread to our program.
[source]

Note: We modified multiply() to match the expected prototype of pthread_create call. Also, output is now written into a separate file to check the correctness of result.

Let's run our first multithreaded program.

$ gcc parallel.cpp -o parallel -pthread
$ time ./parallel

real   0m4.264s
user   0m4.244s
sys    0m0.012s

$ diff parallel.txt serial.txt | wc -l
0

No speedup ? We just did all computation in a single thread while main thread was waiting for it to finish.
At least, our results match.

Step 3: Distribute workload
We need several threads partitioning the work among them.
[source]

We used 10 threads. The range of k in multiply() that each thread operates on is distributed uniformly among the threads. To specify this range, we created a new structure- thread_args. On the main thread, we initialized thread_args for each thread accordingly.
Note that we called pthread_join in a separate loop once all threads were created. Why?

for(int i = 0; i < NUM_THREADS; i++) {
    pthread_create(&child_threads[i], NULL, multiply, &work_ranges[i]);
    pthread_join(child_threads[i], NULL);
}

Calling pthread_join just after pthread_create will cause the main thread to wait before creating the next thread. We don't want that.

Let's run:

$ gcc parallel.cpp -o parallel -pthread
$ time ./parallel

real	0m0.875s
user	0m5.728s
sys	0m0.000s

Yaay! ~5x speedup. Let's check the output.

$ diff parallel.txt serial.txt | wc -l
3400

Aahh! We broke the correctness of the program. What went wrong?
Let’s look into the parallel multiply operation.

for(int i = 0; i < DIM; i++) {
    for(int j = 0; j < DIM; j++) {
        for(int k = range->start; k < range->end; k++) {
            matrix_c[i][j] += matrix_a[i][k] * matrix_b[k][j];
        }
    }
}

All threads are adding to all the matrix_c[i][j] locations. Can two threads write to the same matrix_c[i][j] at the same time? Data races!
We must ensure correctness by using synchronization primitives like lock, mutex, and semaphore.

Step 4: Synchronize: pthread_mutex_lock, pthread_mutex_unlock
[source]

Let's run:

$ gcc parallel.cpp -o parallel -pthread
$ time ./parallel

real	1m33.011s
user	2m25.812s
sys	9m18.500s

$ diff parallel.txt serial.txt | wc -l
0

Even slower than sequential baseline! Why?
At least, results are correct.

Step 5: Eliminate Lock Contention: We were basically doing all the work sequentially. And, adding the overhead of locking and unlocking one global lock variable.
Each thread should perform local addition parallely, then synchronize final updates. Let's fix this!
[source]

Let's run:

$ gcc parallel.cpp -o parallel -pthread
$ time ./parallel

real	0m1.541s
user	0m6.708s
sys	0m4.628s

$ diff parallel.txt serial.txt | wc -l
0

Phew! We have ~2.8x speedup and correctness.

More Pthreads resources:

Native Execution on Intel Xeon Phi

Compile using Intel's icc compiler, with -mmic flag to generate code for the MIC co-processor target.
Then offload using micnativeloadex command.

$ icc parallel.cpp -o parallel.mic -pthread -mmic
$ micnativeloadex parallel.mic