An introductory tutorial on parallel programming using Pthreads.
Step 0: Let's start with the following sequential baseline:
Matrix Multiplication !$ gcc serial.cpp -o serial
$ time ./serial
real 0m4.306s
user 0m4.288s
sys 0m0.016s
Compile and run:
$ gcc serial.cpp -o serial
$ time ./serial
Initialization time: 13155
Multiplication time: 4308665
Print time: 56714
real 0m4.381s
user 0m4.372s
sys 0m0.008s
The actual multiplication operations takes ~98% of the whole execution time. So, we should parallelize multiply(). We will use Pthreads (POSIX Threads) library for this.
Note: We modified multiply() to match the expected prototype of pthread_create call. Also, output is now written into a separate file to check the correctness of result.$ gcc parallel.cpp -o parallel -pthread
$ time ./parallel
real 0m4.264s
user 0m4.244s
sys 0m0.012s
$ diff parallel.txt serial.txt | wc -l
0
No speedup ? We just did all computation in a single thread while main thread was waiting for it to finish.
We used 10 threads. The range of k in multiply() that each thread operates on is distributed uniformly among the threads. To specify this range, we created a new structure- thread_args. On the main thread, we initialized thread_args for each thread accordingly.for(int i = 0; i < NUM_THREADS; i++) {
pthread_create(&child_threads[i], NULL, multiply, &work_ranges[i]);
pthread_join(child_threads[i], NULL);
}
Calling pthread_join just after pthread_create will cause the main thread to wait before creating the next thread. We don't want that.$ gcc parallel.cpp -o parallel -pthread
$ time ./parallel
real 0m0.875s
user 0m5.728s
sys 0m0.000s
Yaay! ~5x speedup. Let's check the output.
$ diff parallel.txt serial.txt | wc -l
3400
Aahh! We broke the correctness of the program. What went wrong?for(int i = 0; i < DIM; i++) {
for(int j = 0; j < DIM; j++) {
for(int k = range->start; k < range->end; k++) {
matrix_c[i][j] += matrix_a[i][k] * matrix_b[k][j];
}
}
}
All threads are adding to all the matrix_c[i][j] locations. Can two threads write to the same matrix_c[i][j] at the same time? Data races!
Let's run:
$ gcc parallel.cpp -o parallel -pthread
$ time ./parallel
real 1m33.011s
user 2m25.812s
sys 9m18.500s
$ diff parallel.txt serial.txt | wc -l
0
Even slower than sequential baseline! Why?
Let's run:
$ gcc parallel.cpp -o parallel -pthread
$ time ./parallel
real 0m1.541s
user 0m6.708s
sys 0m4.628s
$ diff parallel.txt serial.txt | wc -l
0
Phew! We have ~2.8x speedup and correctness.
$ icc parallel.cpp -o parallel.mic -pthread -mmic
$ micnativeloadex parallel.mic