Saturday, August 14, 2010

Parallelizing Matrix Multiplication using OpenMP in One Line

Matrix multiplication is often used for academic study. It's well suited for parallelization due to its intensive O(N^3) computation and independent computation. Parallel programming is hard. Does it surprise you if we parallelize matrix multiplication in merely one line of OpenMP directive?

Serial Matrix Multiplication

/* matrix.cpp */
const int size = 1000;

float a[size][size];
float b[size][size];
float c[size][size];

int main()
{
// Initialize buffers.
for (int i = 0; i < size; ++i) {
for (int j = 0; j < size; ++j) {
a[i][j] = (float)i + j;
b[i][j] = (float)i - j;
c[i][j] = 0.0f;
}
}

// Compute matrix multiplication.
// C <- C + A x B
for (int i = 0; i < size; ++i) {
for (int j = 0; j < size; ++j) {
for (int k = 0; k < size; ++k) {
c[i][j] += a[i][k] * b[k][j];
}
}
}

return 0;
}
Analyzing the data dependency of the operation c[i][j] += a[i][k] * b[k][j], we can see that each iteration i is independent of each other. The same applies to iteration j. For simplicity, we consider parallelizing only single for-loop.

Parallel Matrix Multiplication using OpenMP

/* matrix-omp.cpp */
const int size = 1000;

float a[size][size];
float b[size][size];
float c[size][size];

int main()
{
// Initialize buffers.
for (int i = 0; i < size; ++i) {
for (int j = 0; j < size; ++j) {
a[i][j] = (float)i + j;
b[i][j] = (float)i - j;
c[i][j] = 0.0f;
}
}

// Compute matrix multiplication.
// C <- C + A x B
#pragma omp parallel for default(none) shared(a,b,c)
for (int i = 0; i < size; ++i) {
for (int j = 0; j < size; ++j) {
for (int k = 0; k < size; ++k) {
c[i][j] += a[i][k] * b[k][j];
}
}
}

return 0;
}
With the OpenMP directive (#pragma), the i-for-loop is divided into multiple chunks, each chunk is assigned to a thread. Hence, multiple threads can compute assigned chunks in parallel. That's it.
$ g++ -O2 matrix.cpp -o matrix
$ g++ -O2 -fopenmp matrix-omp.cpp -o matrix-omp
$ time ./matrix
real 0m12.976s
user 0m12.460s
sys 0m0.059s
$ time ./matrix-omp
real 0m6.952s
user 0m12.556s
sys 0m0.050s
Bravo! The parallel version is about 1.86x faster than the serial version on a dual-core system. However, the code above is only for illustration purpose. There are still many optimization opportunities before it achieves commercial performance level.

7 comments:

  1. Thank you and one point: -03 runs 3x faster compared to -02 in both cases on my intel / g++ 4.6,

    ReplyDelete
  2. As the SIZE is a constant, the compiler will be able to do a lot more optimizations. I would not be surprise to see great improvement with -O3. But when the SIZE is a dynamically assigned value, you probably won't get such a nice improvement over -O2.

    ReplyDelete
  3. Try with #pragma omp critical when you are adding to matrix C!!!

    ReplyDelete
    Replies
    1. In this case, critical is not needed as each thread writes onto independent memory only.

      Delete
  4. Replies
    1. Reduction is better shown in different example such as summing up numbers in an array using multiple threads. reduction(+: array) can be used.

      Delete