Matrix multiplication is often used for academic study. It's well suited for parallelization due to its intensive O(N^3) computation and independent computation. Parallel programming is hard. Does it surprise you if we parallelize matrix multiplication in merely one line of OpenMP directive?

__Serial Matrix Multiplication__

/* matrix.cpp */

const int size = 1000;

float a[size][size];

float b[size][size];

float c[size][size];

int main()

{

// Initialize buffers.

for (int i = 0; i < size; ++i) {

for (int j = 0; j < size; ++j) {

a[i][j] = (float)i + j;

b[i][j] = (float)i - j;

c[i][j] = 0.0f;

}

}

// Compute matrix multiplication.

// C <- C + A x B

for (int i = 0; i < size; ++i) {

for (int j = 0; j < size; ++j) {

for (int k = 0; k < size; ++k) {

c[i][j] += a[i][k] * b[k][j];

}

}

}

return 0;

}

Analyzing the data dependency of the operation c[i][j] += a[i][k] * b[k][j], we can see that each iteration i is independent of each other. The same applies to iteration j. For simplicity, we consider parallelizing only single for-loop.

__Parallel Matrix Multiplication using OpenMP__

/* matrix-omp.cpp */

const int size = 1000;

float a[size][size];

float b[size][size];

float c[size][size];

int main()

{

// Initialize buffers.

for (int i = 0; i < size; ++i) {

for (int j = 0; j < size; ++j) {

a[i][j] = (float)i + j;

b[i][j] = (float)i - j;

c[i][j] = 0.0f;

}

}

// Compute matrix multiplication.

// C <- C + A x B

**#pragma omp parallel for default(none) shared(a,b,c)**

for (int i = 0; i < size; ++i) {

for (int j = 0; j < size; ++j) {

for (int k = 0; k < size; ++k) {

c[i][j] += a[i][k] * b[k][j];

}

}

}

return 0;

}

With the OpenMP directive (#pragma), the i-for-loop is divided into multiple chunks, each chunk is assigned to a thread. Hence, multiple threads can compute assigned chunks in parallel. That's it.

$ g++ -O2 matrix.cpp -o matrix

$ g++ -O2 -fopenmp matrix-omp.cpp -o matrix-omp

$ time ./matrix

real 0m12.976s

user 0m12.460s

sys 0m0.059s

$ time ./matrix-omp

real 0m6.952s

user 0m12.556s

sys 0m0.050s

Bravo! The parallel version is about 1.86x faster than the serial version on a dual-core system. However, the code above is only for illustration purpose. There are still many optimization opportunities before it achieves commercial performance level.

clean example!

ReplyDeleteThank you and one point: -03 runs 3x faster compared to -02 in both cases on my intel / g++ 4.6,

ReplyDeleteAs the SIZE is a constant, the compiler will be able to do a lot more optimizations. I would not be surprise to see great improvement with -O3. But when the SIZE is a dynamically assigned value, you probably won't get such a nice improvement over -O2.

ReplyDeleteTry with #pragma omp critical when you are adding to matrix C!!!

ReplyDeleteIn this case, critical is not needed as each thread writes onto independent memory only.

Deletecan you show me using "

ReplyDeletereduction "

Reduction is better shown in different example such as summing up numbers in an array using multiple threads. reduction(+: array) can be used.

Delete