SpeedGo Computing: Parallelizing Matrix Multiplication using OpenMP in One Line

Saturday, August 14, 2010

Parallelizing Matrix Multiplication using OpenMP in One Line

Matrix multiplication is often used for academic study. It's well suited for parallelization due to its intensive O(N^3) computation and independent computation. Parallel programming is hard. Does it surprise you if we parallelize matrix multiplication in merely one line of OpenMP directive?

Serial Matrix Multiplication

/* matrix.cpp */
const int size = 1000;

float a[size][size];
float b[size][size];
float c[size][size];

int main()
{
    // Initialize buffers.
    for (int i = 0; i < size; ++i) {
        for (int j = 0; j < size; ++j) {
            a[i][j] = (float)i + j;
            b[i][j] = (float)i - j;
            c[i][j] = 0.0f;
        }
    }

    // Compute matrix multiplication.
    // C <- C + A x B
    for (int i = 0; i < size; ++i) {
        for (int j = 0; j < size; ++j) {
            for (int k = 0; k < size; ++k) {
                c[i][j] += a[i][k] * b[k][j];
            }
        }
    }

    return 0;
}

Analyzing the data dependency of the operation c[i][j] += a[i][k] * b[k][j], we can see that each iteration i is independent of each other. The same applies to iteration j. For simplicity, we consider parallelizing only single for-loop.

Parallel Matrix Multiplication using OpenMP

/* matrix-omp.cpp */
const int size = 1000;

float a[size][size];
float b[size][size];
float c[size][size];

int main()
{
    // Initialize buffers.
    for (int i = 0; i < size; ++i) {
        for (int j = 0; j < size; ++j) {
            a[i][j] = (float)i + j;
            b[i][j] = (float)i - j;
            c[i][j] = 0.0f;
        }
    }

    // Compute matrix multiplication.
    // C <- C + A x B
    #pragma omp parallel for default(none) shared(a,b,c)
    for (int i = 0; i < size; ++i) {
        for (int j = 0; j < size; ++j) {
            for (int k = 0; k < size; ++k) {
                c[i][j] += a[i][k] * b[k][j];
            }
        }
    }

    return 0;
}

With the OpenMP directive (#pragma), the i-for-loop is divided into multiple chunks, each chunk is assigned to a thread. Hence, multiple threads can compute assigned chunks in parallel. That's it.

$ g++ -O2 matrix.cpp -o matrix
$ g++ -O2 -fopenmp matrix-omp.cpp -o matrix-omp
$ time ./matrix
real 0m12.976s
user 0m12.460s
sys 0m0.059s
$ time ./matrix-omp
real 0m6.952s
user 0m12.556s
sys 0m0.050s

Bravo! The parallel version is about 1.86x faster than the serial version on a dual-core system. However, the code above is only for illustration purpose. There are still many optimization opportunities before it achieves commercial performance level.

8 comments:

AnonymousJuly 30, 2012 at 1:58 PM
clean example!
ReplyDelete
Replies
Luka RahneNovember 21, 2012 at 6:01 AM
Thank you and one point: -03 runs 3x faster compared to -02 in both cases on my intel / g++ 4.6,
ReplyDelete
Replies
xmanNovember 21, 2012 at 7:25 AM
As the SIZE is a constant, the compiler will be able to do a lot more optimizations. I would not be surprise to see great improvement with -O3. But when the SIZE is a dynamically assigned value, you probably won't get such a nice improvement over -O2.
ReplyDelete
Replies
AnonymousApril 27, 2014 at 8:37 PM
Try with #pragma omp critical when you are adding to matrix C!!!
ReplyDelete
Replies
UnknownSeptember 19, 2014 at 9:40 PM
can you show me using "
reduction "
ReplyDelete
Replies
AnonymousNovember 16, 2022 at 9:29 PM
Look in your most popular recreation and learn all about it so you can begin half in} like a professional. Finding an excellent on line casino bonus 1xbet is crucial to play slots on-line – particularly need to|if you wish to} get a small advantage. The most typical bonus types embody free spins, extra cash that matches your deposit, and special presents for recurring gamers.
ReplyDelete
Replies

Add comment