Matrix multiplication is often used for academic study. It's well suited for parallelization due to its intensive O(N^3) computation and independent computation. Parallel programming is hard. Does it surprise you if we parallelize matrix multiplication in merely one line of OpenMP directive?
Serial Matrix Multiplication
/* matrix.cpp */
const int size = 1000;
float a[size][size];
float b[size][size];
float c[size][size];
int main()
{
// Initialize buffers.
for (int i = 0; i < size; ++i) {
for (int j = 0; j < size; ++j) {
a[i][j] = (float)i + j;
b[i][j] = (float)i - j;
c[i][j] = 0.0f;
}
}
// Compute matrix multiplication.
// C <- C + A x B
for (int i = 0; i < size; ++i) {
for (int j = 0; j < size; ++j) {
for (int k = 0; k < size; ++k) {
c[i][j] += a[i][k] * b[k][j];
}
}
}
return 0;
}
Analyzing the data dependency of the operation c[i][j] += a[i][k] * b[k][j], we can see that each iteration i is independent of each other. The same applies to iteration j. For simplicity, we consider parallelizing only single for-loop.
Parallel Matrix Multiplication using OpenMP
/* matrix-omp.cpp */
const int size = 1000;
float a[size][size];
float b[size][size];
float c[size][size];
int main()
{
// Initialize buffers.
for (int i = 0; i < size; ++i) {
for (int j = 0; j < size; ++j) {
a[i][j] = (float)i + j;
b[i][j] = (float)i - j;
c[i][j] = 0.0f;
}
}
// Compute matrix multiplication.
// C <- C + A x B
#pragma omp parallel for default(none) shared(a,b,c)
for (int i = 0; i < size; ++i) {
for (int j = 0; j < size; ++j) {
for (int k = 0; k < size; ++k) {
c[i][j] += a[i][k] * b[k][j];
}
}
}
return 0;
}
With the OpenMP directive (#pragma), the i-for-loop is divided into multiple chunks, each chunk is assigned to a thread. Hence, multiple threads can compute assigned chunks in parallel. That's it.
$ g++ -O2 matrix.cpp -o matrix
$ g++ -O2 -fopenmp matrix-omp.cpp -o matrix-omp
$ time ./matrix
real 0m12.976s
user 0m12.460s
sys 0m0.059s
$ time ./matrix-omp
real 0m6.952s
user 0m12.556s
sys 0m0.050s
Bravo! The parallel version is about 1.86x faster than the serial version on a dual-core system. However, the code above is only for illustration purpose. There are still many optimization opportunities before it achieves commercial performance level.
clean example!
ReplyDeleteThank you and one point: -03 runs 3x faster compared to -02 in both cases on my intel / g++ 4.6,
ReplyDeleteAs the SIZE is a constant, the compiler will be able to do a lot more optimizations. I would not be surprise to see great improvement with -O3. But when the SIZE is a dynamically assigned value, you probably won't get such a nice improvement over -O2.
ReplyDeleteTry with #pragma omp critical when you are adding to matrix C!!!
ReplyDeleteIn this case, critical is not needed as each thread writes onto independent memory only.
Deletecan you show me using "
ReplyDeletereduction "
Reduction is better shown in different example such as summing up numbers in an array using multiple threads. reduction(+: array) can be used.
DeleteLook in your most popular recreation and learn all about it so you can begin half in} like a professional. Finding an excellent on line casino bonus 1xbet is crucial to play slots on-line – particularly need to|if you wish to} get a small advantage. The most typical bonus types embody free spins, extra cash that matches your deposit, and special presents for recurring gamers.
ReplyDelete