Showing posts with label matrix. Show all posts
Showing posts with label matrix. Show all posts

Tuesday, August 17, 2010

Parallelizing Matrix Multiplication using MPI

MPI is a popular mechanism in high performance computing. It works for both cluster and shared memory environment. Why don't we simply use MPI when it works for both environments? Why do we care about OpenMP? Cilk++? etc. Perhaps that depends on the complexity of the applications you are dealing with.

Parallel Matrix Multiplication using MPI

/* matrix-mpi.cpp */
#include <mpi.h>

const int size = 1000;

float a[size][size];
float b[size][size];
float c[size][size];

void multiply(int istart, int iend)
{
for (int i = istart; i <= iend; ++i) {
for (int j = 0; j < size; ++j) {
for (int k = 0; k < size; ++k) {
c[i][j] += a[i][k] * b[k][j];
}
}
}
}

int main(int argc, char* argv[])
{
int rank, nproc;
int istart, iend;

MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &nproc);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);


if (rank == 0) {
// Initialize buffers.
for (int i = 0; i < size; ++i) {
for (int j = 0; j < size; ++j) {
a[i][j] = (float)i + j;
b[i][j] = (float)i - j;
c[i][j] = 0.0f;
}
}
}

// Broadcast matrices to all workers.
MPI_Bcast(a, size*size, MPI_FLOAT, 0,MPI_COMM_WORLD);
MPI_Bcast(b, size*size, MPI_FLOAT, 0,MPI_COMM_WORLD);
MPI_Bcast(c, size*size, MPI_FLOAT, 0,MPI_COMM_WORLD);


// Partition work by i-for-loop.
istart = (size / nproc) * rank;
iend = (size / nproc) * (rank + 1) - 1;

// Compute matrix multiplication in [istart,iend]
// of i-for-loop.
// C <- C + A x B
multiply(istart, iend);

// Gather computed results.
MPI_Gather(c + (size/nproc*rank),
size*size/nproc,
MPI_FLOAT,
c + (size/nproc*rank),
size*size/nproc,
MPI_FLOAT,
0,
MPI_COMM_WORLD);


if (rank == 0) {
// Compute remaining multiplications
// when size % nproc > 0.
if (size % nproc > 0) {
multiply((size/nproc)*nproc, size-1);
}
}

MPI_Finalize();
return 0;
}

$ g++ -O2 matrix.cpp -o matrix
$ mpicxx -O2 matrix-mpi.cpp -o matrix-mpi
$ time ./matrix
real 0m13.226s
user 0m12.529s
sys 0m0.065s
$ time mpirun -np 2 ./matrix-mpi
real 0m8.490s
user 0m6.346s
sys 0m0.178s
Phew .... what a hassle ... you can see the needs to:
  • perform data transfer to workers manually
  • perform work partitioning manually
  • perform many index calculations
  • handle remaining work when the amount of work is not divisible by the number of workers.
Furthermore, this MPI version uses more memory than the shared memory counterparts. The MPI program is launched with multiple processes as multiple workers, hence the memory consumption also multiply up. More work would be required to minimize the total memory consumption.When you must work with cluster environment, perhaps you don't have many choices with the current state of art programming tools.

Sunday, August 15, 2010

Parallelizing Matrix Multiplication using TBB

Parallelizing matrix multiplication using TBB isn't too difficult. It's just a little more work than OpenMP or Cilk++.

Parallel Matrix Multiplication using TBB

/* matrix-tbb.cpp */
#include <tbb/parallel_for.h>
#include <tbb/blocked_range.h>


using namespace tbb;

const int size = 1000;

float a[size][size];
float b[size][size];
float c[size][size];


class Multiply
{
public:
void operator()(blocked_range<int> r) const {
for (int i = r.begin(); i != r.end(); ++i) {
for (int j = 0; j < size; ++j) {
for (int k = 0; k < size; ++k) {
c[i][j] += a[i][k] * b[k][j];
}
}
}
}
};


int main()
{
// Initialize buffers.
for (int i = 0; i < size; ++i) {
for (int j = 0; j < size; ++j) {
a[i][j] = (float)i + j;
b[i][j] = (float)i - j;
c[i][j] = 0.0f;
}
}

// Compute matrix multiplication.
// C <- C + A x B
parallel_for(blocked_range<int>(0,size), Multiply());

return 0;
}
We've moved the computation of the matrix multiplication into the class Multiply which takes in the range of i-iterations to work on. The parallel_for internally split the range [0,size) into multiple blocks. Multiple workers can then work on different non-overlapping blocks in parallel.

$ g++ -O2 matrix.cpp -o matrix
$ g++ -O2 matrix-tbb.cpp -ltbb -o matrix-tbb
$ time ./matrix
real 0m12.971s
user 0m12.489s
sys 0m0.052s
$ time ./matrix-tbb
real 0m7.857s
user 0m12.734s
sys 0m0.282s
Once the computation is organized into functions which can dynamically work on different parts of the computation, it's relatively easy to proceed.

Parallelizing Matrix Multiplication using Cilk++ in Two Lines

Following the parallelization of matrix multiplication using OpenMP in Parallelizing Matrix Multiplication using OpenMP in One Line, can we do the same using Cilk++?

Parallel Matrix Multiplication using Cilk++

/* matrix.cilk */
const int size = 1000;

float a[size][size];
float b[size][size];
float c[size][size];

int cilk_main()
{
// Initialize buffers.
for (int i = 0; i < size; ++i) {
for (int j = 0; j < size; ++j) {
a[i][j] = (float)i + j;
b[i][j] = (float)i - j;
c[i][j] = 0.0f;
}
}

// Compute matrix multiplication.
// C <- C + A x B
cilk_for (int i = 0; i < size; ++i) {
for (int j = 0; j < size; ++j) {
for (int k = 0; k < size; ++k) {
c[i][j] += a[i][k] * b[k][j];
}
}
}

return 0;
}
The cilk_main() is the entry function synonymous to main() in a typical C++ program, but cilk_main() function uses Cilk++ linkage such that you can spawn and sync in the function. This update is trivial.Effectively, there is only one line of non-trivial update to parallelize the serial matrix multiplication.
$ g++ -O2 matrix.cpp -o matrix
$ cilk++ -O2 matrix.cilk -o matrix-cilk
$ time ./matrix
real 0m12.613s
user 0m12.392s
sys 0m0.030s
$ time ./matrix-cilk
real 0m6.775s
user 0m12.591s
sys 0m0.134s
The performance improvement is encouraging. You can try out with quad-core or hex-core if you have the machine.

Saturday, August 14, 2010

Parallelizing Matrix Multiplication using OpenMP in One Line

Matrix multiplication is often used for academic study. It's well suited for parallelization due to its intensive O(N^3) computation and independent computation. Parallel programming is hard. Does it surprise you if we parallelize matrix multiplication in merely one line of OpenMP directive?

Serial Matrix Multiplication

/* matrix.cpp */
const int size = 1000;

float a[size][size];
float b[size][size];
float c[size][size];

int main()
{
// Initialize buffers.
for (int i = 0; i < size; ++i) {
for (int j = 0; j < size; ++j) {
a[i][j] = (float)i + j;
b[i][j] = (float)i - j;
c[i][j] = 0.0f;
}
}

// Compute matrix multiplication.
// C <- C + A x B
for (int i = 0; i < size; ++i) {
for (int j = 0; j < size; ++j) {
for (int k = 0; k < size; ++k) {
c[i][j] += a[i][k] * b[k][j];
}
}
}

return 0;
}
Analyzing the data dependency of the operation c[i][j] += a[i][k] * b[k][j], we can see that each iteration i is independent of each other. The same applies to iteration j. For simplicity, we consider parallelizing only single for-loop.

Parallel Matrix Multiplication using OpenMP

/* matrix-omp.cpp */
const int size = 1000;

float a[size][size];
float b[size][size];
float c[size][size];

int main()
{
// Initialize buffers.
for (int i = 0; i < size; ++i) {
for (int j = 0; j < size; ++j) {
a[i][j] = (float)i + j;
b[i][j] = (float)i - j;
c[i][j] = 0.0f;
}
}

// Compute matrix multiplication.
// C <- C + A x B
#pragma omp parallel for default(none) shared(a,b,c)
for (int i = 0; i < size; ++i) {
for (int j = 0; j < size; ++j) {
for (int k = 0; k < size; ++k) {
c[i][j] += a[i][k] * b[k][j];
}
}
}

return 0;
}
With the OpenMP directive (#pragma), the i-for-loop is divided into multiple chunks, each chunk is assigned to a thread. Hence, multiple threads can compute assigned chunks in parallel. That's it.
$ g++ -O2 matrix.cpp -o matrix
$ g++ -O2 -fopenmp matrix-omp.cpp -o matrix-omp
$ time ./matrix
real 0m12.976s
user 0m12.460s
sys 0m0.059s
$ time ./matrix-omp
real 0m6.952s
user 0m12.556s
sys 0m0.050s
Bravo! The parallel version is about 1.86x faster than the serial version on a dual-core system. However, the code above is only for illustration purpose. There are still many optimization opportunities before it achieves commercial performance level.