Tuesday, August 17, 2010

Parallelizing Matrix Multiplication using MPI

MPI is a popular mechanism in high performance computing. It works for both cluster and shared memory environment. Why don't we simply use MPI when it works for both environments? Why do we care about OpenMP? Cilk++? etc. Perhaps that depends on the complexity of the applications you are dealing with.

Parallel Matrix Multiplication using MPI

/* matrix-mpi.cpp */
#include <mpi.h>

const int size = 1000;

float a[size][size];
float b[size][size];
float c[size][size];

void multiply(int istart, int iend)
for (int i = istart; i <= iend; ++i) {
for (int j = 0; j < size; ++j) {
for (int k = 0; k < size; ++k) {
c[i][j] += a[i][k] * b[k][j];

int main(int argc, char* argv[])
int rank, nproc;
int istart, iend;

MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &nproc);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);

if (rank == 0) {
// Initialize buffers.
for (int i = 0; i < size; ++i) {
for (int j = 0; j < size; ++j) {
a[i][j] = (float)i + j;
b[i][j] = (float)i - j;
c[i][j] = 0.0f;

// Broadcast matrices to all workers.
MPI_Bcast(a, size*size, MPI_FLOAT, 0,MPI_COMM_WORLD);
MPI_Bcast(b, size*size, MPI_FLOAT, 0,MPI_COMM_WORLD);
MPI_Bcast(c, size*size, MPI_FLOAT, 0,MPI_COMM_WORLD);

// Partition work by i-for-loop.
istart = (size / nproc) * rank;
iend = (size / nproc) * (rank + 1) - 1;

// Compute matrix multiplication in [istart,iend]
// of i-for-loop.
// C <- C + A x B
multiply(istart, iend);

// Gather computed results.
MPI_Gather(c + (size/nproc*rank),
c + (size/nproc*rank),

if (rank == 0) {
// Compute remaining multiplications
// when size % nproc > 0.
if (size % nproc > 0) {
multiply((size/nproc)*nproc, size-1);

return 0;

$ g++ -O2 matrix.cpp -o matrix
$ mpicxx -O2 matrix-mpi.cpp -o matrix-mpi
$ time ./matrix
real 0m13.226s
user 0m12.529s
sys 0m0.065s
$ time mpirun -np 2 ./matrix-mpi
real 0m8.490s
user 0m6.346s
sys 0m0.178s
Phew .... what a hassle ... you can see the needs to:
  • perform data transfer to workers manually
  • perform work partitioning manually
  • perform many index calculations
  • handle remaining work when the amount of work is not divisible by the number of workers.
Furthermore, this MPI version uses more memory than the shared memory counterparts. The MPI program is launched with multiple processes as multiple workers, hence the memory consumption also multiply up. More work would be required to minimize the total memory consumption.When you must work with cluster environment, perhaps you don't have many choices with the current state of art programming tools.


  1. How did it work for you? I get: "Assertion failed in file helper_fns.c at line 337: 0
    memcpy argument memory ranges overlap, dst_=0x601620 src_=0x601620 len_=100

    internal ABORT - process 0"

  2. If your MPI library doesn't support overlapped source and destination buffers, use different buffers on the C matrix for send and receive buffers at the process rank 0.

  3. Hi!
    My university has a cluster and i'm working on it. Your code is well commented and easy to understand, but don't run very well in the cluster. The matrix resultant calculate in cluster is wrong, because most of the elements is zero. Would you help me?

  4. For some reason, I keep getting a "Fatal error in PMPI_Gather: Invalid buffer pointer, error stack:
    PMPI_Gather(856): MPI_Gather(sbuf=0x601d20, scount=100, MPI_FLOAT, rbuf=0x601d20, rcount=100, MPI_FLOAT, root=0, MPI_COMM_WORLD) failed
    PMPI_Gather(797): Buffers must not be aliased"

    I've replaced the sendbuf with MPI_IN_PLACE but the error is still there