Sunday, August 15, 2010

Parallelizing Matrix Multiplication using TBB

Parallelizing matrix multiplication using TBB isn't too difficult. It's just a little more work than OpenMP or Cilk++.

Parallel Matrix Multiplication using TBB

/* matrix-tbb.cpp */
#include <tbb/parallel_for.h>
#include <tbb/blocked_range.h>

using namespace tbb;

const int size = 1000;

float a[size][size];
float b[size][size];
float c[size][size];

class Multiply
void operator()(blocked_range<int> r) const {
for (int i = r.begin(); i != r.end(); ++i) {
for (int j = 0; j < size; ++j) {
for (int k = 0; k < size; ++k) {
c[i][j] += a[i][k] * b[k][j];

int main()
// Initialize buffers.
for (int i = 0; i < size; ++i) {
for (int j = 0; j < size; ++j) {
a[i][j] = (float)i + j;
b[i][j] = (float)i - j;
c[i][j] = 0.0f;

// Compute matrix multiplication.
// C <- C + A x B
parallel_for(blocked_range<int>(0,size), Multiply());

return 0;
We've moved the computation of the matrix multiplication into the class Multiply which takes in the range of i-iterations to work on. The parallel_for internally split the range [0,size) into multiple blocks. Multiple workers can then work on different non-overlapping blocks in parallel.

$ g++ -O2 matrix.cpp -o matrix
$ g++ -O2 matrix-tbb.cpp -ltbb -o matrix-tbb
$ time ./matrix
real 0m12.971s
user 0m12.489s
sys 0m0.052s
$ time ./matrix-tbb
real 0m7.857s
user 0m12.734s
sys 0m0.282s
Once the computation is organized into functions which can dynamically work on different parts of the computation, it's relatively easy to proceed.


  1. You can use blocked_range2d, that means you have both for loops from tbb. It performs better.

  2. Yes. You are right. This post is more to complement other posts in OpenMP, MPI, and Cilk++

  3. Thank you for providing this code. I have tested it. But on my pc (intel Due CPU P8400), the time for matrix-tbb and matrix are almost the same. Even, the time of matrix-tbb is greater than matrix. Do you know why? Thank you.

  4. It's easier to test with larger matrix size such as 2000 x 2000. The 1000 x 1000 matrix might be running too fast on your system.

    Furthermore, the CPU dynamic clock might disrupt the running time readings.

  5. thnx for this....

  6. What causes the user & system statistic to increase?

  7. Hello. I`ve tried this code, using MVS 2012 Prof.
    and got this:
    error : identifier "blocked_range" is undefined
    error : identifier "parallel_for" is undefined

    Screenshot is here:

  8. include tbb.h and use the tbb.dll or tbb_debug.dll