SpeedGo Computing: 2010

Friday, November 19, 2010

Using SGC-Ruby-CUDA on the Newly Launched Amazon EC2 Cluster GPU

Wonder if GPU works for you? No budget for a system with decent GPU? Installations and configurations are too much trouble for you? You can now try out SGC-Ruby-CUDA on Amazon EC2 with the system image, located at US East Virginia zone, called SGCRubyCUDA.1 which is available as a community AMI.

Compile for rubycu shared library and run tests.

[root@ip-10-17-130-174 sgc-ruby-cuda.git]# rake
(in /root/sgc-ruby-cuda.git)
checking for main() in -lcuda... yes
creating Makefile
g++44 -I. -I/usr/local/include/ruby-1.9.1/x86_64-linux -I/usr/local/include/ruby-1.9.1/ruby/backward -I/usr/local/include/ruby-1.9.1 -I.   -fPIC -O3 -ggdb -Wextra -Wno-unused-parameter -Wno-parentheses -Wpointer-arith -Wwrite-strings -Wno-missing-field-initializers -Wno-long-long   -o rubycu.o -c rubycu.cpp
g++44 -shared -o rubycu.so rubycu.o -L. -L/usr/local/lib -Wl,-R/usr/local/lib -L.  -rdynamic -Wl,-export-dynamic   -lcuda  -lpthread -lrt -ldl -lcrypt -lm   -lc

[root@ip-10-17-130-174 sgc-ruby-cuda.git]# rake test
(in /root/sgc-ruby-cuda.git)
/usr/local/bin/ruby -I"lib:lib" "/usr/local/lib/ruby/1.9.1/rake/rake_test_loader.rb" "test/test_rubycu.rb" 
Loaded suite /usr/local/lib/ruby/1.9.1/rake/rake_test_loader
Started
......................
Finished in 89.055900 seconds.

22 tests, 99 assertions, 0 failures, 0 errors, 0 skips

Test run options: --seed 25668

Compile for rubygems then install it and try some SGC-Ruby-CUDA APIs.

[root@ip-10-17-130-174 sgc-ruby-cuda.git]# rake gem
(in /root/sgc-ruby-cuda.git)
mkdir -p pkg
  Successfully built RubyGem
  Name: sgc-ruby-cuda
  Version: 0.0.1
  File: sgc-ruby-cuda-0.0.1-x86_64-linux.gem
mv sgc-ruby-cuda-0.0.1-x86_64-linux.gem pkg/sgc-ruby-cuda-0.0.1-x86_64-linux.gem

[root@ip-10-17-130-174 sgc-ruby-cuda.git]# cd pkg
[root@ip-10-17-130-174 pkg]# gem install sgc-ruby-cuda-0.0.1-x86_64-linux.gem 
Successfully installed sgc-ruby-cuda-0.0.1-x86_64-linux
1 gem installed
Installing ri documentation for sgc-ruby-cuda-0.0.1-x86_64-linux...
Installing RDoc documentation for sgc-ruby-cuda-0.0.1-x86_64-linux...

[root@ip-10-17-130-174 pkg]# gem list

*** LOCAL GEMS ***

minitest (1.6.0)
rake (0.8.7)
rdoc (2.5.8)
sgc-ruby-cuda (0.0.1 x86_64-linux)

[root@ip-10-17-130-174 pkg]# irb
irb(main):001:0> require 'rubycu'
=> true
irb(main):002:0> include SGC::CU
=> Object
irb(main):004:0> CUDevice.get_count
=> 2
irb(main):005:0> d = CUDevice.get(0)
=> #<SGC::CU::CUDevice:0x0000000908c920>
irb(main):006:0> c = CUContext.new.create(0, d)
=> #<SGC::CU::CUContext:0x0000000907af40>
irb(main):007:0> d.get_name
=> "Tesla M2050"
irb(main):009:0> d.compute_capability
=> {:major=>2, :minor=>0}
irb(main):010:0> d.total_mem
=> 2817982464

Note: Remember to select Cluster GPU, when launching the instance.

Tuesday, November 16, 2010

GPU Anywhere with Cloud Computing

Simulation taking months to run? Buying and maintaining new systems causing too much hassle? Perhaps Cluster GPU would be a good candidate to save time and trouble. Cloud solution is an excellent platform for proof of concept before committed to a large system in-house.

Paying $2.10 per hour (Amazon pricing as of 16 Nov 2010) for the spec of:

22 GB of memory
33.5 EC2 Compute Units (2 x Intel Xeon X5570, quad-core "Nehalem" architecture)
2 x NVIDIA Tesla "Fermi" M2050 GPUs
1690 GB of instance storage
64-bit platform
I/O Performance: Very High (10 Gigabit Ethernet)
API name: cg1.4xlarge

That may not sound inexpensive in long run, but you now have the option to pay little for great fun, which is cheaper than a movie ticket. Get the systems for a few hours solving great puzzles. That sounds like an interesting weekend activity.

High Performance Computing Using Amazon EC2

Sunday, September 26, 2010

Parallel programming knowledge is must-have skill for Wall Street

Friday, September 17, 2010

Unigine crew: CUDA vs OpenCL vs SPU Part IV

Which language or library you choose to use for your software development has great and prolong impact to the software. Just come across a simple yet interesting benchmark. Perhaps, more details on why such numbers are obtained would be even more enlightening.

Unigine crew: CUDA vs OpenCL vs SPU Part IV

CUDA Programming with Ruby

Need GPU computing power in your Ruby program? Great! SpeedGo Computing is developing Ruby bindings for CUDA, called sgc-ruby-cuda. Take advantage of your Nvidia CUDA-enabled graphics cards with Ruby now.

Currently, only part of the CUDA Driver API is included. More components such as the CUDA Runtime API will be included to make it as complete as possible.

CUDA Programming with Ruby


require 'rubycu'

include SGC::CU

SIZE = 10 
c = CUContext.new

d = CUDevice.get(0)   # Get the first device.
c.create(0, d)    # Use this device in this CUDA context.

m = CUModule.new
m.load("vadd.ptx")    # 'nvcc -ptx vadd.cu'
                      # vadd.cu is a CUDA kernel program.

da = CUDevicePtr.new    # Pointer to device memory.
db = CUDevicePtr.new
dc = CUDevicePtr.new

da.mem_alloc(4*SIZE)    # Each Int32 is 4 bytes.
db.mem_alloc(4*SIZE)    # Allocate device memory.
dc.mem_alloc(4*SIZE)

ha = Int32Buffer.new(SIZE)    # Allocate host memory.
hb = Int32Buffer.new(SIZE)
hc = Int32Buffer.new(SIZE)
hd = Int32Buffer.new(SIZE)

(0...SIZE).each { |i| ha[i] = i }
(0...SIZE).each { |i| hb[i] = 2 }
(0...SIZE).each { |i| hc[i] = ha[i] + hb[i] }
(0...SIZE).each { |i| hd[i] = 0 }

memcpy_htod(da, ha, 4*SIZE)  # Transfer inputs to device.
memcpy_htod(db, hb, 4*SIZE)

f = m.get_function("vadd");
f.set_param(da, db, dc, SIZE)
f.set_block_shape(SIZE)
f.launch_grid(1)  # Execute kernel program in the device.

memcpy_dtoh(hd, dc, 4*SIZE) # Transfer outputs to host.

puts "A\tB\tCPU\tGPU"
(0...SIZE).each { |i| 
    puts "#{ ha[i]}\t#{hb[i]}\t#{hc[i]}\t#{hd[i] }" 
}

da.mem_free    # Free device memory.
db.mem_free
dc.mem_free

c.detach    # Release context.


/* vadd.cu */
extern "C" {
    __global__ void vadd(const int* a,
                         const int* b,
                         int* c,
                         int n)
    {
        int i = blockIdx.x * blockDim.x + threadIdx.x;
        if (i < n)
            c[i] = a[i] + b[i];
    }
}

Although the kernel program still need to be written in CUDA C, this Ruby bindings have provided first bridging step towards Ruby GPU computing.

How to execute?


$ ruby extconf.rb
checking for main() in -lcuda... yes
creating Makefile
$ make
...
g++ -shared -o rubycu.so rubycu.o ...
$ nvcc -ptx vadd.cu
$ ruby -I . test.rb
A       B       CPU     GPU
0       2       2       2
1       2       3       3
2       2       4       4
3       2       5       5
4       2       6       6
5       2       7       7
6       2       8       8
7       2       9       9
8       2       10      10
9       2       11      11

Cool! The summation of two vectors is performed in the GPU.

See also:

Tuesday, September 7, 2010

High Performance for All

Parallel programming is much more affordable now as multi-core CPU and programmable GPU become commodity products. Unlike a decade ago where a minimum dual socket system equipped with lower clocked CPU & RAM would relatively cost a fortune to a typical desktop user, but dual-core system is basically everywhere nowadays. The use of dual-core systems is not really because it's affordable, but simply the users have not given a choice for not going multi-core.

It was non-trivial to me a decade ago, why should I go with lower clocked CPU & RAM in order to go multi-processing? Isn't that will slow down all my applications that use only single core? Fortunately, this problem is now less severe with dynamic clock adjusting CPUs, so called turbo mode. We could enjoy the benefits of high clock speed and multiple cores for different applications.

Moving forward, does commodity products make HPC a commodity service? How is HPC doing in enterprise?

Checkout the report published by Freeform Dynamics: High Performance for All

Wednesday, August 25, 2010

AMD’s Bulldozer vs Intel's Hyper-Threading?

AMD's so called Strong Thread approach in the Bulldozer module is that really compelling?

Extra cores are added when a processor can't operate at a faster clock speed, that's a good and easy way to expand a product line with effectively faster products, even though it may NOT be any faster depending on whether the applications are taking advantage of the multiple cores. But fully duplicating x86 core is expensive to scale up.

Intel hyper-threading is a good idea in certain cases, with only little more hardware it allows multiple threads to share the functional units in a core with lower context switch overhead, tolerating memory latency as memory latency is relatively high. That works well with

Complementing threads - Threads do not use the same types of functional units such as the integer units, floating units, etc. thus maximizing the hardware utilization. Or threads do not have conflicting memory accesses, especially long latency memory accesses.
Threads play nice with cache - A thread does not result in spilling out the data of another thread from the cache. Unfortunately, this would be difficult to ensure in practice as the dynamic OS thread scheduling, memory access pattern, etc. contribute to the cache usage.

On the other hand, AMD's Strong Thread includes two sets of integer units and L1 data cache in a Bulldozer module, which is heavier than the hyper-threading approach, but more lightweight than fully duplicating a x86 core. That effectively allowing a thread to enjoy full private L1 data cache during its execution quantum, while hyper-threading works in a shared L1 cache like environment. Whether the module supports cpu affinity i.e. binding a thread to a particular core of the chip, is something we should be looking for when more details are available.

Hyper-threading vs Bulldozer may provoke the argument of shared cache vs private cache: A thread can potentially access the entire shared cache, while a thread enjoys full bandwidth in accesses to the private cache. The downside is a thread is limited to the smaller private cache size even if the other private cache in the module is under utilized. To argue that further: a larger shared cache would have higher latency due to larger storage management overhead, while smaller private cache would have lower latency generally. Whether shared or private cache is better for the performance, it's very specific to the memory access patterns of multiple threads.

As L1 cache is usually very small, the performance impact of smaller private L1 data cache for a single threaded application could be compensated by the larger shared L2 cache. When an application has large working-set, doubling the L1 data cache is probably insufficient to keep the working-set anyway.

We should also note that the floating-point units connect to shared L2 cache bypassing the L1 data cache. They probably have a good reason for that. I can recall that Itanium II does not use L1 data cache for their floating-point too.

Overall, the AMD Bulldozer is an interesting architecture. It has great potential to exhibit higher performance at lower cost. Its benchmark data is something we should keep an eye on.

See also:

Tuesday, August 17, 2010

Parallelizing Matrix Multiplication using MPI

MPI is a popular mechanism in high performance computing. It works for both cluster and shared memory environment. Why don't we simply use MPI when it works for both environments? Why do we care about OpenMP? Cilk++? etc. Perhaps that depends on the complexity of the applications you are dealing with.

Parallel Matrix Multiplication using MPI

/* matrix-mpi.cpp */
#include <mpi.h>

const int size = 1000;

float a[size][size];
float b[size][size];
float c[size][size];

void multiply(int istart, int iend)
{
    for (int i = istart; i <= iend; ++i) {
        for (int j = 0; j < size; ++j) {
            for (int k = 0; k < size; ++k) {
                c[i][j] += a[i][k] * b[k][j];
            }
        }
    }
}

int main(int argc, char* argv[])
{
    int rank, nproc;
    int istart, iend;

    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &nproc);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    if (rank == 0) {
        // Initialize buffers.
        for (int i = 0; i < size; ++i) {
            for (int j = 0; j < size; ++j) {
                a[i][j] = (float)i + j;
                b[i][j] = (float)i - j;
                c[i][j] = 0.0f;
            }
        }
    }

    // Broadcast matrices to all workers.
    MPI_Bcast(a, size*size, MPI_FLOAT, 0,MPI_COMM_WORLD);
    MPI_Bcast(b, size*size, MPI_FLOAT, 0,MPI_COMM_WORLD);
    MPI_Bcast(c, size*size, MPI_FLOAT, 0,MPI_COMM_WORLD);

    // Partition work by i-for-loop.
    istart = (size / nproc) * rank;
    iend = (size / nproc) * (rank + 1) - 1;

    // Compute matrix multiplication in [istart,iend]
    // of i-for-loop.
    // C <- C + A x B
    multiply(istart, iend);

    // Gather computed results.
    MPI_Gather(c + (size/nproc*rank),
               size*size/nproc,
               MPI_FLOAT,
               c + (size/nproc*rank),
               size*size/nproc,
               MPI_FLOAT,
               0,
               MPI_COMM_WORLD);

    if (rank == 0) {
        // Compute remaining multiplications
        // when size % nproc > 0.
        if (size % nproc > 0) {
            multiply((size/nproc)*nproc, size-1);
        }
    }

    MPI_Finalize();
    return 0;
}


$ g++ -O2 matrix.cpp -o matrix
$ mpicxx -O2 matrix-mpi.cpp -o matrix-mpi
$ time ./matrix
real 0m13.226s
user 0m12.529s
sys 0m0.065s
$ time mpirun -np 2 ./matrix-mpi
real 0m8.490s
user 0m6.346s
sys 0m0.178s

Phew .... what a hassle ... you can see the needs to:

perform data transfer to workers manually
perform work partitioning manually
perform many index calculations
handle remaining work when the amount of work is not divisible by the number of workers.

Furthermore, this MPI version uses more memory than the shared memory counterparts. The MPI program is launched with multiple processes as multiple workers, hence the memory consumption also multiply up. More work would be required to minimize the total memory consumption.When you must work with cluster environment, perhaps you don't have many choices with the current state of art programming tools.

Sunday, August 15, 2010

Parallelizing Matrix Multiplication using TBB

Parallelizing matrix multiplication using TBB isn't too difficult. It's just a little more work than OpenMP or Cilk++.

Parallel Matrix Multiplication using TBB

/* matrix-tbb.cpp */
#include <tbb/parallel_for.h>
#include <tbb/blocked_range.h>

using namespace tbb;

const int size = 1000;

float a[size][size];
float b[size][size];
float c[size][size];


class Multiply
{
public:
    void operator()(blocked_range<int> r) const {
        for (int i = r.begin(); i != r.end(); ++i) {
            for (int j = 0; j < size; ++j) {
                for (int k = 0; k < size; ++k) {
                    c[i][j] += a[i][k] * b[k][j];
                }
            }
        }
    }
};


int main()
{
    // Initialize buffers.
    for (int i = 0; i < size; ++i) {
        for (int j = 0; j < size; ++j) {
            a[i][j] = (float)i + j;
            b[i][j] = (float)i - j;
            c[i][j] = 0.0f;
        }
    }

    // Compute matrix multiplication.
    // C <- C + A x B
    parallel_for(blocked_range<int>(0,size), Multiply());

    return 0;
}

We've moved the computation of the matrix multiplication into the class Multiply which takes in the range of i-iterations to work on. The parallel_for internally split the range [0,size) into multiple blocks. Multiple workers can then work on different non-overlapping blocks in parallel.


$ g++ -O2 matrix.cpp -o matrix
$ g++ -O2 matrix-tbb.cpp -ltbb -o matrix-tbb
$ time ./matrix
real 0m12.971s
user 0m12.489s
sys 0m0.052s
$ time ./matrix-tbb
real 0m7.857s
user 0m12.734s
sys 0m0.282s

Once the computation is organized into functions which can dynamically work on different parts of the computation, it's relatively easy to proceed.

Parallelizing Matrix Multiplication using Cilk++ in Two Lines

Following the parallelization of matrix multiplication using OpenMP in Parallelizing Matrix Multiplication using OpenMP in One Line, can we do the same using Cilk++?

Parallel Matrix Multiplication using Cilk++

/* matrix.cilk */
const int size = 1000;

float a[size][size];
float b[size][size];
float c[size][size];

int cilk_main()
{
    // Initialize buffers.
    for (int i = 0; i < size; ++i) {
        for (int j = 0; j < size; ++j) {
            a[i][j] = (float)i + j;
            b[i][j] = (float)i - j;
            c[i][j] = 0.0f;
        }
    }

    // Compute matrix multiplication.
    // C <- C + A x B
    cilk_for (int i = 0; i < size; ++i) {
        for (int j = 0; j < size; ++j) {
            for (int k = 0; k < size; ++k) {
                c[i][j] += a[i][k] * b[k][j];
            }
        }
    }

    return 0;
}

The cilk_main() is the entry function synonymous to main() in a typical C++ program, but cilk_main() function uses Cilk++ linkage such that you can spawn and sync in the function. This update is trivial.Effectively, there is only one line of non-trivial update to parallelize the serial matrix multiplication.

$ g++ -O2 matrix.cpp -o matrix
$ cilk++ -O2 matrix.cilk -o matrix-cilk
$ time ./matrix
real 0m12.613s
user 0m12.392s
sys 0m0.030s
$ time ./matrix-cilk
real 0m6.775s
user 0m12.591s
sys 0m0.134s

The performance improvement is encouraging. You can try out with quad-core or hex-core if you have the machine.

Saturday, August 14, 2010

Parallelizing Matrix Multiplication using OpenMP in One Line

Matrix multiplication is often used for academic study. It's well suited for parallelization due to its intensive O(N^3) computation and independent computation. Parallel programming is hard. Does it surprise you if we parallelize matrix multiplication in merely one line of OpenMP directive?

Serial Matrix Multiplication

/* matrix.cpp */
const int size = 1000;

float a[size][size];
float b[size][size];
float c[size][size];

int main()
{
    // Initialize buffers.
    for (int i = 0; i < size; ++i) {
        for (int j = 0; j < size; ++j) {
            a[i][j] = (float)i + j;
            b[i][j] = (float)i - j;
            c[i][j] = 0.0f;
        }
    }

    // Compute matrix multiplication.
    // C <- C + A x B
    for (int i = 0; i < size; ++i) {
        for (int j = 0; j < size; ++j) {
            for (int k = 0; k < size; ++k) {
                c[i][j] += a[i][k] * b[k][j];
            }
        }
    }

    return 0;
}

Analyzing the data dependency of the operation c[i][j] += a[i][k] * b[k][j], we can see that each iteration i is independent of each other. The same applies to iteration j. For simplicity, we consider parallelizing only single for-loop.

Parallel Matrix Multiplication using OpenMP

/* matrix-omp.cpp */
const int size = 1000;

float a[size][size];
float b[size][size];
float c[size][size];

int main()
{
    // Initialize buffers.
    for (int i = 0; i < size; ++i) {
        for (int j = 0; j < size; ++j) {
            a[i][j] = (float)i + j;
            b[i][j] = (float)i - j;
            c[i][j] = 0.0f;
        }
    }

    // Compute matrix multiplication.
    // C <- C + A x B
    #pragma omp parallel for default(none) shared(a,b,c)
    for (int i = 0; i < size; ++i) {
        for (int j = 0; j < size; ++j) {
            for (int k = 0; k < size; ++k) {
                c[i][j] += a[i][k] * b[k][j];
            }
        }
    }

    return 0;
}

With the OpenMP directive (#pragma), the i-for-loop is divided into multiple chunks, each chunk is assigned to a thread. Hence, multiple threads can compute assigned chunks in parallel. That's it.

$ g++ -O2 matrix.cpp -o matrix
$ g++ -O2 -fopenmp matrix-omp.cpp -o matrix-omp
$ time ./matrix
real 0m12.976s
user 0m12.460s
sys 0m0.059s
$ time ./matrix-omp
real 0m6.952s
user 0m12.556s
sys 0m0.050s

Bravo! The parallel version is about 1.86x faster than the serial version on a dual-core system. However, the code above is only for illustration purpose. There are still many optimization opportunities before it achieves commercial performance level.

Wednesday, August 11, 2010

Parallel Programming - Hello World

Many computer science/engineering students learn to write Hello World program at their first programming lecture. What's your first parallel program? What about Hello World program in OpenMP, MPI, Cilk++, TBB, Ruby thread, PThread?

Hello World in C

/* hello.c */
#include <stdio.h>

int main()
{
    printf("hello world\n");
    return 0;
}

$ gcc hello.c -o hello
$ ./hello
hello world

Hello World in OpenMP

/* hello-omp.c */
#include <stdio.h>
#include <omp.h>

int main()
{
    #pragma omp parallel
    {   
        int n = omp_get_num_threads();
        int id = omp_get_thread_num();

        printf("hello world %d/%d\n", n, id);
    }   
    return 0;
}

$ gcc -fopenmp hello-omp.c -o hello-omp
$ ./hello-omp
hello world 1/2
hello world 0/2
$ ./hello-omp
hello world 0/2
hello world 1/2

Hello World in MPI

/* hello-mpi.c */
#include <stdio.h>
#include <mpi.h>

int main(int argc, char* argv[])
{
    int n, id; 
    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &n);
    MPI_Comm_rank(MPI_COMM_WORLD, &id);

    printf("hello world %d/%d\n", id, n); 

    MPI_Finalize();
    return 0;
}

$ mpicc hello-mpi.c -o hello-mpi
$ mpirun -np 2 ./hello-mpi
hello world 0/2
hello world 1/2
$ mpirun -np 2 ./hello-mpi
hello world 1/2
hello world 0/2

Hello World in Cilk++

/* hello.cilk */
#include <stdio.h>

void hello()
{
    printf("hello world\n");
}

int cilk_main()
{
    cilk_spawn hello();
    cilk_spawn hello();
    cilk_sync;
    return 0;
}

$ cilk++ hello.cilk -o hello-cilk
$ ./hello-cilk
hello world
hello world

Hello World in TBB

/* hello-tbb.cpp */
#include <iostream>
#include <tbb/parallel_for.h>

using namespace tbb;

class Hello
{
public:
    void operator()(int x) const {
        std::cout << "Hello world\n";
    }
};

int main()
{
    // parallelizing:
    // for(int i = 0; i < 2; ++i) { ... }
    parallel_for(0, 2, 1, Hello());

    return 0;
}

$ g++ hello-tbb.cpp -ltbb -o hello-tbb
$ ./hello-tbb 
Hello world
Hello world

Hello World in Ruby Thread

/* hello.rb */
#!/usr/bin/ruby

def hello
    id = Thread.current.object_id
    puts "hello world #{id}"
end

t1 = Thread.new do
    hello
end
t2 = Thread.new do
    hello
end

t1.join
t2.join

$ ./hello.rb
hello world 69858970515620
hello world 69858970515540

Hello World in PThread

/* hello-pthread.c */
#include <stdio.h>
#include <pthread.h>

void* hello(void* arg)
{
    long id = (long)pthread_self();
    printf("hello world %ld\n", id);
    pthread_exit(NULL);
}

int main()
{
    pthread_t tid1, tid2;
    pthread_attr_t attr;

    pthread_attr_init(&attr);
    pthread_attr_setdetachstate(&attr,
                                PTHREAD_CREATE_JOINABLE);

    pthread_create(&tid1, &attr, hello, NULL);
    pthread_create(&tid2, &attr, hello, NULL);

    pthread_join(tid1, NULL);
    pthread_join(tid2, NULL);

    pthread_attr_destroy(&attr);
    pthread_exit(NULL);

    return 0;
}

$ gcc hello-pthread.c -lpthread -o hello-pthread
$ ./hello-pthread 
hello world 139999688689424
hello world 139999680296720

Saturday, July 31, 2010

Parallel Programming - What Are The Options?

There are simply way too many parallel programming languages and libraries to keep track of. Many of them are no longer active in development, or difficult to get them working in decent operating systems. What are the practical options currently available for multi-core CPU or GPU?

OpenMP

Hardware: Shared memory multi-core CPU system.
Parallelization: Use directives e.g. #pragma omp parallel {} in C/C++/Fortran to parallelize loops or code regions.
Supported by decent compilers.
Non-supporting compilers ignore the directives and compile as serial program.
Very good for incremental parallelization.

Cilk++

Hardware: Shared memory multi-core CPU system.
Parallelization: Use new keywords in C++ namely cilk_spawn to invoke a Cilk linkage function asynchronously, cilk_sync to synchronize with locally spawned functions, cilk_for to parallelize a for-loop.
The Cilk++ runtime system takes care of the thread scheduling which ease nested parallelization tremendously and maintain certain level of efficiency.
Requires Cilk++ compiler and Cilk++ runtime system.
Very good for parallelizing dynamic codes with low overhead.

Hardware: Shared memory multi-core CPU system.
Parallelization: C++ function objects or C++0x lambda expressions as work units, parallelizing with template functions e.g. parallel_do, parallel_for, parallel_reduce, parallel_pipeline, etc. Concurrent storage classes e.g. concurrent_vector are also provided.
Portable to multiple platforms which have good C++ supports.
Uses C++ template and function object extensively. C++ beginners might have difficulty to read/write the codes.
Allow many customization options at task level which can be complicating and messy, but threads are abstracted, i.e. thread scheduling is taken care of.
Recommended only for heavy C++ users.

PThread or thread library built into languages

Hardware: Shared memory multi-core CPU system.
Parallelization: Provides a library of functions to create, destroy, synchronize threads.
Pthread is well supported on Unix/Linux systems, but Windows would require external library.
Low level and explicit manipulations of threads.
Not recommended for general parallel programming tasks.

OpenCL

Hardware: Shared memory multi-core CPU system or OpenCL supported GPU.
Parallelization: Provides a library of functions to massively execute a kernel function on a supported device.
Supported by ATI Stream SDK and Nvidia OpenCL SDK.
Requires OpenCL runtime support for the targeted devices.
Well suited for data parallel or streaming computation.
Not recommended for direct use for general parallel programming, use wrappers for OpenCL instead.

CUDA

Hardware: CUDA enabled Nvidia GPU.
Parallelization: Provides a kernel invocation method to massively execute a kernel function on a CUDA enabled Nvidia GPU. The invocation method requires CUDA compiler to parse its special syntax in the form kernel_method<<<grid_dim, block_dim,shared_mem_size,stream>>>.
Supported by Nvidia CUDA SDK.
Requires CUDA compiler and CUDA runtime system.
Well suited for data parallel or streaming computation.
The CUDA programming guide is well documented for the requirements to achieve good performance with CUDA enabled Nvidia GPU.
Recommended for gpu programming on Nvidia GPU.

Brook+

Hardware: Shared memory multi-core CPU system or CAL supported ATI GPU.
Parallelization: Allow specification of kernel function that accepts streams of data. A kernel function is invoked as per normal function. The specification of a kernel function requires Brook+ compiler to parse the syntax of the kernel function.
Supported by ATI CAL and x86 CPU backend.
Requires Brook+ compiler and Brook+ stream runtime system.
Well suited for data parallel computation.
AMD has been promoting the use of OpenCL for ATI GPU programming. Brook+ is open sourced, however, its development is no longer active.

Hardware: Shared memory multi-core CPU system or cluster of computers.
Parallelization: Provides a library of functions for message passing between processes i.e. point-to-point and collective communications.
Supported by third party library such as MPICH, OpenMPI, etc.
Requires communication runtime system.
Low level manipulations of buffers and process-process communications.
Very popular for programming HPC cluster, but not recommended for general parallel programming.

Hardware: Shared memory multi-core CPU system or distributed systems.
Parallelization: Provides a library of functions for message passing between tasks.
Supported by third party library such as Netlib PVM3.
Use standard network interface such as TCP/IP for higher interoperability over a distributed systems.
Low level manipulations of buffers and task-task communications.

Charm++

Hardware: Shared memory multi-core CPU system or distributed systems.
Parallelization: Object-oriented C++ working units where working units called chares may communicate with other chares using proxy objects.
Scheduling computations based on availability of data.
Requires Charm++ compiler and Charm++ runtime system.

uC++

Hardware: Shared memory multi-core CPU system.
Parallelization: Provides C++ coroutines for independent executions.
The runtime system performs scheduling of virtual processor using OS kernel threads.
Requires uC++ compiler and uC++ kernel.

Thursday, July 29, 2010

Who Is Responsible For The Programming Of Multi Core CPU And GPU?

Multi core CPU and GPU are now commodity products. But, where are the software that could take advantage of their parallel architecture? Who should be developing such software? The domain expert? HPC (high performance computing) sofware engineer? or parallel programming tools such as auto parallelizing compilers?

Domain experts typically do not wish to spend too much time on computing problems. They simply want to get their computation done and collect the results. Earning another degree on computing is not what they are after.
On the other hand, HPC software engineers focused very much only on computing problems, obtaining good performance with minimum domain knowledge. Learning curve for domain knowledge could be very steep for them.
The current state of art parallel programming tools are mostly parallel programming languages or libraries, performance profiling tools, code analyzer for parallelization guidelines, auto parallelizing compilers, and debugging tools.

Consequently, domain experts develop serial algorithm prototype and let HPC software engineers handle all the parallelization and performance issues. HPC software engineers then use the limited sets of parallel programming tools to parallelize and optimize the prototype.

There is serious flaw in this workflow. A parallel program needs to be designed since the beginning of the software development to be effective. When the HPC software engineers take over the prototype for optimization and parallelization, what they can do is too limited.

Without the domain knowledge, the HPC software engineers are essentially working as an auto parallelizing compiler. They analyze the code for real data dependency instead of analyzing the high-level algorithm structure, they optimize and parallelize the algorithms used in the prototype maintaining the same serial semantics instead of using or improving the best algorithms for the tasks. They are unable to perform intrusive algorithm or code modifications necessary for good performance.

See also Why Can't Compilers Auto Parallelize Serial Code Effectively?

Current state of art parallel programming tools are not able to fully automate the optimization and parallelization process. Relying purely on tools is out of the question to optimization and parallelization.

When a parallel program is not performing, who's fault is that?

Ideally domain experts and HPC software engineers should collaborate and recognize the issues from each other. The impact of domain knowledge onto the optimization and parallelization process should not be underestimated. The steep learning curve of the domain knowledge is probably unavoidable, while domain experts should be more computation oriented formulating their domain problems into computing problems.

Bioinformatics set a good example in formulating problems into computing problems, enabling many computer scientists to work on bioinformatics problems with minimum domain knowledge.

Wednesday, July 28, 2010

Why Can't Compilers Auto Parallelize Serial Code Effectively?

An auto parallelizing tool takes in a serial code base in C/C++/Fortran etc. and produces parallel version of the code. For instance, specifying -parallel option at Intel compiler compilation produces parallelized binary with OpenMP runtime. MIPSpro compiler provides similar auto parallelizing function with -apo option, where you can view the code transformation which consists of SGI OpenMP extension.

That's a great news. Theoretically, we can simply re-compile our software applications with the auto parallelizing option of supported compilers to take advantage of multi-core CPUs. Auto parallelizing algorithms such as matrix multiplication give nice performance boost without much effort. It surprises many people who then apply auto parallelizing to a moderately complicated serial program and obtained serious performance degradation against its serial version. Does it surprise you? No?

There are reasons for these if performance degradation surprised you.

1. Lack of dynamic runtime execution profile at compile time coupled with optimistic auto parallelizing compiler.

The number of iterations a block/function is executed is not known until runtime. For cases with small number of iterations, the amount of work could be too small for a group of processor cores to execute. The parallelization overhead of this block/function could be too high to be beneficial. Yet, auto parallelizing compiler simply parallelize all it can, which could result in performance degradation.
The execution time of a block/function is not known until runtime. Effectively, you wouldn't know how long this block/function take to execute. The relative overhead required to execute this block/function in parallel is essentially unknown. An auto parallelizing compiler simply parallelize all of them without considering much about whether the parallelization is indeed beneficial.

Profile guided optimization which gathers runtime information of previous runs would help to estimate the runtime information of future runs and hence parallelizing more effectively.

2. Lack of application specific information on the data access pattern coupled with conservative auto parallelizing compiler.

Dynamic aliasing of memory locations is difficult to analyze. If a read instruction and a write instruction reference to a single memory location at the same time, the consequence is arbitrary depending on the memory model adopted by the hardware. With the extensive use of memory pointers and references, it's difficult for a compiler to determine if two pointers or references are referring to the same object. Explicitly comparing two pointers or references would be prohibitively expensive. To maintain serial semantics, an auto parallelizing compiler would play safe, giving up many potential parallelization.
Dynamic accesses to non-local data may result in conflicting accesses. When a data is non-local, it may be accessed by another core at any time, which may produce arbitrary results. In order to ensure serial semantics, the compiler would give up potential parallelization even if the non-local access is not critical to the overall function.

Programmers with application specific information could determine true aliasing of memory locations much better. Specific data access pattern of the application could guarantee conflict free accesses to non-local data. These would open up many potential parallelization which an auto parallelizing compiler is unable to exploit effectively without helps from the programmers.

3. Serial program incurs unnecessary constraints on the order of execution.
A serial program specifies a sequence of instructions to be executed in the specified order. The underlying compiler optimization and hardware optimization may perform re-ordering or enabling concurrent execution of the instructions provided the serial semantics is maintained. However, the specified execution order needs not be the real data dependency. The order of execution specified in the serial program could be overly restricted for any useful optimization and parallelization, yet difficult for a compiler with only compile time information to determine if two functions may be executed in parallel.

As serial program does not specify which code regions or functions may be executed in parallel with other codes, auto parallelizing compilers have to derive the real data dependency of the serial program by analyzing the variable defined-use information in the code at compile time. Unfortunately, it's difficult to obtain accurate data dependency information at compile time especially for high-level functions which have deep level of function call chain.

Parallel programs break this issue by allowing programmer to specify which code regions or functions may be executed in parallel without the needs to ensure serial semantics is maintained, hence, removing the unnecessary constraints on the order of execution. The burden of ensuring correctness falls onto the programmers who have application specific information which can help to determine the parallelism of the code.

Thursday, July 22, 2010

Where Are All The Practical Parallel Algorithms and Libraries?

Multi-core CPU and GPU are everywhere nowadays from laptops to desktops to high-end computing clusters. Is your particular application running any faster? Nope. But generally you need parallel algorithms for an application to make full use of the multiple cores.

Perhaps you'll expect doing some searches on the web, research publications and academic books would provide you all the state of art parallel algorithms. Many of them are for academic & research purpose, many are written in distinguished programming languages, many focused on special prototype or no longer available hardware, many that you'll hardly need, ... many of them since parallel programming has been studied for decades.

What about practical parallel algorithms or library written in modern programming languages be it C/C++, Ruby, Python, Java, ... which can be incorporated easily into your own software development?

Unfortunately you don't find a lot of them. There are practical reasons for that.

1. Lack Of Widely Accepted Composable Parallel Programming Model
A parallel programming model is composable if the model allows a parallel algorithm to be used transparently within another parallel algorithm and maintain certain level of efficiency. This is essential for building up modular foundation for a parallel algorithm library.

The needs to manually synchronize multiple cores with certain forms of locks or atomic instructions break the transparent requirement. And it's difficult to achieve good efficiency when a parallel algorithm is nested within other parallel algorithms. Excessive synchronizations would incur too much overhead for parallel programming to be beneficial. With application specific knowledge, one may move synchronizations of a parallel algorithm to higher levels to improve efficiency.

Tricks are often used to maximize the efficiency trading off composability. Hence, a parallel programming model still needs to support non-composable mechanisms.

2. Requires Specific Memory Layout And Efficiency Matters
A parallel algorithm may be implemented assuming certain memory layout for its input and output data. Multiple cores would access the data in very specific ways or patterns.

When you obtain a parallel algorithm implementation from another, it's almost certain that the assumed memory layout is not compatible with your software interface. You may do data transformation to make use of the parallel algorithm transparently. This can be extremely expensive not only the in/out transformation itself would take up computation time, it will require additional memory to maintain the transformed structures.

When the degradation of the efficiency is not tolerable, you'll then implement your own parallel algorithm with the ideal memory layout for your application.

Hence, it's not uncommon to see many implementations for similar parallel algorithms around because we can't get them out of their existing environments without incurring much engineering efforts.

3. Requires Hardware Specific Optimization For Efficiency And Efficiency Matters
Programming on parallel clusters would worry about configurations such as the number of nodes, the number of processors per node, the number of cores per processor, and their inter-connectivity.

Programming on shared memory machines would worry about the number of processors, the number of cores per processor, the cache hierarchy, and the inter-core connectivity.

When you have a very high bandwidth inter-connectivity, you would perform data or task partitioning, result aggregations in a very specific way to take advantage of that. On the other hand, when the inter-connectivity has limited bandwidth, you would customize the parallel algorithm differently, or even use other less efficient parallel algorithms for the equivalent computation.

Shared memory programming would have very specific requirements on the memory layout and data access pattern for good efficiency on the underlying hardware. Different organizations of the processor cores, inter-core connections would require different sets of optimizations.

The great variety in the hardware organizations is the fundamental issue in supporting portable performance across multiple hardwares.

As performance matters, performance optimization and tuning is a crucial part of a parallel algorithm development. The portability is certainly limited. Fortunately, this may be rectified with auto-tuning mechanism.

4. Parallel Programming Was Expensive In The Past
Before the multi-core era, parallel programming is generally not affordable. Parallel programming has been very much restricted to research labs and very large corporations.

Research labs and large corporations typically deployed very large parallel systems for solving only very large problems. As shared memory machines are generally more expensive and limited in its maximum configuration size, distributed or cluster computing was the preferred choice.

Hardware configurations for large distributed systems can be complicated and confusing, especially when new systems and old systems are merged into larger systems over time.

With research efforts mostly focusing on such complicated systems, and engineering efforts mostly deal with performance issues rather than practical software development issues, do we expect cross platforms and performance portable parallel libraries? I think that's difficult.

Although there is also a significant amount of research effort on shared memory programming, shared memory machines have their own sets of issues to achieve scalable performance.

Simple parallel libraries for small machines are not interesting enough for researchers to work on. However, with the advent of multi-core processors, parallel programming becomes more affordable and ubiquitous. Sacrificing certain level of performance is reasonable for wider coverage. Parallel libraries work only for small processor core count can also be extremely useful.

Wednesday, July 21, 2010

Why Is Parallel Programming Difficult?

Parallel programming is generally perceived as an activity only for people going after high tech, bleeding edge research. It is difficult and alien enough to drive most software engineers away, whether it is really the case or merely their misconceptions. The fact is, software engineers run away from parallel programming while modern general purpose processors consist more and more multiple cores by force and by default. These processor cores are seriously under utilized.

Is parallel programming new? Nope. It has been studied for decades.

Languages and tools available for parallel programming? Yes. OpenMP, PVM, MPI, Cuda, OpenCL, Cilk++, PThreads, Java threads, ...

But, why is parallel programming difficult when there are tons publications and tools widely available? Isn't it a know how issue? When you master the theory, it should be simple, right? Let's see.

1. Managing Complex Data Dependency Is Non-Trivial

To design a parallel algorithm, the data dependency must be well managed. For statements such as,

A1. x = 10 + 5
A2. y = 20 - 8
A3. z = x * y

Statement A1 and A2 may be executed in parallel at different cores, while statement A3 must be executed only after x and y are available to the core executing statement A3. Not only we need to ensure proper order of execution, data required for the execution of a statement must be delivered in time to the core executing the statement.

When you instruct someone to do something, naturally you say,

B1. Drink the glass of milk at your hand.
B2. Walk to the kitchen.
B3. Wash the glass.

Most people design instructions with a series of actions. You wouldn't bother to specify washing the glass only after you finished drinking the milk and after you arrived in the kitchen near the sink, would you?

Functional programming is an interesting approach to simplify the data dependency complexity by using multiple definitions of stateless computation functions. A computation function definition inherently asks which data is required to compute, forcing one to specify the data dependency implicitly. However, not all computations can be expressed functionally in a convenient way.

2. Managing Data Access Conflicts Without Full Access Control
The system main memory is shared across multiple cores. It is handy when you need to maintain only single copy of data for multiple cores.

A core might be incrementing a variable with the following statements sequentially,

C1. y = x
C2. y += 1
C3. x = y

Since the memory is shared across multiple cores, x could be updated by another core after statement C1 and before statement C3 are executed. The update is effectively lost when statement C3 is then executed writing value of y into x.

Critical section, locking mechanism, or atomic instructions are introduced to protect or synchronize accesses to the variables ensuring no other cores will access the memory during the execution of these statements. This is essentially a passive approach. The programmer may forget to protect or by mistake having the variables unprotected leaving a silent loop hole behind and the execution can go on without any warnings.

This can get extremely complicated when memory pointers are used extensively without much control. You wouldn't know if two pointers are pointing to the same memory location unless you check for that explicitly. The explicit checking would incur too much overhead.

3. Managing Multiple Core Executions Driving One Insane
Current state of art parallel program is usually a single program being executed by multiple cores in parallel, i.e. each core executes a separate copy of the same program. Each of the cores performs computations on different range of data based on their uniquely associated IDs.

When you writes a program, you will simulate that program in your mind to make sure it does what you intended to compute. The same goes to parallel program. However, now you have to simulate the program for the executions of different number of cores and ensure each core would execute correctly even when there is no data for some cores to compute.

This is further complicated by the needs for multiple cores to execute different parts of the program at the same time. How many instances of core executions can you keep track in your mind at a time?

The amount of simulations required would increase multiple folds, especially when you are dealing with multi-dimensional data with many different corner cases.

This is evident especially when a parallel program does not work as expected, and you start debugging the program using your favorite tools. You may examine what are each core executing, and step through the program for each of the cores. Yes, for each of the cores! The big challenge is, how do you trace or back trace to find the problematic codes?

4. Non-Deterministic Behavior Is Confusing
Executing a parallel program multiple times may generate different results. The exact results depend largely on the exact data or task partitioning applied, the specific order of executions, the specific scheduling & allocations of processors at runtime, and the nature of the computation. Sometimes the execution incurs integer overflow, sometimes it doesn't. You get the point? In many cases, you can't simply compare the results with the serial version as the parallel version may be executing completely different computations dynamically.

How do you ensure your parallel program is written correctly when it's difficult to prove mathematically and you can't provide good and consistent reference results? It's simply difficult. Fortunately, it's less problematic for more structural computations which cover many scientific algorithms.

Introducing additional synchronization between the cores may reduce the degree of non-deterministic behavior. Consequently, that could reduces the overall performance significantly too which defeats the purpose of parallel programming itself.

It also posts difficulty to do comprehensive benchmarking. Depending on the nature of the parallel computation, the benchmarking results could be misleading as the total amount of computation, patterns of memory accesses etc. may vary across executions, even for the same input data. More input data sets should be used to obtain statistically significant results to draw meaningful conclusions.

How do you compare the performance of the parallel version against the serial or reference version when the computation could be completely different? It's simply confusing.

5. Lack Of Parallel Library For General Use
We hardly write everything from scratch. We often make use of reusable library for our tasks in hand. That makes programming much more manageable and simpler. Complicated tasks are encapsulated inside software libraries.

If we use parallel library as much as possible such as parallel sorting, concurrent hashing, parallel set, parallel binary tree search, etc. in our application, the application inherently becomes a parallel version, right? That should make parallel programming much easier.

But you don't find many reusable parallel library readily available, do you? Check out the article on Where Are All The Practical Parallel Algorithms and Libraries?

6. Performance Matters
Performance is one of the strongest points for one to make parallel programming difficult. It defeats the purpose if the performance gain is insignificant. For this single reason, one is willing to do the following,

Perform dynamic load balancing for uneven computation units.
Apply hardware and application specific task scheduling and processor allocation to maximize the processor utilization with other cost minimization objectives e.g. minimizing the distance between two frequently communicating cores, minimizing context switching, minimizing working sets, etc.
Apply hardware specific compute granularity optimization - coarse grain vs fine grain.
Apply hardware specific cache optimization to minimize memory fetches as multiple cores are sharing the memory, hence sharing the outstanding memory fetching slots and memory bandwidth.
Apply application specific coarse grain synchronization of memory trading off structured codes and maintainability for lower synchronization overhead.
Use traditional programming language such as C/C++, Fortran etc. trading off the ease of programming in modern dynamic scripting language for performance.
Develop application specific routines with hardware specific optimization trading off portability and re-usability for performance.
Design hardware and application specific memory layout trading off portability and re-usability for performance.
Apply I/O storage optimization to minimize I/O overhead and improving sustainable I/O bandwidth.

You may notice that some of the points above are not directly related to parallel programming. Nevertheless, they affect performance. So they may be included as part of the parallel programming process.

7. Legacy Software Is Often Used As The Entry Point To Parallelization
It is expensive to build a software from scratch. It makes sense to build a parallel version based on the existing serial version. Unfortunately, the software architecture of the serial version may not be suitable for building a parallel version.

Incremental parallelization sounds nice, but not always practical. Effective parallelization could require massive changes to the data structures for better data partitioning, data accesses to synchronize shared memory locations, computation structures for coarse grain parallelism, or even the algorithms for better parallelization. Global variables which are often found in serial program is particularly problematic for parallelization. How well structured is the data dependency in the legacy software is playing an important role to effective parallelization.

For more structural algorithms as of many scientific algorithms, they are always used for large-scale data, and their data dependency is often well structured. Then, parallelizing the serial version would be a good idea.