SpeedGo Computing

Sunday, September 26, 2010

Parallel programming knowledge is must-have skill for Wall Street

Friday, September 17, 2010

Unigine crew: CUDA vs OpenCL vs SPU Part IV

Which language or library you choose to use for your software development has great and prolong impact to the software. Just come across a simple yet interesting benchmark. Perhaps, more details on why such numbers are obtained would be even more enlightening.

Unigine crew: CUDA vs OpenCL vs SPU Part IV

CUDA Programming with Ruby

Need GPU computing power in your Ruby program? Great! SpeedGo Computing is developing Ruby bindings for CUDA, called sgc-ruby-cuda. Take advantage of your Nvidia CUDA-enabled graphics cards with Ruby now.

Currently, only part of the CUDA Driver API is included. More components such as the CUDA Runtime API will be included to make it as complete as possible.

CUDA Programming with Ruby


require 'rubycu'

include SGC::CU

SIZE = 10 
c = CUContext.new

d = CUDevice.get(0)   # Get the first device.
c.create(0, d)    # Use this device in this CUDA context.

m = CUModule.new
m.load("vadd.ptx")    # 'nvcc -ptx vadd.cu'
                      # vadd.cu is a CUDA kernel program.

da = CUDevicePtr.new    # Pointer to device memory.
db = CUDevicePtr.new
dc = CUDevicePtr.new

da.mem_alloc(4*SIZE)    # Each Int32 is 4 bytes.
db.mem_alloc(4*SIZE)    # Allocate device memory.
dc.mem_alloc(4*SIZE)

ha = Int32Buffer.new(SIZE)    # Allocate host memory.
hb = Int32Buffer.new(SIZE)
hc = Int32Buffer.new(SIZE)
hd = Int32Buffer.new(SIZE)

(0...SIZE).each { |i| ha[i] = i }
(0...SIZE).each { |i| hb[i] = 2 }
(0...SIZE).each { |i| hc[i] = ha[i] + hb[i] }
(0...SIZE).each { |i| hd[i] = 0 }

memcpy_htod(da, ha, 4*SIZE)  # Transfer inputs to device.
memcpy_htod(db, hb, 4*SIZE)

f = m.get_function("vadd");
f.set_param(da, db, dc, SIZE)
f.set_block_shape(SIZE)
f.launch_grid(1)  # Execute kernel program in the device.

memcpy_dtoh(hd, dc, 4*SIZE) # Transfer outputs to host.

puts "A\tB\tCPU\tGPU"
(0...SIZE).each { |i| 
    puts "#{ ha[i]}\t#{hb[i]}\t#{hc[i]}\t#{hd[i] }" 
}

da.mem_free    # Free device memory.
db.mem_free
dc.mem_free

c.detach    # Release context.


/* vadd.cu */
extern "C" {
    __global__ void vadd(const int* a,
                         const int* b,
                         int* c,
                         int n)
    {
        int i = blockIdx.x * blockDim.x + threadIdx.x;
        if (i < n)
            c[i] = a[i] + b[i];
    }
}

Although the kernel program still need to be written in CUDA C, this Ruby bindings have provided first bridging step towards Ruby GPU computing.

How to execute?


$ ruby extconf.rb
checking for main() in -lcuda... yes
creating Makefile
$ make
...
g++ -shared -o rubycu.so rubycu.o ...
$ nvcc -ptx vadd.cu
$ ruby -I . test.rb
A       B       CPU     GPU
0       2       2       2
1       2       3       3
2       2       4       4
3       2       5       5
4       2       6       6
5       2       7       7
6       2       8       8
7       2       9       9
8       2       10      10
9       2       11      11

Cool! The summation of two vectors is performed in the GPU.

See also:

Tuesday, September 7, 2010

High Performance for All

Parallel programming is much more affordable now as multi-core CPU and programmable GPU become commodity products. Unlike a decade ago where a minimum dual socket system equipped with lower clocked CPU & RAM would relatively cost a fortune to a typical desktop user, but dual-core system is basically everywhere nowadays. The use of dual-core systems is not really because it's affordable, but simply the users have not given a choice for not going multi-core.

It was non-trivial to me a decade ago, why should I go with lower clocked CPU & RAM in order to go multi-processing? Isn't that will slow down all my applications that use only single core? Fortunately, this problem is now less severe with dynamic clock adjusting CPUs, so called turbo mode. We could enjoy the benefits of high clock speed and multiple cores for different applications.

Moving forward, does commodity products make HPC a commodity service? How is HPC doing in enterprise?

Checkout the report published by Freeform Dynamics: High Performance for All

Wednesday, August 25, 2010

AMD’s Bulldozer vs Intel's Hyper-Threading?

AMD's so called Strong Thread approach in the Bulldozer module is that really compelling?

Extra cores are added when a processor can't operate at a faster clock speed, that's a good and easy way to expand a product line with effectively faster products, even though it may NOT be any faster depending on whether the applications are taking advantage of the multiple cores. But fully duplicating x86 core is expensive to scale up.

Intel hyper-threading is a good idea in certain cases, with only little more hardware it allows multiple threads to share the functional units in a core with lower context switch overhead, tolerating memory latency as memory latency is relatively high. That works well with

Complementing threads - Threads do not use the same types of functional units such as the integer units, floating units, etc. thus maximizing the hardware utilization. Or threads do not have conflicting memory accesses, especially long latency memory accesses.
Threads play nice with cache - A thread does not result in spilling out the data of another thread from the cache. Unfortunately, this would be difficult to ensure in practice as the dynamic OS thread scheduling, memory access pattern, etc. contribute to the cache usage.

On the other hand, AMD's Strong Thread includes two sets of integer units and L1 data cache in a Bulldozer module, which is heavier than the hyper-threading approach, but more lightweight than fully duplicating a x86 core. That effectively allowing a thread to enjoy full private L1 data cache during its execution quantum, while hyper-threading works in a shared L1 cache like environment. Whether the module supports cpu affinity i.e. binding a thread to a particular core of the chip, is something we should be looking for when more details are available.

Hyper-threading vs Bulldozer may provoke the argument of shared cache vs private cache: A thread can potentially access the entire shared cache, while a thread enjoys full bandwidth in accesses to the private cache. The downside is a thread is limited to the smaller private cache size even if the other private cache in the module is under utilized. To argue that further: a larger shared cache would have higher latency due to larger storage management overhead, while smaller private cache would have lower latency generally. Whether shared or private cache is better for the performance, it's very specific to the memory access patterns of multiple threads.

As L1 cache is usually very small, the performance impact of smaller private L1 data cache for a single threaded application could be compensated by the larger shared L2 cache. When an application has large working-set, doubling the L1 data cache is probably insufficient to keep the working-set anyway.

We should also note that the floating-point units connect to shared L2 cache bypassing the L1 data cache. They probably have a good reason for that. I can recall that Itanium II does not use L1 data cache for their floating-point too.

Overall, the AMD Bulldozer is an interesting architecture. It has great potential to exhibit higher performance at lower cost. Its benchmark data is something we should keep an eye on.

See also:

Tuesday, August 17, 2010

Parallelizing Matrix Multiplication using MPI

MPI is a popular mechanism in high performance computing. It works for both cluster and shared memory environment. Why don't we simply use MPI when it works for both environments? Why do we care about OpenMP? Cilk++? etc. Perhaps that depends on the complexity of the applications you are dealing with.

Parallel Matrix Multiplication using MPI

/* matrix-mpi.cpp */
#include <mpi.h>

const int size = 1000;

float a[size][size];
float b[size][size];
float c[size][size];

void multiply(int istart, int iend)
{
    for (int i = istart; i <= iend; ++i) {
        for (int j = 0; j < size; ++j) {
            for (int k = 0; k < size; ++k) {
                c[i][j] += a[i][k] * b[k][j];
            }
        }
    }
}

int main(int argc, char* argv[])
{
    int rank, nproc;
    int istart, iend;

    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &nproc);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    if (rank == 0) {
        // Initialize buffers.
        for (int i = 0; i < size; ++i) {
            for (int j = 0; j < size; ++j) {
                a[i][j] = (float)i + j;
                b[i][j] = (float)i - j;
                c[i][j] = 0.0f;
            }
        }
    }

    // Broadcast matrices to all workers.
    MPI_Bcast(a, size*size, MPI_FLOAT, 0,MPI_COMM_WORLD);
    MPI_Bcast(b, size*size, MPI_FLOAT, 0,MPI_COMM_WORLD);
    MPI_Bcast(c, size*size, MPI_FLOAT, 0,MPI_COMM_WORLD);

    // Partition work by i-for-loop.
    istart = (size / nproc) * rank;
    iend = (size / nproc) * (rank + 1) - 1;

    // Compute matrix multiplication in [istart,iend]
    // of i-for-loop.
    // C <- C + A x B
    multiply(istart, iend);

    // Gather computed results.
    MPI_Gather(c + (size/nproc*rank),
               size*size/nproc,
               MPI_FLOAT,
               c + (size/nproc*rank),
               size*size/nproc,
               MPI_FLOAT,
               0,
               MPI_COMM_WORLD);

    if (rank == 0) {
        // Compute remaining multiplications
        // when size % nproc > 0.
        if (size % nproc > 0) {
            multiply((size/nproc)*nproc, size-1);
        }
    }

    MPI_Finalize();
    return 0;
}


$ g++ -O2 matrix.cpp -o matrix
$ mpicxx -O2 matrix-mpi.cpp -o matrix-mpi
$ time ./matrix
real 0m13.226s
user 0m12.529s
sys 0m0.065s
$ time mpirun -np 2 ./matrix-mpi
real 0m8.490s
user 0m6.346s
sys 0m0.178s

Phew .... what a hassle ... you can see the needs to:

perform data transfer to workers manually
perform work partitioning manually
perform many index calculations
handle remaining work when the amount of work is not divisible by the number of workers.

Furthermore, this MPI version uses more memory than the shared memory counterparts. The MPI program is launched with multiple processes as multiple workers, hence the memory consumption also multiply up. More work would be required to minimize the total memory consumption.When you must work with cluster environment, perhaps you don't have many choices with the current state of art programming tools.

Sunday, August 15, 2010

Parallelizing Matrix Multiplication using TBB

Parallelizing matrix multiplication using TBB isn't too difficult. It's just a little more work than OpenMP or Cilk++.

Parallel Matrix Multiplication using TBB

/* matrix-tbb.cpp */
#include <tbb/parallel_for.h>
#include <tbb/blocked_range.h>

using namespace tbb;

const int size = 1000;

float a[size][size];
float b[size][size];
float c[size][size];


class Multiply
{
public:
    void operator()(blocked_range<int> r) const {
        for (int i = r.begin(); i != r.end(); ++i) {
            for (int j = 0; j < size; ++j) {
                for (int k = 0; k < size; ++k) {
                    c[i][j] += a[i][k] * b[k][j];
                }
            }
        }
    }
};


int main()
{
    // Initialize buffers.
    for (int i = 0; i < size; ++i) {
        for (int j = 0; j < size; ++j) {
            a[i][j] = (float)i + j;
            b[i][j] = (float)i - j;
            c[i][j] = 0.0f;
        }
    }

    // Compute matrix multiplication.
    // C <- C + A x B
    parallel_for(blocked_range<int>(0,size), Multiply());

    return 0;
}

We've moved the computation of the matrix multiplication into the class Multiply which takes in the range of i-iterations to work on. The parallel_for internally split the range [0,size) into multiple blocks. Multiple workers can then work on different non-overlapping blocks in parallel.


$ g++ -O2 matrix.cpp -o matrix
$ g++ -O2 matrix-tbb.cpp -ltbb -o matrix-tbb
$ time ./matrix
real 0m12.971s
user 0m12.489s
sys 0m0.052s
$ time ./matrix-tbb
real 0m7.857s
user 0m12.734s
sys 0m0.282s

Once the computation is organized into functions which can dynamically work on different parts of the computation, it's relatively easy to proceed.