Sunday, September 26, 2010
Friday, September 17, 2010
Unigine crew: CUDA vs OpenCL vs SPU Part IV
Which language or library you choose to use for your software development has great and prolong impact to the software. Just come across a simple yet interesting benchmark. Perhaps, more details on why such numbers are obtained would be even more enlightening.
Unigine crew: CUDA vs OpenCL vs SPU Part IV
Unigine crew: CUDA vs OpenCL vs SPU Part IV
Labels:
CUDA,
gpu,
OpenCL,
programming
CUDA Programming with Ruby
Need GPU computing power in your Ruby program? Great! SpeedGo Computing is developing Ruby bindings for CUDA, called sgc-ruby-cuda. Take advantage of your Nvidia CUDA-enabled graphics cards with Ruby now.
Currently, only part of the CUDA Driver API is included. More components such as the CUDA Runtime API will be included to make it as complete as possible.
See also:
Currently, only part of the CUDA Driver API is included. More components such as the CUDA Runtime API will be included to make it as complete as possible.
CUDA Programming with Ruby
require 'rubycu'
include SGC::CU
SIZE = 10
c = CUContext.new
d = CUDevice.get(0) # Get the first device.
c.create(0, d) # Use this device in this CUDA context.
m = CUModule.new
m.load("vadd.ptx") # 'nvcc -ptx vadd.cu'
# vadd.cu is a CUDA kernel program.
da = CUDevicePtr.new # Pointer to device memory.
db = CUDevicePtr.new
dc = CUDevicePtr.new
da.mem_alloc(4*SIZE) # Each Int32 is 4 bytes.
db.mem_alloc(4*SIZE) # Allocate device memory.
dc.mem_alloc(4*SIZE)
ha = Int32Buffer.new(SIZE) # Allocate host memory.
hb = Int32Buffer.new(SIZE)
hc = Int32Buffer.new(SIZE)
hd = Int32Buffer.new(SIZE)
(0...SIZE).each { |i| ha[i] = i }
(0...SIZE).each { |i| hb[i] = 2 }
(0...SIZE).each { |i| hc[i] = ha[i] + hb[i] }
(0...SIZE).each { |i| hd[i] = 0 }
memcpy_htod(da, ha, 4*SIZE) # Transfer inputs to device.
memcpy_htod(db, hb, 4*SIZE)
f = m.get_function("vadd");
f.set_param(da, db, dc, SIZE)
f.set_block_shape(SIZE)
f.launch_grid(1) # Execute kernel program in the device.
memcpy_dtoh(hd, dc, 4*SIZE) # Transfer outputs to host.
puts "A\tB\tCPU\tGPU"
(0...SIZE).each { |i|
puts "#{ ha[i]}\t#{hb[i]}\t#{hc[i]}\t#{hd[i] }"
}
da.mem_free # Free device memory.
db.mem_free
dc.mem_free
c.detach # Release context.
Although the kernel program still need to be written in CUDA C, this Ruby bindings have provided first bridging step towards Ruby GPU computing.
/* vadd.cu */
extern "C" {
__global__ void vadd(const int* a,
const int* b,
int* c,
int n)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n)
c[i] = a[i] + b[i];
}
}
How to execute?
Cool! The summation of two vectors is performed in the GPU.
$ ruby extconf.rb
checking for main() in -lcuda... yes
creating Makefile
$ make
...
g++ -shared -o rubycu.so rubycu.o ...
$ nvcc -ptx vadd.cu
$ ruby -I . test.rb
A B CPU GPU
0 2 2 2
1 2 3 3
2 2 4 4
3 2 5 5
4 2 6 6
5 2 7 7
6 2 8 8
7 2 9 9
8 2 10 10
9 2 11 11
See also:
Labels:
CUDA,
gpu,
programming,
Ruby
Tuesday, September 7, 2010
High Performance for All
Parallel programming is much more affordable now as multi-core CPU and programmable GPU become commodity products. Unlike a decade ago where a minimum dual socket system equipped with lower clocked CPU & RAM would relatively cost a fortune to a typical desktop user, but dual-core system is basically everywhere nowadays. The use of dual-core systems is not really because it's affordable, but simply the users have not given a choice for not going multi-core.
It was non-trivial to me a decade ago, why should I go with lower clocked CPU & RAM in order to go multi-processing? Isn't that will slow down all my applications that use only single core? Fortunately, this problem is now less severe with dynamic clock adjusting CPUs, so called turbo mode. We could enjoy the benefits of high clock speed and multiple cores for different applications.
Moving forward, does commodity products make HPC a commodity service? How is HPC doing in enterprise?
Checkout the report published by Freeform Dynamics: High Performance for All
It was non-trivial to me a decade ago, why should I go with lower clocked CPU & RAM in order to go multi-processing? Isn't that will slow down all my applications that use only single core? Fortunately, this problem is now less severe with dynamic clock adjusting CPUs, so called turbo mode. We could enjoy the benefits of high clock speed and multiple cores for different applications.
Moving forward, does commodity products make HPC a commodity service? How is HPC doing in enterprise?
Checkout the report published by Freeform Dynamics: High Performance for All
Labels:
cpu,
gpu,
hpc,
multi core
Wednesday, August 25, 2010
AMD’s Bulldozer vs Intel's Hyper-Threading?
AMD's so called Strong Thread approach in the Bulldozer module is that really compelling?
Extra cores are added when a processor can't operate at a faster clock speed, that's a good and easy way to expand a product line with effectively faster products, even though it may NOT be any faster depending on whether the applications are taking advantage of the multiple cores. But fully duplicating x86 core is expensive to scale up.
Intel hyper-threading is a good idea in certain cases, with only little more hardware it allows multiple threads to share the functional units in a core with lower context switch overhead, tolerating memory latency as memory latency is relatively high. That works well with
- Complementing threads - Threads do not use the same types of functional units such as the integer units, floating units, etc. thus maximizing the hardware utilization. Or threads do not have conflicting memory accesses, especially long latency memory accesses.
- Threads play nice with cache - A thread does not result in spilling out the data of another thread from the cache. Unfortunately, this would be difficult to ensure in practice as the dynamic OS thread scheduling, memory access pattern, etc. contribute to the cache usage.
Hyper-threading vs Bulldozer may provoke the argument of shared cache vs private cache: A thread can potentially access the entire shared cache, while a thread enjoys full bandwidth in accesses to the private cache. The downside is a thread is limited to the smaller private cache size even if the other private cache in the module is under utilized. To argue that further: a larger shared cache would have higher latency due to larger storage management overhead, while smaller private cache would have lower latency generally. Whether shared or private cache is better for the performance, it's very specific to the memory access patterns of multiple threads.
As L1 cache is usually very small, the performance impact of smaller private L1 data cache for a single threaded application could be compensated by the larger shared L2 cache. When an application has large working-set, doubling the L1 data cache is probably insufficient to keep the working-set anyway.
We should also note that the floating-point units connect to shared L2 cache bypassing the L1 data cache. They probably have a good reason for that. I can recall that Itanium II does not use L1 data cache for their floating-point too.
Overall, the AMD Bulldozer is an interesting architecture. It has great potential to exhibit higher performance at lower cost. Its benchmark data is something we should keep an eye on.
See also:
Labels:
apu,
cpu,
multi core
Tuesday, August 17, 2010
Parallelizing Matrix Multiplication using MPI
MPI is a popular mechanism in high performance computing. It works for both cluster and shared memory environment. Why don't we simply use MPI when it works for both environments? Why do we care about OpenMP? Cilk++? etc. Perhaps that depends on the complexity of the applications you are dealing with.
Parallel Matrix Multiplication using MPI
/* matrix-mpi.cpp */
#include <mpi.h>
const int size = 1000;
float a[size][size];
float b[size][size];
float c[size][size];
void multiply(int istart, int iend)
{
for (int i = istart; i <= iend; ++i) {
for (int j = 0; j < size; ++j) {
for (int k = 0; k < size; ++k) {
c[i][j] += a[i][k] * b[k][j];
}
}
}
}
int main(int argc, char* argv[])
{
int rank, nproc;
int istart, iend;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &nproc);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == 0) {
// Initialize buffers.
for (int i = 0; i < size; ++i) {
for (int j = 0; j < size; ++j) {
a[i][j] = (float)i + j;
b[i][j] = (float)i - j;
c[i][j] = 0.0f;
}
}
}
// Broadcast matrices to all workers.
MPI_Bcast(a, size*size, MPI_FLOAT, 0,MPI_COMM_WORLD);
MPI_Bcast(b, size*size, MPI_FLOAT, 0,MPI_COMM_WORLD);
MPI_Bcast(c, size*size, MPI_FLOAT, 0,MPI_COMM_WORLD);
// Partition work by i-for-loop.
istart = (size / nproc) * rank;
iend = (size / nproc) * (rank + 1) - 1;
// Compute matrix multiplication in [istart,iend]
// of i-for-loop.
// C <- C + A x B
multiply(istart, iend);
// Gather computed results.
MPI_Gather(c + (size/nproc*rank),
size*size/nproc,
MPI_FLOAT,
c + (size/nproc*rank),
size*size/nproc,
MPI_FLOAT,
0,
MPI_COMM_WORLD);
if (rank == 0) {
// Compute remaining multiplications
// when size % nproc > 0.
if (size % nproc > 0) {
multiply((size/nproc)*nproc, size-1);
}
}
MPI_Finalize();
return 0;
}
Phew .... what a hassle ... you can see the needs to:
$ g++ -O2 matrix.cpp -o matrix
$ mpicxx -O2 matrix-mpi.cpp -o matrix-mpi
$ time ./matrix
real 0m13.226s
user 0m12.529s
sys 0m0.065s
$ time mpirun -np 2 ./matrix-mpi
real 0m8.490s
user 0m6.346s
sys 0m0.178s
- perform data transfer to workers manually
- perform work partitioning manually
- perform many index calculations
- handle remaining work when the amount of work is not divisible by the number of workers.
Labels:
matrix,
MPI,
parallel,
parallelization,
programming
Sunday, August 15, 2010
Parallelizing Matrix Multiplication using TBB
Parallelizing matrix multiplication using TBB isn't too difficult. It's just a little more work than OpenMP or Cilk++.
Parallel Matrix Multiplication using TBB
/* matrix-tbb.cpp */We've moved the computation of the matrix multiplication into the class Multiply which takes in the range of i-iterations to work on. The parallel_for internally split the range [0,size) into multiple blocks. Multiple workers can then work on different non-overlapping blocks in parallel.
#include <tbb/parallel_for.h>
#include <tbb/blocked_range.h>
using namespace tbb;
const int size = 1000;
float a[size][size];
float b[size][size];
float c[size][size];
class Multiply
{
public:
void operator()(blocked_range<int> r) const {
for (int i = r.begin(); i != r.end(); ++i) {
for (int j = 0; j < size; ++j) {
for (int k = 0; k < size; ++k) {
c[i][j] += a[i][k] * b[k][j];
}
}
}
}
};
int main()
{
// Initialize buffers.
for (int i = 0; i < size; ++i) {
for (int j = 0; j < size; ++j) {
a[i][j] = (float)i + j;
b[i][j] = (float)i - j;
c[i][j] = 0.0f;
}
}
// Compute matrix multiplication.
// C <- C + A x B
parallel_for(blocked_range<int>(0,size), Multiply());
return 0;
}
Once the computation is organized into functions which can dynamically work on different parts of the computation, it's relatively easy to proceed.
$ g++ -O2 matrix.cpp -o matrix
$ g++ -O2 matrix-tbb.cpp -ltbb -o matrix-tbb
$ time ./matrix
real 0m12.971s
user 0m12.489s
sys 0m0.052s
$ time ./matrix-tbb
real 0m7.857s
user 0m12.734s
sys 0m0.282s
Labels:
matrix,
parallel,
parallelization,
programming,
TBB
Subscribe to:
Posts (Atom)

