SpeedGo Computing: gpu

Showing posts with label gpu. Show all posts

Monday, June 18, 2012

Personal Supercomputing System with Quad GPUs

The secrets of the new Kepler GPUs have been revealed. The Kepler based graphics cards have been studied extensively for gaming performance. Most would suggest you don't need the upgrade. Furthermore, the supported PCI-E 3.0 is of little to no use.

Well, it's probably a different story for CUDA programs. Here's the setup I'm going to use for testing CUDA programs extensively.

Asus P8Z99-V Premium

Supports dual 16x PCI-E. Most other boards support only single 16x.
Built-in 32GB SSD storage.
At the time of writing, the system is not fully functional when the 2nd GTX 690 is installed. The BIOS probably needs update to fix the problems.

Intel Ivy Bridge Core i7 3770K 3.5GHz/8MB (3.9GHz at Turbo mode)
Corsair Vengeance 1600MHz 32GB
Asus GTX 690 4GB x 2
Kingston HyperX SSD 3K SATA3 240GB
Western Digital Black SATA3 2TB
Corsair AX1200 1200W
Antec Kuhler 920 (Liquid cooling)

The CPU is extremely hot 80+ C with stock fan at full load. Liquid cooling is highly recommended.
Unfortunately, the fan control software is rather limited. It controls the fan based on the liquid temperature which is very different from the CPU temperature. As a result, the fan does not automatically ramp up on high CPU temperature as the liquid temperature may be much lower.

Asus GTX 690 in a box

GTX 690 is much longer than old GTX 285. Prepare a long casing for the card.

Full system with dual GTX 690

A closer look at the LED illuminating GTX

Monday, May 9, 2011

The Choice is Yours: CUDA in C++ or Ruby

See the output here: Ruby Query Output

See the output here: C++ Query Output

Tuesday, May 3, 2011

Web Seminar: Programming GPUs Beyond CUDA

GPU/CUDA programming is easy if we ignore the performance, or even the correctness of the program. It becomes tough when the performance is critical, one has to optimize very hard on the specific hardware. Fortunately, GPU hardware performance improves drastically every 2 years. Unfortunately, the performance is not portable across different generations of GPUs.

Prof Chen from Tshing Hua University is proposing MapCG, a MapReduce framework as a resolution to the portability problem.

Check out the details of the seminar in the following link:

http://www.gpucomputing.net/?q=node/5277

Saturday, April 30, 2011

First Release of SGC Ruby CUDA - Beginning of a long way path

Today we decided to put up the first release of the SGC Ruby CUDA v0.1.0 as a mean to attract Rubyists to try out GPU programming as their new toy projects, and also to encourage HPC developers to evaluate if Ruby is good to use for their HPC applications.

When important software libraries are not available in Ruby, we certainly do not expect much Ruby usage in the area. As time is running short, more and more hardware is piling up underutilized, we are urged to take the first fundamental step moving Ruby programming towards HPC applications by making important SDK such as Nvidia CUDA SDK available to a Ruby program.

Rubyists who are new to GPU programming can now access CUDA GPUs easily to harness the massively parallel architecture of GPUs. On the other spectrum, HPC developers now have a choice to manage their complex applications by large portion in Ruby, while retaining only relatively small section of codes in C/C++/CUDA C etc.

We believe Ruby programming could improve productivity and maintainability tremendously since in many cases, heavy computation only happens in small section of codes, and Ruby programming simplify the software architecture and implementation significantly. Even when the performance is extremely critical that one must port everything back from Ruby to C/C++/CUDA C for highest performance, one has already saved tremendous effort in software architecture and design to achieve manageable design, extendable, ease of use, etc. The porting back to C/C++/CUDA C becomes much more straightforward as one has gained much knowledge about the domain.

Compared to developing a complex application from scratch in C/C++/CUDA C, one has to go through unforeseeable curvy path to achieve the same state which is bound to very high failure rate. Hence, we believe that this could set the start of Ruby programming towards HPC applications.

SGC Ruby CUDA has been updated significantly since the last post about it. As we have packaged it into a Ruby gem, you can now install it with

gem install sgc-ruby-cuda

The code repository is hosted at github, SGC Ruby CUDA.
The documentations are available at rubydoc.info, SGC Ruby CUDA Doc.
Feel free to join the discussion group/mailing list at SGC Ruby CUDA Google Group.

Friday, November 19, 2010

Using SGC-Ruby-CUDA on the Newly Launched Amazon EC2 Cluster GPU

Wonder if GPU works for you? No budget for a system with decent GPU? Installations and configurations are too much trouble for you? You can now try out SGC-Ruby-CUDA on Amazon EC2 with the system image, located at US East Virginia zone, called SGCRubyCUDA.1 which is available as a community AMI.

Compile for rubycu shared library and run tests.

[root@ip-10-17-130-174 sgc-ruby-cuda.git]# rake
(in /root/sgc-ruby-cuda.git)
checking for main() in -lcuda... yes
creating Makefile
g++44 -I. -I/usr/local/include/ruby-1.9.1/x86_64-linux -I/usr/local/include/ruby-1.9.1/ruby/backward -I/usr/local/include/ruby-1.9.1 -I.   -fPIC -O3 -ggdb -Wextra -Wno-unused-parameter -Wno-parentheses -Wpointer-arith -Wwrite-strings -Wno-missing-field-initializers -Wno-long-long   -o rubycu.o -c rubycu.cpp
g++44 -shared -o rubycu.so rubycu.o -L. -L/usr/local/lib -Wl,-R/usr/local/lib -L.  -rdynamic -Wl,-export-dynamic   -lcuda  -lpthread -lrt -ldl -lcrypt -lm   -lc

[root@ip-10-17-130-174 sgc-ruby-cuda.git]# rake test
(in /root/sgc-ruby-cuda.git)
/usr/local/bin/ruby -I"lib:lib" "/usr/local/lib/ruby/1.9.1/rake/rake_test_loader.rb" "test/test_rubycu.rb" 
Loaded suite /usr/local/lib/ruby/1.9.1/rake/rake_test_loader
Started
......................
Finished in 89.055900 seconds.

22 tests, 99 assertions, 0 failures, 0 errors, 0 skips

Test run options: --seed 25668

Compile for rubygems then install it and try some SGC-Ruby-CUDA APIs.

[root@ip-10-17-130-174 sgc-ruby-cuda.git]# rake gem
(in /root/sgc-ruby-cuda.git)
mkdir -p pkg
  Successfully built RubyGem
  Name: sgc-ruby-cuda
  Version: 0.0.1
  File: sgc-ruby-cuda-0.0.1-x86_64-linux.gem
mv sgc-ruby-cuda-0.0.1-x86_64-linux.gem pkg/sgc-ruby-cuda-0.0.1-x86_64-linux.gem

[root@ip-10-17-130-174 sgc-ruby-cuda.git]# cd pkg
[root@ip-10-17-130-174 pkg]# gem install sgc-ruby-cuda-0.0.1-x86_64-linux.gem 
Successfully installed sgc-ruby-cuda-0.0.1-x86_64-linux
1 gem installed
Installing ri documentation for sgc-ruby-cuda-0.0.1-x86_64-linux...
Installing RDoc documentation for sgc-ruby-cuda-0.0.1-x86_64-linux...

[root@ip-10-17-130-174 pkg]# gem list

*** LOCAL GEMS ***

minitest (1.6.0)
rake (0.8.7)
rdoc (2.5.8)
sgc-ruby-cuda (0.0.1 x86_64-linux)

[root@ip-10-17-130-174 pkg]# irb
irb(main):001:0> require 'rubycu'
=> true
irb(main):002:0> include SGC::CU
=> Object
irb(main):004:0> CUDevice.get_count
=> 2
irb(main):005:0> d = CUDevice.get(0)
=> #<SGC::CU::CUDevice:0x0000000908c920>
irb(main):006:0> c = CUContext.new.create(0, d)
=> #<SGC::CU::CUContext:0x0000000907af40>
irb(main):007:0> d.get_name
=> "Tesla M2050"
irb(main):009:0> d.compute_capability
=> {:major=>2, :minor=>0}
irb(main):010:0> d.total_mem
=> 2817982464

Note: Remember to select Cluster GPU, when launching the instance.

Tuesday, November 16, 2010

GPU Anywhere with Cloud Computing

Simulation taking months to run? Buying and maintaining new systems causing too much hassle? Perhaps Cluster GPU would be a good candidate to save time and trouble. Cloud solution is an excellent platform for proof of concept before committed to a large system in-house.

Paying $2.10 per hour (Amazon pricing as of 16 Nov 2010) for the spec of:

22 GB of memory
33.5 EC2 Compute Units (2 x Intel Xeon X5570, quad-core "Nehalem" architecture)
2 x NVIDIA Tesla "Fermi" M2050 GPUs
1690 GB of instance storage
64-bit platform
I/O Performance: Very High (10 Gigabit Ethernet)
API name: cg1.4xlarge

That may not sound inexpensive in long run, but you now have the option to pay little for great fun, which is cheaper than a movie ticket. Get the systems for a few hours solving great puzzles. That sounds like an interesting weekend activity.

High Performance Computing Using Amazon EC2

Friday, September 17, 2010

Unigine crew: CUDA vs OpenCL vs SPU Part IV

Which language or library you choose to use for your software development has great and prolong impact to the software. Just come across a simple yet interesting benchmark. Perhaps, more details on why such numbers are obtained would be even more enlightening.

Unigine crew: CUDA vs OpenCL vs SPU Part IV

CUDA Programming with Ruby

Need GPU computing power in your Ruby program? Great! SpeedGo Computing is developing Ruby bindings for CUDA, called sgc-ruby-cuda. Take advantage of your Nvidia CUDA-enabled graphics cards with Ruby now.

Currently, only part of the CUDA Driver API is included. More components such as the CUDA Runtime API will be included to make it as complete as possible.

CUDA Programming with Ruby


require 'rubycu'

include SGC::CU

SIZE = 10 
c = CUContext.new

d = CUDevice.get(0)   # Get the first device.
c.create(0, d)    # Use this device in this CUDA context.

m = CUModule.new
m.load("vadd.ptx")    # 'nvcc -ptx vadd.cu'
                      # vadd.cu is a CUDA kernel program.

da = CUDevicePtr.new    # Pointer to device memory.
db = CUDevicePtr.new
dc = CUDevicePtr.new

da.mem_alloc(4*SIZE)    # Each Int32 is 4 bytes.
db.mem_alloc(4*SIZE)    # Allocate device memory.
dc.mem_alloc(4*SIZE)

ha = Int32Buffer.new(SIZE)    # Allocate host memory.
hb = Int32Buffer.new(SIZE)
hc = Int32Buffer.new(SIZE)
hd = Int32Buffer.new(SIZE)

(0...SIZE).each { |i| ha[i] = i }
(0...SIZE).each { |i| hb[i] = 2 }
(0...SIZE).each { |i| hc[i] = ha[i] + hb[i] }
(0...SIZE).each { |i| hd[i] = 0 }

memcpy_htod(da, ha, 4*SIZE)  # Transfer inputs to device.
memcpy_htod(db, hb, 4*SIZE)

f = m.get_function("vadd");
f.set_param(da, db, dc, SIZE)
f.set_block_shape(SIZE)
f.launch_grid(1)  # Execute kernel program in the device.

memcpy_dtoh(hd, dc, 4*SIZE) # Transfer outputs to host.

puts "A\tB\tCPU\tGPU"
(0...SIZE).each { |i| 
    puts "#{ ha[i]}\t#{hb[i]}\t#{hc[i]}\t#{hd[i] }" 
}

da.mem_free    # Free device memory.
db.mem_free
dc.mem_free

c.detach    # Release context.


/* vadd.cu */
extern "C" {
    __global__ void vadd(const int* a,
                         const int* b,
                         int* c,
                         int n)
    {
        int i = blockIdx.x * blockDim.x + threadIdx.x;
        if (i < n)
            c[i] = a[i] + b[i];
    }
}

Although the kernel program still need to be written in CUDA C, this Ruby bindings have provided first bridging step towards Ruby GPU computing.

How to execute?


$ ruby extconf.rb
checking for main() in -lcuda... yes
creating Makefile
$ make
...
g++ -shared -o rubycu.so rubycu.o ...
$ nvcc -ptx vadd.cu
$ ruby -I . test.rb
A       B       CPU     GPU
0       2       2       2
1       2       3       3
2       2       4       4
3       2       5       5
4       2       6       6
5       2       7       7
6       2       8       8
7       2       9       9
8       2       10      10
9       2       11      11

Cool! The summation of two vectors is performed in the GPU.

See also:

Tuesday, September 7, 2010

High Performance for All

Parallel programming is much more affordable now as multi-core CPU and programmable GPU become commodity products. Unlike a decade ago where a minimum dual socket system equipped with lower clocked CPU & RAM would relatively cost a fortune to a typical desktop user, but dual-core system is basically everywhere nowadays. The use of dual-core systems is not really because it's affordable, but simply the users have not given a choice for not going multi-core.

It was non-trivial to me a decade ago, why should I go with lower clocked CPU & RAM in order to go multi-processing? Isn't that will slow down all my applications that use only single core? Fortunately, this problem is now less severe with dynamic clock adjusting CPUs, so called turbo mode. We could enjoy the benefits of high clock speed and multiple cores for different applications.

Moving forward, does commodity products make HPC a commodity service? How is HPC doing in enterprise?

Checkout the report published by Freeform Dynamics: High Performance for All

Thursday, July 29, 2010

Who Is Responsible For The Programming Of Multi Core CPU And GPU?

Multi core CPU and GPU are now commodity products. But, where are the software that could take advantage of their parallel architecture? Who should be developing such software? The domain expert? HPC (high performance computing) sofware engineer? or parallel programming tools such as auto parallelizing compilers?

Domain experts typically do not wish to spend too much time on computing problems. They simply want to get their computation done and collect the results. Earning another degree on computing is not what they are after.
On the other hand, HPC software engineers focused very much only on computing problems, obtaining good performance with minimum domain knowledge. Learning curve for domain knowledge could be very steep for them.
The current state of art parallel programming tools are mostly parallel programming languages or libraries, performance profiling tools, code analyzer for parallelization guidelines, auto parallelizing compilers, and debugging tools.

Consequently, domain experts develop serial algorithm prototype and let HPC software engineers handle all the parallelization and performance issues. HPC software engineers then use the limited sets of parallel programming tools to parallelize and optimize the prototype.

There is serious flaw in this workflow. A parallel program needs to be designed since the beginning of the software development to be effective. When the HPC software engineers take over the prototype for optimization and parallelization, what they can do is too limited.

Without the domain knowledge, the HPC software engineers are essentially working as an auto parallelizing compiler. They analyze the code for real data dependency instead of analyzing the high-level algorithm structure, they optimize and parallelize the algorithms used in the prototype maintaining the same serial semantics instead of using or improving the best algorithms for the tasks. They are unable to perform intrusive algorithm or code modifications necessary for good performance.

See also Why Can't Compilers Auto Parallelize Serial Code Effectively?

Current state of art parallel programming tools are not able to fully automate the optimization and parallelization process. Relying purely on tools is out of the question to optimization and parallelization.

When a parallel program is not performing, who's fault is that?

Ideally domain experts and HPC software engineers should collaborate and recognize the issues from each other. The impact of domain knowledge onto the optimization and parallelization process should not be underestimated. The steep learning curve of the domain knowledge is probably unavoidable, while domain experts should be more computation oriented formulating their domain problems into computing problems.

Bioinformatics set a good example in formulating problems into computing problems, enabling many computer scientists to work on bioinformatics problems with minimum domain knowledge.