A few suggestions from my work with Ruby and MPI on a 2048 processor computing system.
1) Dynamic libraries can be a scaling issue. If 10,000 nodes execute the code you want to be able to broadcast the entire binary and not have ad-hoc queries back to the networked file system to grab dynamic libraries. Static compiling is one solution, the other is to broadcast the dynamic stuff in one chunk.
2) Think about a Ruby like API for the four most widely used parallel computing operations. This will be used by many people. Most users want to have GPU accelerated code but do not want that speed at the tradeoff of making their Ruby code more complicated.
a = GpuMatrix.new(100,100) b = GpuMatrix.new(100,100) c = GpuMatrix.new(100,100) a = b*c
# For these let the user pass in a block to define the comparison operation for sort, and the binary operation for scan/reduce. gpu_array.sort() gpu_array.scan() gpu_array.reduce()
Hi Chad, thanks for your kind feedback. When a system scales up, there certainly be many optimizations that need to be done for good performance.
The use of dynamic library in the slides is merely to workaround with the CUDA Runtime API which doesn't provide a kernel load function. I have not seen any CUDA API wrapper supporting the CUDA Runtime API. This could be part of their non-supporting reason.
For executing on large system, we possibly can pre-compile the dynamic library and distribute to local file systems. HPC system admins are good at doing all kinds of tricks :)
2nd point is more about high-level data structures and algorithms. I think we are certainly good to work on that once the Ruby CUDA API. I would be happy to get your involvements or feedback by then. Thanks.
A few suggestions from my work with Ruby and MPI on a 2048 processor computing system.
ReplyDelete1) Dynamic libraries can be a scaling issue. If 10,000 nodes execute the code you want to be able to broadcast the entire binary and not have ad-hoc queries back to the networked file system to grab dynamic libraries. Static compiling is one solution, the other is to broadcast the dynamic stuff in one chunk.
2) Think about a Ruby like API for the four most widely used parallel computing operations. This will be used by many people. Most users want to have GPU accelerated code but do not want that speed at the tradeoff of making their Ruby code more complicated.
a = GpuMatrix.new(100,100)
b = GpuMatrix.new(100,100)
c = GpuMatrix.new(100,100)
a = b*c
# For these let the user pass in a block to define the comparison operation for sort, and the binary operation for scan/reduce.
gpu_array.sort()
gpu_array.scan()
gpu_array.reduce()
Hi Chad, thanks for your kind feedback. When a system scales up, there certainly be many optimizations that need to be done for good performance.
ReplyDeleteThe use of dynamic library in the slides is merely to workaround with the CUDA Runtime API which doesn't provide a kernel load function. I have not seen any CUDA API wrapper supporting the CUDA Runtime API. This could be part of their non-supporting reason.
For executing on large system, we possibly can pre-compile the dynamic library and distribute to local file systems. HPC system admins are good at doing all kinds of tricks :)
2nd point is more about high-level data structures and algorithms. I think we are certainly good to work on that once the Ruby CUDA API. I would be happy to get your involvements or feedback by then. Thanks.