Thursday, July 22, 2010

Where Are All The Practical Parallel Algorithms and Libraries?

Multi-core CPU and GPU are everywhere nowadays from laptops to desktops to high-end computing clusters. Is your particular application running any faster? Nope. But generally you need parallel algorithms for an application to make full use of the multiple cores.

Perhaps you'll expect doing some searches on the web, research publications and academic books would provide you all the state of art parallel algorithms. Many of them are for academic & research purpose, many are written in distinguished programming languages, many focused on special prototype or no longer available hardware, many that you'll hardly need, ... many of them since parallel programming has been studied for decades.

What about practical parallel algorithms or library written in modern programming languages be it C/C++, Ruby, Python, Java, ... which can be incorporated easily into your own software development?

Unfortunately you don't find a lot of them. There are practical reasons for that.

1. Lack Of Widely Accepted Composable Parallel Programming Model
A parallel programming model is composable if the model allows a parallel algorithm to be used transparently within another parallel algorithm and maintain certain level of efficiency. This is essential for building up modular foundation for a parallel algorithm library.

The needs to manually synchronize multiple cores with certain forms of locks or atomic instructions break the transparent requirement. And it's difficult to achieve good efficiency when a parallel algorithm is nested within other parallel algorithms. Excessive synchronizations would incur too much overhead for parallel programming to be beneficial. With application specific knowledge, one may move synchronizations of a parallel algorithm to higher levels to improve efficiency.

Tricks are often used to maximize the efficiency trading off composability. Hence, a parallel programming model still needs to support non-composable mechanisms.

2. Requires Specific Memory Layout And Efficiency Matters
A parallel algorithm may be implemented assuming certain memory layout for its input and output data. Multiple cores would access the data in very specific ways or patterns.

When you obtain a parallel algorithm implementation from another, it's almost certain that the assumed memory layout is not compatible with your software interface. You may do data transformation to make use of the parallel algorithm transparently. This can be extremely expensive not only the in/out transformation itself would take up computation time, it will require additional memory to maintain the transformed structures.

When the degradation of the efficiency is not tolerable, you'll then implement your own parallel algorithm with the ideal memory layout for your application.

Hence, it's not uncommon to see many implementations for similar parallel algorithms around because we can't get them out of their existing environments without incurring much engineering efforts.

3. Requires Hardware Specific Optimization For Efficiency And Efficiency Matters
Programming on parallel clusters would worry about configurations such as the number of nodes, the number of processors per node, the number of cores per processor, and their inter-connectivity.

Programming on shared memory machines would worry about the number of processors, the number of cores per processor, the cache hierarchy, and the inter-core connectivity.

When you have a very high bandwidth inter-connectivity, you would perform data or task partitioning, result aggregations in a very specific way to take advantage of that. On the other hand, when the inter-connectivity has limited bandwidth, you would customize the parallel algorithm differently, or even use other less efficient parallel algorithms for the equivalent computation.

Shared memory programming would have very specific requirements on the memory layout and data access pattern for good efficiency on the underlying hardware. Different organizations of the processor cores, inter-core connections would require different sets of optimizations.

The great variety in the hardware organizations is the fundamental issue in supporting portable performance across multiple hardwares.

As performance matters, performance optimization and tuning is a crucial part of a parallel algorithm development. The portability is certainly limited. Fortunately, this may be rectified with auto-tuning mechanism.

4. Parallel Programming Was Expensive In The Past
Before the multi-core era, parallel programming is generally not affordable. Parallel programming has been very much restricted to research labs and very large corporations.

Research labs and large corporations typically deployed very large parallel systems for solving only very large problems. As shared memory machines are generally more expensive and limited in its maximum configuration size, distributed or cluster computing was the preferred choice.

Hardware configurations for large distributed systems can be complicated and confusing, especially when new systems and old systems are merged into larger systems over time.

With research efforts mostly focusing on such complicated systems, and engineering efforts mostly deal with performance issues rather than practical software development issues, do we expect cross platforms and performance portable parallel libraries? I think that's difficult.

Although there is also a significant amount of research effort on shared memory programming, shared memory machines have their own sets of issues to achieve scalable performance.

Simple parallel libraries for small machines are not interesting enough for researchers to work on. However, with the advent of multi-core processors, parallel programming becomes more affordable and ubiquitous. Sacrificing certain level of performance is reasonable for wider coverage. Parallel libraries work only for small processor core count can also be extremely useful.

No comments:

Post a Comment