Saturday, July 31, 2010

Parallel Programming - What Are The Options?

There are simply way too many parallel programming languages and libraries to keep track of. Many of them are no longer active in development, or difficult to get them working in decent operating systems. What are the practical options currently available for multi-core CPU or GPU?
  1. OpenMP
    • Hardware: Shared memory multi-core CPU system.
    • Parallelization: Use directives e.g. #pragma omp parallel {} in C/C++/Fortran to parallelize loops or code regions.
    • Supported by decent compilers.
    • Non-supporting compilers ignore the directives and compile as serial program.
    • Very good for incremental parallelization.
  2. Cilk++
    • Hardware: Shared memory multi-core CPU system.
    • Parallelization: Use new keywords in C++ namely cilk_spawn to invoke a Cilk linkage function asynchronously, cilk_sync to synchronize with locally spawned functions, cilk_for to parallelize a for-loop.
    • The Cilk++ runtime system takes care of the thread scheduling which ease nested parallelization tremendously and maintain certain level of efficiency.
    • Requires Cilk++ compiler and Cilk++ runtime system.
    • Very good for parallelizing dynamic codes with low overhead.
  3. TBB
    • Hardware: Shared memory multi-core CPU system.
    • Parallelization: C++ function objects or C++0x lambda expressions as work units, parallelizing with template functions e.g. parallel_do, parallel_for, parallel_reduce, parallel_pipeline, etc. Concurrent storage classes e.g. concurrent_vector are also provided.
    • Portable to multiple platforms which have good C++ supports.
    • Uses C++ template and function object extensively. C++ beginners might have difficulty to read/write the codes.
    • Allow many customization options at task level which can be complicating and messy, but threads are abstracted, i.e. thread scheduling is taken care of.
    • Recommended only for heavy C++ users.
  4. PThread or thread library built into languages
    • Hardware: Shared memory multi-core CPU system.
    • Parallelization: Provides a library of functions to create, destroy, synchronize threads.
    • Pthread is well supported on Unix/Linux systems, but Windows would require external library.
    • Low level and explicit manipulations of threads.
    • Not recommended for general parallel programming tasks.
  5. OpenCL
    • Hardware: Shared memory multi-core CPU system or OpenCL supported GPU.
    • Parallelization: Provides a library of functions to massively execute a kernel function on a supported device.
    • Supported by ATI Stream SDK and Nvidia OpenCL SDK.
    • Requires OpenCL runtime support for the targeted devices.
    • Well suited for data parallel or streaming computation.
    • Not recommended for direct use for general parallel programming, use wrappers for OpenCL instead.
  6. CUDA
    • Hardware: CUDA enabled Nvidia GPU.
    • Parallelization: Provides a kernel invocation method to massively execute a kernel function on a CUDA enabled Nvidia GPU. The invocation method requires CUDA compiler to parse its special syntax in the form kernel_method<<<grid_dim, block_dim,shared_mem_size,stream>>>.
    • Supported by Nvidia CUDA SDK.
    • Requires CUDA compiler and CUDA runtime system.
    • Well suited for data parallel or streaming computation.
    • The CUDA programming guide is well documented for the requirements to achieve good performance with CUDA enabled Nvidia GPU.
    • Recommended for gpu programming on Nvidia GPU.
  7. Brook+
    • Hardware: Shared memory multi-core CPU system or CAL supported ATI GPU.
    • Parallelization: Allow specification of kernel function that accepts streams of data. A kernel function is invoked as per normal function. The specification of a kernel function requires Brook+ compiler to parse the syntax of the kernel function.
    • Supported by ATI CAL and x86 CPU backend.
    • Requires Brook+ compiler and Brook+ stream runtime system.
    • Well suited for data parallel computation.
    • AMD has been promoting the use of OpenCL for ATI GPU programming. Brook+ is open sourced, however, its development is no longer active.
  8. MPI
    • Hardware: Shared memory multi-core CPU system or cluster of computers.
    • Parallelization: Provides a library of functions for message passing between processes i.e. point-to-point and collective communications.
    • Supported by third party library such as MPICHOpenMPI, etc.
    • Requires communication runtime system.
    • Low level manipulations of buffers and process-process communications.
    • Very popular for programming HPC cluster, but not recommended for general parallel programming.
  9. PVM
    • Hardware: Shared memory multi-core CPU system or distributed systems.
    • Parallelization: Provides a library of functions for message passing between tasks.
    • Supported by third party library such as Netlib PVM3.
    • Use standard network interface such as TCP/IP for higher interoperability over a distributed systems.
    • Low level manipulations of buffers and task-task communications.
  10. Charm++
    • Hardware: Shared memory multi-core CPU system or distributed systems.
    • Parallelization: Object-oriented C++ working units where working units called chares may communicate with other chares using proxy objects.
    • Scheduling computations based on availability of data.
    • Requires Charm++ compiler and Charm++ runtime system.
  11. uC++
    • Hardware: Shared memory multi-core CPU system.
    • Parallelization: Provides C++ coroutines for independent executions.
    • The runtime system performs scheduling of virtual processor using OS kernel threads.
    • Requires uC++ compiler and uC++ kernel.

Thursday, July 29, 2010

Who Is Responsible For The Programming Of Multi Core CPU And GPU?

Multi core CPU and GPU are now commodity products. But, where are the software that could take advantage of their parallel architecture? Who should be developing such software? The domain expert? HPC (high performance computing) sofware engineer? or parallel programming tools such as auto parallelizing compilers?
  1. Domain experts typically do not wish to spend too much time on computing problems. They simply want to get their computation done and collect the results. Earning another degree on computing is not what they are after.
  2. On the other hand, HPC software engineers focused very much only on computing problems, obtaining good performance with minimum domain knowledge. Learning curve for domain knowledge could be very steep for them.
  3. The current state of art parallel programming tools are mostly parallel programming languages or libraries, performance profiling tools, code analyzer for parallelization guidelines, auto parallelizing compilers, and debugging tools.
Consequently, domain experts develop serial algorithm prototype and let HPC software engineers handle all the parallelization and performance issues. HPC software engineers then use the limited sets of parallel programming tools to parallelize and optimize the prototype.

There is serious flaw in this workflow. A parallel program needs to be designed since the beginning of the software development to be effective. When the HPC software engineers take over the prototype for optimization and parallelization, what they can do is too limited.

Without the domain knowledge, the HPC software engineers are essentially working as an auto parallelizing compiler. They analyze the code for real data dependency instead of analyzing the high-level algorithm structure, they optimize and parallelize the algorithms used in the prototype maintaining the same serial semantics instead of using or improving the best algorithms for the tasks. They are unable to perform intrusive algorithm or code modifications necessary for good performance.

See also Why Can't Compilers Auto Parallelize Serial Code Effectively?

Current state of art parallel programming tools are not able to fully automate the optimization and parallelization process. Relying purely on tools is out of the question to optimization and parallelization.

When a parallel program is not performing, who's fault is that?

Ideally domain experts and HPC software engineers should collaborate and recognize the issues from each other. The impact of domain knowledge onto the optimization and parallelization process should not be underestimated. The steep learning curve of the domain knowledge is probably unavoidable, while domain experts should be more computation oriented formulating their domain problems into computing problems.

Bioinformatics set a good example in formulating problems into computing problems, enabling many computer scientists to work on bioinformatics problems with minimum domain knowledge.

Wednesday, July 28, 2010

Why Can't Compilers Auto Parallelize Serial Code Effectively?

An auto parallelizing tool takes in a serial code base in C/C++/Fortran etc. and produces parallel version of the code. For instance, specifying -parallel option at Intel compiler compilation produces parallelized binary with OpenMP runtime. MIPSpro compiler provides similar auto parallelizing function with -apo option, where you can view the code transformation which consists of SGI OpenMP extension.

That's a great news. Theoretically, we can simply re-compile our software applications with the auto parallelizing option of supported compilers to take advantage of multi-core CPUs. Auto parallelizing algorithms such as matrix multiplication give nice performance boost without much effort. It surprises many people who then apply auto parallelizing to a moderately complicated serial program and obtained serious performance degradation against its serial version. Does it surprise you? No?

There are reasons for these if performance degradation surprised you.

1. Lack of dynamic runtime execution profile at compile time coupled with optimistic auto parallelizing compiler.
  • The number of iterations a block/function is executed is not known until runtime. For cases with small number of iterations, the amount of work could be too small for a group of processor cores to execute. The parallelization overhead of this block/function could be too high to be beneficial. Yet, auto parallelizing compiler simply parallelize all it can, which could result in performance degradation.
  • The execution time of a block/function is not known until runtime. Effectively, you wouldn't know how long this block/function take to execute. The relative overhead required to execute this block/function in parallel is essentially unknown. An auto parallelizing compiler simply parallelize all of them without considering much about whether the parallelization is indeed beneficial.
Profile guided optimization which gathers runtime information of previous runs would help to estimate the runtime information of future runs and hence parallelizing more effectively.

2. Lack of application specific information on the data access pattern coupled with conservative auto parallelizing compiler.
  • Dynamic aliasing of memory locations is difficult to analyze. If a read instruction and a write instruction reference to a single memory location at the same time, the consequence is arbitrary depending on the memory model adopted by the hardware. With the extensive use of memory pointers and references, it's difficult for a compiler to determine if two pointers or references are referring to the same object. Explicitly comparing two pointers or references would be prohibitively expensive. To maintain serial semantics, an auto parallelizing compiler would play safe, giving up many potential parallelization.
  • Dynamic accesses to non-local data may result in conflicting accesses. When a data is non-local, it may be accessed by another core at any time, which may produce arbitrary results. In order to ensure serial semantics, the compiler would give up potential parallelization even if the non-local access is not critical to the overall function.
Programmers with application specific information could determine true aliasing of memory locations much better. Specific data access pattern of the application could guarantee conflict free accesses to non-local data. These would open up many potential parallelization which an auto parallelizing compiler is unable to exploit effectively without helps from the programmers.

3. Serial program incurs unnecessary constraints on the order of execution.
A serial program specifies a sequence of instructions to be executed in the specified order. The underlying compiler optimization and hardware optimization may perform re-ordering or enabling concurrent execution of the instructions provided the serial semantics is maintained. However, the specified execution order needs not be the real data dependency. The order of execution specified in the serial program could be overly restricted for any useful optimization and parallelization, yet difficult for a compiler with only compile time information to determine if two functions may be executed in parallel.

As serial program does not specify which code regions or functions may be executed in parallel with other codes, auto parallelizing compilers have to derive the real data dependency of the serial program by analyzing the variable defined-use information in the code at compile time. Unfortunately, it's difficult to obtain accurate data dependency information at compile time especially for high-level functions which have deep level of function call chain.

Parallel programs break this issue by allowing programmer to specify which code regions or functions may be executed in parallel without the needs to ensure serial semantics is maintained, hence, removing the unnecessary constraints on the order of execution. The burden of ensuring correctness falls onto the programmers who have application specific information which can help to determine the parallelism of the code.

Thursday, July 22, 2010

Where Are All The Practical Parallel Algorithms and Libraries?

Multi-core CPU and GPU are everywhere nowadays from laptops to desktops to high-end computing clusters. Is your particular application running any faster? Nope. But generally you need parallel algorithms for an application to make full use of the multiple cores.

Perhaps you'll expect doing some searches on the web, research publications and academic books would provide you all the state of art parallel algorithms. Many of them are for academic & research purpose, many are written in distinguished programming languages, many focused on special prototype or no longer available hardware, many that you'll hardly need, ... many of them since parallel programming has been studied for decades.

What about practical parallel algorithms or library written in modern programming languages be it C/C++, Ruby, Python, Java, ... which can be incorporated easily into your own software development?

Unfortunately you don't find a lot of them. There are practical reasons for that.

1. Lack Of Widely Accepted Composable Parallel Programming Model
A parallel programming model is composable if the model allows a parallel algorithm to be used transparently within another parallel algorithm and maintain certain level of efficiency. This is essential for building up modular foundation for a parallel algorithm library.

The needs to manually synchronize multiple cores with certain forms of locks or atomic instructions break the transparent requirement. And it's difficult to achieve good efficiency when a parallel algorithm is nested within other parallel algorithms. Excessive synchronizations would incur too much overhead for parallel programming to be beneficial. With application specific knowledge, one may move synchronizations of a parallel algorithm to higher levels to improve efficiency.

Tricks are often used to maximize the efficiency trading off composability. Hence, a parallel programming model still needs to support non-composable mechanisms.

2. Requires Specific Memory Layout And Efficiency Matters
A parallel algorithm may be implemented assuming certain memory layout for its input and output data. Multiple cores would access the data in very specific ways or patterns.

When you obtain a parallel algorithm implementation from another, it's almost certain that the assumed memory layout is not compatible with your software interface. You may do data transformation to make use of the parallel algorithm transparently. This can be extremely expensive not only the in/out transformation itself would take up computation time, it will require additional memory to maintain the transformed structures.

When the degradation of the efficiency is not tolerable, you'll then implement your own parallel algorithm with the ideal memory layout for your application.

Hence, it's not uncommon to see many implementations for similar parallel algorithms around because we can't get them out of their existing environments without incurring much engineering efforts.

3. Requires Hardware Specific Optimization For Efficiency And Efficiency Matters
Programming on parallel clusters would worry about configurations such as the number of nodes, the number of processors per node, the number of cores per processor, and their inter-connectivity.

Programming on shared memory machines would worry about the number of processors, the number of cores per processor, the cache hierarchy, and the inter-core connectivity.

When you have a very high bandwidth inter-connectivity, you would perform data or task partitioning, result aggregations in a very specific way to take advantage of that. On the other hand, when the inter-connectivity has limited bandwidth, you would customize the parallel algorithm differently, or even use other less efficient parallel algorithms for the equivalent computation.

Shared memory programming would have very specific requirements on the memory layout and data access pattern for good efficiency on the underlying hardware. Different organizations of the processor cores, inter-core connections would require different sets of optimizations.

The great variety in the hardware organizations is the fundamental issue in supporting portable performance across multiple hardwares.

As performance matters, performance optimization and tuning is a crucial part of a parallel algorithm development. The portability is certainly limited. Fortunately, this may be rectified with auto-tuning mechanism.

4. Parallel Programming Was Expensive In The Past
Before the multi-core era, parallel programming is generally not affordable. Parallel programming has been very much restricted to research labs and very large corporations.

Research labs and large corporations typically deployed very large parallel systems for solving only very large problems. As shared memory machines are generally more expensive and limited in its maximum configuration size, distributed or cluster computing was the preferred choice.

Hardware configurations for large distributed systems can be complicated and confusing, especially when new systems and old systems are merged into larger systems over time.

With research efforts mostly focusing on such complicated systems, and engineering efforts mostly deal with performance issues rather than practical software development issues, do we expect cross platforms and performance portable parallel libraries? I think that's difficult.

Although there is also a significant amount of research effort on shared memory programming, shared memory machines have their own sets of issues to achieve scalable performance.

Simple parallel libraries for small machines are not interesting enough for researchers to work on. However, with the advent of multi-core processors, parallel programming becomes more affordable and ubiquitous. Sacrificing certain level of performance is reasonable for wider coverage. Parallel libraries work only for small processor core count can also be extremely useful.

Wednesday, July 21, 2010

Why Is Parallel Programming Difficult?

Parallel programming is generally perceived as an activity only for people going after high tech, bleeding edge research. It is difficult and alien enough to drive most software engineers away, whether it is really the case or merely their misconceptions. The fact is, software engineers run away from parallel programming while modern general purpose processors consist more and more multiple cores by force and by default. These processor cores are seriously under utilized.

Is parallel programming new? Nope. It has been studied for decades.
Languages and tools available for parallel programming? Yes. OpenMP, PVM, MPI, Cuda, OpenCL, Cilk++, PThreads, Java threads, ...

But, why is parallel programming difficult when there are tons publications and tools widely available? Isn't it a know how issue? When you master the theory, it should be simple, right? Let's see.

1. Managing Complex Data Dependency Is Non-Trivial
To design a parallel algorithm, the data dependency must be well managed. For statements such as,
A1. x = 10 + 5
A2. y = 20 - 8
A3. z = x * y
Statement A1 and A2 may be executed in parallel at different cores, while statement A3 must be executed only after x and y are available to the core executing statement A3. Not only we need to ensure proper order of execution, data required for the execution of a statement must be delivered in time to the core executing the statement.

When you instruct someone to do something, naturally you say,
B1. Drink the glass of milk at your hand.
B2. Walk to the kitchen.
B3. Wash the glass.
Most people design instructions with a series of actions. You wouldn't bother to specify washing the glass only after you finished drinking the milk and after you arrived in the kitchen near the sink, would you?

Functional programming is an interesting approach to simplify the data dependency complexity by using multiple definitions of stateless computation functions. A computation function definition inherently asks which data is required to compute, forcing one to specify the data dependency implicitly. However, not all computations can be expressed functionally in a convenient way.

2. Managing Data Access Conflicts Without Full Access Control
The system main memory is shared across multiple cores. It is handy when you need to maintain only single copy of data for multiple cores.

A core might be incrementing a variable with the following statements sequentially,
C1. y = x
C2. y += 1
C3. x = y
Since the memory is shared across multiple cores, x could be updated by another core after statement C1 and before statement C3 are executed. The update is effectively lost when statement C3 is then executed writing value of y into x.

Critical section, locking mechanism, or atomic instructions are introduced to protect or synchronize accesses to the variables ensuring no other cores will access the memory during the execution of these statements. This is essentially a passive approach. The programmer may forget to protect or by mistake having the variables unprotected leaving a silent loop hole behind and the execution can go on without any warnings.

This can get extremely complicated when memory pointers are used extensively without much control. You wouldn't know if two pointers are pointing to the same memory location unless you check for that explicitly. The explicit checking would incur too much overhead.

3. Managing Multiple Core Executions Driving One Insane
Current state of art parallel program is usually a single program being executed by multiple cores in parallel, i.e. each core executes a separate copy of the same program. Each of the cores performs computations on different range of data based on their uniquely associated IDs.

When you writes a program, you will simulate that program in your mind to make sure it does what you intended to compute. The same goes to parallel program. However, now you have to simulate the program for the executions of different number of cores and ensure each core would execute correctly even when there is no data for some cores to compute.

This is further complicated by the needs for multiple cores to execute different parts of the program at the same time. How many instances of core executions can you keep track in your mind at a time?

The amount of simulations required would increase multiple folds, especially when you are dealing with multi-dimensional data with many different corner cases.

This is evident especially when a parallel program does not work as expected, and you start debugging the program using your favorite tools. You may examine what are each core executing, and step through the program for each of the cores. Yes, for each of the cores! The big challenge is, how do you trace or back trace to find the problematic codes?

4. Non-Deterministic Behavior Is Confusing
Executing a parallel program multiple times may generate different results. The exact results depend largely on the exact data or task partitioning applied, the specific order of executions, the specific scheduling & allocations of processors at runtime, and the nature of the computation. Sometimes the execution incurs integer overflow, sometimes it doesn't. You get the point? In many cases, you can't simply compare the results with the serial version as the parallel version may be executing completely different computations dynamically.

How do you ensure your parallel program is written correctly when it's difficult to prove mathematically and you can't provide good and consistent reference results? It's simply difficult. Fortunately, it's less problematic for more structural computations which cover many scientific algorithms.

Introducing additional synchronization between the cores may reduce the degree of non-deterministic behavior. Consequently, that could reduces the overall performance significantly too which defeats the purpose of parallel programming itself.

It also posts difficulty to do comprehensive benchmarking. Depending on the nature of the parallel computation, the benchmarking results could be misleading as the total amount of computation, patterns of memory accesses etc. may vary across executions, even for the same input data. More input data sets should be used to obtain statistically significant results to draw meaningful conclusions.

How do you compare the performance of the parallel version against the serial or reference version when the computation could be completely different? It's simply confusing.

5. Lack Of Parallel Library For General Use
We hardly write everything from scratch. We often make use of reusable library for our tasks in hand. That makes programming much more manageable and simpler. Complicated tasks are encapsulated inside software libraries.

If we use parallel library as much as possible such as parallel sorting, concurrent hashing, parallel set, parallel binary tree search, etc. in our application, the application inherently becomes a parallel version, right? That should make parallel programming much easier.

But you don't find many reusable parallel library readily available, do you? Check out the article on Where Are All The Practical Parallel Algorithms and Libraries?

6. Performance Matters
Performance is one of the strongest points for one to make parallel programming difficult. It defeats the purpose if the performance gain is insignificant. For this single reason, one is willing to do the following,
  • Perform dynamic load balancing for uneven computation units.
  • Apply hardware and application specific task scheduling and processor allocation to maximize the processor utilization with other cost minimization objectives e.g. minimizing the distance between two frequently communicating cores, minimizing context switching, minimizing working sets, etc.
  • Apply hardware specific compute granularity optimization - coarse grain vs fine grain.
  • Apply hardware specific cache optimization to minimize memory fetches as multiple cores are sharing the memory, hence sharing the outstanding memory fetching slots and memory bandwidth.
  • Apply application specific coarse grain synchronization of memory trading off structured codes and maintainability for lower synchronization overhead.
  • Use traditional programming language such as C/C++, Fortran etc. trading off the ease of programming in modern dynamic scripting language for performance.
  • Develop application specific routines with hardware specific optimization trading off portability and re-usability for performance.
  • Design hardware and application specific memory layout trading off portability and re-usability for performance.
  • Apply I/O storage optimization to minimize I/O overhead and improving sustainable I/O bandwidth.
You may notice that some of the points above are not directly related to parallel programming. Nevertheless, they affect performance. So they may be included as part of the parallel programming process.

7. Legacy Software Is Often Used As The Entry Point To Parallelization
It is expensive to build a software from scratch. It makes sense to build a parallel version based on the existing serial version. Unfortunately, the software architecture of the serial version may not be suitable for building a parallel version.

Incremental parallelization sounds nice, but not always practical. Effective parallelization could require massive changes to the data structures for better data partitioning, data accesses to synchronize shared memory locations, computation structures for coarse grain parallelism, or even the algorithms for better parallelization. Global variables which are often found in serial program is particularly problematic for parallelization. How well structured is the data dependency in the legacy software is playing an important role to effective parallelization.

For more structural algorithms as of many scientific algorithms, they are always used for large-scale data, and their data dependency is often well structured. Then, parallelizing the serial version would be a good idea.