Wednesday, July 21, 2010

Why Is Parallel Programming Difficult?

Parallel programming is generally perceived as an activity only for people going after high tech, bleeding edge research. It is difficult and alien enough to drive most software engineers away, whether it is really the case or merely their misconceptions. The fact is, software engineers run away from parallel programming while modern general purpose processors consist more and more multiple cores by force and by default. These processor cores are seriously under utilized.

Is parallel programming new? Nope. It has been studied for decades.
Languages and tools available for parallel programming? Yes. OpenMP, PVM, MPI, Cuda, OpenCL, Cilk++, PThreads, Java threads, ...

But, why is parallel programming difficult when there are tons publications and tools widely available? Isn't it a know how issue? When you master the theory, it should be simple, right? Let's see.

1. Managing Complex Data Dependency Is Non-Trivial
To design a parallel algorithm, the data dependency must be well managed. For statements such as,
A1. x = 10 + 5
A2. y = 20 - 8
A3. z = x * y
Statement A1 and A2 may be executed in parallel at different cores, while statement A3 must be executed only after x and y are available to the core executing statement A3. Not only we need to ensure proper order of execution, data required for the execution of a statement must be delivered in time to the core executing the statement.

When you instruct someone to do something, naturally you say,
B1. Drink the glass of milk at your hand.
B2. Walk to the kitchen.
B3. Wash the glass.
Most people design instructions with a series of actions. You wouldn't bother to specify washing the glass only after you finished drinking the milk and after you arrived in the kitchen near the sink, would you?

Functional programming is an interesting approach to simplify the data dependency complexity by using multiple definitions of stateless computation functions. A computation function definition inherently asks which data is required to compute, forcing one to specify the data dependency implicitly. However, not all computations can be expressed functionally in a convenient way.

2. Managing Data Access Conflicts Without Full Access Control
The system main memory is shared across multiple cores. It is handy when you need to maintain only single copy of data for multiple cores.

A core might be incrementing a variable with the following statements sequentially,
C1. y = x
C2. y += 1
C3. x = y
Since the memory is shared across multiple cores, x could be updated by another core after statement C1 and before statement C3 are executed. The update is effectively lost when statement C3 is then executed writing value of y into x.

Critical section, locking mechanism, or atomic instructions are introduced to protect or synchronize accesses to the variables ensuring no other cores will access the memory during the execution of these statements. This is essentially a passive approach. The programmer may forget to protect or by mistake having the variables unprotected leaving a silent loop hole behind and the execution can go on without any warnings.

This can get extremely complicated when memory pointers are used extensively without much control. You wouldn't know if two pointers are pointing to the same memory location unless you check for that explicitly. The explicit checking would incur too much overhead.

3. Managing Multiple Core Executions Driving One Insane
Current state of art parallel program is usually a single program being executed by multiple cores in parallel, i.e. each core executes a separate copy of the same program. Each of the cores performs computations on different range of data based on their uniquely associated IDs.

When you writes a program, you will simulate that program in your mind to make sure it does what you intended to compute. The same goes to parallel program. However, now you have to simulate the program for the executions of different number of cores and ensure each core would execute correctly even when there is no data for some cores to compute.

This is further complicated by the needs for multiple cores to execute different parts of the program at the same time. How many instances of core executions can you keep track in your mind at a time?

The amount of simulations required would increase multiple folds, especially when you are dealing with multi-dimensional data with many different corner cases.

This is evident especially when a parallel program does not work as expected, and you start debugging the program using your favorite tools. You may examine what are each core executing, and step through the program for each of the cores. Yes, for each of the cores! The big challenge is, how do you trace or back trace to find the problematic codes?

4. Non-Deterministic Behavior Is Confusing
Executing a parallel program multiple times may generate different results. The exact results depend largely on the exact data or task partitioning applied, the specific order of executions, the specific scheduling & allocations of processors at runtime, and the nature of the computation. Sometimes the execution incurs integer overflow, sometimes it doesn't. You get the point? In many cases, you can't simply compare the results with the serial version as the parallel version may be executing completely different computations dynamically.

How do you ensure your parallel program is written correctly when it's difficult to prove mathematically and you can't provide good and consistent reference results? It's simply difficult. Fortunately, it's less problematic for more structural computations which cover many scientific algorithms.

Introducing additional synchronization between the cores may reduce the degree of non-deterministic behavior. Consequently, that could reduces the overall performance significantly too which defeats the purpose of parallel programming itself.

It also posts difficulty to do comprehensive benchmarking. Depending on the nature of the parallel computation, the benchmarking results could be misleading as the total amount of computation, patterns of memory accesses etc. may vary across executions, even for the same input data. More input data sets should be used to obtain statistically significant results to draw meaningful conclusions.

How do you compare the performance of the parallel version against the serial or reference version when the computation could be completely different? It's simply confusing.

5. Lack Of Parallel Library For General Use
We hardly write everything from scratch. We often make use of reusable library for our tasks in hand. That makes programming much more manageable and simpler. Complicated tasks are encapsulated inside software libraries.

If we use parallel library as much as possible such as parallel sorting, concurrent hashing, parallel set, parallel binary tree search, etc. in our application, the application inherently becomes a parallel version, right? That should make parallel programming much easier.

But you don't find many reusable parallel library readily available, do you? Check out the article on Where Are All The Practical Parallel Algorithms and Libraries?

6. Performance Matters
Performance is one of the strongest points for one to make parallel programming difficult. It defeats the purpose if the performance gain is insignificant. For this single reason, one is willing to do the following,
  • Perform dynamic load balancing for uneven computation units.
  • Apply hardware and application specific task scheduling and processor allocation to maximize the processor utilization with other cost minimization objectives e.g. minimizing the distance between two frequently communicating cores, minimizing context switching, minimizing working sets, etc.
  • Apply hardware specific compute granularity optimization - coarse grain vs fine grain.
  • Apply hardware specific cache optimization to minimize memory fetches as multiple cores are sharing the memory, hence sharing the outstanding memory fetching slots and memory bandwidth.
  • Apply application specific coarse grain synchronization of memory trading off structured codes and maintainability for lower synchronization overhead.
  • Use traditional programming language such as C/C++, Fortran etc. trading off the ease of programming in modern dynamic scripting language for performance.
  • Develop application specific routines with hardware specific optimization trading off portability and re-usability for performance.
  • Design hardware and application specific memory layout trading off portability and re-usability for performance.
  • Apply I/O storage optimization to minimize I/O overhead and improving sustainable I/O bandwidth.
You may notice that some of the points above are not directly related to parallel programming. Nevertheless, they affect performance. So they may be included as part of the parallel programming process.

7. Legacy Software Is Often Used As The Entry Point To Parallelization
It is expensive to build a software from scratch. It makes sense to build a parallel version based on the existing serial version. Unfortunately, the software architecture of the serial version may not be suitable for building a parallel version.

Incremental parallelization sounds nice, but not always practical. Effective parallelization could require massive changes to the data structures for better data partitioning, data accesses to synchronize shared memory locations, computation structures for coarse grain parallelism, or even the algorithms for better parallelization. Global variables which are often found in serial program is particularly problematic for parallelization. How well structured is the data dependency in the legacy software is playing an important role to effective parallelization.

For more structural algorithms as of many scientific algorithms, they are always used for large-scale data, and their data dependency is often well structured. Then, parallelizing the serial version would be a good idea.

No comments:

Post a Comment