Wednesday, July 28, 2010

Why Can't Compilers Auto Parallelize Serial Code Effectively?

An auto parallelizing tool takes in a serial code base in C/C++/Fortran etc. and produces parallel version of the code. For instance, specifying -parallel option at Intel compiler compilation produces parallelized binary with OpenMP runtime. MIPSpro compiler provides similar auto parallelizing function with -apo option, where you can view the code transformation which consists of SGI OpenMP extension.

That's a great news. Theoretically, we can simply re-compile our software applications with the auto parallelizing option of supported compilers to take advantage of multi-core CPUs. Auto parallelizing algorithms such as matrix multiplication give nice performance boost without much effort. It surprises many people who then apply auto parallelizing to a moderately complicated serial program and obtained serious performance degradation against its serial version. Does it surprise you? No?

There are reasons for these if performance degradation surprised you.

1. Lack of dynamic runtime execution profile at compile time coupled with optimistic auto parallelizing compiler.
  • The number of iterations a block/function is executed is not known until runtime. For cases with small number of iterations, the amount of work could be too small for a group of processor cores to execute. The parallelization overhead of this block/function could be too high to be beneficial. Yet, auto parallelizing compiler simply parallelize all it can, which could result in performance degradation.
  • The execution time of a block/function is not known until runtime. Effectively, you wouldn't know how long this block/function take to execute. The relative overhead required to execute this block/function in parallel is essentially unknown. An auto parallelizing compiler simply parallelize all of them without considering much about whether the parallelization is indeed beneficial.
Profile guided optimization which gathers runtime information of previous runs would help to estimate the runtime information of future runs and hence parallelizing more effectively.

2. Lack of application specific information on the data access pattern coupled with conservative auto parallelizing compiler.
  • Dynamic aliasing of memory locations is difficult to analyze. If a read instruction and a write instruction reference to a single memory location at the same time, the consequence is arbitrary depending on the memory model adopted by the hardware. With the extensive use of memory pointers and references, it's difficult for a compiler to determine if two pointers or references are referring to the same object. Explicitly comparing two pointers or references would be prohibitively expensive. To maintain serial semantics, an auto parallelizing compiler would play safe, giving up many potential parallelization.
  • Dynamic accesses to non-local data may result in conflicting accesses. When a data is non-local, it may be accessed by another core at any time, which may produce arbitrary results. In order to ensure serial semantics, the compiler would give up potential parallelization even if the non-local access is not critical to the overall function.
Programmers with application specific information could determine true aliasing of memory locations much better. Specific data access pattern of the application could guarantee conflict free accesses to non-local data. These would open up many potential parallelization which an auto parallelizing compiler is unable to exploit effectively without helps from the programmers.

3. Serial program incurs unnecessary constraints on the order of execution.
A serial program specifies a sequence of instructions to be executed in the specified order. The underlying compiler optimization and hardware optimization may perform re-ordering or enabling concurrent execution of the instructions provided the serial semantics is maintained. However, the specified execution order needs not be the real data dependency. The order of execution specified in the serial program could be overly restricted for any useful optimization and parallelization, yet difficult for a compiler with only compile time information to determine if two functions may be executed in parallel.

As serial program does not specify which code regions or functions may be executed in parallel with other codes, auto parallelizing compilers have to derive the real data dependency of the serial program by analyzing the variable defined-use information in the code at compile time. Unfortunately, it's difficult to obtain accurate data dependency information at compile time especially for high-level functions which have deep level of function call chain.

Parallel programs break this issue by allowing programmer to specify which code regions or functions may be executed in parallel without the needs to ensure serial semantics is maintained, hence, removing the unnecessary constraints on the order of execution. The burden of ensuring correctness falls onto the programmers who have application specific information which can help to determine the parallelism of the code.

No comments:

Post a Comment