SpeedGo Computing: cpu

Showing posts with label cpu. Show all posts

Tuesday, September 7, 2010

High Performance for All

Parallel programming is much more affordable now as multi-core CPU and programmable GPU become commodity products. Unlike a decade ago where a minimum dual socket system equipped with lower clocked CPU & RAM would relatively cost a fortune to a typical desktop user, but dual-core system is basically everywhere nowadays. The use of dual-core systems is not really because it's affordable, but simply the users have not given a choice for not going multi-core.

It was non-trivial to me a decade ago, why should I go with lower clocked CPU & RAM in order to go multi-processing? Isn't that will slow down all my applications that use only single core? Fortunately, this problem is now less severe with dynamic clock adjusting CPUs, so called turbo mode. We could enjoy the benefits of high clock speed and multiple cores for different applications.

Moving forward, does commodity products make HPC a commodity service? How is HPC doing in enterprise?

Checkout the report published by Freeform Dynamics: High Performance for All

Wednesday, August 25, 2010

AMD’s Bulldozer vs Intel's Hyper-Threading?

AMD's so called Strong Thread approach in the Bulldozer module is that really compelling?

Extra cores are added when a processor can't operate at a faster clock speed, that's a good and easy way to expand a product line with effectively faster products, even though it may NOT be any faster depending on whether the applications are taking advantage of the multiple cores. But fully duplicating x86 core is expensive to scale up.

Intel hyper-threading is a good idea in certain cases, with only little more hardware it allows multiple threads to share the functional units in a core with lower context switch overhead, tolerating memory latency as memory latency is relatively high. That works well with

Complementing threads - Threads do not use the same types of functional units such as the integer units, floating units, etc. thus maximizing the hardware utilization. Or threads do not have conflicting memory accesses, especially long latency memory accesses.
Threads play nice with cache - A thread does not result in spilling out the data of another thread from the cache. Unfortunately, this would be difficult to ensure in practice as the dynamic OS thread scheduling, memory access pattern, etc. contribute to the cache usage.

On the other hand, AMD's Strong Thread includes two sets of integer units and L1 data cache in a Bulldozer module, which is heavier than the hyper-threading approach, but more lightweight than fully duplicating a x86 core. That effectively allowing a thread to enjoy full private L1 data cache during its execution quantum, while hyper-threading works in a shared L1 cache like environment. Whether the module supports cpu affinity i.e. binding a thread to a particular core of the chip, is something we should be looking for when more details are available.

Hyper-threading vs Bulldozer may provoke the argument of shared cache vs private cache: A thread can potentially access the entire shared cache, while a thread enjoys full bandwidth in accesses to the private cache. The downside is a thread is limited to the smaller private cache size even if the other private cache in the module is under utilized. To argue that further: a larger shared cache would have higher latency due to larger storage management overhead, while smaller private cache would have lower latency generally. Whether shared or private cache is better for the performance, it's very specific to the memory access patterns of multiple threads.

As L1 cache is usually very small, the performance impact of smaller private L1 data cache for a single threaded application could be compensated by the larger shared L2 cache. When an application has large working-set, doubling the L1 data cache is probably insufficient to keep the working-set anyway.

We should also note that the floating-point units connect to shared L2 cache bypassing the L1 data cache. They probably have a good reason for that. I can recall that Itanium II does not use L1 data cache for their floating-point too.

Overall, the AMD Bulldozer is an interesting architecture. It has great potential to exhibit higher performance at lower cost. Its benchmark data is something we should keep an eye on.

See also:

Thursday, July 29, 2010

Who Is Responsible For The Programming Of Multi Core CPU And GPU?

Multi core CPU and GPU are now commodity products. But, where are the software that could take advantage of their parallel architecture? Who should be developing such software? The domain expert? HPC (high performance computing) sofware engineer? or parallel programming tools such as auto parallelizing compilers?

Domain experts typically do not wish to spend too much time on computing problems. They simply want to get their computation done and collect the results. Earning another degree on computing is not what they are after.
On the other hand, HPC software engineers focused very much only on computing problems, obtaining good performance with minimum domain knowledge. Learning curve for domain knowledge could be very steep for them.
The current state of art parallel programming tools are mostly parallel programming languages or libraries, performance profiling tools, code analyzer for parallelization guidelines, auto parallelizing compilers, and debugging tools.

Consequently, domain experts develop serial algorithm prototype and let HPC software engineers handle all the parallelization and performance issues. HPC software engineers then use the limited sets of parallel programming tools to parallelize and optimize the prototype.

There is serious flaw in this workflow. A parallel program needs to be designed since the beginning of the software development to be effective. When the HPC software engineers take over the prototype for optimization and parallelization, what they can do is too limited.

Without the domain knowledge, the HPC software engineers are essentially working as an auto parallelizing compiler. They analyze the code for real data dependency instead of analyzing the high-level algorithm structure, they optimize and parallelize the algorithms used in the prototype maintaining the same serial semantics instead of using or improving the best algorithms for the tasks. They are unable to perform intrusive algorithm or code modifications necessary for good performance.

See also Why Can't Compilers Auto Parallelize Serial Code Effectively?

Current state of art parallel programming tools are not able to fully automate the optimization and parallelization process. Relying purely on tools is out of the question to optimization and parallelization.

When a parallel program is not performing, who's fault is that?

Ideally domain experts and HPC software engineers should collaborate and recognize the issues from each other. The impact of domain knowledge onto the optimization and parallelization process should not be underestimated. The steep learning curve of the domain knowledge is probably unavoidable, while domain experts should be more computation oriented formulating their domain problems into computing problems.

Bioinformatics set a good example in formulating problems into computing problems, enabling many computer scientists to work on bioinformatics problems with minimum domain knowledge.