AMD's so called Strong Thread approach in the Bulldozer module is that really compelling?
Extra cores are added when a processor can't operate at a faster clock speed, that's a good and easy way to expand a product line with effectively faster products, even though it may NOT be any faster depending on whether the applications are taking advantage of the multiple cores. But fully duplicating x86 core is expensive to scale up.
Intel hyper-threading is a good idea in certain cases, with only little more hardware it allows multiple threads to share the functional units in a core with lower context switch overhead, tolerating memory latency as memory latency is relatively high. That works well with
- Complementing threads - Threads do not use the same types of functional units such as the integer units, floating units, etc. thus maximizing the hardware utilization. Or threads do not have conflicting memory accesses, especially long latency memory accesses.
- Threads play nice with cache - A thread does not result in spilling out the data of another thread from the cache. Unfortunately, this would be difficult to ensure in practice as the dynamic OS thread scheduling, memory access pattern, etc. contribute to the cache usage.
Hyper-threading vs Bulldozer may provoke the argument of shared cache vs private cache: A thread can potentially access the entire shared cache, while a thread enjoys full bandwidth in accesses to the private cache. The downside is a thread is limited to the smaller private cache size even if the other private cache in the module is under utilized. To argue that further: a larger shared cache would have higher latency due to larger storage management overhead, while smaller private cache would have lower latency generally. Whether shared or private cache is better for the performance, it's very specific to the memory access patterns of multiple threads.
As L1 cache is usually very small, the performance impact of smaller private L1 data cache for a single threaded application could be compensated by the larger shared L2 cache. When an application has large working-set, doubling the L1 data cache is probably insufficient to keep the working-set anyway.
We should also note that the floating-point units connect to shared L2 cache bypassing the L1 data cache. They probably have a good reason for that. I can recall that Itanium II does not use L1 data cache for their floating-point too.
Overall, the AMD Bulldozer is an interesting architecture. It has great potential to exhibit higher performance at lower cost. Its benchmark data is something we should keep an eye on.