I wouldn't mind if they had put it in clear and honest terms: a 20% improvement in performance is not worth the unpredictable chances of performance degradation. That's a fair engineering decision. But hiding behind what some Cognos consultant and some Technet curator have written online, really?
And I'm still trying to decipher the technical motivation from this answer: "Of course there are those that can say “well, things like SMT can be implemented inexpensively and don’t consume that much power.” To those, I ask you, historically hasn’t AMD been the one committed to deliver better value and lower power? Why would we stray from our core principles?"
This first example says "unstable" which is clearly false, so we can ignore him as ignorant.
The second two are both on windows, which until windows 7 was unable to schedule hyperthreading correctly.
The last example with linpack you are maxing out the core, so obviously hyperthreading won't help. Hyperthreading helps when you are not maxing out the core on any one task. Although a drop is not great, for regular desktop apps I think you'll have an overall net gain.
The OS task scheduler can switch tasks if one stalls for IO. A CPU based task scheduler can task switch on cache miss, or availability of single pieces of a CPU (like an adder).
Next he'll say not to run a multi tasking OS. Instead run each task on it's own CPU - "and don't worry we are making more CPUs soon".
CPU based task scheduler can task switch on cache miss, or availability of single pieces of a CPU (like an adder).
Can't the CPU just signal an interrupt for those, and let the OS scheduler handle it?
This is usually not efficient because handling an interrupt takes nearly as long as a cache miss (and let us not consider the case where the interrupt handler itself triggers a cache miss), but Intel did a prototype on the Itanium that worked this way.
A CPU scheduler (instruction unit) does not schedule threads or processes - instead it schedules instructions (assembly instructions).
A thread might block while waiting for IO (disk, net, user input, etc).
An instruction might block because it's waiting for the cache to load its data, or it might block because it does not have an available execution unit.
A modern CPU has, for example, 4 units that can add, 2 that can multiply, 1 for floating point, another for memory IO, another that does XOR, etc, etc.
If one of those is busy you try to find an instruction that can make use of an idle unit.
It's often hard to find an instruction because they often depend on each other, so you have to do them in order. With hyperthreading you have double the number of available instructions to choose from (and those two instruction streams are independent of each other).
The biggest problem with hyperthreading is the cache. Memory is very slow, so you cache memory locally, with hyperthreeading you have twice as much data to cache, so some data is evicted - this is bad.
Some tasks will be slower because of this, but most won't.
Hyperthreading doesn't require any special OS support. As the wikipedia page states:
This technology is transparent to operating systems and
programs. All that is required to take advantage of hyper-
threading is symmetric multiprocessing (SMP) support in
the operating system, as the logical processors appear as
standard separate processors.
According to whom? I don't think anyone with a clue would have suggested that. If you care about writing high-performance code, you care about the behavior of your processor. Obviously, multiplexing the execution of two program threads onto a single physical core is going to affect performance, whether in the form of scheduling differences, caching effects, etc.
they should have dropped HT and go for actual cores instead.
Well, you can do both, of course. If HT gets you better utilization of a single core, then fair enough. The question is really whether the transistors/power you burn on HT would be better spent on additional cores (or bigger caches, etc.).
I expect that if the code was extremely tight with no cache misses, any extra gain from hyperthreading would be pretty limited, if not negative due to extra thread switching costs.
There are probably pathological cases where disabling SMT would be an improvement but I think in general, it should always improve throughput. The main question is whether using the transistors to go towards SMT would be better spent elsewhere. But already the cores on the chip take a very small percent of the die - most of it is cache. And adding SMT doesn't increase the core size very much (a couple percent IIRC), so it seems like a no brainer to me as long as the OS scheduler can do a decent job.
If the two threads are not so related, they can easily trash each others cache and have a substantially negative impact upon performance. Ulrich Drepper has a pretty good discussion on page 29 of this doc ( http://people.redhat.com/drepper/cpumemory.pdf ).
It boils down to this: Code that has a low cache hit rate can often benefit from HT, as it can usually be parallelised into two threads that have a not-much-worse cache hit rate, and there will be plenty of gaps to fill. More optimised code stands to lose quite significantly, as parallelising code with a high cache hit rate to (a) not trash each others cache, and (b) still have a very high hit rate, is very hard.
The main problem is if the working set size of the single threaded code fits just in cache, but the combined working set size of both threads does not. In this case the threads are competing for the shared L1 cache instead of doing useful work.
While extra work can also be done when there isn't a stall (for example, when the processor is unable to sufficiently parallelise execution of the one thread), his argument is that this is a relatively insignificant fraction of the gain produced by HT, so his model is sufficiently accurate. Now, I don't know practically whether this is true or not, but ulrich drepper ought to :-).
I certainly know that in my own testing running ruby servers on both i7 and i5 iMacs that I see nothing close to a 50% speedup. More like 5-10%, max.
We have a different strategy than the competition. They are choosing SMT, we are heading down the path of more physical cores.
At the end of the day 3 things matter: what is the performance (throughput), what is the power consumption, and what is the price.
I am pretty confident we will do well in all 3.
The argument that we have been deflecting for some time is "SMT only adds ~5% overhead and gives you ~20% more performance." On the surface that sounds like a good tradeoff. If you always got 20% - but you don't.
Instead, for our Bulldozer product, we are using 16 cores. It will be in approximately the same power/thermal range of our current 4 and 6-core products. But 2 threads running on 2 physical cores will deliver 80% more throughput than 1 thread running on 1 core. When you compare that to the 20% uplift for SMT when you run 2 threads on 1 core, you start to see the benefit.
I'd really like to see a detailed technical justification for it.
I wouldn't be surprised if things that make long term sense but give incremental benefits at the cost of significant increases in verification work look pretty unsavory to a company in sore financial shape, as they've been for a while now.
I've heard from a number of senior folks in CS who have ties to industry that in many cases decisions by Intel in terms of choice of CPU features have been in large part motivated by making it impossible for competitors to achieve feature parity without risking or perhaps certainly dealing with bankruptcy.
If future implementations do a better job it would be something to consider.
For now we believe that actual cores will drive better performance. If I can put 12 cores into the same price and power envelopes (or even lower) as my competitor's 6-core with SMT, I end up with a better processor.
Cores vs. SMT breaks down if your cores are consuming significantly more power or you are driving your cost up. But if you are not doing those things, than cores over SMT becomes the right call.
The idea of Hyperthreading is to keep the processor utilized at all times. Modern processors have many different components for each core, and not every thread of execution uses all the components. Enabling Hyperthreading lets the OS supply the processor with more work, potentially reducing the amount of the processor that's unused at any time.
The article is right about reducing performance. If you have an 8 core machine, and you have one app that only runs on one core (hello, kcrypd...), it will run slower than "normal" when you have 7 other jobs using the CPU. But, if you have a job that scales linearly across cores, then you will almost always see a speed increase. (I tested this on my machine with a "make" on the Linux kernel. The CPU time used by the whole build is the same when you run with -j1 to -j4. The wall-clock time decreases linearly. When you start adding "hyper" threads, you see the wall-clock time decrease more slowly, but the CPU time increase. This, I think, is what scares people. The processor becomes slower as you add threads beyond the number of actual cores, but its overall throughput increases.
I can't think of any workloads where this is a bad thing. I don't know of any server workload that leans heavily on one thread. (Databases, maybe.) On a workstation, you are only going to "starve" normal processes when you are intentionally doing something intensive, like encoding video. For normal workloads, you will never use even the 4 real cores completely.
(As I type this post, the only things that could possibly want to run are kcryptd, my music player, the X server, and the web browser. Hyperthreading is enabled, but there is no way I can possibly get the processor into the state where it starts slowing down. But when I run "make -j8", I really enjoy the speed benefit at the expense of making firefox 3% slower.)
Anyway, Google around for various benchmarks showing the results of Hyperthreading enabled and disabled, and I think you'll want to enable it. I did my own, and I'm glad I did -- it even makes my eeepc slightly faster!
As long as you are CPU bound.
Once you're IO bound you can have as many cores as you want but it isn't going to move any faster.
The only machines where I manage to approach 100% cpu usage are video servers, all the other boxes are sooner or later IO bound. The only cure for that is gobs of memory.
It's in my experience pretty rare to have a webserver with more than 4 cores to be CPU bound. Most of the time the disk or the network card are the bottle-neck.
Unless you code in a very inefficient way of course, then it can be that you need more CPU before you hit that wall.
AMD so far more or less held the moral high ground against intel in the 'spin wars', they can do better than this.
It's simply a marketing piece in the guise of a blog post.
At the bottom of the entry it reads:
> John Fruehe is the Director of Product Marketing for Server/Workstation products at AMD. His postings are his own opinions and may not represent AMD’s positions, strategies or opinions.
So they effectively distantiate themselves from this posting but at the same time it sits on amd.com.
Yes, it is marketing, but my job is marketing.
You are definitely right that CPU is not always the problem, some systems can be I/O or memory-bound.
Our 4 channels of memory on the new products will help with the latter; to fix the former, you need OEMs willing to put down mulitple chipsets. Typically the cost drives people away from those designs, unfortunately.
>Then in 2011, we plan to introduce Interlagos...we’re designing some shared components that help reduce power consumption and die size, but you won’t see us sharing integer pipelines, the meat of the core.
HLL's could benefit from better support from the OS for scheduling and dealing with multiple cores.
In my multithreaded java application, turning off HT improved performance 100%. Granted, it was one of the early HT implementations wit ha single core, but I still always turn it off in my BIOS.
I tend to prefer AMD over Intel because you get more MIPS/$ and less heat.
That, with an Intel 5500 (2 quadcores, "16" cores total when using hyperthreading).
I used to work for a major OEM that sold Intel systems. One of the biggest problems was that SQL Server, for instance, ran slower with HT when you had more than 4 threads.
But customers never thought to turn it off because they assumed it was a "performance feature." If it was pitched as a "sometimes" performance feature, then people might be inclined to turn it off to check performance.
If I had a dollar for every customer I had to have that discussion with, I could have bought myself a server.
Simultaneous multithreading is what intel calls hyperthreading.