I'm a bit disappointed by the marketing spin. They come up with a few examples where HT falls down, and use that to dismiss the whole idea?
I wouldn't mind if they had put it in clear and honest terms: a 20% improvement in performance is not worth the unpredictable chances of performance degradation. That's a fair engineering decision. But hiding behind what some Cognos consultant and some Technet curator have written online, really?
And I'm still trying to decipher the technical motivation from this answer: "Of course there are those that can say “well, things like SMT can be implemented inexpensively and don’t consume that much power.” To those, I ask you, historically hasn’t AMD been the one committed to deliver better value and lower power? Why would we stray from our core principles?"
This first example says "unstable" which is clearly false, so we can ignore him as ignorant.
The second two are both on windows, which until windows 7 was unable to schedule hyperthreading correctly.
The last example with linpack you are maxing out the core, so obviously hyperthreading won't help. Hyperthreading helps when you are not maxing out the core on any one task. Although a drop is not great, for regular desktop apps I think you'll have an overall net gain.
The OS task scheduler can switch tasks if one stalls for IO. A CPU based task scheduler can task switch on cache miss, or availability of single pieces of a CPU (like an adder).
Next he'll say not to run a multi tasking OS. Instead run each task on it's own CPU - "and don't worry we are making more CPUs soon".
Could you link to further reading about how a CPU-based scheduler works? The wikipedia article [1] only mentions OS-level scheduling, and I was always under the impression that the CPU knew nothing of threads or processes.
CPU based task scheduler can task switch on cache miss, or availability of single pieces of a CPU (like an adder).
Can't the CPU just signal an interrupt for those, and let the OS scheduler handle it?
Multithreaded processors have multiple hardware threads. An SMT processor schedules instructions, mostly ignoring which thread they come from. An SoEMT processor schedules hardware threads, usually switching on a cache miss or similar stall event.
Can't the CPU just signal an interrupt for those, and let the OS scheduler handle it?
This is usually not efficient because handling an interrupt takes nearly as long as a cache miss (and let us not consider the case where the interrupt handler itself triggers a cache miss), but Intel did a prototype on the Itanium that worked this way.
A CPU scheduler (instruction unit) does not schedule threads or processes - instead it schedules instructions (assembly instructions).
A thread might block while waiting for IO (disk, net, user input, etc).
An instruction might block because it's waiting for the cache to load its data, or it might block because it does not have an available execution unit.
A modern CPU has, for example, 4 units that can add, 2 that can multiply, 1 for floating point, another for memory IO, another that does XOR, etc, etc.
If one of those is busy you try to find an instruction that can make use of an idle unit.
It's often hard to find an instruction because they often depend on each other, so you have to do them in order. With hyperthreading you have double the number of available instructions to choose from (and those two instruction streams are independent of each other).
The biggest problem with hyperthreading is the cache. Memory is very slow, so you cache memory locally, with hyperthreeading you have twice as much data to cache, so some data is evicted - this is bad.
Some tasks will be slower because of this, but most won't.
The second two are both on windows, which until windows 7 was unable to schedule hyperthreading correctly.
Hyperthreading doesn't require any special OS support. As the wikipedia page states:
This technology is transparent to operating systems and
programs. All that is required to take advantage of hyper-
threading is symmetric multiprocessing (SMP) support in
the operating system, as the logical processors appear as
standard separate processors.
No extensive OS support is required to utilize basic SMT, true -- but the scheduler needs to be aware of SMT to do remotely-intelligent scheduling. As a trivial example, if you have 2 runnable processes, 4 logical cores, and 2 physical cores, it obviously makes sense to schedule one process on each physical core. (In fact, the Wikipedia article says this, in the very next paragraph after the one you quoted.)
True, but then the point the original article makes kicks in: that's a relatively poor way of doing things. Originally HT was supposed to be transparent. Then it turned out the abstraction wasn't quite as good as they hoped and OS level support was needed. That adds a lot of complexity and possibilities for mistakes. At that point, they should have dropped HT and go for actual cores instead.
According to whom? I don't think anyone with a clue would have suggested that. If you care about writing high-performance code, you care about the behavior of your processor. Obviously, multiplexing the execution of two program threads onto a single physical core is going to affect performance, whether in the form of scheduling differences, caching effects, etc.
they should have dropped HT and go for actual cores instead.
Well, you can do both, of course. If HT gets you better utilization of a single core, then fair enough. The question is really whether the transistors/power you burn on HT would be better spent on additional cores (or bigger caches, etc.).
In my casual micro-benchmarking experience, hyperthreading on my Core i7 920 can be worth about 50% extra. That is, the difference between 4 threads and 8 threads on 4 hyperthreaded cores can be about 50% (and I can see that Windows 7 is scheduling the 4 threads across the 4 cores correctly).
I expect that if the code was extremely tight with no cache misses, any extra gain from hyperthreading would be pretty limited, if not negative due to extra thread switching costs.
Even with no cache misses, SMT can still be a win, since it can increase instruction level parallelism - a single thread can may not have enough instructions being able to execute in parallel to completely fill the pipeline. But with SMT, multiple threads have the chance to fill all available execution units which is much more likely.
There are probably pathological cases where disabling SMT would be an improvement but I think in general, it should always improve throughput. The main question is whether using the transistors to go towards SMT would be better spent elsewhere. But already the cores on the chip take a very small percent of the die - most of it is cache. And adding SMT doesn't increase the core size very much (a couple percent IIRC), so it seems like a no brainer to me as long as the OS scheduler can do a decent job.
It often depends on how related the datasets of the two hyperthreads are. If both HTs are working on the same data, then it can often be a win for the reason you state - increased instruction level parallelism, and of course the opportunity to do work while one thread is stalled on a non-overlappable miss.
If the two threads are not so related, they can easily trash each others cache and have a substantially negative impact upon performance. Ulrich Drepper has a pretty good discussion on page 29 of this doc ( http://people.redhat.com/drepper/cpumemory.pdf ).
It boils down to this: Code that has a low cache hit rate can often benefit from HT, as it can usually be parallelised into two threads that have a not-much-worse cache hit rate, and there will be plenty of gaps to fill. More optimised code stands to lose quite significantly, as parallelising code with a high cache hit rate to (a) not trash each others cache, and (b) still have a very high hit rate, is very hard.
Note that his example is about a processor that can only execute one thread at a time, switching on a cache miss. This is not SMT.
The main problem is if the working set size of the single threaded code fits just in cache, but the combined working set size of both threads does not. In this case the threads are competing for the shared L1 cache instead of doing useful work.
That's not quite right - his example models execution as one thread at a time, switching on a cache miss because most of the extra work that gets done by SMT is when you get a cache miss, freeing up the rest of the processor.
While extra work can also be done when there isn't a stall (for example, when the processor is unable to sufficiently parallelise execution of the one thread), his argument is that this is a relatively insignificant fraction of the gain produced by HT, so his model is sufficiently accurate. Now, I don't know practically whether this is true or not, but ulrich drepper ought to :-).
Isn't that only true for micro-benchmarking, though? Tiny little programs that fit in cache? I was under the impression that HT could provide great gains with a tiny executable (the bit of code being used, anyway) but with larger chunks of code that didn't fit in cache it was mostly useless.
I certainly know that in my own testing running ruby servers on both i7 and i5 iMacs that I see nothing close to a 50% speedup. More like 5-10%, max.
The goal was not to dismiss SMT - merely to explain why we are not using it.
We have a different strategy than the competition. They are choosing SMT, we are heading down the path of more physical cores.
At the end of the day 3 things matter: what is the performance (throughput), what is the power consumption, and what is the price.
I am pretty confident we will do well in all 3.
The argument that we have been deflecting for some time is "SMT only adds ~5% overhead and gives you ~20% more performance." On the surface that sounds like a good tradeoff. If you always got 20% - but you don't.
Instead, for our Bulldozer product, we are using 16 cores. It will be in approximately the same power/thermal range of our current 4 and 6-core products. But 2 threads running on 2 physical cores will deliver 80% more throughput than 1 thread running on 1 core. When you compare that to the 20% uplift for SMT when you run 2 threads on 1 core, you start to see the benefit.
Yes, it's a marketing piece, but the statement that AMD is never going to add threads to their CPUs is surprising. Given the trend in memory vs. CPU performance, adding threads to hide latency seems like a complete no-brainer.
I'd really like to see a detailed technical justification for it.
Most likely the technical justification is a lack of manpower and schedule constraints.
I wouldn't be surprised if things that make long term sense but give incremental benefits at the cost of significant increases in verification work look pretty unsavory to a company in sore financial shape, as they've been for a while now.
I've heard from a number of senior folks in CS who have ties to industry that in many cases decisions by Intel in terms of choice of CPU features have been in large part motivated by making it impossible for competitors to achieve feature parity without risking or perhaps certainly dealing with bankruptcy.
No, it has more to do with the inconsistency of SMT in an x86 environment. If we were always getting consistent performance gains and expected stability it would be a consideration. That might change in the future, but the technology needs to refine first.
It reads like classic FUD material. They cast doubt on the technology yet they don't mention you can disable SMT in the BIOS of an Intel system. It also doesn't answer the most obvious question: What if I want 8-12 cores and the possible +20% performance of SMT?
We have looked into that, but that is not our strategy. If we knew that SMT (in an x86 architecture) could lead to an "always better performance" environment, it might be more applealing. But the fact that the performance aspect becomes application dependent (and there are possible stability situations), we defer to no SMT.
If future implementations do a better job it would be something to consider.
For now we believe that actual cores will drive better performance. If I can put 12 cores into the same price and power envelopes (or even lower) as my competitor's 6-core with SMT, I end up with a better processor.
Cores vs. SMT breaks down if your cores are consuming significantly more power or you are driving your cost up. But if you are not doing those things, than cores over SMT becomes the right call.
This article is worthless. There are no technical details at all, and the FUD is mostly wrong.
The idea of Hyperthreading is to keep the processor utilized at all times. Modern processors have many different components for each core, and not every thread of execution uses all the components. Enabling Hyperthreading lets the OS supply the processor with more work, potentially reducing the amount of the processor that's unused at any time.
The article is right about reducing performance. If you have an 8 core machine, and you have one app that only runs on one core (hello, kcrypd...), it will run slower than "normal" when you have 7 other jobs using the CPU. But, if you have a job that scales linearly across cores, then you will almost always see a speed increase. (I tested this on my machine with a "make" on the Linux kernel. The CPU time used by the whole build is the same when you run with -j1 to -j4. The wall-clock time decreases linearly. When you start adding "hyper" threads, you see the wall-clock time decrease more slowly, but the CPU time increase. This, I think, is what scares people. The processor becomes slower as you add threads beyond the number of actual cores, but its overall throughput increases.
I can't think of any workloads where this is a bad thing. I don't know of any server workload that leans heavily on one thread. (Databases, maybe.) On a workstation, you are only going to "starve" normal processes when you are intentionally doing something intensive, like encoding video. For normal workloads, you will never use even the 4 real cores completely.
(As I type this post, the only things that could possibly want to run are kcryptd, my music player, the X server, and the web browser. Hyperthreading is enabled, but there is no way I can possibly get the processor into the state where it starts slowing down. But when I run "make -j8", I really enjoy the speed benefit at the expense of making firefox 3% slower.)
Anyway, Google around for various benchmarks showing the results of Hyperthreading enabled and disabled, and I think you'll want to enable it. I did my own, and I'm glad I did -- it even makes my eeepc slightly faster!
> Running more threads increases throughput for applications as long as you have available cores.
As long as you are CPU bound.
Once you're IO bound you can have as many cores as you want but it isn't going to move any faster.
The only machines where I manage to approach 100% cpu usage are video servers, all the other boxes are sooner or later IO bound. The only cure for that is gobs of memory.
It's in my experience pretty rare to have a webserver with more than 4 cores to be CPU bound. Most of the time the disk or the network card are the bottle-neck.
Unless you code in a very inefficient way of course, then it can be that you need more CPU before you hit that wall.
AMD so far more or less held the moral high ground against intel in the 'spin wars', they can do better than this.
It's simply a marketing piece in the guise of a blog post.
At the bottom of the entry it reads:
> John Fruehe is the Director of Product Marketing for Server/Workstation products at AMD. His postings are his own opinions and may not represent AMD’s positions, strategies or opinions.
So they effectively distantiate themselves from this posting but at the same time it sits on amd.com.
Ah, welcome to the world of lawyers. Those disclaimers are everywhere, like the the labels on the lawnmowers that say "don't pick this up while it is running." They aren't distancing themselves from me, just standard legal stuff.
Yes, it is marketing, but my job is marketing.
You are definitely right that CPU is not always the problem, some systems can be I/O or memory-bound.
Our 4 channels of memory on the new products will help with the latter; to fix the former, you need OEMs willing to put down mulitple chipsets. Typically the cost drives people away from those designs, unfortunately.
Ooops: AMD are going to support SMT - its in their 2011-2012 road map! However, they don't call it SMT, they think they'll get away with calling them cores. It is two small integer cores sharing fetcher, decoder, L2 cache, and a single FP unit.
>Then in 2011, we plan to introduce Interlagos...we’re designing some shared components that help reduce power consumption and die size, but you won’t see us sharing integer pipelines, the meat of the core.
I just had an insight about Python's GIL. Python depends on the OS scheduler to allocate computing resources to tasks. But if multiple threads of Python are contending for the GIL, to the OS scheduler, this is Python doing "work".
HLL's could benefit from better support from the OS for scheduling and dealing with multiple cores.
It's just my experience, I have no formal studies to back it up...
In my multithreaded java application, turning off HT improved performance 100%. Granted, it was one of the early HT implementations wit ha single core, but I still always turn it off in my BIOS.
I tend to prefer AMD over Intel because you get more MIPS/$ and less heat.
I think this helps underscore my point. Sometimes you get performance, sometimes you don't.
I used to work for a major OEM that sold Intel systems. One of the biggest problems was that SQL Server, for instance, ran slower with HT when you had more than 4 threads.
But customers never thought to turn it off because they assumed it was a "performance feature." If it was pitched as a "sometimes" performance feature, then people might be inclined to turn it off to check performance.
If I had a dollar for every customer I had to have that discussion with, I could have bought myself a server.
I wouldn't mind if they had put it in clear and honest terms: a 20% improvement in performance is not worth the unpredictable chances of performance degradation. That's a fair engineering decision. But hiding behind what some Cognos consultant and some Technet curator have written online, really?
And I'm still trying to decipher the technical motivation from this answer: "Of course there are those that can say “well, things like SMT can be implemented inexpensively and don’t consume that much power.” To those, I ask you, historically hasn’t AMD been the one committed to deliver better value and lower power? Why would we stray from our core principles?"