The root of the problem is that .NET implemented a rather baroque spin loop with several assumptions about the CPU and OS architecture. Unlike any other spin loop I have seen or written, they execute multiple pause instructions per loop. Arguably this was a poor choice.
It would be interesting to see the history of this design. It was probably done years ago when .NET was closed source.
At the risk of repeating the content of the blog post ... The issue was reported in August, 2017 and fixed later in the same month. The fix shipped in a .NET Core patch update in September. We attempted to get the fix into .NET Framework 4.7.2 but missed that scheduling. We're now looking at making the fix in a .NET Framework patch release.
Then please quote at least one actually used in Linux kernel, including where it is used.
objdump -d vmlinux.o | grep -c "f3 90"
683a0: e8 00 00 00 00 callq 683a5 <__task_rq_lock+0x5>
683a5: 55 push %rbp
683a6: 48 89 e5 mov %rsp,%rbp
683a9: 41 56 push %r14
683ab: 49 c7 c6 00 00 00 00 mov $0x0,%r14
683b2: 41 55 push %r13
683b4: 49 89 fd mov %rdi,%r13
683b7: 41 54 push %r12
683b9: 53 push %rbx
683ba: 48 c7 c3 00 00 00 00 mov $0x0,%rbx
683c1: 41 8b 45 3c mov 0x3c(%r13),%eax
683c5: 4d 8b 24 c6 mov (%r14,%rax,8),%r12
683c9: 49 01 dc add %rbx,%r12
683cc: 4c 89 e7 mov %r12,%rdi
683cf: e8 00 00 00 00 callq 683d4 <__task_rq_lock+0x34>
683d4: 41 8b 45 3c mov 0x3c(%r13),%eax
683d8: 49 8b 14 c6 mov (%r14,%rax,8),%rdx
683dc: 48 01 da add %rbx,%rdx
683df: 49 39 d4 cmp %rdx,%r12
683e2: 75 13 jne 683f7 <__task_rq_lock+0x57>
683e4: 41 83 7d 60 02 cmpl $0x2,0x60(%r13)
683e9: 74 0c je 683f7 <__task_rq_lock+0x57>
683eb: 5b pop %rbx
683ec: 4c 89 e0 mov %r12,%rax
683ef: 41 5c pop %r12
683f1: 41 5d pop %r13
683f3: 41 5e pop %r14
683f5: 5d pop %rbp
683f6: c3 retq
683f7: 4c 89 e7 mov %r12,%rdi
683fa: e8 00 00 00 00 callq 683ff <__task_rq_lock+0x5f>
683ff: 41 83 7d 60 02 cmpl $0x2,0x60(%r13)
68404: 75 bb jne 683c1 <__task_rq_lock+0x21>
68406: f3 90 pause
68408: 41 83 7d 60 02 cmpl $0x2,0x60(%r13)
6840d: 75 b2 jne 683c1 <__task_rq_lock+0x21>
6840f: eb f5 jmp 68406 <__task_rq_lock+0x66>
68411: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
68416: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
6841d: 00 00 00
I've never disputed that kernel uses pause (of course it does), but the claim of eloff which he states in the post to which I reply:
"where it starts off with one pause instruction, then goes to multiple, then moves to more drastic back-off techniques"
That's what appears that eloff claims in his post, that he has
where he effectively disputes the previous claim of cesarb: "it's a loop over the pause instruction, but the loop condition checks the lock value"
Your example doesn't support the claim of eloff, and supports the claim of cesarb:
That is, it checks the external condition after each pause. Which can't produce the problems that .NET produced.
And that's why I had expected eloff to give even one example that supports his claim, since what I can find is only what cesarb claims "it's a loop over the pause instruction, but the loop condition checks the lock value"
The whole thread is about if the loop is containing pause, but without the checks of the external condition (but instead just with the counter, like in the offending .NET example from the original article) exists in Linux kernel. If somebody claims that the .NET implementation is not unique in its carelessness of using just a big counter loop over the pause (which appears to be to me: that .NET does something unexpected) (s)he should support that.
(That I ask for the support for "extraordinary claim" in the thread where only one side makes it should be understood, but it seems some here vote "down" without understanding what is being discussed at all -- it's absurd having to summarize the whole discussion in the reply to the "extraordinary claim" in which I for the evidence)
Yeah, I guess you are right. Sorry for the noise.
/* p.s. I didn't vote down your comment */
Would be interested to know why you think the code was always broken. Perhaps Intel define a preferred algorithm they failed to implement?
New architecture makes new choices and, according to the redlined doc, they expect 1-2% performance improvements on threaded apps, which is a measurable improvement. They also explicitly call for possible degradation cases so folks who care can adjust.
.NET apparently hits such a degradation case, but parent says this is due to .NET implementing spin loops in a way (almost) no one else does. If so, in this particular case Intel is doing the greatest good for the greatest number, which is reasonable. My 2c.
I certainly can't quantify it in numbers, but I guess .NET has some of the highest requirements regarding synchronization performance. At least the standard libs, there are obviously some efforts underway to improve that situation.
You aren't going to get a good spin lock without making some assumptions about the hardware and OS. Get new hardware, gotta add a new case.
I also have to quibble with this conclusion:
This is not a .NET issue. It affects all Spinlock implementations
which use the pause instruction.
Is it though?
The "pause" instruction gives all resources to the hyperthread sibling. So under high-utilization (ie: your core has a hyperthread-sibling ready and willing), a "pause" instruction is basically a super-low latency task-switch.
I wouldn't be surprised that a "spinlock with pause" would be better, even in 5ms to 10ms chunks of time (Windows has 15ms quantas, and may task switch as late as 30ms to 75ms depending on the situation).
Because its better to do the hardware-assisted "pause" task switch rather than actually cause the OS-scheduler to get involved. In fact, that's what Microsoft AND Intel have concluded!! To get better real-world metrics, both Microsoft AND Intel felt like the "pause" instruction should wait longer! Skylake-X increases the latency, while .NET created this exponential-backoff.
I bet it was good in isolation, but when combined, the overall effect was too much pausing.
Yes, definitely, beyond any doubt.
The .NET approach only makes sense if you have exactly the same number of active threads as CPU "cores" and if your CPU always supports hyperthreading. If either of those is not true then what .NET is doing is a severe CPU bottleneck when the lock is actually contended.
And it is incredibly unlikely that that exact situation happens in the wild. You almost always have more than <cpu core count> threads/processes active in heavy multithreading workloads, and you definitely can't assume everyone has hyperthreading as that's a feature commonly cut from mid & low range CPUs.
For context here a thread context switch via something like futex is in the sub-10us range. So a 5ms spinlock robs you of ~4.9ms of usable CPU throughput that a different thread/process could have been doing instead. The timeout levels here are just shy of an eternity. 5ms is a really, really long amount of time.
By way of comparison pthread_mutex_lock in glibc will do one pause per attempt for at most 100 attempts. So under skylake it has a max pause duration of around 3-5us, depending on clock speed, which is pretty reasonable. It roughly aligns with the cost of a context switch, so once it starts spinning for longer than a context switch it switches to a context switch instead.
Pausing for as much as 10ms could be better under some circumstances, but that would seem to me to be heavier than a context-switch. Whether the system actually achieves better throughput or not in that case would be workload dependent. Worst case the hyperthread is also running the spin lock loop and no useful work at all is accomplished on the core.
Isn't this entire thread about "pause" ?? Anyway, i think we both understand the difference between "context switch" and "pause" here.
> Pausing for as much as 10ms could be better under some circumstances, but that would seem to me to be heavier than a context-switch. Whether the system actually achieves better throughput or not in that case would be workload dependent.
Seems reasonable. Ideally, threads / cores shouldn't be blocked that long, so 10ms+ blocked periods is definitely an anomaly.
But whether or not a context switch is actually worth it? Ehhh, its hard to say for sure.
"The issue has been fixed with .NET Core 2.1 and .NET Framework 4.8 Preview contains also the fixes for it."
And by the end of the article I was thinking their code might just run faster as a single thread on a ~4GHz processor.
Another thought was why do they have so much resource contention in the first place? That seems fishy.
Another thought: Perhaps they should write it in Rust with all of its "fearless concurrency" claims. While I'm skeptical of those claims, The exercise would probably highlight some of the real issues.
Any way I look at it, I keep thinking it's a software design problem.
Rust doesn't solve deadlocks or race conditions. Rust only makes it hard to accidentally share a memory location, but that's only a very small part of the problem, race conditions being a fundamental part of computer science, being baked into any language that supports expressing and joining asynchronous processes.
I would be generous in agreeing with a statement about Rust making it a little more difficult to end up with a deadlock or unwanted race condition, except that's not true when compared with many other high level languages.
Also nothing is impossible until you prove it via formal methods (e.g. TLA+).
Not just a memory location, any value. You can use the tools Rust gives you on any piece of state you care to manage - a database connection, a remote server, the office coffee machine - represent it as a value and Rust will let you check that it's owned where you want it to be owned, borrowed where you want it to be borrowed and so on.
> Also nothing is impossible until you prove it via formal methods (e.g. TLA+).
What's the distinction you're drawing between something like TLA+ and something like the Rust borrow checker?
It is easy to specifically describe what Rust prevents: unsynchronized memory access from multiple threads. What other high-level languages allow read-write shared-memory multithreading and guarantee that?
The fact that unsynchronized memory access is impossible in safe Rust (modulo compiler bugs, CPU bugs, etc.) is something that nobody seriously doubts. It's obvious. I'm glad people are working on proving it, but the proof is an academic exercise.
Pony maybe? https://www.ponylang.org/discover/#what-is-pony
At least that's my impression of its reference capabilities.
The advantage of Rust over, e.g. Java, then, is that Rust can prevent data races at compile time?
Speculation on my part. I'm not actually advocating Rust but I am thinking about a project to give it a try.
It does not, it only prevents a specific problem known as a data race, which is an issue related to memory/ownership.
There is no deadlock protection. And race conditions are a much broader set of issues than just data races.
I like rust, just want to keep your expectations in check.
That looks like astroturfing to you? Please don't throw accusations like this around willy-nilly.
No, it looks like people exaggerating to me. I don't doubt there is some merit to the safety claims made about Rust, but many people have a tendency toward unwarranted idealization. I admitted that I have not learned Rust yet which should add doubt to my own skepticism.
The project I'm thinking of trying with Rust looks like it will be predominantly Unsafe Rust - using a lot of structures that are multiply referenced by necessity (or not?). I'm doing a lot of reading and thinking prior to starting because I really want the benefit from Rust as well as changing my thinking if there are good changes to be made there. Using an array of items and having them reference each other by index should satisfy the Rust compiler, but that doesn't actually solve the potential problems of trying to traverse (update) a complex data structure with parallel threads. The challenge of doing this suggests to me that the claims are overblown. A modestly complex data structure seems to require disabling all that safety but I am still learning and contemplating different approaches. Questioning basic assumptions seems prudent. I want to believe, but at this point I'm not convinced.
The problem was GC initiated i.e. the entry into GC caused the deadlock to start with which the process never recovered from as it was spinlocking all threads and no resources were available to complete GC.
The object code in question uses an instruction that's basically a microarchitecture-specific hack. IMO the real issue here is that the object code is optimized for an earlier microarchitecture and needs an update.
Disclaimer: I work for Intel, but not on their CPUs and I'm not a fanbois.
On the other hand, I think the hack is not in using the pause instruction itself, but that crazy backoff algo. For example Linux uses pause too (via rep; nop) in lots of places. But not in a exponential loop (that I'm aware of).
I would expect any spinning lock to spin at most some slight fraction of the scheduling granularity, but almost always far less than the context switch time except maybe in some specific case of latency sensitive code.
Maybe thr long spin tine is for some reason trying to account for the very long scheduling quantas of windows server?
Superficially it seems misdirected in anything but real-time code since most schedulers would allow the thread to keep its quanta until the next scheduling if the thread wait time was low. The only case I can see this as not being the case is in a system that is overloaded and all CPU's have multiple threads that are all starved.
So its not like resources are lost (at least, under high CPU utilization). But still, you're right in that a 29ms spin-lock sounds outrageous to me. That's definitely long.
The textbook example of a spinlock is maybe a synchronized counter (iterations++), or maybe a producer/consumer queue.
- invokes the accept() function of the socket family: https://github.com/torvalds/linux/blob/be779f03d563981c65cc7...
- which invokes the accept() function of the protocol: https://github.com/torvalds/linux/blob/be779f03d563981c65cc7...
- which invokes lock_sock_nested(), which spins https://github.com/torvalds/linux/blob/be779f03d563981c65cc7...
Yes, there are spin-locks involved in the process. But I'm talking about these lines of code:
> error = inet_csk_wait_for_connect(sk, timeo);
> mutex_acquire(&sk->sk_lock.dep_map, subclass, 0, _RET_IP_);
You know, the stuff that causes milliseconds to multiple-seconds worth of delay, as opposed to "pause / spinlocks" which is measured in nanoseconds.
You can find similar behaviour in the filesystem APIs, anything touching the VM (mmap_sem IIRC is also a spinlock), pretty much any OS API where threads are all banging at the same shared resource that isn't expected to require a long wait. struct file also contains a spinlock, but doesn't look like it's used in normal operation
Hmmm... okay. I think I see what you're going for. I think its a bit of a muddy example though because of the mutex.
But pre-mutex, there's definitely a spinlock, and the 40-cores would definitely hit that first. Mutexes themselves are likely a spinlock as well, so there's a chance (a low-chance... but a chance nonetheless) that everything is fast-pathed and never sleeps a thread.
So yeah, there are better examples you could use. But upon further analysis, it does seem like your example does work with the right frame of mind.
Genuine questions, because I really don't know.
So actually, you're about right. Its "wrong" to spinlock above 40 cores. But at 40-cores or less, you probably should prefer spinlocks over atomics or other primitives on the x86 platform.
As for what? Well, you need a synchronization primitive for a producer-consumer queue. A spinlock would be a good choice for that, even on 40+ core machines (since most of those cores wouldn't be spinning at the same time, even if they're all locking the same lock).
Remember, x86 primitive for synchronization is the "LOCK" prefix. When an x86 core "LOCK"s an instruction, no other core is allowed to touch the data its touching. (In MESI protocol: the cache line goes into the "exclusive" state). As such, spinlocks match the fundamental nature of the x86 instruction set very very well.
Can anybody actually explain this result? The spinlock needs at least one atomic operation and a write (for lock + unlock) and one non-atomic read-modify-write per iteration while the atomic increment needs only a single atomic operation per iteration and no other memory access. How on Earth can the latter variant be slower?
The only thing I can think of right now is that due to some micro-architectural idiosyncracies, the relevant cache lines end up migrating between cores more often in the atomic increment case, while in the spinlock case, the same core ends up winning the lock back-to-back more often (and so cachelines migrate less frequently which reduces the overall running time).
That seems like it would most likely only ever be relevant to this kind of extremely micro benchmark, but I'd be interested to hear if there are other explanations (I haven't watched the whole talk, but at least the slides don't seem to make an attempt to explain this).
Furthermore, ALL loads / stores are almost strictly ordered on x86.
> Intel 64 memory ordering obeys the following principles:
> 1. Loads are not reordered with other loads.
> 2. Stores are not reordered with other stores.
> 3. Stores are not reordered with older loads.
> 4. Loads may be reordered with older stores to different locations but not with older stores to the same location.
> 5. In a multiprocessor system, memory ordering obeys causality (memory ordering respects transitive visibility).
> 6. In a multiprocessor system, stores to the same location have a total order.
> 7. In a multiprocessor system, locked instructions have a total order.
> 8. Loads and stores are not reordered with locked instructions.
As such, "relaxed atomics" DO NOT EXIST on x86. Almost everything on x86 is innately an acquire / release semantics (even non-locked stuff).
As such, the optimal spinlock-unlock implementation on x86 doesn't use the LOCK prefix at all! You rely upon the memory-ordering consistently guaranteed by the x86 processor, and let the cache-coherence mechanism handle everything for you.
You'll still need the "lock" to obtain the lock (writing 1 into the memory location AND reading the old value). But unlocking the lock (writing 0 into the memory location) can be done x86-LOCK-prefix free.
In contrast, a CAS-atomic on x86 needs to "LOCK" the memory bus, which under x86 is specified as:
> Causes the processor’s LOCK# signal to be asserted during execution of the accompanying instruction (turns the
instruction into an atomic instruction). In a multiprocessor environment, the LOCK# signal ensures that the
processor has exclusive use of any shared memory while the signal is asserted.
So when you do a CAS-atomic on x86, in REALITY, the LOCK-prefix will be asserted, locking out all other cores from interacting with your cache line.
So the cache-coherence mechanism on x86 seems sufficiently robust and well optimized by Intel / AMD, at least with our current 28-core Xeons and 32-core EPYCS. Future scaling may be an issue for Intel/AMD, and ARM / Power9 (which have true relaxed memory) may scale better eventually.
Atomic RMWs on intel (and really the vast majority of high performance architectures) have no effect outside of the core issuing them except in some very exhotic circumstances. The Lock signal is a relic of the pre-P6 past when everything communicated via a single bus.
What prevents a cache line from being 'interacted' with, is being held in exclusive mode. This is a property of any write, atomic or not. An RMW won't hold a cacheline significantly longer than a normal write would.
> As such, the optimal spinlock implementation on x86 doesn't use the LOCK prefix at all! You rely upon the memory-ordering consistently guaranteed by the x86 processor, and let the cache-coherence mechanism handle everything for you.
That's just not true. Spinlocks are not magic; they need to be implemented, and all the implementations I know of use a (atomic and therefore locked!) cmpxchg.
Which makes sense. Spinlocks are a synchronization primitive, and you wrote yourself that ALL synchronization primitives on x86 are "LOCK" prefixed.
So let's rephrase the original question: why should a locked cmpxchg + additional memory accesses be more efficient than a single locked increment?
> So let's rephrase the original question: why should a locked cmpxchg + additional memory accesses be more efficient than a single locked increment?
The ideal spinlock on x86 isn't a cmpxchg btw. Its a pure locked-swap.
And I think there-in lies the answer. A locked-swap on x86 can theoretically execute faster (its a pure write and a pure read to the cache), while the uOps generated for a cmpxchg would include a comparison.
From a cache-coherency perspective: the pure-write of a locked-swap can be implemented by invalidating the caches of every other core. So there's a "race" between cores to send out the "invalidate cache" message to everybody. Everyone who loses to the invalidate-message needs to stall and re-read the new value.
But otherwise, the "winner" of locked-swap executes incredibly efficiently. The winner instantly transitions into the "Owner" state and continues to execute.
But a cmpxchg is a far more complicated order of events. I'd imagine that the underlying, undocumented, cache-coherence mechanisms struggle under a LOCKed cmpxchg (which is far more common with atomic-based lockless programming).
Remember, the "LOCK" prefix is implemented by the MESI protocol (or something more complicated, like MESIF on Intel or MEOSI on AMD) in reality. Even then, modern processors probably implement something far more complicated, and its unfortunate that its undocumented.
Still, LOCK swap is clearly more efficient under MESI than LOCK cmpxchg.
A locked-increment would probably be roughly the same speed as a cmpxchg btw. Both a locked-increment and locked-cmpxchg require a full read/modify/write cycle.
While a locked-swap is just a read / invalidate+write instead.
On the other hand, if the spin lock is contended, avoiding an unconditionally acquiring the cacheline in exclusive mode is beneficial as it will mitigate cacheline ping pong. Thus the test-and-test-and-set idiom.
On the third hand,if the lock is contended, a spin lock is probably inappropriate...
Also I want to mention that CAS can fail, while xchg always succeeds, but the performance implications are not clear cut: for example a spin lock will issue xchg in a loop (failure is encoded at higher level), while wait free algos can issue cas outside of loops as a failed cas implies that another cas succeded and thus carries information.
You almost had a perfect post IMO :-) But this last line is incorrect on some systems.
x86 CAS seems to be strong as you imply. But apparently, Power9 or ARM CAS can have spurious failures. Or more appropriately, Power9 and ARM implement "Load Linked / Store Conditional", which can have spurious failures, and thus compare-and-swap (at a higher-level, like in C++ or C) will look like a CAS failed spuriously.
Which is why atomic_compare_exchange_weak and atomic_compare_exchange_strong exists. Weak doesn't "hold information" due to the potential of spurious failures, but can be a more efficient implementation in the case of LL/SC instructions.
Still (but don't quote me on this because I'm no Power expert), IIRC Power actually does guarantee, via idiom recognition, that specific uses of LL/SC are wait free by downgrading it internally to an actual CAS after a number of failures, so a compare_exchange_strong which loops a bounded number of times can be implemented on this architecture.
For various reasons I still suspect that some other factor is at play, but I also don't really know how to tease that out. Maybe there are some performance counters one could look at, but I'm just speculating.
I'm okay with your conclusion, but your first sentence is incorrect. The XCHG instruction is fully atomic and sequentially consistent even without a LOCK prefix. (This isn't to say x86 has any particular magic -- XCHG with a memory operand is just automatically LOCKed.) Also, ordinary stores are releases, and are thus quite useful as synchronization primitives, without any LOCK prefix at all.
Atomics-based lock-free data-structures scale beyond 40+ core counts. But note that x86's "lock" instruction is fundamental to the system. x86 doesn't have relaxed atomics or barriers under most circumstances. As such, the spinlock best represents what the x86 instruction set offers.
More info on the purpose of the pause instruction: https://stackoverflow.com/questions/4725676/how-does-x86-pau...
Info on what's a "spinlock" (I thought it was another word for busy wait, but there's a subtle difference): https://stackoverflow.com/questions/38124337/spinlock-vs-bus...
That's fair, but consider it closer to a subtitle than a tl;dr, which is supposed to be a summary that lets you skip the main body entirely.
> just to avoid wild speculations and conspiracies
The burden there is on people to not comment if all they know about the article is a wild guess.
The increased latency ... has a small positive performance impact of 1-2% on highly threaded applications.
Making an instruction more than 10x slower overall, for a 1-2% gain (who wants to bet it's some "industry standard" benchmark...?) in a very specific situation sounds like a textbook example of premature microoptimisation. Of course they don't want to give the impression that they severely slowed things down, but the wording of the last sentence is quite amusing: "some performance loss"? More like "a lot". It reminds me of what they said about the Spectre/Meltdown workarounds.
No, this is textbook optimization
Benchmarks showed a positive gain -> that's a real optimization, not premature at all.
Premature optimization is when you guess than a change will make things better without measuring. As soon as you measure it becomes real optimization work, not premature at all. As in, this is how you're supposed to make things better.
Feel free to debate the merits of their benchmarkmark, but this is a textbook example of Doing Performance Correctly otherwise.
But this gain of 1-2% is balanced with author's 50% performance loss. I guess it means that the benchmark results here depend on choice of tested applications. If they had used more multithread .NET applications, the tests could show a performance degradation as well.
If you are spending even a tiny fraction of 1% of your time spinning, your app has terrible performance bugs already.
This article is just bad analysis.
Not necessarily. Very latency-sensitive applications may pin a thread to a CPU core and mark the core as reserved so the kernel doesn't schedule any other task or interrupt on it. Busy-waiting is used instead of yielding to the scheduler, that way context switches can be avoided almost entirely.
The bottom line is that this is an architecture optimized for compute throughput, and they figured out that if they spent a few cycles on spin exit they could more than make it back via more effective hyperthread dispatch. And for everyone but .NET, that seems to be the case.
I thought the VM class of languages were generally inappropriate for real-time work. Heck, even in C, you need to do specific work on the memory allocator to make real-time viable.
It’s not “premature optimization” to gain 1-2% for likely trillions of application executions.
Tbh I think this is more of a sign that Intel's architecture, the Core i generation of CPUs, has been optimized as much as you reasonably can and Intel is squeezing out a few more cpu cycles here and there.
Someone else can pipe in, but the impression I got from friends in that industry was that Intel had a fairly advanced org set up to evaluate that.
Or to look at it another way, they didn't want to make an $x billion bet without due diligence.
If the SwitchToThread call returns False, then the code reverts to a Sleep(0) call instead.
Credit to StackOverflow and Joe Duffy's excellent article on this:
However, as mentioned in one of the SO replies, you still need to make sure that you don't use locks that use loops on calls like this when those locks will have to wait on anything long-running - you're just going to burn CPU/battery needlessly. So, only use such locks when you know that the protected code doesn't involve any long wait states, or modify the code to back out even further into a proper kernel wait after looping X times. Personally, I agree with second SO answer and don't like such constructs, and would personally say just go with a straight-up kernel wait if there's any question about whether such conditions could exist now or in the future.
"Pause" is a hint to the CPU that the current thread is spinning in a spinlock. You've ALREADY have tested the lock, but it was being held by some other processor. So why do you care about latency? In fact, you probably want to free up more processor resources as much as possible.
Indeed, there's not actually any resources wasted when you do a "pause" instruction. In a highly-threaded environment, the hyperthread-brother of the thread picks up the slack (you give all your resources to that thread, so it executes quicker).
10-cycles is probably a poor choice for modern processors. 10-cycles is 3.3 nanoseconds, which is way faster than even the L3 cache. So by the time a single pause instruction is done on older architectures, the L3 cache hasn't updated yet and everything is still locked!!
140-cycles is a bit on the long side, but we're also looking at a server-chip which might be dual socket. So if the processor is waiting for main-memory to update, then 140-cycles is reasonable (but really, it should be ~40 cycles so that it can coordinate over L3 cache if possible).
So I can see why Intel would increase the pause time above 10-cycles. But I'm unsure why Intel increased the timing beyond that. I'm guessing it has something to do with pipelining?
I have no idea what "disrupting Intel's model" would actually mean. But I would be fine if Intel is just outcompeted at their own game, and this already seems to be happening.
I'd say disruption of Intel's model would mean different ways of selling CPUs, fast storage or other silicon. All I see is Intel's competitors trying to mimic Intel to some extent because the enterprise market is the most lucrative. AMD, Qualcomm, Cavium and IBM all consider Intel a main and direct competitor. So I'd be interested if anyone has an idea on disrupting Intel's way of doing business.
It may run on Intel now, but once "runs on AWS/GCP/Azure" becomes more important than running on an ISA, there's no reason compute couldn't transition much faster.
That would likely make "properly coded" Power9 or ARM synchronization to be more scalable. Possibly a major advantage as 32-cores and higher are beginning to hit mainstream (or at least, high-end mainstream)
You can get a 8 year year old octocore that still destroys anything Intel has put out recently.
Intel relies on inertia. Amd consistently beats them on actual performance at a lower price. I remember my 386 dx 2 had better performance than an Intel 486. And let's not forget the p4 issues with their garbage ram. Plus they backdoor their chips with stupid insecure crap
It's only with Ryzen that AMD reached performance parity with Intel once again.
Stock price doesn't determine CPU quality. AMD stock is a whipping boy for shorting..and an easy way to profit on when those professional investors are constantly wrong. (I've invested in AMD over the last 20 years and there is an extremely predictable cycle in stock price. though just a disclaimer, I currently do not hold any AMD stock)
Because I read the benchmarks. Intel was winning by such a huge margin that AMD nearly bankrupted themselves trying to compete on price. Intel didn't even need to dip into their margins at all because AMD was several generations behind.
Intel beat AMD to 32nm fabrications by over a year, with the gap widening over the next five. Intel got to 14nm in 2015 while AMD just got there last year with Ryzen.
> Like I previously stated you can buy an octocore and that's still beats most Intel offerings for far less
BS. Nothing pre-Zen was competitive with contemporary Intel offerings, much less the latest offerings. As I pointed out, AMD was a generation or two behind Intel for this entire decade.
Yeah, you can still buy these six+ year old CPUs for $90, but, as that review shows, they were barely competitive with 2012 Intel processors. Much less, competitive with CPUs that are two generations newer.
> Stock price doesn't determine CPU quality.
It's a measure of how well a company is doing.
The original Phenom had a major CPU bug that hit once they finally began to gain market-share in the server market. This cost AMD pretty heavily and forced them to cut R&D budgets, which, as stated, put AMD generations behind Intel in tech. This led to a slashing of margins to compete, which put them even further behind Intel.
Oh yeah, and somewhere during all that, they they were the target class action lawsuits for over-stated performance.
This doesn't even get into the Radeon lineup, which totally missed the AI/ML craze that sent nVidia profits to the sky.
> This doesn't even get into the Radeon lineup, which totally missed the AI/ML craze that sent nVidia profits to the sky.
On the other hand, they scored well on the cryptomining craze. If it wasn't for the fact that an R9 290 was significantly faster at mining than the GeForce 980 GTX, I'm not sure if AMD would still be making GPUs.
I absolutely don't believe an 8 year old AMD octacore will "destroy" anything intel has. Sure I imagine a single highly specific and biased test could give AMD the advantage in that one instance, but overall performance AMD has been behind since the pentium M/core2 with only a few products that initially compete or win, only to fall behind again.
We may be remembering a different history, but Intel has had single threaded performance locked up for awhile.
AMD prices aggressively and glues more cores together, but hyperbole isn't doing your point any favors.
Bad compared to whom?
Intel's 32 bit offering at the time wasn't any good either; the pentium had such a deep pipeline (and not so smart prefetch/speculative) that the chips ran at a high frequency, ate a lot of power, but pipeline stages were idling or working on useless stuff a large amount of the time.
Luckily for the world, AMD managed to produce that alternative, the AMD64 architecture, which kicked intel's ass so hard that eventually they had to adopt it.
If it weren't for AMD's AMD64, and their cross-licensing deal with Intel (that allowed them to make an x86 compatible chip without specifically asking for Intel's permission), we'd be in an Itanium dominated world right now (or .. ARM would have stepped up, it's hard to predict an alternate reality) but surely with less efficient and less effective processors.
History does not repeat, but it rhymes. Intel has no Itanic this time, but it did become numb enough to let AMD outmaneuver them again.
It's not that Intel is doing a bad job, as much as they have too little incentive to do a good job.
This is far from certain: Itanium shipped years later and much, much below the advertised performance thresholds. In the meantime even Intel’s own x86 had closed the raw performance gap and since IA-64 was such a huge bet on brilliant compilers that the early benchmarks for most business code favored the P-III even before you factored in cost. Similarly, the original claim that it’d offer competitive x86 performance turned out to be half a decade behind years behind when it shipped so it wasn’t even remotely competitive for legacy code and businesses back then had even more weight of code which was difficult to recompile.
It’s equally plausible to me that Intel would have continued to see strong x86 sales and weak demand for Itanium and ended up shipping their own 64-bit extension a few years later with less pressure from AMD simply because businesses weren’t keen on the “recompile everything, make a bunch of expensive optimizations, and it might be faster than the system it’s replacing”
I agree it is plausible, but the "few" in "few years later" makes a huge difference. Itanium was not requested or driven by the market, not even wanted; it was dictated by Intel, who put in billions into it. Without any market pressure, they would have likely taken their sweet time - possibly 10 or more years - to let Itanium mature. They actually took the writeoff for Itanium as late as they could - I would guess no one wants to one such a colossal failure, and that would have been true even if AMD wasn't around.
While IA-64 originated in HP, as another poster noted, it was in Intel's interest that HP would be out of the CPU market, and in that sense Itanium was a success -- it left Intel in a much stronger position in the long run, and if it wasn't for you meddling AMD kids, they would have gotten away with it in the short run as well.
HP had ceded the CPU market but Power was competitive on both the desktop (Apple was shipping a lot of PowerPC) and high-end server (IBM, Bull, and Hitachi) markets, delivering quite competitive performance and mature 64-bit support, and there was still some chance left for SPARC, MIPS, or Alpha — and 3 of those 4 CPUs had native support in Windows. AMD publicly announced AMD 64 in 1999 and I'd be surprised if they hadn't had some early news before that time since IBM, Sun, and SGI all had Opteron servers on the market fairly early on. If that hadn't been looking likely, I wonder whether someone at IBM or Sun might have decided to invest more heavily since Intel was voluntarily giving up their biggest advantage by sacrificing backwards compatibility.
A) Smart compilers are hard, expensive, and slow to develop
B) In 2001, source wasn't open enough for a new ISA to succeed. Not sure it would be now either (chicken / egg for big iron workloads: need marketshare for optimization, need optimization for marketshare)
C) Intel absolutely missed the "desktop memory needs are going to expand" trend
A) That's still true. The best compilers for Itanium (and other VLIW architectures) are still bad and cannot make good use of the hardware. It's likely to stay that way a long time, unless programming languages and programmers adopt vector semantics ; they haven't yet.
B) It's never a good time for a new ISA to succeed, except if it achieves something really novel (like GPUs for parallel processing, or ARM for power consumption). Things can succeed, but unseating an incumbent is exceptionally hard when improvement is marginal if any; Itanium, if a super smart compiler existed when it came out (they still don't ...) would have been a small improvement. It only made sense for Intel to bet on it the way they did was because they were a monopoly (or so they believed).
C) I disagree; Intel didn't miss anything. They just thought they were able to dictate where the market is going. And they would have, except for AMD's guts to do AMD64. For a long time, people bought AMD64 to run 64-bit XP, so they could run with 4GB or even 8GB of RAM - even though every single process was 32-bit. But that feature was more-or-less supported also on 32-bit with bank switching and other memory extensions - the actual use of 64 bit programs (with more than 3GB per process) came much, much later.
Usually by the desktop gaming crowd, who can't define VLIW or discuss the relative merits of different microarchitecture choices.
On (A) and (B), I'll forgive Intel (and HP). Circa-1990, I can see it being much more debatable if making hardware (ILP) or software (VLIW) smarter were a more reliable path to performance. Especially with the market being smaller & more centralized (commercial compilers galore!).
I mean, hell, AMD made the same wrong call circa-2011 in their GPU arch, no? Thus one of the reasons they ceded ML to the simpler SIMD design Nvida used which was more amenable to creating, e.g., something like CUDA? (correct me if I'm wrong)
On (C), my impression was that bank switching and PAE were poorly supported by Windows at the desktop level. Which essentially made it a non-starter for the primary computer market at the time.
AMD/ATI in 2011 was still focused on graphics, ignoring the GPGPU revolution (as it was still called back then). In many ways they still haven't managed to turn around.
> On (C), my impression was that bank switching and PAE were poorly supported by Windows at the desktop level. Which essentially made it a non-starter for the primary computer market at the time.
I remember that while PAE was clunky, it did enable (say) running multiple processes on 8-core machines and giving each its own 2GB of memory; It was definitely poor compared to the real thing (AMD64 at the time), but like EMS, XMS and similar "extended memory" systems of the late '80s before it, it was good enough to stop people from moving to a decent architecture (e.g. Motorola 680x0) if it didn't provide backward compatibility.
On PAE, I think the 4GB memory limit on contemporary Windows desktop OSes is what I'm remembering: https://en.wikipedia.org/wiki/Physical_Address_Extension#Mic...
Today even Russia have managed to produce a VLIW architecture and a good enough compiler .
Do you have a citation for the “good enough” claim? In particular the listed peak performance appears to be below a modern phone's and, far more importantly, we'd need robust benchmark data to know that it's not following the Itanium cycle where a few very favorable, possibly hand-tuned, benchmarks manage to approach the theoretical maximum but most real-world code is substantially lower.