Which Intel CPU generation will have hardware fixes for these Spectre variants?
According to what we're seeing, the situation seems to be reversed with the 3xxx series, where AMD seems to have a small but significant lead; we'll have to wait for independent benchmarks.
They aren't competition if they fall of the map. So a slightly edge would be great because it at least puts them back in the game.
This is easy to prove. The highest clock rate xeons you'll find are a special SKU exclusive to aws. Sure enough, they have far fewer cores than instances with lower clocks.
What you're seeing isn't routing issues, but the fact that their newer process isn't up to snuff, and they don't have the proper yields on larger die sizes.
Like, I've shipped RTL and know pretty well how this stuff works.
It certainly looks promising, but I'll still hold my excitement until we get some 3rd party benchmarks.
I'm also skeptical of first party benchmarks, but I'm already pretty excited that I finally might be able to justify an upgrade from my old haswell setup.
In a cloud you typically pay for cores from a specific CPU type. Presumably any clouds at offer AMD cpus will price them in a competitive manner.
I mean, the cloud business is the place with the most to lose from these kinds of issues, I am incredibly suspicious of the claim that cloud providers aren't patching their microcode.
Whether it's intels or their own modified variant of microcode I would fully expect them to be patched in some way.
They have much to lose from not applying these mitigations, especially if they're the people spending a fortune to find them.
I honestly doubt that claim however and haven't heard it before this thread.
The benefits of X570 over B450 therefore have nothing to do with GPU performance but instead would be either overclocking capability or, more significantly, I/O to everything else.
B450 only provides 6x PCI-E 2.0 lanes and 2 USB 3.0 gen 2. That's not a lot of expansion capability, especially with nvme drives. Want 10gbe? Or a second nvme drive? Good luck.
X570 gets to leverage double the bandwidth to the CPU in addition to being more capable internally. So you'll see more boards with more M.2 nvme slots as a result, for example. And thunderbolt 3 support. Check out some of the x570 boards shown off - the amount of connectivity they have is awesome. That's why you'd get x570 over b450.
Most people do not need a second nvme drive or 10GbE.
The thing is that most things aren't (currently) bottlenecked by PCIe 3.0. A 2080 Ti shows about 3% performance degradation by running in 3.0x8 mode. 4 lanes of PCIe 3.0 is 4 GB/s (32 Gb/s) which is plenty for 10 Gb/s networking... or even 40 Gb/s networking like Infiniband QDR (which runs at 32 Gb/s real speed after encoding overhead). So you can reasonably run graphics, 10 GbE, and one NVMe device off your 3.0x16 PEG lanes.
And AMD also provides an extra 3.0x4 for NVMe devices, so you can run graphics, 10 GbE, and NVMe RAID without touching the PCH at all.
The real use-case that I see is SuperCarrier-style motherboards that have PEX/PLX switches and shitloads of x16 slots multiplexed into a few fast physical lanes, like a 7-slot board or something. Or NVMe RAID/JBOD cards that put 4 NVMe drives onto a single slot. But right now there are no PEX/PLX switch chips that run at PCIe 4.0 speeds anyway, so you can't do that.
Sure but you won't find any board with a setup like that. You can also reasonably split the x4 nvme lanes into 2x x2 but again you won't find a such a setup.
You'll find no shortage of boards with everything wired up to the PCH, though, and it's "good enough" even if it isn't ideal. The extra bandwidth will certainly not be unwanted. Especially when you're also sharing that bandwidth with USB and sata connections.
> The real use-case that I see is SuperCarrier-style motherboards that have PEX/PLX switches and shitloads of x16 slots multiplexed into a few fast physical lanes, like a 7-slot board or something.
I think those use cases would instead just use threadripper or epyc. Epyc in particular with its borderline stupid 128 lanes off of the CPU.
(I'm fairly certain for most gaming workloads, the bandwidth increase will only come into play when getting closer to 4k 144Hz, which is unlikely to be pushed out by first gen PCIe 4.0 GPUs.)
Here's hoping it comes out, looks like a great CPU for a relatively cheap desktop.
If what AMD says is true, the new (for them) TAGE predictor in their industry-leading microarchitecture having 30% fewer branch mispredictions than the last, it feels very cool that one can read and somewhat anderstand the operation of a similar predictor in the leisure hours of a few days.
Also those caches are huge, wow.
There are a few narrow workloads where having a huge unified cache is an advantage, but it generally isn't. If you have many independent processes or VMs it can actually be worse, because when you have one thrashing the caches it would ruin performance across the whole processor rather than being isolated to a subset.
Meanwhile most working sets either fit into 8MB or don't fit into 64MB. When you have a 4MB working set it makes no difference and when you have a 500GB one it's the difference between a >99% miss rate and a marginally better but still >99% miss rate.
Where it really matters is when you have a working set which is ~16MB and then the whole thing fits in one case but not the other. But that's not actually that common, and even in that case it's no help if you're running multiple independent processes because then they each only get their proportionate share of the cache anyway.
So the difference is really limited to a narrow class of applications with a very specific working set size and little cache contention between separate threads/processes.
And most people don't run a bunch of vms. Single thread performance still dominates and latency cannot be improved by adding cpus.
Ryzen/Epyc has cores organized into groups called a CCX, up to four cores with up to 8MB of L3 cache for the original Ryzen/Epyc. So Ryzen 5 2500X has one CCX, Ryzen 7 2700X has two, Threadripper 1950X has four, Epyc 7601 has eight.
Suppose you have a 1950X and a thread with a 500MB+ working set size which is continuously thrashing the caches because all its data won't fit. You have a total of 32MB L3 cache but each CCX really has its own 8MB. That's not as good for that one thread (it can't have the whole 32MB), but it's much better for all the threads on the other CCXs that aren't having that one thread constantly evict their data to make room for its own which will never all fit anyway.
This can matter even for lightly-threaded workloads. You take that thread on a 2700X or 1950X and it runs on one CCX while any other processes can run unmolested on another CCX, even if there are only one or two others.
> And most people don't run a bunch of vms.
That is precisely what many of the people who buy Epyc will do with it, and it's the one where there are the highest number of partitions. The desktop quad cores with a single CCX have their entire L3 available to any thread.
> Single thread performance still dominates
If your workloads are all single-threaded then why buy a 16+ thread processor?
While that might prevent one bad process from evicting things, it seems like it might almost lead to substandard cache utilization, especially on servers that might just want to run one related thing well.
Also sharing between l3s would seem to be a huge issue, but I wasn't able to find info on how that is handled (multiple copies?). But this would seem to help cloud systems to isolate cache writes.
I work on mostly hpc and latency sensitive things where I try to run a bunch in single threads with as little communication as possible, but still need to share data (eg, our logging goes to shm, our network ingress and outgres hits a shared queue, etc).
I would probably buy as a desktop, but not for the servers. Also no avx512 which besides the wider instructions the real gain seems to be in an improved instruction set for them.
Right, that's the trade off. Note that it's the same one both Intel and AMD make with the L2, and also what happens between sockets in multi-socket systems. And separation reduces the cache latency a bit because it costs a couple of cycles to unify the cache. But it's not as good when you have multiple threads fighting over the same data.
> I would probably buy as a desktop, but not for the servers. Also no avx512 which besides the wider instructions the real gain seems to be in an improved instruction set for them.
If you're buying multiple servers the thing to do is to buy one of each first and actually test it for yourself. We can argue all day about cache hierarchies and instruction sets, and that stuff can be important when you're optimizing the code, but it's a complex calculation. If you have the workload where a unified cache is better, but so is having more cores, which factor dominates? How does a 2S Xeon compare with a 1S Epyc with the same total number of cores? What if you populate the second socket for both? How much power does each system use in practice on your actual workload? How does that impact the clock speed they can sustain? What happens with and without SMT in each case?
When it comes down to it there is no substitute for empirical testing.
Doesn't really matter to me, with the value they've been delivering since the original Ryzen launch, I see no reason to not buy them all. People appreciate a good discount on a desirable CPU on Craigslist when its time to upgrade. It's just an easy swap, especially if you use an IC Graphite thermal pad instead of thermal paste.
Is/has Linux added similar code
Are we calling this Mini-Numa or something else?
Yes, in 4.15 patches emerged for TR/Epyc and waaaaaay back in 2.6 it had scheduler domains which can do the same thing.
Only Threadripper and EPYC are NUMA.
Yeah, I meant this. I don't use Ryzen so I think of Zen as being Threadripper and EPYC.
So now there’s at least 2 levels of Non Uniform access to manage by the Bios/OS
The full list of varying latencies is something like: SMT, inter-core, inter-CCX, inter-die, inter-socket. And even that misses a few subtleties.
I found this paragraph confusing, is it talking about data prefetchers (Which would make sense b/c of the mention of short prefetches) or branch predictors? (Which would make sense b/c of the mention of TAGE and Perceptron)
Or, to put that another way, this reads to me like the probabilistic equivalent of a compiler doing dead code elimination on unconnected basic blocks. The L1 predictor is marking L1 cache lines as “dead” (i.e. LRU) when no recently-visited L1 cache line branch-predicts into them.
The way this works is there a fast predictor (L1) that can make a prediction every cycle, or at worst every two cycles, which initially steers the front end. At the same time, the slow (L2) predictor is also working on a prediction, but it takes longer: either throughput limit (e.g., one prediction every 4 cycles) or with a long latency (e.g., takes 4 cycles from the last update to make a new one). If the slow predictor ends up disagreeing with the fast one, the front end if "re-steered", i.e., repointed to the new path predicted by the slow predictor.
This happens only in a few cycles so it is much better than a branch misprediction: the new instructions haven't started executing yet, so it is possible the bubble is entirely hidden, especially if IPC isn't close to the max (as it usually is not).
Just a guess though - performance counter events indicate that Intel may use a similar fast/slow mechanism.
The complexity of the chip is higher than the previous models, with three dies under the hood instead of one. The high end chips are closer to Threadripper than they are to the models they're replacing.
I think $750 is still a ridiculously good price, and Intel's feet are being held to the fire.
- 4 memory channels vs 2
- 64 PCIe lanes vs 16-4-4
- 64MB of L3 cache vs 32MB
- DDR4 3200 vs DDR4 2666
- The PCIe lanes are 4.0 vs 3.0
- 3.5Ghz base clock vs 3.4
- 4.7Ghz boost clock vs 4.0
- 15% better instructions per clock
- Full avx2 instead of emulating with two 128-bit units
- 105W vs 180W TDP
Ryzen 3000 specs say it can support 128GB of ram but it’s hard to find 32GB DIMMs on the market.
So if you’re trying to build a workstation with lots of ram and more than one GPU, the Ryzen boards are too limited even if you’re willing to buy a nice one.
I feel like if you're requiring 128GB of ram and/or maxing out the PCIE then you're going to have a bigger budget and ThreadRipper makes more sense.
Just a note: Two units for fused multiply-add, otherwise it's four units with two multipliers and two adders.
Threadripper 1950x comes with the same core count, more memory channels, more PCI-E lanes and more memory. You can grab one for $499 from amazon.
So you're not going to save more then a few bucks but get a slower and outdated CPU.
Note: I have a TR cooler running on my AM4 board (custom loop though so not completely comparable) and there is more than sufficient space to place it.
In turn, I misunderstood your reply to @lhoff, because in that context, I read it as a rebuttal of the idea that TR parts being expensive by suggesting an AM4 mobo + TR4 cooler as substitutes on a 1950X system.
My 4790K feels so outdated now...
I wouldn't make the assumption that AMD could sustainably sell that much silicon at that price point.
 - https://www.amazon.com/AMD-Threadripper-32-thread-Processor-...
The performance comparison will be interesting though. The 3950X should be quite a bit faster than the 1950X when it's not bottlenecked by memory bandwidth, but of course the 1950X still has twice the memory channels. Slightly offset by the Zen2 memory controller supporting higher frequency RAM. So which one is better will depend heavily on workload. I suspect that for a developer workstation the 3950X would be the better performer, most compilation workloads are not very sensitive to bandwidth.
If you don't need those features you're completely correct about the 3950x.
My biggest problem with virtualization is USB. I have a libvirt with GPU passthrough setup that works great, but have been unable to get a USB controller of any sort to passthrough; always winds up in a group with a bunch of other PCI-e devices. And ordinary forwarding with SPICE or something isn’t really sufficient for what I’d like to set up...
(disclaimer - I don't own a board that can do this, I will one day, though).
https://linustechtips.com/main/topic/799836-pcie-lanes-for-r... — nice diagram of lanes
GPUs generally don't come close to saturating x8 3.0 lanes, unless you have a very specific workload (like the new 3dmark bandwidth benchmark AMD used to demo PCIe 4.0).
Games don't do nearly enough asset streaming to use a lot of bandwidth, since the amount of assets used at the same time is limited by VRAM size, and most stuff is kept around for quite some time. Offline 3D renderers like Blender Cycles IIRC just upload the whole scene at once and then path tracing happens in VRAM without much I/O. For buttcoin mining, people literally use boards with tons of x1 slots + risers. No idea how neural nets behave, but would make sense that they also just keep updating the weights in VRAM.
It would leave a fairly big gap in the lineup with nothing to compete against Intel's X299 platform. AM4 is lacking in memory channels and PCIe lanes. Epyc has much lower clockspeeds, much more expensive CPUs, and more expensive motherboards than Threadripper.
Well, that is what first-gen Threadripper was. Same socket and all, but with half the connected DDR lanes and a pin telling the motherboard it's not EPYC.
I know it's not a big difference, but given the changes to IO and the 16 core consumer version, I don't see why there would be any internal difference to EPYC this time around (which this article claims will have a variable number of chiplets).
As Lisa said, TRs were distinct to Epycs; I guess using UDIMMs vs RDIMMs and much higher base clock (except for the high freq EPYC 7371) led to a few changes.
If it does, it means the process is struggling to produce chips at that speed, so the headroom is incredibly low and you can forget about overclocking.
That said the part you're missing is binning. The 3800x is definitely the worst binned chiplets, as evidenced by the 3900x and 3950x having the same TDP.
Even then, both AMD and Intel CPUs will pull significantly above their rated TDP when boosting. It's not quite a base-clock measurement (eg 9900K is more like 4.3-4.4 when 95W-limited) but it's definitely not a boost power measurement either.
Again, pretty much just a marketing number these days.
There's a mere 100mhz difference in base clock. Which is what TDP is based off of. No where close to enough of a reduction to fully explain +50% cores at the same TDP
> Again, pretty much just a marketing number these days.
Not really no. You just need to understand it represents all core base frequency thermal design target, and not maximum power draw.
It's still based in reality, though. It's not some random made up number.
And binning is an extremely real thing with very significant impact. Not sure why you seem to be trying to outright dismiss it.
I like small builds that are a good compromise between performance and power needs, and the 3700X looks sweet on 65W at that price point.
That was a 34% decrease in cost for comparable models after 1 generation. If that trend follows then the comparible Zen2+/Zen3 model will only be around $500. So, hopefully you just need to wait a year.
That's not going to happen this time around. Intel doesn't really have a response to 16C consumer processors. Best thing they can do is release the 10C chip they're working on... probably at $500 again. And they will be behind the 12C version that AMD has at $500 already.
The only similarly aggressive move that Intel could even make would be to drop 10C to the $350 segment (perhaps with Hyperthreading disabled), which would be a massive blow to their margins.
Sadly, until that happens, AMD CPUs are dead to me. For a C++ (or C or Rust) developer, rr is just too much of a productivity boost to give up.
Apropos the article, I'm trying to convince myself to build an EPYC 3201 server now rather than waiting for the Zen 2 version, for which I presume I'd have to wait until October or November at the earliest.
It was Intel is said to switched to cobalt wiring in latest node, and seems to be paying dearly for that. TSMC and others seem to go the conventional road and continued to perfect the salicide for smaller nodes without any issues.
For the 105W TDP chip vs. say the 65 W one. If there is a lesser task not saturating the cores, the power/heat generation would be similar, and the bigger chip doesn't really ramp up the heat/wattage unless heavier loads are thrown at it?
Similarly, 4 cores running on the 12 or 16 core chips should eat about the same amount of power as each other.
As for already compiled binary, depending on how it was compiled it may or may not work of a different CPU. Also the compiler doesn't do the runtime checks.
The under the hood stuff like true 256 bit registers, branch prediction, cache, etc, all is below the machine code level as other people have pointed out. The compiler doesn't know about it.
>What about if the binary runs on a different CPU, will the compiler include feature checks and multiple code versions?
This is referred to as multiple/dynamic code paths and it needs to be supported by the processor microarchitecture and compiler. afaik only the Intel Compiler and Intel processors support it with the <code>-ax</code> compilation flag.
In general you should pick a minimum architecture for your applications, since it will be forward compatible.
Last I checked a lot of progress had been made, but you're unable to run any 32-bit applications and some software such as the Adobe Creative Suite simply won't run.
I'm considering a new ryzen hackintosh build in july!
Err, "can't wait" as in "it will be awesome when it arrives" or as in "I'm not going to wait for that, stupid Apple"?
Also, I would be a lot more confident in labeling it sarcasm if I didn't believe there were plenty of people actually eagerly awaiting new Apple hardware just like that.
In any case, Poe's law strikes again.
AMD can deliver 8C/16T in 65 W but their GPUs need 50%+ more power than nvidia's for the same performance (up to 100% more at the 1080 lower end). You're saying I'm not right and they don't have a problem?
It's a repeat of Vega, where AMD finally managed to reach Maxwell-level perf/watt... on 14nm, competing against 28nm NVIDIA chips. Once again they are years late and too expensive to boot.
They've managed to close the gap to Turing a little bit (because NVIDIA is still on a 16+ node rather than 7nm) but it's going to be a bloodbath when NVIDIA ports down to 7nm next year.
Price is the great equalizer, but once again AMD is choosing to price head-to-head with NVIDIA. Racing onto 7nm was not a cheap move for them.