The yields for their high-end multicore packages must be abysmal compared to the 7nm TSMC chiplets AMD is packing together.
HBM memory for instance is still quite expensive, and even 3D NAND was more expensive initially. Now it's harder to tell because the whole market crashed and they reached a limit for density anyway.
In both this case and their delays in getting to 10nm due to wanting larger dies, it really feels like Intel's management is letting better be the enemy of good whereas AMD is making smarter choices about where to compromise.
By the time Intel releases their awesome 10nm 3D chiplet stack, AMD will likely have moved onto 5nm compute chiplets with a 7nm IO chiplet. It's not clear to me how Intel will catch up in the next 5 years or so.
Point being you were wrong in the statements you have been making. 5nm will not be out by the time Foveros is introduced. I did not state Foveros is a competitor to a 128 thread epyc processor..
What's your point again?
Plus, these chips with two CCXs also has double the PCIe lanes 40! So a number of NVMe drives, GPUs, 10GbE etc... can run together without fighting over lanes (and that's without double bandwidth of PCIe 4.0).
It does still feel weired calling 16 core/ 32 thread CPU with 72MB of cache 'consumer'.
As DDR5 is coming out next year, that will mean a new socket, limiting the upgrade path for the CPU, RAM & Motherboard. Although, 16 cores ~4.5ghz shouldn't be a problem for the near future (maybe 5 years even). Same goes with the PCIe bandwidth.
Edit: Just done some checking, I appears the 3950X has 24 PCIe lanes (16+4+4), but they are twice as fast, so not far behind the current 2nd generation ThreadRipper!
The chipset multiplexes up to x16 lanes of "stuff" onto the x4 chipset lanes from the CPU.
All of this is physically determined by the pinout of the socket and none of this can change unless AMD moves to a new socket. What did change is the speed of the lanes - x4 lanes on 4.0 is twice as fast as x4 lanes on 3.0.
AMD, like Intel, likes to pretend that chipset lanes "count" as full CPU lanes, arriving at a total of 36 effective lanes. But that's nothing new either.
But if you are thinking of upgradeability of CPUs you are much much better sticking with AMD, who don't change their CPU socket every couple of years.
Purchasing decisions are not always rational, many times they are just emotion driven. The fear of missing out is hardwired into our brains.
I remember paying paying something like $800 for an Athlon 700 MHz CPU (the cartridge one) in 1999.
Intel has always been the go-to. The #1 Priority is thread performance, first and foremost. Second is at least 4 cores. Most modern games can utilize at least 4, but it's also important to give the OS and other programs like discord plenty of cores.
While the Ryzen Gen 1 and Gen 2 have been amazing values, for gaming performance Intel has still ben king. When you compare AMD to Intel FPS to FPS Intel nearly ALWAYS wins.
CSGO is especially thread performance reliant, but this goes for most games. It's worth noting too that while games can use multiple cores, I don't believe most engines scale to 8+ cores very well.
Supposedly Zen 2 solved most of that. (And some game benchmarks like CSGO suggest they really did) We'll see how it actually pans out since there's still the issue of inter-CCX latency (and now even cross-chiplet latency).
Inter-CCX communication requires hopping over the Infinity Fabric bus, which (in case of Zen 1, no newer benchmarks) increases thread latency from ~45us to ~131us. I'm sure it was reduced in Zen+ and is probably closer to 100us by now. However, I'm not sure if inter-chiplet communication will be the same (e.g.: has its own IF bus) or worse (IO chip overhead).
Hopefully someone runs the same inter-thread communication benchmarks on Zen 2.
Otherwise, yeah it can be terrible.
I've said in other comments my 4790K is getting a bit old at this point, not slow for most stuff, but definitely hungry for more cores for a lot of tasks, and looking to break past 32gb of ram. I'd also been considering Epyc or even Xeon, as older/used Xeons can be very well priced. Guess I'm waiting until September.
I’m in nearly the exact some boat. I’d like to have ECC ram the second time around for my home server, which the Zen chips reportedly support though I don’t see people using. I’d also like better power usage. I think I’m going to wait one more year.
For now, planning on just playing around with it. I haven't decided if I'll be running Windows or Linux as the base OS yet.
Game developers have always made a good use of the available resources. They'll use the extra power available. The newest techniques they have, like work stealing queues, can scale to a large number of cores.
So games and gamers will use the extra cores. It's much less of a jump from 4 cores to 16 than from 1 to 2.
Give it a year or two.
No it's not worth it IMO, but some people spend crazy amounts chasing a few extra fps.
High end phones has gotten way more expensive, but there are still consumer products.
It sounds like you're saying performance/power is a benefit for Intel, possibly based upon the history of AMD chips, but that line of thought has been wrong since the Ryzen architecture.
AMD gives their TDP with enabled turbo (similar to real usage), Intel gives TDP at rest / no turbo enabled.
There is still some variance from both between given and real TDP, but the core of the difference is well assumed, and dates back to almost a dozen CPU generations back when Intel already had to guzzle power like crazy to superclock their chips in the vague hope that they could compete with AMD's products of the time (and then they never reverted it once they took the lead back with the core architecture)
It's kind of similar to the whole "Intel wants comparison dont with SMT off", due to the last 15 years being theirs, the whole thing is biased toward Intel, ... yet they still massively lose those comparison.
As a Small Form Factor enthusiast, I can attest to this with utmost confidence. The chips will run at their expected TDP when configured as specified by the factory, that's just not the default on almost any enthusiast board from known companies. In the case of ASUS it can actually be a bit of a battle to get things to run as intel specifies, both with MCE and automatic overclocking behaviors.
If that's the case, then also the performance is "massively blown out", since essentially all the benchmarks around are based on popular motherboards.
Anantech did a test some time ago with a real, fixed, 95 W TDP[¹], and it ain't pretty.
It's definitely good for Intel that "every popular motherboard" is, uh, guilty of going out of spec, otherwise, the popular opinion of Intel chips would be significantly lower.
Regardless, I'm also not really convinced that this can be considered "cheating" by the motherboards. According to the official Intel page [²]:
> The processor must be working in the power, temperature, and specification limits of the thermal design power (TDP)
so ultimately, it's the CPU that sets the performance/consumption ceiling.
Source: I have one of these.
Maybe Intel took that back with their lower clocked 8c/16t chips, dunno, this isn't something that comes up all that much in consumer reviews. But there's at least not a significant gap in either direction, it's pretty much a wash.
On the server side of things Anandtech didn't seem to go much into it but at least with this one: https://www.anandtech.com/show/11544/intel-skylake-ep-vs-amd...
The dual EPYC 7601 used 100w less than the Xeon competition in povray while also being the fastest system by a substantial margin at povray, too. Which would put performance, power, and performance/watt all firmly in the EPYC 7601's domain on that one test. And Intel took it back on MySQL. So 50/50 split.
When limited to its "official" 95W TDP, the 9900K does about 4.3 GHz and has a higher perf/watt than Ryzen (both higher performance and lower power consumption).
So basically you are in a situation where the Ryzen pulls less at stock, has slightly higher efficiency at stock, but has a much lower clock ceiling. While the 9900K ships with much higher clocks and worse efficiency, but has a much lower power floor if you pull the clocks back to 2700X levels.
Of note, the 2700X is actually pulling ~130W under AVX loads (33W more than the 95W-limited 9900K).
The Stilt noted that the default power limit AMD ships is 141.75W and the 2700X will run it for an unlimited amount of time (whereas Intel at least claims PL2 obeys a time limit, although in practice all mobo companies violate the spec and boost for an unlimited amount of time as well). So really "TDP" is a joke all around these days. Nobody really respects TDP limits when boosting, and it doesn't directly correspond to base clocks either (both 9900K and 2700X can run above baseclocks at rated TDP). It is just sort of a marketing number.
Epyc is a different matter and once again more cores translates into better efficiency than fewer, higher-clocked cores. But the gotcha there is that Infinity Fabric is not free either, the infinity fabric alone is pulling more than 100W on Epyc chips (literally half of the total power!).
Similarly, the 2700X spends 25W on its Infinity Fabric, while an 8700K is only spending 8W. So, Infinity Fabric pulls roughly 3x as much power as Intel is spending on its Ringbus. This really hits the consumer chips a lot harder, mesh on the Skylake-X and Skylake-SP is closer to Infinity Fabric power levels (but still lower).
Plus, GF 14nm wasn't as good a node as Intel 14nm. So Ryzen is starting from a worse node.
Moneyshot, core for core, power efficiency on first-gen Ryzen and Epyc was inferior, but of course Epyc lets you have more cores than Xeon. Ryzen consumer platform's efficiency was strictly worse than Intel though.
And that goes double for laptop chips, which are the one area that Intel still dominates. Raven Ridge and Picasso are terrible for efficiency compared to Intel's mobile lineup. And AMD mobile won't be moving to 7nm until next year.
Because of that whole "nobody obeys TDP and it doesn't correspond to base clocks or any other performance level", we'll just have to wait for reviews and see what Zen2 and Epyc are actually like. I am really interested in the Infinity Fabric power consumption, that's potentially going to be the limitation as we move onto 7nm and core power goes down, while AMD scales chiplet count up further.
Why is this shocking? Zen 2 is 7nm and Intel's latest is at 14nm. It would be a far bigger shock if they didn't beat Intel in performance/watt. Zen 2 vs whatever Intel releases on 10nm in the next ~6-18 months is a much more interesting comparison.
AMD wasn't really a consideration but for budget until they launched the Athlon in the late 90s. The success of Athlon was as much about Intel's fumble with Netburst as it was with Athlon being a solid competitor.
It took Intel almost a decade to roll out Core and in that time AMD failed to capture the market despite making tremendous gains and legitimizing itself.
Ultimately AMD fumbled with the Bulldozer/Excavator lines of CPUs and lost almost everything they had gained.
The reasons AMD couldn't capture the market are complex but the short answer is that Intel influences every aspect of a computer from software, to compilers, to peripherals, to firmware.
And by AMD failed you mean Intel used illegal means to stop them from it, right ?
The US, Japanese and Korean fair trade comission equivalent all either blamed Intel or fined them. The EU was still too young in that area to be in time but in 2009 they gave one of their biggest fine ever at 1.45 billions € to Intel for what they did, along with an approriate "oh and if you do it again we won't be late, and won't be so nice".
Calling it "AMD failed to capture the market" is technically true, but that's one funny point of view.
Because Intel played dirty and illegal.
I've heard this baseless assertion before but so far I've never heard any semblance of support. Why do you believe that AMD "fumbled" with their Bulldozer line?
This article about Zen starts with an overview of why Bulldozer failed to deliver: https://arstechnica.com/gadgets/2017/03/amds-moment-of-zen-f...
Or that while it was power efficient at idle, it was exceptionally power hungry under load?
Maybe it was when the CEO admitted it failed to meet expectations, said we'd have to wait 4 years for a successor, and then stepped down?
Idk... I'm probably way off base.
I'm planning to hold out for next gen when they get ray tracing hardware to be a bit more future proof (my GTX 970's not dead yet), but since I'm thinking of trading my Wintendo out for a Mac + eGPU setup it's nice to see that AMD could actually be a good GPU option now.
Those were just announced this week, so keep an eye out for 3rd party benchmarks soon.
AMD's recent releases have a reputation of releasing at "hot/high-power" stock and then doing much better when undervolted. Navi will get the die shrink, so the results for both power and thermals are likely to be even better, but benchmarking needs to be done before we have a full picture of what's changed.
The Ryzen processor is 105w vs. the significantly slower intel processor is 165w. Additionally also AMD's TDP numbers are much more accurate in terms of real peak usage than intel. So almost certainly Zen 2 processor will have a much better performance/power ratio than corresponding intel one moving forward. That was definitely not the case for AMD in their last generation.
> In this case, for the new 9th Generation Core processors, Intel has set the PL2 value to 210W. This is essentially the power required to hit the peak turbo on all cores, such as 4.7 GHz on the eight-core Core i9-9900K. So users can completely forget the 95W TDP when it comes to cooling.
In other words
1) Intel's advertised "TDP" = true? (they don't use the same original meaning of the "Total Design Power" anymore)
2) Intel's advertised peak performance = true (with caveats such as all the mitigations required for the CPU flaws, which lower performance)
3) Intel's advertised peak performance at advertised TDP = BIG FAT LIE
There also seem to be some new X570 motherboards that will actually support this level of craziness, too.
Intel is currently getting absolutely destroyed on that front.
Both AMD & Intel list TDP for all cores used at base clock frequencies. The major difference is Intel heavily leverages what they call all-core boost to never actually run at their base clock, allowing them to list rather ridiculously low base clock frequencies. For example the i9-9900K's base frequency is listed at 3.6ghz, but the all-core turbo frequency is a whopping 4.7ghz. That difference is how you end up with a CPU that expects a whopping 210W of sustained power delivery (the 9900K's PL2 spec) even though its TDP is only 95W.
AMD doesn't (didn't?) have an all-core boost concept, so their base clocks are just higher, making their TDP number closer to real-world. But still technically base-clock numbers and not boost numbers, and so you will still see power draw in excess of TDP.
Officially Ryzen 9 3950X supports up to DDR4-3200 (1600 MHz) according to the published specs https://www.amd.com/en/products/cpu/amd-ryzen-9-3950x however in this benchmark the memory was overclocked to 2063 MHz:
Memory: 32768 MB DDR4 SDRAM 2063MHz
No, it supports "4200+ with ease, 5133 demonstrated".
From official slides https://www.anandtech.com/show/14525/amd-zen-2-microarchitec...
Note I am not playing down the 3950X's performance. It is overall a processor superior to Intel's counterparts in most aspects.
Every ddr4 module beyond that is officially a 3200 module with overclock option.
That's why you need to enable Extreme Memory Profile in your bios to use speeds beyond 3200.
The point being that this is a tricked out rig, not an official reporting of the CPU's performance. And that makes the headline essentially a lie.
You can see all 9980XE Geekbench results here: https://browser.geekbench.com/v4/cpu/search?dir=asc&page=1&q...
This 3950X result is definitely not faster than the top overclocked 9980XE, but it is faster than something like 3/4 of them. Given the base clocks of each I would expect the stock 3950X will end up at least slightly faster than the stock 9980XE though.
For example :
Single-Core Score Multi-Core Score
Memory 65536 MB DDR4 SDRAM 2101MHz
Name Intel Core i9-9980XE
34650 to 61072 in a generation is no joke, while being both a far smaller, much lower power part.
Before the release and subsequent independent testing the trust in any exceptional results should be very low.
I mean, no one should lose their minds over it right now or anything, but it seems impressive. I certainly don't see an upside to giving bogus stats right now.
An Epyc 7501 (32c/64t) apparently only gets 17k multicore score on geekbench under windows: https://browser.geekbench.com/processors/2141
Which is hilariously wrong. And if you think that's some quirk of Epyc, well, same CPU gets 65k when run under Linux: https://browser.geekbench.com/v4/cpu/10782563 So clearly there's a software issue in play. Maybe this is related to the new Windows scheduler change. Maybe geekbench just has some pathologically bad behavior. Who knows.
So yes we should wait for release & independent testing before getting too excited, even if that's just so we get numbers from something other than geekbench.
This Ryzen 9 3950X scores so high because the memory is heavily overclocked by +29%, see my other post in this thread.
Looks like a couple hit 70k+ at 3.00 GHz base .
It depends on the work. So as always benchmark suites are to be taken with a grain of salt. More specific benchmarks, such as compiling a standard set of real software packages, can give a clearer picture of performance for those more specific use cases.
Until we see more specific data on how these chips perform for certain tasks, this is just FUD.
* Ryzen has a longer branch prediction history than Intel's processors.
* This will give it an advantage on repetitive executions.
* It's a challenge to robustly measure tasks since using repeated executions to gain confidence intervals can interfere with the measurement itself.
What's not clear is to what extent real-world tasks are repetitive enough to benefit or random enough to be negatively impacted. It's likely a mix of both.
By no means am I attempting to spread FUD — I find it quite interesting and wanted to spark a bit of discussion on it.
Is there a good place to go for this? I've tried to find software development focused benchmarks before, but I've come up mostly empty.
for a more specific example, linux kernel compilation benchmarks: https://openbenchmarking.org/showdown/pts/build-linux-kernel
Funny, but by making the name of the platform a blank, this applies just as much to 2005 as it does to 2019.
The new XBox will feature a custom Ryzen of some form. Who’s next, Apple?
Given that it's AMD, shouldn't that be "it's fabless"?
I've got a mix of Intel and AMD, and have had no loyalty back to when I replaced my Pentium 75 with a pre-unlocked AMD Duron from OcUK.
I'm so glad to see AMD not only raise its game exponentially, but also force Intel to compete. It's good for everyone.
My next purchase will probably be a Ryzen 5 2600, because the price drop ahead of the 3xxx has made them ridiculous value for money.
Definitely a good time to be a PC gamer.
Slightly frustrating that the integrated graphics 3x00G chips are basically Ryzen 2xxx chips though. I hope the g-range gets a refresh with proper Zen 2-based chips shortly.
WRT "who next", did you see the Chinese AMD custom Ryzen+Vega APU console last year, the Subor Z-Plus, with 8GB GDDR5 as shared system and graphics memory?
It's "fabless fab!"
There are modified Darwin kernels that allow Hackintosh to work on AMD processors. These kernels have some stability issues, but if hobbyist outsiders can get most of the way, I don't forsee it being a big hurdle for actual Apple engineers.
As I see it, as long as Apple is putting out x86 hardware, there’s no reason why it can’t be AMD x86 hardware.
(I’m also secretly hoping the ARM thing won’t actually happen, but that’s neither here nor there, and I’m probably wrong.)
I also have major concerns about raw performance at the high end, and I suspect ARM would come with even more software lockdown, although there's no reason that has to be the case.
I subscribe to the theory that the Air will move to ARM at some point. Adding this feature to XCode sounds like the sort of thing you would do to prepare the way for an architecture shift. Especially if you were still on the fence about that shift. Let's just get a feel of how viable this space is before committing to anything.
After dropping Gen 7's PPC, both the PS4 and the XB1 were customised AMD APU, and there's no great architectural rival.
> The Xbox had a customized Coppermine Pentium-III era processor from Intel.
The original xbox was a PC in a box, the CPU was not a customised part.
And yes, the CPU was customized:
At the very least, half of the cache is disabled. They cherry-picked a feature from the Pentium III lineup that they wanted to keep while lowering the cache to Celeron levels. It's a deliberate modification to reduce cost while maintaining desired performance.
It's not detectably customized beyond that but it's not like it's a SKU you can buy off the shelf, either.
I'm not sure what they're waiting for exactly.
Secondly Apple might be waiting for their own chips to reach a point where they can be used in their laptops/desktops and jump on to that. It would be overkill to use ryzen as an interim.
I'm hoping someone eventually just does the needful and sticks a thunderbolt chipset on a PCIE4 graphics card and makes it work somehow.
The numbers provided by AMD are supposedly benched before 1903 Windows scheduler updates (for CCX aware process threading, much faster clock ramping, etc) and without the latest Intel security mitigations, so it's possible that real world numbers might be even better: https://www.anandtech.com/show/14525/amd-zen-2-microarchitec...
Besides the massive L3 cache, Zen 2 now supports very fast RAM overclocking on part w/ Intel platforms (DDR4 3600 OOTB, air-cooled 4200+, and 5K+ on highend motherboards - a huge improvement considering how finicky Zen, and even Zen+ was) and also a huge FPU bump (including single-cycle AVX2) but I think for full details, again we'll be waiting either for July or later for AMD's Hot Chips presentation.
Every workload will be different, but considering AMD's node, efficiency, and security advantages, I wouldn't take it for granted anymore that Intel will have a lead even for single-core perf (especially once thermals come into play).
Because the software doesn't do it (much; I've been told some applications do time-delayed mixing for stuff like delay) and the software is entrenched.
Then other CPUs would be free to start the next chunk of samples. The amount of parallelism is going to depend on the buffer size and number of samples each plugin needs to operate.
For example, if each plugin includes any kind of LUT, you don't have data locality either way, and you're much better off passing data between the plugins. If the plugins are complex, you'll be flushing your instruction cache, which will have to be refilled via random access as opposed to the linear reading of an audio segment.
Further, 192khz 24bit audio is only 0.5 megabytes per second. Skylake lists sustained L3 bandwidth as 18 bytes/cycle. This is enough to transfer 100k such audio streams simultaneously. It's very unlikely this is a bottleneck.
Also instructions shouldn't be huge, but more importantly they don't change. If the audio buffer stays on the same CPU, it doesn't change either.
Don't forget that writing takes time too. Writing can be a big bottleneck. Keep the data local to the same CPU and it doesn't have to go out to main memory yet.
Other things you are saying about 'flushing' the instruction cache, L3 bandwidth numbers and theoretical LUT that make a difference in one scenario and not the other without measuring (even though the whole scenario is made up) just seem like stabs in the dark to argue about vague what-ifs.
OK, so we're left with a single core running a thousand plugins, and instruction cache pressure is a 'stab in the dark to argue about vague what-ifs'?
You take an absolutist view on what is so obviously a complicated trade off and talk down to me to boot. Maybe I know about high performance code, maybe I don't, maybe you do, maybe you don't. But I do know enough about talking to people on the internet to know to nip this conversation in the bud.
The latency is mostly about initial cache misses. There is no reason to take the time to write out a buffer of samples to memory, only to have another CPU access them with a cache miss. One of many things things you are missing here is prefetching. Instructions will be heavily prefetched as will samples when accessesed in any sort of linear fashion.
Also you can't explicit use caches or send data between them, that is going to be up to the CPU, and it will use the whole cache heirarchy.
> You take an absolutist view
Everything dealing with performance needs to be measured, but I have a good idea of how things work so I know what to prioritize and try first. Architecture is really the key to these things and in my replies I've illustrated why.
> Maybe I know about high performance code, maybe I don't
It sounds like you have read enough, but haven't necessarily gone through lots of optimizations and recitified what you know with the results of profiling. Understanding modern CPUs is good for understanding why results happen, but less so for estimating exactly what the results will be when going in blind.
> maybe you do, maybe you don't
I've got a decent handle on it at this point.
Your experience led to overconfidence and you identified a ridiculous bottleneck for the problem domain. This is complicated and FPU heavy code running on few pieces of tiny data. And yes, riddled with LUTs. The latency cost you're worried about is in the noise.
Instead of doing some back of the envelope calculations and realizing your mistake, you double down, handwave and smugly attack me.
Your conclusions are bullshit, as is your evaluation of my experience. For anyone else that happens to be reading, I suggest taking a look through the source of a few plugins and judging for yourself.
That being said the LUTs would follow the same pattern as execution - all threads would use them and if they are a part of the executable they don't change. This combined with prefetching and out of order instructions means that their latency is likely to be hidden by the cache.
New data coming through however would be transformed, creating more new data. While the instructions and LUTs aren't changing the new data being created on each transformation can either be kept locally so it doesn't incur the same write back penalties and cache misses by
due to allocating new memory, writing to it and eventually getting it to another CPU.
If the same CPU is working on the same memory buffer there is no need to try to allocate them for every filter or manage lifetimes and ownership of various buffers.
1) It's very common for the processing of samples to not be independent, but have iterative state; for example delay effects, amplifiers, noise gates...
2) The work done per sample is substantial with nested loops, trig functions and hard to vectorize patterns
So not only does your technique break the model of the problem domain, the L3 latency you're so worried about when retrieving a block of samples is comparable to a single call to sin, which in some cases we're doing multiple times per sample.
Now you conflate passing data between threads with memory allocation, as though SPSC ring buffers aren't a trivial building block. This is after lecturing me on my many "misunderstandings"... if you're willing to assume I'm advocating malloc in the critical path (!?), no wonder you're finding so many.
I'm not upset, I'm just being blunt. Ditch the cockiness, or at least reserve it for when your arguments are bulletproof.
I'm not sure where this is coming from. If one cpu is generating new data and another CPU is picking it up, it's wasting locality. If lots of new data is generated it might get to other CPUs though shared cache or memory, but either way it isn't necessary.
Data accessed linearly is prefetched and latency is eventually hidden. This, combined with the fact that instructions aren't changing and are usually tiny in comparison, is why instruction locality is not the primary problem to solve.
The difference it makes it up to measurement, but trying to pin one filter per core is a simplistic and naive answer. It implies that concurrency is dependent on how many different transformations exist, when the reality is that the number of cores.that can be utilized will come down to the number of groups of data that can be dealt with without dependencies.
> SPSC ring buffers
That's a form of memory allocation. When you fabricate something to argue against, that's called a straw man fallacy.
In any case, we're clearly not going to find common ground here.
The data rates for real-time audio are so much smaller than modern memory system capabilities that we can almost ignore them. A 192 kHz, 24-bit, 6-channel audio program is less than 3 MB/s, thousands of times slower than a modern workstation CPU and memory system can muster.
The stack of audio filters you describe are a natural fit for pipelined software architectures, and such architectures are trivially mapped to pipelined parallel processing models. Whatever buffer granularity one might make in a single-threaded, synchronous audio API to relay data through a sequence of filter functions can be distributed into an asynchronous pipeline, with workers on separate cores looping over a stream of input sample buffers. It just takes an SMP-style queue abstraction to handle the buffer relay between the workers, while each can invoke a typical synchronous function. Also, because these sorts of filters usually have a very consistent cost regardless of the input signal, they could be benchmarked on a given machine to plan an efficient allocation of pipeline stages to CPU cores (or to predict that the pipeline is too expensive for the given machine).
Finally, audio was a domain motivating DSPs and SIMD processing long before graphics. An awful lot of audio effects ought to be easily written for a high performance SIMD processing platform, just like custom shaders in a modern video game are mapped to GPUs by the graphics driver.
The biggest issue is that we're using plugins written by third parties to a few common standards. Even when the plugins themselves are not trying to make use of a multicore environment, you still get compatibility bugs and various taxes on re-encoding input and output streams to the desired bit depth and sample rate. It can really throw a wrench into optimizing at the DAW level because you can't just go in and fix the plugins to do the right thing.
Then add in the widely varying quality of the plugin developers, from "has hand-tuned efficient inner loops for different instruction set capabilities" to "left in denormal number processing, so the CPU dies when the signal gets quiet." Occasionally someone tries to do a GPU-based setup, only to be disappointed by memory latency becoming the bottleneck on overall latency(needless to say, latency is really prioritized over throughput in real-time audio).
Finally, the skillsets of the developers tend to be math-heavy in the first place: the product they're making is often something like a very accurate simulation of an analog oscillator or filter model, which takes tons of iterations per sample. Or something that is flinging around FFTs for an effect like autotune. They are giving the market what it wants, which is something that is slightly higher quality and probably dozens or hundreds of times more resource-hungry to process one channel.
If all you're doing is mixing and simple digital filters, you're in a great place: you can probably do hundreds of those. But we've managed to invent our way into new bottlenecks. And at the base of it, it's really that the tooling is wrong and we do need a DSP-centric environment like you suggest. (SOUL is a good candidate for going in this direction.)
For N stages, instead of having each filter run at 1/N duty cycle, waiting for their turn to run, they can all remain mostly active. As soon as they are done with one buffer, the next one from the previous pipeline stage is likely to be waiting for them. This can actually lower total latency and avoid dropouts because the next buffer can begin processing in the first stage as soon as the previous buffer has been released to the second stage.
Whatever you can calculate sequentially like:
buf0 = input.recv()
buf1 = filter1(buf0)
buf2 = filter2(buf1)
buf3 = filter3(buf2)
Each worker is dedicated to running a specific filter function, so its internal state remains local to that one worker. Only the intermediate sample buffers get relayed between the workers, usually via a low-latency asynchronous queue or similar data structure. If a particular filter function is a little slow, the next stage will simply block on its input receive step until the slow stage can perform the send.
(Edited to try to fix pseudo code block)
A completely sequential process would have a full end-to-end pipeline delay between each audio frame. The first stage cannot start processing a frame until the last stage has finished processing the previous frame. In a real-time system, this turns into a severe throughput limit, as you start to have input/output overflow/underflow. The pipeline throughput is the reciprocal of the end-to-end frame delay.
But, concurrent execution of the pipeline on multiple CPU cores means that you can have many frames in flight at once. The total end-to-end delay is still the sum of the per-stage delays, but the inter-frame delay can be minimized. As soon as a stage has completed one frame, it can start work on the next in the sequence. In such a pipeline, the throughput is the reciprocal of the inter-frame delay for the slowest stage rather than of the total end-to-end delay. The real-time system can scale the number of pipeline stages with the number of CPU cores without encountering input/output overflow/underflow.
Because frame drops were mentioned early on in this discussion, I (and probably others who responded) assumed we were talking about this pipeline throughput issue. But, if your real-time application requires feedback of the results back into a live process, i.e. mixing the audio stream back into the listening environment for performers or audience, then I understand you also have a concern about end-to-end latency and not just buffer throughput.
One approach is to reduce the frame size, so that each frame processes more quickly at each stage. Practically speaking, each frame will be a little less efficient as there is more control-flow overhead to dispatch it. But, you can exploit the concurrent pipeline execution to absorb this added overhead. The smaller frames will get through the pipeline quickly, and the total pipeline throughput will still be high. Of course, there will be some practical limit to how small a frame gets before you no longer see an improvement.
Things like SIMD optimization are also a good way to increase the speed of an individual stage. Many signal-processing algorithms can use vectorized math for a frame of sequential samples, to increase the number of samples processed per cycle and to optimize the memory access patterns too. These modern cores keep increasing their SIMD widths and effective ops/cycle even when their regular clock rate isn't much higher. This is a lot of power left on the table if you do not write SIMD code.
And, as others have mentioned in the discussion, if your filters do not involve cross-channel effects, you can parallelize the pipelines for different channels. This also reduces the size of each frame and hence its processing cost, so the end-to-end delay drops while the throughput remains high with different channels being processed in truly parallel fashion.
Even a GPU-based solution could help. What is needed here is a software architecture where you run the entire pipeline on the GPU to take advantage of the very high speed RAM and cache zones within the GPU. You only transfer input from host to GPU and final results back from GPU to host. You will use only a very small subset of the GPU's processing units, compared to a graphics workload, but you can benefit from very fast buffers for managing filter state as well as the same kind of SIMD primitives to rip through a frame of samples. I realize that this would be difficult for a multi-vendor product with third-party plugins, etc.
If it is possible to process with small samples (T), with roughly correspondingly small processing time (X), there shouldn't be a problem keeping the latency small with pipelining. If filters depend on future data (lookahead), it is plausible reducing T might not be possible. Otherwise, it should be mostly a problem of weak software design and lots of legacy software and platforms.
This precludes parallel processing of individual packets, but does not prevent concurrent processing of packets.
Plugin A accepts a packet, processes it, outputs it. Plugin B accepts a packet from A, processes it, outputs it. Plugin C accepts a packet from B, processes it, outputs it. [...] Plugin G accepts a packet from F, processes it, outputs it.
Everything is serial so far. Got it. Here's the thing though: Plugin A processes packet n, Plugin B processes packet n-1, Plugin C processes packet n-2, [...] Plugin G processes packet n-6. Now you have 7 independent threads processing 7 independent data packets. As long as the queues between plugins are suitably small you won't introduce latency.
The mental model here should be familiar to anyone in the music industry; each pedal between the instrument and the amp is a plugin, each wire is a queue. Each pedal processes its data concurrently (but not parallel with) with every other pedal.
It's relatively common in game development for AI/physics to generate the data for frame n, while graphics displays frame n-1. (there's a natural, fairly hard sequential barrier separating physics from graphics, and there's a hard sequential barrier when the frame is finally shipped off to the GPU) Especially on consoles that have 8 core CPUs but each core is really slow. PS4/XBoxOne use the AMD Jaguar architecture, which was the mobile variant of Excavator. The single core performance of these CPUs are absolutely atrocious, but the devs make it work for latency sensitive activities like gaming.
> Data travelling from one core to another could mean additional performance loss.
Only if it is evicted from the L3 cache, and the 3950X has 64MB of it. That's over a second(!!) of latency at 16 channel+192kHz+32 bits/sample audio.
Speaking of channels, that seems like a natural opportunity for parallelism.
I get that legacy code is legacy code, and a framework designed to run optimally on Netburst isn't necessarily going to run optimally on Zen 2. (or any other CPU from the past decade) But this is an institutional problem, not a technical one. It sounds to me like somebody needs to bite the bullet and make some breaking changes to the framework.
The process is realtime so you cannot receive events ahead of time. It is actually running how you describe, but you can only process so much during the length of a single buffer. Typically solution is to increase the length of the buffer, but that increases latency or reduce the length of the buffer but that introduces overhead.
> Each pedal processes its data concurrently (but not parallel with) with every other pedal.
That's how it works.
> The single core performance of these CPUs are absolutely atrocious, but the devs make it work for latency sensitive activities like gaming.
I am talking about realistic simulations. You can definitely run simple models without latency, that's not a problem.
> Only if it is evicted from the L3 cache, and the 3950X has 64MB of it. That's over a second(!!) of latency at 16 channel+192kHz+32 bits/sample audio.
That's nothing. Typical chain can consists of dozens of plugins times dozens of channels.
There is no problem with such simple case as running 16 channels with simple processing.
> Speaking of channels, that seems like a natural opportunity for parallelism.
That works pretty well. If you are able to run you single chain in realtime you can typically run as many of them as you have available cores.
But, as another person mentioned, this benchmark wasn't run at the full boost clock for the 3950X, assuming this isn't a faked result entirely.
Please excuse my lack of experience with audio processing, but...
What you're describing about the output of one plugin being fed into the input of another is analogous to unix shell scripts piping data between processes. It actually does allow parallelization, because the first stage can be working on generating more data while the second stage is processing the data that was already generated, and the third stage is able to also be processing data that was previously generated by the second stage.
Beyond that, if you have multiple audio streams, it seems like each one would have their own instances of the plugins.
So, if you had 3 streams of audio, with 4 different plugins being applied to each stream, you would have at least 12 parallel threads of processing... assuming the software was written to take advantage of multiple cores.
If the software is literally just single threaded, there's nothing to be done but to either accept that limitation or find alternative software.
AMD claims that their benchmarks show that the 3900X is faster at Cinebench single threaded than the Intel 9900K. (https://images.anandtech.com/doci/14525/COMPUTEX_KEYNOTE_DRA...) The 3950X has a higher boost clock, so it should be even faster.
I really think you should really wait until you see audio processing benchmarks before making dramatic claims like "It looks like I wouldn't be able to run my chain in realtime on this new AMD" based on a -3% difference in performance on a leaked benchmark of a processor that isn't even running at the full clockspeed. How can you be so sure that a 3% difference would actually prevent you from running your "chain" in realtime? But, based on the evidence available, the chip should do 9% better than the recorded result here (4.7GHz actual boost divided by 4.3GHz boost used in the benchmark), reversing the situation and making the Intel chip slower. Suddenly the Intel chip is inadequate?! No, I really don't think so. Even though Zen 2 seems like it will be better, I feel more confident that even a slower chip like the 9900K would be perfectly fine for audio processing.
Conceptually yes, but technically, multimedia frameworks don’t have much in common with unix shell pipes.
Pipes don’t care about latency, their only goal is throughput. For realtime multimedia, latency matters a lot.
Processes with pipes have very simple data flow topology. In multimedia it’s normal to have wide branches, or even cycles in the data flow graph. E.g. you can connect delay effect to the output of a mixer, and connect output of the delay back into one of the inputs of the mixer.
Bytes in the pipes don’t have timestamps, multimedia buffers do, failing to maintain synchronization across the graph is unacceptable.
I’m not saying multimedia frameworks don’t use multiple cores, they do. But due to the above issues, multithreading is often more limited compared to multiple processes reading/writing pipes.
The main advantage is that you wouldn't be limited in the number of plugins you could run by the performance of a single core, since you could run each plugin on its own core, like you mentioned.
Obviously, having faster individual cores means that each plugin introduces less total latency, but the difference in single-threaded performance between Zen 2 and Intel's best is likely to be very small, and I fully expect Zen 2 to have the best single-threaded performance in certain applications.
Even though I do a lot of docker and some rendering and Photoshop - most development tasks, docker builds, and even most Photoshop tasks that aren't GPU accelerated are bottlenecked on single core performance.
Same goes for the overall zippiness of the OS. The most important thing for me is that whatever I am doing this moment is as fast as possible and single core performance still rules since most software still does not take advantage of multiple cores.
For the next home server though, I am definitely planning on a high core count AMD.
I would add though, that all the new processors are getting so fast, that the difference in single core performance is probably not noticeable. Your main issue would be long running single core tasks which are generally more likely to be multithreaded.
I totally agree with this. I can't stand having a resource limit on creativity when I'm making music. What's worse, is even if you get dedicated hardware (DSP chips, etc.) they are normally designed for specific software, and aren't (and likely can't be) a 'global accelerator' for all audio plugins, regardless of the developer.
My understanding is that typically the TDP is designed to fit to the base clock of the processor, and doesn't necessarily include the amount of power necessary to achieve the boost clocks.
Also, what about GPUs TDP?
Never investigated GPUs. One way to find out would be to trawl Anandtech reviews and collect TDP and measured power draw numbers, they always take measurements.
Binning? There's variation in yields, the better parts might get classified as 3950X, the lesser ones get 4 cores disabled and a 3900X branding.
This is different to Intel's TDP which meant Typical Design Power, i.e Power usage when running in base frequency, in reality they run quite a lot higher.
Because Ryzen 2 is manufactured on 7nm, it’s extremely efficient in that it doesn’t convert its energy into waste heat. Both 3900X and 3950X are designed to produce no more than 100 watts of heat. But of course, that doesn’t say how much current they actually draw under full load. That specification is the key and is very hard to find.
When these chips are released, you will likely see reviews that measure the total system power, that is the power CPU draws plus PSU inefficiencies, VRM inefficiencies, motherboard component inefficiencies, on top of all the power ram, ssds, and everything else uses. So it will not be an accurate measurement, but it will give you an overall sense of how power hungry it really is.
AMD CPU designs have historically been very power hungry, and I expect the new ones to be no different. Looking at how their 7nm GPUs compare against RTX in power consumption leads me to believe the 3000 series will require quite a bit of juice.
In order to see optimum use of those many cores i wish to see.
1. Most used legacy software libraries to incorporate concurrent/parallel algorithms for both CPU and mixed (CPU +IO) load.
2. Some inventive, compact and powerful heat sink design to be implemented in laptop models.
In the past I thought maybe the rumored move to ARM could be the reason, but now with the new Mac Pro I doubt Apple will move to ARM except for some of its laptops.
If AMD can keep it up this time (or Intel keeps flopping) then it may very well happen down the road. Until then, the age old investor relations statement rings true: “past performance is no guarantee of future results.”
Note: I have a Ryzen 5 1600 in my gaming rig and a Ryzen 5 2600 in the wife’s, I love these chips - but I also see the reality of Apple’s ecosystem is all.
Just the same as choosing AMD would involve trade-offs in terms of a very slight loss of single threaded performance, or a higher idle power consumption, particularly in laptops.
In either case, you have good options. Neither product is completely devastatingly useless for any task, as was the case with Bulldozer, which had single threaded performance that was nearly half that of Intel's.
With the release of Zen, there was no longer a clear market leader dominating in performance of all classes, or pricing, or whatever other metric you want. That's called "competitive."
Zen 2 looks like it will be "uncontested." It will have the advantage in essentially everything, including single and multithreaded performance, gaming performance, power consumption, and price... if AMD's benchmarks are to be believed. The general sentiment is that AMD's benchmarks were actually conservative.
The benchmark leaked above in this thread is not running at the production boost clock, which would be 9% higher than the benchmark given, making it theoretically uncontested.
Obviously, we will have to wait for extensive third party benchmarking, but Zen has always been competitive, immediately and unequivocally reducing Intel to merely being competitive as well. Zen 2 has the opportunity be more.
Intel opened Thunderbolt up for non-Intel platforms awhile ago, and we're already seeing motherboards that offer it for Ryzen.
I think I lost track of the thread though, because you're not necessarily the one who asked "why" about Apple.
Where AMD does compete is thread-count. A higher number of slower cores did feel a few niches. Except... Many software vendors charge per core (a Windows Server License is limited to 16 cores), so fewer, faster-cores work out better value for most business users. Plus, power usage is a huge issue in data centres, again favouring Intel.
The biggest problem right now is virtual machines can't move (live migrate) from Intel to AMD hardware (and vice versa) without having to be restarted. So AMD is only really a viable option for new clusters, but I would think Intel is still nervous.
Zen 2 raises IPC by 15%, and raises clock speeds by a solid 10% or more. Single threaded Zen 2 performance is not even a slight concern for me.
Add 9% to the benchmark result this entire thread is about, because this engineering sample was not running at the specified boost frequency that the 3950X will have. Intel has nothing to compete against that... it should be uncontested.
On Epyc, their clock speeds were generally comparable to Intel's, and the single threaded performance was already great there, except for a few specialty processors that Intel released for servers that don't care about high core counts. Epyc 2 stands to completely annihilate any advantage Intel had left.
AMD Zen has always used less power than Intel for each unit of work done, which was one of the original surprises, so... power consumption is absolutely not favoring Intel.
I really feel like you're mentally comparing to the old Bulldozer Opteron processors, based on the concerns you listed.
Intel seems to be in a perfect storm, while AMD seems to have all their ducks lined up (architecture, Fabrication Process, clock speeds).
Still, exciting times! Intel has stagnated on quad-core enthusiast CPUs for a decade (Q6600 - 7700k), it's good to finally have some competition again.
Now that AMD is using TSMC and/or Global Foundries, not sure if still the case.
- Don't want to rely too much on a single manufacturer (they already use AMD GPUs). Always keep multiple supplies alive/well.
- Don't take away too much from Intel to not affect other components (they were in the game for LTE modems which Apple needed/needs)
- How good are integrated intel vs amd gpus? Could play a role as well
Currently in fanless environments (Such as the iPad Pro) the latest CPU, A12X, outperforms Intel's fanless offerings by a good amount.
I would imagine that Apple could build like performing parts if not better using current A12 Tech and don't forget that Apple is already using TSMC's 7nm process. Additionally, Apple could make sure of big.LITTLE in varying sizes to bring large power consumption advantages to Macs as it stands, along with their Neural Core.
Or who knows, maybe they'll just wait for risc-v to to mature to make any sort of switch.
When I say leapfrog I am implying that I believe this list to be correctly ordered and that Apple will not use AMD chips but wait until they can use ARM.
Just idle speculation
AMD is clearly better now, but Apple just needs the CPUs not to suck, so Intel it is.