The numbers provided by AMD are supposedly benched before 1903 Windows scheduler updates (for CCX aware process threading, much faster clock ramping, etc) and without the latest Intel security mitigations, so it's possible that real world numbers might be even better: https://www.anandtech.com/show/14525/amd-zen-2-microarchitec...
Besides the massive L3 cache, Zen 2 now supports very fast RAM overclocking on part w/ Intel platforms (DDR4 3600 OOTB, air-cooled 4200+, and 5K+ on highend motherboards - a huge improvement considering how finicky Zen, and even Zen+ was) and also a huge FPU bump (including single-cycle AVX2) but I think for full details, again we'll be waiting either for July or later for AMD's Hot Chips presentation.
Every workload will be different, but considering AMD's node, efficiency, and security advantages, I wouldn't take it for granted anymore that Intel will have a lead even for single-core perf (especially once thermals come into play).
Because the software doesn't do it (much; I've been told some applications do time-delayed mixing for stuff like delay) and the software is entrenched.
Then other CPUs would be free to start the next chunk of samples. The amount of parallelism is going to depend on the buffer size and number of samples each plugin needs to operate.
For example, if each plugin includes any kind of LUT, you don't have data locality either way, and you're much better off passing data between the plugins. If the plugins are complex, you'll be flushing your instruction cache, which will have to be refilled via random access as opposed to the linear reading of an audio segment.
Further, 192khz 24bit audio is only 0.5 megabytes per second. Skylake lists sustained L3 bandwidth as 18 bytes/cycle. This is enough to transfer 100k such audio streams simultaneously. It's very unlikely this is a bottleneck.
Also instructions shouldn't be huge, but more importantly they don't change. If the audio buffer stays on the same CPU, it doesn't change either.
Don't forget that writing takes time too. Writing can be a big bottleneck. Keep the data local to the same CPU and it doesn't have to go out to main memory yet.
Other things you are saying about 'flushing' the instruction cache, L3 bandwidth numbers and theoretical LUT that make a difference in one scenario and not the other without measuring (even though the whole scenario is made up) just seem like stabs in the dark to argue about vague what-ifs.
OK, so we're left with a single core running a thousand plugins, and instruction cache pressure is a 'stab in the dark to argue about vague what-ifs'?
You take an absolutist view on what is so obviously a complicated trade off and talk down to me to boot. Maybe I know about high performance code, maybe I don't, maybe you do, maybe you don't. But I do know enough about talking to people on the internet to know to nip this conversation in the bud.
The latency is mostly about initial cache misses. There is no reason to take the time to write out a buffer of samples to memory, only to have another CPU access them with a cache miss. One of many things things you are missing here is prefetching. Instructions will be heavily prefetched as will samples when accessesed in any sort of linear fashion.
Also you can't explicit use caches or send data between them, that is going to be up to the CPU, and it will use the whole cache heirarchy.
> You take an absolutist view
Everything dealing with performance needs to be measured, but I have a good idea of how things work so I know what to prioritize and try first. Architecture is really the key to these things and in my replies I've illustrated why.
> Maybe I know about high performance code, maybe I don't
It sounds like you have read enough, but haven't necessarily gone through lots of optimizations and recitified what you know with the results of profiling. Understanding modern CPUs is good for understanding why results happen, but less so for estimating exactly what the results will be when going in blind.
> maybe you do, maybe you don't
I've got a decent handle on it at this point.
Your experience led to overconfidence and you identified a ridiculous bottleneck for the problem domain. This is complicated and FPU heavy code running on few pieces of tiny data. And yes, riddled with LUTs. The latency cost you're worried about is in the noise.
Instead of doing some back of the envelope calculations and realizing your mistake, you double down, handwave and smugly attack me.
Your conclusions are bullshit, as is your evaluation of my experience. For anyone else that happens to be reading, I suggest taking a look through the source of a few plugins and judging for yourself.
That being said the LUTs would follow the same pattern as execution - all threads would use them and if they are a part of the executable they don't change. This combined with prefetching and out of order instructions means that their latency is likely to be hidden by the cache.
New data coming through however would be transformed, creating more new data. While the instructions and LUTs aren't changing the new data being created on each transformation can either be kept locally so it doesn't incur the same write back penalties and cache misses by
due to allocating new memory, writing to it and eventually getting it to another CPU.
If the same CPU is working on the same memory buffer there is no need to try to allocate them for every filter or manage lifetimes and ownership of various buffers.
1) It's very common for the processing of samples to not be independent, but have iterative state; for example delay effects, amplifiers, noise gates...
2) The work done per sample is substantial with nested loops, trig functions and hard to vectorize patterns
So not only does your technique break the model of the problem domain, the L3 latency you're so worried about when retrieving a block of samples is comparable to a single call to sin, which in some cases we're doing multiple times per sample.
Now you conflate passing data between threads with memory allocation, as though SPSC ring buffers aren't a trivial building block. This is after lecturing me on my many "misunderstandings"... if you're willing to assume I'm advocating malloc in the critical path (!?), no wonder you're finding so many.
I'm not upset, I'm just being blunt. Ditch the cockiness, or at least reserve it for when your arguments are bulletproof.
I'm not sure where this is coming from. If one cpu is generating new data and another CPU is picking it up, it's wasting locality. If lots of new data is generated it might get to other CPUs though shared cache or memory, but either way it isn't necessary.
Data accessed linearly is prefetched and latency is eventually hidden. This, combined with the fact that instructions aren't changing and are usually tiny in comparison, is why instruction locality is not the primary problem to solve.
The difference it makes it up to measurement, but trying to pin one filter per core is a simplistic and naive answer. It implies that concurrency is dependent on how many different transformations exist, when the reality is that the number of cores.that can be utilized will come down to the number of groups of data that can be dealt with without dependencies.
> SPSC ring buffers
That's a form of memory allocation. When you fabricate something to argue against, that's called a straw man fallacy.
In any case, we're clearly not going to find common ground here.
The data rates for real-time audio are so much smaller than modern memory system capabilities that we can almost ignore them. A 192 kHz, 24-bit, 6-channel audio program is less than 3 MB/s, thousands of times slower than a modern workstation CPU and memory system can muster.
The stack of audio filters you describe are a natural fit for pipelined software architectures, and such architectures are trivially mapped to pipelined parallel processing models. Whatever buffer granularity one might make in a single-threaded, synchronous audio API to relay data through a sequence of filter functions can be distributed into an asynchronous pipeline, with workers on separate cores looping over a stream of input sample buffers. It just takes an SMP-style queue abstraction to handle the buffer relay between the workers, while each can invoke a typical synchronous function. Also, because these sorts of filters usually have a very consistent cost regardless of the input signal, they could be benchmarked on a given machine to plan an efficient allocation of pipeline stages to CPU cores (or to predict that the pipeline is too expensive for the given machine).
Finally, audio was a domain motivating DSPs and SIMD processing long before graphics. An awful lot of audio effects ought to be easily written for a high performance SIMD processing platform, just like custom shaders in a modern video game are mapped to GPUs by the graphics driver.
The biggest issue is that we're using plugins written by third parties to a few common standards. Even when the plugins themselves are not trying to make use of a multicore environment, you still get compatibility bugs and various taxes on re-encoding input and output streams to the desired bit depth and sample rate. It can really throw a wrench into optimizing at the DAW level because you can't just go in and fix the plugins to do the right thing.
Then add in the widely varying quality of the plugin developers, from "has hand-tuned efficient inner loops for different instruction set capabilities" to "left in denormal number processing, so the CPU dies when the signal gets quiet." Occasionally someone tries to do a GPU-based setup, only to be disappointed by memory latency becoming the bottleneck on overall latency(needless to say, latency is really prioritized over throughput in real-time audio).
Finally, the skillsets of the developers tend to be math-heavy in the first place: the product they're making is often something like a very accurate simulation of an analog oscillator or filter model, which takes tons of iterations per sample. Or something that is flinging around FFTs for an effect like autotune. They are giving the market what it wants, which is something that is slightly higher quality and probably dozens or hundreds of times more resource-hungry to process one channel.
If all you're doing is mixing and simple digital filters, you're in a great place: you can probably do hundreds of those. But we've managed to invent our way into new bottlenecks. And at the base of it, it's really that the tooling is wrong and we do need a DSP-centric environment like you suggest. (SOUL is a good candidate for going in this direction.)
For N stages, instead of having each filter run at 1/N duty cycle, waiting for their turn to run, they can all remain mostly active. As soon as they are done with one buffer, the next one from the previous pipeline stage is likely to be waiting for them. This can actually lower total latency and avoid dropouts because the next buffer can begin processing in the first stage as soon as the previous buffer has been released to the second stage.
Whatever you can calculate sequentially like:
buf0 = input.recv()
buf1 = filter1(buf0)
buf2 = filter2(buf1)
buf3 = filter3(buf2)
Each worker is dedicated to running a specific filter function, so its internal state remains local to that one worker. Only the intermediate sample buffers get relayed between the workers, usually via a low-latency asynchronous queue or similar data structure. If a particular filter function is a little slow, the next stage will simply block on its input receive step until the slow stage can perform the send.
(Edited to try to fix pseudo code block)
A completely sequential process would have a full end-to-end pipeline delay between each audio frame. The first stage cannot start processing a frame until the last stage has finished processing the previous frame. In a real-time system, this turns into a severe throughput limit, as you start to have input/output overflow/underflow. The pipeline throughput is the reciprocal of the end-to-end frame delay.
But, concurrent execution of the pipeline on multiple CPU cores means that you can have many frames in flight at once. The total end-to-end delay is still the sum of the per-stage delays, but the inter-frame delay can be minimized. As soon as a stage has completed one frame, it can start work on the next in the sequence. In such a pipeline, the throughput is the reciprocal of the inter-frame delay for the slowest stage rather than of the total end-to-end delay. The real-time system can scale the number of pipeline stages with the number of CPU cores without encountering input/output overflow/underflow.
Because frame drops were mentioned early on in this discussion, I (and probably others who responded) assumed we were talking about this pipeline throughput issue. But, if your real-time application requires feedback of the results back into a live process, i.e. mixing the audio stream back into the listening environment for performers or audience, then I understand you also have a concern about end-to-end latency and not just buffer throughput.
One approach is to reduce the frame size, so that each frame processes more quickly at each stage. Practically speaking, each frame will be a little less efficient as there is more control-flow overhead to dispatch it. But, you can exploit the concurrent pipeline execution to absorb this added overhead. The smaller frames will get through the pipeline quickly, and the total pipeline throughput will still be high. Of course, there will be some practical limit to how small a frame gets before you no longer see an improvement.
Things like SIMD optimization are also a good way to increase the speed of an individual stage. Many signal-processing algorithms can use vectorized math for a frame of sequential samples, to increase the number of samples processed per cycle and to optimize the memory access patterns too. These modern cores keep increasing their SIMD widths and effective ops/cycle even when their regular clock rate isn't much higher. This is a lot of power left on the table if you do not write SIMD code.
And, as others have mentioned in the discussion, if your filters do not involve cross-channel effects, you can parallelize the pipelines for different channels. This also reduces the size of each frame and hence its processing cost, so the end-to-end delay drops while the throughput remains high with different channels being processed in truly parallel fashion.
Even a GPU-based solution could help. What is needed here is a software architecture where you run the entire pipeline on the GPU to take advantage of the very high speed RAM and cache zones within the GPU. You only transfer input from host to GPU and final results back from GPU to host. You will use only a very small subset of the GPU's processing units, compared to a graphics workload, but you can benefit from very fast buffers for managing filter state as well as the same kind of SIMD primitives to rip through a frame of samples. I realize that this would be difficult for a multi-vendor product with third-party plugins, etc.
If it is possible to process with small samples (T), with roughly correspondingly small processing time (X), there shouldn't be a problem keeping the latency small with pipelining. If filters depend on future data (lookahead), it is plausible reducing T might not be possible. Otherwise, it should be mostly a problem of weak software design and lots of legacy software and platforms.
This precludes parallel processing of individual packets, but does not prevent concurrent processing of packets.
Plugin A accepts a packet, processes it, outputs it. Plugin B accepts a packet from A, processes it, outputs it. Plugin C accepts a packet from B, processes it, outputs it. [...] Plugin G accepts a packet from F, processes it, outputs it.
Everything is serial so far. Got it. Here's the thing though: Plugin A processes packet n, Plugin B processes packet n-1, Plugin C processes packet n-2, [...] Plugin G processes packet n-6. Now you have 7 independent threads processing 7 independent data packets. As long as the queues between plugins are suitably small you won't introduce latency.
The mental model here should be familiar to anyone in the music industry; each pedal between the instrument and the amp is a plugin, each wire is a queue. Each pedal processes its data concurrently (but not parallel with) with every other pedal.
It's relatively common in game development for AI/physics to generate the data for frame n, while graphics displays frame n-1. (there's a natural, fairly hard sequential barrier separating physics from graphics, and there's a hard sequential barrier when the frame is finally shipped off to the GPU) Especially on consoles that have 8 core CPUs but each core is really slow. PS4/XBoxOne use the AMD Jaguar architecture, which was the mobile variant of Excavator. The single core performance of these CPUs are absolutely atrocious, but the devs make it work for latency sensitive activities like gaming.
> Data travelling from one core to another could mean additional performance loss.
Only if it is evicted from the L3 cache, and the 3950X has 64MB of it. That's over a second(!!) of latency at 16 channel+192kHz+32 bits/sample audio.
Speaking of channels, that seems like a natural opportunity for parallelism.
I get that legacy code is legacy code, and a framework designed to run optimally on Netburst isn't necessarily going to run optimally on Zen 2. (or any other CPU from the past decade) But this is an institutional problem, not a technical one. It sounds to me like somebody needs to bite the bullet and make some breaking changes to the framework.
The process is realtime so you cannot receive events ahead of time. It is actually running how you describe, but you can only process so much during the length of a single buffer. Typically solution is to increase the length of the buffer, but that increases latency or reduce the length of the buffer but that introduces overhead.
> Each pedal processes its data concurrently (but not parallel with) with every other pedal.
That's how it works.
> The single core performance of these CPUs are absolutely atrocious, but the devs make it work for latency sensitive activities like gaming.
I am talking about realistic simulations. You can definitely run simple models without latency, that's not a problem.
> Only if it is evicted from the L3 cache, and the 3950X has 64MB of it. That's over a second(!!) of latency at 16 channel+192kHz+32 bits/sample audio.
That's nothing. Typical chain can consists of dozens of plugins times dozens of channels.
There is no problem with such simple case as running 16 channels with simple processing.
> Speaking of channels, that seems like a natural opportunity for parallelism.
That works pretty well. If you are able to run you single chain in realtime you can typically run as many of them as you have available cores.
But, as another person mentioned, this benchmark wasn't run at the full boost clock for the 3950X, assuming this isn't a faked result entirely.
Please excuse my lack of experience with audio processing, but...
What you're describing about the output of one plugin being fed into the input of another is analogous to unix shell scripts piping data between processes. It actually does allow parallelization, because the first stage can be working on generating more data while the second stage is processing the data that was already generated, and the third stage is able to also be processing data that was previously generated by the second stage.
Beyond that, if you have multiple audio streams, it seems like each one would have their own instances of the plugins.
So, if you had 3 streams of audio, with 4 different plugins being applied to each stream, you would have at least 12 parallel threads of processing... assuming the software was written to take advantage of multiple cores.
If the software is literally just single threaded, there's nothing to be done but to either accept that limitation or find alternative software.
AMD claims that their benchmarks show that the 3900X is faster at Cinebench single threaded than the Intel 9900K. (https://images.anandtech.com/doci/14525/COMPUTEX_KEYNOTE_DRA...) The 3950X has a higher boost clock, so it should be even faster.
I really think you should really wait until you see audio processing benchmarks before making dramatic claims like "It looks like I wouldn't be able to run my chain in realtime on this new AMD" based on a -3% difference in performance on a leaked benchmark of a processor that isn't even running at the full clockspeed. How can you be so sure that a 3% difference would actually prevent you from running your "chain" in realtime? But, based on the evidence available, the chip should do 9% better than the recorded result here (4.7GHz actual boost divided by 4.3GHz boost used in the benchmark), reversing the situation and making the Intel chip slower. Suddenly the Intel chip is inadequate?! No, I really don't think so. Even though Zen 2 seems like it will be better, I feel more confident that even a slower chip like the 9900K would be perfectly fine for audio processing.
Conceptually yes, but technically, multimedia frameworks don’t have much in common with unix shell pipes.
Pipes don’t care about latency, their only goal is throughput. For realtime multimedia, latency matters a lot.
Processes with pipes have very simple data flow topology. In multimedia it’s normal to have wide branches, or even cycles in the data flow graph. E.g. you can connect delay effect to the output of a mixer, and connect output of the delay back into one of the inputs of the mixer.
Bytes in the pipes don’t have timestamps, multimedia buffers do, failing to maintain synchronization across the graph is unacceptable.
I’m not saying multimedia frameworks don’t use multiple cores, they do. But due to the above issues, multithreading is often more limited compared to multiple processes reading/writing pipes.
The main advantage is that you wouldn't be limited in the number of plugins you could run by the performance of a single core, since you could run each plugin on its own core, like you mentioned.
Obviously, having faster individual cores means that each plugin introduces less total latency, but the difference in single-threaded performance between Zen 2 and Intel's best is likely to be very small, and I fully expect Zen 2 to have the best single-threaded performance in certain applications.
Even though I do a lot of docker and some rendering and Photoshop - most development tasks, docker builds, and even most Photoshop tasks that aren't GPU accelerated are bottlenecked on single core performance.
Same goes for the overall zippiness of the OS. The most important thing for me is that whatever I am doing this moment is as fast as possible and single core performance still rules since most software still does not take advantage of multiple cores.
For the next home server though, I am definitely planning on a high core count AMD.
I would add though, that all the new processors are getting so fast, that the difference in single core performance is probably not noticeable. Your main issue would be long running single core tasks which are generally more likely to be multithreaded.
I totally agree with this. I can't stand having a resource limit on creativity when I'm making music. What's worse, is even if you get dedicated hardware (DSP chips, etc.) they are normally designed for specific software, and aren't (and likely can't be) a 'global accelerator' for all audio plugins, regardless of the developer.