This is silly. A bottleneck for audio processing is a particular product's flaw,...

mntmoss · on June 12, 2019

I don't think you're wrong in a technical sense, but the human factors in a contemporary DAW environment are imposing a huge penalty on what's possible.

The biggest issue is that we're using plugins written by third parties to a few common standards. Even when the plugins themselves are not trying to make use of a multicore environment, you still get compatibility bugs and various taxes on re-encoding input and output streams to the desired bit depth and sample rate. It can really throw a wrench into optimizing at the DAW level because you can't just go in and fix the plugins to do the right thing.

Then add in the widely varying quality of the plugin developers, from "has hand-tuned efficient inner loops for different instruction set capabilities" to "left in denormal number processing, so the CPU dies when the signal gets quiet." Occasionally someone tries to do a GPU-based setup, only to be disappointed by memory latency becoming the bottleneck on overall latency(needless to say, latency is really prioritized over throughput in real-time audio).

Finally, the skillsets of the developers tend to be math-heavy in the first place: the product they're making is often something like a very accurate simulation of an analog oscillator or filter model, which takes tons of iterations per sample. Or something that is flinging around FFTs for an effect like autotune. They are giving the market what it wants, which is something that is slightly higher quality and probably dozens or hundreds of times more resource-hungry to process one channel.

If all you're doing is mixing and simple digital filters, you're in a great place: you can probably do hundreds of those. But we've managed to invent our way into new bottlenecks. And at the base of it, it's really that the tooling is wrong and we do need a DSP-centric environment like you suggest. (SOUL is a good candidate for going in this direction.)

kitchenkarma · on June 12, 2019

This is a simple fact of life and downvoting isn't going to change it. Plugin cannot start processing before it gets data from previous plugin (sure it can do some tricks like pre-computing coefficients for filters etc). How are you going to get around it? What's happening within a plugin of course can be parallelised, but other than that, the processing is inherently serial. If a computing a filter takes X time and a length of the buffer is Y you can only compute so many filters (Y/X) before it starts stuttering. You can spread that across different cores, but these filters cannot be processed at the same time, because each needs the output of the previous one.

saltcured · on June 12, 2019

Pipelining means that each stage further down the pipeline is processing an "earlier" time window than the previous stage. They don't run concurrently to speed up one buffer, but they run concurrently to sustain the throughput while having more active filters.

For N stages, instead of having each filter run at 1/N duty cycle, waiting for their turn to run, they can all remain mostly active. As soon as they are done with one buffer, the next one from the previous pipeline stage is likely to be waiting for them. This can actually lower total latency and avoid dropouts because the next buffer can begin processing in the first stage as soon as the previous buffer has been released to the second stage.

kitchenkarma · on June 12, 2019

I think this is one of the most misunderstood problem these days. Your idea could work if the process wasn't real-time. In real-time audio production scenario you cannot predict what event is going to happen so you cannot simply just process next buffer, because you won't know in advance what is needed to be processed. At the moment these pipelines are as advanced as they can be and there is simply no way around being able to process X filters in Y amount of time to work in real-time. If you think you have an idea that could work, you could solve one of the biggest problems music producers face that is not yet solved.

saltcured · on June 13, 2019

Something like a filter chain for an audio stream is truly the textbook candidate for pipelined concurrency. Conceptually, there are no events or conditional branching. Just a methodical iteration over input samples, in order, producing output samples also in order.

Whatever you can calculate sequentially like:

  while True:
     buf0 = input.recv()
     buf1 = filter1(buf0)
     buf2 = filter2(buf1)
     buf3 = filter3(buf2)
     output.send(buf3)

can instead be written as a set of concurrent worker loops.

Each worker is dedicated to running a specific filter function, so its internal state remains local to that one worker. Only the intermediate sample buffers get relayed between the workers, usually via a low-latency asynchronous queue or similar data structure. If a particular filter function is a little slow, the next stage will simply block on its input receive step until the slow stage can perform the send.

(Edited to try to fix pseudo code block)

kitchenkarma · on June 13, 2019

This is how it is typically being done. This is not a problem. Problem is that being concurrent, end to end this process is serial, so you can't process any element of this pipeline in parallel. You can run only so many of those until you run out of time to fill the buffer. I think it could be helpful for you to watch this video: https://www.youtube.com/watch?v=cN_DpYBzKso

saltcured · on June 15, 2019

Sorry for the late reply. We have to consider two kinds of latency separately.

A completely sequential process would have a full end-to-end pipeline delay between each audio frame. The first stage cannot start processing a frame until the last stage has finished processing the previous frame. In a real-time system, this turns into a severe throughput limit, as you start to have input/output overflow/underflow. The pipeline throughput is the reciprocal of the end-to-end frame delay.

But, concurrent execution of the pipeline on multiple CPU cores means that you can have many frames in flight at once. The total end-to-end delay is still the sum of the per-stage delays, but the inter-frame delay can be minimized. As soon as a stage has completed one frame, it can start work on the next in the sequence. In such a pipeline, the throughput is the reciprocal of the inter-frame delay for the slowest stage rather than of the total end-to-end delay. The real-time system can scale the number of pipeline stages with the number of CPU cores without encountering input/output overflow/underflow.

Because frame drops were mentioned early on in this discussion, I (and probably others who responded) assumed we were talking about this pipeline throughput issue. But, if your real-time application requires feedback of the results back into a live process, i.e. mixing the audio stream back into the listening environment for performers or audience, then I understand you also have a concern about end-to-end latency and not just buffer throughput.

One approach is to reduce the frame size, so that each frame processes more quickly at each stage. Practically speaking, each frame will be a little less efficient as there is more control-flow overhead to dispatch it. But, you can exploit the concurrent pipeline execution to absorb this added overhead. The smaller frames will get through the pipeline quickly, and the total pipeline throughput will still be high. Of course, there will be some practical limit to how small a frame gets before you no longer see an improvement.

Things like SIMD optimization are also a good way to increase the speed of an individual stage. Many signal-processing algorithms can use vectorized math for a frame of sequential samples, to increase the number of samples processed per cycle and to optimize the memory access patterns too. These modern cores keep increasing their SIMD widths and effective ops/cycle even when their regular clock rate isn't much higher. This is a lot of power left on the table if you do not write SIMD code.

And, as others have mentioned in the discussion, if your filters do not involve cross-channel effects, you can parallelize the pipelines for different channels. This also reduces the size of each frame and hence its processing cost, so the end-to-end delay drops while the throughput remains high with different channels being processed in truly parallel fashion.

Even a GPU-based solution could help. What is needed here is a software architecture where you run the entire pipeline on the GPU to take advantage of the very high speed RAM and cache zones within the GPU. You only transfer input from host to GPU and final results back from GPU to host. You will use only a very small subset of the GPU's processing units, compared to a graphics workload, but you can benefit from very fast buffers for managing filter state as well as the same kind of SIMD primitives to rip through a frame of samples. I realize that this would be difficult for a multi-vendor product with third-party plugins, etc.

labawi · on June 12, 2019

Assuming your samples are of duration T, and you need X CPU time to fully process a sample through all filters. Pipelining allows you to process audio with X > T, nearly X = N * T for N cores, but your latency is still going to be X.

If it is possible to process with small samples (T), with roughly correspondingly small processing time (X), there shouldn't be a problem keeping the latency small with pipelining. If filters depend on future data (lookahead), it is plausible reducing T might not be possible. Otherwise, it should be mostly a problem of weak software design and lots of legacy software and platforms.

kitchenkarma · on June 13, 2019

You cannot run the pipeline in parallel. Sure you can have a pipeline and work the buffers on separate cores, but the process is serial. If it was as simple as you think it would have been solved years ago. There are really bright heads working in this multi billion industry and they can't figure that out. Probably because that involves predicting the future.

BubRoss · on June 12, 2019

You use smaller buffers so the filters run faster and the next chunk of samples can start on another CPU as soon as they exist.