Mainline will have xruns once in a while even with large buffers (like 20ms).
Do run cyclictest from rt-tests for a while if you don't believe this.
I use Windows on my primary computer to run Cubase, on my slave PC's to run Vienna Ensemble Pro, and a Mac to run Pro Tools for hosting picture. Many composers -- Hans Zimmer, for example -- run a very similar setup. Windows is quite common, especially in the film/game/TV scoring world.
macOS is an absolutely wonderful OS for audio work, and in a vacuum is my favorite OS (due in no small part to Core Audio/MIDI), but given the amount of power that we need and the lack of replaceable hard drives, PC's are pretty much always the better way to go nowadays if you're using a desktop. For laptops, though, I'd still nearly always recommend a MBP.
Source: I've worked in pro-audio for a long time, both independently and as an employee for one of the largest companies in the space.
Low-latency real-time audio streaming, for example, is a very common need, and the developers need to consider the same problems mentioned in the article.
I’ve anecdotally heard similar things about the video editing world - people would love to use Macs, but Apple took far too long to release the new Pro (and now it’s here, it’s out of reach for many “pro” users who don’t have unlimited budgets), so they had no choice but to move to Windows.
The whole “musicians use Apple”-thing is really not true — at least in my subjective sample size.
I still remember the first time I managed to combine some of my favourite sound generator & synth apps and create a mix all on my iPad without having to export things to Garageband/Logic , and it was a real game changer on that platform.
If anyone knows about audio development on iOS devices, this author is the 'go to' guy!
 - https://soundcloud.com/cyberferret/industrial-ants-wav
Any you’d specifically recommend?
I use Amplitube for a virtual guitar amp and it works really well with little lag. I've got a bunch of keyboard synth app, of which my faves are: Arturia iSEM, iMS-20 and Magellan.
I know these are mainly old apps which have been around for a while, but I have hardly bought new apps these days because these ones have served me for years.
For a sample of Animoog plus SampleTank - https://soundcloud.com/cyberferret/in-a-bad-moog
It's interesting in that it's written in Objective-C++ which in xcode are suffixed with .mm - This is for performance reasons I assume.
I didn't know about Objective-C++ and here is a link about it. Is it kind of pushed into a dusty closet? Because I only found an archive.org link.
I guess I fundamentally don't understand priority inversion. I've done a lot of realtime programming but have never had to confront that particular buzzword.
If my realtime-priority I/O completion routine tries to take a lock that might be held by an idle-priority GUI thread, why is that a problem? If I'm waiting on a lock, then by definition I'm not running at realtime priority while I'm doing so. Thread priorities should apply only to threads that are actually doing stuff. It would be crazy for the kernel to fail to schedule any other threads while I'm waiting for one of them to release the lock I need. My realtime thread is effectively suspended during that wait, so it can't keep any others from running.
Yes, I might have to wait for other threads and even other processes to be scheduled. Too bad. If I didn't want that to happen, I shouldn't have tried to take a lock in a realtime thread. What exactly should I have expected to happen?
This is a problem because you miss I/O deadlines. You say that “by definition” you’re not running at realtime priority, but that’s just switching around definitions, it’s not changing the problem—which is that you’re missing I/O deadlines.
> If I didn't want that to happen, I shouldn't have tried to take a lock in a realtime thread. What exactly should I have expected to happen?
Agreed that you shouldn’t try to lock in a realtime thread—which is exactly the advice of the article. But blocking until the other thread leisurely completes its work is not the only outcome. You can have the lower-priority thread inherit the priority of the highest-priority thread waiting for mutexes that it owns.
This, by the way, is one of the big reasons why mutexes on most systems aren’t just single words like they could be. This is also a very sensible way to do things. If your critical sections are short, it works well. It’s also a lot easier than doing lock-free synchronization.
IMHO, if your system is designed such that a GUI or other low-priority thread is contending for the same resources as a realtime I/O thread, in a situation where the lock might be held for so long that the I/O thread starves, then you have made a really elementary mistake... one so fundamental that it doesn't deserve an abstract-sounding term like "priority inversion." I still feel like there's something I'm missing here.
I mean, I understand that the GUI thread in question might be heavily pre-empted by medium-priority threads, but eventually the GUI thread is going to be given enough time to finish whatever it was doing with my lock, and life will go on. Is this behavior somehow unexpected or counterintuitive?
If kernels are designed such that lower-priority threads never get a time slice until higher-priority threads block on something or explicitly sleep or yield, well, that seems like a really foolish call on the part of the kernel designer. You literally might as well have a cooperative-multitasking OS at that point.
The mistake people make is, they think "my high-priority thread has to be active every 10ms, and the low-priority thread doesn't do much work with the lock held, 1ms max, so I'm totally fine, there's 9ms of safety factor there". Priority inversion is why that logic is wrong.
I sometimes get offended when there is a fancy term for something I think is common sense. But there's nothing I can do about it.
Kernel designers solve that by temporarily boosting the priority of the lower (blocking) thread to be equal or higher than the priority of the blocked thread (https://docs.microsoft.com/en-us/windows/win32/procthread/pr...). It's called priority inheritance.
I've seen kernels that only time-slice to a lower priority thread if all the high priority threads are sat in a lock. Not ideal but simplifies the kernel code.
I've also seen kernels that only context switch when a thread becomes locked or when a (relatively) low resolution timer ticks.
(I'm talking non-specifically about the kernels running on video game consoles).
But you have to know, and often you don't, which is when priority inversion is a problem.
An example of a class of PI problems occurs when you take a lock on something that everyone agrees is only to be locked for a bounded time, for example, always updating a structure with several fields by taking a lock, updating the fields together, then immediately unlocking it.
Then, your high priority task uses that lock, assuming the delay when doing so is within the bounded time multiplied by the amount of contention at that priority and higher. Often the multiple is exactly 1, for a high priority RT task.
Perhaps the lock is private to a module or library, without it being known to the author of the high priority task.
For example, imagine a hash table library which internally, and briefly takes a lock when updating the table, or maybe global table statistics in some circumstances, and advertises itself as taking O(1) time for inserts and deletes. Or a tree library, etc.
Due to the wonder of modularity and encapsulation, the high priority task author reasonably presumes they can use the library without hidden gotchas. And the hash table library author neglects to mention that it briefly takes a lock sometimes, because it seems like an implementation detail that should not be public.
(An even more minimal "library" is something that simulates atomic operations on a variable (such as atomic_add), by taking a lock around each operation.)
It's a bit of a disaster if that "O(1)" update blocks the high priority task forever, or a very long time, because the idle task is also updating a hash table, and then the idle task gets pre-empted constantly by mid-priority tasks.
This bug could be avoided if the hash table library author documented in their API "this table is not suitable for real-time use". Or if the high priority task author avoids using libraries.
But those are extreme and very unclean, unmodular things to require - and that's why it's a problem, priority inversion.
Another good way to avoid the bug, which doesn't break encapsulation (as much) and meets ordinary expectations in a lot of code, is to have locks that implement priority inheritance (boosting) when needed. Since that fixes a large class of unexpected starvation and unexpected slowness problems, it's useful in RT kernels.
Only if your method is not found in the IMP cache. Trying to acquire a lock for every objc_msgSend would be quite slow.
> I decided not to mount an expedition into the source code because I didn’t know what to look for, or where to look for it, and I wasn’t even sure I’d even find it given that iOS and OS X are both pretty closed-off systems. So, we’ll have to take the word of those more knowledgeable than us, at least for now.
macOS's libmalloc is open source (if not updated very often): https://opensource.apple.com/source/libmalloc/libmalloc-166....
Suppose a deadline of 6ms to produce an audio buffer. If 0.01% of the callbacks choke on a lock, we get a dropout once a minute, which is quite a lot. This means we're interested in the 99.99th percentile.
What? NO! This is only relevant when you're writing assembly. In C, you have to use atomic (or possibly volatile, I'm not sure), or this just undefined behavior.
x84/64 has a pretty strong memory model. Bareback sharing of 64-bit variables has pretty much acquire-release semantics. Of course you should still say atomic and acq_rel when you actually mean that, but it's a question of portable code, not UB.
Volatile stops the compiler from shuffling accesses around so that has can affect inter-thread observation order in some cases.
> There are many other kinds of course, including sequence point violations like "foo(i, ++i)", race conditions in multithreaded programs, violating 'restrict', divide by zero, etc.
(And no, volatile does not save you.)
these are more precisely called data races. And yes, they are UB in C/C++.
Caches may not sync between processors.
for example on Alpha the most general way to load an arbitrary unsigned 32-bit value is
LDAH reg, XXXX(r31) ; upper 16 bits
LDA reg, YYYY(reg) ; lower 16 bits
x->y = 65537;
// later in the same function
x->y = (x->y == 65537 ? 65538 : 65537);
// do something later assuming x->y is either 65537 or 65538
LDAH r22, 0x0001(r31) ; r22 = 0x0000 0000 0001 YYYY
LDA r22, 0x0001(r22) ; r22 = 0x0000 0000 0001 0001
; load r22 = x->y
BLBS r22, toggle_1 ; if (r22 & 1) goto toggle_1;
LDA r22, 0x0001(r22) ; r22 = (r22 & 0xffff0000) | 0x0001
BR toggle_done ; goto toggle_done;
LDA r22, 0x0002(r22) ; r22 = (r22 & 0xffff0000) | 0x0002
; do something assuming r22 is either 65537 or 65538
And so, conceivably, if another thread changed the high 16 bits of x->y between the original store and the load before the comparison, we could observe a mixed value.
Of course, you'd have to create a condition where the compiler writes r22 back to memory and then loads it again but assumes the loaded value is still >= 65536 and < 131072.
Is this contrived? Absolutely. Is it _plausible_? Maybe.
Disclaimer: I've never used an Alpha in my life. This is all inferred from Raymond Chen's really amazing series about processors that Windows NT used to support. 
However, as PeCaN has pointed out, I'm probably suffering from tunnel vision from x86 monoculture.
I believe msvc used to put in memory barriers. Not sure if it still does. These days you are better off ignoring the entire keyword and using the properly specified atomic stuff.
it does but now there's a flag to disable them : /volatile:iso (https://docs.microsoft.com/en-us/cpp/build/reference/volatil...).
Either way, if targeting a modern toolchain, I would strongly recommend using atomic_store()/atomic_load() from <stdatomic.h> for exchanging data between threads when "stronger" interlocked operations (CAS, exchange, atomic arithmetic, etc.) aren't required.
Be very, very careful with the memory order argument when using the "explicit" versions of these functions. Apple subtly broke their IOSharedDataQueue (lock-free & wait-free circular buffer queue between kernel and user space, used for HID events, for example) in macOS 10.13.6 (fixed in 10.14.2 I think) because they used memory_order_relaxed for accessing the head and tail pointers when transitioning to the stdatomic API from OSAtomic. Presumably in a misguided attempt at "optimisation". The writer would only wake up the reader if the head/tail state before or after writing the message to the buffer indicated an empty queue. Unfortunately, due to memory access reordering in the CPU, the writer ended up sometimes seeing a state of memory that never existed on the reader thread, so the reader would wait indefinitely for the writer to wake it up after draining the queue.
The use of volatile for exchanging data between threads, as recommended elsewhere on this thread, is iffy - make sure you know exactly what volatile means to the compiler you are using and the version of the C/C++ standard you are targeting.
(I have no idea if any Alpha compilers did this in the wild.)
Well, is he technically right or not? It sounds like you think he is not right.
I am sort of amazed that audio recording hardware doesn't provide at least 100 milliseconds worth of buffering. It's all of 600 bytes of memory on a 48k bit/sec stream, so it would cost them nothing.
It really isn't rocket science, whereas here we have an entire article on keeping latency down in multithreaded code which should tell you it's both time consuming and requires an pretty skill full programmer. The rocket science bit starts after it's been written, and someone who doesn't have the big picture on where all the latencies are starts touching the code. The program doesn't break in an obvious way, or even deterministically. Instead the first symptom you are likely to notice is heisenbugs occurring in production and the frequency is gradually increasing as the code ages. Tracking that down and fixing it is rocket science.
Compared to that the price of adding a buffer plus a clock in the next re-design is less than peanuts. We've been doing this stuff for decades now and gone through plenty of re-designs, which is why I'm amazed it hasn't happened.
Of course if you're doing batch processing you hardly care about latency and the problem is trivial except for the fact that if you hit the play button in your post processing software you still expect instant audio output, especially for deciding cut points and whatnot.
His argumentation is that Objective-C uses locks internally.
But isn't it possible that those locks (as opposed to application-level locks) cause just negligible pauses, because there's no contention on them?
I guess those internal locks exist to make ObjC's runtime dynamism feasible. One would expect the typical application to not leverage that dynamism (how many apps continously redefine methods under normal operation?).
Tip: If you suspect crap like this, adding spin delays will make it happen often enough to debug.
volatile int cnt = 0x1000;
Multicore super scalar processors make me nervous about my assumptions. I wouldn't put it past one of them to realize that the result is unused and nop it.
It's always good to check.
That said, I've just started working on my first audio pipeline app, and I'm having sampling rate issues making voices sound super deep.
I don't know why everybody is making this so hard: You use a lock-free statically allocated circular buffer for all communication to or from a real-time audio thread. Period. Full stop. The audio thread wins under all contention scenarios. The audio thread ONLY ships data from the circular buffer to the hardware system or vice versa--IT DOES NO OTHER TASK. Everything else talks to the circular buffers.
Nothing else works.
The big questions are even if you do that--1) how do you keep the circular buffer from starving even with your other threads working to fill the circular buffer? and 2) if the circular buffer starves, what do you do?
All you have done in this case is push the problem up one level. The interface with the hardware already works this way.
The interesting problem is actually doing something more substantial - eg. deciding which audio to play, or processing it in some way - in a way that lets you keep feeding that circular buffer on time, every time.
Actually, using a circular buffer has transferred a "hard, real time" problem into a "soft real time" problem. As long as what you want to do has an average time shorter than the time between audio packets and you have enough audio packets buffered, you can ride across the occasional schedule miss because the system was off doing something else (like GC).
For example, VoIP quite often uses a "jitter buffer" for exactly this task.
Now, you still need to keep the circular buffer fed and that may not be easy. However, it's a lot easier than being 1:1 with a hard audio thread.
> The interface with the hardware already works this way.
Most of the time when I'm interacting with low-latency audio threads, they generally don't allow me to specify the buffer semantics with very much flexibility.
As an example, consider some software I wrote recently: https://www.jefftk.com/bass-whistle It reads in audio of someone whistling, interprets it, and resynthesizes it as bass. It needs to be very low latency, or else it won't feel fluent to play. Roughly, on each audio callback it needs to:
* For every sample in the input buffer determine (a) is there whistling and if so (b) what pitch it is.
* Use those decisions to synthesize audio and populate an output buffer.
It does all of this on the audio thread, and there's absolutely no reason to move this processing elsewhere.
Code pointer: PLUG_CLASS_NAME::ProcessBlock in https://github.com/jeffkaufman/iPlug2/blob/master/Examples/B...
No. It can't invert because the circular buffer is lock free--that's the whole point. The audio thread controls the resources and if the audio thread doesn't relinquish then the other threads can't add to the data structure--the audio thread, however, never, ever sees a point of contention.
Now, the audio thread can theoretically starve because it's holding resources so long that the other threads can't transfer a buffer in. However, as audio threads tend to be hard, real-time periodic with the highest priority, so that problem is of the programmers own making.
> for many tasks it just adds overhead.
Only if your task is audio in->audio out with very little processing.
The moment you have to pull from network or disk, change in response to a UI event, or anything else which touches something non-audio, you either eat the overhead of a circular buffer data structure or you risk a glitch.
It’s a really old trick, lots of unixy APIs returning strings or objects that you are not expected to free do a similar thing (function local static variables to storing results) at the cost of not being thread-safe.
Or if you want to start from a lower-level:
Teensy 4.0 is amazingly fast. Teensy 3.6 will suffice. Nucleo STMH734ZI2 is great, too. Nucleo STM32L476RG is not as fast, but good enough for single voice synthesis or basic guitar effects. STM32 Discovery still holds up well.
Having this allows for straightforward code, not always relying on ARM intrisics - which I personally find much more enjoyable to write, and especially so to read.
The Teensy audio library is already fully featured, well developed, and the community is active. Now with the Teensy 4.0, I expect that to only improve.
Instead of having "threads" you have the main program loop and the interrupt that sends audio (mostly through I2S, interrupt triggered when the I2S buffer is empty). When the interrupt hits you better have a buffer ready to be sent, otherwise it will click. Of course, the interrupt handler has to do just that (set a DMA transfer to send audio).
All that in sync with: capturing audio, reading from SD or storage, mixing, changing volume, applying effects, etc. Together with whatever the MCU has to do like reading buttons, communicating with other devices and dealing with a bunch of other interrupts. It's not always deterministic.
If the MCU has to do a lot of audio processing in-between audio interrupts, a big buffer (not always available) can be the solution at the cost of latency. If a smaller buffer is used instead, then the MCU executes the audio interrupt very frequently and the main program barely executes...
... lots of headaches.
For example I reprogram my DLD a lot.
Also, consider creating your synth/audio processor on your local machine first. Tweak all parameters. Make it perfect. Then port to hardware + UI. Not the other way around. The times of fixed points maths/DSP optimisations driving your algorithms are behind us.
It's really bad, I want a new OS where I can actually tell the scheduler what to do.
Also, check out this music composition app I'm launching soon http://ngrid.io.
Disclaimer - i'm working on a Haiku native media editor (video and audio) and the latency is excellent.
On the other hand, I can't really expect regular OS's to implement a "real-time" mode or something like that. As much as I'd like it, I feel it would be so complex and fundamentally incompatible with the way those OS's work! (and then I also start thinking about power management, external factors affecting hardware, etc). I hope the future proves me wrong. Given how much people works on audio and plays videogames on those OS's, the current situation is kind of sad.
One approach is to use a real-time kernel underneath Linux. Bela does this with the Xenomai kernel:
If you wanted something that lots of other people could use, in principle you could make your own Linux distribution with a particular set of patches.
I think you can do better when a buffer runs dry. Instead of outputting silence, you could output the same frequency spectrum as you were right before the event. That way you will not hear any cracks or pops.
And of course you can fade out the effect when the buffer stays dry for more than a few seconds.
Obviously, you'd have to do some filtering when the audio continues because that too can introduce cracks.
Put another way - what you are proposing it looping the buffer which is what some devices do, portable CD players were kinda notorious for it and it doesn't sound much better than cracks or pops. Computers also have a tendency to fall into buffer looping when the system hangs (which is likely the result of the failure mode of realtek codecs).
Yes that's true, I'm proposing something that uses an approximation of it.
Consider it from a different angle: the inner ear essentially performs a Fourier transform. At every moment the "instantaneous" spectrum determines which hair cells are triggered. Now what I propose is to keep triggering those same hair cells (and not any others) when the buffer runs dry. The exact way of accomplishing this is left as an exercise (though using short windows where you take a FFT could be a good approximation).
Perhaps you should undertake this exercise and let us know how it sounds :)
EDIT: In my experience with audio, when I have a bug that introduces even the slightest discontinuity (or even just a cusp) in the audio, well short of a pop to silence, I can still hear a "weirdness". Ears are pretty attuned to things that sound unnatural. I'm not confident that essentially "forging" the audio is going to sound natural.
You can even hang intentionally to generate original music!
(/s, please don't)
You could maybe make an argument that it would be useful in live music settings to prevent a bad situation from sounding even worse, and maybe you'd put it on some audio software so you can sort of still enjoy playing music on a crappy system, but really, it's best to have hardware and software that can 100% guarantee keeping up with audio processing.
Also, the article speaks of multithreaded software, where deadlines can be missed because of complicated dependencies. The end stage where you correct for missing samples can work independently of them in its own thread.
However, the way I'd do it if needed: keep two or the most recent good buffers. When you need to synthesize, start running a phase vocoder based on the hop between those two. You get frozen sinusoids and some random noise for the bins that don't have one, and almost no cpu use on the happy path of no underruns.
Still, don't do it :)
It really depends on who you are asking. Some people just hate those loud cracks and pops, and would love to have something that filters them out naturally.
Sorry, but if you ever find yourself in sich a situation stop, pause, make yourself a tea and consider how wise the thing you are doing really is.
Also, the settings you use for development do not have to be the same as those used in production.
Depending on the use-case, that may not be a problem. Not all things are latency-sensitive.
So it makes sense for certain live situations, but it wouldn't be desirable in studio recordings.
How does one detect whether it's a musical silence or a buffer underrun?