Hacker News new | past | comments | ask | show | jobs | submit login
Four common mistakes in audio development (2016) (atastypixel.com)
291 points by PascLeRasc on Sept 22, 2019 | hide | past | favorite | 143 comments

Absolutely, linux pro-audio developers have been handling these situations for over a decade now with JACK.

Jack is nearly useless without the linux-rt patchset and PREEMPT_RT.

Mainline will have xruns once in a while even with large buffers (like 20ms).

Do run cyclictest from rt-tests for a while if you don't believe this.

Yeah this is also why you see things like Reaper moving to linux and Bitwig debuting on Linux as well as mac and windows. With the future of mac and windows desktop computing up in the air I suggest that linux is the future for pro audio computing rather than apple.

Name me a working professional musician who doesn't use Apple gear (assuming they use computers at all to make music).

raises hand

I use Windows on my primary computer to run Cubase, on my slave PC's to run Vienna Ensemble Pro, and a Mac to run Pro Tools for hosting picture. Many composers -- Hans Zimmer, for example -- run a very similar setup. Windows is quite common, especially in the film/game/TV scoring world.

macOS is an absolutely wonderful OS for audio work, and in a vacuum is my favorite OS (due in no small part to Core Audio/MIDI), but given the amount of power that we need and the lack of replaceable hard drives, PC's are pretty much always the better way to go nowadays if you're using a desktop. For laptops, though, I'd still nearly always recommend a MBP.

Joel Zimmerman, aka Deadmau5. You might see a macbook sitting on stage once in a while, but saying that he uses it to do his production would be no more accurate than saying he uses his watch to control the lights (he actually uses racks of nvidia GPUs)

The pro-audio software market is significantly tilted towards Windows. The split is approximately 75/25. Also, the majority of the "pro"-audio market is not working professionals. In fact, they're a vanishingly small piece of the equation. That's true for pretty much any creative pro-sumer market. Most of the money is in the long tail of people that want to be like pros, not in the tiny number of actual professionals.

Source: I've worked in pro-audio for a long time, both independently and as an employee for one of the largest companies in the space.

Pro-audio is not all pro-music.

Low-latency real-time audio streaming, for example, is a very common need, and the developers need to consider the same problems mentioned in the article.

I hang out with a bunch of electronic music producers and almost all of them use Windows because most musicians don’t have a lot of money.

This used to be fairly true but seems to be less so these days. The lack of an affordable Mac with reasonable power in the lineup for quite a while (plus the quality control issues with the new MBPs) has led to quite a lot of musicians on the lower end of the earnings spectrum to move to PC, while the lack of a high powered, expandable pro Mac has pushed some high end users towards Windows, at least for parts of their workflow.

I’ve anecdotally heard similar things about the video editing world - people would love to use Macs, but Apple took far too long to release the new Pro (and now it’s here, it’s out of reach for many “pro” users who don’t have unlimited budgets), so they had no choice but to move to Windows.

I worked in a studio (with 8 work spaces and a Dante network) and there was no single Apple machine. Of all the musicians I know maybe two use Apple, most use Windows.

The whole “musicians use Apple”-thing is really not true — at least in my subjective sample size.

I used to work for a well-known audio plugins company (if you're a musician, you've heard of this company) and 70% of our customers were Mac users (that was back in 2016).

Scott Hansen used to work on Windows, he's since moved to MacOS[1][2].

1. https://www.instagram.com/p/B1KlPQkgfIT/ 2. https://www.instagram.com/p/BpkqAq2n3c9/

Software defined radio deals with essentially the same requirements, and occurs much more on linux.

As an Audiobus user since pretty much day 1 (and also a Loopy app user), I'd just like to shout out to the author of this article with a "Thank You" for revolutionising music creation on Apple's small devices.

I still remember the first time I managed to combine some of my favourite sound generator & synth apps and create a mix all on my iPad without having to export things to Garageband/Logic [0], and it was a real game changer on that platform.

If anyone knows about audio development on iOS devices, this author is the 'go to' guy!

[0] - https://soundcloud.com/cyberferret/industrial-ants-wav

> favourite sound generator & synth apps

Any you’d specifically recommend?

Absolute favourite is Animoog - The interface is just so perfect for a touch screen device. Of course Loopy HD for looping. SampleTank is great for quickly creating beats and loops (and so is Figure for creating a random loop to build on). Bebot is just a heck of a lot of fun.

I use Amplitube for a virtual guitar amp and it works really well with little lag. I've got a bunch of keyboard synth app, of which my faves are: Arturia iSEM, iMS-20 and Magellan.

I know these are mainly old apps which have been around for a while, but I have hardly bought new apps these days because these ones have served me for years.

For a sample of Animoog plus SampleTank - https://soundcloud.com/cyberferret/in-a-bad-moog

AudioKit Synth One is fantastic, and it's all open-source too.

Thanks for the link. To close the loop the audio kernel documentation in AudioKit Synth One references the OP.


It's interesting in that it's written in Objective-C++ which in xcode are suffixed with .mm - This is for performance reasons I assume.

I didn't know about Objective-C++ and here is a link about it. Is it kind of pushed into a dusty closet? Because I only found an archive.org link.


Let’s say your GUI thread is holding a shared lock when the audio callback runs. In order for your audio callback to return the buffer on time it first needs to wait for your GUI thread to release the lock. Your GUI thread will be running with a much lower priority than the audio thread, so it could be interrupted by pretty much any other process on the system, and the callback will have to first wait for this other process, and then the GUI thread to finish and release the lock before the audio callback can finish computing the buffer — even though the audio thread may have the highest priority on the system. This is called priority inversion.

I guess I fundamentally don't understand priority inversion. I've done a lot of realtime programming but have never had to confront that particular buzzword.

If my realtime-priority I/O completion routine tries to take a lock that might be held by an idle-priority GUI thread, why is that a problem? If I'm waiting on a lock, then by definition I'm not running at realtime priority while I'm doing so. Thread priorities should apply only to threads that are actually doing stuff. It would be crazy for the kernel to fail to schedule any other threads while I'm waiting for one of them to release the lock I need. My realtime thread is effectively suspended during that wait, so it can't keep any others from running.

Yes, I might have to wait for other threads and even other processes to be scheduled. Too bad. If I didn't want that to happen, I shouldn't have tried to take a lock in a realtime thread. What exactly should I have expected to happen?

> If my realtime-priority I/O completion routine tries to take a lock that might be held by an idle-priority GUI thread, why is that a problem?

This is a problem because you miss I/O deadlines. You say that “by definition” you’re not running at realtime priority, but that’s just switching around definitions, it’s not changing the problem—which is that you’re missing I/O deadlines.

> If I didn't want that to happen, I shouldn't have tried to take a lock in a realtime thread. What exactly should I have expected to happen?

Agreed that you shouldn’t try to lock in a realtime thread—which is exactly the advice of the article. But blocking until the other thread leisurely completes its work is not the only outcome. You can have the lower-priority thread inherit the priority of the highest-priority thread waiting for mutexes that it owns.

This, by the way, is one of the big reasons why mutexes on most systems aren’t just single words like they could be. This is also a very sensible way to do things. If your critical sections are short, it works well. It’s also a lot easier than doing lock-free synchronization.

See: https://pubs.opengroup.org/onlinepubs/7908799/xsh/pthread_mu...

Something else might start running instead of your realtime-priority thread or GUI thread.

Yes, that's an occupational hazard when you're waiting on a lock. Someone else might do something. In fact, someone else had better do something, or you won't get your lock.

IMHO, if your system is designed such that a GUI or other low-priority thread is contending for the same resources as a realtime I/O thread, in a situation where the lock might be held for so long that the I/O thread starves, then you have made a really elementary mistake... one so fundamental that it doesn't deserve an abstract-sounding term like "priority inversion." I still feel like there's something I'm missing here.

I mean, I understand that the GUI thread in question might be heavily pre-empted by medium-priority threads, but eventually the GUI thread is going to be given enough time to finish whatever it was doing with my lock, and life will go on. Is this behavior somehow unexpected or counterintuitive?

If kernels are designed such that lower-priority threads never get a time slice until higher-priority threads block on something or explicitly sleep or yield, well, that seems like a really foolish call on the part of the kernel designer. You literally might as well have a cooperative-multitasking OS at that point.

> I mean, I understand that the GUI thread in question might be heavily pre-empted by medium-priority threads, but eventually the GUI thread is going to be given enough time to finish whatever it was doing with my lock, and life will go on. Is this behavior somehow unexpected or counterintuitive?

The mistake people make is, they think "my high-priority thread has to be active every 10ms, and the low-priority thread doesn't do much work with the lock held, 1ms max, so I'm totally fine, there's 9ms of safety factor there". Priority inversion is why that logic is wrong.

Yep, IIUC it's just an explanation for why priorities won't help you when you get in that mess.

I sometimes get offended when there is a fancy term for something I think is common sense. But there's nothing I can do about it.

It is an elementary mistake done by the programmer of the audio application. I think that's the reason the author wrote this article.

Kernel designers solve that by temporarily boosting the priority of the lower (blocking) thread to be equal or higher than the priority of the blocked thread (https://docs.microsoft.com/en-us/windows/win32/procthread/pr...). It's called priority inheritance.

If the gui thread is holding the lock that the audio thread needs then even a few ms before the gui thread gets a slice (and hands back the gui lock) can cause the audio to starve and click.

I've seen kernels that only time-slice to a lower priority thread if all the high priority threads are sat in a lock. Not ideal but simplifies the kernel code.

I've also seen kernels that only context switch when a thread becomes locked or when a (relatively) low resolution timer ticks.

(I'm talking non-specifically about the kernels running on video game consoles).

My understanding is that you are absolutely correct. That's why you shouldn't use locks on the real-time audio thread.

If you know when a lock may be held by a low-priority thread, and take that into account when coding, then it's fine for the high-priority task to use that and wait. You get a kind of priority drop, but you're expecting it so it's ok.

But you have to know, and often you don't, which is when priority inversion is a problem.

An example of a class of PI problems occurs when you take a lock on something that everyone agrees is only to be locked for a bounded time, for example, always updating a structure with several fields by taking a lock, updating the fields together, then immediately unlocking it.

Then, your high priority task uses that lock, assuming the delay when doing so is within the bounded time multiplied by the amount of contention at that priority and higher. Often the multiple is exactly 1, for a high priority RT task.

Perhaps the lock is private to a module or library, without it being known to the author of the high priority task.

For example, imagine a hash table library which internally, and briefly takes a lock when updating the table, or maybe global table statistics in some circumstances, and advertises itself as taking O(1) time for inserts and deletes. Or a tree library, etc.

Due to the wonder of modularity and encapsulation, the high priority task author reasonably presumes they can use the library without hidden gotchas. And the hash table library author neglects to mention that it briefly takes a lock sometimes, because it seems like an implementation detail that should not be public.

(An even more minimal "library" is something that simulates atomic operations on a variable (such as atomic_add), by taking a lock around each operation.)

It's a bit of a disaster if that "O(1)" update blocks the high priority task forever, or a very long time, because the idle task is also updating a hash table, and then the idle task gets pre-empted constantly by mid-priority tasks.

This bug could be avoided if the hash table library author documented in their API "this table is not suitable for real-time use". Or if the high priority task author avoids using libraries.

But those are extreme and very unclean, unmodular things to require - and that's why it's a problem, priority inversion.

Another good way to avoid the bug, which doesn't break encapsulation (as much) and meets ordinary expectations in a lot of code, is to have locks that implement priority inheritance (boosting) when needed. Since that fixes a large class of unexpected starvation and unexpected slowness problems, it's useful in RT kernels.

> Behind Objective-C’s message sending system (i.e. calling Obj-C methods) is some code that does a whole lot of stuff — including holding locks.

Only if your method is not found in the IMP cache. Trying to acquire a lock for every objc_msgSend would be quite slow.

> I decided not to mount an expedition into the source code because I didn’t know what to look for, or where to look for it, and I wasn’t even sure I’d even find it given that iOS and OS X are both pretty closed-off systems. So, we’ll have to take the word of those more knowledgeable than us, at least for now.

macOS's libmalloc is open source (if not updated very often): https://opensource.apple.com/source/libmalloc/libmalloc-166....

The post is about chasing the long tail of latency risk. A high -- even very high -- probability of not acquiring a lock is not good enough. Investigating libmalloc and discovering that somehow it had strictly bounded, suitable execution time wouldn't be good enough either, because it could plausibly change without notice.

To get an idea of the length of the tail:

Suppose a deadline of 6ms to produce an audio buffer. If 0.01% of the callbacks choke on a lock, we get a dropout once a minute, which is quite a lot. This means we're interested in the 99.99th percentile.

> It can be helpful to know that, on all modern processors, you can safety assign a value to a int, double, float, bool, BOOL or pointer variable on one thread, and read it on a different thread without worrying about tearing, where only a portion of the value has been assigned by the time you read it.

What? NO! This is only relevant when you're writing assembly. In C, you have to use atomic (or possibly volatile, I'm not sure), or this just undefined behavior.

I wish people would not say "undefined behavior" to mean something that looks vaguely iffy or dangerous. It has a pretty specific meaning. [1]

x84/64 has a pretty strong memory model. Bareback sharing of 64-bit variables has pretty much acquire-release semantics. Of course you should still say atomic and acq_rel when you actually mean that, but it's a question of portable code, not UB.

Volatile stops the compiler from shuffling accesses around so that has can affect inter-thread observation order in some cases.

[1]: http://blog.llvm.org/2011/05/what-every-c-programmer-should-...

Race conditions are undefined behavior though. It's even included in that post you just linked as an example of undefined behavior.

> There are many other kinds of course, including sequence point violations like "foo(i, ++i)", race conditions in multithreaded programs, violating 'restrict', divide by zero, etc.

(And no, volatile does not save you.)

> Race conditions

these are more precisely called data races. And yes, they are UB in C/C++.

If you are assigning something that fits in a single register, how would you get tearing? (Probably shouldn't have double in that list, in case it's in the special FPU registers)

yeah, 64-bit types might cause problems (I found Rust's atomic module reference educative https://doc.rust-lang.org/std/sync/atomic/ (the source code doesn't do anything too special either)), but the post said "modern processors", so I guess it's rather accurate

Compiler may just reorder or optimize out some assignments.

Caches may not sync between processors.

The compiler is smarter than you think. Suppose one thread alternates between writing 32 and 33 to a 4-byte variable, while another thread alternates between writing 1024 and 1025 to the same 4-byte variable. You might think that the only possible values are 32, 33, 1024, 1025; but then you read the variable and find a 1056 or 1057 there. Why? Because the compiler noticed that the first thread, after its initial write of 32 or 33, only modifies the least significant byte of the value; so it generated a single-byte write to overwrite only the modified part of the value.

I’d love to see a godbolt example of this optimisation, I’ve never heard of a compiler doing this sort of thing.

That one would be more of a pessimisation, I think.

There's zero benefit to just writing one byte. I'm having a hard time believing that a compiler capable of such analysis would use it for this purpose.

Conceivably there's a benefit when you can't load the value you want in one instruction (e.g. on many RISC architectures with fixed 32-bit instructions you can't have a 32-bit immediate)

for example on Alpha the most general way to load an arbitrary unsigned 32-bit value is

    LDAH reg, XXXX(r31)  ; upper 16 bits
    LDA  reg, YYYY(reg)  ; lower 16 bits
Suppose you have code that toggles some value in memory between, say, 65537 and 65538, e.g.

    x->y = 65537;
    // later in the same function
    x->y = (x->y == 65537 ? 65538 : 65537);
    // do something later assuming x->y is either 65537 or 65538
Conceivably, the compiler could emit

    LDAH r22, 0x0001(r31) ; r22 = 0x0000 0000 0001 YYYY
    LDA  r22, 0x0001(r22) ; r22 = 0x0000 0000 0001 0001
for the first load, write it to shared memory and reload it later, and then for the comparison could emit

    ; load r22 = x->y
    BLBS r22, toggle_1   ; if (r22 & 1) goto toggle_1;
    LDA r22, 0x0001(r22) ; r22 = (r22 & 0xffff0000) | 0x0001
    BR toggle_done       ; goto toggle_done;
    LDA r22, 0x0002(r22) ; r22 = (r22 & 0xffff0000) | 0x0002
    ; do something assuming r22 is either 65537 or 65538
since it knows the original LDAH will be shared by either of the new values, it can save an instruction on each path.

And so, conceivably, if another thread changed the high 16 bits of x->y between the original store and the load before the comparison, we could observe a mixed value.

Of course, you'd have to create a condition where the compiler writes r22 back to memory and then loads it again but assumes the loaded value is still >= 65536 and < 131072.

Is this contrived? Absolutely. Is it _plausible_? Maybe.

Disclaimer: I've never used an Alpha in my life. This is all inferred from Raymond Chen's really amazing series about processors that Windows NT used to support. [0]

[0] https://devblogs.microsoft.com/oldnewthing/20170807-00/?p=96...

I think the point is that unless the language guarantees atomicity (and the compiler implements the guarantee correctly), counterintuitive behavior is permissible and somewhat common, and compiler behavior varies across and within CPU architectures.

Actually I agree with the general point, OP is right that you should use volatile. I was just a bit taken back by the specific example above.

However, as PeCaN has pointed out, I'm probably suffering from tunnel vision from x86 monoculture.

For reference, volatile is still the wrong choice for thread safety. If you need your race condition semantics to be well defined, you must use atomics (do note that those atomics might compile down to simple assignments if the platform supports that, but only the compiler is allowed to do that).

Volatile is, well, volatile. You can basically only rely on the fact that the compiler respects source order and keeps all reads and writes (instead of hoisting and reordering).

I believe msvc used to put in memory barriers. Not sure if it still does. These days you are better off ignoring the entire keyword and using the properly specified atomic stuff.

> I believe msvc used to put in memory barriers. Not sure if it still does.

it does but now there's a flag to disable them : /volatile:iso (https://docs.microsoft.com/en-us/cpp/build/reference/volatil...).

For the record I think the specific example from the OP is rather silly and of course never an optimization on x86(_64), but I'd never pass up a chance to talk about more fun architectures ;-)

I've seen variable size accesses (sometimes 32-bit, sometimes bytewise) to memory locations (almost certainly field access via a dereferenced struct pointer, i.e. operator ->) when reverse engineering x86-64 code known to have been compiled with clang. However, that was very specifically for bitwise access - I don't know if the original source code used C's bitfield struct syntax or explicit masking. I also don't recall if it was just reads whose access size was inconsistent or writes too. It's almost certainly a code size optimisation when comparing to or storing immediate (compile-time constant) values.

Either way, if targeting a modern toolchain, I would strongly recommend using atomic_store()/atomic_load() from <stdatomic.h> for exchanging data between threads when "stronger" interlocked operations (CAS, exchange, atomic arithmetic, etc.) aren't required.

Be very, very careful with the memory order argument when using the "explicit" versions of these functions. Apple subtly broke their IOSharedDataQueue (lock-free & wait-free circular buffer queue between kernel and user space, used for HID events, for example) in macOS 10.13.6 (fixed in 10.14.2 I think) because they used memory_order_relaxed for accessing the head and tail pointers when transitioning to the stdatomic API from OSAtomic. Presumably in a misguided attempt at "optimisation". The writer would only wake up the reader if the head/tail state before or after writing the message to the buffer indicated an empty queue. Unfortunately, due to memory access reordering in the CPU, the writer ended up sometimes seeing a state of memory that never existed on the reader thread, so the reader would wait indefinitely for the writer to wake it up after draining the queue.

The use of volatile for exchanging data between threads, as recommended elsewhere on this thread, is iffy - make sure you know exactly what volatile means to the compiler you are using and the version of the C/C++ standard you are targeting.

Which compilers exhibit this behaviour?

It's possible, if unlikely, that a compiler targeting the DEC Alpha could exhibit something similar (but with 16-bit values instead of bytes). See my reply to reitzensteinm.

(I have no idea if any Alpha compilers did this in the wild.)

That’s pretty fascinating, thank you!

Wouldn't the undefined behavior be whatever the CPU happens to do? Or do you think some optimization would introduce tearing?

While he's technically right, you have no guarantee that the value as a whole is written unless you use the proper fences or synchronize it in some way.

“While he's technically right, you have no guarantee”

Well, is he technically right or not? It sounds like you think he is not right.

He's right on assembly level of assignment, but the compiler is free to emit assembly that will tear. Some compilers might do that for optimization purposes.

The (C or C++) compiler is free to do anything because reading and writing a non-`atomic` variable from two different threads without synchronization is undefined behavior.

He's right that the value will be moved atomically, however the compiler is free to reorder the store/load of fences are omitted.

Wow, is all this really necessary now? Decades ago I recall programmers being immensely proud of their Z80 queue functions that didn't disable interrupts. They were needed for the same reasons the article spells out: dealing with hard real time events. But all it took to break it was another programmer leaving interrupts off we a wee bit too long in another piece of code, so we replaced these things with hardware queues that grew bigger and bigger over time - even UARTS had them.

I am sort of amazed that audio recording hardware doesn't provide at least 100 milliseconds worth of buffering. It's all of 600 bytes of memory on a 48k bit/sec stream, so it would cost them nothing.

The main problem with audio is that in many cases 100ms of latency is unacceptable. Having a large queue is a fine solution when latency isn't a big problem, but in a lot of cases for audio you want to get as close to no latency as possible.

That's true in things like phone calls, but I did read the article looking for cases like that and didn't see any. It is all about post processing, where latency isn't an issue. All you need for that case is an accurate clock embedded in the stream.

It really isn't rocket science, whereas here we have an entire article on keeping latency down in multithreaded code which should tell you it's both time consuming and requires an pretty skill full programmer. The rocket science bit starts after it's been written, and someone who doesn't have the big picture on where all the latencies are starts touching the code. The program doesn't break in an obvious way, or even deterministically. Instead the first symptom you are likely to notice is heisenbugs occurring in production and the frequency is gradually increasing as the code ages. Tracking that down and fixing it is rocket science.

Compared to that the price of adding a buffer plus a clock in the next re-design is less than peanuts. We've been doing this stuff for decades now and gone through plenty of re-designs, which is why I'm amazed it hasn't happened.

But what makes you think sound cards don't have buffers? They do, plenty of it. You can easily cause any modern OS to have multiple seconds of audio lag by cranking the buffer size. People just generally don't want that from their system which is makes all this somewhat tricky.

Of course if you're doing batch processing you hardly care about latency and the problem is trivial except for the fact that if you hit the play button in your post processing software you still expect instant audio output, especially for deciding cut points and whatnot.

> Don’t use Objective-C/Swift on the audio thread.

His argumentation is that Objective-C uses locks internally.

But isn't it possible that those locks (as opposed to application-level locks) cause just negligible pauses, because there's no contention on them?

I guess those internal locks exist to make ObjC's runtime dynamism feasible. One would expect the typical application to not leverage that dynamism (how many apps continously redefine methods under normal operation?).

The problem with locks on a non real time os is that the OS may decide to interrupt your thread while it's holding the lock. No matter how short that is. I do embedded I've had similar things happen when the 'lock' is held for just a few instructions.

Tip: If you suspect crap like this, adding spin delays will make it happen often enough to debug.

By any chance do you have any references or information on how to do this (adding a spin delay)? It sounds interesting. -thanks :)

In C it's very simple

   volatile int cnt = 0x1000;

I've grown so paranoid of optimizing compilers this wouldn't even cross my mind. But it is correct, right?

Not OP, but the “volatile” qualifier should prevent the compiler from optimizing it out.

Yeah should.

Multicore super scalar processors make me nervous about my assumptions. I wouldn't put it past one of them to realize that the result is unused and nop it.

It's always good to check.

I see now, thanks!

It is possible to use Swift on audio thread. You have to be careful not to use any reference counted (i.e. class) objects. If you do need to perform allocations (e.g. creating CMSampleBuffer instances), use a custom allocator.

I had someone on the Swift compiler team tell me unequivocally that they make no real time guarantees. Unless you can point to somewhere in the language documentation that the runtime is guaranteed not to ever do X, then I would not assume that you could. (me: former Apple CoreAudio engineer)

If there’s no contention, why do you think they’re there?

They could be there as a guardrail for a worst-case scenario, without them being particularly needed for the average use case.

You can use Objective C on the audio thread just fine in practice. Just don’t call out.

No, you really can't. Show me a single audio app that uses objc in the render block.

Agreed. You really must not. No sane developer would do it. Unfortunately, there are some insane developers.

Let's say you are recording and processing audio in realtime. You have one function available which runs on a separate thread and provides you with the samples (fixed number every call). As I understand this article (and similar ones), you are not supposed to malloc on that thread (as that thread hanging could result in dropped samples). But if you want to save the audio, how else are you going to save it without appending to some array (for which you need to malloc)?

One common solution is to have another thread continuously prepare a bunch of empty buffers in advance, the audio thread just uses one of them, and after that is filled another thread writes it to disk. No locking is required, and no malloccing/disk IO happens on the audio thread.

Couldn't you just spin through buffers alloc'd at startup? Why do you need another thread for buffer management? You could do something like a ring buffer to change pointers.

That said, I've just started working on my first audio pipeline app, and I'm having sampling rate issues making voices sound super deep.

> But if you want to save the audio, how else are you going to save it without appending to some array (for which you need to malloc)?

I don't know why everybody is making this so hard: You use a lock-free statically allocated circular buffer for all communication to or from a real-time audio thread. Period. Full stop. The audio thread wins under all contention scenarios. The audio thread ONLY ships data from the circular buffer to the hardware system or vice versa--IT DOES NO OTHER TASK. Everything else talks to the circular buffers.

Nothing else works.

The big questions are even if you do that--1) how do you keep the circular buffer from starving even with your other threads working to fill the circular buffer? and 2) if the circular buffer starves, what do you do?

The audio thread ONLY ships data from the circular buffer to the hardware system or vice versa--IT DOES NO OTHER TASK.

All you have done in this case is push the problem up one level. The interface with the hardware already works this way.

The interesting problem is actually doing something more substantial - eg. deciding which audio to play, or processing it in some way - in a way that lets you keep feeding that circular buffer on time, every time.

> All you have done in this case is push the problem up one level.

Actually, using a circular buffer has transferred a "hard, real time" problem into a "soft real time" problem. As long as what you want to do has an average time shorter than the time between audio packets and you have enough audio packets buffered, you can ride across the occasional schedule miss because the system was off doing something else (like GC).

For example, VoIP quite often uses a "jitter buffer" for exactly this task.

Now, you still need to keep the circular buffer fed and that may not be easy. However, it's a lot easier than being 1:1 with a hard audio thread.

> The interface with the hardware already works this way.

Most of the time when I'm interacting with low-latency audio threads, they generally don't allow me to specify the buffer semantics with very much flexibility.

This is not right; this can get you priority inversion if your audio thread is waiting for a lower priority thread, and for many tasks it just adds overhead.

As an example, consider some software I wrote recently: https://www.jefftk.com/bass-whistle It reads in audio of someone whistling, interprets it, and resynthesizes it as bass. It needs to be very low latency, or else it won't feel fluent to play. Roughly, on each audio callback it needs to:

* For every sample in the input buffer determine (a) is there whistling and if so (b) what pitch it is.

* Use those decisions to synthesize audio and populate an output buffer.

It does all of this on the audio thread, and there's absolutely no reason to move this processing elsewhere.

Code pointer: PLUG_CLASS_NAME::ProcessBlock in https://github.com/jeffkaufman/iPlug2/blob/master/Examples/B...

> This is not right; this can get you priority inversion if your audio thread is waiting for a lower priority thread

No. It can't invert because the circular buffer is lock free--that's the whole point. The audio thread controls the resources and if the audio thread doesn't relinquish then the other threads can't add to the data structure--the audio thread, however, never, ever sees a point of contention.

Now, the audio thread can theoretically starve because it's holding resources so long that the other threads can't transfer a buffer in. However, as audio threads tend to be hard, real-time periodic with the highest priority, so that problem is of the programmers own making.

> for many tasks it just adds overhead.

Only if your task is audio in->audio out with very little processing.

The moment you have to pull from network or disk, change in response to a UI event, or anything else which touches something non-audio, you either eat the overhead of a circular buffer data structure or you risk a glitch.

Thanks this is what I went with.

You only need one result (the latest) at any given time, so you allocate room before your loop, set an atomic indicating that a calculation is in progress, then have exclusive rw access to that memory.

It’s a really old trick, lots of unixy APIs returning strings or objects that you are not expected to free do a similar thing (function local static variables to storing results) at the cost of not being thread-safe.

You use a lock/wait free data structure like a ring buffer as a FIFO to send the data to a blocking thread that buffers and writes to disk.

pre-malloc the array.

I'm wondering if anyone here does embedded audio synthesis. Is a Teensy the way to go?

I highly recommend Axoloti. It has all the connectors ready to go, is easily expandable to write custom code but has all the boilerplate stuff and basic objects, so you can get creativity quickly without having to write and debug an envelope generator and filter, etc.

Or if you want to start from a lower-level: Teensy 4.0 is amazingly fast. Teensy 3.6 will suffice. Nucleo STMH734ZI2 is great, too. Nucleo STM32L476RG is not as fast, but good enough for single voice synthesis or basic guitar effects. STM32 Discovery still holds up well.

If you want to poke around and have fun, Bela [1] is pretty good way to get started. It's got a prebuilt framework where you can just push a piece of c(++) that gets to run in a xenomai real time audio thread with predigested signals from all the peripherals.

[1] https://bela.io/

I second the Beta. It's been a real pleasure to work with.

Unequivocally yes. Most of the problems mentioned in the article simply do not apply to the single threaded, and (almost) deterministic nature of microcontrollers. The Teensy 4.0 clocks at 600Mhz with a true double precision FPU. This affords a very comfortable budget for DSP, even at single sample update rates.

Having this allows for straightforward code, not always relying on ARM intrisics - which I personally find much more enjoyable to write, and especially so to read.

The Teensy audio library is already fully featured, well developed, and the community is active. Now with the Teensy 4.0, I expect that to only improve.

It DOES apply to a certain extent, and can produce a whole different set of headaches.

Instead of having "threads" you have the main program loop and the interrupt that sends audio (mostly through I2S, interrupt triggered when the I2S buffer is empty). When the interrupt hits you better have a buffer ready to be sent, otherwise it will click. Of course, the interrupt handler has to do just that (set a DMA transfer to send audio).

All that in sync with: capturing audio, reading from SD or storage, mixing, changing volume, applying effects, etc. Together with whatever the MCU has to do like reading buttons, communicating with other devices and dealing with a bunch of other interrupts. It's not always deterministic.

If the MCU has to do a lot of audio processing in-between audio interrupts, a big buffer (not always available) can be the solution at the cost of latency. If a smaller buffer is used instead, then the MCU executes the audio interrupt very frequently and the main program barely executes...

... lots of headaches.

I started off with a Teensy3.2, it was definitely nice. I really don't enjoy all the wiring and actual physical side of it though. So now I mostly reprogram eurorack modules. Not nearly as cheap, but some modules from 4MS or Mutable Instruments (among others) are fully open source along with the bootloader and unit testing system.

For example I reprogram my DLD a lot.


If you're looking at tiny boards, then I'd check out the ESP32 based boards right now. Cheaper. Personally, I think the R-pi Zero is quite attractive due to its lower barrier of entry.

Also, consider creating your synth/audio processor on your local machine first. Tweak all parameters. Make it perfect. Then port to hardware + UI. Not the other way around. The times of fixed points maths/DSP optimisations driving your algorithms are behind us.

It's a little slow. I've used a Mega, but all around prefer the Adafruit Feather line.

That's what I bought the first time, but unfortunately there don't seem to be good audio libraries for the Feather line, and the Teensy Audio library looks quite sophisticated. Or did I miss something?

They should be compatible. It's all Wiring at the end of the day. But IDK, I just write my own.

Which processor, though? There are a few.

I don't know if they've made something more updated, but I usually get the Feather M0 nrf52. The Bluetooth connectivity is super cool for controlling things, and as long as you don't enable encryption, it is relatively low latency.

I see, thanks! I'm going to get a few to play with, I've seen other people recommend that microcontroller too.

Current OS's are fundamentally unsuitable for audiowork. macOS < Windows < Linux < other OS. None of them really do real-time processing.

It's really bad, I want a new OS where I can actually tell the scheduler what to do.

Also, check out this music composition app I'm launching soon http://ngrid.io.

You may want to look into the "Media OS", aka BeOS and it's derivative Haiku. The Media Kit API is a monster, but it was specifically designed to allow fast transfer of buffers between nodes using kernel API's in order to achieve the best possible performance time guarantees. On 90's hardware their latency was/is lower than latency on all desktop oriented computer systems today on 20+ year newer hardware.

Disclaimer - i'm working on a Haiku native media editor (video and audio) and the latency is excellent.

Define “lower than”. I can achieve <4ms on Windows with decent ASIO drivers.

I really admire the discipline of programmers that work seriously on real-time code... but indeed, at the end of the day, they all depend on the OS and everything else that might be running on it for the code to behave as real-time as expected. Sure, they can set the threads to high priority and all that, but... it's still not as good as one would like. And that's not even for hard real-time.

On the other hand, I can't really expect regular OS's to implement a "real-time" mode or something like that. As much as I'd like it, I feel it would be so complex and fundamentally incompatible with the way those OS's work! (and then I also start thinking about power management, external factors affecting hardware, etc). I hope the future proves me wrong. Given how much people works on audio and plays videogames on those OS's, the current situation is kind of sad.

I'm not sure you want to tell the scheduler what to do. Probably you just want a real-time scheduler that you can depend on for real-time work.

One approach is to use a real-time kernel underneath Linux. Bela does this with the Xenomai kernel:


Why is the realtime Linux patchset + Jack audio server insuffecient for what you do?

A patchset is not an OS. Like I can make it work for myself but good luck making it work for users.

The linuxCNC project[1] is a distribution based on the real time kernel extensions for debian for real time cnc machine control. It used to be possible to get few-us guaranteed latency, but with modern multicore machines you are stuck with about 100us due to the way the caching works. Still plenty fast enough for audio!

[1] http://linuxcnc.org/

I thought you wanted it for yourself, though?

If you wanted something that lots of other people could use, in principle you could make your own Linux distribution with a particular set of patches.

You can't expect people to use your os.

Good point, I've been down that rabbit hole myself.

Carefully-worded question: If OP is describing a single monolithic music production application, how does "+ Jack audio server" make the system more appropriate for meeting soft-realtime scheduling deadlines?

I just signed up. Is this a single dev (just you) project? Or, do you forsee hiring in the future?

Excellent article. Great explanation. Thanks Michael!

> Rendering live audio is quite demanding: the system has to deliver n seconds of audio data every n seconds to the audio hardware. If it doesn’t, the buffer runs dry, and the user hears a nasty glitch or crackle: that’s the hard transition from audio, to silence.

I think you can do better when a buffer runs dry. Instead of outputting silence, you could output the same frequency spectrum as you were right before the event. That way you will not hear any cracks or pops.

And of course you can fade out the effect when the buffer stays dry for more than a few seconds.

Obviously, you'd have to do some filtering when the audio continues because that too can introduce cracks.

There is really no such thing as 'instantaneous frequency response'. For any frequency to meaningfully exist, you need data for the corresponding period. e.g. If the audio contains content to 20hz, you need at least 1/40th to 1/20th of a second of data for that to materialize.

Put another way - what you are proposing it looping the buffer which is what some devices do, portable CD players were kinda notorious for it and it doesn't sound much better than cracks or pops. Computers also have a tendency to fall into buffer looping when the system hangs (which is likely the result of the failure mode of realtek codecs).

> There is really no such thing as 'instantaneous frequency response'

Yes that's true, I'm proposing something that uses an approximation of it.

Consider it from a different angle: the inner ear essentially performs a Fourier transform. At every moment the "instantaneous" spectrum determines which hair cells are triggered. Now what I propose is to keep triggering those same hair cells (and not any others) when the buffer runs dry. The exact way of accomplishing this is left as an exercise (though using short windows where you take a FFT could be a good approximation).

> The exact way of accomplishing this is left as an exercise

Perhaps you should undertake this exercise and let us know how it sounds :)

EDIT: In my experience with audio, when I have a bug that introduces even the slightest discontinuity (or even just a cusp) in the audio, well short of a pop to silence, I can still hear a "weirdness". Ears are pretty attuned to things that sound unnatural. I'm not confident that essentially "forging" the audio is going to sound natural.

What if you train a deep neural network on the song so far, so it can generate plausible-sounding music whenever the buffer drops?

You can even hang intentionally to generate original music!

(/s, please don't)

As another comment mentioned, this is done by conferencing software to deal with packet loss. It sounds like they either loop the previous DCT frame and gradually fade out, or feed the time domain output into a reverb, then cross-fade from 100% dry to 100% wet if the buffer is about to run out (someone on HN mentioned a while back that this approach was patented).

You could maybe make an argument that it would be useful in live music settings to prevent a bad situation from sounding even worse, and maybe you'd put it on some audio software so you can sort of still enjoy playing music on a crappy system, but really, it's best to have hardware and software that can 100% guarantee keeping up with audio processing.

I like the idea but one problem is that you usually encounter a buffer underrun when the cpu can't keep up, and so adding an extra step would require something like leaving enough processing headroom each buffer to halt it early and run the approximater.

edited typo

"Output the same frequency spectrum" - takes some considerable processing! We're usually working in time-domain, not frequency-domain. I think you're saying do a load of processing to fill the gap when you don't have time to do processing?

Audio software often works in the frequency domain too. CPUs have optimized instructions for this (video also uses them).

Also, the article speaks of multithreaded software, where deadlines can be missed because of complicated dependencies. The end stage where you correct for missing samples can work independently of them in its own thread.

For the record, I think this is a terrible idea.

However, the way I'd do it if needed: keep two or the most recent good buffers. When you need to synthesize, start running a phase vocoder based on the hop between those two. You get frozen sinusoids and some random noise for the bins that don't have one, and almost no cpu use on the happy path of no underruns.

Still, don't do it :)

> For the record, I think this is a terrible idea.

It really depends on who you are asking. Some people just hate those loud cracks and pops, and would love to have something that filters them out naturally.

So you always calculate the same frequency spectrum just in case of dropouts instead of getting the actual code right?

Sorry, but if you ever find yourself in sich a situation stop, pause, make yourself a tea and consider how wise the thing you are doing really is.

I actually think it's actively harmful to hide problems that can be otherwise fixed. If the CPU is too busy to keep filling the audio buffer, the solution is to increase the buffer size to put less stress on the scheduler. I recently reduced my buffer size in Ableton Live, but I knew I had to increase it because I could hear pops. If these pops were being covered up, I wouldn't have realized my buffer size was too small and I'd be unknowingly introducing subtle artifacts into every recording.

Ok, but a large buffer size means more latency.

Also, the settings you use for development do not have to be the same as those used in production.

> Ok, but a large buffer size means more latency.

Depending on the use-case, that may not be a problem. Not all things are latency-sensitive.

With anything besides headphones there will always be latency anyway. Roughly 1ms per foot from the speaker.

Based on nothing more than user experience, conferencing software like Zoom tends to do something that sounds quite like what you're describing, complete with the fade to silence after about half a second.

So it makes sense for certain live situations, but it wouldn't be desirable in studio recordings.

Assuming it's a viable strategy with regards to processing resources (hint: it's not for anything more than toys) you will have audible artifacts, especially around transients. Filtering and additional processing will only alter the signal even more.

Instead of outputting silence, you could output the same frequency spectrum as you were right before the event.

How does one detect whether it's a musical silence or a buffer underrun?

You'd probably do this at a layer where you have access to the buffer stats to know that the buffer is nearly empty.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact