Hacker News new | past | comments | ask | show | jobs | submit login
Four common mistakes in audio development (atastypixel.com)
404 points by bpierre on June 15, 2016 | hide | past | web | favorite | 208 comments



I really love time restricted environments, in my opinion they truly liberate programmers: instead of using language/library/tech/pattern/etc-of-choice they suddenly realize "oh, we don't have time for that". Do we need GC? We don't have time for that. Do we need to allocate memory? We don't have time for that. Can we maybe do this enterprise-like tens thousands source code files OOP hierarchy thing? We don't have time for that.

Realtime forces people write less, writing only what actually matters, and all this in my opinion helps people to be better engineers (in-a-way). I just wish it looked more attractable for people - hacking your way out in time/resources constrained systems (like MCU's for example) can be as fun as hacking html/js to make your site behave as you want.


I've been spending a great deal of time working on a game engine for a Nintendo DS unit, which is a very constrained little system. Source here if anyone's interested.

https://github.com/zeta0134/pikmin-nds

The NDS is an odd little system. Despite supporting 3D graphics, it entirely lacks useful things like a floating point unit, or a hardware divide. All of your math ends up being done in 12.20bit fixed point. Square roots are already slow on normal systems, and they're just ridiculously slow on this thing, plus you've only got something like 550k hardware cycles per frame to get everything done.

Getting over 100 objects to draw, have basic physics applied to them, and perform AI in a reasonable amount of time has led to some eye opening revelations. Right at the start of the project I spent ages trying to come up with the fastest way to sort all of my objects back to front, to make the renderer work. I was just sure this was going to be the slowest part of the draw, so I spent probably two weeks on sorting implementations alone before finally giving up and choosing a std::priority_queue just to have something that worked. Then I actually tested the thing, and came to the (obvious?) conclusion that I had wasted a great deal of time. The std::priority_queue was handling 100 or so objects really really fast, and the system was lagging badly because the drawing itself was poorly optimized and needed to be revisited. I had to start actually profiling my code, and the recurring theme was that the bottlenecks were almost never where I thought they would be, and usually surprised me.

It's a whole ton of fun, I'll say that much. Not sure if I'll end up doing anything useful with the engine once it's done (obviously I cannot do anything meaningful with a Pikmin clone due to IP reasons) but I'm enjoying just being able to push the limits of the system.


Just curious - how did you profile code running on the NDS? Emulator?


I have a real NDS unit that I test on with an R4 flashcart from time to time. Since the code runs from RAM after the level is loaded, the read speed of the cart itself doesn't factor into things at all.

Emulators are generally pretty accurate at CPU timings, but other hardware timings are lacking, especially with the amount of time it takes the GPU to process certain operations. (DesMuMe is my usual emulator for quick testing, no$gba is the most hardware accurate and has an excellent debugger, but it still falls short of real hardware testing for timings.)

The NDS does have hardware timers, so I wrote a quick and dirty set of profiling functions. I can issue a start and stop keyed on a topic, and it will measure the number of cycles between the calls. So, any bit of the engine I wish to time, I just surround with matching start/stop calls using the same topic, and then print the results of all the topics out to a text console on the bottom screen. (If you're running the game, hit Select a few times to pull this up. Go withdraw a full squad of Pikmin and see the numbers change and the framerate drop.) It's crude but effective, and keeping in mind the 550k cycle limit per frame allows me to get a rough estimate of what percentage of my frame time is taken up by a particular routine.

I've generally found that my physics and AI calculations are proportionally slower on real hardware than the emulator reports, by a small margin. I suspect this is due to the ARM9's weird TCM cache implementation, but I really don't know what's going on. Graphics calculations are way different though, and are usually reported much faster by the emulators, so I always re-profile any graphics "optimizations" on real hardware to make sure I haven't run into something that the emulator just happens to run fast.


I believe this is what OP is referring to https://imgur.com/a/YMi6K#97


Ah yes, that was an earlier and much more crude hack. That's remarkably simple: just change the background color of the bottom screen at any time. Since the screen is being drawn at the same time your code is running, the results literally show up in realtime.

A huge downside with this technique is that color changes during H-blank (waiting period between scanlines) and more critically V-blank (rather long waiting period after each full screen draw) cannot be seen. When I was using this technique, I had the engine actually wait until all of V-blank had passed before it started doing the timing colors. This still ended up being pretty hard to read of course, as the colors would change flicker rapidly at 60FPS, couldn't be easily saved to view later, and were complicated significantly by the multi-pass nature of the engine, where each visible frame is actually composed of several hardware frames, all with totally different timings. So I scrapped that method almost immediately and wrote text-based profiling code that used the built in hardware timers.


If only that was what actually happened...

I spent a few years working in the games industry, and let me tell you, enterprise-like tens of thousands of lines of code files with massive OOP hierarchies were all too common. That's why I think Unity3D is fantastic -- they do the OO (composition) in their engine libs, you write simple scripts.

In my experience, what really forces people to write more code "that matters" is programming languages and frameworks that constrain them.


I also work in gaming industry and IMHO it depends on games, usually the biggest fun is on consoles market, people start making AAA games with engines like Unity, Unreal Engine 3/4, Autodesk Stingray (recently was bitsquid) and then they suddenly realize "oh, we are not getting 30 fps here, because of Lua/C++/etc" and then the fun starts: IIRC one game company was even asking Microsoft to add missing intrinsics to xbox one compiler because they were used in PS4 code, and PS4 and X1 are basically almost the same chip :)


yeah. The biggest, ugliest C++ code bases I've seen were for AAA games. Some of the best programmers work in games, but you also see lots of grads and staff churn. TBH I have never seen a big C++ code base I liked.

Unity may have its detractors and genuine deficiencies but the Unity projects I worked on were much more maintainable due to the constraints of working inside Unity itself.


Yeah, early in my career I was making simpler 'kids' games and was dreaming of working on AAA titles ... That's until I actually shipped the first AAA and it's ball of mud-like code. It's funny because the impression I had before was from all the GDC talks from big studios always seem to have an awesome codebase that was refactor mercilessly. Not so ... I know now ...


> Unity may have its detractors and genuine deficiencies but the Unity projects I worked on were much more maintainable due to the constraints of working inside Unity itself.

As a game developer who mostly worked in Unity, this is horrifying. Unity-based projects are pile of horseshit in terms of architecture and organization, and I honestly hoped that AAA stuff is held to a higher standard.


AFAIK, the main requirement for AAA games is to ship something in time (Christmas doesn't wait) that runs fast enough. And the thing to ship is the game, not libraries that will help making the next game easier.

It may have changed a bit now that hardware speed doesn't increase that much anymore, but historically, that even applied to game series. Even if you knew there would be a sequel, that sequel would target vastly improved hardware, which means new graphics effects, higher-resolution textures, etc.

That is not an environment that leads to high-quality code. The ability to update a game over the Internet hasn't improved that situation, to state it mildly.


Yup, you usually fork the engine at some 8-12 months out from ship with the intention of never merging back(or very little) to mainline.

I remember when we'd take UE3 drops, it's take one engineer a whole month of compiler errors + fixing stuff to get it back to where we started with.



Chromium is nice, but it also uses a restricted subset of C++: std::unique_ptr, std::move, <algorithm> are okay, but no exceptions, shared pointers, etc. Most large C++ code bases are similar. LLVM also outlaws exceptions. Don't know about Firefox.

So, while these code bases are wonderfully informative and good to learn from in many ways, they won't teach you "modern" C++ styles. I put "modern" in quotes because there's nothing wrong with choosing a style that doesn't use the latest whiz-bang features.


> Most large C++ code bases are similar [...] outlaws exceptions

While I can understand the ideas behind it, exceptions do cause head aches, I find it really weird. Enforcing this means using a very limited part of the C++ standard library, e.g. no std::vector, std::map, etc, because all those can throw exceptions. If you say no to exceptions but yes to std::vector then you either need to write code that works in the presence of exceptions (which means you could just say yes to them) or you have shoved your head in the sand like an ostrich thinking if you can't see them they can't happen.


Large source bases take time to build, and if they're large now, started before those features became widely available. But look around a bit -- you'll find plenty of equivalents. Chromium has its own smart pointers library, and there's a good chance that they'll start shifting to the standard stuff over time.


You also have companies which jump on the google style guide bandwagon.

It's a fine guide, but it was written for google and has some stuff in there as a result of that (no exceptions being on example).


> TBH I have never seen a big C++ code base I liked.

In what languages have you seen a big code base that you liked? Just curious.


Not the OP, but in C I've liked Linux well enough (though it's not perfect), and PostgreSQL and SQLite are supposed to be examples of very well structured code.


The quake sources are supposedly pretty good.


The quake sources taught me many lessons on how to structure C code!


When your time budget is a million cycles to handle a few dozen samples, you absolutely can use careful GC and careful allocation. A complicated OOP hierarchy doesn't require any thought at all. It just works, without any care needed.

Some realtime is truly restrictive. Realtime audio on a GHz-class processor is not.


OOP isn't a problem, IF you're talking about C++ code using purely stack allocated objects, that you wrote yourself to not use any memory allocation, taking care with locks, etc...

If you're not experiencing performance issues every day despite your smartphone being "ghz class", I want your phone!


My phone has lots of performance issues. None of them are in the audio processing.


Constraints are limits you think against. Like grounding. Paradox of choice somehow.


This also works in reverse, take google full of fresh graduates filtered thru a process of 20 whiteboard interviews and you end up with 150ms audio/input subsystems lag


Just reading the intro so far, not an audio developer particularly, but wanted to quote this:

> although there is a high horse present in this article, consider me standing beside it pointing at it, rather than sitting on top of it.

Something we can all aspire to!


Seneca in https://en.wikisource.org/wiki/Moral_letters_to_Lucilius/Let...

"What," say you, "are you giving me advice? Indeed, have you already advised yourself, already corrected your own faults? Is this the reason why you have leisure to reform other men?" No, I am not so shameless as to undertake to cure my fellow-men when I am ill myself. I am, however, discussing with you troubles which concern us both, and sharing the remedy with you, just as if we were lying ill in the same hospital. Listen to me, therefore, as you would if I were talking to myself. I am admitting you to my inmost thoughts, and am having it out with myself, merely making use of you as my pretext.


These mistakes basically boil down to not using slow scripting languages and making the core buffer loop run fast and undisturbed. For realtime synthesis use c or c++, optimise early, use trigonometry approximations in range -pi / pi, do not use lookup tables, they unnecessarily fill the cache, cpus are fast enough, write the core loop branchless, write for cache, use vectorisation and function pointers. Do not use anything fancy, simple int32_t, float and vector/array is enough (some filters need double precision though). Do not copy stuff, point to it. Precalculate what can be precalculated, e.g. dont do x / samplerate, do x * samplerate_inverse. Check for denormals.


I'd rather configure the FPU to flush denormals to zero.

As to slow scripting languages - TFA was talking about determinism. A slowish scripting language could give you deterministic runtime, at least as deterministic as C or assembly (interpreted Forth is a simple example), and on the other hand there's no shortage of ways to get your C or C++ code to stall for an unknown amount of cycles (TFA enumerated some.) Of course slower languages are slower, and in a faster one you're more likely to get away with stalls because you have more slack to begin with.


That's new to me, wouldn't the OS be required to be deterministic to begin with, for a deterministic language to actually be deterministic? Since most OSes are not, the only thing left is raw speed (or dedicated dsp chips).


> That's new to me, wouldn't the OS be required to be deterministic to begin with, for a deterministic language to actually be deterministic? Since most OSes are not, the only thing left is raw speed (or dedicated dsp chips).

You can't control the determinism of the OS or user's environment, sure. But you can make sure you're not adding additional indeterminism. If the OS can't deliver stable timing, no audio app will under those circumstances. But for the times when it does, you don't want to be the only instrument popping and underrunning, while every other effect is glitch-free.

Raw speed isn't really that great, because you can't process ahead of the present. If you need to provide a 128 sample buffer every 2.9 ms, it doesn't matter if a slow language takes 2.8ms, and a fast one takes 0.1ms. However, if every 10 seconds the fast language takes 10ms, you've lost 4 buffers! Consistent speed is the aim.


Modern OSs have some amount of predictability. If the system is not overloaded or you renice the program, you can expect it to run every couple of ticks. If there's nothing big running in parallel and your data is small enough, you can expect it to stay in cache. If you do no disk access, you can expect it not to stall.

It all depends on how much latency you can accept. With enough latency, you can bufferize over anything.


> That's new to me, wouldn't the OS be required to be deterministic to begin with, for a deterministic language to actually be deterministic?

A large problem for deterministic timing are the cache hierarchies of modern CPUs.


You can also increase the priority of your thread/process. Works pretty well. On Windows, you can also mark your thread as being audio related and latency sensitive (Multimedia Class Scheduler Service)


>do not use lookup tables, they unnecessarily fill the cache, cpus are fast enough

Whoa, slow down. Testing will tell you which one is faster, but if you're applying a complicated function to every sample (eg. one that uses pow, log or trig especially more than once), lookup tables are definitely to be considered. The more complex the function, the more of a performance gain it will represent. An arithmetic approximation such as Chebyshev approximation can also work, but may or may not be more efficient depending on the function.

And yes, make sure you optimize the size of your table to be as small as possible while maintaining acceptable quality (which also depends on the interpolation you're using to access it). Everything is a tradeoff. New processors are efficient but not magical, and I've seen a marked performance improvement after applying a lookup table on a small plugin a few months ago, all of this on a very modern i7.


You should not use standard pow and log but approximations of these functions. These are usually faster than any lookup table that fills a good part of the cache and then you stall. But there might be some gray area, depending on the system. Or when full precision is needed. In most audio cases approximations work very well. Atleast that was what my testing showed.

edit: it might be different for a single plugin, where there isn't much need for cache, but as the core loop grows bigger, you also need the cache more.


> Whoa, slow down.

An unfortunate suggestion, given the context. :)

GP and others always mention using C/C++, but what about Rust? Is it viable [enough] for these kind of problems? For high-performance code in game engines etc.?

Edit: Ah I should've searched: https://news.ycombinator.com/item?id=11908849


Rust would be fine, biggest hurdle is getting all the various libraries that C/C++ already has access too. I was looking into audio modem stuff a while back for APRS and it seems like most of the linux audio libraries already have Rust equivalent wrappers.


[do not use lookup tables, they unnecessarily fill the cache...]

Good point. A major misconception that a lot of people have is that lookup tables are always faster than actually calculating a value, but cache locality is usually not given as much attention.


It used to be true in the past. Back then, processors were slower, and they had no cache, since the processor and memory speeds were roughly matched. A few integer multiplications would be enough to cost more than a single memory access.

It should still hold true today for small embedded processors with slow arithmetic and no cache.


They often lack ram too, so that const lookup table could be sitting on flash.


That's not my takeaway from the article. It's not saying don't use "slow scripting languages", it's saying avoid language features that are not safe for real-time programming and give the audio thread enough time to do its thing.

In my experience, writing everything in C/C++ has its own host of problems. What's worse, a GC stutter or a segfault? :)


Writing everything in C/C++ has its own host of problems, but it's roughly the name of the game in anything realtime. It's easy enough to think about writing one effect or synth in a slower language, but keep in mind that modern digital audio workstation software is expected to run up to hundreds of plug-ins to produce one block of audio, and that's just not possible without being directly on top of the metal. Even vanilla C/C++ often isn't enough; generally, there's a necessity to drop down into assembler or at the very least use SSE intrinsics.


Realtime often means: when there's not enough time, dont bother. Not going as fast as possible. Which are 2 separate issues. In my opinion, at least


How are they two separate issues? If you don't go as fast as possible, you can get less work done before your deadline. Even with today's CPUs and the kind of parallelisation that a good DAW program will do, I still routinely see producers (myself included) hit CPU usage limits. In audio DSP, programmers have responded to the increase in CPU capacity not with higher-level languages but instead with more computationally expensive algorithms (modern filters, for example, use techniques derived from circuit simulators for solving nonlinear equations).

Besides, actually writing the DSP code makes up a comparatively tiny portion of the development time. Tuning and tweaking the DSP algorithms takes far more, and designing the user interface dwarfs both. I know multiple audio products which use Lua for their UI layer because the productivity increase is so significant. And there, you don't have to worry about performance to anywhere the degree you need to in the DSP.


I will take segfault over GC stutter any time of the day, because I can at least debug and fix it. GC is not really predictable


A single GC stutter that leads to an audiple drop-out means complete failure of the audio program. The article also mentions that: "Don’t use Objective-C/Swift on the audio thread", which can be generalised to "Don't use slow scripting languages where you don't get full control of the audio threads memory layout and data scope/lifetime". Top audio synthesis also gets down to assembly by hand optimising the hot spots.


How exactly is Obj-C a slow scripting language without control of scope?


Afaik obj c got some kind of garbage collection and is heaviliy tuned to use apple libs. Since it's so platform-centric, objc is out of the question for me. Maybe fast, but not as fast as c/c++. I would guess around java speed.


Objective-C is compiled (to machine language, not bytecode), and does not use garbage collection (it has optional garbage collection, but this is not supported on iOS).

What informs your guess that it's "around java speed" ?


Just some blurry knowledge mixed with ignorance (50/50), because of its platform-centricity. It's probably a bit faster than Java though.


They are both exactly as bad as each other in this application but at least it's always possible to avoid a segfault.


That's a false dichotomy. s/segfault/unhandled exception/


segfaults are hardly an inevitability in 'C'/C++ .


Actually even C can be very much non-deterministic. You have to be careful with memory-related issues. free() and malloc() can stall your program quite drastically if you're not careful. So no, it's not just a matter of avoiding slow scripting languages. Plenty of "scripting languages" are used in audio, in fact, they just have different memory models. (See, supercollider, chuck, puredata)


atleast puredata is written in c, i guess the others also to be similar under the hood.


With csound you have a mature, stable audio engine that works everywhere, even works for android, you can call it from various languages, while you can use it as a c/c++ library it has well supported libraries for java on android and objective c and swift on ios. http://csounds.com. I used it for my psychoflute and moonsynth apps on android.


> and function pointers

Ok, everything else made sense, but why on earth would you use function pointers in performance critical code if you can avoid them at all. Same obviously goes for virtual methods.


To be flexible without if / else or switches. If i want a multi-oscillator that can produce different waveforms i use a pointer to different functions depending on the waveform i need. Same goes for different postprocessing / routing. I didn't actually profile this compared to a switch, but for me it's cleaner and more modular. If you got a static effect whose routing stays the same you don't need it of course.

edit: i've never used a virtual method, but my impression is that performance-wise it's quite different to pointers to simple functions. most of my audio code is pod-structs to layout the data in the order that it gets acessed and standalone functions that then modify stuff in the structs. it's pretty barebones data oriented programming compared to the other extreme, some baroque meta virtualised oop architecture.


Runtime codegen could make a big difference. You can dynamically build your processing pipes without overhead. No function pointers needed, total flexibility.

Check JIT libraries (like GNU lightning) and LLVM. Also LuaJIT might be interesting.

"llvmpipe", software rasterizer that uses LLVM for runtime codegen for shaders: http://www.mesa3d.org/llvmpipe.html


Its ancient now, but I built exactly this using Ruby and an LLVM wrapper.

https://github.com/jvoorhis/siren

I wouldn't expect this to run on the latest LLVM without a little hand holding (the most painful part was porting it to run on x86_64 after developing it on my white MacBook!) But it worked rather well and was a fun platform for experimenting with composition, synthesis and compilers!


sounds intriguing and way more flexible, but the example is for video? i don't know anything about this, but if i had to guess i would say that any runtime code generation is worse performance wise compared to a mix of pod structs and function pointers to functions which are all known at compile time.


> runtime code generation is worse performance wise compared to a mix of pod structs and function pointers to functions which are all known at compile time

Having "a mix of pod structs and function pointers" is analogous to having an interpreter for some scripting language. Do you consider interpreted scripting languages fast?

Interpreters are slow pretty much for that reason. Having to chase pointers.

Runtime code generating JITs wipe floor with interpreters, being usually 10-100x faster.

Now I'm not saying your application will be 10-100x faster, but done right, it'll absolutely be faster if your performance critical "inner loop" / filter chain needs to currently follow pointers. It's just a question of how much and whether it matters for the use case.


[Having "a mix of pod structs and function pointers" is analogous to having an interpreter for some scripting language.]

Yes, it might be similar at the core, but i would guess the usual audio pod-struct/function mix is many times less complex than a full jit language interpreter.

What does chasing pointer mean to slowness then? My understanding is that its similar to data pointers, that a certain region of memory needs to be looked up and be fed into the instruction cache. If the process is layed out so that instruction cache misses are minimised, whats so bad about it?


> whats so bad about it?

Other than executing a lot of useless instructions, causing pipeline stalls due to indirection, potentially mispredicted branches, etc. I guess nothing.

Just remember if the actual operation is just a few instructions, the price for all that extra work can be 1-2 orders of magnitude.

If you only need to chase pointers say a million times per second or less, go for it. It won't matter so much. In worst cases you might lose 10% (of total cpu core capacity) performance, if even that. If it's in tens of millions or more, it's pretty likely to be a significant bottleneck.

What does your profiler say? What kind of ILP you're getting?


The current code with the function pointers and pods works ok for me, its about in the million per second range and i didn't profile any alternative to it. How would you implement such an alternative with jit code generation into this pod/function c code setup without major rewrites? Seems to me like a completely different system which is bound to llvm (i use gcc)?

edit: since you were mentioning jit optimisation and code generation - this is a field i do not have any experience, but my impression so far is that this runtime re-arrangement of code can yield quite a big performance increase for higher level languages (jvm does this afaik). Is the link you provided the same idea for statically compiled c/c++ code?


If the chain of operations change rarely, you might even get away with calling clang or whatever compiler and actually compiling a chain of filter source code mashed together.


So how would you raw-structure a performance critical (audio) program that needs some flexible path for the data processing, like with a 128 sample buffer and about 3-10 choices along the processing path. Processing path should be switchable per buffer frequency. The function pointers were the best solution i could come up with, but maybe that doesnt really eliminate branching cache misses and stalling and its impact on performance, but just hides it in plain-text.


How many permutations? If not too many, maybe just compile them all.

Sometimes you do need to use function pointers. It's just if it's in a performance critical path, you might want to consider alternatives.


Permutations are for my own case pretty low, it's just a replacement for a handful of switches, otherwise the flow is pretty straightforward. But i see that if one wants to restructure the whole program flow on-the-fly it might lead to problems. What are the alternatives? The link you provided is for graphics, and does not seem to be generalised. Do you suggest a complete switch to a jit interpreted codebase?


Graphics and audio are usually pretty similar, other than the volume of data is usually quite a bit larger for graphics than for audio.

There's no silver bullet how to structure such flexible, dynamically configurable high performance system.


I disagree with 'pretty similar' and would suggest 'challenging in completely different ways'. Graphics in a minimum 60 fps context is about 700 times less time sensitive. That's almost three magnitudes difference. On top of that it isn't a major fail if you don't meet the 60 on time, but in audio it is. Furthermore audio tends to be more single thread focused, while graphics is much more multi thread friendly (see your link to llvmpipe and general gpu architectures). Graphics need to shovel around about some hundreds of megabytes per second and is less time critical, whereas audio is ultimately time critical and only needs to produce about 350kb per second for 32 bit stereo (the data difference you already mentioned). That's again some three orders of magnitude of difference. I wouldn't call that similar. Since you can't provide any advise for an alternative to function pointers i must challenge your expertise in audio programming since you seem to project your view from a graphics programming point.


Unless your hardware has to be fed sample-at-a-time, your deadline isn't 700 times worse. A fairer comparison would point out for HD video, you have mere nanoseconds per pixel, vs. microseconds for audio samples.

That is, my laptop has ~43 cycles to get out all channels of each pixel, vs. more than 60,000 cycles to get out all channels of each 44.1kHz audio sample.

But the pixels only have to be delivered by screen refresh, and the samples only have to be delivered before the sound card tries to play them--probably 64-1024 samples (or further) in the future.


Well, I haven't done audio programming for quite a while. I did get audio hardware behave pretty nicely in 1998, though. I believe low latency audio was much harder back then.

I'm pretty surprised if your deadline is really that 23 microseconds (60 fps / 7000). You can't even reliably dispatch an IRQ request in that time!

> I wouldn't call that similar

If you have a long audio filter chain it's not all that different from a long fragment shader. You can think of a fragment shader as a filter chain.

> Since you can't provide any advise for an alternative to function pointers

Uh. That's pretty hard without seeing the source code and full problem description. Especially because you seem to be expecting some magical solution. Well, those don't exist, it's just hard work.

Besides, like I said earlier, sometimes function pointers are appropriate for the task. Just understand the cost. Do keep in mind audio filter chains can be pretty complicated. If your filter chain has many concurrent channels with hundreds of dynamic filter steps for each sample to pass through, yeah, you might have some performance problems.

> i must challenge your expertise in audio programming since you seem to project your view from a graphics programming point.

Be my guest. Just wanted to share something I know about and give new ideas. Btw, I'm not doing graphics, but mostly I/O with micro/millisecond level deadlines.


Getting audio hardware to behave nicely for audio is pretty nice :) The calculation for the time factor difference is simply 44100/60 which is about 733, i don't know how the 23 microseconds you got there relate to this.

Your comparision to graphics seem to relate to long FIR filters, which i dont't use because of cache reasons (big kernel/lookup table). FIR filters can however sound pretty nice with static parameters but get extremely expensive for realtime modulation. For realtime modulation i suggest IIR filters (Finite Impulse Response vs Infinite Impulse Response filter architecture).

Of course it's hard to give advice into the blue without any source, but i atleast expected some general hint, like 'avoid branches, use function pointers' on a next-level, since you were so opposed to function pointers to begin with.

I am also not completely sure what's the best approach in this regard, and i surely did not enough profiling in this regard.

That said, i remain sceptical regarding your critique of function pointers in a single threaded audio application, because of a lack of an alternaltive.

I/O on a microsecond critical deadline certainly seems to be in a neigbourhood to these problems. Do you use some arcane javascript asm.js with some jit auto-reconfiguring optimising for this, or how would you describe your approach?


> i don't know how the 23 microseconds you got there relate to this

1 / (60 Hz * 700) = 23.8 microseconds. (Yeah, should have rounded up to 24).

What worked for me in the past was just concatenating (well, memcpying) executable filter blocks of x86 code while ensuring parameters were in the right registers.

Loop start block -> filter 1 -> filter 2 -> filter 3 ... -> loop end block.

I made these blocks by processing specifically compiled (not at runtime, but ahead of time) functions, stripping prologues and epilogues. There's more to it, but it's been a while.

Crude, but effective and fast even when each filter step was pretty small on its own.


So you put in the parameters into registers 'per hand' in assembly and then memcopyed some function to the right adress (using a function pointer?) to process these registers? Wow. That sounds pretty complicated. For now i just use gcc and function pointers and let gcc do the rest.

About the 1/(60 * 700) what does that number even mean to you? 1/60 is the usual refresh rate in seconds, about 16 ms, why do you multiply this with 700, this makes no sense?


You said earlier:

> Graphics in a minimum 60 fps context is about 700 times less time sensitive.

16.6 ms (60 fps) / 700 = ~24 microseconds.

> So you put in the parameters into registers 'per hand' in assembly and then memcopyed some function to the right adress (using a function pointer?) to process these registers? Wow. That sounds pretty complicated. For now i just use gcc and function pointers and let gcc do the rest.

I concatenated blocks of compiled code together (compiled as position independent), avoiding branches and indirect calls.

Loop setup block took care of putting right parameters to right registers.

Hardest part was getting the filter parameter passing right. Dynamic configuration was trivial after that.

There were no function pointers except one to trigger the top level loop.


My original calculation was that in audio you need to deliver 44000 values per second, in graphics you need 60 - 44000/60 roughly equals to 700. That's the difference factor :) (of course the data volume per delivery and it's techniques differ much, thats why i suggested that these are very different problems) It's very simple, maybe we got caught up into some misunderstanding.

For your concatenated blocks of precompiled code maybe you should have used simple function pointers, and one monolithic precompiled block to achieve the same result like i originally suggested.


> For your concatenated blocks of precompiled code maybe you should have used simple function pointers, and one monolithic precompiled block to achieve the same result like i originally suggested.

That's how the original version was like, but it was way too slow. I achieved eventually about 10x faster performance for typical (dynamic) configurations.


> Interpreters are slow pretty much for that reason. Having to chase pointers.

Here's a test I did on the Raspberry Pi 3 (with source so anyone can repeat it on more Intel-y hardware): https://gist.github.com/LnxPrgr3/31eaf5648a9956f5c576f70d876...

Calls through a function pointer are technically more expensive, but it's a factor of ~1.1, not 10-100.

Interpreters are slow because they chase pointers and otherwise branch in a manner the CPU can't predict. Calling the same function through the same pointer over and over again isn't that.


Different compilation unit prevents inlining. Function pointers to small functions can be a performance problem for exactly that reason -- except for very specific special cases (when compiler can actually figure out your function pointer calls always end up calling same function), function pointer calls cannot be inlined.

C/C++ compiler abstracts the difference between inlining and actual call. For best results, cross compilation unit optimizations should be enabled so that inlining can occur across object and library boundaries.

> Interpreters are slow because they chase pointers and otherwise branch in a manner the CPU can't predict. Calling the same function through the same pointer over and over again isn't that.

A lot of function pointer usage is also data dependent. Data dependency and too complicated dependency chains are what breaks prediction.


Data-dependent audio pipeline callbacks? In the schemes I'm aware of, they rarely change after setup--hence my thinking effectively fixed targets were the representative case. (After an update, pipeline stalls plague you until the predictor catches up.)

Breaking inlining is probably the real cost to this. Still, we're talking ~9 cycles per indirect call on this hardware. If you call each callback per buffer instead of per sample (which everyone seems to do), you're down to an amortized cost below 1 cycle per sample for a filter chain 14 deep with only 128 sample buffers. 14 filters probably swamp that cost.

Edit: A good compiler will completely remove my trivial function if it's inlined into call.cc's code. It has no side effects, and neither does the loop calling it 100,000,000 times. The resulting timings would be meaningless: doing nothing is cheap!


> Still, we're talking ~9 cycles per indirect call on this hardware.

In 9 cycles, you could execute up to 18 SIMD instructions, each operating on 256 bits worth of data.

For scalar, 9 cycles can be up to 30 instructions or so. For good code, average about 15-20.

> If you call each callback per buffer instead of per sample (which everyone seems to do), you're down to an amortized cost below 1 cycle per sample

Function pointers are perfectly ok even in high performance programming, if it's not what burns the cycles. If you can batch the filter, all is good. If you can't, well, you might need to think up some other solution.

Profiling rocks. I did simply some experimental cases where I ran same test setup with dynamic function pointers and hardcoded code. I quickly noticed that hardcoded version was over an order of magnitude faster. I figured out a way to narrow the difference and ended up doing dynamic code generation.

> doing nothing is cheap!

Code you don't need to run is the fastest code you can have. Infinitely fast in fact.


They're not a problem at all. ffmpeg/x264/5 outsource all CPU-dependent code to a table of function pointers and call them constantly for small data, like 4x4 pixels.

Desktop CPUs can predict function pointer jumps too, you know.


> It can be helpful to know that, on all modern processors, you can safety assign a value to a int, double, float, bool, BOOL or pointer variable on one thread, and read it on a different thread without worrying about tearing

This is true on the CPU side (for some CPUs) but what about compiler optimizations?

Using "volatile" should make stores and loads happen where they are written in the code, but can it be relied on in multi threaded use cases? It's generally frowned upon for a good reason, but perhaps it's acceptable if you're targetting a limited range of CPUs (the article seems to be focused on iOS and ARM only).

A safer bet would be to use __atomic_load and __atomic_store (from [0] or [1]) or C11 atomics if you have the appropriate headers for your platform. They provide safe loads and stores for all CPU architectures, and provide the appropriate memory barriers for cache coherency (for architectures that do care).

[0] https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins... [1] http://clang.llvm.org/docs/LanguageExtensions.html#langext-c...


If you're doing this in assembly, no problem so long as you understand the memory model of the cpu architecture you're targeting.

The memory model implemented by c compilers absolutely do not allow this. The problem is not cache coherency but optimization. The compiler assumes it understands the visibility of the variables and reorders, combines, and eliminates reads and writes. C is not a high-level assembler.

> A safer bet would be to use __atomic_load ... C11 atomics

These should work but be careful. Use the semantics defined by these constructs not the semantics of any assembly you might imagine the compiler generating.


Sure, but isn't your assembley code going to look a lot like C code with atomics anyway as you use barrier or in order execution instructions (e.g., dmb on ARM or eieio on PPC)?


> [Using "volatile" should make stores and loads happen where they are written in the code...]

Really? I was under the impression that "volatile" basically did nothing other than prevent the compiler from optimizing out a variable that appears to not ever be written to, but is actually written to via an ISR or something.

Also, how does the compiler usually handle global variables? Since they can be modified by some code that was linked in, does it assume them to be volatile?

Of course, this issue is slightly different than the atomic-ness of certain data types. We're talking more about situations where you have a series of atomic operations which you expect to happen in the order of which they're written.


> Really? I was under the impression that "volatile" basically did nothing other than prevent the compiler from optimizing out a variable that appears to not ever be written to, but is actually written to via an ISR or something.

I think it also stops loads and stores to be reordered with other volatile loads and stores, and yes, it indeed is typically useful for memory mapped i/o, interrupt handlers and such. But most of the time it's not the right thing to do.

> Also, how does the compiler usually handle global variables? Since they can be modified by some code that was linked in, does it assume them to be volatile?

No, they're not considered volatile but may have stricter guarantees with reordering than local variables (esp. when combined with calls to foreign functions).

If multithreaded code needs to be correct and portable, using locks or atomics is the way to go.


> I think it also stops loads and stores to be reordered with other volatile loads and stores, and yes, it indeed is typically useful for memory mapped i/o, interrupt handlers and such. But most of the time it's not the right thing to do.

It only prevents reordering by the compiler. CPU can still reorder them as much as it pleases, volatile doesn't affect it at all.

Of course you're pretty safe on x86. Not so on other platforms. Typically you find this kind of issues when porting x86 code to some other platform. X86 writes are always in order with other writes. Loads are always in order to other loads. Just need to remember loads are sometimes ordered before stores.

Again, volatile does not prevent CPU from reordering stores and loads.


If the compiler isn't preventing CPU reordering, then it has been impossible pre-2011 to use volatile for its original purpose, communicating with memory-mapped device registers. That sounds like a broken compiler to me.


Yeah, it'd not be very nice when DMA start bit in some DMA control register is poked before DMA address is still unset!

Those architectures that do this typically provide specialized always serialized instructions for communicating with memory mapped devices. If not, you need to have proper fences to ensure right order.

X86 stores are always in order (except for some specific instructions, but they're no concern in this context), so it doesn't need this.


Memory-mapped device registers are generally mapped in such a way that accesses bypass cache and go straight to memory. This has a fairly large performance penalty, of course.


Yup.

Don't. Use. Volatile.

It's meant for reading from hardware registers and nothing else.

There's nothing to prevent the re-ordering of reads/code around your volatile block. This is doubly nasty in that Windows will insert a memory fence which makes everything look like you'd want it. Soon as you port to another platform, boom all sorts of concurrency issues.

Seriously, use atomics. That's what they're built for.


> This is doubly nasty in that Windows will insert a memory fence

Windows is not inserting any memory fences anywhere in your code. MSVC cl compiler will also only do so when you specifically ask it to do so by using an intrinsic.


> Visual Studio interprets the volatile keyword differently depending on the target architecture. ... For architectures other than ARM, if no /volatile compiler option is specified, the compiler performs as if /volatile:ms were specified; therefore, for architectures other than ARM we strongly recommend that you specify /volatile:iso, and use explicit synchronization primitives and compiler intrinsics when you are dealing with memory that is shared across threads.

> /volatile:ms

> Selects Microsoft extended volatile semantics, which add memory ordering guarantees beyond the ISO-standard C++ language. Acquire/release semantics are guaranteed on volatile accesses. However, this option also forces the compiler to generate hardware memory barriers, which might add significant overhead on ARM and other weak memory-ordering architectures. If the compiler targets any platform except ARM, this is default interpretation of volatile.

From https://msdn.microsoft.com/en-us/library/12a04hfd.aspx

Really, just don't use volatile.


I think I really need to read disassembler output of any code I compile with cl. Ugh.

I haven't seen it doing that on any of my code using volatile, mostly for CAS and fetch-and-add (lock xadd).


> A safer bet would be to use __atomic_load and __atomic_store

Safer and an order of magnitude slower.


Sure, but the only way to be sure that a non-atomic store works as intended is to read the assembly output from the compiler.

It's also not very easy to predict the relative difference because it depends on which core(s) the threads get scheduled on. If both threads happen to get on the same core and the data stays in caches, we're talking about 1ns (L1) to 5ns (L2). An atomic load/store would be about 20ns (4..20x slower).

If they happen to get scheduled on different cores, the CPU interconnect will have to deal with its cache coherency protocol (similar to atomic load/store) and the result will be around the same.

We're talking about pretty miniscule amount of time here. It's a choice between fast if you're lucky but possibly incorrect anyway or marginally slower but guaranteed correctness.


Wouldn't this be the case anyway (cache coherency overhead) even if we used types that just-so-happened to be atomic on the platform, such as "int"?

In other words, isn't this a non issue when comparing the performance of atomic types and "regular" types, as they both would require this?

Also, do the atomic types handle ordering?

I could be going crazy, but I remember seeing some of those atomic types simply being typedef'd to "int" in some OS code before.

EDIT: I was thinking of the "atomic_t, atomic_inc, atomic_set, etc..." in the Linux Kernel. It guarantees atomic behavior, but not ordering. Memory barriers are still required when using it between different threads. The C11 types appear to be different, as they accept an ordering argument in their load/store functions.


Many of them actually do provide guarantees (as e.g. evidenced that a typical implementation of an x86 memory barrier is an "lock; addl $0,0(%%rsp)"). That's why there's now stuff like __smp_mb__before_atomic()/__smp_mb__after_atomic(), to avoid unnecessary atomic ops.


No. The way to ensure that your store works atomically is to specify the assembly output of the compiler. Reading what it is outputting right now doesn't matter, because how do you know when it is going to change its mind?


You don't know. That's why it's a terrible solution.

No-one inspects their binaries after every recompilation or compiler (flag) change.


Why would a relaxed load/store be slower? They'll usually just compile down to a mov or whatever the equivalent is?


Do you mean that a compiler optimisation could cause tearing of the mentioned data types? That's surprising to me and I'd like to know more about it.


Not tearing, but it could reorder the assignment so that it happens someplace else than you think it will. For this kind of thread synchronization memory barriers should be used at least.


Exactly.

To give a concrete example (but artificial) what might happen, consider this case:

    global_flag = 0;
    for(this loop takes a long time) { ... } 
    global_flag = 1;
    for(this loop takes a long time) { ... }
    global_flag = 0;
If it is trying to communicate something to another thread (or audio callback) with global_flag, it might fail miserably because the compiler is free to drop two of the assignments and place `global_flag = 0` wherever in the code. Using `volatile` should prevent this compiler optimization but it doesn't give any guarantees about cache coherency, etc.

As far as I know, using atomic load and store is the only way (apart from locks) to guarantee correctness and portability.


Elaborating on your example.

    my_buffer->something = 42;
    global_pointer = my_buffer;
    global_flag = 1;
Even if everything is declared `volatile`, in C or C++, the compiler or the processor can reorder the stores and reads, so that the reading thread can see the "global_pointer = my_buffer" before the "->something = 42". The only safe way to do it is to add the appropriate memory barriers on both the writing and reading sides, which force the compiler and the processor to not reorder the writes/reads.

    my_buffer->something = 42;
    write_barrier(); // Not the actual function; will vary depending on your environment.
    global_pointer = my_buffer;
    write_barrier(); // Not the actual function; will vary depending on your environment.
    global_flag = 1;
And on the reading side, the corresponding read barriers.

It's simpler to use atomic loads and stores or locks, since their implementation already has the required barriers with all the details (quick: what's the difference between an acquire and a release atomic access?).

(Note that this is different in Java, where `volatile` always implies a memory barrier.)


> what's the difference between an acquire and a release atomic access

ACQUIRE prevents reordering any loads and stores from after the barrier to before. RELEASE is the opposite, no load or store before it may happen after the barrier.

ACQUIRE is what you use when locking a spinlock, RELEASE is when you unlock.

Put these incorrectly and you're not guaranteed to have all loads and stores happening when the spinlock is locked, and your program is no longer guaranteed to behave as expected.


I thought _tearing_ referred to a value being read while in the midst of being updated, so that, for example, some of the read bits comes from the old value and some from the new value.


It does. It's a distinct issue and still very relevant if you're using shared memory without locks to communicate between threads.


I think the extreme case would be a packed struct where a multibyte member happens to span cache lines. This would not be causedby compiler optimization, but by manual deoptimization.


Or just use FAUST: http://faudiostream.sf.net/

Seriously, check it out, it's awesome. Write your audio DSP in a language suited to it, compile to efficient C++ (among other languages) and optionally embed it in a (huge) variety of environments. (even Javascript for webaudio)

It's a functional description language that describes a DSP graph, and can generate block diagrams too. Not to mention, it has libraries containing any filter you can imagine. Highly recommended.


FAUST definitely is worth checking out. Its authors are concerned with "long-term preservation of music programs", which is a great idea ; FAUST allows you to do just that. One can write audio dsp code, without cluttering the actual audio computation code with implementation/performance stuff like SSE, VST, ... that will be obsolete one day.

Another key point is that FAUST doesn't let you manipulate individual samples or time, but instead, focuses on the dataflow, which means you don't have to start all your programs by a time loop ; for example, a simple 2-input mixer is just "process = +;". It's a huge improvement over "for(t=0;t < n;t++) out[t] = in1[t] + in1[t];".


You can do something similar (translate to fast C) with Pure Data (Pd) [0], a graphical audio programming language, by using Heavy [1].

[0] http://puredata.info/

[1] https://enzienaudio.com/


Cool, I hadn't heard of Heavy. It had me until "use our cloud-based service to instantly generate high-performance code for x86 and ARM platforms." The day I switch to using a hosted service to compile my code...

Still, nice idea.


> It can take just one glitch during a live performance for a musician to completely lose faith in their whole setup

I was an early adopter of digital vinyl (if you haven't seen it before, the records have a screechy signal that software can use to determine the position of the needle on the record, which it then maps to an audio file).

A friend of mine purchased the most popular unit at the time but it was really unreliable. It crashed once while a club full of people were dancing and that was the end of it for me. I switched to a unit by a small company (Serato) that had just been released (2004) and never looked back. The unit itself still works perfectly, there was a bug back in about 2008 that they tracked down and patched for me.

Apparently the original brand has now caught up technology-wise and they're a big player, but I will never, ever, ever buy their kit. Reliability issues with audio gear can completely destroy your trust.


I'm glad someone mentioned this. There is interesting history around DVS because the first mainstream product "Final Scratch" used a modified Debian Linux OS (I believe the prototypes were on BeOS). This was in around 2000.

I publish an open source DVS called http://xwax.org/. I had the same goals as you describe; it had to be 100% stable before it did anything else. Hopefully it serves as a reasonable example of some of the advice in the article.


I saw a DJ using one of those digital encoded records a few weeks ago. Pretty cool concept, although I'm not sure what the actual difference is between that and using something like Serato which has the smaller wheels you can turn (if I remember correctly), other than being more prone to physical damage.

I guess one advantage is that it looks like they're playing the actual records. It had me fooled for a minute.


Haha, to be fair - they are playing actual records :)

At the time neither the tech nor acceptance were there for the modern tools. While increasingly there were CDJs around, they didn't feel the same for people who did any scratching.

Also, turntables were everywhere you went, so you could take everything you needed in your bag (laptop, 2 x records, little conversion box) and you could just plug into an existing setup, anywhere, without issue.

Digital vinyl bridged the gap as people started moving over. Also, at the time the software only worked with records and cds. Now you have a huge selection of controllers to choose from.

I'm actually about to sell all my stuff to buy a "controller" (something like https://www.google.co.uk/search?q=reloop+terminal+mix+8&tbm=...) so that I have something convenient and always ready to go.


It feels pretty much like playing with vinyl. You can do the same thing with digital rotary controllers but you just don't get the same touch with it. You can also do some scratch tricks with them, which doesn't work that great with digital tools (perhaps it's better now). It's a matter of preference and getting used to. If you've never tried beatmixing vinyl, it's hard to explain... try it if you get a chance, it's super fun! (would be more fun if I didn't suck so bad at it).

For a low-budget hacker solution, you can just obtain the time coded vinyls, use any reasonable USB audio interface with a RIAA preamp and use xwax (open source) or Mixxx (a GUI frontend to xwax, etc) that runs fine on Linux.


I spin records using exactly this system.

You use the digital vinyl because it feels identical to real vinyl. Those dinky little jog wheels on DJ controllers are for the pretenders, they lack the precision and weight to be much use for scratching, and are considerably more finicky for beatmatching; Serato is/was a digital vinyl system first. Those that learned to spin on turntables tend to swear by them; there is a certain tactile aspect that is ahrd to get elsewhere. (Serato has since pivoted to controllers, since CDJs seem to have taken over the club scene)


So, I assume none of those popular OS'es has priority inheritance[1].

Even though it's a concept from realtime computing, I though it would be widespread in general-purpose OS'es as well. Really, it seems like a useful feature for any OS that implements priority at all.

What would be the downsides of having it in a general-purpose OS?

The ones I can think of are development cost and processing overhead.

[1] https://en.wikipedia.org/wiki/Priority_inheritance


It's available, but the concept has issues when implemented simplistically on a multi-processor system. (I'm not familiar with the arguments as to why; sorry). You also have to request this behavior explicitly, and I think under certain circumstances it can add latency or additional context switching overhead that might otherwise be avoided between your application threads. Also, in some environments your application has to have elevated privileges to boost thread priority, which carries risk.


Since we seem to have a lot of audio programmers in here, does anyone have an opinion on using non-C/C++ languages for audio development? I've always used C/C++ but newer systems languages like Go and Rust (both have e.g. PortAudio support) seem quite well suited to the task.


Sure, I made a live-codable audio plugin (VST) that runs Lua/LuaJIT. From my experience, the only thing that slows it down is the fact that it uses double floats which are unnecessarily precise. The JIT and memory management are very efficient, especially following some of Mike Pall's recommendations such as scoping all variables to be as short-lived as possible (eg. using do...end blocks to explicitly define scope).

I haven't had the time to work on it in a while but it works: https://github.com/pac-dev/protoplug


So, I mostly use C (not C++) for my audio code. I won't touch anything garbage collected for realtime DSP at all, so Go is right out.

Rust, on the other hand, I am very keen to explore, especially with some of the SIMD work that's been brewing. Somebody made bindings for the VST2.4 API/SDK already, and I've been mulling over putting together a brief proof-of-concept with it: https://github.com/overdrivenpotato/rust-vst2


You'd have to write in a bit of a dialect of Go to make sure you don't get bad GC behavior. It's possible in Go, unlike many scripting languages, to write in an allocate-up-front style that doesn't generate garbage, in pretty much exactly the same way you'd do it in C. If you write in the C style, you get a lot of C-like behavior, only with memory-safety. But you don't get much support from the runtime for this style; for instance, if you accidentally append to a slice beyond its capacity, the runtime will simply reallocate the array, there's no way to ask it to not do that and throw an error instead or something, beyond not using "append" (hence my comment about the "dialect"). ('Course, there isn't really one in C either; arguably reallocation is better than a segfault here, as one merely risks an audio pop at some point whereas another guarantees it.) There is also no way to prioritize a goroutine.

I think people end up overselling the difficulty here, because people often strike me as speaking as if the 10ms maximum GC delay is actually the minimum or something and as if the GC is running uncontrollably keyed by a random number generator or something, rather than the amount of garbage generated. It's not that hard to imagine an audio application that runs a 50us GC every several minutes or even less often if you preallocate everything. In practice it's not entirely dissimilar to writing in C and avoiding malloc.

But the real level of difficulty is still something that should make you think, especially as you move from "a single filter" to "a full synthesizer suite".

Rust would generally be a better language, as long as you can step up to the somewhat-harder language, and can find the libraries you need. The payoff for the somewhat-harder language is that you'll have much better assurances about many of the relevant issues. It's what I would generally choose for this task, again, given libraries. But if someone did want to use Go because it's a bit simpler of a language, I wouldn't immediately breath flames on them, unless they were being obviously too ambitious.


> It's possible in Go, unlike many scripting languages, to write in an allocate-up-front style that doesn't generate garbage, in pretty much exactly the same way you'd do it in C

I am very interested in this. Do you have any links with more info?


Basically, the link would be the language spec. Go's structs are "value types". Until you deliberately take a pointer or access them through an interface, they're just structs, like C. It also has real arrays.

In many ways, Go is like a scripting language, but it does let you do some systems-like things in a way that Python or Perl or the other similar languages completely don't.



Go typically relies on garbage collection, which is a no-no. Same with Java. This can be worked around, but you may find that programming a GC language with the GC unavailable is not better than C++ (and it may rule out the use of most third-party libraries).

Rust at a high level could be suitable, when well-tested libraries exist for it.


Audio Weaver is a graphical tool for doing audio dsp. You use a drag-and-drop interface to configure your signal flow graph. When you "run" the system your graph is used to dynamically instantiate the necessary low level functions and compose them together. While the design is running you can do live tuning (changing gains, filter coefficients, delays, etc.). The low level functions have implementations on several architectures (PC, ARM Cortex-A and M, Tensilica HiFi 2 and 3, ADI SHARC, TI C66xx, plus a few more). This means that your signal flow graph is cross platform.

DISCLAIMER - I work for the company that makes Audio Weaver.


I'm pretty new to audio programming, but I'm working on a transcoding library in Rust which calls various C codec/container libraries. I haven't done any extensive testing yet but it seems to perform only a little worse than a C version, and I haven't done any optimization yet.


Java using JNI to hit the audio library which was written in C, but the majority of the application was java.


I'm a bit amazed he doesn't even mention [double buffering](https://en.wikipedia.org/wiki/Multiple_buffering), a system that was already used in old video games to avoid flicker, as a way to draw on scene screens and only pass it on to the video stage once the scene is complete.

All you would need here is 2 (or more) copies of the shared data structure and a few pointers to the data structure. You fill in a draft version of the data structure and only change the "read" pointer to point to it when it's ready. Changing that pointer is, I would hope, atomic. You can then change the "write" pointer to point to a different copy of the data structure to work on.

To make sure the client only reads consistent data, it can make a copy of that pointer before it starts processing, because the pointer itself might change.

Using 3 buffers instead of 2, and if you're sure the processing of a buffer takes less time than the switching cycle time, you can be sure your buffer data will not ever change in the meantime.


People don't really talk 'double buffering' in audio, just 'buffering' or ring buffers. I suppose since we're already working in larger (and sometimes flexible) batches of more than two samples.

And it's already assumed but he article; it mentions it in terms of "the system has to deliver n seconds of audio data every n seconds to the audio hardware. If it doesn’t, the buffer runs dry". The article is focused on generating the content that has to go in these buffers.

Because the problem with just increasing buffering is that it adds more latency. In your triple-buffered video game example, the action is now an extra frame behind the double-buffered case.

Hence the focus of audio folks reducing buffers to the minimum; for any audio application that's acting on data from the real world.

For example a simple live audio processor, or musical instrument.

Small delays of milliseconds seem insignificant but they create various artefacts and effects when that software is used alongside others or in a 'live' scenario.

But you are rightly getting at something that is a problem -- that the knowledge of using large buffers for audio seems to have got forgotten in some software; there seems to be a modern assumption that in order to get stable audio without glitches, that realtime threading is necessary and tiny buffers for trivial playback tools. Forgetting that systems did this a long time ago with nice large ring buffers, and the ability to flush or re-fill the buffer.


Double buffering isn't very suitable for audio. You're right that a suitably-large buffer would allow non-realtime code to "render" e.g. a tenth of a second of audio and pass it off to the realtime audio thread (and/or the audio subsystem, if that has sufficiently-large buffers on your platform). Unfortunately, such large buffers also introduces significant latency; the added latency is fine if you're a music player, but is pretty nasty for interactive applications.


The audio doesn't need to be double buffered. I think GP means you can avoid the author's examples with locks by double buffering the array of keyboard keys being pressed. You don't need to lock either the UI or audio thread, and can ensure they're both seeing uncorrupted data.


I worked on an audio system where the response to buffers bottoming-out was to add more buffers. We had buffers everywhere; in the apps, in the audio pipeline, in the USB stack. Ran into a glitch? Go from 5 buffers to 7. Numerology was everywhere.

So what happened was that latency was utterly unpredictable, and we had buffers that went dry (starving the output DAC) or filled up anyway.

I'd walked into the project a couple years earlier, and wound up spending six cumulative months getting a handle on all the problems, which were distributed amongst several different code bases, and ultimately involved fixing bugs in every single component, including a really nasty one in the OS hypervisor. It was quite a ride.

Isochronous audio is hell. Sure, y'all are smarter than the average bear, but . . . isochronous audio is hell, and you're gonna need rollerskates. :-)


>and you're gonna need rollerskates. :-)

What does that even mean?


Is this a common technique in audio?

In graphics programming, if you are not done with a frame when it would be the time to show it, you can just delay it and it won't cause much problems.

In audio, if you are not done processing the data when it should be playing, you WILL hear a glitch. Is double buffering really that useful then? Can't you just maintain a write pointer into a circular buffer that is always ahead of the read pointer?


If I understand you correctly, you've just increased your output latency by 50% ;)


This is more about safely accessing the shared source data structures (parameters, notes, user input etc) rather than safely accessing the output buffer.


But he does, under another name...


I spent the better part of the last few years working on my own audio apps. It's definitely the most difficult programming domain I've worked in. Real time requirements + threading + low level code makes for a very challenging environment. But it's also a lot of fun. Using the tools the author describes here can save you a lot of headaches and let you focus more on the fun part though.


I wasn't even aware I WAS living in a "post-Audiobus/IAA world". So these are... iPhone apps?


Yeah, the article reads like it's going to be somewhat platform-agnostic until you're down a few paragraphs and it blasts you with iOS stuff as if it was the only platform that mattered for realtime audio. I'm more of a mind that realtime creative audio on iOS is likely to be remain pretty niche.


Great article. Interesting that modifying and reading word-sized values is an atomic operation on ARM. IIRC on x86 this is not the case because values can be cached on the cache of the different cpu cores, and thus be out of sync. Does someone have a more detailed insight into this?


ARMv7 works on deferred consistency between cores and memory. So, even if the write itself is "atomic", which should rather be called tear-free, there is no inherent synchronisation of writes and reads to the same memory location.

C11 and C++11 atomics also guarantee ordering depending on the memory model parameter.

Look up Sutter's talk titled "Atomic Weapons" for more detail.



> and thus be out of sync

The operation is still atomic. You will not get a partially written word size value.



It kills me that it's 2016 and pulseaudio stutters when I move a window (at least on my system).


"Don’t use Objective-C/Swift on the audio thread."

Nothing wrong with using Swift (the language) to render audio in realtime on the audio thread. I'd change the advice above to: don't send messages to Objective-C objects during your audio thread.


Swift will need to do garbage collection at one time or another and then you are fu*ed.


Swift is reference counted and does not pause at unpredictable times.


Reference counting is a form of garbage collection. You can have avalanches of objects being deallocated when as a consequence of one object refcount reaching 0.

You don't want any allocation or deallocation when you need realtime performance.


An object's refcount being decremented is not an unpredictable time for a pause. If you never allocate or release objects on your audio worker thread, it will never be paused for GC, as the person I was replying to implied could happen.


Sure. But same can be said for pretty much any garbage collected system.

If you don't create objects or cause objects to become unreachable, you also won't be creating garbage. So you won't have GC pauses. Unless the GC system is braindead and runs even when there's no heap pressure.


With a stop-the-world garbage collector, allocations on other threads can result in your audio thread being paused, so you have to not be allocating objects on any thread, which is a significantly more difficult requirement than simply not allocating on your audio thread while audio playback is occurring.


> Any app that does audio has at least two threads of execution: the main thread, and the audio thread

As a side note, I sure wish browsers would hurry up and implement web audio workers so that this could be true for me!


An `AudioContext` (which is the object that lets you do non-trivial audio in the browser) does most of its work on a special, high priority thread, off an audio callback.

Having `AudioContext` in a worker does not get you a lot. It's planned, but not prioritized very high. What happens when you're calling methods in JavaScript on an `AudioContext` is that we simply validate the input and queue messages to the audio thread, so this is effectively exactly what the article is talking about. We use some locking, but we've carefully wrote the code so that it's not really an issue. We lock just to swap two pointers in a message queue. This could certainly use atomics, but we have more important things to do first.

That said, the interesting thing developers need and want is `AudioWorklets` [0], and that's what we at the W3C Audio WG is working on more or less full-time, these days. It used to be called `AudioWorker`, but got renamed for a lot of reasons [1].

[0]: https://github.com/WebAudio/web-audio-api/wiki/AudioWorklet-... [1]: https://github.com/WebAudio/web-audio-api/issues/532


Thanks, this is very useful! Follow up questions, if you see them:

1. Does what you said include ScriptProcessorNodes?

2. Is this true of all browsers per specification, or true of a particular browser's implementation? My experience with web audio has been that in Chrome things work as you say, but in FF I could never find a webaudio library that played back without heavy clicking, so I supposed audio workers might help.

3. Given what you said, what will be different about Worklets? I scanned the links but I guess I don't follow the bigger picture.


1. No, it's the one thing that has to be on the main thread per spec.

2. It's true for all browsers. ScriptProcessor runs on the main thread and can be glitchier in Firefox (but depending on the use case it can be the opposite). See https://padenot.github.io/web-audio-perf/#scriptprocessornod... for more details.

3. Worklets are bits or javascript that run on the audio thread directly, where you're supposed to only implement the DSP bits, and only communicate with the main thread using message passing.


Why don't we have have sandbox execution environments in the OS itself?

Why do we have a web browser at all and not just a way of running arbitrary code from a network in a sandbox? We'll end up re-imagining the browser as an OS with workers for anything an OS could do anyway in the end if we continue to need to have the browser act as an application runner.


> Why don't we have have sandbox execution environments in the OS itself?

Well, if you try, you wind up with something that looks an awful lot like a web browser - especially once WebAssembly is a thing. The API that many desktop operating systems provide is not designed for the security model that you're looking for, so you wind up building a new one on top of it - see WinRT when Microsoft needed a sandbox.

We used to have things like Java Web Start (still do in some enterprise systems), and, well, it's not exactly any better than a web browser except that it has a better view layer for applications than HTML/CSS. It's also not supported on mobile platforms.


> Why don't we have have sandbox execution environments in the OS itself?

We do, that is how mainframes work.

It is also available in Mac OS X, Windows and mobile OSes, one just needs to make use of the respective APIs.

As for the GNU/Linux and *BSD I am out of touch how good the current sandbox support for cnames, Lx and others actually is.


It was called Java and it never lived up to the promise. The browser may not be the app deployment system we want, but it is certainly the one we deserve


> If it doesn’t, the buffer runs dry, and the user hears a nasty glitch or crackle: that’s the hard transition from audio, to silence.

It would be awesome if we could prevent this crackle somehow on a lower level of abstraction. What I mean by that is that if the buffer runs dry, the hardware (or the OS/audio driver) could do some prediction in order to bridge any gaps in the audio more nicely.


Its certainly true that different hardware handles buffer underruns differently, my high-quality RME interface emits fairly muted pops when the buffer runs dry, whereas I've had cheaper units make absolutely horrendous noises as the speakers flail around trying to cope with the signal discontinuities. The RME is interpolating the signal towards a zero crossing.


Would it be even better to smoothly join with the zero level instead of coming to a sudden halt?


That's what I mean by interpolating towards a zero crossing.


Sorry, my wording was unclear. I mean asymptotically approaching zero, as opposed to, say, going to zero in a straight line and then switching to a constant zero once you get there (thereby creating a discontinuous derivative).


Indeed, I imagine that is what is going on. The straight line case is what the loudspeaker driver will do in reality if you send it a discontinuity, and that's what produces the pop.

Smooth interpolation will avoid a really nasty pop, but in the real world, musical waveforms are highly complex, so any interpolation algorithm, however smooth, will produce some kind of artefact if you chop the wave in the middle of a cycle and smooth it to zero.

This can be observed when setting loop points in a sampler - you are usually provided with tools to help you match the loop points to the zero crossings. This is not enough however to remove all artefacts. Only some zero crossings will do: one has to match the higher-order cycles in the waveform as well. I don't really have the mathematical vocabulary to really describe what I mean here, but hopefully it's clear.

(BTW when I say driver in these posts I mean the magnet-and-cardboard-cone assembly in the speaker, not any kind of software.)


That's not how sound works. Think of a vibrating object. If it stops vibrating suddenly, slowly moving it to its center point isn't going to make the absence of vibration any less jarring.


Indeed. I think that's what I'm trying to describe in my sibling comment to yours.

However, what it will successfully avoid is the loudspeaker driver attempting to instantaneously snap from some nonzero x-position back to it's origin, which is what causes the really nasty clicks.


Ultimately it's not solving anything, because even with some faking it's still not the audio the performer wants to hear. And, besides, gap-free audio software is a solved problem. The article explains how to do it.


A lot of audio hardware does this - it switches the audio off only when the signal is at a zero crossing. So it can be implemented by the OS by telling the codec that the zero value is an underrun rather than a desired signal.


I solved that problem years ago by keeping a canned piece of white noise in a const buffer that I would switch to if there was an underrun, with an immediate ramped gain reduction to zero. The result were quite good!


Interesting. So what would happen if a sine wave of, say, 1Khz is suddenly shut off by a buffer underrun?


It would be cool if the hardware did some Fourier analysis to resample the buffers it gets (so it could make them run for longer if the buffers run dry), but it probably would cause some kind of latency issue. I reckon that just avoiding buffer underruns is less overhead.


I had to cut off an audio file abruptly once, and I found that I could smooth out the abruptness by quickly fading in a reverberated version of the audio, just as the original audio was about to end, and then letting it ring for a fraction of a second after the original had ended.

(I say "fading in", but it might have been that I had the reverb applied but dry, and transitioned to wet just before the signal ended.)


A colleague of mine actually patented this technique: http://www.google.com/patents/US8538038


:-/


Yep, this should not be a patent..It's a trivial solution to the problem that could be devised by any sound engineer!


If you have any evidence of prior work to that patent (I really hope you do), then you can topple that patent. If your code to do what you said isn't free software I'd recommend you release it as free software now.


I can't check right now, but it's possible that the patent predates my use of the technique. Even if not, I don't know how I could prove it.

By the way, it wasn't done in code; I did it manually in Ardour [0].

[0] <https://ardour.org/>


This seems like the most sensible thing to do, and was actually what I was getting at :)

What do you mean by latency issues? Why would there be any?


Quoting from the Waldorf microwave XT synthesizer FAQ:

The brightness of the click depends on the speed of the level change. The faster the level changes, the brighter is the click. So, the level change speed can be compared with the cutoff of a lowpass filter. There is an easy formula for it:

Let's consider a level change from full to zero (or from zero to full) output from one sample to another on a machine that uses 44.1kHz sample rate. So, we first transfer the sample to milli seconds:

1 sample equals 1/44100 second, which is = 0.02267573696ms.

To calculate the cutoff frequency of the click, just use this formula:

Cutoff (Hz) = 1000 / Level Change Time (ms)

which in the example results in:

44100Hz = 1000 / 0.02267573696ms

Whoops? This the sampling frequency and, err, very bright.

http://faq.waldorfian.info/faq-browse.php?product=xt#116


Waldorf seems like a cool company. It makes me smile to see that they went into such detail in the manual. Plus the Microwave XT sounds amazing.


With zero crossing detection, it would need to know the buffer is running out before the next zero crossing occurs. So if the shutdown signal is given within the last millisecond before the zero crossing, it would ignore all data that comes after the zero, and output zero for the remaining partial wave. On starting up again, it would wait until the signal is at zero before beginning to output. At least that's how the common audio codecs I've seen that implement that feature work. The most common application is volume control, where you don't want a sudden change in the amplitude of the wave to result in a glitch, so you adjust amplitude at zero crossings.


Some network protocols for live audio performances do this.

Filling the audio buffers with some predicted blocks of audio to avoid an ugly sounding gap.

See for instance in this Thesis: "Low-Latency Audio over IP on embedded IP systems" http://www.ti5.tu-harburg.de/staff/meier/master/meier_audio_...

Sect 4.1.2 Packet Loss Handling



I wouldn't say 'better', they're both useful.


Another common mistake: off-by-one error. Even a single zero or duplicate sample value is clearly audible! Its amazing how sensitive we are to audio artifacts.


That's not necessarily a off-by-one error, but a broader problem of audio synthesis, which you can generally deal with by applying low pass filtering. For example depending on starting phase of a sinus oscillator multiplied by a simple linear volume envelope (1.0 -> 0.0) you will get a click at the first sample, which is loudest at multiples of pi/2 for the starting phase (cos). So you need to low pass filter the envelope to get a quick fade-in of the oscillator without click. How you do this and how that sounds is up to you. The original hammond organ similarly got some nice analogue clicks as initial transients because of this 'problem' as a side effect btw, in digital its much harder and expensive to make nice curves. De-clicking an audio engine properly is certainly an important and sometimes frustrating task.


What song is it in the 4sec sound clip. Sounds so familiar, like something from Daft Punk?


The Firefly references in this article are on point.


This is also true of animation threads.


I remember using a circular buffer for a MOD/S3M player I wrote many moons ago. I think you called a hardware interrupt to enable the Gravis or Soundblaster to read from a block of memory and send it out to the speakers at the right frequency, and then made sure that you fed enough data into the buffer. It wasn't even concurrent - every loop, just mix enough of the track to fill the buffer and write it. Simpler days...


> It wasn't even concurrent - every loop, just mix enough of the track to fill the buffer and write it.

This probably works but the proper way to do it back then (and the way drivers work today) was to hook an interrupt that gets fired when the audio buffer needs to be filled.

So there's an element of concurrency in it since you don't know when the IRQ is going to fire. Synchronization was easier with single cores, though. Just disable interrupts and you're done.


> This probably works

No probably about it - it worked like a charm.

> but the proper way to do it back then (and the way drivers work today) was to hook an interrupt that gets fired when the audio buffer needs to be filled.

Which was much more trouble than it was worth when writing a 64k intro on a deadline.


How will web assembly affect audio development in the browser?




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: