Hacker News new | past | comments | ask | show | jobs | submit login

I could type up a thorough explanation, but it would take about an hour, and I have a lot to do. It is actually not a bad idea to do such a write-up, but I don't think the appropriate venue for it is an ephemeral post on Hacker News ... I'd rather blog it somewhere that's more suitable for long-term reference.

But I'll drop a few hints. First of all, nobody is talking about running interrupts at 48kHz. That is complete nonsense.

The central problem to solve is that you have two loops running and they need to be coordinated: the hardware is running in a loop generating samples, and the software is running in a (much more complicated) loop consuming samples. The question is how to coordinate the passing of data between these with minimal latency and maximum flexibility.

If you force things to fill fixed-size buffers before letting the software see them (say, 480 samples or whatever), then it is easy to see problems with latency and variance: simply look at a software loop with some ideal fixed frame time T and look at what happens when T is not 100Hz. (Let's say it is a hard 60Hz, such as on a current game console). See what happens in terms of latency and variance when the hardware is passing you packets every 10ms and you are asking for them every 16.7ms.

The key is to remove one of these fixed frequencies so that you don't have this problem. Since the one coming from the hardware is completely fictitious, that is the one to remove. Instead of pushing data to the software every 10ms, you let the software pull data at whatever rate it is ready to handle that data, thus giving you a system with only one coarse-grained component, which minimizes latency.

You are not running interrupts at 48kHz or ten billion terahertz, you are running them exactly when the application needs them, which in this case is 16.7ms (but might be 8.3ms or 10ms or a variable frame rate).

You don't have to recompute any of the filters in your front-end software based on changing amounts of data coming in from the driver. The very suggestion is nonsense; if you are doing that, it is a clear sign that your audio processing is terrible because there is a dependency between chunk size and output data. It should be obvious that your output should be a function of the input waveform only. To achieve this, you just save up old samples after you have played them, and run your filter over those plus the new samples. None of this has anything to do with what comes in from the driver when and how big.

Edit: I should point out, by the way, that this extends to purely software-interface issues. Any audio issue where the paradigm is "give the API a callback and it will get called once in a while with samples" is terrible for multiple reasons, at least one of which is explained above. I talked to the SDL guys about this and to their credit they saw the problem immediately and SDL2 now has an application-pull way to get samples (I don't know how well it is supported on various platforms, or whether it is just a wrapper over the thread thing though, which would be Not Very Good.)

> Any audio issue where the paradigm is "give the API a callback and it will get called once in a while with samples" is terrible for multiple reasons

This is, actually, how most professional audio APIs are designed and they generally work quite well. ASIO, VST, JACK, PortAudio, CoreAudio, etc.

The other commenter was talking about audio software that both consumes and produces samples at a fixed rate. Clearly, if the audio software is late grabbing 64 samples from the input device, it's also late delivering the next 64 to the output device, and there will be a dropout. The output sample clock has to be the timing master, and the software can never be late, and since it's also waiting for the input audio, it can never be early enough to "get ahead", either.

I am not sure we can make the assumption that the input and output devices are on the same clocks or run at the same rates. Maybe they are (in a good system you'd hope they would be), but I can think of a lot of cases where that wouldn't be true.

However, even when they are synced, you can still easily see the problem. The software is never going to be able to do its job in zero time, so we always take a delay of at least one buffer-size in the software. If the software is good and amazing (and does not use a garbage collector, for example) we will take only one delay between input and output. So our latency is directly proportional to the buffer size: smaller buffer, less latency. (That delay is actually at least 3x the duration represented by the buffer size, because you have to fill the input buffer, take your 1-buffer's-worth-of-time delay in the software, then fill the output buffer).

So in this specific case you might tend toward an architecture where samples get pushed to the software and the software just acts as an event handler for the samples. That's fine, except if the software also needs to do graphics or complex simulation, that event-handler model falls apart really quickly and it is just better to do it the other way. (If you are not doing complex simulation, maybe your audio happens in one thread and the main program that is doing rendering, etc just pokes occasional control values into that thread as the user presses keys. If you are doing complex simulation like a game, VR, etc, then whatever is producing your audio has to have a much more thorough conversation with the state held by the main thread.)

If you want to tend toward a buffered-chunk-of-samples-architecture, for some particular problem set that may make sense, but it also becomes obvious that you want that size to be very small. Not, for example, 480 samples. (A 10-millisecond buffer in the case discussed above implies at least a 30-millisecond latency).

Music or video production studios typically have a central clock, so for this use-case the sample rates should be perfect. But even if the input and output devices are on perfect clocks, with NTSC (59.94 Hz), you'd need a very odd number of samples per video frame in your software, if your processing would happen at a integer fraction of the video frame rate.

Do you know whether studios use 48000Hz with 59.94fps or 48000/1.001 ≈ 47952Hz? Does converting from 24fps film to 23.976fps Blu-ray require resampling the audio? Or are films recoded at 48048Hz and then slowed to 48000 for consumer release?

The short answer is that it's complicated. Digital film (DCP) is typically 24 fps, asirc -- and that doesn't go well into 60, or 50. And the difference is enough that you need to drop a frame and/or stretch the audio. And sometimes this doesn't go so well.

There's a relatively recent trend to try and record digital all the way, and this is also complicated. Record at 24 fps? At 30? At 60? 60 fps 4k is a lot of data. And sound is actually the major pain point -- video frames you can generally just drop/double, speed up/down a little to even things out. But 24 fps to 60 fps creates big enough gaps that audio pitch can become an issue.

If everything happens strictly synchronous to your audio clock, then fixed block processing is the way to go.

But jblow is right in that when you have to feed in samples from a non-synchronized source into your processing/game/video-application/... then trying to work with the fixed audio block size will be terrible/require additional synchronization somewhere else, such as a adaptive resampler on the input/output of your "main loop".

> Since the one coming from the hardware is completely fictitious

Why do you say this? The USB audio card (or similar) is generating blocks of audio at a fixed rate, no?

Maybe for video playback or games you need to synchronize audio and video, but there is no need to do that for music production apps.

If you are writing some sort of synth, as soon as you receive a midi note or a tap, trigger the synth and the note will play in the next audio block. No need to wait for the GUI to update.

If you are doing some sort of effect, grab the input data, process and have it ready for the next block out. I don't understand why you need a second loop.

Well, it depends on how that specific hardware is designed, but we could say that hardware that is designed to generate only fixed blocks of audio is very poor from a latency perspective.

I think you will find, though, that most hardware isn't this way, and to the extent this problem exists, it is usually an API or driver model problem.

If you're talking about a sound card for a PC, probably it is filling a ring buffer and it's the operating system (or application)'s job to DMA the samples before the ring buffer fills up, but how many samples is dependent upon when you do the transfer. But the hardware side of things is not something I know much about.

> If you are writing some sort of synth, as soon as you receive a midi note or a tap, trigger the synth and the note will play in the next audio block

Yeah, and waiting for "the next audio block" to start is additional latency that you shouldn't have to suffer.

> If you are doing some sort of effect, grab the input data, process and have it ready for the next block out. I don't understand why you need a second loop.

The block of audio data you are postulating is the result of one of the loops: the loop in the audio driver that fills the block and then issues the block to user level when the block is full. My whole point is you almost never want to do it that way.

Can you recommend some good code / APIs to check out that don't do it block based? I usually use JUCE which is block based, and I assumed it was just a thin wrapper around the OS APIs which we also block based.

If you want a further analogy, it's like public transit. Which is a better commute: You take Bus A, which then drops you off at the stop for Bus B, at which you have to wait a varying and indeterminate amount of time, because the schedules for Bus A and Bus B are not synchronized; or just taking Bus C, that travels the same route without stopping?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact