What I don't understand, is why the upstream audio doesn't just buffer while the downstream thing processing it is blocked. Why should that result in audible artifacts, can't it just catch up with the rest of the buffer later?
Buffer overruns feels very 1996-cd-burner-ish. Ope, burned a coaster, let's try this hellaciously real-time-bound thing again with inadequate buffering and I/O devices that have unpredictable latency.
If you (a) need low latency (b) have hardware with unpredictable/unreliable latency behavior, you're screwed by definition.
If you can manage with large latency, then sure, buffer the hell out of everything and your chances of losing data will be close to zero.
If you need low latency, you cannot buffer the hell out of everything, and anything that interferes with the data flow runs the risk of causing an underrun.
The solution in the CD burner case was easy: massive latency is fine, so buffer the hell out of it, and it just works.
The solution for low-latency audio anything is not so easy: you can't buffer the hell out of it, which means you're susceptible to hardware and software issues that disrupt timing and end up causing underruns/overruns.
> What I don't understand, is why the upstream audio doesn't just buffer while the downstream thing processing it is blocked. Why should that result in audible artifacts, can't it just catch up with the rest of the buffer later?
So you make your buffer larger. But the thing consuming the data is still slower than the thing producing the data, so that buffer will eventually overrun, too. The graceful way to fail is not to block, but to quickly drop the excess data. However, you're now losing data, and doing so very unexpectedly overall, because apparently this transcription service was supposed to be well able to keep up with the audio stream. (If I understood this whole thing right, it's a bit vague.)
So really, unless the root cause is taken care of, you either block or lose data, without good reason in this case.
> Buffer overruns feels very 1996-cd-burner-ish.
That was a buffer underrun, the opposite. The CD burner, moving at a fixed speed, ran out of data because the producer (the rest of the PC, effectively) was not able to replenish the buffer quickly enough. The fix, besides faster PCs, was to have very large buffers, that were able to pick up the slack for longer than the duration of a "dropout" on the PC side. But between those "dropouts", the PC was probably well able to produce the data to the CD burner at a higher rate than necessary, so the large buffer stayed filled on average.
Both buffer overruns and underruns are still very much a concern. They exist anywhere where producers and consumers operate at a mismatched rate for more than just a short time, in a real-time setting.
it's remote meeting infrastructure, so the latency is critical. When burning CDs, or playing music, it's OK to have a second or two of buffer. When doing conference call, a second of buffer means a second of latency, which means you ask a question and get a response 2 seconds back, which is pretty bad experience. And that's why conference software tries to keep latency as low as possible.
(Now, why does it produce a pop as opposed to silence/hiccup/stretched sound? probably because it was easiest to code)
Even if the buffer is large enough, at some point (i.e. a long enough meeting in this case), it will fill up.
> (Now, why does it produce a pop as opposed to silence/hiccup/stretched sound? probably because it was easiest to code)
Sudden "silence" pretty much is a pop, and so is the silence suddenly ending. The sharp transition at least theoretically contains energy in all frequencies (or rather the full bandwidth of this bandwidth-limited signal), which we perceive as a pop.
Bang a steel bar against a hard table, and you get a whole range of frequencies as well, also very pop-like. Do the same with a tuning fork, and after the initial bang you get a nice, clean, single tone, because the tuning fork effectively filters out all the other frequencies through its impulse response.
It's possible to handle buffer underruns more elegantly than that, but it does require more processing power on the receive side of the buffer (basically by using some strategy to extrapolate the audio forward and decay, as opposed to just dropping the signal to zero when there's no data coming from the other side). It's a common thing to do in streaming audio contexts, especially voice, but generally at the end of the user's network connection which is presumed to be unreliable, not in the middle of a processing pipeline which is presumed to be able to hit its latency targets.
Oh yeah, I didn’t mean to imply that the pops are a necessary consequence of buffer overruns. But as you say, gracefully mitigating the symptom requires non-negligible engineering effort (and potentially resources), when the actual problem, namely buffer overruns occurring because the consumer is too slow, shouldn’t exist in this particular system in the first place.
Buffer overruns feels very 1996-cd-burner-ish. Ope, burned a coaster, let's try this hellaciously real-time-bound thing again with inadequate buffering and I/O devices that have unpredictable latency.
What am I missing?