The practical solution is to meter traffic (don't send faster than they can be processed on the receive end or faster than the tightest congestion bottleneck enroute). At the same time process as quickly as possible on the receive side - don't let a single buffer sit in the IP stack longer than absolutely necessary. E.g. receive on one higher priority thread and queue to a processing thread.
E.g. receive on one higher priority thread and queue to a processing thread.
Yes, the solution to buffer bloat is to only keep buffers at the ends, but whether that's in the kernel or in your application shouldn't matter in most of cases, if any at all. Better to just let the kernel handle it and not recreate the wheel.
Tweak the kernel send/receive buffers if you want. The defaults on Linux are usually too aggressive, but I don't see any need to do much more than that.
And as observed elsewhere in this thread, some OSs have tragically small buffers (128K). Its absolutely vital to keep those buffers from filling in ambitious apps.
I wrote audio/video/screenshare communications code for years. In the bursty situation I described, the whole point is to offload the ip stack buffer into the app buffer at high speed.
I've also written streaming media services for many years. RTP/RTCP, for example, supports adaptive rate limiting, though few implement it. The RTCP sender and receiver reports signal packet loss and jitter so that the sender can, e.g., dynamically decrease the bitrate. If implemented properly, buffering too many RTP packets can hurt the responsiveness of the dynamic adaptation, which can quickly lead to poorer quality. (Modern codecs help to mitigate this issue, but largely because the creators have spent a lot of time putting more adaptation features into the codecs and the low-level bitstream knowing that software higher up the stack is doing it wrong.)
For DNS, because Linux has a default 65KB (or greater!) buffer on UDP sockets, it's trivial to get huge packet loss when doing bulk asynchronous DNS queries. The application will quickly fill the deep UDP buffer; with the deep buffer the kernel will keep the ethernet NIC frame buffer packed, with the result that you'll see a ton of collisions on the ethernet segment and dropped UDP packets once the responses start rolling in. That results in a substantial fraction of the DNS queries have to retransmit, and because the retransmit intervals are so long, that means a bulk query operation that could have finished in a few seconds or less could take upwards of a minute as the stragglers slowly finish or timeout. Without the deep UDP pipelines, the ethernet segment would be less likely to hit capacity, would see fewer dropped packets, and so the aggregate time for the bulk query operation would be several times less.
Reducing the UDP output buffer is substantially easier than implementing heuristics or gating for ramping up the number of outstanding DNS queries. The latter, if well written, might be more performant, but just doing the former would alleviate most of the problem, allowing you to move on to more important tasks.