There's no compelling reason we couldn't have IP audio links (with an easy calling interface) that have no such compression artifacts. I would actually call it more of a noise gate issue perhaps trying to save bandwidth by maximizing the compressibility of quiet parts as if they're total silence, signal off. That kills the intimacy IMO.
We've had it for so long now with digital cell phone calls, even HD calling, certainly Zoom et al (some are particularly aggressive on the noise gate when others are speaking over you).
The process of waiting to create the first packet to send (even with no compression) will always be slower than a system where you just send the audio instantly as an analog signal. Even the process of waiting for a single "sample" of audio (for a tiny tiny packet) is technically slower (but of course no one does that: they tend to group together at least 2ms of audio into a packet). The only way you can go faster than a classic land line (assuming there were no signal repeaters: like, this is one of those landlines where as you get further away the signal also gets quieter) is if you can go a shorter distance or use materials with less resistance--maybe "lower relative permittivity"? I don't know exactly what measurement you use here as I don't remember enough physics--to build the wiring or (probably your best bet) switches.
Is noise gating why you can't talk over each other? I am terrible at doing this (I think of it as synthesis - 2 excited ppl grabbing the idea, running with it, then the other interrupts and takes over) and while I acknowledge it's a bad behaviour, why do video calls or most digital audio ones not let it happen? I get there's delay so they can't exactly blend the two audio streams together, but what is this crazy limitation that only one person can talk at a time?
Delay/latency is a totally separate issue that also has negative effects but not in a way that I would characterize as intimacy. Full duplex vs half duplex plays into it as I alluded: the symptom of half duplex is that the louder person "wins" temporary exclusivity which causes the quieter source to drop out in a way that isn't entirely different from what a noise gate does; the difference is that dropping out due to half duplex is based on the relative levels of the two sources and dropping out due to a noise gate is based on the absolute level of the source being above some determined noise threshold. Either way, all the dropouts where quiet becomes silent result in a lack of breath noise, saliva noise, the kind of laugh that manifests as just a bit of a strong nasal exhale, and plenty of other sounds. Think ASMR videos.
We've had it for so long now with digital cell phone calls, even HD calling, certainly Zoom et al (some are particularly aggressive on the noise gate when others are speaking over you).