I don't know about those theories. I suspect something else. Do you remember landlines, and how responsive they were? You would say things back and fourth and the conversation just worked, even on three-way and conference calls? There was very little noticeable lag!
Even in the early cellphone days, I'd often prefer using my landline. There was less lag. Cellphones felt like speakerphones where it tries to do that cancellation stuff and you feel like the conversation is split into pieces .. like a walkie-talkie where you don't control the "over" / button press.
Cellphones have gotten better, but they've never really hit the latency of landlines (or maybe they have and I didn't notice. I never actually talk to people on my phone any more). Video conferencing like that that but worse. It's really apparent if you're on a remote team and you're next to someone on the same call. You can hear that delay.
I like a lot of the fatigue comes from that delay. We can connect more people now, with video, over great distances, but it does come at a cost at moving from virtual dedicated circuits to package switches networks and transcoding on some central server.
Landlines were so fast and so "direct" in their latency (where distance correlates very directly with time, due to a lack of "hops") that local phone calls were faster than the speed of sound across a table, and for a bit after they came out--before people generally got used to seemingly random latency--local calls felt "intimate", like as if you were talking to someone in bed with their head right next to you; I also have heard stories of negotiators who had gotten really tuned to analyzing people's wait times while thinking that long distance calls were confusing and threw them off their game. But no: cell phones haven't become as fast as landlines and likely never can due to fundamental contrast of compressed packetized routed audio vs. the speed of an analog signal propagating over a circuit-switched wired connection.
I have very clear memories of my first overseas long-distance calls in the 1980s, which were over a satellite link. The latency was so pronounced (about half a second) that carrying out a conversation took quite a bit of practice. You almost had to carry out a formal protocol of handing over control of the line, as if you were on a walkie-talkie.
I have equally clear memories of my first overseas calls carried by undersea fiber because the lack of latency was so pronounced compared to what I had become accustomed to by then.
I still can barely tolerate carrying on a conversation on a cell phone though. The latency and compression artifacts are just horrible compared to a landline. VOIP has gotten pretty good though. I can't tell the difference between a good VOIP landline and a POTS line.
I'm in the military. I have found that the habit of saying "over", learned as crowded circuit management for open radio circuits, is quite useful in teleconferences and VTCs, and for similar reason. It makes it clear that you're handing over the circuit. "Out" means you're leaving the circuit.
I think that can work well, as long as people also adopt the military/aviation habit of speaking crisply and not rambling on to hog the channel until they're absolutely sure they've emptied themselves of every possible thought they could ever have on the matter. Like I just did with every word after "rambling on". ;)
Yeah, this. The experience with overseas calls I was referring to was talking to my then-girlfriend who was spending a year studying in France. Saying "over" all the time is not very romantic.
On the amateur radio side you see both types mixing in pretty amusing ways. On the one hand there's contesters and dx-ers that are all about confirming contacts as fast as they can, and on the other there's more casual people that'll talk about anything and everything that comes to mind. In my experience both sides use some amount of procdural language, but at very different cadences.
In the early 1990s I was sharing a flat in Edinburgh with a guy who's dad was working in Nigeria. The delay was routinely 4-5 seconds, which almost always took us a frustrating 30 seconds or so to sort out while we went backwards and forwards, both saying "no, you start" at the same time, leading to 10 seconds of silence, then both starting again...
There was a hack to force a call over cable I used to use when I did a lot of international calls for BT when I worked on international interconnect for x.400
You should note that most landlines have significantly more delay than they used to have 50 years ago.
All landlines now get digitised and packetized, and usually go over an IP network, frequently via the landline companies HQ hundreds of miles away, before heading back to your neighbours house who you were calling...
Data is the same - If I ping my next door neighbor, from my broadband connection to his, it goes via London (my ISP's headquarters) and Manchester (his ISP's headquarters) before coming back to him, with a round trip latency of >20ms. In sound terms, that delay is like him standing in the next room over.
> Data is the same - If I ping my next door neighbor, from my broadband connection to his, it goes via London (my ISP's headquarters) and Manchester (his ISP's headquarters) before coming back to him, with a round trip latency of >20ms. In sound terms, that delay is like him standing in the next room over.
I feel like that would be enough to be basically perfect, though. If only we could cut down on everything else in the chain adding its own latency.
I think the rule of thumb I used to go by when recording myself playing guitar was that 20ms was a noticeable delay, 100ms was sort of tolerable and anything more than that was enough to make my picking lose synchronization.
There's no compelling reason we couldn't have IP audio links (with an easy calling interface) that have no such compression artifacts. I would actually call it more of a noise gate issue perhaps trying to save bandwidth by maximizing the compressibility of quiet parts as if they're total silence, signal off. That kills the intimacy IMO.
We've had it for so long now with digital cell phone calls, even HD calling, certainly Zoom et al (some are particularly aggressive on the noise gate when others are speaking over you).
The process of waiting to create the first packet to send (even with no compression) will always be slower than a system where you just send the audio instantly as an analog signal. Even the process of waiting for a single "sample" of audio (for a tiny tiny packet) is technically slower (but of course no one does that: they tend to group together at least 2ms of audio into a packet). The only way you can go faster than a classic land line (assuming there were no signal repeaters: like, this is one of those landlines where as you get further away the signal also gets quieter) is if you can go a shorter distance or use materials with less resistance--maybe "lower relative permittivity"? I don't know exactly what measurement you use here as I don't remember enough physics--to build the wiring or (probably your best bet) switches.
Is noise gating why you can't talk over each other? I am terrible at doing this (I think of it as synthesis - 2 excited ppl grabbing the idea, running with it, then the other interrupts and takes over) and while I acknowledge it's a bad behaviour, why do video calls or most digital audio ones not let it happen? I get there's delay so they can't exactly blend the two audio streams together, but what is this crazy limitation that only one person can talk at a time?
Delay/latency is a totally separate issue that also has negative effects but not in a way that I would characterize as intimacy. Full duplex vs half duplex plays into it as I alluded: the symptom of half duplex is that the louder person "wins" temporary exclusivity which causes the quieter source to drop out in a way that isn't entirely different from what a noise gate does; the difference is that dropping out due to half duplex is based on the relative levels of the two sources and dropping out due to a noise gate is based on the absolute level of the source being above some determined noise threshold. Either way, all the dropouts where quiet becomes silent result in a lack of breath noise, saliva noise, the kind of laugh that manifests as just a bit of a strong nasal exhale, and plenty of other sounds. Think ASMR videos.
I never thought I'd miss working in a call center, but man do I miss my landline Jabra hardwired headset with an in-line mute button!! That place was a hellhole, but their call systems were fantastic.
I have a Logitech H390 which is better than any headset I ever used in call center(mainly Plantronics). It has inline mute. It's only $30 - $40. Also the call systems I was using were often crap, where every 30th call would just be static or weird behavior like agents getting routed to each other.
That and you’re constantly struggling to decipher audio and make yourself understood. Much like talking in a bar that’s too loud is exhausting. A lot of the exhaustion seems like it’s the difficulty of compensating for a high latency, low-bandwidth, low quality experience. We’re just not evolved for it—reality is the opposite of all of those things.
This is it. The issue is exhaustion compensating for all the audio problems-- dogs barking, room echoes, compression, images out of sync, the delay in reactions (constantly being unsure if you were heard- not knowing what the correct volume you should be speaking is), being asked to repeat yourself, and the mix of your audio with those of others.
It's a lot to process. With 15 people on "speakerphone", the audio is IMO the biggest and most taxing problem.
Its normally partipants with bad mics, headphones, AGC and the crappy audio support in windows.
I do a number of actual play games and it's the players trying to use the built in mics/headphones that have difficulty hearing the group conversations.
I use a high quality external sound card and don't have that problem - hangouts does seem to disconnect me when I don't speak for a while as my noise floor is so low
Siri has a huge latency problem. I say "Hey Siri" and its' precious seconds before she's listening (articles online tell us that it's possible to just ask the entire question without waiting for the beep -- but doesn't work, sometimes Siri's just not ready.). Someone at Apple clearly needs to reengineer Siri -- I say this as someone who's bought into the Apple ecosystem.
Siri is the like the Apple Maps of voice assistants.
Alexa (at least Echo gen 3) doesn't have this delay. I can talk with Alexa without getting frustrated at the latency. (unfortunately Alexa doesn't handle pauses or stumbles in sentences well)
it is interesting. I don't get exhaustion because I don't talk to the assistants that much, but people talk to their virtual assistants in a different cadence or register. Instead of "alexacanyoutellmethetime?" it is "ALEXA...WHAT...TIME...IS...IT."
Owning an Alphabet Corp Branded Espionage Hockey Puck (tm) myself, I must admit I'm fairly impressed at how far voice recognition and natural language processing has come, despite this. I remember when you really had to talk in that stilted, properly pronounced, methodical way, and then tentatively wait for a response for what seemed an age -- and it wasn't that long ago (maybe like, four or five years?). Nowadays, I can vaguely mumble at mine from the other room without thinking too much about sentence structure or how I'm pronouncing words or the volume or speed of my dictation or switching off things making background noise, and it usually does the right thing, and responds about as quickly as a person. It definitely feels a lot less mentally draining!
I suspect you're actually misidentifying the culprit.
Lag is generally not noticeable on cellphones, and lag has always been a problem with landlines on long-distance (especially far-international) calls.
What you're describing "like a walkie-talkie" is actually the difference between half-duplex and full-duplex.
Landlines are full-duplex, and cell phones were notoriously bad half-duplex... until 4G which is considered full-duplex. There's also the issue of carriers muting audio completely when it's under a certain threshold, which makes them less responsive, because you have to ask "are you there?" every so often.
Videoconferencing latency is a separate issue, because there's one central server for each meeting, so if it's between the NY and SF office, the server may be in SF, so two separate NY participants get double the latency talking to each other than they do talking to SF. But, everybody hears the same audio.
You might say, well why don't they select a server in the middle of the country for that call then? But selecting an optimal server location for a videoconferencing call is difficult, even if you have access to cloud locations all over the country/world, because you usually don't know in advance who's going to join the call. So whoever hosts the call or whoever's the first participant, the server is often chosen to simply be whichever one is closest to them, which can sometimes be quite suboptimal once everyone's joined. (And I sure wouldn't want to be the one to write code for switching servers mid-call.)
Also, videoconferencing is much closer to "half-duplex" because if you were always mixing everyone's audio together, it would be a noisy mess.
But in conclusion -- latency/delay isn't a significant new problem that digital networks have introduced. It's always been there, but duplex, silence thresholds, and conference calls are the more important factors to pay attention to.
Just another example of how the transition to digital from analog wasn't a complete solve. Switching from analog cable to digital cable also introduced this latency so that button presses on the remote are not instantaneous (due to timing issues switching between transport streams). It's kind of like Coke/New Coke/Coke Classic. Eventually, those that remember will fade away, leaving only those familiar with the current situation.
The best communication system I've ever used was the Clear-Com party line intercom in my high school's auditorium. The thing was older than me by at least 10 years. Hand assembled circuit boards in steel chassis. But:
- The headsets were comfortable and practical. I have not seen anything sold for office or even gaming use begin to approach the ergonomics. Plantronic, Logitech, none of it.
- Every headset had foldback, so even the big sound-isolating cans felt natural to speak into. I've spent several hours messing with virtual sound cards trying to make this work with Zoom and never managed it.
- The signal was so clear, and the mics were so good, that as long as you enunciated you could speak just a hair above a whisper and be reliably understood. I remember analog PSTN voice, and this was 10x better. Might have to do with less aggressive EQ.
- Full duplex, no gating. Absolutely no problem with multiple people speaking at the same time. At very busy parts of the show we might intentionally have two simultaneous interactions on the same channel. It was a bit of work to unpack in your brain, but no harder than it would have been in person.
- Obviously no noticeable latency (basic fitness for purpose - these things are for calling cues).
- There was a little bit of analog hum + background noise from each station, so when someone opened their mic you would notice, but it wasn't disruptive. When the channel was busy, this sort of substituted for body language and you'd be invited to speak at the first opportunity.
- Physical switch to open and close the mic. I know Zoom has spacebar PTT, but one button press latch on/off is also important.
It's absolutely astounding to me just how much better these systems are than their nearest alternatives 30+ years later. If I ever control a tech office, particularly with an operations component, I'll seriously consider installing one.
> It's absolutely astounding to me just how much better these systems are than their nearest alternatives 30+ years later.
FWIW, what you're describing is achievable with a decent audio interface, something like the AT BPHS1/BPHS2, and some cough/mute switches. My travel rig for video has full direct monitoring/sidetone for up to eight microphones/headsets, including wireless transmitters if I need to go that route; it's fantastic.
The current state of affairs for normal people is as bad as it is because normal people don't care.
Having used these systems in theatres/live events while I was at Uni I fully agree. These systems are amazing.
The reason is very similar to landlines. A dedicated board in each station that shares three copper lines with every other station. In this scenario you cannot beat analogue.
The ease of setup is also fantastic. All you need is some XLR cables. At some point on the line you need a power supply, which usually acts as a signal splitter so you can have lines going in different directions. Then just plug everything together.
> - The signal was so clear, and the mics were so good, that as long as you enunciated you could speak just a hair above a whisper and be reliably understood. I remember analog PSTN voice, and this was 10x better. Might have to do with less aggressive EQ.
OTOH everyone on a Web video chat always sounds like they're YELLING. Even if the volume's turned down, the tone registers as yelling. Having those happening nearby with regularity, even behind a (thin) wall, is about as bad as the proverbial open-office-seated-near-sales situation.
I think people talk louder on cell phones than they did on old analog land lines, too.
It might have something to do with the lack of sidetone[1] (where you can hear your own voice through the earpiece), which had the effect of you naturally lowering your voice. Landlines alway had sidetone, but most cell phones don't.
> Every headset had foldback, so even the big sound-isolating cans felt natural to speak into. I've spent several hours messing with virtual sound cards trying to make this work with Zoom and never managed it.
I have an external USB microphone from Schure that has an integrated 3.5mm headphone jack and mic volume dial. It mixes the sound of your own voice and the line-out from your computer, just like landline phones do. You might want to give it a try.
That's called sidetone. You really don't want your computer's sound system to generate that for you because the latency is even more distracting than not having it. It's much better to have it done in hardware, either by having the sound hardware direct mic input directly to the headset or with a headset that has it built in.
A good recoding system can do it in software and be acceptable. You can even do some filters on the output (recommend because the right over feedback to singers improves them) and get good results. You are allowed just over 5ms to do all of the above, which is possible but not easy.
It doesn’t due to potential echo but you can set this up yourself with most headsets or with a mic and mixer; you need to with headphones that shut out your voice or you will shout to compensate.
I tend to put my headphones on only halfway, also to catch anything going on in the background (noise cancelling headphones + my back is to the bedroom door).
It's pretty awkward to wear only one ear of my Bose headphones, but with earbuds it works great. Apple's are the least bad solution I've ever found for voice calls.
I have open back headphones that I use at home (can't use at the office because too much sound bleed), and they are great for video calls because of this since they let in the sound of my own voice.
> Do you remember landlines, and how responsive they were?
They weren't always, except for local calls. My Background: I made a lot overseas phone calls in the 1980's to talk to my father, from North America to Japan. I also took a telecom class going through uni.
When calling Japan, not only was it crazy expensive, but the connection was typically terrible. We considered it amazing to be calling Japan at all, so that was ok. Issues:
1. Latency, just start talking when the ringing stops. If you wait to hear the person say "hello" they'd generally hang up before they heard you. Sometimes it would be around 5s latency, we always said this must be "satellite" connections vs using undersea cables.
2. Voice quality. You could hear and recognize the person on the far side, but that's about it.
3. Cost. I don't remember the cost exactly since I was young, but I remember it was between $2-$5/min in 1980's dollars. Calls were short.
Latency was a major focus of telecom networks for a couple of reasons. One was human, obviously humans really prefer low latency in telecom networks. Good old school POTS lines also have to sidetone[1] the old fashioned way, and if the round trip time gets too high it's really disruptive to hear your own voice > 300ms later. But this causes a big echo problem, so have a shorter round trip time makes the echo cancellation filters shorter/easier.
1) Sidetone is hearing your own voice come out of the speaker. Analog POTS lines are super cool this way, on longer connections the signal is weaker so the sidetone is weaker, so the human speaks louder. This means their own voice is louder so the person at the far end can hear them over the weaker signal.
Agreed. After reading the title, I immediately thought of the Doherty Threshold [1]. It states that if the system can respond to user actions and give the user feedback within 400ms, it increases users’ attention and productivity.
That constant delay, plus the worse ones when the connection loses strength, put a constant toll on "zoom" conversations.
I think that is the main factor but many of the items they brought up in the article also add to it but my guess is if there were no delay or drop in quality, it would be very manageable.
I can't believe basic things like typing out an email in Outlook visibly lag at times. I'm a decent typist, but I find it absolutely ludicrous that this quad-core monster is at times half a word behind on my typing
I can type up to 180wpm and I have to intentionally slow myself down to half speed, to use microsoft's start menu, wow. Typing "cmd.exe" or "notepad" in windows start menu, seeing those applications briefly appear as I type, but by the time I press return, it forgot about the match, and launches a bing search instead.
I feel like microsoft is still cluseless, looking at bing analytics and thinking millions of people search for "notepad" every day because they really think its a great thing to research?
I'm amused because I remember typing on an 8 bit computer over a 300 baud modem to a different 8 bit computer that was echoing back and it was always able to keep up. Like you I find anything that can't do as good as the above setup unacceptable. Too bad I can't get It to put some reasonable selection criteria in the system
inadvertent - people who talk too loud or don't understand the lag or don't notice their drop outs
should-know-better - people who use speakerphone and inflict it on people who deal with headsets to give others good sound. or carry their devices moving
> people who use speakerphone and inflict it on people
To be fair, this varies by speakerphone. I use a landline phone with a high quality speakerphone. As far as I've been told, the quality is as good as possible (limited by the line I'm dialing into).
I feel like you've hit the nail on the head. I remember growing up; I'd have long conversations on the landline and it never really felt burdensome.
With Zoom, Facetime etc, it's hard to gauge who's talking to who. Side conversations within groups are impossible. You also can't get a feel over another person's body language.
Latency is one issue, audio quality is, I suspect, another. Most VoIP applications have this characteristic "hyper-compressed" sound to them that I find tiresome in and of itself — sort of the same way that I find shopping mall-style lighting exhausting.
If the other person isn't close to you no system can do it. At about 1000 miles you will be out of sync no matter what system. Sensitive singers will run into trouble at just 500 miles. Speed of light in different mediums is slower that vacuum, radio can probably get over 1000 miles if you can get line of sight.
You'll be happy to know that Damon Krukowski actually covered this on his podcast ways of hearing, episode 3, entitled love, it goes very in depth on how "terrible" even today's cell phones sound vs a land line. https://www.radiotopia.fm/showcase/ways-of-hearing
I've worked remote for a half dozen years now, and you're absolutely right, the lag is definitely the major factor. Minor factors are just the extra little effort expended in making sure your screen is showing, making sure everyone can hear you, the screen jumping focus as it follows the speaker. Just things like that add to it.
But holy crap does VOIP (I guess) beat Zoom in latency. Recently had to call the support line for my wives’ PC, and the latency during that call was around 1s. No clue what the cause was, but it was incredibly painful to have a conversation that way.
At least in part, it depends on what the program prioritizes. Mumbles prioritizes latency over sound quality. You start to sound bad sooner, but it's more snappy. For most gaming, that's the right choice to make. For normal phone conversations, you might value sound quality more.
I'm using Jamulus for making music with others. It has can reach latencies of < 25ms. Chatting over Jamulus is much more pleasant than talking over a video conferencing application.
Even in the early cellphone days, I'd often prefer using my landline. There was less lag. Cellphones felt like speakerphones where it tries to do that cancellation stuff and you feel like the conversation is split into pieces .. like a walkie-talkie where you don't control the "over" / button press.
Cellphones have gotten better, but they've never really hit the latency of landlines (or maybe they have and I didn't notice. I never actually talk to people on my phone any more). Video conferencing like that that but worse. It's really apparent if you're on a remote team and you're next to someone on the same call. You can hear that delay.
I like a lot of the fatigue comes from that delay. We can connect more people now, with video, over great distances, but it does come at a cost at moving from virtual dedicated circuits to package switches networks and transcoding on some central server.