
Ask HN: What's preventing us from achieving seamless video communication? - orangep
Better video compression? faster network speeds? alternative network protocols?<p>With all due respect to all the amazing folks working in the domain, as a person working outside the field, the quality of even 1:1 video communication still seems far from ideal.<p>Wanted to understand a bit what the main underlying hurdles are. Folks say there&#x27;s less room for improvements with compression after H.264. I&#x27;m not sure how much network speeds are a factor given things can get botchy even with wired high bandwidth connections. The audio artifacts definitely impacts the perceived quality so not sure if there&#x27;s room for improvement here technically.
======
keithwinstein
Low-latency packet video _can_ work incredibly well over a dependable network
connection (with a known constant throughput and no jitter), low end-to-end
per-packet latency, and good isolation between everybody's microphone and
speaker. This was mostly solved in the 1990s.

A lot of what makes Skype/Facetime/WebRTC/Chrome suck are the compromises and
complexity inherent in trying to do the best you can do for when these things
don't hold -- and sometimes, those techniques end up adding latency even when
you _do_ have a great network connection.

Receiver-side dejitter buffers add latency. Sender-side pacing and congestion
control adds latency. In-network queueing (when the sender sends more than the
network can accommodate, and packets wait in line at a bottleneck) adds
latency. Waiting for retransmissions adds latency. Low frame rates add
latency. Encoders that can't accurately hit a target frame size on an
individual frame basis add latency. Networks that decrease their available
throughput (either because another flow is now competing for the same
bottleneck, or the bottleneck link capacity itself deteriorated) cause
previously sustainable bitrates to start building up in-network queues, add
latency.

And automatic echo cancellation can make audio incomprehensible, no matter how
good the compression is (but the alternative is feedback, or making you use a
telephone handset).

Another problem is that the systems in place are just incredibly complex. The
WebRTC.org codebase (used in Chrome and elsewhere) is something like a half
million lines of code, plus another half million of vendored third-party
dependencies. The WebRTC.org rate controller (the thing that tries to tune the
video encoder to match the network capacity) is very complicated and stateful
and has a bunch of special cases and is written in a really general way that
makes it hard to reason about.

And the fact that the video encoder and the network transport protocol are
usually implemented separately, by separate entities (and the encoder is
designed as a plug-in component to serve many masters, of which low-latency
video is only one, and often baked into hardware), and each has its own
control loop running at similar timescales also makes things suck. Things
would work better if the encoder and transport protocol were purpose-designed
for each other and maybe with a richer interface between them (I'm not talking
about changing the compressed video format itself; just the encoder
implementation), BUT, then you probably wouldn't have access to such a
competitive market of pluggable H.264 encoders you could slot in to your
videoconferencing program, and it wouldn't be so easy for you to swap out
H.264 for H.265 or AV1 when those come along. And if you care about the
encoder being power-efficient (and implemented in hardware), making your own
better encoder isn't easy, even for an already-specified compression format.

Our research group has some results on trying to do this better (and also
simpler) in a principled way, and we have a pretty good demo video:
[https://snr.stanford.edu/salsify](https://snr.stanford.edu/salsify) . But
there's a lot of practical/business reasons why you're using WebRTC or
FaceTime and not this.

~~~
miki123211
I remember good old Skype from like 3, 4 years ago that didn't use to suck at
this so much, at least on the audio side. When the network connection was
really bad, it used something like phone quality, when it was good, the
quality was closer to Face Time. You could hear the quality shift dynamically
when network conditions were changing (i.e. someone starting a download). I
really enjoyed talking on Skype back then, I knew I could depend on it and get
the best I could, whatever the conditions were. Now it's much worse, the
quality is average, it doesn't handle bad connections well and it doesn't
fully utilize the good ones. I don't think we have anything as good now as we
had back then. There's Team Talk which is amazing for audio, but much more
cumbersome to set up.

~~~
ejcho623
Why do you believe the quality is going backwards? Is there something they are
optimizing for the compromise in quality?

~~~
miki123211
not sure myself, I'm way out of my depth here, but a friend who knows much
more about this sort of stuff, told me it's probably something to do with
mobile connections. Apparently slow landline connections and bad cellular
connections like 3G are unreliable in different ways. Also most mobile phones,
definitely all iPhones, not sure about Android, can't go higher than 16kHz
when it comes to audio calls, so anything better is pointless. Maybe it's just
M$ and trying to save some pennies on the infra, especially that they're not
p2p any more (also b/c mobile connections and restrictive NATS). I know
They're supposedly not that bad any more, but considering what they've done to
Skype's other aspects, I wouldn't be surprised really.

------
cameldrv
Zoom and Facetime are pretty decent, but you are completely right. The
fundamental problem is we have a huge stack of technologies that just barely
work. It runs from USB to video drivers, to operating systems, to
videoconferencing software to Wi-Fi to home/office routers to cable modems to
cable infrastructure to TCP/IP to backbone network capacity, and back.
Everything is pushed to the limit. It's basically the Richard Gabriel worse is
better problem, compounded. Everyone gets their part to something like 99%
reliability.

If you're depending on 20 things, each having 99% reliability, the system has
82% reliability. Roughly speaking, that's what's happening. There is no silver
bullet to fix this. Bringing one layer from 99% to 100% brings the system from
82% to 83%.

~~~
ColanR
Seems like we would do well to find a way to implement some kind of
certification process, like we do with engineers. [1]

If you lose your certification if you write low-quality code, then (hopefully,
if the certifiers have the processes in place) you'd not write the code. In
that way, we could finally compare the importance of quality in writing a
device driver to the importance of quality in designing a bridge.

[1] [https://www.nspe.org/resources/licensure/what-
pe](https://www.nspe.org/resources/licensure/what-pe)

~~~
pjc50
The public have consistently chosen the cheaper but slightly worse option for
decades of technology. This, more so than engineer skill, puts an upper
constraint on reliability.

------
zbuf
Let's start by getting seamless audio communication.

It's too much focus on video, when deficient audio has a greater impact on
rapport. Even an old school land line phone gives a more fluid conversation
than modern video conferencing.

The trouble is, mainstream conferencing solutions are challenged by customers
who expect great audio out of poorly spec'd rooms and microphones. The result
is too much 'masking' poor inputs with software that now we all feel totally
disconnected with why the system as a whole just isn't working well.

(A small plug for my own focus on audio; cleanfeed.net, which is WebRTC-based
with some additional magic)

~~~
closeparen
The microphones and speakers in an analog landline phone are not high end
either, but they still sound amazing compared to Zoom.

------
hideo
I used to work in this field several years ago. It’s gotten way better over
time but there’s room for continuing improvement.

Personally I think it’s a bit of everything:

There’s almost no standardization in signaling protocols. Things like FaceTime
and WhatsApp don’t interoperate.

NAT hole punching remains a complex problem. It’s not easy to solve.

Bandwidth is often not stable for long periods of time. Bandwidth drops,
latency spikes, packets get lost or retransmitted. WiFi connections are
sketchy. Wired connections are better but still packet switched. Cellular
wireless systems are overloaded and suffer from multipath fading.

Encoders are insanely complicated to build. Hardware acceleration isn’t easy
to implement either. Configuring an encoders parameters for a connection
environment is hard and remains a craft and not a science.

The human eye seems to be much more sensitive to artifacts than the human ear.
Cameras are hard to tune and expensive. Auto focus white balance etc effect
call quality quite badly. Camera placement is still a challenge. Minor changes
to lighting and colors can make huge shifts in quality.

This is why video from dedicated conference rooms is way better than video
calls from phones or laptops. The state of the art under controlled conditions
is really unbelievably amazing.

------
plaidfuji
Here’s my question: why is video conferencing designed such that when the
video drops out or lags, _so does the audio_? I never (or very less often)
have problems with simple VOIP. Why not make a video chat client that treats
the audio as a first class citizen, and then displays the video if it happens
to be coming through clearly? Is re-syncing the streams too hard? This seems
like a fix that would make the experience so much better without having to put
more data through any network.

~~~
lordCarbonFiber
Probably because crappy audio is received better than audio out of synch from
the video. Having the "bad international languge dub" effect on conferencing
is distracting as hell.

For what it's worth voice over IP is essentially solved; tools like Mumble can
have 1000s of people with top tier quality. I'm not sold at all on why you'd
even want video, 90% of most business meetings (in my experience) are screen
shares that could be communicated by just sending a link anyway. For personal
calls use direct connect peer'd connection; in my experience all of the
problems of video calling is going through the enabling server.

------
lazyeye
Someone needs to design vidconf hardware where (somehow) the camera is in the
middle of the screen so you are looking directly into each other's eyes.

~~~
hideo
This exists in several forms ranging from TelePrompTer style settings to
cameras embedded in the screen. They’re currently expensive but I’d expect to
see them hitting the market in the next 5-10 years.

There are also software/post-processing implementations of this out there
which try to emulate eye contact. The ones I saw 4-5 years ago were definitely
uncanny valley but there may be more happening there of late.

This gets even more interesting in multi party situations where there may be
3+ locations with more than one person at each location

~~~
asymmetric
> cameras embedded in the screen

Seems like a privacy nightmare. You can’t have a hardware toggle to close the
camera nor cover it with a piece of tape.

~~~
berbec
The Librem guys are making hardware with power switches for camera, GPS, etc.
[1]

1: [https://puri.sm/posts/lockdown-mode-on-the-
librem-5-beyond-h...](https://puri.sm/posts/lockdown-mode-on-the-
librem-5-beyond-hardware-kill-switches/)

------
nikanj
Today, there are N video-conferencing solutions on the market, all of them
rather "enterprise" in their quality. Most meetings start with a good 5-10
minutes of "Can you hear me now?".

Aspiring youths see this market ripe for disruption, and next year there are
N+1 video-conferencing solutions, all of them rather shit.

The biggest challenge doesn't seem to lie in the actual video quality. The
problem is in getting the damn call up in the first place, with all
participants seeing and hearing all other participants.

------
chvid
That people don't really want it.

As far I can see it is possible and good enough today with Skype, FaceTime and
so on plus a good internet connection; yet people prefer email, chat,
telephone ... I think that is what is holding it back and because of the lack
of demand there is not really big investment in create hardware and software
to support it.

~~~
r3bl
I believe it's a chicken and egg problem. I know I would rely on video way
more if my first sentence in every video call didn't have to be "can you hear
me?"

~~~
chvid
The eggs have been there for a long time and plenty of chickens has been
hatched; but noone is eating them.

I am pretty sure that I can do a high quality FaceTime call with a number of
people.

Yet still I prefer doing a regular phone call. As so does just about everyone
else. The question you should ask is: Why is that?

It is not a technical problem.

~~~
r3bl
> I am pretty sure that I can do a high quality FaceTime call with a number of
> people.

A number of people that could afford $1000+ to be able to use that platform.

I personally wouldn't consider this problem to be technically-solved until an
Android that can barely be considered as a mid-ranger can do it.

~~~
dqybh
I wouldn't expect a mid-ranger phone to be fast and completely reliable.
That's why it's just a mid-ranger. The price is a compromise; you get shitty
radio hardware and shitty software so you get shitty videoconferencing.

~~~
tracker1
Is a mid-range phone today worse than a high end phone 5 years ago? Because
the hardware at the high end 5 years ago could handle it decently iirc. Though
it really depended on your internet connection, latency and other issues much
more than the hardware on the device really.

For the most part, given the margins at the high end, it's not significantly
better.

------
floatingatoll
Latency SLAs are what’s missing: “No packets shall be lost, and latency shall
not fluctuate”.

Classical phone lines worked so well because there is never latency. You might
get static, but we’re adapted for listening through static. But you never get
timing disruptions.

Videoconferencing is a black art of trying to smooth over tiny latencies that
the human brain is wired to be extraordinarily sensitive to. People read too
much into a single lost packet.

This same problem applies to VR - if the latency is not rigorously consistent,
people vomit.

It is possible to design a network that connects two people reliably and with
predictable latency - last century’s landline phone network stands as proof.
Until someone builds a network that offers the same level of service for
videoconferencing, it will continue to be a tool of last resort.

~~~
closeparen
That technology is circuit switching, i.e. the antithesis of computer
networking.

~~~
floatingatoll
Dolphin used fiberoptic interconnects to provide either maximum bandwidth or
minimum latency, with a cross-connection grid that could easily be tuned with
a software switch to provide narrowly-predictable extremely low latency with
an SLA.

To say that this is circuit switching may or not be correct, but it’s wrong to
say that it’s the antithesis of computer networking.

EDIT: See also Cloudflare’s ultra-predictable, low-jitter backbone:
[https://blog.cloudflare.com/argo-and-the-cloudflare-
global-p...](https://blog.cloudflare.com/argo-and-the-cloudflare-global-
private-backbone/)

------
faragon
The biggest problem with videoconferencing is using movie video codecs
(designed for maximizing compression, and not for multicast latency control)
tuned/tweaked for the worst client. So the typical video conference server
uses a hand tuned gstreamer and a mechanism for requesting key frames at the
pace of the slowest client. That work okish for up to few hundreds of
connections in the case of all peers having quality connections (fiber).
Scalable video conference, with e.g. thousands of concurrent clients require a
different problem approach, being solved eventually because of specialized
video hardware acceleration in the video conference gateway server, changes in
the network infrastructure, and also low latency devices at client side.

------
sargun
The lack of circuit switching and bandwidth.

~~~
Animats
Exactly.

ISDN was a 64kb/s circuit switched channel end to end, rigidly clocked. Every
bit came in on schedule. Voice with no jitter. A friend of mine in Switzerland
had ISDN home phones until last month, when Swisscom discontinued it in favor
of a VoIP system with worse voice quality.

If there had been a video successor to ISDN, say a 10mb/s circuit, we'd have
real time HDTV video chat with no jitter.

Voice and video over IP only work because of horrible kludges to deal with
jitter and lag.

------
p2t2p
How about this:

$ ping google.com.au PING google.com.au (216.58.196.131) 56(84) bytes of data.

64 bytes from syd15s04-in-f3.1e100.net (216.58.196.131): icmp_seq=1 ttl=51
time=3239 ms

64 bytes from syd15s04-in-f3.1e100.net (216.58.196.131): icmp_seq=2 ttl=51
time=4524 ms

64 bytes from syd15s04-in-f3.1e100.net (216.58.196.131): icmp_seq=3 ttl=51
time=4434 ms

64 bytes from syd15s04-in-f3.1e100.net (216.58.196.131): icmp_seq=4 ttl=51
time=3622 ms

64 bytes from syd15s04-in-f3.1e100.net (216.58.196.131): icmp_seq=5 ttl=51
time=1022 ms

64 bytes from syd15s04-in-f3.1e100.net (216.58.196.131): icmp_seq=6 ttl=51
time=849 ms

64 bytes from syd15s04-in-f3.1e100.net (216.58.196.131): icmp_seq=7 ttl=51
time=1030 ms

64 bytes from syd15s04-in-f3.1e100.net (216.58.196.131): icmp_seq=8 ttl=51
time=974 ms

64 bytes from syd15s04-in-f3.1e100.net (216.58.196.131): icmp_seq=9 ttl=51
time=897 ms

64 bytes from syd15s04-in-f3.1e100.net (216.58.196.131): icmp_seq=10 ttl=51
time=1022 ms

64 bytes from syd15s04-in-f3.1e100.net (216.58.196.131): icmp_seq=11 ttl=51
time=1008 ms

64 bytes from syd15s04-in-f3.1e100.net (216.58.196.131): icmp_seq=12 ttl=51
time=949 ms

64 bytes from syd15s04-in-f3.1e100.net (216.58.196.131): icmp_seq=13 ttl=51
time=871 ms

64 bytes from syd15s04-in-f3.1e100.net (216.58.196.131): icmp_seq=15 ttl=51
time=1103 ms

64 bytes from syd15s04-in-f3.1e100.net (216.58.196.131): icmp_seq=16 ttl=51
time=1005 ms

64 bytes from syd15s04-in-f3.1e100.net (216.58.196.131): icmp_seq=17 ttl=51
time=830 ms

64 bytes from syd15s04-in-f3.1e100.net (216.58.196.131): icmp_seq=18 ttl=51
time=752 ms

64 bytes from syd15s04-in-f3.1e100.net (216.58.196.131): icmp_seq=19 ttl=51
time=703 ms

64 bytes from syd15s04-in-f3.1e100.net (216.58.196.131): icmp_seq=20 ttl=51
time=899 ms

64 bytes from syd15s04-in-f3.1e100.net (216.58.196.131): icmp_seq=21 ttl=51
time=821 ms

Does this answer the question? =)

------
thefounder
You need deterministic performance. There is a standard for that named AVB[0]
but it requires avb compliant switches/hubs. It's used in pro, live, home
audio and corporate environments with great success. In a perfect world the
internet(at least the wired internet) would be avb compliant.

[https://en.m.wikipedia.org/wiki/Audio_Video_Bridging](https://en.m.wikipedia.org/wiki/Audio_Video_Bridging)

------
jaspal747
Can we send two parallel streams of video and audio and then at the receiver,
pick each packet selectively? If a new packet arrives in either of the
streams, pick it if it is the latest, and discard it if we already got one
with the same id..Something one the lines of adding redundancy to compensate
for the poor connection?

------
no_identd
1\. Check out this recent comment thread from a few days ago for an entry
point to learning about numerous performance bounds caused by both design
limitations and extremely long standing feature gaps in the computer network
protocols we nowadays use for global communications (I intentionally avoid the
term "Internet" here, because, as one can hopefully gleam from the content
linked to there, that name seems undeserving for our "global information
superhighway"):

[https://news.ycombinator.com/item?id=19864808](https://news.ycombinator.com/item?id=19864808)

2\. Something important and very relevant to your question which I forgot to
mention in that thread:

[https://www.itu.int/en/ITU-T/Workshops-and-
Seminars/201807/D...](https://www.itu.int/en/ITU-T/Workshops-and-
Seminars/201807/Documents/3_Richard%20Li.pdf)

This recent ITU slide deck by Richard Li (Huawei):

[https://www.itu.int/en/ITU-T/Workshops-and-
Seminars/201807/D...](https://www.itu.int/en/ITU-T/Workshops-and-
Seminars/201807/Documents/3_Richard%20Li.pdf)

From this recent ITU workshop: [https://www.itu.int/en/ITU-T/Workshops-and-
Seminars/201807/P...](https://www.itu.int/en/ITU-T/Workshops-and-
Seminars/201807/Pages/Programme.aspx)

The webcast of which you can find here:

[https://www.itu.int/webcast/archive/t2017075g#video](https://www.itu.int/webcast/archive/t2017075g#video)
(third presentation in the video)

Naturally, given the information from point #1 on RINA, GNUnet & currently
network technology issues combined with the fact that the plans presented
above don't seem to really factor that information in, I dislike some of the
directions the ITU & Huawei seem to go for there, BUT, _even so_ , this
presentation basically exactly answers most of your questions (I think the
other commenters already did a most excellent job answering what the
presentation doesn't), even generalizing them to seemingly "far out"—yet
apparently entirely feasible—ideas like 'Holographic Teleports' (think VR on
steroids.)

~~~
no_identd
The talk starts at around 42:32 in the fourth video.

------
amelius
Related question: why does Netflix provide a so much better video streaming
service than a short-distance VoIP session? (I say short-distance because
Netflix obviously uses a content distribution network of some kind; not taking
this into account would make the comparison unfair).

~~~
Liron
Because having a VoIP call is like producing a Netflix movie .1 seconds before
screening that movie.

E.g. you couldn't buffer 5 seconds.

~~~
amelius
Yes, this is probably the explanation.

------
bo1024
Just curious if you've tried Zoom. I recently switched from skype andor
hangouts.

------
dvh
Not enough IPv4 addresses.

~~~
dvh
The only reason (original) Skype existed was because there were, even at that
time, not enough IPv4 addresses, so users who had public IPv4 address were
tunneling data for users behind NAT. Small guy has no chance to create any
kind of p2p communication app because he had no infrastructure like Skype. If
everybody had public IPv4 address it would be trivial and also fast because of
the direct connection. Video calls are hard but codec had nothing to do with
it.

------
sytelus
The core issue is not technical but rather human. In 1960s Bell Labs poured in
massive investment in trying to make video calls a reality. They solved
virtually all technical issues - in 1960s - and had real system deployed in
NYC and other places. Here was the issue it never took off: _people just don
't like others to see them although they like to see others._. This might feel
funny little thing but it ultimately lead to pulling out video handsets from
the market. No amount of network effect and marketting helped its adoption.

This experiment has been repeated in many forms by different companies in
different settings with no real success. Everyone had solved tech issues, its
just that screen resolution keeps increasing. The only area I can think of
where video calls have minor success is corporate group meetings and talking
to kids/parents but even in those cases people tend to be very selective when
to do video calls vs voice-only calls.

