
3K, 60fps, 130ms: achieving it with Rust - clpwn
https://blog.tonari.no/why-we-love-rust
======
ccostes
Aside from the Rust aspect (which is cool!), I can't believe we've come this
far and still don't have low-latency video conferencing. Maybe I'm overly
sensitive, but people talking over each other and the lack of conversational
flow drives me crazy with things like hangouts.

~~~
bob1029
The biggest problem is that of the video codecs which ultimately boils down to
using interframe compression. This technique requires that a certain # of
video frames be received and buffered before a final image can be produced.
This requirement imposes a baseline amount of latency that can never be
overcome by any means. It is a hard trade-off in information theory.

Something to consider is that there are alternative techniques to interframe
compression. Intraframe compression (e.g. JPEG) can bring your encoding
latency per frame down to 0~10ms at the cost of a dramatic increase in
bandwidth. Other benefits include the ability to instantly draw any frame the
moment you receive it, because every single JPEG contains 100% of the data.
With almost all video codecs, you must have some prior # of frames in many
cases to reconstitute a complete frame.

For certain applications on modern networks, intraframe compression may not be
as unbearable an idea as it once was. I've thrown together a prototype using
LibJpegTurbo and I am able to get a C#/AspNetCore websocket to push a
framebuffer drawn in safe C# to my browser window in ~5-10 milliseconds @
1080p. Testing this approach at 60fps redraw with event feedback has proven
that ideal localhost roundtrip latency is nearly indistinguishable from native
desktop applications.

The ultimate point here is that you can build something that runs with better
latency than any streaming offering on earth right now - if you are willing to
make sacrifices on bandwidth efficiency. My 3 weekend project arguably already
runs much better than Google Stadia regarding both latency and quality, but
the market for streaming game & video conference services which require 50~100
Mbps (depending on resolution & refresh rate) constant throughput is probably
very limited for now. That said, it is also not entirely non-existent - think
about corporate networks, e-sports events, very serious PC gamers on LAN, etc.
Keep in mind that it is virtually impossible to cheat at video games delivered
through these types of streaming platforms. I would very much like to keep the
streaming gaming dream alive, even if it can't be fully realized until 10gbps+
LAN/internet is default everywhere.

~~~
vlovich123
You can also just configure your video encoder to not use B-frames. Then if
you make all consecutive frames P frames then the size is very maintainable.
It gets trickier if your transport is lossy since a dropped P frame is a
problem but it's not an unsolvable problem if you use LTR frames
intelligently.

All the benefits of efficient codecs, more manageable handling of the latency
downsides.

The challenges you'll run into instantly with JPEG is that the file size
increase & encoding/decoding time on large resolutions outstrips any benefits
you get in your limited tests. For video game applications you have to figure
out how you're going to pipeline your streaming more efficiently than
transferring a small 10 kb image as otherwise you're transferring each full
uncompressed frame to the CPU which is expensive. Doing JPEG compression on
the GPU is probably tricky. Finally decode is the other side of the problem.
HW video decoders are embarrassingly fast & super common. Your JPEG decode is
going to be significantly slower.

* EDIT: For your weekend project are you testing it with cloud servers or locally? I would be surprised if under equivalent network conditions you're outperforming Stadia so careful that you're not benchmarking local network performance against Stadia's production on public networks perf.

~~~
namibj
Actually, there are commercial CUDA JPEG codecs (both directions) operating at
gigapixels per second. It's not a question of speed, but rather the fact that
you can at least afford to use H.264's I-frame-only codec for much lower
bandwidth requirements.

~~~
vlovich123
JPEG is still going to be larger & lower quality than H264. I still fail to
see the advantage.

~~~
namibj
~10x higher framerate?

------
jchw
Nitpick: “audiophile-quality sound” it seems, is becoming the new “military-
grade encryption.”

I don’t have many other comments to make other than I am surprised rust-
analyzer was only mentioned in passing.

~~~
dijit
The issue with 'military-grade' is that anyone in the military will attest it
translates to: Cheapest possible thing that gets the job done.

Audiophile grade at least has roots in high fidelity.

~~~
Joeboy
> Audiophile grade at least has roots in high fidelity.

Does it though? Audiophiles generally seem to eschew fidelity in favour of
something that sounds subjectively _nice_ , including the psychoacoustic
effects of spending a lot of money.

Eg. they seem very fond of "warmth". If you asked me to make something sound
"warm", I'd be applying some soft clipping and dampening the top end, not
eliminating sources of distortion.

Edit: If you actually wanted high fidelity, you'd use studio headphones /
monitors, which are designed to be "unflattering", so you can be confident
you'll hear any issues when mixing / mastering. People don't normally listen
for pleasure with those, because they become fatiguing after a few hours.

Choosing equipment because you like the sound is a very reasonable thing to
do, but it's not the same as pursuing fidelity.

~~~
fsociety
Our ears are incredibly sensitive sensors and I think attributing warmth to
soft clipping and dampening the top end is not a complete picture.

Also warmth is just a single quality. I have a pair of very accurate “cold”
headphones that I prefer for music and a pair of “warm” headphones for
electronic music and gaming.

Past the headphones, it is not so much warmth as it is space in the sound for
me. My headphone amplifier sounds effortless and that’s the best way I can
describe the quality of what I hear.

~~~
Daishiman
But those characteristics are based on objective facts of sound reproduction
that can be quantified.

The characteristic of warmth is related to amplification of certain harmonics
as well as equalization in the signal. This is fairly well understood by now.

------
jonnypotty
I wish I read more things like this on hn. "We wanted to know and understand
every line of code being run on our hardware, and it should be designed for
the exact hardware we wanted"

~~~
kingosticks
But that statement seems at odds with a dependency on the enormous WebRTC
AudioProcessing C++ module. But then they also say they don't use WebRTC so
maybe I misunderstand what's going on.

~~~
Shared404
My understanding is that the quoted statement was explaining why they moved
away from WebRTC.

~~~
bschwindHN
We moved away from WebRTC completely for video, networking, and some audio. We
still use webrtc-audio-processing for acoustic-echo-cancellation and some
other niceties. Here is our Rust wrapper for that library:

[https://github.com/tonarino/webrtc-audio-
processing](https://github.com/tonarino/webrtc-audio-processing)

------
ttul
If this actually works, I am desperately keen to get my hands on it. If you
have the capacity for high bandwidth, why not use it? Zoom’s model must work
on whatever crappy broadband people have in their home office. If you have
gigabit, it doesn’t seem to make use of that extra capacity to improve video
quality.

As for sound, I don’t think audiophile quality is necessary...

~~~
onion2k
_As for sound, I don’t think audiophile quality is necessary..._

Given you'll need about 10Mbps upstream for 60fps 3K video it seems a little
unreasonable _not_ to add on a 320Kpbs (or more) audio stream.

It could make this useful for things like streaming music concerts.

~~~
wenc
Semi-related note: there's work being done at Stanford to make it possible for
remote musicians to play together in an ensemble at low latencies.

JackTrip is the resulting software -- not end-user friendly, but apparently it
works.

[https://ccrma.stanford.edu/groups/soundwire/software/jacktri...](https://ccrma.stanford.edu/groups/soundwire/software/jacktrip/)

(Some basic numbers: sounds takes 1 ms to travel a foot, every ms is a foot of
separation between musicians, 30ms of latency = 30 ft separation = the max for
jamming. So 130ms is not low enough.)

------
jupp0r
I love Rust, but them deciding to redesign/reimplement webrtc after being
frustrated after a week seems like a prime candidate for not invented here
syndrome with Rust being the justification. There is a reason webrtc is as big
as it is, it’s a complex problem to solve.

Regarding the premise of high latency in webrtc: Google Stadia has ~160ms
round trip latency at 4k from my Macbook to a data center, so it’s not like
that’s unachievable.

~~~
Apofis
Google is colocating in your basement.

------
LockAndLol
After reading it, I'm still not entirely sure what's being done.

Is it live streaming or is it the transport?

Are they doing video encoding (the audio encoding seems to be done by that
webrtc-audio thing)?

Have they chosen a progressive encoding format that compresses frames and
pumps them out to the wire as soon as they're done?

Is TCP or UDP involved or a new Layer 3 protocol entirely?

Have I just missed all of those parts or were they really missing amid all the
Rust celebration?

~~~
clpwn
> After reading it, I'm still not entirely sure what's being done. > Is it
> live streaming or is it the transport?

tonari is the entire stack, similar in "feature scope" to WebRTC but with
different goals and target environments.

> Are they doing video encoding (the audio encoding seems to be done by that
> webrtc-audio thing)?

Yep, this includes video encoding and transport. We don't use the WebRTC audio
library for encoding or transport, just for echo cancellation and other
helpful acoustic processing.

> Have they chosen a progressive encoding format that compresses frames and
> pumps them out to the wire as soon as they're done?

Yep, basically, if by that you mean we don't use B-frames or other codec
features that would require buffering multiple video frames before receiving a
compressed stream, so we're able to send out encoded frames as they arrive.

> Is TCP or UDP involved or a new Layer 3 protocol entirely?

We encapsulate our protocol in UDP since we operate on normal internet - a new
protocol is out of the question without a huge lobbying force and 15 years of
patience on your side.

> Have I just missed all of those parts or were they really missing amid all
> the Rust celebration?

We intentionally didn't get into the protocol details because we are saving
that for a dedicated post (and code to back it up).

~~~
LockAndLol
Thank you very much for the answers. Glad I wasn't too far off.

Looking forward to the technical post. If you're planning on releasing all of
this royalty-free and opensource, you'd be quite a boon to the free and open
internet. Getting this picked up by the likes of Mozilla and getting it into a
browser would be amazing.

------
nerdbaggy
If anybody is looking for a low latency high bandwidth P2P video streaming
solution there is
[https://github.com/CESNET/UltraGrid/wiki](https://github.com/CESNET/UltraGrid/wiki)
It can do less than 80ms of latency

~~~
ClumsyPilot
This is cool, thanks for the link. Is this Nvidia GPU only? Mighr give it a
try at some point

~~~
nerdbaggy
You don't need the GPU, just depends on the type of compression you want. It
supports intel VA-API as well as NVIDIA VDPAU

------
MR4D
Gotta love a writeup with this line in it:

    
    
      like Brian's 1970s-era MacBook Pro
    

That's a writer(s) who knows what it's like to read long (aka thorough)
technical articles and not bore the readers to death.

Great article!

------
zamalek
> We just enforce rustfmt.

After interaction with both rustfmt and go fmt, I have concluded that
.editorconfig is solving a problem that really shouldn't be solved. We went
through the ordeal of defining our C# coding standards where I work and, let
me tell you, people (myself included) care very deeply about their way of
structuring code. And it's a bloody waste of their time.

Having the language designers say, "here is how our language should be
structured" is a breath of fresh air.

------
pier25
Woah this portal thing into another place seems super exciting if they can
really pull it off and maintain low latency in the real world.

------
imtringued
My WebRTC projects haven't suffered that much from latency. The biggest source
of delays is usually caused by encoding video for me. I've had to limit
streams to 720p and 25fps to reduce the time spent on CPU encoding a vp8
stream. There are also bandwidth considerations (real time encoding =
significantly less compression) but the end result is slightly less than 200ms
one way latency (including input lag from mouse, 15ms network latency and
display lag) without any special settings. All I'm doing is feeding a ffmpeg
stream to kurento and letting it broadcast it via WebRTC. This is not a web
conferencing application and it is also not using WebRTC via p2p. It's closer
to conventional live streaming with a sane amount of latency (compared to up
to 30s of latency you commonly see on twitch). Of course I personally would
prefer it if the latency can be brought down even further. 100ms or lower is
like the holy grail for me and only appears to be doable with codecs that
aren't supported by WebRTC. However, people don't want to install apps just
for my little service and I certainly won't encode every stream via several
codecs just for the tiny minority of the user base that actually ends up using
the app.

------
GuiA
Very cool from a tech standpoint.

From a product point of view, I find it interesting that the
illustrations/concept videos for these things always show people interacting
very closely to the wall - e.g. playing chess, sitting around a table, etc.

[https://tonari.no/static/media/family.48218197.svg](https://tonari.no/static/media/family.48218197.svg)

But in practice, people tend to keep their distance from it. E.g. the pictures
of this setup tend to show people clustered in their own group on each side of
the wall, with a solid 2-3 meters from the wall.

[https://blog.tonari.no/images/ea56c74d-a55d-4183-9a7b-d69795...](https://blog.tonari.no/images/ea56c74d-a55d-4183-9a7b-d697954c5159-tonari-
frontier-2.png.optimized.jpg)

It makes sense, it's awkward to be close to a large solid (emissive) surface,
and humans instinctively get closer to their in group when faced with an out
group. I wonder how the system could be designed to encourage participants
being closer, if there is an advantage to that.

~~~
STRML
A practical problem to solve there: where do you put the cameras? I would
actually prefer putting them behind the screen if possible - a few small
pinholes wouldn't be that noticeable. If you could put multiple wide-angle
cameras in multiple places, you could stitch them together in software and
create a real feeling of closeness.

------
sephamorr
Why exactly do existing video streaming solutions use such small amounts of
bandwidth and have terrible quality as a result? Does anyone have a deep dive
into why this is the case? It seems that it would be a killer feature to make
better utilization of bandwidth.

Even over wifi, speedtest shows 4ms/100mb/100mb on my internet connection, but
Zoom, FaceTime, and others never use more than about 0.8Mbit/s for a video
stream, and the resulting quality of audio and video is...understandably poor.

Latency too totally feels like a software problem, perhaps with too many
layers of abstraction. (60fps->16ms for the camera, ~10ms for encoding with
NVENC/equivalents, 35ms measured one-way latency from my laptop to my parents
4000km away, ~10ms decode, 16ms frame delay = 87ms one way). Maybe I'm asking
for too much from non-realtime systems (I'm used to RTOS, extensive use of
DMA, zero-copy network drivers, etc), but it seems that there is a lot of room
to improve.

~~~
rasz
OnLive "solved" encoder latency 15 years ago. You dont wait 16ms for the next
frame. Instead you progressively start encoding after receiving first tens of
lines. This way your encoded video stream lags just couple of milliseconds
behind, same for decoding. You could crudely emulate this by dividing screen
into 4 rows and sending 4 concurrent video stream, instant 1/4 latency drop.

~~~
sephamorr
Sure, many of the operations in the list can be pipelined as you mention.
Something like G-Sync would also allow you to sync the destination display to
the arrival of the (start of) frame.

------
zelly
The bottleneck is not on the CPU. I'm afraid this company may have wasted
their time trying to reinvent WebRTC. If you really want to get realtime
video, I think the best approach is a custom codec on CUDA or better yet
custom hardware (FPGA). You can only go so far on general purpose hardware
before you hit a wall and get Zoom/WebEx quality.

~~~
ronyfadel
Is or is not? I’m confused: if the bottleneck is not on the CPU what does CUDA
solve?

~~~
zelly
The bottleneck is the video encoding/decoding/rendering, which is done on GPUs
to begin with. Of course if it were done on CPU instead, then it would be
significantly worse, but that's not where we're starting from. Improving stuff
on the host side by, say, rewriting WebRTC in Rust won't improve the latency
of your video by much or at all.

------
swsieber
This is welcome news.

I have been itching to convert a small headshot videostream (thing under
100x100px) to audio, stream it over mumble and then convert it back to video,
just to see what the latency is like. It would obviously be a big undertaking,
but not as big as this methinks.

------
usefulcat
"We wanted to know and understand every line of code being run on our
hardware, and it should be designed for the exact hardware we wanted."

This rings very true for every high-performance thing I've ever worked on,
from games to trading systems.

------
snvzz
130ms is a world better than 500ms and a much welcome improvement, but it is
still terrible.

Latency happens throughout the whole stack; Unfortunately much would need to
be fixed outside this project to achieve any further significant improvement.

Operating System, firmware, blackbox hardware are some other non-negligible
sources of latency. Everything adds up.

------
codefined
Any suggestions on a group video conferencing tool for use on a local network
(Ethernet) that's effective? Either self-hosted or online, just for personal
usage to talk with others?

------
dbrgn
"A week of struggling with WebRTC’s nearly 750,000 LoC behemoth of a codebase
revealed just how painful a single small change could be — how hard it was to
test, and feel truly safe, with the code you were dealing with."

I _totally_ feel you. It's impressive what the WebRTC implementation has
achieved, but it's just not pleasant at all to work with it.

------
systemvoltage
@dang - Suggest altering the title to say what it is "Achieving 3K, 60fps,
130ms Video Conferencing with Rust".

------
eadan
This is amazing! The first thing that popped to my mind seeing the life sized
"portal" was the farcaster portals from the sci-fi novel Hyperion

[https://hyperioncantos.fandom.com/wiki/Farcaster](https://hyperioncantos.fandom.com/wiki/Farcaster)

------
chubs
Sounds impressive, but i'm dying to know: what video codec are they using?

------
novok
I wonder how it compares with apple facetime on two new macbooks with ethernet
connections on both sides.

They actually work on reducing latency and pushing high res video if your
connection supports it.

~~~
bschwindHN
That's a great idea, I've always preferred facetime at least for the video
quality. We'll do a latency test sometime, I suspect it'll be quite good!

------
lc5G

      for crate in $(ls */Cargo.toml | xargs dirname); do
         cargo build
    

Why do this instead of

    
    
      cargo --workspace build
    

Is it so you can time the individual crates?

~~~
ninkendo
Yeah it looks like they wanted to know how long each crate took to build
individually.

But as long as we're nitpicking, nobody should just pipe `ls` into `xargs`
like this, since it fails if anything has spaces in it.

Instead, do:

    
    
        for cargo_toml in */Cargo.toml; do
          crate="$(dirname "${cargo_toml}")"
          pushd $crate
          # ...
        done
    

Don't be that person who writes a script which won't tolerate spaces in
filenames!

~~~
OJFord
Alternatively: Don't be that person who clones the repo at a path with spaces
in!

~~~
ninkendo
Not having spaces in your directory names is certainly a good idea, but I'll
be damned if I let any of my code have issues with them. Just because
something's a good idea doesn't mean it should be a requirement :)

(The main reason for the advice of "Don't put spaces in paths" is really only
because it breaks lots of poorly-written software... but that's not an excuse
for your software to be poorly-written!)

------
Exuma
I love their homepage [https://tonari.no/](https://tonari.no/)

~~~
aiotokyo
Thank you so much! Keep an eye for further updates.

------
Scaevolus
What's the codec stack for this? x264 --tune zerolatency + opus with
opus_delay=20ms?

~~~
namibj
20ms is wasteful. Use minimum latency where SILK still works, afaik that's
7.5ms.

~~~
codys
This assumes that video encoding latency is lower than the audio latency.

~~~
namibj
Shouldn't be more than a single frame, which is 16 2/3 ms for 60 fps. And for
e.g. JPEG it can be even shorter, especially with a rolling shutter.

------
jonny383
"we truly don't believe we could have achieved these numbers with this level
of stability without Rust"

Oh please. This is just rust sensationalism. People don't truly believe rust
is faster than C do they?

~~~
bufferoverflow
In some problems Rust is the fastest:

[https://benchmarksgame-
team.pages.debian.net/benchmarksgame/...](https://benchmarksgame-
team.pages.debian.net/benchmarksgame/performance/nbody.html)

[https://benchmarksgame-
team.pages.debian.net/benchmarksgame/...](https://benchmarksgame-
team.pages.debian.net/benchmarksgame/performance/spectralnorm.html)

[https://benchmarksgame-
team.pages.debian.net/benchmarksgame/...](https://benchmarksgame-
team.pages.debian.net/benchmarksgame/performance/revcomp.html)

[https://benchmarksgame-
team.pages.debian.net/benchmarksgame/...](https://benchmarksgame-
team.pages.debian.net/benchmarksgame/performance/binarytrees.html)

~~~
jonny383
Every single one of those rust implementations is using unsafe {}, thus
defeating the purpose of using rust in the first place. Run the same benchmark
without unsafe {}.

~~~
turndown
>defeating the purpose of using rust in the first place

I don't think this is true; the whole point of Rust is that unsafe operations
are _explicit_ , not that you never do so.

Also, I looked at the first one, and it's only using unsafe on what are
basically op code calls; I don't think it is realistic to complain about that.

------
nerdbaggy
I wonder how much bandwidth this uses. The less bandwidth it uses the higher
the latency because of compression. Its much easier to get low latency video
when you have large (Gbit+) links

------
vertex-four
Are they still using WebRTC, just their own implementation? Or have they
switched to something else on the wire?

~~~
markdog12
There's a section in the article about it: "In the beginning (or: why we're
not WebRTC)"

~~~
vertex-four
I'm interested in what they _are_ using if not WebRTC - there's several good
options in this space (SRT would be my go-to choice), so it'd be really
interesting to see if they rolled their own wire protocol or used something
else.

~~~
beowulfey
They built it from scratch

~~~
vertex-four
Their blog post suggests they wrote _something_ from scratch, but gives no
clue as to what, whether they considered building on a more modern protocol
specification than RTP (which is a couple decades old at this point), what
they've taken from other more modern protocol specs if they didn't use one
directly, or anything aside from that they wrote some code really.

------
realchucknorris
would loved to see a demo

------
alpineidyll3
Awesome post

------
remmargorp64
But does it have middle out compression?

