
Salsify – A New Architecture for Real-time Internet Video - dedalus
https://snr.stanford.edu/salsify/
======
keithwinstein
Hi all -- Salsify co-author here. Surprised to see us here again, but happy to
be part of the conversation (here's a previous one:
[https://news.ycombinator.com/item?id=16964112](https://news.ycombinator.com/item?id=16964112)).

This work was led by my student Sadjad Fouladi. If you liked Salsify, you
might really like Sadjad's talk in June at USENIX ATC about "gg", his system
for letting people use AWS Lambda as a rented supercomputer (e.g. he can
compile inkscape and Chromium really fast by outsourcing the computation to
8,000 Lambda nodes that all talk to each other directly over UDP):
[https://www.youtube.com/watch?v=Cc_MVldSijA](https://www.youtube.com/watch?v=Cc_MVldSijA)
(code here:
[https://github.com/StanfordSNR/gg](https://github.com/StanfordSNR/gg))

You might also be interested in our current video project, led by my student
Francis Yan, on trying to improve live video streaming. If you visit and watch
some live TV you can help contribute to our study:
[https://puffer.stanford.edu](https://puffer.stanford.edu)

~~~
Sean-Der
Hey Keith,

Amazing work, I am really impressed with what you are doing. I have never
found any good reading in this area beside your work, the only thing I have
seen is GStreamer's rtpjitterbuffer and libwebrtc. I haven't really felt
confident in what I have learned from either though.

Have you thought about testing/comparing against other WebRTC implementations?
It makes me a little sad to see WebRTC get a bad name just because one
implementation has issues. WebRTC is a huge opportunity to get companies to
invest in one thing, instead of reinventing the wheel and locking people in.
Do you think things could be improved and eventually match Salsify?

Somewhat unrelated but I am working on Pion WebRTC[0]. By design it is not
coupled tightly with the encoder/decoder (but I want to give feedback to the
user so it could influence it if they wanted). Do you have any suggested
readings, would you be ok with me contacting you directly? I really want to
build something elegant that allows people to build amazing things with
WebRTC. Right now I have been having people do things manually, I don't want
to ship something without doing my due diligence.

thanks

[0] [https://github.com/pion/webrtc](https://github.com/pion/webrtc)

~~~
keithwinstein
Thanks for your kind words -- you're certainly welcome to contact us if you
think we can help. Yes, as I wrote elsewhere here, it would probably be
possible to do everything Salsify does within the context of WebRTC, if you
get to change the sender and receiver. If you're only changing the sender and
you want to interoperate with receivers running the WebRTC.org codebase (e.g.
Chrome), you have less flexibility. If you're only changing the sender and you
want to work with existing high-performance video codecs and the API they
expose, even less flexibility.

My main advice for implementers would be, benchmark your end-to-end glass-to-
glass video latency (including the time spent waiting for the next frame to be
captured & encoded, and then the end-to-end latency through to the display of
that frame) over varied/unpredictable networks. In my experience, implementers
sometimes get caught up focusing too much on network-layer measurements (IP
latency) and can end up missing what in my view is the bottom line: glass-to-
glass video latency. You can use our mahimahi mm-delay/mm-link/mm-onoff/mm-
loss tools (part of Debian/Ubuntu) and the included traces to model some bad
networks. And then see how you do on the same traces we use in our paper. If
you can make a plot like our Figure 8(a) and it all looks good, that seems
like good progress in my book.

~~~
dtaht99
Nice to see you keith. I agree that the network latency part of the equation
tends to get too much play, where glass to glass is the better metric. On the
other hand the bufferbloat effort started where network latencies were often
the dominant contributor (seconds!) and for those that have actually
implemented things like your e2e algos and/or installed stuff like
fq_codel/fq_pie/sch_cake - the encoding step now dominates.

'Course, I still kind of miss scan-lines, and had hoped we'd finally all have
enough bandwidth to just blast raw frames over the network by now.

/me hides

------
vdnkh
From what I understand, Salsify requires a unique connection for each device
being served a stream. Therefore (like WebRTC) you can't cache frames in a
CDN; and each viewer requires a connection to the origin. This gets a little
expensive (which is one of the reasons why WebRTC isn't very common relative
to HTTP streaming). Have you explored any kind of multicast or cache-friendly
variation of Salsify?

------
ralphm
Earlier:

[https://news.ycombinator.com/item?id=16964112](https://news.ycombinator.com/item?id=16964112)

[https://news.ycombinator.com/item?id=16802079](https://news.ycombinator.com/item?id=16802079)

------
xt00
If this somehow fixes the situations where I have to click pause and un-pause
to get a video to actually play, that’d be a step forward for humanity...

~~~
schlipity
I think this is caused by your browser not allowing auto play video, and the
web site not accounting for or checking for this state.

So by toggling the pause / play button, you get the video control in a state
that finally mirrors the browser's state.

~~~
xt00
I was thinking more of the case where the video was playing fine then it
starts buffering and pauses then will sit there indefinitely after buffering
unless I intervene by clicking pause unpause etc. Basically caused by an
imperfect connection. I think like myself many people have thought, “oh if I
just wait it’ll start playing again”.. but I suspect many of the people who
make the video sites don’t want each of the client pages to go into a polling
loop that spams their servers so they sort of have a back off scheme where if
your internet link is not perfect it progressively gets worse and worse and
then just sort of sits in slow retry loop. So the user hitting a button
basically interrupts the slow retry loop and effectively commands the player
instance to try streaming again.

------
j1elo
REMB (actually the correct name would be GCC) is the de-facto rate control and
bandwith congestion estimation algorithm that's used by WebRTC (I have some
short notes with links to the algorithms that were proposed, in [0]), and it's
a technology that for all practical effects has been kind of abandoned since
2014.

So it's only natural that it should be possible to greatly improve upon what
we're using today. I'm glad to see advances in this field, because it's sad
that video calls are still a hugely worse experience compared with good old
phone calls...

[0]: [https://doc-
kurento.readthedocs.io/en/latest/knowledge/conge...](https://doc-
kurento.readthedocs.io/en/latest/knowledge/congestion_rmcat.html)

------
Expez
Is anyone working on putting these ideas into a working product?

I'm taking guitar lessons over Skype and for the most part it works just fine.
However, every once in a while we have to resort to recording and sending
snippets over the wire to get the required fidelity. With digital amp
simulators this is thankfully a trivial exercise, but it would be great to do
without.

I've looked for other alternatives, but couldn't really find any that fit the
bill. The last thing I looked at was a few Audio over IP products, but all of
them were designed to be run on a LAN.

------
jasonhansel
I'm surprised that sites that serve videos--like YouTube and Netflix--haven't
already come up with this idea, of using a codec optimized for transmission
over the network.

I wonder if we can do this with WebSockets? (The decoder might need to be in
WASM for performance to be worthwhile.)

~~~
labawi
I believe the main improvement here is latency, which in a video streaming
context is a non-issue.

------
acd
A little bit of latency comes from bufferbloat where network device buffers
cause latency to increase.

[https://www.bufferbloat.net](https://www.bufferbloat.net)

------
tr33house
Glad to see Sprout-related work going on Keith + Sadjad! I remember being
amazed by sprout a few years ago.

------
sansnomme
Anyone have suggestions what protocol should be used for in-browser streaming
cloud gaming?

~~~
dochtman
WebTransport ([https://wicg.github.io/web-
transport/](https://wicg.github.io/web-transport/)) is looking interesting for
the mid-term future perhaps. In any case, building on QUIC is probably a good
idea.

------
dillonmckay
So, how is this better or worse than RTSP and/or h.265?

~~~
keithwinstein
It does a different thing.

RTSP is mostly about control and framing. It doesn't specify any particular
algorithm to estimate the network's capacity, how a video encoder should try
to match that estimated capacity, or how to recover from lost packets.

H.265 is a format for compressed video that defines a bit-exact decoder. It
doesn't specify the way that an encoder encodes anything into the compressed
format, how the encoder should try to match an externally supplied target
frame size / bitrate target (while also meeting a latency target), or what the
API should be.

The Salsify techniques could work fine with RTSP, and could work fine using
H.265 as the coded video format. The special thing about Salsify is really
about where the control lies.

Traditionally (in Skype, FaceTime, or the WebRTC.org codebase), there is a
drop-in codec with its own control loop (making frame-by-frame decisions), and
a congestion-control protocol with its own control loop (making packet-by-
packet decisions), and these control loops are at close enough timescales that
they end up doing poorly together. And the API to the codec is generally too
limited (especially when it's a very general API, as in WebRTC.org, that tries
to abstract across pre-existing implementations of VP9/H.264/H.265 to give the
application agility across different formats) to achieve the kind of rapid
adaptation to network flakiness that you need over these cellular or bad Wi-Fi
networks.

Salsify basically says, "hey, if your codec supports a functional-style API,
and your transport protocol too, and you can extract the long-lived control
state from each of those individual modules and just have one control loop
that jointly controls both the codec module and the transport module, you can
do a heck of a lot better."

~~~
dillonmckay
Thank you for clarifying that for me.

------
summm
TL;DR: they created a VP8 implementation that is purely functional and
exploited that property to integrate control loops of codec and transport
protocol. Sounds promising.

------
swami26
is this worse or better than zoom?

~~~
keithwinstein
Well, we've never installed a rootkit on anybody's computer, so at least we've
got that going for us...

More to the point, we don't have an empirical end-to-end measurement of Zoom
the way we do for Skype, FaceTime, Google Hangouts, and WebRTC-in-Chrome (with
and without VP9-SVC). But my understanding is that Zoom is architected
similarly to those systems and can be expected to behave within the same
envelope. Would need to measure it to know for sure.

On the other hand, Zoom is an actual product with users, and Salsify is a
research prototype that doesn't even have audio, much less users. So hard to
compare outside the narrow technical questions of video quality and latency
over imperfect networks.

~~~
ianlevesque
These might be the most patient and detailed replies to low effort questions
I’ve ever seen.

------
moltar
Middle out

~~~
giancarlostoro
To this day I would love to see some Pied Piper type of breakthrough.

~~~
Mathnerd314
[https://blogs.dropbox.com/tech/2016/07/lepton-image-
compress...](https://blogs.dropbox.com/tech/2016/07/lepton-image-compression-
saving-22-losslessly-from-images-at-15mbs/)

Not really a breakthrough though, JPEG is just an old format. But maybe
machine learning stuff will get the 4x video compression seen in the show.

