
Salsify – A New Architecture for Real-time Internet Video - jremmons
https://snr.stanford.edu
======
mgamache
To me, the key innovation here is the tight integration between network
conditions and codec frame size. Standard codecs are created with specific
bandwidth requirements and they provide encoded frames that 'average' around
that size. You _could_ just re-initialize a codec at a lower bandwidth on the
fly, but you would have to send an I frame (large full frame) to kickoff the
new series of frames (as video most video frames are just updates of a
previous frame). Having a codec accept a bandwidth target per frame is a
really good idea.

~~~
jonex
Codecs used by real time video systems are able to adjust the bitrate on the
fly. There's not a keyframe request every time that happens unless resolution
is changed. How quickly they adjust might vary, software implementations
generally does it for the next encoded frame. The frame still will be somewhat
larger or smaller than the target size since the codecs can't accurately
predict the encoded size for given quality parameters.

The Salsify implementation in the paper has slightly more accurate way of
producing one single frame as it encodes two frames with different quality
targets and takes the largest one below the frame size target.

~~~
nitrogen
For a resolution change, couldn't you just scale the old last frame to the new
resolution and use that as the basis for more P frames? (originally replied to
the wrong comment)

~~~
derf_
Short answer: Yes.

Longer answer: The codec needs to support it. Codecs actually allow prediction
from multiple reference frames, and maintain a buffer of them (2 to 16,
depending on the codec, profile, and level). An individual frame may refer to
several (potentially all) of those. So re-scaling up to 16 frames for every
frame you decode will get quite expensive, not to mention the generational
losses of doing this repeatedly for every resolution change. In practice what
happens is you scale individual blocks when they get referenced by the current
frame. But that has to be integrated into the motion compensation routines of
the codec.

Both VP9 and AV1 support this, for example.

------
rawrmaan
Aside from the fact that the tech is obviously cool, I think the FAQ section
is really well-written. Props to the team.

~~~
sarreph
I agree, refreshingly top-notch FAQs!

~~~
josephpmay
It reminds me of the Def Con and Coachella FAQs

------
noelwelsh
I'm fairly sure that Ben Orenstein and friend are forming a company to
commercialise this as a Screenhero replacement. Discussed on this podcast:
[http://artofproductpodcast.com/episode-39](http://artofproductpodcast.com/episode-39)

Very interested to see what they cook up (and kinda envious I didn't have the
idea / don't have the space in my life to have a crack at it myself---it
sounds very interesting).

------
khalilravanna
I've been taking Financial Markets course by Robert Shiller and he
continuously makes the point when talking about inventions and new ideas that
"it's crazy to me that this didn't exist before". It's usually the sign of a
really good invention when you have that thought. And that's the thought I'm
having looking at this combining the codec and transport protocol together:
"Why hasn't this been done before?" == "This is awesome!"

~~~
ggambetta
That's easy to say in hindsight, but it's easy to come up with all sorts of
crazy ideas, ask yourself _" why hasn't this been done before?"_, and then
find out that the answer is _" it has, it turned out to be a terrible idea,
and that's why I've never heard of it"_.

------
zParticle
A bigger frustration I experience is that some streaming seems to just "give
up"; stalling and never resuming. I know the connection and server are okay
because I can usually force it to resume manually, e.g. doing a page refresh,
so is it just bad server architecture or a codec problem?

~~~
jjoonathan
Youtube fixed this problem many years ago, but it recently became un-fixed.
New codec?

~~~
noir_lord
Maybe, it happens on FF fairly frequently but never (that I can recall) on
Chrome (same distro Fedora 27).

------
bsder
Um ... from the paper ...

"6.1 Limitations of Salsify

No audio. Salsify does not encode or transmit audio."

Claiming that you beat a bunch of codecs that have synchronized audio (even
though they disable it) is kind of misleading ...

~~~
keithwinstein
Co-author here. Totally reasonable reaction, and we've heard this when the
paper was posted elsewhere (e.g. on Reddit), but have not heard it from
specialists, and honestly we suspect it's probably a red herring. Salsify's
gains on the "delay" metric are mostly coming from two things: (1) the way
that it restrains its compressed video to avoid building up in-network queues
(which audio must also transit) and provoking packet loss, and (2) the way
that it recovers more quickly from network glitches (check out the video).

If you wanted to add audio to Salsify, you would want to control a receiver-
side video and audio buffer to reduce audio gaps and keep a/v in sync during
periods of happy network, but this is unlikely to affect the system's ability
to recover more quickly from glitches or to avoid building up in-network
queues that delay audio and video alike. If you watch the video (or see Figure
6(f), Figure 7, and Figure 8), I don't think there's much reason to think
audio can justify what the Chrome/webrtc.org codebase is doing -- WebRTC's
frame delays are distributed over a broad range (so it's not like they're
synchronized to some fixed timebase either) and are very high, especially in
the seconds after a network glitch.

More to the point for our academic work, it would have been trivial to add
shitty audio that made no difference to the metrics. The hard-but-necessary
part is in designing an evaluation metric to assess (1) the qualify of the
reconstructed audio (including how many gaps/rebuffering delays were there
when the de-jitter buffer went dry), (2) the delay of the reconstructed audio,
keeping in mind this is not constant over time, (3) the quality of the
audio/video synchronization, which also will not be constant over time. Then
measuring that in a fair way across Skype/Facetime/Hangouts/WebRTC/Salsify,
and then trying to decide which compromise on those three axes is desirable.
Somebody should do all that work at some point, but it's a major piece of work
to bite off and pretty far from anything we've done so far.

~~~
tatersolid
Opus, with its low delay and solid rate controls would seem to be the natural
pair here. But I agree audio is likely not the real problem in this space.

Any reason you didn’t choose to start from VP9? Is the encoder still too slow
overall?

------
andygcook
Is this at all related to the company Salsify?
[https://www.salsify.com](https://www.salsify.com)

~~~
mseebach
Seems highly unlikely, at a quick glance there's no overlap.

From the FAQ:

> Why the name “Salsify”?

> It's not a very interesting reason. Salsify comes from an older project
> called “ALFalfa,” for the use of Application Layer Framing in video. Alfalfa
> gave way to Sprout, a congestion-control scheme intended for real-time
> applications, and now Salsify, a new design where congestion-control (in the
> transport protocol) and rate control (in the video codec) are jointly
> controlled. Alfalfa, Sprout, and Salsify are all plantish foods.

The company you linked seems to be meant as "sales-ify".

~~~
ecopoesis
Salsify the PXM company comes from salsify the root vegetable and is
pronounced the same way. Our logo is a stylized salsify flower.

------
mbesto
Slighty tangential...

 _Salsify is led by Sadjad Fouladi, a doctoral student in computer science at
Stanford University, along with fellow Stanford students John Emmons, Emre
Orbay, and Riad S. Wahby, as well as Catherine Wu, a junior at Saratoga High
School in Saratoga, California. The project is advised by Keith Winstein, an
assistant professor of computer science.

Salsify was funded the National Science Foundation and the Defense Advanced
Research Projects Agency (DARPA). Salsify has also received support from
Google, Huawei, VMware, Dropbox, Facebook, and the Stanford Platform Lab._

Financially supported by the government, tech juggernauts, and executed by top
tier doctoral students + a high school student + a top tier university
professor.

Assuming this could be game-changing innovation to further advance worldwide
communication, it's refreshing to see the positive externalities of a
combination of capitalistic (F500 tech co's) and socialistic (university,
government) systems executed by a seemingly diverse set of actors.

~~~
CapacitorSet
How is university and government funding socialistic? Neither involves the
workers' ownership of the means of production (and indeed, they exist in a
state that is anything but socialist).

~~~
omeid2
Socialism is not Communism.

~~~
CapacitorSet
That's correct, but it does not invalidate my point. Socialism involves the
workers' ownership of the means of production.

------
quickthrower2
Even Richard Hendricks didn't combine the codec and the transport. Genius.

~~~
topranks
Richard Hendricks is an idiot who is currently attempting a private takeover
of the Internet (presumably) on behalf of the NSA.

~~~
r32a_
I don't get why they say pied piper is decentralized and a "new open internet"
when pied piper is building it and is closed source

~~~
KillerRabbitt
Because the majority of people watching the show either don't know or don't
care.

------
cbhl
(Disclaimer: This comment is my personal opinion, not that of my employer.)

Really exciting work.

Encoding multiple versions of a video and picking a smaller one in response to
congestion already happens for video-on-demand (think YouTube and Netflix
videos) in DASH. That said, with VOD you can encode the video slower than
real-time.

I can't imagine this ever making it into Skype/FaceTime/Hangouts/Duo. The big
corps will probably continue to focus on "more internet" (fiber optic, zero
rating, wi-fi hotspots, and internet traffic management practices).

~~~
a-dub
DASH uses big chunks of operator selectable size though as it is codec
agnostic. I wonder if coupling transport and codec could have benefits for the
massively scalable VOD case. (ie pre-render a bunch of stuff up front and run
a "lite" codec that is coupled to the transport layer and aware of network
conditions)

------
nodja
The cost here seems to be bandwidth, is this because you're using VP8? Could
this be adapted for other codecs like AV1?

~~~
zaroth
The cost is encoding frames that won’t be sent, and non constant frame rate.
Bandwidth is more optimally utilized.

~~~
nodja
Let me rephrase the question. Assuming that in the video the red background is
the network capacity and the lines are the video bitrate used, salsify seems
to be using ~6000kbps and WebRTC ~2500kbps. Is this higher bitrate because of
you're using VP8, or it a limitation of the protocol? If it's because of VP8
how hard would it be to adapt to modern codecs like HEVC and AV1?

~~~
donpark
I don't see anything in what they did that won't work with HEVC or AV1. They
just hacked VP8 to be able to save and restore codec state per frame so they
can generate multiple versions of the next frame, choosing smaller when
network condition is bad. Their innovation is in preventing congestion rather
than reacting to the aftermath of congestion.

~~~
mgamache
Right, they may have used VP8 because VP9 and HEVC are more CPU intensive and
they could technically encode more frames per second than the input 60 FPS.

~~~
pishpash
They probably used it because source code was most available.

------
baxtr
It's 2018 and I still have many dropped calls and other weird stuff when I
talk with people on my mobile. FaceTime Audio is often a good alternative but
still not perfect. So, I really hope the audio version of this will be
commercialized soon.

------
lyinawake
Unfortunately this would only apply to one-on-one low latency video chats. For
streaming to an audience, which generally uses a distribution network between
the user and the video source to help handle load and geographical
distribution, the CDN itself has no influence on video encoding. The CDN would
need to jump in and do this back-and-forth negotiation and delivery of lower
quality frames, which it is not currently suited for. I'd love to see it come
about, but it's not just the codecs we need to look at for adoption beyond
point-to-point video calls.

~~~
derf_
The other major limitation is that forking the encoder state significantly
inflates the number of reference buffers you need to keep, which greatly
increases memory requirements. That's not much of an issue for software, but
it can be a significant problem for hardware (a lot of real-time interactive
encoding is still done purely in software, however).

------
cryptonector
There's a vegetable named "salsify", very yummy.
[https://duckduckgo.com/?q=salsify+vegetable&t=ffab&ia=recipe...](https://duckduckgo.com/?q=salsify+vegetable&t=ffab&ia=recipes)

------
hexane360
Barely related to this, but looking at the results (section 5.2) I'm amazed at
how much worse T-Mobile is for latency. AT&T and Verizon both give about 2 s
of delay for Hangouts, while T-Mobile gives 7 s of delay.

~~~
jremmons
The reason T-Mobile looks so bad is because the T-Mobile trace was from a 3G
network with very poor conditions, while the others (AT&T and Verizon) were
from LTE networks under relatively good conditions. You shouldn't compare the
quality of the carriers from our results.

~~~
hexane360
Darn, I forgot UMTS was 3G. I think I was confusing it with HSPA+ (which I
guess isn't really 4G either). Great work on the project.

------
amelius
> What would you say to tomorrow’s codec implementers?

> Standardize an interface to export and import the encoder’s and decoder’s
> internal state between frames!

Can't this be achieved using sandboxing/emulation/VM techniques?

~~~
pjc50
Not very efficiently, which is kind of the point here.

------
dang
Another recent discussion was
[https://news.ycombinator.com/item?id=16802079](https://news.ycombinator.com/item?id=16802079).

------
pishpash
Kudos for making things accessible. However, joint source-channel coding is
not news, especially at the level of scalable video coding (probably 20-year-
old research by this point). In academia this isn't as exciting as it sounds
to industry.

~~~
cbhl
Do you know if scalable video coding actually ended up being implemented in
industry (YouTube/Netflix/Hulu/Amazon)?

~~~
tincholio
In a completely different way (i.e. DASH). SVC came out with bad timing, as
the move towards HTTP video was gaining momentum. Also, around 2008(?) or so
when the spec was finalized, there was no HW encoding support, so it was
pretty unusable in practice.

The idea of using layers, though, is much older (I remember reading papers
about this already back in 2001 or so)

------
kappi
this is not new. If you google video telephony adaptive rate adaptation
techniques based on network conditions, you will find many(even from 1980s).

------
profalseidol
How does this compare to Pied Pipers algorithm?

------
jestar_jokin
"Is this a startup company?

No.

Are you sure? Your website looks like a startup company’s.

It's just the HTML template! They all look like this. [...]"

Brilliant

~~~
toomanybeersies
It is a surprisingly nice looking website for an academic project.

~~~
pishpash
Or a trial balloon for commercialization.

------
sercand
I saw this on Reddit before. Most not impressed since it doesn't support
audio.

------
tarheeljason
> Salsify is led by [...] Catherine Wu, a junior at Saratoga High School in
> Saratoga, California.

Oh

~~~
ggg9990
Shows you the difference in opportunities for a smart kid in Bumblefuck vs. a
smart kid living in a $5 million house in Silicon Valley and having
transportation to Stanford.

~~~
jgh
I always laugh (while secretly crying) about the stories from high school
people write on here. Seems like everybody is going to top-tier _tech high
schools_, meanwhile the most computer science I got was being taught Turing by
a gym teacher.

