Hacker News new | past | comments | ask | show | jobs | submit login
Salsify – A New Architecture for Real-time Internet Video (stanford.edu)
250 points by dedalus 50 days ago | hide | past | web | favorite | 41 comments



Hi all -- Salsify co-author here. Surprised to see us here again, but happy to be part of the conversation (here's a previous one: https://news.ycombinator.com/item?id=16964112).

This work was led by my student Sadjad Fouladi. If you liked Salsify, you might really like Sadjad's talk in June at USENIX ATC about "gg", his system for letting people use AWS Lambda as a rented supercomputer (e.g. he can compile inkscape and Chromium really fast by outsourcing the computation to 8,000 Lambda nodes that all talk to each other directly over UDP): https://www.youtube.com/watch?v=Cc_MVldSijA (code here: https://github.com/StanfordSNR/gg)

You might also be interested in our current video project, led by my student Francis Yan, on trying to improve live video streaming. If you visit and watch some live TV you can help contribute to our study: https://puffer.stanford.edu


Hey Keith,

Amazing work, I am really impressed with what you are doing. I have never found any good reading in this area beside your work, the only thing I have seen is GStreamer's rtpjitterbuffer and libwebrtc. I haven't really felt confident in what I have learned from either though.

Have you thought about testing/comparing against other WebRTC implementations? It makes me a little sad to see WebRTC get a bad name just because one implementation has issues. WebRTC is a huge opportunity to get companies to invest in one thing, instead of reinventing the wheel and locking people in. Do you think things could be improved and eventually match Salsify?

Somewhat unrelated but I am working on Pion WebRTC[0]. By design it is not coupled tightly with the encoder/decoder (but I want to give feedback to the user so it could influence it if they wanted). Do you have any suggested readings, would you be ok with me contacting you directly? I really want to build something elegant that allows people to build amazing things with WebRTC. Right now I have been having people do things manually, I don't want to ship something without doing my due diligence.

thanks

[0] https://github.com/pion/webrtc


Thanks for your kind words -- you're certainly welcome to contact us if you think we can help. Yes, as I wrote elsewhere here, it would probably be possible to do everything Salsify does within the context of WebRTC, if you get to change the sender and receiver. If you're only changing the sender and you want to interoperate with receivers running the WebRTC.org codebase (e.g. Chrome), you have less flexibility. If you're only changing the sender and you want to work with existing high-performance video codecs and the API they expose, even less flexibility.

My main advice for implementers would be, benchmark your end-to-end glass-to-glass video latency (including the time spent waiting for the next frame to be captured & encoded, and then the end-to-end latency through to the display of that frame) over varied/unpredictable networks. In my experience, implementers sometimes get caught up focusing too much on network-layer measurements (IP latency) and can end up missing what in my view is the bottom line: glass-to-glass video latency. You can use our mahimahi mm-delay/mm-link/mm-onoff/mm-loss tools (part of Debian/Ubuntu) and the included traces to model some bad networks. And then see how you do on the same traces we use in our paper. If you can make a plot like our Figure 8(a) and it all looks good, that seems like good progress in my book.


Nice to see you keith. I agree that the network latency part of the equation tends to get too much play, where glass to glass is the better metric. On the other hand the bufferbloat effort started where network latencies were often the dominant contributor (seconds!) and for those that have actually implemented things like your e2e algos and/or installed stuff like fq_codel/fq_pie/sch_cake - the encoding step now dominates.

'Course, I still kind of miss scan-lines, and had hoped we'd finally all have enough bandwidth to just blast raw frames over the network by now.

/me hides


First of all: great work!

Two questions.

(1) Are there any benefits over H. 264/RTP/UDP in a point-to-point streaming scenario (like two non-stationary nodes directly connected via WiFi)?

(2) If it was implemented on a fast enough hardware, are there any obstacles to achieving sub 20ms latency in a scenario like the one described in (1)?

Thanks!


Thanks!

(1) Salsify is mostly about how you control the video encoder (e.g., a VP8/VP9/AV1/H.264/H.265 encoder) and transport protocol (e.g., RTP/UDP). H.264, RTP, and UDP themselves don't say anything about the control part, i.e., how to (1) estimate the network path's varying capacity, and then (2) how to adapt the desired frame size to match that capacity, and then (3) how to actually encode a frame of video to match the desired compressed frame size or bitrate.

If you have an unpredictable/variable network path, Salsify is probably better than the control strategies in systems with less tightly controlled encoders and transports, e.g. WebRTC.org/Chrome, Hangouts, Skype, or FaceTime. If the network is mostly constant or at least predictable, Salsify's not going to be helpful. So, bottom line is "maybe."

(2) Getting 20ms p95 glass-to-glass latency with compressed digital video is pretty difficult even under ideal circumstances, and basically impossible with a consumer webcam. At 60 Hz, just the interval between two frames is already 17 ms! So if you have one frame's worth of buffer between the camera/USB/encoder/sender and a frame's worth of buffer in the receiver/video card/display, you're already toast. You would really have to work hard to pipeline everything. Even the very best gaming monitors have latencies on the order of 10 milliseconds, and it's hard to buy a USB camera that even starts giving you any bits from the picture before the exposure is over. (And v4l2 doesn't return from VIDIOC_DQBUF until the frame is completely received, so you're probably looking at changing the kernel if you want to use UVC.) So <20ms I think is really hard. DJI just released an end-to-end custom-engineered system that claims 28ms latency (https://www.dji.com/fpv/info#specs).

If you look at Figure 8(a) from the Salsify paper, Chrome (using the WebRTC.org codebase) is getting per-frame latencies on the order of 600 milliseconds even when the network is totally constant and perfect, and Salsify's per-frame latencies are around 250 milliseconds but less consistent. Then you have to factor in the extra latency associated with waiting for the next frame -- if these systems are running at 10 fps (look at the extreme sparsity of the dots after the network hiccup, especially for WebRTC), there's 100 ms of extra glass-to-glass latency right there. Getting to <20ms @p95 is just another world from where these programs are today.


Just wanted to say that it's great you are crediting and empowering your students with your leadership. It's responsible and inspiring.


Without going in to too much detail, how much would need to change about conventional, deployed VP9/H.264 + WebRTC systems to perform like your system? It seems to me that the vast majority of what these systems do is not in the way of your goals, so I kinda wonder why it's billed as something completely separate rather than something incremental.

AFAIK codecs exposing rate control functionality to applications is not new, and most of WebRTC is functionally equivalent to any protocol of its sort; so before I go reading your paper it'd be nice to know the reasoning behind the seeming ground-up approach.


Realistically it would probably be a major refactor to adapt the WebRTC.org codebase to a Salsify-style design. The WebRTC.org codebase is about a million lines of code (about half of which is vendored third-party code, e.g. libvpx) so any serious refactor is probably out of the realistic capabilities of a university-based research group. For comparison -- the entire Salsify codebase is about 16,000 lines of C++ (including the codec), plus about 7,000 lines of assembly we took from libvpx. It does a heck of a lot less than WebRTC.org, but it's a lot easier to prototype new designs that way.

Salsify is mainly about the benefits you can get if you (a) extract the control loop out of the video codec, and make it expose a functional-style API, (b) use a Sprout-like congestion control algorithm that tries to follow evolving network capacity quickly and estimate "how many bytes can we send right now while trying to maintain a bound on end-to-end delay", and (c) have a single control loop that works every frame and has the choice of whether to send (1) a frame whose coded length is already known and is about equal to what we think the network can handle [it is very hard to get this from any existing video encoder on a single pass!], or (2) no frame at all.

You can certainly do this within the WebRTC protocol, but doing it within the WebRTC.org codebase is going to be a lot of work. :-( I think unfortunately the codebase has some pretty deeply ingrained assumptions about how the control flow is going to go. Inverting that (as we propose), I don't think is an easy incremental change.

Now, you might ask whether there are some incremental improvements you could make to WebRTC.org to get 80% of the benefits of Salsify without a major refactor. E.g., maybe you don't need the functional API if you just make some tweaks to the rate-control algorithm and congestion control. I don't think we know for sure though, but I suspect that for somebody already experienced with that codebase, probably yes, there are gains to be had. There's another question about whether the gains of Salsify (which are on a particular type of flaky cellular network) might come at the expense of costs on other types of (dependable?) networks. To be confident on that we'd probably need to try this stuff much more broadly and on real people (this is the motivation for Puffer).


Thank you for this informative answer. :- )

I guess a complementary question is in order: since you believe that the WebRTC protocol itself is not incompatible with the mechanisms that enable Salsify's performance, how much work would it be to adapt Salsify incrementally into a WebRTC implementation? (Ignoring mandatory codec support for a moment).

I've been developing an Opus codec mode for Bluetooth A2DP, and I have a similar sort of situation. I've been looking (casually) in to using more interesting rate control based on channel performance to improve QoS with Bluetooth A2DP. I think Opus has a property you could only dream of: strict limits on frame size. :- )

Added: just found that the video codec uses the VP8 bitstream, that was not immediately clear to me; seems to me that that is a substantial selling point that should be made more clear on the webpage.

Since your interface doesn't require any change to the bitstream of a conventional codec, it's a heck of a lot closer to public use than I initially thought!


This is an interesting question! I think getting our codebase to use the WebRTC framing and setup would probably not be that hard. We could probably use the same libraries that WebRTC.org is built on (libjingle, etc.), and maybe even just take a lot of their code.

I don't think that means that a Salsify sender would be interoperable with, like, a Chrome receiver, though (even though we're just using VP8) -- we'd have to implement at least the receiver side of Salsify's congestion-control protocol inside WebRTC.org/Chrome, and Salsify sometimes likes to encode a VP8 frame that has to be interpreted relative to a certain (prior) decoder state.

On #2, honestly I've worked with libopus a bit for Puffer and it's just really pleasant to work with. As you say, a lot of the difficulties of interacting with a video encoder you just don't have in this context. It's also pretty easy to get "gapless/clickless" back-to-back playback of audio excerpts that were encoded completely independently, unlike with video where this is a huge pain in the neck and usually requires a SAP/IDR/closed GOP/sequence header+I-frame+P-frame (which takes a lot of bytes so you can't do it very often). See https://github.com/StanfordSNR/puffer/blob/master/src/opus-e... if you are interested for more.


This is a naive question, but would it also fix the problem if there was a video codec that could incrementally compress each frame? And if so, do you think such a codec is feasible to design? I don't know much about video codecs so this is just out of curiosity.


Yeah, that would definitely help if it were efficient. In general the reason people don't use scalable/layered/incremental codecs in 1:1 video transmission is that they're less efficient than just encoding a single stream.

One of the challenges here is that it's hard to know how big a compressed frame will be until you compress it, and that takes time. Ideally an encoder would be able to produce a compressed frame that's the exact size that can be accommodated by the network path, right away. In practice that's not so easy. Salsify gets around this by having the encoder encode two candidate coded frames (one bigger, one smaller) and then giving the application the option of sending either one (after encoding is finished and it knows their exact length and quality) or no frame at all. If the encoder could just do the job perfectly and instantly the first time, you wouldn't need all that.


Any chance you can get this added into WebRTC of the next browser iterations? Submit patches? Is it patented?


So, did you guys even bother looking for potentially confusing uses of that name?

Like maybe salsify.com?


From what I understand, Salsify requires a unique connection for each device being served a stream. Therefore (like WebRTC) you can't cache frames in a CDN; and each viewer requires a connection to the origin. This gets a little expensive (which is one of the reasons why WebRTC isn't very common relative to HTTP streaming). Have you explored any kind of multicast or cache-friendly variation of Salsify?



If this somehow fixes the situations where I have to click pause and un-pause to get a video to actually play, that’d be a step forward for humanity...


I think this is caused by your browser not allowing auto play video, and the web site not accounting for or checking for this state.

So by toggling the pause / play button, you get the video control in a state that finally mirrors the browser's state.


I was thinking more of the case where the video was playing fine then it starts buffering and pauses then will sit there indefinitely after buffering unless I intervene by clicking pause unpause etc. Basically caused by an imperfect connection. I think like myself many people have thought, “oh if I just wait it’ll start playing again”.. but I suspect many of the people who make the video sites don’t want each of the client pages to go into a polling loop that spams their servers so they sort of have a back off scheme where if your internet link is not perfect it progressively gets worse and worse and then just sort of sits in slow retry loop. So the user hitting a button basically interrupts the slow retry loop and effectively commands the player instance to try streaming again.


REMB (actually the correct name would be GCC) is the de-facto rate control and bandwith congestion estimation algorithm that's used by WebRTC (I have some short notes with links to the algorithms that were proposed, in [0]), and it's a technology that for all practical effects has been kind of abandoned since 2014.

So it's only natural that it should be possible to greatly improve upon what we're using today. I'm glad to see advances in this field, because it's sad that video calls are still a hugely worse experience compared with good old phone calls...

[0]: https://doc-kurento.readthedocs.io/en/latest/knowledge/conge...


Is anyone working on putting these ideas into a working product?

I'm taking guitar lessons over Skype and for the most part it works just fine. However, every once in a while we have to resort to recording and sending snippets over the wire to get the required fidelity. With digital amp simulators this is thankfully a trivial exercise, but it would be great to do without.

I've looked for other alternatives, but couldn't really find any that fit the bill. The last thing I looked at was a few Audio over IP products, but all of them were designed to be run on a LAN.


I'm surprised that sites that serve videos--like YouTube and Netflix--haven't already come up with this idea, of using a codec optimized for transmission over the network.

I wonder if we can do this with WebSockets? (The decoder might need to be in WASM for performance to be worthwhile.)


I believe the main improvement here is latency, which in a video streaming context is a non-issue.


A little bit of latency comes from bufferbloat where network device buffers cause latency to increase.

https://www.bufferbloat.net


Glad to see Sprout-related work going on Keith + Sadjad! I remember being amazed by sprout a few years ago.


Anyone have suggestions what protocol should be used for in-browser streaming cloud gaming?


WebTransport (https://wicg.github.io/web-transport/) is looking interesting for the mid-term future perhaps. In any case, building on QUIC is probably a good idea.


https://parsecgaming.com/game-streaming-technology looks interesting for an off-the-shelf type thing, but I guess Salsify would work too. As far as keystrokes I guess you would just use TCP or one of its improved variants.


So, how is this better or worse than RTSP and/or h.265?


It does a different thing.

RTSP is mostly about control and framing. It doesn't specify any particular algorithm to estimate the network's capacity, how a video encoder should try to match that estimated capacity, or how to recover from lost packets.

H.265 is a format for compressed video that defines a bit-exact decoder. It doesn't specify the way that an encoder encodes anything into the compressed format, how the encoder should try to match an externally supplied target frame size / bitrate target (while also meeting a latency target), or what the API should be.

The Salsify techniques could work fine with RTSP, and could work fine using H.265 as the coded video format. The special thing about Salsify is really about where the control lies.

Traditionally (in Skype, FaceTime, or the WebRTC.org codebase), there is a drop-in codec with its own control loop (making frame-by-frame decisions), and a congestion-control protocol with its own control loop (making packet-by-packet decisions), and these control loops are at close enough timescales that they end up doing poorly together. And the API to the codec is generally too limited (especially when it's a very general API, as in WebRTC.org, that tries to abstract across pre-existing implementations of VP9/H.264/H.265 to give the application agility across different formats) to achieve the kind of rapid adaptation to network flakiness that you need over these cellular or bad Wi-Fi networks.

Salsify basically says, "hey, if your codec supports a functional-style API, and your transport protocol too, and you can extract the long-lived control state from each of those individual modules and just have one control loop that jointly controls both the codec module and the transport module, you can do a heck of a lot better."


Thank you for clarifying that for me.


TL;DR: they created a VP8 implementation that is purely functional and exploited that property to integrate control loops of codec and transport protocol. Sounds promising.


is this worse or better than zoom?


Well, we've never installed a rootkit on anybody's computer, so at least we've got that going for us...

More to the point, we don't have an empirical end-to-end measurement of Zoom the way we do for Skype, FaceTime, Google Hangouts, and WebRTC-in-Chrome (with and without VP9-SVC). But my understanding is that Zoom is architected similarly to those systems and can be expected to behave within the same envelope. Would need to measure it to know for sure.

On the other hand, Zoom is an actual product with users, and Salsify is a research prototype that doesn't even have audio, much less users. So hard to compare outside the narrow technical questions of video quality and latency over imperfect networks.


These might be the most patient and detailed replies to low effort questions I’ve ever seen.


Different. This is a codec and protocol. Zoom is an app.


I think they mean is this better than what Zoom does.


Middle out


To this day I would love to see some Pied Piper type of breakthrough.


https://blogs.dropbox.com/tech/2016/07/lepton-image-compress...

Not really a breakthrough though, JPEG is just an old format. But maybe machine learning stuff will get the 4x video compression seen in the show.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: