Hacker News new | comments | show | ask | jobs | submit login
Low-Latency Video Processing Using Thousands of Tiny Threads [pdf] (usenix.org)
140 points by mpweiher 10 months ago | hide | past | web | favorite | 31 comments



Co-author here -- thanks for linking to our paper. The basic theme is that (a) purely functional implementations of things like video codecs can allow finer-granularity parallelism than previously realized [e.g. smaller than the interval between key frames], and (b) lambda can be used to invoke thousands of threads very quickly, running an arbitrary statically linked Linux binary, to do big jobs interactively. (Subsequent work like PyWren, by my old MIT roommate, have now done (b) more generally than us.)

My student's talk, which is worth watching and has a cool demo of using the same system for bulk facial recognition, is here: https://www.usenix.org/conference/nsdi17/technical-sessions/...

Daniel Horn, Ken Elkabany, Chris Lesniewski, and I used the same idea of finer-granularity parallelism at inconvenient boundaries, and initially some of the same code (a purely functional implementation of the VP8 entropy codec, and a purely functional implementation of the JPEG DC-predicted Huffman coder that can resume encoding from midstream) in our Lepton project at Dropbox to compress 200+ PB of their JPEG files, splitting each transcoded JPEG at an arbitrary byte boundary to match the filesystem blocks: https://www.usenix.org/conference/nsdi17/technical-sessions/...

My students and I also have an upcoming paper at NSDI 2018 that uses the same purely functional codec to do better videoconferencing (compared with Facetime, Hangouts, Skype, and WebRTC with and without VP9-SVC). The idea is that having a purely functional codec lets you explore execution paths without committing to them, so the videoconferencing system can try different quantizers for each frame and look at the resulting compressed sizes (or no quantizer, just skipping the frame even after encoding) until it finds one that matches its estimate of the network's capacity at that moment.

One conclusion is that video codecs (and JPEG codecs...) should support a save/restore interface to give a succinct summary of their internal state that can be thawed out elsewhere to let you resume from that point. We've shown now that if you have that (at least for VP8, but probably also for VP9/H.264/265 as these are very similar) you can do a lot of cool tricks, even trouncing H.264/265-based videoconferencing apps on quality and delay. (Code here: https://github.com/excamera/alfalfa). Mosh (mobile shell) is really about the same thing and is sort of videoconferencing for terminals -- if you have a purely functional ANSI terminal emulator with a save/restore interface that expresses its state succinctly (relative to an arbitrary prior state), you can do the same tricks of syncing the server-to-client state at whatever interval you want and recovering efficiently from dropped updates.

For lambda, it's fun to imagine that every compute-intensive job you might run (video filters or search, machine learning, data visualization, ray tracing) could show the user a button that says, "Do it locally [1 hour]" and a button next to it that says, "Do it in 10,000 cores on lambda, one second each, and you'll pay 25 cents and it will take one second."


Many encoders have a "two pass" mode that saves statistics from a first, fast pass to guide the second pass. Usually these are very simple statistics, used to choose a quantizer and frame type in the second pass. Your method feels like a rather extreme version of this, that can guide the whole RDO search with a very large state. And, of course, rarely are first passes done in parallel. It's super exciting to see people looking at this - I think there's a lot of untapped potential. The WebRTC case looks interesting too, though there you need extremely low latencies (1 frame) so I'm curious to find out how you tackle that.

One downside with your approach is that you have to code many keyframes that are eventually thrown away. Keyframes are often expensive to encode because their bitrate is so much higher. Have you considered "synthesizing" fake keyframes somehow, such as doing an especially fast and stupid encode for them? Also, artifacts from keyframes tend to be a lot different from inter predicted frames, so would encoding a keyframe, plus a second frame, then throwing away both yield a quality improvement at the final pass (at the cost of more wasted computation)?


Thanks for your kind words! For the WebRTC/real-time case, here is our current draft (https://cs.stanford.edu/~keithw/salsify-paper.pdf). Would be eager for any comments or thoughts or ideas for experiments to run, as we have a few weeks to revise it before the final version is due.

Re: having to code many keyframes that are thrown away, in practice we're using vpxenc for the initial pass, and it's just a heck of a lot faster than our own C++ codec (whose benefit is that it can encode and decode a frame relative to a caller-supplied state). We then use our own encoder to re-encode the first frame of each chunk as an interframe (in terms of the exiting state of the previous chunk), and that one encode ends up being slower than encoding six frames with vpxenc (see figure 4). So in a real system, you're absolutely right that this is another optimization opportunity, but I think for our purposes there's a lot of other low-hanging fruit we'd want to tackle first.


Very interesting paper. Did you settle on a method for NAT traversal with future revisions?

It looks like the Lambda worker limit is now bumped up to 1000, and with the possibility of automatic worker limit upgrades upon request if things like S3 triggers are being used.

Any idea what upper limits AWS is now allowing upon request?


The paper states "ExCamera is free software. The source code and evaluation data are available at https://ex.camera .", but there doesn't seem to be any source code at the linked site. The github has some source code, but it appears to be incomplete with regards to the binaries also present there. Am I missing something? Did the intentions on this change?


Everything is free software and available at https://github.com/excamera . Not sure what you mean about binaries -- afaik we only have source code checked into our repos.

(The video codec is https://github.com/excamera/alfalfa, and the lambda framework is https://github.com/excamera/mu .)


I'm referring to https://github.com/excamera/excamera-static-bins . It's not clear where the build scripts, etc are for these projects.


$5.40 to encode a 15-minute video is... a lot. Amazon Elastic Transcoder runs at $0.03/minute, so it would be $0.45 to encode the same 15-minute video.

The value of doing video processing with general-purpose compute is either A) you can adopt new codecs quickly (to save operational costs), or B) you can do processing in batch with older/slower/off-peak compute resources (to save capital costs).

If a company was going to spend $XM on engineers to build a low-latency video processing solution at scale, you'd go for the "remove-a-layer-of-abstraction" of thousands of tiny threads: purpose-built real-time encoding silicon.


This really depends on how much you're paying to make that 15 minute video. 15 minutes of video can cost 10 cents or 10 million dollars.


This is about processing video in real time in the cloud. The intro mentions that in the nonparallel case, 1 hr of 4k video can take 30hrs to process. Doing it in <1hr has real benefits.


AFAIK transcode-as-a-service vendors already offer real-time/streaming transcoding, so it's over as soon as you finish uploading the file.


Not in our experience -- 4K and VR encoding is pretty hard. We uploaded the same video to YouTube (4K Sintel) and counting from the end of the upload, it took 417 minutes before it was available to play in 4K VP9, and 36.5 minutes before it was available to play in 4K H.264. (ExCamera took 0.5-2.5 minutes by the same metric, i.e. 6-30x faster than real-time.) You can certainly do "ready-as-soon-as-uploaded" encoding if the resolution is small, upload bandwidth is bad, and the codec is simple, but as networks get faster, resolutions bigger, and video encoders more complex, it gets more difficult.

Even a real-time hardware encoder isn't enough for the target use-case here, where the user already has the video in the cloud and wants to apply a transformation (like an edit or filter) and then re-encode the results. ExCamera is trying to make that as interactive/quick as possible. "Real-time" on an hourlong video would still mean waiting like an hour for each operation.


But you weren't paying for this as a service were you? You don't know that YouTube didn't put you in some queue and you don't know how long YouTube actually took to encode it, or how quickly they would be able to encode it if they wanted to? Or am I misunderstanding you?


We haven't tried a paid service, although my sense is that YouTube does represent the best-of-breed in commercially available parallel transcoding (free as it is to users). I'd encourage you to try it with a paid service and see where the time-to-finished 4K output shakes out. I'd be surprised if it's better but eager to learn either way. ExCamera is doing it at 6x-30x real time (which is important if you want to make a cloud video editor); I don't think you can buy that commercially today.

Ultimately, YouTube uses chunks of 128 frames each because it's too expensive to put key frames much more frequently than that, and other transcoding services will face a similar limitation. ExCamera's purely functional encoder lets it use chunks smaller than the interval between key frames, so the effective degree of parallelism can be a lot higher.

(Obviously there's a lot of overhead from using a software codec -- we think hardware codecs should support a "save/restore state" interface too. I think we do have to prototype first in software to demonstrate the benefits before hardware people will take us seriously.)


YouTube is trying to satisfy its users as cheaply as possible. You cannot use it as a benchmark for state-of-the-art transcoding services (for which you would need to pay a reasonable fee)


You can live stream in 4K on YouTube and it's transcoded to multiple bitrates: https://www.youtube.com/watch?v=ZMQFsNqGavU


Youtubes actual reencoding rate is faster than realtime. This is only a balanced comparison in what is available, not in efficiencies.


Yah I read it and just shook my head as silly...


Nice approach. I assume this can be adapted to run in a single process on a multi-core machine to increase parallelism.

But...

Why no comparison to the currently established approach of encoding chunks of 5sec/128frames, all in parallel?

The paper mentions this approach, but only compares to single- and multi-threaded vpx. As a reader I'm left wondering if it is because this approach leads to a similar overall encoding time.

The paper also fails to mention that keyframes are also there in regular intervals to help seeking, especially for streaming video.

Also the way of measuring youtube "encoding time" is ridiculous. This probably measures the length of the encode job queue and currently available unused compute in gc2.


Thanks! Unfortunately we couldn't run 128-frame independent chunks (which we would call "ExCamera[128,1]") on an AWS Lambda instance because it doesn't have enough storage or RAM to store 128 frames of raw 4K video. We did benchmark ExCamera[24,x] (including the independent case, ExCamera[24,1]), which is the biggest we could fit on a lambda, and include those results in the paper. All of our tested configurations end up with a keyframe at least every 192 frames (the max configuration was ExCamera[12,16]), so we're sort of similar to the status quo in terms of keyframe (i.e. type-1/2 SAP) frequency.


Thanks for the insight.

The vpx runs were done on normal VMs. But I see that approaching a smaller runtimes, VMs are not economical.

The theoretical bound of splitting the source material into 5s chunks and encoding them all in parallel is still a useful comparison to make. Especially considering an endless supply of jobs keeping VMs busy.


I wonder if it’s possible to port to GPGPU? For some workloads, thousands of tiny threads work very fast on these. If yes, it might be much more power efficient and therefore much cheaper to run.


This is really cool. They use VP8, which means that any web browser (except Safari) can decode the video, even over low-latency WebRTC. It also shows a lot of the limitations of VP8/VP9 encode slowness is due to libvpx, not the underlying format. It'll be interesting to see what they can achieve with VP9 or AV1.


I think this paper is important in ways not related to applicability or cost efficiencies, but more as a what if we treated a warehouse scale computer a million core device? What if we programmed this device like simd or a gpu?


Great paper! One thing that isn't mentioned is how do you deal with audio?


Audio encodes far faster than video, so there isn't really a need to parallelize it. You'd probably combine this with Opus in WebRTC.


Audio is fast however if you are attempting to encode hours of content ~min it's gonna be a bottleneck


If you end up being limited, there is a very similar trick you can do with audio, but even simpler. Unlike video, most audio formats don't have keyframes, but rather will converge to a correct decode a few packets after you start fresh or seek. So the solution to encode in parallel is to split the file into a bunch of chunks that overlap by a few packets (in the case of Opus, 80ms worth, or 4 packets, should be enough). You then encode all of these chunks, and then merge them together, throwing away the extra packets in the overlap. Unlike video, no final encode pass is needed.


I wonder how this fares now considering all that context switching that just got more costly from Meltdown mitigation.


Dubious. You make me more threads to make the same throughput, but that's the same as anyone else. Constant increases in overhead aren't hard to model or accommodate in many parallel throughput problems.


It already had network latency to deal with, so I suspect it doesn’t matter. Also KPTI can be disabled if one really wants




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: