
Low-Latency Video Processing Using Thousands of Tiny Threads [pdf] - mpweiher
https://www.usenix.org/system/files/conference/nsdi17/nsdi17-fouladi.pdf
======
keithwinstein
Co-author here -- thanks for linking to our paper. The basic theme is that (a)
purely functional implementations of things like video codecs can allow finer-
granularity parallelism than previously realized [e.g. smaller than the
interval between key frames], and (b) lambda can be used to invoke thousands
of threads very quickly, running an arbitrary statically linked Linux binary,
to do big jobs interactively. (Subsequent work like PyWren, by my old MIT
roommate, have now done (b) more generally than us.)

My student's talk, which is worth watching and has a cool demo of using the
same system for bulk facial recognition, is here:
[https://www.usenix.org/conference/nsdi17/technical-
sessions/...](https://www.usenix.org/conference/nsdi17/technical-
sessions/presentation/fouladi)

Daniel Horn, Ken Elkabany, Chris Lesniewski, and I used the same idea of
finer-granularity parallelism at inconvenient boundaries, and initially some
of the same code (a purely functional implementation of the VP8 entropy codec,
and a purely functional implementation of the JPEG DC-predicted Huffman coder
that can resume encoding from midstream) in our Lepton project at Dropbox to
compress 200+ PB of their JPEG files, splitting each transcoded JPEG at an
arbitrary byte boundary to match the filesystem blocks:
[https://www.usenix.org/conference/nsdi17/technical-
sessions/...](https://www.usenix.org/conference/nsdi17/technical-
sessions/presentation/horn)

My students and I also have an upcoming paper at NSDI 2018 that uses the same
purely functional codec to do better videoconferencing (compared with
Facetime, Hangouts, Skype, and WebRTC with and without VP9-SVC). The idea is
that having a purely functional codec lets you explore execution paths without
committing to them, so the videoconferencing system can try different
quantizers for each frame and look at the resulting compressed sizes (or no
quantizer, just skipping the frame even after encoding) until it finds one
that matches its estimate of the network's capacity at that moment.

One conclusion is that video codecs (and JPEG codecs...) should support a
save/restore interface to give a succinct summary of their internal state that
can be thawed out elsewhere to let you resume from that point. We've shown now
that if you have that (at least for VP8, but probably also for VP9/H.264/265
as these are very similar) you can do a lot of cool tricks, even trouncing
H.264/265-based videoconferencing apps on quality and delay. (Code here:
[https://github.com/excamera/alfalfa](https://github.com/excamera/alfalfa)).
Mosh (mobile shell) is really about the same thing and is sort of
videoconferencing for terminals -- if you have a purely functional ANSI
terminal emulator with a save/restore interface that expresses its state
succinctly (relative to an arbitrary prior state), you can do the same tricks
of syncing the server-to-client state at whatever interval you want and
recovering efficiently from dropped updates.

For lambda, it's fun to imagine that every compute-intensive job you might run
(video filters or search, machine learning, data visualization, ray tracing)
could show the user a button that says, "Do it locally [1 hour]" and a button
next to it that says, "Do it in 10,000 cores on lambda, one second each, and
you'll pay 25 cents and it will take one second."

~~~
TD-Linux
Many encoders have a "two pass" mode that saves statistics from a first, fast
pass to guide the second pass. Usually these are very simple statistics, used
to choose a quantizer and frame type in the second pass. Your method feels
like a rather extreme version of this, that can guide the whole RDO search
with a very large state. And, of course, rarely are first passes done in
parallel. It's super exciting to see people looking at this - I think there's
a lot of untapped potential. The WebRTC case looks interesting too, though
there you need extremely low latencies (1 frame) so I'm curious to find out
how you tackle that.

One downside with your approach is that you have to code many keyframes that
are eventually thrown away. Keyframes are often expensive to encode because
their bitrate is so much higher. Have you considered "synthesizing" fake
keyframes somehow, such as doing an especially fast and stupid encode for
them? Also, artifacts from keyframes tend to be a lot different from inter
predicted frames, so would encoding a keyframe, plus a second frame, then
throwing away both yield a quality improvement at the final pass (at the cost
of more wasted computation)?

~~~
keithwinstein
Thanks for your kind words! For the WebRTC/real-time case, here is our current
draft ([https://cs.stanford.edu/~keithw/salsify-
paper.pdf](https://cs.stanford.edu/~keithw/salsify-paper.pdf)). Would be eager
for any comments or thoughts or ideas for experiments to run, as we have a few
weeks to revise it before the final version is due.

Re: having to code many keyframes that are thrown away, in practice we're
using vpxenc for the initial pass, and it's just a heck of a lot faster than
our own C++ codec (whose benefit is that it can encode and decode a frame
relative to a caller-supplied state). We then use our own encoder to re-encode
the first frame of each chunk as an interframe (in terms of the exiting state
of the previous chunk), and that one encode ends up being slower than encoding
six frames with vpxenc (see figure 4). So in a real system, you're absolutely
right that this is another optimization opportunity, but I think for our
purposes there's a lot of other low-hanging fruit we'd want to tackle first.

------
cbhl
$5.40 to encode a 15-minute video is... a lot. Amazon Elastic Transcoder runs
at $0.03/minute, so it would be $0.45 to encode the same 15-minute video.

The value of doing video processing with general-purpose compute is either A)
you can adopt new codecs quickly (to save operational costs), or B) you can do
processing in batch with older/slower/off-peak compute resources (to save
capital costs).

If a company was going to spend $XM on engineers to build a low-latency video
processing solution at scale, you'd go for the "remove-a-layer-of-abstraction"
of thousands of tiny threads: purpose-built real-time encoding silicon.

~~~
lallysingh
This is about processing video in real time in the cloud. The intro mentions
that in the nonparallel case, 1 hr of 4k video can take 30hrs to process.
Doing it in <1hr has real benefits.

~~~
wmf
AFAIK transcode-as-a-service vendors already offer real-time/streaming
transcoding, so it's over as soon as you finish uploading the file.

~~~
keithwinstein
Not in our experience -- 4K and VR encoding is pretty hard. We uploaded the
same video to YouTube (4K Sintel) and counting from the end of the upload, it
took 417 minutes before it was available to play in 4K VP9, and 36.5 minutes
before it was available to play in 4K H.264. (ExCamera took 0.5-2.5 minutes by
the same metric, i.e. 6-30x faster than real-time.) You can certainly do
"ready-as-soon-as-uploaded" encoding if the resolution is small, upload
bandwidth is bad, and the codec is simple, but as networks get faster,
resolutions bigger, and video encoders more complex, it gets more difficult.

Even a real-time hardware encoder isn't enough for the target use-case here,
where the user already has the video in the cloud and wants to apply a
transformation (like an edit or filter) and then re-encode the results.
ExCamera is trying to make that as interactive/quick as possible. "Real-time"
on an hourlong video would still mean waiting like an hour for each operation.

~~~
chrisseaton
But you weren't paying for this as a service were you? You don't know that
YouTube didn't put you in some queue and you don't know how long YouTube
actually took to encode it, or how quickly they would be able to encode it if
they wanted to? Or am I misunderstanding you?

~~~
keithwinstein
We haven't tried a paid service, although my sense is that YouTube does
represent the best-of-breed in commercially available parallel transcoding
(free as it is to users). I'd encourage you to try it with a paid service and
see where the time-to-finished 4K output shakes out. I'd be surprised if it's
better but eager to learn either way. ExCamera is doing it at 6x-30x real time
(which is important if you want to make a cloud video editor); I don't think
you can buy that commercially today.

Ultimately, YouTube uses chunks of 128 frames each because it's too expensive
to put key frames much more frequently than that, and other transcoding
services will face a similar limitation. ExCamera's purely functional encoder
lets it use chunks smaller than the interval between key frames, so the
effective degree of parallelism can be a lot higher.

(Obviously there's a lot of overhead from using a software codec -- we think
hardware codecs should support a "save/restore state" interface too. I think
we do have to prototype first in software to demonstrate the benefits before
hardware people will take us seriously.)

------
ysleepy
Nice approach. I assume this can be adapted to run in a single process on a
multi-core machine to increase parallelism.

But...

Why no comparison to the currently established approach of encoding chunks of
5sec/128frames, all in parallel?

The paper mentions this approach, but only compares to single- and multi-
threaded vpx. As a reader I'm left wondering if it is because this approach
leads to a similar overall encoding time.

The paper also fails to mention that keyframes are also there in regular
intervals to help seeking, especially for streaming video.

Also the way of measuring youtube "encoding time" is ridiculous. This probably
measures the length of the encode job queue and currently available unused
compute in gc2.

~~~
keithwinstein
Thanks! Unfortunately we couldn't run 128-frame independent chunks (which we
would call "ExCamera[128,1]") on an AWS Lambda instance because it doesn't
have enough storage or RAM to store 128 frames of raw 4K video. We did
benchmark ExCamera[24,x] (including the independent case, ExCamera[24,1]),
which is the biggest we could fit on a lambda, and include those results in
the paper. All of our tested configurations end up with a keyframe at least
every 192 frames (the max configuration was ExCamera[12,16]), so we're sort of
similar to the status quo in terms of keyframe (i.e. type-1/2 SAP) frequency.

~~~
ysleepy
Thanks for the insight.

The vpx runs were done on normal VMs. But I see that approaching a smaller
runtimes, VMs are not economical.

The theoretical bound of splitting the source material into 5s chunks and
encoding them all in parallel is still a useful comparison to make. Especially
considering an endless supply of jobs keeping VMs busy.

------
Const-me
I wonder if it’s possible to port to GPGPU? For some workloads, thousands of
tiny threads work very fast on these. If yes, it might be much more power
efficient and therefore much cheaper to run.

------
TD-Linux
This is really cool. They use VP8, which means that any web browser (except
Safari) can decode the video, even over low-latency WebRTC. It also shows a
lot of the limitations of VP8/VP9 encode slowness is due to libvpx, not the
underlying format. It'll be interesting to see what they can achieve with VP9
or AV1.

------
sitkack
I think this paper is important in ways not related to applicability or cost
efficiencies, but more as a what if we treated a warehouse scale computer a
million core device? What if we programmed this device like simd or a gpu?

------
billysham
Great paper! One thing that isn't mentioned is how do you deal with audio?

~~~
TD-Linux
Audio encodes far faster than video, so there isn't really a need to
parallelize it. You'd probably combine this with Opus in WebRTC.

~~~
billysham
Audio is fast however if you are attempting to encode hours of content ~min
it's gonna be a bottleneck

~~~
TD-Linux
If you end up being limited, there is a very similar trick you can do with
audio, but even simpler. Unlike video, most audio formats don't have
keyframes, but rather will converge to a correct decode a few packets after
you start fresh or seek. So the solution to encode in parallel is to split the
file into a bunch of chunks that overlap by a few packets (in the case of
Opus, 80ms worth, or 4 packets, should be enough). You then encode all of
these chunks, and then merge them together, throwing away the extra packets in
the overlap. Unlike video, no final encode pass is needed.

------
Namidairo
I wonder how this fares now considering all that context switching that just
got more costly from Meltdown mitigation.

~~~
lallysingh
Dubious. You make me more threads to make the same throughput, but that's the
same as anyone else. Constant increases in overhead aren't hard to model or
accommodate in many parallel throughput problems.

