My student's talk, which is worth watching and has a cool demo of using the same system for bulk facial recognition, is here: https://www.usenix.org/conference/nsdi17/technical-sessions/...
Daniel Horn, Ken Elkabany, Chris Lesniewski, and I used the same idea of finer-granularity parallelism at inconvenient boundaries, and initially some of the same code (a purely functional implementation of the VP8 entropy codec, and a purely functional implementation of the JPEG DC-predicted Huffman coder that can resume encoding from midstream) in our Lepton project at Dropbox to compress 200+ PB of their JPEG files, splitting each transcoded JPEG at an arbitrary byte boundary to match the filesystem blocks: https://www.usenix.org/conference/nsdi17/technical-sessions/...
My students and I also have an upcoming paper at NSDI 2018 that uses the same purely functional codec to do better videoconferencing (compared with Facetime, Hangouts, Skype, and WebRTC with and without VP9-SVC). The idea is that having a purely functional codec lets you explore execution paths without committing to them, so the videoconferencing system can try different quantizers for each frame and look at the resulting compressed sizes (or no quantizer, just skipping the frame even after encoding) until it finds one that matches its estimate of the network's capacity at that moment.
One conclusion is that video codecs (and JPEG codecs...) should support a save/restore interface to give a succinct summary of their internal state that can be thawed out elsewhere to let you resume from that point. We've shown now that if you have that (at least for VP8, but probably also for VP9/H.264/265 as these are very similar) you can do a lot of cool tricks, even trouncing H.264/265-based videoconferencing apps on quality and delay. (Code here: https://github.com/excamera/alfalfa). Mosh (mobile shell) is really about the same thing and is sort of videoconferencing for terminals -- if you have a purely functional ANSI terminal emulator with a save/restore interface that expresses its state succinctly (relative to an arbitrary prior state), you can do the same tricks of syncing the server-to-client state at whatever interval you want and recovering efficiently from dropped updates.
For lambda, it's fun to imagine that every compute-intensive job you might run (video filters or search, machine learning, data visualization, ray tracing) could show the user a button that says, "Do it locally [1 hour]" and a button next to it that says, "Do it in 10,000 cores on lambda, one second each, and you'll pay 25 cents and it will take one second."
One downside with your approach is that you have to code many keyframes that are eventually thrown away. Keyframes are often expensive to encode because their bitrate is so much higher. Have you considered "synthesizing" fake keyframes somehow, such as doing an especially fast and stupid encode for them? Also, artifacts from keyframes tend to be a lot different from inter predicted frames, so would encoding a keyframe, plus a second frame, then throwing away both yield a quality improvement at the final pass (at the cost of more wasted computation)?
Re: having to code many keyframes that are thrown away, in practice we're using vpxenc for the initial pass, and it's just a heck of a lot faster than our own C++ codec (whose benefit is that it can encode and decode a frame relative to a caller-supplied state). We then use our own encoder to re-encode the first frame of each chunk as an interframe (in terms of the exiting state of the previous chunk), and that one encode ends up being slower than encoding six frames with vpxenc (see figure 4). So in a real system, you're absolutely right that this is another optimization opportunity, but I think for our purposes there's a lot of other low-hanging fruit we'd want to tackle first.
It looks like the Lambda worker limit is now bumped up to 1000, and with the possibility of automatic worker limit upgrades upon request if things like S3 triggers are being used.
Any idea what upper limits AWS is now allowing upon request?
(The video codec is https://github.com/excamera/alfalfa, and the lambda framework is https://github.com/excamera/mu .)
The value of doing video processing with general-purpose compute is either A) you can adopt new codecs quickly (to save operational costs), or B) you can do processing in batch with older/slower/off-peak compute resources (to save capital costs).
If a company was going to spend $XM on engineers to build a low-latency video processing solution at scale, you'd go for the "remove-a-layer-of-abstraction" of thousands of tiny threads: purpose-built real-time encoding silicon.
Even a real-time hardware encoder isn't enough for the target use-case here, where the user already has the video in the cloud and wants to apply a transformation (like an edit or filter) and then re-encode the results. ExCamera is trying to make that as interactive/quick as possible. "Real-time" on an hourlong video would still mean waiting like an hour for each operation.
Ultimately, YouTube uses chunks of 128 frames each because it's too expensive to put key frames much more frequently than that, and other transcoding services will face a similar limitation. ExCamera's purely functional encoder lets it use chunks smaller than the interval between key frames, so the effective degree of parallelism can be a lot higher.
(Obviously there's a lot of overhead from using a software codec -- we think hardware codecs should support a "save/restore state" interface too. I think we do have to prototype first in software to demonstrate the benefits before hardware people will take us seriously.)
Why no comparison to the currently established approach of encoding chunks of 5sec/128frames, all in parallel?
The paper mentions this approach, but only compares to single- and multi-threaded vpx. As a reader I'm left wondering if it is because this approach leads to a similar overall encoding time.
The paper also fails to mention that keyframes are also there in regular intervals to help seeking, especially for streaming video.
Also the way of measuring youtube "encoding time" is ridiculous. This probably measures the length of the encode job queue and currently available unused compute in gc2.
The vpx runs were done on normal VMs. But I see that approaching a smaller runtimes, VMs are not economical.
The theoretical bound of splitting the source material into 5s chunks and encoding them all in parallel is still a useful comparison to make. Especially considering an endless supply of jobs keeping VMs busy.