Hacker News new | past | comments | ask | show | jobs | submit login
Life of a Netflix Partner Engineer – The case of the extra 40 ms (netflixtechblog.com)
172 points by saranshk on Jan 21, 2021 | hide | past | favorite | 29 comments



I kind of wish the author touched more on the one-frame-at-a-time aspect. Why aren't they copying more data into the device decoder? It seems somewhat silly on a non-realtime OS to grab one frame, request a 15ms wait, and then grab another frame, when your deadline is 16.6ms. That essentially guarantees a stuttering playback at some point, no?

Is there not a better way to do this? There must be. Why not copy 10 frames at a time? Is the device decoder buffer that small? Is there no native Android support for pointing a device decode buffer to a data ingress source and having the OS fill it when it empties, without having to "poll" every 15ms? So many questions.


What makes you think that decoder actually has enough RAM to receive more than a single frame?

Formats like H.264 are designed around hard constraints where the OEMs can build HW decoders that only have maybe 1-3 frames worth of internal memory (this includes reference frames required for forward/backward decoding of B frames). All to keep both costs and latency down. Having your decoding block add 5-10 frames of latency will cause many problems down the line.

It's really not a given that your decoding block will be able to take more than one frame at a time. 16ms really is plenty for video decode handling for most usecases.


> It seems somewhat silly on a non-realtime OS to grab one frame, request a 15ms wait, and then grab another frame, when your deadline is 16.6ms

A couple reasons this isn't as silly as it seems

1) ~All buffers in Android are pipelined, usually with a queue depth of 2 or 3 depending on overall application performance. This means that missing a deadline is recoverable as long as it doesn't happen multiple times in a row. I'd also note that since Netflix probably only cares about synchronization and not latency during video play back they could have a buffer depth of nearly anything they wanted, but I don't think that's a knob Android exposes to applications.

2) The deadline is probably not the end of the current frame but rather than end of the next frame (i.e. ~18ms away) or further. The application can specify this with the presentation time EGL extension[1] that's required to be present on all Android devices.

[1]: https://www.khronos.org/registry/EGL/extensions/ANDROID/EGL_...


My guess is it's simple and prevents A/V desync?


He touches on that here. They have a catch-up mechanism that was thwarted by the same bug: https://news.ycombinator.com/item?id=25428127


My guess is DRM. Less data in the buffer means if someone is trying to rip a stream they won't be able to speed up the process by dumping the buffer.


How much does that speed (of copying the buffer) really matter though?



that's what i was thinking, i've seen this before.


For some perspective on how other platforms try to prevent this problem, see the Multimedia Class Scheduler Service:

https://docs.microsoft.com/en-us/windows/win32/api/avrt/nf-a...

It was added in Windows Vista. It periodically adjusts the priority of threads to make sure they are getting enough CPU time.


In a weird way, I love tackling these kinds of bugs. It can be frustrating but it’s an exercise in logical thinking and you often learn some interesting details of lower parts of the stack, debugging techniques, etc. as you go.


super interesting to me as a computer science student!


This article was posted when it was new, and I still have the same opinion now as I did then.

Why was Netflix putting effort in to supporting a device that the vendor clearly didn't care about?

This unnamed vendor was, in late 2017, preparing to launch a new device running Android 5.0 which was EoL and had not seen an update since late 2015. Android 8.0 was already out at the time. There is no valid excuse, it's not like they were trying to offer continued support for a legacy device.

In a world that made sense, a company the size of Netflix should have the clout to say "hey hold up here, what are you trying to push this out of date crap for?"


> Why was Netflix putting effort in to supporting a device that the vendor clearly didn't care about?

Because they supported a device USERS care about. They were focusing on their customers and user. Why would they punish users for something some random other corpo did?


The article mentions it was for a "large European Pay TV company". Having Netflix function on it would be a large group of potential Netflix customers that already had a compatible device hooked up to a TV. Potential low-friction onboards. And potential complaints from existing customers that upgraded to the new box if it didn't work.


This was an unreleased product with zero users.


Irrelevant, you can support EOL products or not. If your company does, you can gain customers but they may not be worth a lot.


It's unlikely that Netflix could stop it being released


I disagree, in fact I have a lot of respect for Netflix for continuing to try and support out of date devices. There's a handful of old tablets and TV boxes that support barely anything these days, but they're still capable of blitting MPEG to screen, so there is still some use in them.

As you said though, no respect for vendors pushing out new devices that are EOL before they've even shipped though.


Netflix's customers are users, not vendors. Just because the vendor doesn't care, doesn't mean the vendor's users don't care. In fact, the vendor and the vendor's users may well have competing interests.


> This unnamed vendor was, in late 2017, preparing to launch a new device running Android 5.0 which was EoL and had not seen an update since late 2015

The issue is that in order to get a license for real Android your devices has to meet some minimum specs so if you want to create an inexpensive device that doesn't meet those specs AOSP is still a reasonable choice for your OS. I'm not sure if its still true, but it used to be that if you were shipping any AOSP based device Google wouldn't give you an Android license for any of your other devices so manufacturers that were interested in serving more price sensitive markets (e.g. India) will sometimes use it across their whole product line.


Okay, but why not a newer AOSP?


From 1st hand experience: Because your board (that is hardware/soc) vendor only gives you a BSP for Android and they want to charge 500k for an update. The binaries are of course all blobs and the kernel is a precompiled binary monstrosity. Your company doesn't see the reason to pay for newer version because older "works fine".


>I can’t predict all of the issues that our partners will throw at me, and I know that to fix them I have to understand multiple systems, work with great colleagues, and constantly push myself to learn more.

So what is his domain exactly? He went from shipping other platforms/devices to his first Android device. So presumably, he is learning multiple operating systems? Is that practical to have working knowledge of multiple OS internals?

Wouldn't they be better served by people specializing in one OS/platform? That would lead to quicker issue resolutions as someone isn't spending time learning a new OS.

(These aren't criticisms of anyone's ability, but rather why the scope of responsibility is so large and requires so much mental effort)


Some engineers will be bored to tears with a scope of responsibility that is any narrower.

The foundational skill for an engineer who supports partners is likely one's ability to absorb, navigate, and understand new systems quickly. Knowing more than one OS (or language, or CPU, etc) actually improves this skill.

There are some fundamental ideas that are common across OSes that could help reduce mental burden, though the details differ significantly. And there are some activities that almost require one person to know multiple OSes internals, such as reverse engineering Windows drivers to write Linux drivers.


tl;dr. The root cause for their issue was this:

> The Android thread scheduler changes the behavior of threads depending whether or not an application is running in the foreground or the background. Threads in the background are assigned an extra 40 ms (40000000 ns) of wait time.

> A bug deep in the plumbing of Android itself meant this extra timer value was retained when the thread moved to the foreground. Usually the audio handler thread was created while the application was in the foreground, but sometimes the thread was created a little sooner, while Ninja was still in the background. When this happened, playback would stutter.


Also previously posted on HN a month ago, with 200 comments.

https://news.ycombinator.com/item?id=25426195


Well, that was unexpectedly inspiring. I find I like building things that work, but the detective work of a challenging bug is often way more interesting!


I'm pleasantly surprised that this time the 40ms wasn't from Nagle's algorithm




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: