Hacker News new | past | comments | ask | show | jobs | submit login
Noisy neighbor detection with eBPF (netflixtechblog.com)
256 points by el_duderino 51 days ago | hide | past | favorite | 64 comments



This is nifty, but not really congruent with my understanding of the "noisy neighbor" phenomenon. This work seems to reveal when there are more runnable threads than CPUs, leading to tasks waiting to run. The way I use "noisy neighbor" is that it is a concurrent task that trashes some microarchitectural resource, forcing the victim process to use more CPU time. For example, a process on another CPU core in the same cache domain that trashes all of the shared cache lines, or fills up all the load/store slots, or that uses more thermal power causing a global clock speed slowdown.


I thought so too - it seems like this is more about "who is being preempted by who" (although maybe noisy neighbor in the sense of "hogging up CPU time" does often imply "polluting hardware resources" to some degree, especially considering these machines probably have SMT)


Yep in Netflix case they pack bare-metal instances with a very large amount of containers and oversubscribe them (similar to what Borg reports: hundreds of containers per VM is common), so there are always more runnable threads than CPUs and your runqueues fill up.


I'm curious as to the capacity of the bare metal hosts you operate such that you can oversubscribe CPU without exhausting memory first or forcing processes to swap (which leads to significantly worse latency than typical scheduling delays). My experience is that most machines end up being memory bound because modern software—especially Java workloads, which I know Netflix runs a lot of—can be profligate memory consumers.


If you're min-maxing cost it seems doable? 1TB+ RAM servers aren't that expensive.


Workloads tend to average out if you pack dozens or hundreds into one host. Some need more CPU and some need more memory, but some average ratio emerges ... I like 4GB/core.


I wonder if they A/B tested the EEVDF scheduler? Allegedly it is supposed to reduce noisy neighbour effects in multi tenanted environments.

It would be interesting if kubernetes had stronger CPU isolation techniques. There is the static cpu manager which will dedicate cores for a specific container but I believe it isn’t smart enough to know about hyperthreads and so you can still get a noisy neighbour running on the other hyperthread of the core.

Ideally I would like to see kubernetes slice out a certain number of cores for the system daemons/kernel, IRQs, and NIC buffers. Some very low latency systems running on top of kubernetes recommend using isolcpus to do this but it would be nice to have something built into the kubernetes CPU manager to do this. A container running inside the guaranteed QoS class should have exclusive access to the cores and never be pre-empted.


We (co-author here) are using EEVDF already since its in the latest mainline kernels. We havent noticed any major differences using the default settings, but plan to get around to tuning it later.


can someone explain to me, isn't it the kernel's responsibility to preempt the CPU-heavy userspace thread (noisy neighbour in our case) after fixed slice of time anyway?


If a sleeping low-latency task becomes runnable due to events (network, storage, timers) then ideally it would start running straight away, preempting a throughput-oriented task.


Is this not the case in Linux land? On Windows 7, I could run a CPU based bitcoin miner that pegged every core (including hyperthreads) to 100% and still browse the internet with zero stutter, latency, or slowdown, because the Windows scheduler had no issues giving whatever time was needed by the UI or Window Event processing loop and just feed whatever was leftover to the number crunching app.


I think they have priority boosting for apps that are in focus. It doesn't work for tasks in general though. Network/disk throughout takes a hit when your cpu is pegged, for instance.


This is not true. I frequently run heavy compile jobs on my Windows machine that peg all cores to 100%. If I don’t tell Visual Studio to run the compile tasks with background priority, they will starve the window server of resources a s make interacting with the system incredibly slow. I would not want to use such a machine for browsing.


Part of that was likely due to Windows doing its graphics at kernel level instead of in userspace.


Isn't noisy neighbor less of a problem on AWS since nitro? Where I work we monitor that with some CPU steal metrics and it's very rare to see it nowdays.


In this case the noise is coming from inside the house, er, the VM so it's not Nitro's problem.


Yep. In Netflix case each Titus host can run hundreds of containers per bare-metal instance at any given time. One advantage of running a multi-tenant platform like this is that you get better observability on multi-tenancy issues since you're doing the scheduling yourself and know who is collocated with who. It's much harder to debug noisy-neighbor issues when it's happening on the cloud provider side and your caches get thrashed by random other AWS customers.

One thing I was pitching internally when advocating for this platform is that when you have the scale to run it for the economics to make sense, you can reclaim some of AWS margins instead of having your cold tiny VMs subsidize other AWS customers higher perf. If you run the multi-tenant platform yourself, you can oversubscribe every app in a way that makes sense for your business and trade latency or throughput of software for $ on a per-container basis, so you can make much more granular and optimal decisions globally. VS having each team individually right-size their own app deployed on VMs and sharing CPU caches with randos.

I remember once at Netflix we investigated a weird latency issue on a random load balancer instance and got AWS involved: it turned out to be a noisy-neighbor on the underlying VM that gets chopped up into multiple customer-facing LB instances.


Aside: Is titus still being developed?

GitHub repo says it was archived 2 years ago: https://github.com/Netflix/titus


But wasn’t Netflix using FreeBSD?


From a Brendan Gregg presentation[1]:

"Massive AWS EC2 Linux cloud, with FreeBSD appliances for content delivery"

[1] https://www.brendangregg.com/Slides/TracingSummit2014_FromDT...


Only for the video distribution piece. The control plane code is on Linux.


Is it open source?



that's a related tool for monitoring the utilization of BPF programs within the kernel, but NOT the focus of the article -- detecting noisy neighbors in non-EBPF workloads.


I think the code as given in the article is all you get.


I had an obnoxious neighbor with a tv on his back porch who would watch Love is Blind so loud I could hear it throughout my house. Was kind of hoping this would be about that.


Back in the day when I used to live in a multi-apartment building someone was being so loud as to wake me up in the middle of the night and I could never figure out who it was. It was an old building so the noise would transfer across many floors/walls. I was trying to come up with an engineering solution to this but in the mean time I got a new job and had to move anyway. Probably a microphone array would work to triangulate the source, but it would also be hard to explain to the police.


Hypothetically you could talk to all your neighbors. But I understand the tradition is only communicate with apartment neighbors with percussive modes of communication (I.e. pounding on the wall).


Learning to sleep with ear plugs in was the best investment of time I've ever made. Of course they don't totally kill out the sound but they lower it down by enough dB that you can comfortably ignore it, it takes piercing annoying noises and drowns them down to an ignorable level.


Agreed!

Good earplugs (like Mac's Earplugs) will (almost) completely eliminate the higher pitches. There's not a lot that one can do about lower pitches, but a white noise machine can help (either a dedicated white noise machine, or a small fan, etc).

But yeah - I grew up with quiet nights and it was a challenge to live in a dorm. Earplugs for the win! :)


> Probably a microphone array would work to triangulate the source

https://www.youtube.com/watch?v=IRELLH86Edo


I recently had a fire alarm with a failing battery and their way of telling you about the battery problem is a single loud warning beep every 70 or so seconds (something irregular of course), just enough to wake you up to an entirely silent room. Battery voltage was right on the edge and the detection circuit hopelessly naive so it would generally only beep at night when temperatures fell, often not even repeatedly. Horrible product.


Go low-tech: call 911, report a noise complaint, and then the police locate the source :-D


Then they come straight to your door for your statement.


I was hoping for a program that listens for dog barking and writes a timestamp log of every segment of active barking. Once you collect a couple weeks of data you can send it to the county.


The neighborhood was .3 acre lots with houses crammed fence to fence. We were on top of each other. The outdoor tv was a HUGE dick move that is becoming more and more accepted.


Not accepted, but because ACAB, what can we do?

I'd love to plug up all the noisy exhaust systems on every car and truck around me, but then I'd get in trouble despite ordinances basically saying, "maintain your exhaust system as to not be disturbing to others."

If cops aren't going to enforce this and citizens aren't allowed to, then what?


I don't know, snitches do get stitches after all. But writing software just to do snitching on a massive, analytical scale does fit the nerd stereotype though.


> could hear it throughout my house

Got to love American built homes that have poor or non-existent insulation.


I blame NIMBYism, at least for older multi-family. (Shoddy SFH with poor noise insulation is a different story.) If zoning were more relaxed, we would be tearing down 100+ year old multifamily in places like Massachusetts and putting in new homes with modern sound proofing between tenants.


Based on empirical data, new buildings, as they relate to multifamily dwellings, have little to no sound insulation. Older buildings are probably better in that regard.


Link?


Try Amsterdam.


Simple solution: figure out which episode they were on and play the next episode even louder.

Also it's a fantastic show.


> fantastic show

More like a fantastic show for rotting your brain. Just like any other "reality" show out there.


I was hoping it was about some tech to triangulate coordinates of illegal fireworks or gunfire. My wife has PTSD and our neighborhood has become almost unlivable because of illegal commercial grade fireworks year round.


If shotspotter[1] is anything to go by, doing this sort of stuff accurately is non trivial.

[1] https://en.wikipedia.org/wiki/ShotSpotter#Accuracy


That's not a problem here, though. You're not after recognition accuracy, like, telling apart gunshots, fireworks and car engine backfires. You're after accurate positioning and timestamping of any noise source above some loudness threshold. That should be much, much simpler.


Is there any tech that reliably tell the difference between fireworks and gunfire?

I live in a neighborhood where the question has come up a few times.


Supersonic rounds make a distinctive noise that machines could reliably distinguish from fireworks (humans can learn to do this too). But distinguishing arbitrary subsonic ammo vs. arbitrary fireworks seems like it would be at least an order of magnitude harder & I don't think it could be made 100% reliable


By ear, fireworks is regular in a way that gunfire is not.


Yeah, I don't like the terminology here either. Noisy => something that causes electrical or acoustic interference. Not something that uses too many resources.


Why on earth is the Netflix tech blog still on Medium? When I open the link half my screen is blocked by a popup asking me to sign up and upselling me on a membership. Why?


Please don't complain about tangential annoyances—e.g. article or website formats, name collisions, or back-button breakage. They're too common to be interesting.

https://news.ycombinator.com/newsguidelines.html


Ironically, at least for me, your comment applies far more to your comment than the grandparent's.


So does yours! (and mine)


Is it ‘too common’?

I’ve never seen another corporate public facing blog asking readers to sign up with some third party service and so on, not even as a banner ad.


I agree it is worth commenting on and not common. It makes no sense at all for a company like Netflix to use any external service to blog.


Unfortunately even big open source project from likes of Google e.g Dart, Flutter etc use Medium for their tech blogs and announcements. I think engineers just write content and give to product marketing team who decide how to get it published and collect engagement numbers.

So it is less of technology worth sharing to marketing budget that need spending.


Looks like we finally reached DTrace level.

I long for the day when Linux apps will be transparently compiled in a way that the time-critical bits will run as ebpf progs and communicate with user space through io_uring and sched_ext will allow the system to be tailored to very specific workloads. Imagine games! How much of a difference that could make! It is a shame that games vendors simply ignore Linux.


So the thing I've heard from dtrace people is that eBPF still is not really "there". I've never gotten a clear answer about _why_ this is the case, though I feel like there was implications that, basically, eBPF programs are brittle/don't actually capture everything well?

Would love to hear more details on that front.


Why would this help with games?


They generally have a very latency sensitive input/rendering loop. Performance analysis can be challenging, but with the right tools it can be much easier.


The thing that we need in order for your dream to become a reality is excellent user space frameworks, so I encourage you (and anyone else) to go build one or (better) find one you like and contribute.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: