Hacker News new | past | comments | ask | show | jobs | submit login
Seeing through hardware counters: A journey to 3x performance increase (netflixtechblog.com)
344 points by mfiguiere 87 days ago | hide | past | favorite | 130 comments

These posts both make me violently aroused as well as profoundly depressed basically every employer I’ve ever had in 30 years doesn’t value such analysis. This is what engineering looks like, and it’s something no product manager can ever comprehend.

In my limited experience, there's not many companies that actually have the load that requires this level of fine-tuning. Or they don't actually own the software involved; we've had performance issues at the company I work for at the moment (as a consultant / contractor), caused by SAP and a product called Tibco, a visual editor to query and assemble data. Lack of transparency there.

Anyway my point with that is that there are a lot more high level things that needed to be fixed - and this is the case in most, if not all organizations I've worked at - before you get to the lower levels of performance. And this probably applies everywhere.

In this example, they have their software so optimized and tweaked already that it ended up being a low level JVM issue.

Mind you / on the other hand, they mentioned something about subclasses and the like, it might be that a change in their Java code would have solved or amortized this issue as well.

I’ll betcha any fleet with more than 100 machines in it hasn’t been tuned at a low level and could be 1-10 machines if so. That’s probably worth it. But you can’t convince an accountant that you can make money by spending it and at a certain size the accountants make the decisions.

> These posts both make me violently aroused as well as profoundly depressed basically every employer I’ve ever had in 30 years doesn’t value such analysis.

Couldn't agree more, except on the timeline. In my experience, the 90s and most of the 00s proper engineering like this was the norm in silicon valley and I loved it. It's the last 10-15 years that PMs have taken over and now it's all so superficial and uninteresting. I've moved away from hands-on development work because nobody is doing anything intellectually rewarding anymore. But I wish I was doing hard-core engineering like it used to be.

What did you move to instead?

PM here: I actually also got pretty excited reading this. The challenge is selling analyses with unknown outcomes to the bean counters who end up deciding. Blog posts like this help.

this is a problem on a company/culture level, a root cause is probably capitalism and pursuing for never ending growth.

Because growth is most of the time translated as acquiring more users/customers. Eventually making more money out of them.

Somewhere in the distant future, making money becomes cost saving, this is when companies start thinking about how can we cut the cost and such projects/explorations become relevant.

Even then it is not easy to sell as a project. Because there are so many unknowns, discussion goes like:

  me: hey, I see spikes in CPU/mem usage, want to investigate why.
  Product: is it important now?
  me: I don't exactly know, but it might end up saving our instance costs.
  Product: how many instances we are talking about?
  me: I don't know yet how big is the problem, depends on what would be the result
  Product: how long does it take you to do a research?
  me: I don't know what is the problem yet, so I can't evaluate how long it takes to find it.
  Product: maybe we shouldn't do it then?
  me: (again) but it might end up saving our instance costs.
  Product: ok, do you think you can finish investigation in 2 days and share with me timeline and metrics this project would improve?
  me: ok, let's forget about this shit we made. what is a next "metric improving, highly visible" project we have?
  Product: cool, I want you to work on this super exciting "Move button from Page A to Page B" project

Don’t think this is related to capitalism though, more about attitude to risk.

If you always do the obvious things, you’ll end up as a mediocre company that might not survive. If you say yes to any project that comes to mind, you’ll end up burning all your money and end up with endless projects. It’s important to strike a balance and try to find the right investments others don’t see to create a competitive advantage.

It helps if the culture allows for that. The questions still are relevant, you just don’t always need precise answers, Probe 100 instances and see if it surfaces more than once. Set up a list of things you think you’ll need to be doing and an initial proposition what you think might change. It’s the PMs job to then convince others that it’s worth taking the risk.

Nailed it.

Fwiw I work with a lot of PM and have a number on staff. I really appreciate their work. But they’re not engineers by in large and shouldn’t be directing engineering work. It’s critical to have a general manager that is an engineer and respects the work of the PM, but can make decisions like “this is good speculative engineering work” and rewards failure and slipping on product roadmaps.

Frontend developers do not know how to test for performance and do not value performance at all. It’s both equally frustrating and baffling. So much so that I have this growing voice in my head telling me to quit my job and found a performance based consultancy.

My observations contributing to the frustration:

* Developers believe performance is speed. It isn’t. Performance is a difference in measure between two (or more) competing qualities, the performance gap measured in either duration or frequency.

* Developers will double down on the prior mistake by making anecdotal observations without any forms of measurement and will vehemently argue their position in the form of a logic based value proposition. Example: X must be fast because Y and Z appear fast.

* Developers will further hold their unmeasured position by isolating their observations from how things actually work. Software is a complex system of many instructions. Except for the most trivial of toy scripts nothing executes in isolation. For example developers follow the value proposition that a grain of sand is tiny and weighs little, so therefore does not contribute to the weight or size of the beach. An example is that developers will argue to the death using querySelectors to avoid walking the DOM but in Firefox DOM walking can be as fast as 250000x faster than querySelectorAll, which is indeed significant.

I suspect there exist a variety motives for why developers reason things in this way, but from the outside it looks like a house of cards based upon argument from ignorance fallacies outputting really slow software by really defensive people.

> Frontend developers […] do not value performance at all.

I do not think that’s restricted to front-end developers. Back-end developers rarely worry about inserting yet another http request in code called from user action in a web browser, for example. A tenth of a second here, a tenth there, it all adds up.

Product managers and users also don’t seem to care much or do not know how fast modern hardware is. I’ve frequently seen web page refreshes take 5 seconds or more without getting any user complaint, even when I explicitly ask them, and tell them how that could easily be halved.

Can someone clarify this?

> * Developers believe performance is speed. It isn’t. Performance is a difference in measure between two (or more) competing qualities, the performance gap measured in either duration or frequency.

Maybe I'm a dummy, but if someone says "make a fast webapp", I'll be doing things like reducing requests, optimizing queries, making things smaller, using the right data structures, manipulating data in a fast and/or memory efficient way.

This should result in lower loads, more stability, better availability, etc, so also "more performance". I do think of it as lots of focus on speed since, for the most part, I'm measuring how long it takes for a function/query/etc to run.

Happy to check out some articles/keyword suggestions.

The question that matters is: How much faster is it? If you cannot answer that why should anybody believe your new website is fast?

Okay, so we should all measure stuff.

Why every thread in HN speak shit about Frontend developers? Like if BE is any better, most of them still use Java which is has bad memory and cpu performance, and unlike FE, BE have options (any native, GC-free language). Pretty sure 99% Java developers dont have any idea how the JVM works and 99,99% how to tune for better performance, they just move to the next shiny JVM implementation (omg Graalvm) and call it a day.

I really do not like the java ecosystem. but: the vm can be exceedingly fast. Sadly, as you mentioned, most won't even try to make configure it right.

I used to be a performance engineer for telecoms systems. Tuning & fixing performance regressions saved millions in hardware costs & software licenses (think databases, licensed by core) in the late 90s/early 2000s. It was great fun, and highly rewarding (mentally).

I'd love to get back into it.

Would any of the company really benefitted from this kind of analysis. NFLX serves Petabytes of data for a particular use case, majority of the Tier 1 companies would not have 100Mbps of traffic for their top service. Maybe there are 5-15 companies globally which will get value from this and they all will support for this.

I have seen far more companies wasting resources on the projects of no value because the stakeholders believe themselves to be of same level as Netflix, Google.

It seems quite common that small or medium companies adopt highly complex architectures that could be avoided by caring slightly about performance

Those (slightly complex) architectures and tools buy them availability and reproducibility as well as scalability.

Typically negative availability and zero reproducibility?

I think you need to work at companies that are large enough to allow for people to dedicate their time to this level of analysis. Smaller companies are too busy to have staff focussing on things like this.

It’s not “too busy”. It’s that unless you have significant scale, the return for improvements like this are often not worth the engineering time. At companies with massive scale (read: large enough), even tiny improvements can pay for themselves.

Poppycock. It doesnt need to be at this deep a level. But most orgs- small, medium, large, or xl, have essentially zero appreciation for knowing what the heck is happening in their systems. Theres a couple lay engineers about with some deep skills & knowledge, who the rest of the org has no idea what to do with & is unable to listen to. Apps get slow, cumbersome, crufty, shitty, tech debt piles higher, customers accept but grow increasingly doubtful, and thr company jusy fails to recognize forever & ever how close to the "thermocline of trust" it's gettint, where everyone internal & external just abandons the system & all hope. This is hugely prevalent, all over the place, & extremely few companies have the managent that is able to see, recognize, believe in it's people, witness the bad.

This is a case of eeking more out, optimizing, but the general diagnostician view is deeply needed widely. This industry has an incredibly difficult time accepting the view of those close to the machine, those who would push for better. Those who know are aliens to the exterior boring normal business, and it's too hard for the pressing mound of these alien concerns to bubble up & register & turn into corporate-political will to improve, even though this is often a critical limitation & constraint on company's acceptance/trust.

There's a lot of ultra passing this off comments. No one seems to get it. Scaling up is linear. Complexity & slowness is geometric or worse. Slowness as it takes root is ever more unsolvable.

You seem to be bringing in piles of other things here: technical debt, complexity, cruft, etc. My post is about none of those things. You also note that the prevalence of those things is because no one will listen to engineers “with some deep skills & knowledge”. At some point, you also have to assign some responsibility for failure to communicate effectively, and not just a failure to listen. Or maybe they are not listening also, and not understanding the full context? I’m not doubting that poor decisions happen, but I’m not sure it’s useful to assign unilateral blame to management based on a failure to listen to the engineers you’ve deemed special here.

> You also note that the prevalence of those things is because no one will listen to engineers “with some deep skills & knowledge”. At some point, you also have to assign some responsibility for failure to communicate effectively, and not just a failure to listen. Or maybe they are not listening also, and not understanding the full context?

This rings immensely hollow to me, & borders on victim blaming. Oh sure; telling the lower rungs of the totem pole it's their fault for not convincing the business to care- for not being able to adequately tune in the business to the sea of deeper technical concerns- has some tiny kernel of truth to it. Maybe the every-person coder could do better, maybe, granted. But I see the structural issues & organization issues as vastly vastly more immense impediments to understanding-our-selves.

There is such a strong anti-elitism bias in society. We don't like know it alls, whether their disposition is classicaly braggadocious, or humble as a dove. We are intimidated by those who have real, deep, sincere & obvious masteries. We cannot outrun the unpleasantness of the alien, the foreign concerns, steeped in assurity & depth, that we can scarcely begin to grok. Techies face this regularly, are ostracized & distanced from with habit. Few, very few, are willing to sit at points of discomfort to trade with the alien, to work through what they are saying, to register their concerns.

> I’m not sure it’s useful to assign unilateral blame to management based on a failure to listen to the engineers

Again, granted! There absolutely are plenty of poor decisions all around. Engineers rarely have good sounding boards, good feedback, for a variety of reasons but the above forced alienation is definitely a sizable factor where engineers go wrong; being too alone in deciding, not knowing or not having people to turn to to figure shit out, to get not just superficial but critical review that strikes at the heart.

This again does not dissuade me from my core feelings on my core point. I think specifically most companies are hugely unable to assess their own products & systems health, unable to gauge the decay & rot within. Whether it's slog or real peril, there are few rangers tasked with scouting the terrain & identifying the features & failures. And the efforts are renewal/healing are all too often patchwork, haphazard, & random, done as emergency patches. These organisms of business are discorporated, lack cohesion & understanding of themselves & what they are. Having real on the ground truthfinders, truthtellers, assessers, monitors- having people close to the machine who can speak for the machine, for the systems, that is rarely a role we embrace, and so often we simply rely on the same chain of management which is also responsible for self/group-promotion & tasking & reporting which has far too many conflicted interests for it to be expectable for them to deliver these more unvarnished, technically minded views.

For what it's worth I agree highly with this perspective. I'd invite you to be on my gang of post-apocalyptic systems engineers [1].

[1] https://www.usenix.org/system/files/1311_05-08_mickens.pdf


Cheap money and small, but easyly observable gains made all the ones not knowing better appreciate linear scaling improvements.

It's probably cheaper for a small company to just add another machine instead of hire a whole team to do these.

I think you and the parent comment are both correct. Engineers who are skilled enough to do this type of analysis are expensive and another VM/Instance is a few hundred a month at most.

At the very small end of the scale this is very true. It doesn't take too much traffic volume to make it worth it though. It's just difficult to make anyone care. It doesn't take a particularly huge amount of traffic to make it worth spending a few weeks of engineering time to save $50K+ on annual hosting costs.

I think you’re considering only literal cost, and not the opportunity cost (which is typically much higher, and what you’re using to make investment decisions). Suppose you’ve an engineer and they’re paid $100k. Now you’d be tempted to say that anything that takes less than six months and saves more than $50k is worth it. But that’s not true at all. For one, that engineer is actually worth much more than $100k/yr to the company; it costs a lot more to hire and keep that person busy (recruiting, training, an office, a manager, product managers, project managers, etc.). But more importantly: what are the other things they could be doing? Small companies are rarely thinking about micro-efficiency because they are trying to change and grow their products. If this engineer is able to build a feature that will help them grow X% faster, that can have a massive multiplier effect on their prospects (and valuation, for which that $50k saving makes zero difference). Those are the things you’re comparing against, which is why the opportunity cost bar for pure $$ saving improvements is often much higher than it seems.

Yup. One of my previous work was in a very tiny company where everyone was busy, yet my job was basically this type of performance optimization. The reason ? We were building a real time processing product, and we were not hitting our frame time, meaning the product could not be shipped. In those situations, convincing the pm that you are doing important performance work is trivial.

And for most companies, the issue is not in the lower levels; they will have tons of suboptimal software that needs years of work to make more performant or to rewrite before you even get to the point where you need to do CPU level analysis.

I wonder what the difference would be vs. "just" running 3-4 copies of apps on a node.

We did that (ops, not developing the actual app) few times where app scaled badly

You need more staff to manage a cluster than a single machine though. Over the course of 6 months you'll probably spend more time on maintaining the cluster (with all its overhead) than doing an optimization analysis to net you 3x performance.

I’ve been at the largest companies on earth for about 20 years. They are precisely who can’t countenance such work.

Most of the blog posts I see that are this detailed are usually from larger companies.

I don’t think it’s related to size but rather corporate culture, and what behavior leads to profitability. Netflix is all about reducing cost in their technical infrastructure while providing a highly consistent high bandwidth relatively low latency experience. Exxon Mobil couldn’t care less. There are small pockets of Amazon that care, but most of Amazon is product management driven towards a business goal where costs and performance are only relevant when it interferes with the product goal. It’s less about size and more about priority for achieving the core business objectives. Companies that need this behavior but don’t prioritize it will sooner or later be overtaken by a Netflix. Companies that do not need this behavior but do prioritize it will be overtaken by a product focused competitor while the engineers and flipping bits to optimize marginal compute utilization.

Seems like poor analysis on their part then. If they're leaving 3.5x perf on the floor, they're not spending their money very well.

Leaving 3.5x perf of what on the floor though?

We're working on applications that's either waiting for the disk, waiting for the db, waiting for some http server or waiting for the user.

None of our customers will notice the difference if my button click event handler takes 50ms instead of 10ms, or if the integration service that processes a few dozen files per day spends 5 seconds instead of 1 second.

I'll easily trade 5x performance in 99% of my code if it makes me produce 5x more features, because most of the time my code runs in just a few milliseconds at a time anyway.

Of course, I'm weary of big-Oh, that one will bit ya. But a mere 3.5x is almost never worth chasing for us.

Netflix is not doing this work to improve user latency, they are doing this work to minimize cost:

> At Netflix, we periodically reevaluate our workloads to optimize utilization of available capacity.

The idea being that if you pay for fewer servers you spend less money

And at their scale, this makes sense. No one is going to do this work if if means they run 2 servers instead of 6.

> We're working on applications that's either waiting for the disk, waiting for the db, waiting for some http server or waiting for the user.

If you try to be good at waiting on many things, you can use one machine instead of a hundred

> None of our customers will notice the difference if my button click event handler takes 50ms instead of 10ms

They absolutely will, what

> If you try to be good at waiting on many things, you can use one machine instead of a hundred

We can't run on one, because customers run our application on-prem.

> They absolutely will, what

How can you be so sure?

Have you tested this? 40ms of additional interaction delay is extremely noticeable.

You're missing the point.

Sure, 10ms vs 40ms is measurable, and for the keen-eyed noticeable. But if you're only pressing the button once every 5 minutes, it doesn't matter. Similarly, if the button triggers an asynchronous call to a third-party webservice that takes seconds to respond, it doesn't matter. And so on.

Of course, for the things where users are affected by low latency, we try to take care. But overall that's a very, very small portion out of our full functionality.

A lot of businesses reasonably use Python and Ruby, for example.

If you can run 1 server instead of 3 though... that means no ansible, no kubernetes, no wild deployment strategies, 1/3 the likelihood of hardware failure, etc

This isn't true, at all. We are not "high availability" by any means but running on one server has significant risks, _especially_ if you don't use some form of automated provisioning. A destructive app logic bug is far more likely than a hardware failure, and in both cases the impact on your service is significant if you have one instance, but likely manageable if you have more than one.

Why wouldn't you use ansible to manage one server?

It's a great tool with a low learning curve that just requires SSH.

Well, depending on what you do, if performance or throughput matters, and usually it does, a 3.5x improvement in throughput means to save lots of money, so I'm sure most managers should see an advantage in that.

In this case, the engineers had a clear expectation and good reasons that something was wrong and such throughput should be possible. The arguments for that are simple enough that every manager should also understand this.

The other aspect is also latency but also here I would assume that managers should at least care a bit. They also have the old baseline to compare to, and see that it is suddenly much worse, so they should care about why that is and how to fix it.

It’s a self-solving problem over long enough timescales.

There’s a reason YCombinator (as an example) insists on at least one strongly technical founder.

Bean counters rot companies by not seeing the wood for the trees.

The same applies to MBAs (no offence to anyone that has one). Engineering companies need to either be led by people the engineers respect, which means an engineer - or at the very least someone who understands the engineering thoroughly enough to get out of the way.

Fuck “project managers”.

The second you are subjected to one as an engineer you are being told you’re in the wrong place.

I don’t care what the stock options are, I don’t care what the free morning deviled eggs are like, that place is cancer and you need to escape it.

Yup, very few workplaces would know how to reward people with this skill level.

You have to know how to justify your work. Places like Netflix have teams like these because they literally print money, with each engineer often saving the company millions if not tens of millions of dollars annually. It’s hard to argue with results.

Almost every engineer I’ve worked with have this skill level, but they’re not given the leeway, direction, or advice to get to the conclusion. Netflix must have an amazing culture in these areas.

I've been a high performance systems engineer for a number of investment banks over the last 10+ years and people with deep knowledge like that are quite rare; that's why we get paid handsomely.

Mind sharing what field you work in that gathers so much talent like that?

Knowledge != skill. With the skill you can acquire the knowledge given the right conditions and motivation. I would say the vast majority of engineers I work with have never had the need or opportunity to invest the time or experiential opportunities to acquire the knowledge required - but they COULD. I actually find that quite sad, and hence my depression I noted. Nowadays this is extending all the way through school. Machine architecture and low level machine microcode etc isn’t universally taught in CS programs any more.

I’ve worked in all the industries, including a long time in yours. We probably even overlapped or maybe worked together!

Really that just need manager that presents the results to higher-ups as raw cash savings.

"My team makes your team apps and make them run cheaper"

Very few workplaces value this skill because very few workplaces have a use for it. One service now runs 3x faster? You reduced the AWS bill for this one service by 2/3? That's not going to change the bottom line enough for anyone to care about, unless you're, well, Netflix.

Things like this mostly make sense on a Netflix' scale. One grade lower, and spinning up new VM is inherently cheaper than spending SWEs time on solving this low level problem.

It's a giant tragedy of the commons. For every individual company it makes sense to cut corners to get to market quicker or to save on engineering costs. For the industry as a whole it's a disaster.

This is an excellent description of a false sharing perfoance issue, and I wouldn't envy the work involved in tracking that code down through the JVM. These sorts of issues are common inside the OS kernel itself. Kernels are managing lots of shared resources, and as you add more CPUs, these sorts of issues emerge as bottlenecks. They're enough of a problem that the perf tool recently got a new sub-command, "perf c2c", which tries to help identify false sharing (and other cache ping-ponging) issues much quicker.

If you're interested in learning more about this class of issues and the new tool, there was an excellent talk at LPC 2022 by Arnaldo Carvalho de Melo. And as luck would have it, the videos just came out, so I can link it here. In my opinion it was one of the best talks of the conference.


Thanks for sharing! The slides are here: http://vger.kernel.org/~acme/prez/linux-plumbers-2022/

I'm starting to see why systems programming people (C/zig/rust) sneer at java/node.js/dot net.

The jvm and v8 is just another layer of abstraction that gets in the way when performances problems like this arise at scale.

The tools used in this article require an understanding of what the hardware is actually doing.

I've worked in a company that wrote Java code that needed to avoid false cacheing. It was fairly understandable and the tooling made it quite testable for performance regressions.

However, the issue in the Netflix blogpost is in the JVM C++ code. I think it's entirely possible to encounter the problem in any language if you're writing performance-critical code.

It’s often more common! Because the JVM will typically give you ok layouts “for free” and your hand-written native code might be too naive to do the same.

JVM will typically give you ok layouts presumably because almost everything is an object, and when accounting for various headers and instrumentation two distinct objects are more likely to be >= 64 bytes apart even when allocated in succession.

What I find a little odd is why those variables were only on different cache lines 1/8th (12.5%) of the time. What linker behavior or feature would result in randomly shifting those objects while preserving their adjacency? ASLR is the first thing that comes to mind, randomizing the base address of their shared region. But heap allocators on 64-bit architectures usually use 16-byte alignment rather than the random 8-byte alignment that would account for this behavior. Similarly, mmap normally would be page-aligned, even when randomized; or certainly at least 16-byte aligned?

Aiui, it's one object that happens to sometimes be allocated at an offset where two fields in it lie across a cache line boundary. ASLR only affects static offsets, not heap memory.

edit: Pointers will be 8-aligned. Random 16-byte allocation if one pointer is at 8-offset and the next is at 0-offset will sometimes give you a cache-line crossing. Admittedly it should be 25%, not 12%... Maybe Java's allocator is only 8-aligned?

Why 16 byte aligned? That’s 128 bit, double the word size right? What is the significance of that alignment?

long double on x86-64 Linux is 16 bytes big and therefore malloc implementation, or any other allocation routines for that matter, must return ptr aligned to the multiples of the largest primitive type. Interestingly, x86-64 Windows long double is 8 bytes so malloc on Windows returns ptrs aligned to the multiples of 8.

The tools used in this article require an understanding of what the hardware is actually doing.

Which doesn't apply to 99% of business applications out there.

Wait no, I want to spend months figuring out how to optimize user login for my 70 users to go from 20ms to 12ms.

I think user login is one of the few things in your application that is better if it takes longer (e.g. stronger password hashing).

Well, do you see programming as an obligatory tool to make stuff, or do you aspire to expertise and mastery of the craft?

I think those are both probably legitimate approaches, but if it's the latter you really want to understand how the hardware works.

It's fine for a side project, but a waste of time for a business. I'm not going to charge my employer or client hundreds of extra hours just so I can learn a cool new thing.

I too have side projects :)

It doesn't apply to 99% or maybe 99.9% of tasks, but given enough development time, it applies to most business applications eventually.

This issue could happen in any language, as a matter of fact it was happening in native JVM code ( C++ ).

How would C/zig/rust have helped? They don't expose CPU cache lines to you any more than Java does. And with C in particular, you lose more in memory safety bugs than you could ever hope to gain from increased performance, unless you're working in an industry where correctness doesn't actually matter.

Not for cache line per se, but no runtime reflection means that any type information is static and won’t cause this kind of problem (+ no GC which is nice and makes leaks less likely). That’s a pretty significant RAM and performance saving and gets you within shouting distance of hand tuned assembly, especially since hot paths ARE frequently hand-tuned assembly.

I think that’s more what OP meant.

I’ll also note that while I don’t know what this service is doing, 300 RPS seems mighty low. That means each core is doing 30 RPS if you’re 12 wide. Maybe 60 since I don’t know what that 50% autoscaling piece means. Think about that. Each request is burning 16ms of CPU time. While I don’t know what this service does, this does seem like a lot when you consider just how fast a modern CPU is. That being said, it’s not impossible that even tweaking the Java code itself might unlock more CPU and that would be a more cost effective solution than using native code which is typically a bad fit for most web services (even rust).

> Not for cache line per se, but no runtime reflection means that any type information is static and won’t cause this kind of problem (+ no GC which is nice and makes leaks less likely).

Possibly. Equally, maybe that runtime type information is what enables monomorphism that wouldn't be possible in C (Yes, C doesn't explicitly have virtual functions, but you end up needing to do the same thing, so you pass function pointers around and it's the same thing at runtime), and the result is better performance.

> That’s a pretty significant RAM and performance saving and gets you within shouting distance of hand tuned assembly, especially since hot paths ARE frequently hand-tuned assembly.

In my experience higher-level languages significantly outperform lower-level languages for realistic-sized business problems, because programmer effort is almost always the limiting factor. Most of the people using C/zig/rust because "muh fast" don't even know how to use a profiler, yet alone the level of multiple-tool analysis that we see in the article. And again, this is the kind of thing you'd need to be doing in C to address performance issues in the same kind of code. Sure, maybe you don't have reflective type information so it shows up as a mispredicted branch rather than a cache issue - but guess what, you also need this kind of low-level tool to get that information, C won't tell you anything about branch prediction either.

This is true, but the problem is how it scales.

You set up a .NET/JVM project and it takes off, and you find out you're facing massive memory bloat of the managed heap. What do you do? The answer is basically: try using this slightly different allocation pattern by flipping a switch in the runtime, OR try to do a bunch of manual memory management anyway. You quickly discover the allocation patterns don't help much, so you turn to manual management.

Often enough, its simple to understand where the memory allocation comes from, and you could probably fix it with an arena allocator or something simple. So you try that, but then you find out that many library functions in .NET/JVM that you need, don't allow you to pass in preallocated buffers, leaving your ability to solve the memory problem crippled. You now where the memory is needed, when it expires, and when it can be reused, but you don't have the tools to apply this information anywhere.

At that point, you can either leave it be and buy more RAM, or rewrite from scratch in another language. Would be cool to have languages that are more of a hybrid, kind of like .NET unsafe, but then in a non-optional way.

In my experience managed languages work well when:

A) The business needs evolve quickly and/or cost cutting dev time is more valuable than having the application perform quickly.

B) the performance of the app isn’t that critical and developers will have a more pleasant experience in a better ecosystem

C) the managed language is mainly responsible for the control plane of a data path.

Anything beyond that, especially high performance code, will struggle. That’s why you see dev cycles spent on encryption algorithms, common low level routines, to the point of hand vectorizing assembly, etc. It’s a very well defined problem that’s fixed and has an outsized payoff because those things are used so often and used in places where the performance will matter.

There’s probably a 10x to 100x reduction in worldwide compute usage possible if you could wave a magic wand and have everything run as optimally as if the best engineers built everything and optimized everything to the levels we know how to do things now (computers have never been faster and never felt slower because of this).

However, it’s just not economical in terms of dev cycles per output and complaining otherwise is tilting at the windmills of market forces that give the edge in many scenarios to those languages (even when you factor in inefficiencies and/or extra time optimizing ). That’s what the person you’re replying to is stating and is something I’m 100% agreed with. This is someone who is a systems engineer who codes primarily in lower level languages and is generally a fan, especially Rust. I have probably written non trivial code in every popular language out there at this point and they’re just faster to get shit done in. Picking the right language is a mixture of figuring out what kind of talent you can attract, how much you can pay them, and what will satisfy your business needs within those constraints. A lot of people do choose incorrectly or suboptimally due to ignorance of how to choose, ignorance of the alternatives, picking because of personal familiarity the original author has, etc. Those choices are often more fatal when you choose a native language if your competitors choose better whereas the converse is less often true.

My point is not that higher level languages are not more productive, this is why I started out my post with saying "this is true". My point is that there is often no clear path forward to solve performance problems after you've invested in these higher level runtimes. The linked article is a good example, it solves the problem by modifying the runtime. Imagine that, having to modify the runtime to solve a performance problem. Its impressive and at the same time not in the realm of achievable engineering for most firms.

However, what is in the realm of achievable engineering is improving the performance of the code you write yourself, but a large part of that performance is opaquely hidden inside the runtime, with no way for you to change anything. If we were to create a language that has a smooth path from "managed-runtime" to "low-level-freedom" we would be able to adapt our codebases as they become more popular and performance starts to matter. What that would look like, no idea.

Eh. Maybe. And yet, here’s Netflix running a service in Java and explicitly not migrating. If it performance were truly so important, surely it would be valuable to start migrating? Like all business decisions it’s a cost tradeoff and centralizing the performance problems to experienced engineers who understand how to do this kind of work is a valid tradeoff. Sure, not shops can do it.

Consider however: * when you’re small your business bottlenecks are other things

* when you get larger you have more resources and can make a choice to switch or hire experts to remove the major bottlenecks

* improvements to the runtime improve your scaling for free

I’m not speaking theoretically. I worked at a startup that indoor positioning. Our stack end to end was written in pure Java. We struggled to attain great performance but we did what we needed to and focused on algorithmic improvements. We managed to get far enough along the way to get acquired by Apple. Then I spent about 4 months porting the entire codebase verbatim to C++. It ran maybe about as fast (maybe slightly faster but not much). Switching to the native math libraries for the ML hot path gave the biggest speed up but that’s more a problem with the Android ecosystem lacking those libraries (at least at the time or us failing to use them if they did exist). Over the course of the next two years we eeked out maybe an overall 5-10x CPU efficiency gain vs the equivalent Java version (if I’m remembering correctly - I regret not taking notes now) but towards the end it was definitely diminishing returns territory (eg changing a vector of shared_ptr objects on the critical path of the particle simulator netted something like a 5% speed up).

This was important work here because we got to a point where battery life actually started to meaningfully matter for the success of the project whereas as a startup we were trying to survive. But we were always conscious that Java was the better tradeoff for velocity and writing the localizer in c++ carried logistical challenges in growing it. In fact, starting in Java meant that we had an easier time because a lot of the initial figuring out of the structure of the code (via many refactorings and optimizations) had already happened in Java where it’s easier to move faster. Even within the startup we had discussions about migrating to C++ and it never felt like the right time.

My point is, good engineers know when to pick the right tool for the job and know what their risks are and what their contingencies are. If there’s necessity you’ll either change tools or fix your existing ones. Of course, not everyone does that, but my hunch is only those that are going to succeed anyway end up being fine. Kind of how the invisible hand of the market ends up working.

I think the idea of a smooth transition is a fantasy. Sometimes the architectures are so different that you’d have to fundamentally restructure your application to get that jump. It’s a map with many mountain ranges and valleys. There’s plenty of local optima and you can easily get stuck which requires you to fundamentally rearchitect things. For example, io_uring is very different. If you want to eke out optimal performance out of it, you need to build your application around that concept. It’s rare you get something like Project Loom in native landed but that’s a point in favor of Java managing things for you - free architectural speed up without you changing anything.

In my experience managed languages scale a lot better than unmanaged ones. A managed language does a small constant factor (2 is often quoted) worse than the theoretically optimal allocation. Manual memory management gets incredibly good results on microbenchmarks but gets worse and worse as a codebase gets larger.

The numbers in the blog are 350 rps for 48 vCPUs (hyper threads), which is 24 cores.

That's just a hair over 7 requests per hyper thread per second, or 135ms per request. That second number can be seen in the last graph, where the last graph shows an average request latency of about the same value, I'm guessing 120 milliseconds...

Without knowing what the code is doing, it's unfair to judge, but you've got to wonder what Netflix does to burn through $1B-per-annum in cloud infrastructure expenditure. Consider that at that rate they must have a truly epic deep discount, so that's probably equivalent to $3 to $5 billion at on-demand or retail pricing. Bonkers!

Wow that feels slow. 135ms smells like their blocking heavily for some kind of database I/O and not processing anything else. But surely that can’t be the case. They invested a lot in funding a cache line bouncing issue in the JVM so this service must be fairly valuable that you’d expect lower having fruit like that to have already been solved. But the only compute intensive parts of their business I can think of would be transcoding (not 135ms and it would be native code in the background), maybe pre-rendering initial UI, and processing analytics (also background). My guess is maybe on the pre-rendering (not sure if they do that) but 135ms feels steep when you consider what games accomplish in 10 (sure - not quite the same but 135ms would be an eon to accomplish something in). I am kinda curious what this service is responsible for now. A 7 RPS service that’s this valuable seems interesting.

You would make changes in your app, not have to patch JVM, deploy patched JVM etc.

I wouldn't expect caching vs. cache ping-ponging to be an easy and clear-cut decision in any language, and any external dependency is thus possibly an issue there.

Fron the article:

> We tend to think of modern JVMs as highly optimized runtime environments, in many cases rivaling more “performance-oriented” languages like C++. While it holds true for the majority of workloads

What you say more aligns with my thoughts. I've never heard someone express the opinion in this article. Then again, I don't know anyone quite as knowledgeable about the jvm as this article is.

This has been understood for years and years.

I remember there being a writeup about paint.net and what they had to do to keep performance. They were worrying a lot about memory and how to manipulate .net to manage it.

If you want performance, you don't get to ignore memory no matter the language, but some languages/platforms make it easier and some get in the way.

You can add erlang BEAM to that list. I was reading this article to see if it would inform me on how to better scale my elixir apps, and the answer is: no. It’s built to handle embarrassingly parallel computation by ensuring all lightweight processes enjoy isolation, while at the same time not having to think about threading at all when writing code.

well JS is just horrible programming language at every level. Frankly it's a technological wonder it can be run as fast as it is now.

I really like these Netflix tech blog posts. They actually dive into areas that I’m not aware of and show me some technical details instead of high level waffling.

It’s interesting because the longer I work, the better my own code becomes, and as I make fewer mistakes myself, I’m starting to see more unexpected issues that have their root cause deeper down the stack. In my case that’s mostly dependencies.

At Netflix they go a step further and find issues with the JVM itself :)

puts incredibly nitpicky code reviewer hat on

Does the JVM allow using __attribute__ annotations? There's one for alignment, __attribute__((aligned(L1_DCACHE_LINE_SIZE))) (where L1_DCACHE_LINE_SIZE is defined by gcc -DL1_DCACHE_LINE_SIZE=`getconf LEVEL1_DCACHE_LINESIZE`) that could be used instead of the array, see: https://gcc.gnu.org/onlinedocs/gcc-12.2.0/gcc/Common-Variabl...

C11 added `_Alignas`, and it looks like C++11 has `alignas`, so you don't need to rely on vendor-specific extensions for alignment, unless the JVM restricts itself to an earlier standard.

LEVEL1_DCACHE_LINESIZE looks to be specific to Linux and Solaris, however.

I think the array might be better because it works on all architectures and all platforms.

It is great to see a tie from a higher level application performance problem all the way down to a processor architecture issue (the L1 Cache). I was working at Intel when VTune first came into existence and it was such an awesome way to see how small code changes results in significant differences in actual execution. This was especially true when the Pentium Pro came out with the uOps architecture.

One of the many good reasons to have a real 'full stack' appreciation of the platforms we work on.

For anyone interested in how the graphs were made: https://oss.oetiker.ch/rrdtool/

Happy to see people still using RRD after all these years.

I was totally happy to see these graphs after all these years, haha. Now I'm wondering if netflix has munin or cacti or something like that running in their infra.

Thanks for linking this. I used it in one of my first programming jobs, and forgot the name. Had occasion to use something similar recently and couldn’t find it.

Yeah I had "I know that look" reaction when seeing it.

The first actual TSDB in a way

Just in case somebody missed it from the article, there is a 1-hour long youtube video about the same issue: https://www.youtube.com/watch?v=G40VfIsnCdo

Reading the post I can imagine the days/weeks it took for those bright individuals (I spot Brendan Gregg among them, and being Netflix I can safely infer they are all very smart indeed) to find the issue behind this - and all the years of work on the tooling to make it possible.

Now, add the Kubernetes layer on top of this and I think you will add at very least one extra week of debugging...

I consider myself a good Java programmer and I would never figure something like this out on my own. I'm ok with this fact.

I would say that good chunks are things that people who've had to work in high performance computing will know about or would be able to figure out.

Having said that, this line is probably the most difficult part of figuring the whole thing out:

> Based on the fact that (5) “repne scan” is a rather rare operation in the JVM codebase, we were able to link this snippet to the routine for subclass checking

I am not a Java programmer, but surely something like the above is not common knowledge?

Amusingly this is something you can synthesize without knowing what the instruction is, precisely because you don’t know what the instruction is. Most code is simple arithmetic and memory operations that are relatively well known. The use of specialty instructions is pretty rare and often a good way to identify specific code sequences. For example, I don’t really hear much about “vgetmantsd” but the fact that I don’t is a good sign it does a specialized operation that would make it a good needle. (Ok, I can probably guess what it does, but ignore that.)

I too found that interesting. I would not have guessed that the use of 'repne scas' ( scan a string ) would be uncommon in a large codebase... perhaps there is some particular reason why that would be well known if you did a lot of Java profiling.

It strikes me as something that comes back from Google if you search ‘repne scan’?

I have no idea what the super class cache is used for, but as soon as I read that there was good performance 12% = 1/8 of the time, after trying to scale up a server workload, I immediately thought of false sharing.

Different people have different skills, but it’s still interesting to read about things you’ll never use.

EDIT: and as sibling points out, I haven’t heard of the REPNE scan before, so that’s something interesting to look up.

It's ok, for good Java dev I assume they know at least how to troubleshoot performance issue with JFR, if you need to go deeper in native / perf events that's more the job of a very senior or perf people.

The issue was not in Java, but in the C++ implementation of the JVM. I think you can be forgiven.

Nice article, but strange fix.

What they did:

    Klass* field1;
    uint8_t padding[ 64 ];
    Klass** field2;
What I usually do:

    alignas(64) Klass* field1;
    alignas(64) Klass** field2;

I am a little confused because I do not get this level of PMU access on AWS instances short of "metal". Usually if I want to analyze some performance problem I need to move the workload to "metal" instance types and usually the c6i.metal, since the Ice Lake PMU is so much more capable.

If I try to count machine clears on a non-metal instance perf tells me the event doesn't exist, and if I give it the raw event encoding it runs but says the event isn't supported.

This is a cool post. Initially I was discouraged by the list of things needed to properly diagnose it, but that doesn't seem unreasonable:

Java Flight Recorder


PerfSpect (Performance Monitoring Counters (PMCs)

Intel vTune

As folks keep looking to Rust for safety and performance, I wonder what in-language facilities exist to say "this variable should exist in a separate cpu cache line". That would be neat level of control in contrast to recognizing this in debugging + adding arbitrary padding.

That's not the something that can be applied through compile-time analysis nor it is something that the language can "solve". There is actually nothing to be solved unless you have an actual workload which justifies making a change like this.

As a matter of fact, applying such pattern blindly would have a negative effect on an increased memory pressure. You're increasing your application runtime memory footprint by doing this.

I meant in the context that you knew ahead of time this would cause a cache coherency issue. I agree you're almost always finding this after the fact. :)

The JDK issue ticket linked towards the bottom of the post has some interesting discussions. Apparently Developers at RedHat ran into a similar problem and decided to build an 'agent' to help developers identify code patterns triggering the issue.

Slightly different approach to the Netflix post, but still an interesting line of effort to find problematic Java code!

For anyone interested: https://bugs.openjdk.org/browse/JDK-8180450?focusedCommentId...

Would an atomic mutable subclass cache (not sure what it's used for, downcasting?) be unnecessary in a language built around static rather than dynamic dispatch by default, like C/C++/Rust and perhaps Go? Or would it still speed up dynamically dispatched code, but is less practical or worthwhile so it isn't used in practice? (Though Rust's Arc also suffers from atomic contention similar to this blog post, when used across dozens of threads: https://pkolaczk.github.io/server-slower-than-a-laptop/)

Also it's somewhat ironic that the JVM source code (https://github.com/openjdk/jdk8u/blob/jdk8u352-b07/hotspot/s...) says "This code is rarely used, so simplicity is a virtue here" at the site of a bottleneck.

That was a lot lower level than I expected. I like how humbling it is to see such deep dives in areas that I would not know how to debug.

Disproportionately annoyed by inserting a whole cache line of padding between the variables instead of just enough to move them to different lines. That's probably just to make the blog post simpler though.

I thought based on the pre : titled that this was about epoxy counter tops.

they can run vtune on their aws nodes, what a privilege!

It’s not Netflix specific, but access to hardware counters is more rare among instances than you might realize if you only scan the table on Intels pages.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact