Anyway my point with that is that there are a lot more high level things that needed to be fixed - and this is the case in most, if not all organizations I've worked at - before you get to the lower levels of performance. And this probably applies everywhere.
In this example, they have their software so optimized and tweaked already that it ended up being a low level JVM issue.
Mind you / on the other hand, they mentioned something about subclasses and the like, it might be that a change in their Java code would have solved or amortized this issue as well.
Couldn't agree more, except on the timeline. In my experience, the 90s and most of the 00s proper engineering like this was the norm in silicon valley and I loved it. It's the last 10-15 years that PMs have taken over and now it's all so superficial and uninteresting. I've moved away from hands-on development work because nobody is doing anything intellectually rewarding anymore. But I wish I was doing hard-core engineering like it used to be.
Because growth is most of the time translated as acquiring more users/customers. Eventually making more money out of them.
Somewhere in the distant future, making money becomes cost saving, this is when companies start thinking about how can we cut the cost and such projects/explorations become relevant.
Even then it is not easy to sell as a project. Because there are so many unknowns, discussion goes like:
me: hey, I see spikes in CPU/mem usage, want to investigate why.
Product: is it important now?
me: I don't exactly know, but it might end up saving our instance costs.
Product: how many instances we are talking about?
me: I don't know yet how big is the problem, depends on what would be the result
Product: how long does it take you to do a research?
me: I don't know what is the problem yet, so I can't evaluate how long it takes to find it.
Product: maybe we shouldn't do it then?
me: (again) but it might end up saving our instance costs.
Product: ok, do you think you can finish investigation in 2 days and share with me timeline and metrics this project would improve?
me: ok, let's forget about this shit we made. what is a next "metric improving, highly visible" project we have?
Product: cool, I want you to work on this super exciting "Move button from Page A to Page B" project
If you always do the obvious things, you’ll end up as a mediocre company that might not survive. If you say yes to any project that comes to mind, you’ll end up burning all your money and end up with endless projects. It’s important to strike a balance and try to find the right investments others don’t see to create a competitive advantage.
It helps if the culture allows for that. The questions still are relevant, you just don’t always need precise answers, Probe 100 instances and see if it surfaces more than once. Set up a list of things you think you’ll need to be doing and an initial proposition what you think might change. It’s the PMs job to then convince others that it’s worth taking the risk.
My observations contributing to the frustration:
* Developers believe performance is speed. It isn’t. Performance is a difference in measure between two (or more) competing qualities, the performance gap measured in either duration or frequency.
* Developers will double down on the prior mistake by making anecdotal observations without any forms of measurement and will vehemently argue their position in the form of a logic based value proposition. Example: X must be fast because Y and Z appear fast.
* Developers will further hold their unmeasured position by isolating their observations from how things actually work. Software is a complex system of many instructions. Except for the most trivial of toy scripts nothing executes in isolation. For example developers follow the value proposition that a grain of sand is tiny and weighs little, so therefore does not contribute to the weight or size of the beach. An example is that developers will argue to the death using querySelectors to avoid walking the DOM but in Firefox DOM walking can be as fast as 250000x faster than querySelectorAll, which is indeed significant.
I suspect there exist a variety motives for why developers reason things in this way, but from the outside it looks like a house of cards based upon argument from ignorance fallacies outputting really slow software by really defensive people.
I do not think that’s restricted to front-end developers. Back-end developers rarely worry about inserting yet another http request in code called from user action in a web browser, for example. A tenth of a second here, a tenth there, it all adds up.
Product managers and users also don’t seem to care much or do not know how fast modern hardware is. I’ve frequently seen web page refreshes take 5 seconds or more without getting any user complaint, even when I explicitly ask them, and tell them how that could easily be halved.
> * Developers believe performance is speed. It isn’t. Performance is a difference in measure between two (or more) competing qualities, the performance gap measured in either duration or frequency.
Maybe I'm a dummy, but if someone says "make a fast webapp", I'll be doing things like reducing requests, optimizing queries, making things smaller, using the right data structures, manipulating data in a fast and/or memory efficient way.
This should result in lower loads, more stability, better availability, etc, so also "more performance". I do think of it as lots of focus on speed since, for the most part, I'm measuring how long it takes for a function/query/etc to run.
Happy to check out some articles/keyword suggestions.
I'd love to get back into it.
I have seen far more companies wasting resources on the projects of no value because the stakeholders believe themselves to be of same level as Netflix, Google.
This is a case of eeking more out, optimizing, but the general diagnostician view is deeply needed widely. This industry has an incredibly difficult time accepting the view of those close to the machine, those who would push for better. Those who know are aliens to the exterior boring normal business, and it's too hard for the pressing mound of these alien concerns to bubble up & register & turn into corporate-political will to improve, even though this is often a critical limitation & constraint on company's acceptance/trust.
There's a lot of ultra passing this off comments. No one seems to get it. Scaling up is linear. Complexity & slowness is geometric or worse. Slowness as it takes root is ever more unsolvable.
This rings immensely hollow to me, & borders on victim blaming. Oh sure; telling the lower rungs of the totem pole it's their fault for not convincing the business to care- for not being able to adequately tune in the business to the sea of deeper technical concerns- has some tiny kernel of truth to it. Maybe the every-person coder could do better, maybe, granted. But I see the structural issues & organization issues as vastly vastly more immense impediments to understanding-our-selves.
There is such a strong anti-elitism bias in society. We don't like know it alls, whether their disposition is classicaly braggadocious, or humble as a dove. We are intimidated by those who have real, deep, sincere & obvious masteries. We cannot outrun the unpleasantness of the alien, the foreign concerns, steeped in assurity & depth, that we can scarcely begin to grok. Techies face this regularly, are ostracized & distanced from with habit. Few, very few, are willing to sit at points of discomfort to trade with the alien, to work through what they are saying, to register their concerns.
> I’m not sure it’s useful to assign unilateral blame to management based on a failure to listen to the engineers
Again, granted! There absolutely are plenty of poor decisions all around. Engineers rarely have good sounding boards, good feedback, for a variety of reasons but the above forced alienation is definitely a sizable factor where engineers go wrong; being too alone in deciding, not knowing or not having people to turn to to figure shit out, to get not just superficial but critical review that strikes at the heart.
This again does not dissuade me from my core feelings on my core point. I think specifically most companies are hugely unable to assess their own products & systems health, unable to gauge the decay & rot within. Whether it's slog or real peril, there are few rangers tasked with scouting the terrain & identifying the features & failures. And the efforts are renewal/healing are all too often patchwork, haphazard, & random, done as emergency patches. These organisms of business are discorporated, lack cohesion & understanding of themselves & what they are. Having real on the ground truthfinders, truthtellers, assessers, monitors- having people close to the machine who can speak for the machine, for the systems, that is rarely a role we embrace, and so often we simply rely on the same chain of management which is also responsible for self/group-promotion & tasking & reporting which has far too many conflicted interests for it to be expectable for them to deliver these more unvarnished, technically minded views.
Cheap money and small, but easyly observable gains made all the ones not knowing better appreciate linear scaling improvements.
We did that (ops, not developing the actual app) few times where app scaled badly
We're working on applications that's either waiting for the disk, waiting for the db, waiting for some http server or waiting for the user.
None of our customers will notice the difference if my button click event handler takes 50ms instead of 10ms, or if the integration service that processes a few dozen files per day spends 5 seconds instead of 1 second.
I'll easily trade 5x performance in 99% of my code if it makes me produce 5x more features, because most of the time my code runs in just a few milliseconds at a time anyway.
Of course, I'm weary of big-Oh, that one will bit ya. But a mere 3.5x is almost never worth chasing for us.
> At Netflix, we periodically reevaluate our workloads to optimize utilization of available capacity.
The idea being that if you pay for fewer servers you spend less money
If you try to be good at waiting on many things, you can use one machine instead of a hundred
> None of our customers will notice the difference if my button click event handler takes 50ms instead of 10ms
They absolutely will, what
We can't run on one, because customers run our application on-prem.
> They absolutely will, what
How can you be so sure?
Sure, 10ms vs 40ms is measurable, and for the keen-eyed noticeable. But if you're only pressing the button once every 5 minutes, it doesn't matter. Similarly, if the button triggers an asynchronous call to a third-party webservice that takes seconds to respond, it doesn't matter. And so on.
Of course, for the things where users are affected by low latency, we try to take care. But overall that's a very, very small portion out of our full functionality.
It's a great tool with a low learning curve that just requires SSH.
In this case, the engineers had a clear expectation and good reasons that something was wrong and such throughput should be possible. The arguments for that are simple enough that every manager should also understand this.
The other aspect is also latency but also here I would assume that managers should at least care a bit. They also have the old baseline to compare to, and see that it is suddenly much worse, so they should care about why that is and how to fix it.
There’s a reason YCombinator (as an example) insists on at least one strongly technical founder.
Bean counters rot companies by not seeing the wood for the trees.
The same applies to MBAs (no offence to anyone that has one). Engineering companies need to either be led by people the engineers respect, which means an engineer - or at the very least someone who understands the engineering thoroughly enough to get out of the way.
The second you are subjected to one as an engineer you are being told you’re in the wrong place.
I don’t care what the stock options are, I don’t care what the free morning deviled eggs are like, that place is cancer and you need to escape it.
Mind sharing what field you work in that gathers so much talent like that?
I’ve worked in all the industries, including a long time in yours. We probably even overlapped or maybe worked together!
"My team makes your team apps and make them run cheaper"
If you're interested in learning more about this class of issues and the new tool, there was an excellent talk at LPC 2022 by Arnaldo Carvalho de Melo. And as luck would have it, the videos just came out, so I can link it here. In my opinion it was one of the best talks of the conference.
The jvm and v8 is just another layer of abstraction that gets in the way when performances problems like this arise at scale.
The tools used in this article require an understanding of what the hardware is actually doing.
However, the issue in the Netflix blogpost is in the JVM C++ code. I think it's entirely possible to encounter the problem in any language if you're writing performance-critical code.
What I find a little odd is why those variables were only on different cache lines 1/8th (12.5%) of the time. What linker behavior or feature would result in randomly shifting those objects while preserving their adjacency? ASLR is the first thing that comes to mind, randomizing the base address of their shared region. But heap allocators on 64-bit architectures usually use 16-byte alignment rather than the random 8-byte alignment that would account for this behavior. Similarly, mmap normally would be page-aligned, even when randomized; or certainly at least 16-byte aligned?
edit: Pointers will be 8-aligned. Random 16-byte allocation if one pointer is at 8-offset and the next is at 0-offset will sometimes give you a cache-line crossing. Admittedly it should be 25%, not 12%... Maybe Java's allocator is only 8-aligned?
Which doesn't apply to 99% of business applications out there.
I think those are both probably legitimate approaches, but if it's the latter you really want to understand how the hardware works.
I think that’s more what OP meant.
I’ll also note that while I don’t know what this service is doing, 300 RPS seems mighty low. That means each core is doing 30 RPS if you’re 12 wide. Maybe 60 since I don’t know what that 50% autoscaling piece means. Think about that. Each request is burning 16ms of CPU time. While I don’t know what this service does, this does seem like a lot when you consider just how fast a modern CPU is. That being said, it’s not impossible that even tweaking the Java code itself might unlock more CPU and that would be a more cost effective solution than using native code which is typically a bad fit for most web services (even rust).
Possibly. Equally, maybe that runtime type information is what enables monomorphism that wouldn't be possible in C (Yes, C doesn't explicitly have virtual functions, but you end up needing to do the same thing, so you pass function pointers around and it's the same thing at runtime), and the result is better performance.
> That’s a pretty significant RAM and performance saving and gets you within shouting distance of hand tuned assembly, especially since hot paths ARE frequently hand-tuned assembly.
In my experience higher-level languages significantly outperform lower-level languages for realistic-sized business problems, because programmer effort is almost always the limiting factor. Most of the people using C/zig/rust because "muh fast" don't even know how to use a profiler, yet alone the level of multiple-tool analysis that we see in the article. And again, this is the kind of thing you'd need to be doing in C to address performance issues in the same kind of code. Sure, maybe you don't have reflective type information so it shows up as a mispredicted branch rather than a cache issue - but guess what, you also need this kind of low-level tool to get that information, C won't tell you anything about branch prediction either.
You set up a .NET/JVM project and it takes off, and you find out you're facing massive memory bloat of the managed heap. What do you do? The answer is basically: try using this slightly different allocation pattern by flipping a switch in the runtime, OR try to do a bunch of manual memory management anyway. You quickly discover the allocation patterns don't help much, so you turn to manual management.
Often enough, its simple to understand where the memory allocation comes from, and you could probably fix it with an arena allocator or something simple. So you try that, but then you find out that many library functions in .NET/JVM that you need, don't allow you to pass in preallocated buffers, leaving your ability to solve the memory problem crippled. You now where the memory is needed, when it expires, and when it can be reused, but you don't have the tools to apply this information anywhere.
At that point, you can either leave it be and buy more RAM, or rewrite from scratch in another language. Would be cool to have languages that are more of a hybrid, kind of like .NET unsafe, but then in a non-optional way.
A) The business needs evolve quickly and/or cost cutting dev time is more valuable than having the application perform quickly.
B) the performance of the app isn’t that critical and developers will have a more pleasant experience in a better ecosystem
C) the managed language is mainly responsible for the control plane of a data path.
Anything beyond that, especially high performance code, will struggle. That’s why you see dev cycles spent on encryption algorithms, common low level routines, to the point of hand vectorizing assembly, etc. It’s a very well defined problem that’s fixed and has an outsized payoff because those things are used so often and used in places where the performance will matter.
There’s probably a 10x to 100x reduction in worldwide compute usage possible if you could wave a magic wand and have everything run as optimally as if the best engineers built everything and optimized everything to the levels we know how to do things now (computers have never been faster and never felt slower because of this).
However, it’s just not economical in terms of dev cycles per output and complaining otherwise is tilting at the windmills of market forces that give the edge in many scenarios to those languages (even when you factor in inefficiencies and/or extra time optimizing ). That’s what the person you’re replying to is stating and is something I’m 100% agreed with. This is someone who is a systems engineer who codes primarily in lower level languages and is generally a fan, especially Rust. I have probably written non trivial code in every popular language out there at this point and they’re just faster to get shit done in. Picking the right language is a mixture of figuring out what kind of talent you can attract, how much you can pay them, and what will satisfy your business needs within those constraints. A lot of people do choose incorrectly or suboptimally due to ignorance of how to choose, ignorance of the alternatives, picking because of personal familiarity the original author has, etc. Those choices are often more fatal when you choose a native language if your competitors choose better whereas the converse is less often true.
However, what is in the realm of achievable engineering is improving the performance of the code you write yourself, but a large part of that performance is opaquely hidden inside the runtime, with no way for you to change anything. If we were to create a language that has a smooth path from "managed-runtime" to "low-level-freedom" we would be able to adapt our codebases as they become more popular and performance starts to matter. What that would look like, no idea.
* when you’re small your business bottlenecks are other things
* when you get larger you have more resources and can make a choice to switch or hire experts to remove the major bottlenecks
* improvements to the runtime improve your scaling for free
I’m not speaking theoretically. I worked at a startup that indoor positioning. Our stack end to end was written in pure Java. We struggled to attain great performance but we did what we needed to and focused on algorithmic improvements. We managed to get far enough along the way to get acquired by Apple. Then I spent about 4 months porting the entire codebase verbatim to C++. It ran maybe about as fast (maybe slightly faster but not much). Switching to the native math libraries for the ML hot path gave the biggest speed up but that’s more a problem with the Android ecosystem lacking those libraries (at least at the time or us failing to use them if they did exist). Over the course of the next two years we eeked out maybe an overall 5-10x CPU efficiency gain vs the equivalent Java version (if I’m remembering correctly - I regret not taking notes now) but towards the end it was definitely diminishing returns territory (eg changing a vector of shared_ptr objects on the critical path of the particle simulator netted something like a 5% speed up).
This was important work here because we got to a point where battery life actually started to meaningfully matter for the success of the project whereas as a startup we were trying to survive. But we were always conscious that Java was the better tradeoff for velocity and writing the localizer in c++ carried logistical challenges in growing it. In fact, starting in Java meant that we had an easier time because a lot of the initial figuring out of the structure of the code (via many refactorings and optimizations) had already happened in Java where it’s easier to move faster. Even within the startup we had discussions about migrating to C++ and it never felt like the right time.
My point is, good engineers know when to pick the right tool for the job and know what their risks are and what their contingencies are. If there’s necessity you’ll either change tools or fix your existing ones. Of course, not everyone does that, but my hunch is only those that are going to succeed anyway end up being fine. Kind of how the invisible hand of the market ends up working.
I think the idea of a smooth transition is a fantasy. Sometimes the architectures are so different that you’d have to fundamentally restructure your application to get that jump. It’s a map with many mountain ranges and valleys. There’s plenty of local optima and you can easily get stuck which requires you to fundamentally rearchitect things. For example, io_uring is very different. If you want to eke out optimal performance out of it, you need to build your application around that concept. It’s rare you get something like Project Loom in native landed but that’s a point in favor of Java managing things for you - free architectural speed up without you changing anything.
That's just a hair over 7 requests per hyper thread per second, or 135ms per request. That second number can be seen in the last graph, where the last graph shows an average request latency of about the same value, I'm guessing 120 milliseconds...
Without knowing what the code is doing, it's unfair to judge, but you've got to wonder what Netflix does to burn through $1B-per-annum in cloud infrastructure expenditure. Consider that at that rate they must have a truly epic deep discount, so that's probably equivalent to $3 to $5 billion at on-demand or retail pricing. Bonkers!
> We tend to think of modern JVMs as highly optimized runtime environments, in many cases rivaling more “performance-oriented” languages like C++. While it holds true for the majority of workloads
What you say more aligns with my thoughts. I've never heard someone express the opinion in this article. Then again, I don't know anyone quite as knowledgeable about the jvm as this article is.
I remember there being a writeup about paint.net and what they had to do to keep performance. They were worrying a lot about memory and how to manipulate .net to manage it.
If you want performance, you don't get to ignore memory no matter the language, but some languages/platforms make it easier and some get in the way.
It’s interesting because the longer I work, the better my own code becomes, and as I make fewer mistakes myself, I’m starting to see more unexpected issues that have their root cause deeper down the stack. In my case that’s mostly dependencies.
At Netflix they go a step further and find issues with the JVM itself :)
Does the JVM allow using __attribute__ annotations? There's one for alignment, __attribute__((aligned(L1_DCACHE_LINE_SIZE))) (where L1_DCACHE_LINE_SIZE is defined by gcc -DL1_DCACHE_LINE_SIZE=`getconf LEVEL1_DCACHE_LINESIZE`) that could be used instead of the array, see: https://gcc.gnu.org/onlinedocs/gcc-12.2.0/gcc/Common-Variabl...
LEVEL1_DCACHE_LINESIZE looks to be specific to Linux and Solaris, however.
One of the many good reasons to have a real 'full stack' appreciation of the platforms we work on.
Happy to see people still using RRD after all these years.
The first actual TSDB in a way
Now, add the Kubernetes layer on top of this and I think you will add at very least one extra week of debugging...
Having said that, this line is probably the most difficult part of figuring the whole thing out:
> Based on the fact that (5) “repne scan” is a rather rare operation in the JVM codebase, we were able to link this snippet to the routine for subclass checking
I am not a Java programmer, but surely something like the above is not common knowledge?
Different people have different skills, but it’s still interesting to read about things you’ll never use.
EDIT: and as sibling points out, I haven’t heard of the REPNE scan before, so that’s something interesting to look up.
What they did:
uint8_t padding[ 64 ];
alignas(64) Klass* field1;
alignas(64) Klass** field2;
If I try to count machine clears on a non-metal instance perf tells me the event doesn't exist, and if I give it the raw event encoding it runs but says the event isn't supported.
Java Flight Recorder
PerfSpect (Performance Monitoring Counters (PMCs)
As folks keep looking to Rust for safety and performance, I wonder what in-language facilities exist to say "this variable should exist in a separate cpu cache line". That would be neat level of control in contrast to recognizing this in debugging + adding arbitrary padding.
As a matter of fact, applying such pattern blindly would have a negative effect on an increased memory pressure. You're increasing your application runtime memory footprint by doing this.
Slightly different approach to the Netflix post, but still an interesting line of effort to find problematic Java code!
For anyone interested:
Also it's somewhat ironic that the JVM source code (https://github.com/openjdk/jdk8u/blob/jdk8u352-b07/hotspot/s...) says "This code is rarely used, so simplicity is a virtue here" at the site of a bottleneck.
it doesn't seem netflix specific?