Hacker News new | past | comments | ask | show | jobs | submit login
More evidence for problems in VM warmup (tratt.net)
87 points by ltratt on Nov 17, 2022 | hide | past | favorite | 34 comments



Note that certain things can cause your code to be un-jitted. If you run into a NPE, for example, what actually happens is the compiled code segfaults. Adding null checks everywhere would be slow, so instead the JVM lets it blow up. When the jitted code segfaults that bytecode goes back into interpreted mode and has to be compiled again w/ updated heuristics.

Still, covering cases like this is easier than writing C. :)


Also what's missing in this article is a relative comparison. What if the testing machines have some kind of cron or process that's messing up the microbenchmark? I didn't see that comparison here.


Did you miss the bits about krun or running many iterations? Though that’s mostly from the older paper rather than the subject of this article.


I guess I missed that.


It's interesting how often the same stories play out for different interpreter implementations. Everyone tries:

- a multi-pass JIT

- interpretting the input directly to reduce time to first op

- making the fastest transpiler they can and skipping the interpreter

All of these are addressing different constraints and affect each other. For instance, a cheap transpiler is only slightly slower than an interpreter loop, and allows you to move the threshold for the JIT farther to the right. If you can avoid trying to optimize things that you will only be slightly successful at, you can invest more of your CPU budget in deeper optimization on the hottest paths. You are also running on-stack replacement less often, and in fewer scenarios, which may mean you make different tradeoffs there as well.


If I were the King of the Forest, our routing fabric would utilize something like nginx's ramping weight feature to throttle new servers for a few minutes before they reached full membership in the cluster.

As things are we end up running something more like blue-green deployments and hit the dormant side with a stress testing tool. We haven't really come up with a better solution, though we have steadily reduced both the time necessary to warm up the servers and the worst case behavior if you accidentally skip the warming step. Today you will just have a very bad time. Originally circuits would blow like crazy.


I appreciate that these issues definitely complicate research and benchmarking of VM performance, but I'm confused about this question: "If you're a normal user, the results suggest that you're often not getting the performance you expect". Is that true? Most VMs are not simple, predictable systems that would ever reach a "steady state"—they're dynamic systems that are constantly executing different parts of the codebase and taking wildly different codepaths at different times. V8 is a great example of this—the type of code executed on a page changes wildly depending on what you're doing and what actions you take. Why would we even want to optimize it for "reaching a steady state", when different parts of the codebase may be more or less useful at different times? It seems more important to me to work on optimizations that can allow us to deoptimize parts of the codebase that we don't think we will be useful again, to save on memory, even if it involves sacrificing this theoretical notion of a "steady state"


> Most VMs are not simple, predictable systems that would ever reach a "steady state"—they're dynamic systems that are constantly executing different parts of the codebase and taking wildly different codepaths at different times. V8 is a great example of this—the type of code executed on a page changes wildly depending on what you're doing and what actions you take.

That is already the exact point being made by Laurie and Edd's research.

> Why would we even want to optimize it for "reaching a steady state", when different parts of the codebase may be more or less useful at different times?

A steady state for whatever the application is doing right now. Not a single, statically known steady state, without considering what it's doing.

> It seems more important to me to work on optimizations that can allow us to deoptimize parts of the codebase that we don't think we will be useful again, to save on memory, even if it involves sacrificing this theoretical notion of a "steady state"

I think that's a completely solved problem isn't it? You can GC a compiled method. What more do you want?


It smells like a caching problem rather than a gc problem. Surely the compiled method would be kept live by the pointer from its interpreted cousin or from its class’s vtable or whatever.


It’s a kind of weak pointer.


Yeah I see the comparison, it just feels… weak. With weak pointers there isn’t really much of a ‘gc keeps around objects that are used a lot’ thing going on whereas that fits much more nicely for a cache and for this situation.


> the type of code executed on a page changes wildly depending on what you're doing and what actions you take.

I can't speak for anyone else, but I feel like one of my goals for mature software is to start factoring out common bits of code so that a lot of the workloads either share code in common, or share output in common (avoiding code re-execution).

There's a high correlation between Choose Your Own Adventure code that's doing random stuff for every request and my unhappiness on a project. I usually find at least a few highly effective people on any team who sympathize or agree.

If at least 20% of the code isn't "in common" I'm uncomfortable. In common doesn't necessarily mean every task, but multiple tasks. EG, every task runs 5 of these 20 common concerns.

I'm a sucker for complementary code and processes though. I am sure that some of my preferences in software design have an origin story in hardware sympathy, some in ergonomics/cognition, and a number in both. I couldn't begin to guess which opinions started with which observation at this point. Did I rationalize a human optimal solution or rationalize a hardware optimal one?


I think it depends on whether you are thinking about backend code or things like browsers.

The results section of the article leads with: The research question in TCPT I am most interested in is RQ1 "Do Java microbenchmarks reach a steady state of performance?"


After thinking about these issues for over a decade, and having the rug repeatedly pulled out from under me by Java, I've come to the conclusion that the easiest path forward is to stick code snippets into godbolt, or to disassemble optimized binaries from ahead of time compiled languages.

(I didn't say doing this is easy; just that it is easier than reasoning about a JIT)


What's the difference in practice between a JIT and an AOT compiler that supports optimizations based on runtime analysis (I forgot the technical term for that. Profile-guided optimization, maybe?)? I think the real issue is, the higher level the language, the harder it is to reason about what your code will actually get compiled down to. If you want predictable performance in Java, or any high-level language, in my experience you have to essentially write C code in that language, or at least get as close as possible. In Java, that usually means:

* sticking to primitives or arrays of primitives as much as possible. The language has no support for user-defined value types!

* Sticking to regular for loops, while loops, and other similar control structures.

* Using offheap memory via Unsafe or similar API when feasible.


The things that most Java applications need to do to reduce the variance of runtime outcomes are sadly a little more basic: use a modern VM, set the compilation flags appropriately, and make sure the code caches are large enough to hold all your hot functions.


My experience leads me to the exact opposite conclusions. For example, your code cache is either big enough or it's not. It's a pretty binary situation and a simple checkbox to ensure your app didn't run out of space. To quote from https://stackoverflow.com/questions/7513185/what-are-reserve...:

"Normally you'd not change this value. I think the default values are quite good balanced because this problems occur on very rare occasions only (in my experince)."

Similarly, your organization tends to run the version of Java that the whole company supports. I have never worked anywhere where someone was allowed to use a different version than the rest of the company for performance reasons.

In contrast, the advice to write low-level code for predictable, high performance has been virtually timeless and generally applicable across languages.


I'm not suggesting that devs should just wildcat a newer JDK, but surveys indicate that significant population of organizations are still on 8 or even 7. If that's your org, this is probably performance left on the table.

As for the code caches, you're right that it's a binary outcome, but the signals are subtle and if you aren't looking at the right stats a non-expert without a good calibrated gut feeling for how good the performance should be might just conclude that the app was written poorly, or Java just sucks, or whatever.


For this research to be useful, the authors should switch to large benchmarks.

VMs win in the average. Therefore, VMs have great cases, average cases, and crappy cases. The crappy cases will always exist, and that’s known to VM architects and it’s not a bug.

Essentially this work is like criticizing a professional gambler for his biggest losses when the gambler is ahead in the average (or conversely praising him for his biggest wins when they’re behind in the average).

Source: I build VMs for a living.


The crappy cases will always exist, and that’s known to VM architects and it’s not a bug.

I think this is very much the lens of the VM developer. From a different POV, why isn't it valid to ask, "Why can't this also be fast?" Why is that a feature and not a bug?

Essentially this work is like criticizing a professional gambler for his biggest losses when the gambler is ahead in the average

The losses are still losses. Maybe we should try to eliminate the "gambling?"


VMs are fundamentally about gambling. If you knew statically what to do in the compiler then you’d run an ahead of time (AOT) compiler. VMs are for those cases where an AOT would have been slower, and empirically, languages like JS, Java, and Lua perform much better with a dynamic VM than with an AOT and that’s why we use VMs.

So, eliminating gambling is like saying we should just do AOTs. We could but then we’d be slower.

> From a different POV, why isn't it valid to ask, "Why can't this also be fast?" Why is that a feature and not a bug?

A professional gambler ought not ask this question. A professional gambler ought instead ask: how am I doing in the average? Hence the flaw in this research. It’s bad gambling.


A professional gambler ought not ask this question. A professional gambler ought instead ask: how am I doing in the average? Hence the flaw in this research. It’s bad gambling.

I should think "not exactly." A professional gambler would take a look at each game, and determine if the expected payoff makes it worth playing. Here is where the analogy falls down. Not every analogue of "game" is the same!

I worked for awhile for a company that was effectively a JIT VM vendor, and we found scenarios where our JIT could outperform things like a naively written YACC/LEX compiler. (Or a web templating engine. If we cranked up the size of the JIT code cache that ran at speeds expected of a C implementation.)


No need to overthink the analogy. Pro gamblers choose their games while VM developers don’t. The analogy works for those games the gambler did happen to play.

Let me give you more concrete data. On JSC we did tune for microbenchmarks in addition to macrobenchmarks. But the microbenchmarks would show pathologies not seen in the macrobenchmarks, those pathologies were unpredictable (you breathe on the VM and the behavior changes), and anyway real programs were more like macrobenchmarks than microbenchmarks. So - based on experience we switched to using the microbenchmarks only if they reliably reproduced behavior that we first observed in a macrobenchmarks. Got a macrobenchmark regression? See if any microbenchmarks also regressed in hopes of finding something easier to study. Otherwise the microbenchmarks were just whack-a-mole - if you focus on one of them you end up making changes that hurt real stuff.

Sort of like if a gambler focused too much on one game, they’d end up losing in the average.


Pro gamblers choose their games while VM developers don’t.

Again, your answer is very informed, but very much in the VM developer POV. Pro gamblers can and should choose their games. Pro developers can and should also choose their games to make sure their expected payoff works out.

While VM developers can "play" from the POV of the "house," and handle things in aggregate, we applicaiton developers choosing to use a VM or not are very much invested in our "next game."


Bigger benchmarks would probably be better but don’t these papers show that the thing programmers commonly believe (that vms will warm up and then run faster, especially for microbenchmarks) is pretty untrue? VMs don’t always or even very often reach a steady state and that steady state may not be faster, even for microbenchmarks.

I’m surprised that the author of the post (who I generally find writes interesting, thoughtful and seemingly correct things) feels that the research talked about in this article was pushing the state of the art in reliable benchmarking and understanding vm warmup, but that your comment seems mostly dismissive. I would especially expect a VM developer to be interested in more reliable benchmarking. Do you think that that assessment by the author was incorrect or that these things are widely known among vim developers?

I also don’t really understand your comment, eg you claim that VMs win on average but the numbers in the article show that, even for a case which ought to be good for a JIT (I might be wrong here?) with a small very hot section, VMs on average fail to end up in a steady state with better performance than the interpreted case. Is the idea that the good steady state performance you get some of the time will average out with the poor or inconsistent performance you get other times to get good performance? I can sort of see that argument being made about a big system with many instances but I don’t think it’s a correct one because in practice many systems like that will block on the slowest part and so have worse tail performance than a single component. I guess another idea could be that a real program has many parts and if 10% reach a steady fast state while the rest are inconsistent or slower, then on average the program can still be faster. But I don’t really buy this because I expect there are a few places that will matter far more and so you don’t average out very well, but my prior could be badly wrong here.


The authors cherry-picked benchmarks where the VM goes off the rails.

Because warmup is probabilistic, there must be cases where it goes off the rails.

So yeah, I’m dismissive.


Hello Filip -- I hope life is treating you well! I'm happy to clarify a couple of things that might be useful.

First, VM authors I've discussed this with over the years seem roughly split down the middle on microbenchmarks. Some very much agree with your perspective that small benchmarks are misleading. Some, though, were very surprised at the quantity and nature of what we found. Indeed, I discovered a small number had not only noticed similar problems in the past but spent huge amounts of time trying to fix them. There are many people who I, and I suspect you, admire, in both camps: this seems like something upon which reasonable people can differ. Perhaps future research will provide more clarity in this regard.

Second, for BBKMT we used the first benchmarks we tried, so there was absolutely no cherry picking going on. Indeed, we arguably biased the whole experiment in favour of VMs (our paper details why and how we did so). Since TCPT uses 600 (well, 586...) benchmarks it seems unlikely to me that they cherry picked either. "Cherry picking" is, to my mind, a serious accusation, since it would suggest we did not do our research in good faith. I hope I can put your mind at rest on that matter.


I don’t buy it.

- Academics don’t publish results that aren’t sexy. How many people like you ran the same experiment with a different set of benchmarks but didn’t publish the results because they confirmed the obvious and so were too boring? How many times did you or your coauthors have false starts in your research that weren’t published? You’re cherry picking just by participating in the perverse reward system.

- The complexity of the data analysis sure makes it look like you’re doing something smart, but in reality, it’s just an opportunity to cherry pick.

- These results are not consistent with what I’ve seen, and I’ve spent countless hours benchmarking VMs I wrote and VMs I compete with. I’ll believe my own eyes before I believe published research. This leads me to believe there is something fishy going on.

Anyway, my serious accusation stands and it’s a fact that for large real-ish workloads, VMs do “warm up” - they start slow and then run faster, as designed.


I not only welcome reasonable scepticism, but I do my best to facilitate it. I have accrued sufficient evidence over time of my own fallibility, and idiocy, that I now try to give people the opportunity to spot mistakes so that I might correct them. As a happy bonus, this also gives people a way of verifying whether the work was done in the right spirit or not.

To that end we work in the open, so all the evidence you need to back up your assertions, or assuage your doubts, has been available since the first day we started:

* Here's the experiment, with its 1025 commits going back to 2015 https://github.com/softdevteam/warmup_experiment/ -- note that the benchmarks are slurped in before we'd even got many of the VMs compiling. * You can also see from the first commit that we simply slurped in the CLBG benchmarks wholesale from a previous paper that was done some time before I had any inkling that there might be warmup problems https://github.com/ltratt/vms_experiment/ * Here's the repo for the paper itself, where you can see us getting to grips with what we were seeing over several years https://github.com/softdevteam/warmup_paper * The snapshots of the paper we released are at https://arxiv.org/abs/1602.00602v1 -- the first version ("V1") clearly shows problems but we had no statistical analysis (note that the first version has a different author list than the final version, and the author added later was a stats expert). * The raw data for the releases of the experiment are at https://archive.org/download/softdev_warmup_experiment_artef... so you can run your own statistical analysis on them.

To be clear, our paper is (or, at least, I hope is) clear to scope its assertions. It doesn't say "VMs never warmup" or even "VMs only warmup X% of the time". It says "in this cross-language, cross-VM, benchmark suite of small benchmarks we observed warmup X% of the time, and that might suggest there are broader problems, but we can't say for sure". There are various possible hypotheses which could explain what we saw, including "only microbenchmarks, or this set of microbenchmarks, show this problem". Personally, that doesn't feel like the most likely explanation, but I have been wrong about bigger things before!


This seems common but not a necessary property of course. If predictable performance was a design constraint, you could do it. And there's a spectrum between "maximally predictable" and "very unpredictable".

Unpredictability could eg result in a need to more generous overprovisioning infrastructure and have less deterministic scaling, or some users having a game run slowly despite meeting hw requirements, so sometimes it might be good to have deterministic but bit slower on average execution.


> Here's an example of a "good" benchmark from our dataset which starts slow and hits a steady state of peak performance

Am I misinterpreting their graph? The difference between "slow" and "peak performance" seems to be a factor of about 1.005, so a whopping 0.5% improvement after warmup?


Perhaps that program was mostly doing some regexp thing and so it was easy for the jit to optimise a few function calls and a for loop correctly?


A good blog post has lots of hard parts (layout, scope, visuals, audience). Laurence Tratt, you nailed it for me! I loved the details you put in "Benchmarking methodology" and the clean layout with instructive visuals.


The best warmup is to have a long-running VM without memory leaks but with long-term stats, PGO-directed JIT similar to HotSpot.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: