HotSpot compiler: Stack allocation prototype for C2

chrisseaton · on July 2, 2020

I'm anticipating a lot of confusion and people talking past each other in this thread.

C2 already does escape analysis, and scalar-replacement-of-aggregates to turn objects that don't escape a compilation unit into dataflow edges, which if possible is far better than the stack-allocation in this article because you'd rather not write out to the stack if you don't have to, and you'd like unused parts of the object to disappear. Graal then adds partial-escape-analysis to float reification of an object to the last-safe-moment.

What this does on top of that is then to cover the corner case where an object would be safe for SRA, except something requires the object to have the same layout as on the heap. I think a concrete example of this is some intrinsics and merges where an object needs to be heap-allocated on just one branch, so that code following the merge can have a full object in both cases and not care that one has been allocated on the stack.

The key to avoiding confusion - 'allocating on the stack' in this case means literally allocating the full reified object in stack memory, whereas in C2 at the moment when people say 'stack allocation' informally they mean SRA.

C2 already removes reference type allocations when they don't escape a compilation unit, which is something I think C# doesn't yet do at all for some reason I don't really understand.

pcwalton · on July 2, 2020

> C2 already does escape analysis, and scalar-replacement-of-aggregates to turn objects that don't escape a compilation unit into dataflow edges, which if possible is far better than the stack-allocation in this article because you'd rather not write out to the stack if you don't have to, and you'd like unused parts of the object to disappear.

Yes, and I'd also add that SROA typically enables lots of optimizations since it lets you treat object fields as though they were ordinary local variables.

noblethrasher · on July 2, 2020

It appears that the .NET Core CLR does perform some rudimentary escape analysis: https://github.com/dotnet/coreclr/pull/6653

chrisseaton · on July 2, 2020

Right - and that supports the stack-allocation, not SRA, which is what I think C# doesn't have - possibly because they prefer explicit value types over trying to do it automatically through SRA.

shellac · on July 2, 2020

> What this does on top of that is then to cover the corner case where an object would be safe for SRA, except something requires the object to have the same layout as on the heap.

I'm a little surprised that see such noticeable performance improvements. Obviously benchmarks etc etc, but I didn't expect these cases to be quite so common.

orra · on July 2, 2020

Sure, it surprising as an outsider. But nor is this brand new.

Recentish versions of .NET (Core) have introduced ValueTuple and ValueTasks. These types avoid the heap allocations with Tuple and Task.

Hopefully Microsoft think this makes enough real world difference to justify the pain of two sets of types.

shellac · on July 2, 2020

Those are structs, though. This work is about providing similar performance advantages for objects (in certain conditions).

MrBuddyCasino · on July 2, 2020

Which begs the question: if the compiler can automatically transform objects anyway, why do we even need value types/structs?

Or asked differently: what is the difference between an object with properly overridden hashCode() / equals() methods and which is effectively only being used as a data container, and a struct?

Tarean · on July 2, 2020

If you pass an object into a function, the code has to work with any subclass. This means the data has to be behind a pointer since the size is unknown and method calls have to go through a vtable. If you can see the full lifetime of an object you can specialize away until no pointers are left but in the general case this doesn't work, or makes performance worse because you unbox and rebox repeatedly whenever you call methods.

For final classes unpacking into registers might work, but at least in the jvm you can subclass final classes via reflection. If I understand correctly this avoids repacking when calling methods that expect the normal memory layout by actually allocating the the normal object layout - just on the heap.

MrBuddyCasino · on July 2, 2020

Makes sense, thanks for the explanation.

pcwalton · on July 3, 2020

Because escape analysis isn't as powerful as it might seem. Indirect calls in particular can cause objects to be marked as possibly escaping, which matters a lot in languages like Java where most calls are virtual.

chrisseaton · on July 3, 2020

> languages like Java where most calls are virtual

Most Java call sites are in practice monomorphic. They may look virtual according the language spec, but that isn't how they're really implemented.

entha_saava · on July 3, 2020

> if the compiler can automatically transform objects anyway, why do we even need value types/structs?

It cannot always without affecting the semantics. An array of structs will still have to be an array of references in most cases, for example. The compiler isn't sufficiently smart to figure out such things beyond trivial examples.

In General it applies to any optimization: the compiler can figure out a subset of optimizations the programmer can. That's why not giving the control of memory layout to the programmer is objectively bad idea.

Also, when you need performance, not relying on compiler to figure out the optimizations is easier than ensuring the optimizations happen. Compilers are quite unpredictable.

aardvark179 · on July 2, 2020

So some JITs can be given hints that certain objects are just values, and things like identity will never be used, and thus they can be boxed and unboxed as needed, but this is a pretty low level mechanism, and it’s probably not the thing to expose to most users as it can be brittle both in terms of safety and performance. There are a few other internal annotations in the JDK which are useful in the same way but aren’t exposed for similar reasons.

Adding value types to a VM or language means exposing that sort of feature in a way that can be used by programmers and provides the sort of safety guarantees they are used to.

imtringued · on July 2, 2020

Those transformations are not allowed in every situation. Usually they are only possible because of aggressive inlining that eliminates function borders. Once you cross multiple functions after optimization escape analysis is unlikely to work out. The best thing you could perhaps do is add optional restrictions that make escape analysis easier. The end result would be something very similar to the borrow checker in Rust. However, it would be much easier to not rely on escape analysis and just add value type semantics directly into the language.

eggsnbacon1 · on July 2, 2020

I wonder what Oracle will think about this. Isn't this one of the "big" optimizations you only get with Graal EE $$$ right now? Its going to be ironic if external contributors keep C2 as on par with Graal EE.

chrisseaton · on July 2, 2020

No I don't believe this is in Graal EE. I think you may be mistaking it with partial escape analysis, which can float reification of an object on the heap to the branch where it is needed and let it stay virtualised on other branches. This post is about actually allocating the reified object on the stack. I don't believe Graal does that in any configuration.

pjmlp · on July 3, 2020

Apparently no one else cared enough about buying Sun assets, so I am quite happy that Java did not die with version 6, and something was made out of MaximeVM.

The_rationalist · on July 2, 2020

Those are the first results of the renewed interest from Microsoft to contributing to the JVM cf https://news.ycombinator.com/item?id=21837508

I wonder what others lovely optimizations/improvements they will bring :} Especially which optimizations from the C# world will they transpose to the JVM world. According to the benchmarckgame C# is the fastest on earth JITed language

wokkel · on July 2, 2020

The benchmarkgame result are comparing apples to oranges. I stopped after seeing them tests Java regexes tot c# regexes which uses the native pcre library. In my professional experience, the .net jit is nearly always slower than the jvm one.

ygra · on July 2, 2020

With .NET 5 the fastest non-PCRE C# regexredux implementation there completes in five seconds on my machine, as opposed to 8 seconds for the Java implementation. Still quite a bit away from C, but I'd say respectable for a purely managed implementation.

It helps that the team has done a lot of optimizations for regex and other parts of the standard lib in the past few months.

igouy · on July 2, 2020

> I stopped…

You stopped without providing any alternative data beyond — .net is slower because I say it's slower.

You stopped without telling us about the benchmarks game C# program which uses System.Text.RegularExpressions

eggsnbacon1 · on July 2, 2020

Benchmarks game doesn't do JIT warmup, which is extremely frustrating. The results are useless for long running server applications. It essentially benchmarks which VM can compile fairly fast code quickly, not which one produces the fastest code.

Long running VM's like Hotspot in "server" mode do a lot of expensive optimizations over time. Benchmarks game makes no attempt to warm up these VM's, so it doesn't actually measure how fast a VM is in practical hosting scenarios.

Anecdotally, I have heard that Java is still faster than C# in most benchmarks when both VM's have had time to do a full JIT. I've also heard that Golang is much slower than C# and Java when their JITs are allowed to fully warm.

I've been meaning to build a "warmed up" version of benchmarks game specifically for testing VM languages but never got around to it. If someone else wants to pick up the torch I would be eternally grateful!

igouy · on July 3, 2020

You haven't done a "warmed up" version of benchmarks game — so you don't know how much or how little difference it would make for those tiny tiny programs.

https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

eggsnbacon1 · on July 6, 2020

For fast languages like C# and Java many benchmarks are finished in a few seconds. By their tests, the warmup adds ~.3 seconds to these tiny programs. That's a long time, 10-20% in some benchmarks!

In a couple benchmarks Java would probably be faster than C++ if the JVM was allowed to warm up

igouy · on July 11, 2020

Updated for java 14.0.1: "Let's compare our fastest-of-6 no warmup measurements against the fastest-of-25 (or 55 or 175) with warmup JMH SampleTime p(0.0000) measurements"

https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

igouy · on July 7, 2020

> In a couple benchmarks Java would probably be faster…

Please be specific.

JMH timing for that spectral-norm program was 0.175s to 0.283s faster than the 4.29s elapsed time (16s cpu time).

That's not 10-20% it's 6.6%.

At best that might put the best Java spectral-norm program a little faster than the best Haskell program.

At best that might mean the best Java spectral-norm program took 2x (twice as long as) the best C++ program.

https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

> … would probably be faster…

Please take those tiny tiny programs and JMH and make your own measurements — you might believe measurements you make yourself.

dangets · on July 2, 2020

There is a JEP to support Java on windows on AARCH64.

http://openjdk.java.net/jeps/8248496

mindB · on July 2, 2020

If you take the benchmarks game as gospel, Julia is faster than C# in aggregate.

The_rationalist · on July 2, 2020

I checked and while I didn't do the maths, Julia only beat C# in 2 benchmarcks,overall C# must be faster even if those Julia benchmarcks beats Java!

paavohtl · on July 2, 2020

I don't understand the "and C#" part. C# has had stack allocation support (for primitives, structs and fixed size arrays) since the very first release.

shellac · on July 2, 2020

You'll find the detail in https://github.com/microsoft/openjdk-proposals/blob/master/s...

This is about stack allocation of objects where safe (i.e. when you can prove they don't escape the local scope) for Hotspot, something that already exists in J9 (IBM's JVM) I think.

There is equivalent work for .net core here: https://github.com/dotnet/runtime/blob/master/docs/design/co...

(So the title is somewhat accurate, but you have to do some digging)

OnlyOneCannolo · on July 2, 2020

> This document describes work to enable object stack allocation in .NET Core.

> In .NET instances of reference types are allocated on the garbage-collected heap.

> If the lifetime of an object is bounded by the lifetime of the allocating method, the allocation may be moved to the stack.

https://github.com/dotnet/runtime/blob/master/docs/design/co...

divs1210 · on July 2, 2020

Cheney on the MTA and its various implementations (like Chicken Scheme [0]) have been allocating objects on the stack for ages. Would like to know how this compares.

[0] https://www.more-magic.net/posts/internals-gc.html

chrisseaton · on July 2, 2020

This is a bit different. My understanding (but I'm not an expert) is that Chicken uses the stack effectively as the thread allocation buffer, which is clever because it's already thread-local using an existing register, and it's already in cache. They then evacuate from the stack to the old generation (or a separate young generation? I'm not sure). This isn't what is being done in the work referenced in this article, as objects are known in this case to be long-term safe to allocate on the stack - it's not using it as an allocation buffer.

But my understanding of Chicken is limited - maybe you were asking rhetorically and you knew more?

Rusky · on July 3, 2020

Those are more like "allocating the stack in the GC nursery" than they are "allocating objects on the stack." They store the actual chain of function activation records as GC objects, and only "pop" the machine stack as part of running the GC.

C2 has two separate spaces in memory- one for the stack and another for the GC nursery (and the rest of the heap). It pops the machine stack when a Java function returns, and clears out the nursery as part of running the GC.

The difference is that the C2 approach to managing the stack (shared by C, etc.) loses some flexibility to increase performance. When the machine stack maps 1:1 with the call stack like that, objects that are allocated on the stack incur no GC cost- they are always freed automatically on function return simply by moving the stack pointer, never kept alive and promoted, the way they can be in Chicken Scheme's approach.

kasajian · on July 2, 2020

I don't understand. we've always been able to allocate value types on the stack in C#.

sz4kerto · on July 2, 2020

Follow the links under the link. As I mentioned in my other comment, this is an optimization that enables (transparent) allocation of reference types on the stack. I.e. the compiler might be able to allocate the object that's created with 'new' on the stack.

didibus · on July 2, 2020

What I understand is that this is stack allocating reference types, when the compiler can infer that it is safe to do so. Which I think is an optimization .net Core desktop doesn't yet do.

Also worth noting, I believe .net Core doesn't necessarily stack allocate value types either, seems a lot of conditions can make it unsafe as well, like a closed over value type that could escape the local scope, so value types arn't always stack allocated either. Only done when it is safe to do so.

entha_saava · on July 3, 2020

> like a closed over value type that could escape the local scope

You mean, captured in closure?

ygra · on July 3, 2020

I guess that's what they meant. However, that's a bit misleading, since the runtime doesn't care about that that's a value type captured in a closure, as it's a class field in IL already. And those are always allocated like their containing type.

pjmlp · on July 2, 2020

Please fix the heading, C# has had stack allocation since version 1.0 (value allocation to be more precise).

The linked content is only about Java.

sz4kerto · on July 2, 2020

Stack allocation and value allocation are different. In C#, value types can be allocated on the stack, but reference types are always allocated on the heap. The OP describes an approach where reference types are stored on the stack.

littlecranky67 · on July 2, 2020

But doesn't the 'stackalloc' keyword in C# allow you to express to intentionally allocate value types on the stack?

giulianob · on July 2, 2020

I'd say ref structs are more explicitly stack allocated since they're not allowed to be boxed. Stackalloc is used to allocate arrays on the stack (or rather a block of memory).