
Allocation is cheap in .NET until it is not - matthewwarren
http://tooslowexception.com/allocation-is-cheap-until-it-is-not/
======
apardoe-MSFT
"Managed memory is free. Not as in free beer, as in free puppy."

Dev manager of Exchange used that line in a talk. Never were more insightful
words spoken. Devs will move from C++ where they obsess about every allocation
to .NET and they'll totally forget that allocation is expensive no matter what
the platform or runtime.

~~~
rjbwork
>they'll totally forget that allocation is expensive no matter what the
platform or runtime.

Well, it's easier to do in a managed language. When you literally don't have
to agonize or obsess over every allocation because you aren't responsible for
cleaning it up (unmanaged held resources withstanding), you tend not to do so.

P.S.: You're always free to drop down into C or C++ if you want to get some
speed, but of course you need to clean up after yourself there. A friend of
mine wrote a good guide on doing so, if anyone cares
[https://github.com/TheBlackCentipede/PlatformInvocationIndep...](https://github.com/TheBlackCentipede/PlatformInvocationIndepth/blob/master/main.pdf)

~~~
Nuzzerino
>You're always free to drop down into C or C++ if you want to get some speed

Wouldn't C# with structs and pointers do the job in many cases? I've been able
to get 50-fold increases in speed through heavy optimizations, without
switching to another language. Using C or C++ solely for a "speed boost" over
C# is not only unnecessary, but it creates more problems than it solves. If
you don't know how to optimize within C# (as a C# developer), how are you
going to succeed in writing efficient C++ code?

Once you learn the nuances and limitations of making optimizations in C#, then
you should start looking into how and when other languages such as C can
wisely be used. To name an example, C makes it easier to micromanage assembly
instructions (can be done in C# too, but not in a very practical way, and yes
I mean assembly and not IL). C also contains more syntax and features which
are suitable for bitwise micromanagement, whereas with C# it can be more
awkward.

~~~
pjmlp
Yes they would, and the C# 7 improvements taken from Midori experience make it
much better.

I think in general it is a culture problem.

Those of us that embraced managed languages, including for systems programming
(Oberon, D, ...), know that we can be productive 99% of the time and just have
to care how to do speed boost tricks on that 1% using profiler and low level
language tricks.

In C and C++ communities there is a sub-culture of thinking too much ahead of
time how much each line of code costs, thus speeding too much time with design
decisions that actually have zero value in the context of the application
being delivered.

The problem is not taking those decisions, rather taking them without
validating if they are right with a profiler, or regard to the goals that have
to be met for the application.

Beyond which any low level fine tuning, while fun, is needless engineering.

~~~
sterlind
Midori was so beautiful. I think it would have succeeded as a .Net runtime
replacement with picoprocesses. it frustrates me that we didn't open-source
it.

~~~
pjmlp
As believer in GC enabled system programming languages, I do feel it was
indeed a missed opportunity, specially to change the mind of those that think
C and C++ are the only way to write OSes.

------
pcwalton
This article demonstrates why generational GC with a bump-allocating nursery
is so important. Without a semispace copying collector (which is usually
impractical without a generational GC) you can't have bump allocation _at
all_. Not having that fast path is a huge performance loss, as this article
demonstrates.

~~~
ridiculous_fish
Mark-sweep-compact is another GC algorithm that supports bump allocation, and
doesn't have the 2x overhead of semispace.

~~~
dfox
The point is that you need moving GC for bump-allocation to be possible.
Traditional mark-and-sweep is non-moving while semispace collector is the
simplest to describe moving GC. The practical takeaway from all that is that
usual generational GC constructions with semispace minor collections and mark-
and-sweep/compact major collections are in the complexity/efficiency sweet
spot.

By the way it is possible and trivial, although somewhat non obvious, to
construct non-moving generational GC. The general idea is that for simple M&S
you have some list of live objects which you replace that with multiple per-
generation lists (Claus Gittinger mentions this in his VM design talks, which
probably means that at some point such GC was used by ST/X).

~~~
ridiculous_fish
Moving collectors get you best allocation throughput but impose other costs,
which are hard to measure because they are design constraints. Obviously you
cannot have a moving conservative collector so you must have stack maps, safe
points, etc.

Or interactions with native code. How can native code hold a reference to a
potentially movable object? .NET allows pinned pointers (obviously hurting
compaction efficiency) while JNI uses double-dereferenced handles, of which
there's a limit (65k in Android!) Compare to, say, JavaScriptCore, which uses
a non-moving collector, and simply conservatively scans native stacks.

Whether these costs are important depends on your use case, but it's important
to remember that we're rarely building isolated systems.

> non-moving generational GC

Yes, that's what Apple built for its ill-fated experiment with GC! Amusingly
Apple also built its inverse: a moving manual memory manager. Google
MoreMasters for some retro fun!

~~~
dfox
The design space has one non-obvious but fundamental strict dividing line: if
you can constrain the mutator code enough to be able to insert write barriers
for generation GC you also can constrain it enough to have all the metadata to
support moving GC at least to the extent of opportunistic compaction (eg. what
CLR and SBCL on i386 does, both of which have conservationaly scanned stack
because building stack map for register-constrained platforms like i386 is
essentially impossible).

On the other hand BEAM has AFAIK non-moving generational GC implemented in the
aforementioned way which is in this case trivial as you don't need any write
barriers and remembered set when heap objects are inherently immutable. (In
this it is somewhat relevant that Boehm GC for some time had API that allowed
you to signal the time extent of mutatibility of given heap object to the GC,
AFAIK it is no-op since some 6.x version and the
concurrent/incremental/generational bits of it work on basis of mprotect() and
handling SIGSEGV)

~~~
ridiculous_fish
Aside from the stack, another key challenge for moving GCs is hash tables
keyed on object identity. If the object can move the raw address is no longer
a suitable hash.

.NET at one point stored an extra per-object word, which was either (via its
LSB) a random hash code or a pointer to a metadata object that held the lock,
etc.

Python did this cute thing where moved objects would get an extra word
allocated to store the moved-from address, which was where the hash was
stored.

Apple's plan for its (not-released) ObjC moving GC was to directly teach the
GC about hash tables, so it could rehash on collection (!).

Do you know how other implementations handle this?

BEAM's GC is indeed a marvel both because of the immutability and also the
enforced process isolation, so that each process can do its own GC
independently. As you say, constrain the mutator...

~~~
dfox
Cpython has non-moving GC so this is not an issue. On the other hand CPython's
hashmap implementation is probably most educational open source hashmap
implementation that you can find, because it is full of wonderful and portable
performance hacks (for one thing the hash values of CPython's objects is
intentionally computed such that the distribution is non-uniform which allows
tuning of the hashmap implementation for common cases).

As for having tight coupling between moving GC and hashmap implementation
there is another reason why you want to do that: weak pointers and hashmaps
that are either weak keyed or weak valued which both are things that are
useful for the VM implementation itself (symbol tables, method dispatch
caches, identity maps for FFI...).

~~~
mrgriffin
> Cpython has non-moving GC so this is not an issue.

I believe the GP was referring to PyPy.

From [https://morepypy.blogspot.co.uk/2009/10/gc-
improvements.html](https://morepypy.blogspot.co.uk/2009/10/gc-
improvements.html)

> The hash field is not necessarily there; it is only present in classes whose
> hash is ever taken in the RPython program (which includes being keys in a
> dictionary). It is an "identity hash": it works like object.__hash__() in
> Python, but it cannot just be the address of the object in case of a GC that
> moves objects around.

------
narag
This brought memories of that pattern (flyweight?) where the data was stored
outside the objects, possibly in an array. An object was instantiated only to
hold an index to the array position and allow access. That's dirty cheap!

~~~
manigandham
It's still commonly used. The NFX library has Pile which does this well for
holding large data:

[http://nfxlib.com/book/caching/pile.html](http://nfxlib.com/book/caching/pile.html)

[https://www.infoq.com/articles/Big-Memory-
Part-1](https://www.infoq.com/articles/Big-Memory-Part-1)

~~~
Shoothe
This pattern was also used by Java and .NET for implementing cheap
String.substring calls where all substrings would use the same underlying
array with just offsets changed. Unfortunately it turns out that people read
entire files into a one big String and then have a reference to just a small
piece of it (via substring) marking the big underlying array as reachable for
the GC holding a lot of memory for no reason. That's why new implementations
of substring always copy:)

~~~
kbsletten
I know this was changed recently-ish in Java, but I hadn't heard of anybody
doing the old substring trick in .NET, do you know when they cut over?

~~~
matthewwarren
AFAIK this trick was never possible in .NET because Substring always 'deep
copied' the relevant data into a completely new string

------
ridiculous_fish
How does .NET support pinned pointers with a bump pointer allocator? Does it
just eagerly move pinned objects out of the contiguous heap?

~~~
kevingadd
Pinning typically just means it is left in place and exempted from compaction.
This does mean that you can end up with a performance penalty and nasty holes
in your heap layout. Sometimes marshaling code will opt to make a copy of the
data instead (and then perhaps pin that), it depends on the type. There's not
a lot of explicit documentation on this (probably because some of it is an
optimization). Pinned objects can't be moved without breaking semantics - once
you get a pinned-type GCHandle to an object, you can just directly get the
address and it won't ever change. (I believe once the GCHandle is
freed/finalized by the GC, it will automatically unpin the object.)

Typically this isn't a big problem - pinned data structures in .NET code are
either pinned for short periods of time (to pass to native code), or are
reusable large big buffers that stay pinned forever. Large buffers are always
allocated in the large object heap right away. You can always allocate native
memory directly in which case the GC doesn't care about it.

This may be changing since recent updates to C# and the runtime have
introduced the concept of interior pointers to objects, where you can have a
raw pointer to a field within a GCable object. Right now those are constrained
to living on the stack only, so the period of time in which the object can't
be moved/compacted as a result is relatively short.

~~~
johncolanduoni
> (I believe once the GCHandle is freed/finalized by the GC, it will
> automatically unpin the object.)

GCHandle is a struct so you have to explicitly call GCHandle.Free().

~~~
kevingadd
Makes sense. I make a point of calling Free but it wasn't clear to me whether
the pin was attached to the object reference (since the handle contains a
reference).

------
alkonaut
Is there work ongoing or planned to try to add/improve escape analysis, as the
article suggests?

~~~
kkokosa
In general - yes:
[https://github.com/dotnet/coreclr/issues/1784](https://github.com/dotnet/coreclr/issues/1784).
However, nothing more specific I can say...

~~~
alkonaut
I realize it's a pretty hard problem - and hadn't java already demonstrated
the feasibility of it, I would have doubted it to be possible at all without
major surgery to both language and runtime (special scoped types etc). So I
guess my question is: is there something about C# or .NET that makes it much
harder to do escape analysis than it is in Java world?

An evil example is

    
    
        class Something
        {  
           private static readonly Something _inst;
           public Something()
           {
              _inst = this;
           }
        }
    

Where the reference leaked just by instantiating it. Does java detect that
this escaped the stack? How?

~~~
matthewwarren
This post has a nice investigation into 'Escape Analysis' in Java,
[https://shipilev.net/jvm-anatomy-park/18-scalar-
replacement/](https://shipilev.net/jvm-anatomy-park/18-scalar-replacement/)

Shows that the Hotspot doesn't handle it in all scenarios:

> But, EA is not ideal: if we cannot statically determine the object is not
> escaping, we have to assume it does. Complicated control flow may bail
> earlier. Calling non-inlined — and thus opaque for current analysis —
> instance method bails. Doing some things that rely on object identity bail,
> although trivial things like reference comparison with non-escaping objects
> gets folded efficiently.

> This is not an ideal optimization, but when it works, it works magnificently
> well. Further improvements in compiler technology might widen the number of
> cases where EA works well.

~~~
alkonaut
Thanks that clears some of it up. It seems that java runtimes that do EA
actually _do_ this sort of crazy difficult analysis that quickly breaks down
with branches and non-inlined code.

~~~
jcdavis
FWIW, graal has partial escape analysis, which can avoid a lot of those
pitfalls by allowing allocations to escape through just some of the branches.
For instance, if you have

    
    
        X thing = new ...;
        if (slowpath) {
            unlikely_function(thing);
        }
        ...
    

Even if unlikely_function isn't inlined, it can still perform scalar
replacement, and push the allocation site into the branch (reconstructing the
state of the allocated object as it would've been at that point), which is a
big improvement.

This in turn lets the inliner be smarter about what it does and doesn't
inline, vs c2 which tries to greedily inline everything, partly to assist
escape analysis

