

D’s Garbage Collector Problem - copx
http://pointersgonewild.wordpress.com/2014/09/09/ds-garbage-collector-problem/

======
Locke1689
I'm not an expert on GC, but I do work on a compiler written in a GC'd
language that cares quite a bit about performance. I can share a few
guidelines that may prove useful to the author or anyone else having similar
trouble.

1) Profile. This does not mean gather a few charts from a primitive profiling
tool that simply tells you which functions have the most time spent in them.
You need a real profiling tool that gives you detailed analysis of where your
time is going. You need CPU stacks, time blocked, GC stacks -- everything you
can get your hands on. Your profile needs to tell you where, what, and how
much you're allocating. It needs to tell you what kind of allocations you are
doing and how long they are living.

2) Once you have your information, you need to figure out what you can do to
improve your performance. To do this, you need to know the characteristics of
your program and all the platforms it runs on. Is your code being JITed? How
long does your program spend in the JIT? Can you mitigate this by doing AOT
compilation, optimization profiles, or producing more JIT-friendly code? For
GC, what kind of GC are you running on? Is in conservative -- do you have to
worry about memory leaks? Is it stop-the-world -- do you have to worry about
pauses? Is it generational -- does the behavior change at inflection points?
Do you have to adapt your allocations to the sections of your program's run
time? Does allocation in one portion affect allocations in another portion,
e.g. will your allocations produce extra memory fragmentation when running on
a non-moving GC?

3) You need to own your code and its performance. If you're allocating in a
hot loop and the GC can't keep up, you need to rewrite that section to not
allocate. If you're fragmenting the heap in non-critical paths, you need to
stop allocating there, as well. If you're constantly recreating re-usable
ephemeral objects, you need to look at object pooling. If you need to refactor
sections of your code to avoid allocations and stop-the-world collections,
_you need to do that._ Blaming the GC will not make your application faster.
If you need to break out a debugger and jump into the GC to see what's taking
so much time, that's what you need to do.

4) After you've optimized, re-evaluate. Are there actionable items you can
hand to the GC devs? Can you point out general points where GC strategy has a
known bad result and a known better implementation? File these on the GC.
Maybe you'll be able to delete some optimizations in future versions. :)

The first version of Roslyn was 30x slower than the old C# compiler written in
C++. We're now within 2x on 10-year-old hardware and beating it on newer. Perf
optimization works -- but not all programming is a stroll through a park. :)

~~~
copx
> Your profile needs to tell you where, what, and how much you're allocating.

Which languages offer such advanced profilers, though? I am using Lua (a GCed
language) right now, and I am not aware of any Lua profiler which profiles
allocation/the GC. I think D's profiler does not feature that either.

In fact I have only ever heard of Java having such advanced profiling tools,
and I guess Microsoft build the same thing for their version of Java (.NET).
But again, which other languages have tools which allow you to profile
allocation/GC behavior?

~~~
pjmlp
I am on mobile right now, but I would say Oberon (a bit basic), Haskell, Go,
OCaml (not sure), JavaScript and all commercial implementations of Smalltalk
and Common Lisp.

Ah, regarding .NET there are a few commercial profilers out there besides what
Microsoft offers. Like the JVM, the CLR exposes an API for such kind of
tooling.

------
acqq
It's OK to point at the potential problem. It can be that the GC in D can be
improved.

But for the given specific case, whenever I'd want to make a fast compilers
(and I've made two in my career) I wouldn't set my goal to produce gazillion
objects and expect some fairy to make it fast. I'd take care to use the stuff
where one allocation produces a lot of stuff: like arrays. You probably know:
addressing elements of the array in the machine code is very fast: there are
instructions that do that very effectively. So don't be afraid to use arrays
and use indexes instead of pointers. Your code won't be much less readable,
but you can achieve significant speedup, provided you don't have more
algorithms that slow up exponentially (and there's some chance you have these
too).

One of the benefits of arrays is that you're also deallocating a lot of stuff
at once, which also makes the whole process much faster. And whenever you feel
that you're not programming easily enough, think about the first Fortran
compiler, produced in fifties: it generated quite good machine code, almost as
fast as the hand coded one and it had only a few _kilobytes_ of RAM to produce
such a result.

~~~
oelang
"it generated quite good machine code, almost as fast as the hand coded one"

I'm sorry but this is complete bullshit. What is expected of a compiler these
days (SSA, advanced instruction scheduling, smart inlining descisions) is not
comparable with what the first optimizing compilers did.

~~~
acqq
If you want more modern comparison between what the article complains about
and what's possible with the language which has GC, then look at this: Fabrice
Bellard implemented the whole X86 emulator in JavaScript and with it it boots
the Linux (measured on my machine and in my browser) in three seconds. The
Linux kernel it boots is approximately 2 MB.

[http://bellard.org/jslinux/](http://bellard.org/jslinux/)

Whereas Maxime Chevalier-Boisvert (the author of the article we all comment)
compiles

    
    
      a = 0;
      a;
      a;
      … // (60000 times is a used)
      a;
    

in four seconds with her D code. The program with no loops and one single
variable, and just less than 200 KB of input.

~~~
tachyonbeam
That is not a fair comparison at all.

1\. An x86 emulator doesn't need to allocate much. It's probably an
interpreter. It can parse x86 instructions straight from a byte vector, and
use another byte vector for the RAM.

2\. V8 and Firefox have much better garbage collectors than D, because they
need their GC to be fast, and so they've made sure it is.

3\. Higgs is a much more complex piece of software than this.

4\. The 4 seconds time, as I pointed out, is at least 75% GC (but possiby
more). That's the whole point of the blog post. It shouldn't be possible for
the GC to take most of the execution time.

~~~
acqq
How many allocations do you actually make to compile a single variable used
60000 times? I still claim you don't have to allocate much too for that
particular input, as long as you use vectors/arrays, the ones you also
recognize Bellard probably uses and which I've mentioned in my top comment as
the right way to make your code faster. It's really as simple as that.

~~~
tachyonbeam
This isn't C, it's JavaScript.

The semantics of a global variable read are a more complicated than you think.
Global variables reside on the global object, which can grow dynamically.
Global properties can also be deleted at any time. Accessing a property which
doesn't exist should throw an exception. Properties could also be getter
functions, which would then result in a function call for every property read.

Have you written a JS compiler? Have you written a JIT compiler? Any kind of
compiler for a real-world language? A garbage collector?

If someone knows of a way to get allocation statistics, then I can answer the
question of how many allocations occur.

~~~
acqq
> Global variables reside on the global object, which can grow dynamically.

But it the example you give there's only one single variable.

> Global properties can also be deleted at any time.

Also doesn't happen in your example.

> Accessing a property which doesn't exist should throw an exception.

I agree.

> Properties could also be getter functions, which would then result in a
> function call for every property read.

But they aren't in your example.

And all these aspects result in proportional increase in the resulting code,
not in some unexpected complexity.

> If someone knows of a way to get allocation statistics

Run a Perl script (or Python script or do it by hand) that adds an increment
of global nAllocations variable in front of any statement where you do a "new"
of something. (If the most of the allocations are implicit, this counting
maybe has to be added to the runtime of D, if it already doesn't exist. At
least the runtime is open source, AFAIK). Run the program. Print that number.
My guess is that it must be an order of 6 digits (millions) for the
performance you measure. Which doesn't seem to be necessary to process 60000
evaluations of a single variable if more arrays of fixed sized things could be
used in the process. I know for you it's "doctor, it hurts me when I press me
here" "well don't do that" moment, but really, that's how it is. A lot of
small objects allocated separately makes a lot of work for a GC and it's so in
every language. The question is if the language provides you the possibility
to do it more effectively. It does: structs and arrays.

Yes, my bias is a bias of a C programmer, but there's a reason why good
written C code is fast: the number of allocations is small, the most of stuff
is preallocated before it's used. D lets you do the same.

Note: changing the memory representation of the "objects" you use doesn't
suggest you have to change your existing algorithms. It's just how the stuff
is stored in memory. The changes to your code then aren't fundamentally
significant, but you'll certainly get the initial speedup you need.

~~~
mendort
I don't think anyone would argue that the compiler in question is far from
optimal for compiling this particular program.

However, if you are writing a general purpose compiler, you need to
accommodate any program. Each of the properties (e.g. no use of getters, only
one variable) that you point out about this program would take a compiler pass
to establish.

I haven't thought too much about it, but some of the properties are very
likely to be undecidable by static analysis for many programs.

It seems likely that using time to determine whether the program being
compiled has these properties would not be efficient for most real world
programs, even if it did provide enough information to choose a more optimal
compilation strategy in some cases.

~~~
tachyonbeam
Indeed, static analysis is generally rather costly. It's usually a fixed-point
computation that iterates over the code multiple times. You generally need to
allocate many objects to keep track of facts the analysis produces about
different instructions, which need to be propagated around. Running a general-
purpose static analysis on this test program would probably be more expensive
than what I'm already doing.

In the case of my minimalist test program, it's easy to say "well, ha! that's
obviously not an accessor!", but in the general case, it's much harder.
Consider, for example, that there could be load() or eval() statements in the
code. That means invisible code you haven't seen yet, and can't analyze. Then
consider a much more trivial case where you find this somewhere in the code:

function foo(o) { delete o.a; }

Now, you know that "o" is very unlikely to be the global object... But can you
prove that with absolute certainty? It's possible you can't, so then you must
assume that maybe the global variable "a" doesn't exist, and accesses to it
might throw an exception.

TL;DR: compiler design is no easy task.

------
barrkel
I had a 3 minute look at the code in the github repo.

The parser uses ~= (D's array concatenation operator) a lot. This always
copies and will lead to quadratic performance for parsing e.g. blocks of
statements. Similarly, the switch parser uses code like this:

    
    
        curStmts.length += 1;
        (*curStmts)[curStmts.length-1] = parseStmt(input);
    

This is similarly quadratic - it incrementally grows the statement array.

Look at algorithm performance first.

~~~
edmccard
>This always copies and will lead to quadratic performance

In D, "a ~ b" always copies but "a ~= b" might not. Also, appending using ~=
will grow the array exponentially, so the amoritized performance will be
better than quadratic.

Doing "x.length += 1", however, does incremental growth and will have poor
performance.

(See [http://dlang.org/arrays.html](http://dlang.org/arrays.html) and
[http://forum.dlang.org/post/op.wuic5nateav7ka@stevens-
macboo...](http://forum.dlang.org/post/op.wuic5nateav7ka@stevens-macbook-
pro.local) )

------
mostly_harmless
If it's a GC thing, you should also try using various D compilers to compare
performance.

There is:

dmd (official D compiler),

ldc (llvm D compiler),

gdc (GNU D compiler).

When I was messing around with D about a year ago, dmd compiled code the
fastest, gdc by far the slowest, but gdc produced the most performant code.
ldc was middle ground for both compile speed and program performance. Perhaps
the GC issue you are having is just a quirk of the current compiler you are
using.

~~~
qznc
The garbage collector is shared by the three projects. You will not get a
better GC this way.

Maybe ldc/gdc can optimize a few allocations away compared to dmd. I do not
believe this is the case with compiler code like the Higgs discussed.

------
sphink
[SpiderMonkey dev here. I remember the talk you gave at Mozilla, quite a good
talk. I work on the GC much more than the compiler, which I don't know that
much about, but I'm only going to talk about the compiler. Ironic, given that
you were talking about the GC.]

What's your actual goal here? You gave an example of a pathological function
that takes a long time to compile. Does that matter?

We care a lot about compilation time. It matters a lot for Web workloads, and
distinguishes JS compilers from AOT compilers. But why do you care, especially
in this particular case? Do you have "real" code that is hitting the same
bottlenecks? Because if not, you have two easy solutions to this case: first,
leave everything as is. In a production system, you can be doing this sort of
compile in the background, and only switch over to it from the fast-start
interpreter when it's done. The second straightforward option is to just give
up. Fail to compile if things get too big or slow, and let it run in the
interpreter (or some other lower tier.)

If this _does_ resemble a realistic scenario, then we can get a little more
nuanced. We face the same problem in SpiderMonkey -- big functions can take
too long to compile and/or will blow out memory (which, honestly, is the
bigger concern. More on that later.) In JaegerMonkey days, we had "chunked"
compilation, where it artificially broke big functions down into pieces and
compiled them separately, essentially spilling all state between them. I know
we've talked about doing the same for IonMonkey; I'm not sure if it has been
implemented. In particular, asm.js hits this a lot because autogenerated
functions can easily get enormous. Luke has definitely talked a lot about
chunking asm.js compilation; I don't know if he ever got around to
implementing it yet. [You remember Luke; you sat behind him for a while in the
Mountain View office.]

Or how about doing it at a finer granularity? We have a number of places where
we start out trying to do the smart and clever thing, but when things gett out
of hand we abort and fall back on something stupid but fast and small. Here,
that might mean being very pessimistic about your live ranges and
pessimistically chopping them up into small pieces. Could you keep your 2d
matrix, but cap its dimensions? You only track up to K variables at a time,
and if you exceed K you spill one and reuse its slot for the new one.
Obviously, your phi nodes (join points) would need to agree on a set of
variables, so maybe this gets too complicated or just doesn't work.

But anyway, I'm still wondering what you're after. You have a research
compiler. There is no need for it to be better than a production compiler in
all ways. Pick something it _is_ better at, such as producing really fast
code. The learnings from that will be more likely to be useful in a production
system anyway, because all current leading JS compilers are multi-tier -- you
have a fast-start interpreter, usually a still pretty quick-start baseline
compiler, and then an optimizing compiler. The distribution of static code
across those is something like 90%/9%/1% anyway -- the majority of downloaded
code is executed at most once anyway, so there's no point in spending time
optimizing it. But that means the optimizing compiler is free to spend some
time thinking about things. If your compiler is _really_ slow, just claim it's
meant for a 4th tier that only kicks in for really hot code. :) And it can
compile in the background.

Unless it burns too much memory. That won't stay isolated from the execution
thread. It'll pollute caches, hog the memory bus, and make fun of your mother.
So I'd worry more about capping memory usage to a reasonable level than
worrying about compile time, even if it means giving up or dropping down to a
dumber algorithm when things start to get big.

One other effect of being a later tier is that you can stick to compiling
simpler programs. By the time it makes it to your tier, you'll have observed
the actual types and behavior of the function for a while, and so you can
assume that it'll continue to behave the same way. This allows you to ignore
many of the numerous quirks of the JS language. You don't need to handle
getters if your hot code never sees them (or rather, you only have to handle
them in the parts that do see them.) The global object probably won't get
mutated, and almost certainly won't have properties deleted. You will need
mechanisms to discard your compiled program if any of these things change, and
you'll have to be clever about noticing these sorts of mutations, but it's
great to be able to assume that nothing too crazy will happen. Even if you
have to guard all those assumptions.

The GC does matter for both time and space, though. An advantage of a
generational collector isn't just shorter pauses, it's also that you can use
simple bump allocation in the nursery since you never need to do partial
deletes. Also, if you're creating a lot of short-lived junk then it can reduce
your peak memory usage by quite a bit, which is important for a 3rd or 4th
tier background compiler.

------
mike_hearn
It'd be interesting to compare this to code running on the JVM. It's not
enormously surprising that D's garbage collector can be a performance
bottleneck, given the incredible amount of effort and manpower that has gone
into making the Java GC's fast and scalable. It's just a really hard problem.

Does D expose any kind of GC stats? I'd imagine that with D a lot of heap
pressure could be removed by stack allocating as much as possible, as you'd do
in a C++ program. Part of the reason Java needs such beastly GCs is because it
generates lots of tiny short lived objects. Although HotSpot can stack
allocate these days if a function is hot enough, with D the programmer should
be able to do it manually.

~~~
qznc
D cannot copy Java's GC strategies. Being used for low-level work, D by design
cannot move objects.

~~~
WalterBright
This is incorrect. D can move objects, although the current GC does not do so.

------
dstein64
Here's separate "GC issue" that Walter Bright has mentioned:
[http://www.reddit.com/r/programming/comments/22jwcu/how_i_ca...](http://www.reddit.com/r/programming/comments/22jwcu/how_i_came_to_write_d/cgnk4dm)

------
tdees40
When I see things like that, I can't help but think that if you want
performance, you've gotta go C.

Otherwise, you get stuff like this, that's just impossible to reason about.

~~~
tachyonbeam
It is unfortunately difficult to reason about. For instance, I use D's
associative arrays in my program. I know that dynamic allocations must occur
inside this, but I don't know what algorithm is being used and how it
allocates memory.

------
stonemetal
I wonder if selectively disabling D's collector would help? Turn it off during
the allocation phase to prevent thrashing. The runtime would be forced to
allocate from the OS rather than try to collect. It could then be enabled and
run when there was known to be garbage to collect.

Another option would be fall back to pre-allocated object pools rather than
naively calling new. I know it is a fairly common optimization in Java\C# when
the GC is causing performance issues.

~~~
WalterBright
Unfortunately, there are no "fire and forget" performant memory management
schemes. For an application that does a lot of allocation, it pays huge
dividends to analyze the code to figure out where memory is being allocated,
and why.

Often, data structures can be redone to minimize allocation, allocations can
be moved to the stack, object pools can be used, etc. There are a long list of
techniques that can be applied to improve performance. I know that much of the
effort I put into optimizing code goes into dealing with memory management
issues, regardless of what the implementation language is.

------
PaulHoule
If you write programs that allocate memory dynamically it is no surprise that
memory allocation and garbage collection are a big deal.

------
tormeh
I program in Scala and I wish we'd be able to delete objects manually, and
then have the GC clean up the stuff we forget. That'd be awesome; enabling the
perfect fusion of managed and unmanaged memory. In the meantime starting out
with huge heaps and setting references to null after use seems like the best
we JVM folks can do.

Edited to specify I'm using Scala.

~~~
WalterBright
core.memory.GC.free(void* p) should do that for you.

~~~
bjourne
What happens if another object holds a reference to p? Bad things?

~~~
WalterBright
Yes, bad things, i.e. memory corruption.

Manual memory management means it's up to the programmer to get it right.

