
Assault by GC - lnmx
http://marcgravell.blogspot.com/2011/10/assault-by-gc.html
======
beagle3
... and everything old is new again.

The solution, which (unsurprisingly, works!) is how you did stuff in
Fortran-77, APL (circa 1958). There were no iterators, so all you did was pass
around one big array that contained all the data, and indices into it.

In a couple of years, they will realize that even structs are inefficient, and
will go to parallel arrays - at which case the transition to Fortran/APL/J/K
would be complete.

~~~
Sodel
I always like reading about how people did things back when dinosaurs ruled
the data center. Do you know of any web pages that detail this sort of thing?

~~~
beagle3
No, unfortunately.

I'm close to 40, dinosaurish by HN/Reddit standards, but I learned Fortran
mostly because back in the late 80s and early 90s essentially all numeric
processing code was written either in Assembly or Fortran; and at a later
stage, because I was the only one proficient in Fortran, I inherited lots of
legacy code at work.

Most people my age, even if they've been programming since age 9, haven't seen
any Fortran unless they're doing physics or using LAPACK.

~~~
true_religion
As a 20something who has been unfortunately intimate with Fortran, I can see
why people in the past would have been fanatical about C, and heck even talked
about the virtues of diving into assembly rather than deal with Fortran.

------
stcredzero
_GEN-2 is the final lurking place for all your long-lived data..., checking
GEN-2 is a “stop the world” event – it (usually briefly) pauses your code, and
does what it needs to. Now imagine you have a huge set of objects which will
never be available to collect_

For many years now, some Smalltalk VMs have had "permspace." You just send a
message to a long lived object, and it's moved to a different part of memory
considered permanent, and the GC never looks at it. No structs. Just call a
function on the permanent objects. Since it's in a different space, this is
also very efficient for the VM, since identifying a permanent object is just a
comparison of addresses.

~~~
JoachimSchipper
Wouldn't the GC still need to check for references _from_ permspace objects?
(It's probably still a noticeable performance boost.)

~~~
eru
Your language could disallow references from permspace to normal space. Or
even make permspace objects immutable.

~~~
fpgeek
Or just track the (presumably few) mutations of permspace objects for special
handling. In that case, permspace effectively becomes an infinitely long-lived
generation that is never collected, only used as a source of roots.

~~~
stcredzero
Bingo. I should have written "GC only uses it as a source of roots."

------
sehugg
GC is a tricky problem. Even when you have a system that performs GC in the
background (like Java has been doing for almost 10 years) you still are
susceptible to your heap becoming fragmented and run the risk of requiring a
full GC to compact the heap.

Alexey Ragozin's blog is great if you're interested in the murky details of
generational GC: [http://aragozin.blogspot.com/2011/06/understanding-gc-
pauses...](http://aragozin.blogspot.com/2011/06/understanding-gc-pauses-in-
jvm-hotspots.html)

~~~
eru
> you still are susceptible to your heap becoming fragmented and run the risk
> of requiring a full GC to compact the heap.

But isn't the heap fragmentation a much bigger problem with malloc style
memory management?

~~~
tedunangst
No, because you can allocate from different arenas if you need to. Using plain
malloc may lead to fragmentation, but there's no law requiring you to use a
single lowest common denominator allocator for your entire program.

~~~
fpgeek
That doesn't contradict his observation. Changing your program to use custom
memory allocators can solve fragmentation issues, but that's a much bigger
step than the changes required to avoid too many (or too expensive ) full GCs
(i.e. tuning the garbage collector directly, gettting "permanent" data out of
the garbage-collected heap, minimizing mutations so there's less to scan...).

~~~
tedunangst
The only times I've had difficulty solving a fragmentation problem was because
a GC was involved, and rewriting the code in an effort to trick some enormous
black box into doing what I wanted was a lot more difficult than just writing
code that did what I wanted in the first place. ymmv.

~~~
eru
I guess in theory fpgeek is right, and his great-grandparent post is in the
vein of what I thought of answering. But I guess in practice tedunangst is
more right at the moment.

I'll welcome better garbage collection.

------
maximilianburke
Good article! Like his conclusions mention it reminded me a lot of what XNA
developers were going through when it was first released in 2007.

In general garbage collectors are sensitive to the volume of objects in the
heap and the number of live references, since the collector works by scanning
all live objects and following all references contained. If you can't reduce
the volume of memory you use explicitly then you can take a stab at reducing
the memory used by various language features. Boxing value-types -- such as
using a non-generic list of integers, using delegates/events/generators, using
specified capacities with System.Collections.List<T>, etc., can all go a long
way toward lowering overhead. Reducing the references (by moving to storing
indices) can be fairly easy and can be a significant gain as well.

Disposing of IDisposable objects (via using-blocks) when you're done with them
rather than waiting for the collector to harvest and/or finalize them also
helps considerably.

I think in general more explicit memory usage patterns can benefit even when
using a managed language -- you know more about your program and where
performance matters than the garbage collector ever will.

------
lucisferre
This solution seems extreme, given the GC is just a reality of working with
managed languages which is most of the web these days (.NET, Java, Ruby,
etc.). Many, many heavy traffic websites solve these problems without
resorting to language magic tricks. Usually with effective use of caching, or
separating out the heavily accessed data in a way that is is optimized for
those read operations.

This is the second StackExchange article where it feels like square peg, round
hole. The first was [their ORM]
([http://samsaffron.com/archive/2011/03/30/How+I+learned+to+st...](http://samsaffron.com/archive/2011/03/30/How+I+learned+to+stop+worrying+and+write+my+own+ORM))
which isn't really an ORM at all (perhaps micro ORM is fair) in the
traditional sense.

~~~
ajross
More often they do it just by using non-garbage-collected heap strategies.
PHP, Perl, Python and (I believe) Ruby all use reference-counted allocators
and don't exhibit the kind of latency explosions you see with GC algorithms.

~~~
kmontrose
Pure reference counting doesn't handle cycles. I believe those languages (and
I'm pretty certain Python does) implement a periodic mark-and-sweep GC for
those cycles.

So you still periods of high latency collections (in .NET only Gen2
collections are actually noticeable, not sure about other platforms). You also
have the overhead of managing reference counts, which can be non-trivial;
especially if you're paging as a result.

Of course, comparing GC performance between C# and Python or Ruby is sort of
silly. If we were trying to hit our ~40ms Question/Show render targets on them
we'd probably have lower hanging fruit than GC.

~~~
ajross
No, they just fiat "cycles" in as a possible memory leak against which the app
developer has to guard. That's not so bad, really: it's far, far easier than
manually freeing memory. And in any case there are all sorts of resource leaks
that GC can't find anyway; this is minor in comparison.

Your last point seems flat wrong, btw. If you're in a GC environment with
worst-case latency requirements (real requirements, not just nice-to-haves or
99.9th percentiles, or whatnot) of 40ms, then you're in a _whole world of
hurt_. GC won't do that -- brush up on your C.

~~~
dmpk2k
They're not mainstream (that I'm aware of), but there are realtime GCs that
can guarantee a pause of less that 1ms. Metronome is the one I'm familiar
with.

~~~
ajross
True enough, there are research systems that have been built with hard
realtime properties. I don't know much about them. So I should have qualified
it with "all popular GC-based interpreters" or the like.

------
dlss
Why not model GEN-2 GC frequency, then do the following:

1\. When a gen-2 is likely to occur (looks like 3 vertical lines on your
chart), tell the load balancer to stop forwarding requests.

2\. Let any existing connections finish.

3\. Run the GC manually.

4\. Ask the load balancer to start sending requests again.

This seems cleaner than marring what sounded like already complex code with
additional cognitive burdens.

~~~
kmontrose
While this possible using the profiling interfaces, it's not really much
simpler IMO.

What happens if more than 1 server needs to GC at the same time (or
overlapping times)? You definitely don't want to pull everything out of
rotation, so you'll need some coordination, and how do you determine how many
servers are allowed to be pulled out? We know we can limp along on about 1/3
of our servers, but it's not a stellar end user experience; at some point
you're not solving the problem, you're spreading the pain.

There are some minor annoyance this would introduce too.

We use "server in/out of rotation" as an at a glance "things are getting ill"
check, if a web server has been pulled (due to slow responses to requests)
it's almost invariably something that needs fixing _right now_. We'd lose that
if pulling servers out of rotation was normal operating procedure.

This would also be a royal pain in terms of local and dev tier testing, since
you've basically made the load balancer and frequent GCs a pre-requisite for
thorough testing.

Of course, this remains one of the options when dealing with GC stalls; just
think what Marc went with here made more sense.

------
btmorex
I'm not familiar with .NET at all, but isn't there some way that you can move
a group of objects outside of the GC system and manually manage their
allocation and deallocation?

Seems like doing that would be far cleaner than what they actually did and it
directly addresses the problem.

~~~
JoeAltmaier
Then you have to manage their lifetimes 'manually'. Iterate, ultimately you
remove GC altogether.

If GC is the future (and it seems it is) of language runtime, then some kind
of control will be needed for situations like this. Different pools, explicit
tagging of 'long-lived' or 'cache value' or some such during allocation?

~~~
jerf
"Different pools, explicit tagging of 'long-lived' or 'cache value' or some
such during allocation?"

The idea that immediately came to my mind is that given that the site is
probably load-balanced, the best thing to do would be to take the site out of
the load balancer while the big GC runs, then put it back in. I wonder if
there's some way to hook that GC event. So, "pools", but at a higher level.

~~~
tedunangst
There is. I'm surprised they didn't try it.

<http://msdn.microsoft.com/en-us/library/cc713687.aspx>

~~~
sams99
Really?

We register for notifications from the GC using a magic threshold number that
could mean anything.

Then we quickly notify the rest of the webs a GC is pending on our "message
bus". They let us know they are safe from GC at the moment. If they are not
you are in a pickle.

Then we notify HAProxy that we are about to run a GC and ask it to tell us
when it is done draining the current requests and taking our web offline.

Once notified we perform the GC manually

Then we notify HAProxy that we are done with the GC so it can add us back to
the pool.

What could possibly go wrong?

~~~
tedunangst
Or you could just tell haproxy you're going to be low priority for a little
bit and not worry about every single last request never being processed on
GCing server.

~~~
sams99
you do realise that only about 0.09% of our requests were affected by this,
catching them all is the whole point here.

------
Roboprog
As a developer working primarily in Java, I'm jealous that this is an option
for the .NET (C#) people. Not jealous enough to want to work on Windows, or
have to trust Mono not to get sued, but jealous, nonetheless.

Garbage collection is a wonderful option. It gets old when it's the only
hammer you have.

What a joy to be able to maintain a large cache in your program, rather than
having to rely on an external service. What a joy for virtual memory to
actually work, rather than constantly (every 2 or 3 seconds) swapping back in
something that wouldn't otherwise be used for a while, and even more
frequently thrashing on L1/L2 caches.

------
moomin
The most interesting aspect of this from my perspective was the importance of
structs in C#. Java doesn't have this construct, making it harder to solve
these problems in-memory.

~~~
robfig
What is the distinction between struct and a class with all public members?

~~~
maximilianburke
As with C++, structs in C# can have private members. The big difference
between structs and classes, though, is that structs are value types. This
means that, for example, function local usage of struct instances can forego
allocating them on the heap and instead allocate them in the program's
activation records (stack frames).

Because of the semantic differnces you can't just go switching all your
classes to structs and realize a significant performance difference. Code like
this:

    
    
            class Foo
            {
                int bar;
            }
    
            public static void Func(Foo[] arrayOfFoos)
            {
                Foo f = arrayOfFoos[0];
                f.bar = 10;
            }
    

will break in subtle ways because you are no longer manipulating the heap
allocated instance via a reference but rather a copy that was allocated local
to the function.

Arrays of structs also have better cache locality which can improve
performance on number crunching workloads and can also make it easier to
exchange data between managed code and native code through a P/Invoke
interface.

------
alok-g
A big problem I faced with this approach is that value types in C# are
significantly restricted as compared to reference types.

See here for details:
[http://social.msdn.microsoft.com/Forums/en/csharplanguage/th...](http://social.msdn.microsoft.com/Forums/en/csharplanguage/thread/47d10fcb-2d69-451a-bb97-023f1f9113f3)

They got lucky that their Customer type was simple enough.

------
mkilling
XNA doesn't even have a generational GC, which makes writing games feel like a
constant exercise in tricking the GC.

The GC's marking phase is very slow because it has to traverse the whole
object graph each time, meaning you want to make the object graph as simple as
possible. Arrays of structs are very efficient for the GC to scan (it can
either collect the whole array or not), but working with structs is a
nightmare because there's no polymorphism.

For Fluffy[1] we just used classes and tried to optimize the hotspots as an
afterthought. I'm not quite satisfied with how it turned out, there's still a
periodical 20-40ms lag in the game.

There is a great article about keeping heap complexity low:
[http://blogs.msdn.com/b/shawnhar/archive/2007/07/02/twin-
pat...](http://blogs.msdn.com/b/shawnhar/archive/2007/07/02/twin-paths-to-
garbage-collector-nirvana.aspx)

[1] [http://marketplace.xbox.com/en-US/Product/Fluffy-
Operation-O...](http://marketplace.xbox.com/en-US/Product/Fluffy-Operation-
Overkill/66acd000-77fe-1000-9115-d802585508be)

~~~
elisee
On Xbox 360, XNA comes with the .NET Compact Framework which indeed includes a
pared-down, non-generational GC (<http://msdn.microsoft.com/en-
us/library/bb203912.aspx>). (On Windows though you get a full .NET framework.)

As a developer for Xbox 360, I found your best bet is to:

\- monitor the amount of memory allocated (print out GC.GetTotalMemory(false))

\- allocate everything during loading and force a GC when you're done loading
stuff.

\- ensure you're not creating any garbage at all during your main loop, which
will prevent the GC from kicking in at all.

------
Nitramp
Structs seem like a very pragmatic solution to the truly hard problem of
writing a really good GC algorithm.

Java has a whole selection of very advanced GC algorithms, and from what I
hear, it's possible to mitigate most long-pause problems by careful tuning
(e.g. incremental+parallel might help here). But JVM tuning is such a black
art that really nobody knows how to do it ...

TL;DR worse is better.

