The solution, which (unsurprisingly, works!) is how you did stuff in Fortran-77, APL (circa 1958). There were no iterators, so all you did was pass around one big array that contained all the data, and indices into it.
In a couple of years, they will realize that even structs are inefficient, and will go to parallel arrays - at which case the transition to Fortran/APL/J/K would be complete.
I'm close to 40, dinosaurish by HN/Reddit standards, but I learned Fortran mostly because back in the late 80s and early 90s essentially all numeric processing code was written either in Assembly or Fortran; and at a later stage, because I was the only one proficient in Fortran, I inherited lots of legacy code at work.
Most people my age, even if they've been programming since age 9, haven't seen any Fortran unless they're doing physics or using LAPACK.
Alexey Ragozin's blog is great if you're interested in the murky details of generational GC: http://aragozin.blogspot.com/2011/06/understanding-gc-pauses...
"Key point of generational GC is what it does need to collect entire heap each time, but just portion of it (e.g. young space)."
"This principle is based on hypnotize that most object “die young”, i.e. majority of objects are becoming garbage shortly after being allocated."
But isn't the heap fragmentation a much bigger problem with malloc style memory management?
I'll welcome better garbage collection.
For many years now, some Smalltalk VMs have had "permspace." You just send a message to a long lived object, and it's moved to a different part of memory considered permanent, and the GC never looks at it. No structs. Just call a function on the permanent objects. Since it's in a different space, this is also very efficient for the VM, since identifying a permanent object is just a comparison of addresses.
In general garbage collectors are sensitive to the volume of objects in the heap and the number of live references, since the collector works by scanning all live objects and following all references contained. If you can't reduce the volume of memory you use explicitly then you can take a stab at reducing the memory used by various language features. Boxing value-types -- such as using a non-generic list of integers, using delegates/events/generators, using specified capacities with System.Collections.List<T>, etc., can all go a long way toward lowering overhead. Reducing the references (by moving to storing indices) can be fairly easy and can be a significant gain as well.
Disposing of IDisposable objects (via using-blocks) when you're done with them rather than waiting for the collector to harvest and/or finalize them also helps considerably.
I think in general more explicit memory usage patterns can benefit even when using a managed language -- you know more about your program and where performance matters than the garbage collector ever will.
This is the second StackExchange article where it feels like square peg, round hole. The first was [their ORM] (http://samsaffron.com/archive/2011/03/30/How+I+learned+to+st...) which isn't really an ORM at all (perhaps micro ORM is fair) in the traditional sense.
So you still periods of high latency collections (in .NET only Gen2 collections are actually noticeable, not sure about other platforms). You also have the overhead of managing reference counts, which can be non-trivial; especially if you're paging as a result.
Of course, comparing GC performance between C# and Python or Ruby is sort of silly. If we were trying to hit our ~40ms Question/Show render targets on them we'd probably have lower hanging fruit than GC.
It's also possible in the case of object structures to have a sort of "ring" where the weak reference gets shuffled around said ring in order to allow a safe cycle without expecting the user of said objects to care (the trick is that objects' DESTROY methods are allowed to make new references to the object, which lets you do the shuffling just before it would normally have been freed).
I don't entirely see how managing reference counts would cause paging - in perl they're stored as part of the C-level struct for the value so the data's already nicely local to the memory you're already accessing - and when a ref count hits zero the value gets put into a "clean this up when this VM op has finished executing" array so it's pretty efficient.
I would suspect this sort of potential problem is why ObjC seems to be looking towards reference counting - the "smoothness" of not having any chance of a GC pause is probably to apple's eyes a worthwhile trade-off for the disadvantages of reference counting.
Strikes me it might be interesting to be able to mark objects as reference counted as a halfway house between fully GC'ed and fully manual, but I'm not sure whether the amount of use that would get would justify the effort involved.
Paging can occur as a consequence of copying a reference.
var p1 = *A // let p1 be a reference to A
// something in the interim
var p2 = p1 // create a new reference to A via p1
This assumes an object carries its reference count with it, which every reference counting system I'm aware of does. I suppose you could technically keep reference counts in a common space, though growing that would be a bit rough.
A GC isn't a magic bullet or anything, but they don't need to cause any paging as a result of reference changes until a sweep; and even then, they only need to page in the live set which is more likely to be resident in memory.
I'm... skeptical Apple's introduction of ARC is really performance oriented. Seems more like they tried and failed to build a workable GC for Objective-C with Tiger. Haven't worked with much Objective-C since 10.4 though, so I'm not completely settled either way.
The collection thing I can definitely see though.
One thing I've never quite worked out is - without reference counting, how do you handle timely destruction?
I know that if I do something like:
my $x = foo();
In a mark-and-sweep GC language, presumably there's some other additional feature that needs to be implemented in order to provide this?
The equivalent to timely destruction in C# is a using statement combined with implementing IDisposable.
using(var stream = File.Open(somePath))
// use stream to do whatever
C# classes (but not structs) can also declare destructors, but their execution schedule isn't guaranteed; they're useful for making absolutely sure resources are released eventually, Dispose() is preferred for timing and performance.
Your last point seems flat wrong, btw. If you're in a GC environment with worst-case latency requirements (real requirements, not just nice-to-haves or 99.9th percentiles, or whatnot) of 40ms, then you're in a whole world of hurt. GC won't do that -- brush up on your C.
I never claimed GC's solve all leaks either (they do a pretty good job with finalizer semantics honestly), but they're awfully awfully handy for solving very common (and sometimes tricky) ones.
I also did not mean to imply that our 40ms cut is a hard limit, it's a target average with a desire to minimize variability (Marc's work on which lead to this post); one that we're hitting pretty well. If you're working on a real-time system, you're right that you shouldn't be using these non-deterministic GCs.
There is no magic here, it is merely understanding the limitations of the .NET GC and adapting to it. XNA and trading platforms also are aware of these issues and have to follow similar patterns.
1. When a gen-2 is likely to occur (looks like 3 vertical lines on your chart), tell the load balancer to stop forwarding requests.
2. Let any existing connections finish.
3. Run the GC manually.
4. Ask the load balancer to start sending requests again.
This seems cleaner than marring what sounded like already complex code with additional cognitive burdens.
What happens if more than 1 server needs to GC at the same time (or overlapping times)? You definitely don't want to pull everything out of rotation, so you'll need some coordination, and how do you determine how many servers are allowed to be pulled out? We know we can limp along on about 1/3 of our servers, but it's not a stellar end user experience; at some point you're not solving the problem, you're spreading the pain.
There are some minor annoyance this would introduce too.
We use "server in/out of rotation" as an at a glance "things are getting ill" check, if a web server has been pulled (due to slow responses to requests) it's almost invariably something that needs fixing right now. We'd lose that if pulling servers out of rotation was normal operating procedure.
This would also be a royal pain in terms of local and dev tier testing, since you've basically made the load balancer and frequent GCs a pre-requisite for thorough testing.
Of course, this remains one of the options when dealing with GC stalls; just think what Marc went with here made more sense.
It's just a simpler solution to have a group of servers with enough capacity accounting for the random GC pauses. Having short health check enables failing fast which is generally better for the users overall - their requests don't stuck in overloaded servers for a long time (GC or otherwise caused).
Seems like doing that would be far cleaner than what they actually did and it directly addresses the problem.
If GC is the future (and it seems it is) of language runtime, then some kind of control will be needed for situations like this. Different pools, explicit tagging of 'long-lived' or 'cache value' or some such during allocation?
I think that's reductio ad absurdum. It's typical in long-running applications to have a class of memory unlike the rest, that will live the entire lifetime of the application. I see no problem with being able to tell the GC "this memory will always be used," or just allocating it outside of the GC entirely, while still wanting all of the benefits of GC elsewhere in the application.
The idea that immediately came to my mind is that given that the site is probably load-balanced, the best thing to do would be to take the site out of the load balancer while the big GC runs, then put it back in. I wonder if there's some way to hook that GC event. So, "pools", but at a higher level.
We register for notifications from the GC using a magic threshold number that could mean anything.
Then we quickly notify the rest of the webs a GC is pending on our "message bus". They let us know they are safe from GC at the moment. If they are not you are in a pickle.
Then we notify HAProxy that we are about to run a GC and ask it to tell us when it is done draining the current requests and taking our web offline.
Once notified we perform the GC manually
Then we notify HAProxy that we are done with the GC so it can add us back to the pool.
What could possibly go wrong?
Reading that page it is clear that this is either the exact scenario that page is covering, or at the very least one of the core use cases.
It's triggering my Wrong Thing To Do senses, but, well, when the GC isn't quite right, hacking around it may be the only option.
- handling memory with a dynamic lifetime
(with cycle detection as an important corner case)
- improving allocation performance
- preventing heap fragmentation
- thread-local allocators (for multi-threaded scalability)
Any sufficiently complicated, long-lived program that manually manages memory contains an ad hoc, informally-specified, bug-ridden, slow implementation of a garbage collector.
I agree there's a place for some amount of "hinting" to the garbage collector about the lifetime of some special objects, but, to me, the important thing is that you only need it for the extreme cases.
Useful for long-lived data that doesn't change frequently.
Garbage collection is a wonderful option. It gets old when it's the only hammer you have.
What a joy to be able to maintain a large cache in your program, rather than having to rely on an external service. What a joy for virtual memory to actually work, rather than constantly (every 2 or 3 seconds) swapping back in something that wouldn't otherwise be used for a while, and even more frequently thrashing on L1/L2 caches.
How much would you wager? I'm willing to take your money :)
> Of course, this isn't an issue if you're primarily passing the flyweight objects around and not accessing all the fields.
If you examine codebases, you see that this is actually the prevalent use case for objects in general.
It turns out that in practice, parallel arrays are often significantly faster, because more often than not you _do_ access a lot of (memory consecutive) records, and you _don't_ access all the fields.
Assume e.g. your structure has student names and numeric grades. For display purposes, you are likely to read both name and grade.
However, for just about every other purpose, you are likely to scan/process grades without ever looking at the name (e.g., to compute grade statistics; or to sort by grade). And in this case, you actually have much _better_ cache locality than you would have otherwise -- and also much better memory usage compared to individual allocation of objects.
Furthermore, if you take this approach, you find that your functions become type-concrete but use-generic, reducing your code footprint (both LOC and I-cache), because e.g. instead of "calcAverageGrade(student x)", you can just use a routine called "calcAverage(float x)", and call it ont he grade vector.
APL, J and K take this approach to its logical conclusion. The result is programs that are 10-1000 times shorter than the idiomatically comparable Python, Java, C#, C++ etc., and that have runtimes the compare favorably with C and C++.
This doesn't hold if you're doing random access (which it sounds like they are, given that they're passing around lists of indices). In the random-access scenario, parallel arrays will require roughly O(M) times as many trips to main memory as the struct array, where M is the number of fields accessed.
Note that even if they do pass around a list of indices, accesses are often (in practice) batched - because records that arrived together tend to be needed together. And the indices are often sorted (because of how they were generated, or because they are easy to sort) which then produces a very predictable access pattern for memory and disk.
We can theorize about probable and improbable access patterns until tomorrow. My experience, and everyone that used column-major order that I'm aware of, is that in practice it almost always wins against row-major ("common") ordering, whether on disk or in memory.
The fastest databases around (kdb+, vertica) use column major order. That should tell you something (about disk, but also about memory)
Sorry, I'm really confused. Is this supposed to be "O(M/L)"?
The note about clumped records is valid. That is often the case. Without knowing more about their workload, it's difficult to say whether they have that behavior, but it's quite likely.
As for the fastest DBs, they use column order for a number of reasons. The big thing is that their access patterns are rarely random. They're optimized for scanning on keys. DBs like Fastbit can be extremely fast for whole-table scans. They aren't so fast if you need to find a specific entry. For maximum speed there, you need a dedicated index, and then it doesn't really matter whether the data is stored row-major or column-major.
No. As soon as you need to read more than <cache line bytes>, it no longer matters that that the the fields are near each other. So, e.g., if your record is 128 bytes but your cache line is 32 bytes, you have 4 cache misses per record. If you use only 4 fields from the record, you have 4 cache misses per record -- so, we're on parity.
As M increases beyond the cache line size, the advantage (compared to column major) no longer increases. So it is at most a constant in practice.
> The big thing is that their access patterns are rarely random.
That's true for almost all programs, almost all of the time, actually -- which is why I was willing to be the other side of your original wager.
> They aren't so fast if you need to find a specific entry. For maximum speed there, you need a dedicated index, and then it doesn't really matter whether the data is stored row-major or column-major.
And again, that's true for many workloads as well.
The main advantage of row-major is simpler appending; if you write a lot and read little, row-major is faster; this is usually the case for e.g. audit logs.
In most typical cases, column major is equal or better.
Huh? Why would you have 4 cache misses for a 128-byte record? If you need to access 4 fields and they all fall onto the same cache line, you get only one cache miss. If your fields are spread far apart, you might have 4 cache misses, but that isn't a given. With parallel arrays, the fields will be in separate arrays and necessitate 4 cache misses.
> That's true for almost all programs, almost all of the time, actually -- which is why I was willing to be the other side of your original wager.
I didn't make the original wager. :)
I wrote earlier "if your cache line is 32 bytes".
If you need all the fields - you will have 4 cache misses. If you need only some of the fields, and they all fall into 1 cache line, then -- as you said, you will only have one cache miss.
My point (poorly worded, upon rereading, but I didn't have much time, dayjob and all...) was that the potential advantage of record-major does NOT increase beyond what is packed into one cache line -- so it is a small constant, rather than deserving of O() notation.
> I didn't make the original wager. :)
Ah, apologies. (But maybe you'd still let me take your money? :) )
I see what you mean, now. You're right that it's a constant multiplier.
Because of the semantic differnces you can't just go switching all your classes to structs and realize a significant performance difference. Code like this:
public static void Func(Foo arrayOfFoos)
Foo f = arrayOfFoos;
f.bar = 10;
Arrays of structs also have better cache locality which can improve performance on number crunching workloads and can also make it easier to exchange data between managed code and native code through a P/Invoke interface.
See here for details:
They got lucky that their Customer type was simple enough.
The GC's marking phase is very slow because it has to traverse the whole object graph each time, meaning you want to make the object graph as simple as possible.
Arrays of structs are very efficient for the GC to scan (it can either collect the whole array or not), but working with structs is a nightmare because there's no polymorphism.
For Fluffy we just used classes and tried to optimize the hotspots as an afterthought. I'm not quite satisfied with how it turned out, there's still a periodical 20-40ms lag in the game.
There is a great article about keeping heap complexity low: http://blogs.msdn.com/b/shawnhar/archive/2007/07/02/twin-pat...
As a developer for Xbox 360, I found your best bet is to:
- monitor the amount of memory allocated (print out GC.GetTotalMemory(false))
- allocate everything during loading and force a GC when you're done loading stuff.
- ensure you're not creating any garbage at all during your main loop, which will prevent the GC from kicking in at all.
Java has a whole selection of very advanced GC algorithms, and from what I hear, it's possible to mitigate most long-pause problems by careful tuning (e.g. incremental+parallel might help here). But JVM tuning is such a black art that really nobody knows how to do it ...
TL;DR worse is better.