The work on JVM garbage collectors over the last ten years has increasingly been about supporting bigger and bigger heaps efficiently, as physical machines have got bigger, and programs' (or at least application servers') heaps have grown to fill them. Even Shenandoah, a current prototype collector, is designed around large heaps.
But with microservices, we use our cavernous machines to house many small heaps instead. That leaves us with rather different garbage collection challenges.
So, it's possible that the Go developers will get to ignore decades of progress, as they love to do, and still be competitive, because that progress was towards a different future to the one we arrived at.
In three years, people will have realized it wasn't such a good idea, and moved on to the next "this will solve everything" fad.
Once you have seen enough of these fads go by, it's pretty obvious.
I'd be careful about extrapolating future directions of computer science from something that seems cool this year.
> In three years, people will have realized it wasn't such a good idea, and moved on to the next "this will solve everything" fad.
Microservices are simply the youngsters rediscovering that encapsulation is a damn good idea.
As I explain to people, microservices do NOT make your life easier. If you can exist in a monolith, that's way easier to implement.
However, once things start getting complicated, microservices make certain things possible.
Build/deploy/release cycles become independent with microservices, which can be great. Also you can write microservices in different languages.
But the APIs can become less rich. Try passing callbacks between microservices written in different languages. And it can be more of a pain to change APIs. And if there's any state to worry about, you need to manage that and maybe use a stateful protocol (which are harder to get right).
I accept that coarse-grained services can have independent lifecycles, but not microservices. I'm working on a microservice-based project, and two things we've learned are that there are almost no features we can implement in just one microservice, and that if we don't test microservices together, we get bugs. So, in practice, their lifecycles are tightly coupled.
Some people ding this for being a 'distributed monolith' rather than some kind of 'true microservices', but i don't see how we could factor this application into micro-sized bits that would not be tightly coupled.
> Also you can write microservices in different languages.
I worked somewhere that tried that for a while (although small services, rather than microservices). It turns out that it makes it really hard to share people, libraries, and learning between teams, and that the task-specific advantages of particular languages mostly aren't great enough to make that worthwhile. We managed okay with some bits in Java and some in Scala, but adding Clojure and Groovy was too much. I think there are some bits in R now, but they're owned by a completely separate part of the organisation, so Conway's law allows it.
The idea is that you don't do that. It's a matter of duplicating functionality across teams so that each team isn't beholden to another team. pushing that library doesn't suddenly break things, for example.
Whether you agree with it or not, when you choose microservices, you choose to avoid sharing libraries like that. In the small it's probably ok (across microservices in 1 team, depending on circumstances), but in the large the gain isn't worth the added maintenance hassle.
Developers often gloss over how truly difficult it is to make good API choices. And when the wrong choices are made, it goes wrong quickly.
Microservices probably amplify this effect, but it's hard to say.
I doubt it
You want to make your way in the CS field? Simple. Calculate rough time of amnesia (hell, 10 years is plenty, probably 10 months is plenty), go to the dusty archives, dig out something fun, and go for it. It’s worked for many people, and it can work for you.
— Ron Minnich
People today could do well reading
Computer Architecture and Parallel Processing Paperback – International Edition, January 1, 1986
by Kai Hwang (Author)
Micro-services is just a new hype buzzword. Classic case of "smaller is cuter" http://www.smbc-comics.com/?id=2932
Of cause JVM needs to focus on large heaps. Now it still stucks somewhere at 32GB for performance reason. And for heavy-lifting application like HBase/ElasticSearch, that is currently OK, but still makes it quite inconvenient to utilize larger memory(you might need deploy multiple instances on the same box, etc).
You can't invent a new buzzword and tell other people they are solving the wrong problem.
Short version: No it's not.
Long version: Heap is under 30 GB or over 40 GB. Never between. There is a JVM optimization that allows it to use 4 byte pointers up to 30GB of heap.
There is simply a gap: A 30 GB heap (with 4 bytes pointers) may be similar in capacity to a 40 GB heap (with 8 bytes pointers).
Guides and documentation (elastisearch) are simplifying this as "java can't do more than 30 GB heap", which is factually incorrect. You can do a 60GB if you want.
That's not true, dealing with large heaps is problematic in any garbage collected language because due to fragmentation you end up in stop-the-world full GC phases. Which for a big heap (i.e. bigger than 4 GB!), you can end up with latencies measured in seconds or even minutes.
The article mentions the CMS GC for Java, which has this problem. The G1 GC, introduced in Java 7 I think, is much better, however from what I've read it too isn't immune to full GCs. Of course for Java you also have the Azul "pauseless GC", which is supposed to never do this, but it is a commercial solution. And the article also mentioned another GC currently being contributed by Red Hat, which is good to hear.
Sure, if you go from 300MB to 30GB, the GC might behave differently and maybe some applications can't take it.
I think it is reasonable for designers to curate advances to address the problems they are trying to solve.
Namely: I don't agree with Go marketing their GC as if it magically provides no trade offs. Since reading the original design doc for it, I've assumed that Google was simply lazy/didn't care about Android, because the Android GC didn't provide this magic.
I don't believe this to be true. They always point out painstakingly, that the advances made on pause times come with less throughput, for example.
What the go people are doing (and with which I tend to agree and I believe it's hard to argue against the success they're having), is to say that they believe it's possible to build a GC which is good enough for most modern use cases and doesn't come with a lot of tuneables.
(However, even that comes with tradeoffs; it limits the use cases that go can be used for. For large enterprises, being able to minutely tune a GC might be an important thing)
It wasn't remotely clear that the advancement here was "We don't provide settings, because we're guesstimating your workload matches the standard Go use case!", and all the histrionics further masked this truth.
At this point I'm just repeating the article, though. :)
Erlang is more clear about their assumptions and goals. They don't claim to have an amazing new GC, they just say they have a simple generational GC, and that's all they need because they assuming many small heaps.
Are other fields as bad as programming? Do engineering people get to do the equivalent of having a paper published or getting to present at a conference because they implemented something in language A which was already done three years ago in language B? Do other fields have people grousing about how no one uses X, even though it is clearly "better," where "better" is basically an unsubstantiated opinion?
Some embarrassingly large number of "best practices" in our field are pretty much based on a feeling it "seems right" combined with a "well, it worked a couple times before" level of empiricism. Vital decisions about tooling, such as choice of programming language, are made by some combination of: following the herd, following one's own prejudices, or being enamored of a nifty idea.
One thing a company at the scale of Google could probably afford to do, would be to collect real data about the effects of programming practices.
Never said it was.
It's more a manifesto on programming philosophy born of "old man rage" in language form.
I suspect this has something to do with why I like it so much!
Should I be insulted by your finding it insulting? My thought on reading that phrase was, "Right on!"
K-12 education for sure. Completely fad (procurement) driven, utterly disconnected from reality (best available science) with egos and bent morality added for giggles.
Mind the feedback loops and incentives.
And don't trust the opinion of anyone who hasn't maintained code they (personally) wrote and deployed +5 years ago.
The very large heaps are partially driven by distributed in-memory storage systems.
More seriously, though, isn't most large in-memory data storage done off the garbage collected heap? I have to confess i'm actually not sure what anyone is doing with >20 GB of normal objects.
Microservices/API won't help with lowering these.
edit: thank you all for your answers, much appreciatedly
Singularity was a single address space OS, thus there were no "processes" in a traditional sense. They still had a notion of isolated spaces though partly to allow different heaps to be GCd independently, and to use different GC algorithms (with different barriers) exactly due to the lack of a one-size-fits-all approach. Messaging between these process-like-things was done by moving objects onto a "transfer heap" and back again, which boiled down to just a simple pointer write. There were no syscalls.
A very nice architecture, from a theoretical perspective. However it's not clear to me that for standard desktop-sized machines you really need segmented GC spaces these days. With an incremental collector you can probably get pause times in the milliseconds range for a 32GB heap and I guess 32GB is as much RAM as you need on a Pro-level machine. For servers of course you go bigger.
One interesting GC algorithm I came across did thread-local collections. Objects were tracked with write barriers to discover when they became 'globally reachable' and moved to a global heap. Threads could stop independently whilst others continued running to collect their thread local heaps. That opens up the possibility in desktop/mobile apps of having a main thread that's doing an animation of some kind in a low-garbage way, whilst the background thread does work as quickly as possible, with the background thread collecting independently whilst the animation thread still runs.
"A Unified Theory of Garbage Collection"
Basic idea is that tracing and reference counting are duals of each other. Reference counting finds dead objects, tracing finds live objects. Optimized variants of each of these tend to approach the other side.
When there were two GC papers:
Assessing the limits of program-specific garbage collection performance - "We demonstrate that there is considerable promise to reduce garbage collection costs by developing program-specific collection policies."
Or going back to 2012:
And then there were none: a stall-free real-time garbage collector for reconfigurable hardware - "We present the first implementation of a complete garbage collector in hardware (as opposed to previous "hardware-assist" techniques), using an FPGA and its on-chip memory. Using a completely concurrent snapshot algorithm, it provides single-cycle access to the heap, and never stalls the mutator for even a single cycle, achieving a deterministic mutator utilization (MMU) of 100%."
That's by David Bacon, one of the runtime implementation greats, who has been after low-pause GC for years. A golden oldie from him is:
A Pure Reference Counting Garbage Collector - "When processor or memory resources are limited, the Recycler usually runs at about 90% of the speed of the mark-and-sweep collector. However, with an extra processor to run collection and with a moderate amount of memory headroom, the Recycler is able to operate without ever blocking the mutators and achieves a maximum measured mutator delay of only 2.6 milliseconds for our benchmarks. End-to-end execution time is usually within 5%."
As for by-type GC, i remember reading a paper years (decades?) ago, perhaps from Microsoft, where they'd tried putting every class in its own region, the idea being that when you wanted to collect some particular region, you could exploit your knowledge of types and their fields to only scan regions which could possibly have pointers to it (eg to collect BlogPost, you have to scan User). Clever, but didn't work well in practice, because so much garbage is lists or strings that can be pointed to from a lot of places.
However, on googling, i find that Edsger Dijkstra wrote about it in 1975:
Hey, maybe that's old enough that Go can use it?
Progress in the wrong direction is not progress.
Ignoring "progress" that was in the wrong direction is progress.
1983: A Real-Time Garbage Collector Based on the Lifetimes of Objects, Lieberman/Hewitt
Since objects with short lifetimes account for a large
portion of storage use, it is worth optimizing a garbage
collector to reclaim storage for these objects
For example, Java a few generations ago was very poor at escaping heap allocation to the stack and so much the opposite skewing towards heap allocation over stack allocation, so I'm sure the JVMs of this era were an extremely good fit for the generational hypothesis.
The region markings are used as hints. Allocations are put into the region when possible and then the region is discarded at the end in one go, which is of course a well explored idea but this one is interesting in that you only scope out the regions and don't need to manually assign each allocation. Also allocations that naturally exist beyond the end of the region work properly without additional effort.
One issue is that it's often quite hard to beat a properly tuned generational collector even with techniques that intuitively 'feel' right. As one example, Azul experimented with aggressive hardware-assisted stack allocation on their custom CPU and abandoned it because they found it didn't really beat well tuned generational collection.
It isn't GC but NeXTSTEP had allocation regions under this theory; that's where alloc:withZone: came from. In reality programmers mixed up the zones and there were way too many crashes as a result. Without some way to enforce the separation you just don't get much benefit.
In databases, it's common to do arena based allocation. You allocate objects of similar lifetime (life of a query, life of a transaction, life of a procedure execution, etc.), and free them all at once when the lifetime is up.
When hacking postgresql, for instance, using pfree() is fairly uncommon. You just allocate, (which goes to the current arena by default), and then the arena is wiped out all at once later.
Of course, sometimes you might use a lot of memory temporarily in a loop, and you need to pfree() it each iteration. But that is the exception.
I think of each arena as a tiny heap (though of course they can grow large).
In this case, the bigger benefit wasn't the allocation/deallocation – but the fact that you would be working with memory on the same virtual memory page, IIUC.
I've also heard of GC algorithms using exactly this behavior for the nursery stage.
In C/C++ you have to manually decide how short lived allocations are ahead of time and either stack or arena allocate them. Freeing all the allocations on the stack or the arena is then very fast.
In a generational GC the programmer doesn't annotate allocations like that ahead of time. Instead it is assumed that all allocations might be short lived. How long they actually live is then tracked and eventually they're moved out of the "arena" (young gen space) by copying them. But again, freeing the contents of the young gen is very fast because you don't actually touch dead data at all, you only copy the live objects.
I've heard of GCs that can actually do "pre-tenuring": when the VM knows an allocation will definitely live for a long time based on profiling data it can be allocated directly into the old gen space. That reduces your minor GC pause times further.
All this adds overhead of course, but makes automatic something C/C++ devs would do themselves. Given the potentially catastrophic security consequences of screwing up in the C/C++ model I've become a lot less sympathetic to that approach over the years. How much extra performance is a security advisory and possible resulting botnet worth to you?
It's fair to draw an analogy to generational GC, but they aren't the same.
Rust is an example of a language that can do arena allocation without the dangers of C++.
EDIT: Based on my experience with postgresql, and what I've learned about rust so far, I am starting to believe that object lifetime is a fundamental concept in CS. It's possible that sophisticated GCs have, to some degree, hidden the importance.
This is exactly how the Gen0 nursery works. The surviving objects get copied to Gen1 before the Gen0 bump pointer is reset. Of course, in C++ arenas there cannot be surviving objects or that is implemented manually.
The main difference here is that what is in the nursery is decided by the GC, whereas a manual arena requires you to pre-decide what is in the arena or not.
WRT security Rust is my go-to these days(they also support both typed and un-typed arenas). Much better memory safety guarantees and C ABI makes embedding in a larger application straightforward.
There's an underlying assumption (both in the article and in comments here) that the results of garbage collection research are generally applicable across languages. It's not obvious that this is true. While you can certainly get a lot of ideas from previous research, memory allocation and access patterns can differ significantly between languages (due to different language features and common idioms), and what seemed like a win for one language might not actually apply for a different one.
1) they've solved for Go
2) this is comparatively better (almost mockingly so) than other VMs because of the advanced nature of their implemtation
The post suggest that 1) is overreaching and misleading: GC is always about trade-offs. It suggests that 2) is also false, because their implementation is not particularly advanced and just follows from the trade-offs they made in 1).
It does rather seem like over-marketing what they've done, regardless of whether it useful in many Go use-cases.
You're making a very uncharitable interpretation, and unjustifyably calling them guesses only makes it more so.
Even in the thread about 20% CPU usage increase mentions that Go implementer considers this much higher than expected and is likely a bug. But author make it sound in article as Go guys are calling it reasonable tradeoff for low latency.
As Go is a relatively ordinary imperative language with value types, its memory access patterns are probably comparable to C# where the generational hypothesis certainly holds and thus .NET uses a generational collector.
I think that's reasonable. I would say it is on the Go developers to point out why they think Go would have seriously different performance characteristics from similar C# network applications using Task-based asynchrony (similar to goroutine + channel based asynchrony in Go).
They didn't choose mark-and-sweep because they thought other algorithms are worse, go has special needs or anything else. At least this would be the first time I've heard about it. The same goes, from what I've seen, for most of the arrogance ascribed to them (in this article and elsewhere); it usually hinges on making up or misrepresenting things they've allegedly said.
b) I disagree with that summary of the article. I believe it's first paragraph should be an adequate indication of what it's point is, that's how you usually write articles, and it's distinctively not making that point.
And lastly, c) I also don't believe that this statement, if it were the point, is at all true. At best it's hyperbole, but even on the surface, of the list of 15 design criteria for a GC mentioned in the article, at least 7 are explicitly taken into account and considered an optimization goal for the go GC (not all on equal footing and some in a different way than the author might prefer, but claiming they were ignored is just false). I can definitely say that for pause times, pause distribution, compaction, concurrency, scaling, tuning and portability.
Comparing Go and C# as languages don't seem to be the goal of this exercise, either. Instead, the article was making the point that the garbage characteristics have generational components. This is true of C#, so I would expect it to be true of Go as well. If that is the case, whatever Go's current performance numbers are, they could probably be made better by a generational GC.
Unless you seek to optimize a single application, a benchmark is meant to be broadly representative of a wide category of applications whose performance is dominated by the component being benchmarked.
For instance, an application which allocates no memory is a poor benchmark of the memory allocator, and it is also a poor general representative of program performance if most programs' performance is dominated by memory management.
However, this discussion is about GC, not generics specialisation. Java has a guidebook explaining the tradeoffs and how to configure your collector:
"We continue to have significant memory bloat in standard Java Collection data structures. We added a higher overhead and slower G1 GC which will be default going forward. In Java 9 we also hope to break many user programs which depends on Sun internal APIs"
Now this sounds ridiculous because announcement of major product/feature are suppose to highlight advantages just like Go's new GC announcement blog.
Tradeoffs are discussed in mailing lists and bug trackers.
Also, I should note that I've never heard of a language specific GC algorithm. The closest I can think of would be conservative collection which is a variant of non-generational non-compacting mark/sweep for the case where you don't have pointer maps. Usually that means C++. In a conservative GC any memory contents that looks like it might be a pointer is treated as if it is. It turns out that integers that look like pointers but aren't are pretty rare, so whilst this type of GC can actually leak memory (!), in practice it is still very useful.
Early versions of Go used the Boehm garbage collector which worked okay for a 64 bit architecture but not as well for 32 bits since false pointers are more likely - another example of how a smallish language change can have big effects.
I'm not a GC researcher but I've had enough trouble getting reliable benchmark results to be skeptical of drawing any general conclusions from cross-language benchmarks. Performance tends to depend on implementation details.
Is it really? The linked paper is about Smalltalk. A lot of the research is about Java. Functional probably talks about Haskell and ML. All those languages rely on GC completely.
Is this hypothesis still true, when a language allows the programmer to mix different memory management techniques? For example, if you have smart pointer and stack allocation, do objects still "die young"? It might be that these young objects never reach the GC, because they are now allocated elsewhere, which means we have mostly long living objects under GC management.
I guess a selective use of the Boehm GC would be the only reasonable way to falsify the hypothesis. Did anybody try that?
Even .NET allows for it, but most .NET related research lives within MSR walls, and many don't follow it because Microsoft.
So we got Java as the main platform for GC research and only now, thanks to market pressure are we seeing such features planned for Java 10.
There are better options to run Python and Ruby than CPython and MRI in terms of performance, yet they lack mainstream adoption.
EDIT: I should clarify that "Python and Ruby developers" is meant to mean the language implementers, not ordinary users. And this isn't intended to be a dig at those languages--only that they clearly have different goals for their languages than implementors of Java, Go, C++, etc.
Also something that many aren't aware regarding Oracle's JVM development, is the decrease of C++ code, being replaced by intrinsics and better optimizations across releases.
JDK9 651408(C++) lines
JDK8 497588(C++) lines
I do not know the reason but newer JDK seems to have quite a bit more C++ code.
Also, the original implementations of Python and Ruby are so slow in general that it's not clear that improving the GC is the best use of one's time.
In the open-source Common Lisp world, CMUCL and SBCL have for years used Douglas Crosher's multigenerational collector. Guido and Matz would have done well to use it too -- it's much better, probably, than either of theirs. It's still far behind the JVM, but the point is, Python and Ruby didn't necessarily need the very latest research; they just needed 1995 technology instead of 1975 technology.
Another view on this is, that ideally, you don't perform short lived heap allocations in the first place. Go for example tends to require less heap allocations than Java, and the compiler tries to put them on the stack, if they don't escape the function in which they are allocated. So Go manages to avoid those allocations, which generational GCs are especially good at cleaning up.
The idea of generational GC is still considered by the Go team - they plan to have a special heap for each goroutine. This heap would be collected when the groutine exits. This approach would combine the sound idea of generational GC with a pretty good heuristic for the life time of the young generation.
On the other hand, very often it's convenient to just return some freshly allocated object out of function invocation (non-obvious, but trivial example: string), which is why the generational hypothesis tends to hold for almost any language.
If you handle strings, and you are copying and freeing strings all the time, that's just slow code.
And it's not the case that GC makes the slow code faster... it's that certain techniques enable GC to not fail catastrophically on slow code.
It's a little bit disturbing that folks are ready to extrapolate this to some kind of universal rule.
In a high-end video game, for example, we hardly allocate anything short-term at runtime. The whole program is architected to avoid it. When we allocate, it tends to be things with medium-to-long-term life (texture maps, sound effects, render targets, whatever) so a generational system is useless there. In fact we don't use GC on these kinds of things at all, because we also need to control exactly when they are deallocated so we can put something in their place, because memory is limited.
Stack allocation is what your computer is built from the ground up to do. It is not some kind of workaround or optimization, it is how software was originally designed to work.
I am not just talking about putting copies of things on the stack, but having few copies to begin with, etc.
Where elsewhere? The stack? One of the things people always get wrong about GC challenges is "what about stack allocation?". Modern servers come with up to 6TB of RAM. Applications that are interesting from a GC perspective (i.e., lots of concurrency, data structures etc.) commonly employ heaps that are between 10s and 100s of GBs today. The amount of memory available on stacks is 0.5-10% (thousands of threads are required for the high amount) at most. In other words, stack allocation does not remove the needle by much (although it may affect the frequency of young-gen collections). The Java team discovered that when they added escape analysis and stack allocation, and the effect was not dramatic. The stack simply does not matter that much in the grand scheme of things (arrays-of-structs, OTOH, matter a great deal).
Maybe not on the standard implementation, but just like C and C++, Java also enjoys some extensions.
Packed Objects in IBM J9
ObjectLayout in Azul
And in some standard form in Java 10
Go does escape analysis and stack allocation. Presumably this would be pointless if Go allocations rarely died young.
Bear in mind that the hypothesis talks about allocations in general. That means things that get put on the stack count as "dying young".
Not to mention that in Go you create and destroy stacks much more often (probably orders of magnitude compared to typical Java program).
No, this makes no sense. It's managed absolutely identical to how any plain stacks in C are managed. Stacks are created and destroyed when scheduling units (threads or goroutines) are created and destroyed. Whether that memory is allocated by some user level library or the kernel (albeit with POSIX threads, you can, and in fact it's common for thread memory to be allocated by the calling process beforehand) has absolutely no relevance whatsoever. All that matters is that the lifetime of the stack is 1:1 mapped to the lifetime of the schedulable entity.
Of course in Go goroutine stacks are dynamic, which is an optimization that makes Go programs use less memory, but that has no semantic effect.
However, in Go, this percentage is much higher because there are orders of magnitude more stacks.
Let me take this to the extreme: Imagine a language with no assignable variables at all, but with processes and channels. In that language, you're able to create an arbitrary mutable variable as a process with a channel, that can get `set` or `get` messages. The value of the "variable" is kept in the process's "stack". In fact, this is pretty much how the process calculi work (one of which, CSP, is one of the inspirations behind Go's design). That this is how you choose to design your programming abstractions makes no difference to the runtime implementation. Your stacks would behave just like normal variables behave in languages with different abstractions.
Yes they are, it doesn't matter who is allocating them, the kernel, or the program itself (which as I mentioned, also happens with POSIX threads).
> The data in them is heap data
No it isn't, not anymore than data in POSIX thread stacks is heap data. You do realise that the memory for POSIX thread stacks also comes from "the heap" (regardless of who is allocating it), and it's not from the process stack segment, right?
> their lifetime is plain heap lifetime
No, their lifetime is the lifetime of the function activation record, just like in C with POSIX threads. The lifetime of some value that lives on the stack is the lifetime of the stack frame, not the lifetime of the stack segment.
> and their memory management (which is what we're talking about and isn't at all trivial) is heap memory management (pretty much).
No, the memory management of values that live on the stack is stack memory management. Values live as long as their stack frame they belong to lives. Stack frame allocation is a pointer decrement, and stack frame deallocation is a stack frame increment (plus some amortised O(1) cost caused by copying stack mechanism).
The lifetime of the stack segments themselves does indeed resemble (or is) heap-based memory allocation, but that is true in C, Java, Go, and any other language, and doesn't matter (or, if you take the position that it matters, it matters the same in all these languages and environments). And allocating stack segments is trivial.
> It's just another way of abstracting objects via message passing.
??? Message passing!? Wat.
> Let me take this to the extreme: [...]
If you invent new language implementations and change the normal definition of various concepts, I guess anything can be true in that particular frame of reference...
When you create a runtime for lightweight threads (as I have), what matters isn't what the programmer sees but what the runtime sees. POSIX threads and lightweight threads appear the same to the programmer, but behave very differently from the runtime's perspective, and require very different memory management (and very different scheduling) algorithms. What matters to the runtime is: how big each stack is, whether it's constant in size or can grow and shrink, and how often it's allocated and deallocated. It's those parameters that make the difference, not the user abstraction. The point of my example was to clarify that the stack abstraction can be used in different ways. Yes, POSIX thread stacks are also managed in some heap, but their usage parameters make them very different from lightweight thread stacks; the latter behave like a plain object would.
In Erlang, Go and Quasar, the lightweight thread stacks are managed very differently from how the OS manages heavyweight thread stacks, and this is because the two behave very differently despite exposing the same abstraction. You keep repeating that lightweight threads and heavyweight threads are the very same abstraction but what matters is the usage parameters, and that's why their implementation is so different.
BTW, this is true not only for memory management, but for scheduling as well. The scheduling algorithm for heavyweight threads simply does not work well for lightweight threads and vice versa. The reason is that the patterns of blocking and unblocking, and the number of threads -- i.e. the parameters of the behavior -- are radically different, in spite of the abstraction being exactly the same.
A metric for what? For it making many Go users very happy, or for making rather big claims like "future of GC"?
to me, that suggests they think they have the future of GC
I haven't internalized all the parameters of the Go GC, but I wonder if there's a way to tune it on the fly. For example, if I want to run at 60 FPS (16ms/frame), and I've measured that actually doing all the CPU side work took 10ms, can I let the GC have ~5ms of GC time (rather than the <1ms advertised) at the end of my frame to increase throughput (read: less cycles burned overtime & better power efficiency)?
If not in Go, can I do this elsewhere like on the JVM?
Hopefully this isn't too off topic (since the article is about trade-offs of throughput vs latency).
Metronome has pause time in the hundreds of microseconds range. The tradeoff it makes is that, amongst other things, it requires read barriers i.e. every time you read a pointer from the heap you have to execute some extra code to interact with the GC subsystem. If you think about how often you read pointers from the heap in a typical program, well, that turns into a lot of barriers and a corresponding execution performance hit.
Funnny, I remember years ago reading one of the patents the GC is probably based on, and thinking the entire thing was brilliant.
Also consider fan-out. GC pauses add to the long-tail of latency. If your service depends on many backend services, your overall latency may well depend on the maximum of all your backends.
1. Allocate all your structures up front.
2. Your main video processing loop (or compression, encryption, signal processing, etc) does not do any memory allocation.
This is also like writing games in Java.
Even if it still isn't quite here today.
Memory safe system programming languages with a GC by default, value types which can be allocated on global data/stack/registers, generic programming, ability to have more control over resource management than just finalizers, allocate memory outside GC control when performance requires it.
 .NET Native is not yet there, but is incrementally getting a few of the System C# features
I really take issue with this. Imagine all of the progress that would have been delayed or quite possibly not achieved had people not persevered against the traditional thinking. Yes, GC is a hard problem but just because something's been researched for a long period of time does not mean there is no room left for innovation. Breakthroughs come some people that think differently not from people that think it's all been done before.
Check chapter 5.
Just like sorting or graph navigation algorithms get discussed, for example.
However somehow we get this pop culture of GC vs RC instead.
As for the rest, RC is GC.
Check chapter 5 in one of the most widely academic accepted books about GC algorithms.
Technically true, but in reality it is just a token effort to please Hans Bohem and get him to focus on designing the C++ concurrent memory model instead :).
As of today, I don't think any implementation actually offers optional GC, so those functions are alway just trivially implemented.
In a pure mark/sweep GC, as the heap size goes up, the work needed to complete a GC cycle and free some memory goes up too. Thus the more memory you use the more cores you need.
In more complicated collectors that isn't necessarily the case: collection is incremental and done only as much as needed to keep up with the creation of garbage.
As I say in the article, there are reports of people running JVM/G1 on terabyte sized heaps. The problem they have is that if they experience a concurrent mode failure (which G1 is not designed to optimise because in a properly tuned system it's meant to never happen) you're looking at, like, a 5-10 minute pause time whilst the entire heap is mark/sweep and cleaned out as much as possible. So with such huge heaps you'd better make sure the collector doesn't fall behind!
(note: that article is about a rather old version but the graphs illustrate the concept)
This may not help you if you can't tolerate warmup latency spikes. In that case you need to do some GC tuning to tell the collector enough that it can get it right from the start.
While this is true for traditional implementations with separate heap regions, it is not necessary. One can easily represent new and old generations as linked lists in per-object overhead and just change pointers in such overhead to move objects between generations. There is at least one Smalltalk VM that works in this way and IIRC traditional BEAM (ie. non-HiPE) Erlang VM has GC that works in essentially same way.
Sure, it's easy but non-optimal. A side-effect of a moving generational garbage collector is heap compaction and better temporal locality.
- Allocating memory in a generational GC is as fast as allocating something on the stack. It's basically changing the value in a register. It's order of magnitude faster than calling `malloc()`.
- A generational garbage collection is slower than freeing a stack frame, but has a lower per item overhead than calling `free()`.
tl;dr: for most programs, generational GCs are the fastest at allocating dynamic memory.
Also not all GC algorithms imply stop the world.
Regarding RC, it still is a GC algorithm, and Herb Sutter had a nice presentation where he shown how a cascade release of resources with RC algorithms can lead to indeterministic release time and worse, stack overflow.
that's true, but it's also a moot point because std::shared_ptr is almost never the semantics you want. You want std::unique_ptr.
Also, you can get away without allocating memory dynamically in a lot of cases (thus not incurring overhead of malloc).
After C++11 and Rust I'm starting to think garbage collection is holding us back from doing research into better approaches to automatic memory management.
unique_ptr still has the malloc()/free() overhead, I'm assuming the parent poster just misspoke. One major issue with shared_ptr (which unique_ptr doesn't have) is that it must work properly when different threads have different shared_ptr's (copies of each other) pointing to the same object. This adds overhead.
> Also, you can get away without allocating memory dynamically in a lot of cases (thus not incurring overhead of malloc).
Ok, but we're specifically talking about GC and RC here, so I don't see why you'd bring that up.
> After C++11 and Rust I'm starting to think garbage collection is holding us back from doing research into better approaches to automatic memory management.
GC literally is automatic memory management. (And as others have pointed out, so RC is a form of GC.)
Also, there's lots and lots of research into various different ways of doing automatic memory management.
 Alright, they don't literally use malloc/free, but you know what I mean.
well, my point was, in garbage-collected languages you don't have the option to never even engage the garbage collector by creating variables on the stack, while in languages like C++ you can avoid the overhead of malloc/free by using stack variables.
> GC literally is automatic memory management.
It is, but there are other approaches to automatic memory management that don't introduce another entity that does stuff to your program at runtime. C++'s smart pointers is one such approach, see also Automatic Reference Counting and Rust's borrow checking. I wish there was more research done into compile-time techniques to prevent resource leaks.
Oh, I see. I can see what you mean, but there are solid techniques for avoiding GC in e.g. JVM-based langauges -- for an example see e.g. Disruptor/Aeron.
I'll happily grant that it's a problem that it can be quite hard to see if you're actually avoiding GC when using these patterns. It'd be interesting if there were a "@no-gc" annotation which could enforce such patterns on the source code. (So, essentially you'd have a compile-time "GC by default, but you can opt-out and the compiler will tell you when you violate that". One can dream.)
Many reference based GC's do that, e.g. perl. Or have I entirely misunderstood your comment?
Having said all that, pool allocators can work really well for certain workloads, especially client-server stuff. Apache uses a pool allocator for serving requests. Pool allocators also fit well with languages like C/C++, but that's just because those languages are very poorly suited to GC, there are much better designed languages which integrate better with modern GC.
Almost, but not always. The Cheney on the MTA  paper describes a generational GC where the first generation is the stack. So for collections on the first generation, collection is equivalent to clearing a stack frame.
The only special things about the stack per se are a) arena allocation and b) special ISA and OS support. When people say "stack allocation" the important points are (a) as well as c) static lifetime analysis. These are different concepts: Rust, for example, achieves (c) on the heap as well as the native stack.
 I.e., three-pointer, although for the stack the OS manages at least one of them
That depends on the underlying heap, Microsoft Low Fragmentation Heap is very fast for example.
The big plus with reference counting is that you (can) get a very attractive scoped release of resources.
YMMV, of course.
What I'm getting at is, Rust is the only modern language (in vogue at the moment) that does not use garbage collection. It would be nice if more languages didn't require a garbage collector but gave the option to use one.
If there are some more of such languages please chime in and provide a link to them! :)
Not really, as it doesn't allow/collect cycles.
> without the performance degradation
Only for single-threaded code.
I agree with the rest of your comment!
There are different kind of smart pointers and you gotta think what you want to do with them. A GC is easier.
In an ideal world, garbage collector makes sense because we don't want to have to worry about life-cycle management when we are solving other problems. Memory management can indeed be an annoying implementation detail, but as it stands there are certainly limiting factors in hardware which are impacted by generic object-management for specific application domains.
Which in real life means a single developer ownership codebase.
> in real life
They rather pay for outsourced consultants that deliver something that works within a fixed price contract.
(Small) teams of high quality engineers do exist, but are very expensive.
Given the rarity of a well thought out codebase, I'd say it is. As soon as multiple people work on something confusion always ensues.
We are seeing massively longer pauses in production and the suggestion is to go to 1.8, which is not ready.
Reminds me of ruby - religious people practicing on production to get the nirvana.
Yes, it's getting better, or at least less bad, but yeesh. We'll just waste resources architecting around long pauses, and rewrite critical stuff in C, for a few more years...
That's what I get for being non-tech and browsing HN!
Because I don't know a lot about garbage collectors, but I am familiar with Mikes amazing work on LuaJIT and I trust that he knows enough about what he's talking about (based on the quality of his luajit work) for enough of his claims to be true for this to be interesting, useful or both.
Note that while Mike's is strictly non-copying/non-compacting (because of Lua's API) mine does copy between generations and compacts the oldest generation. This makes arena management simpler and faster. I think that's the only significant difference between my implementation and Mike's design.
I suspect it's still a bit buggy, though it passes my test suite™. Non-trival GCs can be tricky. And to reiterate, it's not especially fast, because no work has gone into making it fast. (It's not slow either, because the design is good, but don't expect performance competitive with current production GCs².)
2. Well, it probably smokes Ruby and Python, but those go solidly in the ‘toy GC’ category.