I don't have high hopes for this algorithm, because my intuition is that transactions aren't common enough to make a big difference. Looking forward to seeing the implementation and empirical data.
This is kind of a spin on generational collection, built around the observation that a lot of Go programs are HTTP/RPC/whatever servers.
Indeed, it's one reason why I've assumed that Rust's memory model might turn out to work well for services (not having had a chance to try it): there's a clear entry point where the request begins and ends, and all of the memory allocated from that point onward can be discretely tracked and freed upon completion. In the services that I've built, the majority of the allocations are for these short-lived objects, with a relatively smaller amount being related to long-lived objects like connections to other services or caches. I imagine that websites and services might behave like this, though I haven't studied it carefully.
I'll be interested to see how transaction-specific optimization turns out, and whether it can do better than a general purpose generational collector.
* Transactions are integrated into Ur/Web at a deep level, so,
whenever we run out of space, we can always abort the
execution, allocate a larger heap, and restart.
* As a further optimization, we use region-based memory
management, inferring a stack structure to allow freeing
whole sets of objects at key points during execution.
Every served request by net/http#Server, net/rpc#Server and similar gets its own goroutine. That is to say, its own transaction. Memory allocated within that transaction will be associated with it. Should make a positive difference for server software.
Seems like the main advantage of this transaction oriented collector over a typical generational collector is that it precisely identifies the old-to-young pointers and thus may scan less total objects.
The differences I read into it are:
- There is one heap rather than a young and tenure heap, but tenured objects are marked as such.
- Rather than moving an object from one heap to another, the write barrier tags the object being written to a tenured object (and presumably all its children). Not all collectors move actively (as opposed to lazily at collection time) to the tenured heap, but I'm pretty sure it has been done.
- 'Young generation' collections happen at a specific time: when a goroutine terminates, and happens more or less like a quick version of a semiheap copying collector, in that it just drops all objects not marked reachable.
So there is interesting stuff in there, and it seems like a very appropriate approach for an event-driven environment, but it still seems like a generational collector to me.
Also, I'm confused how the separation of the goroutines is maintained in this heap. When a goroutine advances its current pointer, what happens if it bumps into another goroutine's initial pointer?
That's a big difference. One of the largest, if not the largest, advantages of a standard generational GC in practice is bump allocation in the nursery. If I understand correctly, because of the fact that GCs don't unconditionally tenure surviving objects (instead only tenuring objects that cross thread boundaries), this algorithm loses the fast bump allocation because objects in the nursery can become noncontiguous.
First-fit seems like it could be quite slow here.
The implication seems to be that they can at least avoid locking for allocation (and deallocation and promotion) as long as they find a fit within the reserved region for the goroutine, but I'm still confused how they maintain this separation safely.
Don't know how much of a factor it is in this design, but might also just be expensive to introduce heap compaction to Go now, after going this far with heap objects having stable pointers. Might be a lot of random spots in the code that still assume that.
Not that there aren't benefits to compacting, including the bump allocation, less bloat as programs run a long time, and help with locality. Just, describing where Go's at.
> I'm still confused how they maintain this separation safely
Don't know about this design, but in other places they have some goroutine-local data structures and then do the expensive thread coordination stuff after they run out of resources. Seems plausible that could be the general approach here too.
EDIT: Found the post.
> In the CLR allocation is blazing fast: just increment the heap head pointer. You don't have to take care of problems traditional allocators face: find a big enough fitting free hole. This has a cost: when objects are destroyed, the free memory on the heap becomes fragmented. This needs to be compacted, and this is where the GC pause is needed, as the references for moved objects need to be updated. Basically the cost of finding a free hole in memory at allocation was moved here.
> This was done intentionally, as studies found that most real world applications produce a load where most objects are short lived: this cheap allocation is needed, and a multigenerational GC can take care of them cheaply after destroyed (as for most of the time you only scan and free the youngest generation for unreachable objects, limiting the GC pause time).
I'm confused as to what this means. Write barriers, just as suggested here, are the standard solution for this issue in generational garbage collectors, and they precisely identify the old-to-young pointers as well. (In fact, the nursery and tenured generations are often separated in memory, so you can test just by masking a few bits off.)
I agree with the sibling comment that this seems to be a simple generational GC, just with a different policy for tenuring than the typical textbook one.
There isn't anything fundamental to generational GC that requires imprecision. It's an implementation detail. In fact, David Ungar's original generational GC paper from 1984  uses precise remembered sets. GHC's GC  among others uses precise remembered sets.
My point was not that TOC is fundamentally different from generational collection.
That's a pretty similar effect and could probably be tweaked further to achieve something that separates objects into different regions based on transactions.
Note that I'm not saying that they should've used RAII along with existing GC, I'm just trying to distinguish the transaction concept and the RAII concept.
This basic distinction leads to completely different implementations.
There was a promising proposal document, with a couple of interesting goals, but I guess it was abandoned:
You can use either a fixed size arena allocator, or one using alloca(you better know what you're doing) and still allocate those things on the stack.
In cases where you know the bounds of your input and runtime constraints you can get some really incredible performance boosts(multi-orders of mag) with those approaches.
Not quite. SSO is a lib-specific optimisation. LLVM's libc++ has a 22 characters SSO, recent MSVC and recent GNU libstdc++ have 15 characters SSO, Apache and older libstdc++ didn't have SSO at all (and used COW).
The SSO-23 POC/proposal does provide a 23-characters SSO for a 24 bytes string structure (such as LLVM's, MSVC 2013 and modern G++ strings are 32 bytes and SSO-23's scheme provides space for 31 characters in the same stack space).
 all capacities exclude the final NUL as it's required and unusable outside of C compatibility
In other words, the allocations and deallocations are determined statically, they aren't static allocations in and of themselves.
Also, fun wrinkle: seems like occasionally, wrapping a slowish, mostly-self-contained, garbage-heavy function call in a goroutine could be a useful hint to this collector. It seems like it's only useful when 1) a transactional GC on the calling thread or full GC would really be slower (e.g. calling thread keeps lots of live data so GC on it takes more work), and 2) the launch/communication costs (expect at least µsecs) are small enough relative to gains. But interesting it's even possible for it to work out that way.
The part I'm saddest about is that because some of the goroutines abuse involved literally building in a race condition, the aggressive garbage collection made the resulting output less artistically interesting (on top of taking an hour to complete instead of a few minutes).
If you just want nondeterminacy, I'd suggest using math/rand directly, not accidents of scheduling... which, as was just proved, can disappear as quickly as they appear. In fact it doesn't even take something as serious as a language upgrade for that to happen, that kind of indeterminancy can disappear if you so much as add a member to a struct or have another process on your system eating 100% of one CPU or something. It's not reliable, not even a little.
As to scheduling v. random functions, much of what I like about the original is that the scheduling is completely out of my hands and it's not guaranteed to be consistently distributed or reproducible. It's as though the computer were making its own decisions.
I have a project started to rewrite the whole thing in c++ using random and using entirely Manhattan distance estimations for speed, but working with C++ is incredibly unpleasant and difficult to justify when I need to find a real job.
More recently, I wrote a new routine which generates several of those and then averages the RGB channel values on a per-pixel basis, and then repeatedly re-averages the original images with those average images to create more "generations." I haven't written anything up about that because the code required to sum and average pixels in a PNG is significantly less interesting.
The essence being to take advantage of some form of isolation in a concurrent system to dynamically infer something that looks like scoping/lifetimes.
Is anyone using golang with a different language like C, rust, etc? Are you using it through linkage or through microservices?
What libraries are you using with golang that's written in a different language. What does your company /app / system / software do?
Do you recommend using libraries in a different language or using related golang libraries?
But sometimes objects will survive the close braces.
When considered as a 'cache', the entire heap is a fixed sized cache. (LRU or better 2Q to deal with ephemeral objects). The actual memory object is at an offset in a mapped segment. (If mapped to a file, then voila, persistent memory).
As object are created, they are immediately placed in the 'process cache'. So to take the example of the OP's 'transactional' memobjs, these are cache entries that get promoted to the LRU section of a 2Q, get utilized, and then are naturally 'garbage collected' by eviction.
This scheme is not universal. Very long lived processes that sporadically access 'ancient' objects will stress the VM/FS. One can extend the temporal range of this scheme by employing a set of rotating FS-backed memory segments, but of course there will be a limit (e.g. your FS).
Does that clarify the scheme?
[p.s. of course one can used a leveled cache hierarchy, with optimal realization mapping to L1, L2, wired-memory. But a basic version can just use a single level 2Q cache.]
Sadly no. I'm sure your explanation was good, but my familiarity with GC and low-level memory management is nominal at best. Thanks for trying. :)
Caches are bugs waiting to happen.
22 Mar 2014
My late friend Alain Fournier once told me that he
considered the lowest form of academic work to be
taxonomy. And you know what? Type hierarchies are
Possibly Mr. Fournier meant 'foundational' by "lowest". If not, he may wish to revisit the foundational work performed by "lowly" academics such as a Mr. Darwin in the 19th century.
[p.s. we need to alert the CPU designers that they have been cluelessly using caches all these years. If only they had higher guidance from an infallible mind..]
Corollary: there will always be programming jobs :-)
Both systems have active runtime sub-systems, GC, scheduler, and a compiler front-end.
Java, of course, has a virtual machine and intermediate representation (byte codes), and a JIT to make it performant. The runtime class loading mechanism is a key +input here. The JIT aspect is a mix, with the -input being the upfront startup cost.
Go has far better C linkage, access to underlying memory object image, and provides the (glaringly missing from JVM) unsigned integer types.
Java has a more modern type system and first class metadata facilities, with -input being "more rope". Go has a 'interesting' type system, with -input being source level fragility, but +input here is "less rope".
Having extensively worked with concurrent systems in both languages, the distinction ultimately boils down to a question of whether langauge level continuation (fiber/green) as provided by Go vs. library/ByteCodeEngineering (various Java libs) is a critical requirement.
I honestly think it should be clear at this point that Go is really displacing Python and the notion of Go vs. Java is a false dichotomy.
This is changing. There already are 3rd-party FFI libraries for java which are far less cumbersome to use than JNI and in the future there will be Project Panama which will provide that out of the box.
Excellent point. This doesn't seem to be brought up very often (or at least it's rarely emphasized) in Java/Go comparisons.
It's probably also be less CPU-efficient than azul's, since it uses forwarding pointers instead of page table trickery. But if you have such a large heap you'll also have many cores. Here's a performance comparison to G1.
>Java creates a lot more garbage.
Very true, Go is great about keeping things on the stack.
Also sync.Pool helps a lot when used as needed.
"It is intended that programs written to the Go 1 specification will continue to compile and run correctly, unchanged, over the lifetime of that specification. At some indefinite point, a Go 2 specification may arise, but until that time, Go programs that work today should continue to work even as future "point" releases of Go 1 arise (Go 1.1, Go 1.2, etc.)."
Granted that means it is effectively a freeze on backwards incompatible changes, but it does leave the door open for additions to the language.
Having said that, the go authors are definitely very conservative with language changes.
"Not everything was fixed, though. We're still learning (but the language is frozen for now). "
For example, there's a mess of different error handling techniques (e.g. err == io.EOF vs os.IsNotExist(err) vs perr, ok := err.(*os.PathError); ok).
This also prevents the language from evolving, as can be seen in the case of the context package (see 'Frozen' in this post https://github.com/golang/go/issues/14660#issue-138695431 )
While the above poster didn't explain his problem clearly, it is a valid complaint.
Example of one of them: https://play.golang.org/p/pku8VJteH1
That stupid issue means that packages like os return 'error' even if it will always be of type '*PathError' so you have to be careful to note if the package you're using screwed up in returning err structs or not when doing `if err != nil` or else you'll check a nil interface vs a nil struct... not what you want.
This hinders the ability to make richer error types somewhat.
In addition, it means you have to always read the docs because the type system cannot represent the possible errors you can get there safely.
Second, this is bad because of places in the language (like io.Reader.Read) where it returns two values (numRead, err) where both are useful.
Due to the lack of sum types, you can't actually tell in multiple-returns of errors whether the results are mutually exclusive or not. What's more, without sum types, there's no way to express to the type system correctly that you indeed are in a non-nil branch of a value or so on.
Finally, there's no macros or other ways in the language to have more expressive error checking than the brain-dead "if err != nil". Sure, what go has is better than java's terrible implementation of exceptions, but it's so much worse than what you get in rust or ocaml or so on.
Objects that can be determined to stay within a local scope get allocated on the stack instead of a heap, so cleanup is very efficient.
The difference with Go is that decision is done at compile time in Swift and in runtime in Go.
Does anyone feel confident that a 'good enough for everyone' GC will ever be produced? If not, surely the language designers should settle on picking a type of program that they will optimise for, and sticking to it.
No, they are not. The optimization discussed here adds something to the existing GC, and removes nothing.
> Does anyone feel confident that a 'good enough for everyone' GC will ever be produced?
The Go GC is constantly improving, and its performance in Go 1.6 and 1.7 is considered very good my most users.
Without full memory control you're always going to making trade-offs in terms of performance. Hopefully the GC is "good enough" for your use case but there will never be a one-size fits all GC.
In almost all practical programs the new GC (in 1.5) was a significant win, or at worst no change.
Also true of Ruby or PHP. Which perform abysmally.
"Most users" needs aren't what we look for to define technical excellence - if anything, I most often see "most users" used to guide the compromising of technical standards.
Go does target "most users". That's the demographic. There will always be edge cases for which the general design does not work.
I personally like the "ARC" approach with ObjC and Swift where compiler statically adds and optimizes reference counting routines, promotes objects to stack when possible and requires programmer to manually resolve cyclic references (and Xcode supports nice methods to debug and visualize memory in order to detect unintentional cycles). Accidental memory cycles are rare enough and quite debuggable at a very little mental cost to mark some references as "weak" or "unowned". But the long-term performance and memory exhaustion issues caused by GCs are completely bypassed.
There was a Swift talk exactly about struct vs reference types and making use of COW data structures to reduce ARC impact.
Probably not, GC is always full of tradeoffs. That's why the JVM has a bunch of tuneable GCs, and there are third-party JVMs with their own strategies and GC tradeoffs (e.g. Azul).
A while ago, i had an idea for a "tidal" collector. If you look at what a thread in a typical server application is doing, it's in a loop where it picks up some task (a network request, a job from a queue, etc), then goes and does a load of work on it. If you plot at the stack depth over successive loops, you should see a sort of rising and falling tide - low tide when the task is picked up, and high tide during processing. Along with that rising and falling tide, you should get a rising and falling volume of live objects. Because garbage collection has a cost proportional to the number of live objects, for best efficiency, you want to collect when there are the fewest live objects. Which should be low tide.
To make use of this idea (which might be false, but at least sounds plausible), you'd need two things: a layout where each thread has its own memory (at least, enough of its own memory to hold all the objects that escape eden but become garbage during a request cycle; you might want a shared old generation), and some magic device which watches the stack depth and identifies the moment of low tide.
G1 might provide the right layout. I think every thread gets its own eden region; if you could tweak it so that each one promotes into its own survivor regions, and then be able to find the survivor regions for a given thread, you just need to be able to decide when is the right time to collect. You might be able to do that on the back of the dynamic profiling infrastructure already in the JVM; you would want to sample each thread every now and then to identify the method, even the particular line of code, where low tide happens, and then compile in a GC trigger there.
I am a humble application programmer, so the only work i ever did on this idea was maybe telling people about it on newsgroups or forums (hi everybody!). I'd be amazed if nobody else has thought of it. But i've never heard of anyone trying it. Perhaps it's just a terrible idea.
The proposed Go design sounds really quite like this. With a huge advantage: because Go tends to use these fire-and-forget goroutines, the problem of identifying low tide goes away, and you can just collect when the goroutine dies.
Some of them have quite good escape analysis algorithms, which are somehow similar to this.
The right answer is that Java's generational collector supports this transaction lifetime reasonably well in many cases.