Does this have much effect on Kandria’s latency results? Shirakumo tries to put a positive spin on things, but her results show that SBCL’s existing GC isn’t up to the task for real-time games (or many networked apps, for that matter).
As an aside, thanks for your efforts. Adding a new GC is a major effort, which improves things for all of us! :D
I have yet to benchmark it again; my setup was sloppy as I had to run the same part of the game over and over, because record/replay didn't work. (Nor do I have the time and energy now; have to survive the last semester of university now.) But I suspect it should help, by reducing the sweeping and scavenging times which were pretty bad.
I'm not sure I agree with the comment about "up to the task for real-time games", but maybe my standards are low for video games, and I have an abnormally fast machine and am likely biased there too.
Rightly or wrongly, 60fps is becoming what 30fps used to be, and gamers can tell the difference (as can normal people with VR). It is what it is.
This latency problem is why I’m dabbling with raylib and LFE of all things. The BEAM cannot hold a candle to SBCL for performance code, but the GC latency can be managed better.
Anyway, I don’t want to go on a tangent, since I really appreciate your efforts; there are plenty of domains that the new SBCL GC will be a big deal for.
I wish I had anything interesting to say. I'm currently experimenting with it to see if it's a useful approach to build a game that can sustain a solid 90fps.
Just pushing triangles isn't a problem (so far), but I'm still unclear how much of the engine logic will need to live in a low level language versus within LFE. The BEAM also presents certain architectural possibilities (especially for simulation), that are completely atypical for games.
It's promising enough that I'll continue pursuing this; we'll see where I end up in a few more months. Sorry I can't give more meat than that!
I would be very interested about the architectural possibilities of the BEAM you mentioned. Also the speed of the BEAM is not as bad as is often mentioned, definitely not when you start looking at problems/tests with lots of inherent concurrency. Also the sequential speed has markedly improved with the new ASMJIT.
Also problems with GC basically don't exist when running Erlang/LFE code, at least you have to put a lot of effort into getting problems with GC. A lot! And, of course, the BEAM does automatic load balancing over multiple cores which is another problem gone away. We never have to think about the class of problems like "I have my system set up to run on 2 cores, how should I modify it to run on 4?".
Can I ask a newbish question: will this new GC be available on all OS:s and CPU architectures, or only on some? I don't see anything in the paper about being limiting to some certain platform, so my hopes are high :).
It should be pretty portable, porting is just a matter of time and interest. The main thing is that the compiler needs to use a (more precise, lower overhead) software write barrier rather than using the (coarser, slower) hardware write barrier which was around the longest. I also have to wonder how well the bit-scanning algorithms fare on big-endian, or is that taking "all CPU architectures" too far?
The old gencgc was pretty cool for the single core era, and it sounds like it still holds up well. If I recall correctly, it was based on the Bartlett Mostly Copying paper, which is an elegant and practical GC design. https://www.hpl.hp.com/techreports/Compaq-DEC/WRL-TN-12.pdf. I miss these old papers that described this stuff in a way you didn’t have to be a math major to understand. I think the first version of that paper had the C++ code as an appendix: https://www.hpl.hp.com/techreports/Compaq-DEC/WRL-88-2.pdf.
Clarity in your technical communications matters. The Immix papers are similarly well written and clear. I don’t think it’s a surprise that both GC designs have also been independently implemented over and over. The Chaitin-Briggs register allocator is another example where I’d attribute at least some of the success in widespread industrial implementation to Briggs’ excellent and approachable PhD thesis describing the algorithm: https://www.cs.utexas.edu/users/mckinley/380C/lecs/briggs-th...
Good to know, it never sat right with me that Guile depended on BDW-GC for so long. That's what you use when you are implementing a toy language and don't have time to write your own GC.
It was never the biggest hurdle for guile performance. I do believe guix has the potential to hit the GC wall, but for most programs I think it has served guile well.
With that said, the upcoming immix collector is really friggin exciting.
> Clarity in your technical communications matters.
Indeed; it does.
I wish more people watched that Steels talk, where he speaks about importance to use accessible language. One-syllabus words all the way would be perhaps a bit tedious read, but he makes a very good point about clarity in connection to simplicity.
(As the paper was posted in a more accessible form a few hours ago (thanks slyrus!), I reupped that submission and merged the comments from https://news.ycombinator.com/item?id=37295611 hither.)
OMG! I read the SBCL release notes every month, but when you put together recent highlights like this, I'm blown away. Android binaries? True static executables? Yes, please!
This is impressive progress, especially for a language so many have written off as moribund. And that's just one implementation of the spec. ECL, ABCL, and Clasp are all progressing at a brisk pace too. Maybe the only good language is a dead language.
I cannot emphasize how much of a game changer SB-SIMD has been for writing performant numeric code in SBCL. Being able to experiment with AVX intrinsics in a REPL has been quite the experience for me.
Does anyone know how difficult implementing a Real-Time GC would be for SBCL or ECL. I know of that paper by Rodney Brooks (L -- A Common Lisp for Embedded Systems)
From a technical (i.e. useless) point of view SBCL very nearly already has a real-time GC, with a few modifications it would qualify since already:
1. The amount of work in a GC operation is bounded by the heap size
2. SBCL has a fixed maximum heap size
Two things would need to be done:
1. Ensure the areas where GC is inhibited are bounded
2. Call GC operations on a timer, and when they are done, ensure there is sufficient free space; RTGC cannot exist without specific bounds on the mutator, so you could almost certainly invent bounds on the mutator that would make the gencgc qualify as real-time for.
#1 would need to be done for any actually useful RTGC anyways.
A slightly less snarky answer is that a non-toy GC is a lot of work. Different choices in GC will affect code-generation (e.g. read and/or write barriers GC, and forwarding-pointers for incremental GC[1]).
There's a reason the gencgc has been around as long as it has, and it's because it's "good enough" for a lot of people and the work needed to replace it (with any non stop-the-world GC anyways) is rather large. Even TFA is neither incremental nor concurrent, just parallel.
1: Stop the World GCs may also use forwarding pointers, but codegen doesn't need to know about them because they are fully resolved before the mutator continues.
> From a technical (i.e. useless) point of view SBCL very nearly already has a real-time GC
If your technical (theoretical) definitions are useless, you need to use different definitions. A practical definition of “real-time” GC is a minimum mutator utilization bound for any program given sufficent RAM, but the only GC I’m aware of that does this (despite the abundance of papers on GC wirh “real-time” in the title) is Klock’s regional collector[1]. Unfortunately, it seems to be impractically complex. [To be fair, I don’t think any implementation of malloc() in common use would satisfy this constraint either.]
The customary way to improve GC performance in Common Lisp is to avoid consing. Which, incidentally, is the customary way to write high performance C code too. Allocate your memory in a piece up front and arrange the data in a way that's friendly to your architecture and then do stuff. Nice thing about Common Lisp is that DISASSEMBLE makes it a fairly smooth process to check that the optimizer is actually doing what you want.
And the new SBCL release aids this approach by expanding the cases where the compiler will stack allocate.
> If your technical (theoretical) definitions are useless, you need to use different definitions...
The question of "what is a real-time GC" is academic because nobody actually wants a real-time GC. They want a real-time system. And some GC algorithms can be used to construct a real-time system. Indeed, as others have noted, SBCL can be used to construct a real-time system if you don't do any allocations.
As far as I know, there is a better definition that is commonly used.
A soft real time system is one in which every instruction that the programmer writes has a known well-defined maximum time bound in which it executes. This can be achieved in a GC system if the GC pause time also has a known well-defined time bound, and if it is well defined which instructions can or can't be paused by the GC for that (max) amount of time.
This constraint is important for building systems because it allows the programmer to guarantee that certain parts of their program will fit within a given time bound, such as ensuring that they can fit in the window for rendering a frame to achieve 60FPS or per-packet processing to achieve 1G PPS etc.
The commercial Java implementation JamaicaVM claims to have such a soft real time guarantee, with maximum GC pause times in the order of microsec. They have a paper about it presented at the ACM [0]. I haven't read the paper myself or used their product in any way, I'm only presenting their claims.
Hard real time systems require a fixed time per operation, which would allow one to guarantee that, say, a given loop executes some operation at a precise frequency, which may be critical for things like analog signal processing. This is typically much harder to achieve using a GC system.
> A practical definition of “real-time” GC is a minimum mutator utilization bound for any program given sufficent RAM
I'm going to push back against that definition since:
1) RAM is usually bounded
2) A "never free" allocation strategy meets this requirement.
3) The "minimum mutator utilization bound" needs to be at a given time-slice anyways; e.g. "No more than 50% utilization in any 10ms window" rather than just "No more than 50% utilization." Throughput optimized GCs already give a very low utilization over a very long window; it's narrow windows that they struggle with.
I think a practical definition gives minimum bounds for:
M: fraction mutator utilization over a narrow time window T (where narrow depends on the problem domain; the orders of magnitude from 10us to 100ms are all reasonable depending on the domain)
O: heap size overhead (as a greater than unity multiple of live data)
And conditions them on peak allocation rate R, and heap size S, which each may have some reasonable maximum value.
If you pick non-reasonable values for R, T, and S then "obviously not realtime" GCs can be made to qualify. If you don't condition your GC on R, T, and S then actual shipping real-time systems using GC don't qualify (e.g. Metronome/Stacatto based systems).
> A "never free" allocation strategy meets this requirement.
Yeah, I could have put that better. “Sufficient” probably needs to be defined as a reasonable function of the program’s working set—I’d have said “independent of the program” instead of “reasonable”, except that’d make any non-compacting malloc necessarily non-real-time, as the (Robson) bound[1] there has to depend not only on the working set but also on the ratio between the smallest and largest allocations. On the other hand, maybe forbidding non-compacting mallocs is actually reasonable, because that bound is impractically large anyway.
> The "minimum mutator utilization bound" needs to be at a given time-slice anyways; e.g. "No more than 50% utilization in any 10ms window" rather than just "No more than 50% utilization."
No, the “minimum” in the name refers to a minimum (proportion of time) over all windows. See link in GP and the thesis referenced there for a discussion of the dependency on the window size (which is a bit funny for all allocators).
> No, the “minimum” in the name refers to a minimum (proportion of time) over all windows. See link in GP and the thesis referenced there for a discussion of the dependency on the window size (which is a bit funny for all allocators).
That's just false; I can't think of a GC that can maintain more than 0% utilization in a time-window the length of a compare-and-swap operation (which could be 100s of cycles on some CPUs)
A pattern I've noticed is that in languages with garbage collection, the garbage collection aspect is never finished. It is always a pain point and people are constantly waiting for promised future features and refinements.
There are still pauses, but the pauses are faster by using multiple cores to collect, which is nice for both throughput- and latency-sensitive apps. No-pause ("on-the-fly") collectors exist, but sufficiently-short pause ("concurrent") collectors still do wonders with sub-millisecond latency.
The new GC is not concurrent, it is still a stop-the-world collector. The improvement is that it can do the collection and compaction work in parallel on a multi-core machine.
As such, it will greatly increase throughput if given multiple cores, and will lead to lower pause times based on this improvement in throughput.
The mark-region GC has worse throughput with really large heaps compared to the live set (as the cost model doesn't work out), and always had* worse latency on Kandria. But generally throughput with more than two cores tends to be better.
* I expect latency to be better with compacting, but I haven't benchmarked, so I haven't a clue.
No, it simply means that it can better utilize multiple cores to do the GC work faster, so pauses will be proportionally shorter. With the current SBCL collector (or the old one, since this one is already released), even if you ran on a 16-core system, the GC would still only use a single core to reclaim memory (in practice, to move some of/all your live objects to a different address, as it was a copying collector).
Steel Bank Common Lisp 2.3.8 released: “a mark-region parallel GC is available” - https://news.ycombinator.com/item?id=37295611