I love this opinion from games programmers because they never qualify it and talk about what their latency budgets are and what they do in lieu of a garbage collector. They just hand wave and say "GC can't work". The reality is you still have to free resources, so it's not like the garbage collector is doing work that doesn't need to be done. What latency budgets are you working with? How often do you do work to free resources? What are the latency requirements there? Even at 144 fps, that's 7ms per frame. If you have a garbage collector that runs in 200us, you could run a GC on every single frame and use less than 3% of your frame budget on the GC pause. I'm -not- suggesting that running a GC on every frame is a good idea or that it should be done, but what I find so deeply frustrating is that the argument that GC can't work in a game engine is never qualified.
edit: wow for once the replies are actually good, very pleased with this discussion.
I don't think it's necessarily that it "can't" work as much as it takes away a critical element of control from the game developers and the times you find yourself "at the mercy" of the garbage collector is pretty damn frustrating.
The problem we ran into with garbage collection was that it was generally non-deterministic both in terms of when it would happen and how long it would take. We actually added an API hook in our JS to manually trigger GC (something you can do when you ship a custom JS runtime) so we could take at least the "when it happens" out of the picture.
That said, there were often fairly large variances in how long it would take and, while frame time budgets may seem to accommodate things, if you end up in a situation where one of your "heavy computation" frames coincides with a GC that runs long, you're going to get a nasty frame time spike.
We struggled enough that we discussed exposing more direct memory management through a custom JS API so we could more directly control things. We ultimately abandoned that idea for various reasons (though I left over 6 years ago so I have no idea how things are today).
This is basically it. People who have never worked on actual real-time systems just never seem to get that in those environments determinism often matters more than raw performance. I don't know about "soft" real-time (e.g. games, or audio/video) but in "hard" real-time (e.g. avionics, industrial control) it's pretty routine to do things like disable caches and take a huge performance hit for the sake of determinism. If you can run 10% faster but miss deadlines 0.1% more often, that's a fail. It's too easy for tyros to say pauses don't matter. In many environments they do.
The software was shipped with a memory map file so you know exactly what each memory address is used for. A lot of test procedures involved reading and writing at specific memory locations.
It was for avionics BTW. As you may have guessed, it was certified code with hard real time constraints. Exceeding the cycle time is equivalent to a crash, causing the watchdog to trigger a reset.
I think they posted the guidelines and it was a wonderful read about how they developed real time systems.
It seems extremely unlikely that any general purpose code you might adopt, which happens to invoke malloc() at all, would be fit for purpose in such a restricted environment without substantial modification; in which case you would just remove the malloc() calls as well.
> And even then, it might not ever be called given the way you are reusing the code, so...a dynamic check is the best way to ensure it never actually gets used.
In such a restricted environment, it is unlikely you just have unknown deadcode in your project. "Oh, those parts that call malloc()? They're probably not live code and we'll find out via a crash at runtime." That's like the opposite of what you want in a hard realtime system.
So, no — a static, compile/link time check is a strictly superior way to ensure it never gets used.
> a system where malloc() was forbidden. In fact it always returned null.
Never experienced the malloc() thing but do throw exceptions and fail fast under conditions like these so they're caught in testing.
Static memory is something we already do but we’re pretty interested whether industry actually adopts redundant code paths, monitors, to what extent watchdogs (how many cycles can be missed?), etc.
Also, you probably want some padding so that newer versions of the CPU can be used without too much worry. It's possible for cycle counts of some routines to increase, depending on how new chips implement things under the hood.
[says a guy who was counting cycles, in the 1980s :-)]
If there were no performance impact, there would be no point. I'm not just being snarky; there's an important point here. Caches exist to have a performance impact. In many domains it's OK to think about caching as a normal case, and to consider cache hit ratio during designs. When you say "no performance impact" you mean no negative performance impact, and that might be technically true (or it might not), but...
But that's not how a hard real-time system is designed. In that world, uncached has to be treated as the normal case. Zero cache hit ratio, all the time. That's what you have to design against, even counting cycles and so on if you need to. If you're designing and testing a system to do X work in Y time every time under worst-case assumptions, then any positive impact of caches doesn't buy you anything. Does completing a task before deadline because of caching allow you to do more work? No, because it's generally considered preferable to keep those idle cycles as a buffer against unforeseen conditions (or future system changes) than try to use them. Anything optional should have been passed off to the non-realtime parts of the larger system anyway. There should be nothing to fill that space. If that means the system was overbuilt, so be it.
The only thing caches can do in such a system is mask the rare cases where a code path really is over budget, so it slips through testing and then misses a deadline in a live system where the data-dependent cache behavior is less favorable. Oops. That's a good way for your product and your company to lose trust in that market. Once you're designing for predictability in the worst case, it's actually safer for the worst case to be the every-time case.
It's really different than how most people in other domains think about performance, I know, but within context it actually makes a lot of sense. I for one am glad that computing's not all the same, that there are little pockets of exotic or arcane practice like this, and kudos to all those brave enough to work in them.
E.g. errors that didn't match a branch or input scenario during testing which would go over budget without cache, but with cache might prevent a crash.
Another could be power consumption, latency optimization, or improvement of accuracy. E.g. some signal analysis doesn't work at all if the real-time code is above some required Nyquist threshold, but faster performance improves the maximum frequency that can be handled, improving accuracy.
Essentially, you would need to prove the statement "if a system works well with caches off, then it works well with caches on" to the satisfaction of whatever authority is giving you such stringent requirements.
Hah. Who sez that audio and video products have 'soft' real time? Go on now.
I assume some dedicated devices are more or less hard real time, due to running way simpler software stacks on dedicated hardware.
There's a whole world out there of hard real time, the world is not simply made up of streaming video and cell phones.
The cool thing on HN is you can get down voted for simply making that observation. It's a sign of the times I'm afraid.
For example, if you're doing a take, you have to complete it during the blanking interval, but usually the hardware guarantees that. In the software, you want you take to happen in one particular vertical blanking interval (and yes, it really is a frame-accurate industry). But if you miss, you're only going to miss by one. We didn't (so far as I know) specify guarantees to the customer ("If you get your command to the router X ms before the vertical interval, your take will happen in that vertical"), so we could always claim that the customer didn't get the command to the router in time. Again, so far as I know - there may have been guarantees given to the customer, but I didn't know about them.
But that was 20 years ago, back in the NTSC 525 days.
Nice name, by the way. Do you know of any video cards that will do a true Porter & Duff composite these days? I recall looking (again, 20 years ago) at off-the-shelf video cards, and while they could do an alpha composite, it wasn't right (and therefore wasn't useful to us).
In terms of customers and how much they care, the North American market seems to care less than Europe.
Compare your “not a single click in an hour [for quality reason]” to a “not a single missed deadline in 30 years of the life expectancy of a plane, on a fleet of a few thousands planes [for safety reasons]”. That's the difference of requirements between hard and soft RT.
I did some soft real-time (video decoding) and I have a friend working on hard real-time (avionics) and we clearly didn't worked in the same world.
RT video/audio failing never results in death. Where as failures in "avionics, industrial control" absolutely can / do. That seems to be where OC was drawing the line.
A "minor inconvenience" like a recording session going wrong, a live show with stuttering audio, skipped frames in a live TV show, and so on?
People like deadmau5, Daft Punk, Lady Gaga all perform with Ableton Live and a laptop or desktop behind their rig. If it were anything more than a minor inconvenience, these people wouldn't use this.
It's very unlikely to have audio drop outs, a proper setup will basically never have them. But still if you have one audio dropout in your life, you're not dead, your audience isn't dead, a fire doesn't start, a medical device doesn't fail to pump, and so on.
And yes you can badly configure and system, but the point is you can't configure these to be 100% guaranteed, 99.99% is perfectly fine.
Edit: Sometimes people call these "firm" realtime systems. Implying the deadline cannot be missed for it to operate, but also that failure to meet deadlines doesn't result in something serious like death (e.g in a video game you can display frames slower than realtime and it kind of works but feels laggy, however you cannot also slow down the audio processing because you'll a lowered pitch, so you have to drop the audio.)
I started game programming on the Atari 800, Apple 2, TRS-80. Wrote NES games with 2k of ram. I wrote games in C throughout the 90s including games on 3DO and PS1 and at the arcade.
I was a GC hater forever and I'm not saying you can ignore it but the fact that Unity runs in C# with GC and that so many quality and popular shipping games exist using it is really proof that GC is not the demon that people make it out to be
Some games made with GC include Cuphead, Kerbal Space Program, Just Shapes & Beats, Subnautica, Ghost of a Tale, Beat Saber, Hollow Knight, Cities: Skylines, Broforce, Ori and the Blind Forest
It's successful in spite of that, but that doesn't make it any better.
You might spend more time fighting the GC than benefitting from it. And that seems to be the experience for large games - simpler ones might not care.
Unity offers a lot more than just a language, and developers have to choose, are they willing to put up with GC to get the rest of what Unity offers.
You're still at the mercy of the malloc implementation. I've seen some fairly nasty behaviour involving memory leaks and weird pauses on free coming from a really hostile allocation pattern causing fragmentation in jemalloc's internal data.
In fact, doing that is often a really bad idea in general because of the extreme importance of cache effects. In a high-performance game engine, you need to have a fine degree of control over where your game objects get placed, because you need to ensure your iterations are blazingly fast.
Ostensibly you could do the exact same thing in e.g. Python if you wanted, by disabling with the gc module and just writing custom allocation and cleanup in e.g. Cython. Probably similar in many different managed environment languages.
But instead what you can do is to reuse the "slots" you are handing out from your allocator's memory arena for allocations of some specific type/kind/size/lifetime. If you are controlling how that arena is managed, you will find yourself coming across many opportunities to avoid doing things a general purpose GC/allocator would choose to do in favor of the needs dictated by your specific use case.
For instance you can choose to draw the frame and throw away all the resources you used to draw that frame in one go.
Garbage collection emulates the intent of this method with generational collection strategies, but it has to use a heuristic to do so. And you can optimize your code to behave very similarly within a GC, but the UI to the strategy is full of workarounds. It is more invasive to your code than applying an actual manual allocator.
I've heard of this concept but a search for "mark-and-release per-frame allocation buffer" returned this thread. Is there something else I could search?
A generational GC achieves a similar end result, but has to heuristically discover the generations, whereas an arena allocator achieves the same result deterministically And without extra heap walking.
(1) malloc implementations generally allocate a page at a time and give the page back to the OS when all objects in the page are gone. ptr = malloc(1); malloc(1); free(ptr); doesn't give the single allocated page back to the OS.
If the program runs for any length of time, it will probably need the same memory again, so freeing it is a pessimization.
Standard C library free() implementations very, very rarely free memory back to the OS.
Many C/C++ allocators don't release to the OS often or ever.
Games are very friendly to that approach- with a bit of thought you can use arenas and object pools to cover 99% of what you need, and cut out all of the failure modes of a general purpose GC or malloc implementation.
Disable the garbage collector:
Go is a good language for web backend and other network services, but it's not a C replacement.
Additionally, the Go compiler isn't trying really hard at optimizing your code, which makes it several times slower on a CPU-bound task. That's for a good reason: because for Go's usecase, compile-time is a priority over performances.
Saying that there is no drawbacks in Go is just irrational fandom…
GCC also supports Go (gccgo) and can call native libraries just like from C.
I'm not saying there are no drawbacks in Go, just that I can't think of any advantages of C over Go.
At which point you're mostly just writing C in Go.
I would very much prefer a stripped down version of Go used for these situations rather than throwing more C at it. The main benefits of using Go are not the garbage collection, its the tooling, the readability (and thus maintainability) of the code base, the large number of folks who are versatile in using it.
Large user base? C is number 2. Go isn't even in the top 10.
Tooling? C has decades of being one of the most commonly used languages, and a general culture of building on existing tools instead of jumping onto the latest hotness every few months. As a result, C has a very mature tool set.
I would also like to see a stripped-down version of Go that disables most heap allocations, but I have no idea what it would look like.
There may be more "C programmers" by number but a Golang codebase is going to be more accessible to a wider pool of applicants.
Personally I would love for a —release mode that had longer compile times in exchange for C-like performance, but I use Python by day (about 3 orders of magnitude slower than C) so I’d be happy to have speeds that were half as fast C. :)
Yes it would leak, to avoid leaking you could invoke the GC when you’re not in a critical section. Alternatively, if you don't use maps and instead structure all your data into arrays, slices and structs, you can just avoid allocations using arenas or similar. (You can use arrays and slices without the GC, but maps require it).
Instead my miniature library obviated pools by simply having binary operators operate directly on one of the vector objects passed to it, if more than one vector was required for the operation internally they would be "statically allocated" by defining them in the function definition's context (some variants i would also return one of these internal vectors - which was only safe to use until a subsequent call of the same operator!).
The result this had on the calling code looked quite out of place for JS, because you would effectively end up doing a bunch of static memory allocation by assigning a bunch of persistent vectors for each function in it's definition context, and then you would often need to explicitly reinitialize the vectors if they were expected to be zero.
... it was however super fast and always smooth - I wish it was possible to turn the GC off in cases like this when you know it's not necessary. It was more of a toy as a library, but i did write some small production simulations with it - i'm not sure how well the method would extend to comprehensive vector and matrix libraries, I think the main problem is that most users would not be willing to use a vector library this way, because they want to focus on the math and not have to think about memory.
All of these ideas require more care when using the library though.
You can see most operations act on the Vector and there are some shared temporary variables that have been preallocated. If you look through some of the other parts you can see closures used to capture pre-allocated temporaries per function as well.
The real run-time cost of memory management done well in a modern game engine written without OOP features is extremely low.
We usually use a few very simple specialized memory allocators, you'd probably be surprised by how simple memory management can be.
The trick is to not use the same allocator when the lifetime is different.
Some resources are allocated once and basically never freed.
Some resources are allocated per level and freed all at once at the end.
Some resources are allocated during a frame and freed all at once when a new frame starts.
And lastly, a few resources are allocated and freed randomly, and here the cost of fragmentation is manageable because we're talking about a few small chunks (like network packets)
Instead, we have different types of global arenas, bump allocators, etc. that you can use. These all pre-allocate memory once at start up, and... that's it.
When you have well defined allocation patterns, allocating a new "object" is just a "last += 1;` and once you are done you deallocate thousands of objects by just doing `last -= size();`.
That's ~0.3 nanoseconds per allocation, and 0.x nano-seconds to "free" a lot of memory.
For comparison, using jemalloc instead puts you at 15-25 ns per allocation and per deallocation, with "spikes" that go up to 200ns depending on size and alignment requirements. So we are talking here a 100-1000x improvement, and very often the improvement is larger because these custom allocators are more predictable, smaller, etc. than a general purpose malloc, so you get better branch prediction, less I-cache misses, etc.
Not really, our bump allocator is ~50 LOC, it just allocates a `Box<[u8]>` with a fixed size on initialization, and stores the index of the currently used memory, and that's it.
We then have a `BumpVec<T>` type that uses this allocator (`ptr`, `len`, `cap`). This type has a fixed-capacity, it cannot be moved or cloned, etc. so it ends up being much simpler than `Vec`.
If you need to store pointers and want to conserve a bit of memory, perhaps my compact_arena crate can help you.
I'm a huge Java nerd. I love me some G1/Shenandoah/ZGC/Zing goodness. But once you're writing a program that to the point that you're tuning memory latency in many games anyway, baking in your application's generational hypothesis is pretty easy. Even in Java services you'll often want to pool objects that have odd lifetimes.
Is there a particular codebase you are thinking of here?
We can also look at it from the other direction: if your engine is adjusting its framerate dynamically based on the time it takes to process each frame, and you can do the entire work for a frame in 10ms, does that give you a target of 100 fps? If you tack on another half millisecond to run a GC pause, would your target framerate just be 95 fps?
And what do you do when the set of assets to be displayed isn't deterministic? E.g., an open world game with no loading times, or a game with user-defined assets?
Some games are double or triple buffered.
Rendering is not always running at the same frequency as the game update.
The game update is sometimes fixed, but not always.
I've had very awful experience with GC in the past, on Android, the game code was full C/C++ with a bit of Java to talk to the system APIs, I had to use the camera stream.
At the time (2010) Android was still a hot mess full of badly over engineered code.
The Camera API was simply triggering a garbage collect about every 3-4 frames, it locked the camera stream for 100ms (not nanoseconds, milliseconds!)
The Android port of this game was cancelled because of that, it was simply impossible to disable the "feature".
I never worked with Unity myself but I worked with people using Unity as their game engine, they all had problems with stuttering caused by the GC at some point.
You can try to search Unity forums about this subject, you'll find hundreds or thousands of topics.
What really bothers me with GC is that it solves a pain I never felt, and creates a lot of problems that are usually more difficult to solve.
This is a typical case of the cure being (much) worse than the illness.
Can you point out in the post where they expand on my point? The only this I see is this:
> Again we kept knocking off these O(heap size) stop the world processes. We are talking about an 18Gbyte heap here.
which is exactly my point - even if you remove all O(heap size) locks, depending on the exact algorithm it might still be O(number of objects) or O(number of live objects) - e.g. arena allocators are O(1) (instant), generational copying collectors are O(live objects), while mark-and-sweep GCs (including Go's if I understand correctly after skimming over your link) are O(dead objects) (the sweeping part). Go's GC seems to push most of that out of Stop-The-World pause, instead it offloads it to mutator threads instead... Also, "server with short-lived requests" is AFAIK a fairly good usecase for a GC - most objects are very short-lived, so it would be mostly garbage with simple live object graph...
Still, a commendable effort. Could probably be applied to games as well, though likely different specific optimisations would be required for their particular usecase. I think communication would be better if you expanded on this (or at least included the link) in your original post.
I know people who have written games in common-lisp despite no implementations having anything near a low-latency GC. They do so by not using the allocator at all during the middle of levels (or using it with the GC disabled, then manually GCing during level-load times).
At the expense of a lot of work from the GC. In production we had 30% of the usage coming from the GC alone…
But for game development the problem isn't going to be the GC anyway: cgo is just not adapted to this kind of tasks (and you cannot avoid cgo here).
Currently CGO just sucks. It adds a lot of overhead. The next problem is target platforms. I don't have access to console SDKs but compiling for these targets with Go should be a major concern.
Go could be a nice language for making games but with the state it is in the only thing is good for is hobby desktop games.
The trick here is to double the heap size, which would be completely unacceptable for a game that's making use of what the hardware offers.
FRAME RATE: having a GC collect at random frames makes for jerky rendering
SPEED: Object pools allow reuse of objects without allocing/deallocing, and can be a cache-aligned array-of-structs. Structs-of-arrays can be used for batch processing large volumes of primitives.
RELIABILITY: This is probably applicable to the embedded realm too, but if you can't rely on virtual memory (because the console's OS/CPU doesn't support it, or once again you don't want the speed impact) then you need to be sure that allocations always succeed from a fixed pool of memory. Pre-allocated object pools, memory arenas, ring buffers etc. are a few of ways to ensure this.
There's probably a lot more, but those are the reasons that jump out at me.
If you aren't going to use the GC, then you open up a lot of other performance opportunities by just using a language that didn't have one in the first place.
And if you run the GC manually, you really don't know how long it will take - read: determinism.
> you can also write cache-aligned arrays of structs in Go if you want to
Wasn't this thread about why people don't use GC, not about go? I don't remember.
If you're using an object pool, you're dodging garbage collection, as you don't need to deallocate from that pool, you could just maintain a free-list.
> you can allocate a slab and pull from it if you want to. the existence of a GC doesn't preclude these possibilities
To take it further, you could just allocate one large chunk of memory from a garbage collected allocator and use a custom allocator - you can do this with any language. But you're not using the GC then.
The answer to your question is probably: because they like the language, are productive in it, know the libraries and the feature can be turned off so it's an option.
The problem is GCs for popular languages are nowhere near this good. People will claim their GC runs in 200us, but it's misleading.
For example, they'll say they have a 200us "stop the world" time, but then individual threads can still be blocked for 10ms+. Or they'll quote an average or median GC time, when what matters is the 99.9th percentile time. If you run GC at every 120 Hz frame then you hit the 99.9th percentile time every minute.
Finally, even if your GC runs in parallel and doesn't block your game threads it still takes an unpredictable amount of CPU time and memory bandwidth while it's running, and can have other costs like write barriers.
julia> @benchmark GC.gc()
memory estimate: 0 bytes
allocs estimate: 0
minimum time: 64.959 ms (100.00% GC)
median time: 66.848 ms (100.00% GC)
mean time: 67.062 ms (100.00% GC)
maximum time: 73.149 ms (100.00% GC)
Julia's GC is generational; relatively few sweeps will be full. But seeing that 65ms -- more than 300 times slower than 200us -- makes me wonder.
Test was on an i9 7900X.
As far as I know this is still true of the Go GC. Write barriers are also there and impact performance vs. a fixed size arena allocator that games often use that has basically zero cost.
Does it? In most game I expect resource management to be fairly straightforward, allocation and freeing of resources will mostly be tied to game events which already require explicit code. If you already have code for "this enemy is outside the area and disappears" is it really that much work to add "oh and by the way you might also free the associated resources while you're at it". I don't need a GC thread checking at random intervals "hey let's iterate over an arbitrary portion of memory for an arbitrary amount of time to see if stuff needs dropping yo!".
I realize that I'm quite biased though because I'm fairly staunchly in the "garbage collection is a bad idea that causes more problems than it solves" camp. It's a fairly extremist point of view and probably not the most pragmatic stance.
One place where it might not be quite as trivial would be for instance resource caching in the graphic pipeline. Figuring out when you don't need a certain texture or model anymore. But that often involves APIs such as OpenGL which won't be directly handled by the GC anyway, so you'll still need explicit code to manage that.
That being said I'd still chose (a reasonable subset of) C++ over C for game programming, if only to have access to ergonomic generic containers and RAII.
I write quite a lot of C. When I do I often miss generics, type inference, an alternative to antiquated header files and a few other things. I never miss garbage collection though (because it's a bad idea that causes more problems than it solves).
GC means not only that memory management is simpler, it means that it goes away for most of the programs out there.
Most programmers I know aren't writing code that run Twitter-like servers or avionics, nor are they programming the next Doom game. These people are writing apps and doing back end coding for some big company where real-time isn't an issue and the focus is on getting code out fast, with good average quality and "cheap" labor.
In this case, having a high level language/runtime that doesn't requirs a programmer that can reason about allocations is key.
I can't even begin to tell the kind of codebase that I have seen. Two years ago I was working on a C/C++ legacy system that was thousands of lines of codes and almost every file had memory leaks (that cppcheck could find itself, mind you). Some of them where caused by delivery deadlines, most of them where caused by unskilled employees.
(oh, in case you're wondering: all those leaks where solved by having ten times the hardware and a scheduled restart of the servers)
Even if you are perfect, you will still fragment over time. At some point, you must move a dynamically allocated resource and incur all the complexity that goes with that--effectively ad hoc garbage collection.
There are only two ways around this:
1) You have the ability to do something like a "Loading" screen where you can simply blow away all the previous resources and re-create new ones.
2) Statically allocate all memory up front and never change it.
People went to extreme measures to avoid allocating memory in their games: manually pooling every in-game object & particle, not using string comparisons in C#, etc https://danielilett.com/2019-08-05-unity-tips-1-garbage-coll...
Unity itself finally has a new system they're previewing to average out the GC spikes over time, so a game, say, never drops below 60fps: https://blogs.unity3d.com/2018/11/26/feature-preview-increme...
As well, there is a new way of writing C# code for Unity called ECS that will avoid producing GC sweeps https://docs.unity3d.com/Packagesfirstname.lastname@example.org/man...
As for pooling objects, you'd go to those "extreme measures" as a matter of course in any other language as well. You wouldn't want to alloc and free every frame no matter the language.
And allocating memory is fine during runtime when you are in control of the allocation and the cleanup, whereas in Unity, the sweeps are fully out of your control, expensive, and will just sometimes happen in the middle of the action
> Being able to control the garbage collector (GC) in some way has been requested by many customers. Until now there has been no way to avoid CPU spikes when the GC decided to run. This API allows you disable the GC and avoid the CPU spikes, but at the cost of managing memory more carefully yourself.
It only took them 14 years and much hand wringing from both players and developers to address :)
A similar thing is going on with their nested scene hierarchy troubles, also releasing in 2018.3 with their overhaul to the prefab system, to sort of support what they call "prefabs" having different "prefabs" in their hierarchy without completely obliterating the children. What they have now is not ideal, but they're working on it.
Prior to that, if you made. say, a wheel prefab and a car prefab, as soon as you put the wheels into your car prefab, they lost all relation to their being a wheel, such that if you updated the wheel prefab later, the car would still just have whatever you had put into the car hierarchy originally, which naturally has been the source of endless headaches and hacky workarounds for many developers.
> The frame-rate difficulties found in version 1.01 are further compounded by an issue common with many Unity titles - stuttering and hitching. In Firewatch, this is often caused by the auto-save feature, which can be disabled, but there are plenty of other instances where it pops up on its own while drawing in new assets. When combined with the inconsistent frame-rate, the game can start to feel rather jerky at times.
That studio is part of Valve now!
> Games built in Unity have a long history of suffering from performance issues. Unstable frame-rates, loading issues, hitching, and more plague a huge range of titles. Console games are most often impacted but PC games can often suffer as well. Games such as Galak-Z, Roundabout, The Adventures of Pip, and more operate with an inherent stutter that results in scrolling motion that feels less fluid than it should. In other cases, games such as Grow Home, Oddworld: New 'n' Tasty on PS4, and The Last Tinker operate at highly variable levels of performance that can impact playability. It's reached a point where Unity games which do run well on consoles, such as Ori and the Blind Forest or Counter Spy, are a rare breed.
Hopefully as the engine continues to improve dramatically, this kind of thing will be left in the past
There's a lot of reasons shipping Unity games is hard but the GC and C# are not among them or at least much lower than, say, dealing with how many aspects of the engine will start to run terribly as soon as an artist checks a random checkbox.
> This refers to the engine's ability to exploit multiple streams of instructions simultaneously. Given the relative lack of power in each of the PS4's CPU cores, this is crucial to obtaining smooth performance. We understand that there are also issues with garbage collection, which is responsible for moving data into and out of memory - something that can also lead to stuttering. When your game concept starts to increase in complexity the things Unity handles automatically may not be sufficient when resources are limited.
As others probably already mentioned, worst side of gc is that it is unpredictable and hard to be forced in a way that matches your specific pattern. With manual/raii mm you can make pauses predictable and non-accumulating collection debt and fit before “retrace” perfectly. Also simply relying on automatic storage in time critical paths is usually “can not” in gc envs.
^ if any, I can’t recall if 5.1 actually implemented true incremental gc back then
This is simply not the case. Its still just code after all. The problem was you were fighting the GC but that's just the symptom. The clear problem was leaking something every frame. With all the tooling these days its pretty easy to see exactly what is getting allocated and garbage collected so you know where to focus your efforts.
Not exactly. Here is how the early PC 3D games I worked on did that: They would have a fixed size data buffer initialized for each particular thing you needed a lot of, such as physics info, polygons, path data, in sort of a ring buffer. A game object would have a pointer to each segment of that data it used. If you removed a game object you would mark the segment the game object pointed to as unused. When a new object was created you would just have a manager that would return a pointer to a segment from the buffer that was dirty that the new object would overwrite with data. Memory was initialized at load and remained constant.
One problem with doing things like that is that you would have fixed pool. So there were like 256 possible projectiles in Battlezone(1998) in the world at any time and if something fired 257th an old one just ceased to exist. Particles systems worked that way as well.
What was good about that was that you could perform certain calculations relatively fast because all the data was the same size and inline, so it was easy to optimize. I worked on a recent game in C# and the path finding was actually kind of slow even though the processor the game ran on was probably like 100 times (or more) faster. I understand there are ways to get C# code to create and search through a big data structure as fast as the programmers had to do it in C in the 90's. However it would probably involve creating your own structures rather than using standard libraries, so no one did it like that.
Just have to say, loved that game to bits.
But in practice it can be horrible. You end up writing all kinds of weird code just to avoid allocations in certain situations. And yeah, GC pauses due to having lots of addons is definitely very noticable for players.
Also leads to fun witch hunts on addons using "too much" memory, people consider a few megabytes a lot because they confuse high memory usage with high allocation rate... Our stuff started out as "lightweight" but it grew over the years. We are probably at over 5 MB of memory with Deadly Boss Mods (many WoW players can attest that it's certainly not considered lightweight nowadays, but I did try to keep it low back then).
But I think we still do a reasonable job at avoiding allocations and recycling objects...
The point is: I had to spent a lot of time thinking about memory allocations and avoiding them in a system that promised me to handle all that stuff for me. Frame drops are very noticable.
But there are languages with better GCs than Lua out there...
Somewhat related: I recently wrote about garbage collectors and latency in network drivers in high-level languages: https://github.com/ixy-languages/ixy-languages Discussed here: https://news.ycombinator.com/item?id=20945819
That's a scenario were short GC pauses matter even more as the hardware usually only buffers a few milliseconds worth of data at high speeds.
Memory management is ultimately far simpler and easier in C++ than in C#.
Ultimately you wind up doing the exact same thing in C# that you'd do in C++. Aggressive pooling, pre-allocation, etc. A handful of Unity constructs generate garbage and can't be avoided. I believe enable/disabling a Unity animator component, which you'd do when pooling, is one such example.
It's all just a little extra painful because you to spend all this time avoiding something the language is built around. Trying to hit 0 GC is annoying.
It also means, somewhat ironically, is that GC makes the cost of allocation significantly higher than a C++ allocation. In C++ your going to pay a small price the moment you allocate. But you know what? That's fine! Allocate some memory, free some memory. It ain't free, but it is affordable.
In Unity generating garbage means at some random point in the future you are going to hitch and, most likely, miss a frame. And that's if you're lucky! If you're unlucky you may miss two frames.
Bro, game devs talk about this non-stop. There are probably 1000 GDC talks about memory management.
Game devs don't spell the fine details because they are generally talking to other game devs and there is assumed knowledge. Everyone in games knows about memory management and frame budgets.
> If you have a garbage collector that runs in 200us, you could run a GC on every single frame and use less than 3% of your frame budget on the GC pause.
And if pigs could fly the world would be different. ;)
Go has a super fast GC now. It had a trash GC (teehee) for many many years. But Go is not used for game dev client front-ends. If C# / Unity had an ultra fast GC that used only 3% of a frame budget that would be interesting. But Unity has a trash GC that can easily take 10+ milliseconds. (Their incremental GC is still experimental.) It's a problem that literally every Unity dev has to spend considerable resources to manage.
For 50+ years GC has been a poor choice for game dev. Maybe at some point that will change! The status quo is that GC sucks for games. The onus is on GC to prove that it doesn't suck.
Having done both they are remarkably similar. Lots of pooling and pre-allocation. Most GC based languages are a bit harder both because they tend not to give you a chunk of memory you can do whatever with and require a lot of knowledge about which parts will allocate behind your back. It's also harder to achieve because you typically get no say when the collector will run.
There are loads of commercial games that pay no attention to this though written in both kinds of language.
1) Don't allocate or free resources during gameplay. Push it to load time instead. This works for things like assets that don't change much.
2) Use an arena that you reset regularly. This works well for scratch data in things like job systems or renderers.
3) Pick a fixed maximum number of entities to handle, based on hardware capacity, and use an object pool. This works well for things that actually do get created and destroyed during gameplay, on a user-visible timescale.
Together, these get you really far without any latency budget going toward freeing resources. And there is always something else you could put that budget toward instead, which is why any amount of GC is often seen as a waste.
When working on code such as game engines, it's not simply a matter of how long something takes to do, but also when it happens and how predictable it is. If I knew that GC takes 1ms, I could schedule it just after I call swapBuffers and before I begin processing data for the next frame, which isn't a latency critical portion.
The problem is that GC is unpredictable because the mark and sweep can take a long time, and this will prevent you from meeting your 60fps requirement.
In practice, we hardly ever do any kind of GC for games because we don't really need to. We use block memory allocators like Doug Lea's malloc (dlmalloc) in fixed memory pools if we must, but generally, you keep separate regions for things like object descriptors, textures, vertex arrays, etc. There's a ton of data to manage which can't be interleaved together in a generic allocator, so once you've gone that deep, there's really no point in using system malloc.
Malloc itself isn't a problem either, it's quick. It's adding virtual space via sbrk() and friends which can pause you, so we don't. On consoles, we have a fixed memory footprint, and on a bigger system, we pre-allocate buffers which we will never grow, and that's our entire footprint. We then chunk it up into sections for different purposes, short lived objects, long lived objects, etc. Frequently, we never even deallocate objects one by one. A common tactic is to have a per-frame scratch buffer, and once you're done rendering a frame, you simply consider that memory free - boom, you've just freed thousands of things without doing anything.
There are many things you can do instead of generic GC which are far better for games.
I disagree with the author of the original article about C++. C++ is as complex as you want to make it, you don't have to use all the rope it hands you to hang yourself. However, having the std library with data structures, smart pointers, the ability to do RAII, is invaluable to me. Smart usage of std::ref gets you automatic GC, what you don't have is cycle detection, so you take care never to have cycles by using weak pointers where necessary, and you have all the behavior of auto-GC without stop the world.
I second that. I always find the attitude of “C++ is bad so I’m going to stick to C” really bizarre. You can use C++ as a better C.
- use type inference and references instead of pointers. Writing C style code with these features makes it more readable.
- Don’t like OO programming. Stick to struct with all members public. It’s going to be a lot better than doing the same thing in C and you shall not have void* casts.
- use exceptions instead of return code for errors, so that your code is not peppered with if statements at every line you call. With an exception you’ll get additional info about crashes in dumps.
I wrote for an embedded system where the C++ standard library was not available, in a previous life. I ended up writing my code in C++ and “re-inventing” a couple of useful classes like std::string and std::vector. For the most part my code was very C like...
I happen to have the exact opposite opinion. Exception handling tends to feel too "magical" (read: non-deterministic, hard to behaviorally predict, etc.) relative to just returning an error code.
Otherwise, exceptions make code simpler, cleaner, and more reliable. They will never "go missing" unless you do something to make them go missing. Code that does those things is bad code. Don't write bad code. Do use exceptions.
In my opinion, exceptions also complicate code by taking error handling out of the scope of the code you are writing, since you can't know whether anything you call throws, so you may not catch, and some higher level code will catch an exception, where it's lost context on how to handle it. If exceptions work well in a language; Python, Java, by all means use them, but in C++, they've caused too many problems for me to continue doing so. Even Google bans them in their C++ style guide.
Google's proscription on exceptions is purely historical. Once they had a lot of exception-unsafe code, it was too late to change. Now they spend 15-20% of their CPU cycles working around not being able to use RAII.
My example with libcurl was very simple, but in real systems, things are less clean. You may be using a network tool that hides libcurl from you, and your code is being passed into a callback without you even knowing it until it causes a problem. Other places where C++ exceptions fail would be when you dlopen() a module, or one of your binary C++ dependencies is compiled without exception handlers. The code will compile just fine, but exceptions will be lost, there's not so much as a linker warning.
Google uses RAII just fine, it's just you have to be careful with constructors/desctrucors since they can't return an error. There's no way it burns 15-20% CPU not using RAII - where are you getting these numbers? I used to work there, in C++ primarily, so I'm well familiar with their code base.
Instead you call a constructor that sets a dumb default state, and then another, "init()" function, that sets the state to something else, and returns a status result. But first it has to check if it is already init'd, and maybe deinit() first, if so, or maybe fail. Then you check the result of init(), and handle whatever that is, and return some failure. Otherwise, you do some work, and then call deinit(), which has to check if it is init'd, and then clean up.
I knew someone else who worked at Google. 15-20% was his estimate. Bjarne ran experiments and found overheads approaching that.
Unless you know the details of how every function (and every function that function calls, and so on) handles errors, the easiest way to not pass functions to libcurl that throw is to not write those functions in a programming language with exceptions :)
do you program a lot with your feelings ?
As others have pointed out, this isn't true for many video games. People write things so they're allocated upfront as much as possible and reuse the memory. A lot of old games had fixed addresses for everything as they used all available memory (this allowed for things like Gameshark cheats).
But there's a secondary issue. A GC imposes costs for the memory you never free. It doesn't know the data will never be freed, so it'll periodically check to see if the data is still reachable. Some applications have work arounds for this like using array offsets instead of pointers/references (fewer pointers to follow).
Other issues I've seen include nondeterministic GC sometimes failling to garbage collect one level before we loaded another, OOMing the game. Have you ever forced a GC three times in a row to try and ensure you're not doubling your memory use on a memory constrained platform? I have. (This was exacerbated by cycles between the GCed objects and traditionally refcounted objects - you'd GC, which would run finalizers that would decref, which in turn would unroot some GCed objects allowing them to be collected, which in turn run more finalizers, which would decref more objects, ...)
The last time I encountered a similar (de)alloc spike in a non-GC gamedev system was much longer ago, and it was a particular gameplay system freeing a ton of graphics resources in a single frame, when doing something akin to a scene transition or reskinning of the world - which was easily identified with standard profiling tools, and easily fixed in under a day's work by simply amortizing the cost of freeing the resources over a few frames by delaying deallocs with a queue. More commonly there were memory leaks from refcounted pointer cycles, but those also generally pretty easily identified and fixed with standard memory leak detection tools.
The problem isn't GC per se. It's that most/all off-the-shelf language-intrinsic GCs are opaque, unhookable, uncustomizable, unreplacable black boxes which everything is shoved into. malloc/free? Easily replaced, often hookable. C++'s new/delete? Overloadable, easily replaced, you have the tools to invoke ctors/dtors yourself from your own allocators. Localized GC for your lock-free containers ala crossbeam_epoch? Sure, perfectly fine. Graphics resource managers often end up becoming something similar to GCed systems on steroids, carefully managing when resources are created and freed to avoid latency spikes or collecting things about to be used again soon.
But the GCs built into actual languages? Almost always a pain in the ass.
In other words, the problem is GC, full stop.
UI solutions often/usually use some kind of GCed language for scripting. But the scale and scope of what UIs are dealing with are small enough that we don't see 100ms GC spikes.
Our tools and editors leverage GCed languages a lot - python, C#, lua, you name it - and as long as their logic isn't spilling into the core per-frame per-entity loops, the result is usually tollerable if not outright perfectly fine. We can afford enough RAM for our dev machines that some excess uncollected data isn't a problem.
And with the right devs you can ship a Unity title without GC related hitching. https://docs.unity3d.com/Manual/UnderstandingAutomaticMemory... references GC pauses in the 5-7ms range for heap sizes in the 200KB-1MB range for the iPhone 3 . That's monstrously expensive - 1/3rd of my frame budget in one go at 60fps, when I frequently go after spikes as small as 1ms for optimization when they cause me to miss vsync - but possibly managable, especially if the game is more GPU-bound than CPU-bound. It certainly helps that Unity actually has some decent tools for figuring out what's going on with your GC, and that C# has value types you can use to reduce GC pressure for bulk data much more easily.
 Okay, these numbers are pretty clearly well out of date if we're talking about the iPhone 3, so take those numbers with a giant grain of salt, but at the same time they sound waaay more accurate than the <1ms for 18GB numbers I'm hearing elsewhere in the thread, based on more recent personal experience.
GC enthusiasts always lie about their performance impact, almost always unwittingly, because their measurements only tell them about time actually spent by the garbage collector, and not about its impact on the rest of the system. But their inability to measure actual impact should not inspire confidence in their numbers.
GC pause times are relevant but not critical. What's critical is that I can lay out my objects to maximize locality, prefetching, and SIMD compatibility. Look up Array of Structs vs Struct of Arrays for discussions on aspects of this.
This is not strictly incompatible with a GC in theory, however it's common that in GC'd languages this is either very difficult or borderline impossible. The JVM's lack of value types, for example, makes it pretty much game over. Combining a GC with good cache locality and SoA layout is possible, but it doesn't really exist, either. Unity's HPC# is probably the closest, but they banned GC allocations to do it.
Most times it may run in 200 micro-seconds, but the one time it takes 20ms the user suffers from unacceptable stutter.
I can see how people would want to bypass the problem entirely by being a bit more careful up front.
The garbage collector also needs to track resources, which is an additional cost over just freeing them. You have little control over how the memory is allocated, which is an additional cost over a design that intelligently uses different allocation strategies for different types of resources. Then, even if you can control when the garbage collector is invoked, you have little control over what actually gets freed. What good are those 200µs if the stuff eating up memory isn't actually getting freed fast enough?
Maybe people often overestimate their performance needs. A garbage collector may be fast enough for more purposes than anticipated. Even then, managing memory intelligently may seem like a small price to pay compared to the prospect of eventually fighting a garbage collector to get out of a memory bottleneck.
Here's the harder question: how hard is it to do manual GC in a GC'd environment vs. a non-GC'd environment?
"Environment" is the key word here. Because if you're writing in a GC'd environment, there's a good chance that existing code - third party, first party, wherever it comes from - assumes that it can heap allocate objects and discard them without too much thought.
So for small scale optimizations where you own it all, it can work out fine. But if that optimization needs to bust through an abstraction layer, all of a sudden the accounting structure of the whole program has changed, and the optimization has turned into a major surgical operation.
I'm not completely sure if this is true but I think that doing things in an "OO" style where for example, every different event was it's own type which satisfies the Event interface basically means that different events can't occupy an array together and that I think that each one may hold a pointer to some heap allocated memory, so you really can't optimise this away without ripping up the entire program.
Rather than do so, I ended up running my Sim on a server with >100 cores, it was single threaded but they would all spin up and chomp the GC, a beautiful sight.
Another factor is just the general lack of transparency or knowability of where and how objects occupy memory in these languages.
If memory management is likely to be a concern it is absolutely much easier in an environment where it is prioritized than one where it is ignored.
Err not so fast there. It's quite common in C to allocate an array of objects (not pointers) and reuse them as they expire. Memset is enough to reinitialize them.
And the the main point, lot's of game dev is "C wrapped in C++", game devs tend to rewrite everything for performance and predictability, relying on STL is usually a nope.
Edit: specifics from https://blog.golang.org/ismmkeynote - Go hugely improved GC latency from 300ms before version 1.5, down to 30ms, then again down to well under 1ms but usually 100-200us for the stop the world pause. So I guess it is stop the world but the pauses seem ridiculously small to me. They guarantee less than 500 microseconds "or report a but", which seems more than fast enough for game framerates (16ms for 60Hz frames). Am I missing something?
GCs really let you trivially not care about a LOT of things... Until you need to care. Things like where and when you allocate, the memory complexity of a function call, etc. With games you need to care about all that stuff so much sooner.
Once you're taking the time to count/track allocations anyway you might as well just do it manually. It just codifies something your thinking about anyway.
People understand seconds, and any other measurement would require specifying a lot of computer-specific stuff, and if you're going to do that, then you might as well fully specify the workload, too, to answer those questions before they come.
It isn't meant to be a precise answer, it's meant to put the GC performance broadly in context.
Languages not used for anything anyone cares much about don't attract much bad press. There are reasons why games are written the way they are. It is not masochism.
Basically think about how you can either build a function as recursive and leverage the stack that is built into the programming language or you can write the exact same algorithm in a for loop with a stack object and manipulate the flow of the recursion yourself. they're essentially the exact same thing however one is in complete control of the application writer and the other one is in control of the language. if you use the for lube you can actually say I'm going to run out of memory before you hit the stack limit and actually do something smart it's much harder to do this in a programming language with a built-in recursion. for instance you might approach the stacked limits in the for loop and just returned the current answer.
essentially this analogy goes for the difference between garbage collection in many games and garbage collection in garbage collected languages
Static lifetime management, by contrast, gives the same benefits to memory errors that static typing does to type-mismatch errors: you know, at compile time, how long an object will live and when it will be released. And armed with this knowledge, you can then more effectively profile and optimize.
There are, however, much better ways to go about it than C provides. RAII in C++ and Rust and even ARC in the iOS runtimes allow you to get most of the benefits of automatic memory management while still providing strict, deterministic lifetime guarantees.
This is quite blatantly false. The reality is that every single time your GC pass traverses an object but doesn't free it, it's wasting time doing work that didn't need to be done.
Is the claim that 200us would be the worst case time for the GC to complete its work?
Is this worst case measurement one that the language itself guarantees across all platforms targeted by the game?
If you haven't measured the worst case times, and the system you are using wasn't designed by measuring worst case times, then we're not yet speaking the same language.
Though people who make games sometimes overstate the importance of this. Minecraft is arguably the most successful game of all time despite GC pauses. A competitive shooter like CS:Go, however, can't have that.
I have mad respect for all using low level C / Assembly / Rust in gamedev. But I have a hard time recommending anything other than C# / Unity for secondary school 14-15 year olds just starting their journey. The wealth of resources, tutorials and community online is astounding. And as an entrypoint into VR / AR it's difficult to top ;)
It's common to use stack and pool memory allocators, where this becomes no longer expensive work.
The places where you generally don’t want a GC are with low level engine code, like the renderer - code that might execute hundreds or thousands of times a frame, where things like data locality become paramount. GC does you zero favors there.
The trick is using libraries and techniques that only stack allocate or use your pools. These are techniques you'd almost certainly use without a GCed language but somehow people consider using C before using these techniques is a higher level language.
I think people prefer C and C++ because they can control the memory layout of the app, and specially in console games they can do much more low-level optimisations than using a higher level language. Besides, I'm guessing the pain in C++ doesn't come from memory allocation, you probably follow a well defined ruleset for freeing objects and you know you will be OK.
The program having a say in how the GC behaves would be one requirement IMO.
> you could run a GC on every single frame and use less than 3% of your frame budget on the GC pause
Could I? In what languages?
the point isn't "Go is good enough", because I don't know if it is because the requirements are not stated particularly clearly. If the GC took zero time, duh, it would work and people could use it, it would save them some error cases and make their programming lives easier. If the GC took 3 seconds, it doesn't work. There exists some latency value under which a GC is fast enough that there's no sane argument for -not- using it. What is that value? That's the question at hand.
The problem with gc is that it can blow out a frame's budget, and perhaps extend for more than one frame, taking away precious nanosecs, dropping a 60fps down to 15.
> what they do in lieu of a garbage collector.
Use reference counting. If in the rare case you need circular references then you use a weak pointer, or reclassify your structure so that it's not circular.
That said, where and how you manage the heap can be critical. I recall a few places where I've used the stack itself as a temporary (small) heap, which all gets cleaned up when you exit the function, taking up almost zero clocks to do so.
Note that those independent cores all share common resources. L3 cache & memory bandwidth are both shared, and a GC workload slams both of those resources pretty heavily. So it's still going to have an impact even though it's running on its own core.
EDIT: Unreal Engine is primarily written in C++, and has garbage collection. I'm not salty about the downvote, but I would like to understand why what I have said is apparently incorrect.
Unreal Engine's GC at least is pretty custom. A lot of effort has been put into it to allow developers to control he worst case run time. It has a system that allows incremental sweeps (and multi-threaded parallel sweeping) and will pause its sweeps as it approaches its allotted time limits.
That said, in my experience with Unreal most large projects choose to turn off the per-frame GC checks and manually control it. For example one project I've seen only invoked the GC when the player dies, a level transition happens or once every 10 minutes (as a fallback).