Hacker News new | comments | show | ask | jobs | submit login
Go, don't collect my garbage (cloudflare.com)
272 points by jgrahamc 10 months ago | hide | past | web | favorite | 179 comments



If you make something that is performance critical, and in particular if it's concurrent, the normal rule is you avoid allocating on the heap at all. Use object preallocated data (just operate on massive lists), or use pools of data etc.

Any language/runtime will support just allocating a large blob of data and then playing with it with code that looks like C. Effectively once you want to write perf sensitive code in a garbage collectied language you have to stop writing idiomatic code for that language.

That can mean using data orientation (SoA for example is a very non-idiomatic thing to do in most C-like languages and especially OO ones). Not using heap allocation at all for any length of time is definitely non-idiomatic in Java/C#/Python/js etc., but that's what you need to do if you want any kind of performance.

There are two truths here: 1) Any language can be as fast as C and 2) When they are as fast as C, they also look like C regardless what it was to begin with.


I came into the comments to say basically what you said here. Trying to avoid GC? Preallocate!

For context, I'm a game developer working in Unity. We do basically all of our dev in C#, which is a language I love. But because it's garbage collected, and you can see performance hits when the GC runs. In VR dev especially (targeting 90 FPS) this is unacceptable.

So we profile, look to see where we're allocating anything onto the heap and figure out a way to cache it or re-implement it in a way that makes no garbage. We recently shipped a mobile VR experience and one of the best parts was being able to profile my gameplay code and seeing the '0B's next to its allocations every frame.

I understand that this may not work for everyone, and I fully believe that CloudFlare's got some talented engineers on the team, but there are ways to think about and structure programs in a C-like fashion so that you're more cognizant of the memory you are allocating and using from the heap.


Yeah when I was doing Android development they have a lint that tells you not to allocate in `onDraw()`. I always thought.. what's the point of a garbage collected language if you then have to carefully avoid allocation?


Uh? You should also avoid memory (de)allocation in 'realtime' part of your code even in languages without GCs!


You should, but it isn't a disaster if you don't, unlike with GC languages.


I'm not so sure, it depends a lot of the memory allocator: if the memory allocator decide to defragment your heap when you call it, you'll get a lot of additional latency..


So that you can mostly avoid having to carefully avoid allocation!


can you even do that? In Java you can't allocate any List like data strucutre (LinkedList, ArrayList, arrays (Maps of course)) on the heap. Escape analysis might do it, but there is no way to force it.

per my knowledge this is not even possible in C#, I mean you can drop to c way easier there or use (stackalloc) (same as Unsafe usages in Java) But I doubt that is not a good idea?


A key idea is reducing churn, not simply avoiding allocation at all. For example, if the code which renders a frame always needs a four byte array each time it runs, it could reuse the same array each time, rather than allocating then garbage collecting each time.


So, I'm not sure what the situation is with Java, but I suspect there might be some confusion in what you're reading.

In C#, anything instantiated as part of a class goes onto the heap. Value types that are declared and used only within the function are allocated onto the stack. For example:

  public class Example
  {
      public int a;  // Allocated onto the heap when Example is instanced
      public int b;  // Allocated onto the heap when Example is instanced
      public int c;  // Allocated onto the heap when Example is instanced
   
      public void CycleValues()
      {
          int temp = a;  // Allocated onto the stack when function is called
          a = b;  // a, b, and c are already allocated
          b = c;
          c = temp;
          // temp goes away when the function finishes executing
      }
   
      public int SumPositiveValues()
      {
          List<int> positiveList = new List();  // Not a value type: allocates onto the heap
          if (a > 0)
              valueList.Add(a);
          if (b > 0)
              valueList.Add(b);
          if (c > 0)
              valueList.Add(c);
   
          int sum = 0;  // Allocated onto the stack
          foreach (int i in valueList)
          {
              sum += i;
          }
          return sum;
          // sum goes away when the function finishes executing
          // positiveList was allocated onto the heap, so it persists (although inaccessible) until garbage collection
      }
  }
In the example above, space on the heap is allocated when you create a new Example(). Calling CycleValues() doesn't allocate anything new onto the heap, so it isn't creating any garbage. Calling SumPositiveValues() creates a new List(), which, not being a value type, is allocated onto the heap. When the function finishes, that memory isn't automatically freed; it creates garbage. Optimising for heap performance relies a lot on removing New calls to classes when it is possible and makes sense to do so.


well thats what I meant. Basically I'm not a game dev, but keeping simple values on the stack is most often not a problem. However it's insane how many Lists, Arrays I need and I'm a web developer. So I guess a Game Dev even needs more of these (Inventory, Stats, Chat, whatever)


Yeah, but the big key is to keep things instantiated only once (where reasonable). So, if I'm making an RTS, I keep only one list of selected units, and it persists even if I have no units selected. No reallocation or losing the reference means nothing is getting garbage collected.

There are bigger convos to be had about data architecture here, but now's neither time nor place for it.


You could always use byte array or ByteBuffer and reinvent the data structures with raw bytes right ? I imagine thats what you would have to do... sounds exciting.


that is interesting, I wonder if its easier to just write idiomatic code in language that does not have gc like rust or c++


It is not easier to write code in C++ (but it can be optimised further). I cannot speak to using rust in production.


thats fair, can you point to some example of non idiomatic c# code that you spoke about? I would love to read the code just for learning purpose.


Most of what I'm doing isn't writing non-idiomatic code, but following different idioms. Avoid instantiating classes (including Dictionaries and Lists) with new every frame, cache data you know you'll need to use again, avoid boxing/unboxing (casting to and from the 'object' type).

I can't share any of my own code, but the sort of things that I'm doing can be exemplified by these two functions from Unity's physics API: [1][2].

The first returns a newly allocated array of the data, while the second takes, as a parameter, the array to populate and returns only the number of elements that it populated. If I know I'm going to be doing a big raycast each frame, I can instantiate the output array to a reasonable size in my class declaration not have to pay the price of the garbage created by allocating a new array every frame.

[1] https://docs.unity3d.com/ScriptReference/Physics.RaycastAll....

[2] https://docs.unity3d.com/ScriptReference/Physics.RaycastNonA...


Please stop prefacing comments with phrases to the effect of "I came here to say this", it is already obnoxious on Reddit and is getting annoyingly frequent on HN as well. No added meaning.

I care what you have to say; that you came here to say it is irrelevant.

https://news.ycombinator.com/item?id=15683745 https://news.ycombinator.com/item?id=15590466 https://news.ycombinator.com/item?id=13614442 https://news.ycombinator.com/item?id=14306868 https://news.ycombinator.com/item?id=14327156 https://news.ycombinator.com/item?id=14532140 https://news.ycombinator.com/item?id=14531421


Your criticism only applies if they had said nothing beyond their first sentence.

Sure, their first sentence is irrelevant, but it's forgivable since they elaborate beyond it.


It annoys me personally therefore nobody should do it.


Is it just me, or is this kind of gatekeeping becoming increasingly common on HN? I'm really not a fan


Don't go gatekeeping the gatekeepers, it creates a hostile environment for the people creating a hostile environment.


> Don't go gatekeeping the gatekeepers

I realize you're joking, but I didn't think I was. It's not like I said "HN is no place for this type of comment". I just stated my opinion


It's a statement of agreement, followed by an elaboration on their point. Seems completely fine.


> Effectively once you want to write perf sensitive code in a garbage collectied language you have to stop writing idiomatic code for that language.

Only on GC languages that don't offer proper language features for value types.

Sadly most of them failed to gain market share, like Modula-3.

Eiffel and D are some of few around that offer such capabilities, and C# has been getting some love from System C# features, with spans, new ref types, and GC regions.


Yes, C# is slowly getting there (I'm a desktop C# dev since 14 years so I have waited for these features to creep in). Value types certainly help, but in idiomatic C# if you foreach over something you (sometimes) allocate an iterator on the heap. And it's not clear from the syntax when you do. You just have to know that a List<T> doesn't heap allocate its enumerator but most other types do, for example. These details are mostly hidden from you.

Using the low level features in C# (in the past structs, preferably in an AoS layout, then Span<T> etc tomorrow) will effectively make it a "Safe C". When you use those, you can't use idiomatic C (linq, heap allocs etc). You effectively agree to use a completely different subset of the language for your high perf tasks, and when you do, it doesn't look like the C# you have in the rest of your app.


Then I guess you share my sadness when many Delphi features did not land on C# 1.0, specially in what concerns this kind of programming patterns.


We don't need types that mandate the same policy for every value, we need "this value is immutable" and then maybe hints about whether copying or aliasing is likely to perform better at each call site.


Depends how long running the process needs to be, and how allocation heavy it gets.

My first large web app was old enough that we did it in C++ as CGI...

One of the first performance optimizations we did was accept that most object lifetimes roughly coincided with the end of the request, so we simply didn't deallocate anything and let the OS just free the whole thing at once. You can do the same in most GC'd languages by just tuning GC limits.

(the second performance optimization we did was to to statically link everything; loading a static binary that's constantly in the cache is/was very fast; and it was enough to make CGI surprisingly competitive in terms of performance)

Preallocation certainly has its place too, but in this case we tried custom allocators and it simply didn't make much difference. So before rewriting your code, see if you can simply turn off deallocations/GC without too much memory pressure.


> One of the first performance optimizations we did was accept that most object lifetimes roughly coincided with the end of the request, so we simply didn't deallocate anything

This'd speed up development but wouldn't help performance with CGI, since spawning a process is very expensive. This is why CGI is dead.


Our actual measurements and server loads proved otherwise.

Actually, in reality things were exactly opposite of what you suggest: it provided no noticeable benefits for development, but massively improved performance in production.

Spawning a process on Linux is more expensive than spawning a thread or reusing the same process via a fastcgi type setup, but it's not all that expensive when you're spawning a statically linked app with no complex startup code and the app is guaranteed to be in the buffer cache at any moment.

What killed CGI for most uses was not process creation overhead, but dynamic linking and applications not written to take advantage of the fact they'd die quickly.

We knew that because we actually measured first, before deciding what strategy to take.

We got performance very close to a fastcgi type setup with the above method, and without the complexity of ensuring we didn't have any resource leaks anywhere.

Doing the above was enough to bring process startup overhead sufficiently far down our list of performance issues that it was not worth optimizing further.


The parent post said they used static linking to remove much of the process spawning overhead.


In your case you likely could have made your stack large and used that in some areas while making the data segment of your binary large and using that for other areas. Since the executable is memory mapped anyway the OS would allocate memory in large chunks without the overhead of trying to organize a heap.


Doing just straight allocations in C++ without de-allocations has pretty much same effect. It'll ask for memory from the OS in large chunks, and without anything ever being de-allocated, unless your allocator is too smart for its own good, allocation can be extremely cheap.

I did actually try a custom allocator that'd just hand out sequential pieces of memory from a large buffer without keeping track of it at all, but if it made things any better, the effect was small enough that we couldn't reliably measure it.


Not exactly. You save on deallocations, but you still have the allocator doing system calls and then fitting the sizes you want into the heap. Granted the heap probably doesn't have to work hard to make them fit, but it still allocates. I don't know if under the hood system calls to mmap or virtual_alloc block other processes calling the same functions.


Try measuring it with various allocators.

Some will try too hard and be "too smart" and will add extra cost, but most will try to get out of your way until/unless there are deallocations leaving holes.

They'll typically ask for a multiple of pages at a time, and some increase the number of pages requested each time. A quick check with strace right now shows modern glibc in my instance allocating about 2MB at the time when running "find" for example. It will do more if you ask for larger chunks of memory at the time, of course.

In term of fitting in, some allocators will segment the heap it requests into separate free lists, but many will do this first on deallocations to combat fragmentation. E.g. here's a simple example that shows objects linearly allocated on the heap (on my system, anyway) despite different sizes:

    int main () {
      void * a, * b, *c, *d;
    
      a = malloc(4);
      b = malloc(8);
      c = malloc(12);
      d = malloc(16);
    
      printf("a=%p\nb=%p\nc=%p\nd=%p\n", a,b,c,d);
    }


    $ ./test
    a=0x1a7d010
    b=0x1a7d030
    c=0x1a7d050
    d=0x1a7d070
Increasing the size of c and d to 64 and 128 bytes respectively to get over the alignment size, I get:

    $ ./test
    a=0x1678010
    b=0x1678030
    c=0x1678050
    d=0x16780a0
So still allocating linearly at those, admittedly still small size allocations.

In this case, the cost of doing the allocation typically boils down to seeing if there is space in one of the free lists, then if there's space at the high water mark of the current heap allocation, and only if there is not do another system call. When there is space, the cost tends to be a few comparisons, adding some book-keeping details, and increasing a pointer. It's extremely rare to even show up when profiling.

So the cost of those syscalls tends to get amortized over enough allocations that the cost is negligible. E.g. with my example above, glibc does a single allocation to cover the 4 allocations I did.

For a process that only ever allocates, the number of system calls will depend on how aggressive the allocator is about growing the size of the extra memory it requests, but in general the number of syscalls will be small.

I appreciate the attention to syscalls - it's a pet peeve of mine that people tend to think library calls are cheap, without realising the cost difference between a mostly user-space one vs. ones that do a syscall and hence context switch every time (read() and write() are particularly persistently abused - add buffering, people), but malloc() or C++ new are rarely amongst them (doesn't mean you should do lots of tiny ones, though; if you do, profile and consider a pool allocator).


> When they are as fast as C, they also look like C regardless what it was to begin with.

… Unless your language offers appropriate zero-overhead abstractions. C++, for all its flaws, does this extremely well. Then again, C++ is of course (usually) not garbage collected. The problem with the “write like C” approach is that it forces you to write low-level code, and that is usually messy, not self-documenting, and doesn’t take advantage of advanced type system features (which affects correctness).


C++ allows zero-overhead abstractions, but in my experience it's pretty idiomatic for "stack" variables to implicitly do heap allocations or similar possibly-expensive operations (acquire mutexes/semaphores, exchange IPC messages, etc.) in their constructors. Maybe this isn't true of modern/Core Guidelines C++, but it's true of a lot of code I've seen.


That is very true of modern C++ because the idiomatic way of handling the acquisition and release of all resources is to acquire in a constructor (typically either explicitly for a single resources or implicitly by including another object that acquires resources as a member variable for multiple resources).

C++ doesn't completely free you from considering whether a type allocates memory on the stack or on the heap, but it does let you limit which code needs to concern itself with where values are allocated to a very small portion of the code. Most code written to work against a std::vector will also work against a std::array unless it is relying on the ability of a std::vector to dynamically resize.


> C++ doesn't completely free you from considering whether a type allocates memory on the stack or on the heap

You’re completely right but the way this is phrased shows a common misconception about C++/RAII: it was never about freeing the programmer from considering how to manage memory. Rather, it was about automating the mechanics behind these considerations, and adding a layer of type safety (which C++11 tremendously improved).

If you don’t want to think about memory management, use a garbage collected language (but accept that it doesn’t scale to all scenarios). C++ forces you to think about resource management, it just makes the implementation of resource management vastly easier than C.


I agree about internal heap allocations in stack variables. I disagree about other expensive operations happening unpredictably in constructors. If a class does something expensive in its constructor, you can generally tell by the kind of class it is. If you know the difference between identity classes and value classes - value classes may allocate but that's about it. Identity classes may be expensive (i.e. more expensive than baseline, which is writing all the memory they use once) to construct but you can generally tell.


I believe rust would be a better example here, with zero-overhead being a primary design goal. For C++, it's more of a mixed bag.

As long as all the abstractions and language features are compile-time evaluated to what people may consider to be "C-like code", then it's all the same.

I don't like the term "C-like code", though. It's not C's fault that fast code looks like that, and it is certainly possible to write "prettier" C code than what people have in mind.


zero-overhead abstraction is also C++'s primary goal. In what way it's mixed bag that is not the same in rust? Rust is basically C++ with lifetimes and a cleaner type-system.


> zero-overhead abstraction is also C++'s primary goal

But it's not completely 'pure' on that front.

LLVM's coding standards prohibit RTTI and exceptions precisely because they aren't zero-overhead.

https://llvm.org/docs/CodingStandards.html#do-not-use-rtti-o...


Those features are much worse than not just zero-overhead. Not only do they have a very clear overhead, the overhead applies to all code by default. You have to explicitly disable those features at compile-time to avoid the bloat.


Tough to work without exceptions in particular, as the C++ standard library uses them.


Because Rust knows about lifetimes and C++ doesn't.

What C++ needs is a way to specialize types so that they automatically know if they need heap or stack storage without the programmer having to be explicit.


> I believe rust would be a better example here

From what I’ve heard you’re almost certainly right. I just don’t know enough about Rust to comment on it (unlike C++).


Provided that by "extremely well" you mean being able to give up control over allocations, making it easier to spew vectors, and worse, maps, all around the place.

I think it was ryg who said on twitter that these abstractions are not really "zero cost". C++ abstractions get you in a mindset that might be detrimental to performance.

> The problem with the “write like C” approach is that it forces you to write low-level code

No, you have to think in what order to do things. Yes, you have to write low-level code, but not all the time. Ideally you write the low-level code in one place and the high-level code in another place.

If you do allocations in your high-level code (which is admittedly what you see in many code bases), you are doing it wrong. (speaking of separation of concerns, for example). This has ramifications on maintainability as well as performance.

I'm currently writing a little game engine, where I explore an architecture where these things are neatly separated out. (coincidentally, most allocation is done at compilation time, and much "code" is done by the linker).

There is of course still a place for script-style code that lacks an elaborate architecture. And I guess C++ is the right choice to do that if you need "tight loop performance" or need to interface with C code.


> C++ abstractions get you in a mindset that might be detrimental to performance.

That certainly depends on the application but having been involved with several high-performance libraries written in C++, I can say confidently that C++ abstractions are largely beneficial. For instance, you can often control and avoid memory allocations (but see below): in particular, if you write algorithms interfaces in the style of the standard library, your code will perform very few, if any, unnecessary allocations, and yet be fairly high-level.

> Yes, you have to write low-level code, but not all the time.

I was talking specifically in the context of writing low-level C-like code, as discussed by my parent comment. You wouldn’t write “high-level C code” outside of C, there’s no need.

> If you do allocations in your high-level code you are doing it wrong.

As a blanket statement, I don’t agree with this in C++. If you’re careful you can perform (even dynamic!) allocations in C++. Just make sure that you dispatch to a pool allocator. But of course I agree that this requires care, and lots of code should avoid them.


> I was talking specifically in the context of writing low-level C-like code, as discussed by my parent comment. You wouldn’t write “high-level C code” outside of C, there’s no need.

Not sure about that - if you want performance you might need to stick to the C way anyway, so maybe you can't take advantage of programming in a scripting language. I've heard of some cases where game programmers ported their game logic back from their script language to C/C++.

> For instance, you can often control and avoid memory allocations (but see below): in particular, if you write algorithms interfaces in the style of the standard library, your code will perform very few, if any, unnecessary allocations, and yet be fairly high-level.

Another problem with that approach is binary compatibility. Let's take the infamous std::string as an example (not even a templated class). The implementation is much more likely to change than e.g. const char pointer plus optional length, since also memory management and whatnot is needed.

If the implementation changes you are in trouble everywhere, including most of the places where a pointer and size would have been just fine (because no allocation is needed there).

That just happened to a coworker who now needs to setup a another system and build new binaries on it to support a customer's "old" platform (Ubuntu 14.04)


Languages that only support immutable data types, like some functional languages, typically can't use object pools, or if you can underneath the hood there's lots of copying going on that simulates object pools.

Having said that good lockless implementations along with good use of the right data structures can eliminate some or all of the data bus overhead (depending on the data types and uses). RCU[0] is one such example of single object immutability that in effect allows for mutable data structures that can handle multi-threaded environments at scale.

[0] https://en.wikipedia.org/wiki/Read-copy-update


OTOH, immutability affords many more opportunities for safe, precise liveness analysis, and the optimizations that can be performed based on this. Frequently functional compilers produce highly mutative output from immutable inputs. After all, isn't the purpose of a compiler to strip away a level of abstraction? Sure, you can't explicitly use an object pool, but a Sufficiently Advanced Compiler could do a better job anyway.


A "Sufficiently Advanced Compiler" is probably a magical pipedream though.

No compiler has precognition or is directly cabled into the developers brain (Though I wish it was) to tell what they want to express.

The compiler might not be able to tell a preallocation is worth the effort if it cannot determine how often a function is called. Preallocs waste some startup time on the first time compared to dynamically allocating on the stack or heap and throwing away afterwards and also leaks memory.

The compiler would have to be able to tell wether or not a function will be called often and is timing sensitive or whether it is called rarely and latency doesn't matter or any of the millions of variation in the spectrum between those two extremes.

A perfect compiler would have to (essentially) solve the halting problem by being able to tell how programs will behave all the time.


Precognition, mind-reading, and solving the halting problem are indeed beyond modern compilers - but thankfully, some clever people have given us other options for achieving the goals you describe:

https://en.wikipedia.org/wiki/Profile-guided_optimization https://en.wikipedia.org/wiki/Just-in-time_compilation https://en.wikipedia.org/wiki/Partial_evaluation#Futamura_pr...


A perfect compiler would compile natural languages to computer instructions.

AKA the real "sudo make me a better facebook".


Not really. Natural language isn't anywhere near as precise as programming language code.

Some programming languages permit different implementations to behave differently (argument evaluation order in C for example), but natural language is a whole different league. What you describe would be an extremely impressive AI system, but I wouldn't call it a compiler.

There's an XKCD for everything: https://xkcd.com/568/


Indeed. At that level you might as well just order "Go make me some money." and the problem is that everyone else has that tool, too.


Go in effect does pre-allocs in that it does variable liveness analysis and any variables that are allocated within a function but who's address isn't assigned to another variable that is from outside the function or returned from the function is allocated on the stack. This means it is automatically freed when the function returns (or stack frame is unwound). This also means that you can be certain that any state changes are from within the current function or any of its children if passed to a child.


Yeah, I know this optimization as escape analysis, but it's pretty easily defeated IIRC. I don't think it even applies across compilation units. In some ways, it's nice that the escape analysis is naive--it makes it easy to reason about and it behaves somewhat intuitively; however, in other cases it means unnecessary allocations.


I'd like to know more about this. I would think that the constraints afforded by functional languages would allow lots of awesome compiler optimizations that make relatively naive functional programs quite a bit faster than relatively naive imperative programs, but I often see imperative languages like Go outperform languages like Haskell and I'm not sure why. Also, reasoning about performance in Go is quite a bit easier than reasoning about Haskell, probably because the optimizations are more straightforward. These differences are particularly interesting to me because I'm building a toy functional language that compiles to Go.


If you want reliable performance, you should look at functional languages besides Haskell. Off the top of my head I'd say that Chez Scheme, OCaml, and MLton are all probably faster than Go


In my experience, OCaml seems to perform worse than Go in most interesting singlethreaded cases, and obviously it completely lacks a good parallelism story. I can't speak to Chez Scheme or MLton, but I may check them out in the future. Thanks for the recommendation!


>OCaml seems to perform worse than Go in most interesting singlethreaded cases

Could you provide an example?


Careful here. Let's take a simple example of allocation-heavy vs. allocation-light code in OCaml:

  open Batteries

  let bench name fn =
    let tstart = Unix.gettimeofday () in
    let result = fn () in
    let tend = Unix.gettimeofday () in
    Printf.printf "%-20s %.3f seconds\n" name (tend -. tstart);
    result

  let iterations = 100000
  let count = 1000

  let insert_list_test () =
    let result = ref 0 in
    for i = 1 to iterations do
      let work = ref [] in
      for j = 1 to count do
        work := j :: !work
      done;
      work := List.rev !work;
      result := !result + List.length !work
    done;
    !result

  let insert_list_test_rec () =
    let rec build_list i n acc =
      if i > n then acc
      else build_list (i+1) n (i :: acc)
    in
    Enum.map
      (fun _ -> build_list 1 count [] |> List.rev |> List.length)
      (1--iterations)
    |> Enum.fold (+) 0

  let insert_dynarray_test () =
    let result = ref 0 in
    for i = 1 to iterations do
      let work = DynArray.create () in
      for j = 1 to count do
        DynArray.add work j
      done;
      result := !result + DynArray.length work
    done;
    !result

  let main () =
    let x1 = bench "linked lists" insert_list_test in
    let x2 = bench "linked lists (rec)" insert_list_test_rec in
    let x3 = bench "dynamic arrays" insert_dynarray_test in
    assert (x1 = x2 && x2 = x3)

  let () = main ()
As it turns out, this gives us the following results on my laptop, give or take a few hundreds of seconds:

  linked lists         0.648 seconds
  linked lists (rec)   0.657 seconds
  dynamic arrays       1.295 seconds
The reason that the linked list implementation (functional or imperative) is faster is that OCaml's GC uses a bump allocator for young objects, with allocations being inlined by the compiler; this essentially makes the linked list implementation behave like an arena allocator and is faster than the dynamic array version (which has to resize and copy the underlying array several times), despite having to do a gratuitous list reversal and doing an O(n) length calculation. (Note that this is not specific to OCaml: Most JVM GCs do the same thing.)

The problem that Go has here that it's GC is neither generational nor compacting, so it can't do that, so there's a fairly high constant overhead per allocation, whereas OCaml's or the JVM's temporary allocations are only marginally more expensive than alloca().

Note that there can be good engineering reasons to have a non-generational, non-compacting GC (for starters, compacting GCs make C interoperability more difficult). So, yes, if you have such a GC, you do have to watch out for avoiding extraneous allocations.


You didn't test with a static array, though:

    let insert_array_test () =
      let result = ref 0 in
      for i = 1 to iterations do
        let work = Array.make count 0 in
        for j = 1 to count do
          Array.set work (j - 1) j
        done;
        result := !result + Array.length work
      done;
      !result
I have the following numbers:

    # main ();;
    linked lists         2.983 seconds
    linked lists (rec)   2.827 seconds
    dynamic arrays       5.152 seconds
    static arrays        1.718 seconds


Yes, I was assuming a case where you don't know exactly how much memory you know ahead of time. If you have predictable requirements, you can just use stack allocations.

For example, note how the `ref 0` in the code does not actually incur overhead; the compiler does escape analysis and will stuff the counter in a register or stack location.


All valid points, but typically even the best case gc perf will still be far more expensive than avoiding heap allocations entirely.

When I was writing numeric stuff in common lisp, the performance critical stuff all ended up looking pretty c-like, and as far as I can see it's similar in OCaml (for really high performance stuff).

It's not a bad trade off, especially for anything following the 80:20 heuristic. That way 80% of your code can be written more expressively and less error prone.


> All valid points, but typically even the best case gc perf will still be far more expensive than avoiding heap allocations entirely.

Agreed. If you want a high-performance systems programming language, then you generally want one that allows you to transparently write code for GCed references, untraced references, and pass-by-reference parameters. Nim and Modula-3 come to mind as languages that support this, to some degree or another.

> and as far as I can see it's similar in OCaml (for really high performance stuff).

The big problem that OCaml has with numerics is that it boxes floats by default. The compiler does a lot of work to avoid that boxing where possible, but it's not always possible, even in common use cases. (The JVM shouldn't have that particular issue, though.) In this case, a fast allocator can at most mitigate the overhead associated with boxing.


I'd forgotten about that boxing issue, it was one of the things that discouraged my exploring it more (for numerical work) years ago.


I can't read OCaml code so well, but it seems that the list you are allocating completely fits within the nursery. If it was larger than the nursery, it would have to be promoted to a higher generation which I'm sure would slow down the linked list code a lot. Promoting a dynamic array isn't nearly as expensive because it is just one big object, not thousands of list nodes.


Yes, that's exactly the point I was getting at. We're dealing with temporary values for which you'd use an arena/pool allocator in C/C++. The nursery is essentially an arena allocator that works transparently for temporary values.

My point is also not that it's always better, but that it's not safe to make blanket assumptions about GC overhead.

And if you're dealing with long-lived data of varying sizes, then the manual allocation story isn't that great, either.


How do the results change if you use DynArray.make and pass the count instead of DynArray.create? I think the comment you're replying to was talking about preallocated static arrays, not dynamically resizing ones.


It's still marginally slower at 760 milliseconds, give or take (I suspect that's because major heap allocations are generally more expensive than nursery allocations). That said, in that case I could also use `List.init` and be even faster.

The problem is that often you don't know exactly how much memory you'll preallocate. This is where in (say) C++, you'll resort to an arena or pool allocator; a GC with a bump allocator will give you what is essentially a smart arena allocator by default for your temporary allocations.


> If you make something that is performance critical, and in particular if it's concurrent, the normal rule is you avoid allocating on the heap at all. Use object preallocated data (just operate on massive lists), or use pools of data etc.

Which really defeats the point of the garbage collector, because the good ones do that kind of work for you! The good garbage collectors re-use pre-allocated ranges and pools of data.

Garbage collection usually works out to a performance / latency tradeoff. (A GC app runs faster than a non-GC app, but has latency.) If I get to the point where latency, even with a pause-free of concurrent GC is a problem, then I know I chose the wrong language.


> Garbage collection usually works out to a performance / latency tradeoff. (A GC app runs faster than a non-GC app, but has latency.)

I think there are several layers to this. First of all, GC'd data is on the heap, and often you are already doomed if you do stuff (in spread out places) on the heap at all because of cache issues.

Secondly, if we consider "normal" applications that have small parts of perf sensitive code, while a majority of the codebase is not (such as e.g. a 3D rendering application where you hit a button and it goes of and renders an image for 1 minute), then you'd have plenty of time to set up your data, and then you just "operate" on it for the critical part of time. The same isn't true in all scenarios.

Same thing with e.g. a game engine. It's fine to allocate massive amounts of memory, but then in each frame when you have a 16ms budget you allocate nothing. Because the delay loading a level doesn't matter whether it's 5 sec or 10, but you can't spend 17ms in a frame instead of 16ms.

Another problem (and benefit) in most GC'd languages is that the GC is there for developer convenience, and the class libraries know its there so all standard library features (such as just iterating over a collection of items) risks creating new heap allocated objects, which might trigger new garbage collections. In e.g. C# I can't make a standard list with a custom allocator. And I can't "foreach" over something and be sure I didnt' allocate etc.

> A GC app runs faster than a non-GC app, but has latency

I didn't mean to compare an app that does "new Foo()" in the inner loop with one that does "malloc(...)" in the inner loop, I was trying to make the point that you can't do either in the critical part.

> then I know I chose the wrong language.

Right, but the thing is you might have chosen the perfect language for 90% of your code. The language choice might be too late to change and you just need that concurrent inner loop (rendering loop, encryption pass, whatever) to be quick, and the solution is a lot of time to just allocate nothing.


The issue is that the tradeoffs are unclear, it's not like you pay 10% performance penalty for using a GC - it's often much worse than that, and then you have to do manual work to "tune it". Often the tuning, following production problems, is not that different than using malloc()/free() to begin with.


When using a GC you can get 1:1 performance to non-GC if you preallocate. Both don't do anything to the memory, hence they have provably the same (possible) performance.

Most applications don't require tuning and have acceptable performance for real world usage. The reason people write in GC languages is because malloc() and free() have error cases you can avoid with a GC. A GC is a well tested structure and less likely to leak memory by forgetting a free() or doing a use-after-free.

Of course we have static analysis tools for this but those aren't perfect.

There are perfectly legit reasons to use GC, one of them being not having to do malloc()/free() at the cost of some performance characteristica which can be "tuned" as you describe it.


> "When using a GC you can get 1:1 performance to non-GC if you preallocate."

That is not always true. Also, preallocation can be impossible (due to limit on heap size) if there is no way to free memory without the GC running. Of course, you're right that there are tradeoffs, the only argument here is about the magnitude of the tradeoffs - the OP shows a hidden cost of GC that is not talked about often enough.


What are the cases where the is untrue? Are they mainstream or niche? It would be a shame to caution people away from GC by default because there exist niche cases where GC performance doesn't match manual management.


It's not always possible to get cache coeherancy with a ref based language which can be 10-50x difference in performance.


Refs might be but they aren't inherent to GC, you can have Refs without GC.

I would love to see a benchmark of GC vs non-GC using the same compiler and language for an accurate comparison. Probably also GC vs non-GC naive vs non-GC optimized.


free() is O(n) in the number of dead objects, and needs to defrag unless your object sizes are very stable. A GC can just copy the minority of live objects and then reuse the whole arena.


Uh huh. And what's the O on the copying? And on the repeated heap walking?


> Garbage collection usually works out to a performance / latency tradeoff. (A GC app runs faster than a non-GC app, but has latency.) If I get to the point where latency, even with a pause-free of concurrent GC is a problem, then I know I chose the wrong language.

I think you mean "throughput / latency"? Latency itself is a measure of performance.

It's a fair bit more complicated than that once you start tuning the GC, because the GC can be tuned towards either throughput or latency, and it can be tuned aggressively towards either end. For example, for a given program you can tune the GC so you get fantastic throughput for batch programs. Use a copying GC, and set the heap size large enough that the percentage of live pages is very low when you collect. On the other hand, you can use something like the tri-color collector, which runs in parallel with the main thread and kills throughput, but has minimal impact on latency (sub-millisecond pauses).


> There are two truths here: 1) Any language can be as fast as C and 2) When they are as fast as C, they also look like C regardless what it was to begin with.

Hence the maxim:

"The determined Real Programmer can write FORTRAN programs in any language."

It applies to C as well.


> SoA for example is a very non-idiomatic thing to do in most C-like languages and especially OO ones

Isn't it rather that switching between SoA and AoS is tedious? SOA style itself (using common indices into multiple tables) works ok in C, IMO.


Yes, SoA is just slightly odd (Doesn't look idiomatic with points[i].X instead of points.X[i]), and the switching is so cumbersome that you have to settle for one way or the other.


> If you make something that is performance critical, and in particular if it's concurrent, the normal rule is you avoid allocating on the heap at all.

This is a very synthetic workload, and all allocation were done by the standard library.


I would be surprised if even non-idiomatic CPython could be as fast as C. Even apart from GC, every function call and property access is many C function calls under the hood. I imagine the other languages listed share similar (though less dramatic) issues (Go has bounds checking while C does not, for example).


Yes, obviously all the "can be as fast as C" is modulo safety switched off, and using static calls. If bounds checks aren't elided and dynamic calls aren't JIT'ed into single jumps just like a C function - then you can't match C.

Which is why in most languages, you have to effectively write C (in that language). Not even sure if it's possible in (C)Python, but in C# it effectively means using a lot of "unchecked" blocks, pointers, etc, obviously skipping any interfaces or other vtable lookups etc. The C# you end up with (only structs and no classes, no heap allocation, lots of raw pointers, lots of unchecked and unsafe etc ) will resemble C a lot


Quite true, but the big difference is that you just have to write "C like" code in a specific code path, ideally after validating it with a profiler.

In a language like C that isn't an option, and 100% of the code might be unsafe thanks to UB.


Right. I'm very happy to see these low level Span<T> things enter C# so it's a much nicer integration than it was in the past when the only option was to just offload the inner loop to a C library. But even that (Writing 10% of your app in C) is a much better proposition than writing all of it in C.


Is there a way for the compiler or the runtime to be designed such that it can detect that these heap allocations of the same "shape" and similar size are happening with sufficient frequency/regularity that some optimization is possible?


Yes, it can be done. I don't know about Go, but this is already done for other languages. In C/C++, you can override the memory allocation functions (malloc, delete, new, free, etc) based on the type of heap objects you are going to handle.

Most or all C and C++ standard libraries come with generic allocators that work OK in most scenarios, but not optimally for specific scenarios.


I think there’s a caveat that the language needs to support inline assembly.


It doesn't. Even JavaScript supports preallocating.


Not really. You can preallocate a bunch of objects in a big freelist (or multiple freelists) and manipulate that.


Manual memory management is just one aspect of performance. If you want to compete with C you’re gonna have to be able to hand tune some bottlenecks. The compiler isn’t going to produce optimal code.


Whenever I see GC, I immediately think of tradeoffs. One can optimize a GC algorithm (or GC configuration setting) to work well for one type of workload. Likewise, one can always create a pathological workload for any GC configuration because GC has unavoidable tradeoff between cpu time and memory footprint.

Therefore, I read his article with the intent of placing his findings in the "taxonomy" of GC tradeoffs.

In that vein, I noticed that his benchmarks and graphs do not include measurements for memory footprint. Instead, it has throughput statistics such as "ops/sec" or "sign/second".

I think I see why working memory size isn't emphasized. He's testing a crypto algorithm "ECDSA-P256 Sign". I'm not familiar with it but I assume we can think of it as a typical hash function[1] that doesn't require unpredictable dynamic allocation of memory as the algorithm processes the bytes of input. (A typical hash function uses the same fixed amount of memory whether hashing 1 kilobyte or 1 terabyte.)

If his synthetic benchmark to play with GOGC was using a different type of workload... such as a text parser that stressed the GC with thousands of different sized memory objects, we'd see memory footprint as part of the analysis.

[1] https://en.wikipedia.org/wiki/Elliptic_Curve_Digital_Signatu...


It's not just a fixed-length hash, but it does computations on the curve points, which are represented by big.Int's, that take a variable size and are likely to end up on the heap.

Also, Go crypto makes a heavy use of interface that also make stuff escape to the heap. Of such things are a bottleneck, it might be worthwhile to "un-interface" them.


This doesn't read like a "GC is slow" problem to me, more like a "GC doesn't cope well in the face of a highly concurrent allocation workload". That's pretty bad though. As soon as you go down the rabbit hole of tweaking knobs to cajole GC into doing what you expect, darkness lies beyond..


It's the default setting that didn't cope well. The final setting seemed to get pretty decent performance. And if you only have one knob then the rabbit hole is pretty shallow.

The question not answered (as it wasn't asked) is how many workload profiles can be catered for? But as a sample of one, this appears successful.


The question though is does the setting that worked for this particular micro-benchmark actually work well for more generic workloads that have similar performance issues.


Does it matter? If they code up their target workload and get near linear scaling with collection on 5000% or 12000% it's still a win.

The important question is whether or not they can get the scaling with some setting.


>in the face of a highly concurrent allocation workload

Isn't this "highly concurrent" the main selling point of Go?


IIRC, the main selling point of Go is "it compiles really fast". :-)

If I understand the article, the individual goroutines aren't* communicating; they simply compute a cryptographic hash. However, they are communicating by allocating memory for the hash and then discarding it immediately.

From this, I think we can learn that this implementation has a single-threaded, non-concurrent garbage collector, and the highly-concurrent goroutines are hammering the snot out of it.


The original PR line taken by Go team was that langauges like Java "communicate by sharing" whereas in Go you would "share by communicating". It wasn't so much about Green vs Native threading as it was about the core semantics of Goroutines and Channels. This original PR also had annoying (to people such as myself) aspect of poopooing Java in context of concurrency (of all things).

As it happens [lol] facts asserted themselves and (at least some in) Go community saw the reason as to why "communicating by sharing" paradigm ever got traction. "It's performance, stupid", to paraphrase Slick Willy, not because other language designers were thoughtless. So now it is "idiomatic" to see Go code using both paradigms (locking & passing channels). The original matra was quietly dropped along "systems programming PL" bit somewhere along the line.

This OP type of article, in an interesting way, continues this pattern of "let's do it this [obvious] way" in face of dealing with concurrency. No less an authority than Cliff Click has asserted that lock-free type algorithms pretty much require a GC.

Concurrency and memory management seem to be zero-sum games. Root cause is clearly Physics, and not the mechanism devised to deal with the physical realities.


Channels pass ownership of allocated objects, they don't perform copies or allocations themselves (once you've allocated the channel and the objects you want to pass).

I don't think the article says anything about concurrent allocations; in fact, the various signers don't need to share anything (mutable) to sign the requests they get.


> Channels pass ownership of allocated objects, they don't perform copies or allocations themselves (once you've allocated the channel and the objects you want to pass).

And? Did I say they did?

> I don't think the article says anything about concurrent allocations

No it doesn't. Point was that it is not a general purpose approach. GCs are.


I've just searched for the word "channel" in the article: zero results.

The "original mantra" is "Channels orchestrate, locks serialize" and hasn't been dropped by anyone.


2010: https://blog.golang.org/share-memory-by-communicating

There is nothing wrong with that approach. It does put a ceiling on performance, however.


It sounds like a bug in the GC for it to run so often, but even so, Go's GC has only one knob and they're adament about keeping it that way. There's no rabbit hole to fall into here.


There can be advantages (and ZOMG so much confusion) in allowing multiple knobs though. Java takes it to an extreme and we're somewhat in the territory of needing a black box opitimiser, but you can adjust the GC to suit a really wide variety of workloads and requirements. By leaving it to one knob, Google/Go is heavily limiting the uses of the language.


All GC algorithms have pessimal cases; I believe there is a mathematical proof of that, though I can never find it on the net through the chatter of all the other stuff about garbage collection. This is a pessimal case for the current Go GC, but that alone doesn't prove much because all algorithms have pessimal cases. (Even "manual memory management" can end up with pessimal cases for the automatic part, which is why you might see an improvement with your "manually managed" program by swapping malloc for jemalloc or something. What people call "manual memory management" still generally includes a lot of automatic stuff.)

I believe Go does something like set a threshold of memory use. Once that memory usage is hit, a collection is run. After that collection is run, it uses the new accurate amount of memory used to set the next threshold. The threshold of course start small and have an exponential increase of some sort.

When the program first starts, the threshold is set fairly low. If you immediately start allocating lots of things and keeping them, a common case, it grows to a sensible size in a reasonable number of allocations and sweeps (which as you are growing into your initial footprint, amortize out pretty cheap), and in a lot of cases you won't think much about the GC until you are really pushing performance.

If, however, you write a program that isn't holding on to much RAM in the steady state, such as a crypto algorithm (which for all their algorithmic complexity typically don't take much RAM to run, as compared to, say, loading an image, and I don't know for sure but it's possible this ends up all stack-allocated), and that algorithm allocates a lot of short-term stuff, perhaps even just allocates for its final answer, then you get into a situation where you are quickly generating enough garbage to cross the initial low threshold, but the GC immediately cleans it all up when it runs, so when it is deciding on what the new threshold should be it still picks a low number, causing it to repeat the cycle indefinitely.

Fixes will immediately leap to your mind, but remember you must evaluate your fixes holistically, not as if this is the only case in the world. If it were the only case in the world, it would be an easy problem, and Go never would have exhibited the problem in the first place. In particular, as you consider your "fix", consider also that a real program that used real crypto would be much less likely to see this pessimal case occur, because it is very likely the rest of the program would push the memory thresholds up to something sensible, so the fix of "do nothing" is actually quite likely to work well on non-artifical-benchmark programs. It is likely that the most obvious fix that comes to mind will result in making a program that is deliberately running some small server process or something appearing to leak memory after running for a long period of time, for instance.

(I suspect quite strongly in this case that the correct "fix" is to do nothing.)


I suspect quite strongly in this case that the correct "fix" is to do nothing.

What about a fix that gives the user optional control over the thresholds, or even just a hint as to the desired behavior? Keep the defaults as they are, but add a “pragma” that would set the minimum threshold for GC? Or it could go a long way just to have an option to set a minimum time between collections. While it could easily make things worse for certain situations, I much prefer the option to hang myself over an immutable default that periodically does so without my input or consent.


The Go team were very proud of their garbage collector that “virtually eliminates” (IIRC) GC pauses. I guess it does that at the cost of severe program slowdown in cases like this? It's all about trade-offs, and I don't trust people who think there's only one thing to optimise for.


While the Go GC is definitely no slouch when it comes to engineering, the messaging around it has not been entirely forthright: https://blog.plan99.net/modern-garbage-collection-911ef4f8bd...


This blog is more of a rant. Official blog announcements and marketing material are not the place to delve deeply into technical tradeoffs of designing GC. He compares it with Java which is rather ironic considering I have not seen Java official announcements talking about things not done. It would like saying 'We have not been able to deliver value types, module story is still half done, and we couldn't write an official http cient in whole development cycle in Java 9' in release announcement of Java 9.

Also there is no data provided to show how bad is Go's GC compare to Java


As anybody who has done high-performance, low-latency Java and had to deal with writing GC-less Java code, this is no surprise at all. In the Java world we at least had ways around it with things like sun.misc.unsafe, avoiding language features that allocate, pooling, etc. While some of those things can be done in Go, I don't know if all the techniques are possible. Without them, I'm not sure Go can really compete in the arena.


Exactly right re: Java. You could just use Rust, and get automatic memory management without GC.


One does not simply "just use Rust".



This kind of tuning ends up being required for all languages that use garbage collection. The more you use it in production, the more you're doing meta programming with environment variables, command-line options and your allocation profiles. The problem is not so much some performance reduction, it's that the performance reduction is often variable and unexpected, as it is in this blog post.


Can anyone explain to me why he got better performance with GC turned on (and optimised) rather than off?


Judicious garbage collection reuses memory you have already allocated, and for which there may already be entries in caches, page tables, garbage collector metadata, and whatnot.

If you never garbage collect, the runtime has to repeatedly allocate new memory from the operating system, and all this additional metadata has to be juggled. If allocating new memory is a system call, that has costs. If you get lots of page faults because you keep using new pages, that has costs as well.


It sounds like instead of turning off GC, the algorithm was changed to an "automatic" allocation/deallocation scheme that's closer to what you might see with manual memory management in a C like world. I wonder if the Go runtime has any optimizations to turn things into stack/static allocations in that scenario.


I'm not a Go expert, but my understanding of the blog was that the algorithm or any other aspect of the code was not changed. Only the garbage collection frequency.


My comment was in reply to the parent's:

    > If you never garbage collect, the runtime has to repeatedly allocate new memory from the operating system ...


Yeah, that was me. I still don't understand what you're saying. Who is changing what algorithm in what component?

If you tell the Go runtime not to garbage collect but keep allocating memory, that memory has to come from somewhere. That somewhere is the operating system.


FWIW, I think this mostly shows that the GOGC defaults will surprise you when your program keeps a small amount of live data but allocates quickly and is CPU bound. By default, it's trying to keep allocated heap size under 2x live data. If you have a tiny crypto bench program that keeps very little live data around, it won't take many allocations to trigger a GC. So the same code benchmarked here would trigger fewer GCs in the context of a program that had more data live. For example, it would have gone faster if the author had allocated a 100MB []byte in a global variable.

If your program is far from exhausting the RAM but is fully utilizing the CPU, you might want it to save CPU by using more RAM, equivalent to increasing GOGC here. The rub is that it's hard for the runtime ever be sure what the humans want without your input: maybe this specific Go program is a little utility that you'd really like to use no more RAM than must to avoid interference with more important procs. Or maybe it's supposed to be the only large process on the machine, and should use a large chunk of all available RAM. Or it's in between, if there are five or six servers (or instances of a service) sharing a box. You can imagine heuristic controls that override GOGC in corner cases (e.g., assume it's always OK to use 1% of system memory), or even a separate knob for max RAM use you could use in place of GOGC. But the Go folks tend to want to keep things simple, so right now you sometimes have to play with GOGC values to get the behavior you want.


If any Cloudflare engineers are here, why is Rust not chosen as main language for such use cases?


I like Rust too but it's a newer language and the Cloudflare team has a history of using Go since it's targeted pretty squarely at their kind of work. In the case of doing benchmarks, it's especially unhelpful to make that suggestion since the point of a benchmark is to learn where the limits are rather than an invitation to port everything to a different language to avoid a problem which might remain theoretical forever.


I don't think they're trying to suggest using rust, just curious why Cloudflare chose a language with a GC for workloads that apparently do not play well with GC.

"We already used Go" is a totally valid reason. "It's rare (or as you say potentially never an issue) that we actually need to avoid GC" is another.


“why is Rust not chosen as main language” seemed like an unnecessarily accusatory way to express that idea, especially given that language choices are usually driven by a number of factors rather than a single benchmark.

I've seen a lot of advocacy over the years like that which has seemed to inspire more backlash than adoption. I generally prefer something constructive in the tradition of showing a better way to do something — i.e. in this case, perhaps show a Rust crypto library which delivers better performance so people could judge whether that kind of thing is enough of a draw to be worth migrating or, in the case of Rust, taking advantage of its embedding characteristics.


First of all, I am not advocating using Rust, I am asking why they didn't choose language without GC and good performance for given use case.

    “why is Rust not chosen as main language” seemed like an unnecessarily accusatory way to express that idea, especially given that language choices are usually driven by a number of factors rather than a single benchmark.
do not cut the context, I said "why is Rust not chosen as main language ---> for such use cases? <---". By use case I meant writing performant code without GC


the context of the benchmark is https://blog.cloudflare.com/arm-takes-wing/

(Cloudflare engineer but I have nothing to do with the recent blog posts)


These tests are looking at our current code/workload. So we are testing how they perform under different circumstances. We could switch to another language but that would require rewriting large chunks of code. We've done that in the past and will likely to it in future. Rust is certainly interesting for our situation.


I would love to see an in-depth analysis why Go's GC behaves this way for this specific benchmark in the first place. Just playing around with different values for `GOGC` seems like guess-work and could just paper over the underlying real issue. I know this probably requires to read the GC's source code, but that would've been certainly been very educational.


I'm curious about this.

Go has a history of doing the simplest thing that covers the majority of cases, inherited from Plan 9. For one thing, although I don't know its current GC story, its original collector was very simple.

I wonder what would happen if you transplanted a recent collector from Java, where the strategy is to optimize the crap out of anything that moves, into Go?

Obviously, it wouldn't be as fast as not allocating, or at least not collecting, in hot code, but what would be the actual penalty?

On the other hand, what about, say, Pony and its per-thread collectors?


From a recent talk on the latest JDK GC, it seems to require even more tuning parameters than before. So the best performance achievable may be higher, but with more manual work. Which is along the same trade-off curve, with malloc/free on one end, and a very simple GC on the other.


Using the SetGCPercent option with a large value is dangerous, it may cause a program never get GCed before virtual memory is exploited. And when this happens, Go programs will become very very slow.

The current official Go runtime performs very very poorly when virtual memory is involved. It may make the while OS non-interactive so that you must restart your computer.

A SetGCMemoryThreshold(maxMemory int) opition is better for many programs.


Definitely agree option of a memory limit would be handy in a lot of cases. Sometimes a box exists to run one service and you know you want like 75% of the box's RAM (leaving some for cache, other procs, etc.). Other times, like this, you don't want GOGC=100's frequent collections but also don't want the risk of an explosion if live data increases.

Just adding some heuristics to override GOGC's behavior in extreme cases could help, e.g. assume by default OK to use 1% of system RAM and not OK to use >75% of it regardless of live data. I'm not sure we can realistically hope to get that--adds complexity, folks dislike arbitrary thresholds, and any change will break users who have optimized around current behavior. It might be you could approximate the desired behavior for reasonably well-behaved programs from a third-party package using runtime methods and a timer, but that's a drag.


Tuning is part of the job if you need high performance. It can take many forms from code optimization, build flags, platform configurations, to even hardware tweeks. Most people hate this kinda stuff but some of us think it's fun.


Graph complaint: the orange and blue lines arent labeled which is which.


This is a good illustration that Go isn't as simple as its proponents like to praise[1]. When you face complex engineering challenges, you need tools with a certain amount of complexity.

[1] especially when comparing to Java when it comes to GC customization.


Funny what different people get from these things.

I was most impressed that he managed to get very close to perfect scalability by twiddling the one knob on offer.

Java GC customisation is an art. This analysis was entirely mechanical. If the author wasn't just lucky i.e. that this is sufficient for a large number of workload profiles, then that's very good news.


The main problem is that it isn't a very good knob for this use case, as "ratio of freshly allocated data to live data remaining" is only incidental to the goal of the person writing the program. Therefore you'd expect to have to retune it due to unrelated changes, like if you allocated a big chunk of static memory to hold results.

I expect Go's current behaviour is right for memory-intensive programs, where "limit the wasted memory as a ratio of the actual working set" is a reasonable high-level goal. For a CPU intensive program you might have preferred a "max x% CPU overhead" knob, but just a minimum memory size would work, as this is mostly an artifact of the very small memory usage of benchmark program.

Java's G1 has "max pause delay" as their preferred knob, but Go has stuck that knob on low since around version 1.6, which is a good choice for what is essentially a language for network services.


The problem of software knobs is that the one that helps you is almost never there, and you usually can't get similar functionality from mixing the ones that are there.

That said, there are exceptions. I don't think GC is something simple enough to be an exception, but those exist.


Interesting! IMO it showed that garbage collection isn't some silver bullet (of course), but that with a kick to a quite coarse control it's possible to achieve very good performance indeed.

I am surprised that the difference is so great - and that the GC seems to think it's a good idea to keep collecting 4 MB of data at a time - but I guess the Go team have been tuning heavily for interactive performance i.e. small pauses.

I guess in a situation like this you'd prefer something more like 'server' GC - maybe wait until you've allocated 1 GB then run through and collect everything you can.

In games, I wonder if you'd like to turn the GC off during frame rendering then collect while presenting the frame/waiting for VSync.

For soft real-time audio - synthesizing a few samples at a time - what would you do? OpenAL wants you to queue up thousands or tens of thousands of samples, but that sucks if you need to react very quickly.

If anyone has any ideas, please say. I'd love to do soft real-time audio in a thread even in e.g. C#. (Edit: Maybe JIT some code that doesn't allocate? Or even have a separate interpreter written in C that has a constant-ish upper performance bound & no malloc.)


I don't know. That seemed like a fairly simple exploration of the problem and an easy library function to solve it. Would it be much simpler in any other language?


It's a simple problem with a simple fix, from a very simple benchmark - but it's already reaching the limits of how much tuning Go offers. If even such a basic benchmark needed one GC tuning knob, that seems suggestive that maybe a larger and more complex program really would need all the knobs that Java offers, the ones that Go claims are unnecessary.

Or maybe not - maybe that one knob really can solve all the problems. But I really was surprised that anyone would ever need to actively configure a GC for such a simple use case, so I've learnt something at least.


If you need the machinery Java has, you're already fucked. To a first approximation, no one knows how to tune a JVM. Whole books have been written multiple times on the subject, and everywhere I've been that used large-scale Java applications, tuning failures are still at or near the top of the list of most common production issues.

We don't appear to be much better off than we were with C. I guess the server melting is probably preferable to an exploitable smashed stack, but prevention of both issues appears to be to hire geniuses who don't make mistakes while doing enormously mistake-prone work.


> We don't appear to be much better off than we were with C. I guess the server melting is probably preferable to an exploitable smashed stack, but prevention of both issues appears to be to hire geniuses who don't make mistakes while doing enormously mistake-prone work.

"Everyone gets correctness, you need geniuses to get maximum performance" is a huge improvement over the reverse.


Except I see tons of outages caused by things like VM and application server tuning. Software that isn't working isn't correct or performing well.


Not running at all, while not ideal, is a lot better than silent data corruption.


"more complex program really would need all the knobs that Java offers, the ones that Go claims are unnecessary." - but that is really a move in the wrong direction... it's meta programming with Env variables, allocation sizes, command lines, and JDK versions. Sure, Go will probably eventually have all that, to support more tuning, but at that point, ti feels the war for "simple, predictable, low-overhead GC" has been lost, yet again.


> Sure, Go will probably eventually have all that, to support more tuning, but at that point, ti feels the war for "simple, predictable, low-overhead GC" has been lost, yet again.

Well, it seems we don't have a simple GC that performs well in all cases, and really why would we expect to? So choosing simplicity means leaving performance on the table, at least for now, and while there are languages that do that (Python) it doesn't seem like a good fit for Go.


> "it seems we don't have a simple GC that performs well in all cases" Exactly, and after 20+ years, it may be that it's not possible to have such a GC.


I think most pathological cases can be reduced to similarly simple examples. I'm sure these exist for other GCs as well.


Yes, the take-away here is not that Go is particularly bad, but that GCs are problematic. Notably, GC != automatic memory management.


That's not a reasonable conclusion to draw. You would need to preallocate to maximize performance regardless of GC in this case, and in other cases, the performance and determinism gaps between GC languages and non-GC languages is almost entirely due to tertiary factors. For example, GC languages are more likely to prioritize friendliness over raw performance in other areas--language features, compiler optimizations, standard library design, etc. Non-GC languages really only need to appeal to the high-performance and high-determinism crowds.


> Notably, GC != automatic memory management.

Actually, GC == automatic memory management, just not automatic performant memory management


Simpler would be "just work like I expect".


I think the expectation the Go lot have been working towards is the expectation that individual pauses are very short. Contrast to a Java server GC, aiming towards overall efficiency. (Tell me if I'm wrong, I'm no expert.)


Yes, Go optimizes for low latency and Java for high throughput.


Depends on the Java GC being used.


Granted, but the stock GC is optimized for throughput.


The stock GC on OpenJDK, there are other JVMs to choose from.


As another commented posted elsewhere, GC is all about tradeoffs, and for a govern set of GC tradeoffs there will always be pathological cases. Another GC might work out of the box for this workload, but perform very badly for a workload at which Go's GC excels. As others have already mentioned, the fact that he was so easily able to tune the GC is a testament to Go's simplicity.

There are lots of good criticisms of Go to be had, but it's runtime is pretty remarkable, in my opinion.


What about the use cases where what you expect is not what I expect? GC is, as a sibling comment says, about compromises around memory safety.


Doesn't work with GCs.

As another comment in this thread explained, GCs have CPU and Memory tradeoffs and you can easily make a GC that would work excellent in the OP benchmark but would suffer severely under other workloads.


Having worked with Java GC it's just a nightmare to tune. I prefer languages that have almost no options for GC tuning.


On Java 8+, you can use G1 GC which has very few knobs. On the other hand, CMS GC has a lot of knobs you can tune depending on your workload. JVM (Hotspot in this case) believes in flexibility over a one-size fits all. YMMV.


In other words, there are actually knobs that have you choose the GC, and then choose settings within the GC.


Tuning is a hassle, but when the alternative is rewriting my important code over and over until I hit the right tradeoff between fast and readable, I'll take it.


This is why we will still have C for times to come.


If you can't figure out how to tune 1 documented option in Go to fix this issue caused by this one specific use case (which he did figure it out and it was resolved).. Then you're probably not using C either.


"Then you're probably not using C either."

In fact, I do - daily. I'm not stating that this exact issue is the key reason, but because C is very simple with not much of an abstraction layer between software and hardware, where most such (unexpected) performance issues arises. Yes, it might be a single flag to makes things run normal, but it might also take 1 week in production to find such "magic" bugs.




Applications are open for YC Winter 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: