Hacker News new | past | comments | ask | show | jobs | submit login

The binary-trees benchmark on The Debian Language Shootout[1] involves allocating millions of short-lived trees and traversing them. It is informative about GC performance even with the caveat that there are 'lies, damned lies, and benchmarks', because many real-world graph analysis and brute force tree search algorithms similarly allocate zillions of short-lived nodes. For non-GC languages like C/C++/Rust it gives a decent idea of the performance difference between malloc'ing and freeing individual objects vs doing bulk arena allocations:

  language         secs       GC'd language?
  ========         ====       ==============
  C++ (g++)        0.94
  Rust             1.09
  C (gcc)          1.54
  Free Pascal      1.99
  Intel Fortran    2.38
  Java             2.48       yes  <====
  Lisp (SBCL)      2.84       yes
  Ada (GNAT)       3.12
  OCaml            4.68       yes
  Racket           4.81       yes
  C# .NET          4.81       yes
  Haskell (GHC)    5.02       yes
  Erlang           5.19       yes
  F# .NET          6.06       yes
  Node.js          7.20       yes
  Julia            7.43       yes
  Chapel           7.96       yes
  Dart             9.90       yes
  Go              12.23       yes  <====
  Swift           16.15       * "Automatic Reference Counting"
  Smalltalk (VW)  16.33       yes
  PHP             18.64       yes
  Ruby            23.80       yes
  Python 3        48.03       yes
  Lua             48.15       yes
  Perl            53.02       yes
So Java has the fastest GC for this test, 2.48 secs vs 12.23 secs for Golang. The Java code is also notably perfectly idiomatic for multicore, it doesn't do heroic "avoid GC by writing C-like code manipulating a fixed global memory array" tricks. The Java code is also more concise.

The 'plain C' code that uses Apache Portable Runtime memory pools instead of standard malloc/free and uses OpenMP #pragma's strikes me as more 'heroic' than 'idiomatic', whereas C++ and Rust use standard libraries/crates and idiomatic patterns. (Note that OpenMP is 'standard' for high-performance C and well supported across GCC/LLVM/Microsoft/Intel compilers. But still....)

OCaml and Haskell made impressive showings for functional languages which are in practice the easiest for dealing with complicated tree algorithms, which is perhaps why the formally verified C compiler, CompCert, is implemented in OCaml, as is Frama-C for formally verifying ISO C programs, as is the Coq theorem prover, etc.

[1] https://benchmarksgame-team.pages.debian.net/benchmarksgame/... [Edited link]




Look at the Go benchmark program:

https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

Needs to be faster? Use a pool like the C benchmark programs, but super simple: just a slice, local to the goroutine.

Trivially make the Go program 10x faster, 5 minutes of work, a few edits:

  type Tree struct {
          Left  int
          Right int
  }

  func itemCheck(id int, pool []Tree) uint32 {
          tree := &pool[id]
          if tree.Left != -1 && tree.Right != -1 {
                  return uint32(1) + itemCheck(tree.Right, pool) + itemCheck(tree.Left, pool)
          }
          return 1
  }

  func bottomUpTree(depth uint32, pool *[]Tree) int {
          var tree Tree
          if depth > uint32(0) {
                  tree.Left = bottomUpTree(depth-1, pool)
                  tree.Right = bottomUpTree(depth-1, pool)
          } else {
                  tree.Left = -1
                  tree.Right = -1
          }
          id := len(*pool)
          *pool = append(*pool, tree)
          return id
  }

  func inner(depth, iterations uint32) string {
          var (
                  pool []Tree
                  chk  = uint32(0)
          )
          for i := uint32(0); i < iterations; i++ {
                  a := bottomUpTree(depth, &pool)
                  chk += itemCheck(a, pool)
                  pool = pool[:0]
          }
          return fmt.Sprintf("%d\t trees of depth %d\t check: %d",
                  iterations, depth, chk)
  }
If it matters, your Go program can be made as fast as a C program.

Easily.


Several (about 10) years ago, I submitted essentially the same program. It was one of the fastest of all the implementations across all the languages.

However, Isaac rejected it, saying it violated the conditions.


Of course. The problem description [1] specifically calls for the following:

When possible, use default GC; otherwise use per node allocation or use a library memory pool. As a practical matter, the myriad ways to tune GC will not be accepted. As a practical matter, the myriad ways to custom allocate memory will not be accepted. Please don't implement your own custom "arena" or "memory pool" or "free list" - they will not be accepted.

In the other words, this problem measures the memory allocation performance with the "natural" strategy, which is either a default GC, a memory pool or a per-node allocation in the order of availability. If the target implementation doesn't support GC by default the code would be necessarily more complex, which makes up the performance benefit. So you can't really compare submissions with different strategies, because they are just doing different things and there is no reasonable way to make them do the same thing [2].

[1] https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

[2] You may still argue that every submission should use a per-node allocation for the fair comparison, but a per-node allocation with GC and without GC is pretty different.


My submission did violate the conditions. I was just recollecting what happened; it is not a complaint.

I am at variance with those conditions, though. I refer to the benchmarks suite itself, broadly. When programs dip into GMP, PCRE, etc., we are no longer comparing the speeds of the different language implementations per se. Now, they are mixed with FFI speeds of those implementations, the choices of the foreign language libraries/algorithms, etc..

In the same vein, as you remark in your second footnote, for a fair comparison, the same algorithm should be used in all programs. That will shed light on the strengths and weaknesses of the different language implementations better. I don't intend to say that the current system is useless; I am saying that using the same set of algorithms is likely to be more illustrative.


The point of this benchmark (that you're deliberately ignoring) is actually to have at least one program that realistically shows how bad a language's GC is. Go performs terribly because its GC makes lousy tradeoffs, which is the only reason you're annoyed at its requirements. It's actually probably the best benchmark in the whole set on that site.


Can this benchmark be written using sync.Pool mentioned in the article? If so, then that should be accepted since sync.Pool comes in stdlib.


Debian Maintainers doing Debian Things.


Now do the same in Java and it will be as fast, if not faster. The point is the benchmark is looking at GC performance. In large real world programs, it is not trivial to make such changes.


Do it. Show me that Java can avoid allocating/collecting the Tree objects. Go has no indirection for the Tree objects, in this example. The Trees are contiguous in memory.

Measure the performance.

I have no garbage collection performance problems in the various large Go projects I have worked on. But I am also a competent programmer, so "your mileage may vary".


For the time being, until Valhalla is finalized, Java can emulate this approach simply by using an int array.

After making the modifications, on my local machine, Java ran as low as ~820ms, compared to golang's ~932ms.

golang's GC is no silver bullet as we see here:

* https://blog.discord.com/why-discord-is-switching-from-go-to...

* https://news.ycombinator.com/item?id=21670110


Thanks for doing the test!


No problem. I'll try to download a Valhalla JVM build and see how it fairs.


How does Java do without using an array of ints, or Valhalla? In other words, what's the cost of Java's forced indirection?

What happens if you use

  pool := make([]Tree, 0, 256)
instead of

  var pool []Tree
in the Go so it doesn't waste time growing tiny slices?


After making the change in the golang version, I saw it go as low as 867.61ms.

The JVM is very good at inlining. It's not perfect obviously, but in this benchmark just using an `int[]` wrapped in a class (similar to an ArrayList) produced quicker results than golang. I'm not surprised, since the golang compiler doesn't generate the greatest code.

I tried with Valhalla, with the following array wrapper, and I saw it run as quick as 667ms.

    stretch tree of depth 22  check: 8388607
    2097152  trees of depth 4  check: 65011712
    524288  trees of depth 6  check: 66584576
    131072  trees of depth 8  check: 66977792
    32768  trees of depth 10  check: 67076096
    8192  trees of depth 12  check: 67100672
    2048  trees of depth 14  check: 67106816
    512  trees of depth 16  check: 67108352
    128  trees of depth 18  check: 67108736
    32  trees of depth 20  check: 67108832
    long lived tree of depth 21  check: 4194303
    0.667455718

This is the Java Tree and array wrapper code I used

    inline class Tree {
     public int left; 
     public int right;

     public Tree(int left, int right) {
      this.left = left;
      this.right = right;
     }
    }
    
    ///////////////////////////////////////////////////////////////////////////////
    // Copyright (c) 2001, Eric D. Friedman All Rights Reserved.
    //
    // This library is free software; you can redistribute it and/or
    // modify it under the terms of the GNU Lesser General Public
    // License as published by the Free Software Foundation; either
    // version 2.1 of the License, or (at your option) any later version.
    //
    // This library is distributed in the hope that it will be useful,
    // but WITHOUT ANY WARRANTY; without even the implied warranty of
    // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    // GNU General Public License for more details.
    //
    // You should have received a copy of the GNU Lesser General Public
    // License along with this program; if not, write to the Free Software
    // Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA  02111-1307, USA.
    ///////////////////////////////////////////////////////////////////////////////
    
    final class TIntArrayList {
     protected Tree[] _data;
     protected int _pos;
     public TIntArrayList() {
      _data = new Tree[0];
      _pos = 0;
     }
    
     public void ensureCapacity(int capacity) {
      if (capacity > _data.length) {
       int newCap = Math.max(_data.length << 1, capacity);
       Tree[] tmp = new Tree[newCap];
       System.arraycopy(_data, 0, tmp, 0, _data.length);
       _data = tmp;
      }
     }
     public int size() {
      return _pos;
     }
    
     public void add(Tree val) {
      ensureCapacity(_pos + 1);
      _data[_pos++] = val;
     }
    
     public Tree get(int offset) {
      return _data[offset];
     }
    
     public void resetQuick() {
      _pos = 0;
     }
    }


Would you mind posting a complete example that compiles and runs? Your above changes to Go#8 break it. E.g. the function 'run' needs to be updated to use your new 'bottomUpTree' and 'inner'.


I'm not going to paste 170 lines of code into the thread. Add this line:

  var pool []Tree
to each of the stretch and long-lived closures in run(), pass &pool as the second argument to bottomUpTree(), pass pool as the second argument to itemCheck().

The call to inner() doesn't change.

It's trivial.


+1 Not bad, your simple pool improved performance from about 10.9 secs to 1.4 secs on my laptop here. However, it does break the stated rules for the benchmark.[1] The Java version doesn't use a pool and doesn't change Tree pointers into ints as your version does. The C/C++/Rust versions are required to use a "library memory pool." The rules state,

  Please don't implement your own custom "arena" or "memory pool"
  or "free list" - they will not be accepted."
Donald Knuth uses integers instead of proper pointers in the source code for TeX for maximum portability and performance so your technique is legit used by master programmers but in general replacing pointers with integer indices into arrays only works for monolithic programs of small to medium size. For example, could the Chromium maintainers rewrite the 20 million lines of C++ into Golang replacing most pointers with integer offsets into arrays for performance? No, impossible with current tools and practices, the technique won't scale to that, so this benchmark rule isn't totally stupid and unfair. This binary trees benchmark was originally devised decades ago by Hans Boehm, a notable GC researcher, as a way of testing GC performance that he as a GC implementer found informative.

[1] https://benchmarksgame-team.pages.debian.net/benchmarksgame/...


You don't rewrite your whole program to use pooled allocations.

Just the parts of your program where allocation performance needs to be improved.

If part of your program is doing many allocations, and it's too slow, use a simple pool. Go makes pooled allocation easy and safe.

I have written huge Go programs (e.g., https://NN-512.com), and the garbage collector is not a problem. Quite the opposite, it's a major, major benefit.

Pooled allocations are rarely needed.


We can't say that this is a purely GC benchmark. Generating and checking these trees takes time independently of garbage collection. Without a detailed report from the programming language run-time about how much time was spent in GC, it's just a guess.


> So Java has the fastest GC for this test, 2.48 secs vs 12.23 secs for Golang.

Further not mentioning memory used Java/Go programs makes it very fair comparison. Because GC perf does not depend on memory allocated.


Right now in the datacenter CPU usage is considerably more expensive than RAM usage. Ram consumes comparatively little power, whereas burning hot CPUs+GPUs are the reason datacenters are favored near cooling water and power stations. 2.48 vs 12.23 seconds for Java and Go is a big deal for how many solar panels or tons of coal are needed to run an app on Xeon or Epyc instances, whereas 1.7GB vs 0.4GB for Java and Go, a 4x difference in low-power memory usage, is not so big of deal.

At any rate I did link to the full table so everyone can see the mem usage, source listings, etc.


Just for fun set the Java heap to .4 Gigs or use GOGC to set the Go heap to 1.7 Gigs. If Go is faster then try some other sizes and draw a graph to see what the lines look like.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: