

Parallel Roguelike Lev-Gen Benchmarks: Rust, Go, D, Scala and Nimrod - dom96
http://togototo.wordpress.com/2013/08/23/benchmarks-round-two-parallel-go-rust-d-scala-and-nimrod/

======
pcwalton
Really nice article.

> I think there may be a more concise way to parallelise parts of the problem
> in Rust, using something like (from the Rust docs): > > let result =
> ports.iter().fold(0, |accum, port| accum + port.recv() );

We plan to have convenient fork/join style parallelism constructs, so that you
don't have to build it yourself using message passing or unsafe code. There is
a prototype in the `par.rs` module in the standard library, although it's
pretty rough at this point.

I'd be interested in seeing what the benchmark looks like with Rust 0.8, which
has a totally rewritten (and at this point much faster) scheduling and channel
implementation.

~~~
logicchains
I'd like to test it with the new Rust-coded runtime, especially if it's
faster, but I won't have time to build it for a couple of days. I actually
avoided building it earlier because I thought the new scheduler was slower and
needed time to mature (I think I read that it doesn't yet swap tasks between
threads, or something along those lines?), but I must have misread.

~~~
gillianseed
Yes this was a interesting benchmark, nice work. One question though, why are
you using an old version of GCC (4.7)?

4.8 has been out for quite some time, I did a quick comparison using your
benchmark between gcc 4.8.1 and clang 3.3 on my system and clang still won but
the difference was ~3.5% on my machine.

~~~
logicchains
I was using LLVM 3.2 because Rust and LLVM-D use that, so I thought it was
fair. GCC 4.7 followed from this because I assumed it was of the same
generation as LLVM 3.2, and that 4.8.1 was newer so less fair to compare to
LLVM 3.2.

~~~
gillianseed
I see, personally I would prefer using the latest release available for each
language as this more likely mirrors 'real world' scenarios.

But that's preferences for you, everyone has them :)

~~~
logicchains
You're right that it probably makes for a better benchmark, and I tried to use
recent versions for the other languages, I just didn't bother for C and C++ as
I assumed they'd already be the fastest in class so there was no need to test
on the latest compilers.

~~~
gillianseed
Well my main interest in this benchmark was actually that of Rust and Go as
those are the languages tested that I'm personally most interested in (nice
seeing them performing quite well), I just found the use of older compiler
versions as an odd thing (though you explained your rationale).

Any chance you would consider putting the compiler version used (for all
compilers, not just c/c++) next to the compiler in the column for better
disclosure?

Good luck on your game!

~~~
logicchains
Thanks! And sure, I'll put the compiler versions up sometime next week, along
with the compile times.

------
kid0m4n
Have sent in a pull request:

[https://github.com/logicchains/Levgen-Parallel-
Benchmarks/pu...](https://github.com/logicchains/Levgen-Parallel-
Benchmarks/pull/9)

Go performance improves from ~470 ms to ~360 ms on my Quad core MBP 15 (Late
2011)

~~~
logicchains
Thanks, that makes it faster than Scala.

~~~
kid0m4n
Welcome! Now have a look at this: [https://github.com/logicchains/Levgen-
Parallel-Benchmarks/pu...](https://github.com/logicchains/Levgen-Parallel-
Benchmarks/pull/11)

~~~
dbaupp
I think that optimization can actually be applied to all the languages.

------
unoti
Cool article! One thing that might be important to people if you're looking at
these languages besides just speed: Go and C will tend to have radically lower
memory usage (often like 10x) than most of the other languages there, such as
Scala. This can be very important depending on what your application is. For
me, using Go for game world servers was my choice because I can do so much
more simulation per dollar of server RAM.

~~~
levosmetalo
What's important is the marginal memory usage. I'd rather pay 20-30MB of JVM
memory tax upfront it it levels up with C/C++ for long running process and
lowers the possibility of a memory leak and memory corruption.

~~~
Aloisius
If it was just the 20-30 MB JVM tax, I might be with you. But in applications
that allocate and deallocate a lot of memory, the GC of every JVM
implementation I've used also hoards memory from the system (well there are a
few implementations that give it back, but they cause app performance to
nosedive).

If you're building a piece of software that needs to co-exist with other
memory intensive processes, the JVM's policy of taking memory and rarely
giving it back can cause unnecessary swapping. Heaven forbid you need to run
multiple JVMs on the same machine (like with hadoop).

Then you start having to do silly things like specifying how much memory your
app is allowed to use which makes me feel like I'm on classic MacOS.

~~~
AlisdairO
This is a big issue for me too. IIRC the next version of the JVM will allow
you to specify a target memory usage as well as a max, which will encourage it
to GC more after memory spikes. You should be able to return some of this to
the system without too much penalty (as long as the spikes aren't too
frequent...).

~~~
levosmetalo
No need to wait for the "next gen" GC. Right now you can specify min
percentage of free space in a heap that triggers giving back memory to the OS.
Yes, you need to specify a few more GC parameters, but it's certainly doable
to force JVM to behave this way.

I used this approach in the past to keep memory usage lean. I had some spikes
in application that required a lot of memory, so had to keep Xmx high, but
configured the app to trigger deallocation to the OS when there's more that
20% of free memory. Worked like a charm even on and old ere 1.5 Sun JVM.

~~~
AlisdairO
That mostly works, but you do still run into the issue of the memory hanging
around until collection - so if your JVM doesn't choose to collect (because it
has a high Xmx set), it keeps a bunch of ram.

------
Aloisius
Odd. Why is the C version so much slower than the C++ version? The only thing
I notice that's odd is the unnecessary Room/Tile structs + memcpy in MakeLevs,
but that can't be responsible for that level of difference.

Also there is a small bug in the C version. When it goes to print out the
level, it only compares the first 100, so it'll print a different level from
the C++ version with a seed of say, 20.

~~~
mgse
Changing MakeRoom to be inline (gcc 4.6.3) turns out to be a significant
speedup.

Also appears to be a small benefit from modifying the Lev struct to have Room
first.

------
pbsd
This is not really too relevant to the article, but that random generator
makes me cringe. You can replace it with a decent quality one (xorshift) with
about as many lines of code:

    
    
      uint32_t genrand(uint32_t *seed)
      {
        uint32_t x = *seed;
        x ^= x << 13;
        x ^= x >> 17;
        x ^= x << 5;
        *seed = x;
        return x;
      }

~~~
dbaupp
It actually was changed from stdlib RNG -> xorshift -> this RNG when the
previous (non-parallel) article was posted.

------
jgale
Haskell was excluded from this benchmark because I can't figure it out.

~~~
logicchains
Take the code here: [https://github.com/logicchains/levgen-
benchmarks/blob/master...](https://github.com/logicchains/levgen-
benchmarks/blob/master/H.hs) and run it. Then, take the same code, change the
genRooms function to contain:

    
    
      where
    
        noFit    = genRooms (n-1) (restInts) rsDone
    
        tr       = Room {rPos=(x,y), rw= w, rh= h}
    
        x        = rem (U.unsafeHead randInts) levDim
    
        y        = rem (U.unsafeIndex randInts 1) levDim
    
        restInts = U.unsafeDrop 4 randInts
    
        w        = rem (U.unsafeIndex randInts 2) maxWid + minWid
    
        h        = rem (U.unsafeIndex randInts 3) maxWid + minWid
    
    

And change:

let rands = U.unfoldrN 10000000 (Just . next) gen

to:

let rands = U.unfoldrN 20000000 (Just . next) gen

The running time should double. Does it? Or does it increase by orders of
magnitude? The latter is what happens to me.

~~~
cgag
I just changed the random number bit, not the 10000000 part, and it took ~100x
longer. No idea why.

~~~
logicchains
Maybe a compiler bug?

~~~
cgag
I'm a total noob to haskell and I'm not sure I understand the code well enough
to ask an intelligent question about it, but I'd love to see you/someone ask
on the haskell mailing list/stackoverflow/reddit-haskell or something.

~~~
logicchains
I'll post a question on Stack Overflow then when I have time.

------
vanderZwan
Although it's nice to see which compiler performs best, I'm really curious as
to _why_ one compiler outperforms another for the same language. What's gcc's
achilles heel for example? (it seems to consistently do worse for every
language it's use for)

~~~
logicchains
GCC outputs a jmp in the assembly for GenRands rather than a cmov. This
involves branching, leading to branch misses, which waste time.

~~~
vanderZwan
Thank you for clearing that up. You probably won't see this after 11 days, but
I have to try: is that a missed optimisation or a deliberate design decision?

------
james4k
Why are you parallelizing these in such different ways? For example, in Go you
start 800 goroutines, but in others you start just 4 threads/tasks as worker
pool. I would imagine indeed these would give quite different results.

~~~
logicchains
It's done differently in Go because someone else wrote it, and I found it to
be faster than the version I wrote using a worker pool of 4 tasks. If you look
closely you'll see it doesn't actually ever run more than 4 goroutines
simultaneously.

~~~
james4k
I noticed that, but you're still allocating for 800 goroutines in the end. I
guess in the scope of the benchmark, that is still relatively cheap. It just
was a red flag for me.

~~~
logicchains
No worries. A commenter here submitted a faster version, so it uses that now
instead anyway.

~~~
kid0m4n
Thanks :P And now an even faster version for your perusal:

[https://github.com/logicchains/Levgen-Parallel-
Benchmarks/pu...](https://github.com/logicchains/Levgen-Parallel-
Benchmarks/pull/11)

> 120 ms improvement over the origin implementation

