Parallel Roguelike Lev-Gen Benchmarks: Rust, Go, D, Scala and Nimrod

pcwalton · on Aug 23, 2013

Really nice article.

> I think there may be a more concise way to parallelise parts of the problem in Rust, using something like (from the Rust docs): > > let result = ports.iter().fold(0, |accum, port| accum + port.recv() );

We plan to have convenient fork/join style parallelism constructs, so that you don't have to build it yourself using message passing or unsafe code. There is a prototype in the `par.rs` module in the standard library, although it's pretty rough at this point.

I'd be interested in seeing what the benchmark looks like with Rust 0.8, which has a totally rewritten (and at this point much faster) scheduling and channel implementation.

logicchains · on Aug 23, 2013

I'd like to test it with the new Rust-coded runtime, especially if it's faster, but I won't have time to build it for a couple of days. I actually avoided building it earlier because I thought the new scheduler was slower and needed time to mature (I think I read that it doesn't yet swap tasks between threads, or something along those lines?), but I must have misread.

pcwalton · on Aug 23, 2013

It's been maturing fast and most of those problems have been fixed, although I wouldn't be surprised if you continue to see problems (and if you do filing issues would be much appreciated) :)

gillianseed · on Aug 23, 2013

Yes this was a interesting benchmark, nice work. One question though, why are you using an old version of GCC (4.7)?

4.8 has been out for quite some time, I did a quick comparison using your benchmark between gcc 4.8.1 and clang 3.3 on my system and clang still won but the difference was ~3.5% on my machine.

logicchains · on Aug 24, 2013

I was using LLVM 3.2 because Rust and LLVM-D use that, so I thought it was fair. GCC 4.7 followed from this because I assumed it was of the same generation as LLVM 3.2, and that 4.8.1 was newer so less fair to compare to LLVM 3.2.

gillianseed · on Aug 24, 2013

I see, personally I would prefer using the latest release available for each language as this more likely mirrors 'real world' scenarios.

But that's preferences for you, everyone has them :)

logicchains · on Aug 24, 2013

You're right that it probably makes for a better benchmark, and I tried to use recent versions for the other languages, I just didn't bother for C and C++ as I assumed they'd already be the fastest in class so there was no need to test on the latest compilers.

gillianseed · on Aug 24, 2013

Well my main interest in this benchmark was actually that of Rust and Go as those are the languages tested that I'm personally most interested in (nice seeing them performing quite well), I just found the use of older compiler versions as an odd thing (though you explained your rationale).

Any chance you would consider putting the compiler version used (for all compilers, not just c/c++) next to the compiler in the column for better disclosure?

Good luck on your game!

logicchains · on Aug 25, 2013

Thanks! And sure, I'll put the compiler versions up sometime next week, along with the compile times.

logicchains · on Sept 3, 2013

I tested it with the new runtime, but it only ran at around 87% of the speed of the original. I also needed to add a task::deschedule() call to make the scheduler behave.

kid0m4n · on Aug 23, 2013

Have sent in a pull request:

https://github.com/logicchains/Levgen-Parallel-Benchmarks/pu...

Go performance improves from ~470 ms to ~360 ms on my Quad core MBP 15 (Late 2011)

logicchains · on Aug 23, 2013

Thanks, that makes it faster than Scala.

kid0m4n · on Aug 23, 2013

Welcome! Now have a look at this: https://github.com/logicchains/Levgen-Parallel-Benchmarks/pu...

dbaupp · on Aug 23, 2013

I think that optimization can actually be applied to all the languages.

unoti · on Aug 23, 2013

Cool article! One thing that might be important to people if you're looking at these languages besides just speed: Go and C will tend to have radically lower memory usage (often like 10x) than most of the other languages there, such as Scala. This can be very important depending on what your application is. For me, using Go for game world servers was my choice because I can do so much more simulation per dollar of server RAM.

Symmetry · on Aug 23, 2013

I would bet that C++ and Rust and D and really any language that doesn't have a large runtime and doesn't generally box it's variables would also have less memory usage.

pjmlp · on Aug 23, 2013

Just to list a few forgotten ones that have faded away that could also fit the bill:

- Modula-2

- Modula-3

- Ada

- Turbo/Apple/Object Pascal

- Delphi

- Oberon, Active Oberon, Oberon-2, Component Pascal

levosmetalo · on Aug 23, 2013

What's important is the marginal memory usage. I'd rather pay 20-30MB of JVM memory tax upfront it it levels up with C/C++ for long running process and lowers the possibility of a memory leak and memory corruption.

Aloisius · on Aug 23, 2013

If it was just the 20-30 MB JVM tax, I might be with you. But in applications that allocate and deallocate a lot of memory, the GC of every JVM implementation I've used also hoards memory from the system (well there are a few implementations that give it back, but they cause app performance to nosedive).

If you're building a piece of software that needs to co-exist with other memory intensive processes, the JVM's policy of taking memory and rarely giving it back can cause unnecessary swapping. Heaven forbid you need to run multiple JVMs on the same machine (like with hadoop).

Then you start having to do silly things like specifying how much memory your app is allowed to use which makes me feel like I'm on classic MacOS.

AlisdairO · on Aug 24, 2013

This is a big issue for me too. IIRC the next version of the JVM will allow you to specify a target memory usage as well as a max, which will encourage it to GC more after memory spikes. You should be able to return some of this to the system without too much penalty (as long as the spikes aren't too frequent...).

levosmetalo · on Aug 27, 2013

No need to wait for the "next gen" GC. Right now you can specify min percentage of free space in a heap that triggers giving back memory to the OS. Yes, you need to specify a few more GC parameters, but it's certainly doable to force JVM to behave this way.

I used this approach in the past to keep memory usage lean. I had some spikes in application that required a lot of memory, so had to keep Xmx high, but configured the app to trigger deallocation to the OS when there's more that 20% of free memory. Worked like a charm even on and old ere 1.5 Sun JVM.

AlisdairO · on Aug 29, 2013

That mostly works, but you do still run into the issue of the memory hanging around until collection - so if your JVM doesn't choose to collect (because it has a high Xmx set), it keeps a bunch of ram.

aufreak3 · on Aug 24, 2013

The memory tax for GCed language was written up in [1]. The blog post [2] (which is a great read that I recommend) summarized it like this - "As long as you have about 6 times as much memory as you really need, you’re fine. But woe betide you if you have less than 4x the required memory."

[1] http://www-cs.canisius.edu/~hertzm/gcmalloc-oopsla-2005.pdf [2] http://sealedabstract.com/rants/why-mobile-web-apps-are-slow...

logicchains · on Aug 23, 2013

The benchmark now includes maximum resident memory usage. Go does indeed perform quite well (at least with 6g), using around 30MiB, compared to around 25/26MiB for the non-garbage-collected languages. Scala is memory-heavy, as expected, and Rust seems to use a surprising amount of memory for some reason.

dom96 · on Aug 23, 2013

Don't forget about Nimrod which performs extremely well for a garbage collected language. And D which does too.

logicchains · on Aug 23, 2013

True. I didn't actually realise the Nimrod implementation was garbage collected, I just assumed for some reason it managed memory in a hidden manual way like the C++ implementation. Note that garbage collection may not actually occur in this benchmark; for the single threaded benchmark at least, the Go runtime didn't make a single call to the GC, so D and Nimrod might be similar.

logicchains · on Aug 23, 2013

Although it wouldn't really be relevant for the level generation we're doing unless the memory usage was outrageous, I'm open to including memory usage in the table for comprehensiveness's sake. What would you say is the best way to measure ram usage on Linux; is there something as simple as 'time ./ThisExecutable' is for measuring time?

mastensg · on Aug 23, 2013

Try the time command (not the shell builtin):

    $ command time -f 'max resident:\t%M KiB' ls /
    bin   etc	  initrd.img.old  lib32       media  proc  sbin  tmp  vmlinuz
    boot  home	  iso		  lib64       mnt    root  srv	 usr  vmlinuz.old
    dev   initrd.img  lib		  lost+found  opt    run   sys	 var
    max resident:	968 KiB

logicchains · on Aug 23, 2013

Thanks, I've included max resident memory statistics on the benchmark.

gtani · on Aug 23, 2013

this is a really good writeup (summing up Private_Dirty, Shared_Dirty for some pid's:

http://tech.brightbox.com/posts/2012-11-28-measuring-shared-...

https://news.ycombinator.com/item?id=4842782

dom96 · on Aug 23, 2013

That doesn't only apply to Go and C, it most likely applies to all of the languages in the benchmark which do not rely on the JVM.

candybar · on Aug 23, 2013

Is this actually true on a marginal basis, especially in comparison to Go? My understanding was that JVM languages have a larger constant upfront memory footprint, but the difference isn't necessarily an increasing function of workload (perhaps of program size).

Edit: Also, how is this relevant for anything other than Scala specifically? It seems odd for C and Go to end up on one side and C++, Rust, D and Nimrod to be on the other side, if memory consumption is a big concern.

unoti · on Aug 23, 2013

Generally, yes. Try some examples to see. I've done a lot of comparisons of game world simulators written in Go and Scala, accomplishing similar things with radically different memory footprints. I think the difference has to do with the fact that everything you do in Scala produces tons of little objects for anonymous functions and loop iterators and so on.

For many applications this doesn't matter, especially with how cheap RAM is. But for some applications, such as mobile games and rented server space where your profitability is a function involving RAM cost per user, this can be very significant.

kasey_junk · on Aug 23, 2013

This is absolutely true if you are writing idiomatic Scala code (especially if you are using their collections api).

That said, you can lower your memory requirements dramatically using pretty standard high performance JVM techniques in Scala much as you can in Java. I've designed server systems in scala that produce no garbage for weeks at a time.

This can be a good option if you have some small subset of your system that needs to be memory sensitive/high performance without having to lose the benefits of the JVM in the rest of the system.

Aloisius · on Aug 23, 2013

Odd. Why is the C version so much slower than the C++ version? The only thing I notice that's odd is the unnecessary Room/Tile structs + memcpy in MakeLevs, but that can't be responsible for that level of difference.

Also there is a small bug in the C version. When it goes to print out the level, it only compares the first 100, so it'll print a different level from the C++ version with a seed of say, 20.

mgse · on Aug 23, 2013

Changing MakeRoom to be inline (gcc 4.6.3) turns out to be a significant speedup.

Also appears to be a small benefit from modifying the Lev struct to have Room first.

jgale · on Aug 23, 2013

Haskell was excluded from this benchmark because I can't figure it out.

logicchains · on Aug 23, 2013

Take the code here: https://github.com/logicchains/levgen-benchmarks/blob/master... and run it. Then, take the same code, change the genRooms function to contain:

  where

    noFit    = genRooms (n-1) (restInts) rsDone

    tr       = Room {rPos=(x,y), rw= w, rh= h}

    x        = rem (U.unsafeHead randInts) levDim

    y        = rem (U.unsafeIndex randInts 1) levDim

    restInts = U.unsafeDrop 4 randInts

    w        = rem (U.unsafeIndex randInts 2) maxWid + minWid

    h        = rem (U.unsafeIndex randInts 3) maxWid + minWid

And change:

let rands = U.unfoldrN 10000000 (Just . next) gen

to:

let rands = U.unfoldrN 20000000 (Just . next) gen

The running time should double. Does it? Or does it increase by orders of magnitude? The latter is what happens to me.

cgag · on Aug 23, 2013

I just changed the random number bit, not the 10000000 part, and it took ~100x longer. No idea why.

logicchains · on Aug 24, 2013

Maybe a compiler bug?

cgag · on Aug 24, 2013

I'm a total noob to haskell and I'm not sure I understand the code well enough to ask an intelligent question about it, but I'd love to see you/someone ask on the haskell mailing list/stackoverflow/reddit-haskell or something.

logicchains · on Aug 24, 2013

I'll post a question on Stack Overflow then when I have time.

pekk · on Aug 23, 2013

Ease of use matters.

pbsd · on Aug 23, 2013

This is not really too relevant to the article, but that random generator makes me cringe. You can replace it with a decent quality one (xorshift) with about as many lines of code:

  uint32_t genrand(uint32_t *seed)
  {
    uint32_t x = *seed;
    x ^= x << 13;
    x ^= x >> 17;
    x ^= x << 5;
    *seed = x;
    return x;
  }

dbaupp · on Aug 23, 2013

It actually was changed from stdlib RNG -> xorshift -> this RNG when the previous (non-parallel) article was posted.

vanderZwan · on Aug 23, 2013

Although it's nice to see which compiler performs best, I'm really curious as to why one compiler outperforms another for the same language. What's gcc's achilles heel for example? (it seems to consistently do worse for every language it's use for)

logicchains · on Aug 24, 2013

GCC outputs a jmp in the assembly for GenRands rather than a cmov. This involves branching, leading to branch misses, which waste time.

vanderZwan · on Sept 4, 2013

Thank you for clearing that up. You probably won't see this after 11 days, but I have to try: is that a missed optimisation or a deliberate design decision?

james4k · on Aug 23, 2013

Why are you parallelizing these in such different ways? For example, in Go you start 800 goroutines, but in others you start just 4 threads/tasks as worker pool. I would imagine indeed these would give quite different results.

logicchains · on Aug 23, 2013

It's done differently in Go because someone else wrote it, and I found it to be faster than the version I wrote using a worker pool of 4 tasks. If you look closely you'll see it doesn't actually ever run more than 4 goroutines simultaneously.

james4k · on Aug 23, 2013

I noticed that, but you're still allocating for 800 goroutines in the end. I guess in the scope of the benchmark, that is still relatively cheap. It just was a red flag for me.

logicchains · on Aug 23, 2013

No worries. A commenter here submitted a faster version, so it uses that now instead anyway.

kid0m4n · on Aug 23, 2013

Thanks :P And now an even faster version for your perusal:

https://github.com/logicchains/Levgen-Parallel-Benchmarks/pu...

> 120 ms improvement over the origin implementation

kid0m4n · on Aug 23, 2013

The original was running upto 8 goroutines (my runtime.NumCPU() is 8.)

vanderZwan · on Aug 23, 2013

I can't say anything about the other languages, but as far as Go is concerned, I suspect it's to achieve better load balancing. See Dmitri's explanation here:

https://groups.google.com/d/msg/golang-nuts/CZVymHx3LNM/esYk...

(Dmitri is one of the main people currently responsible for the further development of the Go scheduler). If you only have 8 goroutines running at a time with 8 cores, and one stalls, that wastes CPU usage. If you have, say, 80, the scheduler will put another one to work so the CPU is idle less often.

kid0m4n · on Aug 23, 2013

True, but that would apply only when you are dealing with tasks which are likely to stall. In this particular case, we are doing CPU intensive calculations in tight loops and no I/O. I scaled the number of goroutines from 1 to 8 (total cores in my MBP w/ Hyperthreading), and got best performance with 8. Performance started regressing beyond 8