> I think there may be a more concise way to parallelise parts of the problem in Rust, using something like (from the Rust docs):
>
> let result = ports.iter().fold(0, |accum, port| accum + port.recv() );
We plan to have convenient fork/join style parallelism constructs, so that you don't have to build it yourself using message passing or unsafe code. There is a prototype in the `par.rs` module in the standard library, although it's pretty rough at this point.
I'd be interested in seeing what the benchmark looks like with Rust 0.8, which has a totally rewritten (and at this point much faster) scheduling and channel implementation.
I'd like to test it with the new Rust-coded runtime, especially if it's faster, but I won't have time to build it for a couple of days. I actually avoided building it earlier because I thought the new scheduler was slower and needed time to mature (I think I read that it doesn't yet swap tasks between threads, or something along those lines?), but I must have misread.
It's been maturing fast and most of those problems have been fixed, although I wouldn't be surprised if you continue to see problems (and if you do filing issues would be much appreciated) :)
Yes this was a interesting benchmark, nice work. One question though, why are you using an old version of GCC (4.7)?
4.8 has been out for quite some time, I did a quick comparison using your benchmark between gcc 4.8.1 and clang 3.3 on my system and clang still won but the difference was ~3.5% on my machine.
I was using LLVM 3.2 because Rust and LLVM-D use that, so I thought it was fair. GCC 4.7 followed from this because I assumed it was of the same generation as LLVM 3.2, and that 4.8.1 was newer so less fair to compare to LLVM 3.2.
You're right that it probably makes for a better benchmark, and I tried to use recent versions for the other languages, I just didn't bother for C and C++ as I assumed they'd already be the fastest in class so there was no need to test on the latest compilers.
Well my main interest in this benchmark was actually that of Rust and Go as those are the languages tested that I'm personally most interested in (nice seeing them performing quite well), I just found the use of older compiler versions as an odd thing (though you explained your rationale).
Any chance you would consider putting the compiler version used (for all compilers, not just c/c++) next to the compiler in the column for better disclosure?
I tested it with the new runtime, but it only ran at around 87% of the speed of the original. I also needed to add a task::deschedule() call to make the scheduler behave.
Cool article! One thing that might be important to people if you're looking at these languages besides just speed: Go and C will tend to have radically lower memory usage (often like 10x) than most of the other languages there, such as Scala. This can be very important depending on what your application is. For me, using Go for game world servers was my choice because I can do so much more simulation per dollar of server RAM.
I would bet that C++ and Rust and D and really any language that doesn't have a large runtime and doesn't generally box it's variables would also have less memory usage.
What's important is the marginal memory usage. I'd rather pay 20-30MB of JVM memory tax upfront it it levels up with C/C++ for long running process and lowers the possibility of a memory leak and memory corruption.
If it was just the 20-30 MB JVM tax, I might be with you. But in applications that allocate and deallocate a lot of memory, the GC of every JVM implementation I've used also hoards memory from the system (well there are a few implementations that give it back, but they cause app performance to nosedive).
If you're building a piece of software that needs to co-exist with other memory intensive processes, the JVM's policy of taking memory and rarely giving it back can cause unnecessary swapping. Heaven forbid you need to run multiple JVMs on the same machine (like with hadoop).
Then you start having to do silly things like specifying how much memory your app is allowed to use which makes me feel like I'm on classic MacOS.
This is a big issue for me too. IIRC the next version of the JVM will allow you to specify a target memory usage as well as a max, which will encourage it to GC more after memory spikes. You should be able to return some of this to the system without too much penalty (as long as the spikes aren't too frequent...).
No need to wait for the "next gen" GC. Right now you can specify min percentage of free space in a heap that triggers giving back memory to the OS. Yes, you need to specify a few more GC parameters, but it's certainly doable to force JVM to behave this way.
I used this approach in the past to keep memory usage lean. I had some spikes in application that required a lot of memory, so had to keep Xmx high, but configured the app to trigger deallocation to the OS when there's more that 20% of free memory. Worked like a charm even on and old ere 1.5 Sun JVM.
That mostly works, but you do still run into the issue of the memory hanging around until collection - so if your JVM doesn't choose to collect (because it has a high Xmx set), it keeps a bunch of ram.
The memory tax for GCed language was written up in [1]. The blog post [2] (which is a great read that I recommend) summarized it like this - "As long as you have about 6 times as much memory as you really need, you’re fine. But woe betide you if you have less than 4x the required memory."
The benchmark now includes maximum resident memory usage. Go does indeed perform quite well (at least with 6g), using around 30MiB, compared to around 25/26MiB for the non-garbage-collected languages. Scala is memory-heavy, as expected, and Rust seems to use a surprising amount of memory for some reason.
True. I didn't actually realise the Nimrod implementation was garbage collected, I just assumed for some reason it managed memory in a hidden manual way like the C++ implementation. Note that garbage collection may not actually occur in this benchmark; for the single threaded benchmark at least, the Go runtime didn't make a single call to the GC, so D and Nimrod might be similar.
Although it wouldn't really be relevant for the level generation we're doing unless the memory usage was outrageous, I'm open to including memory usage in the table for comprehensiveness's sake. What would you say is the best way to measure ram usage on Linux; is there something as simple as 'time ./ThisExecutable' is for measuring time?
$ command time -f 'max resident:\t%M KiB' ls /
bin etc initrd.img.old lib32 media proc sbin tmp vmlinuz
boot home iso lib64 mnt root srv usr vmlinuz.old
dev initrd.img lib lost+found opt run sys var
max resident: 968 KiB
Is this actually true on a marginal basis, especially in comparison to Go? My understanding was that JVM languages have a larger constant upfront memory footprint, but the difference isn't necessarily an increasing function of workload (perhaps of program size).
Edit: Also, how is this relevant for anything other than Scala specifically? It seems odd for C and Go to end up on one side and C++, Rust, D and Nimrod to be on the other side, if memory consumption is a big concern.
Generally, yes. Try some examples to see. I've done a lot of comparisons of game world simulators written in Go and Scala, accomplishing similar things with radically different memory footprints. I think the difference has to do with the fact that everything you do in Scala produces tons of little objects for anonymous functions and loop iterators and so on.
For many applications this doesn't matter, especially with how cheap RAM is. But for some applications, such as mobile games and rented server space where your profitability is a function involving RAM cost per user, this can be very significant.
This is absolutely true if you are writing idiomatic Scala code (especially if you are using their collections api).
That said, you can lower your memory requirements dramatically using pretty standard high performance JVM techniques in Scala much as you can in Java. I've designed server systems in scala that produce no garbage for weeks at a time.
This can be a good option if you have some small subset of your system that needs to be memory sensitive/high performance without having to lose the benefits of the JVM in the rest of the system.
Odd. Why is the C version so much slower than the C++ version? The only thing I notice that's odd is the unnecessary Room/Tile structs + memcpy in MakeLevs, but that can't be responsible for that level of difference.
Also there is a small bug in the C version. When it goes to print out the level, it only compares the first 100, so it'll print a different level from the C++ version with a seed of say, 20.
I'm a total noob to haskell and I'm not sure I understand the code well enough to ask an intelligent question about it, but I'd love to see you/someone ask on the haskell mailing list/stackoverflow/reddit-haskell or something.
This is not really too relevant to the article, but that random generator makes me cringe. You can replace it with a decent quality one (xorshift) with about as many lines of code:
uint32_t genrand(uint32_t *seed)
{
uint32_t x = *seed;
x ^= x << 13;
x ^= x >> 17;
x ^= x << 5;
*seed = x;
return x;
}
Although it's nice to see which compiler performs best, I'm really curious as to why one compiler outperforms another for the same language. What's gcc's achilles heel for example? (it seems to consistently do worse for every language it's use for)
Thank you for clearing that up. You probably won't see this after 11 days, but I have to try: is that a missed optimisation or a deliberate design decision?
Why are you parallelizing these in such different ways? For example, in Go you start 800 goroutines, but in others you start just 4 threads/tasks as worker pool. I would imagine indeed these would give quite different results.
It's done differently in Go because someone else wrote it, and I found it to be faster than the version I wrote using a worker pool of 4 tasks. If you look closely you'll see it doesn't actually ever run more than 4 goroutines simultaneously.
I noticed that, but you're still allocating for 800 goroutines in the end. I guess in the scope of the benchmark, that is still relatively cheap. It just was a red flag for me.
I can't say anything about the other languages, but as far as Go is concerned, I suspect it's to achieve better load balancing. See Dmitri's explanation here:
(Dmitri is one of the main people currently responsible for the further development of the Go scheduler). If you only have 8 goroutines running at a time with 8 cores, and one stalls, that wastes CPU usage. If you have, say, 80, the scheduler will put another one to work so the CPU is idle less often.
True, but that would apply only when you are dealing with tasks which are likely to stall. In this particular case, we are doing CPU intensive calculations in tight loops and no I/O. I scaled the number of goroutines from 1 to 8 (total cores in my MBP w/ Hyperthreading), and got best performance with 8. Performance started regressing beyond 8
> I think there may be a more concise way to parallelise parts of the problem in Rust, using something like (from the Rust docs): > > let result = ports.iter().fold(0, |accum, port| accum + port.recv() );
We plan to have convenient fork/join style parallelism constructs, so that you don't have to build it yourself using message passing or unsafe code. There is a prototype in the `par.rs` module in the standard library, although it's pretty rough at this point.
I'd be interested in seeing what the benchmark looks like with Rust 0.8, which has a totally rewritten (and at this point much faster) scheduling and channel implementation.