Hacker News new | comments | show | ask | jobs | submit login
Go vs C++: Ray tracer (part 3) (kidoman.com)
115 points by kid0m4n on Oct 5, 2013 | hide | past | web | favorite | 60 comments



Someone posted something on Nimrod a while back with something totally related [0], and I think it is interesting it keeps getting passed over. I am not a lang expert, but I have become very curious about the multiple alternatives in the systems programming niche, and others. It has been developed at least since 2008 (with 0.6 branch in 2008)[1] and the Go language's first public debut in 2009 (I do not know how stable it was at the time and I gave up after going through the first pages of the whole commit history to get a better answer).[2] At the very least, they were developed in the same time frame, are nominally similar languages, and addressing a lot of the same use cases as Rust I suppose. Yet, unfortunately few bench marks include them.

I guess they will always be writing an indie language, but I wish more would check it out. As I continue to learn, maybe I can contribute code to his project. We will see if I ever get that far.

[0] http://nimrod-code.org/

[1] http://nimrod-code.org/news.html

[2] http://en.wikipedia.org/wiki/Go_programming_language


>and I think it is interesting it keeps getting passed over.

If you think it's being passed over then you need to try and generate interest in the language, like posting articles about it you find, or if you find none, write one.

Same goes for benchmarking, it's unlikely that people who aren't interested in the language will write benchmark-versions for that language so it's up to those who are interested in it to provide them.

Which is what someone did in the Rogue Level Generation Benchmark where I recall Nimrod performed very well.

http://togototo.wordpress.com/2013/08/23/benchmarks-round-tw...

From my own very cursory glance I'm not quite sure why you would compare Nimrod directly to Go, I think it's more aptly compared to something like Rust, with both having optional garbage collector, generics, macros etc.


> If you think it's being passed over then you need to try and generate interest in the language

I think that 616c is doing exactly this with his comment. Not everyone can write effective articles, and writing articles also takes a lot of time and effort. Writing comments advertising the language is usually the next best thing.

I wrote a couple of Nimrod benchmarks including the one you linked to. But it again takes time, and I would prefer to improve the language and its tools than to be writing tons of benchmarks.

I really wish more people gave Nimrod a serious chance because it is truly a brilliant programming language and it really deserves to get more exposure.

Nimrod's target user base very much clashes with Go's, I hear about Python programmers switching to Go because it reminds them of Python so much. Well, I don't see it. But Nimrod is definitely a fast and compiled Python with lots of extras.


Thank you dom96. I see you often in Nimrod forum posts and with cool code on Github. I look forward to brushing up on documentation and eventually starting to code in Nirmod. You and Araq have some inspired/inspiring conversations in language design on that forum. Keep it up, it is a very fruitful read.


That's nice to hear. Join the IRC channel or read the IRC logs (http://build.nimrod-code.org/irclogs/) if you wanna read more discussions or take part in them in real time :)


> I hear about Python programmers switching to Go because it reminds them of Python so much.

Given what it is thrown away, I also find it hard to grasp how the languages relate.


> From my own very cursory glance I'm not quite sure why you would compare Nimrod directly to Go, I think it's more aptly compared to something like Rust, with both having optional garbage collector, generics, macros etc.

(Disclaimer: I work on Rust and don't have a lot of experience with Nimrod other than reading the docs.)

Nimrod and Rust are in a similar space, yes; they're both systems languages that use compiler backends designed for C (in Nimrod's case, by compiling to C; in Rust's case, through LLVM). I especially like Nimrod's nice Python-like syntax.

Regarding "optional garbage collector", Nimrod and Rust are pretty different: last I looked, Nimrod follows the D route of tying safety to the GC, so if you don't use the GC you also lose memory safety. As far as I know, Rust is still unique among industry languages in that it takes the have-your-cake-and-eat-it-too approach of memory safety even if the GC is not used, via unique pointers and lifetimes.


(Disclaimer: I am the creator of Nimrod.)

What you say is entirely correct; however Nimrod's effect system is not tied to any runtime mechanisms like a GC and provides lots of other aspects of safety. For instance you can ensure at compile time that a particular piece of code performs no database operations.

Also the effect system will improve the safety for concurrent programming quite a lot; coming soon in some of the next versions. (Sorry, no papers about it yet.)


A pull request is always welcome :)


Nimrod seems rather unfortunately named. I'm not up on Abrahamic mythos so I didn't get the biblical reference -- to me Nimrod seems like a fighting words insult.

http://www.urbandictionary.com/define.php?term=nimrod


Well, well, a downvote? I guess struck a nerve by asking why alternative languages, the underlying interest of such a blog post, is upsetting. But I am not sure why given the context.

I guess I will stop asking about the differences between Go and Nimrod.


Don't be so oversensitive about the votes your comments get. The best thing you can do on HN is to post and leave for an hour. By then some interesting discussion might have popped up. Don't be bothered by the random downvoters. They are more afraid of you than you are of them.


It's not my first time here. I am well aware hot it works. When I comment I usually prod someone to give me a contrary opinion, and I learn something new from someone for more knowledgeable from me. That is why I keep coming back to HN.

But I could have been more direct this time around, mea culpa.


I imagine a lot of downvotes happen by accident. Whenever I get one, inevitably someone will upvote it just to get it out of the grey. It's an upvote I probably wouldn't have gotten otherwise, as it usually just stays at 1 point. My conclusion: don't sweat them.


I'm guessing the downvote is because your post has no connection to a business card ray-tracer. Has someone done one in Nimrod? Was it posted here?

Because it feels like you're just shoehorning discussion of Nimrod into an unrelated conversation.

One way of generating interest in Nimrod would be to convert the original C++ "business card ray tracer" to Nimrod and do some benchmarks - ease of writing, understanding, code size, speed of execution, etc.


https://kidoman.com/images/go-vs-cpp-after-both-optimized.pn...

In this picture - am I right in interpreting it at 2048x2048 and 8 cores, that the optimised and tuned go code is nearly three times faster than the multithreaded optimised and tuned C++ code? How come the C++ is the same for 1C and 8C? Is this picture from one of the previous articles?

EDIT: (No, I'm wrong, the C++ is single threaded!)

It seems it's single threaded - and the last graph isn't showing the current level of C++ single threaded performance - which is ~36 seconds, with go at 21s for 8 threads. Final state of play is go at 18s multithreaded and C++ at 8s multithread and go at 81s single threaded and C++ at 36s single threaded. I read that from the third last graph.

It's difficult to understand this ending of the article, as the subtitle is:

"Further optimizations and a multi-threaded C++ version" and "Hurray multi-threading"

I suggest the order of the article be changed around to have a 'recap on single threaded C++' at the start, and then the new figures - so that it concludes in straightforward manner.

This quote from the article is wrong:

"C++ is not more than twice as fast than an equivalent Go program at this stage."

Correct me if I'm wrong, but in every case that I can see the C++ code will execute twice before the Go code is finished. It is more than twice as fast.

By the same logic this "almost" is also wrong: "From taking 58.15 seconds (single threaded), it has now dropped to a extremely impressive 36.36 seconds (again single threaded), making it almost twice as fast as the optimized Go version."

36s < 81s/2


"C++ is not more than twice as fast than an equivalent Go program at this stage."

I believe that is a simple typo, where 'not' was meant to be a 'now'.


Initially I thought that, but I can't explain the 'almost' that follows?

also it would be idiomatic to say "now more than twice as fast as" instead of "now more than twice as fast than".


The god of all typos has been fixed :)


Well done, you should probably change the "almost" I pointed out to "more than" also.


This has been circulated on the Rust-dev mailinglist:

https://mail.mozilla.org/pipermail/rust-dev/2013-September/0...

Rust did quite well (given that it's not even production ready i'd say it's impressive): https://mail.mozilla.org/pipermail/rust-dev/2013-September/0...

The only thing they noted was that the Go version "cheated" by precomptuing some values which someone removed to make the algorithms the same.. maybe kid0m4n can comment on this :)


I was looking at the Rust-dev mailing list before going to sleep :)

I would argue that it is not cheating "anymore" as both the Go and C++ version are now equal. In fact, I wanna compare how Rust performs in this exact same test with all optimizations applied. Studying those optimizations will be fun itself.

I have also explained why optimizations are perfectly fine (IMHO) here:

https://github.com/kid0m4n/rays#why-optimize-the-base-algori...


Ok, but with different algorithms you are not testing the languages/compilers anymore but the algorithms ;)

I understand that now when the C and Go version we can't compare with the rust version anymore uless someone applies the same changes.. anyway, t was a very interesting read on the rust ML and frankly i was surprised to see rust fare so well


True...

But, I guess algorithmic advances will happen much less frequently compared to other micro optimizations. And it will keep things interesting across the board.


The Rust code that ML thread is discussing: http://github.com/huonw/card-trace


Hey, possible to optimize the code and send over a PR? I would love to rerun the benchmarks with a optimized Rust version and my rust is totally rusty (pun intended :P)


I'm sure it's possible. ;)

I'm occupied by other things now so won't do it myself, but people are free to work from my code.


Oh the possibilities :D


I wonder, do the Rust guys have any intention of supporting AMD's HSA or Mantle API in any way? I know Mozilla wants to make Rust take full advantage of multi-core systems, and AMD is doing that, too, so I wonder if there can be some synergy there, or if that's outside of their goals for Rust.


Looking at the github it seems you benchmark GCC 4.8.1 vs. go 1.2rc1. Numbers for Go look promising if one considers that Google's Go implementation does not even have an advanced optimizer yet (in contrast to GCC).

>c++ -std=c++11 -O3 -Wall -pthread -ffast-math -mtune=native -march=native -o cpprays cpprays/main.cpp

Have you tried -O2? -O3 often generates slower code.

>i7 2600

Intel's compiler would probably generate faster code. That's why you can't just say "Go vs C++". You could let Go win this fight by compiling the C++ with Digital Mars. It is also a C++ compiler but it lacks a modern optimizer and the generated code is usually much slower.


Few remarks:

-mtune is redundant with march=native turned on;

-use -Ofast instead O3/ffast-math it turns some more options as well (although theoretically it might behave in non-standard way with float computations, you need to test this, it wasn't ever a problem for me)

-add -flto it often helps significantly

-it probably won't matter for simple program but you may try compiling with: -Ofast -march=native -flto -fprofile-generate; then run the program (it will generate .gcda files) and then recompile with the same options and -fprofile-use;

EDIT: Some quick tests shows that all the options I mentioned help (MinGW, gcc 4.8.1). Original time on one thread was 3.670s, Ofast and -flto takes it to ~3.630s and adding PGO moves it to 3.46s (there is some variance with all of those) - whooping 5-6% improvement overall :)


I can't edit anymore, so replying to myself:

On my machine (i7 3770) inserting famous Quake3 hack in place of 1/sqrtf: http://en.wikipedia.org/wiki/Fast_inverse_square_root is faster but not accurate with one iteration (image is very distorted) and slower but accurate with 2 iterations.


How would -flto help when all the code is in one module?


All I know about gcc flags is from testing a lot of combinations of them. I don't really know how -flto works but maybe it does something for functions from included files ?


Can you elaborate on why -O3 often generates slower code? I've never heard that before.


It is common knowledge among GCC users. In the past using -O3 was rare because it often generated downright broken code. There used to be an official warning about that.

The situation is better nowadays but still, as far as I know, no major Linux distro uses -O3 as the default for binary packages.

-O3 can generate slower code because of the aggressive inlining and loop unrolling enabled. These optimizations are very tricky because of their effect on cache use. Basically all that extra code can push other needed code/data out of the cache, which can cause a noticeable decrease in performance.


I think it's 'common knowledge' which has outlived it's relevance as I can't recall the last time I found -O2 outperforming -O3.

Practically every performance oriented open source program I come across also defaults to -O3 these days, or sometimes -Ofast which also enables -ffast-math.

>-O3 can generate slower code because of the aggressive inlining and loop unrolling enabled

-O3 turns on vectorization and inlining optimizations but I can't recall any loop unrolling options which are turned on at -O3.

-funroll-loops is not turned on at any of the -O (including -O3) levels due to it being one of the hardest to get right without any runtime data as basis (which is why the only option that turns it on is PGO - profile generated optimization).

Note that I'm talking about modern versions of GCC, if you are using GCC 4.21 on OSX then this (-O2 > -O3) may still typically be the case.

>The situation is better nowadays but still, as far as I know, no major Linux distro uses -O3 as the default for binary packages.

I'd say they typically use the upstream optimization settings.


>I think it's 'common knowledge' which has outlived it's relevance as I can't recall the last time I found -O2 outperforming -O3.

I can, was about 4 months ago with GCC 4.8.0.

>practically every performance oriented open source program I come across also defaults to -O3 these days

How large is your sample size there? I have only seen -O3 in the default makefiles of audio/video encoders. Those tend to be a natural fit for -O3. In contrast, here is the current makefile of my favorite "performance oriented" FOSS program:

http://repo.or.cz/w/luajit-2.0.git/blob_plain/HEAD:/src/Make...

CCOPT= -O2 -fomit-frame-pointer # Note: it's no longer recommended to use -O3 with GCC 4.x. # The I-Cache bloat usually outweighs the benefits from aggressive inlining.

>I can't recall any loop unrolling options which are turned on at -O3.

You are right (I just looked it up). Guess my memory failed me there.

>I'd say they typically use the upstream optimization settings

I wish! Packagers love to fool around with the upstream sources and makefiles to make them conform to whatever "standards" they have.


>How large is your sample size there? I have only seen -O3 in the default makefiles of audio/video encoders. Those tend to be a natural fit for -O3

Well I very much implied 'performance-oriented' programs as we where discussing 'performance' generated by compiler options, which indeed are a natural fit for -O3.

For which my 'sample size' would be software like encoders, archivers, emulators, 3d renderers etc.

Obviously there's little point in using -O3 on your text editor (yes, extreme example), basically for any non performance-oriented software -O3 will likely only serve to increase the binary size as any potential gains will be unnoticable.

>I wish! Packagers love to fool around with the upstream sources and makefiles to make them conform to whatever "standards" they have.

Not really my experience with Arch packages, but of course I haven't looked at the PKGBUILDS for even 1% of all available packages, basically only those performance oriented packages on which I rely.


-O3 inlines functions and unrolls loops more aggressively, so the increased code size might not fit in the CPU cache.


Fair enough... let me do a quick test with -O2 and see how that fares


Might want to also give a go at -Os (optimize for small code size). On code that spends its time iterating on the same code over and over again this can be a big win.

[edit] Nope, definitely not better. I get O2 being a slight win over O3 and Os being much worse.


Compiler optimization flags are very code and type specific.

(Note that I am comparing apples to oranges here, I used the C++ code used in Rust experiments found here: https://github.com/huonw/card-trace/blob/master/original.cpp )

I changed the C++ version typedef float f to typedef double f, so using floats instead of doubles, compiling with the following flags:

    -m64 -march=corei7-avx -mtune=corei7-avx -Ofast -funroll-all-loops
and the run time dropped down from 17.5 seconds to 11.2 seconds. If I remove -funroll-all-loops, the run time jumps to 14.2 seconds. The original 17.5 seconds were ran with vanilla code using float and -O3. Interestingly enough, if you use the aforementioned flags with floats instead of doubles, the program executes in 15.01 seconds instead. Using floats is bad for performance! Further, if you remove -funroll-all-loops when using floats, the performance increases, but with doubles it decreases.

So, when optimizing, play with compiler flags. Play with types. Play with whatever you have at your disposal and make no assumptions. This stuff is far more complex than believing that certain flags are better than others, it all depends on everything.


So it totally disables all loop unrolling, inlining... hmm


Does a couple of other things, including choosing instruction sequences that are more compact afaik. But also favouring compactness over alignment and obviously jumps over unrolling. Obviously this isn't code that benefits terribly much from it, but it has been known to happen.


gccgo should give better numbers with Go as well. I am guess we pick one and stick with it. I don't mind trying Intel's compiler though


    go build -gcflags -m
After some quick googling I can't find what the "m" flag does. Can anyone shed some light?


  go build -gcflags
If you run `go help build`, you will see:

  -gcflags 'arg list'
		arguments to pass on each 5g, 6g, or 8g compiler invocation.
Then you can run:

  go tool 6g
And see:

  -m	print optimization decisions


Author of the article here

It helps in finding out details of what the compiler (Xg) thinks of various funcs inlining applicability (is that even a word?)


Speaking of gcflags, did you try Go with gcflags=-B to see how it performs without bounds checking.

I know this isn't 'the right way' given that Go is supposed to be a safe language but it would be interesting to see how much difference it would make.


Have not tried that... will give it a shot

But I guess we are then going away from idiomatic Go


I guess, it's an unsupported compiler option (as in it can disappear any new release).

I looked at it more as an option for when you want to cram out possible extra performance of release builds.


I wonder how the C++ version would do with TBB. Also I think it would be more interesting to compare more memory intensive programs, I think C++ would shine even more with all the optimizations opportunities there would have.


And what about adding the following compile options:

-m64 -msse3 -mfpmath=sse?


-m64 and -mfpmath=sse are defaults on x86-64, -msse3 is enabled implicitly by -march=native if the machine supports SSE3.


Adding that to the list of C++ options I am going to try


vector S(vector o,vector d, unsigned int& seed) {}

int T(vector o,vector d,float& t,vector& n) {}

not sure why he's copying the vectors here.


Why not? Note that these are not std::vectors. Copying 12 bytes will probably be cheaper than indirect access through a reference.


Also, he's mostly passing by value.


Passing by value in C++ can offer speed increases when the function will be copying the object anyways, due to move semantics, constructor eliding, etc.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: