
Google paper comparing performance of C++, Java, Scala, and Go [PDF] - pgbovine
https://days2011.scala-lang.org/sites/days2011/files/ws3-1-Hundt.pdf
======
tectonic
Their conclusions:

"We ﬁnd that in regards to performance, C++ wins out by a large margin.
However, it also required the most extensive tuning efforts, many of which
were done at a level of sophistication that would not be available to the
average programmer.

Scala concise notation and powerful language features allowed for the best
optimization of code complexity. The Java version was probably the simplest to
implement, but the hardest to analyze for performance. Speciﬁcally the effects
around garbage collection were complicated and very hard to tune. Since Scala
runs on the JVM, it has the same issues.

Go offers interesting language features, which also allow for a concise and
standardized notation. The compilers for this language are still immature,
which reﬂects in both performance and binary sizes."

~~~
pilooch
when turning map into hash_map gives you 30.4% increase in performance, you
can't call this an 'extensive' tuning effort... The 'extensive' tuning in the
paper (structure peeling, ...), accounts for minor improvements (< 10% total).

~~~
copper
The initial code was supposed to be written in a straightforward way. That
said, I was playing around with the code, and using tcmalloc with the original
C++ code gives a better speedup (something close to 40% with the provided
example.)

------
ksolanki
_We find that in regards to performance, C++ wins out by a large margin.
However, it also required the most extensive tuning efforts, many of which
were done at a level of sophistication that would not be available to the
average programmer._

I am a fan of C++, so I liked the first part (that C++ wins, which is not
really surprising). However it was not clear if the sophistication they talk
about is regarding the _Google's internal data-structures_ or the list of
optimizations listed in Section VI-D? Many of these optimizations are not big
surprises to a C++ programmer (where possible use hash_map, vector instead of
list, initialize data structure outside of loop and try to reuse, and so on).

Overall, I do greatly appreciate the effort and their sharing the results.

~~~
bad_user
The problem with possible optimizations and tunning is that it requires
constant effort. Projects often have serious deadlines and you simply can't
constantly think about memory or allocation footprint and still do a good job
in the amount of time you have, unless you're well above average.

~~~
davidtgoldblatt
Most of the optimizations that got them big wins in C++ (using hash maps
instead of tree-based maps, dynamic arrays instead of linked lists, .empty()
rather than .size() > 0 on a linked list) are "library knowledge" issues more
than memory / allocation ones.

In fact, looking at the list of optimizations on page 9, the only one that
involved a low-level understanding of the machine and resulted in a
significant improvement was (unless I missed something) the use of
InlinedVector (and this is only barely such a case - lots of C++ codebases
have a similar class).

~~~
scott_s
Rather than library knowledge, I submit it's basic algorithmic and data
structure knowledge. Perhaps one could argue it's knowing how that algorithmic
and data structure knowledge maps to actual libraries.

By the way, std::list::size() is defined to be O(1) (see
[http://www.kuzbass.ru:8086/docs/isocpp/lib-
containers.html#l...](http://www.kuzbass.ru:8086/docs/isocpp/lib-
containers.html#lib.container.requirements)). While it's possible, of course,
to define a linked-list where the size operation is O(n), that's not an
implementation of std::list. Hence, there was no difference when
std::list::empty() was used - although I agree it's better to use it, since
it's more meaningful.

~~~
davidtgoldblatt
In a confusing bit of ISO trivia, std::list implementations can do either -
the standards draw a distinction between "should" and "shall". If an
implementation "shall" do something, it's a requirement, and if it "should" do
something, it's the preferred choice among several. The size() function only
"should" be constant time. In fact, in GCC (which I assume is the Google
compiler of choice), std::list::size() is O(n)
([http://gcc.gnu.org/onlinedocs/libstdc++/latest-
doxygen/a0142...](http://gcc.gnu.org/onlinedocs/libstdc++/latest-
doxygen/a01426_source.html#l00845)).

------
pgbovine
summary: author implemented a sophisticated compiler algorithm (loop header
recognition) in a straightforward canonical way in C++, Java, Scala, and Go,
measured performance/memory usage, and then (the most interesting part of the
paper) he had colleagues who were experts in each respective language write
highly-tuned optimized versions and reported on what it took to optimize in
each language.

~~~
CCs
So C++ wins the performance test by large margin, Scala follows (3.6x), then
Java 64bit (5.8x), Go (7x) and finally Java 32bit (12.6x).

Scala without any optimization is still better than Java with all the ninja-
skills black magic applied. Scala optimized is pretty good - "just" 2.5x
slower than C++

~~~
wtracy
Maybe I'm missing something, but I'm amazed to see Scala beating out Java.

~~~
gaius
Why? Remember that at this point in history, Java-the-language and Java-the-VM
are related in name only.

~~~
jbooth
From a performance perspective, the VM is what dominates and Java operates
closer to the VM assumptions about what it's going to run.

If you read the paper, you'll note that the Scala author significantly changed
the structure of the algorithm to conform with the way Scala does things
(recursion, etc). So it's sort of an apples-to-oranges comparison as far as
Scala's concerned, too bad they couldn't write ugly Scala code that would give
a better comparison.

------
aria
Really shocked they would place numbers in a table like that when the 'Scala
Pro' version is doing something algorithmically smarter.

They note in the paper this distinction, but it should be marked in the table,
because I'm sure 99% of people don't see that. Sloppy!

~~~
sili
That, plus the fact that they refused to do same optimizations to Java as they
did to C++ for some reason.

------
scottjad
Speaking of Scala Pro: "It should be noted that this version performs
algorithmic improvements as well, and is therefore not directly comparable to
the other Pro versions."

~~~
enneff
"It should be noted"?! What an understatement! What's the point of going to
the trouble of writing a paper if you're not even going to make your
comparisons fair? Crazy.

~~~
Djehngo
Because the initial comparison was fair.

~~~
igouy
Really?

"... the benchmark also used a HashMap<Type, Integer> when it should have just
stored a primitive int in Type; it also used a LinkedList where it should have
used an ArrayDeque."

[http://jeremymanson.blogspot.com/2011/06/scala-java-
shootout...](http://jeremymanson.blogspot.com/2011/06/scala-java-
shootout.html)

------
onan_barbarian
Using a "older Pentium IV workstation". Older than what? Older than Grandpa?
Older than dirt? Older than the dinosaurs? Why?

Pentium IV - really? Let's see - we've had Core, Core 2, Nehalem and Sandy
Bridge since then, not to mention 3 in-between process shrink versions of the
same architecture.

Perhaps we could pass the hat around so that Google could afford a Sandy
Bridge workstation and discover what these interesting results look like on an
architecture that dates from some point in this decade - or at least, at some
point in the previous decade.

This stuff actually, really, truly makes a difference. And not necessarily in
favor or against any particular one of these languages...

~~~
acqq
Imagine: there is a paper discussing the techniques of weight lifting. Your
complaint is that there are newer models of weights than those they lifted.

~~~
jnhnum1
Newer processors don't just increase the performance of all programs equally.
They have things like improved branch prediction, cache prefetching, better
pipelining, and different cache sizes which can make a lot of performance
optimizations that you get from C / C++ less relevant.

~~~
acqq
Except from the "most optimized C++ code" you have all sources of all
benchmarks. Please try to run them on any newer x86/64 based processor and
show us that the article conclusions don't hold. I'm not holding my breath
though.

~~~
scott_s
I agree that it _probably_ won't make a difference. But good experiments
remove as many of those _probablys_ as possible.

~~~
acqq
The parent commenter uses "improved branch prediction, cache prefetching,
better pipelining, and different cache sizes" as a "mumbo-jumbo that can mean
something different." I'm in the business, so I can tell you, most of the
improvements give you just some "overall speedup" so that you can happily buy
today's processor running on 3 GHz and be glad it's faster than almost a
decade old P4 running on _the same_ 3 GHz. Add to that that now you have a
multi-core CPU and that you have to "clear" the paths to the cores in order to
prevent them from slowing down one another and also to compensate the bigger
delay introduced by more modern RAM technologies, which trade _bigger_ delay
for the possibility to feed more cores.

Then, measure the algorithms that run on one core, anyway, on the P4 and the
latest Core iX. Your slow languages won't be faster than your fast ones just
because the quoted changes were introduced to the processors in between.

~~~
scott_s
Please note that I did not disagree with your conclusion - I agree that it
_probably_ won't make a difference. If it makes you feel better, I'll say it's
a very high value of probably. But _I'm_ in the business of performing systems
experiments. Removing as many variables as possible is just good experimental
design. If you want to know what the performance will be like on modern
machines, then it's best to run on modern machines.

~~~
onan_barbarian
Dear lord, thank you for this bit of common sense.

It's actually quite hard to know what a given piece of code will do on a given
microarchitecture even if on average it runs everything X% faster - you may
find you're the bit of code that bites the big one and run X% slower on the
new microarchitecture (e.g. you were depending on branch mispredicts being
cheaper than they are) or suddenly your code runs way faster than competing
codes (e.g. you're the superstar running 2X% faster because a sudden increase
in ILP exposes that you've got a main loop full of independent operations).

------
thesz
What I'd like to see there is Haskell version and how all those
implementations speed up when running on several cores.

This is much more interesting considering our current reality.

I once gained 10% speed up on two cores by changing just two lines of our
optimization tool written on Haskell.

------
copper
The most instructive part of this are the listings from the changelogs
alongwith the resulting improved performance. I'm really enjoying the C++
tunings. I wish they did show the results with the google hashes, though.

~~~
btmorex
So, the C++ performance numbers were with the "public" version? It wasn't
clear whether they left the internal version out completely (besides the
changelog) or simple didn't release the code. The published C++ numbers are
already way better than any other language in terms of speed and memory use.
It would be surprising if there was a substantially faster version.

~~~
igouy
afaict Yes, the C++ performance numbers were with the "public" version.

afaict Yes, they left the internal version out completely (besides the
changelog).

Notice C++ Dbg has the same wc -l as C++ Opt.

~~~
copper
Isaac, do you think you could add this problem to the shootout? I _believe_
(and am therefore almost certainly wrong) that there isn't any benchmark there
that derives almost purely from compiler theory.

------
mulander
The paper uses wc -l (probably because sloccount doesn't handle go and scala?)
to count the lines of code.

Here is the output of sloccount:

    
    
        SLOC    Directory       SLOC-by-Language (Sorted)
        595     java_pro        java=595
        591     java            java=591
        488     cpp             cpp=488
        328     python          python=328
        0       go              (none)
        0       go_pro          (none)
        0       scala           (none)
        0       scala_pro       (none)
    
        generated using David A. Wheeler's 'SLOCCount'.
    

Compare it to the paper.

    
    
        Benchmark   wc -l
        C++ Dbg/Opt 850
        Java        1068
        Java Pro    1240
        Scala       658
        Scala Pro   297
        Go          902
        Go Pro      786

~~~
igouy
Here are GZip minimally compressed source-code sizes after removal of comments
and removal of duplicate whitespace characters -

    
    
        Benchmark   GZ Bytes    Factor  (wc -l Factor from paper)
        Java Pro    5198        1.65x   1.9x
        Java        4403        1.40x   1.6x
        C++         4229        1.34x   1.3x
        Go          3768        1.20x   1.4x
        Go Pro      3259        1.03x   1.2x
        Scala       3138        ====    ===
        Python      2755        0.87x   ????
        Scala Pro   1929        0.61x   0.5x

------
ivan_ah
link to the code: <http://code.google.com/p/multi-language-bench/> This is a
great resource to learn the different languages. (I had never seen C++
template programming used in practice before)

Challenge to the pythonistas: rewrite the python version so it uses the C++
code under the hood using scipy weave

~~~
igouy
<http://shootout.alioth.debian.org/>

------
cygwin98
I was surprised by the fact that the Python version has 626 lines of code,
compared to 297 to the Scala Pro. As Python is supposed to be more expressive,
I was expecting something like 150ish LOC for Python.

Edit:

The Python version can be found at [http://code.google.com/p/multi-language-
bench/source/browse/...](http://code.google.com/p/multi-language-
bench/source/browse/trunk/src/havlak/python/LoopTesterApp.py)

~~~
Spyro7
Actually, the python version has about 328 lines of code, while the scala
version has about 216 lines of code. You should visit the Google project page
and use something like cloc to analyze the files:

<http://code.google.com/p/multi-language-bench/>

<http://cloc.sourceforge.net/>

I confess that I do not know a lot about Scala, but it looks like some of the
functional aspects of the language allow for some savings in lines of code.

If I had the time right now, I doubt that it would be too difficult to shrink
the python version a bit. Just taking a glance at it, I don't see a single
list comprehension throughout their code. I am sure there are some other
language features that could probably have been better leveraged as well.

Just out of curiosity, where did you get the 626 loc and the 297 loc from - I
tried looking but I can't find it anywhere. Though that could just be a
product of my lack of sleep right now.

~~~
cygwin98
The 297 loc for scala pro is from the paper that also includes 13 lines of
Apache license stuff on the top and 14 lines of comments. The author used "wc
-l" to count the LOC, that's not very scientific anyway. The 626 loc for
python is also from "wc -l". I looked at the code again and found out it
contains a lot of detailed comments and even test cases, so it's not fair to
say the python version has 626 loc.

I suspect the author gives it out as a reference implementation.

------
olifante
TLDR: C++ fastest but hardest, Scala fast and elegant, Java simple but opaque,
Go immature.

------
jbpritts
C++ code, when sophisticated template meta programming techniques are used,
can be as fast as tuned C code, and can beat any language in terms of speed
with the exception of hand-tuned assembly. However, techniques like expression
templates, automatic loop unrolling, template specialization, and static
polymorphism require a VERY high degree of sophistication from the programmer.
In some sense, the programmer is forced to be a compiler. There are some
individuals that, despite this demand, can be highly productive. The others
wallow in complicated syntax, horrible type errors, and difficult to trace
run-time errors. Unless speed is absolutely essential, C++ should be avoided
at all costs.

~~~
pnathan
C++ TMP is quite painful to the uninitiated. Outside of that, C++ is not
_that_ bad, especially as you build up your project-specific abstractions.

------
choxi
This is cool but I think it's important to remember that they're comparing
algorithmic benchmarks but modern software has many different bounds besides
for raw, algorithmic processing. For example, I think I/O and database
interactions are the bottleneck for most web apps. I really liked the way this
article broke it down: <http://pl.atyp.us/wordpress/?p=2947>

~~~
gtani
Thx for that. Java and scala-specifically, it's a pretty high-dimension space.
Background:

<http://www.elis.ugent.be/JavaStats>

[http://www.bestinclass.dk/index.clj/2010/02/benchmarking-
jvm...](http://www.bestinclass.dk/index.clj/2010/02/benchmarking-jvm-
languages.html)

[http://isthisclojure.blogspot.com/2011/02/benchmark-
clojurej...](http://isthisclojure.blogspot.com/2011/02/benchmark-
clojurejvm-5-degrees-of.html)

[https://docs.google.com/present/view?id=0AS8emH3-FLt3ZGRtbWJ...](https://docs.google.com/present/view?id=0AS8emH3-FLt3ZGRtbWJyOGdfMTFmcDZkcTk2cw&hl=en)

Examples of benchmarks gone awry:

[http://stackoverflow.com/questions/6146182/why-is-my-
scala-c...](http://stackoverflow.com/questions/6146182/why-is-my-scala-code-
running-slow)

[http://groups.google.com/group/scala-
language/browse_thread/...](http://groups.google.com/group/scala-
language/browse_thread/thread/94740a10205dddd2/)

~~~
igouy
a pretty high-dimension space :-)

[http://download.oracle.com/docs/cd/E15289_01/doc.40/e15060/t...](http://download.oracle.com/docs/cd/E15289_01/doc.40/e15060/toc.htm)

------
nimrody
Keep in mind that since C++ is usually used for performance critical code, its
compilers are often highly tuned and produce good code (see Intel's ICC for
number crunching code).

A less popular language - or one that is usually used for less-critical code -
will suffer.

C++ wasn't always so fast. See Kernighan and Pike, "The Practice of
Programming" for a simple example where Java surprisingly beats C++.

~~~
cageface
True, but this is one of the reasons people choose to code in C++. The quality
of tools & common libraries is as much a factor in language selection as the
syntax and semantics of the language itself.

~~~
nimrody
You mean - "forced to use C++" :)

I would love to use Digital Mars D or Scala for scientific applications if
they had compilers/runtimes as good as C++.

Perhaps one day the LLVM backend would be useful as a universal backend for
many less popular languages.

------
skimbrel
Wow.

Looks like C++ is still one of the performance kings among current programming
languages. Can't say I enjoy using it though.

It also looks like Go and Scala have a long way to go on compiler
optimization. Not surprising, since they're young languages compared to C++
and Java, and it's a decent amount of work to optimize the higher-order
constructs they provide.

~~~
fizx
Sweet $deity, the Scala compiler is so painful! You changed one line of code
in a 40KLOC code base, now you wait 40s for the recompile.

~~~
tvorryn
Is it hard to use fsc? Fsc should fix those compile times, especially for
minimal changes. In the paper it reduced compile time to 25-33% of scalac's
compile time. (13.9s to 3.8s and 11.3s to 3.5s)

~~~
fizx
I'll check it out; thanks for the tip!

------
edabobojr
The java version spends 75% of it's time gc? That smells fishy to me.

edit: I just got to section VI of the paper, which does some gc tuning. In my
playing, I also found about a 20% speedup by changing the
HavlakLoopFinder.UnionFindMode class to being a statuc class.

~~~
igouy
[http://jeremymanson.blogspot.com/2011/06/scala-java-
shootout...](http://jeremymanson.blogspot.com/2011/06/scala-java-
shootout.html)

------
javanix
This paper served as a _great_ introduction to Go programming for me. I've
been trying to work through my own basic program in Go for the past few weeks
and their overview of the language features used cleared up a lot of loose
ends for me.

~~~
scott_s
Having the same algorithm implemented in several other languages helps, too.
It enables you to look at the implementation in the language you're most
comfortable with, then look at one of the other implementations and think "Oh,
so _that's_ how that's done in this language."

~~~
igouy
<http://shootout.alioth.debian.org/>

Although you shouldn't necessarily think "Oh, so that's how that's done in
this language."

------
xedarius
I wonder what JVM they used, as significant performance differences exist
between Hotspot and JRockit. I also wonder what compiler they used, again
Microsoft compiler will yield different results compared to gcc.

~~~
igouy
Here's the Makefile - gcc

[http://code.google.com/p/multi-language-
bench/source/browse/...](http://code.google.com/p/multi-language-
bench/source/browse/trunk/src/havlak/cpp/Makefile)

------
16s
What is this InlinedVector<> they speak of in C++ tuning? Is that a Google
specific container or is it in the std library (or some other library)?

~~~
jongraehl
It probably holds small vectors directly inside the sizeof(InlinedVector<X>())
bytes, instead of indirecting. I think it's a pretty common technique.

------
mhd
It probably won't take long until someone does the same exercise with other
languages. Maybe they should add it to the shootout.

Can't wait to see Modula-3's numbers…

~~~
igouy
> Maybe they should...

"they"? The space monsters who run the universe?

------
LiveTheDream
Loaded directly as a PDF for me.

~~~
rrrazdan
The scribd link is an alternate.

~~~
kristianp
A bit of cross-promotion, Scribd is a ycombinator company.

------
zohebv
One of the interesting details is that Go Pro required 786 lines of code as
opposed to 850 lines of C++ code and 297 lines of Scala Pro code. Regular Go
code was 902 lines. It certainly doesn't look like Go scores any
expressiveness points over C++.

~~~
ianlancetaylor
"Go Pro" is unfortunately a very misleading name. I just spent an hour
cleaning up his use of Go. Nobody made any attempt to actually turn this code
into a well-written Go program. I did not realize that he was going to publish
it externally.

~~~
LiveTheDream
Would you consider touching up the Go version again to make it more idiomatic
and well written? I'd be interested in seeing good Go examples; I'm sure many
others here would as well.

~~~
uriel
The Go distribution already contains plenty of idiomatic, well written and
very readable code.

------
DonnyV
Who said that when you release a paper that it needs to look like it came from
a text book. I couldn't get past the first page because I felt like I was back
in high school.

~~~
sparky
_> Who said that..._ The conference. From <http://days2011.scala-
lang.org/node/91> :

 _Submissions must be in English and at most 12 pages total length in the
standard ACM SIGPLAN two-column conference format (10pt)._

Nearly all academic papers in computer science and engineering look like this.
Not to say they've converged on an optimal format, but it wasn't exactly a
creative decision on the part of these authors.

