
Perl vs Scala vs Go vs C (performance) - systems
http://blog.dryft.net/2010/05/new-results-for-perl-vs-scala-vs-go-vs.html
======
_delirium
Although these are important real-world cases, I think this kind of high-
throughput file processing mostly benchmarks something other than the core
language design. More or less: 1) how good are the I/O libraries and string-
processing libraries?; and 2) how obvious is it which ones to use, and how to
use them, for peak performance in your use case? As he noted, he got a 4x
speedup in Scala by just shuffling around his use of the Scala library
functions, so this sort of benchmarking is quite sensitive to seemingly minor
choices.

This kind of I/O benchmarking is also more sensitive than most to the
particular choice of hardware. Results might come out totally differently on a
different kind of hard drive, 32-bit v. 64-bit, different OS or filesystem,
etc.

The various entries in the Wide Finder competition give a good overview of the
kinds of tricks that can make a difference when you throw another wrench in
the works by combining high-throughput I/O with multiprocessing (some of the
tweaks aren't multiprocessing-specific, though):
[http://www.tbray.org/ongoing/When/200x/2007/09/20/Wide-
Finde...](http://www.tbray.org/ongoing/When/200x/2007/09/20/Wide-Finder)

~~~
barrkel
Indeed. And C gives you such crummy tools for string manipulation that you are
almost forced to hand-write anything interesting, so it's not surprising that
it ends up doing well.

~~~
JoachimSchipper
If your non-crummy tools are so bad that C ends up eating your lunch by a
factor of 100, I'm not sure that they really are non-crummy.

~~~
barrkel
Powerful tools improve productivity because you have to do less work to get
the job accomplished. But they have a reverse side: you need to be aware of
how they work to use them efficiently, otherwise you'll end up with silly
performance problems - e.g. I've seen naive string tokenization code that did
substrings, and then _deleted the matched substring_ from the start of the
string, all in a loop. By providing the tools, it's easier to misuse them.

When you have fewer tools, on the other hand, you need to think at a lower
level and work more directly on the problem. You have a lot more work to do,
but the probability that you're misusing the tools is smaller, because you'll
be building the tools yourself, manually.

~~~
_delirium
I wonder how much that actually accounts for C's generally good performance in
these kinds of ad-hoc benchmarks. There's really two "low-level" things about
C: 1) it's low-level in the actual language sense of being reasonably close to
the metal; and 2) it has a pretty bare-bones standard library, requiring most
algorithms to be hand-coded, if you don't link in a 3rd-party library. Mostly
people focus on #1, but as you point out, #2 might actually be a big part of
it.

I don't actually mean this as a slight against C, fwiw. In some ways, coding
in C feels liberating, because I'm culturally _allowed_ to do the nasty
special-case hack, instead of doing it The Right Way with the library function
that checks all the edge cases and error conditions and whatever.

~~~
barrkel
On this theme, there was a kind of "code-off" between Rico Mariani (.NET
performance architect at the time) and Raymond Chen (The Old New Thing blog,
Windows user mode developer).

Rico has a kind of index to it here:

[http://blogs.msdn.com/ricom/archive/2005/05/10/performance-q...](http://blogs.msdn.com/ricom/archive/2005/05/10/performance-
quiz-6-chinese-english-dictionary-reader.aspx)

It was interesting to see how little Rico had to do from the start to end up
with quite a performant application. Eventually Raymond started pulling ahead,
but IIRC it was when he started using memory mapped files for I/O.

Here's a summary at version 5 of the unmanaged code:

<http://blogs.msdn.com/ricom/archive/2005/05/18/419805.aspx>

    
    
        Version                Execution Time
        (seconds)
        Unmanaged v1                    1.328
        Unmanaged v2                    0.828
        Unmanaged v3                    0.343
        Unmanaged v4                    0.187
        Unmanaged v5 With Bug           0.296
        Unmanaged v5 Corrected          0.124
        Unoptimized Managed port of v1  0.124
    

It was quite interesting.

------
gjm11
The most surprising-looking thing here is the really wretched results from the
Go version. But when you actually take a look at the Go code, that stops being
so surprising; he's using a CSV-reading library whose author clearly wasn't
thinking much about performance. It doesn't contain anything really super-
dreadful -- O(n^2) algorithms for processing strings, etc. -- but it's full of
gratuitous constant-factor inefficiencies compared with what you'd be likely
to do in, say, C. Unnecessary temporary objects (presumably allocated on the
GCed heap), building short strings by { read a byte, push it into a string-
builder, repeat }, etc. It's the kind of code is makes sense to write in a
higher-level language where (1) allocation overhead is likely to be swamped by
interpretation overhead and (2) you know that for anything that really needs
to be fast you'll punt to an extension library written in C. But Go is not
such a language, even though it tries to offer some of the same conveniences
as they do.

~~~
runT1ME
Go uses a primitive garbage collector also, so while Scala can create
boatloads of extraneous objects, their allocation costs are negligible and
their collection cost is 0.

------
fleitz
Language performance benchmarks are generally pointless. The much more
important benchmark is how easy is the code to write / change. Write it quick,
get the customer, and then when you have that customers money if somehow it
doesn't perform well enough use that money to profile and improve the app or
just buy more hardware.

In 18 months a CPU that costs $200 will perform twice as fast as the current
CPU for $200, I've never seen bang for the buck like doubling the performance
of your code for $200. (Most apps are IO bound anyway, so you want to add
disks not CPU)

Yes, for _SOME_ companies like google it's much better to pay more for
programmers, but most companies don't have > 100,000 servers. If your company
is a start up that is doubling every year then all you have to do is let
Moore's law scale your performance as you add servers.

Also, don't build a business where your profit margin is dependent on getting
the last 5% performance out of a server.

~~~
runT1ME
>In 18 months a CPU that costs $200 will perform twice as fast as the current
CPU for $200, I've never seen bang for the buck like doubling the performance
of your code for $200. (Most apps are IO bound anyway, so you want to add
disks not CPU)

2001 called, they want their answer back. Ok, that's a little harsh, but
really, you're answer was perfect for about six years ago, but if you've taken
a look at chip performance recently, it has _certainly_ not been doubling
every year.

What you can get for cheaper is more cores, but that certainly isn't a free
lunch, as many have pointed out. Suddenly having a good language that can be
per formant and scale matters (python anyone?).

Go is going to have a lot of catching up to do, as writing good multi threaded
libraries are hard, and a good parallel GC is even harder. I don't mean to
pick on your opinion, because its one shared by many, but things are going to
get _very_ interesting in the next 10 years as we really will see the end of
moore's law in terms of linear CPU performance and instead get many-core
processors to play with.

~~~
fleitz
If you write parallel code you'll get a lot closer to free lunch. To me the
transition is almost complete, you're starting to see that even the cheapest
laptops these days have two cores.

~~~
runT1ME
>If you write parallel code you'll get a lot closer to free lunch.

That's like saying "if you spend lots of money on good food, you can get a
pretty good free lunch". Good parallel code is hard, to varying degrees. I'm
in agreement, its going to be the future, but development is going to get a
lot harder in the next few years.

------
acqq
It seems that the author doesn't have the real idea what he actually wants to
measure? He just throws some heaps of code and gives some numbers. I haven't
seen that he even tried to specify what he wants to achieve. His posts just
don't have sense.

Something which is really useful is on:

<http://shootout.alioth.debian.org/>

For example:

<http://shootout.alioth.debian.org/u64q/compare.php?lang=go>

etc

------
AlisdairO
Interesting, but not unexpected results for Scala. The JVM takes a while to
start up, so on short lived scripting-like applications it's not an ideal
choice.

~~~
runT1ME
Startup and to JIT the code. I bet if they accounted for that, we'd see Scala
even more performant.

------
JoachimSchipper
As could be expected, this just benchmarks the implementation. For instance,
Perl has lots of object creation while C uses a fast state-machine-like
approach.

Don't get me wrong - I really like C - but I'm not even sure that it would
still be faster than Perl if you switched implementations.

~~~
fleitz
Agreed.

Sure in scala there are a couple things you can't do like pointer
manipulation, so you are going to incur more allocations but from what I've
found in C# is, if you code it like a C app it performs like a C app.

The biggest key to performance I've found in managed code is to not allocate
memory. It reduces stress on the GC, and improves your cache hit rate.

~~~
barrkel
I can't agree with the non-allocation of memory. What you don't want to do is
create garbage that doesn't die until after a collection has already occurred
- this is the mid-life crisis problem. But you shouldn't be unduly afraid of
allocating memory which won't last long.

I wrote up a quick test sample: <http://pastebin.com/YmbT08Uy>

All it does is fill in an array with values from 0 to 511, then sum them up.
The catch is that it zeros out the array before filling in the values, in two
different ways: one (P), by re-using an array, and secondly (Q) by
reallocating the array and letting the garbage collector zero it out.

Here are test run timings:

    
    
        P: 8.749
        Q: 7.070
        Result: 0
    

And in reversed order, to show that order of compilation didn't make a
difference:

    
    
        Q: 7.081
        P: 8.746
        Result: 0
    

The code using the GC to implicitly zero out the array is still faster than
the code that reuses the array, even though the GC is having to collect a
large fraction of the 10 million 2048-byte arrays allocated by Q. P, on the
other hand, only allocates 100 of them.

Test machine: i7 920 @ 3.6GHz, 12GB memory, .NET 4 on Win7 64-bit.

Don't forget that time spent in GC is proportional to the amount of live data
in the generation(s) being collected, not the amount of garbage. Collecting
garbage is actually free; it's finding out what's garbage that costs. If you
don't keep references around to memory (i.e. increase your live set), GC is
very cheap.

~~~
JoachimSchipper
I'm by no means an expert on the .Net platform, but are these equivalent? You
generate ~20GB of garbage, about half of which could comfortably fit in your
memory; more importantly, it appears that .Net, in some modes, does garbage
collection in a different thread than the one running the program. And it's
clear that using a different thread to zero memory is faster...

Also note that I was talking about _object_ creation, which may be a lot
heavier than array creation.

~~~
barrkel
The memory usage never climbs above 15MB - that's for the whole process, the
.NET runtime, code, system DLLs, etc. all in.

With object creation, reuse is far harder whereas with an array, you can just
zero it out. In other words, the argument in favour of reuse is even weaker
for objects than it is for arrays. I would say that if you have large arrays -
large enough to go into .NET's large object heap, at about 70K - you're better
off reusing those, because they only get collected during expensive gen2
collections, and they aren't compacted.

As to the fact that the GC can zero memory on a different thread, that's not a
problem - it's a benefit, as it's guaranteed correct concurrent zeroing, as
opposed to the contortions you'd have to do to reuse. Again, an argument in
favour of GC.

~~~
JoachimSchipper
I was never arguing in favour of reuse; I was arguing in favour of the C
approach (keep everything in local variables and build a state machine.)

And yes, concurrent zeroing is nice. I don't dispute that: I just wanted to
point out that the speedup you demonstrate above (partially) comes from using
additional resources (CPU time on another processor), not from using existing
resources more efficiently.

~~~
barrkel
If the problem can be solved with a finite state machine, you'd be silly to
allocate an amount proportional to the input size, for sure. I wasn't arguing
against the C approach; when I write compilers, I implement lexical analyzers
as state machines all the time. (In my day job, I write C most of the time.)
And even when state is in object fields, I copy them out into local variables
before doing work, and copy them back in to fields after work, because it's
much easier for a compiler to enregister locals than fields.

What I was actually arguing against is the idea that it's a good idea, in a GC
environment, to avoid allocations when trying to improve performance.
Allocations are extremely cheap in a GC environment; and automatic memory
management by a precise copying collector is asymptotically faster than manual
memory allocation on a per-item basis. (If you use pools or zones, you can get
better manual memory management performance again, but then you lose
allocation granularity.)

It's not clear that the code I posted uses more resources. It uses some time
from the other processor (i7 920 has 8 logical cores, CPU usage as Windows
measures it never exceeds about 17%) but it also completes sooner. The
collector also has opportunity to use more optimized zeroing strategies,
whereas I coded the manual array zeroing with an explicit loop.

~~~
JoachimSchipper
I think we are in violent agreement, although I think your GC benchmark is
still "cheating". ;-)

~~~
barrkel
I fully agree - the point of using GC is that it cheats!

------
tumult
This is really stupid.

