

OCaml kicks ass on Tim Bray's Wide Finder 2 benchmark challenge. Again. - shadytrees
http://eigenclass.org/hiki/widefinder2-conclusions
You can see all the results here: http://wikis.sun.com/display/WideFinder/Results
======
tl
Article title: "...300X faster than naïve Ruby"

OCaml comes in at 300x faster than Ruby. Python comes in at 45x times faster
than Ruby. However, there's more going on here than language:

* The ruby version is single threaded and the test is on a 32 core workstation (counting CPU time Python is only 4x faster and OCaml is only 17x faster)

* The author admits that there were multiple stabs at the OCaml version. What savings could come from optimizing the code in other languages?

* The test is IO heavy. Is OCaml (or other language) taking some shortcut (e.g. Unicode vs non-Unicode, Strings vs bare char[] arrays)?

~~~
mfp
"The ruby version is single threaded and the test is on a 32 core workstation
(counting CPU time Python is only 4x faster and OCaml is only 17x faster)"

I mentioned that on my blog. I also have an OCaml version that is 25x faster
in CPU time (i.e., as fast as the top C++ entries), but barely faster
regarding wall clock time. It takes a couple dozen extra lines.

Keep in mind that the Wide Finder 2 benchmark was about parallelism from the
beginning; I said the Ruby version was naïve precisely because it wasn't
parallel. The fact that the language did matter to this extent came as a
relative surprise, because the most expensive operations in the Ruby version
actually take place in its core classes, written in C. It's just that it's so
slow everywhere else that the overall performance is still an order of
magnitude worse.

"The author admits that there were multiple stabs at the OCaml version. What
savings could come from optimizing the code in other languages?"

There are three OCaml versions, listed on the result table
<http://wikis.sun.com/display/WideFinder/Results>

AFAIK other entries received considerable optimization effort (I'd even go
further and say that most involved more) --- several went through half a dozen
revisions, even if the wiki doesn't reflect it.

You can take a look at the wide-finder mailing list to see how often each
participant was using the T2K (we used the ML to reserve time slots):
[http://groups.google.com/group/wide-
finder/topics?hl=en&...](http://groups.google.com/group/wide-
finder/topics?hl=en&start=160&sa=N)

wf2_multicore.ml was the first version I ran against the full dataset on the
T2K, and did quite well (8 minutes). The 2(?) first runs crashed because I
exhausted the memory space of the T2K, but the 3rd one completed successfully.

wf2_multicore2_block.ml took considerably more time because I switched from
line-oriented to block-based IO --- basically the technique all the fast
implementations used.

------
iigs
_The Wide Finder 2 implies lots of IO activity, which proved to be relatively
hard to optimize on the T2K, because it you can easily saturate a core by
doing mere IO (i.e., a single core is barely able to cope with the sustained
read rate of the disk)._

This might be the most thought-provoking sentence I've read in weeks. It
certainly calls the credibility of people who say "the future will be hundreds
of cores per CPU".

Maybe this is fixable in OCaml somehow, but we might be seeing the limits of
our multi-core panacea future sooner than we think because of realities like
this.

edit: I haven't looked, but I wonder if the IO thread is just busy waited on
the disk, and this perhaps isn't the limit. The fundamental question stays the
same, I guess.

~~~
mfp
IO was only (barely) disk-bound when you gave a full core to the reader (i.e.,
you have to be careful not to use HW threads on the same core). This is not a
problem specific to OCaml --- I reproduced it with a standalone program
written in C that simply read the file: as soon as you have more stuff running
on the same core (in different hardware threads), the IO performance drops.
See [http://groups.google.com/group/wide-
finder/browse_thread/thr...](http://groups.google.com/group/wide-
finder/browse_thread/thread/332f306893b37b0e?hl=en#)

Also, and this came as quite a surprise, it turns out that mmap is slower than
read(2) on the T2K.

~~~
iigs
Wow, interesting. Thanks for the info and the link. This is a surprising and
disappointing aspect of this CPU.

Do you know if there's something about this (seemingly trivial) workload that
is pathologically bad for this processor, or do you believe that the CPU is
just wimpy? I've only ever had a glossy spec-sheet-level introduction to these
at work. Given this load I don't see the value of "4 threads per core" that
they proclaim on <http://www.sun.com/processors/UltraSPARC-T1/specs.xml>

~~~
mfp
It's just that the hardware threads are slow, I think --- compiling stuff on
the T2K also took forever. It also seems to me that there's seemingly little
value in having 4 threads per core: it forces you to parallelize programs that
ran fine with normal cores just to match the performance you'd get without
hardware threads...

------
schtog
I would love to see a Haskell-implementation.

Also the Ruby-benchmark seems weird. It can't be that much slower right? Just
how naive is the implementation?

