OCaml kicks ass on Tim Bray's Wide Finder 2 benchmark challenge. Again.

tl · on Oct 27, 2008

Article title: "...300X faster than naïve Ruby"

OCaml comes in at 300x faster than Ruby. Python comes in at 45x times faster than Ruby. However, there's more going on here than language:

* The ruby version is single threaded and the test is on a 32 core workstation (counting CPU time Python is only 4x faster and OCaml is only 17x faster)

* The author admits that there were multiple stabs at the OCaml version. What savings could come from optimizing the code in other languages?

* The test is IO heavy. Is OCaml (or other language) taking some shortcut (e.g. Unicode vs non-Unicode, Strings vs bare char[] arrays)?

mfp · on Oct 28, 2008

"The ruby version is single threaded and the test is on a 32 core workstation (counting CPU time Python is only 4x faster and OCaml is only 17x faster)"

I mentioned that on my blog. I also have an OCaml version that is 25x faster in CPU time (i.e., as fast as the top C++ entries), but barely faster regarding wall clock time. It takes a couple dozen extra lines.

Keep in mind that the Wide Finder 2 benchmark was about parallelism from the beginning; I said the Ruby version was naïve precisely because it wasn't parallel. The fact that the language did matter to this extent came as a relative surprise, because the most expensive operations in the Ruby version actually take place in its core classes, written in C. It's just that it's so slow everywhere else that the overall performance is still an order of magnitude worse.

"The author admits that there were multiple stabs at the OCaml version. What savings could come from optimizing the code in other languages?"

There are three OCaml versions, listed on the result table http://wikis.sun.com/display/WideFinder/Results

AFAIK other entries received considerable optimization effort (I'd even go further and say that most involved more) --- several went through half a dozen revisions, even if the wiki doesn't reflect it.

You can take a look at the wide-finder mailing list to see how often each participant was using the T2K (we used the ML to reserve time slots): http://groups.google.com/group/wide-finder/topics?hl=en&...

wf2_multicore.ml was the first version I ran against the full dataset on the T2K, and did quite well (8 minutes). The 2(?) first runs crashed because I exhausted the memory space of the T2K, but the 3rd one completed successfully.

wf2_multicore2_block.ml took considerably more time because I switched from line-oriented to block-based IO --- basically the technique all the fast implementations used.

silentbicycle · on Oct 27, 2008

OCaml is non-Unicode by default, though there is Unicode support in extlib (http://code.google.com/p/ocaml-extlib/).

I, too, am curious about a Haskell implementation - I suspect it would be comparably efficient to the OCaml one. As for optimizing the Ruby one, the major speedups come from being able to parallelize processing (not sure how feasible this is in Ruby) and OCaml's static typing system (Ruby loses big here, speedwise).

I also wonder how fast it would run in Lua; probably somewhat faster than Python. I don't have the time to rewrite it ATM, though.

iigs · on Oct 28, 2008

The Wide Finder 2 implies lots of IO activity, which proved to be relatively hard to optimize on the T2K, because it you can easily saturate a core by doing mere IO (i.e., a single core is barely able to cope with the sustained read rate of the disk).

This might be the most thought-provoking sentence I've read in weeks. It certainly calls the credibility of people who say "the future will be hundreds of cores per CPU".

Maybe this is fixable in OCaml somehow, but we might be seeing the limits of our multi-core panacea future sooner than we think because of realities like this.

edit: I haven't looked, but I wonder if the IO thread is just busy waited on the disk, and this perhaps isn't the limit. The fundamental question stays the same, I guess.

mfp · on Oct 28, 2008

IO was only (barely) disk-bound when you gave a full core to the reader (i.e., you have to be careful not to use HW threads on the same core). This is not a problem specific to OCaml --- I reproduced it with a standalone program written in C that simply read the file: as soon as you have more stuff running on the same core (in different hardware threads), the IO performance drops. See http://groups.google.com/group/wide-finder/browse_thread/thr...

Also, and this came as quite a surprise, it turns out that mmap is slower than read(2) on the T2K.

iigs · on Oct 28, 2008

Wow, interesting. Thanks for the info and the link. This is a surprising and disappointing aspect of this CPU.

Do you know if there's something about this (seemingly trivial) workload that is pathologically bad for this processor, or do you believe that the CPU is just wimpy? I've only ever had a glossy spec-sheet-level introduction to these at work. Given this load I don't see the value of "4 threads per core" that they proclaim on http://www.sun.com/processors/UltraSPARC-T1/specs.xml

mfp · on Oct 29, 2008

It's just that the hardware threads are slow, I think --- compiling stuff on the T2K also took forever. It also seems to me that there's seemingly little value in having 4 threads per core: it forces you to parallelize programs that ran fine with normal cores just to match the performance you'd get without hardware threads...

jlouis · on Oct 28, 2008

I don't think it affects credibility of hundreds of cores per CPU at all. I think 32 cores is about as high we will go in the first phase though. Disk I/O bound applications can't be faster by throwing more cores at the problem. Memory bound applications will flat out around the 32 core limit and then we will have to split up the memory as well into several "banks". When that has happened, we can again improve up to some hundred cores.

There is no reason to "fear" having more cores and I don't think it will affect us so much people are stating it will.

schtog · on Oct 27, 2008

I would love to see a Haskell-implementation.

Also the Ruby-benchmark seems weird. It can't be that much slower right? Just how naive is the implementation?