
New Rust hash table leads Benchmarks Game - joseraul
http://benchmarksgame.alioth.debian.org/u64q/performance.php?test=knucleotide
======
mbrubeck
In other Rust/benchmarksgame news, I just submitted a simple fix to the Rust
program for "reverse-complement" that makes it faster than the fastest C++
program, on my computer. The old version was spending 2/3 of its time just
reading the input into memory, because it wasn't allocating a large enough
buffer up front.

[https://github.com/TeXitoi/benchmarksgame-
rs/pull/44](https://github.com/TeXitoi/benchmarksgame-rs/pull/44)

I'm also working on some additional changes that make it even faster than the
C version (again, on my computer) by improving how it divides work across
CPUs:

[https://github.com/TeXitoi/benchmarksgame-
rs/pull/46](https://github.com/TeXitoi/benchmarksgame-rs/pull/46)

These improved Rust programs have not yet been added to the benchmarksgame
site. Previous entries are ranked at:

[http://benchmarksgame.alioth.debian.org/u64q/performance.php...](http://benchmarksgame.alioth.debian.org/u64q/performance.php?test=revcomp)

Minor improvements to the Rust programs for "mandelbrot" and "binary-trees"
are also awaiting review!

~~~
acqq
Thanks!

I like Rust for what it does in advancing the state of the art of the
languages, but I also like how this example demonstrates how hard it is to
avoid "unsafe" constructs and remain competitive.

[https://github.com/TeXitoi/benchmarksgame-
rs/blob/master/src...](https://github.com/TeXitoi/benchmarksgame-
rs/blob/master/src/reverse_complement.rs)

~~~
mbrubeck
For what it's worth, this alternate implementation has only one line of unsafe
code (a call to the libc "memchr" function) and is only 9% slower than the
fastest unsafe version:

[https://github.com/mbrubeck/benchmarksgame-
rs/blob/reverse_c...](https://github.com/mbrubeck/benchmarksgame-
rs/blob/reverse_complement_bytes/src/reverse_complement.rs)

It's very easy to write extremely fast safe Rust code. (The safe Rust version
above is faster than the fastest C++ submission, on my computer.) Using
"unsafe" for optimization is usually only helpful to get a few extra percent
speedup in an inner loop. If this were production code rather than the
benchmarks game, I'd probably ship the safe version.

~~~
kibwen
_> one line of unsafe code (a call to the libc "memchr" function)_

Is burntsushi's Rust implementation of memchr notably slower than libc's?

~~~
burntsushi
The rust-memchr crate uses libc's memchr if it's available and known to be
fast. Otherwise, it falls back to a pure Rust implementation (written by
bluss, not me).

I expect this to change once we get SIMD. ;-)

------
igouy
Previously std::collections::HashMap was used with the default hash function
--

[46.03 secs]
[http://benchmarksgame.alioth.debian.org/u64q/program.php?tes...](http://benchmarksgame.alioth.debian.org/u64q/program.php?test=knucleotide&lang=rust&id=1))

and then the hash function was changed to FnvHasher --

[17.10 secs]
[http://benchmarksgame.alioth.debian.org/u64q/program.php?tes...](http://benchmarksgame.alioth.debian.org/u64q/program.php?test=knucleotide&lang=rust&id=2)

and then better use of quad core with futures_cpupool --

[9.44 secs]
[http://benchmarksgame.alioth.debian.org/u64q/program.php?tes...](http://benchmarksgame.alioth.debian.org/u64q/program.php?test=knucleotide&lang=rust&id=3)

and now use of std::collections::HashMap has been replaced with an
experimental hash table inspired by Python 3.6's new dict implementation --

[5.30 secs]
[http://benchmarksgame.alioth.debian.org/u64q/program.php?tes...](http://benchmarksgame.alioth.debian.org/u64q/program.php?test=knucleotide&lang=rust&id=4)

 _afaict_ comparing #4 to #5 is all about differences between that
experimental hash table and std::collections::HashMap --

[9.14 secs]
[http://benchmarksgame.alioth.debian.org/u64q/program.php?tes...](http://benchmarksgame.alioth.debian.org/u64q/program.php?test=knucleotide&lang=rust&id=5)

~~~
gpm
> experimental hash table

When discussing whether or not to use this at least someone mentioned that
there company was using it in production. I don't think it really counts as
experimental.

~~~
burntsushi
"experimental" is descriptive, not prescriptive:

> Experimental hash table implementation in just Rust (stable, no unsafe
> code).

[https://github.com/bluss/ordermap](https://github.com/bluss/ordermap)

------
stcredzero
Java is showing quite impressive numbers! 50% overhead over native C
implementations was often cited as a good guess for the ultimate efficiency of
JIT code generation back in the _Self_ Hotspot days.

People who were trying to castigate Go early on as having "Java-like speeds"
were really just showing their ignorance of the state of the art of JIT
compilation for managed languages and the JVM. Such outdated folk knowledge of
performance in the programming field seems to be a constant over the decades.
(Programmers have had such distorted views since the mid 80's at least.) Maybe
this kind of knowledge needs to be a used in job interview questions for
awhile? Very soon, people will just memorize such trivia for interviews, but
it would serve to squash this form of folk programming "alternative fact."

~~~
nostrademons
HotSpot has had great speeds for numeric computation at least since 2005. I
was doing financial software in Java in my first job out of college, our CTO
was an ex-Sun architect who literally wrote the book on Java, and the speeds
we got on numerical computations were basically equivalent to C.

The part where Java really falls down is in memory use & management, which you
can see on the binary-tree & mandelbrot benchmarks, where it's roughly 4x
slower than C. There are inherent penalties to pointer chasing that you can't
get around. While HotSpot is often (amazingly) smart enough to inline & stack-
allocate small private structs, typical Java coding style relies on complex
object graphs. In C++ or Rust these would all have well-defined object
ownership and be contained within a single block of memory, so access is just
"add a constant to this pointer, and load". In Java, you often need to trace a
graph of pointers 4-5 levels deep, each of which may cause a cache miss.

Rule of thumb while I was at Google was to figure on real-world Java being
about 2-3x slower than real-world C++.

~~~
rbehrends
> The part where Java really falls down is in memory use & management, which
> you can see on the binary-tree & mandelbrot benchmarks, where it's roughly
> 4x slower than C.

binary-tree is not useful for comparing GCed and non-GCed languages. For non-
GCed languages, you are allowed to use a memory pool of your choice (the C
version uses the Apache Portable Runtime library), for GCed languages you are
required to use the standard GC with the default settings (no adjustment of GC
parameters permitted). This is apples and oranges.

For mandelbrot, the C version uses handcoded SIMD intrinsics. I.e. it's not
even portable to non-x86 processors.

~~~
flukus
> For non-GCed languages, you are allowed to use a memory pool of your choice
> (the C version uses the Apache Portable Runtime library), for GCed languages
> you are required to use the standard GC with the default settings (no
> adjustment of GC parameters permitted). This is apples and oranges.

Doesn't that match with how a library would be used in the real world? A c
library can create it's own memory pool but a GC one has to live with however
it's host is configured.

~~~
rbehrends
If I were to run a performance-critical application, I'd definitely tune the
GC accordingly. It's why the JVM offers several garbage collectors in the
first place, for example.

Also, GCed languages aren't prevented from using memory pools, but often they
are not part of their common libraries, because there's less need for them.

~~~
flukus
> If I were to run a performance-critical application, I'd definitely tune the
> GC accordingly.

But you have to tune it for the performance of the whole application (AFAIK),
you can't tune it for an individual algorithm like you can with c. It's a one
size fits all approach.

~~~
rbehrends
1\. That goes towards the other point that I made [1] about how
microbenchmarks have only limited relevance for the performance of large
applications (the performance of memory pools can also change as a result; as
an extreme case, multiple large memory pools can lead to swapping).

2\. Many GCs allow you to tune performance for individual computations. For
example, Erlang allows you to basically start a new lightweight process with a
heap large enough so that collection isn't needed and to throw it away at the
end; OCaml's GC parameters can be changed while the program is running.

[1]
[https://news.ycombinator.com/item?id=13747876](https://news.ycombinator.com/item?id=13747876)

------
chris_overseas
It seems to me that there are a lot of apples-to-oranges comparisons here?
Some implementations are using the language's standard library hashtable
implementation while others are using 3rd party version (with different
algorithms and data structures across all of them), some are using multiple
threads while others are single threaded etc. As a result, I wouldn't read too
much into the rankings you see here.

~~~
bluejekyll
And some languages are allowed to use FFI to make their impl faster. There's
some rule about this that I don't understand, but oh well. It's all for fun,
not serious.

But come on, now Rust can legitimately be called "faster than C" ;)

At least until the Clang C version is added... or maybe it will still be
faster.

~~~
emn13
I suspect all languages without exception use some standard library
functionality in at least a few of those sample programs, and most "standard"
libraries aren't constrained to be self-hosting - so all of em use some native
code, probably written in C or C++. I suspect that's true of rust too.

FFI is a fact of life. I can imagine it would be perverting the intent of the
game if you explicitly used FFI to delegate the actual core of the benchmark
program to C as opposed to using "standard" building blocks, but the
distinciton is necessarily vague.

~~~
steveklabnik
> I suspect that's true of rust too.

rustc uses LLVM and jemalloc (by default). Other than that, it is all Rust.

------
galangalalgol
When SIMD goes stable rust may dominate that game. Still wish they would use
clang so it was apples to apples with c and c++. Edit: actually I wish they
would add clang for those languages and leave GCC for comparison. Then I'd
want FORTRAN to add gfortran for the same reason.

~~~
staticassertion
Yeah, I think SIMD will be the next big jump for these benchmarks. I also wish
we could see clang used with the C/C++ cases.

~~~
Sean1708
Is there any reason why we can't have a "C clang"? Or has just nobody bothered
yet?

~~~
masklinn
Circa 2011 the maintainer of the benchmark game decided to mostly only allow
one implementation of each language[0] following pypy developers trying to get
program alternatives which weren't pypy-pessimal.

[0] some languages get a bye for some reason e.g. MRI and JRuby, but no pypy,
and which implementation is blessed is also arbitrary e.g. javascript is v8
but lua is lua.

~~~
bluejekyll
2011 didn't have as widespread Clang/LLVM usage. Now that it is so ubiquitous,
probably a good time to revisit for C...

C needs to defend its reputation now!

~~~
0xFFC
I am saying this as someone who loved C for my entire life, when I was in
college I did implement most of the assignments in C when prof said python is
okay, but I did in C because I loved it and I thought I would learn more by
doing them in C.

So no hard feeling involved.

There is no reputation to defend. You mean security problems everywhere ? do
you mean old, broken, nasty build systems ? Do you mean not having single good
package management ? The language (C/C++) is clearly intractable (parsing
wise), it is 2017 and we don't have single good IDE for them (I don't use IDE
at all, but I am sure you agree with me how much an IDE is important for
newcomers)

If that is the case I think writing assembly would outperform C like shit, and
with that logic asm is much better than C.

And to be honest, I am in job finding phase and preparing for jobs, if this
wasn't my plan right now, I would abandoned C and C++ (even C++{11,14}) for
Rust in heartbeat.

The language (Rust) clearly is awesome language, well designed, does have
perfect build system (have you seen Cargo ? it is wonderful), flexible
language design (you can write OS in rust -not having runtime- and you can
write web app in Rust). I am stuck with C and C++ for now, but if my opinion
counts , Rust is superior in every aspect to C/C++.

Even if Rust were slower a little bit (+/\- 10%) I wouldn't mind. Because its
ecosystem is so healthy I would trade 10% of my program performance to having
something as nice as Rust.

~~~
bluejekyll
Right. The only thing C had going for it was that it was the fastest non-
assembly language.

I'm biased, I want all C development halted and moved over to Rust. If C is no
longer the _fastest_ , for some definition of that, then no one should be
defending it at this point.

Honestly, I want the LLVM optimized C version specifically so that this last
argument for C in any context will be taken off the table. I'm sure there are
some people saying, "but it's not using the same compiler backend and
optimizer...", I want this to counter that.

Rust all the things.

~~~
0xFFC
I am asking this genuinely, I always thought and heard LLVM optimizer is
inferior to GCC's. Am I wrong ? is there any scientific benchmark for this ?

~~~
sqeaky
My understanding was that GCC was better in more scenarios most of the time,
but now it seems like Clang has caught up. and Maybe Clang 4.0 is even faster
in more.

Here are some recent benchmarks from phoronix:
[http://www.phoronix.com/scan.php?page=article&item=gcc7-clan...](http://www.phoronix.com/scan.php?page=article&item=gcc7-clang4-jan)

~~~
emn13
Yeah. But in this kind of tuned benchmark, I doubt clang will do any better
across the board - especially not considering the fact that the current
programs are likely to be at least somewhat tuned specifically for gcc.

~~~
sqeaky
Clang does well enough to be the default compiler for Apple and Google. It
can't be that far behind.

~~~
emn13
Oh yeah, I have no reason to believe it's worse; it's just that I also doubt
it's going to help much either.

But... trial and error and all ;-).

------
richard_todd
Just FYI in case it happens to others: I was confused because the page I get
shows Rust coming in fourth. I had to reload it/re-sort the columns a couple
times to see the new Rust #4 entry.

------
crb002
LLVM IR could squeeze out some more. I should take a crack at it. Did
something similar to show DuPont why Criterion rocked.

~~~
kzrdude
The hash table is implemented in safe Rust (By using std's Vec). It has some
inefficiencies that could maybe have been polished off using `unsafe`.

~~~
Manishearth
I had a quick look at it a while back, there aren't many (any?) inefficiencies
there like that.

------
saurik
> Some language implementations have hash tables built-in; some provide a hash
> table as part of a collections library; some use a third-party hash table
> library. (For example, use either khash or CK_HT for C language k-nucleotide
> programs.) The hash table algorithm implemented is likely to be different in
> different libraries.

> Please don't implement your own custom "hash table" \- it will not be
> accepted.

> The work is to use the built-in or library hash table implementation to
> accumulate count values - lookup the count for a key and update the count in
> the hash table.

The C++ implementation is thereby testing an old version of a non-standard
extension of libstdc++ that I had never heard of and which was likely
contributed once by IBM and never really looked at again (by either
maintainers _or users_ ;P), while the C implementation is testing the
specified khash library, which is apparently something a number of people
actively contribute to and attempt to optimize, giving it some notoriety.

If I were to do this in C++, and I wasn't allowed to use my hash table, I
would almost certainly not be using __gnu_pbds::cc_hash_table<>. If I were to
just want to use something "included", I would use the C++11
std::unordered_map<> (note that this code is compiled already as C++11). But
we all know the STL is designed for flexibility and predictability, not
performance, and the culture of C++ is "that's OK, as if you actually care
about performance no off-the-shelf data structure is going to be correct". If
I decided I wanted speed, I know I'd want to check out Folly, and I might even
end up using khash.

Reading other comments, what happened here is the Rust version is now using
some "experimental" hash table based on ongoing work to optimize the Python
3000 dict implementation. This is just not a useful benchmark. What we are
benchmarking is "how maintained is the implementation's built in hash table
and is it tunable for this particular workload".

That's why you should not be surprised to see Java doing so well: the code
actually being written here is just some glue... your programming language has
to be incompetent to do poorly at this benchmark (especially as many
commenters here are using a "within a power of 2" rule of thumb). There are
even multiple listings for the Java one, and the one that is faster is using
it.unimi.dsi.fastutil.longs.Long2IntOpenHashMap?!?

What we really should be asking here is: why is _any_ language doing "poorly"
in this benchmark? It just isn't surprising that Rust is competitive with
C/C++, nor is it surprising that Java is also; what _is_ surprising is that
Swift, Go, Haskell, and C# are not, and so I bet the issue is something (such
as "is allocating memory for a thing which is not required") that can be
trivially fixed for each (though by the rules of engagement, it might... or
might not :/ as Java "cheated", right? ;P... require a minor fix upstream).

I mean, since the "work" explicitly is not "write a hash table using nothing
but primitives from this language", there is no particular reason why Perl and
Python (which is using a non-destructive array .replace, which is likely
_brutal_... again: I bet this is almost always a benchmark of ancillary memory
allocations) should be doing as poorly as they are: if we all made "optimize
for this benchmark" a top priority for a weekend hackathon, I bet we could get
every open source language to nail this under 25s. But do we care?

~~~
igouy
> If I were to do this in C++ … I would use the C++11 std::unordered_map<> …

[http://benchmarksgame.alioth.debian.org/play.html#contribute](http://benchmarksgame.alioth.debian.org/play.html#contribute)

~~~
saurik
The thesis of my comment was that we should be surprised that any language
does poorly on this benchmark, particularly ones that have similar kinds of
targeting, and that if we cared about this benchmark (and I claim we don't),
we should all pitch in, possibly _upstream_ to fix various languages and their
standard libraries to nail this benchmark. However, I also believe the rules
of this benchmark are awkward and even flawed, and that it isn't clear to me
that it is worth anyone's time to do that.

I am not lamenting that someone should spend more time on this: I am lamenting
that tons of people seem to care about it at all, it is not a "microbenchmark"
(as some are calling it), and I think the main lesson we can learn from it is
"there is something subtely wrong, either with the implementation that was
contributed for this benchmark, the language's runtime, or it's standard
library", as given these rules we really should expect every language to be
similarly in performance.

And so, if we all cared about this benchmark, I bet we could figure out what
is going on and get every open source language down under 25s. Past that
point, I think the rules are such that this isn't even a fun game much less a
useful metric of anything worth measuring, and you are probably wasting your
time contributing. I guess, to make this subthread go somewhere: why do you
disagree?

~~~
igouy
> it is not a "microbenchmark"

Home page -- "Will your _toy benchmark program_ be faster if you write it in a
different programming language? It depends how you write it!"

[http://benchmarksgame.alioth.debian.org/](http://benchmarksgame.alioth.debian.org/)

------
insulanian
I'm not a system programmer, but it makes me very happy to see Rust taking
off, with all its potential to, hopefully, replace currently most used unsafe
system languages.

------
makmanalp
Is there something fishy going on with the NaiveHashMap here? What's
happening?

    
    
      impl Hasher for NaiveHasher {
          fn finish(&self) -> u64 {
              self.0
          }
          fn write(&mut self, _: &[u8]) {
              unimplemented!()
          }
          fn write_u64(&mut self, i: u64) {
              self.0 = i ^ i >> 7;
          }
      }

~~~
tveita
NaiveHasher (declared as "struct NaiveHasher(u64);") is a tuple with one
64-bit field named "0". (Tuple fields are implitly named as 0, 1, 2 ...)

It implements a very simple hashing function for 64-bit numbers, each hashed
number _n_ overwrites the state with _n_ xor ( _n_ >> 7). finish() will return
the last written state.

This seems to work okay since the hashed structure "Code" contains exactly one
64-bit field.

I don't know if it's fishy, but it's certainly very _custom_. For a generic
hashing implementation you'd at least want to mix in the previous state. I
assume it was done more for brevity than performance though.

------
santaclaus
Cool! Is Rust's hash table open addressed?

~~~
sanxiyn
Yes. It's open addressing, linear probing, Robin Hood hashing.

[https://doc.rust-
lang.org/std/collections/struct.HashMap.htm...](https://doc.rust-
lang.org/std/collections/struct.HashMap.html)

~~~
masklinn
The program which reached the top of k-n does not use the standard library's
hashmap, it uses ordermap:
[https://github.com/bluss/ordermap](https://github.com/bluss/ordermap)

Though that's also an open addressed map.

------
geodel
It is new Rust hash table from external crate.

~~~
kazagistar
The default hash table has better security against malicious input by using a
slower hashing algorithm. Its the right default, but if you really want
performance and to compete with C/C++, you have to use an algorithm that makes
different tradeoffs, or you would be comparing apples to oranges.

~~~
ajross
Do the hash tables in use by the C/C++ entries use low-security hash
functions? Seems like that needs evidence.

~~~
mbrubeck
The first-place Rust program uses this very simple low-security hash function:

    
    
        impl Hasher for NaiveHasher {
            fn write_u64(&mut self, i: u64) {
                self.0 = i ^ i >> 7;
            }
        }
    

The second-place C program uses the exact same hash function as the Rust
program, except it also truncates the result to 32 bits:

    
    
        #define CUSTOM_HASH_FUNCTION(key) (khint32_t)((key) ^ (key)>>7)
    

The third-place C++ program uses the identity function as its hash function:

    
    
        struct hash{
            uint64_t operator()(const T& t)const{ return t.data; }
        };
    

Sources:

\- Rust:
[http://benchmarksgame.alioth.debian.org/u64q/program.php?tes...](http://benchmarksgame.alioth.debian.org/u64q/program.php?test=knucleotide&lang=rust&id=4)

\- C:
[http://benchmarksgame.alioth.debian.org/u64q/program.php?tes...](http://benchmarksgame.alioth.debian.org/u64q/program.php?test=knucleotide&lang=gcc&id=1)

\- C++:
[http://benchmarksgame.alioth.debian.org/u64q/program.php?tes...](http://benchmarksgame.alioth.debian.org/u64q/program.php?test=knucleotide&lang=gpp&id=3)

------
jeffdavis
Does this have anything to do with:
[https://news.ycombinator.com/item?id=13742865](https://news.ycombinator.com/item?id=13742865)
?

~~~
masklinn
No. The Rust program uses
[https://github.com/bluss/ordermap](https://github.com/bluss/ordermap) which
is an implementation/variant[0] of Hettinger's naturally ordered hash map.

------
kcdev
I'm really curious what this benchmark would be for JavaScript V8? Anyone have
the time to recreate the same functionality to test in Node?

~~~
steveklabnik
There are two "Node.js" entries on this page.

------
vegabook
quite a decent perf from the ML family at around 19 seconds (F# and Ocaml).
Top of the functionals, at least, twice as fast as Haskell.

Also look how ginormous the binaries are for all the VM languages. Kinda would
have thought it would be the opposite what with not needing to link in as much
runtime?

~~~
Wildgoose
You could even argue that Rust is a member of the ML family seeing as the ML
family of languages were major inspirations and furthermore I believe the
original implementation of Rust was written in OCaml.

~~~
Lev1a
Before Rust was written in Rust it was indeed written in OCaml.

------
snakeanus
Does anybody know what happened to the image charts like this one for example?
[https://web.archive.org/web/20121218042116/http://shootout.a...](https://web.archive.org/web/20121218042116/http://shootout.alioth.debian.org/u64/ats.php)

This was my favourite feature of the benchmarks game.

~~~
igouy
Those charts provide an instant without-thought comparison (which can be
helpful in other situations and with other data sets).

In this situation: it's helpful to slow-down, look at the source-code, think
about what's being compared …

------
gigatexal
The rust version is using multiple cpus using a pool concept (which looks a
lot like the multiprocessing module from python so kudos there). But the C
version is single threaded from what I can tell. So rust is safe but threaded
to be faster than single threaded C which isn't that much slower. Hmm...

~~~
infogulch
That's not true. If you look at the comparisons, the cpu time taken by C is
actually more than rust, and the cpu load looks about even. Note the C version
uses:

    
    
        #pragma omp parallel sections

~~~
gens
>.. and the cpu load looks about even.

318% is 20% bigger then 265%.

~~~
infogulch
And 5.30s is about 20% faster than 6.46s. We're just throwing around numbers
now.

My point was that the cpu load is "about even" compared to if the C version
was single-threaded, in which case it would look more like 200-300% bigger cpu
load instead of merely 20%.

