
Making Rust as Fast as Go - chrfrasco
https://www.christianfscott.com/making-rust-as-fast-as-go/
======
matklad
Note that Rust and Go programs differ in a seemingly insignificant, but
actually important detail.

Rust does this

    
    
        next_dist = std::cmp::min(
            dist_if_substitute,
            std::cmp::min(dist_if_insert, dist_if_delete),
        );
    

Go does this

    
    
        nextDist = min(
             distIfDelete, 
             min(distIfInsert, distIfSubstitute)
        )
    

The order of minimums is important for this dynamic programming loop. If I
change Rust version to take minimums in the same order (swapping substitute
and delete), runtime drops from 1.878696288 to 1.579639363.

I haven't investigated this, but I would guess that this is the same effect
I've observed in

* [https://matklad.github.io/2017/03/12/min-of-three.html](https://matklad.github.io/2017/03/12/min-of-three.html)

* [https://matklad.github.io/2017/03/18/min-of-three-part-2.htm...](https://matklad.github.io/2017/03/18/min-of-three-part-2.html)

(reposting my comment from reddit, as it's a rather unexpected observation)

~~~
londons_explore
Since min(a, min(b,c)) == min(b, min(a,c)), perhaps the compiler should be
smart enough to swap the comparisons around if it makes it quicker?

~~~
dan-robertson
I suspect that statement is not true for floats. Possibly you don’t get the
same float from min(0,-0) as min(-0,0), and similarly with NaNs. Rust
specifies that if one input is NaN then the other is returned but doesn’t say
what happens if both are NaN.

~~~
fluffything
> Rust specifies that if one input is NaN then the other is returned but
> doesn’t say what happens if both are NaN.

It does. If both are NaN a NaN is returned. Note, however, that when Rust says
that a NaN is returned, this means that any NaN can be returned. So if you
have min(NaN0, NaN0) the result isn't necessarily NaN0, it might be another
NaN with a different payload.

~~~
dan-robertson
Right. That’s what I was getting at but I didn’t know the NaN-return rule.

------
chrfrasco
Hey all, as some keen-eyed commenters have pointed out, it looks like the rust
program is not actually equivalent to the go program. The go program parses
the string once, while the rust program parses it repeatedly inside every
loop. It's quite late in Sydney as I write this so I'm not up for a fix right
now, but this post is probably Fake News. The perf gains from jemalloc are
real, but it's probably not the allocators fault. I've updated the post with
this message as well.

The one-two combo of 1) better performance on linux & 2) jemalloc seeming to
fix the issue lured me into believing that the allocator was to blame. I’m not
sure what the lesson here is – perhaps more proof of Cunningham’s law?
[https://en.wikipedia.org/wiki/Ward_Cunningham#Cunningham's_L...](https://en.wikipedia.org/wiki/Ward_Cunningham#Cunningham's_Law)

~~~
arcticbull
Thanks for following up. Just as an FYI, there's a few bugs in your
implementation, the most obvious one is the use of ".len()" in a number of
places interspersed with ".chars().count()". These two return different
values. ".len()" returns then number of UTF-8 bytes in the input string, which
for ASCII is the same as ".chars().count()" obviously, but if you do attempt
any Unicode characters, your function won't work. ".chars()" provides Unicode
Scalar Values (USVs) -- which is a subset of code points, excluding surrogate
pairs [1]. Note also this is _not_ the same as a Go rune, which is a code
point _including_ surrogate pairs.

Secondly, you re-implemented "std::cmp::min" at the bottom of the file, and
I'm not sure if the stdlib version is more optimized.

Lastly, well, you caught the issue with repeated passes over the string.

I've fixed the issues if you're curious:
[https://gist.github.com/martinmroz/2ff91041416eeff1b81f624ea...](https://gist.github.com/martinmroz/2ff91041416eeff1b81f624ea585f83a)

Unrelated, I hate the term "fake news" as it's an intentional attempt to
destroy the world public's faith in news media. It's a cancer on civilized
society. Somewhere your civics teacher is crying into some whiskey, even
though of course you're joking.

[1]
[http://www.unicode.org/glossary/#unicode_scalar_value](http://www.unicode.org/glossary/#unicode_scalar_value)

~~~
arcticbull
This doesn't even begin to get into the question of what Levenshtein Distance
even means in a Unicode context. What's the Levenshtein Distance of 3 emoji
flags? I suppose we should be segmenting by grapheme clusters and utilizing a
consistent normalization form when comparing, but Rust has no native support
for processing grapheme clusters -- or for normalizations I believe. The
UnicodeSegementation crate might help.

Based on some cursory research, the go version differs in a more subtle way
too. A Rune is a Code Point, which is a superset of the Rust "char" type; it
includes surrogate pairs.

~~~
derefr
Levenstein (edit) distance is fundamentally an information-theoretical concept
defined on bitstreams, as insertions/deletions/swaps _of individual bits_
within a stream. It has a lot in common with error-correcting codes, fountain
codes, and compression, which all also operate on bitstreams.

Any higher-level abstract mention of Levenstein distances (e.g. of Unicode
codepoints) is properly supposed to be taken to refer to the Levenstein
distance of a conventional (or explicitly specified) binary encoding of the
two strings.

~~~
grantwu
Can you point to a source that defines Levenstein distance as only referring
to bitstreams?

A translation of the original article [1] that introduced the concept notes in
a footnote that "the definitions given below are also meaningful if the code
is taken to mean an arbitrary set of words (possibly of different lengths) in
some alphabet containing r letters (r >= 2)".

And if you wish to strictly stick to how it was originally defined, you'd need
to only use strings of the same length.

More recent sources [2] say instead "over some alphabet", and even in the
first footnote, describe results for "arbitrarily large alphabets"!

[1]
[https://nymity.ch/sybilhunting/pdf/Levenshtein1966a.pdf](https://nymity.ch/sybilhunting/pdf/Levenshtein1966a.pdf)

[2] [https://arxiv.org/pdf/1005.4033.pdf](https://arxiv.org/pdf/1005.4033.pdf)

~~~
arcticbull
And Unicode is the biggest alphabet haha.

------
molf
The Rust version uses `target.chars().count()` to initialise the cache, while
the Go version counts up to `len(target)`. These are not equivalent: the Rust
version counts Unicode code points, the Go version counts bytes.

I am confused by the implementations, although I have not spent any time
testing them. Both versions contain a mix of code that counts bytes (`.len()`
and `len(...)`) and Unicode code points (`chars()` and `[]rune(...)`). My
guess is that the implementation might not work correctly for certain non-
ASCII strings, but I have not verified this.

Of course, if only ASCII strings are valid as input for this implementation
then both versions will be a lot faster if they exclusively operate on bytes
instead.

~~~
eis
Yep.

Here a Go playground example showing that the result is indeed wrong:

[https://play.golang.org/p/vmctMFUevPc](https://play.golang.org/p/vmctMFUevPc)

It should output 3 but outputs 5 because each ö is two bytes, len("föö") = 5.

I would suggest using "range" to iterate over the unicode characters.

~~~
earthboundkid
If you diff föö and f though, it correctly gives an edit distance of 2.

The code is weird because someone knew enough to convert the strings to slices
of runes but not enough to use the rune slices consistently. :-/

~~~
arcticbull
Not to mention Rune slices are insufficient for things like Flag emoji and
Family emoji, which is going to be a ton of separate runes put together. The
latter of which, apparently deletes one family member at a time when you hit
"backspace".

~~~
bigbizisverywyz
Oh fab, just when I thought I had a fairly solid understanding of how to
handle Unicode strings I learn something else that increases the complexity.

I have nothing but respect and gratitude for people that write good unicode
handling libraries, but even then the end developer has to learn a lot just to
be aware of what to look out for when handling strings.

Somewhere on github I think, somebody has posted a file with evil Unicode
strings.

~~~
arcticbull
In general, unicode requires you think differently about strings depending on
context. Here's my rule of thumb.

1\. If you are transporting a unicode string, reading/writing over the network
or to a file, think in terms of UTF-8 bytes. Do not attempt to splice the
string, treat it as an atomic unit.

2\. If you are parsing a string, think in terms of code points (runes in Go,
chars in Rust). A good example would be the Servo CSS parser. [1]

3\. If you're comparing/searching/inspecting/sorting a string in code, segment
by grapheme clusters and normalize, then do what you came to do. [2]

4\. If you're displaying a string, think in terms of pixels. Do not attempt to
limit a string by length in "characters" (nee grapheme clusters in the unicode
world) but rather measure by what the renderer does with the string. Each
character can be a thoroughly arbitrary width and height.

5\. If you're building a WYSIWYG editor, there's more to it than I even know
myself, but I suggest reading into what Xi did. It's going to be some
combination of everything above. [3]

[1] [https://github.com/servo/rust-
cssparser/blob/master/src/toke...](https://github.com/servo/rust-
cssparser/blob/master/src/tokenizer.rs)

[2] [https://github.com/unicode-rs/unicode-
segmentation](https://github.com/unicode-rs/unicode-segmentation)

[3] [https://github.com/xi-editor/xi-editor](https://github.com/xi-editor/xi-
editor)

~~~
account42
> 2\. If you are parsing a string, think in terms of code points (runes in Go,
> chars in Rust). A good example would be the Servo CSS parser. [1]

If all your syntactically meaningful characters are in ASCII you can also use
UTF-8 bytes in your parser.

Even if they aren't, no UTF-8 encoding of a character is a substring of the
encoding of any other character(s).

------
ishanjain28
I tried to benchmark Go/Rust versions as well.

I made 4 changes in Rust version.

1\. Moved up the line that gets a value from cache[j+1] before any calls are
made to cache[j]. This removes 1 bound check. (Improvement from 182,747ns down
to 176,xyzns +-4800)

2\. Moved from .chars().enumerate() to .as_bytes() and manually tracking
current position with i/j variables. (Improvement from 176,xyz ns down to
140,xyz ns)

3\. Moved to the standard benchmark suite from main + handrolled benchmark
system.(File read + load + parse into lines was kept out of benchmark)

4\. Replaced hand rolled min with std::cmp::min. (improvement from 140,xyz
down to 139,xyz but the std deviation was about the same. So Could just be a
fluke. Don't know)

In Go version, I made three changes.

1\. Doing the same thing from #1 in Rust actually increased the runtime from
190,xyz to 232,xyz and quite consistently too. I ran it 10+ times to confirm)

2\. Replaced []rune(source), []rune(target) to []byte(source), []byte(target).
(Improvement from 214817ns to 190152 ns)

3\. Replaced hand rolled bench mark system with a proper bench mark system in
Go. (Again, File read + load + parse into lines was kept out of benchmark)

So, At the end of it, Rust version was about 50k ns faster than Go version.

Edit #1:

In rust version, I had also replaced the cache initialization to
(0..=target.len()).collect() before doing anything els.. This also gave a good
perf boost but I forgot to note down the exact value.

~~~
blablabla123
I'd be really surprised to hear that Go is supposed to be faster than Rust. Of
course Rust is a bit newer but to me it always sounded like Go is fast because
it's static but it doesn't have to be high-speed if that would sacrifice
conciseness. Given that this is an artifical example, this here looks more
realistic: [https://benchmarksgame-
team.pages.debian.net/benchmarksgame/...](https://benchmarksgame-
team.pages.debian.net/benchmarksgame/fastest/rust-go.html)

~~~
arcticbull
Rust and Go are contemporaries. Rust started in 2006 at Mozilla, and the first
Go public release from Google was in 2009, meaning it probably started at the
same time.

Except of course all the Plan 9 garbage (like Go's hand-rolled assembler)
brought in to underpin Go from the 80s ;)

~~~
entha_saava
> Except of course all the Plan 9 garbage (like Go's hand-rolled assembler)
> brought in to underpin Go from the 80s

This is unfair criticism. If Go had used LLVM, it would affect its selling
point (fast compile times) and authors knew the plan 9 toolchain well.

Go the language feels like it is from 80s. But its toolchain is not at all
bad. LLVM monoculture is the last thing one would want. Obligatory reminder
that LLVM has its flaws too..

~~~
blablabla123
Also when this LLVM stuff doesn't work, it's a major pain to troubleshoot
because all this is a complexity monster. Go and its Plan 9 heritage is more
like 80s retro future and things like cross-compiling are super easy

~~~
arcticbull
In my opinion, compile times are irrelevant. Developers can always get faster
or larger machines if they need them. What matters is how the final product
performs on customer machines. Performance and memory usage of the final
product, plus your ability as an engineering team to avoid costly and
difficult mistakes are basically the only thing that matters.

~~~
entha_saava
> Developers can always get faster or larger machines if they need them.

Those who can get top notch hardware are already on top notch hardware and
those who can't are limited by management decisions.

Even with top notch machines, compile times matter. Because the difference are
huge in compile times of C++ and Go in moderately complex projects with bad
build systems that are the norm.

> What matters is how the final product performs on customer machines.

Apparently today's developers value fast iteration speed and that's fine. The
problem is they don't value user resources because they have top notch
machines and don't care about performance.

------
codeflo
I recently did some experiments with creating small static Rust binaries,
custom linking, no_std et cetera. A lot of stuff around that kind of thing is
unstable or unfinished, which might be somewhat expected. But I’ve also come
to the conclusion that Rust relies on libc _way_ too much. That might be fine
on Linux, where GNU’s libc is well-maintained, is a bit questionable on MacOS
(as seen in this article) and is a a complete distribution nightmare on
Windows (in no small part due to a series of very questionable decisions by
Microsoft).

My understanding is that Go doesn’t use the libc at all and makes system calls
directly, which IMO is the correct decision in a modern systems programming
language that doesn’t want to be limited by 40 years of cruft.

~~~
ekidd
As far as I know, the official system interface on Windows and several Unix
systems is via the standard library, not via direct syscalls. I don't know
about the MacOS. But in general, you may be required to dynamically link the
standard library on many platforms.

Linux guarantees syscalls are stable. And on Linux, you have the option of
telling Rust to cross-compile using a statically-linked musl-libc. (If you
also need to statically link OpenSSL or a few other common libraries, I
maintain [https://github.com/emk/rust-musl-
builder](https://github.com/emk/rust-musl-builder), and there's at least one
similar image out there.)

~~~
pansa2
AFAIK on Windows, the hierarchy is:

C library => kernel32.dll => ntdll.dll => system calls

You don’t have to go via the C library - calling kernel32 directly is fine (I
believe this is what Go does). However, it’s very rare to call ntdll or to
make system calls directly.

~~~
codeflo
Basically yes. ntdll is an implementation detail that shouldn’t be relied
upon. kernel32/user32 and friends are considered the “proper” interface to the
system and have been stable for decades.

~~~
vardump
There are some ntdll calls that are officially documented and ok to use. Of
course there are also a lot of calls you shouldn't use.

When necessary, it's fine to use even undocumented ones to support Windows 7
and older. It's not like those are going to change anymore.

------
devit
The main problem is that allocating a vector for each evaluation is completely
wrong: instead, it needs to be allocated once by making the function a method
on a struct containing a Vec; which makes the allocator moot.

The second problem is that at least the Rust code is decoding UTF-8 every
iteration of the inner loop instead of decoding once and saving the result, or
even better interning the characters and having versions of the inner loop for
32-bit chars and 8-bit and 16-bit interned indexes.

Furthermore the code rereads cache[j] instead of storing the previous value,
and doesn't do anything to make sure that bound checks are elided in the inner
loop (although perhaps the compiler can optimize that).

The code for computing the min seems to have been written mindlessly rather
than putting serious thought towards whether to have branches or not and in
what order (depending on an analysis of what the branch directions rates would
be).

Implausible benchmark results are almost always an indicator of the
incompetence of the person performing the benchmark.

~~~
burntsushi
> The second problem is that at least the Rust code is decoding UTF-8 every
> iteration of the inner loop instead of decoding once and saving the result

Indeed. This is a pretty damning difference. The `target` string is being
repeatedly UTF-8 decoded where as the same is not true in the Go version. The
Go version even _goes out of its way_ to do UTF-8 decoding exactly once for
each of `source` and `target`, but then doesn't do the same for the Rust
program.

> Implausible benchmark results are almost always an indicator of the
> incompetence of the person performing the benchmark.

Come on. We can do better than this. Please don't make it personal. We all
have to learn things at some point.

~~~
masklinn
> Indeed. This is a pretty damning difference. The `target` string is being
> repeatedly UTF-8 decoded where as the same is not true in the Go version.
> The Go version even goes out of its way to do UTF-8 decoding exactly once
> for each of `source` and `target`, but then doesn't do the same for the Rust
> program.

I'm really not sure that's an issue, utf8 decoding is very, very cheap and
it's iterating either way.

It would have to be benched, but I wouldn't be surprised if allocating the
caches (at least one allocation per line of input) had way more overhead,
especially given the inputs are so very short.

I'm not going to claim Rust's utf8 decoder is the fastest around, but it's
_very_ fast.

~~~
burntsushi
It's cheap, but not _that_ cheap. It shouldn't be as cheap as just iterating
over a sequence of 32-bit integers.

But yes, I did benchmark this, even after reusing allocations, and I can't
tell a difference. The benchmark is fairly noisy.

I agree with your conclusion, especially after looking at the input[1]. The
strings are so small that the overhead of caching the UTF-8 decoding is
probably comparable to the cost of doing UTF-8 decoding.

[1] - [https://github.com/christianscott/levenshtein-distance-
bench...](https://github.com/christianscott/levenshtein-distance-
benchmarks/blob/master/sample.txt)

~~~
matklad
> It shouldn't be as cheap as just iterating over a sequence of 32-bit
> integers.

I wonder if there are any benchmarks about this? Specifically, it feels like
in theory iterating utf8 _could_ actually be faster if the data is mostly
ascii, as that would require less memory bandwidth, and it seems like the
computation is simple enough for memory to be the bottleneck (this is a wild
guess, I have horrible intuition about speed of various hardware things). In
_this_ particular benchmark this reasoning doesn’t apply, as strings are short
and should just fit in cache.

~~~
burntsushi
If all you need to do is validate UTF-8, then yes, mostly ASCII enables some
nice fast paths[1].

I'm not a UTF-8 decoding specialist, but if you need to traverse rune-by-rune
via an API as general as `str::chars`, then you need to do some kind of work
to convert your bytes into runes. Usually this involves some kind of
branching.

But no, I haven't benchmarked it. Just intuition. A better researched response
to your comment would benchmark, and would probably at least do some research
on whether Daniel Lemire's work[2] would be applicable. (Or, in general,
whether SIMD could be used to batch the UTF-8 decoding process.)

[1] -
[https://github.com/BurntSushi/bstr/blob/91edb3fb3e1ef347b30e...](https://github.com/BurntSushi/bstr/blob/91edb3fb3e1ef347b30e5bd792bb4d29ee19d163/src/ascii.rs#L1)

[2] - [https://lemire.me/blog/2018/05/16/validating-
utf-8-strings-u...](https://lemire.me/blog/2018/05/16/validating-
utf-8-strings-using-as-little-as-0-7-cycles-per-byte/)

------
tromp
Missed chance to shorten title to "Making Rust Go Fast" :-)

~~~
katktv
Making Rust As Fast As It Can Go

~~~
kreetx
Go Rust!

~~~
arcticbull
GO!!!!

------
mqus
Is it intentional that the benchmarks include not only running the program
itself but also compiling it? e.g. in the linked source code, the bench.sh
includes the compilation step which is known to be slow in rust:

    
    
        #!/usr/bin/env bash
    
        set -e
        run() {
          cargo build --release 2> /dev/null
          ./target/release/rust
        }
    
        run;
    

Sure, if you run it many times in succession the compiler won't do much but
the benchmarking script (run.js) doesn't really indicate that and the blog
post also doesn't mention that.

EDIT: I was just being stupid, don't mind me. The times were taken within each
language and not externally.

~~~
chrfrasco
run.js is not doing the benchmarking. If you look at the source for each of
the programs being benchmarked, you'll see that the programs themselves are
responsible for benchmarking

------
alvarelle
Could also try to use the smallvec crate in this case, which put small
allocation on the stack [https://docs.rs/smallvec/](https://docs.rs/smallvec/)

------
arcticbull
There's a bunch of issues with the Rust implementation, not least that the
initial condition where source or target lengths are zero, it returns the
number of UTF-8 bytes of the other, but all other computations are performed
in terms of Unicode chars -- except at the end: `cache[target.len()]` which
will return the wrong value if any non-ASCII characters precede it.

Further, each time you call `.chars().count()` the entire string is re-
enumerated at Unicode character boundaries, which is O(n) and hardly cheap,
hence wrapping it in an iterator over char view.

Also, re-implementing std::cmp::min at the bottom there may well lead to a
missed optimization.

Anyways, I cleaned it up here in case the author is curious:
[https://gist.github.com/martinmroz/2ff91041416eeff1b81f624ea...](https://gist.github.com/martinmroz/2ff91041416eeff1b81f624ea585f83a)

------
hu3
I'm surprised that a naive implementation in Go can outperform a naive
implementation in Rust.

~~~
empath75
I’m not. Hell, when I first started learning rust i frequently wrote code that
ran slower than _python_.

------
virtualritz
I tried this on a spare time project[1]. Runtime in a quick test went down
from 14.5 to 12.2 secs on macOS!

So a solid ~15% by changing the allocator to jemalloc.

However, I now have a segfault w/o a stack trace when the data gets written at
the end of the process.

Possibly something fishy in some `unsafe{}` code of a dependent crate of mine
that the different allocator exposed. :]

Still – no stack trace at all is very strange in Rust when one runs a debug
build with RUST_BACKTRACE=full.

[1] [https://github.com/virtualritz/rust-diffusion-limited-
aggreg...](https://github.com/virtualritz/rust-diffusion-limited-aggregation)

~~~
saagarjha
I have found that jemallocator is currently broken on macOS Catalina, so that
might be the problem. If you can reproduce this issue reliably, I'd love to
hear about it because I can't myself unless I use a very specific toolchain
that produces -O3 binaries that are a real pain to work with.

~~~
virtualritz
It's 100% reproducible. Just check out the previous to last commit on master
on the github repo I linked to and run the tool with any command that invokes
the nsi crate.

Eg.:

    
    
       > rdla dump foo.nsi
    

should produce the segfault before exiting the process.

Is there a jemallocator ticked where to attach a report for this?

~~~
saagarjha
Thanks! I think you found the jemallocator bug, so I'll try your project and
follow up there.

~~~
virtualritz
I think that bug may be fixed in jemalloc by now. The version of jemalloc that
the jemalloc-sys crate tracks is from two years ago. I tried bumping jemalloc
to latest master but that makes the jemalloc-sys build fail (trivially with a
missing script but still).

Also there is this:
[https://github.com/gnzlbg/jemallocator/issues/136](https://github.com/gnzlbg/jemallocator/issues/136)

------
submeta
Impressed to see four posts about Rust on the front page of HN simultaneously.

------
anderskaseorg
I’ve found that Microsoft’s mimalloc (available for Rust via
[https://crates.io/crates/mimalloc](https://crates.io/crates/mimalloc))
typically provides even better performance than jemalloc. Of course, allocator
performance can vary a lot depending on workload, which is why it’s good to
have options.

------
savaki
This discussion seems to me like a microcosm of the differences in
philosophies between Rust and Go.

With Rust, you have much more control, but you also need a deep understanding
of the language to get the most out of it. With Go, the way you think it
should work is usually is Good Enough™.

~~~
jeffdavis
I wouldn't put it that way. Both languages are fast at nice straight-line
code.

The main area I'd expect to see performance benefits for rust (though I don't
have experience here) is larger rust programs. Rust's zero-cost abstractions
have more benefits as the abstractions nest more deeply. For a small program,
you don't really have a lot of abstractions, so Go will do just fine.

I think Go has a number of nice performance tricks up it's sleeve, though, so
I wouldn't rule out Go on performance grounds too quickly.

------
novocaine
It may be that the system allocator is making an excessive number of syscalls
to do work, whereas most custom allocators will allocate in slabs to avoid
this. You could try using dtruss or strace to compare the differences.

------
savaki
A few folks have commented that there were logic errors in the Go version.
Specifically that

    
    
      len("föö") = 5
    

should instead have returned

    
    
      len("föö") = 3
    

I submitted a pull request, [https://github.com/christianscott/levenshtein-
distance-bench...](https://github.com/christianscott/levenshtein-distance-
benchmarks/pull/3), that fixes these issues in the Go implementation.

Interestingly enough, when I re-ran the benchmark, the Go version is roughly
19% faster than it was previously:

    
    
      old: 1.747889s
      new: 1.409262s (-19.3%)

------
fortran77
Related thread:
[https://news.ycombinator.com/item?id=21103027](https://news.ycombinator.com/item?id=21103027)

------
loeg
FreeBSD's system allocator _is_ jemalloc :-).

------
pcr910303
TLDR for people who didn't read:

The speed difference came from the allocator.

Rust switched from jemalloc to the system allocator per ticket #36963[0] for
various reasons (like binary bloat, valgrind incompatibility, etc...).

Go uses a custom allocator[1] instead.

To make 'Rust Go fast' (pun intended), one can use the '#[global_allocator]'
to use a custom allocator (in this case, with the jemallocator crate) to make
allocations fast again.

[0]: [https://github.com/rust-lang/rust/issues/36963](https://github.com/rust-
lang/rust/issues/36963)

[1]:
[https://golang.org/src/runtime/malloc.go](https://golang.org/src/runtime/malloc.go)

~~~
k__
The comments of Rust programmers here also suggest that the Rust
implementation is, indeed, different from the Go implementation.

~~~
pcr910303
It was just a summary of the post contents - the post suggests that the
biggest difference comes from the allocator.

------
maoeurk
Assuming this was run on a 64bit system, the Rust version seems to be
allocating and zeroing twice as much memory as the Go version.

edit: this has been pointed out as incorrect, Go ints are 8 bytes on 64bit
systems -- thanks for the correction!

    
    
      let mut cache: Vec<usize> = (0..=target.chars().count()).collect();
    

which can be simplified as

    
    
      let mut cache: Vec<usize> = vec![0; target.len()];
    

vs

    
    
      cache := make([]int, len(target)+1)
      for i := 0; i < len(target)+1; i++ {
        cache[i] = i
      }
    

Rust usize being 8 bytes and Go int being 4 bytes as I understand it.

So between doing more work and worse cache usage, it wouldn't be surprising if
the Rust version was slower even with the faster allocator.

~~~
rossmohax
Go int is 8 bytes

~~~
eis
It can be either depending on the system.

[https://golang.org/ref/spec#Numeric_types](https://golang.org/ref/spec#Numeric_types)

