Hacker News new | past | comments | ask | show | jobs | submit login
Performance comparison: counting words in Python, C/C++, Awk, Rust, and more (benhoyt.com)
368 points by goranmoomin on July 24, 2022 | hide | past | favorite | 218 comments



Re: the Rust optimized implementation, I was able to get ~20-25% better performance by rewriting the for loops as iterators, using Rust's byte range pattern matching, and a buffered writer, which seems crazy, but it's true. I chalked it up to some crazy ILP/SIMD tricks the compiler is doing.

I even submitted a PR[0], but Ben decided he was tired of maintaining and decided to archive the project (which fair enough!).

[0]: https://github.com/benhoyt/countwords/pull/115


In case anyone is interested, I did an optimized, but much more simple, Rust implementation just today[0], which is faster than the optimized implementation on my machine. No indexing into arrays of bytes, no unsafe, etc., no "code golf" measures. Of course, credit where it's due -- these are simply tweaks to Andrew Gallant's code.

Looks like idiomatic Rust, which I think is interesting. Shows there is more than one way to skin a cat.

[0]: https://github.com/kimono-koans/countwords/blob/master/rust/...


Out of curiosity, what's the difference in runtime when compared to Andrew's optimized version on your machine?

For your other solution, if the time save is consistent from your machine to OP's (a big if), the Rust solution bumps up to 4th place.


> Out of curiosity, what's the difference in runtime when compared to Andrew's optimized version on your machine?

Results on an M1: my "idiomatic" version is 1.32 times faster than Andrew's original optimized version, whereas the optimized C version is 1.13 times faster than my "idiomatic" version. So, all things being equal, that'd make Rust 3rd, just ahead of C++, and behind Zig and C, if I'm reading the results correctly.

More important, to me, would be the other thing -- it's readable, idiomatic Rust. That is, if it had been 5% slower, I think we'd probably all prefer to maintain this code.


Can you share precisely how you're running the code and measuring timings?

I've seen folks flub the measuring for this benchmark by testing much smaller inputs, for example.


I'm using the kjvbible_x10.txt corpus and using hyperfine on the Mac.

I just took a look on Linux and my code that is now running 1.43x faster on the Mac is only 1.02x faster on Ubuntu 22.04. But, again, it's really that it's a Rust commercial -- by leaning on the stdlib, it's possible to get straightforward, but still really fast code.

I have no doubt you could make it faster than me and have!


These changes would prompt some additional remarks about the Rust implementation: most optimisations make code more verbose and less idiomatic, but in Rust it can sometimes be the opposite.


Very nice. That’s in fact really clean, and (at least to me) indeed surprising.

As rejection reasons go, “I wanted to name-drop person X” is interesting. But as you said, that’s the maintainer’s decision to make.


Can't win them all. I think I came along when they were just really sick of maintaining it. Ben and Andrew have all my respect.


Iterators are potentially a much better design too as they allow for more modular code.


BufWriter provides a default capacity of 8 KB. Rather than perform repeated syscalls to write to the stdout fd, this will syscall on flush https://doc.rust-lang.org/std/io/struct.BufWriter.html

Also, for loops desugar to into_iter, but the extra closure does provide an additional bump https://doc.rust-lang.org/book/ch13-04-performance.html


From my measurements, your optimized version runs at about the same speed as mine:

    $ cd /tmp/
    $ git clone -b ag/test-kimono https://github.com/BurntSushi/countwords
    $ cd countwords/rust/
    $ ./bench
    <snip>
    Summary
      './optimized-trie/target/release/countwords < kjvbible_x10.txt' ran
        1.22 ± 0.02 times faster than './kimono-simple/target/release/countwords < kjvbible_x10.txt'
        1.26 ± 0.02 times faster than './optimized-customhashmap/target/release/countwords < kjvbible_x10.txt'
        1.56 ± 0.03 times faster than './optimized-unsafe/target/release/countwords < kjvbible_x10.txt'
        1.58 ± 0.03 times faster than './kimono-optimized/target/release/countwords < kjvbible_x10.txt'
        1.60 ± 0.03 times faster than './optimized/target/release/countwords < kjvbible_x10.txt'
        3.97 ± 0.06 times faster than './simple/target/release/countwords < kjvbible_x10.txt'
        7.58 ± 0.21 times faster than './bonus/target/release/countwords < kjvbible_x10.txt'
You mentioned in another comment that you were benchmarking on an M1. Maybe there's some interesting differences there in the codegen, how the CPU executes or both.

Your 'fast-simple' version is a different story though. Personally, I would not put that in the "simple" classification as outlined by the OP. It is IMO very much not the first program someone would write for this. Bringing in 'hashbrown' and futzing with the hash table lookup/insert is definitely a perf tweak that you probably wouldn't want to do unless you had to, because it makes the code more complex. The 'Box<[u8]>' cements it as an 'optimized' variant.

Now it does look like a simpler variant of the "optimized" Rust program I wrote. I'm pretty impressed. I don't think I ever would have broken out of my local optima to discover that program. In particular, your approach appears to do 3 (and some change) passes over each buffer: 1) read up to '\n' (just the first line), 2) UTF-8 validation, 3) make lowercase and 4) split on whitespace. I realize chunking all of that up confers a lot of benefits, but still, 4 passes and it still being faster than 1 pass is very interesting.

I've glanced at the profiles of each program but haven't been able to solidly conclude anything more precisely about where exactly the benefit is. At this point, my next step would be to slowly translate my version into yours, and benchmark each step of the way until I could isolate the key change (assuming it is one change). But alas, I have run out of time this evening. :-)

Kudos!

(The other interesting thing to note here is that my 'trie' variant is now the fastest Rust submission. Previously, it was slower than the optimized variants on my older CPU. That's pretty neat.)


> In particular, your approach appears to do 3 (and some change) passes over each buffer: 1) read up to '\n' (just the first line), 2) UTF-8 validation, 3) make lowercase and 4) split on whitespace.

I really should have waited until I had more time to respond, but it really only should be 2, 3, and 4.

I tried unsafe/unchecked for UTF8, and, yes, it is a modest bump, but I wanted to do it without unsafe. And 3 and 4 are really pretty fast for what they are. They both work on bytes and the str as_bytes transmute is virtually cost free from what I can tell.


> It is IMO very much not the first program someone would write for this.

True.

> Bringing in 'hashbrown' and futzing with the hash table lookup/insert is definitely a perf tweak that you probably wouldn't want to do unless you had to, because it makes the code more complex. The 'Box<[u8]>' cements it as an 'optimized' variant.

I would say, both these items, I added much later, and contribute substantially less to the bottom line performance than you might think. Bulk of the performance is elsewhere.


> Kudos!

Thanks!

> I don't think I ever would have broken out of my local optima to discover that program.

It really is the answer to what is the most knuckle-headed thing one could try, but I was curious what Rust-with-the-guardrails could do.

> My 'trie' variant is now the fastest Rust submission

Cool. Very interested why this is the case.


The "why" behind the trie is explained in the comments in the code. :-) The core of my hypothesis was to avoid hashing and hash table lookups at all.

The problem with the trie, though, is that it does a memory access per byte.

Whether and when this trade is beneficial is not totally clear to me, but clearly, it can vary.

Whether my hypothesis is actually correct is also not something I'm certain of. Verifying this would take some time with: 1) looking at the codegen and 2) some use of `perf` to extract CPU counters for things like branch and cache misses, and see if a correlation can be established.


For those curious about Rust optimization, the “optimized” version makes three changes compared to the “idiomatic” version:

* It uses byte strings instead of UTF-8 strings. In my opinion, that’s not an optimization, that’s changing the problem. Depending on the question you’re asking, only one of the two can be correct.

* It uses a faster hash algorithm. It’s not the first time this came up in a benchmark article. Rust’s decision to use a DOS-safe hash by default (and not provide a fast algorithm in the std, like other languages do) really seems to hurt it in that kind of microbenchmark.

* It uses get_mut+insert instead of the more convenient HashMap::entry method, because the latter would require redundantly allocating the key even in the repeat case. I’ve hit this problem in the past as well. Maybe the upcoming HashMap::raw_entry_mut will make this kind of optimization cleaner.


The problem specified declares the words we're counting are ASCII:

> ASCII: it’s okay to only support ASCII for the whitespace handling and lowercase operation

UTF-8 (quite deliberately) is a superset of ASCII. So a UTF-8 solution is correct for ASCII, but a bytes-as-ASCII solution works fine in Rust if you only need ASCII.

This is why Rust provides ASCII variants of a lot of functions on strings, and the same functions are available on byte slices [u8] where ASCII could be what you have (whereas their Unicode cousins are not available on byte slices).


>> ASCII: it’s okay to only support ASCII for the whitespace handling and lowercase operation

That's a somewhat specific list -- at least I didn't read that as a general "the program can assume that the input is only ASCII".

But then, the author seems to have accepted solutions that crash on non-UTF8 sequences and ones that byte-compare them, so probably either behavior was meant to be fine. I just don't get that from this rule.


> That's a somewhat specific list -- at least I didn't read that as a general "the program can assume that the input is only ASCII".

I don't think it assumes the input is only ASCII. If the problem is "given UTF-8 text, split on ASCII whitespace and convert ASCII uppercase letters to lowercase," you can do that correctly (and produce correct UTF-8 output) without really being UTF-8 aware. For why, see here: https://en.wikipedia.org/wiki/UTF-8#Encoding

> But then, the author seems to have accepted solutions that crash on non-UTF8 sequences and ones that byte-compare them, so probably either behavior was meant to be fine. I just don't get that from this rule.

That's a separate concern right? The rules are only about the behavior when the program is given UTF-8 input.


The issue is that this is a weird requirement. I have seen real world data sets that had all manner of exotic whitespace like nonbreaking spaces and vertical tabs peppered throughout, so I am sympathetic. But this situation isn't common.

That said, for a CLI program like this, usually approximate results are good enough anyway. And realistically, most text should use ASCII whitespace for pretty much all text.


It is a context-specific requirement, but "fully general lowercasing" would be an impossible requirement.

For e.g. Japanese text, I think you'd only have to add 1 or 2 characters to the set of whitespace characters. You also have to solve Japanese text segmentation, which is hard-to-impossible. If you want to canonicalize the words by transforming half-width katakana to full-width, transforming full-width romaji to ascii, etc., that's a lot of work, and which of those transformations are desired will be specific to the actual use of the program. If you want to canonicalize the text such that the same word written using kanji or using only hiragana end up in the same bucket, or that words that are written the same way in hiragana but written differently when using kanji end up in different buckets, or that names that are written the same way in kanji but written differently in hiragana end up in different buckets, or that loanwords incorrectly written using hiragana are bucketed with the katakana loanword, or that words written using katakana for emphasis are bucketed with the hiragana word (but katakana loanwords are not converted to hiragana and bucketed with the non-loanword that is made up of the same moras), well, that all sounds even more challenging than the hard-to-impossible problem you already had to solve to decide where words begin and end :)

Edit: One of the first concerns I mentioned, about full width romaji and half width katakana, and additionally concerns about diacritics, can be addressed using unicode normalization, so these things are pretty easy[0]. An issue you may still face after normalizing is that you may receive inputs that have incorrectly substituted tsu ツ for sokuon ッ (these are pronounced differently), because for example Japanese banking software commonly transmits people's names using a set of characters that does not include sokuon.

My point is that this is not just a hard problem but many different, incompatible problems, many of which are hard, and because of the incompatibilities you have to pick one and give up on the others. An English-speaking end user may not want their wordcount to perform full width romaji -> ascii conversion.

[0]: https://towardsdatascience.com/difference-between-nfd-nfc-nf...


Great write up! However I should've clarified, I wasn't talking about word segmentation in general, only about expanding the universe of valid "whitespace" grapheme clusters (or even just codepoints) to include the various uncommon, exotic, and non-Western items.


> It uses byte strings instead of UTF-8 strings. In my opinion, that’s not an optimization, that’s changing the problem. Depending on the question you’re asking, only one of the two can be correct.

Sure, but is it changing the problem to something easier than what the other languages are already doing, or to something more similar? I'd imagine the C code is basically just using byte arrays as well, for instance.


> * It uses byte strings instead of UTF-8 strings. In my opinion, that’s not an optimization, that’s changing the problem. Depending on the question you’re asking, only one of the two can be correct.

To be fair, that's what the C version does as well.


> It uses byte strings instead of UTF-8 strings. In my opinion, that’s not an optimization, that’s changing the problem. Depending on the question you’re asking, only one of the two can be correct.

No, that's very wrong. ripgrep has rich Unicode support for example, but represents file contents as byte strings. UTF-8 strings vs byte strings is an implementation detail.

I think you might benefit from reading the "bonus" Rust submission: https://github.com/benhoyt/countwords/blob/8553c8f600c40a462...

IMO, Ben kind of glossed over the bonus submission. But I personally think it was the biggest point in favor of Rust for a real world version of this task.


The question as posed only cares about ASCII.


>It uses byte strings instead of UTF-8 strings. In my opinion, that’s not an optimization, that’s changing the problem. Depending on the question you’re asking, only one of the two can be correct.

For tons of questions, both can be correct.


Ad uft8: it would change the problem only if the definition of word breaks changed. E.g., if a word break is defined by whitespace, then it probably doesn't.


What happens when the ASCII character for space is embedded within a multibyte UTF-8 codepoint?


UTF-8 is designed so that this doesn’t happen: All bytes in a multibyte codepoint have the high bit set, placing them outside of the ASCII range.


This never happens. In UTF-8, bit 7 is set in all valid multibyte representations; ASCII characters cannot appear in them. Now, you can have an ASCII character in the middle of a multibyte sequence, but this isn't valid UTF-8 (and should be treated as an error by the decoder). "Words" generated by a splitter would also have invalid UTF-8 if they were split on a space coming in between bytes in a multibyte sequence, but these can detected. Word splitting will not generate invalid words from valid input, or vice versa; and if you know you have valid input, splitting on space, \t, \n, \r, etc. is perfectly cromulent in UTF-8.

Now if you want to also split on U+200B (ZERO WIDTH SPACE), U+202F (NARROW NO-BREAK SPACE), etc... hoo boy.


If UTF-8 allowed that, then a solution would split words wrong according to UTF-8 rules, but it still would not matter, since the problem statement says that "it's okay to only support ASCII for the whitespace handling and lowercase operation".

So if your solution gets word splitting (or lowercasing) wrong for non-ASCII input, it still gets a pass according to my reading.


Such "overlong" encodings are not valid UTF-8.


This was down-voted, which isn't really the best way to handle things that are wrong

Your parent was wondering about a hypothetical UTF-8 sequence [all sequences in this post are hexadecimal bytes] XX 20 XX in which the ASCII space character encoded 20 is actually somehow part of a UTF-8 character. That's not a thing. UTF-8 is deliberately designed so that nothing like this can happen, along with several other clever properties, such properties are why UTF-8 took over the world.

Overlong sequences are like E0 80 A0 which naively looks like it's a UTF-8 encoding of U+0020 the space character from ASCII but it's not, because U+0020 is encoded as just 20. These encodings are called "overlong" because they're unnecessarily long, they're transporting a bunch of leading zero bits for the Unicode code point, the UTF-8 design says to reject these.

[ If you are writing a decoder you have two sane choices. If your decoder can fail, report an error, etc. you should do that. If it's not allowed to fail (or perhaps there's a flag telling you to press on anyway) each such error should produce the Unicode code point U+FFFD. U+FFFD ("The Replacement Character") is: Visibly obvious (often a white question mark on a black diamond); Not an ASCII character, not a letter, number, punctuation, white space, a magic escape character that might have meaning to some older system, filesystem separator, placeholder or wildcard, or any such thing that could be a problem. Carry on decoding the rest of the supposed UTF-8 after emitting U+FFFD. ]


I would expect a wordcount to pass these through as-is, along with any other invalid sequences of bytes with the high bits set. Performing unexpected conversions just seems like a way to make my future self have a bad time debugging.


I love using the UNIX shell (pipes and programs that do "one thing well") to the point where it hurts my ability to improve at other other languages. But sometimes all you need is a throwaway command to do something once, not a beautifully optimized function that will be written, improved, compiled, profiled, reoptimized and then executed a billion times. The utter tersity of a bash one-liner just tickles me.


I'm with you with both shell pipelines and Perl "one-liners". The latter was a wonderful step up from the former back in the 1990s: a modern programming language designed around paving the cowpaths of shell pipelines. The power of both has always been inspiring. So many useful, one-off tasks can be accomplished with a single line in a terminal. So much potential power.

But my problem with both has always been that I needed them often, but not quite often enough to remember them without looking things up (again). Over the years, as my memory of these commands has strengthened, I've needed them less often. We don't use computers the same way we did back then. It's like a spaced repetition system where the spacing is set to guarantee forgetting.


> The utter tersity of a bash one-liner just tickles me.

To anyone else confused, the repo has both "sh" and "bash" versions, and the bash one isn't a one liner - it's the sh one.


How about extending this to comparing programming language efficiency from a developer's standpoint, looking at bytes of code, versus the execution time?

I took the source code file size from the repository, and the runtime from the blog post.

Then I made an arbitrary overall "PAIN SCORE" (lower is better) by multiplying code size * runtime. I suggest this is a worthwhile metric simply because lower is better on both axes, but of course, in the "real world" there will be different economic costs to CPU time and developer time depending on the use case. Here's the sorted results, from least "pain" to most:

    LANGUAGE   FILENAME        CODE SIZE      RUNTIME     PAIN SCORE (LOWER IS BETTER)
    Shell      optimized.sh      75 bytes      1.83 s      137.25
    Crystal    simple.cr        240 bytes      1.29 s      309.6
    Nim        simple.nim       424 bytes      0.77 s      326.48
    Python     simple.py        208 bytes      2.21 s      459.68
    Ruby       simple.rb        175 bytes      3.17 s      554.75
    Go         optimized.go    1514 bytes      0.40 s      605.6
    Python     optimized.py     464 bytes      1.33 s      617.12
    Zig        optimized.zig   2688 bytes      0.24 s      645.12
    Go         simple.go        688 bytes      1.12 s      770.56
    Zig        simple.zig      1394 bytes      0.55 s      766.7
    Nim        optimized.nim   1683 bytes      0.49 s      824.67
    Shell      simple.sh         60 bytes     14.81 s      888.6
    Ruby       optimized.rb     401 bytes      2.47 s      990.47
    JavaScript simple.js        532 bytes      1.88 s     1000.16
    C          optimized.c     4360 bytes      0.23 s     1002.80
    Rust       optimized.rs    3065 bytes      0.43 s     1317.95
    Swift      simple.swift     317 bytes      4.23 s     1340.91
    JavaScript optimized.js    1501 bytes      1.10 s     1651.1
    C          simple.c        2735 bytes      0.96 s     2625.6
    Rust       simple.rs       2239 bytes      1.38 s     3089.82
Sorting only by code size, the most concise implementations are: Shell, Ruby, Python, Crystal. Nobody was aiming to play code golf (i.e. minimize source code size), so these are fairly straightforward, idiomatic, readable implementations.

I am definitely a Crystal fan, in fact this afternoon I'm continuing to implement suggestions from my recent Show HN https://news.ycombinator.com/item?id=32081943 comments. :)


It would probably be prudent to strip comments before measuring the code sizes:

I did not check all files, but noticed that the simple rust code consists of about 60% comments (in terms of bytes), mostly due to a verbose blurb. It also spends about 15% of the non-comment bytes on printing nicer error messages, which is an interesting choice for a micro-benchmark. The simple C and FORTH versions similarly have a lot of comments.

Meanwhile a lot of the other files have very few or no comments.


I wrote the Rust version and the comments are a reflection of my style and my belief that benchmarks should come with some kind of analysis. The "simple" variant was also intended to demonstrate idiomatic code without a bunch of perf tricks. Nicer error messages in that case is fair game.

Indeed, a naively analysis based on code size without taking comments into account is pretty obviously wrong.


Programming language efficiency from a developer's standpoint is not measured in bytes written in the source code, because typing characters is never the bottleneck when writing code.

Even if it were, most code isn't "fire and forget", and so for most code the cost is dominated by maintenance cost, which has more to see with reading than writing code.

Languages that are concise are so because they either express information very densely, or express less information (via for example ignoring error handling or making it implicit with exceptions). In practice I find that such languages are much harder to maintain, because the missing information has to be rebuilt by the reader.


Extra pain score should be awarded for 2 character abbreviations, one character flags, mode switches, special character operators, difficult documentation, lack of error handling, missing type safety, difficulty of introspecting/debugging halfway through program, environment dependencies. Most of these weigh a lot more than just line count.


It'd be interesting to write code in a concise, easy to learn/grok language and then transpile to a faster systems language and recompute the PAIN score. I spent a few months last year pursuing that approach (py2many on github).

We'd need a stdlib that works across languages for the approach to be successful. nimpylib and similar libraries in other languages are a step in that direction.

Alternatively, the python stdlib itself could be rewritten in python and transpiled to other languages. Perhaps it'd help alternative python implementations in terms of compatibility.


Isn’t unfair to not count the runtime code size too for interpreted languages? Because for the compiled ones, they include all the necessary runtime in the executable.


Also a Crystal fan but from afar. The compile time speeds are a real killer of productivity.


> a worthwhile metric

Previously —

"Completely Random and Arbitrary Point System!, or CRAPS![TM]"

https://web.archive.org/web/20010124100400/http://www.bagley...

Currently —

fwiw https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

> Sorting only by code size

https://benchmarksgame-team.pages.debian.net/benchmarksgame/...


I think another interesting "pain score" is whether the code is written with built-ins from the language or the standard library (i.e. not new data structures or functions).


I would expect the AWK version to have very low pain despite its reputation. This is the sort of problem that AWK is well optimized for.


This is a rather meaningless comparison, since the differences are going to be dominated by:

1) The choice of libraries/datatypes used for strings and word->count map

2) How the source file is split into words - probably library function again, although in C/C++ one could choose to implement a super-optimized low level version that would blow the others away

IMO a performance comparison between languages is only meaningful if the runtime is not dominated by library functions, or if one admits it's really a standard library performance comparison, not a language comparison.


I understand what you're saying. However if you look at it not from the point of the performance figures per se, but rather which language, when written idiomatically, can solve a realistic problem performantly, then I think it makes more sense.


True - there's still value in these types of comparison, although the click-baity "C++ vs Python" title rather detracts from that.

I'm not sure how much value there is in this specific benchmark though since in the real world you'd be using a utility to do this (e.g. linux wc - word count) and it's not obvious what's dominating the runtime (I'd guess reading file and/or splitting into words), so what you're takeaway should be (idiomatic X is good for what, exactly?) if wanting to extrapolate this to some other task.

For that matter, in the real world people don't really choose language on a task specific basis... It's either use language X because the project demands it, or write some limited use utility in a scripting language if performance isn't a major concern.


Agreed. It’s a common desire to boil benchmarks down to a single number.

Some of those languages, however, are far more complex then the others. And as any complex tool, it requires certain methodology. If you are to use heavy machinery like C++, forget iostreams and character-level processing on your hot paths.


Do you have a particular example problem in mind that might demonstrate language performance better? One of the Benchmark Game programs maybe?


What is there to benchmark between C++, C, Rust, or any other low level compiled language though? With any one of them you can essentially achieve the same assembly code generation, if you try hard enough.

In the end it either boils down to compiler optimisations or library implementations.


I think the interesting question there (and really any language comparison) is to compare the result of idiomatic implementations. Unfortunately, there is no clear answer on what is idiomatic, even with something like Python that is more opinionated than most on the question.


When there is "no clear answer on what is idiomatic", we are left with uninteresting "yes it is" / "no it isn't" Dead Parrot assertions.

http://montypython.50webs.com/scripts/Series_1/53.htm


I think the best way forward on the idiom question is to accept that we can't have a competitive/adversarial benchmark suite where idiom is a factor. I think you really need one author, or maybe a group of colleagues who trust each other, to write the whole suite and make their own choices about what idiom means to them. Folks who disagree with those choices can either write an article about how much performance you gain from making what changes, or just produce their own whole suite. Having a very high level comparison like "this is how languages stack up with these idiomatic choices, but this is a different chart with different choices" would be interesting, even though it would take more work to interpret it.


Seems that idiomatic is a matter of personal taste, used to arbitrarily accept or reject.

One way forward would be to ignore claims based on idiom, but measure how long it took to write a program that produced correct output.


I think it's more interesting to try to measure the languages that are designed to be "fast but also garbage collected", like Java and Go.


With C I rarely use libs outside of stdlib and when I do I pay attention to performance more than convenience.

I think even if it's dominated by libs and data types I think it has value as it might reflect what actually happens out there in the wild rather than a purely academic exercise.


exactly, is mostly comparing different C libraries implementations.


My language of choice, LiveCode, is a superset of HyperTalk. The concept of "words" has been built-in (not a library) since the '80s.


Recently there was a contest to optimize the performance of a similar program[0] and a Zoom discussion of the optimizations[1].

The programs are not comparable in the the following ways:

- Case: TFA requires (at least) ASCII lowercasing but the contest problem required no lowercasing.

- Ordering: TFA does not require sorting, but the contest problem required sorting.

- Memory: TFA imposes a requirement phrased as "don't read the whole file into memory" and this sounds like it's a resource-saving constraint, but it's usually a constraint that requires the program to spend additional resources. You could just mmap the file and store pointers into the mapped region. It costs extra to do copies instead of no copies.

- Text: TFA is unclear on what assumptions may be made about the lengths of words. For the contest problem, the Hungarian wikipedia input's longest word is around 80k.

- Safe, Hashing, Stdlib: TFA imposes some restrictions on what constructs may be used that are not imposed in the contest problem.

For the contest version of this problem, it seems like you can tokenize, hash, and count strings at around 1GB/s. Adapting a solution to solve TFA's problem (but not to conform to its Safe/Hashing/Stdlib requirements) would probably not carry too large of a penalty, since it's like 3 instructions to ASCII-lowercase 32 bytes and 1 string copy per unique string should take negligible time compared to the hash table lookups. So there is some room for the optimized solutions to go a little faster, if more optimizations are permitted.

[0]: https://easyperf.net/blog/2022/05/28/Performance-analysis-an...

[1]: https://www.youtube.com/watch?v=R_yX0XjdSBY


Funny how you can take the "optimized" C++ and make it significantly faster (vs. gnu/linux standard libraries) with little effort. A lookup table for to_lower, instead of bit math, helps slightly. Building with the LLVM-libc string functions helps a lot, because the GNU std::string_view::operator== is calling an AVX-optmized memcmp via the PLT, which is a terrible strategy for short strings. LLVM-libc has an inlined bcmp (note: bcmp can be significantly faster than memcmp, but on GNU they are aliases) that is resolved for the target platform at build time instead of run time.

Edit:

Even if you ignore LLVM-libc, just slapping this into the optimized.cpp and replacing the critical `==` with `our_bcmp` makes it 10% faster. IFUNC calls to micro-optimized SIMD functions are counterproductive. It is far, far better that the compiler can see all the code at build time.

  int our_bcmp (const char* a, const char* b, size_t sz) {
    for (size_t i = 0; i < sz; i++) {
      if (a[i] != b[i]) return 1;
    }
    return 0;
  }


EDIT: I think the use of the term "allocation" might be overloaded here from the original article, after thinking about this a bit more I think the author is referring to a different way of accessing the map. If you look in godbolt https://godbolt.org/z/Yf64rodW6 you can see that the pointer version uses

  CALL    runtime.mapaccess2_fast64(SB)
whereas the 'raw' version uses

  CALL    runtime.mapassign_fast64(SB)


When reading the Go solution this bit stood out to me

> To reduce the allocations, we’ll use a map[string]*int instead of map[string]int so we only have to allocate once per unique word, instead of for every increment

I just tried benchmarking this with a simple setup and I get zero allocations for both approaches, although the "pointer" approach is slightly faster

  package main
  
  import (
   "testing"
  )
  
  func BenchmarkIncrementMapRawInt(b *testing.B) {
   var data = make(map[int]int)
  
   b.ResetTimer()
   for i := 0; i < b.N; i++ {
    key := i % 1000
    data[key]++
   }
  }
  
  func BenchmarkIncrementMapPointerInt(b *testing.B) {
   var data = make(map[int]*int)
  
   b.ResetTimer()
   for i := 0; i < b.N; i++ {
    key := i % 1000
    increment(data, key)
   }
  }
  
  func increment(counts map[int]*int, value int) {
   if p, ok := counts[value]; ok {
    *p++
    return
   }

   n := 1
   counts[value] = &n
  }

  $ go test -bench=. -benchmem .
  goos: darwin
  goarch: arm64
  BenchmarkIncrementMapRawInt-8        112288002         10.57 ns/op        0 B/op        0 allocs/op
  BenchmarkIncrementMapPointerInt-8    139728302          8.586 ns/op        0 B/op        0 allocs/op


Presumably this would allocate if bits weren’t “primitive”, eg if you had 128 bit ints implemented as a tuple of two ints.


I like the OCaml version. It's both very easy to read and comes out really well in the benchmark.

I also looked at the "simple" Zig version, which came out really well in the benchmark, and to me it didn't look simple at all. It seems you need to make a lot of low level details explicit.

But IMHO AWK takes the crown here. :)


> But IMHO AWK takes the crown here. :)

Agreed! AWK is still the king of this stuff. For tasks like this, I kinda think AWK is nigh-unbeatable: so simple to write, so obvious what's going on (even if you've never seen any AWK program before, you're probably going to be able to figure out what's going on there), and decently performant.

AWK is the bee's knees.


On the other hand, I'm rather surprised that the Go version is almost as long and opaque as C.


Are you looking at the "simple" or the "optimized" versions? For the optimized, yes, the Go one is very similar to the C. For the simple, idiomatic version, the Go version [1] is much simpler than the C one [2]: 40 very straight-forward LoC vs 93 rather more complex ones including pointer arithmetic, tricky manual memory management, and so on.

[1] https://github.com/benhoyt/countwords/blob/c66dd01d868aa83dc... [2] https://github.com/benhoyt/countwords/blob/c66dd01d868aa83dc...


Yeah I meant the optimized versions.


TFA only talks about gawk, though, making this comparison meaningless ie. you can't benchmark languages but only language implementations. Same goes for other programming languages ofc unless those have only a single implementation.

mawk is an order of magnitude faster than gawk, and gawk isn't even the default on many Linuxen.


> TFA only talks about gawk, though, making this comparison meaningless

Not exactly! To quote from TFA (I'm the author):

> Another “optimization” is to run it using mawk, a faster AWK interpreter than gawk. In this case it’s about 1.7 times as fast as gawk -b. I’m using mawk in the benchmarks for the optimized version.


d'oh sorry didn't read that far


Wow. Swift, touted as "safe by design and (...) runs lightning-fast"[1] is more of a screw-up than I thought. Almost twice as slow as Lua and behind even Pascal and Forth.

[1] https://developer.apple.com/swift/


It's a single implementation, I wouldn't put too much on it. I can definitely write a C++ implementation that's slower and less optimised than most of those results, particularly if I was using Boost.

iPhone and Mac apps run pretty well in my experience, and Swift is definitely faster (in general) than Python at least. There were some serious considerations to port over ML libraries to Swift due to its ease of use, similar to Python, while providing much better execution speed.


> There were some serious considerations to port over ML libraries to Swift due to its ease of use, similar to Python, while providing much better execution speed. Google abandoned that Swift work (if that's what you're referring to)


Google and abandoning things, name a more iconic duo


Sure, but the fact that it was a project at all does imply Swift had a not-insignificant speedup vs Python, meanwhile in these benchmarks Python is about2x the speed of Swift.


Note that only an unoptimized Swift solution was provided, and Swift String operations (like `lowercased`) are fully Unicode-compliant.


I only skimmed the article, but I suspect they are including the time necessary to spin up the executable and any runtime environment prior to executing the relevant code.

"Lightning fast" can mean a lot of things, and it might not mean "executables start quickly."


If that were the case, that makes Java's results fairly impressive, given the JVM's slow start time.


Maybe for tiny tiny programs "the JVM's slow start time" is only a few tenths of a second?

https://benchmarksgame-team.pages.debian.net/benchmarksgame/...


If that is the case, it would explain why python is so slow. In my experience the runtime speed of python is good enough for most purposes, but startup time is definitely slow.


Swift executables start fairly quickly.


I don't know that I'd characterise it as a "screw up" based on one random person's solution to one particular problem.

In particular the "simple" solutions, which is all that's offered for Swift, will be whatever was most idiomatic/ obvious to a programmer of potentially quite variable quality.

It's unfortunate, I think, that there are "simple" solutions for some languages which have had performance revisions. If you needed an hour, or a friend, or even a community to "hint" how to improve it that wasn't the simple solution.

[ For example I feel like the Rust might be faster asking for ASCII lowercase, and using the unstable (not order-preserving) sort algorithm, but even if I'm correct such a change would logically be made to the optimised Rust, not the simple ]


Swift is in the same or better performance bracket then C#/Java and unsafe usage can be close to performance of C (maybe 1.3-2x slower). I'm not sure about ranting for poor single test results :P


No need for hand-waving, there are publicly available benchmark results: https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

It's not the only benchmark in town and it needs to be taken with a grain of salt, but so does every benchmark.


Latner himself (creator of Swift) said Swift had speed issues because of ARC that they were hoping to solve with better code analysis. Did they?


They have been improving this steadily (I would argue it is rarely an issue nowadays, and there have always been simple workarounds), but there’s no reason a solution to the problem mentioned in the article should need to use ARC, and should use structs/protocols instead (which don’t use reference counting). I’ve seen a lot of Benchmarks that rate Swift poorly because of this.


The optimized version of C# is not really optimized as there are things left in the table that could be much close to C++.


>even Pascal and Forth.

Even?

Pascal is touted as being as fast as C


Perhaps bad phrasing on my part. I meant that in the are-they-still-even-maintaining-modern-optimizing-compilers-for-Pascal-and-Forth sense.


FreePascal maintains their own optimizer. It is one of the few languages that still have their own optimizer rather than using LLVM. Although they would probably better of using LLVM. They introduce a new optimization every month, and I usually notice the new optimization when my code starts crashing with the nightly compiler.´

But they have a bunch of reasons why they cannot use LLVM like "LLVM will almost certainly never support all targets that FPC supports (Gameboy Advance, OS/2, WinCE, ...), or at some point drop support for targets that FPC still supports (as already happened with Mac OS X for PowerPC/PowerPC64)." [1]

Or "FPC's native code generators are much faster than LLVM's (even if you would neglect the overhead of FPC generating bitcode and the LLVM tool chain reading it back in), so especially while developing it may be more interesting to use FPC's own code generators"

But the test of this thread mostly benchmarks the hashmap implementation. I got my own Pascal hash map, it is twice as fast than the one in their standard library.

And, looking at the test code, it does a double hashing, first a get, then an insert. If it did a find entry and update it in-place, it would be twice as fast, too.

Together that would be four times faster and as fast as the C version and still be simple

[1] https://wiki.freepascal.org/LLVM#Frequently_Asked_Questions


I've checked with ClickHouse and the result is better than I expect... it runs in 0.043 sec. on my machine, which is faster than any other result.

The code:

SELECT arrayJoin(splitByChar(' ', lower(line))) AS word, count() AS c FROM file('kjvbible.txt', LineAsString) WHERE notEmpty(word) GROUP BY word ORDER BY c DESC FORMAT Null

or:

clickhouse-local --query "SELECT arrayJoin(splitByChar(' ', lower(line))) AS word, count() AS c FROM file('kjvbible.txt', LineAsString) WHERE notEmpty(word) GROUP BY word ORDER BY c DESC" > /dev/null

It is using only a single thread.



>it runs in 0.043 sec. on my machine, which is faster than any other result.

Did you run the other benchmarks on your machine as well?


Yes (but only scripted, without compilation):

`grep` | 0.03 | 0.03 | `grep` baseline; optimized sets `LC_ALL=C`

`wc -w` | 0.18 | 0.25 | `wc` baseline; optimized sets `LC_ALL=C`

SQL | 0.26 | | by Alexey Milovidov

Perl | 1.22 | | by Charles Randall

Python | 1.42 | 0.86 |

Tcl | 5.30 | | by William Ross

Shell | 9.66 | 1.79 | optimized does `LC_ALL=C sort -S 2G`


N.B. the Tcl script is absurdly inefficient. A single simple optimization cuts the run time in half.


I forgot to multiply the file 10 times. When I do, the result is 0.209 sec. which is still better than every other result.


You are also using a language function to read the file. In the 'official' github implementations they have to accept the data line by line from stdin - stdin likely being slower than reading a file directly.


In this type of blog post with comparisons including many languages, you usually see some unatural approach for your favorite language.

However, in this case, the Python example is idiomatic.

Since the articles calls for better approaches, I would suggest to take the opportunity to use the walrus operator and rsplit() for the optimized version. Something like this:

    reminding = ""
    c=Counter( )
    while (chunk := sys.stdin.read(64 * 1024)):
        pre, post = chunk.lower().rsplit('\n', 1)
        c.update((reminding + pre ).split())
        reminding = post
It should not affect performance too much, and the code gets more expressive.

However, the performances will be quite different depending of the python version you use. Interestingly, Python 3.11 beta, which comes with a lot performance tweaks, is slower for this exercice, while being reported to be faster on real life tasks.


Your code assumes there is at least one newline in each 64K chunk, and assumes there are no words past the final newline.

It will fail on "abc" and give the wrong answer for "abc\ndef".

I prefer rpartition over rsplit to handle first case, and the loop-and-a-half construct instead of the while+walrus operator to handle the second, as in this modified version of your code:

    remaining = ""
    c=Counter( )
    while True:
        chunk = sys.stdin.read(64 * 1024)
        if not chunk:
            if not remaining:
                break
            pre = post = ""
        else:
            pre, mid, post = chunk.lower().rpartition("\n")
        c.update((remaining + pre).split())
        remaining = post


Indeed, rpartition is better suited for this task.


> Incidentally, this problem set the scene for a wizard duel between two computer scientists several decades ago. In 1986, Jon Bentley asked Donald Knuth to show off “literate programming” with a solution to this problem, and he came up with an exquisite, ten-page Knuthian masterpiece. Then Doug McIlroy (the inventor of Unix pipelines) replied with a one-liner Unix shell version using tr, sort, and uniq.

Since this writing and other linked resources present only one side of the affair, I will mention: the presentation in Pearls was quite unfair to Knuth, and the conclusions commonly made of it somewhere between moderately and entirely unsound. (That is, as conclusions from the paper. I speak of unsoundness of logic only: these conclusions may be supportable from other sources.)

Knuth was challenged to demonstrate literate programming, and so that was what he demonstrated. He showed building something from the ground up assuming a very minimal environment, with problem analysis and all: of course that ended up more verbose than a Unix pipeline that glosses over problem analysis and starts with most of the tools you need to make a probably-acceptable solution!

McIlroy himself admitted a few years later that he had been “a little unfair” to criticise on engineering grounds what had only been an illustration of technique. Even in the paper, Bentley did admit a degree of culpability for this mismatch in his “criticism of programs” paragraph. (“He admires the execution of the solution, but faults the problem on engineering grounds. (That is, of course, my responsibility as problem assigner; Knuth solved the problem he was given on grounds that are important to most engineers-the paychecks provided by their problem assigners.)”)

But I do wish that Knuth had been given the opportunity to write a review of McIlroy’s review, for simultaneous publication.


1986 was very early days for modern operating systems. I suspect that the hardware environment(low resource) would be surprising to many phone owners here. So the efficacy had to do with the tools low-level bindings, too


Indeed. In 1986, Knuth was exclusively using the SAIL DEC10, running the WAITS OS. This environment limited him to approximately half a megabyte of data space (heap plus stack plus globals), and half a megabyte of code space. All statically linked, so all library and runtime stuff had to fit.

Also of note is that most CPUs of the day didn't execute instructions orders of magnitude faster than bandwidth to main memory, as is the case today; it was early days for worrying about making algorithms "cache aware". So, for instance, there was no big penalty for "pointer chasing" in linked lists as there is today. Additionally, the severe memory size limits affected various time vs. space tradeoffs in algorithm design.


and, is it a compiled language at all? versus a script language of some kind.


Is what a compiled language? For Knuth's program in the Programming Pearls article? He wrote it in Pascal, so not a scripting language.


I thought he wrote it in his own language, WEB.


WEB is not a totally different programming language, it is a way to write programs. It is a hybrid (in the form presented in that article, though later versions were centered on other languages or language agnostic) of TeX and Pascal. The code that was written was proper Pascal, the documentation was proper TeX, and there was extra syntax for WEB use in naming and referencing code definitions. The WEB system had two additional components (and these terms are still used in many WEB-derived systems): WEAVE, TANGLE.

WEAVE takes a WEB source file and generates proper TeX out of it. TANGLE takes a WEB source file and generates proper Pascal.

The code written in WEB (and variants) is not necessarily in proper order for compilation, and can contain references to other blogs which are included as text-inclusions. I've never used WEB proper, but in org-mode there is org-babel. Its syntax is something like this (from memory, I use shortcuts so I don't have to memorize all the details and type them out):

  To handle user input, the program will read from a file.

  #+NAME: open-file (a better name would be used in a real program)
  #+BEGIN_SRC lisp :noweb yes
    (with-open-file (f filepath)
      <<parsing>>)
  #+END_SRC
elsewhere

  The actual parsing will look like:

  #+NAME: parsing
  #+BEGIN_SRC lisp :noweb yes
    ...
  #+END_SRC
And the order of these can be reversed, the correct output will be produced with a few other settings and options. With org-babel, when you tangle the org file it will generate one or more source files based on the options and settings you've used throughout the org file itself. Instead of WEAVE, you would just use the standard org export settings.


Interesting!


... (2021). Great read and fun project. See the discussion from when it was originally posted by the author.[0]

[0]: https://news.ycombinator.com/item?id=26463967


See also a different comparison (on ~the same problem) at StackExchange: https://codegolf.stackexchange.com/questions/188133/bentleys...

(Copying from my comment the last time this was posted: https://news.ycombinator.com/item?id=26467684)

There's also a nice book "Exercises in Programming Style" about just this problem.


See also Doug Bagley's "The Great Computer Language Shootout"

https://web.archive.org/web/20010616231931/http://www.bagley...


Curious if anyone knows why the swift version is slow? Is it the casting from subsequence to a string?


As I wrote elsewhere in this thread (though I was downvoted), I suspect the reason is that this benchmark is measuring the total execution time of a single run of an executable. This would include any time spent by the executable to bootstrap its runtime environment, initiate a virtual machine, and whatever else it needs to do in addition to the relevant code at hand to process the string.

Such a test will favor implementations that have extremely svelte boostrapping, allowing them to immediately begin executing the relevant code and return a result.

I feel a more useful test would be for the relevant string processing code to be run tens to thousands of times within the process itself, so as to diminish the relative importance of boostrap/warmup code and/or runtime optimization performed by VMs. Unless, of course, part of the intent was specifically to measure the impact of the bootstrapping.


In that case, it would change the Python result dramatically as well.

But I don't think it would be fair: I use Python a lot, and if it's for scripting, you are happy with the fact it's very easy to write. However, it's slow to start, and you pay that each time you run the script.

To me, it makes sense in this exercice, which is heavily leaning toward scripting, we see the price of the VM start in the overall profiling.

Otherwise, let's use pypy, warm it up, a few 1000 times, and you may get closer to Go times.

But we don't use pypy for scripting.


I agree that the benchmark should be run for much longer to get a useful result.

But why does Swift have a long startup time in the first place? Shouldn’t it start near instantly like the C and Rust programs?


I don't think Swift's performance is due to start up time at all. I actually cloned the repo, and ran the benchmark and found that Swift's execution time scales drastically with the size of the input. The Swift team actually boasts about its quick start up time on the official website [1]. I ran a simple hello world benchmark to gauge Swift's start up time and got an output of 13 milliseconds.

  echo 'print("hello world")' > hello.swift && swiftc hello.swift -O -o hello
  time ./hello
[1] https://www.swift.org/server/


Yeah the startup time for the Swift code is on the order of several milliseconds on my computer, the profiler running at 1000 Hz only gets four samples off of it


I was going to make a general comment about this too. It’s a huge penalty for Java too and has no correlation for how well it performs outside of toy benchmarks


… unless you’re using Java in the same way, such as an Azure Function or AWS Lambda. Or simply as a command-line tool.

IMHO, VM startup time should be included and the first few passes (before JIT kicks in) should also be included.

Java has supposedly “nearly C-like performance” until you read the fine print.

(This should apply to C# as well.)


> a huge penalty

Is that just your assumption or have you measured that penalty for this tiny tiny program?

Might "the JVM's slow start time" in this case be insignificant?

https://benchmarksgame-team.pages.debian.net/benchmarksgame/...


Here's a very rough breakdown of the operations it does:

  1.33 s    56.7% specialized Collection<>.split(separator:maxSplits:omittingEmptySubsequences:)
  263.00 ms 11.2% Substring.lowercased()
  226.00 ms  9.6% specialized Dictionary.subscript.modify
  189.00 ms  8.0% readLine(strippingNewline:)
As you can see, calling split on the string is really slow. The reason it is slow is that it's using the generic implementation from Collection, rather than String, which doesn't really know anything about how String works. So to do the split it's doing a linear march down the string calling formIndex(after:), then subscripting to see if there's a space character there using Unicode-aware string comparison on that one Character.

Swift is really nice that it gives you "default" implementations of things for free if you conform to the right things in the protocol hierarchy. But, and this is kind of unfortunately a common bottleneck, if you "know" more you should really specialize the implementation to use a more optimized path that can use all the context that is available. For example, in this case a "split" implementation should really do its own substring matching and splitting rather than having Collection call it through its slow sequential indexing API. Probably something for the stdlib to look at, I guess.

The rest of the things are also not too unfamiliar but I'll go over them one by one; lowercased() is slow because it creates a new Substring and then a new String, plus Unicode stuff. Dictionary accesses are slow because Swift uses a secure but not very performant hash function. readLine is slow because the lines are short and the call to getline is not particularly optimized on macOS, reallocs and locks internally, then the result is used to create a new String with the newline sliced off.


Thank you for this response!

It's sad to see that Swift is so slow in this particular case given that it has the potential to be such an optimized language (strong typing, compilation, backing from Apple).


As they call out in the article, accurately counting words is a bit more complicated that splitting by space. You need to account for all sorts of punctuation etc.

When I've been asked this in programming interviews, I've almost always been expected to produce something like the code in the article (and do), but usually I'd point to something like the NLTK library as a better approach. It's polished and highly capable, and handles probably just about every edge case there is.


It's a bit of a shame that Rust's stdlib doesn't have a "fast but less safe" hasher. It's one of the easiest performance optimizations in my experience because it's very rare that an attacker controls input to your hashmap in a way that can actually lead to a DOS.

Not that it doesn't happen, it's just not something I've run into.

Just today I was testing out the performance difference hashing some Scylla queries and got nearly 2x faster hashing moving to fxhash.


These days I tend to use R for these type problems. It’s slow, but vectors as first-class data types simplify things a lot

readLines(file) |> strsplit(“ +”) |> unlist() |> table() |> sort(desc = TRUE)


Got a bit better C++ version here which uses a couple libraries instead of std:: stuff - https://gist.github.com/jcelerier/74dfd473bccec8f1bd5d78be5a... ; boost, fmt and https://github.com/martinus/robin-hood-hashing

    $ g++ -I robin-hood-hashing/src/include -O2 -flto -std=c++20 -fno-exceptions -fno-unwind-tables -fno-asynchronous-unwind-tables -lfmt

    $ time ./a.out < kjvbible_x10.txt > /dev/null
    0,19s user 0,01s system 99% cpu 0,197 total
with the same build flags, optimize.cpp gives me

    0,22s user 0,01s system 99% cpu 0,233 total


C# really needs a GetOrAdd method on the regular Dictionary class (as exists on ConcurrentDictionary). This common pattern (from the optimized version) requires the hash function on "word" to be evaluated twice (once for TryGetValue and once for Add). Then, if there is a collision on the 32 bit hash then the equality check needs to be run twice as well. Not good as all of these are O(N) operations (where N is the size of "word"). I suspect the optimized version would be materially faster if they implemented their own dictionary that includes this.

               if (!counts.TryGetValue(word, out var wordCountRef))
                {
                    counts.Add(word, new Ref<int>(1));
                }


Doesn't the Unix Shell implementation break the "threading" constraint, in that each program in the pipe will have a main thread each, and all the programs will run concurrently?

Or should the "threading" constraint really have been stated as "you can use multiple threads, but only if they all have their own address space"?

Alternatively, given the multi-core capabilities of even budget/mobile/small systems these days (even the Rasberry Pi 4 has a quad-core CPU), isn't restricting the implementation to a single thread a weird artificial limitation in 2022? Also, a GPU-based implementation would be interesting, and GPUs most of their power from their massively parallel core architecture.


If I were doing the benchmark, I would just run it on a VM with a single physical core. There are good reasons for comparing single threaded performance since it may not be necessary or desirable to scale that way. For example, suppose it's a wc API or something; limiting each request to a single thread can be better since you get fairer scheduling and less data sharing which is almost always better for multiprocessing. Even shared immutable data can cause problems since the CPU may not realize the data is actually immutable.

I agree GPUs would probably do well on this problem, though it depends on how expensive the memory copying to/from the GPU ends up being. However, if you are going for all our performance, it seems like you could use SIMD like memchr or something.


> If I were doing the benchmark, I would just run it on a VM with a single physical core.

I think that's definitely an environment worth benchmarking on - but I don't think that it should be the only environment to benchmark on.

Also, I don't think it's a good reason to limit implementations to a single thread, even if that is your benchmark environment. It can be worth seeing how well an implementation that's capable of taking advantage of multiple cores/CPUs, does when it's only given one core to work with.

It's probably worth optimising for and benchmarking different threading strategies too - does your implementation create one thread per "work unit" and let the scheduler work them all out, or does it create one a one thread per core (or maybe, one thread per core, plus one), thread pool, and assign work units to each each thread until they're done? And how do those strategies work if they're only given a single core?

The single core case is definitely worth testing, but it seems odd to limit implementations to a single thread because of it. If you think you can go faster with a threaded implementation, you should be able to try that out.


> … worth seeing how well an implementation that's capable of taking advantage of multiple cores/CPUs, does when it's only given one core to work with.

Did something like that — programs written for multi-core forced onto one core, alongside programs not written for multi-core.

iirc That difference wasn't something anyone ever expressed interest in.

https://web.archive.org/web/20121231010227/http://benchmarks...


> When optimizing this, the first thing to do is compile with optimizations enabled (g++ -O2). I kind of like the fact that with Go you don’t have to worry about this – optimizations are always on.

I think it is more fair to say that in Go optimizations are always OFF.


The results indicate otherwise


No need to decode the C++ names when you use valgrind/callgrind/kcachegrind for C, that works just as well for C++.


What code was used for Common Lisp? I'm guessing a well written version would be in the top contenders for speed.



Right, this came up a year ago with a version that ranks among the top here: https://news.ycombinator.com/item?id=26465857

  (defmethod performance-count ((path-file string))
    (let ((map (make-hash-table :test 'equal)))
      (with-open-file (stream path-file :direction :input :if-does-not-exist nil)
        (when stream
          (loop for line = (read-line stream nil 'end)
         until (eq line 'end)
         do
         (let ((split (split-string #\space (string-downcase line))))
           (dolist (word split)
      (let ((index (gethash word map)))
        (if index
            (setf (gethash word map) (incf index))
          (setf (gethash word map) 1))))))
          (let ((keys (sort (alexandria:hash-table-keys map) 
  (lambda(x y)(> (gethash x map)(gethash y map))))))
     (dolist (key keys)
       (format t "~A ~A~%" key (gethash key map))))))))


You don't need those extra newlines if you'd prefix every code line with two spaces, by the way. It makes it much more readable if you do that.


Nice! I didn't know that trick.


Which lisp compiler did you use?


I haven't run it, just tracked down the source because I was similarly curious about the slow runtime. If I do run it it will be with SBCL. Looking at the repo, Hoyt did as well.


That code is slow because it's working with a really long list instead of a very fast hash table


I had a 1 minute look at the "optimized" C code - it's calling malloc for every word.


>I had a 1 minute look at the "optimized" C code - it's calling malloc for every word.

Neither simple.c nor optimized.c call malloc "for every word". They only call malloc when inserting a previously-unseen word into the counting map.


Doesn't malloc return a pointer? Sounds unsafe


That's not the point. It's perfectly safe and there's nothing else in C. The point is you don't have to call this function for every single word. It takes time and pointer increment doesn't


Ooo... spooooky pointers... ancient prophecies speak of the Earth splitting in half and swallowing everyone who's ever thought about allocating memory.


Taking it in a slightly different direction – because in my work chunking is nearly never necessary (too little text or too much memory) whereas Unicode characters and punctuation nearly always is – I would tend to something like:

  from collections import Counter
  from re import finditer

  word_counts = Counter()
  with open(PATH, encoding = 'utf-8') as file:
      doc = file.read()
  word_counts.update(match.group().lower() for match in finditer(r'\w+', doc))
In Python, this could be a fairly performant way of taking unicode into account, because it doesnʼt use Pythonʼs for-loop, and regular expressions are rather optimized compared with writing low-level-style Python. (Maybe it should use casefold instead of lower.)


My language of choice, LiveCode, is a superset of HyperTalk. The concept of "words" has been built-in (not a library) since the '80s. Hash tables since the '90s.

A solution to every aspect of this problem but the line-by-line bit took just over a minute to write. Adding that took another minute.

The trade-off is that LC is slower than almost any of the languages listed, and optimization is almost impossible -- the idiomatic way is the fastest (almost). And if you don't like the way LC parses words, there is almost no way to change it without going into the source code (a big pain).


Hm, I tried comparing the simple.hs vs simple.py (with ghc 8.10.7 which I happened to have here, their stack.yaml uses 8.10.4) and it's 5.7s vs 1.9s, not quite as stark a difference. Maybe because I'm on Linux?


Their slow timings seem to be because they enable the threaded/multicore runtime without actually using it. A performance footgun I suppose.


q solution in kdb, runs in about 2.5s on my machine vs simple.py which takes 4.7s. Rough interpolation with the results table puts this on par with Rust/Go/OCaml or other compiled languages.

  \t desc count each group `$lower " " vs raze read0 `:kjvbible_x10.txt
It's also 1 line!

I am sure q pros or k purists can optimize this even more...

EDIT: Moving the lower earlier brings this down to 1.9s

  desc count each group `$ " " vs lower raze read0 `:kjvbible_x10.txt


Fun stuff! Did run a similar thing with a simple bioinformatics problem before (calculating the ratio of G and Cs against A+G+C+T), also with a whole bunch of contributors:

https://github.com/samuell/gccontent-benchmark#readme

Really hard - or impossible - to arrive at a definitive single number for one language, but the whole exercise is a lot of fun and quite informative IMO :)



This site makes Chrome (Android, current) stop responding.


Yeah, mine too. Ironic considering the subject...


This was always my approach when I had do “leetcode whiteboard interviews”.

Start with something not too far north of FizzBuzz but with a ton of scope to progressively enrich the problem until the clock runs out.

At $BIG_CO du jour people tend to talk about how much “signal” an interview produces, and starting somewhere that a meaningful number of candidates can’t do anything with offers very little “signal”, likewise very little Shannon information.

Great article!


I'm curious why go is faster than rust in the benchmarks. My best theory is that the program doesn't run long enough to trigger a GC, so rust is paying the cost of freeing memory, but go isn't. (Using an arena allocator in rust would probably produce similar behavior).

Or maybe the difference is just noise? It's hard to tell without more details on how the benchmarks were run.


Why not read the source code? :-)

I wrote comments explaining things: https://github.com/benhoyt/countwords/blob/8553c8f600c40a462...


Most of the time you see Rust lose on that kind of benchmark due to it not using buffered algorithm in the standard library (buffers are available and easy to use but they are an option you have to think about and search for).


The naive implementation is slower mostly because of the DOS resistant hasher in the Rust stdlib, and the smallish buffer a line represents.


It seems like this is just measuring a single pass at reading a file with about 100k words in it. I ran the benchmark and it finished almost instantly. So languages with startup overhead and warmup time (eg: JIT etc) will totally underperform here. Makes sense then why Java is almost identical between optimised and non-optimised versions.


otoh the similar k-nucleotide program in the benchmarks game processes bigger files to amortize any startup overhead —

https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

otoh instead of asserting "will totally underperform" please measure and share your measurements!

Sometimes "startup overhead" turns-out to be insignificant.

https://benchmarksgame-team.pages.debian.net/benchmarksgame/...


In what concerns Java, using AOT or JIT caches is also an option, however all these kind of benchmarks tend to reveal lack of knowledge of the Java ecosystem.

Same applies to .NET by the way.


> We usually think of I/O as expensive, but I/O isn’t the bottleneck here...The tokenization and hash table operations are the bottleneck by a long shot

Interesting. This is something I'll have to keep in mind.

Best practice is always been to consider I/O the slowest.

But, it's true... times have changed.

Maybe we shouldn't make this assumption anymore.


Here’s a couple of articles on the performance of IO and slowness of other stuff that I find particularly interesting:

https://itnext.io/modern-storage-is-plenty-fast-it-is-the-ap...

https://gregoryszorc.com/blog/2021/04/06/surprisingly-slow/


He didn't clear the disk cache when profiling so of course the IO overhead was negligible. Had he flushed the cache between runs the numbers would have been very different.


Measurements of the similar k-nucleotide programs show that effect — N=250,000 compared to N=2,500,000

https://benchmarksgame-team.pages.debian.net/benchmarksgame/...


Some of these efficiency gains are remarkable. Most of the code I write is just the quickest way to write a mundane function. Really curious if you went into a huge tech co and devoted ~10% of SWE time solely to optimization of code base what the energy savings would be.


You can't walk into a huge tech co and make a dent because huge tech co already has hundreds of engineers on that job full time. Just look at the papers coming out of huge tech co's. They will publish if their technique saves them 0.2%. That's how tight things are already wired in the big leagues.

The real money is in huge non-tech co, or medium tech co. You could walk into a goldman sachs type of operation and point out savings in CPU time that would be 90-99%. But they're not going to care, because their corporate structures aren't wired to observe and act on such things.


If you make a dent on server bills on the order of several million dollars a year generally it's not too difficult to make a case that you should be allowed to continue.


For performance comparison between languages, this article is not very useful. Nothing against the author, this task is simply impossible :)

Om the other hand the article is gold mine for learning performance analysis in different languages.


Interesting, it confirms what I've suspected for a while: that by using Python and not C++ for "general text mashing" tasks I'm not losing that much performance. Like, 2x, 5x, perhaps 10x. But not 100x.


> a simple interview problem

I've been asking different versions of that question since I started interviewing and it is a surprisingly good filter.

But with articles like this, now I need new questions..


surprised to see swift 4 times as slow as non-optimized go.

Anyone has an explanation ?

I know the swift string type is really complex, but i always assumed it was at least performing well..


Reading anything at all into the relative performance of these programs written by different people which are all doing radically different things seems like a gigantic waste of time.

Some of these “simple” versions are custom-specifying allocators to minimize memory overhead and doing highly optimizable ASCII value manipulations and doing byte-equality on strings, rather than go through slower Unicode libraries. The “simple” Swift version is doing nothing at all to avoid generating many allocations/deallocations and running fully Unicode-aware algorithms to lower the input and perform Unicode-normalized string comparisons, it’s totally apples-to-oranges stuff.


i was wondering if this was due to unicode indeed. I had hoped that the benchmarks required languages to perform the same kinds of string check, otherwise there's really no point indeed.



You need to measure memory usage with it, take the product of the two (memory and speed), and thus get a very good number on performance for comparison.


Your Perl implementation is almost as fast as c++? Something weird is going on here. Maybe you are bottlenecked on disk read speeds or something.


Why is that surprising? The perl interpreter is fast and dispatches just about everything to heavily optimized C. Just like python does, it is only better at it.

This is the kind of thing where perl really shines.


This is what perl was made for and it is good at it. I was going to be very suspicious if Python or Ruby ended up faster. On this chart it it the fact that the simple C++ is so slow that is interesting.


Not really surprising. Perl approaches a domain specific language for text processing with a heavily-optimized 34 year old implementation.


Another 1 minute lookin... multiplication on every byte.. really? If this is supposed to be fast, get one big buffer for the words. Get another if you run out. The hash function can be the word itself for words up to length 8. Longer words would need a mult per 8 bytes. You can do branchless strchr. Roll your own memmove - just always copy 24 bytes and terminate at the length you already know. The whole point of using C or C++ is that you can do non idiomatic things when you have to.


> I tried running it using the PyPy optimizing compiler, but it’s significantly slower for some reason

I wonder why that is?


Nice 3rd place for Zig implementation. But is it missing `defer arena.deinit();`?


My takeaway is that unoptimized c++ is pretty close to the same performance as JavaScript.

Wild.



Yep, I don't have the energy tonight to inspect the source code, but how you write js matters a lot.

Little things like hoisting local variables out of the prototype and closure chains can make a massive difference.

I did some experiments with hot path js optimization several years ago and was pretty surprised at the performance you can squeeze out of the jitter.

Js is notorious for being difficult to do good perf analysis on because of the warming and hot path analysis. From my experiments it can sometimes take many seconds from startup for the jitter to kick in on a tight loop.


You can tell it will be BS just from the title treating C and C++ as if they were one language, not two very different languages.

Other comments here reveal how, exactly.


I remember the last time this came up and you bloviated about a bunch of nonsense[1], and ended up not being able to deliver. Talk about bullshit. At least your comment here is shorter.

[1]: https://news.ycombinator.com/item?id=26466672


This is the executive version of this test: "I gave this problem to my team and it took them two weeks to figure out a solution."


You included Awk in the title, but omitted Go, really?


Indeed. The original title is:

> Performance comparison: counting words in Python, Go, C++, C, AWK, Forth, and Rust.

Which means the submitter actively chose Go to be removed from the title. Judging from their post history they are aligned with Rust community.


The original title is 2 characters over the HN title length limit. I suspect they dropped "Go" first, because it's the language which has a 2-letter name, and they they editorialised it further to hint that there are more languages.


Leaving Awk over the vastly more popular Go, hints a more probable and frequent behaviour: bias.

Removing any of them would suffice.


The author maintains an AWK implementation:

https://github.com/benhoyt/goawk


Ben Hoyt, isn't the same person as the HN submitter.


grep is indeed fast!


Our distributed NewSQL database project actually uses three of them, Go, Rust, and C++, for different layers. There are pros and cons for each of them, but it still challenge to maintain three languages in a one system development. Performance wise, C++ and Rust are good, coding wise Go is easiest but Rust is kind of having best practice.


I would not call it performance comparison at all. When Python call functions written in C it is not Python's performance. Write those functions using plain Python and then see the results. Sure for this basic example in the article it does not matter from a practical standpoint. But when you need to step away from canned cases suddenly Python's performance sucks big time.


Can we move past the whole "it is not really python to use libraries written in C", especially when talking pure stdlib python?

Python is basically a DSL for C extensions. That is the whole point. It would be like criticizing any compiled language for essentially being a DSL for machine code, and not "Real Instructions".

Python's ability to interop with pre-built, optimized libraries with a lightweight interface is arguably its greatest selling point. Everyone and their dog knows purely interpreted CPython is slow. It doesn't need to be pointed out every single time python performance is brought up unless literally discussing optimizing the CPython vm.


I am not criticizing Python. It does well enough what it was made to do. It just make no sense to call that particular example a "language performance comparison". It is anything but.


But you are still wrong. As mentioned, Dicts are incredibly efficient data structures in Python (because they underpin everything) and the Counter class is pure python. That's 100% pure python. Saying dicts "don't count" because they are implemented in C would disqualify the entire language of CPython, as virtually everything under the hood is a PyObject struct pointer. It just so happens that "counting abstract objects" is the class of problem* CPython does basically all the time in normal VM execution anyways, so this task is easy and fast.

* looking up a reference of an attribute with a string key underpins the meat and potatoes of python's execution model


I have no dog in this fight, but I do want to point out that’s it’s not exactly true that Counter is implemented in pure Python:

* it’s a subclass of dict

* its update method (used by the code in the post) dispatches to a C implementation on its fast path


Yes, that's exactly what I said. dict.update is in C, because it's a core feature of the python vm. It's pure CPython. What do you think "pure python" is? There's no python hardware ISA (afaik). All cpython is manipulating data structures in C via Python VM opcodes. It just so happens that whatever opcodes that are dispatched in the course of solving this problem are quite efficient.

If you say "it does not count as Real Python if you dispatch to C", then you literally cannot execute any CPython vm opcodes, because it's all dispatching to C under the hood.


“Pure Python” commonly means implemented using only the Python language. Something written in pure Python ought to be portable across Python implementations. I was merely pointing out that this line

https://github.com/python/cpython/blob/4395ff1e6a18fb26c7a66...

isn’t exactly pure Python, because, under a different runtime (eg PyPy), the code would take a different path (the “pure Python” implementation of _count_elements[1] instead of the C implementation[2][3]). Yes, it's hard to draw exact lines when it comes to Python, especially as the language is so tied to its implementation. However, I think in this case it's relatively clear that the code that specific line is calling is an optimization in CPython, specifically intended to get around some of the VM overhead. Said optimization comes into play in the OP.

[1]: https://github.com/python/cpython/blob/4395ff1e6a18fb26c7a66...

[2]: https://github.com/python/cpython/blob/4395ff1e6a18fb26c7a66...

[3]: https://github.com/python/cpython/blob/4395ff1e6a18fb26c7a66...


> It doesn't need to be pointed out …

Apparently there will be someone who feels the need to point it out; and someone who feels the need to point out that it doesn't need to be pointed out; and …


Python's `collections.Counter` is written in Python and is a subclass of the builtin `dict` type. I don't think it's comparable to something like using `pandas` to solve the problem.

https://github.com/python/cpython/blob/main/Lib/collections/...


So we should disqualify the C and C++ impls because some libc functions are implemented using ASM, right?


Are you implying that node.js is pure ecmascript ?


If it looks like python and quacks like python...




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: