Re: the Rust optimized implementation, I was able to get ~20-25% better performance by rewriting the for loops as iterators, using Rust's byte range pattern matching, and a buffered writer, which seems crazy, but it's true. I chalked it up to some crazy ILP/SIMD tricks the compiler is doing.
I even submitted a PR[0], but Ben decided he was tired of maintaining and decided to archive the project (which fair enough!).
In case anyone is interested, I did an optimized, but much more simple, Rust implementation just today[0], which is faster than the optimized implementation on my machine. No indexing into arrays of bytes, no unsafe, etc., no "code golf" measures. Of course, credit where it's due -- these are simply tweaks to Andrew Gallant's code.
Looks like idiomatic Rust, which I think is interesting. Shows there is more than one way to skin a cat.
> Out of curiosity, what's the difference in runtime when compared to Andrew's optimized version on your machine?
Results on an M1: my "idiomatic" version is 1.32 times faster than Andrew's original optimized version, whereas the optimized C version is 1.13 times faster than my "idiomatic" version. So, all things being equal, that'd make Rust 3rd, just ahead of C++, and behind Zig and C, if I'm reading the results correctly.
More important, to me, would be the other thing -- it's readable, idiomatic Rust. That is, if it had been 5% slower, I think we'd probably all prefer to maintain this code.
I'm using the kjvbible_x10.txt corpus and using hyperfine on the Mac.
I just took a look on Linux and my code that is now running 1.43x faster on the Mac is only 1.02x faster on Ubuntu 22.04. But, again, it's really that it's a Rust commercial -- by leaning on the stdlib, it's possible to get straightforward, but still really fast code.
I have no doubt you could make it faster than me and have!
These changes would prompt some additional remarks about the Rust implementation: most optimisations make code more verbose and less idiomatic, but in Rust it can sometimes be the opposite.
From my measurements, your optimized version runs at about the same speed as mine:
$ cd /tmp/
$ git clone -b ag/test-kimono https://github.com/BurntSushi/countwords
$ cd countwords/rust/
$ ./bench
<snip>
Summary
'./optimized-trie/target/release/countwords < kjvbible_x10.txt' ran
1.22 ± 0.02 times faster than './kimono-simple/target/release/countwords < kjvbible_x10.txt'
1.26 ± 0.02 times faster than './optimized-customhashmap/target/release/countwords < kjvbible_x10.txt'
1.56 ± 0.03 times faster than './optimized-unsafe/target/release/countwords < kjvbible_x10.txt'
1.58 ± 0.03 times faster than './kimono-optimized/target/release/countwords < kjvbible_x10.txt'
1.60 ± 0.03 times faster than './optimized/target/release/countwords < kjvbible_x10.txt'
3.97 ± 0.06 times faster than './simple/target/release/countwords < kjvbible_x10.txt'
7.58 ± 0.21 times faster than './bonus/target/release/countwords < kjvbible_x10.txt'
You mentioned in another comment that you were benchmarking on an M1. Maybe there's some interesting differences there in the codegen, how the CPU executes or both.
Your 'fast-simple' version is a different story though. Personally, I would not put that in the "simple" classification as outlined by the OP. It is IMO very much not the first program someone would write for this. Bringing in 'hashbrown' and futzing with the hash table lookup/insert is definitely a perf tweak that you probably wouldn't want to do unless you had to, because it makes the code more complex. The 'Box<[u8]>' cements it as an 'optimized' variant.
Now it does look like a simpler variant of the "optimized" Rust program I wrote. I'm pretty impressed. I don't think I ever would have broken out of my local optima to discover that program. In particular, your approach appears to do 3 (and some change) passes over each buffer: 1) read up to '\n' (just the first line), 2) UTF-8 validation, 3) make lowercase and 4) split on whitespace. I realize chunking all of that up confers a lot of benefits, but still, 4 passes and it still being faster than 1 pass is very interesting.
I've glanced at the profiles of each program but haven't been able to solidly conclude anything more precisely about where exactly the benefit is. At this point, my next step would be to slowly translate my version into yours, and benchmark each step of the way until I could isolate the key change (assuming it is one change). But alas, I have run out of time this evening. :-)
Kudos!
(The other interesting thing to note here is that my 'trie' variant is now the fastest Rust submission. Previously, it was slower than the optimized variants on my older CPU. That's pretty neat.)
> In particular, your approach appears to do 3 (and some change) passes over each buffer: 1) read up to '\n' (just the first line), 2) UTF-8 validation, 3) make lowercase and 4) split on whitespace.
I really should have waited until I had more time to respond, but it really only should be 2, 3, and 4.
I tried unsafe/unchecked for UTF8, and, yes, it is a modest bump, but I wanted to do it without unsafe. And 3 and 4 are really pretty fast for what they are. They both work on bytes and the str as_bytes transmute is virtually cost free from what I can tell.
> It is IMO very much not the first program someone would write for this.
True.
> Bringing in 'hashbrown' and futzing with the hash table lookup/insert is definitely a perf tweak that you probably wouldn't want to do unless you had to, because it makes the code more complex. The 'Box<[u8]>' cements it as an 'optimized' variant.
I would say, both these items, I added much later, and contribute substantially less to the bottom line performance than you might think. Bulk of the performance is elsewhere.
The "why" behind the trie is explained in the comments in the code. :-) The core of my hypothesis was to avoid hashing and hash table lookups at all.
The problem with the trie, though, is that it does a memory access per byte.
Whether and when this trade is beneficial is not totally clear to me, but clearly, it can vary.
Whether my hypothesis is actually correct is also not something I'm certain of. Verifying this would take some time with: 1) looking at the codegen and 2) some use of `perf` to extract CPU counters for things like branch and cache misses, and see if a correlation can be established.
For those curious about Rust optimization, the “optimized” version makes three changes compared to the “idiomatic” version:
* It uses byte strings instead of UTF-8 strings. In my opinion, that’s not an optimization, that’s changing the problem. Depending on the question you’re asking, only one of the two can be correct.
* It uses a faster hash algorithm. It’s not the first time this came up in a benchmark article. Rust’s decision to use a DOS-safe hash by default (and not provide a fast algorithm in the std, like other languages do) really seems to hurt it in that kind of microbenchmark.
* It uses get_mut+insert instead of the more convenient HashMap::entry method, because the latter would require redundantly allocating the key even in the repeat case. I’ve hit this problem in the past as well. Maybe the upcoming HashMap::raw_entry_mut will make this kind of optimization cleaner.
The problem specified declares the words we're counting are ASCII:
> ASCII: it’s okay to only support ASCII for the whitespace handling and lowercase operation
UTF-8 (quite deliberately) is a superset of ASCII. So a UTF-8 solution is correct for ASCII, but a bytes-as-ASCII solution works fine in Rust if you only need ASCII.
This is why Rust provides ASCII variants of a lot of functions on strings, and the same functions are available on byte slices [u8] where ASCII could be what you have (whereas their Unicode cousins are not available on byte slices).
>> ASCII: it’s okay to only support ASCII for the whitespace handling and lowercase operation
That's a somewhat specific list -- at least I didn't read that as a general "the program can assume that the input is only ASCII".
But then, the author seems to have accepted solutions that crash on non-UTF8 sequences and ones that byte-compare them, so probably either behavior was meant to be fine. I just don't get that from this rule.
> That's a somewhat specific list -- at least I didn't read that as a general "the program can assume that the input is only ASCII".
I don't think it assumes the input is only ASCII. If the problem is "given UTF-8 text, split on ASCII whitespace and convert ASCII uppercase letters to lowercase," you can do that correctly (and produce correct UTF-8 output) without really being UTF-8 aware. For why, see here: https://en.wikipedia.org/wiki/UTF-8#Encoding
> But then, the author seems to have accepted solutions that crash on non-UTF8 sequences and ones that byte-compare them, so probably either behavior was meant to be fine. I just don't get that from this rule.
That's a separate concern right? The rules are only about the behavior when the program is given UTF-8 input.
The issue is that this is a weird requirement. I have seen real world data sets that had all manner of exotic whitespace like nonbreaking spaces and vertical tabs peppered throughout, so I am sympathetic. But this situation isn't common.
That said, for a CLI program like this, usually approximate results are good enough anyway. And realistically, most text should use ASCII whitespace for pretty much all text.
It is a context-specific requirement, but "fully general lowercasing" would be an impossible requirement.
For e.g. Japanese text, I think you'd only have to add 1 or 2 characters to the set of whitespace characters. You also have to solve Japanese text segmentation, which is hard-to-impossible. If you want to canonicalize the words by transforming half-width katakana to full-width, transforming full-width romaji to ascii, etc., that's a lot of work, and which of those transformations are desired will be specific to the actual use of the program. If you want to canonicalize the text such that the same word written using kanji or using only hiragana end up in the same bucket, or that words that are written the same way in hiragana but written differently when using kanji end up in different buckets, or that names that are written the same way in kanji but written differently in hiragana end up in different buckets, or that loanwords incorrectly written using hiragana are bucketed with the katakana loanword, or that words written using katakana for emphasis are bucketed with the hiragana word (but katakana loanwords are not converted to hiragana and bucketed with the non-loanword that is made up of the same moras), well, that all sounds even more challenging than the hard-to-impossible problem you already had to solve to decide where words begin and end :)
Edit: One of the first concerns I mentioned, about full width romaji and half width katakana, and additionally concerns about diacritics, can be addressed using unicode normalization, so these things are pretty easy[0]. An issue you may still face after normalizing is that you may receive inputs that have incorrectly substituted tsu ツ for sokuon ッ (these are pronounced differently), because for example Japanese banking software commonly transmits people's names using a set of characters that does not include sokuon.
My point is that this is not just a hard problem but many different, incompatible problems, many of which are hard, and because of the incompatibilities you have to pick one and give up on the others. An English-speaking end user may not want their wordcount to perform full width romaji -> ascii conversion.
Great write up! However I should've clarified, I wasn't talking about word segmentation in general, only about expanding the universe of valid "whitespace" grapheme clusters (or even just codepoints) to include the various uncommon, exotic, and non-Western items.
> It uses byte strings instead of UTF-8 strings. In my opinion, that’s not an optimization, that’s changing the problem. Depending on the question you’re asking, only one of the two can be correct.
Sure, but is it changing the problem to something easier than what the other languages are already doing, or to something more similar? I'd imagine the C code is basically just using byte arrays as well, for instance.
> * It uses byte strings instead of UTF-8 strings. In my opinion, that’s not an optimization, that’s changing the problem. Depending on the question you’re asking, only one of the two can be correct.
To be fair, that's what the C version does as well.
> It uses byte strings instead of UTF-8 strings. In my opinion, that’s not an optimization, that’s changing the problem. Depending on the question you’re asking, only one of the two can be correct.
No, that's very wrong. ripgrep has rich Unicode support for example, but represents file contents as byte strings. UTF-8 strings vs byte strings is an implementation detail.
IMO, Ben kind of glossed over the bonus submission. But I personally think it was the biggest point in favor of Rust for a real world version of this task.
>It uses byte strings instead of UTF-8 strings. In my opinion, that’s not an optimization, that’s changing the problem. Depending on the question you’re asking, only one of the two can be correct.
Ad uft8: it would change the problem only if the definition of word breaks changed. E.g., if a word break is defined by whitespace, then it probably doesn't.
This never happens. In UTF-8, bit 7 is set in all valid multibyte representations; ASCII characters cannot appear in them. Now, you can have an ASCII character in the middle of a multibyte sequence, but this isn't valid UTF-8 (and should be treated as an error by the decoder). "Words" generated by a splitter would also have invalid UTF-8 if they were split on a space coming in between bytes in a multibyte sequence, but these can detected. Word splitting will not generate invalid words from valid input, or vice versa; and if you know you have valid input, splitting on space, \t, \n, \r, etc. is perfectly cromulent in UTF-8.
Now if you want to also split on U+200B (ZERO WIDTH SPACE), U+202F (NARROW NO-BREAK SPACE), etc... hoo boy.
If UTF-8 allowed that, then a solution would split words wrong according to UTF-8 rules, but it still would not matter, since the problem statement says that "it's okay to only support ASCII for the whitespace handling and lowercase operation".
So if your solution gets word splitting (or lowercasing) wrong for non-ASCII input, it still gets a pass according to my reading.
This was down-voted, which isn't really the best way to handle things that are wrong
Your parent was wondering about a hypothetical UTF-8 sequence [all sequences in this post are hexadecimal bytes] XX 20 XX in which the ASCII space character encoded 20 is actually somehow part of a UTF-8 character. That's not a thing. UTF-8 is deliberately designed so that nothing like this can happen, along with several other clever properties, such properties are why UTF-8 took over the world.
Overlong sequences are like E0 80 A0 which naively looks like it's a UTF-8 encoding of U+0020 the space character from ASCII but it's not, because U+0020 is encoded as just 20. These encodings are called "overlong" because they're unnecessarily long, they're transporting a bunch of leading zero bits for the Unicode code point, the UTF-8 design says to reject these.
[ If you are writing a decoder you have two sane choices. If your decoder can fail, report an error, etc. you should do that. If it's not allowed to fail (or perhaps there's a flag telling you to press on anyway) each such error should produce the Unicode code point U+FFFD. U+FFFD ("The Replacement Character") is: Visibly obvious (often a white question mark on a black diamond); Not an ASCII character, not a letter, number, punctuation, white space, a magic escape character that might have meaning to some older system, filesystem separator, placeholder or wildcard, or any such thing that could be a problem. Carry on decoding the rest of the supposed UTF-8 after emitting U+FFFD. ]
I would expect a wordcount to pass these through as-is, along with any other invalid sequences of bytes with the high bits set. Performing unexpected conversions just seems like a way to make my future self have a bad time debugging.
I love using the UNIX shell (pipes and programs that do "one thing well") to the point where it hurts my ability to improve at other other languages. But sometimes all you need is a throwaway command to do something once, not a beautifully optimized function that will be written, improved, compiled, profiled, reoptimized and then executed a billion times. The utter tersity of a bash one-liner just tickles me.
I'm with you with both shell pipelines and Perl "one-liners". The latter was a wonderful step up from the former back in the 1990s: a modern programming language designed around paving the cowpaths of shell pipelines. The power of both has always been inspiring. So many useful, one-off tasks can be accomplished with a single line in a terminal. So much potential power.
But my problem with both has always been that I needed them often, but not quite often enough to remember them without looking things up (again). Over the years, as my memory of these commands has strengthened, I've needed them less often. We don't use computers the same way we did back then. It's like a spaced repetition system where the spacing is set to guarantee forgetting.
How about extending this to comparing programming language efficiency from a developer's standpoint, looking at bytes of code, versus the execution time?
I took the source code file size from the repository, and the runtime from the blog post.
Then I made an arbitrary overall "PAIN SCORE" (lower is better) by multiplying code size * runtime. I suggest this is a worthwhile metric simply because lower is better on both axes, but of course, in the "real world" there will be different economic costs to CPU time and developer time depending on the use case. Here's the sorted results, from least "pain" to most:
LANGUAGE FILENAME CODE SIZE RUNTIME PAIN SCORE (LOWER IS BETTER)
Shell optimized.sh 75 bytes 1.83 s 137.25
Crystal simple.cr 240 bytes 1.29 s 309.6
Nim simple.nim 424 bytes 0.77 s 326.48
Python simple.py 208 bytes 2.21 s 459.68
Ruby simple.rb 175 bytes 3.17 s 554.75
Go optimized.go 1514 bytes 0.40 s 605.6
Python optimized.py 464 bytes 1.33 s 617.12
Zig optimized.zig 2688 bytes 0.24 s 645.12
Go simple.go 688 bytes 1.12 s 770.56
Zig simple.zig 1394 bytes 0.55 s 766.7
Nim optimized.nim 1683 bytes 0.49 s 824.67
Shell simple.sh 60 bytes 14.81 s 888.6
Ruby optimized.rb 401 bytes 2.47 s 990.47
JavaScript simple.js 532 bytes 1.88 s 1000.16
C optimized.c 4360 bytes 0.23 s 1002.80
Rust optimized.rs 3065 bytes 0.43 s 1317.95
Swift simple.swift 317 bytes 4.23 s 1340.91
JavaScript optimized.js 1501 bytes 1.10 s 1651.1
C simple.c 2735 bytes 0.96 s 2625.6
Rust simple.rs 2239 bytes 1.38 s 3089.82
Sorting only by code size, the most concise implementations are: Shell, Ruby, Python, Crystal. Nobody was aiming to play code golf (i.e. minimize source code size), so these are fairly straightforward, idiomatic, readable implementations.
It would probably be prudent to strip comments before measuring the code sizes:
I did not check all files, but noticed that the simple rust code consists of about 60% comments (in terms of bytes), mostly due to a verbose blurb. It also spends about 15% of the non-comment bytes on printing nicer error messages, which is an interesting choice for a micro-benchmark. The simple C and FORTH versions similarly have a lot of comments.
Meanwhile a lot of the other files have very few or no comments.
I wrote the Rust version and the comments are a reflection of my style and my belief that benchmarks should come with some kind of analysis. The "simple" variant was also intended to demonstrate idiomatic code without a bunch of perf tricks. Nicer error messages in that case is fair game.
Indeed, a naively analysis based on code size without taking comments into account is pretty obviously wrong.
Programming language efficiency from a developer's standpoint is not measured in bytes written in the source code, because typing characters is never the bottleneck when writing code.
Even if it were, most code isn't "fire and forget", and so for most code the cost is dominated by maintenance cost, which has more to see with reading than writing code.
Languages that are concise are so because they either express information very densely, or express less information (via for example ignoring error handling or making it implicit with exceptions). In practice I find that such languages are much harder to maintain, because the missing information has to be rebuilt by the reader.
Extra pain score should be awarded for 2 character abbreviations, one character flags, mode switches, special character operators, difficult documentation, lack of error handling, missing type safety, difficulty of introspecting/debugging halfway through program, environment dependencies. Most of these weigh a lot more than just line count.
It'd be interesting to write code in a concise, easy to learn/grok language and then transpile to a faster systems language and recompute the PAIN score. I spent a few months last year pursuing that approach (py2many on github).
We'd need a stdlib that works across languages for the approach to be successful. nimpylib and similar libraries in other languages are a step in that direction.
Alternatively, the python stdlib itself could be rewritten in python and transpiled to other languages. Perhaps it'd help alternative python implementations in terms of compatibility.
Isn’t unfair to not count the runtime code size too for interpreted languages? Because for the compiled ones, they include all the necessary runtime in the executable.
I think another interesting "pain score" is whether the code is written with built-ins from the language or the standard library (i.e. not new data structures or functions).
This is a rather meaningless comparison, since the differences are going to be dominated by:
1) The choice of libraries/datatypes used for strings and word->count map
2) How the source file is split into words - probably library function again, although in C/C++ one could choose to implement a super-optimized low level version that would blow the others away
IMO a performance comparison between languages is only meaningful if the runtime is not dominated by library functions, or if one admits it's really a standard library performance comparison, not a language comparison.
I understand what you're saying. However if you look at it not from the point of the performance figures per se, but rather which language, when written idiomatically, can solve a realistic problem performantly, then I think it makes more sense.
True - there's still value in these types of comparison, although the click-baity "C++ vs Python" title rather detracts from that.
I'm not sure how much value there is in this specific benchmark though since in the real world you'd be using a utility to do this (e.g. linux wc - word count) and it's not obvious what's dominating the runtime (I'd guess reading file and/or splitting into words), so what you're takeaway should be (idiomatic X is good for what, exactly?) if wanting to extrapolate this to some other task.
For that matter, in the real world people don't really choose language on a task specific basis... It's either use language X because the project demands it, or write some limited use utility in a scripting language if performance isn't a major concern.
Agreed. It’s a common desire to boil benchmarks down to a single number.
Some of those languages, however, are far more complex then the others. And as any complex tool, it requires certain methodology. If you are to use heavy machinery like C++, forget iostreams and character-level processing on your hot paths.
What is there to benchmark between C++, C, Rust, or any other low level compiled language though? With any one of them you can essentially achieve the same assembly code generation, if you try hard enough.
In the end it either boils down to compiler optimisations or library implementations.
I think the interesting question there (and really any language comparison) is to compare the result of idiomatic implementations. Unfortunately, there is no clear answer on what is idiomatic, even with something like Python that is more opinionated than most on the question.
I think the best way forward on the idiom question is to accept that we can't have a competitive/adversarial benchmark suite where idiom is a factor. I think you really need one author, or maybe a group of colleagues who trust each other, to write the whole suite and make their own choices about what idiom means to them. Folks who disagree with those choices can either write an article about how much performance you gain from making what changes, or just produce their own whole suite. Having a very high level comparison like "this is how languages stack up with these idiomatic choices, but this is a different chart with different choices" would be interesting, even though it would take more work to interpret it.
With C I rarely use libs outside of stdlib and when I do I pay attention to performance more than convenience.
I think even if it's dominated by libs and data types I think it has value as it might reflect what actually happens out there in the wild rather than a purely academic exercise.
Recently there was a contest to optimize the performance of a similar program[0] and a Zoom discussion of the optimizations[1].
The programs are not comparable in the the following ways:
- Case: TFA requires (at least) ASCII lowercasing but the contest problem required no lowercasing.
- Ordering: TFA does not require sorting, but the contest problem required sorting.
- Memory: TFA imposes a requirement phrased as "don't read the whole file into memory" and this sounds like it's a resource-saving constraint, but it's usually a constraint that requires the program to spend additional resources. You could just mmap the file and store pointers into the mapped region. It costs extra to do copies instead of no copies.
- Text: TFA is unclear on what assumptions may be made about the lengths of words. For the contest problem, the Hungarian wikipedia input's longest word is around 80k.
- Safe, Hashing, Stdlib: TFA imposes some restrictions on what constructs may be used that are not imposed in the contest problem.
For the contest version of this problem, it seems like you can tokenize, hash, and count strings at around 1GB/s. Adapting a solution to solve TFA's problem (but not to conform to its Safe/Hashing/Stdlib requirements) would probably not carry too large of a penalty, since it's like 3 instructions to ASCII-lowercase 32 bytes and 1 string copy per unique string should take negligible time compared to the hash table lookups. So there is some room for the optimized solutions to go a little faster, if more optimizations are permitted.
Funny how you can take the "optimized" C++ and make it significantly faster (vs. gnu/linux standard libraries) with little effort. A lookup table for to_lower, instead of bit math, helps slightly. Building with the LLVM-libc string functions helps a lot, because the GNU std::string_view::operator== is calling an AVX-optmized memcmp via the PLT, which is a terrible strategy for short strings. LLVM-libc has an inlined bcmp (note: bcmp can be significantly faster than memcmp, but on GNU they are aliases) that is resolved for the target platform at build time instead of run time.
Edit:
Even if you ignore LLVM-libc, just slapping this into the optimized.cpp and replacing the critical `==` with `our_bcmp` makes it 10% faster. IFUNC calls to micro-optimized SIMD functions are counterproductive. It is far, far better that the compiler can see all the code at build time.
int our_bcmp (const char* a, const char* b, size_t sz) {
for (size_t i = 0; i < sz; i++) {
if (a[i] != b[i]) return 1;
}
return 0;
}
EDIT: I think the use of the term "allocation" might be overloaded here from the original article, after thinking about this a bit more I think the author is referring to a different way of accessing the map. If you look in godbolt https://godbolt.org/z/Yf64rodW6 you can see that the pointer version uses
CALL runtime.mapaccess2_fast64(SB)
whereas the 'raw' version uses
CALL runtime.mapassign_fast64(SB)
When reading the Go solution this bit stood out to me
> To reduce the allocations, we’ll use a map[string]*int instead of map[string]int so we only have to allocate once per unique word, instead of for every increment
I just tried benchmarking this with a simple setup and I get zero allocations for both approaches, although the "pointer" approach is slightly faster
package main
import (
"testing"
)
func BenchmarkIncrementMapRawInt(b *testing.B) {
var data = make(map[int]int)
b.ResetTimer()
for i := 0; i < b.N; i++ {
key := i % 1000
data[key]++
}
}
func BenchmarkIncrementMapPointerInt(b *testing.B) {
var data = make(map[int]*int)
b.ResetTimer()
for i := 0; i < b.N; i++ {
key := i % 1000
increment(data, key)
}
}
func increment(counts map[int]*int, value int) {
if p, ok := counts[value]; ok {
*p++
return
}
n := 1
counts[value] = &n
}
$ go test -bench=. -benchmem .
goos: darwin
goarch: arm64
BenchmarkIncrementMapRawInt-8 112288002 10.57 ns/op 0 B/op 0 allocs/op
BenchmarkIncrementMapPointerInt-8 139728302 8.586 ns/op 0 B/op 0 allocs/op
I like the OCaml version. It's both very easy to read and comes out really well in the benchmark.
I also looked at the "simple" Zig version, which came out really well in the benchmark, and to me it didn't look simple at all. It seems you need to make a lot of low level details explicit.
Agreed! AWK is still the king of this stuff. For tasks like this, I kinda think AWK is nigh-unbeatable: so simple to write, so obvious what's going on (even if you've never seen any AWK program before, you're probably going to be able to figure out what's going on there), and decently performant.
Are you looking at the "simple" or the "optimized" versions? For the optimized, yes, the Go one is very similar to the C. For the simple, idiomatic version, the Go version [1] is much simpler than the C one [2]: 40 very straight-forward LoC vs 93 rather more complex ones including pointer arithmetic, tricky manual memory management, and so on.
TFA only talks about gawk, though, making this comparison meaningless ie. you can't benchmark languages but only language implementations. Same goes for other programming languages ofc unless those have only a single implementation.
mawk is an order of magnitude faster than gawk, and gawk isn't even the default on many Linuxen.
> TFA only talks about gawk, though, making this comparison meaningless
Not exactly! To quote from TFA (I'm the author):
> Another “optimization” is to run it using mawk, a faster AWK interpreter than gawk. In this case it’s about 1.7 times as fast as gawk -b. I’m using mawk in the benchmarks for the optimized version.
Wow. Swift, touted as "safe by design and (...) runs lightning-fast"[1] is more of a screw-up than I thought. Almost twice as slow as Lua and behind even Pascal and Forth.
It's a single implementation, I wouldn't put too much on it. I can definitely write a C++ implementation that's slower and less optimised than most of those results, particularly if I was using Boost.
iPhone and Mac apps run pretty well in my experience, and Swift is definitely faster (in general) than Python at least. There were some serious considerations to port over ML libraries to Swift due to its ease of use, similar to Python, while providing much better execution speed.
> There were some serious considerations to port over ML libraries to Swift due to its ease of use, similar to Python, while providing much better execution speed.
Google abandoned that Swift work (if that's what you're referring to)
Sure, but the fact that it was a project at all does imply Swift had a not-insignificant speedup vs Python, meanwhile in these benchmarks Python is about2x the speed of Swift.
I only skimmed the article, but I suspect they are including the time necessary to spin up the executable and any runtime environment prior to executing the relevant code.
"Lightning fast" can mean a lot of things, and it might not mean "executables start quickly."
If that is the case, it would explain why python is so slow. In my experience the runtime speed of python is good enough for most purposes, but startup time is definitely slow.
I don't know that I'd characterise it as a "screw up" based on one random person's solution to one particular problem.
In particular the "simple" solutions, which is all that's offered for Swift, will be whatever was most idiomatic/ obvious to a programmer of potentially quite variable quality.
It's unfortunate, I think, that there are "simple" solutions for some languages which have had performance revisions. If you needed an hour, or a friend, or even a community to "hint" how to improve it that wasn't the simple solution.
[ For example I feel like the Rust might be faster asking for ASCII lowercase, and using the unstable (not order-preserving) sort algorithm, but even if I'm correct such a change would logically be made to the optimised Rust, not the simple ]
Swift is in the same or better performance bracket then C#/Java and unsafe usage can be close to performance of C (maybe 1.3-2x slower). I'm not sure about ranting for poor single test results :P
They have been improving this steadily (I would argue it is rarely an issue nowadays, and there have always been simple workarounds), but there’s no reason a solution to the problem mentioned in the article should need to use ARC, and should use structs/protocols instead (which don’t use reference counting). I’ve seen a lot of Benchmarks that rate Swift poorly because of this.
FreePascal maintains their own optimizer. It is one of the few languages that still have their own optimizer rather than using LLVM. Although they would probably better of using LLVM. They introduce a new optimization every month, and I usually notice the new optimization when my code starts crashing with the nightly compiler.´
But they have a bunch of reasons why they cannot use LLVM like "LLVM will almost certainly never support all targets that FPC supports (Gameboy Advance, OS/2, WinCE, ...), or at some point drop support for targets that FPC still supports (as already happened with Mac OS X for PowerPC/PowerPC64)." [1]
Or "FPC's native code generators are much faster than LLVM's (even if you would neglect the overhead of FPC generating bitcode and the LLVM tool chain reading it back in), so especially while developing it may be more interesting to use FPC's own code generators"
But the test of this thread mostly benchmarks the hashmap implementation. I got my own Pascal hash map, it is twice as fast than the one in their standard library.
And, looking at the test code, it does a double hashing, first a get, then an insert. If it did a find entry and update it in-place, it would be twice as fast, too.
Together that would be four times faster and as fast as the C version and still be simple
I've checked with ClickHouse and the result is better than I expect... it runs in 0.043 sec. on my machine, which is faster than any other result.
The code:
SELECT arrayJoin(splitByChar(' ', lower(line))) AS word, count() AS c FROM file('kjvbible.txt', LineAsString) WHERE notEmpty(word) GROUP BY word ORDER BY c DESC FORMAT Null
or:
clickhouse-local --query "SELECT arrayJoin(splitByChar(' ', lower(line))) AS word, count() AS c FROM file('kjvbible.txt', LineAsString) WHERE notEmpty(word) GROUP BY word ORDER BY c DESC" > /dev/null
You are also using a language function to read the file. In the 'official' github implementations they have to accept the data line by line from stdin - stdin likely being slower than reading a file directly.
In this type of blog post with comparisons including many languages, you usually see some unatural approach for your favorite language.
However, in this case, the Python example is idiomatic.
Since the articles calls for better approaches, I would suggest to take the opportunity to use the walrus operator and rsplit() for the optimized version. Something like this:
reminding = ""
c=Counter( )
while (chunk := sys.stdin.read(64 * 1024)):
pre, post = chunk.lower().rsplit('\n', 1)
c.update((reminding + pre ).split())
reminding = post
It should not affect performance too much, and the code gets more expressive.
However, the performances will be quite different depending of the python version you use. Interestingly, Python 3.11 beta, which comes with a lot performance tweaks, is slower for this exercice, while being reported to be faster on real life tasks.
Your code assumes there is at least one newline in each 64K chunk, and assumes there are no words past the final newline.
It will fail on "abc" and give the wrong answer for "abc\ndef".
I prefer rpartition over rsplit to handle first case, and the loop-and-a-half construct instead of the while+walrus operator to handle the second, as in this modified version of your code:
remaining = ""
c=Counter( )
while True:
chunk = sys.stdin.read(64 * 1024)
if not chunk:
if not remaining:
break
pre = post = ""
else:
pre, mid, post = chunk.lower().rpartition("\n")
c.update((remaining + pre).split())
remaining = post
> Incidentally, this problem set the scene for a wizard duel between two computer scientists several decades ago. In 1986, Jon Bentley asked Donald Knuth to show off “literate programming” with a solution to this problem, and he came up with an exquisite, ten-page Knuthian masterpiece. Then Doug McIlroy (the inventor of Unix pipelines) replied with a one-liner Unix shell version using tr, sort, and uniq.
Since this writing and other linked resources present only one side of the affair, I will mention: the presentation in Pearls was quite unfair to Knuth, and the conclusions commonly made of it somewhere between moderately and entirely unsound. (That is, as conclusions from the paper. I speak of unsoundness of logic only: these conclusions may be supportable from other sources.)
Knuth was challenged to demonstrate literate programming, and so that was what he demonstrated. He showed building something from the ground up assuming a very minimal environment, with problem analysis and all: of course that ended up more verbose than a Unix pipeline that glosses over problem analysis and starts with most of the tools you need to make a probably-acceptable solution!
McIlroy himself admitted a few years later that he had been “a little unfair” to criticise on engineering grounds what had only been an illustration of technique. Even in the paper, Bentley did admit a degree of culpability for this mismatch in his “criticism of programs” paragraph. (“He admires the execution of the solution, but faults the problem on engineering grounds. (That is, of course, my responsibility as problem assigner; Knuth solved the problem he was given on grounds that are important to most engineers-the paychecks provided by their problem assigners.)”)
But I do wish that Knuth had been given the opportunity to write a review of McIlroy’s review, for simultaneous publication.
1986 was very early days for modern operating systems. I suspect that the hardware environment(low resource) would be surprising to many phone owners here. So the efficacy had to do with the tools low-level bindings, too
Indeed. In 1986, Knuth was exclusively using the SAIL DEC10, running the WAITS OS. This environment limited him to approximately half a megabyte of data space (heap plus stack plus globals), and half a megabyte of code space. All statically linked, so all library and runtime stuff had to fit.
Also of note is that most CPUs of the day didn't execute instructions orders of magnitude faster than bandwidth to main memory, as is the case today; it was early days for worrying about making algorithms "cache aware". So, for instance, there was no big penalty for "pointer chasing" in linked lists as there is today. Additionally, the severe memory size limits affected various time vs. space tradeoffs in algorithm design.
WEB is not a totally different programming language, it is a way to write programs. It is a hybrid (in the form presented in that article, though later versions were centered on other languages or language agnostic) of TeX and Pascal. The code that was written was proper Pascal, the documentation was proper TeX, and there was extra syntax for WEB use in naming and referencing code definitions. The WEB system had two additional components (and these terms are still used in many WEB-derived systems): WEAVE, TANGLE.
WEAVE takes a WEB source file and generates proper TeX out of it. TANGLE takes a WEB source file and generates proper Pascal.
The code written in WEB (and variants) is not necessarily in proper order for compilation, and can contain references to other blogs which are included as text-inclusions. I've never used WEB proper, but in org-mode there is org-babel. Its syntax is something like this (from memory, I use shortcuts so I don't have to memorize all the details and type them out):
To handle user input, the program will read from a file.
#+NAME: open-file (a better name would be used in a real program)
#+BEGIN_SRC lisp :noweb yes
(with-open-file (f filepath)
<<parsing>>)
#+END_SRC
elsewhere
The actual parsing will look like:
#+NAME: parsing
#+BEGIN_SRC lisp :noweb yes
...
#+END_SRC
And the order of these can be reversed, the correct output will be produced with a few other settings and options. With org-babel, when you tangle the org file it will generate one or more source files based on the options and settings you've used throughout the org file itself. Instead of WEAVE, you would just use the standard org export settings.
As I wrote elsewhere in this thread (though I was downvoted), I suspect the reason is that this benchmark is measuring the total execution time of a single run of an executable. This would include any time spent by the executable to bootstrap its runtime environment, initiate a virtual machine, and whatever else it needs to do in addition to the relevant code at hand to process the string.
Such a test will favor implementations that have extremely svelte boostrapping, allowing them to immediately begin executing the relevant code and return a result.
I feel a more useful test would be for the relevant string processing code to be run tens to thousands of times within the process itself, so as to diminish the relative importance of boostrap/warmup code and/or runtime optimization performed by VMs. Unless, of course, part of the intent was specifically to measure the impact of the bootstrapping.
In that case, it would change the Python result dramatically as well.
But I don't think it would be fair: I use Python a lot, and if it's for scripting, you are happy with the fact it's very easy to write. However, it's slow to start, and you pay that each time you run the script.
To me, it makes sense in this exercice, which is heavily leaning toward scripting, we see the price of the VM start in the overall profiling.
Otherwise, let's use pypy, warm it up, a few 1000 times, and you may get closer to Go times.
I don't think Swift's performance is due to start up time at all. I actually cloned the repo, and ran the benchmark and found that Swift's execution time scales drastically with the size of the input. The Swift team actually boasts about its quick start up time on the official website [1]. I ran a simple hello world benchmark to gauge Swift's start up time and got an output of 13 milliseconds.
Yeah the startup time for the Swift code is on the order of several milliseconds on my computer, the profiler running at 1000 Hz only gets four samples off of it
I was going to make a general comment about this too. It’s a huge penalty for Java too and has no correlation for how well it performs outside of toy benchmarks
Here's a very rough breakdown of the operations it does:
1.33 s 56.7% specialized Collection<>.split(separator:maxSplits:omittingEmptySubsequences:)
263.00 ms 11.2% Substring.lowercased()
226.00 ms 9.6% specialized Dictionary.subscript.modify
189.00 ms 8.0% readLine(strippingNewline:)
As you can see, calling split on the string is really slow. The reason it is slow is that it's using the generic implementation from Collection, rather than String, which doesn't really know anything about how String works. So to do the split it's doing a linear march down the string calling formIndex(after:), then subscripting to see if there's a space character there using Unicode-aware string comparison on that one Character.
Swift is really nice that it gives you "default" implementations of things for free if you conform to the right things in the protocol hierarchy. But, and this is kind of unfortunately a common bottleneck, if you "know" more you should really specialize the implementation to use a more optimized path that can use all the context that is available. For example, in this case a "split" implementation should really do its own substring matching and splitting rather than having Collection call it through its slow sequential indexing API. Probably something for the stdlib to look at, I guess.
The rest of the things are also not too unfamiliar but I'll go over them one by one; lowercased() is slow because it creates a new Substring and then a new String, plus Unicode stuff. Dictionary accesses are slow because Swift uses a secure but not very performant hash function. readLine is slow because the lines are short and the call to getline is not particularly optimized on macOS, reallocs and locks internally, then the result is used to create a new String with the newline sliced off.
It's sad to see that Swift is so slow in this particular case given that it has the potential to be such an optimized language (strong typing, compilation, backing from Apple).
As they call out in the article, accurately counting words is a bit more complicated that splitting by space. You need to account for all sorts of punctuation etc.
When I've been asked this in programming interviews, I've almost always been expected to produce something like the code in the article (and do), but usually I'd point to something like the NLTK library as a better approach. It's polished and highly capable, and handles probably just about every edge case there is.
It's a bit of a shame that Rust's stdlib doesn't have a "fast but less safe" hasher. It's one of the easiest performance optimizations in my experience because it's very rare that an attacker controls input to your hashmap in a way that can actually lead to a DOS.
Not that it doesn't happen, it's just not something I've run into.
Just today I was testing out the performance difference hashing some Scylla queries and got nearly 2x faster hashing moving to fxhash.
C# really needs a GetOrAdd method on the regular Dictionary class (as exists on ConcurrentDictionary). This common pattern (from the optimized version) requires the hash function on "word" to be evaluated twice (once for TryGetValue and once for Add). Then, if there is a collision on the 32 bit hash then the equality check needs to be run twice as well. Not good as all of these are O(N) operations (where N is the size of "word"). I suspect the optimized version would be materially faster if they implemented their own dictionary that includes this.
if (!counts.TryGetValue(word, out var wordCountRef))
{
counts.Add(word, new Ref<int>(1));
}
Doesn't the Unix Shell implementation break the "threading" constraint, in that each program in the pipe will have a main thread each, and all the programs will run concurrently?
Or should the "threading" constraint really have been stated as "you can use multiple threads, but only if they all have their own address space"?
Alternatively, given the multi-core capabilities of even budget/mobile/small systems these days (even the Rasberry Pi 4 has a quad-core CPU), isn't restricting the implementation to a single thread a weird artificial limitation in 2022? Also, a GPU-based implementation would be interesting, and GPUs most of their power from their massively parallel core architecture.
If I were doing the benchmark, I would just run it on a VM with a single physical core. There are good reasons for comparing single threaded performance since it may not be necessary or desirable to scale that way. For example, suppose it's a wc API or something; limiting each request to a single thread can be better since you get fairer scheduling and less data sharing which is almost always better for multiprocessing. Even shared immutable data can cause problems since the CPU may not realize the data is actually immutable.
I agree GPUs would probably do well on this problem, though it depends on how expensive the memory copying to/from the GPU ends up being. However, if you are going for all our performance, it seems like you could use SIMD like memchr or something.
> If I were doing the benchmark, I would just run it on a VM with a single physical core.
I think that's definitely an environment worth benchmarking on - but I don't think that it should be the only environment to benchmark on.
Also, I don't think it's a good reason to limit implementations to a single thread, even if that is your benchmark environment. It can be worth seeing how well an implementation that's capable of taking advantage of multiple cores/CPUs, does when it's only given one core to work with.
It's probably worth optimising for and benchmarking different threading strategies too - does your implementation create one thread per "work unit" and let the scheduler work them all out, or does it create one a one thread per core (or maybe, one thread per core, plus one), thread pool, and assign work units to each each thread until they're done? And how do those strategies work if they're only given a single core?
The single core case is definitely worth testing, but it seems odd to limit implementations to a single thread because of it. If you think you can go faster with a threaded implementation, you should be able to try that out.
> … worth seeing how well an implementation that's capable of taking advantage of multiple cores/CPUs, does when it's only given one core to work with.
Did something like that — programs written for multi-core forced onto one core, alongside programs not written for multi-core.
iirc That difference wasn't something anyone ever expressed interest in.
> When optimizing this, the first thing to do is compile with optimizations enabled (g++ -O2). I kind of like the fact that with Go you don’t have to worry about this – optimizations are always on.
I think it is more fair to say that in Go optimizations are always OFF.
I haven't run it, just tracked down the source because I was similarly curious about the slow runtime. If I do run it it will be with SBCL. Looking at the repo, Hoyt did as well.
That's not the point. It's perfectly safe and there's nothing else in C.
The point is you don't have to call this function for every single word. It takes time and pointer increment doesn't
Ooo... spooooky pointers... ancient prophecies speak of the Earth splitting in half and swallowing everyone who's ever thought about allocating memory.
Taking it in a slightly different direction – because in my work chunking is nearly never necessary (too little text or too much memory) whereas Unicode characters and punctuation nearly always is – I would tend to something like:
from collections import Counter
from re import finditer
word_counts = Counter()
with open(PATH, encoding = 'utf-8') as file:
doc = file.read()
word_counts.update(match.group().lower() for match in finditer(r'\w+', doc))
In Python, this could be a fairly performant way of taking unicode into account, because it doesnʼt use Pythonʼs for-loop, and regular expressions are rather optimized compared with writing low-level-style Python. (Maybe it should use casefold instead of lower.)
My language of choice, LiveCode, is a superset of HyperTalk. The concept of "words" has been built-in (not a library) since the '80s. Hash tables since the '90s.
A solution to every aspect of this problem but the line-by-line bit took just over a minute to write. Adding that took another minute.
The trade-off is that LC is slower than almost any of the languages listed, and optimization is almost impossible -- the idiomatic way is the fastest (almost). And if you don't like the way LC parses words, there is almost no way to change it without going into the source code (a big pain).
Hm, I tried comparing the simple.hs vs simple.py (with ghc 8.10.7 which I happened to have here, their stack.yaml uses 8.10.4) and it's 5.7s vs 1.9s, not quite as stark a difference. Maybe because I'm on Linux?
q solution in kdb, runs in about 2.5s on my machine vs simple.py which takes 4.7s. Rough interpolation with the results table puts this on par with Rust/Go/OCaml or other compiled languages.
\t desc count each group `$lower " " vs raze read0 `:kjvbible_x10.txt
It's also 1 line!
I am sure q pros or k purists can optimize this even more...
EDIT: Moving the lower earlier brings this down to 1.9s
desc count each group `$ " " vs lower raze read0 `:kjvbible_x10.txt
Fun stuff! Did run a similar thing with a simple bioinformatics problem before (calculating the ratio of G and Cs against A+G+C+T), also with a whole bunch of contributors:
Really hard - or impossible - to arrive at a definitive single number for one language, but the whole exercise is a lot of fun and quite informative IMO :)
This was always my approach when I had do “leetcode whiteboard interviews”.
Start with something not too far north of FizzBuzz but with a ton of scope to progressively enrich the problem until the clock runs out.
At $BIG_CO du jour people tend to talk about how much “signal” an interview produces, and starting somewhere that a meaningful number of candidates can’t do anything with offers very little “signal”, likewise very little Shannon information.
I'm curious why go is faster than rust in the benchmarks. My best theory is that the program doesn't run long enough to trigger a GC, so rust is paying the cost of freeing memory, but go isn't. (Using an arena allocator in rust would probably produce similar behavior).
Or maybe the difference is just noise? It's hard to tell without more details on how the benchmarks were run.
Most of the time you see Rust lose on that kind of benchmark due to it not using buffered algorithm in the standard library (buffers are available and easy to use but they are an option you have to think about and search for).
It seems like this is just measuring a single pass at reading a file with about 100k words in it. I ran the benchmark and it finished almost instantly. So languages with startup overhead and warmup time (eg: JIT etc) will totally underperform here. Makes sense then why Java is almost identical between optimised and non-optimised versions.
In what concerns Java, using AOT or JIT caches is also an option, however all these kind of benchmarks tend to reveal lack of knowledge of the Java ecosystem.
> We usually think of I/O as expensive, but I/O isn’t the bottleneck here...The tokenization and hash table operations are the bottleneck by a long shot
Interesting. This is something I'll have to keep in mind.
Best practice is always been to consider I/O the slowest.
He didn't clear the disk cache when profiling so of course the IO overhead was negligible. Had he flushed the cache between runs the numbers would have been very different.
Some of these efficiency gains are remarkable. Most of the code I write is just the quickest way to write a mundane function. Really curious if you went into a huge tech co and devoted ~10% of SWE time solely to optimization of code base what the energy savings would be.
You can't walk into a huge tech co and make a dent because huge tech co already has hundreds of engineers on that job full time. Just look at the papers coming out of huge tech co's. They will publish if their technique saves them 0.2%. That's how tight things are already wired in the big leagues.
The real money is in huge non-tech co, or medium tech co. You could walk into a goldman sachs type of operation and point out savings in CPU time that would be 90-99%. But they're not going to care, because their corporate structures aren't wired to observe and act on such things.
If you make a dent on server bills on the order of several million dollars a year generally it's not too difficult to make a case that you should be allowed to continue.
Interesting, it confirms what I've suspected for a while: that by using Python and not C++ for "general text mashing" tasks I'm not losing that much performance. Like, 2x, 5x, perhaps 10x. But not 100x.
Reading anything at all into the relative performance of these programs written by different people which are all doing radically different things seems like a gigantic waste of time.
Some of these “simple” versions are custom-specifying allocators to minimize memory overhead and doing highly optimizable ASCII value manipulations and doing byte-equality on strings, rather than go through slower Unicode libraries. The “simple” Swift version is doing nothing at all to avoid generating many allocations/deallocations and running fully Unicode-aware algorithms to lower the input and perform Unicode-normalized string comparisons, it’s totally apples-to-oranges stuff.
i was wondering if this was due to unicode indeed. I had hoped that the benchmarks required languages to perform the same kinds of string check, otherwise there's really no point indeed.
You need to measure memory usage with it, take the product of the two (memory and speed), and thus get a very good number on performance for comparison.
Why is that surprising? The perl interpreter is fast and dispatches just about everything to heavily optimized C. Just like python does, it is only better at it.
This is the kind of thing where perl really shines.
This is what perl was made for and it is good at it. I was going to be very suspicious if Python or Ruby ended up faster. On this chart it it the fact that the simple C++ is so slow that is interesting.
Another 1 minute lookin... multiplication on every byte.. really?
If this is supposed to be fast, get one big buffer for the words. Get another if you run out.
The hash function can be the word itself for words up to length 8.
Longer words would need a mult per 8 bytes.
You can do branchless strchr.
Roll your own memmove - just always copy 24 bytes and terminate at the length you already know.
The whole point of using C or C++ is that you can do non idiomatic things when you have to.
Yep, I don't have the energy tonight to inspect the source code, but how you write js matters a lot.
Little things like hoisting local variables out of the prototype and closure chains can make a massive difference.
I did some experiments with hot path js optimization several years ago and was pretty surprised at the performance you can squeeze out of the jitter.
Js is notorious for being difficult to do good perf analysis on because of the warming and hot path analysis. From my experiments it can sometimes take many seconds from startup for the jitter to kick in on a tight loop.
I remember the last time this came up and you bloviated about a bunch of nonsense[1], and ended up not being able to deliver. Talk about bullshit. At least your comment here is shorter.
The original title is 2 characters over the HN title length limit. I suspect they dropped "Go" first, because it's the language which has a 2-letter name, and they they editorialised it further to hint that there are more languages.
Our distributed NewSQL database project actually uses three of them, Go, Rust, and C++, for different layers. There are pros and cons for each of them, but it still challenge to maintain three languages in a one system development. Performance wise, C++ and Rust are good, coding wise Go is easiest but Rust is kind of having best practice.
I would not call it performance comparison at all. When Python call functions written in C it is not Python's performance. Write those functions using plain Python and then see the results. Sure for this basic example in the article it does not matter from a practical standpoint. But when you need to step away from canned cases suddenly Python's performance sucks big time.
Can we move past the whole "it is not really python to use libraries written in C", especially when talking pure stdlib python?
Python is basically a DSL for C extensions. That is the whole point. It would be like criticizing any compiled language for essentially being a DSL for machine code, and not "Real Instructions".
Python's ability to interop with pre-built, optimized libraries with a lightweight interface is arguably its greatest selling point. Everyone and their dog knows purely interpreted CPython is slow. It doesn't need to be pointed out every single time python performance is brought up unless literally discussing optimizing the CPython vm.
I am not criticizing Python. It does well enough what it was made to do. It just make no sense to call that particular example a "language performance comparison". It is anything but.
But you are still wrong. As mentioned, Dicts are incredibly efficient data structures in Python (because they underpin everything) and the Counter class is pure python. That's 100% pure python. Saying dicts "don't count" because they are implemented in C would disqualify the entire language of CPython, as virtually everything under the hood is a PyObject struct pointer. It just so happens that "counting abstract objects" is the class of problem* CPython does basically all the time in normal VM execution anyways, so this task is easy and fast.
* looking up a reference of an attribute with a string key underpins the meat and potatoes of python's execution model
Yes, that's exactly what I said. dict.update is in C, because it's a core feature of the python vm. It's pure CPython. What do you think "pure python" is? There's no python hardware ISA (afaik). All cpython is manipulating data structures in C via Python VM opcodes. It just so happens that whatever opcodes that are dispatched in the course of solving this problem are quite efficient.
If you say "it does not count as Real Python if you dispatch to C", then you literally cannot execute any CPython vm opcodes, because it's all dispatching to C under the hood.
“Pure Python” commonly means implemented using only the Python language. Something written in pure Python ought to be portable across Python implementations. I was merely pointing out that this line
isn’t exactly pure Python, because, under a different runtime (eg PyPy), the code would take a different path (the “pure Python” implementation of _count_elements[1] instead of the C implementation[2][3]). Yes, it's hard to draw exact lines when it comes to Python, especially as the language is so tied to its implementation. However, I think in this case it's relatively clear that the code that specific line is calling is an optimization in CPython, specifically intended to get around some of the VM overhead. Said optimization comes into play in the OP.
Apparently there will be someone who feels the need to point it out; and someone who feels the need to point out that it doesn't need to be pointed out; and …
Python's `collections.Counter` is written in Python and is a subclass of the builtin `dict` type. I don't think it's comparable to something like using `pandas` to solve the problem.
I even submitted a PR[0], but Ben decided he was tired of maintaining and decided to archive the project (which fair enough!).
[0]: https://github.com/benhoyt/countwords/pull/115