

Strlen(buf) – Not as simple as you’d think - halayli
https://medium.com/late-night-programming/29eb94f8441f

======
tqh
Kind of similar to the one I wrote for Haiku: [http://cgit.haiku-
os.org/haiku/tree/src/system/libroot/posix...](http://cgit.haiku-
os.org/haiku/tree/src/system/libroot/posix/string/strlen.cpp) Makes me wonder
if having testbyte() inside the loop will be good for branch prediction.

And IIRC repe cmpsb isn't bad for longer strings on new x86. It's the initial
setup that adds overhead. Still, you want to make a fast general solution.

~~~
halayli
The testbyte is only called in the end once a NULL is detected in the word,
and will run in parallel.

~~~
tqh
I was thinking more how branch prediction is handled in the pipeline. Agner
Fog states that on Haswell having branches in the same 16 byte block affects
performance. Probably not enough to matter though.

------
Jweb_Guru
And, of course, an industrial-strength optimizer like LLVM goes even further
in using compile-time information to eliminate or replace calls to strlen in
many situations.

[http://llvm.org/docs/doxygen/html/SimplifyLibCalls_8cpp_sour...](http://llvm.org/docs/doxygen/html/SimplifyLibCalls_8cpp_source.html#l00771)

------
nwmcsween
That is actually a pretty ugly version of word-at-a-time strlen here is my
version: [http://sprunge.us/VZFF](http://sprunge.us/VZFF). Compilers can (and
do) vectorize char-at-a-time functions, ICC especially does some weird magic.

~~~
tqh
The first for-loop looks like it only stops if s* is zero or the pointer s is
zero.

Also you need to calculate size from the byte in the word that is zero, not
the word itself.

~~~
nwmcsween
Ah yes I missed the second for loop to get the location.

------
icodestuff
I wonder if there are any architectures where the naive strlen performs
better.

~~~
csense
The Zilog Z80 would definitely be in this category.

It would be really complicated for a compiler or human programmer to work
around the quirks of the Z80 instruction set, where only certain registers can
be used for certain operations. Even an optimal implementation of the word
solution would probably be quite a bit slower than a straightforward naive
implementation.

The Z80 instruction set only supports word reads from constant addresses (but
bytes can be read from an address specified in any register pair). The Z80 has
very few registers, so the more complicated algorithm will probably face
register pressure. Also, word comparison is only implemented for the HL, DE
register pair unless you want to use an index register which requires an
instruction prefix, which will make your code even slower.

But memory reads from an address in a register pair other than HL or an index
register can only go to the accumulator register A, thus you'd need HL for
both the comparison and the address. So you need more instructions to save and
restore HL. (You can't even use the fastest option EX DE,HL because DE is also
needed for the comparison.)

