
Why is strlen so complex in C? - azhenley
https://stackoverflow.com/q/57650895/938695
======
saagarjha
The accepted answer fails to understand that the standard library is exempt
from following the C standard and makes a number of false or overly
prescriptive assertions. (It also doesn't answer the question, but that's par
for the course on Stack Overflow…)

~~~
seisvelas
>that's par for the course on Stack Overflow

This is tangential to the meat of your comment but I feel compelled to nitpick
based entirely on personal anecdote: I have been helped hundreds of times on
StackOverflow by people who had nothing to gain from it and yet provided in
depth, insightful answers. In my experience, the farther you get away from the
big, overwhelmed tags, the better the quality is. But you don't want to get so
obscure that you are just ignored.

The Racket community on SO is particularly fantastic, and (in my experience)
the more 'web' related your question is, the worse answers you get.

~~~
saagarjha
Yes, I'm not saying that Stack Overflow doesn't have "birdies". For example, I
really enjoy this user's consistently high quality answers:
[https://stackoverflow.com/users/224132/peter-
cordes](https://stackoverflow.com/users/224132/peter-cordes)

~~~
seisvelas
He once answered a question of mine about recursion with x86!

------
m463
Optimization hacks aside, I swoon at the simple (and verified) seL4
implementation (note strNlen):

    
    
      word_t strnlen(const char *s, word_t maxlen)
      {
          word_t len;
          for (len = 0; len < maxlen && s[len]; len++);
          return len;
      }
    
    

[http://sel4.systems/](http://sel4.systems/)

[https://github.com/seL4/seL4](https://github.com/seL4/seL4) (from file
src/string.c)

~~~
ridiculous_fish
The trailing semicolon on that for loop is quite load-bearing. I would have
some code review feedback.

~~~
anticensor
It can be replaced by continue; in this specific case.

------
bcaa7f3a8bbc
It's worth pointing out that the author was reading an outdated implementation
of strlen() from glibc. The generic implementation is still here with the same
code, and would be complied if no ARCH-specific implementation is written, as
seen here,
[https://github.com/bminor/glibc/blob/master/string/strlen.c](https://github.com/bminor/glibc/blob/master/string/strlen.c).

But it's probably irrelevant to most systems today, as the assembly version is
almost always used, e.g.

* i386: [https://github.com/bminor/glibc/blob/master/sysdeps/i386/str...](https://github.com/bminor/glibc/blob/master/sysdeps/i386/strlen.S)

* i586/i686: [https://github.com/bminor/glibc/blob/master/sysdeps/i386/i58...](https://github.com/bminor/glibc/blob/master/sysdeps/i386/i586/strlen.S)

* i686 w/ SSE2: [https://github.com/bminor/glibc/blob/master/sysdeps/i386/i68...](https://github.com/bminor/glibc/blob/master/sysdeps/i386/i686/multiarch/strlen-sse2.S)

* i686 w/ SSE2 + BSF: [https://github.com/bminor/glibc/blob/master/sysdeps/i386/i68...](https://github.com/bminor/glibc/blob/master/sysdeps/i386/i686/multiarch/strlen-sse2-bsf.S)

* x86_64 w/ SSE2: [https://github.com/bminor/glibc/blob/master/sysdeps/x86_64/s...](https://github.com/bminor/glibc/blob/master/sysdeps/x86_64/strlen.S)

* ARM: [https://github.com/bminor/glibc/blob/master/sysdeps/arm/strl...](https://github.com/bminor/glibc/blob/master/sysdeps/arm/strlen.S)

* ARMv6: [https://github.com/bminor/glibc/blob/master/sysdeps/arm/armv...](https://github.com/bminor/glibc/blob/master/sysdeps/arm/armv6/strlen.S)

* ARMv6T2: [https://github.com/bminor/glibc/blob/master/sysdeps/arm/armv...](https://github.com/bminor/glibc/blob/master/sysdeps/arm/armv6t2/strlen.S)

* ARMv8a/AArch64: [https://github.com/bminor/glibc/blob/master/sysdeps/aarch64/...](https://github.com/bminor/glibc/blob/master/sysdeps/aarch64/strlen.S)

* ARMv8a/AArch64 w/ ASIMD: [https://github.com/bminor/glibc/blob/master/sysdeps/aarch64/...](https://github.com/bminor/glibc/blob/master/sysdeps/aarch64/multiarch/strlen_asimd.S)

etc.

~~~
zamalek
I was wondering how it would compare to the code produced by auto-
vectorization, I guess it doesn't matter with hand-crafted assembly.

~~~
wahern
The compiler can't autovectorize because it can't assume it's okay to read
past the end of the string. Only the human or the machine can make that
determination, using special knowledge about the environment, such as that
when locating the NUL byte out-of-bounds reads are okay so long as they don't
cross a page boundary. And "okay" is a stretch because tools like Valgrind
ship with huge lists of manually maintained suppressions to account for hacks
like this that violate the standard rules.

~~~
flukus
But in this case a human applied their special knowledge in glibc, what's
stopping the human applying the same knowledge to the compiler itself for this
specific OS/arch?

~~~
wahern
Primarily because it would make it more difficult to detect and debug out-of-
bounds reads, so at the very least it's not something you'd want to do by
default, even at high optimization levels.

Possibly it might also be an awkward, complex, or dangerous (as in risk of
unintended consequences) optimization to selectively violate the memory model
that way. But I'm not familiar with the internals of optimizing compilers, so
that's just conjecture.

------
Ididntdothis
Have there been any efforts to add real string types and maybe arrays to C?
Seems a lot of complications come from the primitive/not existing
implementation of strings and arrays.

~~~
raverbashing
It's such a shame that C gets some things like strings so wrong.

Writing in C shouldn't have to be so painful, but I guess they were the
pioneers in a lot of things and the "high level assembly" idea stuck.
(Premature optimization?)

Also having objects and method calls, even if it's syntactic sugar deep down,
is the best kind of syntactic sugar

~~~
MadWombat
This is kind of the point of C. It maps directly to the hardware. There are
very few decisions the compiler has to make about the code. And it is for that
reason that C does not allow for higher level abstractions. On one hand, it is
a pain to write. On the other hand, you know exactly what is going on under
the hood and nothing is ever going to stop you from shooting yourself in a
body part of your choice :)

~~~
kllrnohj
> This is kind of the point of C. It maps directly to the hardware.

No it doesn't. It didn't in the past, it was a high-level abstraction. And it
doesn't today, because CPUs don't behave at all like C's defined virtual
machine.

~~~
justinmeiners
> CPUs don't behave at all like C's defined virtual machine

What "virtual machine" are you referring to?

~~~
anticensor
One that ISO 9899 refers to as its abstract machine.

------
bjourne
It wouldn't surprise me if the "unoptimized" strlen is just as fast or even
faster on modern x86 hw. Both algorithms need to process the same amount of
data. Thus they will fetch exactly the same number of cache lines from main
memory. Likely, the cost of fetching those cache lines dominates, meaning that
it doesn't matter that the "unoptimized" version does more processing per
byte. Only way to find out for sure is to run the benchmarks.

~~~
BeeOnRope
It's not even close. A per-byte loop is going to be be 4-6 times slower than
the 8-byte chunks + a little math shown in the question.

Let's say that all the stars align and you process an entire iteration of the
byte-by-byte loop in 1 cycle. On a 4 GHz machine thats ... 4 GB/s. That's
paltry compared to main memory bandwidth of 30-100 GB/s [1], not to mention
say L1 bandwidth of ~ 256 GB/s (available to a single core).

This idea that everything is memory limited it just false: it's pretty hard to
write code efficient enough that it can be limited by memory _bandwidth_ on a
single core.

\---

[1] Admittedly, a single core can usually only access 20-30 GB/s of that, but
that's still >> 4 GB/s.

~~~
bjourne
Did you benchmark it? I did and found that on a Core-M with MSVC, the
optimized strlen beats the naive one by about 20% and on Xeon with gcc, by
about 40%. Depending on string length and other parameters. I would expect the
difference to be smaller on more recent processors. But without benchmarking I
don't think you can say.

~~~
BeeOnRope
Yes, I have benchmarked it in the past which is where I got the 4-6x figure
from. Maybe it's worth a revisit, but unless the compiler is doing something
special with the byte version, I would be surprised if the speedup was much
less.

You may have to unroll both to get max speed.

If you could share your benchmark I would be interested to look at it.

~~~
bjourne
Benchmarks are always worth revisits. Mine is here
[https://github.com/bjourne/c-examples/blob/master/programs/s...](https://github.com/bjourne/c-examples/blob/master/programs/strlen.c)

