
Glibc's strlen implementation: Probably not what you'd guess - aston
http://sources.redhat.com/cgi-bin/cvsweb.cgi/libc/string/strlen.c?rev=1.1.2.1&content-type=text/x-cvsweb-markup&cvsroot=glibc
======
jws
I'm not done, but I thought I'd share my quick benchmarks:

    
    
      1 million strlens on the same random 100 byte string:
        Atom 330, 1.6GHz, gcc 4.3.2 (Debian Lenny)
          glibc:   1.3ns/char
          easy:    3.4ns/char
          obsd:    3.4ns/char
    
        Core2 Duo, 2.8GHz, gcc version 4.0.1 (Apple Inc. build 5484)
          libc:    0.12ns/char
          glibc:   0.39ns/char
          easy:    0.58ns/char
          obsd:    0.60ns/char
      
      easy: while ( *p++) c++;
      obsd: openbsd, for (s = str; *s; ++s); return s-str;
    

The preliminary conclusions are:

The glibc strlen is something like twice as fast as the naive implementation,
but there is something else out there that knocks its socks off.

Secondary conclusion would be: remember not to compare GHz across different
processors.

~~~
DarkShikari
Apple's libc is probably SIMD-optimized. You can make strlen a ton faster with
some basic SSE instructions.

~~~
jwilliams
... Yep - And OpenBSD is intended to run on as much hardware as possible -
hence this simpler form might suit their purposes.

~~~
dchest
Nope, the have implementations for different architectures, you can find them
here:

<http://www.openbsd.org/cgi-bin/cvsweb/src/lib/libc/arch/>

The linked code is for the case when there's no arch-specific implementation.

~~~
jwilliams
Thanks - sort of what I meant though :)

------
axod
Clever stuff. If you're interested in such things, best book I've seen is
"Hackers Delight" [http://www.amazon.com/Hackers-Delight-Henry-S-
Warren/dp/0201...](http://www.amazon.com/Hackers-Delight-Henry-S-
Warren/dp/0201914654/ref=pd_bbs_sr_1?ie=UTF8&s=books&qid=1236711085&sr=8-1)

Author covers really clever techniques to count bits, count non zero bytes
like this etc etc. Bit manipulation at its best.

~~~
pchristensen
Holy crap, all 15 reviews of that book are 5 stars! That's the best average
I've ever seen on Amazon.

~~~
axod
I'd give it 5 also. It's a ridiculously clever fun book.

I agree with some of the reviewers though, the title probably means less
people manage to find it than if it was called "Bit manipulation bible" or
something.

------
briansmith
That makes the X86 assembly version look straightforward:
<http://www.int80h.org/strlen/>. I think it is just a fallback implementation
that is used when no optimized version has been created for the target
processor. Every time I've looked my compiler has used a hand-optimized
version for every platform. Quite possibly, this is 100% dead code.

~~~
sketerpot
That looks like a pretty straightforward translation of the basic while ( _d++
=_ s++); code. It uses higher-level instructions, but those probably get
microcoded to the same thing anyway. (Note: my information on the inner
workings of x86 CPUs may be a bit dated.)

------
aminuit
HN has a strange fascination with strlen implementations. For anyone who
missed it a few months ago, cperciva put up an interesting article about
strlen for UTF-8 (variable character width) strings.

[http://www.daemonology.net/blog/2008-06-05-faster-
utf8-strle...](http://www.daemonology.net/blog/2008-06-05-faster-
utf8-strlen.html)

------
there
for comparison: [http://www.openbsd.org/cgi-
bin/cvsweb/src/lib/libc/string/st...](http://www.openbsd.org/cgi-
bin/cvsweb/src/lib/libc/string/strlen.c?rev=1.7;content-type=text%2Fplain)

~~~
yan
And this is why I read (free|open)bsd's cvsweb repo when I'm curious as to how
things are implemented. I used to visit that repo in university when I was
bored just to read some clear, concise, well-documented code.

I used to use and love FreeBSD (Since switched to OS X as desktop os, still
run FreeBSD servers), but OpenBSD source code just looked more approachable.

McKusicks' wonderful book and video course helped.

------
mynameishere
It's faster if you just store the length in a separate memory location. This
is very fast:

    
    
      getLength()
      {
        return length;
      }

~~~
tptacek
I'm worried nobody's going to mod this up because it doesn't talk about bit
twiddling hacks or assembly cycle counts, but it is in fact that real answer
to this problem. Don't use ASCIIZ when string processing is a bottleneck.

~~~
dchest
Probably because it's not a real answer to _this_ problem (counting the number
of bytes in a null-terminated string)?

~~~
tptacek
Meh. Fair point.

------
lacker
The lesson here is _use library functions_. Someone smarter than you has
probably optimized the hell out of them.

~~~
tlrobinson
Indeed, in college I was interviewing for a job, and the interviewer asked how
I would determine the length of a null terminated string, and I thought I
nailed it by giving a typical "while(*(str++)) c++;" implementation.

The "right" answer was just to use strlen()...

~~~
nostrademons
I had a job interview once where the question was "find all descendants of a
DOM node with a certain tag name, and return them in document order." I
thought for about 30 seconds, and then wrote up on the board:

    
    
        element.getElementsByTagName(tagName);
    

That was what he was looking for. ;-) Of course, then he had me do it out as
if getElementsByTagName didn't exist.

------
yan
It's cute.

Will I employ similar technique in my own code? Absolutely not. It hinders
readability and byte comparison instructions in consecutive memory addresses
are so stupidly fast that I'll probably save less CPU time combined in all
executions of my program than the total amount of time it took me to think
this hack up.

If you're a maintainer of glibc though (which is known for its exceptionally
clear and straight-forward code,) then I might consider accepting a patch from
someone who thought it up.

edit: I thought similarly of DJB's loop unrolling when I saw it in qmail. It's
cute, but will it make any noticeable difference today? Probably not.

~~~
tlb
In places where I've done this sort of thing, I prefer

    
    
      #ifdef NO_CLEVER_OPTIMIZATIONS
          obvious version
      #else
          complicated fast version
      #endif
    

It's good as documentation, good as a test case, and good for isolating weird
problems like compiler optimization bugs.

~~~
path
Agreed, that's a great way to handle the case

------
tgos
Sorry for asking such a stupid question, but isn't that code horribly broken?
It assumes that the next 3 byte after a string are readable. What if I
malloc'ed memory for the string in such a way, that the \0 is the last byte of
the memory page and the next byte after \0 is on the next page, which is not
mapped into my VM space? In such a case the CPU will throw a page-fault and
the process will die because of SIGSERV or SIGBUS. Or is the glibc version of
malloc padding all memory to have at least 3 byte beyond its last byte?

~~~
gjm11
The glibc strlen begins by checking the first few bytes, if necessary, so that
it can continue with the assumption that the string is 4-byte or 8-byte
aligned. So no, it's not horribly broken: all the reads are aligned, and
you're not going to get a page boundary in the middle of the word.

(Also, you're not going to be reading off the end of a malloc'ed block,
because essentially all malloc implementations, including the one in glibc,
return blocks whose start address and size are multiples of the architecture's
word size.)

------
AlexMeyer
A cool page is Paul Hsiehs' webpage. There you will find some good discussion
on the trick used. <http://www.azillionmonkeys.com/qed/asmexample.html> Look
for part 5: "A fast implementation of strlen()".

------
huhtenberg
Interesting, but this is somewhat pointless if you ask me.

It relies on certain assumptions that are not the part of the C standard, so
while this code works on the majority of platforms, this is not a _portable_ C
code. In which case going all the way down to the assembly level makes more
sense. Especially considering there are typically dedicated CPU instructions
for the exact purpose of searching a zero in a contiguous block of memory.
Something like "rep scasb" on x86.

~~~
axod
rep scasb is horribly slow.

~~~
tptacek
It's 2x slower per iteration, but much faster to invoke, and friendlier to the
microarchitecture. It's not as simple as you're making it out to be.

Also, using strlen() on a 4k string at all borders on malpractice.

~~~
axod
Go have the argument with glibc then (And all the other people who use
optimizations like this).

"friendlier to the micro-architecture" doesn't even make sense. Check the chip
timings for rep scasb. It's not friendly.

We're not really discussing wether you should be using strlen on large strings
or not, but even if it's used say a million times on strings of length 80 or
so, you'd see an improvement worth having.

Check any assembly language forum, book, etc and there will be discussion on
why rep scasb/movsb/cmpsb are lame.

Would you implement string copy with rep movsb as well?

~~~
tptacek
Friendlier to the microarchitecture means: fewer branches, fewer BTB entries,
less impact on the icache. Sorry, you wrote like you might have already known
that.

You know that VC++ _does_ implement copies with movsd/movsb, right?

Sorry, I don't read a lot of books and forums on assembly programming. Just
the PRM. I'm just stuck reading/writing a lot of assembly on projects.

~~~
axod
>> Friendlier to the microarchitecture means: fewer branches, fewer BTB
entries, less impact on the icache. Sorry, you wrote like you might have
already known that.

If it's less clock cycles to do branching and comparing by dword (which it is
for medium to long strings) than doing rep scasb, then what else matters...?

>> You know that VC++ does implement copies with movsd/movsb, right?

I've stepped through VC++ string copy code in softice many a time thanks.

Notice how I was asking about 'movsb', and you replied with 'movsd/movdb'.
Notice the difference?

~~~
tptacek
I have no idea what points you're trying to make here.

You can trade per-byte cycle counts for lower cost to invoke the routine, and
for not evicting cache and BTB entries.

On your second point, I assumed it was the "rep" part of the instruction that
you were railing against. Apparently it's the "not knowing the difference
between a byte and a dword" part. That's awesome. You can have the last word,
if you'd like.

~~~
axod
rep movsb/rep movsd works well for moving data. However, you obviously can't
use that approach for searching for a 0. That's why the code is optimized as
it was. My point is that using rep scasb is suboptimal.

Don't know what you're talking about "lower cost to invoke the routine", and
the cache/BTB entries would be negligible on a small routine like this.

You seem kinda angry and bitter whenever you reply to me :/ Chill out eh.

~~~
tptacek
It costs cycles to call a C function. I seem angry and bitter all the time.
But my point is just, there's an argument in favor of scasb.

~~~
axod
So you're comparing inlined rep scasb, with non-inlined alternative.
Interesting comparison I guess.

Sure, it would bloat the code a little to inline the optimized version, but it
could be done in tight inner loops if required.

~~~
tptacek
I'm assuming you're not inlining a function with a loop in it, but OK, you can
also just expand the 7 insns everywhere you call strlen.

------
hvdm
A very quick test shows that the glibc version is more than 3 times faster for
strings of length 80 compared to the bsd version.

