
 Counting Characters in UTF-8 Strings Is Fast - nickb
http://canonical.org/~kragen/strlen-utf8.html
======
anonymous1239
I can't reproduce these results. Running his program, GCC's own strlen()
soundly beats all the other strlens in all three test cases. Compiled without
optimization (gcc -g -Wall -o mystrlen mystrlen.c mystrlen.s) with gcc 4.1.2
on an AMD Opteron 265, I get strlen() taking about 0.04 each time, compared
with 0.18 for my_strlen() and 0.06 or 0.06 for ap_strlen_utf8_s (the elite
assembly version). Although honestly, you should not expect GCC unoptimized to
beat your hand-coded assembly anyway! These tests need to be done with
optimization.

Compiled with optimization, (gcc -g -Wall -std=c99 -O2 -march=i686 -pedantic
-o mystrlen mystrlen.c mystrlen.s), I get strlen() taking 0.02, versus
my_strlen() at 0.04 and ap_strlen_utf8_s (the awesome assembly one) at 0.06 --
200% slower.

So I'm not sure his results are right, even in the direction. The general
theme, though (that strlen is not a very expensive operation) is true.

Here are the complete results:

% gcc -g -Wall -o mystrlen mystrlen.c mystrlen.s mystrlen.c: In function
'timethings': mystrlen.c:57: warning: implicit declaration of function
'my_strlen_utf8_s' mystrlen.c:59: warning: implicit declaration of function
'ap_strlen_utf8_s' % ./mystrlen "": 0 0 0 0 0 0 "hello, world": 12 12 12 12 12
12 "naÃ¯ve": 6 6 6 5 5 5 "ããã«ã¡ã¯": 15 15 15 5 5 5 1: all 'a': 1:
my_strlen(string) = 33554431: 0.176222 1: strlen(string) = 33554431: 0.041959
1: my_strlen_s(string) = 33554431: 0.157842 1: my_strlen_utf8_s(string) =
33554431: 0.158211 1: my_strlen_utf8_c(string) = 33554431: 0.307177 1:
ap_strlen_utf8_s(string) = 33554431: 0.063365 2: all '\xe3': 2:
my_strlen(string) = 33554431: 0.180825 2: strlen(string) = 33554431: 0.043419
2: my_strlen_s(string) = 33554431: 0.166384 2: my_strlen_utf8_s(string) =
33554431: 0.161270 2: my_strlen_utf8_c(string) = 33554431: 0.309674 2:
ap_strlen_utf8_s(string) = 33554431: 0.064023 3: all '\x81': 3:
my_strlen(string) = 33554431: 0.176232 3: strlen(string) = 33554431: 0.042303
3: my_strlen_s(string) = 33554431: 0.157776 3: my_strlen_utf8_s(string) = 0:
0.083548 3: my_strlen_utf8_c(string) = 0: 0.311597 3: ap_strlen_utf8_s(string)
= 0: 0.083935 % gcc -g -Wall -std=c99 -O2 -march=i686 -pedantic -o mystrlen
mystrlen.c mystrlen.s mystrlen.c: In function 'timethings': mystrlen.c:57:
warning: implicit declaration of function 'my_strlen_utf8_s' mystrlen.c:59:
warning: implicit declaration of function 'ap_strlen_utf8_s' % ./mystrlen "":
0 0 0 0 0 0 "hello, world": 12 12 12 12 12 12 "naÃ¯ve": 6 6 6 5 5 5
"ããã«ã¡ã¯": 15 15 15 5 5 5 1: all 'a': 1: my_strlen(string) = 33554431:
0.039751 1: strlen(string) = 33554431: 0.020602 1: my_strlen_s(string) =
33554431: 0.158968 1: my_strlen_utf8_s(string) = 33554431: 0.159315 1:
my_strlen_utf8_c(string) = 33554431: 0.075991 1: ap_strlen_utf8_s(string) =
33554431: 0.063607 2: all '\xe3': 2: my_strlen(string) = 33554431: 0.039673 2:
strlen(string) = 33554431: 0.021445 2: my_strlen_s(string) = 33554431:
0.159388 2: my_strlen_utf8_s(string) = 33554431: 0.158209 2:
my_strlen_utf8_c(string) = 33554431: 0.127230 2: ap_strlen_utf8_s(string) =
33554431: 0.063322 3: all '\x81': 3: my_strlen(string) = 33554431: 0.039742 3:
strlen(string) = 33554431: 0.020674 3: my_strlen_s(string) = 33554431:
0.158717 3: my_strlen_utf8_s(string) = 0: 0.082357 3: my_strlen_utf8_c(string)
= 0: 0.076103 3: ap_strlen_utf8_s(string) = 0: 0.064756

------
LogicHoleFlaw
Detractors of UTF-8 often point to the "inefficiency" of counting characters
as a reason not to use it for Unicode. I'm glad to see that this is not a
valid concern.

~~~
ajross
Detractors of UTF-8 are just wrong, period. On modern machines, main memory
bandwidth gives you about 10 cycles or more _per_ _byte_ of input of free
processing. It's irrelevant. And it's important to note that "how many glyphs
does this string contain" is a pretty obscure question to be asking in the
first place. What's the problem being solved here?

The beauty of UTF-8 is that all (all!) of the common string operations
continue to work as-is without change. strlen/strdup/strcat/strcpy? Yup, same
behavior. Linear search for substrings or regexes? Yup, same behavior.
Alphabetic comparison? Check. Just use it, and don't mess with encodings or
you _will_ mess things up. If you don't know how it works, just trust it. If
you think you know how it works and still believe it doesn't meet your
requirements, you're wrong.

