Really good article. You'll get nothing from me but heartfelt agreement. I espec...

_3u10 · on April 30, 2012

"The one property I really love about UTF8 is that you get a free consistency check as not every arbitrary byte sequence is a valid UTF8 string."

You don't get this at all using UTF-8. You only get it if you attempt to decode the string which even something like strlen doesn't do. Strlen will happily give you wrong answers about how many characters are in a UTF-8 string all day long and never ever attempt to check the validity of the string. Take your valid UTF-8 and change one of the characters to null, now it doesn't work in many circumstances with 'UTF-8' code.

Also, should the free consistency check ever actually work you're in a bigger pickle as you now have to figure out whether the string is wrongly encoded UTF-8 or someone sent you extended ASCII.

I did a lot of work with unicode apps. I used to have a series of about 5 strings that I could paste into a 'UNICODE' application and have it invariably break.

One was an extended ASCII string that happend to be valid UTF-8 sans BOM :)

One was a UTF-8 string with BOM and has 0x00 inside :) (I call this string how to tell if it was written with C)

One was a UTF-8 string with a BOM :)

One UTF-8 string with a some common latin characters, a couple japanese, and a character outside the BMP.

Two UTF-16 strings in LE/BE with and sans BOM.

pilif · on April 30, 2012

>> "The one property I really love about UTF8 is that you get a free consistency check as not every arbitrary byte sequence is a valid UTF8 string."

>You don't get this at all using UTF-8. You only get it if you attempt to decode the string which even something like strlen doesn't do.

I wasn't talking about using strlen (aside of when I was jokingly talking about US-UTF8 where I've seen instances of strlen() being used against UTF-8 strings). I was talking of using library functions designed to handle UTF-8 encoded character data (which strlen() and friends are not).

What I meant with "free consistency check" was that any library function that is designed to deal with UTF-8 data is by default put into a position where it can quite safely determine whether the input data given to it is in fact in UTF-8 or not.

This is not true for any other character encoding I know of (I don't know about the legacy 2-byte Asian encodings at all).

In legacy 8bit character sets, there's nothing you can do to check whether you have been lied to aside of analyzing the content, trying to guess the language and map that to the occurrence of characters in the character set you have been told the string is in (pretty much unfeasible).

With UTF-16 you can at least use some heuristics if you are dealing with common english texts (every second byte would be 0), but you can't be sure - especially not when the text consists of primarily non-ASCII text.

Only with UTF-8 you can take one look at input data and determine with quite a bit of confidence whether the data you have just been handed is in fact in UTF-8 or not (it might still be pure ASCII, but that still qualifies as UTF-8).

If you ever get lied to and somebody tries to feed you ISO-8859-1 claiming it to be UTF-8 (happens all the f'ing time to me), then any library or application designed to deal with UTF-8 can immediately detect this and blow up before you store that data without any way to ever find out what encoding it would have been in.

mikelward · on April 30, 2012

"One was an extended ASCII string that happend to be valid UTF-8 sans BOM :)"

Do you mean you pasted in a string of bytes that was valid UTF-8 into an app expecting UTF-8, and it didn't decide to convert it into ISO 8859-something based on some heuristic?

Sounds like correct behavior to me.

jsprinkles · on April 30, 2012

> You only get it if you attempt to decode the string which even something like strlen doesn't do.

Because strlen() is a count of chars in a null-terminated char[], not a decoder. Ever. It's character set agnostic.

> Strlen will happily give you wrong answers about how many characters are in a UTF-8 string all day long and never ever attempt to check the validity of the string.

Because, again, strlen() counts chars in a null-terminated char[]. It is giving you the right answer, you are asking it the wrong question.

> Take your valid UTF-8 and change one of the characters to null, now it doesn't work in many circumstances with 'UTF-8' code.

Which means it's not a valid UTF-8 decoder, but is instead treating the buffer as Modified UTF-8[1].

> that I could paste into a 'UNICODE' application

Clipboards or pasteboards in many operating systems butcher character set when copying and pasting text. Generally, the clipboard cannot be trusted to do the right thing in every circumstance. On Windows, in particular, character set can get transposed to the system character set or something rather arbitrary when text is copied.

> One was a UTF-8 string with BOM and has 0x00 inside :) (I call this string how to tell if it was written with C)

> One was a UTF-8 string with a BOM :)

Don't use the BOM[2] in UTF-8. It's recommended against.

So really, your point is that some implementations are bad, and you have a bag of tricks for breaking implementations that don't handle all corner cases? That's pretty universal even in the non-Unicode world; there's bad implementations of everything. Windows is an especially bad implementation of most things Unicode.

A valid decoder will, indeed, consistency-check an arbitrary string of bytes as UTF-8. The OP is correct, and your corner cases don't refute his point.

[1]: http://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8

[2]: http://en.wikipedia.org/wiki/UTF-8#Byte_order_mark

_3u10 · on April 30, 2012

A name like strlen suggests that it's designed to take the length of a string, if it was called count_null_ter_char_array then I'd tend to believe you. It's not character set agonistic, it's monotheistic at the shrine of ASCII, it's all over the coding style.

Null is valid UTF-8, it just doesn't work with C 'strings'. I can get null out of a UTF-8 encoder with no problem.

My point is that UTF-8 is nowhere near the panacea being described and if you have to touch the strings themselves that it's far better to use UTF-16 in the vast majority of cases. The only time you ever really want to use UTF-8 is if you're dealing with legacy codebases, it's a massive hack.

jeltz · on April 30, 2012

I do not understand how UTF-16 could be better for this reason. wcslen works exactly like strlen but on wide chars instead of chars.

jsprinkles · on April 30, 2012

> A name like strlen suggests that it's designed to take the length of a string

It is. A "string" in C is a char[] (there is no "string" type). A char is a type that is a number. That number has no meaning aside from being a number. Conveniently, you can assign a char like so:

    char foo = 'b';

That sets the variable foo, of type char, to the value 98. That the 98 means anything, in particular, the letter 'b' in many character sets, is a complete accident and completely orthogonal to char's purpose. A "string" in C is a collection of chars. That is all. No encoding (especially not "the shrine of ASCII"), no purpose beyond being an array of numbers that end in 0, just a bunch of numbers.

You are misunderstanding "strings" in C, and by extension, strlen(). This is not a problem with UTF-8. This is a problem with you misunderstanding the C library and basic types. If you don't believe me (I'm right, but, your call), you can certainly download the C99 spec and investigate what a "char" is, what a "string" is (hint: there isn't such a thing at all), and what "strlen()" is designed to be.

Here's a simple, naive strlen():

    size_t strlen(char *string) {
        char *p = string;
        while(*p) p++;
        return p - string;
    }

That's it. No "monotheism at the shrine of ASCII". It counts chars until it finds 0. It is giving you the right answer. That you don't understand the answer is not UTF-8's (or C's) problem at all. Now, if you want to talk about printf(), I'm listening -- because you might be able to conjure up a point there -- but you are not talking about printf(). This, and other comments, are way off-base on how strlen() works.

> Null is valid UTF-8, it just doesn't work with C 'strings'.

Sure it does! I can store a null in a char[] all day long. That just changes its behavior when passed to something that counts the length of a char[] before a terminating null (like, wait for it, strlen()). Watch!

    char buf[8];
    buf = "abcde\0f";

What we have here is a buffer of length 8, which contains these char values:

    97 98 99 100 101 0 102 0

Now, strlen(buf) is 5. That's because that's what strlen is designed to do. The actual length of the buffer is, amazingly, still eight, and if your code expects to work with all eight chars in the char[], then by golly, it can.

If you are using strlen() with any expectation of character set awareness or human alphabet behavior, you completely misunderstand the purpose of strlen().

Since you're so adamant that UTF-16 is better (but you completely misunderstand how C's typing works), I'm less inclined to accept your opinion on UTF-8 being a "massive hack". Explain to me what strlen() on a buffer containing a UTF-16 string does -- and, why that's better -- and I might come around.

yuhong · on April 29, 2012

Yea, reminds me of DBCS. UTF-8 however don't use bytes below 0x80 as anything other than an ASCII character, unlike some DBCS encodings such as Shift-JIS.