Hacker News new | past | comments | ask | show | jobs | submit login

I use UTF-8 for transmitted data and disk I/O, and I use UCS-4 (wchar_t on Linux/FreeBSD) for internal representation of strings in my software.

I generally agree with this article, but I disagree with it on the point that UTF-8 is the only appropriate encoding for strings stored in memory, and also I disagree on the point wchar_t should be removed from C++ standard or made sizeof 1, as in Android NDK.

Let me explain why.

In UTF-8 single Unicode character may be encoded in multiple ways. For example NUL (U+0000) can be encoded as 00 or as C0 80. The second encoding is illegal because it's longer than necessary and forbidden by standard, but naive parser may extract NUL out of it. If UTF-8 input was not properly sanitized, or there is a bug in charset converter, this may result in exploit like SQL injection or arbitrary filesystem access or something like that: malicious party can encode not only NUL, but ", /, \ etc this way.

Also UTF-8 string can't be cut at arbitrary position. Byte groups (UTF-8 runes) must be processed as a whole, so appear either on left side or on the right side of cut.

Reversing of UTF-8 string is tricky, especially when illegal character sequences are present in input string and corresponding code points (U+FFFD) must be preserved in output string.

I think UTF-8 for network transmitted data and disk I/O is inevitable, but our software should keep all in-memory strings in UCS-4 only, and take adequate security precautions in all places where conversion between UTF-8 and UCS-4 happens.

And sizeof(wchar_t)==4 in GCC ABI is not a design defect, wchar_t exists for a good reason. I admit that sizeof(wchar_t)==2 on Windows is utterly broken.




Concerning "cut at an arbitrary position" actually utf-8 is the only codec that can deterministically continue a broken stream because bytes that start a character are special.


> Also UTF-8 string can't be cut at arbitrary position.

Neither can be any other kind of Unicode string because of Combining Characters. That's why the Unicode standard (or an Annex) recommends algorithms for text segmentation.

(And if you really need to cut at a certain length then you can easily backtrack and find the beginning of the sequence by looking for the first byte with the MSB = 0)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: