Hacker News new | past | comments | ask | show | jobs | submit login

UTF-8 is ascii-compatible. Everything with the low bit cleared (characters 0x00-0x7F) is represented identically to ASCII. All codepoints >= 0x80 are represented with multiple bytes with the high bit (0x80) set.

UTF-8 is a very elegant construct for Unix-type C systems — you could basically reuse all your nul-terminated string APIs.




Sure. But it’s not at all clear to me that this trick would actually handle multibyte utf-8 chars correctly.


Consider the codepoint U+1F4A9 ("PILE OF POO").

This encodes to the byte sequence F0 9F 92 A9 in UTF-8. Notice that every one of these bytes has a value > 0x7F, which means they're all outside the ASCII range.

That's one of the useful properties of UTF-8: you know that a code point requiring multi-byte encoding will never contain any bytes that could be confused for ASCII, because every byte of a multi-byte code point will be > 0x7F.

Which in turn means that if you use any processing mechanism that only alters bytes which are in the ASCII range, and passes all other bytes through unmodified, you are guaranteed not to modify or corrupt any multi-byte UTF-8 sequences.


Oh that’s interesting, I didn’t realize utf-8 had that nice property.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: