UTF-8 is ascii-compatible. Everything with the low bit cleared (characters 0x00-...

monochromatic · on Oct 7, 2018

Sure. But it’s not at all clear to me that this trick would actually handle multibyte utf-8 chars correctly.

ubernostrum · on Oct 7, 2018

Consider the codepoint U+1F4A9 ("PILE OF POO").

This encodes to the byte sequence F0 9F 92 A9 in UTF-8. Notice that every one of these bytes has a value > 0x7F, which means they're all outside the ASCII range.

That's one of the useful properties of UTF-8: you know that a code point requiring multi-byte encoding will never contain any bytes that could be confused for ASCII, because every byte of a multi-byte code point will be > 0x7F.

Which in turn means that if you use any processing mechanism that only alters bytes which are in the ASCII range, and passes all other bytes through unmodified, you are guaranteed not to modify or corrupt any multi-byte UTF-8 sequences.

monochromatic · on Oct 9, 2018

Oh that’s interesting, I didn’t realize utf-8 had that nice property.