Hacker News new | past | comments | ask | show | jobs | submit login

Which I assume stands for "Windows-Transformation-Format-8(bits)".





Abstract

WTF-8 (Wobbly Transformation Format − 8-bit) is a superset of UTF-8 that encodes surrogate code points if they are not in a pair.


Can you still assume the bytes 0x00 and 0xFF are not present in the string (like in UTF-8?)

Yes. The only difference between UTF-8 and WTF-8 is that the latter does not reject otherwise valid UTF-8 byte sequences that correspond to codepoints in range U+D800 to U+DFFF (which means that, in practice, a lot of things that say they are UTF-8 are actually WTF-8).

Not really since you are unlikely to end up with unpaired surrogates outside of UTF-16 unless you explicitly implement a WTF-16 decoder - most other things are going to error out or remove/replace the garbage data when converting to another encoding.

And if you convert valid UTF-16 by interpreting them as UCS-2 and then not check for invalid code points you are going to end up with either valid UTF-8 or something that isn't even valid WTF-8 since that encoding disallows paired surrogates to be encoded individually.

WTF-16 is something that occurs naturally. WTF-8 isn't.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: