Which I assume stands for "Windows-Transformation-Format-8(bits)".

mmoskal · 2024-11-25T01:30:25 1732498225

Abstract

WTF-8 (Wobbly Transformation Format − 8-bit) is a superset of UTF-8 that encodes surrogate code points if they are not in a pair.

hedora · 2024-11-25T03:29:49 1732505389

Can you still assume the bytes 0x00 and 0xFF are not present in the string (like in UTF-8?)

int_19h · 2024-11-25T19:08:13 1732561693

Yes. The only difference between UTF-8 and WTF-8 is that the latter does not reject otherwise valid UTF-8 byte sequences that correspond to codepoints in range U+D800 to U+DFFF (which means that, in practice, a lot of things that say they are UTF-8 are actually WTF-8).

account42 · 2024-11-27T09:44:27 1732700667

Not really since you are unlikely to end up with unpaired surrogates outside of UTF-16 unless you explicitly implement a WTF-16 decoder - most other things are going to error out or remove/replace the garbage data when converting to another encoding.

And if you convert valid UTF-16 by interpreting them as UCS-2 and then not check for invalid code points you are going to end up with either valid UTF-8 or something that isn't even valid WTF-8 since that encoding disallows paired surrogates to be encoded individually.

WTF-16 is something that occurs naturally. WTF-8 isn't.