Hacker News new | past | comments | ask | show | jobs | submit login

Which characters are not available in UTF-8 that warrant using WTF-8?



Invalid UTF-16 with unpaired surrogates. Or rather WTF-8 is an alternate encoding of UCS-2. The subset of UCS-2 that is valid UTF-16 encodes to valid UTF-8 when encoded with WTF-8. The encoding is invertible, valid UTF-8 decodes to valid UTF-16, otherwise any byte sequence decodes to UCS-2.


Just read the abstract:

> WTF-8 (Wobbly Transformation Format − 8-bit) is a superset of UTF-8 that encodes surrogate code points if they are not in a pair. It represents, in a way compatible with UTF-8, text from systems such as JavaScript and Windows that use UTF-16 internally but don’t enforce the well-formedness invariant that surrogates must be paired.


Ok, but in practice, what does this mean for the characters? Are there certain characters unavailable?


It’s the unpaired surrogate code points. That’s the whole thing. It’s about encoding ill-formed UTF-16, which is distressingly common in the real world.


broken emojis? There apparently are known issues that some frameworks break Unicode at wrong boundaries, maybe the author saw it regularize into a deeper mess


It’s not just broken emoji, it’s straight up broken content: UTF-8 can not represent unpaired surrogates.

WTF-8 is necessary for Rust’s compatibility with Windows filesystems (it underlines OsString on Windows) as e.g. file names are sequences of UTF-16 code units (and thus may contain unpaired surrogates).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: