Hacker News new | past | comments | ask | show | jobs | submit login
The WTF-8 encoding (simonsapin.github.io)
235 points by andrewaylett on May 27, 2015 | hide | past | web | favorite | 104 comments

Aw man. I was using "WTF-8" to mean "Double UTF-8", as I described most recently at [1]. Double UTF-8 is that unintentionally popular encoding where someone takes UTF-8, accidentally decodes it as their favorite single-byte encoding such as Windows-1252, then encodes those characters as UTF-8.

[1] http://blog.luminoso.com/2015/05/21/ftfy-fixes-text-for-you-...

It was such a perfect abbreviation, but now I probably shouldn't use it, as it would be confused with Simon Sapin's WTF-8, which people would actually use on purpose.

This is actually where the name is from, I found it too funny to pass up: https://simonsapin.github.io/wtf-8/#acknowledgments https://twitter.com/koalie/status/506821684687413248

Sorry for hijacking it!

>  the future of publishing at W3C

That is an amazing example.

It's not even "double UTF-8", it's UTF-8 six times (including the one to get it on the Web), it's been decoded as Latin-1 twice and Windows-1252 three times, and at the end there's a non-breaking space that's been converted to a space. All to represent what originated as a single non-breaking space anyway.

Which makes me happy that my module solves it.

    >>> from ftfy.fixes import fix_encoding_and_explain
    >>> fix_encoding_and_explain(" the future of publishing at W3C")
    ('\xa0the future of publishing at W3C',
     [('encode', 'sloppy-windows-1252', 0),
      ('transcode', 'restore_byte_a0', 2),
      ('decode', 'utf-8-variants', 0),
      ('encode', 'sloppy-windows-1252', 0),
      ('decode', 'utf-8', 0),
      ('encode', 'latin-1', 0),
      ('decode', 'utf-8', 0),
      ('encode', 'sloppy-windows-1252', 0),
      ('decode', 'utf-8', 0),
      ('encode', 'latin-1', 0),
      ('decode', 'utf-8', 0)])

Hey, is there any way I could automate this kind of fix? It'd be awesome for web scraping.

Automating this fix is precisely what I'm showing off. And yes, it's damn useful for web scraping.


Neato! I wrote a shitty version of 50% of that two years ago, when I was tasked with uncooking a bunch of data in a MySQL database as part of a larger migration to UTF-8. I hadn't done that much pencil-and-paper bit manipulation since I was 13.

Awesome module! I wonder if anyone else had ever managed to reverse-engineer that tweet before.

The term "WTF-8" has been around for a long time. Here's an example from 2008:


I love this.

    The key words "WHAT", "DAMNIT", "GOOD GRIEF", "FOR HEAVEN'S SAKE",
    in this memo are to be interpreted as described in [RFC2119].

What about Double-UTF-8 -> D-UTF-8 ->"Duty-F-8"

Duty Fate?

You really want to call this WTF (8)? Is it april 1st today? Am I the only one that thought this article is about a new funny project that is called "what the fuck" encoding, like when somebody announced he had written a to_nil gem https://github.com/mrThe/to_nil ;) Sorry but I can't stop laughing.

This is intentional. I wish we didn’t have to do stuff like this, but we do and that’s the "what the fuck". All because the Unicode Committee in 1989 really wanted 16 bits to be enough for everybody, and of course it wasn’t.

The mistake is older than that. Wide character encodings in general are just hopelessly flawed.

WinNT, Java and a lot of more software use wide character encodings UCS2/UTF-16(/UTF-32?). And it was added to C89/C++ (wchar_t). WinNT actually predates the Unicode standard by a year or so. http://en.wikipedia.org/wiki/Wide_character , http://en.wikipedia.org/wiki/Windows_NT#Development

Converting between UTF-8 and UTF-16 is wasteful, though often necessary.

> wide characters are a hugely flawed idea [parent post]

I know. Back in the early nineties they thought otherwise and were proud that they used it in hindsight. But nowadays UTF-8 is usually the better choice (except for maybe some asian and exotic later added languages that may require more space with UTF-8) - I am not saying UTF-16 would be a better choice then, there are certain other encodings for special cases.

And as the linked article explains, UTF-16 is a huge mess of complexity with back-dated validation rules that had to be added because it stopped being a wide-character encoding when the new code points were added. UTF-16, when implemented correctly, is actually significantly more complicated to get right than UTF-8.

UTF-32/UCS-4 is quite simple, though obviously it imposes a 4x penalty on bytes used. I don't know anything that uses it in practice, though surely something does.

Again: wide characters are a hugely flawed idea.

Sure, go to 32 bits per character. But it cannot be said to be "simple" and will not allow you to make the assumption that 1 integer = 1 glyph.

Namely it won't save you from the following problems:

    * Precomposed vs multi-codepoint diacritics (Do you write á with
      one 32 bit char or with two? If it's Unicode the answer is both)

    * Variation selectors (see also Han unification)

    * Bidi, RTL and LTR embedding chars
And possibly others I don't know about. I feel like I am learning of these dragons all the time.

I almost like that utf-16 and more so utf-8 break the "1 character, 1 glyph" rule, because it gets you in the mindset that this is bogus. Because in Unicode it is most decidedly bogus, even if you switch to UCS-4 in a vain attempt to avoid such problems. Unicode just isn't simple any way you slice it, so you might as well shove the complexity in everybody's face and have them confront it early.

If you use a 32-bit scheme, you can dynamically assign multi-character (extended) grapheme clusters to unused code units to get a fixed-width encoding.

Perl6 calls this NFG [1].

[1] http://design.perl6.org/S15.html

^ link currently broken, the plain-text version is at https://raw.githubusercontent.com/perl6/specs/master/S15-uni...

You can't use that for storage.

> The mapping between negative numbers and graphemes in this form is not guaranteed constant, even between strings in the same process.

What's your storage requirement that's not adequately solved by the existing encoding schemes?

What are you suggesting, store strings in UTF8 and then "normalize" them into this bizarre format whenever you load/save them purely so that offsets correspond to grapheme clusters? Doesn't seem worth the overhead to my eyes.

In-memory string representation rarely corresponds to on-disk representation.

Various programming languages (Java, C#, Objective-C, JavaScript, ...) as well as some well-known libraries (ICU, Windows API, Qt) use UTF-16 internally. How much data do you have lying around that's UTF-16?

Sure, more recently, Go and Rust have decided to go with UTF-8, but that's far from common, and it does have some drawbacks compared to the Perl6 (NFG) or Python3 (latin-1, UCS-2, UCS-4 as appropriate) model if you have to do actual processing instead of just passing opaque strings around.

Also note that you have to go through a normalization step anyway if you don't want to be tripped up by having multiple ways to represent a single grapheme.

NFG enables O(N) algorithms for character level operations.

The overhead is entirely wasted on code that does no character level operations.

For code that does do some character level operations, avoiding quadratic behavior may pay off handsomely.

i think linux/mac systems default to UCS-4, certainly the libc implementations of wcs* do.

i agree its a flawed idea though. 4 billion characters seems like enough for now, but i'd guess UTF-32 will need extending to 64 too... and actually how about decoupling the size from the data entirely? it works well enough in the general case of /every type of data we know about/ that i'm pretty sure this specialised use case is not very special.

The Unixish C runtimes of the world uses a 4-byte wchar_t. I'm not aware of anything in "Linux" that actually stores or operates on 4-byte character strings. Obviously some software somewhere must, but the overwhelming majority of text processing on your linux box is done in UTF-8.

That's not remotely comparable to the situation in Windows, where file names are stored on disk in a 16 bit not-quite-wide-character encoding, etc... And it's leaked into firmware. GPT partition names and UEFI variables are 16 bit despite never once being used to store anything but ASCII, etc... All that software is, broadly, incompatible and buggy (and of questionable security) when faced with new code points.

We don't even have 4 billion characters possible now. The Unicode range is only 0-10FFFF, and UTF-16 can't represent any more than that. So UTF-32 is restricted to that range too, despite what 32 bits would allow, never mind 64.

But we don't seem to be running out -- Planes 3-13 are completely unassigned so far, covering 30000-DFFFF. That's nearly 65% of the Unicode range completely untouched, and planes 1, 2, and 14 still have big gaps too.

> But we don't seem to be running out

The issue isn't the quantity of unassigned codepoints, it's how many private use ones are available, only 137,000 of them. Publicly available private use schemes such as ConScript are fast filling up this space, mainly by encoding block characters in the same way Unicode encodes Korean Hangul, i.e. by using a formula over a small set of base components to generate all the block characters.

My own surrogate scheme, UTF-88, implemented in Go at https://github.com/gavingroovygrover/utf88 , expands the number of UTF-8 codepoints to 2 billion as originally specified by using the top 75% of the private use codepoints as 2nd tier surrogates. This scheme can easily be fitted on top of UTF-16 instead. I've taken the liberty in this scheme of making 16 planes (0x10 to 0x1F) available as private use; the rest are unassigned.

I created this scheme to help in using a formulaic method to generate a commonly used subset of the CJK characters, perhaps in the codepoints which would be 6 bytes under UTF-8. It would be more difficult than the Hangul scheme because CJK characters are built recursively. If successful, I'd look at pitching the UTF-88 surrogation scheme for UTF-16 and having UTF-8 and UTF-32 officially extended to 2 billion characters.

What do you make of NFG, as mentioned in another comment below?

NFG uses the negative numbers down to about -2 billion as a implementation-internal private use area to temporarily store graphemes. Enables fast grapheme-based manipulation of strings in Perl 6. Though such negative-numbered codepoints could only be used for private use in data interchange between 3rd parties if the UTF-32 was used, because neither UTF-8 (even pre-2003) nor UTF-16 could encode them.


Yes. sizeof(wchar_t) is 2 on Windows and 4 on Unix-like systems, so wchar_t is pretty much useless. That's why C11 added char16_t and char32_t.

I'm wondering how common the "mistake" of storing UTF-16 values in wchar_t on Unix-like systems? I know I thought I had my code carefully basing whether it was UTF-16 or UTF-32 based on the size of wchar_t, only to discover that one of the supposedly portable libraries I used had UTF-16 no matter how big wchar_t was.

Unix-like systems except for MirBSD, which uses a 16-bit wchar_t

Oh ok it's intentional. Thx for explaining the choice of the name. Not only because of the name itself but also by explaining the reason behind the choice, you achieved to get my attention. I will try to find out more about this problem, because I guess that as a developer this might have some impact on my work sooner or later and therefore I should at least be aware of it.

I wonder what will be next? Calling a sports association "WTF"?



to_nil is actually a pretty important function! Completely trivial, obviously, but it demonstrates that there's a canonical way to map every value in Ruby to nil. This is essentially the defining feature of nil, in a sense.

With typing the interest here would be more clear, of course, since it would be more apparent that nil inhabits every type.

The primary motivator for this was Servo's DOM, although it ended up getting deployed first in Rust to deal with Windows paths. We haven't determined whether we'll need to use WTF-8 throughout Servo—it may depend on how document.write() is used in the wild.

So we're going to see this on web sites. Oh, joy.

It's time for browsers to start saying no to really bad HTML. When a browser detects a major error, it should put an error bar across the top of the page, with something like "This page may display improperly due to errors in the page source (click for details)". Start doing that for serious errors such as Javascript code aborts, security errors, and malformed UTF-8. Then extend that to pages where the character encoding is ambiguous, and stop trying to guess character encoding.

The HTML5 spec formally defines consistent handling for many errors. That's OK, there's a spec. Stop there. Don't try to outguess new kinds of errors.

No. This is an internal implementation detail, not to be used on the Web.

As to draconian error handling, that’s what XHTML is about and why it failed. Just define a somewhat sensible behavior for every input, no matter how ugly.

Is there a roadmap for Servo on Windows7+ ? Is this the best start point to dive in: https://github.com/servo/servo/issues/1908 ?

Yes, that bug is the best place to start. We've future proofed the architecture for Windows, but there is no direct work on it that I'm aware of.

What does the DOM do when it receives a surrogate half from Javascript? I thought that the DOM APIs (e.g. createTextNode, innerHTML setter, setAttribute, HTMLInputElement.value setter, document.write) would all strip out the lone surrogate code units?

In current browsers they'll happily pass around lone surrogates. Nothing special happens to them (v. any other UTF-16 code-unit) till they reach the layout layer (where they obviously cannot be drawn).

I also gave a short talk at !!Con about this, with some Unicode history background: http://exyr.org/2015/!!Con_WTF-8/slides.pdf

I found this through https://news.ycombinator.com/item?id=9609955 -- I find it fascinating the solutions that people come up with to deal with other people's problems without damaging correct code. Rust uses WTF-8 to interact with Windows' UCS2/UTF-16 hybrid, and from a quick look I'm hopeful that Rust's story around handling Unicode should be much nicer than (say) Python or Java.

Have you looked at Python 3 yet? I'm using Python 3 in production for an internationalized website and my experience has been that it handles Unicode pretty well.

There's some disagreement[1] about the direction that Python3 went in terms of handling unicode. Pretty good read if you have a few minutes.

1 http://lucumr.pocoo.org/2014/1/5/unicode-in-2-and-3/

Not that great of a read. Stuff like:

> I have been told multiple times now that my point of view is wrong and I don't understand beginners, or that the “text model” has been changed and my request makes no sense.

"The text model has changed" is a perfectly legitimate reason to turn down ideas consistent with the previous text model and inconsistent with the current model. Keeping a coherent, consistent model of your text is a pretty important part of curating a language. One of Python's greatest strengths is that they don't just pile on random features, and keeping old crufty features from previous versions would amount to the same thing. To dismiss this reasoning is extremely shortsighted.

Many people who prefer Python3's way of handling Unicode are aware of these arguments. It isn't a position based on ignorance.

Hey, never meant to imply otherwise. In fact, even people who have issues with the py3 way often agree that it's still better than 2's.

http://lucumr.pocoo.org/2014/1/9/ucs-vs-utf8/ is a nice comparison of Python’s (2 and 3) and Rust’s Unicode handling.

Python 3 doesn't handle Unicode any better than Python 2, it just made it the default string. In all other aspects the situation has stayed as bad as it was in Python 2 or has gotten significantly worse. Good examples for that are paths and anything that relates to local IO when you're locale is C.

> Python 3 doesn't handle Unicode any better than Python 2, it just made it the default string. In all other aspects the situation has stayed as bad as it was in Python 2 or has gotten significantly worse.

Maybe this has been your experience, but it hasn't been mine. Using Python 3 was the single best decision I've made in developing a multilingual website (we support English/German/Spanish). There's not a ton of local IO, but I've upgraded all my personal projects to Python 3.

Your complaint, and the complaint of the OP, seems to be basically, "It's different and I have to change my code, therefore it's bad."

My complaint is not that I have to change my code. My complaint is that Python 3 is an attempt at breaking as little compatibilty with Python 2 as possible while making Unicode "easy" to use. They failed to achieve both goals.

Now we have a Python 3 that's incompatible to Python 2 but provides almost no significant benefit, solves none of the large well known problems and introduces quite a few new problems.

I have to disagree, I think using Unicode in Python 3 is currently easier than in any language I've used. It certainly isn't perfect, but it's better than the alternatives. I certainly have spent very little time struggling with it.

That is not quite true, in the sense that more of the standard library has been made unicode-aware, and implicit conversions between unicode and bytestrings have been removed. So if you're working in either domain you get a coherent view, the problem being when you're interacting with systems or concepts which straddle the divide or (even worse) may be in either domain depending on the platform. Filesystem paths is the latter, it's text on OSX and Windows — although possibly ill-formed in Windows — but it's bag-o-bytes in most unices. There Python 2 is only "better" in that issues will probably fly under the radar if you don't prod things too much.

There is no coherent view at all. Bytes still have methods like .upper() that make no sense at all in that context, while unicode strings with these methods are broken because these are locale dependent operations and there is no appropriate API. You can also index, slice and iterate over strings, all operations that you really shouldn't do unless you really now what you are doing. The API in no way indicates that doing any of these things is a problem.

Python 2 handling of paths is not good because there is no good abstraction over different operating systems, treating them as byte strings is a sane lowest common denominator though.

Python 3 pretends that paths can be represented as unicode strings on all OSes, that's not true. That is held up with a very leaky abstraction and means that Python code that treats paths as unicode strings and not as paths-that-happen-to-be-unicode-but-really-arent is broken. Most people aren't aware of that at all and it's definitely surprising.

On top of that implicit coercions have been replaced with implicit broken guessing of encodings for example when opening files.

When you say "strings" are you referring to strings or bytes? Why shouldn't you slice or index them? It seems like those operations make sense in either case but I'm sure I'm missing something.

On the guessing encodings when opening files, that's not really a problem. The caller should specify the encoding manually ideally. If you don't know the encoding of the file, how can you decode it? You could still open it as raw bytes if required.

I used strings to mean both. Byte strings can be sliced and indexed no problems because a byte as such is something you may actually want to deal with.

Slicing or indexing into unicode strings is a problem because it's not clear what unicode strings are strings of. You can look at unicode strings from different perspectives and see a sequence of codepoints or a sequence of characters, both can be reasonable depending on what you want to do. Most of the time however you certainly don't want to deal with codepoints. Python however only gives you a codepoint-level perspective.

Guessing encodings when opening files is a problem precisely because - as you mentioned - the caller should specify the encoding, not just sometimes but always. Guessing an encoding based on the locale or the content of the file should be the exception and something the caller does explicitly.

It slices by codepoints? That's just silly, so we've gone through this whole unicode everywhere process so we can stop thinking about the underlying implementation details but the api forces you to have to deal with them anyway.

Fortunately it's not something I deal with often but thanks for the info, will stop me getting caught out later.

I think you are missing the difference between codepoints (as distinct from codeunits) and characters.

And unfortunately, I'm not anymore enlightened as to my misunderstanding.

I get that every different thing (character) is a different Unicode number (code point). To store / transmit these you need some standard (encoding) for writing them down as a sequence of bytes (code units, well depending on the encoding each code unit is made up of different numbers of bytes).

How is any of that in conflict with my original points? Or is some of my above understanding incorrect.

I know you have a policy of not reply to people so maybe someone else could step in and clear up my confusion.

Codepoints and characters are not equivalent. A character can consist of one or more codepoints. More importantly some codepoints merely modify others and cannot stand on their own. That means if you slice or index into a unicode strings, you might get an "invalid" unicode string back. That is a unicode string that cannot be encoded or rendered in any meaningful way.

Right, ok. I recall something about this - ü can be represented either by a single code point or by the letter 'u' preceded by the modifier.

As the user of unicode I don't really care about that. If I slice characters I expect a slice of characters. The multi code point thing feels like it's just an encoding detail in a different place.

I guess you need some operations to get to those details if you need. Man, what was the drive behind adding that extra complexity to life?!

Thanks for explaining. That was the piece I was missing.

bytes.upper is the Right Thing when you are dealing with ASCII-based formats. It also has the advantage of breaking in less random ways than unicode.upper.

And I mean, I can't really think of any cross-locale requirements fulfilled by unicode.upper (maybe case-insensitive matching, but then you also want to do lots of other filtering).

> There Python 2 is only "better" in that issues will probably fly under the radar if you don't prod things too much.

Ah yes, the JavaScript solution.

Well, Python 3's unicode support is much more complete. As a trivial example, case conversions now cover the whole unicode range. This holds pretty consistently - Python 2's `unicode` was incomplete.

> It is unclear whether unpaired surrogate byte sequences are supposed to be well-formed in CESU-8.

According to the Unicode Technical Report #26 that defines CESU-8[1], CESU-8 is a Compatibility Encoding Scheme for UTF-16 ("CESU"). In fact, the way the encoding is defined, the source data must be represented in UTF-16 prior to converting to CESU-8. Since UTF-16 cannot represent unpaired surrogates, I think it's safe to say that CESU-8 cannot represent them either.

[1] http://www.unicode.org/reports/tr26/

From the article:

>UTF-16 is designed to represent any Unicode text, but it can not represent a surrogate code point pair since the corresponding surrogate 16-bit code unit pairs would instead represent a supplementary code point. Therefore, the concept of Unicode scalar value was introduced and Unicode text was restricted to not contain any surrogate code point. (This was presumably deemed simpler that only restricting pairs.)

This is all gibberish to me. Can someone explain this in laymans terms?

People used to think 16 bits would be enough for anyone. It wasn't, so UTF-16 was designed as a variable-length, backwards-compatible replacement for UCS-2.

Characters outside the Basic Multilingual Plane (BMP) are encoded as a pair of 16-bit code units. The numeric value of these code units denote codepoints that lie themselves within the BMP. While these values can be represented in UTF-8 and UTF-32, they cannot be represented in UTF-16. Because we want our encoding schemes to be equivalent, the Unicode code space contains a hole where these so-called surrogates lie.

Because not everyone gets Unicode right, real-world data may contain unpaired surrogates, and WTF-8 is an extension of UTF-8 that handles such data gracefully.

This was gibberish to me too. I researched it a bit and wrote an explanation that would have made sense to the 2-hours-ago me: https://news.ycombinator.com/item?id=9614641

Every term is linked to its definition. https://simonsapin.github.io/wtf-8/#terminology Does this help?

I understand that for efficiency we want this to be as fast as possible. Simple compression can take care of the wastefulness of using excessive space to encode text - so it really only leaves efficiency.

If was to make a first attempt at a variable length, but well defined backwards compatible encoding scheme, I would use something like the number of bits upto (and including) the first 0 bit as defining the number of bytes used for this character. So,

> 0xxxxxxx, 1 byte > 10xxxxxx, 2 bytes > 110xxxxx, 3 bytes.

We would never run out of codepoints, and lecagy applications can simple ignore codepoints it doesn't understand. We would only waste 1 bit per byte, which seems reasonable given just how many problems encoding usually represent. Why wouldn't this work, apart from already existing applications that does not know how to do this.

That’s roughly how UTF-8 works, with some tweaks to make it self-synchronizing. (That is, you can jump to the middle of a stream and find the next code point by looking at no more than 4 bytes.)

As to running out of code points, we’re limited by UTF-16 (up to U+10FFFF). Both UTF-32 and UTF-8 unchanged could go up to 32 bits.

Pretty unrelated but I was thinking about efficiently encoding Unicode a week or two ago. I think there might be some value in a fixed length encoding but UTF-32 seems a bit wasteful. With Unicode requiring 21 (20.09) bits per code point packing three code points into 64 bits seems an obvious idea. But would it be worth the hassle for example as internal encoding in an operating system? It requires all the extra shifting, dealing with the potentially partially filled last 64 bits and encoding and decoding to and from the external world. Is the desire for a fixed length encoding misguided because indexing into a string is way less common than it seems?

Opinions: no it’s not worth the hassle. Yes, "fixed length" is misguided. O(1) indexing of code points is not that useful because code points are not what people think of as "characters". (See combining code points.) http://lucumr.pocoo.org/2014/1/9/ucs-vs-utf8/

When you use an encoding based on integral bytes, you can use the hardware-accelerated and often parallelized "memcpy" bulk byte moving hardware features to manipulate your strings.

But inserting a codepoint with your approach would require all downstream bits to be shifted within and across bytes, something that would be a much bigger computational burden. It's unlikely that anyone would consider saddling themselves with that for a mere 25% space savings over the dead-simple and memcpy-able UTF-32.

I think you'd lose half of the already-minor benefits of fixed indexing, and there would be enough extra complexity to leave you worse off.

In addition, there's a 95% chance you're not dealing with enough text for UTF-32 to hurt. If you're in the other 5%, then a packing scheme that's 1/3 more efficient is still going to hurt. There's no good use case.

Coding for variable-width takes more effort, but it gives you a better result. You can divide strings appropriate to the use. Sometimes that's code points, but more often it's probably characters or bytes.

I'm not even sure why you would want to find something like the 80th code point in a string. It's rare enough to not be a top priority.

Why this over, say, CESU-8? Compatibility with UTF-8 systems, I guess?

According to the article, they wanted a superset of UTF-8, which CESU-8 is not. https://simonsapin.github.io/wtf-8/#cesu-8

Yes. For example, this allows the Rust standard library to convert &str (UTF-8) to &std::ffi::OsStr (WTF-8 on Windows) without converting or even copying data.

An interesting possible application for this is JSON parsers. If JSON strings contain unpaired surrogate code points, they could either throw an error or encode as WTF-8. I bet some JSON parsers think they are converting to UTF-8, but are actually converting to GUTF-8.

If you want to preserve unpaired surrogates that are hex-encoded in JSON strings, WTF-8 could help. But it’s unclear to me that you should: https://tools.ietf.org/html/rfc7159#section-8.2

Serious question -- is this a serious project or a joke?

The name is unserious but the project is very serious, its writer has responded to a few comments and linked to a presentation of his on the subject[0]. It's an extension of UTF-8 used to bridge UTF-8 and UCS2-plus-surrogates: while UTF8 is the modern encoding you have to interact with legacy systems, for UNIX's bags of bytes you may be able to assume UTF8 (possibly ill-formed) but a number of other legacy systems used UCS2 and added visible surrogates (rather than proper UTF-16) afterwards.

Windows and NTFS, Java, UEFI, Javascript all work with UCS2-plus-surrogates. Having to interact with those systems from a UTF8-encoded world is an issue because they don't guarantee well-formed UTF-16, they might contain unpaired surrogates which can't be decoded to a codepoint allowed in UTF-8 or UTF-32 (neither allows unpaired surrogates, for obvious reasons).

WTF8 extends UTF8 with unpaired surrogates (and unpaired surrogates only, paired surrogates from valid UTF16 are decoded and re-encoded to a proper UTF8-valid codepoint) which allows interaction with legacy UCS2 systems.

WTF8 exists solely as an internal encoding (in-memory representation), but it's very useful there. It was initially created for Servo (which may need it to have an UTF8 internal representation yet properly interact with javascript), but turned out to first be a boon to Rust's OS/filesystem APIs on Windows.

[0] http://exyr.org/2015/!!Con_WTF-8/slides.pdf

> WTF8 exists solely as an internal encoding (in-memory representation)


Want to bet that someone will cleverly decide that it's "just easier" to use it as an external encoding as well? This kind of cat always gets out of the bag eventually.

Better WTF8 than invalid UCS2-plus-surrogates. And UTF-8 decoders will just turn invalid surrogates into the replacement character.

The name might throw you off, but it's very much serious. It's like CESU-8 and Modified UTF-8, which both deal with various encoding issues in legacy systems by modifying UTF-8:


Note the WTF-8 entry has only been there fore a few minutes, I just added it. It might be removed for non-notability.

  s/Note/Note that/

I thought he was tackling the other problem which is that you frequently find web pages that have both UTF-8 codepoints and single bytes encoded as ISO-latin-1 or Windows-1252

This is a solution to a problem I didn't know existed.

The nature of unicode is that there's always a problem you didn't (but should) know existed.

And because of this global confusion, everyone important ends up implementing something that somehow does something moronic - so then everyone else has yet another problem they didn't know existed and they all fall into a self-harming spiral of depravity.

Some time ago, I made some ASCII art to illustrate the various steps where things can go wrong:

    [user-perceived characters]
       [grapheme clusters] <-> [characters]
                ^                   ^
                |                   |
                v                   v
            [glyphs]           [codepoints] <-> [code units] <-> [bytes]

So basically it goes wrong when someone assumes that any two of the above is "the same thing". It's often implicit.

That's certainly one important source of errors. An obvious example would be treating UTF-32 as a fixed-width encoding, which is bad because you might end up cutting grapheme clusters in half, and you can easily forget about normalization if you think about it that way.

Then, it's possible to make mistakes when converting between representations, eg getting endianness wrong.

Some issues are more subtle: In principle, the decision what should be considered a single character may depend on the language, nevermind the debate about Han unification - but as far as I'm concerned, that's a WONTFIX.

Let me see if I have this straight. My understanding is that WTF-8 is identical to UTF-8 for all valid UTF-16 input, but it can also round-trip invalid UTF-16. That is the ultimate goal.

Below is all the background I had to learn about to understand the motivation/details.

UCS-2 was designed as a 16-bit fixed-width encoding. When it became clear that 64k code points wasn’t enough for Unicode, UTF-16 was invented to deal with the fact that UCS-2 was assumed to be fixed-width, but no longer could be.

The solution they settled on is weird, but has some useful properties. Basically they took a couple code point ranges that hadn’t been assigned yet and allocated them to a “Unicode within Unicode” coding scheme. This scheme encodes (1 big code point) -> (2 small code points). The small code points will fit in UTF-16 “code units” (this is our name for each two-byte unit in UTF-16). And for some more terminology, “big code points” are called “supplementary code points”, and “small code points” are called “BMP code points.”

The weird thing about this scheme is that we bothered to make the “2 small code points” (known as a “surrogate” pair) into real Unicode code points. A more normal thing would be to say that UTF-16 code units are totally separate from Unicode code points, and that UTF-16 code units have no meaning outside of UTF-16. An number like 0xd801 could have a code unit meaning as part of a UTF-16 surrogate pair, and also be a totally unrelated Unicode code point.

But the one nice property of the way they did this is that they didn’t break existing software. Existing software assumed that every UCS-2 character was also a code point. These systems could be updated to UTF-16 while preserving this assumption.

Unfortunately it made everything else more complicated. Because now:

- UTF-16 can be ill-formed if it has any surrogate code units that don’t pair properly.

- we have to figure out what to do when these surrogate code points — code points whose only purpose is to help UTF-16 break out of its 64k limit — occur outside of UTF-16.

This becomes particularly complicated when converting UTF-16 -> UTF-8. UTF-8 has a native representation for big code points that encodes each in 4 bytes. But since surrogate code points are real code points, you could imagine an alternative UTF-8 encoding for big code points: make a UTF-16 surrogate pair, then UTF-8 encode the two code points of the surrogate pair (hey, they are real code points!) into UTF-8. But UTF-8 disallows this and only allows the canonical, 4-byte encoding.

If you feel this is unjust and UTF-8 should be allowed to encode surrogate code points if it feels like it, then you might like Generalized UTF-8, which is exactly like UTF-8 except this is allowed. It’s easier to convert from UTF-16, because you don’t need any specialized logic to recognize and handle surrogate pairs. You still need this logic to go in the other direction though (GUTF-8 -> UTF-16), since GUTF-8 can have big code points that you’d need to encode into surrogate pairs for UTF-16.

If you like Generalized UTF-8, except that you always want to use surrogate pairs for big code points, and you want to totally disallow the UTF-8-native 4-byte sequence for them, you might like CESU-8, which does this. This makes both directions of CESU-8 <-> UTF-16 easy, because neither conversion requires special handling of surrogate pairs.

A nice property of GUTF-8 is that it can round-trip any UTF-16 sequence, even if it’s ill-formed (has unpaired surrogate code points). It’s pretty easy to get ill-formed UTF-16, because many UTF-16-based APIs don’t enforce wellformedness.

But both GUTF-8 and CESU-8 have the drawback that they are not UTF-8 compatible. UTF-8-based software isn’t generally expected to decode surrogate pairs — surrogates are supposed to be a UTF-16-only peculiarity. Most UTF-8-based software expects that once it performs UTF-8 decoding, the resulting code points are real code points (“Unicode scalar values”, which make up “Unicode text”), not surrogate code points.

So basically what WTF-8 says is: encode all code points as their real code point, never as a surrogate pair (like UTF-8, unlike GUTF-8 and CESU-8). However, if the input UTF-16 was ill-formed and contained an unpaired surrogate code point, then you may encode that code point directly with UTF-8 (like GUTF-8, not allowed in UTF-8).

So WTF-8 is identical to UTF-8 for all valid UTF-16 input, but it can also round-trip invalid UTF-16. That is the ultimate goal.

By the way, one thing that was slightly unclear to me in the doc. In section 4.2 (https://simonsapin.github.io/wtf-8/#encoding-ill-formed-utf-...):

> If, on the other hand, the input contains a surrogate code point pair, the conversion will be incorrect and the resulting sequence will not represent the original code points.

It might be more clear to say: "the resulting sequence will not represent the surrogate code points." It might be by some fluke that the user actually intends the UTF-16 to interpret the surrogate sequence that was in the input. And this isn't really lossy, since (AFAIK) the surrogate code points exist for the sole purpose of representing surrogate pairs.

The more interesting case here, which isn't mentioned at all, is that the input contains unpaired surrogate code points. That is the case where the UTF-16 will actually end up being ill-formed.

The encoding that was designed to be fixed-width is called UCS-2. UTF-16 is its variable-length successor.

Thanks for the correction! I updated the post.

hmmm... wait... UCS-2 is just a broken UTF-16?!?!

I thought it was a distinct encoding and all related problems were largely imaginary provided you /just/ handle things right...

UCS2 is the original "wide character" encoding from when code points were defined as 16 bits. When codepoints were extended to 21 bits, UTF-16 was created as a variable-width encoding compatible with UCS2 (so UCS2-encoded data is valid UTF-16).

Sadly systems which had previously opted for fixed-width UCS2 and exposed that detail as part of a binary layer and wouldn't break compatibility couldn't keep their internal storage to 16 bit code units and move the external API to 32.

What they did instead was keep their API exposing 16 bits code units and declare it was UTF16, except most of them didn't bother validating anything so they're really exposing UCS2-with-surrogates (not even surrogate pairs since they don't validate the data). And that's how you find lone surrogates traveling through the stars without their mate and shit's all fucked up.

The given history of UTF-16 and UTF-8 is a bit muddled.

> UTF-16 was redefined to be ill-formed if it contains unpaired surrogate 16-bit code units.

This is incorrect. UTF-16 did not exist until Unicode 2.0, which was the version of the standard that introduced surrogate code points. UCS-2 was the 16-bit encoding that predated it, and UTF-16 was designed as a replacement for UCS-2 in order to handle supplementary characters properly.

> UTF-8 was similarly redefined to be ill-formed if it contains surrogate byte sequences.

Not really true either. UTF-8 became part of the Unicode standard with Unicode 2.0, and so incorporated surrogate code point handling. UTF-8 was originally created in 1992, long before Unicode 2.0, and at the time was based on UCS. I'm not really sure it's relevant to talk about UTF-8 prior to its inclusion in the Unicode standard, but even then, encoding the code point range D800-DFFF was not allowed, for the same reason it was actually not allowed in UCS-2, which is that this code point range was unallocated (it was in fact part of the Special Zone, which I am unable to find an actual definition for in the scanned dead-tree Unicode 1.0 book, but I haven't read it cover-to-cover). The distinction is that it was not considered "ill-formed" to encode those code points, and so it was perfectly legal to receive UCS-2 that encoded those values, process it, and re-transmit it (as it's legal to process and retransmit text streams that represent characters unknown to the process; the assumption is the process that originally encoded them understood the characters). So technically yes, UTF-8 changed from its original definition based on UCS to one that explicitly considered encoding D800-DFFF as ill-formed, but UTF-8 as it has existed in the Unicode Standard has always considered it ill-formed.

> Unicode text was restricted to not contain any surrogate code point. (This was presumably deemed simpler that only restricting pairs.)

This is a bit of an odd parenthetical. Regardless of encoding, it's never legal to emit a text stream that contains surrogate code points, as these points have been explicitly reserved for the use of UTF-16. The UTF-8 and UTF-32 encodings explicitly consider attempts to encode these code points as ill-formed, but there's no reason to ever allow it in the first place as it's a violation of the Unicode conformance rules to do so. Because there is no process that can possibly have encoded those code points in the first place while conforming to the Unicode standard, there is no reason for any process to attempt to interpret those code points when consuming a Unicode encoding. Allowing them would just be a potential security hazard (which is the same rationale for treating non-shortest-form UTF-8 encodings as ill-formed). It has nothing to do with simplicity.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact