
The WTF-8 encoding - andrewaylett
https://simonsapin.github.io/wtf-8/
======
rspeer
Aw man. I was using "WTF-8" to mean "Double UTF-8", as I described most
recently at [1]. Double UTF-8 is that unintentionally popular encoding where
someone takes UTF-8, accidentally decodes it as their favorite single-byte
encoding such as Windows-1252, then encodes _those_ characters as UTF-8.

[1] [http://blog.luminoso.com/2015/05/21/ftfy-fixes-text-for-
you-...](http://blog.luminoso.com/2015/05/21/ftfy-fixes-text-for-
you-4-0-changing-less-and-fixing-more/)

It was such a perfect abbreviation, but now I probably shouldn't use it, as it
would be confused with Simon Sapin's WTF-8, which people would actually use on
purpose.

~~~
SimonSapin
This is actually where the name is from, I found it too funny to pass up:
[https://simonsapin.github.io/wtf-8/#acknowledgments](https://simonsapin.github.io/wtf-8/#acknowledgments)
[https://twitter.com/koalie/status/506821684687413248](https://twitter.com/koalie/status/506821684687413248)

Sorry for hijacking it!

~~~
rspeer
> ÃƒÆ’Ã‚Æ’ÃƒÂ¢Ã‚â‚¬Ã‚Å¡ÃƒÆ’Ã‚â€šÃƒâ€šÃ‚Â the future of publishing at W3C

That is an amazing example.

It's not even "double UTF-8", it's UTF-8 _six times_ (including the one to get
it on the Web), it's been decoded as Latin-1 twice and Windows-1252 three
times, and at the end there's a non-breaking space that's been converted to a
space. All to represent what originated as a single non-breaking space anyway.

Which makes me happy that my module solves it.

    
    
        >>> from ftfy.fixes import fix_encoding_and_explain
        >>> fix_encoding_and_explain("ÃƒÆ’Ã‚Æ’ÃƒÂ¢Ã‚â‚¬Ã‚Å¡ÃƒÆ’Ã‚â€šÃƒâ€šÃ‚Â the future of publishing at W3C")
        ('\xa0the future of publishing at W3C',
         [('encode', 'sloppy-windows-1252', 0),
          ('transcode', 'restore_byte_a0', 2),
          ('decode', 'utf-8-variants', 0),
          ('encode', 'sloppy-windows-1252', 0),
          ('decode', 'utf-8', 0),
          ('encode', 'latin-1', 0),
          ('decode', 'utf-8', 0),
          ('encode', 'sloppy-windows-1252', 0),
          ('decode', 'utf-8', 0),
          ('encode', 'latin-1', 0),
          ('decode', 'utf-8', 0)])

~~~
voltagex_
Hey, is there any way I could automate this kind of fix? It'd be awesome for
web scraping.

~~~
rspeer
Automating this fix is precisely what I'm showing off. And yes, it's _damn_
useful for web scraping.

[https://github.com/LuminosoInsight/python-
ftfy](https://github.com/LuminosoInsight/python-ftfy)

------
chriswwweb
You really want to call this WTF (8)? Is it april 1st today? Am I the only one
that thought this article is about a new funny project that is called "what
the fuck" encoding, like when somebody announced he had written a to_nil gem
[https://github.com/mrThe/to_nil](https://github.com/mrThe/to_nil) ;) Sorry
but I can't stop laughing.

~~~
SimonSapin
This is intentional. I wish we didn’t have to do stuff like this, but we do
and that’s the "what the fuck". All because the Unicode Committee in 1989
really wanted 16 bits to be enough for everybody, and of course it wasn’t.

~~~
ajross
The mistake is older than that. Wide character encodings in general are just
hopelessly flawed.

~~~
frik
WinNT, Java and a lot of more software use wide character encodings
UCS2/UTF-16(/UTF-32?). And it was added to C89/C++ (wchar_t). WinNT actually
predates the Unicode standard by a year or so.
[http://en.wikipedia.org/wiki/Wide_character](http://en.wikipedia.org/wiki/Wide_character)
,
[http://en.wikipedia.org/wiki/Windows_NT#Development](http://en.wikipedia.org/wiki/Windows_NT#Development)

Converting between UTF-8 and UTF-16 is wasteful, though often necessary.

> wide characters are a hugely flawed idea [parent post]

I know. Back in the early nineties they thought otherwise and were proud that
they used it in hindsight. But nowadays UTF-8 is usually the better choice
(except for maybe some asian and exotic later added languages that may require
more space with UTF-8) - I am not saying UTF-16 would be a better choice then,
there are certain other encodings for special cases.

~~~
ajross
And as the linked article explains, UTF-16 is a huge mess of complexity with
back-dated validation rules that had to be added because _it stopped being a
wide-character encoding_ when the new code points were added. UTF-16, when
implemented correctly, is actually significantly _more_ complicated to get
right than UTF-8.

UTF-32/UCS-4 is quite simple, though obviously it imposes a 4x penalty on
bytes used. I don't know anything that uses it in practice, though surely
something does.

Again: wide characters are a hugely flawed idea.

~~~
jheriko
i think linux/mac systems default to UCS-4, certainly the libc implementations
of wcs* do.

i agree its a flawed idea though. 4 billion characters seems like enough for
now, but i'd guess UTF-32 will need extending to 64 too... and actually how
about decoupling the size from the data entirely? it works well enough in the
general case of /every type of data we know about/ that i'm pretty sure this
specialised use case is not very special.

~~~
cpeterso
Yes. sizeof(wchar_t) is 2 on Windows and 4 on Unix-like systems, so wchar_t is
pretty much useless. That's why C11 added char16_t and char32_t.

~~~
colomon
I'm wondering how common the "mistake" of storing UTF-16 values in wchar_t on
Unix-like systems? I know I thought I had my code carefully basing whether it
was UTF-16 or UTF-32 based on the size of wchar_t, only to discover that one
of the supposedly portable libraries I used had UTF-16 no matter how big
wchar_t was.

------
pcwalton
The primary motivator for this was Servo's DOM, although it ended up getting
deployed first in Rust to deal with Windows paths. We haven't determined
whether we'll need to use WTF-8 throughout Servo—it may depend on how
document.write() is used in the wild.

~~~
Animats
So we're going to see this on web sites. Oh, joy.

It's time for browsers to start saying no to really bad HTML. When a browser
detects a major error, it should put an error bar across the top of the page,
with something like "This page may display improperly due to errors in the
page source (click for details)". Start doing that for serious errors such as
Javascript code aborts, security errors, and malformed UTF-8. Then extend that
to pages where the character encoding is ambiguous, and stop trying to guess
character encoding.

The HTML5 spec formally defines consistent handling for many errors. That's
OK, there's a spec. Stop there. Don't try to outguess new kinds of errors.

~~~
SimonSapin
No. This is an internal implementation detail, not to be used on the Web.

As to draconian error handling, that’s what XHTML is about and why it failed.
Just define a somewhat sensible behavior for every input, no matter how ugly.

------
SimonSapin
I also gave a short talk at !!Con about this, with some Unicode history
background:
[http://exyr.org/2015/!!Con_WTF-8/slides.pdf](http://exyr.org/2015/!!Con_WTF-8/slides.pdf)

------
andrewaylett
I found this through
[https://news.ycombinator.com/item?id=9609955](https://news.ycombinator.com/item?id=9609955)
\-- I find it fascinating the solutions that people come up with to deal with
other people's problems without damaging correct code. Rust uses WTF-8 to
interact with Windows' UCS2/UTF-16 hybrid, and from a quick look I'm hopeful
that Rust's story around handling Unicode should be much nicer than (say)
Python or Java.

~~~
copsarebastards
Have you looked at Python 3 yet? I'm using Python 3 in production for an
internationalized website and my experience has been that it handles Unicode
pretty well.

~~~
DasIch
Python 3 doesn't handle Unicode any better than Python 2, it just made it the
default string. In all other aspects the situation has stayed as bad as it was
in Python 2 or has gotten significantly worse. Good examples for that are
paths and anything that relates to local IO when you're locale is C.

~~~
masklinn
That is not _quite_ true, in the sense that more of the standard library has
been made unicode-aware, and implicit conversions between unicode and
bytestrings have been removed. So if you're working in either domain you get a
coherent view, the problem being when you're interacting with systems or
concepts which straddle the divide or (even worse) may be in either domain
depending on the platform. Filesystem paths is the latter, it's text on OSX
and Windows — although possibly ill-formed in Windows — but it's bag-o-bytes
in most unices. There Python 2 is only "better" in that issues will probably
fly under the radar if you don't prod things too much.

~~~
DasIch
There is no coherent view at all. Bytes still have methods like .upper() that
make no sense at all in that context, while unicode strings with these methods
are broken because these are locale dependent operations and there is no
appropriate API. You can also index, slice and iterate over strings, all
operations that you really shouldn't do unless you really now what you are
doing. The API in no way indicates that doing any of these things is a
problem.

Python 2 handling of paths is not good because there is no good abstraction
over different operating systems, treating them as byte strings is a sane
lowest common denominator though.

Python 3 pretends that paths can be represented as unicode strings on all
OSes, that's not true. That is held up with a very leaky abstraction and means
that Python code that treats paths as unicode strings and not as paths-that-
happen-to-be-unicode-but-really-arent is broken. Most people aren't aware of
that at all and it's definitely surprising.

On top of that implicit coercions have been replaced with implicit broken
guessing of encodings for example when opening files.

~~~
aidos
When you say "strings" are you referring to strings or bytes? Why shouldn't
you slice or index them? It seems like those operations make sense in either
case but I'm sure I'm missing something.

On the guessing encodings when opening files, that's not really a problem. The
caller should specify the encoding manually ideally. If you don't know the
encoding of the file, how can you decode it? You could still open it as raw
bytes if required.

~~~
DasIch
I used strings to mean both. Byte strings can be sliced and indexed no
problems because a byte as such is something you may actually want to deal
with.

Slicing or indexing into unicode strings is a problem because it's not clear
what unicode strings are strings of. You can look at unicode strings from
different perspectives and see a sequence of codepoints or a sequence of
characters, both can be reasonable depending on what you want to do. Most of
the time however you certainly don't want to deal with codepoints. Python
however only gives you a codepoint-level perspective.

Guessing encodings when opening files is a problem precisely because - as you
mentioned - the caller should specify the encoding, not just sometimes but
always. Guessing an encoding based on the locale or the content of the file
should be the exception and something the caller does explicitly.

~~~
aidos
It slices by codepoints? That's just silly, so we've gone through this whole
unicode everywhere process so we can stop thinking about the underlying
implementation details but the api forces you to have to deal with them
anyway.

Fortunately it's not something I deal with often but thanks for the info, will
stop me getting caught out later.

------
eridius
> _It is unclear whether unpaired surrogate byte sequences are supposed to be
> well-formed in CESU-8._

According to the Unicode Technical Report #26 that defines CESU-8[1], CESU-8
is a Compatibility Encoding Scheme for UTF-16 ("CESU"). In fact, the way the
encoding is defined, the source data _must_ be represented in UTF-16 prior to
converting to CESU-8. Since UTF-16 cannot represent unpaired surrogates, I
think it's safe to say that CESU-8 cannot represent them either.

[1]
[http://www.unicode.org/reports/tr26/](http://www.unicode.org/reports/tr26/)

~~~
SimonSapin
On further thought I agree.
[https://github.com/SimonSapin/wtf-8/commit/51abeef717a161ba9...](https://github.com/SimonSapin/wtf-8/commit/51abeef717a161ba9eea9624cd0a040d15bdbe7b)

------
j_jochem
From the article:

>UTF-16 is designed to represent any Unicode text, but it can not represent a
surrogate code point pair since the corresponding surrogate 16-bit code unit
pairs would instead represent a supplementary code point. Therefore, the
concept of Unicode scalar value was introduced and Unicode text was restricted
to not contain any surrogate code point. (This was presumably deemed simpler
that only restricting pairs.)

This is all gibberish to me. Can someone explain this in laymans terms?

~~~
cygx
People used to think 16 bits would be enough for anyone. It wasn't, so UTF-16
was designed as a variable-length, backwards-compatible replacement for UCS-2.

Characters outside the Basic Multilingual Plane (BMP) are encoded as a pair of
16-bit code units. The numeric value of these code units denote codepoints
that lie themselves within the BMP. While these values can be represented in
UTF-8 and UTF-32, they cannot be represented in UTF-16. Because we want our
encoding schemes to be equivalent, the Unicode code space contains a hole
where these so-called surrogates lie.

Because not everyone gets Unicode right, real-world data may contain unpaired
surrogates, and WTF-8 is an extension of UTF-8 that handles such data
gracefully.

------
hvidgaard
I understand that for efficiency we want this to be as fast as possible.
Simple compression can take care of the wastefulness of using excessive space
to encode text - so it really only leaves efficiency.

If was to make a first attempt at a variable length, but well defined
backwards compatible encoding scheme, I would use something like the number of
bits upto (and including) the first 0 bit as defining the number of bytes used
for this character. So,

> 0xxxxxxx, 1 byte > 10xxxxxx, 2 bytes > 110xxxxx, 3 bytes.

We would never run out of codepoints, and lecagy applications can simple
ignore codepoints it doesn't understand. We would only waste 1 bit per byte,
which seems reasonable given just how many problems encoding usually
represent. Why wouldn't this work, apart from already existing applications
that does not know how to do this.

~~~
SimonSapin
That’s roughly how UTF-8 works, with some tweaks to make it self-
synchronizing. (That is, you can jump to the middle of a stream and find the
next code point by looking at no more than 4 bytes.)

As to running out of code points, we’re limited by UTF-16 (up to U+10FFFF).
Both UTF-32 and UTF-8 unchanged could go up to 32 bits.

------
danbruc
Pretty unrelated but I was thinking about efficiently encoding Unicode a week
or two ago. I think there might be some value in a fixed length encoding but
UTF-32 seems a bit wasteful. With Unicode requiring 21 (20.09) bits per code
point packing three code points into 64 bits seems an obvious idea. But would
it be worth the hassle for example as internal encoding in an operating
system? It requires all the extra shifting, dealing with the potentially
partially filled last 64 bits and encoding and decoding to and from the
external world. Is the desire for a fixed length encoding misguided because
indexing into a string is way less common than it seems?

~~~
SimonSapin
Opinions: no it’s not worth the hassle. Yes, "fixed length" is misguided. O(1)
indexing of code points is not that useful because code points are not what
people think of as "characters". (See combining code points.)
[http://lucumr.pocoo.org/2014/1/9/ucs-vs-
utf8/](http://lucumr.pocoo.org/2014/1/9/ucs-vs-utf8/)

------
TazeTSchnitzel
Why this over, say, CESU-8? Compatibility with UTF-8 systems, I guess?

~~~
i80and
According to the article, they wanted a superset of UTF-8, which CESU-8 is
not.
[https://simonsapin.github.io/wtf-8/#cesu-8](https://simonsapin.github.io/wtf-8/#cesu-8)

~~~
SimonSapin
Yes. For example, this allows the Rust standard library to convert &str
(UTF-8) to &std::ffi::OsStr (WTF-8 on Windows) without converting or even
copying data.

------
haberman
An interesting possible application for this is JSON parsers. If JSON strings
contain unpaired surrogate code points, they could either throw an error or
encode as WTF-8. I bet some JSON parsers think they are converting to UTF-8,
but are actually converting to GUTF-8.

~~~
SimonSapin
If you _want_ to preserve unpaired surrogates that are hex-encoded in JSON
strings, WTF-8 could help. But it’s unclear to me that you _should_ :
[https://tools.ietf.org/html/rfc7159#section-8.2](https://tools.ietf.org/html/rfc7159#section-8.2)

------
brokentone
Serious question -- is this a serious project or a joke?

~~~
masklinn
The name is unserious but the project is very serious, its writer has
responded to a few comments and linked to a presentation of his on the
subject[0]. It's an extension of UTF-8 used to bridge UTF-8 and UCS2-plus-
surrogates: while UTF8 is _the_ modern encoding you have to interact with
legacy systems, for UNIX's bags of bytes you may be able to assume UTF8
(possibly ill-formed) but a number of other legacy systems used UCS2 and added
visible surrogates (rather than proper UTF-16) afterwards.

Windows and NTFS, Java, UEFI, Javascript all work with UCS2-plus-surrogates.
Having to interact with those systems from a UTF8-encoded world is an issue
because they don't guarantee well-formed UTF-16, they might contain unpaired
surrogates which can't be decoded to a codepoint allowed in UTF-8 or UTF-32
(neither allows unpaired surrogates, for obvious reasons).

WTF8 extends UTF8 with unpaired surrogates (and unpaired surrogates only,
paired surrogates from valid UTF16 are decoded and re-encoded to a proper
UTF8-valid codepoint) which allows interaction with legacy UCS2 systems.

WTF8 exists solely as an _internal_ encoding (in-memory representation), but
it's very useful there. It was initially created for Servo (which may need it
to have an UTF8 internal representation yet properly interact with
javascript), but turned out to first be a boon to Rust's OS/filesystem APIs on
Windows.

[0]
[http://exyr.org/2015/!!Con_WTF-8/slides.pdf](http://exyr.org/2015/!!Con_WTF-8/slides.pdf)

~~~
tjradcliffe
> WTF8 exists solely as an internal encoding (in-memory representation)

Today.

Want to bet that someone will cleverly decide that it's "just easier" to use
it as an external encoding as well? This kind of cat always gets out of the
bag eventually.

~~~
Dylan16807
Better WTF8 than invalid UCS2-plus-surrogates. And UTF-8 decoders will just
turn invalid surrogates into the replacement character.

------
udev
This is a solution to a problem I didn't know existed.

~~~
Veedrac
The nature of unicode is that there's _always_ a problem you didn't (but
should) know existed.

And because of this global confusion, everyone important ends up implementing
something that somehow does something moronic - so then everyone else has yet
another problem they didn't know existed and they all fall into a self-harming
spiral of depravity.

~~~
cygx
Some time ago, I made some ASCII art to illustrate the various steps where
things can go wrong:

    
    
        [user-perceived characters]
                    ^
                    |
                    v
           [grapheme clusters] <-> [characters]
                    ^                   ^
                    |                   |
                    v                   v
                [glyphs]           [codepoints] <-> [code units] <-> [bytes]

~~~
leni536
So basically it goes wrong when someone assumes that any two of the above is
"the same thing". It's often implicit.

~~~
cygx
That's certainly one important source of errors. An obvious example would be
treating UTF-32 as a fixed-width encoding, which is bad because you might end
up cutting grapheme clusters in half, and you can easily forget about
normalization if you think about it that way.

Then, it's possible to make mistakes when converting between representations,
eg getting endianness wrong.

Some issues are more subtle: In principle, the decision what should be
considered a single character may depend on the language, nevermind the debate
about Han unification - but as far as I'm concerned, that's a WONTFIX.

------
haberman
Let me see if I have this straight. My understanding is that WTF-8 is
identical to UTF-8 for all valid UTF-16 input, but it can also round-trip
invalid UTF-16. That is the ultimate goal.

Below is all the background I had to learn about to understand the
motivation/details.

—

UCS-2 was designed as a 16-bit fixed-width encoding. When it became clear that
64k code points wasn’t enough for Unicode, UTF-16 was invented to deal with
the fact that UCS-2 was assumed to be fixed-width, but no longer could be.

The solution they settled on is weird, but has some useful properties.
Basically they took a couple code point ranges that hadn’t been assigned yet
and allocated them to a “Unicode within Unicode” coding scheme. This scheme
encodes (1 big code point) -> (2 small code points). The small code points
will fit in UTF-16 “code units” (this is our name for each two-byte unit in
UTF-16). And for some more terminology, “big code points” are called
“supplementary code points”, and “small code points” are called “BMP code
points.”

The weird thing about this scheme is that we bothered to make the “2 small
code points” (known as a “surrogate” pair) into real Unicode code points. A
more normal thing would be to say that UTF-16 code _units_ are totally
separate from Unicode code _points_ , and that UTF-16 code _units_ have no
meaning outside of UTF-16. An number like 0xd801 could have a code _unit_
meaning as part of a UTF-16 surrogate pair, and _also_ be a totally unrelated
Unicode code _point_.

But the one nice property of the way they did this is that they didn’t break
existing software. Existing software assumed that every UCS-2 character was
also a code point. These systems could be updated to UTF-16 while preserving
this assumption.

Unfortunately it made everything else more complicated. Because now:

\- UTF-16 can be ill-formed if it has any surrogate code units that don’t pair
properly.

\- we have to figure out what to do when these surrogate code points — code
points whose only purpose is to help UTF-16 break out of its 64k limit — occur
outside of UTF-16.

This becomes particularly complicated when converting UTF-16 -> UTF-8. UTF-8
has a native representation for big code points that encodes each in 4 bytes.
But since surrogate code points are real code points, you could imagine an
alternative UTF-8 encoding for big code points: make a UTF-16 surrogate pair,
then UTF-8 encode the two code points of the surrogate pair (hey, they are
real code points!) into UTF-8. But UTF-8 disallows this and only allows the
canonical, 4-byte encoding.

If you feel this is unjust and UTF-8 should be allowed to encode surrogate
code points if it feels like it, then you might like Generalized UTF-8, which
is exactly like UTF-8 except this is allowed. It’s easier to convert from
UTF-16, because you don’t need any specialized logic to recognize and handle
surrogate pairs. You still need this logic to go in the other direction though
(GUTF-8 -> UTF-16), since GUTF-8 can have big code points that you’d need to
encode into surrogate pairs for UTF-16.

If you like Generalized UTF-8, except that you _always_ want to use surrogate
pairs for big code points, and you want to totally disallow the UTF-8-native
4-byte sequence for them, you might like CESU-8, which does this. This makes
both directions of CESU-8 <-> UTF-16 easy, because neither conversion requires
special handling of surrogate pairs.

A nice property of GUTF-8 is that it can round-trip any UTF-16 sequence, even
if it’s ill-formed (has unpaired surrogate code points). It’s pretty easy to
get ill-formed UTF-16, because many UTF-16-based APIs don’t enforce
wellformedness.

But both GUTF-8 and CESU-8 have the drawback that they are not UTF-8
compatible. UTF-8-based software isn’t generally expected to decode surrogate
pairs — surrogates are supposed to be a UTF-16-only peculiarity. Most
UTF-8-based software expects that once it performs UTF-8 decoding, the
resulting code points are real code points (“Unicode scalar values”, which
make up “Unicode text”), _not_ surrogate code points.

So basically what WTF-8 says is: encode all code points as their real code
point, _never_ as a surrogate pair (like UTF-8, unlike GUTF-8 and CESU-8).
However, if the input UTF-16 was ill-formed and contained an unpaired
surrogate code point, _then_ you may encode that code point directly with
UTF-8 (like GUTF-8, not allowed in UTF-8).

So WTF-8 is identical to UTF-8 for all valid UTF-16 input, but it can also
round-trip invalid UTF-16. That is the ultimate goal.

~~~
cygx
The encoding that was designed to be fixed-width is called UCS-2. UTF-16 is
its variable-length successor.

~~~
haberman
Thanks for the correction! I updated the post.

------
jheriko
hmmm... wait... UCS-2 is just a broken UTF-16?!?!

I thought it was a distinct encoding and all related problems were largely
imaginary provided you /just/ handle things right...

~~~
masklinn
UCS2 is the original "wide character" encoding from when code points were
defined as 16 bits. When codepoints were extended to 21 bits, UTF-16 was
created as a variable-width encoding compatible with UCS2 (so UCS2-encoded
data is valid UTF-16).

Sadly systems which had previously opted for fixed-width UCS2 and exposed that
detail as part of a binary layer and wouldn't break compatibility couldn't
keep their internal storage to 16 bit code units and move the external API to
32.

What they did instead was keep their API exposing 16 bits code units and
declare it was UTF16, except most of them didn't bother validating anything so
they're really exposing UCS2-with-surrogates (not even surrogate pairs since
they don't validate the data). And that's how you find lone surrogates
traveling through the stars without their mate and shit's all fucked up.

------
eridius
The given history of UTF-16 and UTF-8 is a bit muddled.

> _UTF-16 was redefined to be ill-formed if it contains unpaired surrogate
> 16-bit code units._

This is incorrect. UTF-16 did not exist until Unicode 2.0, which was the
version of the standard that introduced surrogate code points. UCS-2 was the
16-bit encoding that predated it, and UTF-16 was designed as a replacement for
UCS-2 in order to handle supplementary characters properly.

> _UTF-8 was similarly redefined to be ill-formed if it contains surrogate
> byte sequences._

Not really true either. UTF-8 became part of the Unicode standard with Unicode
2.0, and so incorporated surrogate code point handling. UTF-8 was originally
created in 1992, long before Unicode 2.0, and at the time was based on UCS.
I'm not really sure it's relevant to talk about UTF-8 prior to its inclusion
in the Unicode standard, but even then, encoding the code point range
D800-DFFF was not allowed, for the same reason it was actually not allowed in
UCS-2, which is that this code point range was unallocated (it was in fact
part of the Special Zone, which I am unable to find an actual definition for
in the scanned dead-tree Unicode 1.0 book, but I haven't read it cover-to-
cover). The distinction is that it was not considered "ill-formed" to encode
those code points, and so it was perfectly legal to receive UCS-2 that encoded
those values, process it, and re-transmit it (as it's legal to process and
retransmit text streams that represent characters unknown to the process; the
assumption is the process that originally encoded them understood the
characters). So technically yes, UTF-8 changed from its original definition
based on UCS to one that explicitly considered encoding D800-DFFF as ill-
formed, but UTF-8 as it has existed in the Unicode Standard has always
considered it ill-formed.

> _Unicode text was restricted to not contain any surrogate code point. (This
> was presumably deemed simpler that only restricting pairs.)_

This is a bit of an odd parenthetical. Regardless of encoding, it's never
legal to emit a text stream that contains surrogate code points, as these
points have been explicitly reserved for the use of UTF-16. The UTF-8 and
UTF-32 encodings explicitly consider attempts to encode these code points as
ill-formed, but there's no reason to ever allow it in the first place as it's
a violation of the Unicode conformance rules to do so. Because there is no
process that can possibly have encoded those code points in the first place
while conforming to the Unicode standard, there is no reason for any process
to attempt to interpret those code points when consuming a Unicode encoding.
Allowing them would just be a potential security hazard (which is the same
rationale for treating non-shortest-form UTF-8 encodings as ill-formed). It
has nothing to do with simplicity.

