
UTF-7: a ghost from the time before UTF-8 - luu
https://crawshaw.io/blog/utf7
======
devy
Along the same lines of Unicode horror stories, my personal favorite is the
MySQL's original 3-byte non-standard unicode implementation, called "utf8" and
then later renamed to "utf8mb3". [1] It only covers the Basic Multilingual
Plane (BMP). And it felt like the designed decision were made similar like
those on MyISAM db engine that in retrospective, those were the wrong
optimization that caused more trouble than bring benefits. And they didn't fix
it until MySQL version 5.5.3 via introduction of "utf8mb4" charset. In a very
obvious way, if you use emojis in your application with MySQL backend and not
using the full 4-byte "utf8mb4" charset, you will absolutely get bitten by
that gotcha. [2][3][4]

[1]: [https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-
utf8...](https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-utf8mb3.html)

[2]: [https://medium.com/@adamhooper/in-mysql-never-use-
utf8-use-u...](https://medium.com/@adamhooper/in-mysql-never-use-utf8-use-
utf8mb4-11761243e434)

[3]: [https://stackoverflow.com/questions/202205/how-to-make-
mysql...](https://stackoverflow.com/questions/202205/how-to-make-mysql-handle-
utf-8-properly)

[4]: [https://www.eversql.com/mysql-utf8-vs-utf8mb4-whats-the-
diff...](https://www.eversql.com/mysql-utf8-vs-utf8mb4-whats-the-difference-
between-utf8-and-utf8mb4/)

~~~
avar
I wonder if anyone else has the experience of complaining about and pointing
this problem out for a decade or more before emojis become widespread, and
pretty much nobody cared ("they speak _what_ language?"). But as soon as
emojis became more widespread suddenly proper Unicode support became
everyone's problem.

~~~
pkulak
Pretty smart to put the fun stuff at the end so that everyone has to worry
about everything in the middle.

~~~
bdamm
There must be a name for this approach to software design. That is, creating a
forcing function that must correctly handle the popular case which by proxy
handles a majority of less common but still important cases. (I'm aghast that
MySQL had this issue!)

~~~
marcodave
it might be called ️-dd (aka emoji-driven-design)

------
simias
>Even though this is 2018, occasionally someone will try to claim in
conversation with me that UTF-16 is better than UTF-8.

I'm curious, what are their arguments? UTF-16 is pretty objectively bad at
everything, it's not compatible with ascii and C-strings like UTF-8 and it's
not a decent constant-width encoding for codepoints like UTF-32 (and even that
is not necessarily useful anyway since you often care more about characters
than codepoints). AFAIK the only advantage of UTF-16 is that it's a little
more efficient when encoding text written in certain scripts.

~~~
cryptonector
People who don't understand surrogate pairs and think that UTF-16 == UCS-2
tend to argue that UTF-16 is better because (in their minds) it's a fixed-
sized character encoding (it's not).

Even UTF-32 is not a fixed-sized character encoding!! It's a fixed-sized
_codepoint_ encoding. Characters can be composed of more than one codepoint.
Even if you think "hey, I'll use pre-composed codepoints", you'll fail because
not every legitimate, canonically decomposed character has a single-codepoint
pre-composition.

The disadvantages to UTF-16 far outweigh its one advantage (the one you
mentioned).

(There is one more advantage to UTF-16, and it's that Win32 uses it all over
the place. But that's not much of an advantage, and Windows does seem to be
making progress towards putting UTF-8 on an equal footing (if not better).)

~~~
avar
In practice most applications that require a chars(str) function can get away
with returning the wrong result for things outside the BMP, as opposed to
UTF-8 where you need to start caring as soon as you hit words like "café".

Even if you you do require chars(str) for large strings outside the BMP those
characters were so rare before emojis that you could waste a single bit on
"contains any non-BMP?" and almost always do the work in O(1) time as opposed
to O(n) for UTF-8.

~~~
cryptonector
Sorry, that just generates garbage when dealing with things outside the BMP.
That can be a lot more common than you think. E.g., when dealing with Chinese
characters in a context where unification is not welcomed (e.g., in China).

~~~
avar
Yes, you're right that it generates garbage, but that's besides the point.

The point is that a huge number of programmers, especially in the 90s and
early 00s would argue for UTF-16 on the basis of it being a fixed width
encoding _in practice_. Maybe they didn't know that it actually wasn't, or
they knew and didn't care because they never had to deal with anything outside
the BMP.

The overlap between Windows programmers producing software for e.g. in the
U.S. or European market and those that would have ever encountered a non-BMP
used to be _tiny_ until emojis came along.

So yes, while not in theory, in practice you could get away with treating
UTF-16 like fixed width encoding like UCS-2 for a huge number of applications
where you could reap the benefits of constant-time chars(str) and
charoffset(str, N).

~~~
cryptonector
The garbage is super annoying. Please stop. Human scripts are O(N), too bad.
You can build indices (must, for large documents), but you can't really avoid
this being O(N).

And we're not even talking about normalization.

People get upset about these things and blame Unicode, but the problems are
not with Unicode -- they are semantics problems with our scripts that Unicode
deals with about as well as can be hoped for.

The only thing I'd remove from Unicode is pre-compositions and the associated
normal forms NFC and NFKC. But note that that wouldn't remove the need for
normalization.

------
rspeer
UTF-7 is "fun" because encoding libraries tend to support it, but since nobody
cares about it, edge cases in the implementation may go undiscovered for a
while.

Back on Python 2.7.5, the UTF-7 decoder didn't do range checking, so this
script [1] produced a "Unicode string" containing the codepoint U+DEADBEEF.
(The maximum valid codepoint is U+10FFFF.) This string would crash regexes,
corrupt databases, etc., so that allowed denial-of-service attacks against any
function that let you specify an arbitrary encoding.

(This is fixed in all extant versions of Python.)

[1]
[https://gist.github.com/rspeer/7559750](https://gist.github.com/rspeer/7559750)

------
perlgeek
Ok, I have a fun little UTF-7 story to share.

At $work, we run a heavily patched OTRS for keeping track of our tickets, and
have lots of systems automatically sending cron and other status mails to it
(bad, I know).

We got a bug report that the recipient email was displayed as Mojibake,
something like blabla+⻧⻯⻱@ourdomain.net

After digging into the source email and the OTRS code base, I found the
problem:

Some shitty MUA years ago failed to properly encode email headers, and sent
8-bit "Subject: " headers. To deal with that, OTRS had a workaround that tried
to decode all (!) headers with the encoding specified in the Content-Encoding
header, with a fallback to ASCII or Latin-1 if the decoding failed.

Now, a pretty old Windows system or application sent UTF-7 encoded emails to
something like blabla+autoreply=no+more=stuff-here@ourdomain.net, and OTRS
successfully decoded the 'To:' header as UTF-7. And since it's not ASCII
compatible, it turned it into gibberish.

The "fix" was to only do the header decoding if the declared encoding is ASCII
compatible.

The code still seems to be in OTRS today:
[https://github.com/OTRS/otrs/blob/rel-6_0_14/Kernel/System/E...](https://github.com/OTRS/otrs/blob/rel-6_0_14/Kernel/System/Encode.pm#L353)

------
jperras
Many years ago, UTF-7 was a viable way in which you could try to exploit
Internet Explorer's somewhat questionable default of interpreting a resource
as UTF-7 if it finds a UTF-7 character in the first 4096 bytes:

[http://shiflett.org/blog/2005/googles-xss-
vulnerability](http://shiflett.org/blog/2005/googles-xss-vulnerability)

------
Sir_Cmpwn
UTF-7 is the worst, but far from the only one. When writing an email client
you will have to deal with no fewer than SIX distinct text encodings,
including several unique to IMAP.

~~~
michaelhoffman
Worse than UTF-EBCDIC?

------
fredley
Oh God I was introduced to this parsing attachment names from raw SMTP emails.
You think it's all fine until suddenly you come across a UTF-7-encoded UTF-8
filename...

------
wglb
It is unfortunate that they left off rfc4042:
[https://tools.ietf.org/html/rfc4042](https://tools.ietf.org/html/rfc4042),
particularly since we are running out of room in bytes to stuff more bits.

~~~
masklinn
> we are running out of room in bytes to stuff more bits.

The unicode codespace is composed of 17 16-bit planes, and so has room for
1114112 codepoints. As of Unicode 11, 137439 are allocated.

And while UTF-8 has been restricted to only support 21 bits (which already
allows almost 1 million more USV than actually allowed), the encoding scheme
(and pre-2003 UTF-8) supports 31 bits of payload.

~~~
gpvos
Note the date of the RFC.

~~~
wglb
Of all of the ones on that date, this is my favorite.

------
badrabbit
Working with email,this such a painful thing to deal with. The IMAP legacy is
all over outlook and email security/middleware appliances.

------
amaccuish
I'm looking forward to JMAP for the future of PIM.

------
codezero
For another "Unicode to ASCII" encoding, check out Punycode:
[https://en.wikipedia.org/wiki/Punycode](https://en.wikipedia.org/wiki/Punycode)

------
maxxxxx
Is there anybody who really understands all the different encodings and
character sets? I always get a headache when I have to analyze a problem in
that area.

------
Dylan16807
> I looked inside three UTF-7 encoders and found they don't follow the RFC at
> all on this. Instead, they encode the UTF-16 to modified base64 without any
> zero bit padding, and then remove any base64 = padding from the result.

That sounds like padding to a " _character_ boundary" to me. I can't find
anywhere that defines the term as being an entire block of 3/4.

------
Aardwolf
I wish ASCII had a few more symbols and didn't waste 32 values on control
codes.

Missing characters imho:

fixed width space, degree symbol, copyright/trademark symbols, the opposite
direction of `, maybe a few more like pilcrow, section symbol, dagger, generic
currency symbol, card suits, arrows and a few mathematical ones like +-,
roughly equal, not equal, ...

Why I care about ASCII here? Because programming language source code is still
written in that, and it's also those symbols that appear on standard US
keyboard layout.

I think those extra symbols would have made life easier in many cases, more
than some of the more obscure control characters it has (such as the useless
line break vs newline we are still suffering from). They could have gotten
away with just 16 instead of 32 control characters imho, even in the times
when they had mechanical machines with bells :p

~~~
nothrabannosir
_> Why I care about ASCII here? Because programming language source code is
still written in that, and it's also those symbols that appear on standard US
keyboard layout._

None of the programming languages I've used in the last ten years, save
perhaps brainfuck, used ASCII. It was all unicode, either through explicit
configuration atop the source code file, or e.g. UTF-8 by default.

~~~
Aardwolf
I mean core language keywords and operators, not strings or custom variable
names. The core set is limited set for good reason, but a few more symbols
could have helped, e.g. degree symbol is very common and could have been used
for angles.

Did you know in C++ you can have a variables named as emoji? It compiles with
clang++ with -std=c++11 :)

APL is a famous non-ASCII based programming language using arrows and such,
but very difficult to type on a modern computer of course. Other than esoteric
languages, I don't know any modern programming languages using non-ASCII
symbols in core keywords and operators.

~~~
lalaithion
Haskell has Unicode synonyms for many of it's constructs (enabled by a
compiler flag), so you can write eg. → instead of `->` and ∀ instead of
`forall`.

~~~
pmarreck
I get that "for free" via programming ligatures:

[https://github.com/tonsky/FiraCode](https://github.com/tonsky/FiraCode)

Depending on your (irrational) feelings about fonts and typography, this may
either be the most amazing advance in coding readability you've ever
encountered, or a mild improvement, or even annoying. But I love them!

A number of coding-oriented fonts support them now:

[https://medium.com/larsenwork-andreas-larsen/ligatures-
codin...](https://medium.com/larsenwork-andreas-larsen/ligatures-coding-
fonts-5375ab47ef8e)

also, small grammar niggle: "it's" should only be used where it can be
replaced with "it is" and still make sense; otherwise it's always "its" :)

~~~
reificator
Fira code is great, but it can confuse others when they're looking over your
shoulder.

I consider it essential when working with Javascript, thanks to the
~~boneheaded and objectively wrong~~ personal-opinion-based decision to make
the strict equality operator longer than the coercive one. Making sure you
have the right one is important, and while linters can also solve the problem,
I like having it be very visual and obviously apparent.

Regarding `it's`:
[https://www.youtube.com/watch?v=063jQAM6N8I](https://www.youtube.com/watch?v=063jQAM6N8I)

~~~
pmarreck
I thought I was familiar enough with Monty Python but I've never seen these,
hilarious! (also the gag at 3:10)

------
tracker1
Unless you're doing old-school serial/modem communication, I can't imagine
anyone wanting to use UTF-7.

