Hacker News new | past | comments | ask | show | jobs | submit login
UTF-7: a ghost from the time before UTF-8 (crawshaw.io)
149 points by luu 3 months ago | hide | past | web | favorite | 132 comments



Along the same lines of Unicode horror stories, my personal favorite is the MySQL's original 3-byte non-standard unicode implementation, called "utf8" and then later renamed to "utf8mb3". [1] It only covers the Basic Multilingual Plane (BMP). And it felt like the designed decision were made similar like those on MyISAM db engine that in retrospective, those were the wrong optimization that caused more trouble than bring benefits. And they didn't fix it until MySQL version 5.5.3 via introduction of "utf8mb4" charset. In a very obvious way, if you use emojis in your application with MySQL backend and not using the full 4-byte "utf8mb4" charset, you will absolutely get bitten by that gotcha. [2][3][4]

[1]: https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-utf8...

[2]: https://medium.com/@adamhooper/in-mysql-never-use-utf8-use-u...

[3]: https://stackoverflow.com/questions/202205/how-to-make-mysql...

[4]: https://www.eversql.com/mysql-utf8-vs-utf8mb4-whats-the-diff...


I wonder if anyone else has the experience of complaining about and pointing this problem out for a decade or more before emojis become widespread, and pretty much nobody cared ("they speak what language?"). But as soon as emojis became more widespread suddenly proper Unicode support became everyone's problem.


Pretty smart to put the fun stuff at the end so that everyone has to worry about everything in the middle.


There must be a name for this approach to software design. That is, creating a forcing function that must correctly handle the popular case which by proxy handles a majority of less common but still important cases. (I'm aghast that MySQL had this issue!)


it might be called ️-dd (aka emoji-driven-design)


There are tons of unicode characters that are outside of BMP that MySQL's original 3-byte implementation didn't support. [1] It just happened that emojis have been becoming wider spread than those non-BMP unicode characters.

[1]: https://stackoverflow.com/questions/5567249/what-are-the-mos...


if you use emojis in your application with MySQL backend and not using the full 4-byte "utf8mb4" charset, you will absolutely get bitten by that gotcha.

Did we use to work together? Because I've been one of those poor DBAs to inherit a system that was not using utf8mb4 charsets, and sure damn enough an emoji took down said system for hours affecting every client on the books.

They recovered, but limped along for another two years before eventually getting bought out and effectively soft-shutting down.

I was hired post-Armageddon to keep the lights on and eventually left once the acquisition completed.


Doesn't seem to be a rare problem.

When I did a web project in university in 2014, the assistant professor warned me at the beginning that this will happen with MySQL and Emoji, because he had the problem the last semester.

And then in 2017, years after I learned that this was a huge problem, I got a freelance app project where they already had a back-end developer. The app crashed multiple times after release when I finally realized the back-end dev didn't know anything about this and simply clicked together the DB in phpmyadmin with all the default settings...


You're probably right that it's not a rare problem, but at the right scale it can be a business ending problem.


True.

I'm just amazed that there seems to be a fair amount of back-end developers who don't know anything about this. Probably because SQL systems are used at more conservative companies with older customers?


That's a good question. I think part of this can be attributed to the efforts taken by the database World to demystify database operations and abstract some of the processes to the point where infrastructure by click became easy enough for these problem to be completely forgotten about or overlooked by your modern developer doing business as DBA (swidt?).

Can insert? Can query? Nothing more to do here.


   Did we use to work together? Because I've been one of those poor DBAs to inherit a system that was not using utf8mb4 charsets, and sure damn enough an emoji took down said system for hours affecting every client on the books.
Perhaps ;) My first major project out of current job was spending two months figure out why our system doesn't support emojis. And MySQL was the blame, I want my two months back.


> In a very obvious way, if you use emojis in your application with MySQL backend and not using the full 4-byte "utf8mb4" charset, you will absolutely get bitten by that gotcha.

Hello JIRA Servicedesk, which at least half a year ago still preferred "old" utf8. Over the winter holidays I'll have to, basically, reinstall the entire fucking thing.


There should be a big FAT warning in the mysql/mariadb docs that "utf8" should NEVER be used, and "utf8mb4" is likely the intended use... A future full version cut should probably change the alias for "utf8" to the proper implementation.

Came across this the last time I used mysql, fortunately before it was too far into the project. In any case, every time I've ever used mysql, there's something that pisses me off about it. Like indexing on a BINARY field is using case-insensitive comparisons, like it's actually text.


Also, mysqldump has to be separately forced to use utf8mb4 or your backups will be silently corrupted.


That I did not know. Fun.


When I worked at Mozilla, on MDN, we had this mysterious bug:

https://bugzilla.mozilla.org/show_bug.cgi?id=776048

The tl;dr is that MDN is a multilingual site, and supports categorization in a few ways, including by tagging. An example of the bug: the English MDN's CSS reference tagged articles with "CSS Reference". The French MDN's CSS reference tagged them "CSS Référence". Sometimes, the French tag appeared on English articles.

The source of the issue was MySQL's utf8_general_ci collation, which did not see "e" and "é" as distinct, so it was a toss-up as to which tag you'd get back from the database. The solution was to build and install a custom collation to teach MySQL to see accented and unaccented characters as distinct.


Good job on improving the system! However it seems like using 'Référence CSS' would have been more correct grammatically and wouldn't have caused the problem.


The authors/editors for different languages were the ones who chose the styling of the tags. Don't know why they put it that way, but they did. Our job was just to make it work.


>Even though this is 2018, occasionally someone will try to claim in conversation with me that UTF-16 is better than UTF-8.

I'm curious, what are their arguments? UTF-16 is pretty objectively bad at everything, it's not compatible with ascii and C-strings like UTF-8 and it's not a decent constant-width encoding for codepoints like UTF-32 (and even that is not necessarily useful anyway since you often care more about characters than codepoints). AFAIK the only advantage of UTF-16 is that it's a little more efficient when encoding text written in certain scripts.


People who don't understand surrogate pairs and think that UTF-16 == UCS-2 tend to argue that UTF-16 is better because (in their minds) it's a fixed-sized character encoding (it's not).

Even UTF-32 is not a fixed-sized character encoding!! It's a fixed-sized codepoint encoding. Characters can be composed of more than one codepoint. Even if you think "hey, I'll use pre-composed codepoints", you'll fail because not every legitimate, canonically decomposed character has a single-codepoint pre-composition.

The disadvantages to UTF-16 far outweigh its one advantage (the one you mentioned).

(There is one more advantage to UTF-16, and it's that Win32 uses it all over the place. But that's not much of an advantage, and Windows does seem to be making progress towards putting UTF-8 on an equal footing (if not better).)


This is a historical oddity. At the time Apple and Microsoft (plus a few others) game up with Unicode the assumption was that we’d be able to fit everything into 16 bits.

That turned out not to be true but binary compatibility made switching to UTF-8 impossible.


Not just binary compatibility, also API compatibility. All of the performance guarantees of NSString are based around the internal representation being UTF-16 (or ASCII). Similarly -[NSString length] is defined as the number of UTF-16 code units, and all of the subscripting / range operations are defined in terms of UTF-16 code units.


Same thing goes for Java and .NET.

It's somewhat ironic that those who were the earliest on the Unicode bandwagon ended up saddled with all this compatibility garbage...


What gets me is that this affects later languages too. For example, Swift's internal String storage is either ASCII or UTF-16, precisely because it needs to maintain NSString's performance guarantees when bridging to Obj-C.

That said, it seems the plan for Swift 5 is to change String's internal storage to be UTF-8, and then when bridging to NSString it will provide amortized constant-time UTF-16 codepoint access by calculating and caching "breadcrumbs" for fast lookup of UTF-16 offsets.


Does Win32 use UTF-16 or WTF-16?


For path and object names, it's the latter.

WTF-16 "standard" is just UTF-16 that allows invalid surrogate pairs. Windows kernel style handling.

It's a shame you were voted down for a perfectly legitimate (if misleading) question.

This is a very common gotcha most developers ignore until they encounter their first invalid surrogate pair file path... UTF-16 can't represent all possible Windows file names or paths.


Was it misleading? I was just trying to understand if Win32 used WTF-16 everywhere (and some strings mostly happened to be valid UTF-16), or if there was some place where it used UTF instead.


I'll bet they saw it and thought you were just being insulting (what-the-fuck-16), but it apparently stands for wobbly-transformation-format. Had to look this up myself to be sure.


I think object names and paths are WTF-16, that's where NT kernel is involved. Note that Microsoft doesn't call it that way, but it's a useful naming distinction regardless.

Applications can of course use whatever they please. Just need to be careful with the possible invalid object or file name surrogate pairs, not to fail on those and to pass them with alteration.

Well, misleading in that sense you did get the downvotes... That said I find WTF-16 pretty appropriate name for what Windows does internally on the kernel level.


In general kernels know nothing about codesets, but filesystems do (e.g., ZFS definitely does). I call this just-use-16 and just-use-8 depending on whether the strings are arrays of bytes or of uint16_t's.


Plus while UTF-32 is currently defined as equivalent UCS-4, the surrogate pairs tricks of UTF-8/UTF-16 are still theoretically possible even if they don't currently encode anything. Not that we have a reason today to expect there to be a plane bigger than the Astral Plane, but in the 90s no one expected the Astral Plane to open up. Anyone picking UCS-4 today and pretending its UTF-32 but not prepared to handle surrogate pairs for the Currently Imaginary Plane beyond, is potentially making the exact same mistakes of UCS-2 early adopters and will have the same UTF-16/WTF-16 headaches should that day arrive.


I expect UTF-16 to die eventually and let us have more than 21 bits of codepoints, but we'll never ever need even 30 bits, let alone 32 bits, of codepoints.

I have to imagine that you're just jesting :0


It's not entirely a jest, but it is funny to think about. The question will be then, as it has been every time before: what do we mean by "Universal" in Unicode?

One very obvious jest is that at the current rate of emoji expansion it may be inevitable to open up the next plane just for emoji.

Less of a jest is that there's always more symbols to encode. We've been a wonderfully creative species, and we're still rediscovering old written languages and discovering new bits of physics and math that can benefit from new symbols. We've got tons of interesting fictional language encodings to consider that are currently not likely to be encoded in Unicode today because they are locked in that struggle between the needs of Academia, the interests the Enthusiasts, and the money/complications of Intellectual Property laws. (Klingon, Tolkien's Elvish languages, D'ni, Timelord Sigils, etc.) But if those seesaws tip towards enthusiasts or academia that changes quickly. Meanwhile, creative artists keep inventing new glyphs and languages to encode.

That's just our one species on this planet. What if we discover more or get better at encoding existing non-human languages? Then we really get to test that "Universal" bit in the name, huh? (That's also obviously a bit of jest, but an interesting sci-fi hope/yearning there, too. It would be nice to encode some non-human languages for a change.)

So, all of the above, based on current assumptions (emojis will slow down, IP law remains vicious, Fermi's Paradox won't been solved in our lifetimes, etc) may still be unlikely to push us past the Astral Plane, but that doesn't make it silly to assume that we may never need to open the next plane.


There's always more glyphs to encode, but I think it will be a long time before we reach hundreds of millions of glyphs, even if we assigned codepoints to every fictional script and every glyph used in dead scripts, and every variation used in medieval and ancient times, and every future new emoticon or glyph for new scripts or fashionable things.

Remember, you need a fairly large number of users for each glyph, otherwise there's no point. We're not going to have per-person glyphs, as no one could understand enough of them for them to be useful. There aren't going to be too many new scripts with users in just the hundreds of thousands, or low single-digits millions. However, if that's wrong, if population gets large enough and regional scripts become a thing... then yeah, we'd run out, but let's hope not -- Unicode is complex enough as it is.


> Remember, you need a fairly large number of users for each glyph, otherwise there's no point.

You are forgetting scripts like Linear A and Linear B which Unicode encodes despite only being of use to a few dozen academics or so at a time, and a few dozen (if that) or so (very) historic documents and maybe hundreds of academic papers over time. Popularity/usage isn't a determiner for Unicode encoding of a script. There are plenty of good reasons to encode low usage scripts and glyphs, if they help people communicate when they are used.

But yes, I agree that it is probably a long while until the Astral Plane runs out, even encoding every small academically interesting thing along the way. I just don't feel we can discount the possibility that the Astral Plane won't be the last plane simply because we can't imagine today what the "Imaginary Plane" would even contain.


I'm not forgetting those or any other historical scripts. I mentioned those. My comment about the number of users of glyphs was in the context of future new glyphs. The supply of old glyphs is pretty limited, so it's easy to accept them all even if the number of users is minute :)


I think you may be confused about the meaning of "plane" in Unicode.

A plane is a contiguous set of 65,536 (216) code points. The lowest-numbered plane was originally the only one; now it's simply Plane 0 (officially, the "Basic Multilingual Plane"). "Astral Plane" is a collective term to refer to planes beyond that, of which there are now 16, for a total of 17 planes.

Unicode currently promises to cap at 17 planes because that's the limit of what UTF-16 can encode with surrogate pairs; anything beyond Plane 16 (officially, "Supplementary Private Use Area-B") would require some other scheme to encode in 16-bit units.

Unicode does not use anywhere near the amount of space even those 17 planes represent; you could encode seven full copies of current Unicode in the unassigned and non-reserved space available, and still have room left over (eight full copies if you un-reserve Planes 15 and 16).

If we ever truly needed more than 17 planes, we'd have trouble trying to do UTF-16, but that's a problem with UTF-16, not with Unicode. UTF-8 as currently defined would run into trouble beyond 32 planes, but we'd either abandon it for something else, or bolt on some inelegant hack to extend UTF-8 beyond four bytes.

So other than maintaining compatibility with existing encodings, there is nothing about Unicode's design that forbids tacking on more planes if and as needed. The hard part, as with so many things in programming, was going from "there's only one" to "there's more than one". If we ever needed enough planes to exhaust both UTF-16 and UTF-8.

One very obvious jest is that at the current rate of emoji expansion it may be inevitable to open up the next plane just for emoji.

To put it in perspective, the total set of emoji in Unicode 11.0, which are spread out across multiple blocks in different planes due to historical reasons, add up to less than 2% of the capacity of a single plane, and 0.11% of the available space in a 17-plane Unicode.


I appreciate the technical description of a plane. I was indeed using the more colloquial "plane [collection]" as shorthand for "collection of planes" as in the "Astral Plane [collection]".


The scheme used in UTF8 can encode up to 42 bits with a single start byte, the last start byte being 0xFF followed by 7 bytes of the form 10xxxxxx.


Hence I said "UTF-8 as currently defined".

You could produce something that uses the same basic scheme as UTF-8 (using the leading byte to indicate the total number of bytes used for the code point), but it would not be UTF-8 as we know it (which caps at four bytes per code point), and different encoders/decoders would need to be developed.


In practice most applications that require a chars(str) function can get away with returning the wrong result for things outside the BMP, as opposed to UTF-8 where you need to start caring as soon as you hit words like "café".

Even if you you do require chars(str) for large strings outside the BMP those characters were so rare before emojis that you could waste a single bit on "contains any non-BMP?" and almost always do the work in O(1) time as opposed to O(n) for UTF-8.


Sorry, that just generates garbage when dealing with things outside the BMP. That can be a lot more common than you think. E.g., when dealing with Chinese characters in a context where unification is not welcomed (e.g., in China).


Yes, you're right that it generates garbage, but that's besides the point.

The point is that a huge number of programmers, especially in the 90s and early 00s would argue for UTF-16 on the basis of it being a fixed width encoding in practice. Maybe they didn't know that it actually wasn't, or they knew and didn't care because they never had to deal with anything outside the BMP.

The overlap between Windows programmers producing software for e.g. in the U.S. or European market and those that would have ever encountered a non-BMP used to be tiny until emojis came along.

So yes, while not in theory, in practice you could get away with treating UTF-16 like fixed width encoding like UCS-2 for a huge number of applications where you could reap the benefits of constant-time chars(str) and charoffset(str, N).


The garbage is super annoying. Please stop. Human scripts are O(N), too bad. You can build indices (must, for large documents), but you can't really avoid this being O(N).

And we're not even talking about normalization.

People get upset about these things and blame Unicode, but the problems are not with Unicode -- they are semantics problems with our scripts that Unicode deals with about as well as can be hoped for.

The only thing I'd remove from Unicode is pre-compositions and the associated normal forms NFC and NFKC. But note that that wouldn't remove the need for normalization.


> what are their arguments?

It’s often faster. When you process non-English text in UTF8, branch prediction fails all the time because all characters have random size from 1 to 3 bytes. When you process non-English text in UTF16, branch prediction predicts all the time: surrogate pairs are extremely rare so the 99.9% of the stream is 2 bytes/character.

> it's not compatible with ascii and C-strings like UTF-8

In modern C++, wchar_t is a built-in fundamental type: http://www.cplusplus.com/reference/cwchar/wchar_t/


> In modern C++, wchar_t is a built-in fundamental type: http://www.cplusplus.com/reference/cwchar/wchar_t/

But with the slight problem that it actually means UTF-32 on everything on non-Windows (there's char16_t (C++11) and char32_t (C++20), but somewhat obnoxiously, they're new types, so you need explicit casts for pointer conversion)


It's even worse - it means "wide character", which is not necessarily Unicode. E.g. on all BSDs, if your locale is something like zh_TW-Big5, then wchar_t stores Big5 codepoints! You can use it to store UTF-32, of course, but all the wcs* standard functions won't treat it as such.

Now, there's __STDC_ISO_10646__, which, if it is defined by an implementation, means that wchar_t is guaranteed to be a Unicode codepoint. So e.g. glibc defines that, and on Linux you can therefore assume wchar_t as UTF-32 regardless of locale. But BSDs don't. And on Windows, nothing that uses 16-bit wchar_t can use that define, since 16 bits aren't wide enough to be conforming.


Correct me if I'm wrong:

UTF-16 can be faster if you're iterating through characters, but for parsing iterating through bytes is usually sufficient. It's fine to use something like plain old strchr to search for a { or a < in a UTF-8 string. Characters only get really important when you want to print the string.


As pointed out elsewhere, you are wrong because UTF-16 'characters' can still be comprised of compounded elements. (Base character plus additional composition elements to create a final character.)

In simple terms, you can't just treat it as an array to index to any given character.

There's also all the detriments that still apply. (Including UTF-16LE / UTF-16BE, BOM, not being able to concatenate two valid string sequences (blind) and always have a valid result, etc.)

You're also incorrect, or at least not correctly describing the search operation.

The assumption about finding specific characters (for example, command flags) in a string or array of strings is also... complicated.

As an example: https://en.wikipedia.org/wiki/Halfwidth_and_fullwidth_forms

For /command/ flags the half and full width equivalent characters might want to be 'folded' back over the traditional ASCII namespace.

Sometimes a specific loss of precision can be desirable.


> As pointed out elsewhere, you are wrong because UTF-16 'characters' can still be comprised of compounded elements. (Base character plus additional composition elements to create a final character.) In simple terms, you can't just treat it as an array to index to any given character.

I am aware of that having implemented UTF-8 myself, and I thought I was fairly clear in my comment about characters vs. bytes.

> The assumption about finding specific characters (for example, command flags) in a string or array of strings is also... complicated. As an example: https://en.wikipedia.org/wiki/Halfwidth_and_fullwidth_forms

Also aware of that. But it only matters if you're actually searching for those characters. Parsing something like XML or JSON, therefore, is as easy with UTF-8 as it would be with ASCII, because you're only looking for characters like < or { which are the same byte values in UTF-8. You don't need to worry about accidentally finding a continuation byte because UTF-8 sets high bits on continuation bytes for exactly this reason.


Is this UTF-16(BE|LE) and UTF-8 mixed?

I'm unclear about the context you're proposing. If you're trying to match for a Unicode character in either UTF-16 bytestream dialect then determining the matching format first is important. The mixed format just seems too crazy to touch.


> It's fine to use something like plain old strchr to search for a { or a < in a UTF-8 string.

Correct.

> only get really important when you want to print the string.

Print, split into fixed-length pieces, render, layout, typeset — GUI apps do that a lot with their strings.

Even web browser do, probably that’s why JavaScript strings are UTF-16 despite web in general is mostly UTF-8, see section 6.1.4 “The String Type” on page 67: http://www.ecma-international.org/publications/files/ECMA-ST...


Very true, but once you're at the level of complexity where you're rendering a GUI are you really concerned about cache invalidation due to incorrect branch prediction iterating strings?

GUI rendering is very expensive for lots of reasons, but I don't believe that's one of them - even in the UTF-8 case. You're ultimately either drawing thousands of pixels or communicating over a (relatively compared to the CPU cache) slow bus with the GPU.


> You're ultimately either drawing thousands of pixels or communicating over a (relatively compared to the CPU cache) slow bus with the GPU.

In modern software, GPU draws pixels. Before it does that, CPU lays out these glyphs. Because GPUs are ridiculously fast these days, the layout step is typically slower than painting.

I use MS edge browser. The built-in profiler said this comments page took 14ms to layout and only 6ms to paint. This page contains just a few tiny images, the majority of the content is text. I think the layout step spent most of the time iterating over the characters, looking up glyphs and measuring various blocks of text on this page.


Wow, branch prediction is actually a pretty good point I hadn't thought about.


It's common, though, to decode to an internal 16- or even 32-bit representation. Combined with clever UTF-8 decoding state machines the overhead is minimal, whether you do this all at once or in a streaming, segmented fashion. Rare is the circumstance where you can just load UTF-16 directly into an array and go to town without preprocessing. In most application scenarios you need to translate to a normalization form (i.e. NFC, NFD, etc), which you can simply combine with your UTF-8 decoding. Such tight loops of arithmetic code are blazingly fast, and performance can improve dramatically by batching these steps separately rather than intermixing different phases of text processing in your application.

Perl6 decodes to NFG form which provides a 1:1 mapping between characters and scalar values.[1] That makes Unicode string munging almost as easy as ASCII.[2] But it requires always preprocessing and intern'ing your strings.

[1] Multi-codepoint graphemes are dynamically assigned a unique internal codepoint.

[2] I say almost because some Unicode text rules, like those for word and paragraph breaks, can never be implemented as simple as in ASCII.


I think it's also more space efficient when dealing with foreign languages primarily. I agree that UTF-8 is the better default, but that's one of the most compelling arguments in favor of it (aside from it being the default on Windows).


I think it's also more space efficient when dealing with foreign languages primarily.

For plain text, yes. UTF-16 covers the Basic Multilingual Plane in 2 bytes per character, UTF-8 covers the BMP in at most 3 bytes per character. However, since most (web) content is dominated by markup in ASCII, which is 1 byte per character in UTF-8 and 2 bytes per character in UTF-16, UTF-8 typically wins even with Asian scripts.

Taking some random zh Wikipedia page in UTF-8:

    curl "https://zh.wikipedia.org/wiki/威蘭運河" | wc -c
    78543
    curl "https://zh.wikipedia.org/wiki/威蘭運河" | wc -m
    71003
So, that's 1.11 bytes per character on average, compared to at least 2 bytes per character if it was encoded in UTF-16.

(Of course, adding compression is going to diminish the differences anyway. A very primitive compression algorithm in such cases would be recoding your UTF-16 to UTF-8 :p.)


  curl -Ss "https://zh.wikipedia.org/wiki/威蘭運河" | iconv -t utf-16 | wc -c
  142008
There are only 3799 non-ASCII characters on that page:

  curl -Ss "https://zh.wikipedia.org/wiki/威蘭運河" | tr -d '\1-\176' | wc -m
  3799
Compared with 67204 ASCII ones (tr -cd).


The 142008 is to be expected. The page has 71003 code points, 2 bytes per code point (I guess that zh.wikipedia.org is simplified Chinese and normally only uses characters from the BMP) gives 142006 bytes.

I'd guess that the two remaining bytes are the byte order mark.


Even this argument isn't great, though, because a lot of the time it doesn't matter. Like, for HTML, the majority of the time the UTF-8 version will be shorter regardless because of all of the single-byte characters used in the markup.


> I agree that UTF-8 is the better default

Better default for what? I agree UTF-8 is better default for HTML sent over network. For different context, the better default can be different.

> side from it being the default on Windows

Also default in Java (including Android userland), C#, iOS & OSX, python 3 (unless compiled with UTF32 support but the default is UTF16), Symbian, OpenOffice, QT, most XML parsers (e.g. Xerces), and many other languages, libraries and frameworks. Even JavaScript uses UTF16 strings.


> python 3 (unless compiled with UTF32 support but the default is UTF16)

Not the case since PEP 393 / Python 3.3.

Also pretty much all Linux distros used wide builds.

> Also default in Java (including Android userland), C#, iOS & OSX, python 3 (unless compiled with UTF32 support but the default is UTF16), Symbian, OpenOffice, QT, most XML parsers (e.g. Xerces), and many other languages, libraries and frameworks. Even JavaScript uses UTF16 strings.

Most of these don't actually use UTF16 but UCS2 with surrogates (sometimes nicknamed WTF16).


> Most of these don't actually use UTF16 but UCS2 with surrogates

Most of these actually use UTF16.

.NET has System.Globalization.StringInfo class, Java has string methods like String.codePointCount. Operator [] returns 2-byte code points in these languages because backward compatibility. Newer languages don’t need to be backward compatible see e.g. https://docs.swift.org/swift-book/LanguageGuide/StringsAndCh... but their internal format is still UTF-16. If you need to interop with any of these languages, your life will be much easier if your C++ code uses UTF-16 as well.


> Most of these actually use UTF16.

All of .Net, Java and Javascript[0] allow unpaired surrogates, meaning none of them uses UTF-16. I expect the same happen for Symbian, OpenOffice, Qt, … So did Python on pre-FSR narrow builds.

Swift is migrating to UTF-8 as part of its ABI stabilisation: https://forums.swift.org/t/string-s-abi-and-utf-8/17676

[0] unclear for ObjC/NSString but given the string interface leaks it being an array of utf-16 code units I'd expect the same

edit: https://stackoverflow.com/a/33558934/8182118 indicates that both NSString and Swift strings can contain unpaired surrogates, so not UTF-16 either.


> meaning none of them uses UTF-16

Depends on your definition of “uses”, and I disagree with yours.

It’s always technically possible to make invalid data. For example, you can write a web app that will send Content-type: application/json and invalid utf-8 in the response.

Yes, technically these programming languages allow unpaired surrogates in strings. This is not necessarily a bad thing, there’re valid uses for such strings. For example, you can concatenate strings without decoding UTF16 into code points, i.e. the code will be slightly faster, but while you’re concatenating, in some moment of time the destination will contain an unpaired surrogate.

I hope you agree golang uses utf-8 strings. Despite that, the unicode/utf8 standard package has ValidString function, which means you can still create invalid strings in go: https://golang.org/src/unicode/utf8/utf8.go Does it mean golang uses wtf-8 instead of utf-8?


> Depends on your definition of “uses”, and I disagree with yours.

It really does not, unless you're also disagreeing with the definition of "is". Unpaired surrogates are not valid in a UTF-16 stream. If there are unpaired surrogates it's not UTF-16. If the system normally allows and generates unpaired surrogates, it's not dealing in UTF-16.

> It’s always technically possible to make invalid data. For example, you can write a web app that will send Content-type: application/json and invalid utf-8 in the response.

And that will blow up in the client, because your shit system has sent it garbage.

> Yes, technically these programming languages allow unpaired surrogates in strings.

Making them not-UTF-16.

> This is not necessarily a bad thing, there’re valid uses for such strings.

Like taking a pile of garbage and making it into a bigger pile of garbage.

Which is not relevant to to the issue at hand: such piles of crap are not UTF-16.

> For example, you can concatenate strings without decoding UTF16 into code points

That's not actually an example of your claims, you can concatenate valid UTF-16 without decoding it into codepoints as well.

> I hope you agree golang uses utf-8 strings.

I most certainly do not, why would you hope I agree with an obviously incorrect statement, what is wrong with you?

> Despite that, the unicode/utf8 standard package has ValidString function, which means you can still create invalid strings

Well you've got cause and effect reversed but yes you can create non-utf8 go strings, making Go's "strings" not utf8 at all.

> Does it mean golang uses wtf-8 instead of utf-8?

No, WTF-8 is something well-defined[0].

Golang's "strings" are just random bags of bytes, much like C's, and they're no more utf-8 than C's: you may assume they are of whatever encoding you're interested in but with no guarantee whatsoever that's actually the case and your assumptions may well blow up in your face.

[0] https://simonsapin.github.io/wtf-8/


Yes, unpaired surrogates are invalid UTF16.

No, just because languages allow representing invalid strings in their type system doesn’t mean they use some other encoding.

Ability to represent invalid data is often a good thing. English language allows representing all kind of garbage, but this is OK, see e.g. Jabberwocky by Lewis Carroll.

There are programming languages who enforce strings encoding and other constraints with strict type systems and/or runtimes. They are only practical for very limited set of problems. The majority of real-world software has to deal with invalid data: compilers do because users type invalid programs, any sufficiently complex systems do because they receive data from external components written in other languages or from I/O. I think not being able to represent invalid strings is a bug, not feature.

Mainstream i.e. practical languages allow representing invalid strings, provide functionality to validate & normalize them, and often raise runtime errors when you try to use them in a way that makes this a problem. For example, .NET throws an ArgumentException saying "Invalid Unicode code point found at index ##" when you try to normalize an invalid string.


Better default for everything. If you have particular needs, then you change from the default. This is why 'default values' exist.


> Better default for everything.

Software industry is huge and the requirements are dramatically different in different areas. What’s good default for a bash script is not necessarily a good default for OSX desktop app, or Linux web server.

When you’re working on native rich GUI, UTF16 is often better because majority of higher-level languages, libraries and frameworks use it, even on Linux (QT, Android). If you’ll pick something that’s not native, you’ll waste substantial time, both development and CPU, converting these strings back and forth for absolutely no value.

Sometimes you have to because you’re integrating different things together, e.g. UTF16 Android’s Java with UTF8 Android’s C++: http://banachowski.com/deprogramming/2012/02/working-around-... But quite often you are good just picking whatever’s the simplest option and not doing any of that.


> python 3 (unless compiled with UTF32 support but the default is UTF16)

The default is UTF-32 on Linux, and UTF-16 on Windows and macOS - simply so as to match the platform convention.


> AFAIK the only advantage of UTF-16 is that it's a little more efficient when encoding text written in certain scripts.

and IIRC east Asian languages (which benefit the most) almost only use UTF-8


I don't know if this is still the case, but as of a few years ago, there was clear evidence that the Japanese would rather have their text corrupted via mojibake than silently use UTF-8. Of course, that's about avoiding Unicode entirely rather than preferring UTF-16 over UTF-8.


TIL about "mojibake"

(I visited Japan for 2 weeks for the first time about a month ago; had a great time)


Do you know why that is? Why don't they like Unicode?


A couple reasons I'm aware of:

- Han unification. Chinese and Japanese characters that have the same etymology but are written differently are sometimes assigned to the same codepoint, and it's up to the font to distinguish. And application designers... tend not to detect the language and change the font accordingly. This leads to Japanese text being rendered in a Chinese font.

- Some Han de-unification has happened in Unicode now. Great. Which version should fonts and encodings support?

- Nobody actually knows the complete de facto mapping between Shift-JIS and Unicode. Yes, there are standards and ICU and Python modules and stuff; they're incomplete. This leads to data loss surrounding rare characters.

(Tell me you've got such a mapping and I'll give you some strings I found in the wild to decode with it.)


>Han unification.

I am still hoping some day down the road we can fix this without overstepping on each other's culture and fonts / glyph.


One legitimate concern I know of for email is that you could still get Japanese mobile phones as late as 2008 that didn't support UTF-8. People setting their mail filters to autotrash UTF-8 has also definitely happened (pro tip: never attempt to filter by charset; if you really want to, filter by script instead).

I believe another major issue is that a fair amount of software sets language environment for things such as fonts based on charset. Charsets such as ISO-2022-JP, GBK, and Big5 give you a clear clue as to what the underlying language is, but UTF-8 just tells you "here is CJK ideographs, good luck!" On the other hand, I've never really heard of any opposition to Unicode from Chinese or Korean locales, so I suspect a large part of it is just stubborn tech entrenchment in Japan as opposed to strong technical reasons.


And almost all actual uses are cases where markup (be it HTML, OOXML, ODF) are sufficiently common that the total file size ends up being larger using UTF-16 than UTF-8.


> east Asian languages (which benefit the most) almost only use UTF-8

China, a major user of East Asian languages, uses GB2312, certainly not UTF-8.

UTF-8 assigns 3 bytes to Chinese characters so it can use the 2-byte space for European characters; that's not exactly a big selling point when you're using East Asian languages.


UTF-7 is "fun" because encoding libraries tend to support it, but since nobody cares about it, edge cases in the implementation may go undiscovered for a while.

Back on Python 2.7.5, the UTF-7 decoder didn't do range checking, so this script [1] produced a "Unicode string" containing the codepoint U+DEADBEEF. (The maximum valid codepoint is U+10FFFF.) This string would crash regexes, corrupt databases, etc., so that allowed denial-of-service attacks against any function that let you specify an arbitrary encoding.

(This is fixed in all extant versions of Python.)

[1] https://gist.github.com/rspeer/7559750


Ok, I have a fun little UTF-7 story to share.

At $work, we run a heavily patched OTRS for keeping track of our tickets, and have lots of systems automatically sending cron and other status mails to it (bad, I know).

We got a bug report that the recipient email was displayed as Mojibake, something like blabla+⻧⻯⻱@ourdomain.net

After digging into the source email and the OTRS code base, I found the problem:

Some shitty MUA years ago failed to properly encode email headers, and sent 8-bit "Subject: " headers. To deal with that, OTRS had a workaround that tried to decode all (!) headers with the encoding specified in the Content-Encoding header, with a fallback to ASCII or Latin-1 if the decoding failed.

Now, a pretty old Windows system or application sent UTF-7 encoded emails to something like blabla+autoreply=no+more=stuff-here@ourdomain.net, and OTRS successfully decoded the 'To:' header as UTF-7. And since it's not ASCII compatible, it turned it into gibberish.

The "fix" was to only do the header decoding if the declared encoding is ASCII compatible.

The code still seems to be in OTRS today: https://github.com/OTRS/otrs/blob/rel-6_0_14/Kernel/System/E...


Many years ago, UTF-7 was a viable way in which you could try to exploit Internet Explorer's somewhat questionable default of interpreting a resource as UTF-7 if it finds a UTF-7 character in the first 4096 bytes:

http://shiflett.org/blog/2005/googles-xss-vulnerability


UTF-7 is the worst, but far from the only one. When writing an email client you will have to deal with no fewer than SIX distinct text encodings, including several unique to IMAP.


Worse than UTF-EBCDIC?


Oh God I was introduced to this parsing attachment names from raw SMTP emails. You think it's all fine until suddenly you come across a UTF-7-encoded UTF-8 filename...


It is unfortunate that they left off rfc4042: https://tools.ietf.org/html/rfc4042, particularly since we are running out of room in bytes to stuff more bits.


> we are running out of room in bytes to stuff more bits.

The unicode codespace is composed of 17 16-bit planes, and so has room for 1114112 codepoints. As of Unicode 11, 137439 are allocated.

And while UTF-8 has been restricted to only support 21 bits (which already allows almost 1 million more USV than actually allowed), the encoding scheme (and pre-2003 UTF-8) supports 31 bits of payload.


Note the date of the RFC.


Of all of the ones on that date, this is my favorite.


Apart from the date I think many modern programmers aren't aware of the influential PDP-10 (Mark was a PDP-10 programmer, which is how we met in the early 80s).

The PDP-10 had 36-bit words which were byte addressable -- and bytes could range from 1-63 bits in length.

So beyond the humorous language (nonets, septets) and inline PDP-10 assembly examples this was a absurdist joke intended to appeal on several levels.


Working with email,this such a painful thing to deal with. The IMAP legacy is all over outlook and email security/middleware appliances.


I'm looking forward to JMAP for the future of PIM.


For another "Unicode to ASCII" encoding, check out Punycode: https://en.wikipedia.org/wiki/Punycode


Is there anybody who really understands all the different encodings and character sets? I always get a headache when I have to analyze a problem in that area.


> I looked inside three UTF-7 encoders and found they don't follow the RFC at all on this. Instead, they encode the UTF-16 to modified base64 without any zero bit padding, and then remove any base64 = padding from the result.

That sounds like padding to a "character boundary" to me. I can't find anywhere that defines the term as being an entire block of 3/4.


I wish ASCII had a few more symbols and didn't waste 32 values on control codes.

Missing characters imho:

fixed width space, degree symbol, copyright/trademark symbols, the opposite direction of `, maybe a few more like pilcrow, section symbol, dagger, generic currency symbol, card suits, arrows and a few mathematical ones like +-, roughly equal, not equal, ...

Why I care about ASCII here? Because programming language source code is still written in that, and it's also those symbols that appear on standard US keyboard layout.

I think those extra symbols would have made life easier in many cases, more than some of the more obscure control characters it has (such as the useless line break vs newline we are still suffering from). They could have gotten away with just 16 instead of 32 control characters imho, even in the times when they had mechanical machines with bells :p


> Why I care about ASCII here? Because programming language source code is still written in that, and it's also those symbols that appear on standard US keyboard layout.

None of the programming languages I've used in the last ten years, save perhaps brainfuck, used ASCII. It was all unicode, either through explicit configuration atop the source code file, or e.g. UTF-8 by default.


I mean core language keywords and operators, not strings or custom variable names. The core set is limited set for good reason, but a few more symbols could have helped, e.g. degree symbol is very common and could have been used for angles.

Did you know in C++ you can have a variables named as emoji? It compiles with clang++ with -std=c++11 :)

APL is a famous non-ASCII based programming language using arrows and such, but very difficult to type on a modern computer of course. Other than esoteric languages, I don't know any modern programming languages using non-ASCII symbols in core keywords and operators.


You just want some extra keyboard characters. That’s got nothing to do with ASCII. You’ll notice that you also don’t actually have literal control character keys on your keyboard.


Haskell has Unicode synonyms for many of it's constructs (enabled by a compiler flag), so you can write eg. → instead of `->` and ∀ instead of `forall`.


I get that "for free" via programming ligatures:

https://github.com/tonsky/FiraCode

Depending on your (irrational) feelings about fonts and typography, this may either be the most amazing advance in coding readability you've ever encountered, or a mild improvement, or even annoying. But I love them!

A number of coding-oriented fonts support them now:

https://medium.com/larsenwork-andreas-larsen/ligatures-codin...

also, small grammar niggle: "it's" should only be used where it can be replaced with "it is" and still make sense; otherwise it's always "its" :)


Fira code is great, but it can confuse others when they're looking over your shoulder.

I consider it essential when working with Javascript, thanks to the ~~boneheaded and objectively wrong~~ personal-opinion-based decision to make the strict equality operator longer than the coercive one. Making sure you have the right one is important, and while linters can also solve the problem, I like having it be very visual and obviously apparent.

Regarding `it's`: https://www.youtube.com/watch?v=063jQAM6N8I


I thought I was familiar enough with Monty Python but I've never seen these, hilarious! (also the gag at 3:10)


Perl6 too.


https://docs.perl6.org/language/unicode_ascii for a list of Unicode characters in Perl 6 and their ASCII counterparts.


The characters commonly found in programming language syntaxes are a subset of the standard ANSI 101 keyboard layout. It makes them easy to type.

It is however, especially today, a completely separate issue from encoding. Having more characters in ASCII won’t make a dent without having them on most keyboards, and having them on a keyboard is enough to encourage adoption without having them in the ascii code plane.


For some reason, julia has no ascii infix xor. That is, you can write `xor(a,b)` or `a ⊻ b` (`^` is taken for "raise to power").


How do you type "⊻"? I think it'd rather avoid a language where I can't type all the operators. Still, could be worse, some languages allow unicode identifiers, and fun ensues.

(You can allow unicode to appear in source files - either in inline documentation or strings - without allowing unicode identifiers. But even in languages that support this, I usually escape non-ASCII characters)


Where did ⊻ come from? The only symbol I ever learned for xor was ⊕. (Though I see that according to wikipedia, "It is symbolized by the prefix operator J and by the infix operators XOR, EOR, EXOR, ⊻, ⩒, ⩛, ⊕, ↮, and ≢.")


Scala has core non-ASCII syntax elements.

(Though they also have ASCII equivalents of those you can use instead.)


I remember a comment a few years ago about Perl6 picking up the Japanese style quotation marks「 and 」for some purpose. Might have been tongue in cheek.


「 and 」are indeed used in Perl 6 when textually representing a Match object (the Match.gist). See https://docs.perl6.org/language/unicode_ascii for other acceptable unicode characters (and their ASCII counterparts) in Perl 6.


Control characters were essential back when ASCII was standardised. Please remember that ASCII wasn't just a text encoding standard, it was used for communication with external hardware as well.

In fact due to the teletype legacy on modern terminals, we still use ASCII control codes now. eg killing an application via ctrl+c - that's sent as an ASCII control code and the receiving pesudo-TTY then interprets that as an interrupt signal. Backspace is also an ASCII control code.

The point about cartridge return (CR) and line feed (LF) is an interesting one. From a purely text markup perspective it does seem insane to have two separate characters. However from the perspective of printing data on a teletype it makes total sense: you want one control character to bring the head back to the start of the line, and a second control character to feed the paper. In fact even on modern pseudo-TTYs (eg on Linux), you can still define whether you want CHR(10) to do both a CRLF or just LF - and in fact when programs do put the TTY into raw mode (eg readline, ncurses, etc) then preferred behaviour is to disable the auto-CR on CHR(10).

The interesting thing is your point seems focused on keyboard layouts - and there's nothing stopping a keyboard from having characters that are not part of ASCII. eg UK keyboards have a pound sign (GBP) and that's not included in ASCII. Neither is the Euro currency character and that appears on some keyboards as well. I have no idea what ¬ (next to the backtick button on my keyboard) is called but that's not an ASCII character either. Navigation buttons on keyboards (left, right, home, page up, etc) are not ASCII characters. Nor are the function keys (F1 et al). Escape is though - and has proven to be quite an important control character too as it's used heavily for inlining data into TTYs via ANSI escape sequences (eg colours, cursor movement, etc).

But getting back to the GBP and Euro point: there is no reason why you couldn't configure your own keyboard to map different keys to different code points outside of the ASCII range if you wanted to.

Honestly though, some of your examples of missing symbols don't really make a whole lot of sense. How would card suits having an ASCII code help developers (or anyone for that matter)? Arrows I could see a benefit of and they do appear on some of the competing character encodings of the 70s and 80s. The opposite direction of quotation marks is a valid point to as they might help make it easier to write nested strings. However in terms of written text I'm more than happy having editors auto-fix straight quotation marks with their angled counterparts.


  > Escape is though - and has proven to be quite
  > an important control character too
Making the ASCII escape character the same as the keyboard escape key is one of the most annoying things in the history of terminals. It means terminal apps can't just react to your escape key directly; in theory they have to wait for the next character, to see if it's part of an escape sequence (because eg. arrow keys are escape sequences).

This is why so many curses-based terminal apps don't use escape to go back a level or close a window, even though GUI apps do it all the time. Too hard to make it work predictably.


You say that but it's actually a really easy problem to work around because the timing differences between an escape sequence and an escape key being pressed followed by any other character are vastly different. So you just look at the pause after the escape key to judge if it's an escape sequence or not (only needs to be a fraction of a second too).

In fact in the terminal stuff I've written in Go, escape sequences end up being picked up as one array even though the TTY is set to be unbuffered just because of the time the code takes to itterate through the STDIN read loop. So didn't even need to write any fancy timing logic for the terminal stuff I've written in Go.

Sure, in some rare edge cases you might have a situation when connecting to a remote TTY over a laggy network connection will cause the escape sequence to fail because of timing issues, but that's less of an issue these days and I think the terminal would be pretty unusable by that stage anyway.

> This is why so many curses-based terminal apps don't use escape to go back a level or close a window, even though GUI apps do it all the time. Too hard to make it work predictably.

Escape is used all the time in terminal apps though. Readline uses it when you have vim mode enabled (which means Bash also supports it), Vi obviously uses it heavily. Tmux, top, cfdisk all support escape. But most importantly ncurses stuff (eg dialog) does too.

So I think what you're talking about there is more of a UI design decision rather than terminal programs not being able to do it. Thinking back, I don't really recall escape being a common navigation idiom until it became popular on the web. Your typical 90s WIMP UI (eg Windows 9x, MacOS 7+) wasn't escape heavy either (or at least it wasn't a common way to navigate as I recall). Sure you'd use it on model dialogs (eg file->open) but really very little else. So terminal software - being much older, generally speaking - never had that convention either. In fact it wasn't even that common on the web until model dialogs became all the rage so people used to use the back button heavily (and hence why GUI file managers have forwards and backwards buttons in them - but that's a whole other tangent).

To be honest though, I much prefer the terminal approach anyway. Escaping out of screens feels very crude and the terminal approach for multiple "screens" of data was to use function keys or control sequences to precisely navigate from one screen to another. Or in the case of ncurses UIs, optionally using arrow keys to highlight the option you want. Personally I find the function key kind of workflow far easier to navigate than escaping stuff all the time but I guess it's what you're used to.

In any case, terminal apps don't need to wait for the next character to determine whether the user is pressing [esc] or [-->] (with the terminal emulator transmitting CSI+n+C). You just look at the timings instead.


Sending key codes in a batch should not cause their interpretation to be changed!

It might be the least bad solution right now, but it's not good.


From an idealistic point of view I agree, but both ASCII and ANSI escape codes predate the kind of terminal UIs we are discussing here and thus things just organically evolved that way.

You only have to look at the mess that is applications built on top of Electron using HTML, CSS, JS transpilers etc to see how we’ve not learned our lesson about evolved technologies becoming de facto.

There’s plenty wrong with writing UIs in the terminal due to the age of their design but in regards to the argument raised by the GP, detecting the escape key is one of the lesser painful anti-patterns in that area and thus a great many tools do use the escape key.


> I have no idea what ¬ is called

That's the logical not symbol, companion to ∧ (and) and ∨ (or).

It's in ISO-8859-1 (and presumably the British keyboard layout) while the others aren't, since they can be made with /\ and \/.


Hmm, what a shame that these three characters can't be typed with <Compose> /\, <Compose> \/ and <Compose> -|


Sarcasm? If you set up a .XCompose file you can type them any way you want. In my case they're mapped to <Compose> <^> <^>, <Compose> <v> <v>, and <Compose> <-> <,> (plus a few minor variations).

I always miss the programmable compose keys when I'm forced to use Windows. Charmap and alt-sequences are not an adequate substitute.


<Compose> <minus> <comma> is the shipped sequence for ‘¬’.


… which is the reason ‘\’ was invented.


> more than some of the more obscure control characters it has

It's perhaps an oddity now, but these control characters were most definitely necessary when ASCII was formalized in the 1960s from teletype machines.


> Why I care about ASCII here? Because programming language source code is still written in that, and it's also those symbols that appear on standard US keyboard layout. > I think those extra symbols would have made life easier in many cases...

How would it make life any easier? The fact that they aren’t in ASCII doesn’t mean they can’t be on the keyboard. That’s an entirely separate issue. European keyboards have lots of non-ASCII characters on them.

Also as others have noted, programming languages are almost universally not ASCII. Though again I don’t know why you think that matters.


I made a flat file database for an imageboard script a couple of years ago in which I found a lot of value in using the control characters, to separate the content of posts (group separator) and the attributes of a post within that (record separator).


Bit late now. Although MS-DOS default code page 437 did have the card suits in the control character space, and a few other of the ones you mentioned in the upper >127 half of the page.


I wish more people used those control characters. Instead of building CSV, TSV, XML and JSON / YAML files with more and stranger methods of escaping characters why didn't we use the already existing and perfectly fine control characters for File, Record, Group and Unit Separators.


Control characters are still extremely common in embedded codes such as DataMatrix or QR Code.

https://www.gs1.org/standards/barcodes

I would extrapolate that RFID uses similar encoding, but I have no experience there.


Code128 has special control symbols that are distinct from ASCII control characters which are used for framing/formating in context of GS1. Datamatrix also has special symbol that is equivalent to FNC1.

The fact that scanners often encode FNC1 as some arbitrary ASCII character, with GS and ^ being common choices or even reformat the data to something resembling the human readable representation (ie. "(01)..." instead of FNC1 0 1 ...) is only about the scanner interface.

But IIRC QR code does not have any special out-of-ASCII control symbols and GS1 structures are encoded with ASCII GS instead of FNC1.


Unless you're doing old-school serial/modem communication, I can't imagine anyone wanting to use UTF-7.




Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: