Hacker News new | past | comments | ask | show | jobs | submit login
A Spectre is Haunting Unicode (dampfkraft.com)
353 points by hardmaru 7 months ago | hide | past | web | favorite | 110 comments

So, since we have them and they're not going away, we should invent a meaning for them. Obviously, one of them will have the meaning "ghost character". Other than that, meanings like "mistake that would have been simple to correct if it had been caught quickly, but now it is too late", or "something that slips through the cracks of bureaucracy", or "mistake that does not cause an actual problem, but seems wrong".

I like these suggestions. It sounds like they are all suited to commenting legacy code.

Ouch. Too true.

I say just leave them alone so drunk teenagers can get them tattooed and get laughed at by native Japanese speakers.

They could be given sarcastic Unicode names like MEANINGLESS KANJI-STYLE TATTOO #1 to #12.

One of them should mean "that thing when someone likes the look of foreign characters and gets a tattoo or buys a shirt or something, without finding out what it means".

Any tattooed Japanese glyph is funny for Japanese because We didn't have a culture for tattooing letters. Especially by 楷書体(regular script)

It somewhat reminds me of this: https://en.wikipedia.org/wiki/Cangjie_method#Early_Cangjie_s...

A particular "feature" of this early system is that if you send random lowercase words to the character generator, it will attempt to construct Chinese characters according to the Cangjie decomposition rules, sometimes causing strange, unknown characters to appear.


This is utterly wonderful - brought to mind the film Brazil's fly in a typewriter.

I'm now going to have to go down the rabbit hole of each of these ghost kanji. Good times.

I wonder if some interested Japanese ever got together to simply find new sensible meanings and words to retroactively apply to these ghost kanji. As a sort of creative pass-time.

The article links to the Nico Nico Douga Wiki (a wiki for creative works), in which each of the 12 characters is imagined to be the name of a Japanese youkai (a type of spirit/demon/monster): http://dic.nicovideo.jp/a/%E5%B9%BD%E9%9C%8A%E6%96%87%E5%AD%... (Japanese). It’s also quite a neat coincidence that there are exactly 12 remaining unattested characters (down from ~60 before the thorough investigation), because 12 is a rather auspicious number in East Asian tradition.

For example, 妛 (山 mountain atop 女 female) is imagined to mean a yama-uba (traditional “mountain witch” youkai) who lives at the foot of the mountain (as befitting the “女” position in the character).

There’s a good tradition of reusing old or outdated characters in new contexts. For example, in Chinese, the archaic character 囧, meaning “window” (modern term 窗) has been resurrected to mean “embarrassed” or “awkward” due to its pictorial resemblance to a facial expression.

There also happens to be exactly 12 standard ideograph characters that were erroneously encoded in the Unihan compatibility character block (i.e. 﨎﨏﨑﨓﨔﨟﨡﨣﨤﨧﨨﨩).

Obciouly not exactly the same, but this article reminded me of the scroll lock key. Many people seem to have forgotten its original purpose: When PC Magazine asked an executive of keyboard manufacturer Key Tronic about the key's purpose, he replied "I don't know, but we put it on ours, too"

Wikipedia remembers: "The Scroll Lock key was meant to lock all scrolling techniques, and is a vestige of the original IBM PC keyboard. In the original design, Scroll Lock was intended to modify the behavior of the arrow keys. When the Scroll Lock mode was on, the arrow keys would scroll the contents of a text window instead of moving the cursor. In this usage, Scroll Lock is a toggling lock key like Num Lock or Caps Lock, which have a state that persists after the key is released."


It still works that way too in a real vtty. Hit scroll lock to scroll through the text buffer at the terminal in a text-only session with the up and down arrows on BSD. Very handy.

My modern keyboard even still contains little arrow glyphs on the number pad keys to correspond with them.

Isn't that for num-lock? (or when num-lock is off?)

The fixating effect of digitalization is scary.

New characters were introduced by accident, and although they're useless, they're probably not going anywhere. But what about the inverse? Will character sets still be able to change and grow new characters now that they're standardized like this?

Character sets in general will certainly continue to grow — look no further than the ever-expanding emoji set. Unicode is designed to be able to incorporate new characters.

Whether the CJKV¹ set will grow is another matter entirely. I expect that there will be a couple more extensions for characters found in historical sources; the goal of the Unicode standard is — roughly speaking — to be able to digitize any and all existing texts, after all.

But for actual new characters to be added they would first have to become popular enough to warrant inclusion. Neither Chinese nor Japanese is a language that produces new Chinese characters like they did historically. The barrier to get a character excepted beyond a fringe group is quite high — in any language — and there really is no strong need to create new characters.

1: Chinese, Japanese, Korean, Vietnamese (the latter two historically used Chinese characters as well)

Emojis are mistake and added too soon. We end up with standard that is full of short lived symbols from 2000-2025 that were once in a fashion and new generations of people will abandon them for something else when the culture changes.

Adding new stuff (excluding old languages) to Unicode should have a delay, at least 20-25 years from proposal to the standard (roughly a generation).

In the meantime it's possible to insert emoji into text as :), smiley, -smiley-, smiley.jpg, ::smiley:: or whatever you want and the system you use is free to change it into a picture.

Emoji already existed in documented (not to mention frequent) use when they were added to Unicode. As other commenters noted, the emoji originated in Japan and were encoded in the character sets used by mobile phones. Not adding them was not an option, because it would mean Unicode wouldn't be able to encode Japanese emails and text messages.

What happened in the past few years is that the emoji set was updated, because the Japanese core set missed symbols when considered from a global perspective. In some cases emoji have also been added pro-actively in cooperation with smartphone OS vendors; this is a point of contention, but I think it makes sense in this case.

> We end up with standard that is full of short lived symbols from 2000-2025 […]

That's fine. There is plenty of space in the Unicode standard. It also contains whole swaths of glyphs that were only used in some small country for a few decades, or that are obscure to the point of being almost irrelevant. But that is what Unicode is for. It is a means to encode the texts we produce as humans¹, not a proscriptive table of characters we ought to use.

1: Aliens too I guess, as long as they can provide proof of prior and significant enough use.

> That's fine. There is plenty of space in the Unicode standard.

But there is not infinite time from a font creator. Already it is impossible to know which of the newer characters will be supported by which OS/fonts/etc. Unicode characters are all but useless if they can't be displayed.

That's fine too. There is no need for every font to have support for Byzantine Musical Symbols or Coptic Epact Numbers. Operating Systems tend to have a collection of fonts installed that can handle every common use case. If you depend on glyphs introduced in special use areas you are more than likely aware of the need to install a special font.

Sure, you may need to be aware as a developer that a new character introduced in last month's Unicode update isn't likely to be supported yet, but that's no different from any other technology like CSS.

It would be nice if there was a reference set of glyphs. Like if in order to add a character to Unicode you had to also add a default glyph. Then at least if a font doesn't have that character the default could be displayed instead of a black box.

When I'm running Linux/X, if the current font doesn't know a character it renders the character using another font on the system that does. So then it's just a matter of including a set of fonts that covers every character, and I think a reasonable attempt has been made to do that - although maybe that's based on the fonts I've chosen to install, since that's something that's moderately important to me.

When I'm running Windows (for work) it only renders characters for the current font, and I get box instead of a character. (Traditionally at least; maybe this has changed? - it's not important to my work so I haven't really investigated, but as a feature it's so useful it's hard to believe it hasn't happened yet.)

But as you can see this isn't a problem with Unicode but the configuration of your system. If it chooses to show you a character from another font, then that's good and convenient.

Note that Windows support for emoji is significantly better. I think Gtk+3/Gnome apps and Firefox have improved significantly this year (wow, lots of color!), but they're still lagging.

> So then it's just a matter of including a set of fonts that covers every character

Have you ever tried to literally do this? You won't succeed.

The Google Noto font family is getting there, but I think they're only caught up with like Unicode 6, and we're on Unicode 11 now. There are some recently-added scripts that you won't find in any font. The Unicode tables render them in proprietary fonts that are obfuscated in the PDF, usually without even any information about how to buy them.

The list of font contributors can be found here: http://www.unicode.org/charts/fonts.html. Choice quote:

> The Unicode Consortium currently uses over 390 different fonts to publish the code charts and figures for The Unicode Standard. The overwhelming majority of these fonts are specially tailored for this purpose and have been donated to the Consortium with a restricted license for use in documenting the standard.

You might be able to reach out to the vendors listed and try to buy the font, but it seems the majority were produced only for Unicode's use.

That doesn't make sense for some complex (shaped) scripts where you have things like zero width joiners.

Are there characters that really can't be drawn individually? Surely there's something better than a black box for pretty much any character.

Yes, I disagree that it doesn't make sense. You're not supposed to see a ZWJ, so the correct render of it is not visible. So even if you include it in your font - well, the correct thing for your font engine to do is to apply the semantics of it correctly and change the render of the surrounding characters.

Yes, but there are glyphs, especially in Indic languages, that have no representation in a presentation form in Unicode. If you are familiar with Arabic, each glyph has forms - isolated, beginning, medial, and final. Your "typical" Arabic string is composed of isolated code points, they go through a text shaper, and out come glyph indices that also have a Unicode code point associated with them. You do the same thing with an Indic language, except imagine that the glyph indices that come out have no corresponding Unicode code points. It's very surprising at first and some software can't handle these unassigned or "hidden" glyphs.

So my point is: how much value would there be in requiring a representation of a small fraction of the glyphs (only those with Unicode code points, many of which are ZWJs) in a script in a standards document when potentially hundreds of forms are necessary after shaping to represent the language?

That’s a big problem with the standard in my opinion: standardization of new characters require submit a font for them but this font do not need to be open. So you have standardized characters that cannot be displayed without someone else implementing another font. That’s a lot of duplicate effort.

I'm not sure have over 100k glyphs in random fonts solves anything. I'm also not convinced fonts samples are something a character encoding should care about or try to standardize.

For practical use Googel's Noto font is under an open font license and covers so many glyphs it's collection is split into an multiple OTF files because of the 65k glyph limit per font. The goal of Noto is much the same as the one you propose - to have an open representation of every character (and in a consistent font).

> covers so many glyphs it's collection is split into an multiple OTF files because of the 65k glyph limit per font.

Not because of that. Out of the hundred Noto files, several are the same CJK characters with different country and Sans/Serif/Mono styles, and everything else combined would fit into a single file.

Even removing CJK, there are still more than 65,535 glyphs necessary to represent everything the the SMP and BMP less CJK. If you look in BMP without CJK, surrogates, and private use areas, you are looking at around 27,000 code points. If you look at the SMP (supplementary multilingual plane), there are around 90 blocks of 4096 code points assigned. That total is well over 65,535. And keep in mind many scripts also require unassigned glyphs which are not Unicode code points themselves. These unassigned glyphs count against the 65,535 TTF limit, though.


Sure, it would take two files to do everything outside CJK. What I said is still true, that everything covered by the non-CJK Noto fonts would fit in a single file (50k glyphs total).

My point is that you only need 3-4 files to cover Unicode. Noto is not split into a hundred different files for that reason, but for other reasons.

(Also the used space on the SMP is roughly 90 * 256.)

People are going to use popular icons whether they're in unicode or not. The burden of making it work is going to exist either way. And something like an emoji doesn't need to be in a general-purpose font.

> In the meantime it's possible to insert emoji into text as :), smiley, -smiley-, smiley.jpg, ::smiley:: or whatever you want and the system you use is free to change it into a pictur

Sure. It’s also possible to encode any non-ASCII text as ASCII as ::yen-sign:: or ::em-dash:: or ::cyrillic-capital-letter-schwa:: and implement your own algorithm to replace them with their intended symbols. We just happen to have one fairly widely agreed upon standard for doing that, and it’s Unicode.

Weren't emojis added because Japanese phones had them, since a long, long time (before smartphones), and the main pursuit of unicode is to be able to represent all written characters? The first emojis would be close to the 20 year lapse now, and they are the most used.

1999, so yeah, the first set of almost 200 emoji are nearly 20 years old. And yes, they were added first in Japan and didn’t become a western fad until after the iphone added them (not claiming the iphone was the catalyst, but the timing was about the same).

The phrase "since a long, long time," applied to mobile phones, is causing me a certain amount of bogglement.

I'm also suspicious of the cost and value of adding emoji, but what is the alternative? They are here, they need to be represented somehow. If Unicode doesn't do it it's possible that fragmented and propriety encoding systems would fill the vacuum.

So you think that emojis should not have been added to Unicode because, in 10 years, they will be obsolete like Linear A (U+10600..U+1077F)?

> But for actual new characters to be added they would first have to become popular enough to warrant inclusion.

How would a character become popular if people cannot type it?

By tentatively placing it in the Private Use Area for example, or by organising support for it via other means (e.g., the German ẞ or ⏻).

Print already fixed it pretty hard. If anything, it's easier to sneak in a custom font and use the private use area than what you had to do to get a new character in lead type.

> New characters were introduced by accident, and although they're useless, they're probably not going anywhere.

I'd say that since they exist, people will find uses for them. The article links to a page of ghost stories that uses the ghost characters as names for them; I've also seen one TV show's title given as Magical�Pokaan (sic).

I'm in the process of proposing 7 new Taiwanese characters, and 14 more Hakka characters to the Unicode IRG.

They need scanned evidence of times when the characters are used. It then takes about 5 years to actually be introduced into the standard.

It's not a new question; I've seen arguments about Año Nuevo State Park and whether the tilde is necessary in English after some major publication discovered that they didn't have ñ in their standard font and their proofreaders didn't know Spanish.

The ~ in "Año" is never optional. Removing it leads to some incredibly inappropriate readings.

Año = year. Ano = anus.

You're answering about Spanish. Ano does not mean that in English.

But any context where you misspell "año" will suggest Spanish, so the misspelling will evoke anuses.

If it actually did evoke that in the average person, the mistake wouldn't happen so often.

And what if it evokes the very similar Portuguese, where "ano" does in fact mean year?

Seemingly it would then be ano novo instead of ano nuevo.

Aaand we're off!

(For me, the name of the place is "Año Nuevo State Park", in Spanish, so "Ano Nuevo" is simply misspelled. Don't let your technology control you.)

What does Ano mean in English?

Not everything needs to have a meaning. But if you really want an answer, it's a name that some people have.

There was this story a few days ago about adding a character, for the new era in the Japanese calendar due to the Emperor's abdication and crowning of the new Emperor.

They need to add a character for the name of the era, but it hasn't been decided yet. So the Unicode standards is going to need to rush it in.


German just saw the addition of the capital "ß" (U+1E9E) last year.

A character that existed for a while, just not in digital typefaces.

The upper-case ß is not in common use, though. If I'm not mistaken, it's only use is correctly reproducing names in official documents that write a name in all-caps.

In common usage, when mapping to upper case, the lower-case ß gets mapped to double S (e.g. "Straße" -> "STRASSE"), which I guess is another fun edge case for Unicode implementations to cover.

If you want that to work nicely that would be a heck of a headache. Not only are "Straße" and "STRASSE" different lengths, but converting "SS" to lower case would now depend on context (which is probably the case even when you're solely considering the German language, consider the name "Bessel" for instance, or compound words like Linnaeusstraße).

True, but that issue has been existing for the past decades already.

    toLower(toUpper(a)) != a
    len(toLower(toUpper(a))) != a
If your code assumes any of this, it already was broken

You'd assume the somewhat weaker properties of:

    toLower(toUpper(a)) == toLower(a)
    toLower(toLower(a)) == toLower(a)
should work though, with German even that is not guaranteed unless you really spend a lot of time figuring out when to convert SS to ß. I suppose making toLower convert ß to ss would avoid that particular problem, but I'm not sure the Germans would agree with that one (which is why I said no nice solution exists).

At the very least it looks like you need a different toLower for style purposes than one for string comparison, unless someone can come up with a string where replacing ß with ss changes the meaning (or some equivalent to this), in which case there's no hope.

> I suppose making toLower convert ß to ss would avoid that particular problem

But that function would not be toLower(), it would be toSwiss()

> unless someone can come up with a string where replacing ß with ss changes the meaning

"Er trinkt in Maßen" - He drinks with moderation.

"Er trinkt in Massen" - He drinks huge amounts.

The pronounciation is different, too, a vowel before "ß" is long, before "ss" short.

Well, there goes any hope of a sensible case-insensitive way to compare strings in German. Unless you can figure out which of those two should correspond to "ER TRINKT IN MASSEN".

I think drinking too much is why we are having this problem in the first place, so maybe people should just stop it. You'd end up with Maßßen and Maszen eventually. Ich geh dann mal zum Arz, mir jezt ein attezt holen. Da wird man doch bekloppt.

I maintain that ss is a legitimate substitute in standard German orthography, just as ae for ä etc. not only because I can't be arsed to switch key layouts, but because ß is a useless character. If you compare Fuß, Ruß and Mus; Muss and Bus, Plus; Museeum and Muster. It is derived from a ligature, which is not any less confusing between ss and sz. So Buße implies it was once written Busse (or Busze). Add to that the phonetic difference between jetzt, jezt, and Arzt, is rather ridiculous, I mean, what do we have the c for? We rather standardize an extra letter and use c chiefly in ch and sch, but why? Ich habe da so meine Cweifel, ehh Tsweivel, eh ... ach lassen wir dass.

Oh wait wtf, why

Case changes are locale-sensitive. For example, if your locale is Turkish, toUpper(“i”) is “İ”, the Turkish dotted-I. Meanwhile, toLower(“I”) is “ı”, the dotless I.

But now, if you don’t have your locale set to Turkish, then you have

toUpper(“ı”) = “I”, toLower(“I”) = “i”

so the character does not round trip.

That doesn't explain why upper lower conversion shouldn't round trip within one locale.

As an example, the Greek letter sigma has two lowercase forms, depending on whether it's at the end of a word. So 'ς'.upper().lower() == 'σ', even though all three characters are ordinary Greek letters you could even find within a single word.

adds to personal notes: Never accept a job that involves manually dealing with locales

Good note, but the fact that ς doesn't round trip isn't locale-dependent.

Thanks. That's a much better example.

Actually, the Council of German Language has been recommending use of ẞ in daily usage instead of SS since 2017, and it’s starting to get more popular.

They are now allowing both ẞ and SS but are not recommending one over the other.

See Rule 25 E3:

> Bei Schreibung mit Großbuchstaben schreibt man SS. Daneben ist auch die Verwendung des Großbuchstabens ẞ möglich. Beispiel: Straße – STRASSE – STRAẞE.


It existed in a single edition of an orthography reference in Eastern Germany (and was removed in the next edition).

All other extant uses were designers and font makers writing blog,posts how much they want that character and discussing how it should look like.

Besides that it never existed. Unicode made a mistake. It was just that some people feel it should exist, in order to „complete“ the set of characters.

It has been used occasionally for the last century, long before there were blogs or posts. I collected a few examples here http://imgur.com/a/NaFgA

The most compelling argument is that this character has been used on German passports since at least 1988 (I don't know whether they started this even earlier). On passports, names are written in capital letters by international standard. That character is necessary to accurately represent people's names.

I'm kind of surprised that the international standard permits a letter that isn't found in any other alphabet. For those of us who have names in languages that use non-Latin alphabets, our passports spell them out using Latin letters (usually resulting in something that rarely sounds like the original when pronounced in a sensible way).

The machine-readable line gets substituted letters (ü --> ue). The "normal page" doesn't.

Gave me a tense moment in China, when a border agent questioned the difference between the name on my ticket and the name in the passport.

Of course, to Germans both forms are interchangeable, but I'd compare a Chinese letter sign by sign, too.

He finally accepted it when I showed him that the machine-readable line was identically spelled.

The character set has been pretty stable for a long time with the exception of post-war character simplification (which could be done again without changing the codes)

Actually the situation is more complicated than that. Japanese Kanji include some characters that were created by the Japanese, following the basic principles of Chinese characters, over the past 1000~ years. Everywhere that Chinese characters have been used there have been local variations. There have also been various hand-written simplifications, which in some cases became "new" characters that over time were repurposed.

It is also not true that simplified characters could be do again without changing code points; it is true for Japanese but for Chinese simplification there were several characters that were "merged" i.e. two formerly different characters are now written the same way. I would actually argue that the idea of "code points" makes no sense for Chinese characters, and that instead an approach based on the components of each character should have been used.

Yes, everyone likes to talk about 働 and 峠. These characters are rare and the set is not expanding.

Anyway, sure the simplification could happen. You just take the code point for 國 and start rendering 国. The principle that the font affects the rendering is already set with Han unification (probably not a good decision but we are stuck with it now). At worst you end up with redundancies (even in the Chinese case automated transition from simplified to traditional characters is mostly right; if you're going the other way it is always right).

> even in the Chinese case automated transition from simplified to traditional characters is mostly right; if you're going the other way it is always right

Even Traditional-to-Simplified has its pitfalls because some characters have a 1:2 mapping: https://en.wikipedia.org/wiki/Ambiguities_in_Chinese_charact...

Good catch, although I think the nature of Japanese use of characters makes this less likely.

It's much easier to reproduce images / vector graphics on a computer than it is on a printing press. Reaction GIFs, custom pseudo-emoji, "stickers", and so on could be considered variations on characters. Also, handwriting is possible with a stylus, though tablets aren't popular enough for this to have become a big thing yet (it may someday). I wouldn't worry too much!

Like useless genes.

Controversial opinion but I believe the skin-tone modifiers fall into this category as well.

The answer is easy -- No, it won't be able to change and grow.

Have been to more than one character museum where they show evolution of character. For most of the history you can see some change between 10-20 years, but it stays completely same for last 20 years or so.

Twenty? I don't know what you mean. If you pick up a Japanese book published in 1960 the characters are the same ones they use now. Nobody was using bone script or anything.

I did not talk specifically about Japanese, and I don't think the OP did either.

And I should clarify for that -- character mostly stop changing once printing press is introduced, and stop almost completely once digitization is completed. Of course common language get digitize long before, but all common language seems to have digitization completed 10-20 years ago.

You are thinking about alphabet-based languages where the creation of a new word does not involve the creation of a new character. We are frequently creating new words which require "workarounds" (sometimes just using a romanisation) until it's properly encoded.

Who are we here? The Japanese have a syllabary. Even the Chinese are, as far as I know, almost never creating new characters (when something new is called for, as far as I know, a soundalike old character is the most common solution).

There are languages in India with tens of millions of speakers whose digital representations are still poorly implemented, partially because the speakers of those languages rely on English instead of their own language when using computers.

This reminds me of the famous mistake of "referer" being spelled with only 1 'r' in the HTTP standard.

As a kid that tinkered with programming/computers before I learned English, this has stuck as something I always misspell.

If they're in Unicode, it should be possible to register ghost characters as a ghost domain name, right?

I don't see why not. Hell, it's probably possible to register a domain name for characters that aren't even defined in Unicode yet.

So you could take a chance registering some not-yet-defined code point, and possibly end up with the next poop emoji as your e-mail address!

Worst lottery ever.

With Punycodes?

The Google Translate Chinese text-to-speech for that core set of ghost kanji is also spooky. https://translate.google.com/#zh-CN/en/%E5%A6%9B%E6%8C%A7%E6...

Like DNA coding mistakes

Any reason why these charcters cannot be revoked if nobody uses them?

How do you prove a unicode character is unused? Look at every database field in the entire world? Every book someone has ever typed on a computer and printed? See what fonts include them or don't?

You can't define what an unused character looks like once it's in the standard... even less if it's already available on almost every modern computer.

People have already used these "ghost characters" when describing them, as jokes in their online handles, etc... This article and now this comment include "妛" which is one such character.

Furthermore, how would you even remove them? Tell all font authors "unicode point X now shouldn't be included in your font"... why would they ever bother to remove something from a font?

And what would be the benefit? Unicode offers enough codepoints that there's plenty of space left over for whatever garbage you wish to include, like new emojis and super-astral-aether-planar characters.


Encoding Stability

Applicable Version: Unicode 2.0+

Once a character is encoded, it will not be moved or removed.

This policy ensures that implementers can always depend on each version of the Unicode Standard being a superset of the previous version. The Unicode Standard may deprecate the character (that is, formally discourage its use), but it will not reallocate, remove, or reassign the character.

Presumably they can also be entered as part of a passphrase. Now that it's published it's even more likely.

This blog post used them.

Basically "NaN"s in character form :) - fascinating read!

> In the end only one character had neither a clear source nor any historical precedent: 彁. The most likely explanation is that it was created as a misreading of the 彊 character, but no specific indcident was uncovered.

I think I've uncovered a new English word, "indcident." I wonder what it means and how it is pronounced?

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact