A particular "feature" of this early system is that if you send random lowercase words to the character generator, it will attempt to construct Chinese characters according to the Cangjie decomposition rules, sometimes causing strange, unknown characters to appear.
I'm now going to have to go down the rabbit hole of each of these ghost kanji. Good times.
For example, 妛 (山 mountain atop 女 female) is imagined to mean a yama-uba (traditional “mountain witch” youkai) who lives at the foot of the mountain (as befitting the “女” position in the character).
There’s a good tradition of reusing old or outdated characters in new contexts. For example, in Chinese, the archaic character 囧, meaning “window” (modern term 窗) has been resurrected to mean “embarrassed” or “awkward” due to its pictorial resemblance to a facial expression.
New characters were introduced by accident, and although they're useless, they're probably not going anywhere. But what about the inverse? Will character sets still be able to change and grow new characters now that they're standardized like this?
Whether the CJKV¹ set will grow is another matter entirely. I expect that there will be a couple more extensions for characters found in historical sources; the goal of the Unicode standard is — roughly speaking — to be able to digitize any and all existing texts, after all.
But for actual new characters to be added they would first have to become popular enough to warrant inclusion. Neither Chinese nor Japanese is a language that produces new Chinese characters like they did historically. The barrier to get a character excepted beyond a fringe group is quite high — in any language — and there really is no strong need to create new characters.
1: Chinese, Japanese, Korean, Vietnamese (the latter two historically used Chinese characters as well)
Adding new stuff (excluding old languages) to Unicode should have a delay, at least 20-25 years from proposal to the standard (roughly a generation).
In the meantime it's possible to insert emoji into text as :), smiley, -smiley-, smiley.jpg, ::smiley:: or whatever you want and the system you use is free to change it into a picture.
What happened in the past few years is that the emoji set was updated, because the Japanese core set missed symbols when considered from a global perspective. In some cases emoji have also been added pro-actively in cooperation with smartphone OS vendors; this is a point of contention, but I think it makes sense in this case.
> We end up with standard that is full of short lived symbols from 2000-2025 […]
That's fine. There is plenty of space in the Unicode standard. It also contains whole swaths of glyphs that were only used in some small country for a few decades, or that are obscure to the point of being almost irrelevant. But that is what Unicode is for. It is a means to encode the texts we produce as humans¹, not a proscriptive table of characters we ought to use.
1: Aliens too I guess, as long as they can provide proof of prior and significant enough use.
But there is not infinite time from a font creator. Already it is impossible to know which of the newer characters will be supported by which OS/fonts/etc. Unicode characters are all but useless if they can't be displayed.
Sure, you may need to be aware as a developer that a new character introduced in last month's Unicode update isn't likely to be supported yet, but that's no different from any other technology like CSS.
When I'm running Windows (for work) it only renders characters for the current font, and I get box instead of a character. (Traditionally at least; maybe this has changed? - it's not important to my work so I haven't really investigated, but as a feature it's so useful it's hard to believe it hasn't happened yet.)
But as you can see this isn't a problem with Unicode but the configuration of your system. If it chooses to show you a character from another font, then that's good and convenient.
Note that Windows support for emoji is significantly better. I think Gtk+3/Gnome apps and Firefox have improved significantly this year (wow, lots of color!), but they're still lagging.
Have you ever tried to literally do this? You won't succeed.
The Google Noto font family is getting there, but I think they're only caught up with like Unicode 6, and we're on Unicode 11 now. There are some recently-added scripts that you won't find in any font. The Unicode tables render them in proprietary fonts that are obfuscated in the PDF, usually without even any information about how to buy them.
> The Unicode Consortium currently uses over 390 different fonts to publish the code charts and figures for The Unicode Standard. The overwhelming majority of these fonts are specially tailored for this purpose and have been donated to the Consortium with a restricted license for use in documenting the standard.
You might be able to reach out to the vendors listed and try to buy the font, but it seems the majority were produced only for Unicode's use.
So my point is: how much value would there be in requiring a representation of a small fraction of the glyphs (only those with Unicode code points, many of which are ZWJs) in a script in a standards document when potentially hundreds of forms are necessary after shaping to represent the language?
For practical use Googel's Noto font is under an open font license and covers so many glyphs it's collection is split into an multiple OTF files because of the 65k glyph limit per font. The goal of Noto is much the same as the one you propose - to have an open representation of every character (and in a consistent font).
Not because of that. Out of the hundred Noto files, several are the same CJK characters with different country and Sans/Serif/Mono styles, and everything else combined would fit into a single file.
My point is that you only need 3-4 files to cover Unicode. Noto is not split into a hundred different files for that reason, but for other reasons.
(Also the used space on the SMP is roughly 90 * 256.)
Sure. It’s also possible to encode any non-ASCII text as ASCII as ::yen-sign:: or ::em-dash:: or ::cyrillic-capital-letter-schwa:: and implement your own algorithm to replace them with their intended symbols. We just happen to have one fairly widely agreed upon standard for doing that, and it’s Unicode.
How would a character become popular if people cannot type it?
I'd say that since they exist, people will find uses for them. The article links to a page of ghost stories that uses the ghost characters as names for them; I've also seen one TV show's title given as Magical�Pokaan (sic).
They need scanned evidence of times when the characters are used. It then takes about 5 years to actually be introduced into the standard.
Año = year. Ano = anus.
And what if it evokes the very similar Portuguese, where "ano" does in fact mean year?
(For me, the name of the place is "Año Nuevo State Park", in Spanish, so "Ano Nuevo" is simply misspelled. Don't let your technology control you.)
They need to add a character for the name of the era, but it hasn't been decided yet. So the Unicode standards is going to need to rush it in.
In common usage, when mapping to upper case, the lower-case ß gets mapped to double S (e.g. "Straße" -> "STRASSE"), which I guess is another fun edge case for Unicode implementations to cover.
toLower(toUpper(a)) != a
len(toLower(toUpper(a))) != a
toLower(toUpper(a)) == toLower(a)
toLower(toLower(a)) == toLower(a)
At the very least it looks like you need a different toLower for style purposes than one for string comparison, unless someone can come up with a string where replacing ß with ss changes the meaning (or some equivalent to this), in which case there's no hope.
But that function would not be toLower(), it would be toSwiss()
> unless someone can come up with a string where replacing ß with ss changes the meaning
"Er trinkt in Maßen" - He drinks with moderation.
"Er trinkt in Massen" - He drinks huge amounts.
The pronounciation is different, too, a vowel before "ß" is long, before "ss" short.
But now, if you don’t have your locale set to Turkish, then you have
toUpper(“ı”) = “I”, toLower(“I”) = “i”
so the character does not round trip.
See Rule 25 E3:
> Bei Schreibung mit Großbuchstaben schreibt man SS. Daneben ist auch die Verwendung des Großbuchstabens ẞ möglich. Beispiel: Straße – STRASSE – STRAẞE.
All other extant uses were designers and font makers writing blog,posts how much they want that character and discussing how it should look like.
Besides that it never existed. Unicode made a mistake. It was just that some people feel it should exist, in order to „complete“ the set of characters.
The most compelling argument is that this character has been used on German passports since at least 1988 (I don't know whether they started this even earlier). On passports, names are written in capital letters by international standard. That character is necessary to accurately represent people's names.
Gave me a tense moment in China, when a border agent questioned the difference between the name on my ticket and the name in the passport.
Of course, to Germans both forms are interchangeable, but I'd compare a Chinese letter sign by sign, too.
He finally accepted it when I showed him that the machine-readable line was identically spelled.
It is also not true that simplified characters could be do again without changing code points; it is true for Japanese but for Chinese simplification there were several characters that were "merged" i.e. two formerly different characters are now written the same way. I would actually argue that the idea of "code points" makes no sense for Chinese characters, and that instead an approach based on the components of each character should have been used.
Anyway, sure the simplification could happen. You just take the code point for 國 and start rendering 国. The principle that the font affects the rendering is already set with Han unification (probably not a good decision but we are stuck with it now). At worst you end up with redundancies (even in the Chinese case automated transition from simplified to traditional characters is mostly right; if you're going the other way it is always right).
Even Traditional-to-Simplified has its pitfalls because some characters have a 1:2 mapping: https://en.wikipedia.org/wiki/Ambiguities_in_Chinese_charact...
Have been to more than one character museum where they show evolution of character. For most of the history you can see some change between 10-20 years, but it stays completely same for last 20 years or so.
And I should clarify for that -- character mostly stop changing once printing press is introduced, and stop almost completely once digitization is completed. Of course common language get digitize long before, but all common language seems to have digitization completed 10-20 years ago.
Worst lottery ever.
You can't define what an unused character looks like once it's in the standard... even less if it's already available on almost every modern computer.
People have already used these "ghost characters" when describing them, as jokes in their online handles, etc... This article and now this comment include "妛" which is one such character.
Furthermore, how would you even remove them? Tell all font authors "unicode point X now shouldn't be included in your font"... why would they ever bother to remove something from a font?
And what would be the benefit? Unicode offers enough codepoints that there's plenty of space left over for whatever garbage you wish to include, like new emojis and super-astral-aether-planar characters.
Applicable Version: Unicode 2.0+
Once a character is encoded, it will not be moved or removed.
This policy ensures that implementers can always depend on each version of the Unicode Standard being a superset of the previous version. The Unicode Standard may deprecate the character (that is, formally discourage its use), but it will not reallocate, remove, or reassign the character.
I think I've uncovered a new English word, "indcident." I wonder what it means and how it is pronounced?