> Because cote, coté, côte and côté are all different words with the same upper-case representation COTE.
The last bit is wrong, at least in French. The upper case representation of “côté” is “CÔTÉ”. It infuriates me to no end that Microsoft does not accept that letters in caps keep their accents. I can believe that it might be language-dependent, though.
Also, a search engine that does not find “œuf” when you type “oe” is broken.
Very interesting article, though. I wish more developer were aware of this.
>It infuriates me to no end that Microsoft does not accept that letters in caps keep their accents.
That behaviour is actually locale-dependant in Microsoft apps. Setting Microsoft Word to "French (Canada)" uses accented capitals, while "French (France)" does not.
I imagine there's a lively internal debate about the latter. France's Académie Française actually says the behaviour you'd prefer is the proper one, but the local typographic culture is heavily influenced by a history of typewriter use where such accents weren't available.
I must admit I’ve never tried to set it to Canadian French. I’m not surprised that the standards might be higher. Also, there’s a box to tick somewhere to set it to a sane behaviour. It’s just the default, which needs to be changed every time, which nobody bothers to do.
There isn’t any debate really, in the sense that virtually nobody remotely interested in typesetting advocates for unaccented caps. All the french books on typography and spelling emphasise that accents change the meaning of words and that they should be kept, provided that you have a device that supports them. It’s just an urban legend amplified by stupid defaults in commonly-used applications.
I blame the school system which somehow convinced some people that dropping accents from capital letters is okay. It makes absolutely no sense whatsoever.
In uni, for the typography class, the learning material had a nice capital "PALAIS DE CONGRES" on the first page, suggesting a building full of eels.
Another thing OP does not mention is normalization: The German character ä can be represented as a single character in Unicode, but it can also be represented as the letter a followed by a combining diacritical mark. Both will be displayed the same way.
I guess most Germans are not aware of these differences and would expect that Ctrl+F treats them the same.
I was working on the remove diacritics feature of Drupal this weekend so I am quite fresh with these affairs.
"Any self-respecting" -- this is very, very hard. If you are Danish, your keyboard most likely has a way to enter Ø and you would expect Ø and O to be handled as two different letters. However, if an English article mentions the Øresund bridge, it's not unreasonable to expect a search on Oresund to match it. Or, in Hungarian, "kor" means period or age, "kór" means disease and "kör" means circle... but searching for "kor" in Chrome finds all of them. And Unicode won't save you: Ø does not have decomposition rules but ó and ö both do. So the common practice of decomposing - removing marks - recomposing route can either be too little or too much. The upcoming solution for Drupal will allow users to edit remove diacritics rules per language, no other way to do it.
My ideal (but no one has implemented this yet) search would have a slider for strict/lax. In strict mode, only equivalent forms would match, so composed and decomposed would match as per unicode rules, but nothing else. As you get laxer, you obey locale-specific replacements like ö -> oe in German or å -> aa in Danish. Even more lax one could have homoglyphs match, so latin a would match kyrillic а. Then maybe α matches a and β matches b and ß, and of course C matches ℂ and weirdness like that. That already sounds like a few years of research to get it working world-wide.
In the last stage, a full DWIM-AI would do the matches. ;)
I once worked on what we called DYM tools (for "Did You Mean"). The goal was to assist native English speakers who were learning a second language find words they heard, in that language's electronic dictionary. We knew, for example, that native speakers of English have difficulty distinguishing the dental and retroflex consonants of Hindi, so the DYM allowed mismatches of those. A perfect spelling was at the top of the list returned from the dictionary, while a misspelling--such as writing a dental for a retroflex--resulted in the mismatched word being a little lower in the list. We tailored our DYMs for particular target languages (and always for native speakers of English). As you say, getting something like this to work for multiple languages at the same time would be difficult.
In Swedish I always want e to match é, and I would never want a to match ä.
é is just a different accent of e, not a different character in the alphabet. ä is a completely different character in my alphabet. Similarly ö is a completely different character in my alphabet and should not be considered equal to o, if the text is Swedish.
For someone in the US or France, the ö or ë probably is just a pronounciation diacritic as in Raphaël or Cooöperation. Of course an American doing Ctrl+F on a website (Likely New Yorker!) would want to find "Coöperation" when searching for "Cooperation". Even weirder: when I go to the New Yorker and search for Cooperation I want to find "Coöperation" too!
So this is highly context dependent. Ideally I want collation/comparison to depend on the content of the page, not my browser/OS language.
Yes, that is correct. HTML can declare languages (which are also useful for automatically selecting fonts), so that should be used for searching, too, in addition to font selection, indexing, etc.
Another idea would be to do that in Swedish text, "é" will always be represented in decomposed form, while "ä" will always be represented in precomposed form, so that you can tell the difference.
That would be incorrect, as Unicode defines the two forms to be interchangeable. You'd do a decomposition before doing a search anyway in which case both forms become the same code points (which ones depends on the decomposition type).
The correct way to handle this is by tagging the text with a language tag, as has been mentioned in other replies to the parent post.
Do you have a better incentive in mind to mark text up properly?
I'd even suggest browser to mess up formatting of non English letters (by using wildly different fonts) to encourage better semantic markup, but it is a bit hard-core and everyone would shout "compatibility breakage" at them :)
If text is not marked up properly, then the user should be allowed to override that setting, but when it is marked up properly, such overriding should not be necessary.
As long as it is user configurable which fonts to use for which language, I think that it does not break compatibility to do that. Actually, I think it is a bit good idea. The document should only specify the language and the style (e.g. bold, emphasis, normal, fix pitch, heading, etc) and then that combination is mapped to a font in the browser. (If the user has enabled use of CSS fonts, and such fonts are specified, then they would override those specified by the user. If the user has not enabled use of CSS fonts, then the user's fonts are always used.) This would be needed anyways due to the Han unification that Unicode does, anyways (and Unicode is very messy, anyways). (I mentioned before that Unicode can be good for searching, and if that is what you are doing with Unicode rather than for writing and displaying documents, then Han unification is probably desirable, although again the Duocode that I mentioned before may help even more.)
That is insufficient and wrong. "Apfel" should not match "Äpfel" in a search, because those are different words, one is the singular, the other the plural of "apple". However, "Aepfel" should match "Äpfel", since those are the same word, the former is an accepted replacement form. Similarly, "ss" and "sz" (very rare) should match "ß" and vice versa, so "Strasse" is the same as "Strasze" and "Straße".
Of course all of this is locale-dependent and it might be acceptable for ease of use to match "ä" to "a" for international users. But not in a german locale.
Proper I18N is very hard, and supporting Unicode is just a first step. Converting text between lower/upper/title-case can only be done knowing at least the locale of the text, and in rare cases a case-folding-roundtrip can even change the meaning of the text, so you will have to be able to speak the language.
> Similarly, "ss" and "sz" (very rare) should match "ß" and vice versa, so "Strasse" is the same as "Strasze" and "Straße".
This is unworkable, since "Busse" (several autobuses) should not match "Buße" (repentance). But "Busse" (incorrect spelling of "Buße" due to character set limitations/Swiss German spelling I believe) should match "Buße" by your argument.
Anyway, modern Firefox lets you choose whether to match "Apfel" against "Äpfel".
You are right in that my argument is somewhat incomplete, one could be stricter and not match "Buße" when searching for "Busse" because that would be a somewhat uncommon almost-misspelling in German German.
But: "Busse" is not a always incorrect spelling, it's just an "emergency" one, if you really can't use "ß" for some reason. One example is allcaps text, here "BUSSE" and "BUSSE" are not distinguishable, although they have different meanings. Also, there is Swiss German, where replacing ß with ss is the normal form and not at all "accepted in an emergency" like in German German: https://www.galaxus.de/de/page/schweizerhochdeutsch-fuer-anf...
Incidentially this is one of the instances where you really have to know the language and the context to be able to do a lowercase -> uppercase -> lowercase roundtrip, otherwise you might screw up with "Buße -> BUSSE -> Busse", changing the meaning. Or not, if the current locale is de_CH instead of de_DE.
Together, since very often search will be case-insensitive, I think that while you may be correct, being as strict as to not match here would not be what the user would expect.
Yes, again, I18N is hard and sometimes it is impossible to do correctly for a machine.
Text search implementations do not have the luxury of assuming that the text they are used against us correctly spelled. Indeed one common usecase for find and replace is searching for misspellings.
Yes, I know. Just more headaches, because you have one more equivalent variant to worry about (use of ẞ is correct but very rare), and fonts that do not support the new character ẞ yet or ever (so it will not be standard capitalisation for quite some time, if ever).
Even in a German locale, it doesn't work. "Müller" and "Mueller" are two different names, there is no band called "Moetley Cruee", etc. Also, "sz" hasn't been a replacement for "ß" since 1901.
Or, even better, the "find" feature should support multiple user-specified options about what to care about in the search, like is done for the "match case" option. Simply allowing regex covers most issues.
I agree about regex, it is a good idea to have. I would much rather that Firefox's search function supports regex rather than the incremental search it has, because I think regex would be useful.
But, regex is a separate issue than languages of text.
That again is a special case: Sz is never ß at the start of a word... Since there are composite words in German, you have to pay attention to "Schlusssszene" being neither "Schlusßzene" nor "Schlussßene" but "Schlußszene". But I would not blame anyone or any program for getting this wrong.
Also, in this case, Sz is pronounced as the two distinct letters it is composed of. Whereas when "sz" replaces "ß", the pronounciation is just very similar to a plain "s".
That's the same point though - it's just as "wrong" to match 'ß' twice in Schlu[ß][sz]ene as it is to match 'a' once in [Ä]mpel.
Admittedly German isn't my first language. But it seems odd to be permissive in one case, but more restrictive in the other. In my experience there are two cases for searching: exact matches, like you'd expect an editor's find and replace, and more permissive searches, where you might expect to skip through several matches to find what you're looking for. In this second case, what's the advantage of being more exclusive in one case, then less exclusive in the other? Why not have a more permissive default (as mentioned by the other user's reply regarding Firefox's configurable search), even if it's technically incorrect?
Because the ß == ss case is special in that there are numerous words that can be written both ways correctly, there are numerous words that can be written both ways sometimes and often the meaning depends on whether the most common form is written with ss or ß. And Swiss German allows replacing ss for ß as a normal way of writing everything.
Whereas for umlauts ä, ö and ü, the forms with ae, oe and ue are always "emergency replacements", even the Swiss use äöü very frequently (though not always). Also, there are almost never ambiguities when replacing ä, ö, ü -> ae, oe, ue. Yet there are very frequent ambiguities when replacing ä, ö, ü -> a, o, u, because singular/plural, conjunctive and other forms are derived that way and sometimes there are just different words that only differ in one vowel being an umlaut.
So there is a difference in the chance of being wrong when picking one form of a word for another possibly equivalent one. When you do ss == ß, you are often correct. When you do ä, ö, ü == ae, oe, ue, you are almost always correct. When you do ä, ö, ü == a, o, u, you are almost always wrong.
I can't think of any case where Germans would use one of these and German speaking Swiss would not. But as another Swiss peculiarity, some of our words contain "üe", e.g. "Üetliberg", "gmüetlich".
Chrome's Ctrl+F behavior is damned annoying for handling linguistic data. When I search for ʰ (aspiration), I don't want h (the letter, which appears in some of the column headers and most of the notes entries) - conflating the two makes searching for ʰ practically worthless. If I couldn't turn off "ASCII characters match everything that looks vaguely like them" in Google Sheets (i.e. enable regex search), I wouldn't use Google Sheets.
...and now that I've complained about it I notice that there are extensions to enable regex search in the browser itself, nice
On my firefox (68.9.0esr), a Ctrl+F for "The German character a" finds no results.
Also, as a Finnish user, I would not be impressed by a search for "talli" (stables) matching "tälli" (blow), or "länteen" (westwards) matching "lanteen" ('of the hip').
In contrast, as a Swedish user, I'm spectacularly impressed when, say, a flight or train booking site lets me search for Goteborg rather than Göteborg. Because otherwise, you know, I probably can't get the fuck home from wherever I happen to be with whatever keyboard the particular machine I happen to be using has.
The number of times, in actual real life, that the inconvenience of the ambiguity outweighs the convenience of the overlap are not many.
I would assume that sites like this know these words as synonyms and don't simply do a lax string matching, because in this specific case "Gothenburg" would probably also work - it did when I was planning a route through Sweden with a German app that defaults to local spellings in foreign countries.
Flight and train booking sites need to support search for a very small set of place names: if they care about selling tickets they can list a synonym list for each destination, sidestepping language and spelling norms. Fuzzy completion in search forms as one types, like e.g. in Google Maps, is also effective.
"Should match" above was too strong. "Should match for users with locale set to English" is right though. In English, theoretically you should write "café" and "résumé" but practically speaking, it doesn't matter and no one cares if you omit the accents.
In the US, there's starting to be more push from people with ñ in their name to get support for the correct letter, but most government systems have zero support for anything but ASCII all capitals (sorry McWhoever).
I think it depends on the context. I need the search-and-replace dialog in my text editor to be strict about it, whereas I would find it weird if my web browser would not treat a, à, â, and ä the same way. After all, in the languages I speak they are all the same character, just with different accents. This is of course also language-dependent.
Firefox only gained that ability very recently in FF73 (https://bugzilla.mozilla.org/show_bug.cgi?id=202251), hence it's not yet in the current ESR. It does seem to have a toggle for turning it on/off, though.
Normalisation and - except for the soft-hyphen - ignoring of ignorable whitespace characters (ZWNJ, ZWJ, WJ etc.) on the other hand are still missing in Firefox (https://bugzilla.mozilla.org/show_bug.cgi?id=640856).
> On my firefox (68.9.0esr), a Ctrl+F for "The German character a" finds no results.
On mine (77.0.1) it finds the OP's "The German character ä" as well as yours. Conversely, searching for "The German character ä" finds the OP's original and your "a" version. There is a "match diacritics" button for enabling/disabling this behavior.
Speaking as someone whose native language has not one, but three of those weird letters, I would need it because it makes searching about a billion times easier. There's plenty of situations where typing my native language's characters is inconvenient for a multitude of reasons, and any particular document is not likely to hold so many individual words dissimilar only by using those particular different letters that the inconvenience of the ambiguity is going to outweigh the convenience of not having to convince my current device to type, say, an ä.
Would you also match it with α? Notice that "a" and "α" are always completely interchangeable, as opposed to "a" and "ä". What about their uppercase versions (which are visually identical)? Or all similarly-looking letters between Latin, Greek, and Cyrillic alphabets? And what about letters that are almost identical but have very different meanings like "w" and "ω"; would you match them ?
I have no idea. I don't speak Greek, and I am not familiar enough with its alphabet. The question, I suppose, is if there's a reasonable/intuitive enough mapping between the Greek and the Latin alphabet that it would be possible to search for Greek words using a non Greek keyboard. If yes, then I suppose my answer would also be yes - but I'm happy to be corrected by someone who is actually speaks Greek.
My point is that when it comes to things like searching, usefulness trumps purity. I search because I want to find shit. Not because I want to get perfect feedback on linguistic details from my software.
Usefulness of such behavior is objectionable. Characters with and without diacritics are different graphemes, part of different words. If i search something, i do not want to get plenty of irrelevant results. Just that happened to me few days ago, when i entered a rare word root to search box in Firefox and was surprised that got plenty of irrelevant results because there was common word differing in diacritic marks.
Even if sometimes there is a text written without diacritic marks where there should be, it is usually consistent for whole document, so if i search some word that contain diacritics, i know whether to enter the word with or without it.
I guess we can all agree that there must always be an option to select between strict matches (without identification of different characters) and fuzzy matches, with varying degrees of "fuzziness" (diacritics, case, etc).
I don't think we "all" can ever agree on something "always" being true.
I can imagine cases where enough context is available and you can only do the actually usable search. Heck, even Google doesn't give you any option, though I frequently hate that.
There was a great example from someone above: sometimes you search to _skip shit_ instead. If I am looking for ž, I might actually want to find where it first occurs as opposed to all the occurrences of z which will be many more.
My own opinion is that in case-sensitive mode, writing "a" should not found the accented "a" unless it is in decomposed representation (and even then, only if you are searching for "a" itself, since if you follow it by another letter, then it will no longer match since the combining accent mark intervenes), but never matches a precomposed accented "a". In case-insensitive mode, the behaviour should be locale-dependent and configurable by the user.
I sometimes work with transliterations of cuneiform. I don't want s to match š, h to match ḫ, g to match g̃, r to match ř, or any accented vowels matching any differently accented or unaccented vowels. They mean drastically different things.
If you're going to have a "smart" search that decides to mesh these together, the option to turn off the "smartness" should be right there.
My Exchange server always finds my friend when I search for "Kivanc" even though the "i" and the "c" are Turkish letters without a dot and with a squiggle respectively.
However, I cannot get it to find my friend Nikolay when I type the whole thing, as it doesn't match the final Cyrillic letter with my "y".
I would say, in general, string matching leaves a lot to be desired, especially when I don't know how to actually type most of the letters of my colleagues's and friends's names. (It's easier on Mac than Windows 10 though - I can't figure out a dang thing for Windows 10 that is not English.)
If your Russian friend's name is Николай, then you might try to type it as "j", "i" or "ij". I only studied Russian for a short while, but й sounds more like "ij".
Very nice, thank you! Replacing the "backwards N with hacek" (not sure what the letter is called, sorry) with "i", "iy" or "j" all find him. Funny that his preferred romanization with just a "y" doesn't.
I was talking about that Ctrl+F should match ä (the single character) when the user searches for ä (the letter a plus the combining something mark). And vice versa.
There is only one key ä on the keyboard, the user probably doesn't have a choice whether that enters the single character version or the two-character version...
Just last week I noticed the description for the movie "Us" in the HBO Max app on iOS and AppleTV displayed all of the left-leaning and right-leaning quotes and double-quotes as question marks, but displayed ® correctly. My guess is something in their pipeline is using latin1 encoding[1].
Which is to say, just getting everything onto UTF-8 (or UTF-16 if you must) would be a huge win.
Unicode supports 150+ scripts which can be used to write thousands of languages. No matter how good your Unicode support, there is probably some obscure script for which you do the wrong thing. But, in the real world, most software (whether proprietary or open source) supports a limited set of languages, and you test it works correctly for all of them, and every now and again you might add support for a new language (due to business requirements, or, in the case of open source, because someone volunteers to do it)-at which point any issues specific to that language and its script are found and fixed.
The unicode consortium maintains open source software to handle a lot of hard cases (not just case folding but canonicalization, formatting, line breaking...). Doesn't magically make all your problems go away but takes away a lot of hard work that you're most likely to get wrong.
>characters that display as white space but actually are not because they represent a nordic rune carved on the side of the stone not visible from the front
This is particularly annoying when trying to sanitize input in a language like Java, where String.trim() only removes space (maybe tab?) and the pre-defined Regex classes in Pattern define some of the whitespace characters, but not all.
Which is particularly egregious, since in actual braille, that is a whitespace character (specifically, a space with no dots in it represents the character... space! 0x20).
There is considered to be semantics in individual glyphs. So IIRC there is a unicode minus sign and a unicode dash and even if they are visibly indistinguishable you're not supposed to mix them. I'm no unicode expert though. I doubt the unicode consortium is comprised of fools and nincompoops. Best assume they know what they're doing.
I'm not claiming that BRAILLE PATTERN BLANK is the same character as SPACE (I don't particularly disagree, but that's not my point); I'm claiming that BRAILLE PATTERN BLANK is a whitespace character, just like SPACE or NEWLINE or U+2003 EM SPACE (pretty much any of the U+200xs, really), regardless of whether it's the same character.
Yes, and I'm assuming the reason BRAILLE PATTERN BLANK is not correctly classified is at least as stupid as that, and can therefore be safely ignored for the purposes of having legitimate reasons for things.
Finally you've got to the point, that braille whitespace is not classified as a whitespace (which on checking is true). Why didn't you say that at the start.
There are a lot of those in Unicode. Characters that are visually indistinguishable but exist to provide round-trip capability to some older character set.
Then in that case because some apparent pairs of letters are in fact a single letter in some languages. Per the DZ wiki page someone else has given, DZ is a distinct single letter in hungarian and slovak at least.
Double letters being a single letter isn't rare, such as ll and dd in welsh. In a strange sense it arguably occurs in english in th and th, which I could argue are double-letter representations of single letters, which were originally eth and thorn
Very cool video, but it sounds like that space-that-isn't-a-space would also apply to a runic interpunct as well as to the Latin interpunct he mentions.
> Unicode support in my experience seems to be one of those hand wavy things where most people respond to the question of “Do you support unicode?” with
> > Yeah we support emojis, so yes we support unicode!
And sometimes it's not even that simple, because things that say they support UTF-8 can't encode the full range.
Emoji have been a boon for unicode support. In order to support emoji, a lot of software ends up supporting unicode in general. I think I read that this was intentional by the unicode consortium - it is arguable that emoji don't belong in unicode, but is included to enhance adoption.
It has more limited usefulness outiside of North America and Japan when it comes to cultural or geographically limited characters, especially food, drink and clothing; The Consortium does recognise this, but with a measly 50+100 new emoji per year, it will be forever before we see this improved significantly.
Without that control, you would be free to call custom emoji from the vasty deep, but without vendor support, that would not do you much good. I find it preferable to have a limited pipeline of additions, but with considerable vendor support.
My take on this is that it's difficult to communicate the exact details of “Unicode support”, so perhaps it's alright to say that one supports Unicode, generally speaking, but Unicode is complicated and there are corners that aren't supported fully.
My take is that it is important to understand the basics:
- UTFs
- [canonical] equivalence (form, normalization)
- [canonical] decomposition
- [canonical] precomposition
- case mapping
- case-folding
- localization
- the reason for all this complexity
(hint: it's not the UC's fault)
You don't have to really grok the details, much less memorize them. It does not much more than TFA's ~1,500 words to explain the concepts.
> - the reason for all this complexity
> (hint: it's not the UC's fault)
This is a common fallacy, that just because some of the complexity is unavoidable and irreducible, means that the UC hasn't made things much worse than they needed to be with things like code points (which are neither full characters (see eg U+308 combining umlaut) nor single characters (see eg U+1F1 the two letters D and Z)) or emoji.
But, yes, with the possible exception of normalization, all of those are things you'd need to deal with in some fashion regardless of Unicode.
Even combining codepoints are a thing that predates Unicode and simplifies a number of things, so they were a very useful thing to have in Unicode.
First off, diacritical marks have generally been just that: marks that get added to other characters. Notionally that is a great deal like combining codepoints. For example, typewriter fonts, and later ASCII, were designed to have useful overstrike combinations. So you could type á as a<BS>', both in typewriters and in ASCII! Yes, ASCII was a variable-length, multi-byte internationalized encoding for Latin scripts. For example, this is why Spanish used to drop diacritical marks when up-casing: because there was no room in those typewriter fonts for accents on upper-case letters, though nowadays we have much more luxurious fonts, and so you see marks kept on up-cased Spanish a lot more than one used to.
Second, diacritical marks as combining marks adds a great deal of flexibility and a fair degree of future-proofing to the system. This is a restatement of the first point, I know. Read on before telling me that flexibility is nice but expensive.
Third, it is much simpler to close canonical composition to new compositions by having canonical decomposition than it is to keep adding pre-compositions for every sensible and new character. Part of this is that we have a fairly limited codepoint codespace while pre-compositions are essentially a cartesian explosion, and cartesian explosion means consuming codepoint codespace very quickly. Cartesian explosions complicate software as well (more on this below). This really gets back to the point about flexibility.
Fourth, scripts and languages tend to have decomposition strongly built into their rules. It would complicate things to only let them have pre-composed characters! For example, Hangul is really a phonetic script, but is arranged to look syllabic. Now, clearly Hangul is a lot more like Latin in terms of semantics than like Hiragana, say, so you should want to process Hangul text as phonetic (letters) rather than syllabic, but the syllabic aspect of it can't be ignored completely. So Hangul gets precompositions (lots, thanks to cartesian explosion) but even in NFC Hangul is decomposed because that is much more useful for handling in software.
Emoji are mostly like ideographic CJK gluphs, except with color. The color aspect is what truly makes emojis a new thing in human scripts, but even color can be a combining mark. Emoji clearly illustrates the fact that our scripts are still evolving. This is a critical thing to understand: our languages and scripts are not set in stone! You could say that there was no need to add emoji all you want, but they were getting added in much software, and users were using them -- eventually the UC was going to have to acknowledge their existence.
Aside from emoji, which are truly an innovation (though, again, not the UC's), conceptual decomposition of characters had long been a part of scripts. It was there because it was useful. It's entirely natural, and a very good thing too, that it's there in Unicode as well. Decomposition, and therefore combining codepoints, was probably unavoidable.
Regarding digraphs and ligatures, these too are part of our scripts, and we can't very well ignore them completely.
This is not to say that the UC didn't make mistakes. Han unification was a mistake that resulted from not inventing UTF-8 sooner. Indeed, the UC did not even invent UTF-8. Some mistakes are going to be unavoidable, but decomposition and combining codepoints are absolutely not a mistake.
Finally, we're not likely to ever replace Unicode -- not within the next few decades anyways. However much you might think Unicode is broken, however widespread, right, or wrong that perception might be, Unicode is an unavoidble fact of developers' lives. One might as well accept it as it is, understand it as it is, and move on.
Combining diacritics are a good thing, I'm talking about Unicodes conflation^W deliberate lies that a code point is a meaningful unit of data in it's own right rather than a artifact of using a multi-level encoding from characters like "ä" to bytes like "61 CC 88".
> For example, this is why Spanish used to drop diacritical marks when up-casing: because there was no room in those typewriter fonts for accents on upper-case letters
Huh, I remembered that, but though it was some badly-designed computer typesetting system.
> Third, it is much simpler to close canonical composition to new compositions by having canonical decomposition than it is to keep adding pre-compositions for every sensible and new character.
Yeah, this is what conviced me that base+diacritic was a better design than precomposed characters.
> Emoji are [...] with color.
That - that they are not actually text - is one of several problems with them, yes.
> but decomposition and combining codepoints are absolutely not a mistake.
Decomposition and combining diacritics are not a mistake. The mistake is treating the decomposed elements as first-class entities.
> Finally, we're not likely to ever replace Unicode -- not [soon] anyways.
That has been said about literally every bad status quo in the history of problems that existed long enough for people to compain about them.
"The dragon is bad!"[0], and no amount of status-quo bias is going to change that.
> I'm talking about Unicodes conflation^W deliberate lies that a code point is a meaningful unit of data in it's own right rather than a artifact of using a multi-level encoding from characters
What on Earth are you talking about. Where does the UC "lie" like this? Remember, the base codepoint is generally a character in its own right when it's not combined, and combining codepoints are mostly not characters in their own right.
> Yeah, this is what conviced me that base+diacritic was a better design than precomposed characters.
If anything, precompositions probably exist primarily to make transcoding from other previously-existing codesets (e.g., ISO-8859-*) easier.
> > Emoji are [...] with color.
>
> That - that they are not actually text - is one of several problems with them, yes.
How are Kanji text and emoji not? Both are ideographic. What is the distinction? This is not rhetorical -- I'm genuinely curious what you think is the distinction.
> > but decomposition and combining codepoints are absolutely not a mistake.
>
> Decomposition and combining diacritics are not a mistake. The mistake is treating the decomposed elements as first-class entities.
The base codepoints are generally characters in their own rights ("first-class entities") while the combining ones are generally not. In what way am I getting that wrong?
> > Finally, we're not likely to ever replace Unicode -- not [soon] anyways.
>
> That has been said about literally every bad status quo in the history of problems that existed long enough for people to compain about them.
You are making arguments for why Unicode should be replaced, though I think they are unfounded, but you'll need more than that to get it replaced. Even granting you are right, how would it be made to happen?
> and combining codepoints are mostly not characters in their own right.
The Unicode standard originally defined "character" as a code point (not vice versa), and until less than five years ago I could not have a discussion about the differences between characters and code points (and why the latter are bad) without some idiot showing up to claim there was no difference, since Unicode defined them to be same. However on looking for a citation, it seems that http://www.unicode.org/glossary/ does not actually support that claim (although bits such as "Character [...] (3) The basic unit of encoding for the Unicode character encoding" do little to oppose it). So I can't actualy prove that the lies of said idiots were deliberate on the part the UC. Which is to say I was mistaken in assuming that equivalence to be deliberate lie rather than a miscommunication (probably).
> How are Kanji text and emoji not?
Kanji are monochrome and can be written with physical writing implents such as pencils. Emoji aren't and can't; a chunk of graphite can't make green and purple marks for a image (not character) of a eggplant.
> I'm genuinely curious what you think is the distinction.
Apologies, I should have been a bit more explicit there.
> The base codepoints are generally characters in their own rights
And to extent that that's the case, we should, where possible, be talking about the characters, not the implement details of representing them.
> Even granting you are right, how would it be made to happen?
No idea, but I'm not inclined to pretend something isn't bad just because I can't do anything about it.
> So I can't actualy prove that the lies of said idiots were deliberate on the part the UC.
I started having to deal with Unicode back around 2001. Even back then the distinctions between code unit, codepoint, character, and glyph, were all well-delineated. Very little has changed since then, really, just more scripts, some bug fixes, and -yes- the addition of emoji,
> Kanji are monochrome and can be written with physical writing implents such as pencils. Emoji aren't and can't; a chunk of graphite can't make green and purple marks for a image (not character) of a eggplant.
I mean, cave paintings were color and kinda ideographic, and you do know that you can use colored crayons/pencils/pens/paint/printers to make colored text, yes? :-)
I'm sure you can make a monochrome emoji font. In a few thousand years we might evolve today's emoji to look like today's Han/Kanji -- that's what happened to those, after all!
Seriously, I can't see how color is enough to make emoji not text. Kids nowadays write sentences in Romance languages and English using mostly emoji sometimes. It can be like a word game. In fact, writing in Japanese can be a bit like a word game.
Animation begins to really push it though, because that can't be printed.
> And to extent that that's the case, we should, where possible, be talking about the characters, not the implement details of representing them.
And we do. We talk about LATIN SMALL LETTER A and so on. Yes, U+0061 is easier sometimes.
> No idea, but I'm not inclined to pretend something isn't bad just because I can't do anything about it.
Fair. Myself I think of Unicode as... mostly faithful to the actual features of human written language scripts, with a few unfortunate things, mostly just Han unification, UCS-2, and UTF-16. Honestly, considering how long the project has been ongoing, I think it's surprisingly well-done -- surprisingly not-fucked-up. So I'm inclined to see the glass as half-full.
Let's put it this way: there are a lot of mistakes to fix once we get the time machine working. Mistakes in Unicode are very, very far down the priority list.
> I started having to deal with Unicode back around 2001.
I don't remember the date, but I first had to deal with unicode when it was a 16-bit encoding (pre- surrogate pairs).
> I mean, cave paintings were color and kinda ideographic
Sure, but they weren't text, which is what we were actually talking about.
> you can use colored [whatever]
That's a property of the rendering, not of the characters; eg <font color=red>X</font> is a colored font tag containing a non-colored character (er, a character without color information), not vice-versa.
> Let's put it this way: there are a lot of mistakes to fix once we get the time machine working. Mistakes in Unicode are very, very far down the priority list.
That's fair; I have a very long and very angry priority list, but I was mainly just objecting to
> all [rather than some] this complexity [is] not the UC's fault
and grabbing a couple of specific examples off the top of my head.
(Although now I realize I never got around to complaining about zero-width joiner, or 'skin tone' modifiers, or parity sensitive flag characters, or...)
Yes, but many logographic characters started life as ideographic and then evolved to being logographic via simplification and styling. Emoji might some day evolve the same way. The distinction between emoji and text is weak. The fact that they are used inline with other scripts certainly argues emoji are text, and it would be weird to require a markup representation instead.
I understand that Han unification in Unicode has resulted in the same character encoding distinct glyphs, depending on the language in use.
So there would be a Japanese Kanji that looks similar to, but distinct from, a Chinese character. And both would be encoded as the same character in Unicode. And that character would look one way when included in a Japanese document and another way when included in a Chinese document.
If the document contains both Japanese and Chinese, and the character appears in both parts, would a Japanese user expect to find both occurrences when entering one of them?
Han unification is kind of overblown. There are a few characters that look different in contemporary Japanese and Chinese fonts, but you can find pre-War Japanese books that use the "Chinese" style character. There are a few characters that really shifted like 令, where the man on the street might not realize they're the same, but most are obviously the same, like 乘 and 乗. I think most Japanese would expect a search for a Japanese-style character to turn up its Chinese-style counterpart. Not doing so would be mostly a pain because you'd have to figure out how to type the Chinese one with your Japanese keyboard.
> Ideally, the document format should specify which parts are in what language.
But this is "off-band" information and it kind of defeats the purpose of unicode. Moreover it is extremely annoying in practice: imagine that you write a text (e.g., a comment on HN) where you want to explain the character differences between japanese and chinese. How do you get to do that? There are no "language selector characters" in unicode!
Yes, it would really be great if the Unicode Consortium designated some Unicode codepoints to be combining characters specifying the type of glyph to be used for the preceding Han character… and in fact, it seems like they have (for some at least)!
The Variation Selectors Unicode block¹ has three variation selectors designated for this purpose:
• U+FE00 VARIATION SELECTOR-1
• U+FE01 VARIATION SELECTOR-2
• U+FE02 VARIATION SELECTOR-3
I assume they work like U+FE0E VARIATION SELECTOR-15 (for 'text presentation' of the preceding character) and U+FE0F VARIATION SELECTOR-16 (for 'emoji presentation' of the preceding character). You can see examples of that in action here².
It would be nice if it was possible to write mixed-language Chinese and Japanese here in HN comments. It would also be nice if it was possible to write math here. But neither are possible...
I believe the plan behind the Chinese characters in Unicode was always that different hanzi-using languages had different renderings of certain characters, and that rendering it properly would be up to the font.
Yes, but that was botched from the start. Latin alphabets include various "font variants" in unicode, you can write C, ℂ, ℭ, 𝒞, ⠉, 𝐂, Ⅽ, 𝙲 and maybe a few more…
I guess the accusations of Unicode being somewhat centered on western languages do have a point.
Was. Whoever thought it a good idea to merge letters with similar shapes to same code points must’ve been drinking way too much ancient Chinese civilization kool-aid.
IIRC, ISO-10646 initially preserved each of the 16-bit national codings for Han characters (including three separate Chinese encodings). I don't know if that has been retained at all in later versions of the standard and a cursory reading of the relevant wikipedia pages is not informative.
The original motivation for the efforts that resulted in Han unification was to help with library and bibliography management (some of these efforts were by non-CJK speakers). One of the original design goals of Unicode was to be able to represent all of the characters in existing character sets uniquely (so two distinct characters in some charset requires two distinct characters in Unicode), and another design goal was to be able to facilitate conversion between different character sets representing the same script. This dovetailed nicely with the existing efforts to unify CJK scripts for bibliographies, hence Han unification.
there aren’t many fonts that cover both Japanese and Chinese, much less to say beautifully, and as a user I’d say ideally the relevant font should be used in respective languages.
The idea that "case insensitivity" is a generally meaningful concept is a falsehood. It applies to a small subset of writing systems, and horribly breaks elsewhere.
If you count writing systems, a small subset has an upper/lower distinction: Greek, Latin, and Cyrillic, and maybe a handful of others. Not Arabic, Chinese, Japanese, Devanagari, Tamil, Thai, Thaana, Bengali, Hebrew, or a bunch of other scripts.
If on the other hand you count the languages that use writing systems with an upper/lower case distinction, then most use Latin, a much smaller set use Cyrillic; and the rest of the writing systems are used by one or two languages each, with the notable exception of the Arabic script.
So most scripts lack an upper/lower case distinction, but probably most languages use a script that does have such a distinction.
Yeah, I was lax, "small subset of writing systems" is probably true, but that small subset is used by a very large portion of humanity, so I don't think that the idea that "case insensitivity" is a generally meaningful concept is a falsehood.
More precisely, I would probably agree that the concept of case is not so useful and we'd probably live well if it were only a graphical attribute (or not even that), without any uppercase (or lowercase) character in the character sets, but since the characters do exist as of now, are spread everywhere in text and in almost any case when a user of a bicameral * script (or at least the Latin one) wants to search normal text he wants a case-insensitive search, your software would better be able to perform it, and usually by default.
By the way, I refreshed my Unicode and it doesn't look like Unicode case-insensitive is that difficult, it would seem that you basically only have to perform the mappings listed in CaseFolding.txt (+ maybe the normalization steps).
* I'm ashamed to admit that I either just learned the term or had completely forgot about it
Exactly, and among those languages that have a case distinction quite a few have at least one edge case (like German ß vs ss vs ẞ) where the transformation isn't bijective. Or it's position-dependent.
If you want case insensitivity, and you also want internationalization, you'd better hope you have a good library that handles the edge cases, or you'll get a bunch of errors. Case-sensitive is far easier to program.
When dealing with case you rarely (if ever) need bijective transformations...
Case-insensitive search/comparison requires a temporary mapping, not a permanent transformation; the German case is easily handled in Unicode's CaseFolding.txt by just mapping all three of them either to ss or to ß (depending on the implementer, if he prefers a length-invariant mapping or not).
Position-dependent? I read about it, but the Unicode case folding algorithm doesn't support it, so I imagine it's not considered useful for case-insensitive comparison
You will probably do want a good library when dealing with Unicode, but for this case thing it actually doesn't seem to be that complex to implement by hand...
I don't doubt case-sensitive is far easier to program , but you usually program for some user's sake, not for your own pleasure :)
And in most cases a normal user expects case-insensitive search
He forgot about stricter identifier rules, almost no product got right. Normalization.
And the complete lack of unicode support in the most basic libraries or programs, like the libc, stdlibc++ or coreutils.
As long as you cannot compare strings, you cannot search for them.
As long as identifiers are treated as binary (e.g filesystem paths, names, ...), there are tons of security holes in the most basic things, like kernel drivers. Adding unicode support to any product does not make much sense then. Rather restrict it to Latin-1 or the current locale to be on the safe side. Nobody knows the security recommendation for identifiers.
And the article didnt even get started on emojis, there are for example two different ways you can define skin color of an emoji. Most programming language dont have unicode support so its up to the developer how to handle them.
In this article, unusually, text is difficult because of naive expectations, not because of incompetence.
For example, the treatment of case-insensitive comparisons is quite good, but the author thinks that counting "characters" should be simple and well-defined, that conversion between uppercase and lowercase characters should be reversible, and that that one can "obviously use well supported libraries that solve this problem for you".
There's a ton of recorded history of decentralized character encodings. Some people long for the days when a character was a byte and a byte was a character, but that model doesn't fit the world's languages.
The unicode consortium has been historically inclusive enough to avoid alienating enough people to cause a revolt.
They have defined ranges where they will not assign code points as private use areas, and you can use those for whatever symbols you need, but of course there's no mechanism to resolve disputes over use, and you would need your own method to determine which font was appropriate, and its likely to be challenging to use outside of applications you control.
I think the available extension mechanisms in Unicode are quite limiting and in fact close to useless because you can run into trouble when you use them.
What I want is refer to new symbols through some unique ID, a decentralized way to serve those symbols, including all relevant information (such as what those symbols mean, what they are based on, what glyphs are preferred/available, how the use of these symbols evolved over time, references to related symbols, etc. etc.).
If I want to invent a symbol for a Coke can, a link to an external website [1], a new mathematical operator, or even Covid19, I want to be able to do it now, not wait for Unicode to stamp the proposal.
Use images or svgs then. Or define your custom svg font. I don't see how a decentralized way of defining whatever could even remotely work, since all of this absolutely must work while offline and while users are inputting arbitrary text.
I'm not sure where it comes from and what's come of it; seeing as the next sentence is "server could track usage", I have the feeling it's from the editor from Google... ;) (Mark Davis)
I don't see how being offline would be a problem, since the same problem exists when you want to type text using certain fonts which you'd need to download first.
Current proposal to support pretty much exactly that, that is a generic Wikidata identifier (for emojis).
I'm not sure how I feel about that, on one hand I would like to have a widespread system to refer to stuff in an ontology, on the other hand it would probably make anything using plain text download stuff from the internet, and maybe even require a connection in simplicistic implementations...
Well, my own software tends not to use Unicode. Sometimes, it only uses ASCII. Sometimes, it doesn't care about character coding, as long as it is compatible with principle of extended ASCII. Sometimes, it is a combination of the two (such as ASCII only for commands, but comments can include any characters that are compatible with principle of extended ASCII). In one case (VGMCK), it does parse UTF-8 in some contexts (such as #TITLE commands), but the only thing it does with the decoded data is to convert it to UTF-16, since the output format (VGM) contains UTF-16 text. A UTF-8 byte order mark at the beginning of the file is not acceptable, though.
But I have thought of some mechanisms to declare character sets and character mapping, to determine which font is appropriate, etc. Unicode is a valid choice, but even if you select Unicode as your character set of use, you must specify the language code and the Unicode version. You can also specify more than one mapping, and custom mappings, for example if you are using CSUR to write in Klingon and English, then you can declare both Unicode and CSUR, together with the relevant version numbers.
I thought of a similar mechanism for declaring character sets and character mapping for TAVERN32 (which doesn't exist yet, but it is meant to be an improved text adventure game VM having the advantages of both TAVERN and Glulx). One lump declares the character sets and character mapping, and can include fallbacks if wanted, so that if ASCII mappings are declared, then it can work with any implementation even if they do not know about that character set. The story file might also (optionally) include other lumps with bitmap fonts, so that also allows it to work even if the character set is unknown (as long as it can display graphics, which some implementations may be incapable of). However, in this case, the character mapping is also relevant for compression too, and not only for determining character sets. There is then the possibility that multiple codes will be mapped to the same ASCII character (or sequence of characters in ASCII or any other character set), avoiding the problem of compatibility.
Reminds me of the adage that standards are nice because there are so many to choose from. Fortunately, the Unicode Standard is largely--but not completely--agreed on. There's still a lot of Chinese and Cyrillic in non-Unicode encodings.
But if there were a "decentralized" version of unicode (the lower case 'u' here is intentional), I'm afraid it would be the wild west. In fact I can recall a time in the early 2000s where a group I was with was trying to deal with Hindi (Devanagari) electronic text. It seemed like every website had its own encoding; no one used the ISCII standard, and only a couple websites (none in India) used Unicode. Deciphering the encodings was made even more difficult by the fact that some of them were actually stroke encodings, not character encodings--as if the letter 'd' were made of the character 'c' plus the character 'l'. Humans had to use the website's font in order to read the page, and of course it was hopeless for a computer to read it without a hand-built encoding converter. It was awful.
> Currently, one organization decides what symbols/emoji we can use
Well, the whole emoji idea of encoding images using font codepoints is silly.
Alternative would be just use crypto-hashes of SVG / PNG images (perhaps encoded by some unicode codepoints), which could be resolved by some image repositories. There could be multiple repositories and software could use many of them.
What I think is that it is useful to use different character sets for different purposes. One character set cannot possibly be useful for everyone in all applications, even though that is what the Unicode Consortium tried to do. Unicode is just really a mess, I think.
PostScript allows characters to have names instead of only numbers (you can change the assignment of numbers to each name), or you can use CID keying which means given a character set name, that decides what each 16-bit number corresponds to. Adobe said you will have to register the character set name, but I think that you can use URIs (or UUIDs, which are a special case of URIs). You could also use UUIDs as XUIDs, by using 36#UUID as the organization number, and then the next four numbers make up the UUID.
I am working on a character set actually, but it isn't really decentralized nor is it meant to replace Unicode in all of its uses; it is meant only for a specific use, which is grid-based displays (such as terminal emulators). There is a 8-bit mapping, which can be used to encode text, and there is a 16-bit mapping, which can be used for fonts and for tables within terminal emulation software to convert character codes from other character sets. Additionally, it has the property that both the 8-bit and 16-bit mappings allow easily figuring out the width of a string, without needing any tables, without having any ambiguity, and without the possibility to be changed in future (you can actually just add together the widths of the bytes, which are determined using a simple rule; control codes have undefined widths). The 8-bit mapping is also compatible with principle of extended ASCII (like UTF-8 is, although the format is different from UTF-8).
I have also seen someone else on IRC also mentioning making up such a character set for grid-based displays, although theirs was simpler than mine, not having any non-ASCII single width characters at all. I consider that drawback unacceptable, so I made one which is almost as simple, but allows non-ASCII single width characters.
One thing that Unicode is good for though is searching collections of documents written in many languages. However, Unicode has many features which are excessive, unnecessary, or problematic, for that purpose, too.
Another alternative I have seen once on IRC is Duocode, where each character code has two halfs, being the language code and the glyph code. For the purpose of searching documents in many languages, this Duocode idea might even be better than Unicode (especially if many of the messy features of Unicode are avoided).
And, for Chinese writing, there is Cangjie encoding. (I think Cangjie is mostly used as a input method, although it could also be used as a character set for Chinese writing.)
I would say that Unicode is complex, not that it is a mess. As Einstein (or somebody) said, Everything Should Be Made as Simple as Possible, But Not Simpler.
There are admittedly a few oddities in Unicode (like the treatment of some characters composed of others in Devanagari, IIRC), but they are blessedly rare. What's complicated is writing systems. Have a look at the Burmese script, or contextual variants in Arabic script, and you'll begin to understand why Unicode is complicated.
How characters should be composed should be a feature of the font, I think. This way, you can even make up your own writing systems and then it will work.
> PostScript allows characters to have names instead of only numbers
All defined Unicode codepoints do have names. "Mathematical Monospace Capital C" is codepoint U+1D672. Using character names as an encoding is inefficient but valid, you can do so in e.g. Perl: $foo = "\N{MATHEMATICAL MONOSPACE CAPITAL C}";
Btw., Perl is the one language I would point out for really excellent Unicode support, I know of none that comes even close.
I won't disagree with you about Perl, but Python comes close, including in its support of Unicode character properties in regex's using the newer regular expression library (https://pypi.org/project/regex/).
Yes, although that is not the point. PostScript isn't using character names as an encoding; there is an array to specify which character names are being encoded, so that you can encode everything with numbers instead.
But I agree it isn't really that good; having a character set name and then the numbers, would be better, I think. PostScript supports that too, though.
One of the requisites of a normal character set, and one of the aims of Unicode in particular, is random access, if you have references to an array it's a different thing, that in many contexts can't work
The last bit is wrong, at least in French. The upper case representation of “côté” is “CÔTÉ”. It infuriates me to no end that Microsoft does not accept that letters in caps keep their accents. I can believe that it might be language-dependent, though.
Also, a search engine that does not find “œuf” when you type “oe” is broken.
Very interesting article, though. I wish more developer were aware of this.