Hacker News new | past | comments | ask | show | jobs | submit login
Equivalence of Unicode strings is strange (2016) (databasearchitects.blogspot.com)
32 points by greghn on Aug 2, 2022 | hide | past | favorite | 33 comments

If he thinks that users wanting case-insensitive sort is so strange, I think the title of this should be "Equivalence of human text is strange"

Case-insensitive depends on the use-case I guess, but accent-insensitive really makes sense for (probably) most applications, considering that's also how dictionaries are sorted in german.

The semantic impact of diacritics is locale-dependent, some languages treat accented characters as a variant of the base (and thus with the same identity) while others treat them as separate letters.

So while the german or french alphabets are purely latin (the diacritics are not part of the alphabet), that is very much not the case of the 29 letters Turkish alphabet (where accented characters are separate but following the un accented form), or the Icelandic alphabet (where ö does not sort with o and ó), or the Norwegian alphabet (where IIRC ø and å sort at the end of the alphabet, nowhere near o or a).

To say nothing of Welsh, where “ng” is a digraph sorting between “g” and “h” (although I don’t think any welsh word starts with ng so it probably doesn’t matter).

It is not just locale dependent, but may also depend on the type of list you are sorting. So while German dictionaries ignore the umlauts as claimed, German phone books sort names with umlauts according their non-umlaut transcription (ä = ae, ö = oe, ü = ue). And I own an old encyclopedia that goes the other way and sorts words that are conventionally written with the non-umlaut transcription as if they were written with an umlaut, which is then ignored.

Dictionary: Goethe, Görlitz, Gotha, Gotik

Phone book: Görlitz, Goethe, Gotha, Gotik

Encyclopedia: Görlitz, Gotha, Goethe, Gotik

And, oh, yes, "ß" sorts right before "ss" and after "sr".

> dictionaries ignore the umlauts

almost, but not quite, since you can have words that only differ by umlaut; in that case, the one with the umlaut should sort after the one without one, so basically your alphabet now is a, ä, b, c, ... l, m, n, o, ö, p, ..., z. This btw is how modern Japanese dictionaries are sorted, too, where two entries with the only difference of a dakuten (as in しゅ, じゅ) are sorted as shown here.

I guess they also sort loan words starting with “ng” with “n” (example: https://en.wikipedia.org/wiki/Bhutanese_ngultrum)

Sorting also is context-dependent. https://unicode.org/reports/tr10/#Introduction: “even within the same language, dictionaries may sort differently than phonebooks or book indices”

Weird example (https://unicode.org/reports/tr10/#Contextual_Sensitivity): “Normally, all differences in sorting are assessed from the start to the end of the string. If all of the base letters are the same, the first accent difference determines the final order. %[…] In some French dictionary ordering traditions, however, it is the last accent difference that determines the order”

(That entire page is worth reading if you want to know about collation)

In Dutch the ij is a digraph. The dictionary sorts words starting with as an i followed by j, though many other places like the phone book sort it together with y.

(... after my boss told me he checked our submissions of free time by simply making a search for the word "vrij" in Slack, [meaning "free"] I started to use the unicode digraph ij to try and cheat the system.)

Dictionary sorting is a mess anyway.

For example Finnish there is the fact that ÅÄÖ are at the end. Make sense as Ä and Ö are entirely different letters and Å is common enough in names.

But where it actually gets fun is that V and W can be considered to be same letter and thus can be sorted together... Though so that W is after V... And that is not even considering the accents or letters from other latin-scripts...

And then there is compound words. Which can be sorted either be sorted based on base words and their use in compound words or with morphemes. Or just alphabetically...

And even fun articles aren't always clear cut even in English.

> but accent-insensitive really makes sense for (probably) most applications, considering that's also how dictionaries are sorted in german.

I doubt it's exactly insensitive in the way that's described in the article, because it's still likely deterministic, whereas in the article the behavior can be nondeterministic.

I'm having a trouble thinking of an example in German, but in Spanish, there are plenty of words that differ only by an accent mark: "papá" and "papa", "tú" and "tu", etc. I'm guessing that any given dictionary would still treat these identical-but-for-diacritic words consistently throughout the dictionary, rather than sometimes placing the one with the umlaut first and sometimes placing the one with the umlaut second.

Umlauts are always sorted like the base letter + e in German, so as are, or, ue. However, I think with accent-imsensitive they meant that accents that don't exist in German are simply ignored when sorting. So you wouldn't have German words where the letters only differ by accent. Umlauts are normal German letters, so they have sorting rules (albeit different ones from the same letter in other countries, e.g. in Swedish I think they simply come at the end of the alphabet).

You're correct about Swedish. The extra letters come at the end, in the order å, ä, ö. Interestingly enough this order is different to how Norwegian and Danish orders them.

Also, as opposed to German, they are not referred to as umlauts or accents. ä is a completely different letter compared to a, and they have absolutely no relation. An indication of this is that I can't even think of a word that describes the dots over the a, similar to how most English speakers wouldn't be be able to name the dot over the i.

As such, a search or sorting algorithm that puts equivalence between ä and a would be completely broken. However, an English speaker who only wants to look up a Swedish word in a dictionary would definitely want that equivalence. They may not even be able to type the Swedish letters. From this I conclude that it's the locale of the user that must dictate the way comparison works, not the language of the words that are being compered. This is thankfully how locales typically work.

In Swedish, these are extremely common letters, there are plenty of examples where words differ only in letters which would change using such algorithms. Such as älg/alg (elk/algae), or kö/ko (queue/cow).

It's not exactly right to say that a/å/ä and o/ö have absolutely no relation in Swedish. There are plenty of language cases where they act more like umlauts, such as stor/större (large/larger) or få/färre (few/fewer). Despite this, they are still always treated as separate letters.

Also, the history of the letter shapes are the same as the german umlauts, with e being written above a or o before writing ä or ö and o written above a before writing å.

In one sense you're right. There is a historical relation. However, I'd argue that that relation is so obscure these days that no native speaker would even think about it, unless it's pointed out of course.

But yes, the origin of the letters is indeed from their base, where ä came from an e written on top of the a. This is more clear in Norwegian where you write the same letter like this æ.

These days, very few people (outside of liguists working with the history of the language) would argue that ä is a variant of a though.

To be fair, vowel gradation (aka ablaut) is a rather weak relationship – one wouldn’t usually say, eg. that in English i, a, and u are related just because of the conjugation of Germanic strong verbs such as sing/sang/sung.

Ablaut is unrelated in this context though. These vowels arise from a much more recent sound change (conveniently called umlaut, because it gave rise to a bunch of umlauted vowels).

Fair point, it's about back vowels becoming front vowels of the same height.

Romanian is similar to Swedish with ă, â, î, ț, ș - these are all separate letters, not accents (though they are sorted after the Latin letter they resemble, not at the end of the alphabet - a, ă, â, b, c,..., s, ș, t, ț,... ), and similarly words often differ only in those letters - in/în (flax/in), par/păr (stick/hair).

However, for Romanian it is also very common in electronic documents to replace these letters (collectively called diacritics) with their base Latin forms, and it is not always easy to predict how a document will be spelled. So, it's often useful for text searches to actually conflate them. I'm not sure, but this may also have been common in things like word indexes for books, even before computers.

I'm curious if the same is true for Swedish.

If being faced with a situation where a Swedish word has to be written without the Swedish letters, such as when writing names in broken forms online (I have on of those letters in my name, and forms that tells me to "only use alphabetic letters" are annoyingly common), then either people simply use the letter without the dots, or they use formal transliteration.

The formal way (which is what is done for the international part of a passport for example) is å to aa, ä to ae, ö to oe.

Diacritics are a mess, and most of the time don't even resemble the "original" sound.

Well, they have the advantage that people will typically understand what word was meant even if using only Latin letters, with a bit of context. If they had been entirely different unrelated glyphs, it may be much harder to understand in situations in which you are limited to the Latin alphabet.

Of course, it's debatable whether diacritics were a better solution than using letter combinations. There are some canonical replacements already - sh for ș and tz for ț - and some could have been created for the extra vowels. This is especially puzzling since we already use letter combinations instead of diacritics for the Ç sound (c-e/i) and the soft G sound (g-e/i).

You're wrong about german, that sorting order is only common for phone books, see https://news.ycombinator.com/item?id=32316743

The article's behavior is non-determenistic only because author didn't finish designing the system and just kinda gave up instead.

For example in the last statement, one approach might be to keep all equivalent strings, then sort them using some other complete ordering function (like utf-8 codepoints) and always return first one. (Yes, this will increase complexity, but locales often do that)

But if you use eg. GROUP BY as in the author’s example it’s not really up to you which representative of the equivalence class is chosen (I wonder if there’s a Unicode algorithm for transforming a word to a canonical representative of its EC… probably not in the general case).

If you are a database user, sure, you are at mercy of database developer.

But if you are writing a database yourself, it is entirely up to you which representative will be returned by GROUP BY. You can take an easy way out and implement "return non-deterministic random member of the group" strategy. Or you can choose the representative with smallest UTF-8 point values, so it will be fully deterministic (I'd prefer this one personally). Or choose row with smallest insertion date. Or let user choose. There are tons of options.

French happens to have a quartet of words that makes collation order obvious: cote, coté, côte, and côté. (That's the order my French-English dictionary orders those words; my French dictionary swaps coté and côte.)

Is a dictionary ever going to be sorting those words against each other outside of a single copy in the main list? You're always consistent if you only do it once.

> Is a dictionary ever going to be sorting those words against each other outside of a single copy in the main list? You're always consistent if you only do it once.

If a dictionary listed "papa" before "papá" but "más" before "mas", I would call that an inconsistent lexicographic ordering.

Compound words are quite common: Papá Noel, papaíto, papas fritas, papista

If you're counting words with suffixes, I already checked the oxford spanish-english dictionary as the first useful google result, and it goes back and forth between accent and no accent, for example papa then papá then papada.

But that's not really the same thing as having the entire words papa and papá be sorted one way in one place, and another way in another place.

The acute in Spanish is a stress marker is it not, therefore (to this non-expert anyway) it makes more sense to treat the accented and unaccented vowels the same for collation purposes than would be the case for languages where the diacritic denotes a different phoneme. That said, I don't know enough Spanish to know whether the acute ever distinguishes separate words, which would perhaps weaken that argument.

The grave in Italian serves the same purpose AFAIK so would be interesting to compare the rules for that language.

There are many words in Spanish that are distinguished only by the position of the stress. In fact it is very common for verbs, for example:

- Cálculo: "calculation" - Calculo: "I calculate" - Calculó: "he calculated"

Having said that, the standard collation rules are:

- First, unmarked vowels. - Second, vowels with an acute accent. - Last, vowels with diaeresis (the diaeresis is very rare and in native words only appears in the syllables "güe" and "güi", were it means that the U should be pronounced, as otherwise it would be silent).

In my experience when you have a DB field where you take sorting very seriously (because you're writing a dictionary and/or the data is meant to be circulated in printed form) you'll want to use basic lexicographic sorting (by Unicode code points, LC_COLLATE="C.UTF-8" IIRC) and build a sort field yourself. I just tried the ICU collation demo at https://icu4c-demos.unicode.org/icu-bin/collation.html and while they do have both German phonebook and German general-purpose collation orders, the former cannot be used for dictionaries and the latter sorts abel, äbel, aber, äber as abel, aber, äbel, äber, which I find wrong. Instead of mucking with collation APIs, it's much easier to rewrite your strings and use lexicographic (codepoint-based) collation; in the case here, rewriting as a0bel, a1bel, a0ber, a1ber could work.

Applications are open for YC Winter 2024

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact