So while the german or french alphabets are purely latin (the diacritics are not part of the alphabet), that is very much not the case of the 29 letters Turkish alphabet (where accented characters are separate but following the un accented form), or the Icelandic alphabet (where ö does not sort with o and ó), or the Norwegian alphabet (where IIRC ø and å sort at the end of the alphabet, nowhere near o or a).
To say nothing of Welsh, where “ng” is a digraph sorting between “g” and “h” (although I don’t think any welsh word starts with ng so it probably doesn’t matter).
Dictionary: Goethe, Görlitz, Gotha, Gotik
Phone book: Görlitz, Goethe, Gotha, Gotik
Encyclopedia: Görlitz, Gotha, Goethe, Gotik
almost, but not quite, since you can have words that only differ by umlaut; in that case, the one with the umlaut should sort after the one without one, so basically your alphabet now is a, ä, b, c, ... l, m, n, o, ö, p, ..., z. This btw is how modern Japanese dictionaries are sorted, too, where two entries with the only difference of a dakuten (as in しゅ, じゅ) are sorted as shown here.
Sorting also is context-dependent. https://unicode.org/reports/tr10/#Introduction: “even within the same language, dictionaries may sort differently than phonebooks or book indices”
Weird example (https://unicode.org/reports/tr10/#Contextual_Sensitivity): “Normally, all differences in sorting are assessed from the start to the end of the string. If all of the base letters are the same, the first accent difference determines the final order. %[…] In some French dictionary ordering traditions, however, it is the last accent difference that determines the order”
(That entire page is worth reading if you want to know about collation)
(... after my boss told me he checked our submissions of free time by simply making a search for the word "vrij" in Slack, [meaning "free"] I started to use the unicode digraph ĳ to try and cheat the system.)
For example Finnish there is the fact that ÅÄÖ are at the end. Make sense as Ä and Ö are entirely different letters and Å is common enough in names.
But where it actually gets fun is that V and W can be considered to be same letter and thus can be sorted together... Though so that W is after V... And that is not even considering the accents or letters from other latin-scripts...
And then there is compound words. Which can be sorted either be sorted based on base words and their use in compound words or with morphemes. Or just alphabetically...
And even fun articles aren't always clear cut even in English.
I doubt it's exactly insensitive in the way that's described in the article, because it's still likely deterministic, whereas in the article the behavior can be nondeterministic.
I'm having a trouble thinking of an example in German, but in Spanish, there are plenty of words that differ only by an accent mark: "papá" and "papa", "tú" and "tu", etc. I'm guessing that any given dictionary would still treat these identical-but-for-diacritic words consistently throughout the dictionary, rather than sometimes placing the one with the umlaut first and sometimes placing the one with the umlaut second.
Also, as opposed to German, they are not referred to as umlauts or accents. ä is a completely different letter compared to a, and they have absolutely no relation. An indication of this is that I can't even think of a word that describes the dots over the a, similar to how most English speakers wouldn't be be able to name the dot over the i.
As such, a search or sorting algorithm that puts equivalence between ä and a would be completely broken. However, an English speaker who only wants to look up a Swedish word in a dictionary would definitely want that equivalence. They may not even be able to type the Swedish letters. From this I conclude that it's the locale of the user that must dictate the way comparison works, not the language of the words that are being compered. This is thankfully how locales typically work.
In Swedish, these are extremely common letters, there are plenty of examples where words differ only in letters which would change using such algorithms. Such as älg/alg (elk/algae), or kö/ko (queue/cow).
Also, the history of the letter shapes are the same as the german umlauts, with e being written above a or o before writing ä or ö and o written above a before writing å.
But yes, the origin of the letters is indeed from their base, where ä came from an e written on top of the a. This is more clear in Norwegian where you write the same letter like this æ.
These days, very few people (outside of liguists working with the history of the language) would argue that ä is a variant of a though.
However, for Romanian it is also very common in electronic documents to replace these letters (collectively called diacritics) with their base Latin forms, and it is not always easy to predict how a document will be spelled. So, it's often useful for text searches to actually conflate them. I'm not sure, but this may also have been common in things like word indexes for books, even before computers.
I'm curious if the same is true for Swedish.
The formal way (which is what is done for the international part of a passport for example) is å to aa, ä to ae, ö to oe.
Of course, it's debatable whether diacritics were a better solution than using letter combinations. There are some canonical replacements already - sh for ș and tz for ț - and some could have been created for the extra vowels. This is especially puzzling since we already use letter combinations instead of diacritics for the Ç sound (c-e/i) and the soft G sound (g-e/i).
For example in the last statement, one approach might be to keep all equivalent strings, then sort them using some other complete ordering function (like utf-8 codepoints) and always return first one. (Yes, this will increase complexity, but locales often do that)
But if you are writing a database yourself, it is entirely up to you which representative will be returned by GROUP BY. You can take an easy way out and implement "return non-deterministic random member of the group" strategy. Or you can choose the representative with smallest UTF-8 point values, so it will be fully deterministic (I'd prefer this one personally). Or choose row with smallest insertion date. Or let user choose. There are tons of options.
If a dictionary listed "papa" before "papá" but "más" before "mas", I would call that an inconsistent lexicographic ordering.
But that's not really the same thing as having the entire words papa and papá be sorted one way in one place, and another way in another place.
The grave in Italian serves the same purpose AFAIK so would be interesting to compare the rules for that language.
- Cálculo: "calculation"
- Calculo: "I calculate"
- Calculó: "he calculated"
Having said that, the standard collation rules are:
- First, unmarked vowels.
- Second, vowels with an acute accent.
- Last, vowels with diaeresis (the diaeresis is very rare and in native words only appears in the syllables "güe" and "güi", were it means that the U should be pronounced, as otherwise it would be silent).