Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Unicode sorting is hard and why browsers added special emoji matching to regexp (hexops.com)
97 points by todsacerdoti on June 28, 2021 | hide | past | favorite | 55 comments


The takeaway is not "unicode is overly complex", it's that an algorithm for sorting strings consisting of any possible human language characters is super convoluted, and it's pretty amazing that unicode gives us a standard collation algorithm including localizations, that works remarkably well.

But... is complicated enough to implement that many environments apparently haven't felt there was enough demand to do so?


> algorithm for sorting strings consisting of any possible human language characters is super convoluted

not only that: it's different depending on locale and sometimes context (in Germany there's a distinction of whether you sort for a phone book listing or not).

Doing it right requires basically shipping a huge database with locale specific information which is seen as "bloat" by many projects and thus made optional or not available at all.

>and it's pretty amazing that unicode gives us a standard collation algorithm including localizations, that works remarkably well.

it's not good enough in many cases. Thankfully there's ICU, but many environments frown upon that because it's seen as bloated. If sorting works "fine" for US-ASCII, why link to a 60MB library?


I meant to include ICU in what I was saying works well when I said "unicode gives us a standard collation algorithm including localizations".

ICU stands for "International Components for Unicode." ICU collation is an implementation of the Unicode Collation Algorithm (which is really a suite of algorithms, or an algorithm with various parameterized branches), using the Unicode Common Locale Data Repository for localization -- that's also unicode, the CLDR is a unicode thing too, part of unicode. It's all unicode.

There are other implementations than ICU -- although maybe none as complete, because it is a lot/expensive, indeed. But ICU is an implementation of unicode algorithms. But for instance, unicode collation is also implemented -- in a way that will be the same as ICU modulo bugs or details the standard leaves up to implementations to vary -- in pure ruby by https://github.com/twitter/twitter-cldr-rb#sorting-collation


PS: Did you know the unicode collation algorithms also includes a way to produce a sequence of bytes for a unicode string, such that sorting the produced sequence of bytes in strict byte order will produce the proper collation order for the original strings?

Which is how you implement high-performance unicode collation sorting in say a db.

Unicode is a very well-designed thing!


So, do you mind sharing the name of that transformation?


The only language / framework I've used that seemed to get this right is OpenStep / Cocoa, which is presumably why Swift is mentioned as being 'notable'.

NSString would occasionally come up in performance analysis as being strangely slow compared with other languages, but this was because various operations like comparisons followed unicode.


> The preferred alphabetical sorting would be [ "Ähnlich", "Äpfel", "Bären", "Käfer", "küssen" ]

As an example of a conflicting locale, Swedish would have Äpfel after Käsen!


Like "Aalborg" being after "Zaragoza" in a Danish sorting. And that's more tricky with "aa" being treated as a single character in Danish.


This leads to fun things like the regex [a-z][a-z] not matching the string "aa" in SQL Server when using the Danish collation.


Aalborg is one of very few exceptions that even in Danish can commonly be spelled with “aa” as a replacement for “å” (the single character).

Just like ae for æ.


Unfortunately, the exceptions only appear in names and proper nouns, which are where you generally want sorting the most.


Yes, that is an interesting case!

"Ä" is a separate letter in the Swedish alphabet: https://en.wikipedia.org/wiki/Swedish_alphabet

In German on the other hand, "Ä" is part of the language but not part of the alphabet and it is simply expanded to "AE" as in the previous comment.


EDIT: following comment is specifically about German German, Swiss German etc do not always work the same

> In German on the other hand, "Ä" is part of the language but not part of the alphabet

Some people claim that, but I wouldn't take that as an authoritative statement. Plenty language authorities disagree, since "ä" and "ae" are not interchangeable.

> it is simply expanded to "AE" as in the previous comment

Only in some contexts (lists of names predominantly), generally "ä" is sorted after "a". E.g. the Duden dictionary goes "zahlen", "zählen", "Zah­len­an­ga­be".


There is also the topic of language revisions. In my experience, gnu tools sort swedish correctly by the old rules, those that collate w with v, i.e w and v sort as if they are equal (war before ver before whar).

However, in 2006 the arguably authorative dictionary changed to sort w after v. Gnu tools have yet to catch up with this, what I can see. :)


> While WASM languages could shell out to JavaScript browser APIs for collation, I suspect they won’t due to the lack of guarantees around those APIs.

That's just it, right? Unicode support has always been a tradeoff between precise coverage/larger binaries and loose coverage/smaller binaries. I'd be excited to see the data size for collation+normalization get under 1MB.


Yeah its true but disappointing. I ported some email parsing code from C to wasm a couple of years ago and ran into this problem. Emails are encoded using all sorts of different character encodings. Canonicalizing them to utf-8 is important for consumers, but the libicu database for character encoding is several megabytes. So I'm left with a sad set of choices. I could:

1) Not support character encodings outside of ASCII, utf-8 and utf-16

2) Ship a huge wasm bundle or

3) Call back into javascript from my wasm module and use the javascript wrappers around libicu. This would make it harder to use the library outside of javascript.

I'm not sure what the right solution is here. If the character conversion databases can be shrunk to <500kb then I could just ship them with the module. Or if wasm had a better standard library, I could use that. But right now the whole situation is unfortunately awkward.


If you only ship the default collation it is already well under 1MB. DUCET [1] for example is 406 KB in a naive format [2], which gzips (-9) into 164 KB and Brotli (-q 11) brings it down to 76 KB. Other Unicode tables combined, even in a directly-accessible form, do not exceed 200 KB with an appropriate encoding in my experience.

[1] https://www.unicode.org/Public/UCA/13.0.0/allkeys.txt

[2] Encodes `XXXX; [.AAAA.BBBB.CCCC] [*DDDD.EEEE.FFFF] ...` into three-byte code point and seven bytes per each collation element (`.` encoded as 0x01, `*` encoded as 0x02, the end of elements signalled with 0x00); everything is in big endian and sorted by code point. This is not suitable for direct access as is but close. (The only reason I've used a naive encoding here is that, I have never tried to compress DUCET in a directly-accessible form so I couldn't give an educated guess at all.)


Author here, I hinted at the end of the article that I am exploring exactly that: a compression algorithm for the DUCET specifically. My early findings suggest that I can bring it to under 50K and beat gzip/brotli compressions, but with those on top can go further https://github.com/jecolon/ziglyph/issues/3


28 KB with Brotli! I'd like to see the number with LZ4 (or similarly simple compression algorithms) so that the dependency can be minimized, but that sounds way better than I imagined.


FWIW:

    List<String> sorted = Stream.of("Bären",
                                    "Käfer",
                                    "küssen",
                                    "Ähnlich",
                                    "Äpfel")
                                .sorted(Collator.getInstance(Locale.ROOT))
                                .collect(Collectors.toList());
    // [Ähnlich, Äpfel, Bären, Käfer, küssen]
In Java, String's default comparison is lexicographic over UTF-16 code units. Whilst this isn't precisely what you'd choose today, i do think it makes sense for the default comparison not to be locale-sensitive.


It's not Unicode sorting is hard, it's mostly because CJK, ideograms, emojis, and many other non-alphabetical languages do not have order.

It's not a bug, it's by design.


> [...] CJK, ideograms, emojis, and many other non-alphabetical languages do not have order.

Uh, no? Most if not all scripts do have letter order, CJK(V) "ideographs" included. The actual problem is twofold: the letter order might not be unique, and the word order might be completely different from the letter order. CJK(V) "ideographs" exhibit both problems---their word order is phonetic, while letter order is based on shapes and different dictionaries use different rules---and it is very common in Chinese and Japanese that the string meant to be sortable has pronunciation attached (e.g. Wikipedia).

Also, believe it or not, even emojis do have order too [1], albeit arbitrary one.

[1] https://unicode.org/emoji/charts/emoji-ordering.html


Even English a few edge cases where collation order doesn't match letter order: titles are typically ordered ignoring initial A/The, names are often sorted by surname (even when written with given name first); famously, M'/Mc are are traditionally sorted with Mac, along with a host of special cases that get gleefully thrown out in this era of doing very dumb very fast.


It's not sensible to call out non-alphabetic (by which I imagine you exclude all phonemic systems including abjads, syllabaries, etc all of which have full character order.

Even in Latin scripts, sorting order is language-specific and can even be domain specific (e.g. some languages sort phone books differently from dictionaries).

For "CJKV" systems there are ordering methods such as radicals and, as in European languages, ordering can depend on what you want to accomplish.


Speaking for Japanese, there is certainly an "order" for words, but that order is based on the readings of words, which may be difficult to ascertain from just the string.

https://en.m.wikipedia.org/wiki/Goj%C5%ABon

Eg. Words starting with 一人 can be read as both いちにん (ichinin as in 一人前 - food for one person) or ひとり (hitori - 一人暮らし - living alone)


Also some independent words can have different readings. Eg. 紅葉 is both こうよう and もみじ.


大人気 is my favourite of those, as a particularly long ambiguous homograph where the two common readings don't even have overlapping meanings.


Indeed, that one can be funny especially when split over a line ending. You start reading it as one, and no, it was the other. I've had this experience several times with this specific word.


(1) 大人気 (おとなげ, otonage) - maturity

(2) 大人気 (だいにんき, daininki) - very popular


FYI, Korean alphabet (Hangul) is fully ordered[1], and Korean unicode block[2] strictly follows it.

[1]: https://en.wikipedia.org/wiki/Hangul#Alphabetic_order

[2]: https://unicode.org/charts/PDF/UAC00.pdf


Hangul has order because it's enumerable. Hanja on the other hand, does not have order and it's not enumerable.

(not enumerable as in written text as strokes can be combined and invented on the fly, enumerable if you think it's a finite charset on computers)


New characters can be invented on the fly, but not every random scribble would be considered a character. In practice, only a handful of different stroke types are used, which gives even newly invented nonsense characters a recognizable visual style. For any finite stroke count, there's only a finite number of combinations, so the infinte set of characters with unbounded stroke count can be enumerated.

It can also be sorted using the radical and stroke count, although then you can't enumerate it in order because there's an infinite number of possible characters for each radical.


It's so annoying how "CJK" groups Hangul with completely different writing systems. And even makes people think they're similar.


Well, Hanja can also appear inline in otherwise Hangul texts. Vietnamese has a similar issue though their use of an extended Roman alphabet ended up in a slightly different place, Unicode-wise. I don't remember the decision but I believe it might have been at the request of some RoK people.

There might also be an issue if old Hanja-only texts need to be handled differently from Hanzi rendering/processing. That would not surprise me.

Still, I agree it seems pretty weird.


Hangul and Hiragana are not so dissimilar AFAIK? But the point of saying "CJK" is the other direction: codepoints subject to Han unification may represent Hanzi, Kanji or Hanja.


Hangul is actually just an alphabet. Each square hangul character is composed of 2-3 letters. Hiragana is a syllabary.


> Korean alphabet (Hangul) is fully ordered

Buuuut historically, there have been several ways to arrange Hangul linearly, and the prevailing sorting in dictionaries differes between North and South Korea.


This is why I had to point out that order does exist, just that there are too many such orders.


Even English has no defined ordering for all characters. What is the ordering of punctuation marks? Do upper case or lower case letters go first? How do you deal with white space characters?

ASCII ordering is just an arbitrary standard that mapped numbers to letters so people could make computer hardware interoperate. It's not always a sensible sort ordering.


Pedantic, I know, but the original 1963 ASCII document did speak a lot about collation and ordering. Perhaps a lot of it doesn't make sense today, but the sort order wasn't accidental.

http://www.bahaistudies.net/asma/ascii.pdf


> Even English has no defined ordering for all characters. What is the ordering of punctuation marks? Do upper case or lower case letters go first? How do you deal with white space characters?

This isn't really true; there are well-defined, standardized answers to all of those:

- For purposes of alphabetizing, punctuation marks are completely ignored, as if they weren't present. "Stop me!" and "Stop, me!" are identical strings.

- Case is also ignored. "AAAAAGH" and "aaaAagh" are identical strings.

- Spaces are not ignored. Sort order is determined by the text before the first space. Shorter sorts before longer. "A bird", "a cat", and "a deer" all sort under "a" and above "ab".

Note that these are rules for English, not rules for arbitrary sequences of binary data. They look the way they do because they model the data as a sequence of English words.


It's not really well-defined if "ignored" means "sort arbitrarily".

You can't just say "Stop me!" and "Stop, me!" are identical. You have to define which comes first.


> You can't just say "Stop me!" and "Stop, me!" are identical. You have to define which comes first.

Why? All real-world applications of sorting must handle sorting identical objects. Saying that "Stop me!" and "Stop, me!" are identical doesn't cause any problems that aren't also caused by saying "Stop me!" and "Stop me!" are identical.


But they're not identical, they differ by a comma.

It's a desired feature of sorting algorithms that they should be deterministic, in which case arbitrarily ordered input should produce identical output.

Otherwise, having non-deterministic output just makes everything harder -- harder to make comparisons, more work to write tests, etc.


> But they're not identical, they differ by a comma.

This makes as much sense as saying that "Stop me!" and "Stop me!" aren't identical, because one of them is first and the other is second.

It's true, but it's not relevant to sorting them.

Your "desired feature of sorting algorithms" is most commonly realized as the question of whether the algorithm provides a "stable" sort -- that is, whether two values that compare equal are guaranteed to occur in the same order in the (sorted) output that they did in the (unsorted) input. There are applications where this is considered desirable, and many other applications where it isn't. But there are no applications in which it's not possible for two values to compare equal. For the question of stable sorting to arise in the first place, you have to have already realized that you're going to be sorting things that compare equal.


...which is why it's generally desirable for unequal things to never compare equal.

Stability is about preserving the order of items that compare equally. That's a different topic.

I'm talking about it being desirable not to underspecify the comparison function (sorting rules), so that unequal items are never compared as equal.


‘Shorter sorts before longer’ is the reason ASCII puts space first and textual punctuation before alphabetics.


The way most native English speakers (and writers) that I know, including myself, would sort such a set is to ignore the set (regexp) \W . The 'C' (blind bit sorting) Locale is more sensitive and a suitable default for simple machine focused environments.

E.G. of sorting how English humans tend to:

`A string with non-letters ignores the non-letters.

Ace


Does sorting words from different languages even make sense? Do all languages even define a sort order for their letters? What about languages that aren't based around letters? Frankly I think that the unicode consortium is talking out of school here.


There definitely needs to be a definitive sort order across languages. That it needs to be sensible is highly suspect. Emojis alone should signal that there fundamentally cannot be. The best they can do is define a sensible cross-language sort within a language family (e.g. Latin based) but even that’s highly suspect


> The best they can do is define a sensible cross-language sort within a language family [...] but even that's highly suspect.

How would you sort ["О", "Е", "Ζ", "Р", "Η", "С", "Κ"], for example?


You can sort according to the orthography of one and precisely one language. Words from another language would simply follow that, also if that language would sort them differently.


Well you certainly can sort any language's glyphs, using strategies such as that one. But will it be useful outside of that language? People who speak German will be aware of their sort order, and people who speak Korean will be aware of their sort order, but only a small subset of the population who speak German and Korean fluently will care that you went the extra mile to sort properly when sorting a list containing German and Korean words. It almost seems as though showing a globally-sorted list containing multiple languages is an anti-pattern, and you should group by character set first. However, that might be a political boondoggle if German/Anglo names are always displayed before or after the Korean names. Hmmm...


My name is one of the last in the alphabet. Should I be offended now?

Anyway, if you sort in German, Korean names will not be sorted as when you sort in Korean, but then German names will be somewhere else. Ç'est la vie. How else are you going to sort? By mapping everything to phonemes first? And what's the proper sorting order of phonemes? I'm sure German names will on average come either before or after Korean names in that case, too.


>The preferred alphabetical sorting would be [ "Ähnlich", "Äpfel", "Bären", "Käfer", "küssen" ] - Array.sort doesn’t do that.

All my literal wat. This is absolutely not the expected nor the preferred behaviour.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: