I tested the table with randomly generated strings, and was puzzled to discover expected lookups would fail with some table constructions. I narrowed it down to inconsistencies in collation.
It turns out that the collation order implemented by Windows was not transitive. You could have three code points, a, b and c, where a > b and b > c and c > a. Sort order is well defined within blocks, but there isn't necessarily a meaningful sort order across blocks; how should your Cyrillic letter compare with your Greek letter vs your Armenian letter?
Universal case insensitivity is hard if not impossible. It's best to preserve both a "canonical case" version and the raw data in a search index, if different.
Not that it has any impact on your point. But as a German who only learned about it fairly recently, I have rather ambivalent feelings about it. ;-)
In some cases that might be because the "oe" was never an "ö" in this case, even if people might have shifted to pronouncing it that way. But in other cases I imagine that it was intended and did start out as an "ö" sound, but people just decided to write the name with "oe".
(Same with the other Umlauts.)
Even beyond the issue with two concepts using the same glyph, the interpretation of the same diactrics among different languages is inconsistent. English tends to drop diacritics to the point that many people think that English doesn't use them; German uses expansion (so ä becomes ae). As you mention, some languages are incomprehensible either way, so they need to be preserved. And sorting and collation is even more fun!
 This does mean that Unicode's insistence that characters represent semantic differences rather than graphical differences can come across as rather arbitrary. The original purpose of Unicode was to unify different character sets together, so it preserves character differentiation that existed in antecedent charsets but tends to otherwise unify characters in practice.
It seems like German could actually be at fault for your bogus transliteration issues in Finnish, then. Sorry about that! 8)
Looks like that's the case: https://docs.microsoft.com/en-us/windows/desktop/api/stringa...
It might have been an early version of this one: http://codecentral.embarcadero.com/Item/15171 - from 2001 - but I've written quite a few hash tables over the years, and may be blending the different recollections. I never used Vista, and I'm pretty sure my experience predated Windows 7.
I also wrote JclStrHashMap to support JclExprEval: https://github.com/project-jedi/jcl/blob/master/jcl/source/c...
I also wrote the Delphi runtime library TDictionary generic implementation, but that was more recent.
Update I found it: original discussion, from newsgroups, has been ported to the web: http://www.delphigroups.info/2/62/478610.html
It was from 2004, so more like 15 years ago. It's mildly painful to read myself from back then too :)
Wouldn't you need to do case normalization before hashing, to make that work?
This article discusses some of the challenges:
In the first section points 16, 22, and 23 are relevant, in the second section look at points 8, 9, 10, 11, 12, 13, and 40.
Sorting based on Unicode strings is tricky, especially if you want cross platform compatibility. It's better today but there are still a lot of edge cases to consider if you have any hope of getting the same sort twice from "interesting" data.
Which is a cautionary tale the solution for which is to use icu, the solution adopted by Postgres.
We ran headfirst into this issue at my company and we've actually been recommending the opposite (use the "C" locale on the database, treat collation as a render level concern).
I have a whole write up explaining the technical motivations behind that recommendation: https://gist.github.com/rraval/ef4e4bdc63e68fe3e83c9f98f56af...
That is, to do anything you want to do a locale sensitive way, such as querying the database for a given case insensitive string, or pagination, you'll need to have the DB index be locale/collation aware or else return every possible value and the renderer sort it out.
As an example, how you would you repeatedly return a range of results based on a string in a locale-aware way, e.g., to display paged results, if you defer the work to the renderer?
The only general solution I'm aware of which lets you "bypass" the DB collation is to use a locale-aware collation library like ICU to generate a binary sort key, which can be compared using plain binary comparison and storing those in the DB. This still means the overall index is locale aware, but the DB doesn't need to be aware of any collation rules: only the code that generates queries and handles the results needs to do the sort key transformation.
It means that you have a single library that does all the conversion, which you can probably control more easily, rather than delegating this to the database, where you might need to support several vendors or at least various versions (and problems can arise even within a single DB version as this postgres issue shows).
Normal DB indexes are mostly for numbers and (short) ASCII strings. (Something like canonicalized UTF-8 is an edge case.) For strings that have encodings and locales, you likely need a full-text index, provided by your DB or by something like Solr / ElasticSearch. It addresses the oddities of human-oriented texts better.
It's a pretty common use case (so common that it is usually taught in an introduction to databases) that you might want an index that allows you to efficiently search by non-ASCII strings like say, a person's last name.
It may be implementable for e.g. French names: a name like "François" may be stored as UTF-8 with "ç" always represented as a composite pair (or always a single character), and the application layer knows and uses this.
It must be a dubious idea for German names when you may need to see "Müller" and "Mueller" as the same name, but also keep e.g. only "Mueller" as the form referenced in legal papers.
Once you start considering anything non-ASCII, things become hairy. (Of course, if you only work for the US market and don't care about any international travelers as customers, you can continue using these introductory courses as a guidance.)
People who use databases naturally end up wanting to store Unicode strings, and people naturally want to query them in locale-sensitive ways. Your Mueller example is a fine example of this. Databases have deep support for this, and you can also build your own compromise, so to speak, on top of the database with ICU or another library if it doesn't meet your needs.
Indexing is full of caveats, though.
If you're using surrogate keys for everything, it means your tools are broken or you don't trust them. It may be the least worst available option to plaster over them, but you shouldn't accept that as a best practice.
> For strings that have encodings and locales, you likely need a full-text index, provided by your DB or by something like Solr / ElasticSearch.
Really, a reverse index, which, as its name implies, does the opposite of what you want if you're trying to index something.
Those products "work" by doing text preprocessing (stemming and such), which anyone can do, and then they toss a bunch of garbage back and let the user sort through it.
That doesn't solve the problem a database index is trying to address.
Tools are not broken; reality is. So you have to work around assumptions that are pertinent to blind usage of technology.
You need to think how to use a tool (such as a Unicode-capable DB text type) to get a correct result. And first you need to understand what constitutes the correct result. Else you may end up with strings that are equally correct and nominally equal, but bytewise different, so search by index fails.
Retreating to ASCII is not the answer, and "full text search engines" don't even solve the same class of problem, such as say paginating a list of local towns by name.
The most important job of a DB is store records which largely consist of numeric (including boolean) values and strings. Strings often need to be outside the ASCII range, and even in existing cases where they aren't supported, it's a limitation of the application and the users would really prefer unicode strings.
Even if you just target English speaking people from the US or whatever, there are a variety of non-ASCII characters that come up all the time, including accented characters, various currency signs, etc.
But this will make some features really hard, if you need them. Pagination based on sorted values subject to collation wouldn't be queryable, you would need to either get all the data and sort it, or query for a key and the sorted column, sort and then query for details on the displayed items.
Selecting a range would also be potentially difficult.
However, if someone really wanted to accomplish this, they could probably use PostgreSQL's functional indices and unicode normalization to do it.
 The GP relative to my reply
The only real solution is to have your own collation routines, test the hell out of them for every release (you don’t want a compiler bug or the fixing of a compiler bug to introduce a subtle change in your collation code. The truly paranoid also worry about bug fixes in hardware), and to never stop supporting a collation, once you have ever released a product using it.
That’s what many databases do and why, for example, SQL Server has two binary collations. https://docs.microsoft.com/en-us/sql/relational-databases/co...: ”There are two types of binary collations in SQL Server; the older BIN collations and the newer BIN2 collations. In a BIN2 collation all characters are sorted according to their code points. In a BIN collation only the first character is sorted according to the code point, and remaining characters are sorted according to their byte values. (Because the Intel platform is a little endian architecture, Unicode code characters are always stored byte-swapped.)”)
Of course, PostgreSQL is one the few programs which actually cares about collation and case-insensitivity, so it has to work the hard way.
ICU collations are not used by default, you have to explicitly enable them, and you can't set them as the database default. And even if you use ICU collations, I'm not sure if indices actually detect if the ICU versions don't match.
Maybe they improved some of the issues in PostgreSQL 11 (haven't been following the ICU issue in detail), but I don't think this issue is fixed.
Also, with libc collations you have to rebuild indexes after a glibc upgrade:
Edit: Why the downvote? New versions of Unicode come out every year, and ICU collations are explicitely versioned. If you use ICU 51.1 and ICU 52.1 you are going to have the same kind of issues
Edit: The docs say indices need to be rebuilt when ICU version changes: https://www.postgresql.org/docs/11/sql-altercollation.html
ICU offers a way to detect the collation version has changed, but I thought there was no way to access older collation versions.
#define U_UNICODE_VERSION "9.0"
libc's need something similar. In safelibc I do have nothing so far, but I'm at 11.0
(Just because I can’t resist, I’ll mention that sort keys are a perfect application for Hu-Tucker codes: https://cs.stackexchange.com/a/49586).
More recent discussion of various possible approaches on PostgreSQL -hackers list: https://www.postgresql.org/message-id/flat/CAEepm%3D0uEQCpfq...
Proposal to add versions to libc collations in FreeBSD: https://reviews.freebsd.org/D17166
And glibc is still 2 years behind in its unicode version: they are at 9.0, current is 11.0.
musl is at 6.0
SELECT * FROM pg_catalog.pg_collation;
I constantly have to update my towupper() function for my safelibc. musl is hopelessly behind, and glibc is not much better with Unicode support. ICU constantly has breaking changes, but at least keeps its tables up to date.
Unicode itself changes its tables every year. This year we had 6 changes there.
DB or Web clients really need to pass the unicode version in its API: like "sorted according to Unicode 11.0"
Encoding translates a string from one language to another, and it's often a reversible translation. The obvious case is to translate from Unicode to binary and back. In Unicode, the alphabet is the set of all Unicode codepoints. In binary, you have a stream of octets and the alphabet is the set of integers between 0 and 255. So UTF-8 is a standard that specifies how to perform such a translation.
Since we're often going back and forth between Unicode and binary, and because real processors work with bytes, it's reasonable to call "encoding" translating a string to binary and "decoding" translating binary to a string. The math doesn't require that, though.
Collation represents a string as an expression (usually also a string) that is trivially comparable. For instance, supposed I wanted an English-friendly way to compare band names. I might map all codepoints that look like A or E to "A", then K, S and C to C, etc., discarding any accents and maybe throwing out leading junk like "The" or "A".
That's why a case-insensitive collation won't let you have Foo and FOO, because they would both collate as FOO. It also means, unlike encoding, collation is expected to be one-way.
So how does collation algorithmic instability break an application?
Edit: What I mean is-- what are a set of reasonable assumptions a dev would make in the application layer that would be good practice on the one hand, yet also be wholly dependent on collation algorithm stability to keep from breaking?
If you think about any kind of binary tree based structure, it will have invariants like "the left child node will be less than the parent node, and the right child node will be greater than the parent node."
If those are violated just a little bit, at best data will simply disappear from the index. It's possible for updates to delete stuff, or for operations on the index to hang.