
The dangers of streaming across versions of glibc: A cautionary tale (2014) - moehm
https://www.postgresql.org/message-id/flat/BA6132ED-1F6B-4A0B-AC22-81278F5AB81E%40tripadvisor.com
======
barrkel
About 20 years ago, I implemented a hash table that used binary trees for
buckets. It was supposed to support Unicode strings as keys, and I used the
Windows comparison functions to figure out how to compare the keys, in order
to handle case insensitive lookup IIRC.

I tested the table with randomly generated strings, and was puzzled to
discover expected lookups would fail with some table constructions. I narrowed
it down to inconsistencies in collation.

It turns out that the collation order implemented by Windows was not
transitive. You could have three code points, a, b and c, where a > b and b >
c and c > a. Sort order is well defined within blocks, but there isn't
necessarily a meaningful sort order across blocks; how should your Cyrillic
letter compare with your Greek letter vs your Armenian letter?

~~~
nine_k
Case transformations are locale-dependent. That is, in French lower case of
"I" would be "i", and upper case of "i" would be "I". In Turkish, which uses
largely the same letters, lower case of "I" would be "ı", and upper case of
"i" would be "İ". Also, in German, upper case of "s" is "S", but upper case of
"ß" would be "SS", and you have to guess what lower case of "SS" would be.

Universal case insensitivity is hard if not impossible. It's best to preserve
both a "canonical case" version and the raw data in a search index, if
different.

~~~
krylon
There is an uppercase ß, actually:
[https://en.wikipedia.org/wiki/Capital_%E1%BA%9E](https://en.wikipedia.org/wiki/Capital_%E1%BA%9E)

Not that it has any impact on your point. But as a German who only learned
about it fairly recently, I have rather ambivalent feelings about it. ;-)

~~~
nine_k
Nice! German has funnier problems, though, because a vowel with an umlaut can
be represented as that vowel + e, that is, "fuer" is a legit representation of
"für". How do you normalize that? Turning to one canonical representation
works most of the time, but sometimes you also need letter-to-letter
correspondence.

~~~
chousuke
Is that really legit in German? I know Finnish umlauts (ä, ö) get sometimes
mangled to ae or oe, but they are definitely not valid alternative spellings
nor are they pronounced even close to similar.

~~~
jcranmer
This brings up something that is perhaps easy to forget: the interpretation of
diacritics is far from universal. The diacritic in the character 'ä' can be
one of two semantically different diacritics. It can be a diaeresis, a
diacritic whose function is to mark that vowel starts the next syllable rather
than existing as part of a diphthong; or it can be an umlaut, whose purpose is
to indicate that it is a different vowel sound altogether. As far as any
charset is concerned, though, despite those very different semantics, the two
things are the same diacritical mark [1].

Even beyond the issue with two concepts using the same glyph, the
interpretation of the same diactrics among different languages is
inconsistent. English tends to drop diacritics to the point that many people
think that English doesn't use them; German uses expansion (so ä becomes ae).
As you mention, some languages are incomprehensible either way, so they need
to be preserved. And sorting and collation is even more fun!

[1] This does mean that Unicode's insistence that characters represent
semantic differences rather than graphical differences can come across as
rather arbitrary. The original purpose of Unicode was to unify different
character sets together, so it preserves character differentiation that
existed in antecedent charsets but tends to otherwise unify characters in
practice.

~~~
tedunangst
Pretty much the only place you see diaeresis in English is in the New Yorker,
whenever they use a word like coördination.

------
gumby
This isn't a glibc issue; really it's a story that "character collations are
not stable over time" (mentioned downthread in the article in a message
quoting Unicode Technical Standard #10).

Which is a cautionary tale the solution for which is to use icu, the solution
adopted by Postgres.

~~~
rraval
> the solution for which is to use icu

We ran headfirst into this issue at my company and we've actually been
recommending the opposite (use the "C" locale on the database, treat collation
as a render level concern).

I have a whole write up explaining the technical motivations behind that
recommendation:
[https://gist.github.com/rraval/ef4e4bdc63e68fe3e83c9f98f56af...](https://gist.github.com/rraval/ef4e4bdc63e68fe3e83c9f98f56af7a4)

~~~
mnw21cam
This has the advantage that your database operations run a heck of a lot
faster, but has the potential disadvantage that primary key uniqueness may not
be maintained, if you think that alternate ways of writing the same characters
in unicode matters for that.

~~~
greglindahl
Normalization is a separate issue, you can normalize and then use the C
collation order.

~~~
BeeOnRope
Sure but either you are talking about a fixed normalization algorithm which is
not locale aware, in which case it doesn't solve the locale-specific unique
key issue, or it is locale-aware and hence suffers the same problem with time-
varying behavior.

~~~
jrochkind1
You are using string/text values as a pk and trying to sort on em? I'd say
this is another reason not to do that.

~~~
BeeOnRope
Well I am not doing any of the things in the comment chain leading up to this,
but probably the mention of _primary_ key was a red herring. The GP's [1]
point was that some index constraint (they mentioned PK uniqueness, but it
could really be any constraint) might not be correctly maintained if the DB
was not aware of the correct collation order. So from the point of view of the
renderer, which is locale aware and uses locale-based collation, the DB is
violating the constraints.

\---

[1] The GP relative to my reply

------
theamk
This, BTW, is a great example why many filesystems are case sensitive --
ignoring case requires collation support, and this can change all the time.
Treating filenames as opaque byte strings, on the other hand, makes the
filesystems, databases and so on Always Work.

Of course, PostgreSQL is one the few programs which actually cares about
collation and case-insensitivity, so it has to work the hard way.

~~~
leeter
It has always amused me that NTFS and NT Kernel are both case sensitive when
it comes to filenames. But WIN32 emulates the case insensitive legacy behavior
because people won't fix, or can't fix legacy code. That said applications can
properly opt into POSIX semantics.

~~~
voltagex_
Are you talking about WSL or some other way that applications can use POSIX on
Windows?

~~~
blattimwind
FILE_FLAG_POSIX_SEMANTICS

------
nezirus
AFAIK, this is fixed in PostgreSQL 10 (ICU), so no reason to panic if you are
on v10+. However, it nicely demonstrates problems with strcoll :-)

~~~
emddudley
Correct. Using ICU collations avoids this problem.

Also, with libc collations you have to rebuild indexes after a glibc upgrade:

[https://postgresql.verite.pro/blog/2018/08/27/glibc-
upgrade....](https://postgresql.verite.pro/blog/2018/08/27/glibc-upgrade.html)

~~~
jakobegger
Even if you use ICU, you need to make sure you link with the same version of
ICU!

Edit: Why the downvote? New versions of Unicode come out every year, and ICU
collations are explicitely versioned. If you use ICU 51.1 and ICU 52.1 you are
going to have the same kind of issues

~~~
anarazel
ICU collations in PG are versioned, and ICU has support for accessing the
different collation versions. So you might be stuck on an older collation
version without explicit action, but it'll not yield wrong results.

~~~
jakobegger
Really? PostgreSQL can use an index created with an older version of ICU? I
didn't know that.

Edit: The docs say indices need to be rebuilt when ICU version changes:
[https://www.postgresql.org/docs/11/sql-
altercollation.html](https://www.postgresql.org/docs/11/sql-
altercollation.html)

ICU offers a way to detect the collation version has changed, but I thought
there was no way to access older collation versions.

------
antirez
To rely on anything non trivial provided by libc is always a risk, but here
the problem is that collation is not the right tool for the job (implementing
data structures were there is to compare elements). In Redis memcmp() is used,
because byte-to-byte comparison is a much more stable story. This leaves the
user with the problem of finding a suitable representation that is
lexicographically sortable by the first bytes in a way that makes sense for
the application, and that also contains, in the farest bytes, what the user
really wants to return. Not viable for something higher level than Redis I
guess.

~~~
cbsmith
It's not that simple. SQL in particular actually requires lexicographical
sorting, so you need to do it somehow. The problem is relying on
lexicographical sorting remaining fixed when it isn't. Per the unicode
standard, lexicographical sorting rules are subject to change. So now you need
a mechanism to detect/manage those changes.

~~~
rurban
That's done via

    
    
        unicode/uchar.h:
        #define U_UNICODE_VERSION "9.0"
    

in icu. pcre2 has PCRE2_CONFIG_UNICODE_VERSION 10 fribidi has
FRIBIDI_UNICODE_VERSION "6.2.0"

libc's need something similar. In safelibc I do have nothing so far, but I'm
at 11.0

~~~
cbsmith
Right... so properly detecting when U_UNICODE_VERSION in the replication
target differs from U_UNICODE_VERSION in the source, and adjusting for it is
what's needed. PostgreSQL wasn't doing this in 2014.

~~~
rurban
Exactly. I'm just not happy with the types of these defines. Only major
updates need to be tracked, so the PCRE format seems to be the best. But all
others are using strings, not an integer.

------
macdice
PostgreSQL has the beginnings of a solution to this problem: version tracking
for collations. I think it should be extended to track versions in a more fine
grained way so that it can detect and reject this scenario (and related
scenarios), and I think that libc implementations should provide a way to
expose the version. I have proposed this for FreeBSD libc, and I hope someone
proposes something similar for glibc (maybe me eventually). I think PostgreSQL
should continue to support both ICU and libc collations (I don't think it's
reasonable for every piece of software to use its own collation system, or for
everyone to switch to ICU, I think libc should do a slightly better job).

More recent discussion of various possible approaches on PostgreSQL -hackers
list: [https://www.postgresql.org/message-
id/flat/CAEepm%3D0uEQCpfq...](https://www.postgresql.org/message-
id/flat/CAEepm%3D0uEQCpfq_%2BLYFBdArCe4Ot98t1aR4eYiYTe%3DyavQygiQ%40mail.gmail.com)

Proposal to add versions to libc collations in FreeBSD:
[https://reviews.freebsd.org/D17166](https://reviews.freebsd.org/D17166)

------
rurban
The updated variant of this postgresql/glibc warning and story is
[https://postgresql.verite.pro/blog/2018/08/27/glibc-
upgrade....](https://postgresql.verite.pro/blog/2018/08/27/glibc-upgrade.html)

And glibc is still 2 years behind in its unicode version: they are at 9.0,
current is 11.0. musl is at 6.0

------
bmurphy1976
The post of from 2014 FYI. No idea if this is still an issue.

~~~
JadeNB
nezirus
([https://news.ycombinator.com/item?id=18855711](https://news.ycombinator.com/item?id=18855711))
says it isn't. (I don't know how this squares with daamien's
([https://news.ycombinator.com/item?id=18855837](https://news.ycombinator.com/item?id=18855837))
contrary reply.)

~~~
nezirus
Arch package is configured with ICU support, but I think you need to use ICU
collations explicitly: [https://postgresql.verite.pro/blog/2018/07/25/icu-
extension....](https://postgresql.verite.pro/blog/2018/07/25/icu-
extension.html)

    
    
      SELECT * FROM pg_catalog.pg_collation;
    

As usual with collations, you must be careful and know what you are doing (old
collations are also still available, for backwards compatibility)

------
rurban
Not just different glibc's, also Unicode versions.

I constantly have to update my towupper() function for my safelibc. musl is
hopelessly behind, and glibc is not much better with Unicode support. ICU
constantly has breaking changes, but at least keeps its tables up to date.
Unicode itself changes its tables every year. This year we had 6 changes
there. DB or Web clients really need to pass the unicode version in its API:
like "sorted according to Unicode 11.0"

------
alexchamberlain
May I ask a stupid question? What's the difference between collation and
encoding? Some people in this thread have suggested "use byte strings"; is it
"enough" to "just" use UTF-8, like you should "just" use UTC (unless you
know/Should know better)?

~~~
ben509
To get a bit into the theory: a string is a sequence of symbols drawn from
some "alphabet", and we can define operations like concatenation, counting,
matching and substring on it. In particular, a natural operation on strings is
lexicographic comparison, whereby you compare the first characters of two
strings, if they are different, the two strings compare that way, if they
match, you proceed to look at the second character, etc.

Encoding translates a string from one language to another, and it's often a
reversible translation. The obvious case is to translate from Unicode to
binary and back. In Unicode, the alphabet is the set of all Unicode
codepoints. In binary, you have a stream of octets and the alphabet is the set
of integers between 0 and 255. So UTF-8 is a standard that specifies how to
perform such a translation.

Since we're often going back and forth between Unicode and binary, and because
real processors work with bytes, it's reasonable to call "encoding"
translating a string to binary and "decoding" translating binary to a string.
The math doesn't require that, though.

Collation represents a string as an expression (usually also a string) that is
trivially comparable. For instance, supposed I wanted an English-friendly way
to compare band names. I might map all codepoints that look like A or E to
"A", then K, S and C to C, etc., discarding any accents and maybe throwing out
leading junk like "The" or "A".

That's why a case-insensitive collation won't let you have Foo and FOO,
because they would both collate as FOO. It also means, unlike encoding,
collation is expected to be one-way.

~~~
jancsika
> Collation represents a string as an expression (usually also a string) that
> is trivially comparable. For instance, supposed I wanted an English-friendly
> way to compare band names. I might map all codepoints that look like A or E
> to "A", then K, S and C to C, etc., discarding any accents and maybe
> throwing out leading junk like "The" or "A".

So how does collation algorithmic instability break an application?

Edit: What I mean is-- what are a set of reasonable assumptions a dev would
make in the application layer that would be good practice on the one hand, yet
also be wholly dependent on collation algorithm stability to keep from
breaking?

~~~
ben509
All the assumptions that make a database index work will break.

If you think about any kind of binary tree based structure, it will have
invariants like "the left child node will be less than the parent node, and
the right child node will be greater than the parent node."

If those are violated just a little bit, at _best_ data will simply disappear
from the index. It's possible for updates to delete stuff, or for operations
on the index to hang.

------
cryptonector
PostgreSQL should use its own internationalization library, or ICU, but not
the C library's I18N components. This would allow PG to ensure that any one
version of PG also uses the same version of Unicode.

~~~
cbsmith
Or, you know, you could just track the versioning of the library that you use
for collation...

------
butterisgood
Static linking can be good.

~~~
waynecochran
Of course if you have two applications communicating with each other that are
statically linked with different version of the library you can still have
inconsistency problems.

~~~
CamperBob2
That's more a protocol design problem than anything else.

------
phonon
What does Postgres Aurora use?

------
hgq
2014

