
The sad state of foldcase and string comparisons - rurban
http://perl11.org/blog/foldcase.html
======
kwillets
Postgres is going through some of these issues as well with ICU support. They
don't necessarily have room for normalized strings in indexes, conversion is
slow, and hashing also has trouble deciding if strings are identical.

I've been trying to get some ideas together about how to get around these
issues. The Unicode Comparison Algorithm doesn't clearly define how to do
incremental comparison (for string sorting, or traversing a trie etc.), but it
seems that ICU already does some optimizations, such as discarding the longest
common prefix in its strcmp via binary comparison, and then only converting
the distinguishing character position.

~~~
Ultimatt
I recommend checking out how the Perl 6 MoarVM implementation of these things
has been handled. Specifically with respect to optimising string match and
indexing. Samantha McVey the main dev looking at unicode in the VM at the
moment just implemented the Unicode collation rules ontop of full normalised
strings too
[https://github.com/MoarVM/MoarVM/blob/master/docs/collation....](https://github.com/MoarVM/MoarVM/blob/master/docs/collation.asciidoc)

~~~
kwillets
Thanks for that link; this is the kind of re-engineering that I'm curious
about.

While we're at it here's a proposal for normalized keys in pg:
[https://wiki.postgresql.org/wiki/Key_normalization](https://wiki.postgresql.org/wiki/Key_normalization)
.

