Hacker News new | comments | show | ask | jobs | submit login
The sad state of foldcase and string comparisons (perl11.org)
4 points by rurban on Sept 16, 2017 | hide | past | web | favorite | 3 comments

Postgres is going through some of these issues as well with ICU support. They don't necessarily have room for normalized strings in indexes, conversion is slow, and hashing also has trouble deciding if strings are identical.

I've been trying to get some ideas together about how to get around these issues. The Unicode Comparison Algorithm doesn't clearly define how to do incremental comparison (for string sorting, or traversing a trie etc.), but it seems that ICU already does some optimizations, such as discarding the longest common prefix in its strcmp via binary comparison, and then only converting the distinguishing character position.

I recommend checking out how the Perl 6 MoarVM implementation of these things has been handled. Specifically with respect to optimising string match and indexing. Samantha McVey the main dev looking at unicode in the VM at the moment just implemented the Unicode collation rules ontop of full normalised strings too https://github.com/MoarVM/MoarVM/blob/master/docs/collation....

Thanks for that link; this is the kind of re-engineering that I'm curious about.

While we're at it here's a proposal for normalized keys in pg: https://wiki.postgresql.org/wiki/Key_normalization .

Applications are open for YC Winter 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact