
On Unicode Normalization – or why normalization insensitivity should be rule - lelf
https://blogs.oracle.com/nico/entry/normalization_insensitivity_should_be_the
======
adekok
They're right.

I spent the last 5 years trying to get i18n working in a protocol:

[https://tools.ietf.org/html/draft-ietf-radext-
nai-15](https://tools.ietf.org/html/draft-ietf-radext-nai-15)

The working group came to the same conclusion. Whatever crap the client puts
in the packet is _inviolate_. Don't change it. You're not smart enough to do
that.

Instead, if you need to do comparisons, compare on the _normalized_ strings.
You create a "normalized" version of the string, and then use that for
comparisons.

I wish the IETF precis working group had a "recommendations" document. It
would make everyone's lives much simpler.

------
jmount
Having the luxury of only watching Unicode from afar, it looks like a right
horror show. As a mathematician I respect how much harder it can be to work
over equivalence classes (especially when you don't have obvious normal forms)
than to work over mere instances. I am not sure all the Unicode players really
appreciate this (though I definitely agree: you can't safely re-normalize
externally supplied strings, you do have to work with equivalence classes at
some point).

What I keep hoping is the Unicode committees will eventually pile on one bad
idea too many and make string comparison as hard as a general instance of the
Post correspondence problem
[http://en.wikipedia.org/wiki/Post_correspondence_problem](http://en.wikipedia.org/wiki/Post_correspondence_problem)
and thus undecidable. They are presumably not there yet (due to unambiguity,
reading direction, and bounded length), but I keep hoping.

The idea is many characters in Unicode have more than one equivalent
representation (and these can have different lengths). So, roughly, checking
sequences of codepoints represent the same string becomes a grouping problem:

For a sequence of integers 1...t define a ordered sequence partition P as a
sequence of contiguous sets of integers (P1,...,Ph) such that:

    
    
      1) Each Pi is a contiguous sequence of integers u...v with 1<=u<=v<=t.
    
      2) Union_i Pi = {1,...,t}
    
      3) max(Pi) < min(Pj) for all 1<=i<j<=h.
    
    

Check for:

    
    
       a1,a2,...am 
       and b1,b2,...,bn
    

can the sequence of integers 1...m and 1...n be order sequence partitioned
into called A=(A1,...,Au) and B=(B1,...,Bu) such that: the sequence of code
points a_{Ai} and b_{Bi} (i=1...u) represent the same equivalent Unicode
characters?

What stops this from encoding generally hard problems (NP hard) is lack
ambiguity in the code-point -> character dictionaries ensuring there is only
one partition of each code point sequence such that all elements are valid
code-points. Thus we can find the unique partition of each code point sequence
using a left to right read and then we just check if the partitions match.

So all we have to do to successfully encode NP hard problems is trick the
standards committee into introducing a sufficient number of ambiguous grouping
(things like code-points "a1 a2 a3 a4" such that both "(a1 a2 a3) (a4)" and
"(a1 a2) (a3 a4)" are valid groupings). This will kill the obvious comparison
algorithms, and with some luck we get a code dictionary that will allow us to
encode NP-hard problems as string equivalence.

To get undecidable problems we just have to trick the committee into
introducing enough "fix insertions" where "a1 a2 a3 a4" can be grouped into
"(a1 x1 a2) (a3 a4 x2 x3)" by the insertion of the implied or "fix" code-
points x1, x2, x3. Then, with some luck, we could build a code dictionary that
could encode enough instances of Post correspondence problems to make Unicode
string comparison Turing complete and thus undecidable. (I need to check if a
single fixed code-book is enough, or if we need some trick to allow in many
different code-books.)

So I think all we need is some clever design (to actually get a dangerously
expressive encoding, not just the suspicion there is one) and stand up a
stooge linguist or anthropologist to claim a few additional "harmless lexical
equivalences" are needed to faithfully represent a few more languages.

\----

What would I have liked instead of Unicode? Some sort of tagged string-region
implementation. Each string-region would have to specify what lexical
dictionary (and version of the dictionary) it follows and only one
language/lexical context is used for the whole string-region. So each section
you are working with has a single set of rules meant to represent only one
language (instead of hoping for a universal representation). Some care has to
be taken that the region blocks can never be confused with in-block stuff to
ensure we can identify the block boundaries before really looking into any
region (perhaps indicating that block designations should not be in the
string, a lot like when we added length to strings instead of relying on
C-style zero ending).

