
Text normalization in Go - enneff
http://blog.golang.org/normalization
======
berdario
Just a small detail that isn't mentioned in the article:

in NFC form, "base characters and modifiers are combined into a single rune
whenever possible"

the interesting detail is "whenever possible": since NFC works by first
decomposing, and then recomposing... there're some cases in which if you run
NFC normalization on it, the characters will remain decomposed

an example is 𝅘𝅥𝅮 (U+1D160) which its normalized composed form is made of 3
different codepoints

I tried to look at the algorithm for generating the composition table, and it
seems it's generated from the decomposition table... if that's so, I can't
understand how it could happen that some code points have an NFC form longer
than 1

more details: [http://stackoverflow.com/questions/17897534/can-unicode-
nfc-...](http://stackoverflow.com/questions/17897534/can-unicode-nfc-
normalization-increase-the-length-of-a-string)

does anyone knows the cause behind this?

~~~
lelf
1\. It's decompose, reorder, compose. So you can see some weird stuff like
ḍ̇=ḋ○̣ → NFD=d○̣○̇ → NFD=ḍ○̇

2\. It's not compression, it's normalisation. So it's not compose everything
you can. I cannot tell you exact the algorithm off the top of my head, but:

the reason for U+1D160 — it's in CompositionExclusions list.

~~~
berdario
Thanks, after looking up CompositionExclusions I discovered the rationale:

[http://unicode.org/reports/tr15/#Primary_Exclusion_List_Tabl...](http://unicode.org/reports/tr15/#Primary_Exclusion_List_Table)

> When a character with a canonical decomposition is added to Unicode, it must
> be added to the composition exclusion table if there is at least one
> character in its decomposition that existed in a previous version of
> Unicode. If there are no such characters, then it is possible for it to be
> added or omitted from the composition exclusion table. The choice of whether
> to do so or not rests upon whether it is generally used in the precomposed
> form or not.

------
Nogwater
That "café" -> "cafeś" replacement is pretty scary. It looks like the built
in strings.Replace function makes the same mistake:

    
    
      fmt.Println(strings.Replace("multiple cafe\u0301", "cafe", "cafes", 1)) // multiple cafeś

~~~
rsc
fmt.Println(strings.Replace("multiple cafeterias", "cafe", "cafes", 1))

~~~
Nogwater
Yeah, I get that. It's just that you might assume that the strings functions
would operate on character boundaries (as defined in the blog post) and not
based on runes (code points). Leaky abstractions and all that...

~~~
enneff
The purpose of the normalization package is to help you work with text under
these constraints. I can't imagine many situations where strings.Replace would
be sufficient for reliably manipulating natural language. The cafe example is
to demonstrate the why you might need the package.

~~~
Nogwater
I wasn't thinking that I'd really want to pluralize text like this, but maybe
you'd want to turn people's names into links in HTML source or something. If
someone's name ends with an accent, and if the unicode isn't normalized,
strange things are bound to happen. The blog post is great at pointing this
out, and it sounds like people are working on a go.text/search package to
help, so that's good. I'm not saying Go is broken, just that this kind of
stuff can be really surprising.

~~~
enneff
Yep, working with natural languages is scary. :-)

------
hmmdar
Looks like this issue is pervasive in other languages as well. Out of
curiosity ran the same test in Javascript and received the same result.

    
    
      s = "We went to eat at multiple cafe\u0301"
      "We went to eat at multiple café"
      s.replace('cafe', 'cafes');
      "We went to eat at multiple cafeś"
    

Interesting thing is when the text is copy-pasted backspacing first deletes
the accent. At least in chrome.

~~~
lstamour
FYI:

Node.js - [https://github.com/walling/unorm](https://github.com/walling/unorm)
YMMV, but looks good.

It can also serve as a polyfill for the eventual
[http://people.mozilla.org/~jorendorff/es6-draft.html#sec-
str...](http://people.mozilla.org/~jorendorff/es6-draft.html#sec-
string.prototype.normalize)

------
argon81
I actually took this a step further a few months back and implemented
unicode's "Skeleton" algorithm [https://github.com/mtibben/tr39-confusables-
go](https://github.com/mtibben/tr39-confusables-go)

This is useful for example, to ensure that users don't try and spoof each
other's usernames. Simply create and store a skeleton string for each
username, and keep a unique constraint on it

~~~
zellyn
That confusables list is a good starting point, although you'll need to make
additions, and probably scale back a couple of the over-zealous ones (eg. rn
-> m)

I'm coming at this from a comment spam point of view, not usernames, btw.

------
jcampbell1
This document is good, but doesn't mention the case of ligatures. German's "ß"
is a problem, and it is not obvious how go handles it.

In javascript:

    
    
         "ß".toUpperCase().length !== "ß".length;
    

Does weiss == weiß ?

~~~
patrickg
Does weiss == weiß ?

Yes and no. The swiss would write the former, other German speaking (writing)
countries would write the latter. It is incorrect in Germany (after ie, au,
eu, ... you must not write ss, unless it's a name, such as the city Neuss)

The upper case of weiß would be WEISS. But it's hard from the upper case WEISS
to determine if the lower case is weiss or weiß. (This is why one should never
write people's names in bibliographies in small caps.)

~~~
qznc
You can upper case of weiß as WEIß. It is mandatory for taxes and other
documents and recommended by the Post.

Technically, Unicode has a capital sharp s since 5.1.0, so we could write
WEIẞ.

~~~
patrickg
Yes, you can do that. But that's evil and ugly (mixing uppercase and lowercase
letters that way). I know it has do be done sometimes.

And I am glad that U+1E9E (LATIN CAPITAL LETTER SHARP S) is not official part
of German orthography.

------
abvdasker
Not that there's anything wrong with it, but why are there so many HN articles
about Go?

------
titraxx
Is there an rss for articles of this blog ? Didn't found it.

~~~
patrickg
There is an atom feed:

    
    
        <link rel="alternate" type="application/atom+xml" title="blog.golang.org - Atom Feed" href="http://blog.golang.org/feed.atom"/>

