Hacker News new | past | comments | ask | show | jobs | submit login

Just tried this in Perl6; looks like string comparisons Do The Right Thing™.

    > "\x65\x301".contains("\xe9")
    True



  > "\c[latin small letter e]\c[combining acute accent]" eq "\c[latin small letter e with acute]"
  True
Edit: And of course

  > "\c[dog face]".chars
  1
and not 2 as in the article.

PS: WTF? HN strips emojis :/ (and does it incorrectly when they are emoji sequences).


Swift is another major language that has correctly solved this problem in this way - i.e. not representing/operating on strings as though they were naive arrays of bytes or code points - but rather as though they were arrays of characters, which Unicode thoroughly and intuitively defines in the same way that humans think about characters.


Swift is headed in the right direction.

But I think Perl6 is the only language that can do this magic:

  > 'Déjà vu' ~~ /:ignorecase:ignoremark deja \s vu/
  「Déjà vu」


> characters, which Unicode thoroughly and intuitively defines in the same way that humans think about characters

"Character" is a somewhat vague term, and Unicode prefers to use more specific terms like "code unit", "code point", "abstract character", etc.

In this case I think you may be referring to grapheme clusters, which come closer to how "humans think about characters" than Unicode abstract characters, which are building blocks of the technical encoding standard but in some cases don't really match a human concept of a graphical element of a writing system.

See also “Characters” and Grapheme Clusters in section 2.11 of https://www.unicode.org/versions/Unicode12.0.0/ch02.pdf, for example.


Oops - you're right. I'm using the term "character" for both the intuitive and documented definition, but the documented term I'm referring to is actually grapheme cluster.


Normalization covers more than just combining characters. CJK full width digits and punctuation can be problematic when you want the canonical forms for pattern matching:

  s = '1 2'

  print(unicodedata.normalize('NFD', s))
  print(unicodedata.normalize('NFC', s))
  print(unicodedata.normalize('NFKC', s))

  1 2
  1 2
  1 2


It's called collation

  > $*COLLATION.set(:!secondary:!tertiary:!quaternary)
  collation-level => 1, Country => International, Language => None, primary => 1, secondary => 0, tertiary => 0, quaternary => 0

  > '12' coll '12'
  Same


Perl 6 normalizes to NFC by default for everything except filenames: https://docs.perl6.org/language/unicode

As to lelf's 1-character emoji, str.chars returns the number of characters in the string-- it would only return 2 if it returned the number of code units instead (which, the documentation notes, is what currently happens on the JVM).


Perl 6 actually normalizes to NFG (Normalization Form Grapheme) https://docs.perl6.org/language/glossary#NFG


And on the C level you need something like safelibc's wcsnorm_s and then wcsfc_s for case-insensitive search. Or libunicode or ICU, but they are too slow and big and failed to be useful in the GNU coreutils.

That's for strings. For identifiers (names, filenames, ...) there are a lot more rules to consider, and almost nobody supports unicode identifiers safely.

There's also still no support for foreign strings on the most basic utilities, like expand, wc, cut, head/tail, tr, fold/fmt, od or sed, awk, grep, go, silversearch, go platinum searcher, rust ripgrep, ... => http://perl11.org/blog/foldcase.html I do maintain the multibyte patches for coreutils and fixed it for my projects at least.


And Ruby does what I expect:

    > "\x65\x301".include?("\xe9")
    => false




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: