

JavaScript has a Unicode problem (2013) - inerte
http://mathiasbynens.be/notes/javascript-unicode

======
btn
One of the major difficulties with Unicode handling is not just that there are
poor implementations out there with legacy baggage, but a lot of poor advice
as well (or well-meaning advice that _seems_ correct, but misses some corner
case or some language). For example, this article wants to count "graphemes",
and the author goes through three versions of an algorithm to account for
surrogate pairs and various combining marks. All seems well in the test cases
the author shows, but combining marks are only one class of codepoints that
can join to form a grapheme, and the algorithm will fail for other grapheme
clusters such as 'நி' (Tamil letter NA + Tamil Vowel Sign I), or Hangul made
of conjoining Jamo (such as '깍': 'ᄁ' \+ 'ᅡ' \+ 'ᆨ'), or other control
characters.

Luckily, the Unicode Technical Committee has figured this out for you, and
UAX#29 provides an algorithm for determining grapheme cluster boundaries [1].
Yes, it's long and technical, it has many cases (and exceptions) to handle,
and it can't be expressed compactly in two lines of JavaScript; but it will
give you a well-defined and understood answer for all scripts in Unicode.

[1]
[http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Bounda...](http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries)

~~~
giovannibajo1
One thing I never understand is why it is so important to count graphemes.

I read dozens (hundreds?) of Unicode-related blogpost for many different
languages, with long debates and discussions about the hurdles of counting
graphemes, but they always forget to explain why one should need it; it's just
assumed that it's important or interesting. This specific post just says:
"Let's say you want to count the number of symbols in a given string, for
example. How would you go about it?" and then go into a multi-page
explanation, which is even incomplete (as you correctly noticed).

I can't remember many cases in which it's been useful to count graphemes, in
my programming activity. I usually need to either:

1) count the number of bytes of the Unicode encoding I'm using / going to use,
for the purpose of low-level stuff like buffers/sockets/memory/etc. 2) ask a
graphic library to tell me how big the string will be on the screen, in pixels
(with the given fonts, layout, hints, and whatnot).

Counting graphemes only sounds useful for things like command-line terminal;
e.g.: if I were to make a command-line user interface (ala getopt()) which
automatically wordwraps text in the usage screen at the 80-th column, I would
need to count graphemes, in the unlikely case I had to support Tamil or Korean
for such a specialistic case.

tl;dr: counting grapheme is a very complicated problem you probably don't need
to ever solve.

~~~
btn
Counting graphemes may be over-used, but needing to know their boundaries is
important (and leads naturally to counting). For example, when you hit
"delete" in a text editor, you'll probably want it to delete whole graphemes
(and similarly for text selection); if you're doing text truncation, you may
measure it by pixels, but you'll want to chop off the excess bytes at a
grapheme boundary.

 _in the unlikely case I had to support Tamil or Korean for such a
specialistic case._

Why is it "unlikely" that you would want your software to support users of
other languages?

~~~
pavlov
In the case of a delete action in a text editor, are you sure that deleting
the whole grapheme is actually what the Tamil or Korean user wants?

You mentioned the following examples in your grandparent post:

\- 'நி' (Tamil letter NA + Tamil Vowel Sign I)

\- Hangul made of conjoining Jamo (such as '깍': 'ᄁ' \+ 'ᅡ' \+ 'ᆨ')

I don't speak either language, but it doesn't seem unreasonable to me that
pressing Delete would delete just the vowel sign in Tamil, or just the last
component within the Hangul character. In fact, that might be just what the
user wants?

~~~
taejo
> I don't speak either language, but it doesn't seem unreasonable to me that
> pressing Delete would delete just the vowel sign in Tamil, or just the last
> component within the Hangul character. In fact, that might be just what the
> user wants?

My Korean is pretty poor, but I think that's exactly what one wants. If you
mistype a letter, you want to retype that letter, not the whole syllable.
However, this should work uniformly: it shouldn't matter if the syllable is
represented as a single codepoint or made up of comjoining jamo.

~~~
yew
If the Hangul and Tamil constructs are anything like ligatures (e.g. _fi_ in
the Latin alphabet), I would imagine that's the case most of the time. Plus
lots of special rules for which glyphs to treat as single symbols and which to
decompose (e.g. _&_ is technically a ligature but almost never decomposed).

------
stormbrew
Discussions of unicode often centre around the issue of counting
symbols/graphemes/bytes/etc. and I often wonder what the use case is for
counting anything other than either the number of bytes (for storage) or the
size in device units of the output text from a rendering engine (for display)
is. All the options between seem like pure exercise.

The reality seems to be that the 'size' of text is entirely dependent on
context and even forward thinking articles on the subject seem to get hung up
on counting things that don't matter.

~~~
byroot
Well, that depends. If you want to implement a simple `truncate`[0][1]
function, then you need to count graphemes

[0]
[http://api.rubyonrails.org/classes/ActionView/Helpers/TextHe...](http://api.rubyonrails.org/classes/ActionView/Helpers/TextHelper.html#method-
i-truncate) [1]
[https://github.com/epeli/underscore.string](https://github.com/epeli/underscore.string)

~~~
al2o3cr
If you're truncating by character (or WHATEVER) counts, you are guaranteed to
be doing it wrong - maybe not in your native language, but in somebody's.

Heck, even in one graphemically-straightforward language you can get
silliness:
[http://www.images.generallyawesome2.com/photos/funny/photos/...](http://www.images.generallyawesome2.com/photos/funny/photos/testimonial-
fail.png)

~~~
millstone
But truncation is very often needed when you have more text than space. What's
the alternative?

~~~
gliese1337
If it's storage space, you truncate by bytes, rounding down to the nearest
complete grapheme- no need to count graphemes. If it's display space, truncate
by pixels, in which case you need "size in device units of the output text
from a rendering engine". Again, no need to count graphemes.

~~~
dustyleary
Counting graphemes and detecting "the nearest complete grapheme" are basically
the same problem.

The only reason counting graphemes is hard is because detecting grapheme
boundaries is hard.

------
eloff
These are less JavaScript problems than utf-16 problems. The whole one
character is not a code point problem. It's common to Java, .net, basically
all of windows, and anything else that uses utf-16 strings. The solution is
easy. If you need a one to one mapping of code points to characters convert to
utf32 first. Utf8 has the same problems, the only difference is people know
characters and code points don't match up. Whereas with utf16 there's a bunch
of people who are either new or should never have been programmers to begin
with that are clueless about it. Sadly this number is so large that just about
any program that uses utf-16 strings is broken for inputs where code points !=
characters. This is partly the fault of the languages and libraries which give
you functions like substring, reverse, etc on utf-16 strings, where they
basically have no consistent meaning. It should have been a storage format not
a manipulation format.

~~~
fauigerzigerk
_" >These are less JavaScript problems than utf-16 problems"_

The issues related to combining marks are not UTF-16 problems and are not
solved by converting to codepoints.

Also, Java as well as many other UTF-16 based languages have much better
unicode support than JavaScript (like access to codepoints and unicode
character classes in regular expressions).

As always, if something can be done in a sloppy broken way JavaScript will
take advantage of it to the fullest.

~~~
eloff
They're typically solved by normalization, something JavaScript currently does
not do well.

------
mlex
There's a fantastic, in-depth article on Unicode and NSString (the default
string class in Objective-C) that was published a couple days ago, which
covered a lot of the same material but from an Objective-C standpoint instead.

[http://www.objc.io/issue-9/unicode.html](http://www.objc.io/issue-9/unicode.html)

------
citrin_ru
About Unicode in JS and other languages it is still worth to read "Unicode
Support Shootout: The Good, the Bad, the Mostly Ugly" by Tom Christiansen [1].

[1] e. g.
[http://dheeb.files.wordpress.com/2011/07/gbu.pdf](http://dheeb.files.wordpress.com/2011/07/gbu.pdf)

------
gcb0
I was alternating between 4 browsers while reading this. There was zero
consistency.

Fun times ahead...

------
al2o3cr
One other thing to watch out for - if you're using the sort of regexes the
author suggests, be VERY careful about any minificiation / uglification steps.
I recently had to chase down an issue where uglify was replacing Unicode
escapes with literal characters, causing strange "Invalid regular expression:
Range out of order in character class" errors on load.

------
KwanEsq
Hmm, Lucida Console on Windows does not seem to work well with combining
glyphs.

------
VeejayRampay
Really precise and in-depth article, props.

