
JavaScript’s internal character encoding: UCS-2 or UTF-16? - mathias
http://mathiasbynens.be/notes/javascript-encoding
======
__alexs
> Both UCS-2 and UTF-16 are character encodings for Unicode.

UCS-2 is not an encoding that is generally compatible with Unicode. It's kind
of like saying that 7-bit ASCII and UTF-8 are character encodings for Unicode.

~~~
mathias
> It's kind of like saying that 7-bit ASCII and UTF-8 are character encodings
> for Unicode.

They can only encode a subset of Unicode, sure, but they’re still character
encodings for Unicode, right?

~~~
__alexs
Sure, but if I buy a product released in 2012 that advertises "Unicode"
support, and it turns out that it only supports a version of it from the era
of IE4, I'm probably going to whine.

------
hmottestad
"It produces a variable-length result of either one or two 16-bit code units
per code point"

    
    
      from the article.
    
    

"It produces a variable-length result of either one or two 16-bit code units
per code point"

    
    
      from wikipedia.org
    

I feel this should have been quoted or referenced in some way in the article.
Or it might just be a very rare case of coincidence.

------
lambda
The encoding is UTF-16, but what it calls "characters" are code units
<http://unicode.org/glossary/#code_unit>, not code points
<http://unicode.org/glossary/#code_point>.

~~~
mathias
Exactly. This makes it more like UCS-2 than like UTF-16.

------
apaprocki
For all the gory details about TC-39 work to possibly get rid of this
restriction in ECMAScript and support full Unicode, venture to the TC-39 wiki:

[http://wiki.ecmascript.org/doku.php?id=strawman:support_full...](http://wiki.ecmascript.org/doku.php?id=strawman:support_full_unicode_in_strings)

------
herge
Why doesn't everybody use UTF-8? How much overhead is incurred in encoding a
non-ascii language (say Chinese) in UTF-8 compared to UTF-16?

~~~
Karellen
Because Windows uses UTF-16 as its Unicode encoding. Why? Because Unicode
support in Windows started in NT, and that used UCS2. Why? Because when NT was
first being developed in 1989/1990, UTF-8 hadn't been invented yet, and
wouldn't be for another few years. And UTF-8 didn't have much traction at all
until after 2000, by which time it was too late.

Also, Windows has some problems with UTF-8 due to MB_LEN_MAX[0] being 2. Which
cannot be changed because that would break the ABI. (e.g. "char foo[MB_LEN_MAX
* len + 1];") Those problems aren't totally insurmountable as Windows does
have UTF-8 code pages which work, but it's not totally plain sailing.

In any case, going back now would be an admission of a mistake.

Thus, Windows is UTF-16. And the infection spreads....

[0] [http://www.kernel.org/doc/man-
pages/online/pages/man3/MB_LEN...](http://www.kernel.org/doc/man-
pages/online/pages/man3/MB_LEN_MAX.3.html)

------
patorjk
Very nice write up. I was actually looking for something like this about a
week ago, and was referred to the ECMAScript spec (section 8.4) which talked
about "UTF-16 code units" - which I believe is just UCS-2. If this is the
case, I kind of wonder if the spec should be updated to make things a little
more clear, since the issue isn't straight forward for those who don't know a
lot about unicode.

~~~
finnw
Not quite. Surrogates (<http://stackoverflow.com/q/5178202/12048>) are UTF-16
code units but they are not UCS-2 code units.

~~~
mathias
This.

JavaScript’s internal encoding is closer to UCS-2 than it is to UTF-16, but it
doesn’t guarantee that a string is valid UCS-2 (or valid UTF-16).

------
yonran
This means that for applications that want to store binary data as efficiently
as possible in localStorage (e.g. Offline Wikipedia
<https://news.ycombinator.com/item?id=3409512>), you can pack two bytes into
each string character. ECMAScript strings are just arrays of 16-bit unsigned
integers (e.g., '\ud800' is a valid JS string but is not valid UTF-16).

~~~
mathias
It’s worth pointing out that `'\ud800'` is not valid UCS-2 either, since UCS-2
technically doesn’t allow surrogate characters.

JavaScript strings are like UCS-2, except that they allow surrogate
characters.

