Hacker News new | past | comments | ask | show | jobs | submit login

"This is what made UTF-8 the favorite choice in the Web world, where English HTML/XML tags are intermixed with any-language text."

Except that Javascript is UTF-16, so no luck with 4 byte chars there.




> Javascript is UTF-16

No it isn't. Javascript is no different from any other text. It can be encoded in any encoding. Where did you get the idea that JS is UTF-16?

EDIT: I misunderstood the intent of the comment I was responding to. JS uses (unbeknownst to me) UTF-16 as its internal representation of strings.


JavaScript source can be encoded in any way that the browser can handle, yes.

Within the JS language, strings are represented as sort-of-UCS-2-sort-of-UTF-16 [0]. This is one of the few problems with JS that I think merits a backwards-compatibility-breaking change.

[0] http://mathiasbynens.be/notes/javascript-encoding


GP means string literals. To quote from the spec: "4.3.16 String value: primitive value that is a finite ordered sequence of zero or more 16-bit unsigned integer... Each integer value in the sequence usually represents a single 16-bit unit of UTF-16 text."


The "usually" there turns out to be important.

Javascript "strings" are, as the spec says, just arrays of 16 bit integers internally. Since Unicode introduced characters outside the Basic Multilingual Plane (BMP) i.e. those with codepoints greater than 0xFFFF it has no longer been possible to store all characters as a single 16 bit integer. But it turns out that you can store non-BMP character using a pair of 16 bit integers. In a UTF-16 implementation it would be impossible to store one half of a surrogate pair without the other, indexing characters would no longer be O(1) and the length of a string would not necessarily be equal to the number of 16 bit integers, since it would have to account for the possibility of a four byte sequence representing a single character. In javascript none of these things are true.

This turns out to be quite a significant difference. For example it is impossible in general to represent a javascript "string" using a conforming UTF-8 implementation, since that will choke on lone surrogates. If you are building an application that is supposed to interact with javascript — for example a web browser — this prevents you from using UTF-8 internally for the encoding, at least for those parts that are accessible from javascript.


The idea is from ECMA 262, sections 2(conformance), 4.3.16 (String value), 6 (Source text), 8.4 (String type)... That's basically THE reason, why all js engines are UTF-16 internally.


We really ought to suggest that be changed in ES7.


Have a look at this stack overflow question[1]. Javascript/ECMAScript strings are supposed to be UTF-16. That said UTF-16 encodes 4 byte codepoints just as easily as UTF-8.

http://stackoverflow.com/questions/8715980/javascript-string...


Concrete example of where JS has trouble:

String.fromCharCode(0x010004).charCodeAt(0); => 4




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: