
Node's Unicode Dragon - foobar2k
http://cirw.in/blog/node-unicode
======
stormbrew
Wish I'd known about this when I was pointing out in another HN thread how
utf-16 is a terrible encoding for, among other reasons, pushing the corner
case where you find out your encoding/decoding is broken to the very edge of
likelihood. It's ridiculous that v8 doesn't properly support utf16, but it's
to be expected I suppose.

UTF-8 does not have this problem. That's the way we should be moving.

~~~
sillysaurus2
This. Why doesn't everybody use UTF-8? Nobody seems to have any problems with
UTF-8. It seems to work almost perfectly, and it's efficient.

~~~
est
Because some of us are pissed that some BMP characters takes 3 bytes in UTF8,
that's 50% more waste of storage space and 50% more time to read/write.

I like the design of Python 3.3 encoding. ASCII takes 1 byte, BMP takes 2
bytes, everything else 4 bytes.

[http://www.python.org/dev/peps/pep-0393/](http://www.python.org/dev/peps/pep-0393/)

~~~
deathanatos
The good point (in my opinion) is not that "ASCII takes 1 byte, BMP takes 2
bytes, everything else 4 bytes", but rather that the exposed API _hides_ this
from you, and exposes to you a sequence of code points. This, I hope, will
reduce errors, as _code points_ , not code units, is often a better
abstraction to be working with. (For some random string processing function.)

So far as I know, Haskell is the only other language that I know of that
exposes, as the defaultish-native interface, Unicode strings as a sequence or
iterable of code points (by just using UTF-32). Java, C#, your-language-here
all do code units. C++'s template are powerful enough that someone could make
unicode_str<encoding_to_store_as>, but I've not seen one.

See:
[http://www.unicode.org/glossary/#code_point](http://www.unicode.org/glossary/#code_point)
[http://www.unicode.org/glossary/#code_unit](http://www.unicode.org/glossary/#code_unit)

~~~
millstone
Code points is a better abstraction than code units, but it's still a piss-
poor abstraction.

Consider the problem of producing a valid substring from a Unicode string.
It's important that you not split surrogate pairs, and it's true working with
code points spares you from that particular problem. But it's also important
that you not split combining marks, and zero width joiners, and Hangul
syllables... (see
[http://www.unicode.org/reports/tr29/](http://www.unicode.org/reports/tr29/)
for all the gory details).

An average programmer cannot correctly extract a substring from a Unicode
string whether given the code units or the code points. These abstractions are
inadequate: instead you want something like grapheme clusters.

~~~
pyre
This was my reaction too. It's Unicode all the way down... :)

------
justin_vanw
Man, I'm starting to think there is a cult around JSON.

If you need to accept arbitrary binary data, JSON is a profoundly bad choice.
At a minimum, you would expect them to base64 encode the data and put that
into a JSON string.

If you are looking at error reports, how is it even _remotely_ acceptable to
have them silently modified to include invalid unicode replacement characters?

The lesson here isn't some crappy hack workaround they found, it's a case
study in the lengths you'll have to go to when you insist on making technology
choices without considering the problem you want to solve.

~~~
derefr
Any wire-serialization format that wants to send arbitrary data should really
have a "raw binary payload" type. XML has CDATA. ASN.1 has bitstrings. BERT
has Binaries. But JSON doesn't really have anything like that.

I wonder... at some point, Javascript could get a convenient literal syntax
for creating pre-filled ArrayBuffers, which would basically be the format JSON
would want to adopt. But would it? Are changes to Javascript literal syntax
folded into JSON, or is JSON now its own thing that doesn't track JS any more?

~~~
Dylan16807
CDATA disallows null bytes, so it's even worse than non-support: illusory
support

XML doesn't even allow _escaped_ null bytes, so you're basically forced to use
base64 or weird custom app-internal escapes.

JSON never tracked javascript. It has one version, period. But you could get
people to adopt a superset with a new data type, if you kept it _simple_.

~~~
ygra
Isn't CDATA _character_ data anyway and thus not even binary but in the
document's character set? Which makes it a poor choice for binary data even
without taking into account that XML forbids certain characters.

As for binary data in web services ... isn't it easier to just use Content-
Type for that and use the appropriate type for the payload? That wouldn't
require a textual data format that can contain arbitrary binary data.

~~~
derefr
That presumes you want to send _a lot_ of binary data. Sometimes you just have
five bytes or so. (E.g., as in the article, an _un-decoded_ string.)

(Still, you're right, I admit to having been able to avoid any work painful
enough to teach me XML arcana. I was actually thinking of one of the many
variants of "Binary XML" I had read about recently, and assumed the typing was
bijective to XML's own types. In other news, _BSON_ of all things has a raw-
binary type.)

------
baddox
Despite that being a rather interesting technical article, I am upset that my
expectation of an actual Unicode depiction of a dragon was not met.

~~~
cirwin
🐉

To see this dragon, either:

1\. Use Safari or Firefox on OS X. 2\. Install custom fonts for Linux or
Windows. 3\. Install [https://chrome.google.com/webstore/detail/chromoji-
emoji-for...](https://chrome.google.com/webstore/detail/chromoji-emoji-for-
google/cahedbegdkagmcjfolhdlechbkeaieki) for Chrome

~~~
pavlov
The dragon glyph is rendered correctly in IE10 on Windows 8 without any custom
fonts. Hooray for the most underestimated browser ever ;)

~~~
city41
Also true of mobile IE10

------
nonchalance
String encoding in general is a mess. Wait till you get to code pages.
Incidentally, the largest JS script I've ever seen pertained to encoding and
decoding characters under various codepages:
[https://raw.github.com/Niggler/js-
codepage/master/cptable.js](https://raw.github.com/Niggler/js-
codepage/master/cptable.js) [github complains "(Sorry about that, but we can't
show files that are this big right now.)"]

------
jrochkind1
The OP describes an environment where data goes from node to Rails.

If you want to check a string for valid encoding and/or replace bad bytes with
replacement char on the _ruby_ end... it's not very obvious how you do that
with the ruby stdlib api, and it takes a few tricks to do right.

So I wrote a gem for it:
[https://github.com/jrochkind/ensure_valid_encoding](https://github.com/jrochkind/ensure_valid_encoding)

------
state
Whew. This explains a bug from six months ago that drove me up the wall. I
could never figure it out.

------
shawnz
> Unfortunately for us, Javascript has never been updated to support UTF-16.
> Instead it continues to treat strings as UCS-2.

So really, they were parsing the JSON as if it were UTF-16, but really it was
UCS-2. How is that an error in Node?

~~~
justincormack
JSON is defined as UTF8, 16 or 32 [1]. The escaped characters are UTF-16 not
UCS2. It is unfortunate of JavaScript can't parse it correctly!

[1] [http://www.ietf.org/rfc/rfc4627.txt](http://www.ietf.org/rfc/rfc4627.txt)

~~~
kansface
This is true of JSON, but its not true of Javascript which gives no fucks
about utf16 (or valid surrogate pairs). Its a very strange world where JSON
and Javascript have incompatible interpretations of strings.

[http://mathiasbynens.be/notes/javascript-
encoding](http://mathiasbynens.be/notes/javascript-encoding)

~~~
gnaritas
Not really as JSON is not valid JavaScript and requires its own parser. It's
based on JavaScript, but it is not JavaScript.

~~~
daxelrod
I was skeptical, but I did some searching, and you appear to be right! The
difference seems to come down to string handling:

[http://timelessrepo.com/json-isnt-a-javascript-
subset](http://timelessrepo.com/json-isnt-a-javascript-subset)

~~~
gnaritas
Ha, same article where I first learned this.

------
dsj36
how did the error JSON include the undecodable bytes? JSON strings are all
unicode sequences, so there would have had to be some way that the raw bytes
were mapped into codepoints.

on the other hand, if the offending bytes were blindly substituted into the
JSON, then it's not surprising that there were decoding issues down the
line...

~~~
jlarocco
From the article:

> The exceptions that were crashing us were caused by people using
> String.prototype.substr. That function works perfectly on strings that only
> contain Unicode 1.0 data, but as soon as you're storing UTF-16 in your UCS-2
> string there's a possibility that when you take a slice you'll split a valid
> surrogate pair into two invalid lonely surrogates.

To me, it seems like it'd be nearly impossible for somebody to trigger, but
there's always Murphy's law...

~~~
twoodfin
These kinds of isolated surrogate pairs are pretty easy to create if you're
doing the right kind of processing on the right kind of data.

Suppose you receive a long piece of text wrapped in JSON, unpack it into a JS
String, then start processing it in fixed size chunks. If your source text
contains any significant percentage of surrogate pair-represented characters,
you'll eventually break one.

------
scoopr
This same problem manifests with Java as well, where some methods that claim
to return UTF-8 on closer inspection actually return “modified UTF-8”, which
is broken the same way. Notably I ran across this in with JNI function
GetStringUTFChars, but may come across in DataOutputStream's writeUTF etc.

------
bsaul
Reminds me of a previous discussion about Go being more "mature" than node.js,
where i said having someone like Pike on board gives you more than 30 years of
"maturity". I'm pretty sure you wouldn't find those leaky UTF encoding
handling in Go.

~~~
ygra
Well, Node builds atop an established language, while Go is a new development.
It's probably easier to build sane Unicode semantics into a new language than
to change the JS spec.

------
scott_karana
Is it just me, or is the two-column layout a bit tricky for readability?

(1440x900)

------
ChrisAntaki
I can't believe NodeJS doesn't support Dragon symbols. This is a dealbreaker.

