
Why we can't process Emoji anymore - tpinto
http://gist.github.com/1707371
======
oofabz
This is why UTF-8 is great. If it works for any Unicode character it will work
for them all. Surrogate pairs are rare enough that they are poorly tested.
With UTF-8, if there are issues with multi-byte characters, they are obvious
enough to get fixed.

UTF-16 is not a very good encoding. It only exists for legacy reasons. It has
the same major drawback as UTF-8 (variable-length encoding) but none of the
benefits (ASCII compatibility, size efficient).

~~~
pixelcort
The problem with UTF-8 is that lots of tools have 3 byte limits, and
characters like Emoji take up 4 bytes in UTF-8.

~~~
pjscott
How many tools have 3-byte limits on UTF-8? The only one I can think of right
now is MySQL. (The workaround is to specify the utf8mb4 character set. This is
MySQL's cryptic internal name for "actually doing UTF-8 correctly.")

~~~
jrabone
MySQL is one of the worst offenders for broken Unicode and collation problems
arising therein. Neither it nor JavaScript deserve consideration for problems
that need robust Unicode handling.

~~~
mikeash
I actually switched my (low traffic, low performance needs) blog comments
database from MySQL to SQLite purely because I could not make MySQL and
Unicode get along. All I needed was for it to accept and then regurgitate
UTF-8 and it couldn't even handle that. I'm sure it can be done, but none of
the incantations I tried made it work, and it was ultimately easier for me to
switch databases.

~~~
pjscott
As an ugly last resort, you could store Unicode as UTF-8 in BLOB fields. MySQL
is pretty good about storing binary data. (I dread the day that I'll have to
do something more advanced with Unicode in MySQL than just storing it.)

~~~
mikeash
I no longer recall whether I tried that and failed, or didn't get that far.
Seems like a semi-reasonable approach if you don't need the database to be
able to understand the contents of that column. But on the other hand, SQLite
is working great for my needs too.

------
ender7
Apropos: <http://mathiasbynens.be/notes/javascript-encoding>

TL;DR:

\- Javascript engines are free to internally represent strings as either UCS-2
or UTF-16. Engines that choose to go USC-2 tend to replace all glyphs outside
of the BMP with the replacement char (U+FFFD). Firefox, IE, Opera, and Safari
all do this (with some inconsistencies).

\- _However_ , from the point of view of the actual JS code that gets
executed, strings are always UCS-2 (sort of). In UTF-16, code points outside
the BMP are encoded as surrogate pairs (4 bytes). But -- if you have a
Javascript string that contains such a character, it will be treated as two
consecutive 2-byte characters.

    
    
      var x = '𝌆';
      x.length; // 2
      x[0];     // \uD834
      x[1];     // \uDF06
    

Note that if you insert said string into the DOM, it will still render
correctly (you'll see a single character instead of two ?s).

~~~
notJim
I'm relatively comfortable with this stuff, but I am confused by your
response.

First you say that engines will "internally" replace non-BMP glyphs with the
replacement character, but then you give an example that seems to work fine
(and I think would work fine as long as you don't cut that character in half,
or try to inspect its character code without doing the proper
incantations[1].)

So, I guess what I'm asking is, at what point does the string become
"internal", such that the engine will replace the character with the
replacement character?

[1]: As given in the article you linked to.

~~~
Kaworu
I dare not try and reexplain the discussion in this bug report as my
understanding feels insufficient, but the entire discussion at
<http://code.google.com/p/v8/issues/detail?id=761#c14> (note, I've linked to
the 14th commment in the discussion, but there's more interesting stuff above)
talks about it. At the core is a distinction between v8's internal
representation of strings and it's API vs. what a browser engine which embeds
v8 might do.

------
praptak
Sometimes you need to know about encodings, even if you're just a consumer.
Putting just one non 7-bit character in your SMS message will silently change
its encoding from 7-bit (160 chars) to 8-bit (140 chars) or even 16 bit (70
chars) which might make the phone split it into many chunks. The resulting
chunks are billed as separate messages.

~~~
fwr
On iOS, using any non-basic Latin character in SMS makes it switch to 16 bit,
even when there is no reason for that to happen. It's a thing that most
foreign language speakers must live with.

By doing this full of excuses write-up, this guy wasted a substantial amount
of time that he could have spent better researching the issue. Your consumer
doesn't care that Emoji is this much or that much bits, it doesn't matter for
him that you're running your infrastructure on poorly chosen software - there
is absolutely no excuse for not supporting this in a native iOS app,
especially now that Emoji is so widely used and deeply integrated in iOS.

How is that a problem they are focusing on, anyway, when their landing page
features awful, out of date mockups of the app? (not even actual screenshots -
notice the positions of menu bar items) They are also featuring Emoji in every
screenshot - ending support might be a fresh development, but I still find
that ironic.

~~~
speednoise
This was an internal email: <https://medium.com/tech-talk/1aff50f34fc>

~~~
jonny_eh
So Node.js already fixed the issue, nice!

------
pjscott
The quick summary, for people who don't like ignoring all those = signs, is
that V8 uses UCS-2 internally to represent strings, and therefore can't handle
Unicode characters which lie outside the Basic Multilingual Plane -- including
Emoji.

~~~
bbotond
Honestly that's a shame.

~~~
masklinn
It was fixed back in March though.

------
driverdan
If you search for V8 UCS-2 you'll find a lot of discussion on this issue
dating back at least a few years. There are ways to work around V8's lack of
support for surrogate pairs. See this V8 issue for ideas:
<https://code.google.com/p/v8/issues/detail?id=761>

My question is why does V8 (or anything else) still use UCS-2?

~~~
est
because counting 2 bytes is much faster for computers than counting vary 1, 2,
3 or even 4 bytes.

~~~
speleding
This is not a real issue because counting code points in an UTF8 string is
easy too: the encoding is cleverly defined such that you just need to check
the number of bytes that have the top bit cleared. Since UTF8 strings are
generally shorter it can even be faster than counting UTF-16 if you don't know
the length in advance.

------
gkoberger
Took me a bit to realize that this is talking about the Voxer iOS app
(<http://voxer.com/>), not Github (<https://github.com/blog/816-emoji>).

~~~
whit537
Yeah, I was worried there for a sec. :^)

------
hkmurakami
_> Wow, you read though all of that? You rock. I'm humbled that you gave me so
much of your attention._

That was actually really fun to read, even as a now non-technical guy. I can't
put a finger on it, but there was something about his style that gave off a
really friendly vibe even through all the technical jargon. That's a definite
skill!

~~~
jgeorge
DeSalvo's source comments have always been an entertaining read. :)

------
beaumartinez
This is dated January 2012. By the looks of things, this was fixed in March
2012[1]

[1] <https://code.google.com/p/v8/issues/detail?id=761#c33>

~~~
Cogito
I wonder if this has been rolled into Node yet.

[edit] Node currently uses V8 version 3.11.10.25, which was released after
this fix was made, but not sure if the fix was merged to trunk

[edit2] actually, looks like it has, though I can't identify the merge commit

------
ricardobeat
Please, if you're going to post text to a Gist at least use the .md extension:

<https://gist.github.com/4151124>

~~~
ctrlaltesc
Which enables an even more readable layout with gist.io
<http://gist.io/4151124>

------
pbiggar
A couple of reasons why it makes sense for V8 and other vendors to use UCS2:

\- The spec says UCS2 or UTF16. Those are the only options.

\- UCS2 allows random access to characters, UTF-16 does not.

\- Remember how the JS engines were fighting for speed on arbitrary
benchmarks, and nobody cared about anything else for 5 years? UCS2 helps
string benchmarks be fast!

\- Changing from UCS2 to UTF-16 might "break the web", something browser
vendors hate (and so do web developers)

\- Java was UCS2. Then Java 5 changed to UTF-16. Why didn't JS change to
UTF-16? Because a Java VM only has to run one program at once! In JS, you
can't specify a version, an encoding, and one engine has to run everything on
the web. No migration path to other encodings!

~~~
cmccabe
_UCS2 allows random access to characters, UTF-16 does not._

I'm not sure if that's really true. On IBM's site, they define 3 levels of
UCS-2, only one of which excludes "combining characters" (really code points).

[http://pic.dhe.ibm.com/infocenter/aix/v6r1/index.jsp?topic=%...](http://pic.dhe.ibm.com/infocenter/aix/v6r1/index.jsp?topic=%2Fcom.ibm.aix.nls%2Fdoc%2Fnlsgdrf%2Fiso10646_ucs-2.htm)

If you have combining characters, then you can't simply take the number of
bytes and divide by 2 to get the number of letters. If you don't have
combining characters, then you have something which isn't terribly useful
except for European languages (I think?)

Maybe someone more familiar with the implementation can describe which path
they actually went down for this... given what I've heard so far, I'm not
optimistic.

~~~
pbiggar
OK, I cracked into the V8 source to take a look at what actually happens. It
looks like the implementation does use random access for two-byte strings.
However, it also uses multiple multiple string implementations (ASCII, 2 byte
strings, "consString" (I presume some kind of Rope), "Sliced Strings" (sounds
like a rope again, but might be shared storage of the string contents for
immutable strings)), so they could likely use other implementations with
whatever properties they choose.

See
[https://github.com/v8/v8/blob/3ff861bbbb62a6c0078e042d8077b2...](https://github.com/v8/v8/blob/3ff861bbbb62a6c0078e042d8077b27f6ae2fa8f/src/objects-
inl.h#L2469) and
[https://github.com/v8/v8/blob/3ff861bbbb62a6c0078e042d8077b2...](https://github.com/v8/v8/blob/3ff861bbbb62a6c0078e042d8077b27f6ae2fa8f/src/objects-
inl.h#L2555).

~~~
cmccabe
Ugh, I just realized this whole article is a farce. v8 added UTF-16 support
earlier this year. Its support for non-BMP code points is now on par with java
(although it has many of the same limitations)

------
languagehacker
We seem to be seeing this more and more with Node-based applications. It's a
symptom of the platform being too immature. This is why you shouldn't adopt
these sorts of stacks unless there's some feature they provide that none of
the more mature stacks support yet. And even then, you should probably ask
yourself if you really need that feature.

~~~
fusiongyro
According to Cogito, this was fixed in March:

<http://news.ycombinator.com/item?id=4834731>

I want to agree with you simply because I don't like Node, but it's hardly
fair to damn something over a bug that was fixed 9 months ago.

------
freedrull
Why on earth would the people who wrote V8 use UCS-2? What about alternative
JS runtimes?

~~~
marshray
Because Unicode was sold to the world's software developers as a fixed-width
encoding claiming 16 bits would be all we'd ever need.

~~~
dmethvin
Yes, and several C/C++ conventions and types seemed to make that a safe
choice, for example wchar_t. Let's face it, collectively we really screwed
this one up. It's the biggest mistake since Microsoft chose the backslash as a
path separator in DOS 2.0.

~~~
magic_haze
It was actually IBM's fault: they used '/' to denote CLI args in the apps they
wrote for DOS 1.0, which didn't have any concept of directories. From Larry
Osterman's blog [1]:

> Here's a little known secret about MS-DOS. The DOS developers weren't
> particularly happy about this state of affairs - heck, they all used Xenix
> machines for email and stuff, so they were familiar with the *nix command
> semantics. So they coded the OS to accept either "/" or "\" character as the
> path character (this continues today, btw - try typing "notepad c:/boot.ini"
> on an XP machine (if you're an admin)). And they went one step further. They
> added an undocumented system call to change the switch character. And
> updated the utilities to respect this flag.

[1]
[http://blogs.msdn.com/b/larryosterman/archive/2005/06/24/432...](http://blogs.msdn.com/b/larryosterman/archive/2005/06/24/432386.aspx)

~~~
dmethvin
Larry has that wrong, IBM wasn't to blame.

IBM licensed DOS from Microsoft. Microsoft bought DOS from Seattle Computer
Products QDOS. That software got its command line switches using "/" from CP/M
for compatibility reasons; originally, both CP/M and MS-DOS were available for
the IBM PC.

CP/M borrowed the convention primarily from RT-11, the OS for the PDP-11,
although it wasn't consistently followed there. Programs on RT-11 were
responsible for parsing their own command line args, and not all of them used
the same convention.

Inside Windows itself, most APIs accept either forward or backward slashes in
paths (even both in the same path) without any special incantation. The
problem is mainly at the application level where the whole forward/backward
slash thing gets messed up because technically you should accept either one
from user input and most app code expects one or the other.

------
eps
They control their clients, so they could've just re-encoded emojies with
custom 16bit escaping scheme, make the backend transparently relay it over in
escaped form and decode it back to 17bits at the other end.

Or am I missing something obviuos here?

------
kstenerud
Small nitpick, but Objective-C does not require a particular string encoding
internally. In Mac OS and iOS, NSString uses one of the cfinfo flags to
specify whether the internal representation is UTF-16 or ASCII (as a space-
saving mechanism).

------
dgreensp
The specific problems the author describes don't seem to be present today;
perhaps they were fixed. That's not to say this conversions aren't a source of
issues, just that I don't see any show-stopper problems currently in Node, V8,
or JavaScript.

In JavaScript, a string is a series of UTF-16 code units, so the smiley face
is written '\ud83d\ude04'. This string has length 2, not 1, and behaves like a
length-2 string as far as regexes, etc., which is too bad. But even though you
don't get the character-counting APIs you might want, the JavaScript engine
knows this is a surrogate pair and represents a single code point (character).
(It just doesn't do much with this knowledge.)

You can assign '\ud83d\ude04' to document.body.innerHTML in modern Chrome,
Firefox, or Safari. In Safari you get a nice Emoji; in stock Chrome and
Firefox, you don't, but the empty space is selectable and even copy-and-
pastable as a smiley! So the character is actually there, it just doesn't
render as a smiley.

The bug that may have been present in V8 or Node is: what happens if you take
this length-2 string and write it to a UTF8 buffer, does it get translated
correctly? Today, it does.

What if you put the smiley directly into a string literal in JS source code,
not \u-escaped? Does that work? Yes, in Chrome, Firefox, and Safari.

~~~
jruderman
The invisible smiley was a font system problem, fixed in Firefox 19 Aurora
(assuming you're on Mac).

<https://bugzilla.mozilla.org/show_bug.cgi?id=715798>

------
dale-cooper
The UCS-2 heritage is kind of annoying. In java for example, chars (the
primitive type, which the Character class just wraps) are 16 bits. So one
instance of a Character may not be a full "character" but rather a part of a
surrogate pair. This creates a small gotcha where the length of a string might
not be the same as the amount of characters it has. And that you just cant
split/splice a Character array naively (because you might split it at a
surrogate pair).

~~~
masklinn
Which, at the end of the day, doesn't really matter since a code point is not
a "character" in the sense of "the smallest unit of writing" (as interpreted
by an end-user): many "characters" may (depending on the normalization form)
or will (jamo) span multiple codepoints. Splitting on a character array is
always broken, regardless of surrogate pairs.

~~~
dale-cooper
Yes. What i'm saying is that it would feel less error prone if the character
object was actually a codepoint. It's a leaky abstraction, you shouldn't need
to handle something that is tied to the internal representation of strings in
the jvm. Can one "character" span multiple codepoints? Do you have an example
of this?

~~~
masklinn
> It's a leaky abstraction, you shouldn't need to handle something that is
> tied to the internal representation of strings in the jvm.

And I'm saying it doesn't really matter, because unicode codepoints are
already a form of "leaky abstraction" which you'll have to handle (in that a
read/written "character" does not correspond 1:1 to a codepoint anyway).
Unicode is a tentative standardization of historical human production, and if
you expect _that_ to end up clean and simple you're going to have a hard time.

> Can one "character" span multiple codepoints?

Yes.

> Do you have an example of this?

Devanagari (the script used for e.g. Sanskrit) is full of them. For instance,
"sanskrit" is written "संस्कृतम्" [sə̃skɹ̩t̪əm]. If you try to select
"characters" in your browser you might get 4 (सं, स्कृ, त and म्) or 5 (सं,
स्, कृ, त and म्) or maybe yet another different count, but this is a sequence
of _9_ codepoints (regardless of the normalization, it's the same in all of
NFC, NFD, NFKC and NFKD as far as I can tell):

    
    
        स: DEVANAGARI LETTER SA
        ं: DEVANAGARI SIGN ANUSVARA
        स: DEVANAGARI LETTER SA
        ्: DEVANAGARI SIGN VIRAMA
        क: DEVANAGARI LETTER KA
        ृ: DEVANAGARI VOWEL SIGN VOCALIC R
        त: DEVANAGARI LETTER TA
        म: DEVANAGARI LETTER MA
        ्: DEVANAGARI SIGN VIRAMA
    

Note: I'm not a Sanskrit speaker and I don't actually know devanagari (beyond
knowing that it's troublesome for computers, as are jamo) so I can't even tell
you how many "symbols" a native reader would see there.

~~~
dale-cooper
That's quite interesting, i had no idea! What i was hoping for was some kind
of term for one character or symbol and use that as a unit, but perhaps it's
impossible to create an abstraction like that.

I'm curious if a Sanskrit speaker would see each of the codepoints as a symbol
or not.

Edit: thinking about it, i guess if you asked a Sanskrit speaker how long a
word/sentence was, you'd get the answer..

~~~
masklinn
> What i was hoping for was some kind of term for one character or symbol and
> use that as a unit

There is one, kind-of: "grapheme cluster"[0]. This is the "unit" used by UAX29
to define text segmentation, and aliases to "user-perceived character"[1].

Most languages/API don't really consider them (although they crop up often in
e.g. browser bug trackers), let alone provide first-class access to them. One
of the very few APIs which actually acknowledges them is Cocoa's NSString —
and Apple provides a document explaining grapheme clusters and how they relate
to NNString[2] — which has very good unicode support (probably the best I know
of, though Factor _may_ have an even better one[3]), and it handles grapheme
clusters through providing messages which work on codepoint ranges in an
NSString, it doesn't treat clusters as first-class objects.

> i guess if you asked a Sanskrit speaker how long a word/sentence was, you'd
> get the answer..

Indeed.

[0] <http://www.unicode.org/glossary/#grapheme_cluster>

[1]
[http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Bounda...](http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries)

[2]
[https://developer.apple.com/library/mac/#documentation/Cocoa...](https://developer.apple.com/library/mac/#documentation/Cocoa/Conceptual/Strings/Articles/stringsClusters.html)

[3] the original implementor detailed his whole route through creating
factor's unicode library, and I learned a lot from it: <http://useless-
factor.blogspot.be/search/label/unicode>

~~~
dale-cooper
Very interesting, going to read through that guys blog. Thanks for the links!

------
eloisant
Maybe nickpicking but I don't think Softbank came up with the Emoji. Emoji
existed way before Softbank bought the Japanese Vodaphone, and even before
Vodaphone bought J-Phone.

So emoji were probably invented by J-Phone, while Softbank was mostly taking
care of Yahoo Japan.

------
adrianpike
Here's the thread in the v8 bug tracker about this issue:
<http://code.google.com/p/v8/issues/detail?id=761>

Is there a reason that the workaround in comment 8 won't address some of these
issues?

~~~
dgl
I don't think it's needed anymore.

If you read closely you'll see the original linked message is from January and
there's an update on that issue from March when a fix was made in V8.

------
clebio
Somewhat meta, but this would be one where showing subdomain on HN submissions
would be nice. The title is vague enough that I assumed it was something to do
with _Github_ not processing Emoji (which would be sort of a strange state of
affairs...).

~~~
ladon86
Not that strange, Github implements much of the Emoji set using different
shortcuts, see the reference here:

<http://www.emoji-cheat-sheet.com/>

Before I read the article I guessed that maybe the icon set had some licensing
issues for Github. Luckily, not so! (:smiley:)

~~~
clebio
That was basically my point. It would be strange if they _stopped_ processing
it.

------
pla3rhat3r
I love this article. So often it has been difficult to explain to people why
one set of characters can work while others will not. This lays out some great
historical info that will be helpful going forward.

------
cjensen
UCS-16 is only used by programs which jumped the gun and implemented Unicode
before it was all done. (It was 16 bits for awhile with Asian languages
sharing code points so that the font in use determined whether the text was
displayed as Chinese vs Japanese vs. etc). What Century was V8 written in that
they thought UCS-16 was an acceptable thing to implement?

Good rule of thumb for implementers: get over it and use 32 bits internally.
Always use UTF-8 when encoding into a byte stream. Add UTF-16 encoding if you
must interface with archaic libraries.

~~~
masklinn
> UCS-16 is only used by programs which jumped the gun and implemented Unicode
> before it was all done.

There's no such thing as "all done", Unicode 1.0 was 16 bit, Unicode 6 was
released recently.

------
evincarofautumn
Failures in Unicode support seem usually to result from the standard’s
persistently shortsighted design—well intentioned and carefully considered
though it undoubtedly is. It’s a “good enough” solution to a very difficult
problem, but I wonder if we won’t see Unicode supplanted in the next decade.

All that aside: emoji should not be in Unicode. Fullstop.

------
FredericJ
How about this npm module : <https://npmjs.org/package/emoji> ?

------
xn
Here's the message decoded from quoted-printable:
<https://gist.github.com/4151707#file_emoji_sad_decoded.txt>

------
mranney
Note that this message is almost a year old now. The issue has been addressed
by the node and V8 teams.

------
shocks
Very informative, great read. Thanks!

------
alexbosworth
Fixed a good while ago for node.js

------
masklinn
Wow, the first half of the text is basically full of crap and claims which
don't even remotely match reality, and now I'm reaching the technical section
which can only get even more wrong.

~~~
masklinn
To whoever the downvoter was: no, seriously. For instance in the first few
paragraphs:

* emoji were invented by NTT DoCoMo, not Softbank

* even if that had been right Softbank's copyrighting of their emoji _representations_ has no bearing on NTT and KDDI/au using completely different implementations (and I do mean completely, KDDI/au essentially use <img> tags)

* lack of cooperation is endemic to japanese markets (especially telecoms) and has nothing to do with "ganging up"

* if NTT and au/KDDI wanted to gang up on Softbank you'd think they'd share the same emoji

* you didn't _have_ to run "adware apps" to unlock the emoji keyboard (there were numerous ways to do so from dedicated — and usually quickly nuked – apps to apps "easter eggs" to jailbreak to phone backup edit/restore)

That's barely the first third.

------
sneak
TLDR: node sucks

~~~
tptacek
It's v8's fault, and v8 does not suck.

~~~
prodigal_erik
Unicode 2.0 added surrogate pairs in 1996. Unfortunately, the first versions
of both Java and JavaScript predated this and got strings horribly wrong, and
now any conforming implementation of either is required to suck. The Right
Thing would be for almost everyone to work with only combining character
sequences, except for a rare few who need to know how to dissect one into its
codepoints and reassemble them correctly (just as people don't normally need
to extract high or low bits from an ASCII character).

~~~
jrabone
No. Combining characters and NF(K)C/D normalisation rules are a different
problem entirely - consider the "heavy metal umlaut" (ie. Spın̈al Tap) where
there is no lossless conversion possible - only “n" followed by U+0308

~~~
prodigal_erik
They're facets of the same problem. I shouldn't routinely be dealing with
either surrogates or combining marks; unless I have a specific reason, it's
only an opportunity to make a mistake that hardly anyone knows how to
troubleshoot. "n̈" should be an indivisible string of length one until I need
to ask how it would actually be encoded in UTF-16 or whatever.

~~~
jrabone
But that's the point - there is no such character. Given the Unicode
consortium have added codepoints for every other bloody thing under the sun,
I'm amazed that there isn't one for n-diaresis but there you are.

Add a small number of people who for artistic reasons decide that they want to
make life hard (Rinôçérôse I'm looking at you) and you just have to accept
that the length of your string might not equal the number of codepoints
contained therein...

------
csense
A two-character sequence for a smiley face that should be compatible with
everything in existence:

    
    
      :)
    

Problem solved. Why is this front page material (#6 as of this writing)?

