Hacker News new | comments | show | ask | jobs | submit login
Why we can't process Emoji anymore (github.com)
258 points by tpinto 1608 days ago | hide | past | web | 152 comments | favorite



This is why UTF-8 is great. If it works for any Unicode character it will work for them all. Surrogate pairs are rare enough that they are poorly tested. With UTF-8, if there are issues with multi-byte characters, they are obvious enough to get fixed.

UTF-16 is not a very good encoding. It only exists for legacy reasons. It has the same major drawback as UTF-8 (variable-length encoding) but none of the benefits (ASCII compatibility, size efficient).


This comment is somewhat misleading. The issue at hand is orthogonal to any of the benefits of UTF-8 over UTF-16 (which are real, UTF-8 is great, you should use it.)

4-byte characters in UTF-8 are just as rare as surrogate pairs are just as rare in UTF-16, because they both are used to represent non-BMP characters. As a result, there is software that handles 3-byte characters (i.e., a huge percentage of what you'll ever see), but doesn't handle 4-byte characters.

MySQL is a high-profile example of software which, until recently, had this problem: http://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8m....


The problem is that handling 1 unit is very different from 2+ units, in terms of coding patterns, whereas 3 is not so different from 4+. In the latter case there's already probably a loop to handle multiple unit characters, which will in many cases work without change for longer sequences (and if not, probably the code probably requires very little change to do so).

So whereas it's rather common for programs to mis-handle multiple-unit UTF-16 characters, it seems much less likely that programs will mis-handle 4+ unit UTF-8 characters.


The problem with UTF-8 is that lots of tools have 3 byte limits, and characters like Emoji take up 4 bytes in UTF-8.


Honest question: but isn't that just a broken implementation (and a very obvious brokenness at that)? It seems to me there's a big difference between someone not coding to the standard, and the standard making your taks impossible.


The same could be said of UTF-16 implementations that don't support surrogate pairs.


But that's not true for UCS-2, which can't represent certain code points.

On the other hand, UTF-8 is ASCII compatible and more efficient for text that's primarily ASCII.


If most implementations are broken, it becomes the standard.


How many tools have 3-byte limits on UTF-8? The only one I can think of right now is MySQL. (The workaround is to specify the utf8mb4 character set. This is MySQL's cryptic internal name for "actually doing UTF-8 correctly.")


MySQL is one of the worst offenders for broken Unicode and collation problems arising therein. Neither it nor JavaScript deserve consideration for problems that need robust Unicode handling.


I actually switched my (low traffic, low performance needs) blog comments database from MySQL to SQLite purely because I could not make MySQL and Unicode get along. All I needed was for it to accept and then regurgitate UTF-8 and it couldn't even handle that. I'm sure it can be done, but none of the incantations I tried made it work, and it was ultimately easier for me to switch databases.


As an ugly last resort, you could store Unicode as UTF-8 in BLOB fields. MySQL is pretty good about storing binary data. (I dread the day that I'll have to do something more advanced with Unicode in MySQL than just storing it.)


I no longer recall whether I tried that and failed, or didn't get that far. Seems like a semi-reasonable approach if you don't need the database to be able to understand the contents of that column. But on the other hand, SQLite is working great for my needs too.


I don't think it's specifically a 3-byte limit, I think it's just that lots of tools decode UTF-8 into UCS-2 internally instead of UTF-16.


That is quite clearly broken, and any tool that does so should be fixed or dumped. This is not new, and Marcus Kuhn had made UTF8 test resources available for years at http://www.cl.cam.ac.uk/~mgk25/unicode.html


I've found this sub-page super useful over the years for testing http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt


I believe that MySQL is one such tool. In recent versions you can work around it by asking for the encoding "utf8mb4" instead of "utf8", but I think they have to be quite recent.

So yes, another way in which MySQL is quite clearly broken.


Which tools?

Honest question, as the three byte limit seems rather arbitrary and no more logical than, say, a four byte one.


It is totally arbitrary - there's no reason you can't have degenerate 6-byte encodings, and compliant decoders should cope with them. See Marcus Kuhn's excellent UTF-8 decoder torture test page linked elsewhere in this thread.


Actually, a truly compliant UTF-8 decoder should reject all degenerate forms, because they can be used to bypass validation checks, and the spec does not allow such forms. A four-byte UTF-8 sequence is sufficient to represent any Unicode code point (and then some), so no compliant decoder should accept any sequences of 5 or 6 bytes.

Note that Kuhn's torture test page deliberately includes a lot of invalid sequences in order to make sure that they're gracefully handled. Section 4 is dedicated to degenerate encodings like this.


It's not 3 bytes so much as 16 bits, aka Unicode 1.0 limits. Which turn into 3 bytes in UTF-8.


File bugs with those tools. These sort of issues should have been sorted years ago and any program that can't do 3+ byte character encodings should be named and shamed.


And the problem with UTF-16 is that a lot of applications can't handle surrogate pairs, except a lot of Emoji are above the BMP, aren't they? So why is this a bigger deal for UTF-8 than UTF-16?


> except a lot of Emoji are above the BMP, aren't they?

All of the Unicode 6.0 emoji are.


UCS2 is better than UTF8 internally because it counts unicode characters faster than UTF8. Every character is juest 2 bytes, instead of 1, 2, 3 or even 4 bytes.

In python:

    len(u'汉字') == 2
    len( '汉字') == 4 # or maybe 6, it varies based on console encoding and CPython options
    len(u'汉字'.encode('utf8')) == 6


I don't really think you can argue that UCS2 is better than UTF-8, because UCS2 is simply broken. It's not a reliable way to encode unicode characters. It's like arguing that my Ferrari which is currently on fire is a better car than your Lamborghini (which is not.) I mean, they do each have their merits, but a flaming car is not useful to anyone.

I think this is probably semantics, but I just wanted to point that out in case anyone is confused, which would be understandable because this shit is whack.


Why do you want to count Unicode characters? Why do you care if it is fast to do so? Why would you ever need to use character-based string indexing?

UTF-16 solves problems that don't exist.

(Honestly, I would love it if someone could explain what the purpose of counting characters is, because I don't know why you'd ever do that, except when you're posting to Twitter.)


Be careful not to confuse UTF-16 and UCS-2. UTF-16 is a variable-width encoding, so using it doesn't actually make counting anything easier. UCS-2 is fixed-width, and evil. With UCS-2 you can easily count the number of code points, as long as they fall in the BMP (!), but this is not the same as counting graphemes.


It's not a common use case, but I've had to do it. Luckily, it's fairly easy:

    int count_multibytes_encountered(char *text, unsigned long len) {
        int count = 0;
        for (int i = 0; i < len; i++) {
            if ((*(text+i) & 0b10000000) == 0b10000000 &&       // check if it's a multi-char byte
                (~*(text+i) & 0b01000000) == 0b01000000) {      // and check that it's not a leading byte
                count++;
            }
        }
        return count;
    }


[deleted]


You will never define an unambiguous way to index into a Unicode string by "characters" with less than 10 pages of specification. When two sides of the API interpret this contract differently, a security vulnerability is a very possible outcome.

Use unambiguous delimiters, JSON, XML, or even byte offsets into normalized UTF-8 instead. But don't do this please.


If you make an API, please don't do this. If you have to include indexes into a Unicode string, then make them indexes into a binary string in a known encoding. This works equally well for all encodings, and won't tempt people to do something as preposterous as using UCS-2.


Such an API could easily indicate bytes in a UTF-8 string instead.


> Why do you want to count Unicode characters?

Text editing and rendering. Some parts of the system cannot simply treat Unicode text as an opaque binary hunk of information.

> Why do you care if it is fast to do so?

Efficient full text search that can ignore decorative combining characters.


> Text editing and rendering

Unless you're working entirely in fixed point characters (and you probably aren't, given that even fixed-width systems like terminal emulators use double-wide glyphs sometimes), you need to know the value of each character to know its width. That involves the same linear scan over the string that is required to calculate the number of glyphs in a variable-width encoding.


If you implement naïve Aho-Corasick text search over one-byte characters, it works without modification on UTF-8 text. It does not ignore combining characters, but UCS-2 also features combining characters (c.f. other comments in this same thread), so no matter what encoding you use, you must first normalize the Unicode text and the search string before you compare for equivalence (or compatibility, which is a looser notion than equality for Unicode code point sequences.)


> Text editing and rendering. Some parts of the system cannot simply treat Unicode text as an opaque binary hunk of information.

Except these parts of the system have to work on unicode glyphs (rendered characters) which will span multiple codepoints anyway, so counting codepoints remains pointless. The only use it has is knowing how much codepoints you have. Yay.


How does fast character counting help with full-text search?


The best search algorithms can skip ahead upon a mismatch. A variable-length encoding requires branch instructions in the inner loop, leading to pipeline flushes and potentially dramatic slow down.


This is incorrect. Searching text with a variable-length encoding does not require extra branch instructions. If you're searching through UTF-8 text, you can just pretend it's a bunch of bytes and search through that.

This isn't counting problems with normalization, of course. You will have to put your needle and haystack both into the same normalization form before searching. But you had to do that anyway.


> Why do you want to count Unicode characters?

Because, there are other countries which use more than English language?

I fucking hate you ascii-centric ignorant morons sometimes, you know, for example

- display welcome message character by character fro left to right

- Extract the first character because it's always the surname

- catch two non-ascii keyword and find its index in a string

In the first example, should I just put byte by byte, which displays as garbage, and suddenly three bytes become a recognizable character?


Could you not have communicated your examples without the hostility?


I think you misunderstand. I wasn't asking why Unicode characters should be counted instead of bytes or ASCII characters, I was asking why you would even want to count characters at all.

> I fucking hate you ascii-centric ignorant morons

Nice.

> You ignorant, arrogant fuck.

This is why I quit posting under an alias, so I wouldn't be tempted to say such things.

> display welcome message character by character fro left to right

UTF-16/UCS-4/UCS-2 doesn't solve anything here. Counting characters doesn't help. For example, imagine if you try to print Korean character-by-character. You might get some garbage like this:

    ᄋ
    아
    안
    안ᄂ
    안녀
    안녕
    안녕ᄒ
    안녕하
    안녕하ᄉ
    안녕하세
    안녕하세ᄋ
    안녕하세요
Fixed width encodings do not solve this problem, and UTF-8 does not make this problem more difficult. I am honestly curious why you would need to count characters -- at all -- except for posting to Twitter.

Splitting on characters is garbage. (This example was done in Python 3, so everything is properly encoded, and there is no need to use the 'u' prefix. The 'u' prefix is a nop in Python 3. It is only there for Python 2.x compatibility.)

    >>> x
    '안녕하세요'
    >>> x[2:4]
    'ᆫᄂ'
I tried in the Google Chrome console, too:

    > '안녕하세요'.substr(2,2)
    "하세"
    > '안녕하세요'.substr(2,2)
    "ᆫᄂ"
I'm not even leaving the BMP and it's broken! You seem to be blaming encoding issues but I don't have any issues with encoding. It doesn't matter if Chrome uses UCS-2 or Python uses UCS-4 or UCS-2, what's happening here is entirely expected, and it has everything to do with Jamo and nothing to do with encodings.

    >>> a = '안녕하세요'
    >>> b = '안녕하세요'
    # They only look the same
    >>> len(a)
    5
    >>> len(b)
    12
    >>> def p(x):
    ...     return ' '.join(
                'U+{:04X}'.format(ord(c)) for c in x)
    
    >>> print(' '.join('U+{:04X}'.format(ord(c))
              for c in b))
    >>> print(p(a))
    U+C548 U+B155 U+D558 U+C138 U+C694
    >>> print(p(b))
    U+110B U+1161 U+11AB U+1102 U+1167 U+11BC U+1112 U+1161 U+1109 U+1166 U+110B U+116D
See? Expected, broken behavior you get when splitting on character boundaries.

If you think you can split on character boundaries, you are living in an ASCII world. Unicode does not work that way. Don't think that normalization will solve anything either. (Okay, normalization solves some problems. But it is not a panacea. Some languages have grapheme clusters that cannot be precomposed.)

Fixed-width may be faster for splitting on character boundaries, but splitting on character boundaries only works in the ASCII world.


> Counting characters doesn't help.

Why? If you can count characters (code points) then it's natural that you can split or substring by characters.

Try this in javascript:

    '안녕하세요'.substr(2,2)
Internally Fixed length encoding is much faster than variable-length encoding.

> Unicode does not work that way.

It DOES.

> Splitting on characters is garbage.

You messed up Unicode in Python in so many levels. Those characters you seen in Python console is, actually not Unicode. These are just bytes in sys stdout that happens be to correctly decoded and properly displayed. You should always use the u'' for any kind of characters. '안녕하세요' is WRONG and may lead to unspecified behaviors, it depends on your source code file encoding, intepreter encoding and sys default encoding, if you display them in console it depends on the console encoding, if it's GUI or HTML widget it depends on the GUI widget or content-type encoding.

> I'm not even leaving the BMP and it's broken!

Your unicode-fu is broken. Looks like your example provided identical Korean strings, which might be ICU module in Chrome auto normalized for you.

> You can't split decomposed Korean on character boundaries.

In a broken unicode implementation, like Chrome browser v8 js engine.

> I happen to be using Python 3. It is internally using UCS-4.

For the love of BDFL read this

http://www.python.org/dev/peps/pep-0414/

http://docs.python.org/3/whatsnew/3.3.html


I'm sorry but you're wrong. I suggest you inform yourself better of the subject you're talking about before you call people "ignorant morons" next time.

dietrichepp is talking about Normalized Form D, which is a valid form of Unicode and cannot be counted using codepoints like you're doing.

Maybe you can try:

'𠀋'.substr(0,1)


yeah sure why not.

    >>> u'𡘓'[0:1]
    u'\U00021613'

    >>> u'Hi, Mr𡘓'[-1]
    u'\U00021613

    >>> u'𠀋'[0:1]
    u'\U0002000b'

Javascript won't work because UCS2 in js engine, duh.

Actually Javascript is messed up with Unicode string and binary strings, that's why Nodejs invented Buffer

http://nodejs.org/api/buffer.html


You've moved the goalposts:

  u'\U00021613'
This is a UTF-32 code unit, not a UTF-16 code unit. Even UTF-32 doesn't help when you have combining characters. I suggest you read dietrichepp's post again, he's talking about Normalization Form D.


Okay, if it's an explicit combining character what's wrong with explicit character part counting?

You know normalized form is the norm, right?


There are four different normalized forms in Unicode. Maybe you should enlighten us about which one you're talking about.

Or just stop embarrassing yourself.


Reading all of your comments, so you are suggesting a Unicode object should not have len() or substring() ?

A standard like that is totally not embarrassing.


I am suggesting that people read about unicode before designing supposedly cross-platform applications or programming languages. It's not that hard, just different than ASCII.


Since you understand Unicode so well, can you explain dietrichepp's theory that Unicode don't need counting or offsets?

http://news.ycombinator.com/item?id=4834931

And why UCS4 (Not variable-length) is chosen in many Unicode implementations? Why wchar_t is always 32bit in posix?


Since you understand Unicode so well, can you explain dietrichepp's theory that Unicode don't need counting or offsets?

Unicode doesn't have "characters." If you talk about characters, all you've succeeded in doing is confusing yourself. Leave characters back in ASCII-land where they belong.

Counting code points is stupid. If you like counting code points, go sit in the corner. You don't understand unicode.

You can count graphemes, but it's not going to be easy. And most of the time, I don't see why you would need to do that.

And why UCS4 (Not variable-length) is chosen in many Unicode implementations? Why wchar_t is always 32bit in posix?

wchat_t is a horrible abomination that begs for death. Nobody should use it. Use UTF-8 instead. I think Python used to use UCS4, but they don't any more. It's a horrible representation because all your strings bloat up by 4x.


Code points aren't letters.

Consider the following sequence of code points: U+0041 U+0308 [edit: corrected sequence]

That equals this european letter: Ä

Two code points, one letter. MAGIC! You can also get the same-looking letter with a single code point using U+00C4 (unicode likes redundancy).

Not all languages have letters. Not all languages that have letters represent each one with a single code point. Please think twice before calling people "morons."


> Two code points, one letter.

Yes I under stand there are million ways to display the same shape using various unicode. But how does that make code point counting impossible?

AND if you explictly using COMBINING DIAERESIS instead of single U+00C4, counting diaeresis separately is wrong somehow?

Why don't we make a law stating that both ae and æ is single letter?


I am responding to your earlier post which announced that UCS2 is better than UTF8 internally because it counts unicode characters faster than UTF8. Hopefully now you understand that just taking the number of UCS2 bytes and dividing by 2 does not give you the number of letters.

Just in case you don't, let's walk through it again.

UCS-16 big-endian represenation of Ä:

0x00 0x41 0x03 0x08

Another UCS-16 big-endian representation of Ä:

0x00 0xc4

If you look at the number of bytes, the first example has 4. It represents one letter. The second example has 2. It also represents one letter. Conclusion: UCS2 does not "count unicode characters faster than UTF8." You still have to look at every byte to see how many letters you have, same as in UTF-8.

Do you grasp this? If not, maybe you are one of those "ascii-centric ignorant morons" I keep hearing so much about.


Name one Unicode implementation which shows utf16 `0x00 0x41 0x03 0x08` as length 1.

U+4100 U+0803 is two code points by defintion. Thus length == 2.



Yes? `System.Globalizatiion` or `ICU` can count grapheme, what's your point?

Those libraries are equivalent to normalize( utf16 `0x00 0x41 0x03 0x08`) == length 1

Back to my top comment, I stated that UCS2 counts faster than UTf8 internally, because every BMP code point is just two bytes, what's wrong here? If variable-length is so good why py3k is using UCS-4 internally? (Wich means every character is exactly 32 bits. There, I said character again.)


> Back to my top comment, I stated that UCS2 counts faster than UTf8 internally

The part cmccabe tries to explain, and which you repeatedly fail to understand, is that UCS2 counts unicode code points faster than UTF-8, which is completely useless because "characters" (what the end-user sees as a single sub-unit of text) often spans multiple codepoints, so counting codepoints is essentially a recipe for broken code and nothing else.

> If variable-length is so good why py3k is using UCS-4 internally?

It's not. Before 3.3 it used either UCS2 or UCS4 internally, as did Python 2, since Python 3.3 it switches the internal encoding on the fly.

> Wich means every character is exactly 32 bits. There, I said character again.

yeah and you're wrong again.


> yeah and you're wrong again.

http://en.wikipedia.org/wiki/UTF-32

UTF-32 (or UCS-4) is a protocol to encode Unicode characters that uses exactly 32 bits per Unicode code point.

http://www.unicode.org/faq/utf_bom.html#utf32-1


> See? Expected, broken behavior you get when splitting on character boundaries.

Yeah, like your Jamo trick is complex for a native CJK speaker.

Thought Jamo is hard? Check out Ideographic Description Sequence. We have like millions of 偏旁部首笔画 that you can freestyle combine with.

And the fun is the relative length of glypes, 土 and 士 is different, only because one line is longer that the other. How would you distinguish that?

But you know what your problem is?

It's like arguing with you that you think ส็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็ is only one character.

IMPOSSIBU?!!!???

And because U+202e exists on the lolternet so we deprive your ability to count 99% normal CJK characters???!??!111!

Combination characters is normalized to single character in most cases, and should be countable and indexable separately.

If you type combination characters EXPLICITLY, they will be counted with each combination, naturally, what's wrong with that?

Or else why don't we abandon Unicode, every country deal with their own weird glype composition shit?


By being unnecessarily insulting you're degrading your own position, and your argument is collapsing under the weight of your anger and sarcasm. I see Dietrich and Colin working with definitions of code point, character and glyph that illuminates why counting one way will lead to problems when you slip into thinking you're counting the other way. Then in your much-too-fired-up responses you conflate them again, and muddy us back to square one.

It seems to me you're deriding us for being native speakers of languages with alphabets, and also deriding us for wanting APIs that prevent developers from alphabet-language backgrounds from making the mistakes our assumptions would incline us towards. You're going to have to decide if you're angry because you like the "simplicity" of UTF-16, because we don't speak a CJK language as well as you do (maybe Dietrich or Colin does; I have no idea) or because you're just angry and this is where you've come to blow off steam. If it's the third, I hope you'll try Reddit first next time, since this kind of behavior seems to be a lot more acceptable there than here.


For fuck's sake I am not defending UTF16's simplicity, I am defending that:

fixed width can count code points (I worded it as "character") faster than variable-length

Then his dietrichepp tries to educate me two code points combined should be treated equaly with another single code point, WTF y u no normalization?

Downvote me as you like, but you can't change the fact that UCS4 is used internally in Unicode systems.

Any reason other than for faster code point counting?

-----------------

dietrichepp also offended me that unicode characters should not count or offset. QTF:

> Why do you want to count Unicode characters? Why do you care if it is fast to do so? Why would you ever need to use character-based string indexing?


What in the world is your goal in this conversation? In your impotent rage you've only established that it's useless to count code points, completely counter to your original point in favor of UCS-2.


Counting code points in UTF8 can actually be faster than UCS2 in case you do not know the length in advance such as null-terminated strings. The UTF8 encoding is cleverly defined such that to find the number of code points you just need to check the number of bytes that have the top bit cleared. Since UTF8 strings are generally shorter than UCS2 it can be faster in some cases. Either way this is not a serious enough concern to use one encoding over the other.


Issues like this are why I hate internationalization.

If it was simple as making everything Unicode and it Just Working, it would be possible. But the number of difficulties and problems I've seen have made me decide -- and tell everyone I know -- to avoid dealing with internationalization if you value your sanity.

Issues discussed here:

* Different incompatible variable-length encodings

* Broken implementations

* Character count != length

Issues discussed elsewhere:

* It's the major showstopper keeping people away from Python 3

* Right-to-left vs. left-to-right [1]

* BOM at the beginning of the stream

Conceptual issues -- questions I honestly don't know the answer to when it comes to internationalization. I don't even know where to look to find answers to these:

* If I split() a string, does each piece get its own BOM?

* If I copy-paste text characters from A into B, what encoding does B save as? If B isn't a text editor, what happens?

* If a chr(0x20) is part of a multi-byte escape sequence, does it count as a space when I use .split()?

* When it encounters right-to-left, does the renderer have to scan the entire string to figure out how far to the right it goes? Wouldn't this mean someone could create a malicious length-n string that took O(n^2) time to process?

* What happens if I try to print() a very long line -- more than a screenful -- with a right-to-left escape in a terminal?

* If I have a custom stream object, and I write characters to it, how does it "know" when to write the BOM?

* Do operators like [] operate on characters, bytes, 16-bit words, or something else?

* Does getting the length of a string really require a custom for loop with a complicated bit-twiddling expression?

* Is it possible for a zero byte to be part of a multibyte sequence representing a character? How does this work with C API's that expect zero-terminated strings?

* If I split() a string to extract words, how do the substrings know the BOM, right-to-left, and other states that apply to themselves? What if those strings are concatenated with other strings that have different values for those states?

* What exactly does "generating locales" do on Debian/Ubuntu and why aren't those files shipped with all the other binary parts of the distribution? All I know about locale generation is that it's some magic incantation you need to speak to keep apt-get from screaming bloody murder every time you run it on a newly debootstrapped chroot.

* Is there a MIME type for each historical, current, and future encoding? How do Web things know which encoding a document's in?

* How do other tools know what encoding a document uses? Is this something the user has to manually tell the tool -- should I be saying nano thing.txt --encoding=utf8? If the information about the format isn't stored anywhere, do you just guess until you get something that seems not to cause problems?

* If you're using UTF-16, what endianness is used? Is it the same as the machine endianness, or fixed? What operations cause endian conversion?

* Should my C programs handle the possibility that sizeof(char) != 1? Or at least check for this case and spit out a warning or error?

* What automated tools exist to remove BOM's or change accented characters into regular ones, if other automated tools don't accept Unicode? Once upon a time, I could not get javac to recognize source files I'd downloaded which had the author's name, which included an 'o' with two dots over it, in comments. That was the only non-ASCII character in the files, and I ended up removing them; syncing local patches with upstream would have been a nightmare. Do people in different countries run incompatible versions of programming languages that won't accept source files that are byte-for-byte identical? It sounds ridiculous, but this experience suggests it may be the case.

[1] http://xkcd.com/1137/


I'll take a few of those questions.

> If I split() a string, does each piece get its own BOM?

Conceptually, each piece is a sequence of code points. The BOM stuff only comes into play when you turn it into an external encoding. And frankly, I would much rather use UTF-8, explicitly specify that the encoding is UTF-8, and not have to worry about adding a BOM.

> If a chr(0x20) is part of a multi-byte escape sequence, does it count as a space when I use .split()?

In valid UTF-8, all bytes in multibyte characters will have the high bit set. A space can only be represented as the 0x20 byte, and an 0x20 byte can only be a space. If you've got malformed input, then that's a whole other can of worms.

> Is it possible for a zero byte to be part of a multibyte sequence representing a character? How does this work with C API's that expect zero-terminated strings?

In UTF-8, the answer is no. In other multibyte encodings (e.g. UTF-16), you should not expect to be able to treat it at all like ASCII.

> If you're using UTF-16, what endianness is used? Is it the same as the machine endianness, or fixed? What operations cause endian conversion?

When reading external text, you can detect this from the BOM -- byte order, after all, is why you have a byte order marker. When converting from your internal format to UTF-16, you pick whatever is most convenient.

> Should my C programs handle the possibility that sizeof(char) != 1? Or at least check for this case and spit out a warning or error?

I don't know any popular non-embedded platform on which sizeof(char) != 1. That said, it can't hurt to get it Right.

> What automated tools exist to remove BOM's or change accented characters into regular ones, if other automated tools don't accept Unicode?

In Python, there's a library called "unidecode" which does a pretty good job of punching Unicode text until it turns into ASCII.


> Should my C programs handle the possibility that sizeof(char) != 1? Or at least check for this case and spit out a warning or error?

sizeof(char) == 1 by definition.


I have attempted to answer your questions, but as a general comment, most of them would be answered by having a basic understanding of unicode and popular encodings for it. You're a programmer, understanding this is part of your job. (If you aren't a programmer,then my my do you sure ask a lot of technical questions.) You could read this: http://www.joelonsoftware.com/articles/Unicode.html. Then read a bit more. Then forget everything. Then read a bunch more and then give a small talk on unicode. Then suddenly it's 2AM and you're answering asinine questions about unicode on HN. What I'm saying is, be careful.

One last general comment I have is that a lot of your questions relate to things that you shouldn't necessarily need to understand the exact details of to do a lot of things. Instead, you should use an off-the-shelf, battle-tested unicode library. As far as I know, they exist for every platform by now. Of course this doesn't free you from knowing stuff, but it means that instead of knowing exactly the range of the surrogate pairs, all you need is a mental model of what's going on. When you're surprised, you can begin to fill in the gaps.

1. Use UTF-8 if you're using byte strings, or your platforms unicode string type. If the latter, it will have its own internal encoding, and you'll work at the character level. In either case, as soon as a string comes in from elsewhere (network, disk, OS), sanitize it to a known state, i.e., a UTF-8-encoded byte string, or your platform's unicode string type. In case you can't do that, reject the string, log it, and figure out what's going on.

2. Use a non-broken UTF-8 implementation.

3. Yeah. Your UTF-8 implementation is handling this for you now.

4. Not familiar with this/don't use Python enough.

5. Haven't dealt with this, but it's definitely got some complications to it. I would guess more for layout than for programming, but I can't be sure.

6. The BOM tells your decoder to that the forthcoming stream is little endian or big endian. It itself is not actually part of the string. Admittedly, a lot of programs have trouble with BOMS still, which is why you're using UTF-8 (without BOMs, because you don't need it.)

7. No, when you .split, the BOM is no longer part of the string. You don't have a BOM again until you transmit the string or write it to disk, as it's not needed (your implementation uses whatever byte ordering it likes internally, or the one you specified if using a byte string.)

8. The string is probably transmitted in whatever the internal encoding of your OS is. That means UTF-16 on Windows and UTF-8 on Linux, AFAIK. If you're writing a desktop app, your paste event should treat this string as a pile of unsafe garbage until you've put it into a known state (i.e., a unicode object or byte string in a known encoding.) When you save it, it saves in whatever encoding you specify. You should specify UTF-8.

9. I'm not sure exactly what we're talking about here. The byte 0x20 is a valid UTF-8 character (the space), or part of a 2-byte or 4-byte character in UTF-16. However, as long as you're working with a unicode string type, your .split function operators on the logical character level, not the byte level. If you're using a byte string (e.g., python's string type), then yes, the byte 0x20 is a space, because your split method assumes ASCII. If you try to append a byte string containing 0x20 to a unicode string, you should get an exception, unless you also specify an encoder which takes your byte string and turns it into a unicode string. Your unicode string implementation may have a default encoder, in which case the byte string would be interpreted as that encoding, and an exception would only be thrown if it's invalid (which means if the default encoding was UTF-16, this would throw an exception, because 0x20 is not a valid UTF-16 character.) This answer is long, and HN's formatting options are lacking, so let me know, and I'll try to be clearer.

10. Again I haven't yet dealt with RTL, but the characters are in the same order internally regardless of whether they're to be displayed RTL or LTR. It's a sequence of bytes or characters, the encoder and decoder do not care what those characters actually are. So if I have "[RTL Char]ABC ", that's the order it will be in memory, even though it will display as " CBA." In UTF-8, this string would be 7 bytes long, in UTF-16, 10. In both cases, the character length of the string is 5.

11. I'm not sure why this would be a problem, provided your terminal can handle unicode, which most can in my experience (there's some fiddling you have to do in windows.) It should wrap or break the line the same as with RTL. I believe the unicode standard includes rules for how to break text, but not positive.

12. I'm not really sure what you mean. Your object will write whatever bytes it writes. If you're using a UTF-16 encoder, usually you can specify the Endianness and whether to write a BOM or not.

13. If you're using a unicode type in a language like python, [] operates on logical characters. If you're using a byte string type (python's string, a PHP or C string, for ex), [] operates on bytes.

14. If you're using a unicode type, split returns unicode objects, which have their own internal encoding. Again, right-to-left characters look exactly like any other character to the encoder and decoder. If you're using byte strings, you need to use a unicode-aware split function, and tell it what encoding the string is in. It will return to you strings in the same encoding (and endianness.)

16. Not familiar with this.

17. MIME types are separate from encoding. I can have an HTML page that is in UTF-8 or UTF-16, both have the MIME type txt/html, same with txt/plain and so on. MIME and encoding operate at different levels. Web things knowing the encoding is actually fairly complicated. The correct thing to do is to send an HTTP header that specifies the encoding, and add a <meta charset="bla"> attribute in the HEAD of the page. If you don't do this, I think browsers implement various heuristics to attempt to detect the encoding. Having a type for every future encoding is an unreasonable demand, because the future is notoriously difficult to predict. If you have a crystal ball, I'm sure the standards committees would love to hear from you.

18. Also somewhat complicated! I think there are tools which can guess for you, using similar heuristics that I mentioned that browsers use. You should specify the encoding using your text editor, as you showed. It is not too much to ask to tell people to set their text editors up correctly. Your projects should have a convention that all files use, just like with code formatting. If someone tries to check in something that's not valid UTF-8, you can have a hook that rejects the commit if you want. Then they can fix it. The format is not stored anywhere, which is why you should have the convention and yell at people who mess up (not literally, be nice.) If you don't know what a file is, you can use aforementioned tools, or try a few encodings and see what works. Yes, it's a hassle, which is why you should set a convention.

19. You can specify LE/BE when you say what encoding something is. As in, you say, hey encode this here text as UTF-16 LE, and it says right-o, here we go!

20. A C char is not aware of the encoding, so that wouldn't have any effect! Some other guy said that a char is always a byte, so answer: no.

21. These are two wildly different things. A BOM is always FFEF or FEFF, depending on endianness, so you can look for those, and chop off the two bytes, I guess. For dealing with accents, look into unicode normal forms, they define a specific way to compose and decompose accents. I'm not sure about your javac woes, there ought to be some way to tell javac what encoding to expect, like you can do with python. It may be the case that javac guesses the encoding based on your locale, and his was different than yours.


> Issues like this are why I hate internationalization.

Same here.

The one to blame should be Unicode. It's actually near impractical MULTIcode defined by a committee.


"Character". You keep saying that word …

    >>> len(u'épicé')
    5
    >>> len(u'épicé')
    7


Apropos: http://mathiasbynens.be/notes/javascript-encoding

TL;DR:

- Javascript engines are free to internally represent strings as either UCS-2 or UTF-16. Engines that choose to go USC-2 tend to replace all glyphs outside of the BMP with the replacement char (U+FFFD). Firefox, IE, Opera, and Safari all do this (with some inconsistencies).

- However, from the point of view of the actual JS code that gets executed, strings are always UCS-2 (sort of). In UTF-16, code points outside the BMP are encoded as surrogate pairs (4 bytes). But -- if you have a Javascript string that contains such a character, it will be treated as two consecutive 2-byte characters.

  var x = '𝌆';
  x.length; // 2
  x[0];     // \uD834
  x[1];     // \uDF06
Note that if you insert said string into the DOM, it will still render correctly (you'll see a single character instead of two ?s).


I'm relatively comfortable with this stuff, but I am confused by your response.

First you say that engines will "internally" replace non-BMP glyphs with the replacement character, but then you give an example that seems to work fine (and I think would work fine as long as you don't cut that character in half, or try to inspect its character code without doing the proper incantations[1].)

So, I guess what I'm asking is, at what point does the string become "internal", such that the engine will replace the character with the replacement character?

[1]: As given in the article you linked to.


I dare not try and reexplain the discussion in this bug report as my understanding feels insufficient, but the entire discussion at http://code.google.com/p/v8/issues/detail?id=761#c14 (note, I've linked to the 14th commment in the discussion, but there's more interesting stuff above) talks about it. At the core is a distinction between v8's internal representation of strings and it's API vs. what a browser engine which embeds v8 might do.


Safari uses UTF-16, not UCS-2. I believe this is true of other browsers as well. Otherwise this would render the replacement char, but it doesn't, it renders correctly:

javascript:var x = '𝌆';document.write(x);


Well, a JS string is just a series of UTF-16 code-units (per ES5, there is no impl choice here), so there isn't really any encoding pre-se (and isn't necessarily a UTF-16 string, per the spec's definition thereof, as lone surrogates are valid). The fact that that works is more a testament to the the DOM being UTF-16 than JS.

(On the other hand, I'm sure you knew that. But probably there are people reading your comment who didn't. :))


You are technically correct, the best kind of correct! But I think we both agree there is absolutely no sense in which anything in browser engines is UCS-2, and that browsers will not in fact replace characters beyond the BMP with the replacement glyph, as the top-level comment claimed. It is kind of embarassing that the top rated comment (as of writing) but says completely false things.


I hate the implication in this comment and in the linked article that the spec is somehow immutable. The ECMAScript spec here is fundamentally flawed with regard to character encoding and needs to be fixed.

UCS-2 is not a valid Unicode encoding any more, because there are several sets of characters encoded outside the BMP. The spec should be updated to require UTF-16 support in all implementations.

If a modern programming language like JavaScript doesn't provide a way to represent characters outside the BMP in its character data type, that needs to be fixed too. Indexing and counting characters in a JavaScript string need to reflect the human and Unicode notion of characters, not the arbitrary 2-byte blocks that UCS-2 happens to use.

The language authors should be ashamed of this situation - having a modern language without proper Unicode support is simply awful.


Sometimes you need to know about encodings, even if you're just a consumer. Putting just one non 7-bit character in your SMS message will silently change its encoding from 7-bit (160 chars) to 8-bit (140 chars) or even 16 bit (70 chars) which might make the phone split it into many chunks. The resulting chunks are billed as separate messages.


On iOS, using any non-basic Latin character in SMS makes it switch to 16 bit, even when there is no reason for that to happen. It's a thing that most foreign language speakers must live with.

By doing this full of excuses write-up, this guy wasted a substantial amount of time that he could have spent better researching the issue. Your consumer doesn't care that Emoji is this much or that much bits, it doesn't matter for him that you're running your infrastructure on poorly chosen software - there is absolutely no excuse for not supporting this in a native iOS app, especially now that Emoji is so widely used and deeply integrated in iOS.

How is that a problem they are focusing on, anyway, when their landing page features awful, out of date mockups of the app? (not even actual screenshots - notice the positions of menu bar items) They are also featuring Emoji in every screenshot - ending support might be a fresh development, but I still find that ironic.


Absolutely right. The customer does not care that you made a shortsighted decision to pick a language for a TEXT based system that cannot correctly support none-BMP Unicode. There are no excuses, surrogates have been out there for years (Windows was using UTF16-LE from NT 3.51 / Unicode 2) as have 4 byte UTF8 encodings.

JavaScript is a joke in this respect, and is keeping horrors like Shift-JIS alive long after they should have been retired.


This was an internal email: https://medium.com/tech-talk/1aff50f34fc


So Node.js already fixed the issue, nice!


The GSM 03.38 charset specified for SMS is not straight 7-bit ASCII. See eg. http://www.dreamfabric.com/sms/default_alphabet.html


The quick summary, for people who don't like ignoring all those = signs, is that V8 uses UCS-2 internally to represent strings, and therefore can't handle Unicode characters which lie outside the Basic Multilingual Plane -- including Emoji.


Honestly that's a shame.


It was fixed back in March though.


If you search for V8 UCS-2 you'll find a lot of discussion on this issue dating back at least a few years. There are ways to work around V8's lack of support for surrogate pairs. See this V8 issue for ideas: https://code.google.com/p/v8/issues/detail?id=761

My question is why does V8 (or anything else) still use UCS-2?


The ES5 spec defines a string as being a series of UTF-16 code-units, which inherently means surrogates show through.

APIs like that tend to be low priority because they aren't used by browsers (which pass everything through as UTF-16 code-units, typically treating them as possibly-valid UTF-16 strings).


> My question is why does V8 (or anything else) still use UCS-2?

Because the ES spec defines a string as a sequence of UTF-16 code units (aka UCS-2-with-visible-surrogates), because as many others (e.g. Java) the language's strings were created during/inherited from Unicode 1.0 which fit in 16 bits (UTF-16 is a retrofitting of Unicode 1.0 fixed-width to accomodate the full range of later unicode version by adding surrogate pairs)


because counting 2 bytes is much faster for computers than counting vary 1, 2, 3 or even 4 bytes.


This is not a real issue because counting code points in an UTF8 string is easy too: the encoding is cleverly defined such that you just need to check the number of bytes that have the top bit cleared. Since UTF8 strings are generally shorter it can even be faster than counting UTF-16 if you don't know the length in advance.


Took me a bit to realize that this is talking about the Voxer iOS app (http://voxer.com/), not Github (https://github.com/blog/816-emoji).


Yeah, I was worried there for a sec. :^)


>Wow, you read though all of that? You rock. I'm humbled that you gave me so much of your attention.

That was actually really fun to read, even as a now non-technical guy. I can't put a finger on it, but there was something about his style that gave off a really friendly vibe even through all the technical jargon. That's a definite skill!


DeSalvo's source comments have always been an entertaining read. :)


This is dated January 2012. By the looks of things, this was fixed in March 2012[1]

[1] https://code.google.com/p/v8/issues/detail?id=761#c33


I wonder if this has been rolled into Node yet.

[edit] Node currently uses V8 version 3.11.10.25, which was released after this fix was made, but not sure if the fix was merged to trunk

[edit2] actually, looks like it has, though I can't identify the merge commit


Please, if you're going to post text to a Gist at least use the .md extension:

https://gist.github.com/4151124


Which enables an even more readable layout with gist.io http://gist.io/4151124


A couple of reasons why it makes sense for V8 and other vendors to use UCS2:

- The spec says UCS2 or UTF16. Those are the only options.

- UCS2 allows random access to characters, UTF-16 does not.

- Remember how the JS engines were fighting for speed on arbitrary benchmarks, and nobody cared about anything else for 5 years? UCS2 helps string benchmarks be fast!

- Changing from UCS2 to UTF-16 might "break the web", something browser vendors hate (and so do web developers)

- Java was UCS2. Then Java 5 changed to UTF-16. Why didn't JS change to UTF-16? Because a Java VM only has to run one program at once! In JS, you can't specify a version, an encoding, and one engine has to run everything on the web. No migration path to other encodings!


UCS2 allows random access to characters, UTF-16 does not.

I'm not sure if that's really true. On IBM's site, they define 3 levels of UCS-2, only one of which excludes "combining characters" (really code points).

http://pic.dhe.ibm.com/infocenter/aix/v6r1/index.jsp?topic=%...

If you have combining characters, then you can't simply take the number of bytes and divide by 2 to get the number of letters. If you don't have combining characters, then you have something which isn't terribly useful except for European languages (I think?)

Maybe someone more familiar with the implementation can describe which path they actually went down for this... given what I've heard so far, I'm not optimistic.


OK, I cracked into the V8 source to take a look at what actually happens. It looks like the implementation does use random access for two-byte strings. However, it also uses multiple multiple string implementations (ASCII, 2 byte strings, "consString" (I presume some kind of Rope), "Sliced Strings" (sounds like a rope again, but might be shared storage of the string contents for immutable strings)), so they could likely use other implementations with whatever properties they choose.

See https://github.com/v8/v8/blob/3ff861bbbb62a6c0078e042d8077b2... and https://github.com/v8/v8/blob/3ff861bbbb62a6c0078e042d8077b2....


Ugh, I just realized this whole article is a farce. v8 added UTF-16 support earlier this year. Its support for non-BMP code points is now on par with java (although it has many of the same limitations)


We seem to be seeing this more and more with Node-based applications. It's a symptom of the platform being too immature. This is why you shouldn't adopt these sorts of stacks unless there's some feature they provide that none of the more mature stacks support yet. And even then, you should probably ask yourself if you really need that feature.


According to Cogito, this was fixed in March:

http://news.ycombinator.com/item?id=4834731

I want to agree with you simply because I don't like Node, but it's hardly fair to damn something over a bug that was fixed 9 months ago.


Why on earth would the people who wrote V8 use UCS-2? What about alternative JS runtimes?


Because Unicode was sold to the world's software developers as a fixed-width encoding claiming 16 bits would be all we'd ever need.


Yes it was, in 1991 when NT and Java and Cocoa were new or under development. In 1996, Unicode 2.0 came out with surrogate pairs and astral planes, and Unicode was no longer a 16 bit encoding.

I'm pretty sure work on V8 started after 1996.

More likely is the idea that the authors of V8 felt that UCS-2 was an acceptable speed/correctness trade-off.


Yes, and several C/C++ conventions and types seemed to make that a safe choice, for example wchar_t. Let's face it, collectively we really screwed this one up. It's the biggest mistake since Microsoft chose the backslash as a path separator in DOS 2.0.


It was actually IBM's fault: they used '/' to denote CLI args in the apps they wrote for DOS 1.0, which didn't have any concept of directories. From Larry Osterman's blog [1]:

> Here's a little known secret about MS-DOS. The DOS developers weren't particularly happy about this state of affairs - heck, they all used Xenix machines for email and stuff, so they were familiar with the *nix command semantics. So they coded the OS to accept either "/" or "\" character as the path character (this continues today, btw - try typing "notepad c:/boot.ini" on an XP machine (if you're an admin)). And they went one step further. They added an undocumented system call to change the switch character. And updated the utilities to respect this flag.

[1] http://blogs.msdn.com/b/larryosterman/archive/2005/06/24/432...


Larry has that wrong, IBM wasn't to blame.

IBM licensed DOS from Microsoft. Microsoft bought DOS from Seattle Computer Products QDOS. That software got its command line switches using "/" from CP/M for compatibility reasons; originally, both CP/M and MS-DOS were available for the IBM PC.

CP/M borrowed the convention primarily from RT-11, the OS for the PDP-11, although it wasn't consistently followed there. Programs on RT-11 were responsible for parsing their own command line args, and not all of them used the same convention.

Inside Windows itself, most APIs accept either forward or backward slashes in paths (even both in the same path) without any special incantation. The problem is mainly at the application level where the whole forward/backward slash thing gets messed up because technically you should accept either one from user input and most app code expects one or the other.


They control their clients, so they could've just re-encoded emojies with custom 16bit escaping scheme, make the backend transparently relay it over in escaped form and decode it back to 17bits at the other end.

Or am I missing something obviuos here?


Small nitpick, but Objective-C does not require a particular string encoding internally. In Mac OS and iOS, NSString uses one of the cfinfo flags to specify whether the internal representation is UTF-16 or ASCII (as a space-saving mechanism).


The specific problems the author describes don't seem to be present today; perhaps they were fixed. That's not to say this conversions aren't a source of issues, just that I don't see any show-stopper problems currently in Node, V8, or JavaScript.

In JavaScript, a string is a series of UTF-16 code units, so the smiley face is written '\ud83d\ude04'. This string has length 2, not 1, and behaves like a length-2 string as far as regexes, etc., which is too bad. But even though you don't get the character-counting APIs you might want, the JavaScript engine knows this is a surrogate pair and represents a single code point (character). (It just doesn't do much with this knowledge.)

You can assign '\ud83d\ude04' to document.body.innerHTML in modern Chrome, Firefox, or Safari. In Safari you get a nice Emoji; in stock Chrome and Firefox, you don't, but the empty space is selectable and even copy-and-pastable as a smiley! So the character is actually there, it just doesn't render as a smiley.

The bug that may have been present in V8 or Node is: what happens if you take this length-2 string and write it to a UTF8 buffer, does it get translated correctly? Today, it does.

What if you put the smiley directly into a string literal in JS source code, not \u-escaped? Does that work? Yes, in Chrome, Firefox, and Safari.


The invisible smiley was a font system problem, fixed in Firefox 19 Aurora (assuming you're on Mac).

https://bugzilla.mozilla.org/show_bug.cgi?id=715798


The UCS-2 heritage is kind of annoying. In java for example, chars (the primitive type, which the Character class just wraps) are 16 bits. So one instance of a Character may not be a full "character" but rather a part of a surrogate pair. This creates a small gotcha where the length of a string might not be the same as the amount of characters it has. And that you just cant split/splice a Character array naively (because you might split it at a surrogate pair).


Which, at the end of the day, doesn't really matter since a code point is not a "character" in the sense of "the smallest unit of writing" (as interpreted by an end-user): many "characters" may (depending on the normalization form) or will (jamo) span multiple codepoints. Splitting on a character array is always broken, regardless of surrogate pairs.


Yes. What i'm saying is that it would feel less error prone if the character object was actually a codepoint. It's a leaky abstraction, you shouldn't need to handle something that is tied to the internal representation of strings in the jvm. Can one "character" span multiple codepoints? Do you have an example of this?


> It's a leaky abstraction, you shouldn't need to handle something that is tied to the internal representation of strings in the jvm.

And I'm saying it doesn't really matter, because unicode codepoints are already a form of "leaky abstraction" which you'll have to handle (in that a read/written "character" does not correspond 1:1 to a codepoint anyway). Unicode is a tentative standardization of historical human production, and if you expect that to end up clean and simple you're going to have a hard time.

> Can one "character" span multiple codepoints?

Yes.

> Do you have an example of this?

Devanagari (the script used for e.g. Sanskrit) is full of them. For instance, "sanskrit" is written "संस्कृतम्" [sə̃skɹ̩t̪əm]. If you try to select "characters" in your browser you might get 4 (सं, स्कृ, त and म्) or 5 (सं, स्, कृ, त and म्) or maybe yet another different count, but this is a sequence of 9 codepoints (regardless of the normalization, it's the same in all of NFC, NFD, NFKC and NFKD as far as I can tell):

    स: DEVANAGARI LETTER SA
    ं: DEVANAGARI SIGN ANUSVARA
    स: DEVANAGARI LETTER SA
    ्: DEVANAGARI SIGN VIRAMA
    क: DEVANAGARI LETTER KA
    ृ: DEVANAGARI VOWEL SIGN VOCALIC R
    त: DEVANAGARI LETTER TA
    म: DEVANAGARI LETTER MA
    ्: DEVANAGARI SIGN VIRAMA
Note: I'm not a Sanskrit speaker and I don't actually know devanagari (beyond knowing that it's troublesome for computers, as are jamo) so I can't even tell you how many "symbols" a native reader would see there.


That's quite interesting, i had no idea! What i was hoping for was some kind of term for one character or symbol and use that as a unit, but perhaps it's impossible to create an abstraction like that.

I'm curious if a Sanskrit speaker would see each of the codepoints as a symbol or not.

Edit: thinking about it, i guess if you asked a Sanskrit speaker how long a word/sentence was, you'd get the answer..


> What i was hoping for was some kind of term for one character or symbol and use that as a unit

There is one, kind-of: "grapheme cluster"[0]. This is the "unit" used by UAX29 to define text segmentation, and aliases to "user-perceived character"[1].

Most languages/API don't really consider them (although they crop up often in e.g. browser bug trackers), let alone provide first-class access to them. One of the very few APIs which actually acknowledges them is Cocoa's NSString — and Apple provides a document explaining grapheme clusters and how they relate to NNString[2] — which has very good unicode support (probably the best I know of, though Factor may have an even better one[3]), and it handles grapheme clusters through providing messages which work on codepoint ranges in an NSString, it doesn't treat clusters as first-class objects.

> i guess if you asked a Sanskrit speaker how long a word/sentence was, you'd get the answer..

Indeed.

[0] http://www.unicode.org/glossary/#grapheme_cluster

[1] http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Bounda...

[2] https://developer.apple.com/library/mac/#documentation/Cocoa...

[3] the original implementor detailed his whole route through creating factor's unicode library, and I learned a lot from it: http://useless-factor.blogspot.be/search/label/unicode


Very interesting, going to read through that guys blog. Thanks for the links!


Maybe nickpicking but I don't think Softbank came up with the Emoji. Emoji existed way before Softbank bought the Japanese Vodaphone, and even before Vodaphone bought J-Phone.

So emoji were probably invented by J-Phone, while Softbank was mostly taking care of Yahoo Japan.


Here's the thread in the v8 bug tracker about this issue: http://code.google.com/p/v8/issues/detail?id=761

Is there a reason that the workaround in comment 8 won't address some of these issues?


I don't think it's needed anymore.

If you read closely you'll see the original linked message is from January and there's an update on that issue from March when a fix was made in V8.


Somewhat meta, but this would be one where showing subdomain on HN submissions would be nice. The title is vague enough that I assumed it was something to do with _Github_ not processing Emoji (which would be sort of a strange state of affairs...).


Not that strange, Github implements much of the Emoji set using different shortcuts, see the reference here:

http://www.emoji-cheat-sheet.com/

Before I read the article I guessed that maybe the icon set had some licensing issues for Github. Luckily, not so! (:smiley:)


That was basically my point. It would be strange if they _stopped_ processing it.


I love this article. So often it has been difficult to explain to people why one set of characters can work while others will not. This lays out some great historical info that will be helpful going forward.


UCS-16 is only used by programs which jumped the gun and implemented Unicode before it was all done. (It was 16 bits for awhile with Asian languages sharing code points so that the font in use determined whether the text was displayed as Chinese vs Japanese vs. etc). What Century was V8 written in that they thought UCS-16 was an acceptable thing to implement?

Good rule of thumb for implementers: get over it and use 32 bits internally. Always use UTF-8 when encoding into a byte stream. Add UTF-16 encoding if you must interface with archaic libraries.


> UCS-16 is only used by programs which jumped the gun and implemented Unicode before it was all done.

There's no such thing as "all done", Unicode 1.0 was 16 bit, Unicode 6 was released recently.


Failures in Unicode support seem usually to result from the standard’s persistently shortsighted design—well intentioned and carefully considered though it undoubtedly is. It’s a “good enough” solution to a very difficult problem, but I wonder if we won’t see Unicode supplanted in the next decade.

All that aside: emoji should not be in Unicode. Fullstop.


How about this npm module : https://npmjs.org/package/emoji ?


Here's the message decoded from quoted-printable: https://gist.github.com/4151707#file_emoji_sad_decoded.txt


Note that this message is almost a year old now. The issue has been addressed by the node and V8 teams.


Very informative, great read. Thanks!


Fixed a good while ago for node.js


Wow, the first half of the text is basically full of crap and claims which don't even remotely match reality, and now I'm reaching the technical section which can only get even more wrong.


To whoever the downvoter was: no, seriously. For instance in the first few paragraphs:

* emoji were invented by NTT DoCoMo, not Softbank

* even if that had been right Softbank's copyrighting of their emoji representations has no bearing on NTT and KDDI/au using completely different implementations (and I do mean completely, KDDI/au essentially use <img> tags)

* lack of cooperation is endemic to japanese markets (especially telecoms) and has nothing to do with "ganging up"

* if NTT and au/KDDI wanted to gang up on Softbank you'd think they'd share the same emoji

* you didn't have to run "adware apps" to unlock the emoji keyboard (there were numerous ways to do so from dedicated — and usually quickly nuked – apps to apps "easter eggs" to jailbreak to phone backup edit/restore)

That's barely the first third.


TLDR: node sucks


TLDR: The V8 engine can't (supposedly) encode Unicode codepoints that are over 16-bits in length, because it uses the UCS-2 encoding.


TLDR: v8 "sucks" (and doesn't support Unicode code points outside of the lowest ~64k characters).

Edit: v8 in general is pretty cool, but not supporting Unicode outside UCS-2 is pretty bad.


Most apps seem to "support" surrogate pairs by simply not being aware of them at all.

Good on the V8 developers for recognizing these conditions that their code didn't fully handle and refusing to muddle on through with broken processing.


It's v8's fault, and v8 does not suck.


Unicode 2.0 added surrogate pairs in 1996. Unfortunately, the first versions of both Java and JavaScript predated this and got strings horribly wrong, and now any conforming implementation of either is required to suck. The Right Thing would be for almost everyone to work with only combining character sequences, except for a rare few who need to know how to dissect one into its codepoints and reassemble them correctly (just as people don't normally need to extract high or low bits from an ASCII character).


No. Combining characters and NF(K)C/D normalisation rules are a different problem entirely - consider the "heavy metal umlaut" (ie. Spın̈al Tap) where there is no lossless conversion possible - only “n" followed by U+0308


They're facets of the same problem. I shouldn't routinely be dealing with either surrogates or combining marks; unless I have a specific reason, it's only an opportunity to make a mistake that hardly anyone knows how to troubleshoot. "n̈" should be an indivisible string of length one until I need to ask how it would actually be encoded in UTF-16 or whatever.


But that's the point - there is no such character. Given the Unicode consortium have added codepoints for every other bloody thing under the sun, I'm amazed that there isn't one for n-diaresis but there you are.

Add a small number of people who for artistic reasons decide that they want to make life hard (Rinôçérôse I'm looking at you) and you just have to accept that the length of your string might not equal the number of codepoints contained therein...


Damn you for being right. :)


Not even a little bit accurate.


A two-character sequence for a smiley face that should be compatible with everything in existence:

  :)
Problem solved. Why is this front page material (#6 as of this writing)?




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: