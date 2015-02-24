Hacker News new | comments | show | ask | jobs | submit login
Emoji.length == 2 (jonnew.com)
47 points by stanzheng 3 hours ago | hide | past | web | 34 comments | favorite





There are multiple ways of counting "length" of a string. Number of UTF-8 bytes, number of UTF-16 code units, number of codepoints, number of grapheme clusters. These are all distinct yet valid concepts of "length."

For the purpose of allocating buffers, I can see the obvious use in knowing number of bytes, UTF-16 code units, or the number of codepoints. I also see the use in being able to iterate through grapheme clusters, for instance for rendering a fragment of text, or for parsing. Perhaps someone can shed light on a compelling use case for knowing the number of grapheme clusters in a particular string, because I haven't been able to think of one.

I'm not sure about calculating password lengths: if the point is entropy, the number of bytes seems good enough to me!

The password field bug is possibly compelling, but I don't think it's obvious what a password field should do. Should it represent keystrokes? Codepoints? Grapheme clusters? Ligatures? Replace all the glyphs with bullets during font rendering?

(Similarly, perhaps someone could explain why they think reversing a string should be a sensible operation. That this is hard to do is something I occasionally hear echoing around the internet. The best I've heard is that you can reuse the default forward lexicographic ordering on reversed strings for a use I've forgotten.)

reply


The Unicode standard describes in Annex 29 [1] how to properly split strings into grapheme clusters. And here [2] is a JavaScript implementation. This is a solved problem.

[1] http://www.unicode.org/reports/tr29/

[2] https://github.com/orling/grapheme-splitter

reply


I'm the author. Thank you for you sharing! I will check it out shortly.

reply


Before emoji, fonts and colors were independent. Combining the two creates a mess. Try using emoji in an editor with syntax coloring. We got into this because some people thought that single-color emoji were racist.[1] So now there are five skin tone options. The no-option case is usually rendered as bright yellow, which comes from the old AOL client. They got it from the happy-face icon of the 1970s.

Here's the current list of valid emoji, including upcoming ones being added in the next revision.[2]

A reasonable test for passwords is to run them through an IDNA checker, which checks whether a string is acceptable as a domain name component. This catches most weird stuff, such as mixed left-to-right and right-to-left symbols, zero-width markers, homoglyphs, and emoji.

[1] https://www.washingtonpost.com/news/the-intersect/wp/2015/02... [2] http://unicode.org/emoji/charts-beta/full-emoji-list.html

reply


Unicode does not require fonts to use color, and the spec does try to deal with the case where you don't want to use color; explicitly talking about black-and-white renderings in multiple places. This is no different for the skin tone modifiers, it's perfectly okay to fall back to a greyscale emoji (indeed, it might make sense to render all emoji in greyscale or B&W in a text editor).

reply


> Before emoji, fonts and colors were independent.

Fonts and colors are still, for the most part, independent. Color or the lack thereof is a property of your font and text rendering subsystems. For instance Noto Emoji provides B&W emoji, and Noto Color Emoji provides colored ones.

reply


It's interesting that in certain subcultures, including a large portion of the tech community, things that are non-racial are now considered racist. Not only do you have to be "racially aware", as they say, but you have to be racially aware in the right way. Being "colorblind" isn't enough anymore. Even the emoticons have to express race!

Perhaps predictably, this has backfired and will continue to backfire quite spectacularly; it turns out that when you force people to start thinking along racial lines, they might not end up with the exact same ideas about race that you have. I suspect this may be a large contributing factor behind the recent resurgence of ethno-nationalism (see the Alt Right et al.).

reply


The thing that frustrates me the most about Unicode emoji is the astounding number of combining characters. For combining characters in written languages, you can do an NFC normalization and, with moderate success, get a 1 codepoint = 1 grapheme mapping, but "Emoji 2.0" introduced some ridiculous emoji compositions with the ZWJ character.

To use the author's example:

‍woman - 1 codepoint

black woman - 2 codepoints, woman + dark Fitzpatrick modifier

‍️‍‍woman kissing woman - 7 codepoints, woman + ZWJ + heart + ZWJ + kissy lips + ZWJ + woman

It's like composing Mayan pictographs, except you have to include an invisible character in between each component.

Here's another fun one: country flags. Unicode has special characters 🇱 🇮 🇰 🇪 🇹 🇭 🇮 🇸 that you can combine into country codes to create a flag. 🇰+🇷 = 🇰🇷

edit: looks like HN strips emoji? Changed the emoji in the example into English words. They are all supposed to render as a single "character".

reply


Man, imagine if you could compose chinese characters out of radicals like this.

I'm not sure if that would be a good thing or a bad thing.

reply


The ideographic description characters do provide a way to describe how to map radicals into characters, but don't actually provide rendering in such a manner.

There is active discussion on actually being able to build up complex grapheme clusters in such a manner, because it's necessary for Egyptian and Mayan text to be displayed properly. U+13430 and U+13431 have been accepted for Unicode 10.0 already for some Egyptian quadrat construction.

reply


Doesn't it already exist to an extent? That's pretty much how Korean is built isn't it?

reply


Jamo, Emoji (including flag combinators), Arabic, and Indic scripts all combine according on effectively per-character basis. There's not really any existing character that says "display any Unicode grapheme A and grapheme B in the same visual cell with A above B." The proposed additions to Egyptian hieroglyphs would be the first addition of such a generic positioning control character to my knowledge, albeit perhaps limited just to characters in the Egyptian Unicode repertoire.

Research on what to do vis à vis Mayan characters (including perhaps reusing Egyptian control characters for layout) is still ongoing, as is better handling of Egyptian.

reply


The one that still surprises me is Hangul (Korean script). Hangul characters are made of 24 basic characters (jamo) which represent consonant and vowel sounds, which are composed into Hangul characters representing syllables.

Unicode has a block for Hangul jamo, but they aren't used in typical text. Instead, Hangul are presented using a massive 11K-codepoint block of every possible precomposed syllable. ¯\_(ツ)_/¯

reply


The original version of Unicode was primarily intended to unify all existing character sets as opposed to designing a character database from fundamental writing script principles. That's why most of the Latin accented characters (e.g., à) come in precomposed form.

It is worth noting that precomposed Hangul syllables decompose to the Jamo characters under NFD (and vice versa for NFC). However, most data is sent and used with NFC normalization.

reply


I would imagine this is a legacy from the Good Old Days when every Asian locale had its own encoding. Unicode imported the Hangul block from ISO-2022-KR/Windows-949 (different encodings of the same charset), which has only Hangul syllables.

reply


Somebody involved with Unicode must have had the same idea, because the ideographic description characters exist. However, I've never seen them used in practice because they don't actually render the character. You just get something like ⿰扌足, which corresponds to 捉.

https://en.wikipedia.org/wiki/Ideographic_Description_Charac...

reply


https://en.wikipedia.org/wiki/Cangjie_input_method

This isn't at the textual level, and the components are not strictly radicals, but this may interest you.

reply


> It's like composing Mayan pictographs

Which reminds me: We need a Jaguar emoji!

reply


That's all well and good, but at the end of the day, some unknowing developer has to write this functionality into whatever input-related code for some program that doesn't use OS-level components, and it just creates a mess.

reply


The Zero-Width-Joiner allows for some really strange things: https://blog.emojipedia.org/ninja-cat-the-windows-only-emoji....

One can basically achieve an unlimited number of emojis by concatenating the current ones.

reply


> Sometimes, I think people come up with these names just to add excitement to their lives.

Let's get outta here guys, we've been rumbled!

reply


I think the article "A Programmer's Introduction to Unicode" that was shared here recently is a good read and explains Unicode well.

https://news.ycombinator.com/item?id=13790575

reply


The issue doesn't really seem to be the emojis, but rather the variation sequences, which seem to be really awkward to work with, but I can sort of see why they're necessary. But the fact that we need special libraries to answer fairly basic queries about unicode text doesn't bode well.

reply


> But the fact that we need special libraries to answer fairly basic queries about unicode text doesn't bode well.

That's always been needed to actually properly work with unicode, what do you think ICU is? Few if any languages have complete native Unicode support. And it's hardly new, Unicode has an annex (#29) dedicated to text segmentation: http://www.unicode.org/reports/tr29/

reply


Makes me wonder whether or not that should be considered a bug.

reply


Unicode is fucked. All these bullshit emojis remind me of the 1980s when ASCII was 7 bits but every computer manufacturer (Atari, Commodore, Apple, IBM, TI, etc...) made their own set of characters for the 128 values of a byte beyond ASCII. Of course Unicode is a global standard so your pile-of-poop emoji will still be a pile-of-poop on every device even if the amount of steam is different for some people.

It's beyond me why this is happening. Who decides which bullshit symbols get into the standard anyway?

reply


> Unicode is fucked. All these bullshit emojis

Ah yes, all those bloody emoji taking the place of better worthier characters, those dastardly characters taking up all of one half of one 16th of a Unicode plane (which has 16 planes, though only 14 are "public").

And the gall they have, actually being used and lighting up their section of plane 1 like a christmas tree while the rest of the plane lies in the darkness: http://reedbeta.com/blog/programmers-intro-to-unicode/heatma... what a disgrace, not only existing but being found useful, what has the world come to.

reply


Meet the shadowy overlords who approve emojis[0]

[0] http://www.latimes.com/business/technology/la-fi-tn-emoji-q-...

reply


If anything, their adaptability gives me confidence. They have little power to stop vendors from creating new emojis that are morphologically distinct from existing ones, so they might as well wrangle them into a standard.

reply


There is a Unicode encoding "UTF-32" which has the advantage of being fixed width. This is not popular for the obvious reason that even ascii characters are expanded to 4 bytes. Additionally the windows APIs, among other interfaces, are not equipped to handle 4-byte codepages.

reply


It's fixed width with respect to code points, but not with respect to any of the other things mentioned in the linked article. For example, the black heart with emoji variation selector (which makes it render red) is two code points.

reply


Language is inherently complex, there's no way to solve this in any "cleaner" way than what we already came up with. Unfortunately the best way forward is to build up what we already have and cover all the warts with wrapper functions/libraries.

reply


And where are the sex emoji? The dirtiest thing I've been able to text is a heart and a pair of handcuffs ;-)

reply


The Love Hotel (U+1F3E9) is rather obvious, maybe the kiss mark (U+1F48B) as well, though the raunchiest ones (in actual use) are a bit more… discreet?: the aubergine (U+1F346) and splashing "sweat" (U+1F4A6).

reply




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: