Hacker News new | past | comments | ask | show | jobs | submit login
Bear plus snowflake equals polar bear (andysalerno.com)
384 points by soopurman 45 days ago | hide | past | favorite | 122 comments

This is definitely an example where "character" was the wrong word. We start out OK with bytes being distinct, but by the end we're talking about how a character is made out of several characters. If we think in terms of code points (Rust's native char type actually provides here a slightly different thing, a Unicode Scalar Value, which certain types of code point are not, but close enough) clearly a code point isn't made out of several code points, so we needed a different word.

I like squiggle, if you're a text rendering engine you might want to use "glyph" although you might already need that word for something else. But try to avoid character because that word already has far too many meanings, most of which won't be what you wanted.

I think the term you are looking for is grapheme cluster.


We're so lazy that we can't even be bothered to type "character" half the time, so "grapheme cluster" has no chance to catch on unless we can think of a "char"-style abbreviation. I'll start the bikeshed with "gracl", but ambitious folks may wish to argue for "clog" ("cluster of graphemes").

How about graph? One character longer than char and probably not already in use in the area of computer science.

Is this sarcasm?

Almost certainly a joke.

I laughed for a good two minutes. Still laughing cause I can't tell whether he's talking about nodes or vertices

Node because that's not a major software term yet.

It's really an Extended Grapheme Cluster. In our codebase (Darklang) we abbreviate it to EGC.

Elixir uses the term grapheme already:


Is that what Go uses "rune" for?


No. A "rune" is one codepoint, Golang just decided that they can't use the same name as everyone else (as is tradition in Go).

No, because a grapheme cluster is à where the letter is combined by a and ` which are two Unicode Scalar Values.

A Go rune is more like one Unicode Scalar Value.

However, be careful, terminology is subtle: what is a Code Point, what a Scalar Value, etc.

clugr 'sounds' better than clog though (unless you want to evoke pipe somehow)...

clugr evokes more Freddy to me

character…grapheme cluster…


Unicode uses the term grapheme cluster to emphasize the fact that it can be composed of a cluster of codepoints, and perhaps also to distinguish it from the linguistic term grapheme. But ultimately in the context of Unicode I think you could just get away with using grapheme. That's what Perl did when it invented Normalization Form Grapheme (NFG).

You can refer to the definitions of these terms. [1] It's best not to reinvent how these terms are used. There has been far too much of that, already.

[1] http://www.unicode.org/glossary/

PS Be prepared for a long read.

> I like squiggle

Go uses "Runes" which is a pretty unambiguous and memorable term.

Though, in Go's case they don't include use of ZWJ, as they're limited to 32-bits.

It's somewhat ambiguous because Unicode has a code block for actual runes which are not the same as the "runes" you're talking about


I believe this name is inherited from Plan 9 (which also happens to be the origin of UTF-8).


A Go "rune" is not the same as a "single rendered/visible character" though, a grapheme cluster/glyph/squiggle is from what I understand.

example: https://play.golang.org/p/jJdS2WRWDXs

I think, zwjchar has a chance: it is unambiguous, so when in future standard evolves we could distinguish it. Besides, technical people like precision and cryptic names.

This reminds me of a blog post that I can no longer remember that discusses how Chinese is a lot of arranging several related kanji into a single character to express a new idea (please link it if you can find it).

I would love to see unicode characters to allow for arbitrary combinations beyond those defined using just ZWJ to allow more flexibility (e.g. blizzard could be created by adding like "slowflake x 5" which creates a single character with five snowflakes, without having to create an entirely new character representing blizzard from snowflake + ZWJ + snowflake).

As an aside, my favorite ZWJ magic is black flag + ZWJ + skull and crossbones = pirate flag.

Also see https://en.wikipedia.org/wiki/Blissymbols for more symbol language fun.

The notion that Chinese characters describe ideas that are formed from smaller characters that also describe ideas is a nice story, but it's not really the case for the majority of Chinese characters. See the Wikipedia page, Principles of formation section[0].

Many Chinese characters are constructed with two parts of smaller characters, one of which can indicate a general concept while the other provides a vague pronunciation hint. This isn't really good enough to provide you with enough information to guess at either the meaning or the pronunciation if you don't already know the word, but if you do already know the word from spoken language, then the hints might be enough to recognize the character when you read it too.

[0] https://en.wikipedia.org/wiki/Chinese_characters#Principles_...

If you'd like to decompose Chinese characters (e.g. if you're trying to use handwriting recognition and it's /almost/ right), try the Pingtype keyboard. Enter the character into the lower text box, and click Search. It'll decompose, and recompose characters based on the parts. I agree that the meaning is often hard to grasp from the glyphs, and can even be misleading.


Right, but there are still some nice examples, like:

木 = tree

森林 = forest


田 = field

力 = strength

男 = man (because he labours in the fields)

This breaks down very quickly as you get more abstract.

人 = person "rén"

门 = gate "mén"

们 = plurality marker "mėn"

也 = also "yé"

他 = he "tā"

他们 = them "tāmėn"

One can come up with a story for why person + gate is plural, but the inclusion of gate in the character for plurality is really more about the sound of gate.

This is even more interesting because prior to the 1920s 他 was a gender neutral third person pronoun. And in ancient Chinese 也 was probably pronounced differently to today's yě, perhaps more like the ā of tā. A great resource for looking up the history of individual characters is Wiktionary:



Wiktionary is neat because it also shows you pronunciations in several different Chinese languages (Mandarin, Cantonese, Minnan, Hakka etc) and the relationship to Japanese, Korean and Vietnamese, if it's there.

I think 他 is still gender neural, I put down "he" out of laziness, since the feminine-only form is 她... but you make good points.

Yes totally. I wasn't contradicting GP. The pattern where part of the character indicates the sound is so common that it helps me a great deal when reading Chinese. My speaking/hearing is better than my reading. I might not know how to write a particular character but, if I can guess the sound, then that plus the surrounding context are enough for me to know what the word is.

I think you mean this article on “Yingzi”: http://www.zompist.com/yingzi/yingzi.htm

Yes! Thanks for the link :D

For the next Unicode emoji extension, we might as well just take a word vector model and PCA project it into the Unicode 1 dimensional space. Somebody who is better at linear algebra should comment and tell me how wrong I am.

SignWriting, a writing system for Signed Languages, would need something like this, because rotation and location of the graphemes matter. Imagine the letter a being rotated by 45 degrees and moved to the upper left corner of the drawing area. How would you encode this?

There's a symbol for the torso, for the fist, for touch and for the strongness of touch. If you hit your own shoulder with a fist with the knuckles facing upwards, you rotate the symbol for the fist, paint the torso and the fist next to each other and put the touch symbol approximately where the shoulder would be.


It's brilliant, but horrible, since you need somewhere between 300 and 1000 dimensions of floating point word vector semantic space to get good separation. Maybe it's shrunk since I last checked.

Even with less granularity, that's still like 256 bytes per grapheme.

I think this would be great for machine translation though.

It's also similar to forming words from multiple kanji. E.g. like Chinese "penguin" is combined from "business" and "goose" (not completely accurate translation).

Imagine poor kids 100 years from now having to memorize all those emojis and their combinations at school. Or having to know 3000 basic emojis in order to be able to read news.

>Imagine poor kids 100 years from now having to memorize all those emojis and their combinations at school.

I mean, we all know what a "polar bear" is, right? If those kids know how to say "penguin", and they know how to write the parts, then they know how to write "penguin" in Chinese.

Meanwhile, the parts in "penguin" aren't used in any other word. The "pen" isn't a writing utensil. And what's a "guin"? If you didn't learn how to spell beforehand and were suddenly asked to write "penguin", maybe you'd write "pengwin" or "pengwen", or mishear it as "pengwing" (because birds have wings so it makes sense, right?) English writing is brute-force memorization of spelling with some patterns that often don't hold, just like Chinese is brute force memorization with some patterns regarding sound or meaning that often don't hold.

And we can see that all English speakers still struggle to spell some words that they don't encounter often, and some even struggle to spell words they do use often.

No, the "pen" in "penguin" means "head" - it's from Welsh. English spelling is a lot easier if you know the languages it was derived from or influenced by.

An advantage of non-phonetic spelling is that it doesn't privilege any one accent over any other, so allowing a polycentric acrolect with each variety picking up accent and vocabulary from the local dialect, but maintaining mutual comprehension in writing.

>English spelling is a lot easier if you know the languages it was derived from or influenced by.

Knowing a bunch of languages in order to have a proper context for English spelling is surely a higher threshold than memorizing the characters in your own language.

It’s still debated if that is the correct etymology. A different hypothesis has “penguin” deriving from the Latin “pinguis” (fat).

Certainly in Czech the two are related:

tučňák = noun, penguin

tučný = adjective, fat

I don't know if that helps determine the origin of the English word, but it's definitely funny to think that there’s a nation which calls penguins “fatties” :D

The Chinese character "企" has multiple meanings. One is "to stand on tiptoe".[1] So "企鹅 (penguin)" means "standing goose".

[1] https://en.wiktionary.org/wiki/%E4%BC%81

The character 企 also looks like a penguin by itself.

I wish it was that simple. Though some characters are, the majority have evolved through usage along with pronunciation. This makes similar characters mean very different things and are pronounced differently. Mastering Chinese is about memorizing the 2000-3000 most used characters.

This reminds me of a game I made! Matching emojis to produce the given emoji prompt. I didn't use any complex math tho, and my list of combinations is arbitrarily made by hand.


Haha, juice box = apple + hammer.

dang, it would be cool if mods could mark a post as "allow emoji here".

I think the HN codebase is mostly frozen.

Dang has alluded it's being worked on so long threads wont need to be paginated anymore.

Only tangentially related, but it's a question that have bothered me for a while: Is there a reason why composite Math symbol aren't part of Unicode? Things like series, limits or integrals look like a really good fit for this kind of composition, and you could get rid of mathjax for 80% of its usage.

I gotta think it's because all the math you mention cannot render within a normal line height. That's not the end of the world for rendering, but maybe that counts as a bridge too far in Unicode? Or maybe it just boils down to "because it doesn't yet" and as soon as someone makes a good proposal it'll happen

Unicode already has a bunch of things that extend the height like those "corrupted text" generators.

Those are a ton of combining accents put together. Crucially: they don't make the line any taller—that's what makes them so hard to read! An integral should be a little taller than the line height but you still want the integral's limits to be legible. Maybe I'm being too prescriptive; ∫₀¹𝑥 𝑑𝑥 would read ok if that 1 were stacked on top of the 0 without any other changes.

My apologies if HN's formatter strips the special characters (edit: it didn't yay!): I wrote integral sign, subscript 0, superscript 1, math italic x, thin space, math italic d, math italic x. And maybe the idea is we should use words like "the integral from zero to one of x with respect to x." IDK.

There is a proposal by the Julia language team


Everybody, if you're a somebody in your field, please go sign this.

I guess the question is how much typesetting do you want to stick into unicode? It works OK for emoji because the results are always simple: it's just another individual emoji, with the same size & characteristics as the original.

That said unicode is not free from typesetting weirdness, see the character ﷽.

consider the following:


As noted, it's a bit more complicated than just putting things above or below a symbol. Also, if you look closer at the output from MathJax, you'll notice things like the setting of x+y=z has slightly different spacing around the + than the =. For that matter, properly typeset mathematics will also the spaces around the colons differently in, e.g., f: X → { y ∊ *R* : |y| < 1 }. Math typesetting is not a simple matter. I'd also note that the existence of, e.g., ² or ₆ in Unicode is more a result of allowing the mapping of legacy encodings and that there is a preference in Unicode that in general, superscripts and subscripts not be handled through encodings but rather through the layout engine of the application which is why there is no superscripted decimal point or other such characters.

1) Combining emoji compose into a finite and reasonably small set of combined symbols. The set of mathematical formulas isn't finite in the same kind of way.

2) Layout of mathematical formulas is reasonably complicated. It doesn't make sense to force that complexity to be included in every text layout engine.

There is UnicodeMath[1], which was developed by Microsoft and is the default representation used by the Word equation editor.

It looks like this:


f(x)=∑_(k=-∞)^∞〖c_n e^(ⅈkx)〗

∫_(-∞)^(+∞)〖exp⁡(-a/2 x^2) ⅆx〗=√(2π/a)

I find it quite readable, even for quite complicated formulae like the above. You can also replace the unicode symbols with Latex-style escape strings, like \sum or \below.

[1]: https://unicode.org/notes/tn28/UTN28-PlainTextMath-v3.1.pdf

Thank $diety for emojis making sure most user facing code ends up handling extended code points correctly, and teaching devs about the multibyte encodings.

I imagine in a world without them existing a lot of the non-ascii paths would not be regularly used.

It is so stupid that Emojis render differently on different platforms. In particular there are subtle differences among the facial expression emojis that make them pretty useless if you’re not sure the recipient is on the same platform that you’re choosing one on.

It's a reminder that these are glyphs, on the same plane as words, aimed at portability.

I think what you are looking for is stamps and png/gifs, which are also supported in most relevant chat platforms these days.

An encoding for words that allowed rendering as different words on different platforms would also be stupid.

  MacOS: grimace
  Windows: frown
  Android: scowl
  Samsung: growl
  Firefox: angry face
That’s pretty much what the emoji situation is today. I don’t understand what the actual use case is for the requirement that emojis differ across platforms.

Yes, I realize they are widely used, but they are widely used despite this stupidity, not because of it.

Do you have any examples of current emojis that clearly have a very different meaning on different platforms? What you're saying might have been true 5+ years ago, but over time emojis have become more and more similar.

Here is just one random article from Emojipedia about the history of the "folded hands" emoji: https://blog.emojipedia.org/emojiology-folded-hands/ There are many more examples.

In addition to that, almost all emoji keyboards now autocomplete the emoji based on standard names, so if you search for "disappointed" on most any emoji keyboard, you will get the same face.

For reference, here is the current official emoji set, including the standard names and images showing how they render on different platforms: https://unicode.org/emoji/charts/full-emoji-list.html

The famous example is the pistol emoji, which is rendered as either a real gun or a water gun depending on platform. Lots of scope for misunderstanding there!

This has not been true since early 2018. See: https://blog.emojipedia.org/all-major-vendors-commit-to-gun-... There was only a couple of years when vendors rendered it differently, and they worked toward a consensus, as they have for many other emojis.

Even if there was no vendor consensus, I'm not sure what the misunderstanding could be with this particular emoji. There is only one pistol emoji, and regardless of whether it is rendered as a water pistol or a revolver, it still is used to represent the concept of a pistol. There are better examples of emojis that used to be displayed with a facial expression or hand gesture that had a relatively different meaning depending on the platform. For example face with rolling eyes, person tipping hand (information desk person) and so on.

It still rendered as a revolver on the phone I used to write that comment. Guess I'm still on 2017 Android. Please do not invite me to any water gun parties.

> Do you have any examples of current emojis that clearly have a very different meaning on different platforms? What you're saying might have been true 5+ years ago, but over time emojis have become more and more similar.

Well... yes? That's because avalys is obviously correct. The "solution" defined by the Unicode consortium is so spectacularly stupid that everyone has unanimously agreed to move away from it by synchronizing their images, because it makes no sense to send an image unless you know what it will look like.

Rendering differently on an older platform is also an important difference (consider those who can't afford the latest phones). E.g. the gun one has a very different meaning on (some) older platforms from newer ones.

I think this is basically what's happening on the JCK plane with glyphs that have the same code, but slightly different representations depending on the font (e.g. Chinese or Japanese), with potentially wildly different meanings in the language they are used.

To your point, this is an aspect where integrating and regularily adding emojis to unicode pushed a very technical system under popular attention, and attracted a lot of people from unrelated backgrounds who now have to understand how it works and what its goals in the first place.

> I think this is basically what's happening on the JCK plane with glyphs that have the same code, but slightly different representations depending on the font (e.g. Chinese or Japanese), with potentially wildly different meanings in the language they are used.

It was even more dumb when they did it for CJK. Unicode is now neither fish nor fowl; you can't rely on things to look the same in different places, but you can't rely on them to mean the same either; there's no proper separation of concerns because they decided it's fine for the meaning of a character to change based on the font you're using.

Yes, it's not an ideal solution on many respects.

I kind of get why we got here, as encodings were a quagmire for a long time, and coming to a real clean, everybody's happy solution with unicode looked completely unrealistic.

We now have way better compatibility, got to settle for one encoding in most cases, and the annoyances are for now somewhat manageable (if/when China opens more to the global net, it might be a different story). Installing fonts is still easier than adding encoding compatibilities.

On a serious note: Yes, I agree. On a comical note: Reminds me of the great "Cheeseburger-gate" scandal of 2017, in which Google put the cheese under the burger patty instead of on top of. https://www.theverge.com/2017/10/30/16569346/burgergate-emoj...

Problem is that each tech company uses a proprietary set of emoji so they couldn't standardize even if they wanted. Many instant messaging apps work around this by using their own emoji set for all users but this is also often annoying especially when the emoji are ugly.

The solution is to allow users to use their own svg libraries when sending messages, and the ability to manage these libraries by copying inline symbols from messages sent by others.

This would promote an open culture of creating inline symbols. E.g. where is our "covid" symbol, and why do we have to wait for the Unicode consortium to define it for us, and app makers to implement it??

I like telegram's sticker system best. User defined images which you can organize and store in the UI for easy access.

Please not. That would be the tower to Babel. Afterwards we are all just going bla bla and don't understand each other at all, because everyone invents its own funny secret emoji language.

> Afterwards we are all just going bla bla and don't understand each other at all, because everyone invents its own funny secret emoji language.

Just like we all go "bla bla" on the phone because we can choose our own sounds?

Yes, we could have wide support for inline svg by now.

Relatedly, I’ve always rather enjoyed the talk “Love Hotels & Unicode”[1]

[1] https://web.archive.org/web/20120404020043/http://www.reignd...

If you want to see how an emoji will render on various platforms, or generally search what's available, Emojipedia is good for that: https://emojipedia.org/kiss-woman-woman-medium-dark-skin-ton...

Is there a good reason these composite emojis aren't all single code points? They have to at least mostly be separate images to be rendered, so it's not like it's an impossible combinatorial explosion. I remember there are only on the order of 1000 valid combinations.

I would assume for backward compatibility. If you don’t have the font with the updated emojis you can still get the reference.

There was a flap when people realized that you can combine the zero-width joiner with the "Prohibited" symbol to put that on top of anything you don't like.

These kind of remind me of ligatures. Not exactly the same since (as far as I'm aware) ligatures are always based on a direct blending/combination of the two character's shapes, but still similar.

I think programming ligatures kind of break this assumption - particularly != becoming the not equals sign.

I'm working on a game for younger kids using pictures/emojis for "universal" conversation rather than text, so this gave me some good ideas. Thanks!

I've had success playing Concept with my 5yo. https://boardgamegeek.com/boardgame/147151/concept

Seconding the suggestion, concept is a great game for people of all ages learning a language. It's also a good game overall :)

Yes, I should note that I have also had substantial success playing Concept with friends, other family, and co-workers.

Although we've never actually kept score, through any of that.

But as a software developer, it’s always fun to think about edge cases, and squeezing almost 5KB into a 280-“character” tweet is fun

This makes me wonder if anyone has created a version of base64 that uses the vast, sprawling space of unicode to take advantage of these glyph-count-based restrictions.

If they have, I hope they called it uuuniencode.

There are several of these. base65536 is the one that seems to pop up the most often on HN, although base2048 is more useful for Twitter. On the GitHub page the dev helpfully links to the various implementations: https://github.com/qntm/base2048

This is what you're looking for:


It can store 385 bytes per tweet. This link includes a bit more technical explanation of how Twitter counts characters towards the limit. Apparently, using the entire range of unicode characters does not improve compression because of the double weighting of emojis and other characters as described in TFA. It links to a base131072 encoding which can only store 297 bytes per tweet.

Whoah, it's a crafting game.

Emojis need more serious attention. A lot of online conflicts start from miscommunication due to lack of facial expressions. To the point thats its driving a social crisis and division in the real world.

As a recent observer of a discussion wherein the dramatic differences between single, double, and triple-dot sentence terminators were argued, I think concerns about emoji facial expressions might be valid, but resolving them would still leave us in a world of crisis and division..

I can't stand ".." being used in communications. It's not a thing, and because of that it's incredibly ambiguous!

What was the outcome of that argument?

I think the main takeaway for the majority of participants was that I'm old and don't know how to talk on the Internet.

Humans came up with a way to express musical notes on paper. To me that seems much more challenging. The list of common emotions is not that numerous. We need some shortcut, compact way of expressing them in writing. It would go a long way to score, or markup our words with some sort of emotion meter. Perhaps thats what music is to singing. Just maybe a visual way of doing that. Emojis are hard to use.


I'm not disagreeing, but note that humans have not come up with a way to express musical notes in Unicode or Emoji. The second vertical dimension is critical for music notation.

I'd suggest that the same idea applies to nuanced facial expressions. You could certainly devise a set of glyphs to stand in, but they would have to be learned by everyone.

The failures we see today are attempts at rendering expressions on drawn faces -- and to be fair, even a high res photographic still image of a real human making an expression could be easy to misinterpret. Especially across cultures and subcultures. My 15-year old niece has very strong opinions on how many periods I use to end sentences..

I think "fixing" emoji is probably a lost cause. But I am biased since I don't care at all. The most interesting and amusing thing I know about emoji is that when Apple changed the "gun" from a realistic-looking Glock/etc to a toy squirt gun, all the ingrates who used the image in a threatening manner ended up looking silly to iOS users. Android followed quickly.

I support the same sillification of negative and angry facial expression emoji as well, FWIW, and I think it's probably hopeless to try to cram nuance into any of them. Fortunately, we still have words.

compare, in order of relative "uh-oh!":





I also find (thumbs up emoji) and (ok emoji) to be somewhat ambiguous. I see them as potentially sarcastic. Really, sarcasm is the biggest barrier to effective online communication.

Apparently you cannot use emojis on HN.

Don't forget:




et cetera et alii ad nauseam

This seems like much too large a problem to tackle with emoji.

Even as complex as emoji are becoming, they really still don't address the issues behind online miscommunication. And I really doubt they ever could.

Maybe people need to learn how to express themselves precisely

on the other hand people still for some reason believe that "lol" only means "laughing out loud" which's lol itself.

so maybe it's impossible?

The Emoji department of the unicode consortium is really just three letter agencies making sure text parsing and font rendering libs will continue to have exploitable bugs. ;-)

I wonder where is the "substitution" of the codepoint sequence (from a sequence to a single codepoint) done? Very concretely and practically: is it the font doing the substitution? What else if not? How do they decide that a sequence should be substituted?

Even our character encodings have turned into bloatware.

Conspiracy theory:

1. specification authors want to make sure the extended grapheme cluster algorithms are widely adopted so that implementations can correctly deal with devanagari 2. they notice no one gives a shit about brown people and their writing systems 3. combining emojis requiring the use of the same underlying algorithms were popularised in order to push the adoption

I'm on Linux right now and, sadly, couldn't see the magic character properly. :(

Anyone know if I'm missing anything, or is there no support for 13.1 yet? My standard routine is to just install every noto font I can find (noto-fonts noto-fonts-cjk noto-fonts-emoji noto-fonts-extra).

It works for me with Firefox on NixOS using Noto Color Emoji, from, I'm assuming, noto-fonts-emoji.

Works here on a fresh (yesterday) Linux Mint install.

Same on fresh, stock openSUSE Tumbleweed with Gnome 40

Really enjoyed, thank you for sharing. It was interesting to learn how emojis are composed from component parts. I'm wondering where that will lead to in the future? Will we have a vast built-in library of emojis of ever increasing complexity?

From the title I thought this is similar to word arithmetic based on Word2Vec representations, where for example we approximately have:

King - Man + Woman = Queen Or Brother - Man + Woman = Sister

And some sexist variants of these too. But the article is on a different topic.

This reminds me of a fun iOS pizzle game from a few years back called "Alchemy" - you started with icons for earth, air, fire, and water and then combined them in a similar way to the title example to create others.

Yes it reminded me of that too. I thought it was a repost - pretty sure its original web version was posted on HN. I spent a lot of toilet visits on that game... became pretty unfun once the board is super full, though.

I thought this was going to be about word embeddings!

That Twitter limit is interesting. I wonder how much data can fit in 140 emojis, if all existing and valid combinations are allowed?

Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact