It was just a tweak to emoji characters to mark them all as East Asian Full Width instead of Narrow or Ambiguous so that they displayed correctly when using a fixed width font in a terminal console. This probably only matters if you like to use emoji filenames (you mad person), but it felt like a wart so I reported it & had a short back and forth with the chair of the emoji-related subcommittee which resulted in a proposal which was eventually accepted by the committee into Unicode 9.0. The committee were great: took my tiny bug report seriously, wrote huge long treatises to justify the change & eventually voted it into the standard.
(This was pretty much my peak geek achievement of 2016 so far :) )
It's not that I use emoji filenames, it's that I deal with real-world natural language text all the time, including at the console.
(In terms of compatibility, my text-justifying function is going to stop working correctly for the period of time between when gnome-terminal updates to Unicode 9 and when Python 3.x does. Still worth it.)
This is good, actually, because the meaning of a string operation should be consistent when run on the same version of Python.
(If only this applied to the "default encoding". The default encoding should be UTF-8, not whatever you get by asking the user's likely-misconfigured locale. As it is, you can't rely on the default encoding if you want your code to work consistently.)
This comment -- https://github.com/JuliaLang/julia/issues/3721#issuecomment-...
Although I had never worked with the Unicode Consortium, I [submitted a proposal] for an international symbol for an observer and it was eventually accepted.
(Top result: http://www.cbc.ca/news/trending/rifle-emoji-dropped-unicode-...)
Its one thing to have absolute, iron fisted control over your own platform - its another to intentionally seek to limit people's self expression on other platforms by influencing the standard in this way.
One might have a right to do something and yet be wrong to do it.
Apple had every right to do what they did, but they were completely, totally and undeniably in the wrong to do it. Everyone associated with their action should be ashamed. Honestly, they should all resign: their behaviour demonstrates that they have no business being associated with this sort of work.
Also, if you think Apple was wrong, you must also think that Microsoft was (they voiced support), and everyone else at the meeting who agreed with the move. As the article says,
> Davis confirmed to BBC News on Monday that "there was consensus to remove" the emojis, but that he couldn't comment on the details.
So it's clearly not just Apple that thought this was the appropriate move.
There are literally thousands of characters that Apple doesn't encode for their platforms. I haven't heard whether Apple will be supporting:
Osage, a Native American language
Nepal Bhasa, a language of Nepal
Fulani and other African languages
The Bravanese dialect of Swahili, used in Somalia
The Warsh orthography for Arabic, used in North and West Africa
Tangut, a major historic script of China
>The two characters will still be part of the Unicode spec, but they'll be classed as black-and-white "symbols" instead of regular emoji
A bit early.
While there was considerable uproar over Emoji and there still often is over yet another fifteen symbols that everyone thinks no one would ever need or use, the bulk of the Unicode character set is still scripts for human languages. And some of those are only relevant for a very small minority, say, archaeologists. But that's fine. There's enough space, we're nowhere near to running out and Unicode enabled all sorts of cool things in computing that simply were not possible before or only with awful hacks and workarounds.
Could you give an example? I don't know anything about this stuff.
That's because every byte stored in the file, for example byte number 188, either means "¼" (as it does in ISO/IEC 8859-1, aka. Latin-1 or ANSI), or it means "ỳ" (as in ISO/IEC 8859-14) or "ｼ" (in JIS X 0201, one of the many Japanese encodings that were devised over the years.)
How do you know which encoding a certain file uses? In general YOU CAN'T and this was the source of many problems and "solutions" which caused even more problems over the years.
Well then, how did you mix symbols from different alphabets, say in a dictionary or in a post that talks about them, like this very post? YOU COULDN'T, short of doing ugly hacks and other subterfuges, like using GIFs for all foreign characters.
Unicode gave a distinct number (or "codepoint") to every character and symbol known to man (within reasonable limits) and this allowed a lot of things that we take for granted nowadays, including this very post, were I just copied and pasted various symbols from their Wikipedia pages and just expect it to work.
Unicode defines normalization algorithms (is é its own character, or e with a modifier character?).
I can have a document which combines English, Russian, Arabic, and Chinese, and expect it to be readable and editable by many different tools.
English combines with top-down Chinese, and English combines with right-to-left Arabic, but top-down Chinese and right-to-left Arabic don't combine properly in the same document using Unicode -- the Arabic will be written bottom-up instead of top-down when embedded in the top-down Chinese.
I meant something simpler, like: The word 'computer' in English is 计算者 [jì suàn zhě] in Chinese, Компьютер in Russian, and حاسوب in Arabic."
Try that without Unicode.
It's of course possible with TeX, and no doubt other solutions. Which is why I added "and expect it to be readable and editable by many different tools".
(As a real-world use case, look at Knuth's "The Art of Computer Programming" and see how he credits people using their full names, in their own written language.)
計算機 is a calculator, not a computer, in Taiwan: https://tw.images.search.yahoo.com/search/images?p=計算機
Compare with Baidu in the PRC: http://image.baidu.com/search/index?tn=baiduimage&ie=utf-8&w...
Edit: Oops, I didn't even realise that the parent poster asked about 計算著. That one I've never seen.
In this way, new FAQs and other updates spread to all the markets easily.
It's vastly easier to do this stuff when all the documents use the same text encoding. Even if noone can read everything, the fact that everything uses the same encoding means that any pair of languages you can read are technically readable.
2BD2 GROUP MARK 2015-May-05 Accepted 2016-May-29 Stage 6
The bitcoin symbol is in there too.
>The scope of the Unicode Standard (and ISO/IEC 10646) does
not extend to encoding every symbol or sign that bears
meaning in the world.
>This list has been round and round and round on this -- regular as clockwork, about once a year, the topic comes up again. And I see no indication that the UTC or WG2 are any closer to concluding that bunches of icons should start being included in the character encoding standards simply on the basis of their being widespread and recognizable icons.
>Where is the defensible line between "Fast Forward" and
"Women's Restroom" or "Right Lane Merge Ahead" or
"Danger Crocodiles No Swimming"?
Now it looks they add whatever somebody thinks of. I guess it's related to the liberation from the BMP.
Until Unicode has a half-star character, it won't even be able to encode the average newspaper.
Recent article on the Unicode/emoji debate:
In all seriousness, I'm not sure emoji's really belong in text encoding. Even though it's more convenient, based on where they're most frequently used I don't think they need to be universal.
1) everybody uses them on their phones, they're in Unicode, consistent and compatible between devices and messaging programs. In the far flung future, researchers will be able to study their linguistic role in communication, confident in understanding what the characters were.
2) everybody uses them on their phones, they're proprietary fonts and codepoints (in the Unicode private use area if you're luck, just random data if you're not), there's no consistency between phone models, manufacturers, or cell networks. Future researchers can pound sand.
We were at #2 pre-Unicode. It was a goddamn mess, especially in Japan. Lord knows why anyone would prefer it. There's no value in being a snob about what kinds of incredibly frequently used characters we think are Worthy of inclusion, imo.
3) People who love colourful images will use stickers in Facebook Messenger, LINE, Viber, and soon iMessage. I'm sure WeChat has them too.
It's basically like 2), except we've moved from proprietary codepoints to proprietary protocols.
I don't mind characters like and or even good old ︎ (which has always been too tiny for its own good). These work in black and white, in different artistic styles, and they're a fairly limited set.
But now we're going down the road where we get new stuff like tacos and unicorns every year. And even though Unicode is an industry standard, the pictures need to look like Apple's bitmaps to avoid confusion, and the Unicode standard changes so often that you have to manually keep track of who can already see and whose computer/phone/browser/messenger software is too old.
> characters like and or even good old ︎ (which
Should have been:
> I don't mind characters like ((yellow smiling Emoji)) and ((thumbs up Emoji)) or even good old ︎((pre-Emoji Unicode smiley)) (which has always been too tiny for its own good). These work in black and white, in different artistic styles, and they're a fairly limited set.
> But now we're going down the road where we get new stuff like tacos and unicorns every year. And even though Unicode is an industry standard, the pictures need to look like Apple's bitmaps to avoid confusion, and the Unicode standard changes so often that you have to manually keep track of who can already see ((upside-down smiling Emoji)) and whose computer/phone/browser/messenger software is too old.
Text symbols (as opposed to emoji) have different rules. Basically, the symbol needs to be used in "running text" (i.e. normal text), like "containers with [recycling symbol] can be recycled" or "he bid 2[club]". Traffic signs for example are not normally used in the middle of text, so they aren't encoded in Unicode. To get the Bitcoin symbol encoded, I needed to show that it was used in text, not just as a standalone icon. The full rules for symbols in Unicode are at http://www.unicode.org/pending/symbol-guidelines.html
For the snowman in particular, it was added to Unicode because it was a symbol used in the character set for Japanese TV broadcasts, see http://www.unicode.org/L2/L2007/07391-n3341.pdf
TL;DR: Don't argue "Why does Unicode have a poop emoji but no symbol for X?" - the rules are totally different for emoji and symbols.
Edit: does HN strip out arbitrary Unicode characters now? I originally had Unicode characters in place of [recycling symbol] and [club], but they disappeared when I submitted.
The snowman, on the other hand, is a weather symbol for snow, I assume. It appears alongside other symbols for meteorological phenomena, so I imagine was added around the same time and with similar reasoning: http://www.fileformat.info/info/unicode/block/miscellaneous_...
They are mostly in Unicode for use in SMS, but there are plenty of use cases in other forms of text.
Heck, I'd be unhappy. I love adding emoticon and emoji and fun things to my emails.
So why don't we pick a very good set: perhaps every letter in every language in common use for the past 200 years? Then, for the oddball symbols that someone wants to mix in text, there can be some kind of SVG-like convention. This allows publishing textual information without requiring that every device maker updates their device to support a 1-off symbol.
The main purpose of Unicode is to encode the information. How the information is turned into its visual counterpart is outside the scope of unicode. For what it's worth this could be done by linking unicode code points to matching SVGs in a document. Wait, exactly that is already a W3C standard: https://www.w3.org/TR/SVG/fonts.html
Or, put another way:
'We have an unambiguous, cross-platform way to represent “PILE OF POO” (), while we’re still debating which of the 1.2 billion native Chinese speakers deserve to spell their own names correctly.'
> undue effort on every computer maker, ect, to keep up.
The effort to update the font files every few years? Unless you insist on supporting a new Unicode version the second it comes out, I don't see the big effort here? Of course there is effort for font makers, but this is quite centralised.
So which is it? Does each code point represent a visual image? A semantic meaning? Both? It depends? Something else?
I've tried to decipher that on my own and only learned that the answer to these sorts of questions are complicated, because it's very complicated to represent all written human language via one set of rules.
So I know some of the answers to my questions above, but I'm hoping someone with real expertise can provide the fundamental rules/policies - if there are any.
Look it's pretty simple, every code point represents a semantic meaning, except for:
1. those characters who also encode the width of their visual image (U+FF00..FFEF)
2. the one that means 'unknown' (U+FFFD)
3. those characters that change their visual representation depending on their position in the word (U+FB50..U+FDFF,U+FE70..U+FEFF)
4. those that change the visual image of another code point (U+FE00..U+FE0F)
5. those characters that have a visual image as their semantic meaning (too many to list)
6. those that are designated to have no semantic meaning at all (U+FDD0..U+FDEF)
7. those that have a meaning only in pairs (U+D800..U+DFFF)
"every code point represents a semantic meaning" is completely consistent with the notion that some code points e.g. have differing visual representation depending on their position in the word.
Well the answer is clear: each code point represents one visual image, to which is associated one or more meanings.
For CJK characters, they unified all semantically similar han-characters, even when they have visual forms that are quite different between Japanese, Chinese and Korean.
If you want to write Japanese and Chinese in the same document, you need to mark up the section to tell the system that renders it, to render different visual forms for similar codepoints depending on whether they are used in Japanese or Chinese.
This isn't true. 青 and 靑 are the same character written differently; they have their own codepoints. Ditto for a huge number of simplified Chinese characters; 语 is mainland Chinese and 語 is the same character in Japanese.
I wouldn't know how to show you examples here, as 直 will 直 display the same since they have the same code point, but different number of strokes in japabese and chinese.
Han unification is generally seen as a bad choice in retrospect, but it was something Unicode had to do when it looked like 2^16 codepoints were all they were going to get.
Han characters that are traditionally viewed as variants of one another, or that are simplified from more complex logograms (such as 龜, which was simplified into 亀 in Japan and 龟 in mainland China) tend to have different codepoints, but the stylistically different ones usually belong to the same codepoint.
> the stylistically different ones usually belong to the same codepoint
Fair enough. Do you happen to know why 青 and 靑 weren't unified?
Also, why are you doing that check? Is it to see if something is lowercase? If so, your check will get the wrong answer for lowercase letters like å.
Unicode does have a way to check if something is uppercase/lowercase, when that distinction exists. This is in UnicodeData.txt.
You might try asking an old IBM programmer just how "fine" they felt dealing with EBCDIC...
But how serious is the problem? How many times do you need to test if a given character is one of the 26 allowed letters of the English alphabet, and where you implement it by testing it against the range?
Typically you write it as "islower_english(c)" once, and be done with it. Is that really hard?
If you do think that's a serious problem, then what of those programmers who need to test for lowercase letters in "España", "München", "Diyarbakır", and "façade"?
We've broadly moved beyond that, but there's still value in grouping sets together in a way that makes certain kinds of frequent tests less computationally expensive than they would be if codepoints were randomly distributed.
EDIT: Plus, if it were that important, IBM could implement the function in hardware. (Perhaps they did.)
(I don't know anything about Unicode, so maybe it already has that.)
Programatically, it is much easier to say "does a character lie between 0x12 and 0xBC" than to create a function like `isSymbolForTrafficInEurope()`
(And relevant to my country, "Swedish characters lowercase" would map to latin letters lowercase + åäö.)
Now, language-specific subsets¹ of those are a bit iffy to deal with. Especially when text can contain loan words from other languages, so in my experience it's rarely a useful thing to ask for.
¹ Yes, subsets. Latin letters lowercase is not the set abcdefghijklmnopqrstuvwxyz. It is the set
How do you condense that again into language-specific subsets? Every letter that appears in a word in a dictionary? Then at least é belongs to German as well, even though it's usually not considered part of the German Latin subset. Unicode stays clear of that issue by simply not defining what script subsets a character belongs to (rightfully so, IMHO).
The emoji code points can be represented differently on different systems given their meaning.
So it makes sense to have different emojis for different 'meanings'.
The 'moon' switch here does no mean 'moon' - it means 'standby' or whatever.
It may look noticeably different on different systems.
Think from a design perspective: you have 5 emojis to represent 'clouds, sky, earth' etc. - and the a different set of 5 to represent 'on, off, sleep, shutdown'. Those icons will be markedly different in terms of representation, groupings, colour coding, underlying functionality if they are integrated into an experience in any meaningful way.
Text your car with the 'shutdown' symbol to tell it to shut down.
Your bot texts your friend with a moon symbol to tell him you're asleep. Or whatever.
So if a system wants to render "on" differently than "straight vertical line", that's possible.
However, if "off" should be rendered differently than "circle", that's not possible. (Or only possible with out-of-band information or modifier characters which would still have to be defined)
It's a mess. If you want to write a document in Japanese that talks about a Chinese character which is written differently than its Japanese version, you can, or can't, achieve this in Unicode, depending on the character, its history, and the mood of the consortium the day it was assigned.
The reality is that Unicode is governed by people, some of those people are grumpy reductionists who push for a minimum of symbols and a maximum of meaning-overloads, and others are more liberal and tend to advocate the opposite, and the result is a compromise, and is in areas very messy.
So? How is that different from any regular character in real life?
101 for example means the number 101, an introductory class in university, slang for "anything introductory" in general, etc.
And let's not get started on the meanings of letters, e.g. a and e.
If people actually used these, it would make searching text for formulae much easier. Wikipedia editors and academic publishers, please note.
Also, there's no Unicode for screwdriver. Perhaps iFixit would like to campaign for that?
Congratulations on getting the power symbols in! When @edent writes "Will update ... when I stop dancing", was it "I got the power"?
I don't see how using them for anything else would have any use. I never searched for units when searching for formulas
BTW, just because a character exist doesn't mean it's the best choice for ordinary use. As https://en.wikipedia.org/wiki/%C3%85#Symbol_for_.C3.A5ngstr.... points out:
> Unicode also has encoded U+212B Å ANGSTROM SIGN. However, that is canonically equivalent to the ordinary letter Å. The duplicate encoding at U+212B is due to round-trip mapping compatibility with an East-Asian character encoding, but is otherwise not to be used.
The idea is one (complex) encoding that will represent the info until the end of time. It creates a lot of trouble, but it's still a good idea.
The standards are not applied consistently. Even leaving emoji out of it, the chinese "character" 囍 never occurs in running text, but there it is in unicode.
FROM MEMORY, a while back there was an article on HN complaining that emoji seemed to magically bypass the requirements other characters needed to meet for inclusion in unicode, and that in fact they were commonly in violation. The taco symbol was called out as an example. I can no longer find this article, but it mentioned the running text requirement, and -- I believe -- specifically indicated that use in names does not count as use in running text. (For an idea of why that might be the case, check out http://tvtropes.org/pmwiki/pmwiki.php/Main/LuckyCharmsTitle .)
HOWEVER, I was not even able to find, on the unicode web site, any discussion of a running text requirement at all, for any kind of symbol. Some example proposals do refer to "running text" by name, but they don't indicate why. The example proposal given for adding characters to an existing block ( http://std.dkuug.dk/jtc1/sc2/wg2/docs/n2934.pdf , suggested as a prototype in http://www.unicode.org/faq/char_proposal.html ) does not mention "running text" at all, and doesn't appear to go to much trouble to document it, although some such documentation is given. The rough guidelines for character proposals at http://unicode.org/pending/proposals.html do not refer to "running text" at all, but they do suggest that, late in the process (specifically, on a proposal summary form, which is different from, and subsequent to, an actual proposal), "references to dictionaries and descriptive texts establishing authoritative information" are required.
I conclude that the Unicode standard's preferred criterion for chinese character inclusion is "would an authoritative chinese dictionary include this character", and while the answer to that question for 囍 is not unambiguous -- a lot of dictionaries don't include it -- it's easy to imagine that some do.
I would appreciate a pointer to the actual running text requirements, as well as what they are supposed to apply to, if anyone can provide that.
> It would be odd not to have an encoding for such a common character.
Outside of its use as a wedding decoration, which is plainly nonlinguistic, how common is it?
Also used in new year celebrations. Basically every single year you'd see tons of these printed.
It's a pun actually, double happiness.
CJK characters are, broadly, an example of the Unicode Consortium trying to be way too reductive about what they'd accept, leading to a lot of bad decisions like Han Unification, which caused a lot of damage and which the Consortium has generally now backed away from and recognized as a bad idea.
So, yes, if you look closely at CJK character sets in Unicode, you can find a lot of decision making that appears to contradict decision making elsewhere in the standard. This is in large part because the decisions they made wrt CJK characters turned out to be largely wrong, and they've since changed their approach.
The 天书 ( https://en.wikipedia.org/wiki/A_Book_from_the_Sky ), by design, consists solely of chinese characters that don't exist. (Theoretically. A couple of them, by oversight, did exist.) They are still recognizably "chinese characters" by virtue of being composed of the same components. Should they have unicode points?
囍 plainly exists, but has no textual use. Is it more similar to 靑 or to ️U+2764 "heavy black heart"?
I'm not saying it does, I'm saying Unification illustrates the fact that the Consortium's decision-making with respect to CJK has changed over time, has frequently been illogical, and shouldn't be pointed at as an example of anything good or sane or worthy of precedent.
The fact that 囍 has a code point but 福倒 doesn't have a codepoint is another example of the Consortium being unnecessarily reductive and intransigent about CJK.
> Do you think that 福倒 should have a code point?
Yes. If we want to be able to talk about it in text (like now), I want to be able to encode it in a standardized way.
> Should they have unicode points?
I'd lean towards no, as they're one-offs, not something broader that people want to discuss and use in text. But I'd be ok with adding them, too. We're not running out of space. There's no value in making CJK so much harder to interop with than everything else, in general.
> Yes. If we want to be able to talk about it in text (like now), I want to be able to encode it in a standardized way.
This doesn't make any sense. We talk about things in text by using words, not direct representations. A dog emoji is not necessary or desirable for discussing dogs in text, and a 福倒 emoji is not necessary or desirable for discussing 福倒s in text.
Should the wikipedia page https://en.wikipedia.org/wiki/Statue_of_Liberty be edited to replace the cumbersome phrase "statue of liberty" with the more modern and convenient U+1F5FD 'STATUE OF LIBERTY'?
"Necessary" is an ill-defined and reductive way of looking at communication. History has shown us that you can't draw bright lines between things you, in the abstract, have decided are the "necessary" subset, and expect the world to follow along.
Linguists have come to understand that you can only describe and follow human, behaviour, not prescribe it.
Anyways, humans plainly found it necessary to annotate their text messages with pictoral indicators of their mood, to the point where it became so widely spread and such a mess that we felt it desirable to standardize the code-point representations. That it isn't desirable in all circumstances or appropriate in all registers of formality does not mean that it isn't an emergent behaviour which will continue to arise whether or not it is "necessary".
tl;dr I don't really give a shit that "dog emoji" isn't appropriate for an academic text on canine surgery. It's more than sufficient to me that it is used millions of times in text messages between regular human beings. Text needn't be formal text to deserve respect in encoding.
I took "If we want to be able to talk about it in text (like now), I want to be able to encode it in a standardized way" as implying that the two clauses were related to each other. Saying "if we want to be able to talk about it" means you're talking about what's necessary for that purpose.
I was under the impression that this describes the current state of affairs, and has since before Unicode came around. I know I've read an article about someone whose 姓 was 马 and whose personal name was a character composed of three 马 stacked left-to-right (which might have been pronounced cheng?) getting harassed because the government couldn't encode the name.
The running text requirement is for symbols, so it doesn't apply to Chinese: http://www.unicode.org/pending/symbol-guidelines.html
The word "running" doesn't appear on that page. (Actually, no requirements at all appear on that page; it speaks strictly in terms of strengthening or weakening the case for inclusion, not disqualifying.) Can you explain briefly why that page is evidence that the running text requirement does not apply to Chinese, and where it specifies what the running text requirement is?
Alternatively, what requirements do apply to Chinese, and would they preclude an invented character like one with 女 on the left and 离 on the right?
> Without that character in unicode, Chinese display systems and printers probably won't use unicode at all
is baseless. In its current uses, it doesn't appear on display systems and when printed it is almost always designed as an image, not printed as part of a font. Compare: http://pic10.nipic.com/20100928/5211371_231333032314_2.jpg
edit: I'd be interested in hearing your thoughts about "it does seem to have a more figurative than literal reference than most characters, in a way that I am not sure how to translate into English", in Chinese if necessary. (No guarantee I'll understand it, but I'm interested.)
In response to your edit, I mean that it has cultural resonance that is unusually strong in relation to its linguistic overtones, in many ways similar to the semantic timbre of a character like 福. The level of abstraction is different from English because of the ideographic nature of characters that means the visual appearance is emphasised, so the boundary that you pick out between reference and referent is more blurred.
I'm not really clear why exactly this character isn't more widely used in text, but I feel this might not be a bright dividing line from more common characters. I think inclusion of the the 福倒 is a harder case to make, but the examples I quoted elsewhere in this thread make me think it should be included. Perhaps not what you were hoping for in terms of elaboration, the problem is more conceptual fuzziness on my side perhaps than language of expression.
Having said that, I note that 喜事 appears in my dictionaries with the gloss "wedding" (well, "any occasion meriting joy, particularly a wedding"), 囍事 does not, and since 囍 is a symbol of weddings which is generally assumed by the Chinese to have the same pronunciation as 喜 it makes for very natural wordplay to substitute it into the word for wedding. I would draw a pretty close analogy with the $ of "Micro$oft" -- it's use in running text, but it shouldn't be taken as evidence that $ is a letter in English.
It is pretty natural to jump from 喜 to 囍 because that is how Chinese works. You take radicals, and you bundle them up. You have the "busho" system where people in the past bundled up little bits and pieces and form new words. No reason why people in the present can't do the same.
Re: Micro$oft being outrageous if $ becomes a part of the alphabet. You are misapplying an English oriented viewpoint. In Chinese, there is no objection to forming words in that way, by incorporating radicals together. It's similar in theme to how in German, you can just keep stringing words together to form larger words. In fact, I actually think in the future, words like Micro$oft should entitle $ to become part of the alphabet! That's a very Chinese way of looking at things.
Language is not static. Systems that try to encode language are descriptive. They can never be prescriptive - otherwise we as a civilization die.
If 囍 wants to be a character point, let it be one. If 福倒 wants to be one, there should be one. Isn't the point of unicode to have enough space to include all these kinds of language artifacts (artifact as in a cultural / historical item thought up by humans) in order so people can uniquely reference each one? They are distinct logical units.
If the unicode rulebooks are too rigid, the rules need to change or the approach needs to change. It's useless to try to argue that xyz character in another language shouldn't/can't be a character - people will just stop using unicode if it doesn't suit their needs.
Reeks of colonialism, that's what it is.
EDIT: as an additional gloss, here's why I think 喜 and 囍 are sometimes used differently, even though by the dictionary definition they seem to be the same. I will explain why I think logically they are different concepts.
喜 is happiness, delight, joy. It is probably an adjective in the English sense (I can't map grammar rules through different languages easily).
事 is an occurrence, an item, something that happens.
When you put them together,
喜事 literally means something happy is happening.
The cultural meaning has turned that into a connotation of "wedding", but it could actually be a ton of happy things. Promotions, and yes - one other really big thing in a person's life: having a baby.
有喜 (means "having happiness") is the traditional way of referring to a woman being pregnant
You can turn that into 家有喜事 - meaning home having something happy - as in this household is having a baby. And you can use it without the 有 - and just use 喜事 to refer to having a baby.
This is different from a wedding.
囍 is a modification of 喜, by doubling up the character and treating it as a radical, people are referring to the idea that there are "two people having happiness" - like a doubled amount of happiness.
In the article linked http://www.chinatimes.com/newspapers/20160623000760-260115 - the 囍事 is used to specifically identify the "wedding" type of 喜事 - it's like trying to avoid the ambiguity and double-entendres that Chinese writing typically embraces and just presents things matter of fact, which is ideal because the article is a newspaper article about customs of towns. Not really something you want people to have multiple interpretations like an essay or a poem, for example.
So logically, there is a difference when trying to use 喜 vs 囍 and I actually really appreciate the author's use of the double version in the text.
I know that not everyone reads these characters in this way, but I do - and I'm sure other people will notice this too. It's the best part of Chinese - not knowing, and not seeing the ambiguity, and one day, someone tells you about it .. and you're like - OMG that's what that means ...
For my earlier indication that this type of character modification is common in chinese:
木 = wood
林 = common last name Lin, also means forest (uncommon on its own)
森 = common character for forest.
The English word "forest" is usually 森林
It's just a doubling and trippling of the 木 radical.
What does it matter that this character is super old - people thousands of years ago thought this up.
Also, if this character weren't so old, would you say that 森 and 林 are both forests and thus don't need separate character points in unicode? That's outrageous!
So now we have a modern version of this modification 喜 -> 囍
And I showed how I think they are different logical concepts.
The link http://www.chinatimes.com/newspapers/20160623000760-260115 showed how it can be used in typographical context.
hmm what's the issue with it being a unicode character point?
In the writing of this post, I think I've come to identify Chinese as an "ambiguity-first" language - I learned Chinese as my mother tongue, but stopped at a elementary school level, and switch over to learning English to a Bachelor's degree level.
In Chinese, puns, double-entendres, and ambiguity just "happens" by default, and you have to work your way to be crystal clear.
English is more straight-forward, with a speaker having to try to make puns or double-entendres.
In the case of 囍, it's a reduction in scope. Modern Chinese people had to create a new word just to narrow down the meaning of 喜 - so that it specifically refers to weddings.
Your whole line of thinking was that 喜 already had meanings inclusive of wedding, so 囍 can't possibly add any more meaning when it also means wedding. In actuality, it took away a bunch of extraneous connotations, and in Chinese, the reduction in complexity is so valuable that it's worth a new word.
I think that it's a mistake to try to over-literate and reduce languages into a set of rulebooks for character encoding - that's all I am trying to put forth - it's best for the person or peoples who speak the language to come up with the encoding for it. I have an elementary school knowledge of Chinese and already I am kinda miffed at why people have an objection to 喜 vs 囍
Imagine how the people who have Bachelor's degrees in Chinese must feel.
In this case, the codepoints were added in part because the proposers could show many printed works (user manuals, I guess) that included sentences such as "to turn the foobar on, press the ■ button", which shows that the glyph between "the" and "button" is in some way like the surrounding glyphs. Chessmen were added for similar reasons, even though very few people actually read either user manuals or chess literature.
Then what about §? or $? Or %? The list is endless.
Because the Unicode standards body doesn't want them in, or because those scripts don't have champions pushing for their inclusion?
>On the other hand, approving emoji and random icons delights Westerners.
Westerners? Notwithstanding the fact that emoji icons came from Japan, I'm fairly certain emotive icons are popular globally.
There is also some discussion here - https://news.ycombinator.com/item?id=9219162
In this instance, someone is complaining that they cannot type their name on a computer.
These aren't characters, but entire scripts that are not part of the standard. Nor are major scripts like kanji complete.
But, hey. Power button icon.
More importantly, ask why it seems unreasonable that a small number of very widely-used ISO standard symbols were incorporated quickly? Wouldn't that be the most reasonable expectation since it lacks the political heat of e.g. Han unification and doesn't require any research or debate to establish that they are used, have a precise meaning, and are not covered by existing codepoints?
I love the echoing nature of these counter-arguments, that a problem doesn't even exist unless it's "major" and "nobody is working on it". I wonder how many actual different human beings have responded to me in this thread...
Since you're hung up on http://unicode.org/standard/unsupported.html, let's read it and see how many languages are missing:
* Loma: 250K speakers, and this is to codify characters used for personal correspondence in the 1930-40s: http://www.unicode.org/L2/L2010/10005-n3756-loma.pdf
* Naxi Dongba: a pictographic script used by priests in an ethnic group of roughly 300K people: http://www.unicode.org/L2/L2011/11178-n4043.pdf
Your original claim was that “Unicode don't actually have anything like coverage of the entirety of every script and alphabet” but you're arguing about things which affect something like 0.008% of people – not even their primary usage – and for which there is work in progress to support!
Nobody is saying that Unicode is complete, but like any other human effort there's a limited amount of time to work on things. At some point things which are used daily by billions of people are going to get prioritized over things which are used infrequently by thousands of people, and it's hard to argue that this is wrong even if you – like me – want to have 100% of human language represented in Unicode.
You're going to seriously say that after your last few posts? Two posts into this exchange, you moved the goalposts, and you hammered that button repeatedly.
But at least you actually looked at the proof you repeatedly demanded, even I had already mentioned the pages. You didn't bother reading much of it, or to note that goes well beyond a couple scripts on that page to other incomplete scripts and as-yet entirely unimplemented scripts. But you at least made that minimum effort.
And the limited time to work on these things is exactly the issue. There are scripts not yet in the standard and major language scripts that aren't complete - but we've got "pile of poo" and a slew of emoji. And now, we've got four power button icons that a handful of people demanded.
You started this conversation with “Unicode don't actually have anything like coverage of the entirety of every script and alphabet.” It's hardly moving the goalposts to question how complete Unicode has to be to qualify as “anything like” or how much weight usage should have.
> But at least you actually looked at the proof you repeatedly demanded, even I had already mentioned the pages. You didn't bother reading much of it, or to note that goes well beyond a couple scripts on that page to other incomplete scripts and as-yet entirely unimplemented scripts. But you at least made that minimum effort.
Before you could call that proof, you have to clearly articulate the questions it could answer. Note that my first comment indicated a clear understanding of how Unicode works – the process is not in question here, only the thresholds you haven't articulated. All I've been trying to get you to state is precisely what your rules would be for coverage of human languages before we can add anything else and how much usage should factor into that. There's also a much harder question of trying to come up with a rule which to say why a pictograph, the phaistos disc symbols, etc. are valid for inclusion but a modern symbol used millions of times a day around the world to communicate is not?
While thinking about this, it's also worth remembering that despite your apparent belief that emoji are a Western novelty, the question was how to improve Unicode adoption in Japan and that required having an answer for the millions of people who were using systems which relied on non-standard encodings and by most accounts Japanese carriers were resistant to adopting Unicode without having a standard to replace those ad-hoc systems. I think that decision should have been handled differently (i.e. assigning an emoji plane) but it was driven by understandable technical reasons affecting large numbers of people on a daily basis. Since that decision was made, the additional cost to add a small number of non-controversial additions which do not require scholarly research or documentation does not seem excessive — we are, after all, talking about a small percentage of the new symbols in Unicode 9.0.
Your fonts don't have to support the entirety of Unicode. That's why we have font stacks and fallbacks.
"To start the device, press the ⏻ button on the device face"
There, used in a sentence.
"Every possible image" (e.g. an elephant icon in running text) is not.
A standard clip art library that covers universally understood symbols sounds like something that would be very useful.
What is "plain text format" though? If 'text' isn't limited to Western ASCII characters (which it very obviously shouldn't be considering many people use other character sets), then the idea of a text standard should be to encode all the glyphs people use, so "plain text format" becomes a canonical list of all the communicative symbols in all languages. That's what Unicode aims to be.
In my opinion, if they're used for communication, it doesn't seem unreasonable that such a canon of characters should include universal iconographic symbols like the standby icon.
This blog post is a nice example, I have absolutely no idea how these new code points are supposed to look like, since I only spend an afternoon to implement the unicode best practices from the Arch wiki, instead of subscribing to some unicode standard mailing list. (Except the one symbol which was redefined to a symbol that does not carry the semantic meaning of "standby symbol" anywhere outside of the unicode standard.)
In my opinion there are two ways forward, one burn the entire thing. Or alternatively, force the unicode committee to produce an authoritative and complete font, in triplicate, and in their own blood.
Meanwhile, a lot of the "Ys and Zs" added to Unicode have proved to be extremely useful. Unicode's math operator and letter-styling support is what made MathJax (and more generally MathML) possible. They've also helped big time when it comes to accessibility (e.g. screen readers) for mathematics on the internet. Should we have shunted that off to another standard and made the creators of screen readers completely restructure their offerings so they can deal with Unicode characters and "Mathicode" characters? Assuming anyone bothered to implement it, how would that be better than just adding a Unicode category and spending a meager amount of space?
Second mathematical symbols, consider the case were I get a text file considering mostly of ASCII 7 and some mathematical symbols which may render as mathematical symbols or as Chinese characters, since there is no way to specify the encoding in a text file and so I have to guess the encoding. (That is not helped by the roughly 17 standardized encodings that mostly agree with utf-8.)
What does that have to do with Unicode adding anything? Are you really claiming that if we threw out Unicode like you recommend, and (if I'm understanding your point correctly) choose an encoding for the new version that looks nothing like ASCII the encoding mess would get better? I think continuing the migration of most transmission of text to UTF-8 and explicitly specifying encodings for everything that needs to stick with Latin-1, etc. is a better option, unless you propose codifying the new encoding in law to force adoption.
Tip: a quicker way is to copy the unrendered box and Google it.
No, last time I checked you are not legally allowed to do that.
The shapes of the reference glyphs used in these code charts are not prescriptive. Considerable variation is to be
expected in actual fonts. The particular fonts used in these charts were provided to the Unicode Consortium by a number
of different font designers, who own the rights to the fonts.
See http://www.unicode.org/charts/fonts.html for a list.
You may freely use these code charts for personal or internal business uses only. You may not incorporate them either
wholly or in part into any product or publication, or otherwise distribute them without express written permission from
the Unicode Consortium. However, you may provide links to these charts.
The fonts and font data used in production of these code charts may NOT be extracted, or used in any other way in any
product or publication, without permission or license granted by the typeface owner(s).
The Unicode Consortium is not liable for errors or omissions in this file or the standard itself. Information on characters
added to the Unicode Standard since the publication of the most recent version of the Unicode Standard, as well as on
characters currently being considered for addition to the Unicode Standard can be found on the Unicode web site.
In the same way it's useful to standardize letters in various alphabets without standardizing their screen representation. There is semantic content associated with each of these symbols that persists even if there is significant variation in how they are presented. Of the ones you list, emojis are the only ones where this is any more a problematic approach than it is for letters in various alphabets. And as people who don't approve of Unicode adding emojis like to point out, emojis aren't that critical so having some loss in the translation isn't a huge deal.
Remember that before emoji standardization various cell phone manufacturers (particularly in Japan if I remember correctly) started using codepoints for whatever they pleased. The alternative to Unicode not standardizing them was to have a repeat of the OEM font gold rush in the SMP.
> styled math letters (which would have been equally well served by simply rendering them in italics or in a special math font)
That was my first reaction as well, but there are a few problems with that approach:
* Math italic characters look very different from normal italics, and are shaped and kerned very differently because they are commonly used for single-letter variables which will be juxtaposed together in expressions. If your goal is to be able to preserve some math formulas in a purely line based text format, preserving this aspect makes a big difference in readability.
* Many of the math letters and "letter-like symbols" have associated semantic content (like bold for vectors), which it makes sense to preserve. MathML alleviates this to a significant degree but I don't believe these codepoints were intended only for MathML usage.
* On the technical side, OpenType math fonts need to carry associated metadata for many of these characters. Putting them in separate fonts complicates this, since these tables need to refer to glyphs (general codepoints are unsuitable in a number of cases) and each font file would have a different glyph address space.
Things were way worse than that: to add emoji to text, NTT DoCoMo used private-use codepoints, AU used embedded image tags and Softbank wrapped emoji codes in SI/SO escape sequences.
I disagree. Say the name of a letter in any alphabet, and people will draw it in ways that are similar enough for automatic recognition. This is not true for pictograms and emojis.
> The alternative to Unicode not standardizing them was to have a repeat of the OEM font gold rush in the SMP.
I disagree. The alternative is a much simpler and faster standardization, of the kind I offered here: https://news.ycombinator.com/item?id=11958903 There is absolutely no need for a fixed codepoint for most of the non-BMP characters.
> If your goal is to be able to preserve some math formulas in a purely line based text format, preserving this aspect makes a big difference in readability.
So is rendering text in Arial vs. Comic Sans, but they haven't made separate codepoints for those.
Also, where this makes a lot of difference, would count as "specialized usage". I don't think it makes sense to have a single universal standard to standardize all specialized usage of human-readable data.
So you think instituting a system based on links not rotting would better preserve meaning? Not to mention that:
* Every text renderer that doesn't support your codepoint now displays a full URL, instead of a box, making text using these emojis very difficult to read.
* Instead of making implementation easy by requiring nothing new of text shaping libraries, they now have to be able to both connect to the internet and tie into a file cache.
The supplementary space was already there when we got to emojis, and UTF-8 and UTF-16 already had to deal with SMP codepoints for some of the less common CJK characters. Not everything above 0xFFFF is "weird" non-human language stuff. If the choice was "stick with UCS-2 and be totally fine language wise, or add more bits just for emojis and pictograms" I'd probably agree with you. If you think that's what happened, your timeline for this process is way off.
> So is rendering text in Arial vs. Comic Sans, but they haven't made separate codepoints for those.
Sure, but the different letter types carry crucial meaning in math formulas. "sup" in upright letters is the math operator supremum, "sup" in math italics is s * u * p. This kind of thing applies to every one of the mathematical letter variants.
> Also, where this makes a lot of difference, would count as "specialized usage". I don't think it makes sense to have a single universal standard to standardize all specialized usage of human-readable data.
You're zooming way out on this one. Math symbols have a lot more in common with letters and "normal" symbols than "all specialized usage of human-readable data". Remember that when Unicode added the math symbols, things like MathJax were simply impossible. Being able to write at least some formulas, which consist of letters and symbols, without losing tons of semantic information seems like exactly the kind of thing character encodings should do.
It can still display a box.
> Instead of making implementation easy by requiring nothing new of text shaping libraries, they now have to be able to both connect to the internet and tie into a file cache.
The question is implementation of what. I think that it is not an onerous requirement from applications that need to display emojis or ancient Egyptian hieroglyphs. Their OS could provide this service for them just as it allows them the use of fonts.
> The supplementary space was already there when we got to emojis, and UTF-8 and UTF-16 already had to deal with SMP codepoints for some of the less common CJK characters. Not everything above 0xFFFF is "weird" non-human language stuff.
That is a good point, one of which I was not aware, but I still don't think it justifies standardization of Chinese characters, emojis, and ancient Greek musical notation by the same standards body.
> Sure, but the different letter types carry crucial meaning in math formulas.
The use of italics in text may also carry crucial meaning. But if a textual representation as sup is supported, I don't see why the specialized rendering should be supported, too, but for math and not plain text.
> Being able to write at least some formulas, which consist of letters and symbols, without losing tons of semantic information seems like exactly the kind of thing character encodings should do.
I agree, but I think that that semantic information is preserved when writing N or NN instead of 𝑵 or ℕ. Considering that Unicode isn't enough to write most mathematical formulas convenient forms anyway and requires a specialized renderer anyway, I don't see the reason for this extra effort.
If it knows about your new codepoint. Everyone using an implementation that doesn't yet support it is going to show the full URL. If history repeats itself, these implementations will be the majority for at least a decade.
> The question is implementation of what. I think that it is not an onerous requirement from applications that need to display emojis or ancient Egyptian hieroglyphs.
Except they need to do literally nothing different. Hieroglyphs are vectorized, and emojis are either bitmapped (which existing font formats already supported) or vectorized. None of this required a single line of code in any text shaping library to change. Text shaping libraries generally don't even need to understand Unicode categories or other metadata: font files already contain all the relevant information (directionality, combining mark, etc.).
> The use of italics in text may also carry crucial meaning. But if a textual representation as sup is supported, I don't see why the specialized rendering should be supported, too, but for math and not plain text.
If you scrub all bold and italics from text, do any of the words turn into different words? That's what happens with sup (or other named operators). Same thing for blackboard letters, fraktur, etc.
> I agree, but I think that that semantic information is preserved when writing N or NN instead of 𝑵 or ℕ. Considering that Unicode isn't enough to write most mathematical formulas convenient forms anyway and requires a specialized renderer anyway, I don't see the reason for this extra effort.
My point is it was better than nothing, which was the alternative when Unicode added these. By adding mathematical characters that worked exactly the same as all other characters, we could get some of the advantages of MathML without needing everybody to implement a special math renderer or learn any new markup, just download new font files.
If getting everyone to accept MathML or something similar in an expedited fashion was a reasonable proposition, then I might agree that they should've kept it out of Unicode. Those were not the facts on the ground when this decision was made. Note that even now that we have MathML, the few browsers that support it (IIRC Firefox & Safari only) have complete shit implementations that look terrible.
The crux of this is that making changes to a standard to support something new and propagating the changes is super hard; getting new standards to get accepted and implemented is a Herculean effort. Unicode's expansion into these domains required nobody to do anything differently, let alone decide to up and write an implementation of a completely different standard. I think this is a case were our alternative was to let the perfect be the enemy of the good, or at least the working.
And that's what happens when scrubbing subscripts or binomial coefficients. When you want to represent math as text, you need to change your representation (add multiplication signs, forgo subscripts, use confusing parentheses etc.). This is still true with Unicode. The contribution of the non-BMP special math characters is quite minimal.
> The crux of this is that making changes to a standard to support something new and propagating the changes is super hard; getting new standards to get accepted and implemented is a Herculean effort.
Sure, but the emoji craziness continues, and there's little sign it would ever stop. Instead of saying "this isn't text; if you want, call it 'special text', escape it, and let a different body standardize it", the body entrusted with standardizing text representation worries about how to represent a picture of two people kissing as text. What next? Kids would want to add tunes to their text messages. Would the Unicode Consortium add code points for MIDI? And maybe managers would want to standardize code points for organizational diagrams. Would that be the consortium’s responsibility, too? The BMP contains all the characters for reasonable text-art.
Suffice it to say this getting "out of control" and eating up the remaining space in our lifetimes, or our children's lifetimes would be pretty impressive. This means that the worst we have to fear is more boxes, assuming OSes don't keep up with their fallback fonts. Saying adding more symbols-designed-to-be-just-kind-of-placed-in-line-with-other-symbols-like-they-always-have-been is the first step towards MIDI and organization diagrams is like saying you're vegetarian because it's a slippery slope from eating meat to eating people.
Also note that unlike every other excess of the Unicode standard, these would require massive changes to the code that handles text. This means that if Unicode decided to do this, you wouldn't have to worry about negative effects because nobody would implement it.
I think it would have made much more sense to have something like image tags: a special codepoint would introduce a link to a URL containing a sequence of glyphs, followed by an index into that sequence. Those glyphs would be guaranteed not to change (in any meaningful way), and devices would be free to cache them. This way, anything that isn't real text, would standardize representation, too, instead of just a vague "meaning". Another standard could relate those glyphs to one another in some way, giving them standard semantics and means of translation (i.e. "Egyptian hieroglyphics"). This would also allow each of those (emojis or hieroglyphics) to evolve their standards independent of a single universal standard that means little.
The dream of a 16-bit Unicode washed up on the rocks of CJK scripts. It's dead and it isn't going to be revived. You can argue for a simpler standard, with fewer assigned codepoints, but the original BMP isn't it and was never going to be it.
I concede this point. I still don't see why the Unicode Consortium should spend effort standardizing non-text as text.
^1 I am pretty sure that some use cases actually profit a lot, but neither text file formats (since most text does not contain 2^64 different characters zip would work nicely) nor networking (since most data on the internet is either video or torrents) seem to be among them. So probably they would not be huge fields.
In any case, Utf8 has a place, and if you want easy manipulation and search, convert it to Utf32 - it's fixed width.
But that's only because Unicode has significantly ventured well beyond what we consider to be text. The BMP is enough to represent all text (including math).
There are no codepoints if it's not text. How do you use codepoints for embedding a video or a picture on your blog? You don't! But, if you want to treat something as if it were text, then I suggested doing something similar to an XML namespace: "the next segment is hieroglyphics, you can get their glyphs from here, and these are their indices...". That "extended text" is still not text, and it still doesn't use any Unicode codepoints, but it can work according to similar principles.
> Do you think it'd be more efficient to have to support 6 different standards than one?
Then why don't we let the Unicode Consortium take over standardizing video or audio? If something isn't text, why is it standardized by a text standardization body?
You can exchange text with embedded icons just as easily without requiring OS vendors to come up with their own versions of vaguely-defined pictures by... simply embedding pictures.
I can already see the people on whatever would be the HN 20 years from now complaining how "bloated" Unicode is, full of thousands of symbols that no one ever uses, and calling to replace the whole thing, costing the industry even more money to replace a standard yet again.
If you are trying to index into a string by "character" you are almost certainly already doing it wrong. Meaningful indexing almost always has to be by grapheme cluster. See Swift's string API as a great example of this done right.