Hacker News new | past | comments | ask | show | jobs | submit login
Unicode Standard, Version 12.0 (unicode.org)
67 points by lelf 20 days ago | hide | past | web | favorite | 55 comments



Emojis are a plot to make English-speaking developers care about fixing their code to work with Unicode.


Not just Unicode, but Unicode characters outside of the Basic Multilingual Plane (BMP).

At the very least it is a pleasant unintended consequence. A decade ago support for characters outside of the BMP was hairy (certainly possibly though in PDF's with LaTeX via XeLaTeX), and something you would use if you dabbled in the obscure glyphs contained in it; now it is standard.


If only they hadn't tried to cut it down to 16 bits near the start, we could have avoided a lot of the partial support that emojis expose.


Ideally, we'd have started from UTF-8 and never even bothered with 16 bit chars in the first place.


Maybe all Asian scripts was not planned to be included back then?

Seems strange they would miscount so grossly otherwise.


It's all down to CJK. Originally they allocated 21k codepoints to CJK, and if that was accurate then 16 bits would pretty much fit things.

But we currently have 88k CJK characters assigned out of possibly more than 100k total.

I can't easily find anything about how this went wrong and they got such a small number.


> we currently have 88k CJK characters assigned

If you also count the Unihan variations registered in Unicode's ideographic variation database by various Japanese outfits, encoded using the VS16 to VS255 characters after the codepoint they modify, there's another 8k or 9k unique characters assigned.


As somebody not even working within the light-cone of this field, what is going on with emojis? It seems like everything I read about unicode has at least a mention of them. Are they that important to users? Particularly technically challenging? Interesting from a theoretical standpoint?


Emojis are in the Unicode range with code-point IDs above 16 bits. (The original Unicode was limited to 16 bit code-points.


Every top comment is about Emoji… Law of triviality in action ;)

  Also in Version 12.0, the following Unicode Standard Annexes have notable modifications ⟨…⟩
  UAX #14, Unicode Linebreaking Algorithm
  UAX #29, Unicode Text Segmentation
  UAX #31, Unicode Identifier and Pattern Syntax
  UAX #38, Unicode Han Database (Unihan)
  UAX #45, U-Source Ideographs


Monospace is hard to read on mobile, please don't use it for quoting.


It's not triviality.

How many people do you really think care about Elymaic script? Or about Nandinagari?


Why would you bring up something relatively obscure and ignore all the other major non-emoji changes?

In addition to what grandparent listed:

> UTS #10, Unicode Collation Algorithm—sorting Unicode text

> UTS #39, Unicode Security Mechanisms—reducing Unicode spoofing

> UTS #46, Unicode IDNA Compatibility Processing—compatible processing of non-ASCII URLs

I think plenty of people care about these changes.


Presumably some people. Assigning code points is the Unicode consortium’s job. That’s what unicode does. Nobody should be upset that they keep doing it.

But you ignore the substance of the parent’s post which is about the new elements of the Unicode standard which are not confined to assigning code points. There is substance there to be analyzed - there is material there about how Unicode should be used in defining identifier syntaxes which is of high relevance to an HN audience (eg in defining your new serverless framework, what characters should you allow in function names? Unicode now has a better answer for you than ‘[a-zA-Z_0-9-]’).

There are updates to Unicode security and idna support.

But no, sure, let’s complain about emoji and obscure languages.


>Assigning code points is the Unicode consortium’s job. That’s what unicode does. Nobody should be upset that they keep doing it.

Why not?

Something being "somebody's job", and others being upset that that somebody keeps doing it, can be very logical. E.g.

1) if the job wastes resources, is deemed silly, is done badly, etc.

2) if the job is detrimental to those others


It is a shame that the majority of these won't been seen by common devices. Google has stopped work on their Noto fonts initiative, and modern versions of Android are stuck with pre Unicode 10.

Apple has been pretty good at adding Emoji on iOS, but macOS seems to be left behind.

As for Linux... Installing the Unifont gives coverage, but most distros don't seem to have a way to update base level fonts.

I'd love to be proved wrong about any of this. But is seems that the majority of systems don't care because their native scripts are already supported.


Apple has been pretty good at adding Emoji on iOS, but macOS seems to be left behind.

The latest macOS version gets emoji updates regularly; it stays in sync with the latest iOS.


That's fine though. On Linux distributions care is usually taken to ensure fonts that cover all languages in common use today, but there is no need to cover every glyph in the standard right from the start. Some Unicode blocks eventually get a default font, some are so specialistic that you have to install a special font, but that's perfectly doable.

Newer versions of distributions tend to come with updated versions of the fonts installed, so eventually support will increase.


>Apple has been pretty good at adding Emoji on iOS, but macOS seems to be left behind

I'd be surprised is there was more than a "between iOS/macOS major versions" discrepancy of their emojis.


> Google has stopped work on their Noto fonts initiative

:-( This may be my favorite Google project ever.


Emoji handling on Linux is so upsetting. It should not be this tricky.


I submit that it should: http://baldi.me/blog/emoji-in-sql


Hilarious.


Are the more fancy scripts supported by Unicode used by real people in production? By scholars? With special fonts? Or is it more like Unicode just wanting to support everything, even though the target audience is actually using something else?

Asking because I'm impressed by the aim of the whole Unicode project but having no real experience with it beyond the basics.


You will need a "special" font to visualise the text but depending on the writing system it may be enough for someone to simply make one new glyph for each of the characters in your system and add it to a general purpose "everything" font. For some writing systems you need more powerful technology because e.g. the system has complicated rules about how shapes fit together and are transformed by adjacent shapes.

For practical purposes there isn't "something else". We're well past the point where Unicode was adding things that worked fine on a specially modified edition of Microsoft Windows for the specific language (like Dungan, which needs extra characters not normally used in Cyrillic) or whatever, these are now often _really obscure_ writing systems where previously you'd only put them "on a computer" by uploading a picture of the writing. Now the computer can handle them as text because they're in Unicode.

For all the historical writing systems, and some of the minority systems that have very few users many of whom know another language that is more widely used and thus more useful to them in practice (imagine going on a forum to ask a question about maintaining the motor sledge you use, you know Russian and also Dungan - obviously you will ask in Russian, because that's a LOT more people who might answer) - in practice the new scripts in Unicode will only be used by academics to transcribe stuff. It still makes that easier, because they can use Unicode everywhere, not just in specialist tools that maybe another researcher built for the language they care about.


Mostly scholars. But even if nobody at all would be using it currently, the explicit goal of Unicode is to support all scripts. Unicode is meant to make all other text encodings obsolete so the world never has to think about text encodings again (which mostly worked so far). That goal can only be reached/maintained if every script anyone might plausibly want to use is contained in Unicode.


More specifically, scripts and glyphs that have documented and valid use cases. If you made up a script today, you would have to start using it first (and gain acceptance of it in some community) before it would be eligible for inclusion in the Unicode standard. A good example is the power symbol (⏻, Unicode 9.0). The proposal for it neatly documented that it was in wide use already — in manuals in particular.

Emoji are a slightly different beast though. Those seem to get included based on projected use cases.


They used to be included because the Japanese had them in their encoding systems, but the situation now is far more fuzzy. Which is odd for a standard.


>but the situation now is far more fuzz

It's basically: "text/social comment/chat apps are big, let's add more BS icons for our Facebook/Apple/Google/MS/etc chat apps"


> Unicode is to support all scripts. Unicode is meant to make all other text encodings obsolete so the world never has to think about text encodings again

Technically speaking Unicode is not an encoding, but otherwise your point is mostly correct.


I guess UTF-8 is technically what we would call the encoding (with alternatives like UTF-32 with other tradeoffs). But what would be the correct word for Unicode, if not encoding? I guess I could always say Unicode standard, but that feels like just avoiding the issue (for example we usually say SMTP protocol, not SMTP standard).


"Character Set" is usually the phrase.

A character set can be encoded in a variety of ways, for Unicode / ISO-10646 the encoding UTF-8 is the most popular for a variety of reasons that I'm sure will one day be an exciting historical artefact for HN readers to remark upon.

I don't like the word character, because it tends to cause idiots to build software that thinks Unicode codepoints are the indivisible unit out of which strings are made, and that's no more true than for bytes. I prefer the nice fuzzy word "squiggle" when I mean the thing you as a human are perhaps imagining when saying "character" and to use nice technical terms like "pictogram", "grapheme", "glyph", "code point", "code unit", "symbol", and so on when I mean those specific technical things. But in the phrase "character set" that's what we ended up with, so be it.


FWIW, I've updated now the safeclib to 12.0.0 final from the previous 12.0.0-d1, and there were no changes in the case folding tables. And the changes from 11.0 are minimal, just 6 new entries. So it's just a minimal libc specific update, thanksfully.

For utf8-safe languages there would be 4 new scripts to add, but this affects only rust, java and cperl. All others are unicode unsafe.


I would be more amazed to see Tengwar and more conlang scripts added finally.


[flagged]


If you aren't going to post substantively, don't post at all. We're asked you several times before and eventually we ban accounts that won't follow the guidelines.

https://news.ycombinator.com/newsguidelines.html


I actually went into this post fully expecting to make the same snarky comment, but upon actually reading the article, they added 554 characters and 4 new scripts, along with stuff like Egyptian hieroglyph formatting support, vs. 61 emojis. Which made the idea of snarking about how they only cared about emojis look rather silly.

Anyway, since you asked, the fringe diversity group getting the new custom emojis is disabled people (mainly hearing-impaired, vision-impaired, and wheelchair users). So, y'know, on the order of 10% of the population. Very fringe.


Well, as the linked post says in the first paragraph, almost 90% of the 554 new characters aren’t emoji, and include four new scripts (used by several different languages) and additions for several other scripts. Emoji are only mentioned once in the first half of the article, and unless you consider accessibility-related emoji to be only for a “fringe diversity group”, I’m not sure how the newly added emoji [0] could be particularly controversial?

[0] http://www.unicode.org/emoji/charts/emoji-released.html


And only 14 of the new emoji are related to accessibility, of which 6 are gender-specific.

The rest is stuff like garlic and yo-yo. I mean think of yo-yoers what you will but I don't think they're a fringe diversity group :p.


It's about half-way into the post that it talks about emoji and not actual languages, so the answer is pretty obviously yes.


Given this release adds several new scripts, which actual languages do you think they don't care enough about? (+ of course there being several groups of people involved, so it's not an either/or thing)


The whole emoticon thing have became ridiculous. It's time for Unicode to split and let the kids playing with their images and to focus on more important, textual things. For example the cuneiform block is not detailed enough to be used by scholars.


Do you think that the committee approving new emojis is qualified to seamlessly transition into discussing the minutiae of cuneiform formatting? It may be a guess on my part, but I'm pretty confident that there are two different (and probably non-overlapping) sets of people working on these two areas.


Perhaps Unicode could solve the emoji problem once and for all by providing a mechanism for encoding arbitrary .svgs as multi-kilobyte strings of modifier characters.


Emojis started being included in Unicode because they were already in the text encodings used by Japenese SMS. Multi-kb glyphs wouldn't exactly fly in SMS.


I dunno, by wildly extrapolating current trends, by 2050 a typical text-based message will already consume several MB of bloat, Javascript, subliminal advertisements, superliminal advertisements, JSONs encoding XML encoding JSONs encoding base64 encodings of..., 128-bit float variable font settings, prayers to the Packet Gnomes, and more Javascript.

In comparison, a few extra kb won't seem so bad.


Japan has never widely adopted SMS[0]. They have always used proper internet E-mail instead, and more lately, LINE messenger.

[0]Once they added global 3G/UMTS support their networks got SMS support, but it's only used by phone number verification services and 2FA, nobody actually sends them. Their homegrown 2G networks before that launched messaging using E-mail.


Hmm, they did include that block for encoding arbitrary bitmap images. (U+2800)


U+2800 etc. is for Braille, which is an alphabet. Nothing to do with bitmaps.


I think it was a joke referring to some people using Braille to display black and white images as text, e.g. https://loveeevee.github.io/Dots-Converter/retro.html


Yeah. Key to that use is that they didn't pick any specific patterns, they went with every possible combination of 8 dots being present/absent.


Nitpick: the U+2800 block specifically isn’t an alphabet. For example, U+2801 is called “Braille Pattern dots-1”. That’s like calling U+0069 “vertical bar with a dot on top”, rather than “Latin small letter I”

(Reason for that is that the dot patterns are heavily overloaded and even language specific. German digits are different from US ones, for example)


Of course I know this is two different committees. I'm not talking about that level of separation but splitting Unicode in Unicode proper and the emoji in another standard. Emoji standardisation is orthogonal to encoding present and past writing systems once the set of existing Japanese emoji were encoded. So it became a shit show as soon as the standard created previously non-exiting characters as emoji. In comparison, if you create your own alphabet it won't be included, so newly created emoji should be treated in the same way: being excluded of the standard.


Until there's an emoji representing every imaginable demographic group engaged in every imaginable activity, the Unicode is not complete.


In what way is the Cuneiform block insufficiently detailed?


Variants. There is a lot of differences in the form of characters given time and place and writing. Scholars want to be able to reproduce text with fidelity so the current standard is insuffisant for them (I heard that from a scholar in that field). In a way it is similar to the issue created by Unihan. Abstractly the unified characters are 'the same' but people still want to see them displayed with a given shape.




Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: