
How I added 6 characters to Unicode (and you can too) - deafcalculus
http://www.righto.com/2016/10/inspired-by-hn-comment-four-half-star.html
======
1wd
The rationale given for including mirrored half-stars as separate codepoints
is right-to-left languages. I wondered why this was needed, since Unicode
already has the a right-to-left mark (RLM)[1].

I found the answer in a comment on "Explain XKCD".[2] The RLM usually only
reorders characters, but does not mirror their glyphs. The exception are
glyphs with the "Bidi_Mirrored=Yes" property, which are mapped to a mirrored
codepoint.[3]

The half-stars proposal includes a note on that property: "Existing stars are
in the “Other Neutrals” class, so half stars should probably use the ON
bidirectional class. The half stars have the obvious mirrored counterparts, so
they can be Bidi mirrored. However, similar characters such as LEFT HALF BLACK
CIRCLE are not marked as mirrored. I'll leave it up to the Unicode experts to
determine if Bidi Mirrored would be appropriate or not."

[1] [https://en.wikipedia.org/wiki/Right-to-
left_mark](https://en.wikipedia.org/wiki/Right-to-left_mark)

[2]
[https://www.explainxkcd.com/wiki/index.php/1137:_RTL](https://www.explainxkcd.com/wiki/index.php/1137:_RTL)

[3]
[http://www.unicode.org/Public/UNIDATA/BidiMirroring.txt](http://www.unicode.org/Public/UNIDATA/BidiMirroring.txt)

------
syphilis2
I also enjoyed this Hacker News article about adding the electrical on, off
sleep, standby symbols to Unicode. [http://unicodepowersymbol.com/we-did-it-
how-a-comment-on-hac...](http://unicodepowersymbol.com/we-did-it-how-a-
comment-on-hackernews-lead-to-4-%C2%BD-new-unicode-characters/)
[https://news.ycombinator.com/item?id=11958682](https://news.ycombinator.com/item?id=11958682)

------
treve
The one I'm surprised about is not the stars, but actually the bitcoin
character. It's just a form of branding to me, and while I think there's
interesting uses for blockchain technology, public interest seems to be a bit
inflated. Plus that blockchain tech will likely outlive bitcoin itself.

~~~
sqeaky
It's not like there is some central Bitcoin company so what is the brand?
Brands are generally owned by companies and are intellectual property in the
eyes of governments.

~~~
nerdponx
Whatever it is, it's likely to be short-lived and therefore a questionable
addition to Unicode.

~~~
sqeaky
Why do you think it will be short lived and on what scale is it short?

~~~
nerdponx
As other people said, blockchain technology will outlive bitcoin. In this case
I am saying "short" to mean a decade or so. I expect Unicode to last much
longer.

------
nacc
It is great to see Unicode being able to encode almost every symbol people can
think of, however I am still struggling to make them appear on my screen - is
there a good font that has great coverage for unicode? Many times there are
clever use of unicode yet I can only see empty rectangles.

~~~
glitch
[https://en.wikipedia.org/wiki/Unicode_font#List_of_Unicode_f...](https://en.wikipedia.org/wiki/Unicode_font#List_of_Unicode_fonts)

~~~
rspeer
Keep in mind that what you want may not be one font that covers lots of glyphs
-- that makes the font take up lots more memory and take longer to load. And
you _definitely_ wouldn't want to use a high-coverage Unicode font as a
dynamically-loaded Web font.

Operating systems are fine at understanding that different fonts are necessary
for different glyphs, so what's better in a lot of cases is to have a _family_
of fonts that together cover all the glyphs you need. That's what Google Noto
[1] is doing.

[1] [https://www.google.com/get/noto/](https://www.google.com/get/noto/)

Symbola is a good font for covering a lot of symbols, while not representing
many text characters (on the assumption that you already have fonts you prefer
for text).

That said, there's a justification for having a few of the fonts on that
chart, like Lucida Sans Unicode and Arial Unicode MS, because they guarantee
consistency without you having to install a huge font family. GNU Unifont is
also interesting in a hackery kind of way, in that it achieves good coverage
by using only pixelly bitmaps.

But on the other hand, Code2000 is an awful font. It eats gobs of memory and
it looks bad. Don't use it just because it has a lot of glyphs.

~~~
cooper12
GNU Unifont is just a fallback font, which I think is what the parent really
needs since they're most concerned about seeing the symbol and I doubt
consistent appearance with their font.

[https://en.wikipedia.org/wiki/Fallback_font](https://en.wikipedia.org/wiki/Fallback_font)

------
markbao
I love this – but does it bother anyone else that the outlined and filled
stars have different sizes? What's the reason behind that?

HN strips the characters out from comments, but they're displayed in the
beginning of the article.

~~~
treve
Unicode does not dictate how glyphs are presented. It just describes and
categorizes them.

So how they look comes from the font that is used. For the proposal these
fonts probably didn't exist yet, so it was probably just a (slightly sloppy)
photoshop.

~~~
markbao
That's a good point, and I should have clarified, I'm referring to the full
stars (not half-stars in the new proposal). Not a Unicode issue, but
definitely something I've seen at least on macOS machines.

------
edent
So glad the unicodepowersymbol.com stuff was helpful! We had a lot of fun
getting the proposal together.

If anyone wants to submit some new characters, all of our documents are on
GitHub
[https://github.com/jloughry/Unicode](https://github.com/jloughry/Unicode)

------
Animats
We need to hold the line somewhere. Preferably before corporate logos get into
Unicode. I've seen Facebook and Twitter icons as Unicode characters in the
user-definable space. This currently requires a downloaded font, but there's
probably some lobbyist somewhere trying to get them into Unicode.

It's getting really complicated. There are now skin-tone modifiers for emoji.

~~~
WalterBright
Unicode is turning into a few useful characters amid a sea of junk. This will
continue as long as people acquire status by getting "their" symbol(s) into
Unicode. I don't see any way this can change.

~~~
Animats
How are Windows and Java, which are somewhat tied to 16-bit Unicode, handling
this? It used to be that the astral planes didn't matter much, but now they
do.

~~~
j4_james
That's what surrogate pairs are for. [1] You're no longer working with one
code point per character, but even with 32-bit Unicode there's no real
guarantee of that (consider things like combining characters, accents, emoji
skin tones, etc.)

[1] [https://msdn.microsoft.com/en-
us/library/windows/desktop/dd3...](https://msdn.microsoft.com/en-
us/library/windows/desktop/dd374069\(v=vs.85\).aspx)

~~~
WalterBright
Soon 20 bits won't be enough, either, and every Unicode program out there will
break :-(

~~~
ygra
Unicode is 21 bits wide. And there's lots of space left. Heck, Emoji still
make up very little of the total encoded characters, compared to “normal”
human writing systems. (And I'd argue that emoji _are_ by now a normal
addition to writing, considering how many people use them daily and can be
glad to have them interoperable across different platforms, carriers, and
devices. Something that hasn't been that way previously.)

------
amelius
Perhaps we should have an escape code for SVG in Unicode, so we can describe
any missing character.

~~~
wxs
Unicode Technical Report #51, which is where Emoji are laid out, talks a bit
about the current thinking of the committees on this:

> The longer-term goal for implementations should be to support embedded
> graphics, in addition to the emoji characters. Embedded graphics allow
> arbitrary emoji symbols, and are not dependent on additional Unicode
> encoding. Some examples of this are found in Skype and LINE—see the emoji
> press page for more examples.

> However, to be as effective and simple to use as emoji characters, a full
> solution requires significant infrastructure changes to allow simple,
> reliable input and transport of images (stickers) in texting, chat, mobile
> phones, email programs, virtual and mobile keyboards, and so on. (Even so,
> such images will never interchange in environments that only support plain
> text, such as email addresses.) Until that time, many implementations will
> need to use Unicode emoji instead

[1]
[http://unicode.org/reports/tr51/#Longer_Term](http://unicode.org/reports/tr51/#Longer_Term)

------
hf
I simply cannot wrap my head around the direction of the Unicode discourse.

We're discussing the appropriate code-point for different smiley faces,
obscure electrical symbols[0] or, in the present case, half stars to express
film or book ratings, yet we have _no_ complete set of sub- and superscripts!

Am I mistaken in thinking it odd, that there's a complete Klingon alphabet but
no representation whatsoever for most Greek or Latin subscripts? Or what if,
heaven forbid, I'd want to use a 'b' index/subscript? Tough! Not even the
"phonetic extensions", where subscript-i comes from, provides it.

Refer to
[https://en.wikipedia.org/wiki/Unicode_subscripts_and_supersc...](https://en.wikipedia.org/wiki/Unicode_subscripts_and_superscripts#Latin_and_Greek_tables)
or look for SUBSCRIPT in
[http://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt](http://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt)

Surely there's the one or two actual scientists on the Unicode consortium? Or
even the one odd soul still sporting a notion of consistency who finds it only
logical to provide a "subscript b" if there's a "subscript a"?

How am I wrong?

[0]
[https://news.ycombinator.com/item?id=11958682](https://news.ycombinator.com/item?id=11958682)

~~~
jessaustin
ISTM a great deal of trouble and complication could have been prevented by
three special types of NBSP that meant "sub", "super", and "back to normal".
It's true that some glyphs will be special-cased by some fonts, but in general
the glyph is just shrunk and translated when sub- or super-scripted.

~~~
db48x
Yes, just like the LRE/LRO/RLE/RLO/PDF/etc characters.

------
gjasny
It would be cool to see the powerline symbols to be added to Unicode. The
necessary user base should be already there.

See:
[https://github.com/powerline/fonts/blob/master/README.rst](https://github.com/powerline/fonts/blob/master/README.rst)

A zsh theme with those characters in use:
[https://gist.github.com/agnoster/3712874](https://gist.github.com/agnoster/3712874)

~~~
yes_or_gnome
I have to disagree. All but 3 of those pictographs are already in the Unicode
standard. You have to patch fonts because A) your preferred font may not have
them and B) to make certain that the font meets Powerline's expectations.

The ones that are "unique" are a bit annoying because they replace defined
characters in the Basic Multilingual Plane - Private Use section(E000-FFFF).
Even though the section is "Private Use" it is often already defined by your
OS's system font. There's the Supplemental Private Use Areas A (F0000-FFFFD)
and B (100000-10FFFF) which can be overwritten safely.

I scare quote "unique" because two of those characters are full-height arrows;
one right-pointing, the other left-pointing. These are already defined as
u1F780 (🞀) u1F782 (🞂). It may be the case that some fonts that the triangles
either A) don't actually go from floor to ceiling, or B) they have empty space
behind their hypotenuse.

The only truly unique character is the "git branch" pictograph. Maybe, someone
could write up a convincing argument to include it, but I can't imagine one.
It's not a symbol you see to often even in the git community. And, I would bet
if you looked hard enough, there's some mathematical symbol that would be
suitable.

Just FYI, I've used powerline fonts daily for the past ~3 years.

------
YeGoblynQueenne
That's great but what we really need (ahem- what _I_ really need) is more
maths-y characters, like ∑∏∫∀ and all the sub- and super- scripted letters:
ⁱⁿₙᵢ and so on.

I can never find a lower-case Greek subscripted α or β when I need one...

~~~
JadeNB
> That's great but what we really need (ahem- what I really need) is more
> maths-y characters, like ∑∏∫∀ and all the sub- and super- scripted letters:
> ⁱⁿₙᵢ and so on.

Agreed, but what we need even more than the _symbols_ is some ((La)TeXy, says
the mathematician) way of _combining_ them. For example (says the
mathematician who doesn't understand the complexity of text encodings), why do
we need a whole bunch of separate "subscript m", "subscript n", etc., glyphs,
rather than just one "subscript" combining mark?

------
WalterBright
Unicode is a brilliant idea, but it went off the rails with combining
characters, especially when there is both a code point for a character and a
combining set of characters that semantically are the same thing.

~~~
ygra
How would you solve things without combining characters? Especially the case
where you can have multiple diacritics on a letter. Encode every single
combination of all of them? Seems a bit wasteful, don't you think?

Precomposed characters exist because they existed in other encodings
previously and encoding such characters has been one of the core principles of
Unicode to ensure an easy upgrade path. Heck, we inherited box drawing
characters that way, which I think are more questionable than combining
diacritics.

~~~
WalterBright
At a minimum, I would not have any 2-character graphemes be semantically
identical with any single code point.

> Seems a bit wasteful

If they were all separate code points, how many are we talking about?

Also, consider that nearly every Unicode program handles them wrongly. That's
pretty wasteful of programmer time and money.

~~~
ygra
The precomposed characters only exist for compatibility with existing
character sets and encodings. If you don't want to deal with them in your
code, just normalize to NFD and they're gone. If Unicode didn't care about
compatibility to legacy character sets at all, adoption would have been very
different, I guess. By now it's probably a moot point since not supporting
Unicode is foolish at best, but in the early 90s things were very different.

As for diacritics, it depends on what you care about for precomposing them.
Actual usage for scripts in use currently? Then it's only a handful and the
worst thing probably is Vietnamese or Ancient Greek which have a bunch of
characters with more than one diacritic.

However, the current system with composable diacritics gives you plenty of
flexibility: Need a character with a diacritic that isn't used in any language
currently? Just compose them and you got it. Font support may be spotty (note
that Unicode and font support are completely separate things – bashing Unicode
for bad fonts is a fairly useless endeavour), but at least you can represent
that grapheme in text without resorting to embedding images, or overlaying
glyphs by other means (cf. TeX). Those options are also not interoperable with
any other applications.

It also means that if some language now develops a script based on, say,
Latin, and invents a new diacritic that can go on different vowels, you'd only
have to encode a single new code point, not five or six of them. It scales far
better and also isn't tied to any specific writing system. I can use ´ on a or
on ω and it works the same.

And could you elaborate on how “nearly every Unicode program handles them
wrongly”? I'd argue that most programs coming into contact with Unicode do
little more than passing it along without caring about the contents at all.
And trying to shoehorn human language into something an average programmer can
handle without error is likely impossible. Language is complex, writing is
complex; Unicode is complex as a result of that. This doesn't only apply to
text, mind you, there are lots of things that are complex and are often
implemented naïvely or wrongly by programmers who don't know any better. That
usually means that programs are broken, and many programmers should know
better. Not that we should try adjusting the world to broken programs.

~~~
WalterBright
> And could you elaborate on how “nearly every Unicode program handles them
> wrongly”?

A good chunk don't do surrogate pairs correctly (or are even aware of them),
the rest get tripped up by the combining character issue. Even for those who
understand it, there are no clear answers: "should a combining character
compare equal to a precomposed one?" And of course there are 3 levels of UCS
support.

The whole existence of an unnormalized form is a gigantic mistake that could
have been easily avoided - simply make the unnormalized form an illegal
sequence to begin with.

Unicode programming hasn't gotten as bad yet as timezone programming, but they
are well on their way :-(

------
tantalor
What about 1/4, 3/4, 1/5, etc...?

~~~
Symbiote
¼, ¾, ⅕.

For etc, start here: [http://unicode-search.net/unicode-
namesearch.pl?term=fractio...](http://unicode-search.net/unicode-
namesearch.pl?term=fraction)

You can use "fraction slash" to make any fraction, using super/subscript
numbers: ⁷⁄₃₃

~~~
TazeTSchnitzel
I thought they were talking about fractions too, but then I realised they're
probably talking about fractions _of stars_.

~~~
Symbiote
Ah, I see. Something like ◔ "CIRCLE WITH UPPER RIGHT QUADRANT BLACK".

Someone requested something similar here [1], and someone else made it using
CSS here [2]. As the article explains though, it would need to be used in text
for the Unicode committee to accept it.

[1] [https://github.com/FortAwesome/Font-
Awesome/issues/4147](https://github.com/FortAwesome/Font-Awesome/issues/4147)

[2] [http://codepen.io/denwo/pen/azjXzL](http://codepen.io/denwo/pen/azjXzL)

------
koltaggar
Best part is where you swap Andrew West's first name for Adam

~~~
kens
Oops, sorry Andrew! Many apologies! (I watched way too much Batman as a child
and "Adam West" is wired into my brain.)

