
Unicode is hard - edent
https://shkspr.mobi/blog/2017/05/unicode-is-hard/
======
masklinn
> The £ is printed just fine on some parts of the receipt!

> ⨈Һ𝘢ʈ ╤ћᘓ 𝔽ᵁʗꗪ

Assuming the printer uses ESC/POS[0] (which is likely), the codepage is part
of the printer's state. To change the code page, the driver sends a specific
ESC command (<ESC t x> aka <1B 74 XX> where x/XX is the desired codepage byte)
(none of which is "UTF8" incidentally) and you can change the codepage before
each actually displayed character.

So it's the driver software fucking up and either misencoding its content
(most likely) or selecting the wrong codepage. The £ might be displayed
correctly on the right side because it's e.g. hard-coded (properly encoded)
while the product label is dynamic and when that was added/changed no care was
taken with respect to properly transcoding. The printer absolutely doesn't
care, it just maps a byte to a glyph according to the currently selected
codepage.

[0] ESC because the protocol is based on proprietary ESCape codes[1], POS
because the entire thing's a giant piece of shit

[1]
[https://en.wikipedia.org/wiki/Escape_character#ASCII_escape_...](https://en.wikipedia.org/wiki/Escape_character#ASCII_escape_character)

~~~
na85
Sadly, my Oneplus 3T rendered the A and the F of your "what the fuck" as black
boxes.

Mind boggling that this is still a problem today.

~~~
boomboomsubban
I believe that means it's correctly interpreting the Unicode, but there isn't
a font that contains a character for that code. I think this is because the
"official" Android font is patented, another layer of absurd crap that leads
to many Unicode issues.

------
LeonM
My name is Léon, with the acute accent on the e. I usually leave the accent
when I need to enter my name somewhere digitally, since in about 50% of the
cases it's not handled correctly. It usually ends up as L'on.

Even in the travel world it goes wrong all the time. You'd expect large
international travel organisations (yes, talking to you Tui!) to be able to
handle UTF8 names since many of their customers and locations will have
special characters, but no. I once was nearly refused to board an airplane
because the name on my ticket did not match the one on my passport...

~~~
makecheck
String-matching is really scary in Unicode, especially since the exact _form_
of the string matters with respect to composition — and that’s before you even
consider that some characters just plain _look like_ others or even _are_ the
same glyph. And strings can contain things like zero-width spaces that look
like nothing at all.

Sure, there are recommended practices but there have been enough mistakes
already (or lazy programmers) that it is hard to be confident that any string
with “interesting” symbols in it is exactly what it appears to be. And there
have been security problems related to the fact that many interfaces expect
the _user_ to know exactly what they’re reading, much less the programmer.

~~~
Klathmon
It almost sounds like there could be a lot of benefit from a "homophones
check" system but for Unicode glyphs (perhaps with a variable amount of
"closeness") being built into Unicode handling libraries.

Like how "е" looks identical to "e", which looks close to "ė" if you aren't
careful, which might be mistaken for "é" in smaller fonts even though all 4
letters are different unicode glyphs.

Being able to say "ɡrеɡ" is the same as "greg" even though 3 of the 4
characters are actually different would be extremely useful in some cases, and
in others would be extremely incorrect, so giving the developer the ability to
say how "exact" they need their checks to be in a "native" and easy way might
go a long way toward not only making this problem more "obvious" but also
toward forcing them to be explicit about what they are checking for.

~~~
asveikau
This comparison is language specific.

Do you group c with ç and č? In English you would. In France, Portugal,
Serbia, the Baltic States or Czech republic you may not.

~~~
ygra
It's also font-specific. Is т a homoglyph of T or m, for example? There isn't
really a good way to solve this because restricting systems to only use ASCII
(which also has homoglyphs, e.g. 0/O, 1/l, I/l, ...) is very user-unfriendly.

~~~
derefr
т _is_ a homoglyph of T—because one could mistake one for the other. They're
in a visual equivalence-class. That doesn't mean that you should normalize т
_into_ T, though. Those are separate considerations.

If you were granting e.g. domain names, or usernames, you'd be able to map
each character in the test string to its homoglyph equivalence-class, and then
ask whether anyone has previously registered a name using that sequence of
equivalence-class values. So someone's registration of "тhe" would preclude
registering "the", and vice-versa; but when you normalized "тhe", you'd still
get "mhe".

Of course, to use such a system properly, you'd have to keep the original
registered variant of the name around and use it URL slugs and the like (even
if that means resorting to punycode), rather than trying to "canonicalize" the
person's provided name through a normalization algorithm. Because they have
"[the equivalence class of т]he", not "mhe"; someone _else_ has "mhe".

~~~
Wyverald
> т is a homoglyph of T—because one could mistake one for the other.

I believe gp is talking about the font. In some fonts (especially
italic/cursive), the letter "т" looks like "m", and nothing like "T" \-- so
it's really hard to say with which one it's "visually equivalent".

------
d2p
I've seen this a lot too; but it's not the weirdest thing I've seen on a
receipt... We once ate at The Boot Room at Cheshire Oaks and when we got the
bill, the numbers didn't add up! (I don't add these things up but since the
things we ordered were fairly round numbers and should've been just below £20
and the bill was just over, it was obvious something was fishy).

I totalled the numbers up again and the total was exactly £1.50 less than the
total shown on the bill! My wife (having no faith in my basic adding skills)
pulled out her phone telling me "don't be silly" and added them up to get the
same result as I had.

We asked the waiter about it, who disappeared off to get his own calculator..
He added things up, looked confused and then took it off to the manager. She
then repeated the process on the calculator and also looked confused, unable
to explain what had happened. They gave us £1.50 in cash, apologised and then
kept the receipt (I guess they didn't want us posting that on twitter!).

To this day I've no idea what happened. You could suggest that some programmer
somewhere is getting rich off this, but it seems rather unlikely to me. I'd
really love to know what the cause was (and whether the manager ever reported
it further up the chain; because this seems like a rather serious error to
me.. how often does it happen? is it always £1.50? did the issue get
found/fixed?).

~~~
LeonM
I develop software algorithms for automatic processing of invoices and
receipts. I have analysed hundreds of them and you'd be surprised to see how
many contain errors like totals not matching the products, VAT breakdowns not
matching the percentages, rounding errors etc.

In my experience this is usually because 'financial software' systems used to
create invoices is sold by companies with 90% sales people, and maybe one or
two developers. There seems to be little to no quality control. No one seems
to care, since 'financial software' is very lucrative anyway.

In the beginning I tried to report the faulty invoices to the suppliers,
thinking that the'd immediately press the big red emergency button and fix it,
but in most cases the servicedesk employee does not care or even understand
what I am talking about. Most of them send the 'thank you for your report, we
are working very hard to fix the problem' email, but never actually fix it.

~~~
pavel_lishin
I've seen a register app coded in Javascript.

~~~
Thiez
I doubt the floating point errors are going to make much of a difference with
the numbers the average register has to deal with. Javascript can represent
integers up to 9007199254740991 accurately, so if you do all your calculations
in cents your register can process a little over 90 trillion dollars before
things get problematic.

~~~
pavel_lishin
As dragonwriter points out, there are many, many places where non-integer math
ends up taking place - taxes, discounts, coupons, three-for-ones, etc.

> _if you do all your calculations in cents_

That would certainly have been a good idea also, I bet.

~~~
Thiez
How would a "three-for-one" action end up with floating point inaccuracies? I
imagine the price for three items would tend to be dividable by three.

~~~
wayn3
Floating point is intrinsically inaccurate. You can't use it to handle money.

With floating point, the assumption (x*y)/y = x does simply not hold.

~~~
Thiez
Floating point numbers can accurately represent integers, so if you have all
your prices in cents, you end up with (x * 3) / 3, where each number in that
calculation is an integer. No inaccuracy there. Of course there is no reason
for a register to actually perform this division in a three-for-one action, it
can just replace (x * 3) with x, or subtract (x * 2).

I agree as much as the next person that you shouldn't represent money with
floating point numbers, but I disagree that cash register software written in
Javascript must automatically be incorrect (or more so that similar software
in a different language). And I don't even like Javascript.

~~~
wayn3
Floating point can accurately represent more than just integers. That does not
mean that they can accurately perform mathematical operations on them.

I'd assume that division is implemented as multiplying with the reciprocal
(because thats faster). If that's correct, then any division by 3 (or any
other number that is not of the form 2^i) breaks your cash register.

Because 1/3 represented as a floating point number equals 0.33333333.. and so
on but not indefinitely, 3 * 1/3 = 0.99999999.. - which would be equal to 1 if
you were using real numbers. But you don't.

------
arielm
I'm pretty sure the reason only some of the currency symbols aren't correct
has to do with the database.

If you think about it, the item names are most likely coming from a database
that just might not be in the right encoding (latin1 is still the default in
MySQL I think). The symbols that do work are probably hard coded into the
receipt's template, and hence don't have this problem.

Why a shop owner would store the price and currency symbol in an item's
description is beyond me, but having worked in the POS world and seeing what
shop owners do with their items I'd definitely believe it.

~~~
lmm
Note that the encoding that MySQL calls "latin1" (and uses as its default) is
not, in fact, latin1. It is windows cp1252 except with 8 random characters
swapped around. I wish I was joking.

~~~
0x0
Haha, and what mysql calls "utf8" is not, in fact, all of utf8. That's called
"utf8mb4".

~~~
Joeri
It also can't sort unicode correctly according to the standard UCA algorithm.
The ticket for this is closed as a wontfix.

------
Houshalter
Ode to a shipping label;
[http://fun.drno.de/pics/english/ode_to_a_shipping_label.png](http://fun.drno.de/pics/english/ode_to_a_shipping_label.png)

------
bhaak
"The £ is printed just fine on some parts of the receipt!"

That's probably a hint that it isn't the printers fault.

I would guess that some other system that is used to enter what's available on
the menu is using CP 437 and somewhere an encoding step (CP 437 to Unicode) is
missing so we get the ú character.

I wonder what character we would get if it was a "5€ cocktail" instead.

~~~
ourcat
Yes. I'd say this points more to an issue with how the product name was stored
in a database, rather than the printer itself.

Bad collation.

~~~
vardump
Bad collation? That just changes alphabetical order, no?

------
cbr

        In order to maintain backwards compatibility with existing
        documents, the first 256 characters of Unicode are identical to
        ISO 8859-1 (Latin 1).
    

This isn't true in a useful sense. It does look like it's true in Unicode
codepoint space [1] but in any specific encoding of Unicode it can't be the
case because latin1 uses all 0-255 byte values. For example, in utf8 it's only
an exact overlap for bytes 0-127 (7 bit ascii).

(Though maybe this means you could convert latin1 to utf-16 by interleaving
null bytes with the latin1 bytes?)

[1]
[https://en.m.wikipedia.org/wiki/Latin-1_Supplement_(Unicode_...](https://en.m.wikipedia.org/wiki/Latin-1_Supplement_\(Unicode_block\))

~~~
jcranmer
> (Though maybe this means you could convert latin1 to utf-16 by interleaving
> null bytes with the latin1 bytes?)

Yes. In fact, things like JS JITs end up storing strings as either UTF-16
strings or Latin1 strings internally to take advantage of this fact.

~~~
wereHamster
JavaScript uses (used, until a recent version) UCS-2, not UTF-16!

~~~
ygra
Most JavaScript implementations have a bunch of different string types used
internally, depending on what you're doing with the string. In-memory
representation has no bearing on the API visible to the outside world.

And while the JavaScript _APIs_ only allow you to deal with UCS-2, the string
contents themselves are, in fact, usually UTF-16.

------
TazeTSchnitzel
> So ASCII gradually morphed into an 8 bit language - and that's where the
> problems began.

Oh sweet summer child. No, ASCII itself was a problem. Before we had 8-bit
character sets, we had 7-bit character sets:

[https://en.wikipedia.org/wiki/ISO_646](https://en.wikipedia.org/wiki/ISO_646)

This is why IRC considers [\\] and {|} to be lowercase and corresponding
uppercase letters respectively: it was made by a Scandinavian, and in their
character sets, some accented characters occupy the same positions as ASCII
[\\]{|} would.

The story of character sets is the story of evolving common subsets: ISO 646
within ASCII, ASCII within “extended ASCII” (or at least, some variants
thereof), Latin-1 within the Unicode BMP, the Unicode BMP within Unicode.

Oh and by the way, before we had 7-bit character sets, we had 6-bit (e.g. IBM
BCD). And before those, we had 5-bit (e.g. Baudot code). And before that, we
had different telegraph codes (variations of Morse code)…

~~~
Animats
Even 5-bit Baudot-type codes were not standardized. A-Z and 0-9 are the same
on all Teletype machines, but there's (at least) ITA2, USTTY, Fractions
(⅛,¼,⅜,½,⅝,¾,⅞ only, for stock market use) and Weather Symbols (8 direction
arrows and 4 cloud cover symbols). I own five Teletype machines and only two
of them are 100% compatible.

------
jbg_
The code that sends the price to the printer was written with currency symbols
in mind, and selects the correct code page before sending the code for the £
symbol.

The code that sends the "product name" was not, and doesn't correctly
translate its input to the code page that the printer is using.

When I made a homemade POS system for a bar, years ago, I ran all the printers
in bitmap mode and rendered the receipts in software, to sidestep this and
other problems. The performance was still acceptable, but I think the reason
many POS systems don't go this route is compatibility; they have to work with
many models of printer and bitmap support is not universal, and even among
those printers that support it I am not sure if it is standardised.

------
anonymfus
>Each language needed its own code page. For example Greek uses 737 and
Cyrillic uses 855.

Cyrillic is not a language, it's an alphabet/script. Codepage 855 was used for
Cyrillic mostly in IBM documentation. In Russia codepage 866 was adopted on
DOS machines, because in codepage 855 characters were not ordered
alphabetically.

>Even today, on modern windows machines, typing alt+163 will default to 437
and print ú.

It's only true for machines where so called "OEM codepage" is configured as
codepage 437. But in Russia it's codepage 866 by default, so typing alt+163
prints г.

~~~
edent
That's a good point. I've updated the post to reflect that it's an alphabet.
Thank you!

~~~
tiraniddo
It's worth noting that ALT+X gives you the default OEM code page for
compatibility with DOS _sigh_ whereas ALT+0X gives you Unicode. So typing
ALT+0163 will give you £.

~~~
anonymfus
>sigh whereas ALT+0X gives you Unicode. So typing ALT+0163 will give you £.

This is incorrect. It gives you an ANSI codepage. On old Windows version it
would be a default ANSI codepage, on modern Windows it's a codepage associated
with your input language. So if I type ALT+0163 with English keyboard layout I
get £ from Win-1252, but the same combo after switching to Russian gives me Ј
from Win-1251.

Entering numbers bigger than 255 just causes wraparound. For example, ALT+0835
also will give you £ instead of ₣.

------
kakwa_
>8859-1 defines the first 256 symbols and declares that there shall be no
deviation from that. Microsoft then immediately deviates with their Windows
1252 encoding.

>Everyone hates Microsoft.

If only it was only that... Microsoft has even worse encoding schemes. The
ugliest I encountered was an "encoding" based on glyph indexes in ttf files.

Conversion is a pain in that case, and is uncertain... it also leads me to not
so beautiful code...

[https://raw.githubusercontent.com/kakwa/libemf2svg/master/in...](https://raw.githubusercontent.com/kakwa/libemf2svg/master/inc/font_mapping.c)

Even between Microsoft products (namely Office on Mac and Office on Windows),
this scheme is not handled properly (the string is incorrectly handled as an
UTF-16LE string on Office on Mac).

------
kps
Some of what's written here is not quite right. ASCII was developed in
cooperation with ISO, ECMA (European Computer Manufacturer's Association), and
BSI (British Standards Institute), and CCITT (International Telegraph and
Telephone Consultative Committee), and it was clear from the start that there
would be national/linguistic versions — this was the origin of ‘code pages’,
to use the IBMism. ISO 2022 / ECMA 35 had defined the means of designating
character sets (both 7-bit and 8-bit) by 1971, a decade before the IBM PC
chose to ignore the standard.

~~~
dfox
In fact, original version of ASCII even left some of the codes under 128
undefined or available for local redefinitions. This is why Smalltalk uses _
for assignment (it was left arrow on Alto) and why some still used encodings
have local currency symbol (eg. ¥) in place of \\.

~~~
kps
Along with leaving codes 96–123 undefined, and significant differences in
control codes, 1963 ASCII had ← and ↑ in the positions used for _ and ^ in
1967.

------
Symbiote
The receipt also has the time in 24 hour format, then a zero-padded AM/PM
format a couple of lines below. Shoddy software, with no attention to detail.

In Britain, it would be easy not to notice the incorrect symbol when setting
up the machine. Elsewhere in Europe, it ought to get noticed quickly — but I
occasionally get receipts in Denmark where the shop's address (or even name!)
is corrupted, like "SkrÉdderi, LÎvstrÉde" instead of "Skrædderi, Løvstræde".

~~~
goatface
In the UK, there is plenty of computer systems that refuse to acknowledge that
people or places can have non-ascii chars in their name, and ask you to
correct them.. or even refuse to work possibly because they are comparing
broken different encodings of them. Even paying tax by debit card seems to be
impossible with their chosen payment processor if the name on your card or
parts of your address does not match the constraints of the english alphabet.

~~~
Symbiote
My Danish street address includes non-ASCII letters.

There are easy transliterations, but I input them on my British accounts to
make a point. About half work correctly.

~~~
goatface
Avoding any international characters, both when registering addresses and when
inputting them in forms ends up being the path towards hopefully being able to
spend money with credit/debit cards though. For various other non financial
forms fuzzing the systems with what should be common enough european text and
laughing as it fails is safer fun though.

------
chmaynard
The salient property of all flavors of ASCII is that each character fits
nicely in an 8-bit word. This word size was commonly used in computer memory
at the time, and memory was very expensive.

My first programming job was writing software for the MUMPS operating system
on a DEC PDP-15, which had an 18-bit word size. PDP-15 MUMPS used 6-bit ASCII
(which was uppercase only) because three characters fit nicely in an 18-bit
word.

------
sixothree
The problem here is not the printer.

I'm willing to bet the problem here is that the descriptions of the items are
stored in the database as ascii and not unicode.

------
garyclarke27
Interesting Article - Reminds me of a recent experience when I registered a
few companies, one of them included R&D in the name. No problem for UK
companies house, online registration within minutes. But has been surprising
how much grief the & character causes with other systems. Banking systems
refuse to accept it, they only accept a very limited number of characters for
names. Should have used RnD like AirBnB - is ridiculous though, that
gymnastics like this are still required in 2017! In the EU most banks are
relaxed about account names, they just rely on IBANs but in places like Serbia
they are annoyingly anal and reject payments if the name does not match
exactly.

------
jorangreef
We don't always take the time to understand Unicode.

I wrote the following article for Node.js to try and clarify the intersection
of Unicode and filesystems, especially with regard to different normalization
forms, and using normalization only for purposes of comparison:

[https://nodejs.org/en/docs/guides/working-with-different-
fil...](https://nodejs.org/en/docs/guides/working-with-different-filesystems/)

------
bencollier49
Not enough love for code page 437! If we had proper support for it I wouldn't
have so much trouble displaying proper smiley faces in the console. Linux, I'm
looking at you.

~~~
dfox
Problem with supporting all of cp437 is that it assigns character for every
byte including control codes.

Even in DOS this caused issues:

1) NUL is same as space in cp437, but is all ones in many other DOS code
pages. This causes strings output by some software written in C(++) to end in
black rectangle (Notably in C++ version of Turbo Vision, including Turbo C++
IDE), background in many TUI applications consisting of thin 8px spaced lines
is caused by same thing (see below for why it is rendered as thin lines)

2) DOS and language runtimes for DOS tend to ignore most control codes but
still not all of them. In particular 0x07 BEL is useful character (often used
as the dot in selected radio button), only way to get it on screen is to
directly write into framebuffer.

3) MDA-style character generator (present on essentially anything but CGA) has
special hardware logic for making cp437 box drawing characters one pixel
wider. This means that all "right facing" box drawing characters have to reuse
codes with this magic behavior and you cannot use these magic slots for normal
characters that are wider than 7px. (And is reason for the thin 8px spaced
lines)

~~~
bitwize
On the other hand, it was sometimes tremendously useful. I remember a DOS
terminal emulator which could operate in "normal mode" (control codes were
interpreted normally) or "diagnostic mode" (control codes were printed as
their CP437 characters). Came in real handy when attempting to debug terminal
output with screen-clobbering characters in it.

------
jrochkind1
Character encoding is hard. Unicode is not hard, at least not that hard,
certainly compared to character encoding before unicode. Unicode is the
solution, not the problem. The problem here is that something got confused
about what charcter was encoding in use somewhere -- debugging this is hard,
but the best solution is almost always "just make it a unicode encoding,
ideally UTF-8, at every stage of the pipeline you can".

------
git-pull
I am author of a CJK language library for python called cihai
([https://cihai.git-pull.com](https://cihai.git-pull.com)).

So as part of this, and after years, I eventually realized the only way to
make a scalable tool to lookup Han glyphs is to build upon UNIHAN: The Unicode
Consortium's Han Unification effort.

I write about Unicode and UNIHAN in my own words here: [http://unihan-etl.git-
pull.com/en/latest/unihan.html](http://unihan-etl.git-
pull.com/en/latest/unihan.html)

The challenge with Unicode and hanzi is there are many historical and regional
variants to a single source Han grapheme of the same meaning.

So, each glyph or variant gets its own codepoint, or number, reserved. In
fact, this years when Unicode 10.0 is cut, the new CJK Extension F will
introduce 7,473 characters
([http://unicode.org/versions/Unicode10.0.0/](http://unicode.org/versions/Unicode10.0.0/)).

Thankfully, my only task is to make the database accessible in as friendly a
way as possible. Which is actually a mammoth task, see, there are over 90
fields which are used to denote dictionary indices, regional IRG [1] indices
(which are national-level workgroups that convene to add new characters),
phonetics (mandarin, cantonese jyutping, and more).

The fields are dense. They pack in objects that are most easily split up by
regular expressions. [https://github.com/cihai/unihan-
etl/blob/master/unihan_etl/e...](https://github.com/cihai/unihan-
etl/blob/master/unihan_etl/expansion.py)

So a UNIHAN field for kHanyuPinyin
([http://www.unicode.org/reports/tr38/#kHanyuPinyin](http://www.unicode.org/reports/tr38/#kHanyuPinyin)):

U+5364 kHanyuPinyin 10093.130:xī,lǔ 74609.020:lǔ,xī

U+5EFE kHanyuPinyin 10513.110,10514.010,10514.020:gǒng

U+5364 is two values (separated by the space), then a list of items either of
the colon (:), which are separated by commas.

You may wonder where this all comes from. The effort is global, but a good
deal of it is thanks to people who took their time to contribute it,
organizationally or personally. Take a look in the descriptions of the fields
at
[http://www.unicode.org/reports/tr38/](http://www.unicode.org/reports/tr38/)
for bibliographic info.

In any event, the hope is to create a successor to cjklib
([https://pypi.python.org/pypi/cjklib](https://pypi.python.org/pypi/cjklib))
and have datasets for CJK available in datapackages
([http://frictionlessdata.io/data-packages/](http://frictionlessdata.io/data-
packages/)). That way, sources of data are sustainable and not tied down to
any one library.

[1]
[https://en.wikipedia.org/wiki/Ideographic_Rapporteur_Group](https://en.wikipedia.org/wiki/Ideographic_Rapporteur_Group)

------
faragon
"The printer doesn't know which code page to use, so makes a best guess."

The printer probably use a default code page, and that's all. BTW, Unicode is
not hard. The "hard" part is reading the device manual, and implementing
encoding conversion properly. Also, in cases where no character selection is
possible, in most cases you can use the printers in graphic mode.

------
asimpletune
What if in the distant future, the actual spelling of people's surnames drift
due to normalization of this? I'd liken it to immigrants having their names
transliterated to a latin alphabet at Elis Island, or something like that.

------
RedCrowbar
Łukasz Langa recently gave a PyCon talk [1] on the subject.

[1]
[https://www.youtube.com/watch?v=7m5JA3XaZ4k](https://www.youtube.com/watch?v=7m5JA3XaZ4k)

~~~
deathanatos
That talk is proof as to just how difficult Unicode is in practice:

* @15:32, "UTF-32 uses the same amount of bytes for (almost) all code points" — there is no "almost" about it; UTF-32 _always_ uses 4 octets per code point.

* There was some amount of conflation between code points and characters.

* It was implied that len() will always give you length-in-code-points in Python 3, whereas it doesn't in Python 2. In Python < 3.3, it's code units (just like it is in Python 2), which on a narrow build will be 16-bit and thus wrong for strings w/ code points outside the BMP. This particular problem wasn't solved until 3.3 with the introduction of PEP-393.

The author's main points regarding the difference between text, and how you
encode it, is good.

------
kris-s
Related PyCon talk about this:
[https://youtu.be/bx3NOoroV-M](https://youtu.be/bx3NOoroV-M)

------
kevin_thibedeau
> Unicode was born out of the earlier Universal Coded Character Set

Unicode was started independently and later harmonized with UCS.

------
callesgg
Some parts of Unicode is hard like many characters looking almost exactly
alike.

------
kmicklas
Unicode isn't hard, dealing with software that doesn't use it is.

~~~
deathanatos
Software doesn't use it because our language and our system's support for
effectively dealing with this stuff is utter garbage. For example:

* The overwhelming majority of languages don't give you code-point level iteration over strings by default (and you probably want grapheme), most opting for code _units_ — which is what an unsigned char ptr in C containing UTF-8 data will give you. (C++, Java, C#, Python < 3.3, and JavaScript all fall in this bucket)

* Linux, and most (all?) POSIX OSs store filenames as a sequence of bytes. What human chooses a sequence of bytes to "name" their files?

* Things like "how wide will this character display as in my terminal" are either impossible, or done with heuristics. Usually, it's not done at all; most DB CLIs I've used that output tabular data will corrupt the visual if any non-ASCII is output.

(Yes, some of this is in the name of "backwards compatibility".)

~~~
jcranmer
> * The overwhelming majority of languages don't give you code-point level
> iteration over strings by default (and you probably want grapheme), most
> opting for code units — which is what an unsigned char ptr in C containing
> UTF-8 data will give you. (C++, Java, C#, Python < 3.3, and JavaScript all
> fall in this bucket)

Saying for (let ch of str) in JavaScript iterates over the codepoints, not
UCS-2 codepoints.

~~~
deathanatos
TIL! (Though, note that both indexing and .length operate in code units in
JS.)

------
k_sze
Mandatory XKCD [https://xkcd.com/927/](https://xkcd.com/927/)
[https://xkcd.com/1726/](https://xkcd.com/1726/)

------
teddyh
_The Absolute Minimum Every Software Developer Absolutely, Positively Must
Know About Unicode and Character Sets (No Excuses!)_ from 2003:

[https://www.joelonsoftware.com/2003/10/08/the-absolute-
minim...](https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-
every-software-developer-absolutely-positively-must-know-about-unicode-and-
character-sets-no-excuses/)

~~~
nabla9
That is not the absolute minimum. Unicode is complex beast and
oversimplification is dangerous.

When these absolute minimum intros talk just about encoding it misleads people
to think that that's enough. I can't count the number of people who have read
Joel's article and have the misconception that all user perceived characters
are mapped to code points. I was one of those people. Just because ASCII and
Latin-1 character sets can be mapped to code points does not mean that's how
Unicode works.

At minimum every software developer must know four different levels:

* bytes,

* code points,

* combining character sequence

* grapheme clusters, extended grapheme clusters

Joel stops at the second level. He never gets into point where he explains how
encode user perceived characters, how to detect grapheme cluster boundaries in
the Unicode encoding.

examples: 각 , नी , நி

~~~
jstimpfle
Thanks, but I'm happy to be ignorant about levels three and four. I know they
exist, and that's enough for me. Something is really wrong if programmers of
most applications domains have to care about that complexity.

~~~
nabla9
Knowing that their exist is the minimum you must know. Knowing what you don't
know is already knowledge.

Joel gives the impression that he don't know that he don't know.

Knowing that you can't break a Unicode text string or insert text into the
middle of Unicode string unless you know what language it uses is usually
enough. They are just binary blocks you can't modify unless you have some
extra info or uses specific libraries.

------
devoply
Unicode is not hard. What's hard is the conversions between all these
different systems. That's the hard part. Unicode is simple enough to be done
flawlessly as long as you stick to unicode for everything.

~~~
ygra
If you only need to receive, store, and send text, Unicode is easy enough and
you can just treat it as a byte stream. Once you get into things like
manipulating text, comparisons and searches, or displaying text, things get
hairy and all kinds of fun algorithms from the various Unicode Technical
References and Notes make their appearance. _Those_ parts are the ones that
increase complexity.

Also, a major reason why Unicode is large and complex is because languages and
scripts are large and complex. Unless we all agree on using simple computer-
friendly languages and scripts that complexity is not going to change, and the
need of working with older scripts (e.g. for historians and researchers) still
requires something like Unicode. Unicode is the kind of thing that emerges
from a messy world, and unsurprisingly it's messy as well.

~~~
jrochkind1
Unicode is still _way_ less hard than anything else for manipulating text.
Global human written language is complicated, unicode is a pretty ingeniously
designed standard, it's got solutions that work pretty darn well for almost
any common manipulation you'd want to do. Now, everything isn't always
implemented or easily accessible on every platform, and people don't always
understand what to do with it -- because global human written language is
complicated -- but unicode is a pretty amazing accomplishment, quite
successful in various meanings of 'succesful'.

------
a3n
In ancient times we tried to build the Tower of Babel, that would reach to God
and Heaven. God said "Nah," made us all speak different languages and
scattered us around.

Now it looks like we're up to our old tower building ways again, except this
time with computers and data. So God smirked and gave us Unicode.

