
Unicode: Good, Bad, and Ugly (2011) - Tomte
https://www.azabani.com/pages/gbu/
======
danso
What a great, useful read...maybe its text density doesn't follow best
practices for a slidedeck but something about dividing up all those code
examples across slides made this the most engaging multi-language writeup
about Unicode. I always _think_ I know how complicated it is but I think the
OP set me to a new level of realized ignorance.

Though I also like that my perceived difficulties with Unicode when moving
from Ruby to Python are not just imaginary or out of ignorance, but seem to be
actual differences/flaws in implementation. Also did not know about the
`regex` module for Python, which aims to replace the standard `re` (and was
just updated this week):
[http://pypi.python.org/pypi/regex](http://pypi.python.org/pypi/regex)

~~~
Veedrac
regex one of the most underused packages IMHO, even for people using straight
ASCII. It even has fuzzy matching!

------
re
These slides are from 2011, so some things have changed for the better in the
interim. For example, Python now uses a flexible internal string
representation that ensures that characters in a string are always whole code
points, addressing the "inherent brokenness" called out in the slides:
[https://www.python.org/dev/peps/pep-0393/](https://www.python.org/dev/peps/pep-0393/)

~~~
Avernar
For what reason other than perhaps font rendering code do you need to index
code points in a string? Everything you think you need code points for
(character counting, truncation, etc) you actually need grapheme clusters.

I wish Python went with UTF-8 instead of their multi width internal
representation.

~~~
Veedrac
The problem IMHO is more not regressing on lots of already-written code which
assumes O(1) indexing, and less a problem of principle.

~~~
Avernar
Most code works perfectly fine indexing bytes in a UTF-8 string. Anything that
looks for stuff in the ASCII range such as parsers would not need to be
changed.

Python 3 already required all code to be looked over because of the string
literal change so it wouldn't be much different.

------
gkoberger
Broken on mobile and hard to navigate on desktop. Here's an easier version:
[http://output.jsbin.com/qewonisinu/1](http://output.jsbin.com/qewonisinu/1)

------
devy
In my own experience, the biggest unicode implementation failure was MySQL's
choice of using 3-byte BMP Only UTF-8 instead of a full 4-byte Unicode
support.[1] In retrospect, their decision to go with a broken subset
implementation had caused more trouble, confusions and incompatibilities than
their claimed benefits of speed/performance/simplification. Do a simple google
search will see almost everyone recommends avoid `utf8` and use `utf8mb4`
instead. [2][3][4]

As a result, if your MySQL databases/tables/columns are using `utf8` instead
of `utf8mb4` (a MySQL invention) charset, you cannot store / retrieve emoji
characters properly.

[1] [https://dev.mysql.com/doc/refman/5.5/en/charset-unicode-
utf8...](https://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8mb4.html)

[2] [https://mzsanford.wordpress.com/2010/12/28/mysql-and-
unicode...](https://mzsanford.wordpress.com/2010/12/28/mysql-and-unicode/)

[3] [https://mathiasbynens.be/notes/mysql-
utf8mb4](https://mathiasbynens.be/notes/mysql-utf8mb4)

[4] [https://www.drupal.org/node/1314214](https://www.drupal.org/node/1314214)

------
Ulti
Perl 6 has some of the best unicode support out there as default behaviour.
Even supports all unicode digits as numeric literals. So:

    
    
        1 + ൭ == 8

~~~
shiro
Impressive. But I wonder if I want to see something like this in the code:

    
    
        ٦1٥٠3 + ٤६੬៩ - ৭۹੧

------
Demiurge
I wish this was updated. I thought I'd check some things in python, and slide
37 about Python not treating characters as smallest units does not apply
anymore. This seems valid:

    
    
      Python 3.5.0 (default, Sep 22 2015, 12:32:59) 
      [GCC 4.2.1 Compatible Apple LLVM 7.0.0 (clang-700.0.72)] on darwin
      Type "help", "copyright", "credits" or "license" for more information.
      >>> import re
      >>> g = "\N{GREEK SMALL LETTER ALPHA WITH VARIA AND YPOGEGRAMMENI}"
      >>> print(g)
      ᾲ
      >>> print(re.search(r'\w', g))
      <_sre.SRE_Match object; span=(0, 1), match='ᾲ'>
      >>> p = "\N{MATHEMATICAL SCRIPT CAPITAL P}"
      >>> print(p)
      𝒫
      >>> print(re.search(r'\w', p))
      <_sre.SRE_Match object; span=(0, 1), match='𝒫'>
      >>> print(re.search(r'..', p))
      None
      >>> print(re.search(r'.', p))
      <_sre.SRE_Match object; span=(0, 1), match='𝒫'>
      >>>

------
MindTwister
Reads better if you install the fonts recommended at the end Alfios and
Symbola [http://users.teilar.gr/~g1951d/](http://users.teilar.gr/~g1951d/) and
the font Everson Mono available here:
[http://www.evertype.com/emono/](http://www.evertype.com/emono/)

------
tempodox
That presentation is evil. It doesn't behave as a web page but like some
stupid Keynote document.

~~~
ygra
[https://en.wikipedia.org/wiki/S5_%28file_format%29](https://en.wikipedia.org/wiki/S5_%28file_format%29)

You can also click on the Ø in the lower-right to remove all that and view it
as a normal web page.

------
dfc
Can someone put 2011 in the title?

------
lhecker
I guess the author must have been pretty happy when Swift came out, with all
of it's glorious Unicode support...

------
forrestthewoods
Website doesn't work on iPad

~~~
scrollaway
Given that the site was around before the ipad was a thing, isn't it more fair
to say "iPad doesn't work with this site"?

~~~
TuringTest
_> Given that the site was around before the ipad was a thing, isn't it more
fair to say "iPad doesn't work with this site"?_

No. Because the slides have been published to the world wide web, it should
follow WWW standards and then it could have been seen in any future device
that followed those. Those are the expectations of the publishing platform,
which this presentation doesn't follow.

The site instead chooses to break browser compatibility, and therefore many
standard actions are impossible:

\- Navigating back and forth between slides.

\- Selecting text (such as the URL in the first slide, that I had to type in
the address bar instead of being able to copy/paste it).

\- Deep-linking a URL to any intermediate page (there's a workaround to this,
but it's not obvious).

Apparently, its limitations also include not being able to see this simple
content in any browser, like the native iOS one.

~~~
lqdc13
I very much dislike presentations that keep the slide history in the browser
history.

Now, to go back to the original page you were on before you went to the site,
you have to search through your history or click 100x times on the back
button.

I also never found a use for deep-linking to an intermediate page.

~~~
TuringTest
>Now, to go back to the original page you were on before you went to the site,
you have to search through your history or click 100x times on the back
button.

Useful trick: if you long-press on the back button in most browsers, it shows
a list that allows you to jump several pages back.

However, I was not asking for keeping the slide history in the browser history
- the presentation doesn't even have back/forward buttons to get back to the
previous slide while staying in the page.

> I also never found a use for deep-linking to an intermediate page. It's used
> for referencing the content of the slide, so that people coming from an
> external site will see exactly the page that you're talking about.

I hate it when someone quotes some content in a slideshow and link you to the
first page, forcing you to guess which part of the file contains the content
they're talking about.

I hate

