
Why and how you ought to keep multibyte character support simple [pdf] - protomyth
https://www.openbsd.org/papers/eurobsdcon2016-utf8.pdf
======
ploxiln
On the "Caveats for xterm" page it says

    
    
      On  other  operating  systems  except  OpenBSD,  there  is
      no  way  in  hell  to  make the  interaction  of  locales
      with terminal controls truly safe.
    

But consider a linux system based on musl libc, it's not very different from
openbsd's policy of utf-8 and ascii only, it's probably pretty close even if
not perfect:

[http://wiki.musl-
libc.org/wiki/Functional_differences_from_g...](http://wiki.musl-
libc.org/wiki/Functional_differences_from_glibc#Character_sets_and_locale)

~~~
stsp
It should probably say "On operating systems which support arbitrary locales,
..."

------
brandmeyer
I don't understand how the algorithm on page 21 works. Aren't many Unicode
characters formed with multiple code points, like <modifying-mark><basic-
character>? If these are reversed to be <basic-character><modifying-mark>,
then the textual output would actually be different, wouldn't it?

Shouldn't rev(1) reverse graphemes instead of code points?

~~~
stsp
Yes, rev(1) probably should handle combined characters.

But those are a property of Unicode, not UTF-8. UTF-8 encodes code points, and
we often try to get away without decoding them. Of course the resulting
Unicode can change its meaning but it's still valid Unicode (and valid UTF-8).

In some cases we already look at Unicode properties (such as a character's
column width). So perhaps we can find a nice way to fix this problem in
rev(1), some day.

There are many more interesting Unicode issues we don't address in OpenBSD's
UTF-8 support (e.g. han unification, pre-composed vs de-composed
normalization).

But we have to start somewhere.

Perhaps, eventually, someone will specify a minimal and sane variant of
unicode, which removes all the ambiguities, edge cases, and silly symbols.
We'd probably switch over in a heartbeat.

~~~
ramshorns
What would a minimal and sane variant of Unicode be like? Removing the weird
behaviour of Unicode would necessarily mean removing support for some
characters, like those that only exist in decomposed form with combining
diacritics, and some types of scripts like right-to-left. Mapping code points,
characters and graphemes one-to-one seems like it would make text processing
easier at the cost of excluding a large portion of the character set.

I guess it would form a middle ground; US-ASCII is also a minimal subset of
Unicode where text processing is easy.

~~~
karlmdavis
Ding ding! Hard things are hard.

It seems... at least a bit arrogant for a developer that doesn't write any of
the languages that rely on these features to claim that they're insane and
excessive.

------
wtbob
My takeaway is that POSIX is completely broken and needs to be re-evaluated.

~~~
ChoHag
It'll fit right in then.

------
gberger
Why the random photos on each slide?

~~~
stsp
The photos are all from the area around Calgary, where some of the initial
ideas were born during an OpenBSD hackathon. IIRC we disabled Latin1 support
during this hackathon.

While giving this talk in Belgrade, Ingo apologized he didn't have photos from
a Belgrade hike yet so he used the Calgary ones instead.

~~~
odabaxok
This does not answer the original question. What is the purpose of these
photos?

~~~
coldtea
It's a little thing called decoration, you should look it up sometime...

~~~
odabaxok
To answer in your tone: You misspelled distraction. ;)

...but seriously, it does not add any value to the presentation. Also, every
photo has a caption, which can be truly a distraction. I can imagine if
someone tries to read all of them and loses track in the presentation in every
single slide. The author could have decorated with something on-topic, if he
felt the slides were too plain.

~~~
coldtea
> _To answer in your tone: You misspelled distraction. ;)_

Heh, less of a malevolent tone, and more of a reference to a Futurama episode
(s2e6).

------
FullyFunctional
Honest question: why did they keep C instead of (like Plan9) going all-out
UTF-8?

