
OpenBSD removes support for non-UTF8 locales - ingve
http://marc.info/?l=openbsd-cvs&m=143956261214725&w=2
======
kragen
I wonder what the pros and cons weighed in the discussion were.

Clearly not supporting Unicode text in non-UTF-8 locales (except through,
like, some kind of compatibility function, like recode or iconv) is the Right
Thing. One problem that I have is that current UTF-8 implementations typically
are not "8 bit clean", in the sense that GNU and modern Unix tools typically
attempt to be; they crash, usually by throwing an exception, if you feed them
certain data, or worse, they silently corrupt it.

Markus Kuhn suggested "UTF-8B" as a solution to this problem some years ago.
Quoting Eric Tiedemann's libutf8b blurb, "utf-8b is a mapping from byte
streams to unicode codepoint streams that provides an exceptionally clean
handling of garbage (i.e., non-utf-8) bytes (i.e., bytes that are not part of
a utf-8 encoding) in the input stream. They are mapped to 256 different,
guaranteed undefined, unicode codepoints." Eric's dead, but you can still get
libutf8b from
[http://hyperreal.org/~est/libutf8b/](http://hyperreal.org/~est/libutf8b/).

~~~
throwaway2048
I'm willing to bet a large amount that non UTF-8 encoding were broken and
nobody cared enough to bother fixing them.

OpenBSD does not hesitate to nuke legacy stuff that gets broken. Which i feel
is ultimately for the best, because half-assed support that barely functions
is worse than no support at all many times.

~~~
stsp
It was in fact intentionally broken to find out where removing single-byte
locales hurts our users most.

We have a hackathon coming up with devs committed to making UTF-8 work in more
base utilities. If that works out, and the most sore points of
latin1/koi-8/etc users have been adequately addressed, 5.9 will ship with only
the UTF-8 locale (and of course the default "C" locale -- ASCII).

If this approach turns out to be wrong because we cannot get regressions
fixed, 5.9 will ship like 5.7 and 5.8 (with UTF-8 and single byte locales).

~~~
opk
My first thought was, what about the "C" locale so good to see that question
already answered.

I really wish there was some sort of standard "U" locale that would be the
same as "C" but UTF-8, and ISO rather than US format dates.

~~~
kazinator
That locale pseudo-exists. It's called "don't call the evil setlocale
function, write in C90 as much as possible, do your own UTF-8 encoding and
decoding, and implement the exact default date format you want with your own
strftime string or whatever."

~~~
Dylan16807
That doesn't exactly help me as a user, and possibly makes things worse as
some things respect locale and some don't.

------
gnuvince
As a French-speaking person, I cannot tell you how much the announcement[0]
that after 5.8, basic utilities, including mg(1), will be UTF-8 ready pleases
me. I'm a huge Emacs fan, but I like to use mg(1) for quick edits and this is
very exciting news for me!

[0]
[http://undeadly.org/cgi?action=article&sid=20150722182236](http://undeadly.org/cgi?action=article&sid=20150722182236)

~~~
busterarm
Funny thing about Emacs and OpenBSD...

...Emacs is the only package in the entire ports tree that can't use ASLR.

~~~
4ad
That can't possibly be true, the Go port also doesn't use ASLR.

~~~
busterarm
I'm just quoting what an OpenBSD dev mentioned to me. I guess there could be
more.

------
fletchowns
I dream of a world where everything is UTC, UTF-8, and metric.

~~~
RexRollman
Personally, I wish everyone used the 24:00 clock. Maybe the military has
messed me up, but I really prefer seeing something like 18:22 over 6:22pm. It
just seems simplier.

~~~
blackbeard
Yes couldn't agree more with that.

Also dates in numeric order I.e. yyyy/mm/dd you know like all the other
numbers we deal with not dd/mm/yyyy or the crazy mm/dd/yy.

~~~
qznc
Use minus instead of slash and you have ISO 8601: yyyy-mm-dd

~~~
blackbeard
Sold.

~~~
peterfirefly
8601 also gives you proper week numbering (which Europeans tend to like) +
weeks start on Monday, _after_ the weekend.

~~~
baudehlo
In some countries the weekend isn't Saturday and Sunday though. So the start
of the week is kind of arbitrary.

~~~
saljam
Saudi Arabia used to have it's weekend on Thursday & Friday. Recently they've
switched to Friday & Saturday.

------
jlarocco
Heh, I initially read it as "improves", and was wondering why they'd bother.
Removing it is surprising, but makes sense.

------
Animats
How does locale work on the keyboard side, then? What determines whether text
entry is right to left or left to right?

~~~
ori_b
Exactly the same as before -- your programs just expect UTF8 codepoints as
input.

