
Xterm(1) now UTF-8 by default on OpenBSD - protomyth
http://undeadly.org/cgi?action=article&sid=20160308204011
======
jhallenworld
I and others have pushed changes into XTerm to improve mouse support of
terminal-based applications. All terminal emulators should implement XTerm's
command set, especially these:

Bracketed paste mode: allows editor to determine that text is from a mouse
paste instead of typed in. This way, the editor can disable auto-indent and
other things which can mess up the paste. Libvte now supports this!

Base64 selection transfer: this is a further enhancement which allows the
editor to query or submit selection text to the X server. This allows editors
to fully control the selection process, for example to allow the selection to
extend through the edit buffer instead of just the terminal emulator's
contents.

One patch of mine didn't take, but I think it's still needed: allow mouse drag
events to be reported even if the coordinates extend beyond the xterm window
frame. Along with this is the ability to report negative coordinates if the
mouse is above or to the left of the window. Why would this be needed? Think
of selecting text which is scrolled off the window. The distance between edge
and the mouse controls the rate of selection scrolling in that direction.

BTW, it's fun to peruse xterm's change log. For example, you can see all the
bugs and enhancements from Bram Moolenaar for VIM. [http://invisible-
island.net/xterm/xterm.log.html](http://invisible-
island.net/xterm/xterm.log.html)

Thomas Dickey maintains a lot of other software as well, in particular
ncurses, vile and lynx: [http://invisible-island.net/](http://invisible-
island.net/)

~~~
caf
Bracketed paste mode is also useful for IRC, to prevent misfiring a huge paste
into a channel.

------
spedru
Every time some link or headline reads "now UTF-8 by default", the only
reasonable response in 2016 is "about time".

~~~
JoachimSchipper
That's not why this article is interesting. Rather, it highlights how
profoundly _not_ UTF-8 ready the (terminal) world is.

(It does work in practice, but in-band signaling over a channel carrying
complex data that receiver and sender interpret according to settings that do
not appear in the protocol at all is, predictably, terrible.)

------
thisrod
This reminded me of a Rob Pike comment. I can't find the text, but it was
along the lines of, "I recently tried Linux. It was as if every bug I fixed in
the 1980s had reverted."

~~~
kazinator
That was baseless posturing. A famous study and its follow-up found that the
utilities on GNU/Linux are more robust, and that was twenty years ago:

ftp://ftp.cs.wisc.edu/paradyn/technical_papers/fuzz-revisited.pdf [1995]

" _This study parallels our 1990 study (that tested only the basic UNIX
utilities); all systems that we compared between 1990 and 1995 noticeably
improved in reliability, but still had significant rates of failure. The
reliability of the basic utilities from GNU and Linux were noticeably better
than those of the commercial systems._ "

I doubt there has been much improvement in those commercial Unixes; they are
basically dead. (What would be the business case for fixing something in
userland utility on commerical Unix?)

The maintainers of the free BSD's have been carrying that torch, but they
don't believe in features.

Stepping into a BSD variant is like a trip back to the 1980's. Not exactly the
real 1980's, but a parallel 1980's in which Unix is more robust---but the
features are all rolled back, so it's just about as unpleasant to use.

~~~
cokernel_hacker
I don't think Rob meant stability. Rob was probably referring to the reality
that modern Linux hasn't innovated itself past SVR4 by any appreciable amount.

We are still using X, still using terminals powered by control codes, etc.

Rob probably sees things like LANG and LC_ALL as bugs. His fix was UTF-8
everywhere, always. Where is Linux? Still in bag-of-bytes-o-rama.

~~~
SixSigma
That and getting rid of the TTY altogether.

We aren't using punched cards

EDIT: people hate when I say this, which amuses me. The TTY must die !!!!

~~~
nils-m-holm
> The TTY must die!!!!

Being sight-impaired, I have to disagree strongly! The TTY is the only thing
that lets me adjust the font size of all programs running in it without going
through lots of trouble.

(BTW: didn't downvote your comment.)

~~~
guard-of-terra
Browsers let you do that. KDE also does, so do other environments. For a quick
hack, set your sceen DPI to 50.

~~~
nils-m-holm
Set your minimum font size to 32 points and browse the web for a while! Let me
know how it feels!

~~~
guard-of-terra
Browsers can set default zoom, not just font size.

~~~
nils-m-holm
Doesn't help. When the zoom factor is big enough, you have to scroll sideways
while reading.

Anyway, I have tried a lot of things over the years and _nothing_ even comes
close to using a text interface.

To name a few nuisances: controls moving outside of the screen, overlapping
elements in web content, unreadable buttons, unclickable input fields, tiny
fonts in menus, etc. Nothing of this happens with text interfaces.

Thanks for your input, though!

~~~
guard-of-terra
For reading, consider installing beeline reader (yes the name is stupid-ish),
in plugin or bookmarklet form.

------
igravious
I've been trying to teach myself some unicode code points because I'm getting
sick and tired of continually Googling them and copying and pasting the result
or bringing up a symbol character table.

In fact, I'd say keyboards are woefully out to date.

Specifically, I keep looking up † dagger (U+2020) and ‡ double-dagger (U+2021)
for footnotes, black heart (U+2065) to be romantic, black star (U+2605) to
talk about David Bowie's last album and ∞ to talk about actual non-finite
entities.

I olny found out recently that Ctrl+Shift+u and then type unicode hexadecimal
outputs these in Ubuntu, presumably all Linuxen. AltGr+8 is great for
diaeresis while we're at it so you can go all hëävÿ mëtäl really easily.

edit: _black heart and star are not making it through, why Lord, why?!_

~~~
elros
On OS X, if you type Command+Control+Space, it brings up a character insertion
menu where you can search by character name. I can get both daggers, black
star and black heart quite quickly that way.

~~~
msbarnett
Also on OS X, † is option-t, and ‡ is option-shift-7.

~~~
igravious
Ok, On Linux I have found ‡ and †

† is AltGr-Shift-%, and ‡ AltGr-Shift-:

I'll never remember them :(

~~~
lozf
U+2020 (†) and U+2021 (‡) aren't that hard to remember for the sake of a few
extra key-presses and wider compatibility.

------
gpvos
Wouldn't it be better if all those dangerous escape sequences (like
Application Program-Control, redefining function keys, alternate character
sets, etc.) were disabled by default in xterm? Anyone using the obsolete
software that uses them could enable them if they wish.

------
deathanatos
Repeat after me: UTF-8 is the sane default in this day and age. This is a good
change.

The whole "the ISO 6429 C1 control code 'application program command'" thing
is a bit surprising though. (I'm guessing this change doesn't actually avoid
this directly? If you sent an APC it'd still do it, it's just that APC is
multiple bytes in UTF-8, and hopefully a bit rarer?)

> _Reinterpreting US-ASCII in an arbitrary encoding_

This way will likely work — at least, I thought. The vast majority of
encodings are a superset of ASCII, so reinterpreting ASCII as them _is_ valid.
The only one I know of that isn't is EBCDIC, and I've never seen it used.
(Said differently, non-superset-of-ASCII codecs are incredible rare to
encounter, so the above assumption usually holds.) (The reverse,
reinterpreting arbitrary data as ASCII, is not going to work out as well.)

Though it is rather horrifying how easily it is to dump arbitrary data into a
terminals stream. Unix does not make this easy for the program. The vast
majority of programs, I'd say, really just want to output text. Yet, they're
connected to a terminal. Or better, if perhaps a program could say, "I'm
outputting arbitrary binary data", or even "I'm outputting a
application/tar+gzip"; the terminal would then know immediately to not
interpret this input. And in the case of tar+gzip, it would have the
opportunity to do something truly magical: it could visualize the octets
(since trying to interpret a gzip as UTF-8 is insane); it could even just note
that the output was a tar, and _list the tar 's contents like tar -t_. If the
program declares itself aware, like "application/terminal.ansi", then okay,
you know: it's aware; interpret away.

But it doesn't, so it can't. Part of the difficulty is probably that the TTY
is both input and output (not that the input can't also declare a mimetype or
something similar). And the vast majority of programs don't escape their user
input before sending it to a terminal; it's like one giant "terminal-XSS" or
"SQL-injection-for-your-terminal". And it is probably unreasonable to expect
it; I don't really know of any good libraries around terminal I/O; most
programs I see that do it assume the world is an xterm and just encode the raw
bytes, right there, and pray w.r.t. user input.

catting the linux kernel's gzip into tmux can have consequences from "lol" to
"I guess we need a new tmux session".

It was also just today that I discovered that neither GNU's `ps` nor `screen`
support Unicode, at least, for characters outside the BMP.

~~~
comex
UTF-16 isn't a superset of ASCII, for one. Doesn't seem that anyone uses a
native UTF-16 terminal, but if you're trying to use grep or whatnot on a
UTF-16 encoded file, it'll happily silently not do what you want...

~~~
TazeTSchnitzel
畂桳栠摩琠敨映捡獴!

~~~
comex
唀吀䘀ⴀ㄀㘀 戀礀琀攀猀眀愀瀀猀 愀爀攀 愀氀猀漀 昀甀渀.

------
zkirill
This is really great! Just a few days ago I got very confused when I saw tofu
characters in xterm and had to switch to uxterm to see them (or set some
locale flag in my home dir).

------
plugnburn
UTF-8 must be the default and only encoding. Why does anything else still
exist?

~~~
jmnicolas
Yes but UTF-8 with or without byte order mark ? ;-)

~~~
plugnburn
Without. BOM (when used for UTF-8) is an obsolete crap invented by necrosoft
in order to make their software incompatible with normal.

~~~
TazeTSchnitzel
It's not a Microsoft invention, and MS's use of it is really quite sensible.
They had a problem of distinguishing UTF-16, UTF-8 and non-Unicode (possibly a
single-byte "extended ASCII" type encoding, possibly some multi-byte
monstrosity) text files. Since UTF-8 and ASCII-compatible encodings look
similar when there aren't many >U+007F characters in use, and identical if
none are in use, they could get confused. Prepending a Byte Order Mark solves
this problem, in that it makes a file unambiguously UTF-8 (or UTF-16, for that
matter).

------
kazinator
Great! Now just drop the embarrassing man(1) page reference, and you can call
it modernized.

Wow, I'm surprised that the people whose buttons this pushes are able to
make(1) a HN account, let alone have enough points to downvote.

Think about it. There is only one man page for xterm. I fyou type "man xterm"
with no section number you get that man page. If there existed an xterm(7)
page, you'd still get the xterm(1) man page by default. So why the hell write
the (1) notation every time you type the word xterm?

Man page section numbers are not useful or relevant, by and large and
mentioning them only adds noise to a paragraph.

Even stupider is when the worst of the Unix wankers write man page section
numbers after ISO C function names. Example sentence: "Microsoft's malloc(3)
implementation is found in MSVCRT.DLL". #facepalm#

~~~
gjvc
>Think about it. There is only one man page for xterm. I fyou type "man xterm"
with no section number you get that man page. If there existed an xterm(7)
page, you'd still get the xterm(1) man page by default. So why the hell write
the (1) notation every time you type the word xterm?

Because the convention exists to define the type of the component. It's a
handy convention, and I'm betting there are a few people reading this who have
never used anything other than GNOME terminal so appending the section number
immediately helps the reader to place the component, otherwise they'd have to
look it up. etc

~~~
kazinator
So, if I don't know anything but Gnome terminal, and don't know what xterm is,
if I see "xterm", I have to look it up. However, if I see "xterm(1)", I _don
't_ have to look it up?

Strange.

(And how did I get to the situation in which I know what (1) means, yet I only
know Gnome terminal and don't know what xterm is?)

(What about the fact that xterm(1) is also a _hyperlink_ in the sumitted page?
You could change the anchor text to "xterm(foo)" and it would still navigate
to the correct man page with one click.)

~~~
gjvc
unix has got much bigger problems than this

