Hacker News new | past | comments | ask | show | jobs | submit login

FWIW, I wouldn't try to parse escape sequences directly from the input bytestream -- it's easy to end up with annoying edge cases. :-/ In my experience you'll thank yourself if you can separate the logic into something like:

- First step (for a UTF-8-input terminal) is interpreting the input bytestream as UTF-8 and "lexing" into a stream of Unicode Scalar Values (https://www.unicode.org/versions/Unicode15.1.0/ch03.pdf#P.12... ; https://github.com/mobile-shell/mosh/blob/master/src/termina...).

- Second step is "parsing" the scalar values by running them through the DEC parser/state machine. This is independent of the escape sequences (https://vt100.net/emu/dec_ansi_parser ; https://github.com/mobile-shell/mosh/blob/master/src/termina...).

- And then the third step is for the terminal to execute the dispatch/execute/etc. actions coming from the parser, which is where the escape sequences and control chars get implemented (https://www.vt100.net/docs/vt220-rm/ ; https://invisible-island.net/xterm/ctlseqs/ctlseqs.html ; https://github.com/mobile-shell/mosh/blob/master/src/termina...).

Without this separation, it's easier to end up with bugs where, e.g., a UTF-8 sequence or an ANSI escape sequence is treated differently if it's split between read() calls (https://bugs.chromium.org/p/chromium/issues/detail?id=212702), or invalid input isn't correctly recovered-from, etc.




> - First step (for a UTF-8-input terminal) is interpreting the input bytestream as UTF-8

> - Second step is "parsing" the scalar values by running them through the DEC parser/state machine.

Unfortunately, you may need to intermingle some logic between these two steps.

While VT100 style control sequences are usually introduced with an ESC, they can also be represented as a C1 control sequence, e.g. 0x84 instead of ESC + D, 0x9b instead of ESC + [. These sequences are raw bytes, not Unicode codepoints, and their encoding collides unpleasantly with UTF-8 continuation characters.

Further documentation: https://vt100.net/docs/vt220-rm/chapter4.html

Since there's no standard which specifies how UTF-8 should interact with the terminal parser, you're a little bit on your own here. But probably the simplest fix is to introduce a special case into the UTF-8 decoder which allows stray continuation characters to be passed through to the DEC parser, rather than transforming them to replacement characters immediately.


:-) The UTF-8/Unix FAQ and existing terminal emulators don't agree with you here. As you say, there's no spec for this, but here's what Kuhn's FAQ says (https://www.cl.cam.ac.uk/~mgk25/unicode.html#term):

"UTF-8 still allows you to use C1 control characters such as CSI, even though UTF-8 also uses bytes in the range 0x80-0x9F. It is important to understand that a terminal emulator in UTF-8 mode must apply the UTF-8 decoder to the incoming byte stream before interpreting any control characters. C1 characters are UTF-8 decoded just like any other character above U+007F."

The existing ANSI terminal emulators that support UTF-8 input and C1 controls seem to agree on this (VTE, GNU screen, Mosh). xterm, urxvt, tmux, PuTTY, and st don't seem to support C1 controls in UTF-8 mode. So I don't think poking holes in the UTF-8 decoder is necessary, especially since allowing C1 in UTF-8 mode is rare anyway.


Note very few terminals implement UTF-8 and C1 controls and in particular xterm (which is kind of the defacto standard) doesn't because of the issues you outline. My opinion is they should just die as a legacy thing. No programs depend on them.


That's a perfectly reasonable answer too. There's a couple of other VT100 features that are safe to omit if you aren't going for full "historical accuracy" -- VT52 mode, for instance, has been obsolete for 30-40 years now.

As an aside: I wonder how useful it'd be to assemble a report documenting all known terminal control sequences and other behaviors, what terminals they're available in, and how frequently they're used in modern software. There are some big gaps between the DEC documentation, ECMA035/043/048, and actual implementations of terminal emulators.


https://www.xfree86.org/current/ctlseqs.html <- seems to claim they are supported? Oh! Maybe I misunderstand, and you mean those two features simultaneously?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: