
Text Processing in the Shell - pcr910303
https://blog.balthazar-rouberol.com/text-processing-in-the-shell
======
shp0ngle
Sometimes I find it strange - in both good and bad way - that we are, in 2020,
learning tools and languages designed and build in the 80s, with models and
constraints of the time, with 40 years of layers of backwards compatibility,
and actually going sometimes back to the 70s

I am still learning tools designed around the constraints of teleprinters

Sure, it’s the same on Windows side (and macOS side with their classic OS
compatibility layers still present, like all the HFS stuff). Not bashing bash
here.

Surely our computers have very different models of operation than PDP-11, yet
we are sometimes pretending it doesn’t

~~~
enriquto
The English language is several hundreds of years old, and the ADN is several
hundreds of millions years old. _Everything_ we are based on is legacy; I do
not see why it should be different in computing.

~~~
hnlmorg
In fairness to the OP, there's a lot of cruft in the way terminal work that
really isn't necessary any more but need to be there because of backwards
compatibility. Such as

\- formatting being in-lined via ANSI escape sequences,

\- and there's a massive disparity between what escape sequences terminal
emulators support,

\- control codes being part of the same character set as printable characters,

\- changing the behaviour of the TTY requires either terminal emulator support
or OS support depending on the behaviour you require because the TTYs are
defined partially via kernel drivers (which requires syscalls to alter) and
partially by escape sequences,

\- and in the case of kernel behaviour, those syscalls vary from one OS to
another. Some OS's don't even support from TTY behaviours that other OSs do so
you can't even guarantee that logic is cross platform and that you just wrap
around specific differences in syscalls,

\- resizing terminals UIs can be a nightmare -- often requiring capturing RPC
signals and redrawing -- because there's no native layout system for drawing
to the TTY,

This isn't meant as a criticism though because there's a lot the general
design of terminals gets right (eg the kernel driver for TTY allows us to kill
processes over remote shells like SSH and mosh). But I think terminals are one
of those things that work "good enough" that most of the ugliness is hidden
from everyday users. However if we were to redesign UNIX terminals from the
ground up there is a lot of things most engineers would like to change and of
lot of places where things could be improved. Like having an out-of-band
channel for sending meta-data describing the pipeline but shouldn't be mixed
in with the byte stream.

~~~
JdeBP
... and pretty much all of those were addressed, outwith Unix, by the
evolution of the 1960s terminal I/O model into the console I/O model during
the 1980s.

You even forgot to mention one of the things that was addressed: input.
Terminal I/O input, done properly, requires a full ECMA-48 decoder state
machine, with bodges to accommodate non-conformant warts from the Linux KVT,
SCO Console, and RXVT. This is all too often _not_ done properly, because
people do not realize that there is ECMA-48 in _both_ directions; and is a
mess, just looking at function keys alone and not even accounting for keypad
application/normal modes and a mouse/locator. Console I/O evolved into uniform
input event records for HIDs that did not require state machines to decode.

Note that the lack of a layout system is only applicable to character-mode
terminals. Block-mode terminals are a quite different kettle of fish.

~~~
hnlmorg
> _... and pretty much all of those were addressed, [outside?] Unix, by the
> evolution of the 1960s terminal I /O model into the console I/O model during
> the 1980s._

> _Note that the lack of a layout system is only applicable to character-mode
> terminals. Block-mode terminals are a quite different kettle of fish._

Indeed but the point isn't "are these solvable problems?" but rather "why are
we still using archaic tech?"

Designing a solution to those problems is actually the easy part. It is
shifting the ecosystem away from TTYs that's hard.

> _You even forgot to mention_

It wasn't intended as an exhaustive list :) There's plenty more issues I
hadn't raised.

\---

In an ideal world I'd love to see UNIX terminals reinvented. The reality is
things are "good enough" for most people that they simply don't notice most of
the issues and migrating to the next evolution of UNIX terminals would mean a
break in backwards compatibility which will be more disruptive (initially in a
negative way) than making do with the warts we currently have.

~~~
dublin
Check out Microsoft's new Windows terminal
([https://www.hanselman.com/blog/ItsTimeForYouToInstallWindows...](https://www.hanselman.com/blog/ItsTimeForYouToInstallWindowsTerminal.aspx)
) It may be the most modern, capable, and yet compatible terminal I've come
across, and works well regardless of the environment you want to run it
in/with. (To be fair, a modern Windows terminal is a good 20 years overdue,
but MS deserves credit for finally getting it right...)

~~~
hnlmorg
If it's backwards compatible with TTYs then it probably hasn't solved any of
the problems I mentioned and there's already a plethora of nice terminal
emulators out there (which is why I say most of the "ugliness" is hidden from
everyday users).

It should also be noted that Windows makes a few mistakes when it comes to
terminal design too:

\- For starters cmd.exe is both a shell and terminal emulator and it's
impossible to separate the two.

\- To compound things, many common commands like dir, rm, etc are shell
builtins (this will be a throwback to DOS). So you cannot even use an
alternative shell on Windows without having to either invoke cmd.exe or
rewrite existing utilities.

\- And if that wasn't bad enough, cmd.exe builtins do not read from STDIN.
They instead use DOS syscalls to read keyboard input.

\- As you've probably guessed, cmd.exe isn't the only culprit which does this.
Any "Windows" command line program designed for or which uses DOS APIs will
not follow the standard streams idiom. Any command line software written for
NT, however will. So you end up having to write all sorts of really nasty
hacks just to get the command line working on Windows (far _far_ nastier than
any of the hacks that happen on UNIX/Linux).

\- Then you have Powershell, which is an entirely separate command line in its
own right and largely - though not completely - incompatible with cmd.exe.

\- And WSL, which is also incompatible with cmd.exe _and_ Powershell.

At least on UNIX/Linux, you have one terminal methodology. From there you can
use whichever terminal emulator you want, whichever shell you want, whichever
programming language to write CLI tools and/or whichever CLI tools you want to
download. Where as on Windows you have 4 competing standards which don't
cooperate well.

~~~
JdeBP
It behooves one not to make egregious mistakes when talking about the mistakes
supposedly made by Windows. cmd.exe is not a terminal emulator at all, and
does not make DOS system calls (it being a Win32 program) at all.

* [http://jdebp.uk./FGA/a-command-interpreter-is-not-a-console....](http://jdebp.uk./FGA/a-command-interpreter-is-not-a-console.html)

And the fact that CMD and PowerShell provide different interpreted languages
is no different to the Korn shell, tclsh, and Perl providing different
languages.

~~~
hnlmorg
> _It behooves one not to make egregious mistakes when talking about the
> mistakes_

Admittedly it's been a little while since I last played around with custom
shells and terminals on Windows and I had also rushed my post so you're right
that some details were wrong but you're just as far off with your corrections
as the points you were criticising so you're really not in a position to be
making platitudes about egregious mistakes.

I didn't say cmd.exe was a terminal emulator, I said it was multiple layers
_including_ the terminal emulator but not exclusively the terminal. Ok,
techically it's conhost.exe that provide the terminal emulation, I'd lazilly
lumped that together with cmd.exe because cmd.exe depends on conhost.exe when
not run headless. The former requires the latter so you can't just drop
cmd.exe into another terminal emulator and run it (other people have tried and
there's extensive blog posts about the hacks they've had to do to get it to
work, like running conhost.exe off screen).

If you want to be 100% technically accurate then cmd.exe is actually not like
any of those things we've described. It's certainly not equivalent to Korn or
other language REPLs as you stated. In fact the "language" part of cmd.exe is
barely a macro language (again, due to it's DOS heritage). Plus shells
orchestrate with byte streams where as NT's streams work very differently and
cmd.exe doesn't even behave correctly when used as a CLI tool (which I'll get
into below).

You're right that cmd.exe itself doesn't make DOS syscalls, as said aboveI was
rushing my post which caused me to conflating two points. What I really meant
to say was:

1\. cmd.exe builtins read from the NT console API's stdin. Which means you
cannot fork out to cmd.exe as a CLI command because any prompts ("Are you sure
you wish to delete" type things) just whiz straight past without pausing for
input. This was highly annoying when I was developing my alternative Windows
shell and wanted to make use of rm, copy, etc rather than having to write
those commands all over again.

2\. Windows, and cmd.exe by extension, supports running other console
applications which don't use NT's console streams because they favour some of
the other hacks used in the DOS days. This means those applications also don't
work with alternative shells let alone alternative terminal emulators.

Also I think it's disingenuous citing your own blog post as a source. I could
link you to the Github repository where I've had to put in numerous
workarounds for the shell I've written to work with Windows. But instead I'll
link to something a little more recognised:

[https://devblogs.microsoft.com/commandline/windows-
command-l...](https://devblogs.microsoft.com/commandline/windows-command-line-
inside-the-windows-console/)

(I did have a hunt around for the blog posts from other developers building
console solutions for Windows and the similar problems they've ran into but
since it was around 5 years ago when I gave up first party Windows support,
those blogs are now lost in the mists of the ether).

------
RMPR
A little suggestion for the authors, they mentioned xargs, I think [GNU
parallel]([https://www.gnu.org/software/parallel/](https://www.gnu.org/software/parallel/))
might work a mention too, since it is a kind of modern successor that can use
many computers to run tasks.

~~~
pletnes
This you have to install, xargs is everywhere. Also with the -P flag you can
parallelize the most common cases.

~~~
assafmo
Personally I like parallel better because of the `--bar` option and `{}`,
`{.}`, `{/}` and `{/.}`. And I usually just use it with `-P 1` anyway.

------
latenightcoding
OP you should mention perl one-liners in upcoming chapters
[https://catonmat.net/introduction-to-perl-one-
liners](https://catonmat.net/introduction-to-perl-one-liners)

~~~
thesuperbigfrog
Agreed. Perl makes it easy to do complex text processing and can replace many
individual command line text processing tools.

Recent versions of Perl also support UTF-8 so they can support text processing
in different natural languages or internationalization needs. See
[https://en.wikibooks.org/wiki/Perl_Programming/Unicode_UTF-8](https://en.wikibooks.org/wiki/Perl_Programming/Unicode_UTF-8)

~~~
bmn__
> Recent versions of Perl

Make sure it's recent enough and released after 2002!

~~~
thesuperbigfrog
I was thinking of Perl versions 5.14 (released in 2011) and later since that
release fixed several Unicode-related bugs.

There have been other improvements and fixes in the versions up to 5.30, so
the Unicode support now is pretty transparent.

Many of the classic command line text processing tools are not Unicode aware.

------
SPBS
I've always viewed `awk '!a[$0]++'` as superior to `sort | uniq` because it
preserves order and does not have to sort the data first before deduplicating.
But `sort | uniq` is much easier to remember.

~~~
asicsp
Or just `sort -u` (if you are using GNU sort, not sure about other
implementations)

Another difference is that sort is optimized to handle large files [0]

[0]
[https://unix.stackexchange.com/questions/279096/scalability-...](https://unix.stackexchange.com/questions/279096/scalability-
of-sort-u-for-gigantic-files)

------
fnord123
This is cool for English text. But once you get Unicode with various ways to
represent é, whew lad. This get shitty quickly in the shell.

~~~
bmn__
English _is_ Unicode. Pretending otherwise would be quite naïve.
[https://www.azabani.com/pages/gbu/#slide4](https://www.azabani.com/pages/gbu/#slide4)

~~~
fnord123
You claim is not clear.

"Unicode with various ways to represent é" is a shit show to parse using shell
tools. e.g. Try scraping Spanish language Twitter feeds. When I have done this
kind of work, I made a tool to canonicalize glyphs and had to put it between
every step of a pipeline.

------
assafmo
Lately I've started to use Perl instead of sed for replacing text. Its regex
support is much better IMO:

    
    
      cat a.txt | perl -pe 's/banana-(\d)/papaya-$1/g'
    

Or in-place:

    
    
      perl -i -pe 's/banana-(\d)/papaya-$1/g' a.txt

------
clircle
I'll preface my question by saying that I'm not a dev.

Why do this kind of work in the shell? Isn't it better to do this in a
programming language that can run on all operating systems? What are Windows
users supposed to do?

~~~
overgard
As a dev: I don't know. It's a well written article and this stuff can be
handy in a pinch, but I've yet to see many real world scenarios where a
complicated shell script is a good idea. Most of these examples, I would
probably rather write a five line python script to ingest the data into sqlite
and then use actual queries.

~~~
nostoc
The use case is to quickly handle ad hoc scenarios.

If you need to quickly extract something from a csv, you could break out
python, or import it into a database, but using cut and grep (or csv-tools)
will take 5 seconds.

The point is if you need to do a specific task many times, do it in a
programming language. But if you have an ad hoc task, you're saving a lot of
time by being proficient in the shell.

~~~
overgard
Ad hoc tasks quickly become regular tasks though. If your "database" is csv
files you're better off figuring out a better way to structure it.

The way things are moving with containers, the idea you're even going to have
these utilities on the server, and the idea that the server is writing this
stuff to a file system- that's totally changing. So is this useful for local
stuff? Maybe, but is Excel probably more useful there?

------
nailer
Text processing is scraping. Modern shells have structured output from their
stdlib, allow you to pipe to 'where' and 'select', and can read JSON, yaml,
etc natively.

~~~
gpanders
Which shells in particular are you referring to?

~~~
nailer
Mainly pwsh and nushell

------
MR4D
I wish man pages were this good!

~~~
bori5
[https://tldr.sh/](https://tldr.sh/)

~~~
leadingthenet
Even better:
[https://github.com/chubin/cheat.sh](https://github.com/chubin/cheat.sh)

This includes man pages from tldr, and more! The command line utility had been
a great help for me over the past few months.

~~~
bori5
Thank you, did not know about this one!

------
Gnouc
I invite you to read
[https://unix.stackexchange.com/q/169716/38906](https://unix.stackexchange.com/q/169716/38906)

------
411111111111111
i was just missing a small note akin to "which becomes all the more powerful
by combining these commands" and then make a totally readable example such as
this one

    
    
         some_file=example.sh; tail -n +$(( $(wc -l $some_file | grep -o "[0-9]\+") - 5 )) $some_file

------
boshomi
Pleased add the "join"

------
seemslegit
Just Don't. Unix-style text stream processing was super cool in 80s-90s but is
born tech-debt today.

~~~
thekelvinliu
what do you do instead?

~~~
kuschku
Output data as JSON, manipulate and show it with jq

~~~
bryanrasmussen
not all data comes from sources that you control and have chosen how to
output.

~~~
BaltoRouberol
[https://github.com/kellyjonbrazil/jc](https://github.com/kellyjonbrazil/jc)
can come pretty handy there

~~~
enriquto
oh my god, why?! just why!?

What the world needs is the inverse program of "jc", where an unparseable json
string is expanded into a flat list of lines all of the form
"field.subfield=value"

~~~
Igrom
If you search for "gron", you will find a family of such tools, e.g.,
[https://github.com/tailhook/rust-gron](https://github.com/tailhook/rust-
gron).

~~~
enriquto
Thanks, that's just what I needed!

