Hacker News new | past | comments | ask | show | jobs | submit login
Problem solving with Unix commands (vegardstikbakke.com)
356 points by v3gas 37 days ago | hide | past | web | favorite | 211 comments

Gary Bernhardt[1] gave a great talk about practical problem solving with the unix shell: "The Unix Chainsaw"[2].

"Half-assed is OK when you only need half of an ass."

In the talk, he gives several demonstrations a key aspect of why unix pipelines are so practically useful: you build them interactively. A complicated 4 line pipeline started as a single command that was gradually refined into something that actually solves a complicated problem. This talk demonstrates the part that isn't included in the the usual tutorials or "cool 1-line command" lists: the cycle of "Try something. Hit up to get the command back. Make one iterative change and try again."

[1] You might know him from his other hilarious talks like "The Birth & Death of JavaScript" or "Wat".

[2] https://www.youtube.com/watch?v=sCZJblyT_XM

> In the talk, he gives several demonstrations a key aspect of why unix pipelines are so practically useful: you build them interactively.

The standard Unix interface might have been interactive in the ’70s, back when hardware and peripherals were horribly non-interactive. But I don’t know why so many so-called millenial programmers (people my age) get excited about the alleged interactivity of the Unix that most people are familiar with. It doesn’t even have the cutting edge ’90s interactivity of Plan 9, what with mouse(!) selection of arbitrary text that can be piped to commands and so on. And every time someone comes up with a Unix-hosted tool that uses some kind of fold-up menu that informs you about what key combination you can type next (you know, like what all GUI programs have with Alt+x and the file|edit|view|… toolbar), people hail it as some kind of UX innovation.

I think the interactivity you describe might be a different thing from what your parent is talking about.

From what I understand, your parent talks about how the commands are built iteratively, with some kind of trial-error loop, which is a strength that is supposedly not emphasized enough. And I agree by the way. Nothing to do with how things are input.

That's correct. Articles/tutorials or an evangelizing fan often show the end result: the cool command/pipeline that does something cool and useful. The obvious question when someone unfamiliar with unix upon seeing something like the pipeline in this article:

    comm -1 -3 <(ls -1 dataset-directory | \
                 grep '\d\d\d\d_A.csv'   | \
                 cut -c 1-4              | \
                 python3 parse.py        | \
                 uniq                      \
                 )                         \
               <(seq 500)
is "Why would I want to write a complicated mess like that?" Just use ${FAVORITE_PROG_LANG:-Perl, Ruby, or whatever}". For many tasks, a short paragraph of code in a "normal" programming language is probably easier to write and is almost certainly a more robust, easier to maintain solution. However, this assumes that you knew what the problem was and that qualities like maintainability are a goal.

Bernhardt's (and my) point is that sometimes you don't know what the goal is yet. Sometimes you just need to do a small, one-off task where a half-assed solutions might be appropriate... iff it's the right half of the ass. Unix shell gets that right for a really useful set of tasks.

This works because you are free to utilize that powerful features incrementally, as needed. The interactive nature of the shell lets you explore the problem. The "better" version in a "proper" programming language doesn't exist when you don't yet know the exact nature of the problem. A half-assed bit of shell code that slowly evolved into something useful might be the step between "I have some data" and a larger "real" programming project.

That said, there is also wisdom in learning to recognize when your needs have outgrown "small, half-assed" solutions. If the project is growing and adding layers of complexity, it's probably time to switch to a more appropriate tool.

Just yesterday I needed to extract, sort, and categorize the user agent strings for 6 months' traffic to a handful of sites (attempting to convince a company to abandon TLS 1.0/1.1).

The first half of the job was exactly the process you described: start with one log file, craft a grep for it, craft a `grep -o` for the relevant part of each relevant line, add `sort | uniq -c | sort -r`, switch to zgrep for the archived rotated files, and so on.

The other half of the ass was done in a different language, using the output from the shell, because I needed to do a thousand or so lookups against a website and parse the results.

Composable shell tools is a very under-appreciated toolbox, IMO.

To be fair, it's possible to make this block simpler and more readable than what you have there. The problem with a lot of bash scripts I've seen is that they just duck-tape layer after layer of complexity on top of each other, instead of breaking things into smaller, composable pieces.

Here's a quick refactor for the block that I would say is simpler and easier to maintain.

  function xform() {
    local dir="$1"
    ls -1 "$dir"           |
    grep '\d\d\d\d_A.csv'  |
    cut -c 1-4             |
    python3 parse.py       |
  comm -1 -3 <(xform dataset-directory) <(seq 500)

>is "Why would I want to write a complicated mess like that?" Just use ${FAVORITE_PROG_LANG:-Perl, Ruby, or whatever}".

Some discussion on the pros and cons of those two approaches, here:

More shell, less egg:


I had written quick solutions to that problem in both Python and Unix shell (bash), here:

The Bentley-Knuth problem and solutions:


That's not a “discussion on the pros and cons of those two approaches”; that's a skewed story about just one part of a particular review of an exercise done in a particular historical context. (More on that here: https://news.ycombinator.com/item?id=18699718)

Not that there isn't some merit to McIllroy's criticism (I know some of the frustration from trying to read Knuth's programs carefully), but at least link to the original context instead of a blog post that tells a partial story:



(One of the places where McIlroy admits his criticism was "a little unfair": https://www.princeton.edu/~hos/mike/transcripts/mcilroy.htm)

BTW, there's a wonderful book called “Exercises in Programming Style” (a review here: https://henrikwarne.com/2018/03/13/exercises-in-programming-...) that illustrates many different solutions to that problem (though as it happens it does not include Knuth's WEB program or McIllroy's Unix pipeline).


>(More on that here: https://news.ycombinator.com/item?id=18699718)

>BTW, there's a wonderful book called “Exercises in Programming Style” (a review here: https://henrikwarne.com/2018/03/13/exercises-in-programming-...) that illustrates many different solutions to that problem (though as it happens it does not include Knuth's WEB program or McIllroy's Unix pipeline).

I'm the same person who referred to my post with two solutions (in Python and shell) in that thread, here:


in reply to which, Henrik Warne talked about the book you mention above.

Ah, good luck. Please consider all the viewpoints when linking to that blog post; else we may keep having the same conversation every time. :-)

>Please consider all the viewpoints when linking to that blog post;

It should have been obvious to you, but maybe it wasn't: nobody always considers all viewpoints when making a comment, otherwise it would become a big essay. This is not a college debating forum. There is such a thing as "caveat lector", you know:


>else we may keep having the same conversation every time.

No, I'm quite sure we won't be. Nothing to gain :)

Let me put it this way: the last time the link was posted, I pointed out many serious problems with the impression it gives. Now, if the same link is posted again with no disclaimer, then either:

1. You don't think the mentioned problems are serious,


2. You agree there are serious problems but don't care and will just post it anyway.

Not sure which one it is, but it doesn't cost much to add a simple disclaimer (or at least link to the original articles). Else as long as I have the energy (and notice it) I'll keep trying to correct the misunderstandings it's likely to lead to.

FYI, you don't need the backslash if you end the line with a pipe... it's implied in that case.

I generalized interactivity to the Unix that most people seem familiar with.

“The interactive nature of the shell” isn’t that impressive in this day and age. Certainly not shells like Bash (Fish is probably better, but then again that’s very cutting edge shell (“for the ’90s”)).

Irrespective of the shell this just boils down to executing code, editing text, executing code, repeat. I suspect people started doing that once they got updating displays, if not sooner.

Some people figure out the utility of this right away. Many don't. Whenever I show my coworkers the 10-command pipeline I used to solve some ad-hoc one-time problem, many of them (even brilliant programmers and sysadmins among them) look at it as some kind of magic spell. But I'm just building it a step at a time. It looks impressive in the end, even though it's probably actually wildly inefficient and redundant.

But none of that is the point. The end result of a specific solution isn't the point. The cleverness of the pipeline isn't the point. The point is that if you are familiar with the tools, this is often the fastest method to solve a certain class of problem, and it works by being interactive and iterative, using tools that don't have to be perfect or in and of themselves brilliant innovations. Sometimes a simple screwdriver that could have been made in 1900 really is the best tool for the job!

> Irrespective of the shell this just boils down to executing code

Bernhardt's stated goal with that talk was get people to understand this point (and hopefully use and benefit from the power of a programmable tool). "If [only using files & binaries] is how you use Unix, then you are using it like DOS. That's ok, you can get stuff done... but you're not using any of the power of Unix."

> Fish

Fish is cool! I keep wanting to use it, but the inertia of Bourne shell is hard to overcome.

> Fish is cool! I keep wanting to use it, but the inertia of Bourne shell is hard to overcome.

Back when I tried Fish some like 5 or 6 years ago I think, I was really attracted by how you could write multiline commands in a single prompt. I left it, though, when I found out that its pipes were not true pipes. The second command in a pipeline did not run until the first finished, and that sucked and made it useless when the first command was never meant to finish on its own or when it should've been the second command to determine when the first should finish.

It seems they've fixed that, but now I found that you can also write multiline commands in a single prompt in zsh, and I can even use j/k to move between the lines, and have implemented tab to indent the current line to the same indentation as the previous line in the same prompt. Also, zsh has many features that make code significantly more succinct, making it quicker to write. This seems to go right against the design principle of fish of being a shell with a simpler syntax, so now I don't see the point of even trying to move to it.

I feel that more succinct and quicker to write does not mean simpler.

Fish tries to have a cleaner syntax and probably succeeds in doing so. It may even be an attempt to bring some change to the status quo that is the POSIX shell syntax.

I didn't try to fish anyway, because I like to not have to think about translating when following some tutorial or procedure on the Web. Zsh just works for that, except in a few very specific situations (for a long time, you could not just copy paste lines from SSH warnings to remove known hosts, but this has been fixed recently by adding quotes).

> I feel that more succinct and quicker to write does not mean simpler.

Indeed, it does not. They're design trade-offs of each other.

> Fish tries to have a cleaner syntax and probably succeeds in doing so. It may even be an attempt to bring some change to the status quo that is the POSIX shell syntax.

Indeed, it does, and it is (attempting to, though maybe not doing).

The thing is that, for shell languages, which are intended to be used more interactively for one-off things than for large scripting, I think being more succinct and quicker to write are more valuable qualities than being simpler.

If you are interested, I use a configuration for zsh that I see as "a shell with features like fish and a syntax like bash"

In .zshrc:


    if [ ! -d "$ZSH" ]; then
        git clone --depth 1 git://github.com/robbyrussell/oh-my-zsh.git "$ZSH"


    plugins=(zsh-autosuggestions) # add zsh-syntax-highlighting if not provided by the system

    source "$HOME/.oh-my-zsh/oh-my-zsh.sh"

    PROMPT="%B%{%F{green}%}[%*] %{%F{red}%}%n@%{%F{blue}%}%m%b %{%F{yellow}%}%~ %f%(!.#.$) "


    if [ -f /usr/share/zsh-syntax-highlighting/zsh-syntax-highlighting.zsh ]; then
        source /usr/share/zsh-syntax-highlighting/zsh-syntax-highlighting.zsh # Debian
    elif [ -f /usr/share/zsh/plugins/zsh-syntax-highlighting/zsh-syntax-highlighting.zsh ]; then
        source /usr/share/zsh/plugins/zsh-syntax-highlighting/zsh-syntax-highlighting.zsh # Arch

Relevant packages to install: git zsh zsh-syntax-highlighting


WARNING: it downloads and executes Oh My Zsh automatically using git. You may want to review it before.

If it suits you:

Works on macOS, Arch, Debian, Ubuntu, Fedora, Termux and probably in most places anyway.

You may need this too:

    export TERM="xterm-256color"

For posterity (I actually needed to install this today):

You need to install zsh-autosuggestions by installing the package from your distribution (Debian, Arch) and source it, or just run:

    git clone https://github.com/zsh-users/zsh-autosuggestions ${ZSH_CUSTOM:-~/.oh-my-zsh/custom}/plugins/zsh-autosuggestions
in zsh and do exec zsh.

How is that not impressive for vast majority of developers?

For the past couple decades, the only other even remotely mainstream place where you could get a comparable experience was a Lisp REPL. And maaaybe Matlab, later on. Recently, projects like R, Jupyer, and (AFAIK) Julia have been introducing people to interactive development, but those are specific to scientific computing. For general programming, this approach is pretty much unknown outside of Lisp and Unix shell worlds.

The author is an MS student in statistics. Seems that Unix is well-represented in STEM university fields.

Old-timey Unix (as opposed to things like Plan 9) won. When does widespread ’70s/’80s computing stop being impressive? You say “unknown” as if we were talking about some research software, or some old and largely forgotten software. Unix shell programming doesn’t have hipster cred.

> When does widespread ’70s/’80s computing stop being impressive?

When the majority adopts it, or at least knows about it.

> You say “unknown” as if we were talking about some research software, or some old and largely forgotten software. Unix shell programming doesn’t have hipster cred.

It's unknown to those that are only experienced in working in a GUI, which I believe is still the majority of developers. In my experience, anyone of those people are always impressed when seeing me work in my screen filled with terminals, so it does seem to have some "hipster cred". :)

> When does widespread ’70s/’80s computing stop being impressive? You say “unknown” as if we were talking about some research software, or some old and largely forgotten software.

That's precisely what I'm talking about. The 70s/80s produced tons of insight into computer use in general, and programming in particular, that were mostly forgotten, and are slowly being rediscovered, or reinvented every couple years. Unix in fact was a step backwards in terms of capabilities exposed to users; it won because of economics.

Had Bell Labs been allowed to explore UNIX commercially, none of us would be having this discussion.

I'll toss a link to repl.it here.

It supports a large number of languages. I started using it while I was working through SICP. I've used the python and JS environments a little as well.

Smalltalk transcript, VB immediate window, Oberon (which inspired Plan 9), CamlLight, come to mind as my first experiences in such tooling.

The alternative is to write a program or script that does part of the job, run it (compiling if necessary), and see what happens. Then modify.

This loop is definitely slower than the shell or other REPL though.

It's much slower, and doesn't lend itself as well for building the program up from small, independently tested and refined pieces. The speed of that feedback loop really matters - the slower it is, the larger chunks you'll be writing before testing. I currently believe the popularity of TDD is primarily a symptom of not having a decent REPL (though REPL doesn't replace unit tests, especially in terms of regression testing).

BTW. there's another nice feature of Lisp-style interactive development - you're mutating a living program. You can change the data or define and redefine functions and classes as the program is executing them, without pausing the program. The other end of your REPL essentially becomes a small OS. This matters less when you're building a terminal utility, but it's useful for server and GUI software, and leads to wonders like this:


It's a different way of approaching programming, and I encourage everyone to try it out.

> leads to wonders like this

The engineer in me that learned about computers on a 286 with 4MB of RAM and a Hercules graphics card screams in shock and horror at the thought of letting a Cray-2's worth of computing power burn in the background. The hacker in me thinks the engineer in me should shut up and realize that live-editing shader programs is fun[1] and a great way to play with interesting math[2].

[1] http://glslsandbox.com/e#41482.0

[2] http://glslsandbox.com/e#52411.1

> The hacker in me thinks the engineer in me should shut up and realize that live-editing shader programs is fun[1] and a great way to play with interesting math[2].

Yeah, sure. My point is, I assume you're not impressed by shader technology here (i.e. it's not new), but the remaining parts are Lisp/Smalltalk 70s/80s stuff, just in the browser.

> I think the interactivity you describe might be a different thing from what your parent is talking about.

Actually no, they're not different things; both refer to the same activity of a user analyzing the information on the screen and issuing commands that refine the available information iteratively, in order to solve a problem. (I would have bought your argument had you made a distinction between "solving the problem" and "finding the right tools to solve the problem").

The thing is that the Unix shell is terribly coarse-gained in terms of what interactivity is allowed, so that the smaller refinement actions (what you call "input") must be described in terms of a formal programming language, instead of having interactive tools for those smaller trial-error steps.

There are some very limited forms of interactivity (command line history, keyboard accelerators, "man" and "-h" help), but the kind of direct manipulation that would allow the user to select commands and data iteratively, are mostly absent from the Unix shell. Emacs is way better in that sense, except for the terrible discoverability of options (based on recall over recognition).

One of the dead ends of Unix UX are all the terse DSLs. I feel that terse languages like Vi’s command language [1] get confused with interactivity. It sure can be terse, but having dozens of tiny languages with little coherence is not interactive; it’s just confusing and error-prone.

One of these languages is the history expansion in Bash. At first I was taken by all the `!!^1` weirdness. But (of course) it’s better—and actually interactive—to use keybindings like `up` (previous command). Thankfully Fish had the good sense to not implement history expansion.

[1] I use Emacs+Evil so I like Vi(m) myself.

> select commands and data iteratively ... Emacs is way better in that sense

Bind up/down to history-search-backward/history-search-forward. In ~/.inputrc

    # your terminal might send something else for the
    # for the up/down keys; check with ^v<key>
    # UP
    "\e[A": history-search-backward
    # DOWN
    "\e[B": history-search-forward
(note that this affects anything that uses readline, not just bash)

The default (previous-history/next-history) only step through history one item at a time. The history-search- commands step through only the history entries that match the prefix you have already typed. (i.e. typing "cp<UP>" gets the last "cp ..." command; continuing to press <UP> steps through all of the "cp ..." commands in ${HISTFILE}). As your history file grows, this ends up kind of like smex[1] (ido-mode for M-x that prefers recently and most frequently used commands).

For maximum effect, you might want to also significantly increase the size of the saved history:

    # no file size limit
    # runtime limit of commands in history. default is 500!
    # ignoredups to make the searching more efficient
    # (and make sure HISTFILE is set to something sane)

[1] https://github.com/nonsequitur/smex/

Both terminals, shells and Emacs suffer from the problem that you have to configure them out of their ancient defaults.

I also like setting my history so that it appends to the history file after each command, so that they don't get clobbered when you have two shells open:

  shopt -s histappend
  export PROMPT_COMAND="history -a"

I’m a lone dev that works with moderate-size data and whatever UX solution you’re thinking of is slow or doesn’t exist.

Yes Bash is an untyped hell. But when I pipe 100 GB to stout, my computer’s death wish to show me the fucking data.

> mouse(!) selection of arbitrary text that can be piped to commands and so on


  xclip -out | ...
or do you mean something different?

Essentially selecting and operating on text in the same buffer. Saw it in Russ Cox’s demonstration of Acme.

The idea was taken from Oberon, which got inspired by Mesa/Cedar at Xerox PARC.

Basically any function/procedure that gets exported from a module can be used either on the REPL, or from them mouse, depending on its signature.

Quite powerful concept for those that like to take advantage of GUI based OSes.

Powershell is the only that comes close to it. Maybe fish as well, but never used it.

One interactive feature I like about Bash though (or shell or whatever) is C-x C-e to edit the current line in a text editor (FC). Great when I know how to transform some text in the editor easily but not on the command line.

This is a reason why I like languages that have REPLs, and perhaps an advantage of dynamically typed, and especially functional languages that I feel is often not emphasized enough (judging by the recent discussions on the issue).

I love how I can just open up IEx (Elixir) and work my way through various functions in my app interactively. If something doesn't work as expected, I change the code and run the 'recompile' command. Then, once things work and eventually stabilize, I add typespecs for at least some of the benefits of an explicitly typed language. In some cases I might make (unit) tests part of the process, but that depends on the situation.

AWK makes most of that grep/cut/count-type stuff so much easier, 1-liners need something like 3 parts instead of 10.

(My grep doesn't work like in the video, but..)

  for f in *.rb;do awk '$1 ~ /class|module/ {print $2}' $f;done |
    awk '{a[$1]++} END{for (q in a) print a[q],q}' | sort -n
This seems to do what all the sed, cut, regrepping, wc etc do. AWK (in default mode, anyway) removes leading spaces, makes cutting and counting words easy. It took about 30 seconds to write, too.

Every time I come up with some crazy bash construction someone shows me how I could have done it more elegantly with AWK and nothing else. <3


That's the tip of the iceberg. You can do anything in awk.

for a in *.c do awk 'function lensort(a,zerp, x,tmp) {for (x = 1 ; x < zerp ; x++) {tmp = a[x]; if (length(a[x + 1]) < length(a[x])) {a[x] = a[x + 1]; a[x + 1] = tmp}}} /^(int|void|char|struct|long|float)/ {while (x < 2) {getline ; arr[$1] = (substr($0,1,1) == "{") ? "\n" : $0 "\n" ; x++}} END {lensort(arr,v); for (c in arr) {printf "%s\n",(length(arr[c]) > 0) ? arr[c] : ""}}' ${a} done

I have never been able to get my head around awk - how did you learn it?

Same as usual, read every book about it![0] and don't try to learn from the man page. I think the original book by A, W and K (the 2nd edition anyway) was best. Also Arnold Robbins' books Sed and Awk and Effective AWK Programming are great.

The GNU Awk User's Guide[1] is amazingly detailed, but the new AWK has bloated a lot, has 1000 features where the older one had a few dozen, and those core ones should be learnt first. (But people still like programming with it, and kept asking for features..) Hmm I don't think I use any of the new features, maybe I should...

But really, it's very simple, I can't thinking of anything that's been anything like as easy to learn, except maybe BASIC. With which AWK shares some similarities - the friendliness of it, the BEGIN and END..

[1] http://www.gnu.org/software/gawk/manual/gawk.html

[0] Or if there are too many, get names from lists of best books on a subject.

p.s. Bruce Barnett's AWK guide may be helpful, I've learned a lot from him on various UNIXy subjects.


The AWK programming language book is excellent!


You can find some discussion about it here: https://news.ycombinator.com/item?id=13451454

>This book was typeset in Times Roman and Courier by the authors, using an Autologic APS-5 phototypesetter and a DEC VAX 8550 running the 9th Edition of the UNIX operating system.


"The AWK programming language" has already been suggested and is great.

But awk is actually pretty simple. Imagine everything having an implicit loop like this around it like so:

  # BEGIN{ ... } would go here
  for $0 in input.split('\n'):
      $1, $2, ... = line.split()
      # if TEST {ACTION}; pairs ...
  # END { ... } would go here
so the first TEST in GP's awk would be `$1 ~ /class|module/` and the corresponding ACTION to take `{print $2}`.

if TEST is omitted it's True, if ACTION is it's print the whole line.

It's C. Learn C and you know awk. The only awk worth using is G(nu)awk anyway.

No, in that C doesn't have the pattern-action style of awk:

    /<pattern>/ { <action> }
"For every line which matches <pattern> perform <action> " is the fundamental flow of awk programs. It's procedural but the iteration is hidden; it's almost like each stanza is a callback which gets executed when the pattern fires. I'm kinda surprised no "real language" has adopted this design for an API.

This [0] the most complete post I've read on the topic. Lays out all the relevant tools. Spending some time going through each tool's documentation/options, pays off tremendously.

[0]: https://www.ibm.com/developerworks/aix/library/au-unixtext/i...

Wow, great find. Sad how hard it seems these days to come across an easy-to-follow primer on a topic without narrative fluff and/or ads everywhere. For those interested in a standalone copy there is a PDF of the content available here https://www.ibm.com/developerworks/aix/library/au-unixtext/a...

Writing clear tutorials is a fair amount of effort, more than I originally thought when I first did it.

I haven’t seen that one before and it looks pretty good. Interesting that it doesn’t mention sed or awk (edit: I'm wrong, it does mention sed & awk), let alone Perl. I would say that Perl’s so powerful for one liners and Unix text pipelines, I’d consider it required in a text processing reference.

Another one I like, and I think it’s mainly because of the philosophy contained in the title is “Ad hoc data analysis from the Unix command line” https://en.m.wikibooks.org/wiki/Ad_Hoc_Data_Analysis_From_Th...

Perhaps it's changed since you viewed it, but both sed and awk are described in the parent's link.

Oh, you're right, thanks! I doubt it changed, much more likely it was my mistake. ;) It was either operator error, or maybe the menu didn't display correctly while I was browsing using my iPad.

My main problem with this reference is that it encourages the use of non-portable utilities and flags. Double check against POSIX before writing any of this into your scripts:


See also: http://shellhaters.org/

For actual text processing, I recommend a book.

* https://oreilly.com/openbook/utp/

Seems like a book specifically about text processing for the purpose of writing/formatting/typesetting documents.

Really nice link! Thanks fforflo.

The brilliant fun of working with the Unix CLI toolset is that there are millions of valid ways to solve a problem. I also thought of a “better” solution of my own that took an entirely different approach than most of the ones posted here. That’s not really the point.

What’s great about this article is that it follows the process of solving the problem step by step. I find that lots of programmers I work with struggle with CLI problem solving, which I find a little surprising. But I think it all depends on how you think about problems like this.

If you start from “how can I build a function to operate on this raw data?” or “what data structure would best express the relationship between these filenames?” then you will have a hard time. But if you think in terms of “how can I mutate this data to eliminate extraneous details?” and “what tools do I have handy that can solve problems on data like this given a bit of mungeing, and how can I accomplish that bit of mungeing?” and if you can accept taking several baby steps of small operations on every line of the full dataset rather than building and manipulating abstract logical structures, then you’re well on your way to making efficient use of this remarkable toolset to solve ad hoc problems like this one in minutes instead of hours.

If you bother to write a python script to parse the integers, why not use python to solve the whole problem?

This is one of the many reasons I think PowerShell did UNIX philosophy better: you don't need to parse text because the pipelines pass around typed objects. You can kinda almost get the same behavior from some UNIX commands by first having them dump everything into JSON and then having the other end parse the JSON for you, but you're still relying on a lot of text parsing. Personally I think it is high time the UNIX world put together a new toolset.

Take a look at osh (Object SHell): https://github.com/geophile/osh

It is a Python implementation of this idea: OS objects like files and processes are represented in Python. You construct pipelines as in UNIX, but passing Python objects instead of strings. E.g. to find the pids of /bin/bash commands:

    osh ps ^ select 'p: p.commandline.startswith("/bin/bash")' ^ f 'p: p.pid' $
- osh: Runs the tool, interpreting the rest of the line as an osh command.

- ^: Piping syntax.

- ps: Generate a stream of process objects, (the currently running processes).

- select: Select those processes, p, whose commandline starts with /bin/bash.

- f: Apply the function to the input stream and write function output to the output stream, (so, basically "map"). The function computes the pid of an input process.

- $ Print each item received in the input stream.

Osh also does database access (query results -> python tuples) and remote access. E.g., to get a process listing of (pid, commandline) on every node in a cluster:

    osh @clustername [ ps ^ f 'p: (p.pid, p.commandline)' ] $

That said, that might not be the best example, because that same first command with Unix utilities is just:

    pgrep -f /bin/bash

OK, how about this:

    osh timer 1 ^ \
        f 't: (t, processes())' ^ \
        expand 1 ^ 
        f 't, p: (p.pid, int(t), p.commandline)' ^ 
        sql "insert into process values(%s, %s, %s)"
- timer 1: Generate a timestamp every second.

- f 't: (t, processes())': Take the timestamp as input and generate a timestamp and sequence of Process objects.

- expand 1: Expand the sequence of Processes, so that there is one per output tuple, e.g. (123, (p1, p2)) -> (123, p1), (123, p2).

- f 't, p: (p.pid, int(t), p.commandline)': Take the timestamp and Process as input, and generate (pid, timestamp as int, command line) as output.

- sql "insert into process values(%s, %s, %s)": Take the triples from the previous command and dump them into a table in a database.

Indeed, in PowerShell you could do:

    1..500 | ?{ !(Test-Path ('{0:0000}_A.csv' -f $_)) }

1..500 generates a sequence of numbers 1 through 500

| pipes the numbers

?{ … } is a filter that is evaluated for each item (number)

! negates the following expression

Test-Path tests that a file exists

-f formats the string left of -f with the parameters (zero-based to the right of -f

'{0:0000}_A.csv' is a pattern which formats parameter 0 as 4 digits, zero-padded.

EDIT: Explanation

Great example. PowerShell is definitely underrated. I don’t care to use it as an interactive shell, and that likely turns many off to it, but now that it’s cross platform I think it’s massively underrated for these sorts of tasks.

I don't think that example really shows the benefits of the "Powershell way". There's hardly any need for objects and what-not when solving an easy problem like this; strings will do just fine. With Bash, a close equivalent would be:

  for f in {0001..0500}_A.csv; do test -e $f || echo $f; done

Why replace your hammers, screwdrivers, and chisels just because someone invented a 3D printer? They have tradeoffs. Powershell has some good ideas, and benefits from having been invented altogether, rather than evolving over four decades. But in practice it's not as efficient for doing simple things. It's oriented towards much more complex data structures, which is great... but there's no need to throw out your simpler tools just because you think they look ugly.

For sure, if you are used to bash, use it, because you will ve productive.

But dont say powershell is not practical for simple things. Its very practical for simple and complex things. And I am productive with it.

They're full of footguns, esoteric behavior, and have arcane names. They're actually pretty awful tools now that it isn't the 70s anymore.

Yet when I visit the unix sysadmins' office I see people chaining commands to administer hundreds of boxes. On the windows side I rarely see powershell prompts. Powershell looks so much better in theory. However it's just an okayish scripting language with a good REPL. Unix tools are a far better daily driver.

That's probably because for daily tasks we have much better tooling in Windows already that doesn't require us to use the command line and interactive construct it. I can easily administer the configurations of thousands of computers through AD, for instance, and while I could use PowerShell to do so, using ADUC is just easier most of the time.

If you do a lot of work with Exchange though, you'll probably end up using PowerShell much more, because the web UI for it is not so great.

No matter what you think of the specific implementation, a lot of PowerShell's ideas are good ideas. Unfortunately UNIX culture is such that they'll probably never implement any of them.

> That's probably because for daily tasks we have much better tooling in Windows

ssh, docker, ansible, kubernetes, grafana, prometheus, etc... All coming from Linux/unix. This statement is clueless. Most of the cloud is not running microsoft, and for a good reason.

To automate, we have python, which has a much better syntax. It's pointless to use powershell.

And it takes a microsoft head, without knowledge of programming language's history to say that powershell's ideas actually come from powershell. Method chaining/fluent interface with a pipe instead of a dot does not look that new.

Also, some attempts have been to implement posh clones on unix. Being redundant with either perl/python or bash/zsh, none succeeded.

PowerShell sort of cheats, which enables the nice object pipelines; all cmdlets are .net modules that are run within the same runtime. That makes PowerShell much closer to "normal" programming languages with repls than traditional shells. That is also why PowerShell model is not directly a good fit to the UNIX world.

I would like to see more work done in the realm of object shells (and have some ideas myself), especially around designs that meld in more the UNIXy way of having independent communicating processes. But it is a difficult problem domain, and many approaches would involve rewriting lot of the base system we take for granted that is just huge amount of work.

PowerShell had the benefit of having stuff like WMI, COM, the whole .NET, and of course all the resources, funding and marketing from MS. Even then it has seemed to have been an uphill struggle, despite there being far more a need for PS in the Windows world.

Fair point! I'd argue there's a difference in time spent in writing a python script to solve it all, and just parsing the ints as I did. Python was my first thought for how to parse the ints.

Great article. I didn't know about comm actually, so that was new for me!

With regard to the 'ints' it helps to not think about them as ints but rather just some text that follows the pattern ^[0][1-9] or some equivalent to that. "Starts with any number of zeros or no zeros followed by any number of numbers between 1 to 9 or none at all."

So long as you know the repeatable pattern you can always use sed to just replace the part of the pattern you want gone with nothing which effectively deletes it from the output. sed is like a Swiss army knife in that regard because you can do nice simple deletions like that and even iterate on them if you need or you can do quite complicated capture groups if you need to as well. Sed can get you unbelievably far in terms of shaping text in a stream.

I have a few tricks I've learned with various tools that I thought were worth writing down. Hopefully you can find some more useful stuff.


Your regex got munged, but shouldn't it be:

to avoid matching empty lines?

Oh I see. Yes the output got a bit messed up.

You could use + if you want. I only do it if not doing will produce incorrect output for my given input, which happens occasionally but is rare.

Bear in mind in this case there is also _A.csv or _B.csv you need to account for as well which your version doesn't. Mine will still pick it up because I'm not specifying the end, as I'm assuming I've done the necessary preprocessing steps to get good data so it will produce the expected output.

I'm not usually fussed on being that strict when I write regex as I tend to do various preprocessing steps like grep -v ^$ to filter out any blank lines if I don't want them in the stream etc.

Ya, I dropped in an unwanted $.

You can use numfmt to parse the number:

    $ seq -w 0001 0005|numfmt 
Or just use plain sed:

    $ seq -w 0001 0005|sed 's/^0*//'

yeah that was super weird. Why write an article about the merits of shell tools if you put some python in the mix...

Python basically acts like a subshell with its own language... I don't see why anyone would think unix shell scripts are really that different from Python scripts, especially if you're not doing the subprocess control things that command shells are optimal for. Invoking python to do something doesn't seem any more awkward to me than invoking sed, awk, etc..

Python has no bindings to the underlying OS (unless you import os modules, and even then it has its own way of working), and does not interact directly with environment variables either, and as such can not be seriously considered like a "shell" tool at the same level as the other ones. It just was never designed to act in this way. And in the embedded world you don't want to install a python interpreter if you can do without.

If you're just sorting numbers, you don't need the OS or environment variables.

Removing leading zeroes doesn't require Python. One easy solution would be sed:

    $ echo -e '0001\n0010\n0002' | sed 's/^0*//'

Yeah plus seq can generate sequences with leading zeroes (something like seq -f %04.f 1 20).

So instead of scripting, he could have generated a sorted list of numbers from the files he had. Created a file with the sequence of numbers for the range and diffed/commed the whole thing. Voilà...

The seq provided in the GNU toolset has a -w flag to turn on "equal width" mode, so one can also get zero padded numbers out (from GNU seq) by turning on that mode and zero padding the input:

    $ seq -w 0001 0003

Nice, this works automatically like this:

    $ seq -w 98 102

  seq -f %04g 0 3
  echo {0000..0003} # if bash

Did not know that. Thanks! That's much easier.

Yes. Or if you're going to whip out Python, might as well make it all in Python.

Very much my thinking, especially when there are no significant commands you need to shell out to:

    import sys, pathlib
    basedir = pathlib.Path(sys.argv[1])
    for i in range(1, 501):
        if not (basedir / f'{i:04}_A.csv').is_file():

Just for fun some bash:

    for k in $(seq -f %04.f 1 501); do 
        ! [[ -f "$f" ]] && echo $f || :
or more succinctly,

    for k in {0001..0501}_A.csv; do
        ! [[ -f "$f" ]] && echo $f || :
and if you have GNU parallel installed:

    parallel -kj1 '! [[ -f "{}" ]] && echo {} || :' ::: {0001..0501}_A.csv

Nice one!

    $ echo -e '0001\n0010\n0002\n0' | bc

This is better relative to the `sed` solution as it handles the `0000` case well.

If one of the items is actually zero, this would delete it entirely, which probably isn't the desired result.

Yeah, no need for Python. Even the following seems to work fine:

$ printf "%d\n" 003


Well, that works up to a point. Fore some, that point might be considered a bit too close to zero...

    $ printf "%d\n" 009
    sh: 1: printf: 009: not completely converted

0-prefix octal notation continues to be a mistake.

I love when programs mysteriously fail when passed an IP address like

Thankfully modern tools have been moving away from supporting that notation and are less likely to explode all over unsuspecting users.

Yes, you are right. This seems to work better:

$ printf "%1.0f\n" 009


$ printf "%1.0f\n" 00125990


  $ printf %d 010

Given the situation in the article, you might as well do this:

  ls ????_A.csv | grep -o '[1-9][0-9]*'

Oh nice. sed still scares me a bit.

Definitely check out perl one liner patterns. Perl is less scary and more powerful and usually almost as short as sed commands. Perl can often replace pipelines that use both sed and awk. Perl one liners need judicious flags though, common patterns look like perl -ne, perl -pie, perl -lane ... these do very different things. Once you know them, it’s like a minor superpower.

I’m not terribly experienced with Unix tools but I reckon that it might be best to just use Perl instead. Then you just have to worry about PCRE instead of PCRE in addition to old-style regexps.

Then again, Perl is even scarier.

perl one-liners are pretty powerful and can replace awk/sed/cut/tr/etc. That being said, then you have to remember which command-line options you should give perl (was it -lane to do X, or was it -pie, or something else entirely?).

But yeah, the perl rabbit hole goes as deep as you want to, and in some sense it makes it more difficult to say "screw it, I'm redoing this in a real language" as complexity rises. And then you end up with a thousand lines of line noise.

Well Perl is a real language, so of course it makes it more difficult to say “screw it, I'm redoing this in a real language”.

That said the Perl that you write in a one-liner should be different than the Perl that you write when the complexity rises.

If you have more than a few hundred lines, you probably need to move some of it out into a module. There should also be tests for that module. It might even make sense to structure it like a CPAN module so that you can use the existing tools to test and install it.

I have to admit the last time I wrote anything larger than short scripts in perl was in the perl 4 period (on HP-UX to boot). I eagerly awaited the perl 5 version of the Camel book, but in the end I jumped ship to python before getting seriously into perl 5.

I haven’t coded Perl beyond one-liners. What attracts me to it are its regular expressions. So much of Unix scripting seems to involve regexps. So I figure Perl+utilities is the better option compared to utilities+Bash.

I wouldn’t wanna use it for more than short scripts. Perl 6 might be fun, but it doesn’t seem to have a large enough community.

If you want to improve your coding skills you should read “Higher Order Perl”. (made available for free online by the author)

If you want to improve your Perl code read “Modern Perl”. (There is more than one version, and I know the first version was made freely available online)

Perl is a better language for large codebases than most people give it credit for. That said it allows more creativity when it comes to your code. So you can make awful code just as easily as beautiful code. Perl6 makes the beautiful code easier to write and shorter, while making awful code a bit harder to write.

> There is more than one version, and I know the first version was made freely available online

Good news! Every version is freely available online. Here's the most recent:


It’s useful but highly cryptic in my usage.

A change in structure might be helpful:

    $ ls data
    0001.csv 0002.csv 0003.csv 0004.csv ...
    $ ls algorithm_a
    0001.csv 0002.csv 0004.csv ...
    $ diff -q algorithm_a data |grep ^Only |sed 's/.*: //g'
    0003.csv ...

Excellent point, haha!

For learning to get things done with Unix, I recommend the two old books "Unix Programming Environment" and "The AWK Programming Language". There are many resources to learn the various commands etc., but there is still no better place than those books to learn the "unix philosophy". This series is also good:


I think the best part about using Unix tools is it forces you to break down the problem into tiny steps.

You can see feedback every step of the way by removing and adding back new piped commands so you're never really dealing with more than 1 operation at a time which makes debugging and making progress a lot easier than trying to fit everything together at once.

It's basically functional programming. I find that my approach to writing code is very similar to how I work with the shell. The main difference, I guess, is that the command 'units' are slightly bigger, in the form of functions, but the way I iterate my solution to a problem is basically the same.

The problem with this is that there isn't a standard format forced on the args that following the command name "cut".

What makes it worse is that there are seemingly patterns of standard format that get violated by other patterns. It's often based on when the utility was first authored and whatever ideas were floating around during the time. So sometimes characters can "clump" together behind a flag, under the assumption that multi-character flags will get two hyphens. Then some utilities or programs use a single flag for multi-character flags. Plus many other inconsistencies-- if I learn the basic range syntax for cut do I know the basic range syntax for imagemagick?

Those inconsistencies don't technically conflict since each only exists in the context of a particular utility. But it's a real pain to sanity to see those inconsistencies sitting on either side of a pipe, especially when one of them is wrong. (Or even when it's a single command you need but you use the wrong flag syntax.) That all adds to the cognitive load and can easily make a dev tired before its time to go to sleep.

Oh, and that language switch from bash to python is a huge risk. If you're scripting with Python on a daily basis it probably doesn't seem like it. But for someone reading along, that language boundary is huge. Because the user is no longer limited to runtime errors and finicky arg formatting errors, but also language errors. If the command line barfs up an exception or syntax error at that boundary I'd bet most users would just give up and quit reading the rest of the blog.

Edit: clarification

Learning the idiosyncrasies of the tools involved is one of the tradeoffs. But there's no getting around it. These tools have been around for far too long to change them all in some misguided attempt at consistency--the semantics of most tools are so different, it wouldn't even make sense to try to enforce some consistency anyway.

You don't have to know every flag for every tool. You don't need to know if you can glob args together in a certain tool. These are different tools developed across decades by different people for different purposes. The fact that you can glue them all together on an ad-hoc basis is magical!

You learn by learning how to do one thing at a time--cutting characters 10-20, or grepping for a regex, or summing with awk, or replacing strings with sed, or translating characters with tr--and adding it to your mental toolbox. It's okay to have a syntax error because man is there and you can easily iterate the command to make it do what you want.

You aren't writing a program to stand the test of time. You're solving a problem in the moment!

It's true that this can be a pain; but this flexibility is also bash's greatest feature: a bash script can make use of almost any other program, regardless of the particular idioms that that program's author was partial too. This is exactly why bash has been so successful for so long and likely will continue to be so for a very long time.

Any attempts to tighten this down would raise the barrier for entry and therefore reduce the ecosystem that bash can operate in.

Also, it's a bit of a false dichotomy. Any other language is also susceptible to these sorts of inconsistencies. For example: Do I specify a range as [min, max] or as two separate parameters? Is it inclusive or exclusive? etc. At some point all programming interfaces come down to conventions, and if your language only supports one then you'll only be able to interop with the subset of the broader community that agrees with you.

> Oh, and that language switch from bash to python is a huge risk.

I was thinking about that and I came up with

  sed 's/^0*//'
as an alternative to the Python program. Another option that works for the same purpose is

  xargs -n1 expr 0 +
Edit: There's an earlier subthread with several options for this: https://news.ycombinator.com/item?id=19160875

I've often done this, usually not for a large dataset, but it's sometimes helpful to pipe text through Unix commands in Emacs. C-u M-| sort, for instance, will run the selection through sort and replace it in place.

If you're going the all python route, and even want to be able to run bash commands, and want something where you can feed the output into the input, I'd strongly recommend jupyter. (If you want to stay in a terminal, ipython is part of jupyter and heavily upgrades the built-in REPL and does 90% of what I'm mentioning here.)

You can break out each step into its own cell, save variables (though cell 5 will be auto-saved as a variable named _5) but the nicest thing is you can move cells around (check the keyboard shortcuts) and restart the entire kernel and rerun all your operations, essentially what you're getting with a long pipeline, only spread out over parts. And there are shortcuts like func? to pop up help on a function or func?? to see the source.

It's got some dependencies, so I'd recommend running it in a virtualenv via pipenv:

    pipenv install jupyter  # setup new virtualenv and add package
    pipenv run jupyter notebook
    pipenv --rm  # Blow away the virtualenv
Also, look into pandas if you want to slurp a CSV and query it.

I doubt you'll find many Emacs users that would prefer "C-u M-| sort" over "M-x sort-lines".

$ join -v 2 <(ls | grep _A | sort | cut -c-4) <(ls | grep -v _A | sort | cut -c-4)

The shortest one I could come up with, no need to use python.

`join -v 2` shows the entries in the second sorted stream that don't have match in the first sorted stream, the rest is self-explanatory I hope.

Edit: $ join -v2 -t_ -j1 <(ls | grep _A | sort ) <(ls | grep -v _A | sort)

Is even shorter, it takes first field (-j1) where fields are separated by '_' (-t_)

Slightly shorter:

    ls -v|cut -d_ -f1|uniq -c|awk '$1<2{print $2}'
Tested by creating 500 sets of dual files and removing 10 `_A` randomly.

    for i in $(seq 1 500); do j=$(printf %04d $i); touch ${j}_data.csv; touch ${j}_A.csv; done
    for i in $(seq 1 10); do q=$((RANDOM % 500)); r=$(printf %04d $q); rm -v ${r}_A.csv; done
    removed '0438_A.csv'
    removed '0327_A.csv'
    removed '0150_A.csv'
    removed '0173_A.csv'
    removed '0460_A.csv'
    removed '0194_A.csv'
    removed '0073_A.csv'
    removed '0293_A.csv'
    removed '0404_A.csv'
    removed '0153_A.csv'
And then using the code above to verify the missing files


Elegant and short! But unless I'm missing something, your script will print even the datasets that have _A but not the corresponding _data?

A fair point but I was working within the context of the original problem which seemed to be "there's always a _data but not always an _A". If I was trying to provide a robust generic solution, I wouldn't be code golfing it...

You could use uniq -u to avoid the awk.

    ls|cut -d_ -f1|uniq -u
You win.

I like this solution - I'm not very used to using "cut" - or more generally to map from "files" to "fields/lines in a text stream".

I'm more inclined to ask:

given a list of files with this name, does a file of a different name exist on the file system?

But the more Unix approach is really:

how can I model my data as a text stream, and how can I then pose/answer my question?

(here: list all filenames in folder in sorted order - cut away the text indicating type - then count/display the non-repeat/single entries)

My solution would probably be more like (with bash in mind, most of this could be "expanded" to fork out to more utils, like "basename -s" etc) :

  for data_file in *_data.csv
    if [[ ! -f "${alg_file}" ]];
      echo "Missing alg file:\
      ${alg_file} for data \
      file: ${data_file}";
Ed: this is essentially the same solution as:


Although more verbose. I think I prefer omitting the explicit if, though - and just using "test" and "or" ("[[", "||" ).

This can be more shortened to

  ls -1 | uniq -u -w4
using GNU uniq, for these special filenames. Unfortunately, Posix does not define -w option for uniq.

Perfect, ta.

Nice golf!

This was a nice read and a good introduction to text processing with unix commands.

I agree with the other user re python usage - that you may as well use it for the whole task if you're going to use it at all - but I don't think it's a major flaw. It worked for you right? I would suggest naming the python file a bit more descriptively though.

Interesting to read the other suggestions about dealing with this without python.

Thanks! Glad to hear!

My favourite one is 'pkill -9 java'. Fixes my laptop if it starts lagging.

Does that kill electron instances too? ;)

It's always good sport to kill java. Warms my heart every time.

I thought this was a neat demo of building up a command with UNIX tools. The python inclusion was a bit odd, yes.

I learned about sys.stdin in Python and cutting characters using the -c flag


After moving back to working on a Windows machine the last several years and being “forced” into using PowerShell, I now find myself using it for these sorts of tasks on Linux.

I now use PowerShell for any tasks of equal or greater complexity than the article. It’s such a massive upgrade over struggling to recall the peculiar bash syntax every time and the benefits of piping typed objects around are vast.

As a nice bonus, all of my PowerShell scripts run cross-platform without issue.

I've dabbled in PowerShell before, but I've always found the objects you get from cmdlets to be so much more opaque than the plain text you get from Unix output, which makes it harder to use the iterative approach to development the article and other commenters describe. Do you have any tips for poking around in PowerShell objects / a workflow that works for you?

I’ve tried to love it while using it as an interactive shell, but it’s hard for me to lose the Unix muscle memory and remember their verbose commands.

For anything more than a single pipe, or anything that requires loops or control flow, I switch to Powershell in Visual Studio Code with the PowerShell extension which has intellisense and helps to poke around the methods on each object. From there you can select subsets of your script and run with F8 which helps me prototype with quick feedback.

Use `gm` (alias for Get-Member)

e.g like

    ps | gm
Will tell you exactly the different object types and member methods and properties are returned from `ps`.

    ls | gm
Will tell you that ls returns two different object types (directories and files).

All the pipes and non-builtin commands (especially python!) look like overkill to me, I must say.

    for set in *_data.csv ; do
        if [ ! -e $success ] ; then echo $num ; fi
ETA: likely specific to bash, since I have no experience with other shells except for dalliances with ksh and csh in the mid-90s.

Yup, I'd probably have gone with a `for` loop also. A bit shorter:

  for set in *_data.csv; do
    [[ -f "${set/data/A}" ]] || echo "${set%_data.csv}"
Edit: though I just write it out like this for formatting on HN. In real life, that would just be a one-liner:

for set in *_data.csv; do [[ -f "${set/data/A}" ]] || echo "${set%_data.csv}"; done

Just because I like GNU parallel:

    parallel -kj1 'f="{}"; [[ -f "${f/data/A}" ]] || echo $f' ::: *_data.csv

I usually do text processing in Bash, Notepad++ and Excel. Each has its own pros and cons, that's why I usually combine them.

Here you have the tools I use in Bash:

grep, tail, head, cat, cut, less, awk, sed, sort, uniq, wc, xargs, watch ...

As an aside I once found out you can replace 'sort | uniq' entirely with an obscure awk command so long as you don't require the output to be sorted. Iirc it performs twice as fast.

  cat file.txt | awk '!x[$0]++'

The awk commands prints the first occurrence of each line in the order they are found in the file. I can imagine that sometimes that might be even better than sorted order.

sort has a -u option on my linux... ------ -u, --unique with -c, check for strict ordering; without -c, output only the first of an equal run

If you're already in Windows land, you should consider leveraging PowerShell instead of bash. Pretty much all the same tooling is there, only with more descriptive names, tab completion on everything, passes typed object data instead of text parsing, etc.

Ahem... what is powershell core? (I take exception to your if condition). As someone on Arch- I enjoy it a lot.

Bash with Notepad++ and Excel? Do you use Wine or WSL?

It's kind of mandatory to use Windows in certain envs.

If you are using python in your pipeline, might as well go all in!

  from pathlib import Path

  all_possible_filenames = {f'{i:04}A.csv' for i in range(1,10)}

  cur_dir_filenames = {Path('.').iterdir()}

  missing_filenames = all_possible_filenames - cur_dir_filenames

  print(*missing_filenames, sep='\n')

The article solves the problem: for which numbers x between 1 and 500 is there no file x_A.csv? It looks like in this case it is equivalent to the easier problem: for which x_data.csv is there no corresponding x_A.csv?

    cd dataset-directory
    comm -23 <(ls *_data.csv | sed s/data/A/) <(ls *_A.csv)

This will fail for any filenames that contain newlines

Correct. It is intended for the filenames in the article. More generally, I try to write all my shell code to silently produce hard to track down errors when a filename contains newlines, in order to punish me for my carelessness if I ever accidentally create such a filename.

I got paid $175/hr as a data analyst contractor to basically run bash, grep, sed, awk, perl. The people that hired me weren't dumb, just non-programmers and became giddy as I explained regular expressions. The gig only lasted 3 months, but I taught myself out of a job: once they got the gist of it they didn't need me. Yay?

Nicely done using Unix utils. You can have a pure sed solution (save the `ls` invocation) that is much simpler, albeit obscure, that hinges on the fact that every number has a `data.csv` file.

Given a sorted list of these files (through `ls` or otherwise) the following sed code will print out the data files for which A did not succeed on them.


This works on the fact that there exists a data file for all successful and unsuccessful runs on data, so sed simply prints the files for which there does not exist an `A` counterpart.

If you want to only print out the numbers, you can add a substitution or two towards the end.

Edit: fixed the sed program

Actually the following is even shorter

So all together this gives the following

  ls|sed '/A/{N;d;}'

given the limited scope of files in the direcctory... not sure why it was necessary to use grep, instead of the built in glob?

  ls dataset-directory | egrep '\d\d\d\d_A.csv'
which FWIW wouldn't even work, on multiple levels: you need -1 on ls and no files end with A.csv


  ls -1 dataset-directory/*_A?.csv
ref: http://man7.org/linux/man-pages/man7/glob.7.html

Update: apologies, apparently my client cached an older version of this page. at that time the files were named A1.csv and A2.csv

Some ls man pages state the following about the -1 option: "This is the default when the output is not directed to a terminal."

I've never needed to use -1 when piping ls's output to another command.

> I am starting to realize that the Unix command-line toolbox can fix absolutely any problem related to text wrangling.

Am I the only one who thought, "No shit, Sherlock"?. This is a fundamental of UNIX that many people don't seem to grasp.

Everybody realises this at some point. Nobody ever thought "I can use this for anything" when they first saw a shell. It takes time.

He’s an MS student. He’s just documenting and sharing his journey. As blogs do.


Use F# with a TypeProvider. Of course, I imagine it would take some work learning F# but once you learn it the sky is the limit in what you can do with this data.

If you don't mind "cd dataset-directory" beforehand, a shorter and possibly more correct version would be:

  comm -1 -3 <(ls *_A.csv | sed 's/_.*$//') <(seq -w 0500) | sed 's/^0*//'
The OP's solution doesn't seem correct because of the different ordering of the two inputs of `comm': lexicographical (ls) and numeric (seq).

Although -w is supported by both GNU and BSD versions of `seq', BSD's ignores leading zeros in input. Thus a more portable approach is:

  comm -1 -3 <(ls *_A.csv | sed 's/_.*$//') <(seq -f %04.f 500) | sed 's/^0*//'

Easier would be just use 'cat list_of_numbers | sort | uniq -u' to get the unique entries.

Shorter still:

    sort -u < list_of_numbers

And if you're using cat because it keeps the filename out of the way when editing the pipeline, then just put the redirect before the command instead, so instead of e.g.

  cat file | grep pattern | sort -u
you can write

  < file grep pattern | sort -u
and the filename is out of the way compared to

  grep pattern file | sort -u


now, I'll wait for someone to post a link to the "UUOC award" award

This is not the same. For sequence [5,5,4,3,3,2,1,1] "sort -u" returns [1,2,3,4,5], while "sort | uniq -u" returns [2,4].

Huh, I didn't know that! Thanks.

Useless use of seq spotted. Seq does not exist on many systems. Bash has {0001..0500} instead.

Nice writeup though.

Well, to be fair, bash does not exist on many systems either.

For example I have used dragonflybsd and freebsd today and they both had "seq" but no "bash".

They have jot(1)


I learnt a lot from the book Data Science at the Command Line, now free and online at https://www.datascienceatthecommandline.com/

Set operations are very useful. Here's a summary:


Not the most efficient solution but this is what springs to mind for me:

    seq 1000 | xargs printf '%04d_A.csv\n' | while read -r f; do test -f $f || echo $f; done

Instead to create a script in Python to convert numbers in integers, you can use awk: "python3 parse.py" becomes "awk '{printf "%d\n", $0}'"

Not sure I understand why it needs to even know there's numbers in the filename.

The problem seems to boil down to:

"Find all files with the pattern '[something]_data.csv' and report if '[something]_A.csv' doesn't exist"

Unless I'm missing something, all the sorting and sequence generation isn't adding anything.

Why even use awk rather than the shell's (well, bash's) builtin printf?

    $ printf '%d\n' "0005"

That might not always do what a naïve user expects:

    $ printf '%d\n' "0025"

You can apply the awk command on a pipe, and so it is applies on each line of the file/stream.

Right - though that's solvable with xargs:

    $ echo "0005" | xargs printf '%d\n'
That said, my suggestion doesn't work anyway since the leading 0 marks it as octal, d'oh (as mentioned elsewhere in the thread).

More power to those who enjoy writing control flow in shell, but if I need anything beyond a single line I'm going with an interactive ipython session.

You could use one sed command to replace your grep, cut, and python. It feels cheap to use python do massage data in a post about Unix command line.

Is there a nice alternative for seq or jot ? Something neater than for-loop in awk ?

In bash, you can create sequences with {A..B}. E.g.

echo {1..10}

or to count backwards

echo {10..0} boom!

ls | rb 'group_by { |x| x[/\d+/] }.select { |_, y| y.one? }.keys'


For heavier duty text processing, try

emacs -e myfuns.el

When it comes to mashing text, nothing beats emacs.

awk one liner: ls | awk '{split($1,x,"_"); split(x[2],y,"."); a[x[1]]+=1} END {for (i in a) {if (a[i] < 2) {print i}}}'

Zsh one liner (probably works in Bash too):

    for a in {0001..0500}; do [[ ! -f ${a}_A.csv ]] && echo $((10#${a})); done
The only trick I'm using is base transformation to remove padding in the echo...

I didn't realize Zsh (and Bash) was capable of removing zero padding in that way.

Everybody has there own style, but I would prefer to print the missing file pattern and avoid loops.

If you have GNU parallel installed (works in bash)

     parallel -kj1 '! [[ -f "{}" ]] && echo {} || :' ::: $(jot -w %04d_A.csv - 1 501)
or if preferred

     parallel -kj1 '! [[ -f "{}" ]] && echo {} || :' ::: {0001..0500}_A.csv

> I didn't realize Zsh (and Bash) was capable of removing zero padding in that way.

Well, it's for transforming an integer in a different bases like octal, binary until base 24 (or more don't recall), but it can be abused to strip padding zeroes from variables. Using printf should probably be cleaner but usually I only recall the the C syntax...

I think I have parallel installed but I tend to use xargs out of habit, mostly because I was forced to use xargs in locked out production systems.

If the number of files wouldn't be so big, I'd simply expand them on ls and capture stderr:

    ls {0001..0500}_A.csv 1> /dev/null
It's a little nosier with the error messages but it's fast. With 500 files I'm sure I'll exhaust the shell parse(?) buffer:

    (ls {0001..0500}_A.csv 2>&1 1> /dev/null) | awk -F\' '{print $2}'
and too much complications to suppress stdout and pipe only stderr. ^__^;

The people that created the command line weren't L33T H4XOR NOOBS. They were brilliant PhD scientists. Let's not confuse the two.

> I am starting to realize that the Unix command-line toolbox can fix absolutely any problem related to text wrangling.

How many problems related to text wrangling arise simply by working with Unix tools?

“This philosophical framework will help you solve problems internal to philosophy.”

What a useless comment. The OP is an interesting walkthrough of solving a highly specific problem in a clever way using a common but often poorly understood toolset. Then you come in and leave a snarkbomb trashing the idea that learning how to use this toolset is worthwhile without providing any reasoning or alternatives.

Do you also trash posts about learning how to build your own furniture or troubleshooting car engines?

What elevated domain do you operate in that only has perfectly elegant solutions to beautifully architected problems that use only tools perfectly crafted to solve those exact problems? Doesn’t sound like very interesting work to me.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact