
Problem solving with Unix commands - v3gas
http://vegardstikbakke.com/unix/
======
pdkl95
Gary Bernhardt[1] gave a great talk about practical problem solving with the
unix shell: "The Unix Chainsaw"[2].

"Half-assed is OK when you only need half of an ass."

In the talk, he gives several demonstrations a key aspect of _why_ unix
pipelines are so practically useful: you build them _interactively_. A
complicated 4 line pipeline started as a single command that was gradually
refined into something that actually solves a complicated problem. This talk
demonstrates the part that isn't included in the the usual tutorials or "cool
1-line command" lists: the cycle of "Try something. Hit up to get the command
back. Make one iterative change and try again."

[1] You might know him from his other hilarious talks like "The Birth & Death
of JavaScript" or "Wat".

[2]
[https://www.youtube.com/watch?v=sCZJblyT_XM](https://www.youtube.com/watch?v=sCZJblyT_XM)

~~~
SomethingOrNot
> In the talk, he gives several demonstrations a key aspect of why unix
> pipelines are so practically useful: you build them interactively.

The standard Unix interface might have been interactive in the ’70s, back when
hardware and peripherals were horribly non-interactive. But I don’t know why
so many so-called millenial programmers (people my age) get excited about the
alleged interactivity of the Unix that most people are familiar with. It
doesn’t even have the cutting edge ’90s interactivity of Plan 9, what with
mouse(!) selection of arbitrary text that can be piped to commands and so on.
And every time someone comes up with a Unix-hosted tool that uses some kind of
fold-up menu that informs you about what key combination you can type next
(you know, like what all GUI programs have with Alt+x and the file|edit|view|…
toolbar), people hail it as some kind of UX innovation.

~~~
jraph
I think the interactivity you describe might be a different thing from what
your parent is talking about.

From what I understand, your parent talks about how the commands are built
iteratively, with some kind of trial-error loop, which is a strength that is
supposedly not emphasized enough. And I agree by the way. Nothing to do with
how things are input.

~~~
pdkl95
That's correct. Articles/tutorials or an evangelizing fan often show the end
result: the cool command/pipeline that does something cool and useful. The
obvious question when someone unfamiliar with unix upon seeing something like
the pipeline in this article:

    
    
        comm -1 -3 <(ls -1 dataset-directory | \
                     grep '\d\d\d\d_A.csv'   | \
                     cut -c 1-4              | \
                     python3 parse.py        | \
                     uniq                      \
                     )                         \
                   <(seq 500)
    

is "Why would I want to write a complicated mess like that?" Just use
${FAVORITE_PROG_LANG:-Perl, Ruby, or whatever}". For many tasks, a short
paragraph of code in a "normal" programming language is probably easier to
write and is almost certainly a more robust, easier to maintain solution.
However, this assumes that you knew what the problem was and that qualities
like maintainability are a goal.

Bernhardt's (and my) point is that sometimes you don't know what the goal is
yet. Sometimes you just need to do a small, one-off task where a half-assed
solutions might be appropriate... _iff_ it's the right half of the ass. Unix
shell gets that right for a _really useful_ set of tasks.

This works because you are free to utilize that powerful features
incrementally, as needed. The interactive nature of the shell lets you explore
the problem. The "better" version in a "proper" programming language _doesn 't
exist_ when you don't yet know the exact nature of the problem. A half-assed
bit of shell code that slowly evolved into something useful might _be_ the
step between "I have some data" and a larger "real" programming project.

That said, there is also wisdom in learning to recognize when your needs have
outgrown "small, half-assed" solutions. If the project is growing and adding
layers of complexity, it's probably time to switch to a more appropriate tool.

~~~
SomethingOrNot
I generalized interactivity to the Unix that most people seem familiar with.

“The interactive nature of the shell” isn’t that impressive in this day and
age. Certainly not shells like Bash (Fish is probably better, but then again
that’s very cutting edge shell (“for the ’90s”)).

Irrespective of the shell this just boils down to executing code, editing
text, executing code, repeat. I suspect people started doing that once they
got updating displays, if not sooner.

~~~
TeMPOraL
How is that not impressive for vast majority of developers?

For the past couple decades, the only other even remotely mainstream place
where you could get a comparable experience was a Lisp REPL. And maaaybe
Matlab, later on. Recently, projects like R, Jupyer, and (AFAIK) Julia have
been introducing people to interactive development, but those are specific to
scientific computing. For general programming, this approach is pretty much
unknown outside of Lisp and Unix shell worlds.

~~~
SomethingOrNot
The author is an MS student in statistics. Seems that Unix is well-represented
in STEM university fields.

Old-timey Unix (as opposed to things like Plan 9) won. When does widespread
’70s/’80s computing stop being impressive? You say “unknown” as if we were
talking about some research software, or some old and largely forgotten
software. Unix shell programming doesn’t have hipster cred.

~~~
TeMPOraL
> _When does widespread ’70s /’80s computing stop being impressive? You say
> “unknown” as if we were talking about some research software, or some old
> and largely forgotten software._

That's precisely what I'm talking about. The 70s/80s produced tons of insight
into computer use in general, and programming in particular, that were mostly
forgotten, and are slowly being rediscovered, or reinvented every couple
years. Unix in fact was a step backwards in terms of capabilities exposed to
users; it won because of economics.

~~~
pjmlp
Had Bell Labs been allowed to explore UNIX commercially, none of us would be
having this discussion.

------
fforflo
This [0] the most complete post I've read on the topic. Lays out all the
relevant tools. Spending some time going through each tool's
documentation/options, pays off tremendously.

[0]: [https://www.ibm.com/developerworks/aix/library/au-
unixtext/i...](https://www.ibm.com/developerworks/aix/library/au-
unixtext/index.html)

~~~
jihadjihad
Wow, great find. Sad how hard it seems these days to come across an easy-to-
follow primer on a topic without narrative fluff and/or ads everywhere. For
those interested in a standalone copy there is a PDF of the content available
here [https://www.ibm.com/developerworks/aix/library/au-
unixtext/a...](https://www.ibm.com/developerworks/aix/library/au-unixtext/au-
unixtext-pdf.pdf)

~~~
wglb
Writing clear tutorials is a fair amount of effort, more than I originally
thought when I first did it.

------
skywhopper
The brilliant fun of working with the Unix CLI toolset is that there are
millions of valid ways to solve a problem. I also thought of a “better”
solution of my own that took an entirely different approach than most of the
ones posted here. That’s not really the point.

What’s great about this article is that it follows the process of solving the
problem step by step. I find that lots of programmers I work with struggle
with CLI problem solving, which I find a little surprising. But I think it all
depends on how you think about problems like this.

If you start from “how can I build a function to operate on this raw data?” or
“what data structure would best express the relationship between these
filenames?” then you will have a hard time. But if you think in terms of “how
can I mutate this data to eliminate extraneous details?” and “what tools do I
have handy that can solve problems on data like this given a bit of mungeing,
and how can I accomplish that bit of mungeing?” and if you can accept taking
several baby steps of small operations on every line of the full dataset
rather than building and manipulating abstract logical structures, then you’re
well on your way to making efficient use of this remarkable toolset to solve
ad hoc problems like this one in minutes instead of hours.

------
yuriko
If you bother to write a python script to parse the integers, why not use
python to solve the whole problem?

~~~
AnIdiotOnTheNet
This is one of the many reasons I think PowerShell did UNIX philosophy better:
you don't need to parse text because the pipelines pass around typed objects.
You can kinda almost get the same behavior from some UNIX commands by first
having them dump everything into JSON and then having the other end parse the
JSON for you, but you're still relying on a lot of text parsing. Personally I
think it is high time the UNIX world put together a new toolset.

~~~
skywhopper
Why replace your hammers, screwdrivers, and chisels just because someone
invented a 3D printer? They have tradeoffs. Powershell has some good ideas,
and benefits from having been invented altogether, rather than evolving over
four decades. But in practice it's not as efficient for doing simple things.
It's oriented towards much more complex data structures, which is great... but
there's no need to throw out your simpler tools just because you think they
look ugly.

~~~
AnIdiotOnTheNet
They're full of footguns, esoteric behavior, and have arcane names. They're
actually pretty awful tools now that it isn't the 70s anymore.

~~~
grumpydba
Yet when I visit the unix sysadmins' office I see people chaining commands to
administer hundreds of boxes. On the windows side I rarely see powershell
prompts. Powershell looks so much better in theory. However it's just an
okayish scripting language with a good REPL. Unix tools are a far better daily
driver.

~~~
AnIdiotOnTheNet
That's probably because for daily tasks we have much better tooling in Windows
already that doesn't require us to use the command line and interactive
construct it. I can easily administer the configurations of thousands of
computers through AD, for instance, and while I could use PowerShell to do so,
using ADUC is just easier most of the time.

If you do a lot of work with Exchange though, you'll probably end up using
PowerShell much more, because the web UI for it is not so great.

No matter what you think of the specific implementation, a lot of PowerShell's
ideas are good ideas. Unfortunately UNIX culture is such that they'll probably
never implement any of them.

~~~
grumpydba
> That's probably because for daily tasks we have much better tooling in
> Windows

ssh, docker, ansible, kubernetes, grafana, prometheus, etc... All coming from
Linux/unix. This statement is clueless. Most of the cloud is not running
microsoft, and for a good reason.

To automate, we have python, which has a much better syntax. It's pointless to
use powershell.

And it takes a microsoft head, without knowledge of programming language's
history to say that powershell's ideas actually come from powershell. Method
chaining/fluent interface with a pipe instead of a dot does not look that new.

Also, some attempts have been to implement posh clones on unix. Being
redundant with either perl/python or bash/zsh, none succeeded.

------
dnet
Removing leading zeroes doesn't require Python. One easy solution would be
sed:

    
    
        $ echo -e '0001\n0010\n0002' | sed 's/^0*//'
        1
        10
        2

~~~
mshook
Yeah plus seq can generate sequences with leading zeroes (something like seq
-f %04.f 1 20).

So instead of scripting, he could have generated a sorted list of numbers from
the files he had. Created a file with the sequence of numbers for the range
and diffed/commed the whole thing. Voilà...

~~~
pwg
The seq provided in the GNU toolset has a -w flag to turn on "equal width"
mode, so one can also get zero padded numbers out (from GNU seq) by turning on
that mode and zero padding the input:

    
    
        $ seq -w 0001 0003
        0001
        0002
        0003
        $

~~~
anc84
Nice, this works automatically like this:

    
    
        $ seq -w 98 102
        098
        099
        100
        101
        102

------
boomlinde
A change in structure might be helpful:

    
    
        $ ls data
        0001.csv 0002.csv 0003.csv 0004.csv ...
        $ ls algorithm_a
        0001.csv 0002.csv 0004.csv ...
        $ diff -q algorithm_a data |grep ^Only |sed 's/.*: //g'
        0003.csv ...

~~~
v3gas
Excellent point, haha!

------
stiff
For learning to get things done with Unix, I recommend the two old books "Unix
Programming Environment" and "The AWK Programming Language". There are many
resources to learn the various commands etc., but there is still no better
place than those books to learn the "unix philosophy". This series is also
good:

[https://sanctum.geek.nz/arabesque/series/unix-as-
ide/](https://sanctum.geek.nz/arabesque/series/unix-as-ide/)

------
nickjj
I think the best part about using Unix tools is it forces you to break down
the problem into tiny steps.

You can see feedback every step of the way by removing and adding back new
piped commands so you're never really dealing with more than 1 operation at a
time which makes debugging and making progress a lot easier than trying to fit
everything together at once.

~~~
mercer
It's basically functional programming. I find that my approach to writing code
is very similar to how I work with the shell. The main difference, I guess, is
that the command 'units' are slightly bigger, in the form of functions, but
the way I iterate my solution to a problem is basically the same.

------
ben509
I've often done this, usually not for a large dataset, but it's sometimes
helpful to pipe text through Unix commands in Emacs. C-u M-| sort, for
instance, will run the selection through sort and replace it in place.

If you're going the all python route, and even want to be able to run bash
commands, and want something where you can feed the output into the input, I'd
strongly recommend jupyter. (If you want to stay in a terminal, ipython is
part of jupyter and heavily upgrades the built-in REPL and does 90% of what
I'm mentioning here.)

You can break out each step into its own cell, save variables (though cell 5
will be auto-saved as a variable named _5) but the nicest thing is you can
move cells around (check the keyboard shortcuts) and restart the entire kernel
and rerun all your operations, essentially what you're getting with a long
pipeline, only spread out over parts. And there are shortcuts like func? to
pop up help on a function or func?? to see the source.

It's got some dependencies, so I'd recommend running it in a virtualenv via
pipenv:

    
    
        pipenv install jupyter  # setup new virtualenv and add package
        pipenv run jupyter notebook
        pipenv --rm  # Blow away the virtualenv
    

Also, look into pandas if you want to slurp a CSV and query it.

~~~
omaranto
I doubt you'll find many Emacs users that would prefer "C-u M-| sort" over
"M-x sort-lines".

------
jancsika
The problem with this is that there isn't a standard format forced on the args
that following the command name "cut".

What makes it worse is that there are seemingly patterns of standard format
that get violated by other patterns. It's often based on when the utility was
first authored and whatever ideas were floating around during the time. So
sometimes characters can "clump" together behind a flag, under the assumption
that multi-character flags will get two hyphens. Then some utilities or
programs use a single flag for multi-character flags. Plus many other
inconsistencies-- if I learn the basic range syntax for cut do I know the
basic range syntax for imagemagick?

Those inconsistencies don't _technically_ conflict since each only exists in
the context of a particular utility. But it's a real pain to sanity to see
those inconsistencies sitting on either side of a pipe, especially when one of
them is wrong. (Or even when it's a single command you need but you use the
wrong flag syntax.) That all adds to the cognitive load and can easily make a
dev tired before its time to go to sleep.

Oh, and that language switch from bash to python is a huge risk. If you're
scripting with Python on a daily basis it probably doesn't seem like it. But
for someone reading along, that language boundary is huge. Because the user is
no longer limited to runtime errors and finicky arg formatting errors, but
also language errors. If the command line barfs up an exception or syntax
error at that boundary I'd bet most users would just give up and quit reading
the rest of the blog.

Edit: clarification

~~~
skywhopper
Learning the idiosyncrasies of the tools involved is one of the tradeoffs. But
there's no getting around it. These tools have been around for far too long to
change them all in some misguided attempt at consistency--the semantics of
most tools are so different, it wouldn't even make sense to try to enforce
some consistency anyway.

You don't have to know every flag for every tool. You don't need to know if
you can glob args together in a certain tool. These are different tools
developed across decades by different people for different purposes. The fact
that you can glue them all together on an ad-hoc basis is magical!

You learn by learning how to do one thing at a time--cutting characters 10-20,
or grepping for a regex, or summing with awk, or replacing strings with sed,
or translating characters with tr--and adding it to your mental toolbox. It's
okay to have a syntax error because man is there and you can easily iterate
the command to make it do what you want.

You aren't writing a program to stand the test of time. You're solving a
problem in the moment!

------
almostarockstar
This was a nice read and a good introduction to text processing with unix
commands.

I agree with the other user re python usage - that you may as well use it for
the whole task if you're going to use it at all - but I don't think it's a
major flaw. It worked for you right? I would suggest naming the python file a
bit more descriptively though.

Interesting to read the other suggestions about dealing with this without
python.

~~~
v3gas
Thanks! Glad to hear!

------
maratc
$ join -v 2 <(ls | grep _A | sort | cut -c-4) <(ls | grep -v _A | sort | cut
-c-4)

The shortest one I could come up with, no need to use python.

`join -v 2` shows the entries in the second sorted stream that don't have
match in the first sorted stream, the rest is self-explanatory I hope.

Edit: $ join -v2 -t_ -j1 <(ls | grep _A | sort ) <(ls | grep -v _A | sort)

Is even shorter, it takes first field (-j1) where fields are separated by '_'
(-t_)

~~~
zimpenfish
Slightly shorter:

    
    
        ls -v|cut -d_ -f1|uniq -c|awk '$1<2{print $2}'
    

Tested by creating 500 sets of dual files and removing 10 `_A` randomly.

    
    
        for i in $(seq 1 500); do j=$(printf %04d $i); touch ${j}_data.csv; touch ${j}_A.csv; done
        for i in $(seq 1 10); do q=$((RANDOM % 500)); r=$(printf %04d $q); rm -v ${r}_A.csv; done
        removed '0438_A.csv'
        removed '0327_A.csv'
        removed '0150_A.csv'
        removed '0173_A.csv'
        removed '0460_A.csv'
        removed '0194_A.csv'
        removed '0073_A.csv'
        removed '0293_A.csv'
        removed '0404_A.csv'
        removed '0153_A.csv'
    

And then using the code above to verify the missing files

    
    
        0073
        0150
        0153
        0173
        0194
        0293
        0327
        0404
        0438
        0460

~~~
taviso
You could use uniq -u to avoid the awk.

~~~
maratc

        ls|cut -d_ -f1|uniq -u
    

You win.

~~~
e12e
I like this solution - I'm not very used to using "cut" \- or more generally
to map from "files" to "fields/lines in a text stream".

I'm more inclined to ask:

given a list of files with this name, does a file of a different name exist on
the file system?

But the more Unix approach is really:

how can I model my data as a text stream, and how can I then pose/answer my
question?

(here: list all filenames in folder in sorted order - cut away the text
indicating type - then count/display the non-repeat/single entries)

My solution would probably be more like (with bash in mind, most of this could
be "expanded" to fork out to more utils, like "basename -s" etc) :

    
    
      for data_file in *_data.csv
      do;
        alg_file="${data_file%_data.csv}_A.csv";
        if [[ ! -f "${alg_file}" ]];
        then
          echo "Missing alg file:\
          ${alg_file} for data \
          file: ${data_file}";
        fi;
      done
    

Ed: this is essentially the same solution as:

[https://news.ycombinator.com/item?id=19161358](https://news.ycombinator.com/item?id=19161358)

Although more verbose. I think I prefer omitting the explicit if, though - and
just using "test" and "or" ("[[", "||" ).

------
sagartewari01
My favourite one is 'pkill -9 java'. Fixes my laptop if it starts lagging.

~~~
fxfan
Does that kill electron instances too? ;)

------
samwhiteUK
I thought this was a neat demo of building up a command with UNIX tools. The
python inclusion was a bit odd, yes.

I learned about sys.stdin in Python and cutting characters using the -c flag

~~~
v3gas
Thanks!

------
jclay
After moving back to working on a Windows machine the last several years and
being “forced” into using PowerShell, I now find myself using it for these
sorts of tasks on Linux.

I now use PowerShell for any tasks of equal or greater complexity than the
article. It’s such a massive upgrade over struggling to recall the peculiar
bash syntax every time and the benefits of piping typed objects around are
vast.

As a nice bonus, all of my PowerShell scripts run cross-platform without
issue.

~~~
nxrabl
I've dabbled in PowerShell before, but I've always found the objects you get
from cmdlets to be so much more opaque than the plain text you get from Unix
output, which makes it harder to use the iterative approach to development the
article and other commenters describe. Do you have any tips for poking around
in PowerShell objects / a workflow that works for you?

~~~
jclay
I’ve tried to love it while using it as an interactive shell, but it’s hard
for me to lose the Unix muscle memory and remember their verbose commands.

For anything more than a single pipe, or anything that requires loops or
control flow, I switch to Powershell in Visual Studio Code with the PowerShell
extension which has intellisense and helps to poke around the methods on each
object. From there you can select subsets of your script and run with F8 which
helps me prototype with quick feedback.

------
darrenf
All the pipes and non-builtin commands (especially python!) look like overkill
to me, I must say.

    
    
        for set in *_data.csv ; do
            num=${set/_*/}
            success=${set/data/A}
            if [ ! -e $success ] ; then echo $num ; fi
        done
    

ETA: likely specific to bash, since I have no experience with other shells
except for dalliances with ksh and csh in the mid-90s.

~~~
phaemon
Yup, I'd probably have gone with a `for` loop also. A bit shorter:

    
    
      for set in *_data.csv; do
        [[ -f "${set/data/A}" ]] || echo "${set%_data.csv}"
      done
    

Edit: though I just write it out like this for formatting on HN. In real life,
that would just be a one-liner:

for set in *_data.csv; do [[ -f "${set/data/A}" ]] || echo "${set%_data.csv}";
done

~~~
hellabites
Just because I like GNU parallel:

    
    
        parallel -kj1 'f="{}"; [[ -f "${f/data/A}" ]] || echo $f' ::: *_data.csv

------
ciucanu
I usually do text processing in Bash, Notepad++ and Excel. Each has its own
pros and cons, that's why I usually combine them.

Here you have the tools I use in Bash:

grep, tail, head, cat, cut, less, awk, sed, sort, uniq, wc, xargs, watch ...

~~~
james_s_tayler
As an aside I once found out you can replace 'sort | uniq' entirely with an
obscure awk command so long as you don't require the output to be sorted. Iirc
it performs twice as fast.

    
    
      cat file.txt | awk '!x[$0]++'

~~~
omaranto
The awk commands prints the first occurrence of each line in the order they
are found in the file. I can imagine that sometimes that might be even better
than sorted order.

------
ortekk
If you are using python in your pipeline, might as well go all in!

    
    
      from pathlib import Path
    
    
      all_possible_filenames = {f'{i:04}A.csv' for i in range(1,10)}
    
      cur_dir_filenames = {Path('.').iterdir()}
    
      missing_filenames = all_possible_filenames - cur_dir_filenames
    
      print(*missing_filenames, sep='\n')

------
omaranto
The article solves the problem: for which numbers x between 1 and 500 is there
no file x_A.csv? It looks like in this case it is equivalent to the easier
problem: for which x_data.csv is there no corresponding x_A.csv?

    
    
        cd dataset-directory
        comm -23 <(ls *_data.csv | sed s/data/A/) <(ls *_A.csv)

~~~
ebeip90
This will fail for any filenames that contain newlines

~~~
omaranto
Correct. It is intended for the filenames in the article. More generally, I
try to write all my shell code to silently produce hard to track down errors
when a filename contains newlines, in order to punish me for my carelessness
if I ever accidentally create such a filename.

------
iheartpotatoes
I got paid $175/hr as a data analyst contractor to basically run bash, grep,
sed, awk, perl. The people that hired me weren't dumb, just non-programmers
and became giddy as I explained regular expressions. The gig only lasted 3
months, but I taught myself out of a job: once they got the gist of it they
didn't need me. Yay?

------
kritixilithos
Nicely done using Unix utils. You can have a pure sed solution (save the `ls`
invocation) that is much simpler, albeit obscure, that hinges on the fact that
every number has a `data.csv` file.

Given a sorted list of these files (through `ls` or otherwise) the following
sed code will print out the data files for which A did not succeed on them.

    
    
      /data/!N
      /A/d
      P;D
    
    

This works on the fact that there exists a data file for all successful and
unsuccessful runs on data, so sed simply prints the files for which there does
not exist an `A` counterpart.

If you want to only print out the numbers, you can add a substitution or two
towards the end.

    
    
      /data/!N
      /A/d
      s/^0*\|_.*//g;P;D
    

Edit: fixed the sed program

~~~
kritixilithos
Actually the following is even shorter

    
    
      /A/{N;d;}
    

So all together this gives the following

    
    
      ls|sed '/A/{N;d;}'

------
LogicX
given the limited scope of files in the direcctory... not sure why it was
necessary to use grep, instead of the built in glob?

    
    
      ls dataset-directory | egrep '\d\d\d\d_A.csv'

which FWIW wouldn't even work, on multiple levels: you need -1 on ls and no
files end with A.csv

    
    
      vs
    
      ls -1 dataset-directory/*_A?.csv
    

ref: [http://man7.org/linux/man-
pages/man7/glob.7.html](http://man7.org/linux/man-pages/man7/glob.7.html)

Update: apologies, apparently my client cached an older version of this page.
at that time the files were named A1.csv and A2.csv

~~~
olog-hai
Some ls man pages state the following about the -1 option: "This is the
default when the output is not directed to a terminal."

I've never needed to use -1 when piping ls's output to another command.

------
inp
Instead to create a script in Python to convert numbers in integers, you can
use awk: "python3 parse.py" becomes "awk '{printf "%d\n", $0}'"

~~~
darrenf
Why even use awk rather than the shell's (well, bash's) builtin printf?

    
    
        $ printf '%d\n' "0005"
        5

~~~
inp
You can apply the awk command on a pipe, and so it is applies on each line of
the file/stream.

~~~
darrenf
Right - though that's solvable with xargs:

    
    
        $ echo "0005" | xargs printf '%d\n'
        5
    

That said, my suggestion doesn't work anyway since the leading 0 marks it as
octal, d'oh (as mentioned elsewhere in the thread).

------
mklm
If you don't mind "cd dataset-directory" beforehand, a shorter and possibly
more correct version would be:

    
    
      comm -1 -3 <(ls *_A.csv | sed 's/_.*$//') <(seq -w 0500) | sed 's/^0*//'
    

The OP's solution doesn't seem correct because of the different ordering of
the two inputs of `comm': lexicographical (ls) and numeric (seq).

~~~
mklm
Although -w is supported by both GNU and BSD versions of `seq', BSD's ignores
leading zeros in input. Thus a more portable approach is:

    
    
      comm -1 -3 <(ls *_A.csv | sed 's/_.*$//') <(seq -f %04.f 500) | sed 's/^0*//'

------
wmu
Easier would be just use 'cat list_of_numbers | sort | uniq -u' to get the
unique entries.

~~~
phireal
Shorter still:

    
    
        sort -u < list_of_numbers

~~~
wmu
This is not the same. For sequence [5,5,4,3,3,2,1,1] "sort -u" returns
[1,2,3,4,5], while "sort | uniq -u" returns [2,4].

~~~
phireal
Huh, I didn't know that! Thanks.

------
pletnes
Useless use of seq spotted. Seq does not exist on many systems. Bash has
{0001..0500} instead.

Nice writeup though.

~~~
enriquto
Well, to be fair, bash does not exist on many systems either.

For example I have used dragonflybsd and freebsd today and they both had "seq"
but no "bash".

~~~
lelf
They have jot(1)

------
adamchainz
I learnt a lot from the book Data Science at the Command Line, now free and
online at
[https://www.datascienceatthecommandline.com/](https://www.datascienceatthecommandline.com/)

------
pixelbeat__
Set operations are very useful. Here's a summary:

[http://https://www.pixelbeat.org/cmdline.html#sets](http://https://www.pixelbeat.org/cmdline.html#sets)

~~~
pixelbeat__
[https://www.pixelbeat.org/cmdline.html#sets](https://www.pixelbeat.org/cmdline.html#sets)

------
js2
Not the most efficient solution but this is what springs to mind for me:

    
    
        seq 1000 | xargs printf '%04d_A.csv\n' | while read -r f; do test -f $f || echo $f; done

------
jon49
Use F# with a TypeProvider. Of course, I imagine it would take some work
learning F# but once you learn it the sky is the limit in what you can do with
this data.

------
Dowwie
More power to those who enjoy writing control flow in shell, but if I need
anything beyond a single line I'm going with an interactive ipython session.

------
dahfizz
You could use one sed command to replace your grep, cut, and python. It feels
cheap to use python do massage data in a post about Unix command line.

------
oh5nxo
Is there a nice alternative for seq or jot ? Something neater than for-loop in
awk ?

~~~
jabl
In bash, you can create sequences with {A..B}. E.g.

echo {1..10}

or to count backwards

echo {10..0} boom!

------
redka
ls | rb 'group_by { |x| x[/\d+/] }.select { |_, y| y.one? }.keys'

[https://github.com/thisredone/rb](https://github.com/thisredone/rb)

------
BentFranklin
For heavier duty text processing, try

emacs -e myfuns.el

When it comes to mashing text, nothing beats emacs.

------
Upvoter33
awk one liner: ls | awk '{split($1,x,"_"); split(x[2],y,"."); a[x[1]]+=1} END
{for (i in a) {if (a[i] < 2) {print i}}}'

~~~
IronBacon
Zsh one liner (probably works in Bash too):

    
    
        for a in {0001..0500}; do [[ ! -f ${a}_A.csv ]] && echo $((10#${a})); done
    

The only trick I'm using is _base transformation_ to remove padding in the
echo...

~~~
hellabites
I didn't realize Zsh (and Bash) was capable of removing zero padding in that
way.

Everybody has there own style, but I would prefer to print the missing file
pattern and avoid loops.

If you have GNU parallel installed (works in bash)

    
    
         parallel -kj1 '! [[ -f "{}" ]] && echo {} || :' ::: $(jot -w %04d_A.csv - 1 501)
    

or if preferred

    
    
         parallel -kj1 '! [[ -f "{}" ]] && echo {} || :' ::: {0001..0500}_A.csv

~~~
IronBacon
> _I didn 't realize Zsh (and Bash) was capable of removing zero padding in
> that way._

Well, it's for transforming an integer in a different bases like octal, binary
until base 24 (or more don't recall), but it can be abused to strip padding
zeroes from variables. Using _printf_ should probably be cleaner but usually I
only recall the the C syntax...

I think I have _parallel_ installed but I tend to use _xargs_ out of habit,
mostly because I was forced to use _xargs_ in locked out production systems.

If the number of files wouldn't be so big, I'd simply expand them on ls and
capture stderr:

    
    
        ls {0001..0500}_A.csv 1> /dev/null
    

It's a little nosier with the error messages but it's fast. With 500 files I'm
sure I'll exhaust the shell parse(?) buffer:

    
    
        (ls {0001..0500}_A.csv 2>&1 1> /dev/null) | awk -F\' '{print $2}'
    

and too much complications to suppress stdout and pipe only stderr. ^__^;

------
iheartpotatoes
The people that created the command line weren't L33T H4XOR NOOBS. They were
brilliant PhD scientists. Let's not confuse the two.

------
sureaboutthis
> I am starting to realize that the Unix command-line toolbox can fix
> absolutely any problem related to text wrangling.

Am I the only one who thought, "No shit, Sherlock"?. This is a fundamental of
UNIX that many people don't seem to grasp.

~~~
adtac
Everybody realises this at some point. Nobody ever thought "I can use this for
anything" when they first saw a shell. It takes time.

------
SomethingOrNot
> I am starting to realize that the Unix command-line toolbox can fix
> absolutely any problem related to text wrangling.

How many problems related to text wrangling arise simply by working with Unix
tools?

“This philosophical framework will help you solve problems internal to
philosophy.”

~~~
skywhopper
What a useless comment. The OP is an interesting walkthrough of solving a
highly specific problem in a clever way using a common but often poorly
understood toolset. Then you come in and leave a snarkbomb trashing the idea
that learning how to use this toolset is worthwhile without providing any
reasoning or alternatives.

Do you also trash posts about learning how to build your own furniture or
troubleshooting car engines?

What elevated domain do you operate in that only has perfectly elegant
solutions to beautifully architected problems that use only tools perfectly
crafted to solve those exact problems? Doesn’t sound like very interesting
work to me.

~~~
SomethingOrNot
[http://www.art.net/~hopkins/Don/unix-
haters/handbook.html](http://www.art.net/~hopkins/Don/unix-
haters/handbook.html)

