
Shell Magic: Set Operations with uniq - cpitman
http://blog.deadvax.net/2018/05/29/shell-magic-set-operations-with-uniq/
======
sciurus
This is a neat hack! For two input files, the intersection and relative
complement can be done more straightforwardly with comm.

[https://en.wikipedia.org/wiki/Comm](https://en.wikipedia.org/wiki/Comm)

    
    
      # show only items in both a and b
      comm -1 -2 a_list b_list
    
      # show only items unique to a
      comm -2 -3 a_list b_list
    
      # show only items unique to b
      comm -1 -3 a_list b_list

~~~
partycoder
Can also use

    
    
        comm <flags> <(command 1) <(command 2)
    

To use the output of a command as input. This also works with diff and other
commands.

~~~
TheDong
More accurately, it can be used for any command which expects a file and
doesn't do anything too weird in reading it (e.g. doesn't seek to the
beginning and read it again)

The '<( ... )' is just giving the path to that command's stdout as a file
descriptor.

    
    
        $ ls <(echo hi)  
        /proc/self/fd/11
        $ vi <(echo hi)
        # opens vi with 'hi' as the contents

~~~
therein
Thank you. I was just thinking of looking into how it was implemented.

~~~
lloeki
It is called "process substitution", and can work the other way around:

[http://www.tldp.org/LDP/abs/html/process-
sub.html](http://www.tldp.org/LDP/abs/html/process-sub.html)

With fish:

[https://fishshell.com/docs/current/commands.html#psub](https://fishshell.com/docs/current/commands.html#psub)

But there it only works one way:

[https://github.com/fish-shell/fish-
shell/issues/1786](https://github.com/fish-shell/fish-shell/issues/1786)

------
markusbk
If your input is already sorted (like this article assumes), you can use "sort
-m", which is a lot faster. Also, to print only lines with duplicates, use
"uniq -d" instead of "uniq -c | grep 2\ ".

Union: Instead of

    
    
        cat a_list b_list | sort | uniq
    

do

    
    
        sort -m a_list b_list | uniq
    

Intersection: Instead of

    
    
        cat a_list b_list | sort | uniq -c | grep 2\ 
    

do

    
    
        sort -m a_list b_list | uniq -d
    

Relative complement: Instead of

    
    
        cat a_list b_list b_list | sort | uniq -c | grep 2\ 
    

do

    
    
        sort -m a_list a_list b_list | uniq -u
    

Note the change of approach here: instead of making lines from b_list appear
twice and grepping for that count, make lines from a_list appear twice and
have uniq only print lines that aren't repeated.

------
jph
I like shell set operations scripts, because they are quick and easy.

I prefer `awk` over `uniq` and `comm` because awk tends to be faster at set
ops that can skip sorting and deduplicating.

Here's my script for union, intersection, etc. See README on GitHub.
Suggestions welcome.

[https://github.com/sixarm/setop](https://github.com/sixarm/setop)

    
    
        #!/bin/sh
        set -eu
        op="$1"; shift
    
        case  $op  in
          ∪|u|union|or|∨|add|addition|'+'|'|')
            awk '!seen[$0] {print} {seen[$0]=1}' "$@"
            ;;
          ∩|i|intersection|and|∧|'&')
            awk 'FNR==1{argind+=1} seen[$0]+=1 {next} END { for (key in seen) { if (seen[key]==argind) { print key } } }' "$@"
            ;;
          ⊖|d|diff|difference|xor|⊻)
            awk 'FNR==1{argind+=1} seen[$0]+=1 {next} END { for (key in seen) { if (seen[key]==1) { print key } } }' "$@"
            ;;
          ex|except|exclude|subtract|subtraction|'-')
            awk 'NR==FNR{seen[$0]=1;next} seen[$0]=0; END { for (key in seen) { if (seen[key]) { print key } } }' "$@"
            ;;
          extra)
            awk 'BEGIN{argindmax=ARGC-1} FNR==1{argind+=1} argind<argindmax {seen[$0]; next}!($0 in seen)' "$@"
            ;;
          disjoint|n|not|none)
            awk 'seen[$0]==1 {x=1;exit} {seen[$0]=1} END { print x ? "FALSE" : "TRUE"}; exit !!x}' "$@"
            ;;
          *)
        esac

~~~
unixhero
A few comments. First, wow, I love how this looks like dark magic.

Those mathematical notations, are you using them because it makes it easier to
see how it corresponds to actual Set Theory/theorems? If so, could you just as
well have used an alphanummeric identifier like "left" "union" "right" or -
would the code break without this notation? I'm on deep waters here, I don't
know this. But set theory seems to pop up a lot in my line of work,
essentially doing joins in datasets using Tableau - so my interest in the
nitty gritty of this field is increasing.

~~~
jph
Thanks! To answer your questions...

> because it makes it easier to see how it corresponds to actual Set
> Theory/theorems?

Yes. These are the Unicode symbols for set theory.

> could you just as well have used an alphanummeric identifier like "left"
> "union" "right" or

Yes. You can use any of the words in the case switch statements, such as
`setop union file1 file2`. You can also edit the script to add your own words
if you like.

You can see simpler versions of these scripts in our GitHub repos. For example
the `union` command is
[https://github.com/sixarm/union](https://github.com/sixarm/union)

> set theory seems to pop up a lot in my line of work

More and more in mine too. Thank you for your comments!

------
pixelbeat__
I've a summary of the operations at:

[http://www.pixelbeat.org/cmdline.html#sets](http://www.pixelbeat.org/cmdline.html#sets)

Note comm output is a bit awkward to parse, so I use another coreutils `join`
command to process already sorted data

~~~
kseistrup
Thanks for the list, there are many useful examples.

Another way to get the last date in current month (== the number of days in a
month) is:

    
    
        : $(cal); echo $_

~~~
lloeki
For the curious, `:` is the noop builtin in bash. I use it mostly as I would
use `pass` in python, since empty conditionals or functions are a parse error.

    
    
        if whatevs; then
          :
        fi

~~~
kseistrup
And to elaborate even more:

‘:’ is the command, and ‘$(cal)’ – which equals the unquoted output from
running the ‘cal’ command – are the arguments. The last day of the month is
thus the last argument of ‘:’ and can be referenced with the ‘$_’ variable.

------
perl4ever
The cut and join commands are also useful. I'm pretty sure you could prove all
basic database operations can be done using the shell.

~~~
Aelius
In fact I wrote an image/tag database solution entirely in shell/coreutils.

It's too disgusting to share.

~~~
perl4ever
After playing around with join just now, I remembered why I hate it - it
requires your inputs to be sorted which makes it abominably inconvenient.

------
hn-minutiae
Here's another collection of set operations - I refer to this a couple of
times a week at least.
[http://www.catonmat.net/download/setops.txt](http://www.catonmat.net/download/setops.txt)

~~~
unixhero
Whoa. That is a great resource.

------
derriz
Hard to resist this - "Unlike the intersection, the Set Difference is a bit
harder to scale up to more than two lists. It is concievable, and I may even
have done it, but I’ll leave it as an exercise to the reader to develop that."

An inefficient solution which involves unnecessary sorting: for each file_i, 0
<= i < n in the set of n files, cat it 2^i times before combining to pipe
through sort and uniq -c. Every possible set operation combination can be
determined by grepping the result for a particular combination of counts.
Intersection would calculated by grepping for 2^n - 1 while symmetric
difference would require egrep to pick out any of 1, 2, 4,..,2^(n-1).

~~~
perl4ever
The way I would do it, assuming we have list1.txt, list2.txt, list3.txt, ...
and want to calculate (1 - 2 - 3 - ...) :

1\. Use sed to add "1<tab>" (that's a one digit and a tab char) to the first
list to difference by and save as "prefix.txt".

2\. Use cat to combine all the lists, sort | uniq -c | sort -n | sed <reformat
to make tab delimited> | sort again and save as output.txt.

3\. join prefix.txt and output.txt on each whole line and cut the second tab
delimited field to produce the final result.

So in order to be in the result, a list item must appear in exactly one list
and that must be the first list. That should be what we want (?)

------
bariumbitmap
The Debian 'corekeeper' package includes a cron script that uses this.

[https://sources.debian.org/src/corekeeper/1.6/debian/corekee...](https://sources.debian.org/src/corekeeper/1.6/debian/corekeeper.cron.daily/)

It's actually possible to simplify the set operations a little in that case.

[https://bugs.debian.org/cgi-
bin/bugreport.cgi?bug=895143](https://bugs.debian.org/cgi-
bin/bugreport.cgi?bug=895143)

------
nonsince
For completeness, to scale relative complement up to N lists you would need to
repeat the last list 2^N times. So it'd look something like:

    
    
        rel_com() {
            for i in $(seq 1 $#); do
                for j in $(seq 1 $(bc <<< "2 ^ $(expr $i - 1)")); do
                    echo ${!i}
                done
            done
        }
        
        sort -m $(rel_com a b c) | uniq -c
    

Then you have a N-bit number with the i'th bit representing membership in the
i'th file.

------
ollybee
I've always used grep for this with the -f option to read a list of patterns
from a file and -v for inverted search as necessary.

------
sametmax
Nice. I've wondered several times how to have the equivalent of {1, 2, 3} ^
{3, 4} from Python, but in bash.

------
wallstprog
On a related note:
[http://btorpey.github.io/blog/2017/05/10/join/](http://btorpey.github.io/blog/2017/05/10/join/)

------
kazinator
If you use _awk_ , which has associative arrays, you avoid sorting, and can
keep everything POSIX.

------
beeforpork
I like bash's <(...) operator and often use diff -u with it:

    
    
      diff -u <(cat file1) <(cat file2)
    

Obviously, the 'cat' commands can be more complex commands, which make this
more interesting.

------
zouhair
No need to use cat:

    
    
      # sort a_list b_list | uniq
      a 
      b 
      c 
      d 
      e

------
emmelaich
See also

[https://github.com/pkrumins/set-operations-in-unix-
shell](https://github.com/pkrumins/set-operations-in-unix-shell)

------
esterly
Thanks! Will be using these for debugging. I commonly compare `cat data.csv |
wc -l` vs `cat data.csv | sort -u | wc -l` to check unique inputs

------
kbob
Aaargh! "cat | sort" is an antipattern.

Sort accepts a list of files as arguments. "sort [flags] file1 file2 ..." is
more concise, more efficient, and is less likely to run out of space.

~~~
BenjiWiebe
And for people like me who want the file name first on the line, this format
works in Bash at least:

</dir/file sort | uniq >outfile

~~~
fiddlerwoaroof
I always find it really confusing when the command-line begins with a
redirect.

~~~
unixhero
Yes, how do you parse that - ie. what does it even mean?

~~~
fiddlerwoaroof
It's just a convention that redirects come after the command they're
redirecting. There might be shell options that influence this, but I think all
these are generally equivalent:

    
    
        foo bar < a
        foo < a bar
        < a foo bar

~~~
paxunix
It is POSIX shell parsing behaviour that redirections can appear anywhere in
the command (obviously, not in the same word as another parameter, nor inside
a quoted string). They have to be stripped out by the shell before execution:
[http://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3...](http://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap02.html#tag_18_07)

