
Removing duplicate lines from files keeping the original order with Awk - laz_arus
https://iridakos.com/how-to/2019/05/16/remove-duplicate-lines-preserving-order-linux.html
======
asicsp
I have a collection of such one-liners, for duplicates including how to form
key for multiple fields, see [1]

[1] [https://github.com/learnbyexample/Command-line-text-
processi...](https://github.com/learnbyexample/Command-line-text-
processing/blob/master/gnu_awk.md#dealing-with-duplicates)

~~~
asauce
Wow this is great. I want to become more competent with text processing in the
command line and this looks like a great place to start.

Thanks for linking this repo!

------
omh
Awk is wonderful. It's an odd way to write programs, but for quick one-off
processing tasks it almost can't be beaten.

Somewhat related blog post which I like to refer people to: "Command-line
Tools can be 235x Faster than your Hadoop Cluster"
[https://adamdrake.com/command-line-tools-can-
be-235x-faster-...](https://adamdrake.com/command-line-tools-can-
be-235x-faster-than-your-hadoop-cluster.html)

~~~
sdegutis
Article's like OPs about removing duplicate lines make me want to learn awk.

But it feels like it would be a net loss based on how seldom I currently need
to write one-off scripts.

Based on experience, I'd probably have a perfect use-case for it every 2-3
years.

~~~
sametmax
Even if you do, it's a better investment to learn a general high level
language with strong scripting capabilities, but that is also good at many
things else.

Sure, the days you'll need awk, you'll take 15 minutes instead of 2 writing
your script. So what ?

But the rest of the year, you'll have a more versatile toolbox at your
disposal for automatic things, testing, prototype network processes, make
quick web sites or API, and explore data sets.

That being said, I can see the point of learning awk because, well, it's fun.

~~~
ryl00
> Even if you do, it's a better investment to learn a general high level
> language with strong scripting capabilities, but that is also good at many
> things else.

And we already have that: it's called perl. :)

~~~
sametmax
I tried very hard not to name a specific language so that the point is not
cancelled by some lang war.

------
augustk
And here is the ungolfed version:

awk '{ if (! visited[$0]) { print $0; visited[$0] = 1 } }'

~~~
zufallsheld
Much more readable and understandable.

~~~
RBerenguel
You can always write AWK in a file and read the script with -f, making it
fully readable (and AWK is quite a pleasantly readable and surprisingly
versatile language to write at that point)

~~~
asicsp
to add to this, if you've coded one-liner first, you can convert to script
using -o option

for ex:

    
    
        awk -o '{ORS = NR%2 ? " " : RS} 1'
    

gives (default output file is awkprof.out)

    
    
        {
            ORS = (NR % 2 ? " " : RS)
        }
    
        1 {
            print $0
        }

~~~
RBerenguel
I wasn't aware of this, might come handy for "one liner edge cases".

------
kazinator
You need "gawk -M" for this for bignum support, so visited[$0]++ doesn't wrap
back to zero, otherwise it is not correct for huge files with huge numbers of
duplicates.

The portable one-liner that doesn't suffer from integer wraparound is actually

    
    
       awk '!($0 in seen) { seen[$0]; print }'
    

which can be golfed a bit:

    
    
       awk '!($0 in s); s[$0]'
    

$0 in s tests whether the line exists in the s[] assoc array. We negate that,
so we print if it doesn't exist.

Then we unconditionally execute s[$0]. This has an undefined value that
behaves like Boolean false. In awk if we mention an array location, it
materializes, so this has the effect that "$0 in s" is now true, though s[$0]
continues to have an undefined value.

~~~
zimpenfish
> huge files with huge numbers of duplicates

At least on the stock MacOS awk, you can get up to 2^53 before arithmetic
breaks (doesn't wrap, just doesn't go up any more which means the one-liner
still works.)

    
    
        > echo '2^53-1' | bc
        9007199254740991
        > seq 1 10 | awk 'BEGIN{a[123]=9007199254740991;b=a[123]}{a[123]++}END{print a[123],b,a[123]-b}'
        9007199254740992 9007199254740991 1
    

Even with one character per line, you'd need an 18PB file before you got to
this limit, afaict.

------
hjk05
On a previous post people were complaining that math wasn’t as clear as code.
I’d argue that this this is exactly the kind of code-like clarity math
notation provides you. I makes perfect sense, but only after 2 full pages
describing what’s going on in one line.

------
mehrdadn
Where I run into trouble with awk is gawk incompatibilities with the
implementation on Mac. The gawk manual really sucks at telling you what
exactly is an extension to the language, and I haven't been able to find a
good source -- you just have to either guess and check, or cross-check against
other ones' manuals (like BSD). Otherwise it's an amazing tool...

~~~
asicsp
I thought the gawk book/documentation [1] did a good job of mentioning
differences between various implementations, do you have an example?

You might find this [2] helpful (oops, seems like it got deleted, see [3] -
thanks @bionoid)

[1]
[https://www.gnu.org/software/gawk/manual/gawk.html](https://www.gnu.org/software/gawk/manual/gawk.html)

[2]
[https://www.reddit.com/r/awk/comments/4omosp/differences_bet...](https://www.reddit.com/r/awk/comments/4omosp/differences_between_gawk_nawk_mawk_and_posix_awk/)

[3] [https://archive.is/btGky](https://archive.is/btGky)

~~~
mehrdadn
> do you have an example?

Sure, try this:

    
    
      echo 1 2 | awk '{ print gensub(/1/, "3", "g", $1); }'
    

The logical thing for them to do would be to mention in bold and/or big and/or
red font under gensub's documentation that it's an extension (e.g. try nawk),
whereas looking through it I don't see any mention at all:
[https://www.gnu.org/software/gawk/manual/html_node/String-
Fu...](https://www.gnu.org/software/gawk/manual/html_node/String-
Functions.html#String-Functions)

If I may rant about this for a bit, GNU software manuals are generally rather
awful (though they're neither alone in this nor is it impossible to find
exceptions). They frequently make absolutely zero effort to display important
information more prominently and unimportant information less so (if you're
even lucky enough that they tell you the important information in the first
place). Like if passing --food will accidentally blow up a nuke in your
hometown, you can expect that if they documented it at all, they just casually
buried it in the middle of some random paragraph. Their operating assumption
seems to be that if you can't be bothered to spend the next 4 hours reading a
novel before writing your one-liner then it's just obviously your fault for
sucking so much.

~~~
Liquid_Fire
While I agree it should be more obvious, it does say in the opening section:

> Those functions that are specific to gawk are marked with a pound sign
> (‘#’). They are not available in compatibility mode (see section Command-
> Line Options)

~~~
mehrdadn
Oh dear lord. I've looked at that page probably twenty times in the past year
and still not seen the note about that pound sign. Thanks for pointing it out.
Man it's infuriating.

------
aidos
This is a nice little run through of a real life example.

Once you realise that awk's model is _' match pattern { do actions; }'_
everything makes a whole lot more sense.

~~~
sohkamyung
Awk also supports BEGIN and END actions that take place at the start and at
the end of execution.

BEGIN might be used to initialise Awk variables or print initial messages,
while END can be used to print a summary of actions at the end.

See [1]
[https://www.grymoire.com/Unix/Awk.html#uh-1](https://www.grymoire.com/Unix/Awk.html#uh-1)

~~~
RBerenguel
You can also use either BEGIN or END as the only entrypoints, essentially
using AWK as you'd use any other programming language. Yes, sometimes this
defeats the point of what AWK excels at, but it's good to know.

------
lifeisstillgood
There was a twitter thread ages ago where someone had written a collection of
(php?) utilities - and the twitterer posted a laughing slap down saying why
write a utility when this one liner and that one liner will do?

There was a lot of push back - and this article is a good example of why

If I wanted to remove duplicate lines from a file I would almost certainly
_not_ use awk

I have never spent the time to get good enough with the _whole new and
different languge_ of awk, and am unlikely to need to (my large scale file
processing needs seem small, and if I do it's almost always in context of
other processing chains - so a normal languge like python would be the natural
choice

I could whip up something like this in python in a less time than it would
take to google the answer, read up why the syntax works that way and verify I
have not mistyped anything on a few test files.

Basically using awk takes me out of my comfort zone - for a one off task it
loses me time, for a production like repeat task I am going to reach for a
slew of other solutions.

I mean the title of this page loses the exclamation mark - and it took me two
goes to spot it.

~~~
oftenwrong
For many years my AWK knowledge was limited to basic '{print $1}' style usage.
I never bothered to learn more. I tended to use perl when I needed a custom
text-processing operation. Later, as perl became less a part of my working
life, I began using ruby instead - they are pretty similar in spirit.

One day, nearly 20 years after it was published, I picked up a used copy of
The AWK Programming Language by Aho, Kernighan, and Weinberger. Yes, they are
credited in that order on the cover... I suspect intentionally. I only read
the first N chapters, but it was enough. I used AWK many times within the
following month, and I continue to use AWK on a daily basis. When the task is
complicated, I will still use ruby, but often enough AWK is easier.

The point: you think "Why would I learn X when I can use Y?", but you won't
really know the answer until you learn X. If I had never learned perl, python,
ruby, AWK, shell script, vi macros, then I would probably be editing files by
hand (!) like I sometimes catch developers actually doing (!!!). For a person
who doesn't know these tools, that might actually be the path of least
resistance. Investing some time here and there to learn new tools pays off in
the future in ways that are unpredictable.

~~~
jacobolus
The basic imperative Python version is much easier to remember and read
though, even for not-that-experienced Python programmers. I would expect
laypeople to be able to more-or-less figure out what it is supposed to do.

    
    
      seen = set()
      with open(filename, "r") as file:
        for line in file:
          if line not in seen:
            print(line)
            seen.add(line)
    

Often (at least in my experience) this kind of operation is either (a) part of
some larger automated data processing pipeline for which it’s really nice to
have version control, tests, ... or (b) part of some interactive data
exploration by a programmer sitting at a repl somewhere, not just a one-off
action we want to apply to one file from the command line.

In those contexts, the Python (or Ruby or Clojure or whatever general-purpose
programming language) version is easy to type out more-or-less bug-free from
memory, debug when it fails, slot into the rest of the project, modify as part
of a team with varied experience, etc. etc.

~~~
skinner_
One advantage is that

    
    
      seen.add(line)
    

can be changed to

    
    
      seen.add(hash(line))
    

which can be significantly more memory efficient for files with long lines.

~~~
jacobolus
Or perhaps better, if needs change the seen = set() object can be swapped out
for any alternative object seen = foo that provides foo.__contains__ and
foo.add methods.

This could involve saving previously seen lines in a radix tree, adding
multiple layers of caching, saving infrequently seen lines to disk or over the
network, etc. as appropriate for the use case.

------
rusk
Got to love awk. My weapon of choice for ad hoc arbitrary text processing and
data analysis. I’ve tried to replace it with more modern tools time and again
but nothing else really comes close in that domain.

~~~
vidarh
Interesting Ruby (MRI anyway) has command line options to make it act pretty
similar to awk:

-n adds an implicit "while gets ... end" loop. "-p" does the same but prints the contents of $_ at the end. "-e" lets you put an expression on the command line. "-F" specified the field separator like for awk. "-a" turns on auto-split mode when you use it with -n or -p, which basically adds an implicit "$F = $_.split to the while gets .. end loops.

So "ruby -[p or n]a -F[some separator] -e ' [expression gets run once every
loop]'" is good for tasks that are suitable for "awk-like" processing but
where you may need access to other functionality than what awk provides..

~~~
asicsp
I'd say it is more similar to perl than awk for options like -F -l -a -n -e -0
etc. And perl borrowed stuff from sed, awk, etc

I have a collection for ruby one-liners too [1]

[1] [https://github.com/learnbyexample/Command-line-text-
processi...](https://github.com/learnbyexample/Command-line-text-
processing/blob/master/ruby_one_liners.md)

~~~
rusk
Sed and Awk had a child and named her Perl. When Perl grew up she underwent an
epigenetic shift and became Ruby!

------
oftenwrong
See also, 'nauniq' (non-adjacent uniq), an implementation of this text-
processing task as a full utility, with some convenient options for reducing
memory usage:

[https://metacpan.org/pod/distribution/App-
nauniq/script/naun...](https://metacpan.org/pod/distribution/App-
nauniq/script/nauniq)

------
julienfr112
I was wondering how awk work internally. Does it compile the script then run
it ? Is it bytecode or lower level ? A finite state machine ?

~~~
benhoyt
They parse the script to a parse tree (abstract syntax tree) and then either
interpret that directly (tree-walking interpreter) or compile to bytecode and
then execute that. The original awk ("one true awk") uses a simple tree-
walking interpreter, as does my own GoAWK implementation. gawk and mawk are
slightly faster and compile to bytecode first.

If you're interested, you can read more about how GoAWK works and performs
here:
[https://benhoyt.com/writings/goawk/](https://benhoyt.com/writings/goawk/)

~~~
srean
Did gawk change to bytecode interpreter recently ? As far as I can recall it
used to be an AST walker. Wonder if I am misremembering things. Mawk used tobe
more than 'slightly' faster than gawk. Maybe the recent change to bytecode
have brought their perf characteristics closer.

------
reacweb
I have taken as rule to use awk only for trivial tasks and to switch to perl
as soon as the syntax is slightly beyond my usual use cases. In perl, I would
do:

    
    
        perl -nle 'print unless exists $h{$_};$h{$_}++' < your_file

~~~
asicsp
perl borrowed stuff from awk, so you could also do

    
    
        perl -ne 'print if !$seen{$_}++'

~~~
rlonstein
Play a round of perl golf?

    
    
        perl -pne '$_=$#$_++?$_:""'
    

I'm rusty at this but shaved off six chars, five if you count the 'p' added to
switches.

~~~
reacweb
I have never seen this $#$var trick and google is not the friend of perl
operators. Do you have any explanation ?

    
    
        perl -ne 'print if!++$#$_'
    

seems to work also

~~~
showdead
If you have array @foo in perl, $#foo is the index of the last element of the
array, which is just the size of the array -1. So if @foo is undefined, $#foo
is -1.

Using a variable instead of 'foo' is a symbolic reference, so this is
effectively using the symbol table as the associative array. This means that
this solution also gets it wrong if your file contains a line that matches the
name of a built-in variable in perl. That would be tough to debug!

If your file contains

    
    
        This is the first line of the file
    

then during execution of

    
    
        ++$#$_
    

the result is the same as if you had written

    
    
        ++$#{This is the first line of the file}
    

So the variable @{This is the first line of the file} goes from undefined to
an array of length 1, turning $#{This is the first line of the file} to 0.

Incidentally, this is why the snippet fails to work for a line repeated more
than once: for each occurrence of the expression, the value returned is in the
sequence -1, 0, 1, 2, 3, ... so it is only false for the second occurrence.

Using preincrement instead of postincrement means the values returned are 0,
1, 2, 3, ... which means that inverting the test makes it false for every
occurrence after the first.

------
jancsika
Me: Wow, that associative array looks very powerful. Is there a way to
leverage it to do something useful like convert curl-obtained JSON array of
file patches from Github's API to the mbox format that `git am` expects?

Unix: No, that JSON data is too structured. But if you have a more error-prone
format like CSV I can show you a neat trick to filter your bowlers by number
of spares.

~~~
james_s_tayler
can't u just process JSON data with jq?

[https://stedolan.github.io/jq/](https://stedolan.github.io/jq/)

~~~
jancsika
I'd still need to find a way to massage the data into the mbox format because
_that_ is the ancient format that git understands.

I'm not saying that there isn't a way to do that. Only that it can only be
done poorly with a big ugly (and probably buggy) spaghetti script that looks
nothing like what the expressive demo suggests it should look like.

~~~
james_s_tayler
If you're getting a bunch of stuff with curl from GitHub can't you just use
curl to get the patches directly from github?

Append .patch to the end of a pr or commit and it spits out the mbox formatted
patch.

[https://github.com/jiphex/mbox/commit/f139c575e306a1691a31d8...](https://github.com/jiphex/mbox/commit/f139c575e306a1691a31d8c2f4b44f48984b1267.patch)

~~~
jancsika
If you do that with curl it will redirect to the login page. Github obviously
wants me to use their API.

I assume this wasn't always the case as the use case I'm referencing is a
build script I'm debugging.

~~~
james_s_tayler
I just tried running curl on it now and it came back fine without redirecting
me.

For a private repo you just need to set the correct options to curl.

curl -Lk --cookie "user_session=your_session_cookie_here"
[https://github.com/your_org/your_project/pull/123.patch](https://github.com/your_org/your_project/pull/123.patch)

~~~
jancsika
And what is the unix tool I use to automatically retrieve the session cookie
and paste it there?

~~~
james_s_tayler
Naturally, you use curl. You just instruct curl to save the cookies that the
login process sends back to you. Then you reuse those cookies on the
subsequent request.

Found an example here.

[https://gist.github.com/d48/3501047](https://gist.github.com/d48/3501047)

------
hk__2
The tradeoff of this solution is it stores all (unique) lines in memory. If
you have a large file with a lot of unique lines you might prefer using `sort
-u`, although it doesn’t keep the order.

~~~
triangleman
Are you sure this is the case? Perhaps it's only storing the hash of the
lines? If not then how do you do that?

IME this one-liner can churn through 100MB of log lines in a second. Other
solutions like powershell's "select-object -unique" totally choke on the file.

~~~
hk__2
It can’t just store the hash of the lines otherwise it would drop lines in
case of hash collision.

~~~
n4r9
It depends on the implementation, but typically hash tables are used to store
the elements and values of associative arrays:

[https://www.gnu.org/software/gawk/manual/html_node/Array-
Int...](https://www.gnu.org/software/gawk/manual/html_node/Array-Intro.html)

I suspect that it's designed so that hash collisions are impossible until you
get to an unrealistic number of characters per line.

~~~
michaelmior
I doubt it's designed to silently break in some cases. Unrealistic isn't
realistic until one day it is and that is a bad day. I suppose it could just
throw an error in the case of a hash collision, but I doubt it.

~~~
n4r9
But what does it do, then? The page I links states that it uses a hash table.
Hash tables apply a hash function to the key. Hash functions map arbitrary
input data onto data of a fixed size. It's inevitable that collisions will
occur. ~~even if you use some sort of clever workaround in the case of
collisions, eventually you use up all the available outputs.~~ (my bad)

I'm not claiming that it will silently break! I'd be very interested in
exploring the internals a little more and finding out how hard it is to get a
collision in various implementations and how they behave subsequently.

EDIT: I've read chasil's comment and agree that it must be storing raw keys in
the array. I guess awk uses separate chaining or something to get around hash
collisions.

------
founderling
Awk ' visited[$0]++' or how badly HNs automatic title formatter messes up awk
commands :)

Hint: You can edit it after you posted.

~~~
laz_arus
Done, thanks for the hint :)

~~~
founderling
You are !done yet.

~~~
kreetx
I think you meant !done[$bug]++ ?

------
mitnk
The man page of (n)awk [0][1] is surprisingly short and readable.

[0] `man awk` on mac

[1] online version
[https://www.mankier.com/1/nawk](https://www.mankier.com/1/nawk)

[2] gawk's man page works great as a reference
[https://www.mankier.com/1/gawk](https://www.mankier.com/1/gawk)

------
aabbcc1241
If I know this, I wouldn't make
[https://github.com/beenotung/uniqcp](https://github.com/beenotung/uniqcp)

------
triangleman
So then, what is the one liner to preserve the filename rather than get a new
deduped.txt?

Also how do you apply that command to the next file using shell history?

~~~
gcmeplz
Use sponge!
[https://linux.die.net/man/1/sponge](https://linux.die.net/man/1/sponge)

    
    
      awk '!n[$0]++' fileName | sponge fileName

------
pvaldes
open the file with emacs

menu edit -> select all

press 'escape' and 'x' keys together and write: delete-duplicate-lines

done

------
bigato
This is very memory intensive. Which may not matter if the data volume is
small enough. But it is also a bit hard to understand, at least not so obvious
at first sight. For most use cases sort -u would be ideal and way simpler to
understand, if you don't mind having an ordered file at output.

~~~
mitnk
> This is very memory intensive.

Only for ones not familiar with awk.

It would make a lot of sense after you understand how awk works (as the
article explains).

~~~
anc84
That does not make any sense. If it is memory intensive depends on awk, not on
the person being familiar with it.

Sois it memory intensive or not?

~~~
chasil
The example AWK script will build an array of every unique line of text in the
file.

If the file is large, and mostly unique, then assume that a substantial
portion of the file will be loaded into memory.

If this is larger than the amount of ram, then portions of the active array
will be paged to the swap space, then will thrash the drive as each new line
is read forcing a complete rescan of the array.

This is very handy for files that fit in available ram (and zram may help
greatly), but it does not scale.

~~~
mannykannot
I don't know how awk (or this particular implementation) works, but it could
be done such that comparing lines is only necessary when there is a hash
collision, and also, finding all prior lines having a given hash need not
require a complete rescan of the set of prior lines - e.g. for each hash, keep
a list of the offsets of each corresponding prior line. Furthermore, if that
'list' is an array sorted by the lines' text, then whenever you find the
current line is unique, you also know where in the array to insert its offset
to keep that array sorted - or use a trie or suffix tree.

~~~
michaelmior
Sure, you only need to compare when there's a hash collision, but you still
need to keep all the lines in memory for later comparison.

~~~
mannykannot
Sure (though they could be in a compressed form, such as a suffix tree), but
that wasn't the issue I was addressing.

