Sort uses only a fixed amount of memory, you can sort files larger than memory, ...

justinsaccount · on Jan 17, 2016

'sort | uniq' is another special case of this, and it is much better to replace that with 'sort -u'

the 'sort' in 'sort | uniq' doesn't know you are going to be throwing away all the duplicate data.

If anyone is wondering, here is an implementation of the python approach i have lying around:

  #!/usr/bin/env python2
  import sys
  from collections import defaultdict
  
  c = defaultdict(int)
  
  for line in sys.stdin:
      c[line] += 1
  
  top = sorted(c.items(), key=lambda (k,v): v)
  for k, v in top:
      print v, k,

heydenberk · on Jan 17, 2016

Just for fun, here's a version using `Counter` from the same `collections` module which makes that blissfully simple:

    #!/usr/bin/env python2
    import sys
    from collections import Counter

    for pair in Counter(sys.stdin).most_common():
        print pair

thebelal · on Jan 17, 2016

This is even easier to do with awk.

    awk -e '!a[$0]++'

This also preserves the original input order, which is a nice property.

jtwaleson · on Jan 17, 2016

Don't you mean a python set? But yes, for use cases containing many duplicates where the result easily fits in memory, that is probably the fastest.

bacr · on Jan 17, 2016

Fun fact: they are nearly the same implementation. See: http://markmail.org/message/ktzomp4uwrmnzao6

masklinn · on Jan 17, 2016

As one would generally expect, the backing store of most hashsets is little more than a hashmap with zero-sized/no values.

In fact, that's exactly how Rust's standard library hashset is implemented since rust supports zero-sized types "in userland" (and unit `()` is a ZST):

    pub struct HashSet<T, S = RandomState> {
        map: HashMap<T, (), S>
    }

http://doc.rust-lang.org/src/std/collections/hash/set.rs.htm...

justinsaccount · on Jan 17, 2016

a set replaces 'sort -u' or 'sort | uniq'. A dictionary replaces 'sort | uniq -c'

fenollp · on Jan 17, 2016

Or sort -u` instead of `sort | uniq`.