Hacker News new | past | comments | ask | show | jobs | submit login

Sort uses only a fixed amount of memory, you can sort files larger than memory, but for such situations where you have only a few tens of millions of distinct values you can just use a python dictionary and it works even faster. While sort would shuffle data around a lot, the memory dictionary would just hold a key and a count as it gobbles the logs. It works because it is a special case of sorting where there are relatively few different values relative to the count of the whole list.



'sort | uniq' is another special case of this, and it is much better to replace that with 'sort -u'

the 'sort' in 'sort | uniq' doesn't know you are going to be throwing away all the duplicate data.

If anyone is wondering, here is an implementation of the python approach i have lying around:

  #!/usr/bin/env python2
  import sys
  from collections import defaultdict
  
  c = defaultdict(int)
  
  for line in sys.stdin:
      c[line] += 1
  
  top = sorted(c.items(), key=lambda (k,v): v)
  for k, v in top:
      print v, k,


Just for fun, here's a version using `Counter` from the same `collections` module which makes that blissfully simple:

    #!/usr/bin/env python2
    import sys
    from collections import Counter

    for pair in Counter(sys.stdin).most_common():
        print pair


This is even easier to do with awk.

    awk -e '!a[$0]++'
This also preserves the original input order, which is a nice property.


Don't you mean a python set? But yes, for use cases containing many duplicates where the result easily fits in memory, that is probably the fastest.


Fun fact: they are nearly the same implementation. See: http://markmail.org/message/ktzomp4uwrmnzao6


As one would generally expect, the backing store of most hashsets is little more than a hashmap with zero-sized/no values.

In fact, that's exactly how Rust's standard library hashset is implemented since rust supports zero-sized types "in userland" (and unit `()` is a ZST):

    pub struct HashSet<T, S = RandomState> {
        map: HashMap<T, (), S>
    }
http://doc.rust-lang.org/src/std/collections/hash/set.rs.htm...


a set replaces 'sort -u' or 'sort | uniq'. A dictionary replaces 'sort | uniq -c'


Or sort -u` instead of `sort | uniq`.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: