
Don't MAWK AWK - the fastest and most elegant big data munging language - blasdel
http://anyall.org/blog/2009/09/dont-mawk-awk-the-fastest-and-most-elegant-big-data-munging-language/
======
philh
Perl was written partly as a replacement for awk, and as such it has command-
line switches that make it more suitable than it might appear. You could get
very similar behaviour with a much shorter implementation using `perl -nai~`,
something like:

    
    
        BEGIN { open(VOCAB, ">vocab"); }
        if (!$imap{$ARGV}{$F[0]}) {
          $imap{$ARGV}{$F[0]} = ++$I{$ARGV};
        }
        if (!$jmap{$F[1]}) {
          $jmap{$F[1]} = ++$J;
          print VOCAB $F[1] . "\n";
        }
        print "$imap{$ARGV}{$F[0]} $jmap{$F[1]} $F[2]\n"
    

Which apart from the BEGIN line is almost a direct translation of the awk. A
lot uglier, but for one-off things that isn't much of a problem.

(And if you want to claim awk has a three-line implementation, this is four
lines.)

Admittedly, it's not quite the same - instead of putting output from file1 in
file1n, it renames file1 to file1~ and puts its output back in file1. If you
want to change that, you have to add your own file-handling code. That would
only be a few lines. And it's probably never going to be as fast as mawk.

There are other cases where I suspect perl would beat awk, but maybe get
beaten by sed. Not to rain on awk's parade or anything - it's still cool. Just
not _that_ much cooler than perl. :)

~~~
brendano
aha, very nice! I was wondering how to do the awk-style structure in perl; it
was unfair I didn't research it.

Maybe it's just me, but I find it much harder to read than the awk syntax, I
think mostly because of the dollar signs. I think it's pretty crowded as a
four liner. Awk's condition-action syntax helps a little here too.

    
    
        BEGIN { open(VOCAB, ">vocab"); }
        if (!$imap{$ARGV}{$F[0]}) { $imap{$ARGV}{$F[0]} = ++$I{$ARGV}; }
        if (!$jmap{$F[1]}) { $jmap{$F[1]} = ++$J;print VOCAB $F[1] . "\n"; }
        print "$imap{$ARGV}{$F[0]} $jmap{$F[1]} $F[2]\n"

------
aw3c2
sed, grep and awk are among the major reasons why I love Linux so much. It
took months until I first used them, now I use them daily and they made me
work so much more productive than before.

------
fizx
Silly bash function I use all the time.

    
    
      function f {
        awk '{print $'$1'}'
      }
    
      cat tab-separated | f 2 > just-the-2nd-column

~~~
bsaunder
cut -f2 tab-separated > just-the-2nd-column

~~~
mmt
doesn't work if it's any-amount-of-whitespace separated

~~~
obecalp
-d'<tab>'

~~~
mmt
Nope: echo 'a b' | cut -d' ' -f2 a b

------
neilc
1GB isn't exactly "Big Data". I'd expect most truly Big Data tasks to be more
I/O bound than computation bound -- at least if your "computation" consists of
text parsing and hash table lookups.

That said, it's interesting that mawk is so fast.

~~~
fizx
Depends. If you do a naive Ruby implementation, then you'll be CPU-bound quite
quickly.

    
    
      #!/usr/bin/env ruby
      while line = STDIN.gets
        puts line.split(/\s+/).first
      end
    

This pegs my CPU at only 2MB/s, well below the IO capabilities of any modern
system. I guess the tool you're using matters, which I think was the original
point.

------
henning
A pretty good showing from Java on this, even though Java's I/O system is
pretty annoying. I don't think the implementation he shows there is _too_
odious if you're used to Javaland pain.

~~~
fizx
It's not the end of the world. It just adds a few lines, but for multi-GB text
processing, the runtime speedup, and the decent concurrency support is worth
it.

    
    
      import java.io.BufferedReader;
      import java.io.InputStreamReader;
    
      public class Foo {
        public static void main(String[] args) throws Exception {
          BufferedReader reader = new BufferedReader(new InputStreamReader(System.in));
          String line;
          while ((line = reader.readLine()) != null) {
    
          }
        }
      }

~~~
ZitchDog
It's not hard to write a utility that wraps an InputStream in an Iterator so
you can do things like:

    
    
      for(String line : readLines(System.in)) {
        //do something with line here
      }

~~~
fizx
Like org.apache.commons.io.IOUtils.lineIterator? Ultimately, I choose to
either be a Maven project and require half the world, or just a single file
that's easily compiled.

If it's the latter, I don't bother creating many abstractions.

------
mathogre
I used mawk a long time ago, but it became stale. Last version I used, I
believe, was 1.3.3. It was excellent - fast and accurate. I crunch a lot of
data, and it always outperformed gawk. I migrated away from it when it would
no longer compile on a Linux system. As I still had gawk, and gawk was fast
enough, I left mawk behind.

Now I'll have to see if I can get it to run on OS X. Hmmm... ;)

UPDATE

It's available on MacPorts. It should be on my machines tonight.

~~~
jcw
I'm installing it right now.

------
kvs
I wonder if the C/C++ code compiled with LLVM/Clang would make a dent in the
run time?

------
skwiddor
use it in unicode mode, that kills performance

