
Command-line tools are fast – is Python faster? - necaris
http://necaris.dreamwidth.org/130439.html
======
dalke
"More than a third faster than the shell script approach"

That's not quite a fair comparison. It uses 'awk' as the shell script
baseline. The reference article points out that 'mawk' is faster than awk, and
took the run-time from 18 seconds to 12 seconds.

Also, the 'find' isn't needed here because of the directory flattening, and
because the list of filenames in 'data/ * .pgn' is so small.

Also, the original command-line code was too complicated. There's a much
easier and faster way to get the same answer. But to get there I have to point
out that the original code and the re-implementation contain a bug. They
assign "0-0" and "0--0" as a win for black, when it should be a draw.

Timing are on my laptop, a 2011 MacBook Pro, reporting only the real time, and
ensuring the data is in cache.

    
    
        $ time cat *.pgn > /dev/null
    
        real	0m1.523s
    

(If I purge the cache, this goes to 1m45.176s!)

Using the awk script from the reference code, without the find:

    
    
        $ time grep -h "Result" *.pgn | awk '{ split($0, a, "-"); \
           res = substr(a[1], length(a [1]), 1); if (res == 1) white++; \
           if (res == 0) black++; if (res == 2) draw++;} \
           END { print NR, white, black, draw }'
        9878269 3762840 2853647 3260769
    
        real	0m44.169s
    
    

replace awk with gawk:

    
    
        $ time grep -h "Result" *.pgn | gawk '{ split($0, a, "-"); \
           res = substr(a[1], length(a [1]), 1); if (res == 1) white++; \
           if (res == 0) black++; if (res == 2) draw++;} \
           END { print NR, white, black, draw }'
        9878269 3762840 2853647 3260769
    
        real	0m22.970s
    
    

Use a more efficient awk script, and assign '0-0' and '0--0' as wins for
black, to be bug compatible:

    
    
        $ time grep -h Result *.pgn  | awk '/0-[01]/||/0--0/ {black++} \
             /1-0/ {white++} /1.2-1.2/ {draw++} END {print NR, white, black, draw}'
        9878269 3762840 2853647 3260769
    
        real	0m11.539s
    

Switch to gawk:

    
    
        $ time grep -h Result *.pgn  | gawk '/0-[01]/||/0--0/ {black++} \
             /1-0/ {white++} /1.2-1.2/ {draw++} END {print NR, white, black, draw}'
        9878269 3762840 2853647 3260769
    
        real	0m8.065s
    

The result is significantly faster, still using the command-line.

