
A Million Text Files and a Single Laptop - myth_drannon
http://randyzwitch.com/gnu-parallel-medium-data/
======
justinsaccount
One thing worth noting is that the author of this post is using OS X. The
default ancient BSD versions of the tools they are using (cut, tr, sort) are
HORRIBLY SLOW:

    
    
      $ wc -l big.log
       1058408 big.log
      $ cat big.log |time gtr -d \" > /dev/null
              0.38 real         0.21 user         0.14 sys
      $ cat big.log |time tr -d \" > /dev/null
             20.97 real        20.77 user         0.10 sys
    
      $ cat big.log | time gcut -f 4  > /dev/null
              0.61 real         0.59 user         0.01 sys
      $ cat big.log | time cut -f 4  > /dev/null
              2.99 real         2.95 user         0.02 sys
     
      $ cat big.log | time gsort  > /dev/null
             28.52 real        27.03 user         0.80 sys
      $ cat big.log | time sort  > /dev/null
             54.80 real        53.20 user         0.85 sys
    

If you are using OS X you owe it to yourself to install a non-broken userland.
It's clear that apple doesn't give a shit about unix at this point.

~~~
dguaraglia
What installation do you recommend? Anything I can brew right away?

EDIT: I found this article [https://www.topbug.net/blog/2013/04/14/install-
and-use-gnu-c...](https://www.topbug.net/blog/2013/04/14/install-and-use-gnu-
command-line-tools-in-mac-os-x/)

------
sigil
Don't do this:

    
    
        ls | parallel -m -j $f "cat {} >> ../transactions_cat/transactions.csv"
    

It only works if all files enumerated by `ls` are smaller than `PIPE_BUF`,
which is 4096 bytes on Linux (but was historically only 512 bytes). Writes
larger than `PIPE_BUF` are not guaranteed to be atomic. Meaning, if you have
input files larger than 512 or 4096 bytes, your output file will have chunks
of file A interleaved with chunks of file B.

If I knew for sure all input files were smaller than PIPE_BUF, I'd do this
instead:

    
    
        find . -type f -print0 | xargs -0 -P8 cat > ../transactions.csv
    

That saves you the overhead of firing up a new shell just to reopen a file for
append and exec `cat`.

~~~
coderzach
I don't think that's the case, the default behavior of gnu parallel is to
buffer the entire output before outputting. The output will only be
interleaved if the `--ungroup` option is used. Parallel also has --line-
buffer, which will interleave lines, rather than arbitrary bytes.

~~~
sigil
Are you suggesting that the _command_ argument to `parallel` is shell that's
parsed and reinterpreted by `parallel`?

Buffering the output would make more sense, but the OP's _command_ isn't
outputting to stdout, but rather appending directly to a file.

~~~
gcr
Good point. The command should simply move the >>file outside of the quote,
and then it will work.

------
mpweiher
The "parallel concatenation" sounds like a really bad idea, partly because you
have no guarantee that the files won't get interleaved, partly because it's
solving the wrong problem.

There is virtually no benefit to the actual process of concatenation from the
parallelisation, as the process is completely I/O bound (and again,
concatenating things in parallel will only work by accident). The problem the
parallelisation is solving is the overhead of starting cat for each individual
file.

This overhead can be eliminated quite easily by the standard Unix find/xargs
combination, which by default will do 5000 files at a time. At that point, you
will almost certainly be maxing out your I/O devices, no parallel processing
needed, and your files will all be in a predictable order with no
interleaving.

------
vessenes
I was curious about the performance difference between these shell scripts and
sqlite3, single threaded but fast and hardened database.

I don't have forever, so I modified the linked-to scripts to make only 12,345
customer records (two orders of magnitude slower).

I wrote a python importer to sqlite, which is largely CPU bound on the python
side -- go would be much faster, but I wanted to stick with his other tools.
20 lines of code.

Times for 'setting up the data': in his case, catting everything into one
file. In mine, splitting the data up into rows and inserting into sqlite:

    
    
        #His Data Prep
        time ls | parallel -m -j 32 "cat {} >> ../transactions_cat"
        real 0m1.218s
    
        #My Data Prep
        time pypy import.py
        real 1m54s
    
        #His Count unique products (Note that as written it overcounts)
        time cut -d ' ' -f 5- ../transactions_cat | tr -d \" | tr ',' '\n' | parallel --pipe --block 1M sort -u | sort -u | wc -l
        real 1m52s
    
        #My Count Unique products
        time sqlite3 transaction.db "select count(distinct item) from txs"
        real 0m3.033s
    
        #Transactions per day
        #(his script -- note that as written it adds up the wrong columns)
        real	0m8.976s
    
        #My transactions per day. 
        time sqlite3 transaction.db ' select substr(date, 1, 10) as day, count(*) from txs group by day' 
        real	0m8.625s
    
    
        #Transactions per store month
        # I used the single threaded oneliner from the article, he noted it was fastest.
        time cut -d ' ' -f 2,3,5 transactions.csv | tr -d '[A-Za-z\"/\- ]' | awk '{print (substr($1,1,5)"-"substr($1,6,6)), length(substr($1,14))+1}' | awk '{a[$1]+=$2;}END{for(i in a)print i" "a[i];}' | sort
        real 0m50.886s
    
        #Mine
        time sqlite3 transaction.db ' select substr(date, 1, 7) as month, store, count(*) from txs group by month, store'
        real 0m16.908s
    
    

In each case except for import sqlite is faster. To me the biggest win is not
just the speed or the added expressivity (it's a real difference though for
many of these queries), but the simplicity with which correctness can be
assessed. Two of his three scripts did not do what they said on the tin, and
the third one (which seems to), had a large disclaimer on it that he wasn't
sure he'd properly optimized it.

~~~
justinsaccount
Did you run those commands on OS X as well? See my other comment above on how
the old bsd versions of the text processing tools shipped with OS X are slow.

~~~
vessenes
I did run them on OS X, with the default tools.

------
heinrichhartman
Yes, it's entirely possible! I once crunched some 350GB of arxiv.org papers on
an old ThinkPad:

[https://github.com/HeinrichHartmann/related-
work.net/tree/ma...](https://github.com/HeinrichHartmann/related-
work.net/tree/master/DataImport)

Demo here: [http://dev.related-work.net/](http://dev.related-work.net/)

------
placeybordeaux
I have been doing plenty of analysis of ~400GB of json on a machine with 32GB
of ram using jq, tr, sort and grep.

So far I haven't found another solution that allows for creating 'queries' so
easily. Anyone that knows a way to arbitrarily cut up more json that can
easily fit in RAM let me know

~~~
dangerlibrary
I have long been an advocate of bashreduce.

I even worked for a company that ran a cluster of storage machines that each
had tiny amounts of ram and contained enormous (many 100GB+) csv text files -
this was right around the time hadoop was starting to be a thing, but the
cluster was air-gapped for security and updates to external software were
incredibly tedious. We'd write small 'cut | grep | sort | awk' sh commands and
run them using a tiny perl script called 'prsh' \- parallel rsh - to aggregate
the results.

------
vessenes
I have a gut feeling that this would be far, far faster with a sqlite import,
including the time spent creating the sql table structures.

select count(*) group by queries in sqlite are significantly faster than the
times he's showing here, at the very least.

EDIT: I went ahead and ran some tests and added the results to another
comment. In short, sqlite is significantly faster and easier to debug.

~~~
gglitch
I'm on a phone and no longer have the essay up but I too am puzzled by why sql
isn't a good fit for this work.

~~~
okket
Maybe some misguided "SQL is black magic, avoid at any cost" bias?

I'd even pull out PostgreSQL with its mighty scripting and analyzing
capabilities. Ok, speed may come at cost of HD space and memory consumption
and would require a not so fast import...

~~~
peterwwillis
You disable stuff that slows down processing and then import the data using
the SQL engine's own LOAD DATA commands. Not that slow either.

Really though, you first have to ensure your documents all match a standard,
since a lot of people just expect csv, tsv and other documents to all be the
same, and they are not. So you have to do extra work anyway. Might as well use
SQL too.

