
Running Awk in parallel to process 256M records - ketanmaheshwari
https://ketancmaheshwari.github.io/posts/2020/05/24/SMC18-Data-Challenge-4.html
======
tetha
Hm. I'm fully aware that I'm currently turning into a bearded DBA. And I may
be just misreading the article and I probably don't understand the article.

But, I started being somewhat confused by something:

> Fortunately, I had access to a large-memory (24 T) SGI system with 512-core
> Intel Xeon (2.5GHz) CPUs. All the IO is memory (/dev/shm) bound ie. the data
> is read from and written to /dev/shm.

> The total data size is 329GB.

At first glance, that's an awful lot of hardware for a ... decently sized but
not awfully large dataset. We're dealing with datasets that size at 32G or 64G
of RAM, just a wee bit less.

The article presents a lot more AWK knowledge than I have. I'm impressed by
that. I acknowledge that.

But I'd probably put all of that into a postgres instance, compute indexes and
rely on automated query optimization and parallelization from there. Maybe
tinker with pgstorm to offload huge index operations to a GPU. A lot of the
shown scripting would be done by postgres, the parallelization is done
automatically based on indexes, while eliminating the string serializations.

I do agree with the underlying sentiment of "We don't need hadoop". I'm
impressed that AWK goes so far. I'd still recommend postgres in this case as a
first solution. Maybe I just work with too many silly people at the moment.

~~~
VHRanger
If you're dealing with such a static single-machine dataset, why not go for
SQLite instead of PostGres?

There'd be much less setup overhead

~~~
tetha
Sqlite is single-writer for a single database. I mean. Of course, VHRanger,
only your team will need to write to the database, and your team will make
sure that only one person on your team will write to it. Eh. I've been there
too many times. Oh, but yes, your team will also figure out the fallout if
things to wrong. Ah..

Ok. Maybe those are enterprise concerns: Sqlite doesn't scale regarding
multiple users reading and writing. Of course it's a read only dataset, but do
you know the bouquet of views and derived tables data scientists create around
a read-only dataset? Hah. Oh and of course these are not critical, but if they
get lost, shit hits the fan because it takes multiple weeks to rebuild them.

I've been in that swamp enough times to just install postgres and stop caring.
Takes me 2 more hours now, but avoids weeks of discussions in the future.

~~~
Volt
Remember that we're comparing this to awk…

------
tobias2014
I'm sure that with tools like MPI-Bash [1] and more generally libcircle [2]
many embarrassingly parallelizable problems can easily be tackled with
standard *nix tools.

[1] [https://github.com/lanl/MPI-Bash](https://github.com/lanl/MPI-Bash) [2]
[http://hpc.github.io/libcircle/](http://hpc.github.io/libcircle/)

~~~
pdimitar
I have lately used a shell `for` loop that only emits part of the iterated
values (not all are eligible for further processing) and then fed the loop
directly to the `parallel` tool via the pipe operator.

The results were indeed embarrassingly parallel.

I am a fan of some languages that seem better equipped to utilise our modern
many-core machines and I'd still write a longer-living system with better
guarantees in those languages -- but many people ignore shell goodness at
their own peril.

~~~
mturmon
Big agree. Gnu parallel, xargs, and make -j are all very useful basic tools
for embarrassingly parallel workloads.

I've been developing a simulation software that does Monte Carlo over
realizations of a simulated universe, and xargs and (later) parallel were
super useful for parts of the workload. All the parallel job instances run the
same simulation code, but for a different simulated universe, all controlled
by a random number seed, so you can generate an ensemble of simulations with
basically:

    
    
       head -NumberOfSimsWanted seeds.txt | xargs simulation.py

~~~
eythian
I have lately been working on a genetic algorithm system, written in single-
threaded Perl. When I started off wanting to do, say, 20 simulations, I just
used parallel, and it worked great on the 8 threads of my desktop machine. Now
I've done a bit more and want to do long-running simulations that might take a
few hours, I've taken to just firing it all at AWS Batch and having a script
that uploads the results to S3. Now I can do 1-200 instances at the same time,
which my own hardware isn't up to in a reasonable timeframe.

~~~
pdimitar
I have a pretty strong workstation (iMac Pro) that I currently don't use all
the time. And a gaming PC that I barely touch in the last 9-10 months. Hit me
up if you want me to lend you some CPU power sometime.

~~~
eythian
Cheers, but the configuring of it would be more hassle than it's worth I
suspect. As it is, using AWS is very cheap as I can have it use spot
instances. So hundreds of CPU-hours used for in the order of US$10-20.

~~~
pdimitar
I suspect setting that up would indeed be a hassle. :)

Thanks for mentioning AWS Batch, I didn't know about it and will look it up.

------
tannhaeuser
It's odd that TFA has this focus on performance but doesn't mention _which_
awk implementation has been used; at least I haven't found any mention of it.
There are 3-4 implementations in mainstream use: nawk (the one true awk, an
ancient version of which is installed on Mac OS by default), mawk (installed
on eg. Ubuntu by default), gawk (on RH by default last I checked), or busybox-
awk. Tip: mawk is much faster than the others and to get performance out of
gawk you should use LANG=C (and also because of crashes with complex regexpes
in Unicode locales in some versions of gawk 3 and 4).

~~~
ketanmaheshwari
Thank you for your comment. I should have clarified that I use gawk version
4.0.2. I mean to update the post but somehow am unable to push to repo or
login to my github.

Edit: Done.

~~~
mfontani
They seem to be currently having a few problems with log-in:
[https://www.githubstatus.com/incidents/q3cfsrp1qb6l](https://www.githubstatus.com/incidents/q3cfsrp1qb6l)

------
hidiegomariani
this somewhat reminds me of taco bell programming
[http://widgetsandshit.com/teddziuba/2010/10/taco-bell-
progra...](http://widgetsandshit.com/teddziuba/2010/10/taco-bell-
programming.html)

~~~
ketanmaheshwari
Excellent! Didn't know there is a name for this. Agreed the work is a case of
Taco bell programming!

~~~
Scarbutt
There is also [https://adamdrake.com/command-line-tools-can-
be-235x-faster-...](https://adamdrake.com/command-line-tools-can-
be-235x-faster-than-your-hadoop-cluster.html)

------
Upvoter33
I've always wanted to build a parallel awk. And call it pawk. And have an
O'Reilly book about it. With a chicken on the cover. pawk, pawk, pawk! This is
a true story, sadly.

~~~
ketanmaheshwari
I too wanted to do almost exactly the same. I want to port awk to GPUs by
hacking its code and adding OpenCL tags at right places. Someday!

------
svnpenn

        !($1 in a) && FILENAME ~ /aminer/ { print }
    

This uses a regular expression. As regex is not actually needed in this case,
you might be able to get better performance with something like this:

    
    
        !($1 in a) && index(FILENAME, "aminer") != 0 { print }

------
co_dh
I like the idea of using AWK for this. But you can give kdb/q a try. 250M rows
is nothing for kdb, and it seemed that you can afford the license.

~~~
lenkite
kdb/q is amazing. I have seen some experienced folks make jaw dropping data
computations at the drop of a hat on gigabyte sized data. Makes everything
else look laughably _puny_.

I suspect if it was open source, it would probably be the most popular big-
data storage and computing platform.

~~~
labelbias
Hmm... 256M records is still not that much for postgres or mysql so it
shouldn't be too much for kdb/q. Several gb of data also shouldn't be too
much.

------
mjcohen
gawk has been my go-to text processing program for many years. I have written
a number of multi-thousand line programs in it. I always use the lint option.
Catches many of my errors. One of these had to read a 300,000,000 byte file
into a single string so it could be searched. The file was in 32-byte lines.
At first, I read each line in and appended it to the result string, but that
took way to long since the result string was reallocated each time it was
appended to. So I read in about 1000 of the 32-byte lines, appending them to a
local string. This 32000-byte string was then appended to the result, so this
only was done 10000 times. Worked fine.

------
FDSGSG
Spending _minutes_ on these tasks on hardware like this is pretty silly. awk
is fine if these are just one-off scripts where development time is the
priority, otherwise you're wasting tons of compute time.

Querying things like these on such a small dataset should take seconds, not
minutes.

~~~
ketanmaheshwari
Thank you for your comment. Most of the solutions indeed take less than a
minute. Solution to problem 1 & 3 took 25 sec and that to 5 took 26.

The solution that took 9 minutes involved processing the abstract in each
record. The abstracts are quite sizeable on some of these publications.
Processing millions of them took time.

The solution that took 48 minutes involved a nested loop effectively reaching
the iteration count for 216 years times 256M records which comes to about 55B.

Hope this clarifies things a bit but I am not claiming this to be the most
optimized solution. I am sure there is scope for refinements -- this was my
take on it.

------
arendtio
First, I think it is great that you found a tool that suits your needs. A few
weeks ago I was mangling some data too (just about 17 million records) and
would like to contribute my experience.

My tools of choice were awk, R, and Go (in that order). Sometimes I could
calculate something within a few seconds with awk. But for various
calculations, R proved to be a lot faster. At some point, I reached a problem
where the simple R implementation I borrowed from Stack Overflow (which was
supposed to be much faster than the other posted solutions) did not satisfy my
expectations and I spend 4 hours writing an implementation in Go which was a
magnitude faster (I think it was about 20 minutes vs. 20 seconds).

So my advice is to broaden your toolset. When you reach the point where a
single execution of your awk program takes 48 minutes, it might be worth
considering using another tool. However, that doesn't mean awk isn't a good
tool, I still use it for simple things, as writing 2 lines in awk is much
faster than writing 30 in Go for the same task.

------
ineedasername
Definitely an under-appreciated tool. Very useful for one-off tasks that would
take a fair bit longer to code in something like python.

------
Zeebrommer
I am often impressed by the things that can be done with these old-school UNIX
tools. I'm trying to learn a few of them, and the most difficult part are
these very implicit syntax constructions. How is the naive observer to know
that in bash `$(something)` is command substitution, but in a Makefile
`$(something)` is just a normal variable? With `awk`, `sed` and friends it
gets even worse of course.

Is the proper answer 'just learn it'? Are these tools one of these things
(like musical instruments or painting) where the initial learning phase is
tedious and frustrating, but the potential is basically limitless?

~~~
avar
Some of it you pickup or remember from the context of the file you're looking
at, but you really should take the time to read the manuals for these tools
from cover to cover at some point if you're making extensive use of them. In
the case of GNU bash & make: [https://www.gnu.org/savannah-
checkouts/gnu/bash/manual/bash....](https://www.gnu.org/savannah-
checkouts/gnu/bash/manual/bash.html) &
[https://www.gnu.org/software/make/manual/make.html](https://www.gnu.org/software/make/manual/make.html)

------
schmichael
[https://mobile.twitter.com/awkdb](https://mobile.twitter.com/awkdb) was a
joke account made in frustration by a coworker trying to operate a Hadoop
cluster almost a decade ago. Maybe it's time to hand over the account...

------
gautamcgoel
Your system had 512-core Xeons? Did you mean that you had 5 12-core xeons? Or
512 cores total?

~~~
ketanmaheshwari
512 Intel Xeon CPUs configured over 16 nodes. This one:
[http://www.comnetco.com/sgi-uv300-the-most-powerful-in-
memor...](http://www.comnetco.com/sgi-uv300-the-most-powerful-in-memory-
supercomputer)

~~~
gautamcgoel
Wow. I didn't realize you could run so many CPUs in one address space. This
thing is basically one huge computer. Would love to open up Gnome System
Monitor and see 512 cores!

~~~
cesaref
Core counts has gone up quite considerably in the last few years, certainly in
a compute farm context. 10 years ago, you'd be looking at 2U servers having 8
cores (dual quad core) as high density. These days, 2U servers can pack 4
sockets, and processors are in the high 20 something cores, so if you've got
deep pockets you can get over 100 cores in 2U.

~~~
jerven
AMD Epyc can give you 128/256 cores in a 2 socket system. Not that expensive.

------
winrid
and here I am working on a big distributed system that has to handle 200k
records a day (and hardly does successfully). sigh.

------
_wldu
Turning json data into tabular data using jq was pretty neat. So many json
apis in use today yet still a need for csv and excel docs.

~~~
nojito
Excel can natively connect to and parse json nowadays.

------
nmz
You couldn't have used FS="\036" or "\r"?

------
tarun_anand
Amazing work. Keep it up.

------
nojito
Why not just use data.table?

The solution would be much less error prone and most likely much quicker as
well.

~~~
ketanmaheshwari
Thank you for the comment. One reason to use awk was that I wanted to see how
far I could go with Awk. I will checkout data.table. Does it offer any kind of
parallelism?

~~~
nojito
Many common functions in data.table are parallelized under the hood by OpenMP.

------
gh123man
Slightly off topic, but as a swift developer
([https://swift.org/](https://swift.org/)) the usage of swift/T in this
project really confused me. Is swift/T in any way related to Apple's swift
language?

The naming conflict makes googling the differences fairly challenging.

~~~
ketanmaheshwari
Both Swift's are different. I was aware of potential confusion which is why I
did not use Swift in the title and mention explicitly in the blog that this is
not Apple Swift. Fun story: Apple contacted Swift/T team before launching
Apple Swift and made a mention of it in their page.

~~~
gh123man
Cool! Thanks for the explanation. That answers my biggest question of who came
first.

------
skanga
Try mawk if you can. I find that it does even faster.

~~~
ketanmaheshwari
Indeed I did try mawk and found it to be faster. However, when I try setting
locale "LC_ALL=C"; the performance of both awk and mawk were almost the same.
I also let it at awk in favor of portability -- mawk is not available on most
systems and needs to be installed.

------
flatfilefan
GNU Parallel + AWK = even less code to write.

