
Using AWK and R to parse 25TB - markus_zhang
https://livefreeordichotomize.com/2019/06/04/using_awk_and_r_to_parse_25tb/
======
scottlocklin
You can solve many, perhaps most terascale problems on a standard computer
with big enough hard drives using the old memory efficient tools like sed,
awk, tr, od, cut, sort & etc. A9's recommendation engine used to be a pile of
shell scripts on log files that ran on someone's desktop...

~~~
devonkim
Furthermore, as computers get faster and cheaper in every dimension what makes
economic sense to use “Big Data” tooling and efforts gets substantially larger
with it. The limits of single nodes 15 years ago were pretty serious but most
problems businesses have even in the so-called enterprise can currently easily
fit on a workstation costing maybe $5k and be crunched through in a couple
hours or maybe minutes - a lot easier to deal with than multiple Spark or Hana
nodes. Operationalizing the analysis to more than a single group of users or
problem is where things get more interesting but I’ve seen very, very few
companies that have the business needs to necessitate all this stuff at scale
- most business leaders still seem to treat analytics results in discrete
blocks via monthly / weekly reports and seem quite content with reports and
findings that take hours to run. Usually when some crunching takes days to run
it’s not because the processing itself takes a lot of CPU but because some
ancient systems never intended to be used at that scale are the bottleneck or
manual processes are still required so the critical path isn’t being touched
at all by investing more in modern tools.

I can support “misguided” Big Data projects from a political perspective if
they help fund fixing the fundamental problems (similar to Agile consultants)
that plague an organization, but most consultants are not going to do very
well by suggesting going back and fixing something unrelated to their core
value proposition itself. For example, if you hire a bunch of machine learning
engineers and they all say “we need to spend months or even years cleaning up
and tagging your completely unstructured data slop because nothing we have can
work without clean data” that’ll probably frustrate the people paying them
$1MM+ / year each to get some results ASAP. The basics are missing by default
and it’s why the non-tech companies are falling further and further behind
despite massive investments in technology - technology is not a silver bullet
to crippling organizational and business problems (this is pretty much the
TL;DR of 15+ years of “devops” for me at least).

~~~
edraferi
> The basics are missing by default

Absolutely. I really want to see advanced AI/ML tools developed to address
THIS problem. Don’t make me solve the data before I use ML, give me ML to fix
my data!

That’s hard though, because data chaos is unbounded and computers are still
dumb. I think there’s still tons of room for improvement though.

~~~
devonkim
I watched a talk by someone in the intelligence community space nearly 8 years
ago talking about the data dirt that most companies and spy agencies are
combing through and the kind of abstract research that will be necessary to
turn that into something consumable by all the stuff that private sector seems
to be selling and hyping. So I think the old guard big data folks collecting
yottabytes of crap across the world and trying to make sense of it are well
aware and may actually get to it sometime soon. My unsubstantiated fear is
that we can’t attack the data quality problem with any form of scale because
we need a massive revolution that won’t be funded by any VC or that nobody
will try to tackle because it’s too hard / not sexy - government funding is
super bad and brain drain is a serious problem. In academia, who the heck gets
a doctorate for advancements in cleaning up arbitrary data to feed into ML
models when pumping out some more model and hyperparameter incremental
improvements will get you a better chance of getting your papers through or
employment? I’m sure plenty of companies would love to pay decent money to
clean up data with lower cost labor than to have their highly paid ML
scientists clean it up, so I’m completely mystified what’s going on that we’re
not seeing massive investments here across disciplines and sectors. Is it like
the climate change political problem of computing?

~~~
dgacmu
> In academia, who the heck gets a doctorate for advancements in cleaning up
> arbitrary data to feed into ML models

Well - Alex Ratner [stanford], for one:
[https://ajratner.github.io/](https://ajratner.github.io/)

And several of Chris Re's other students have as well:
[https://cs.stanford.edu/~chrismre/](https://cs.stanford.edu/~chrismre/)

Trifacta is Joseph Hellerstein's [berkeley] startup for data wrangling:
[https://www.trifacta.com/](https://www.trifacta.com/)

Sanjay Krishnan [berkeley]: [http://sanjayk.io/](http://sanjayk.io/)

~~~
devonkim
I was asking somewhat rhetorically but am glad to see that there’s some
serious efforts going into weak supervision. At the risk of goalpost moving, I
am curious who besides those in the Bay Area at the cutting edge are working
on this pervasive problem? My more substantive point is that given the massive
data quality problem among the ML community I would expect these researchers
to be superhero class but why aren’t they?

~~~
dgacmu
... they are?

There are a lot of people tackling bits and pieces of the problem. Tom
Mitchell's NELL project was an early one, using the web in all its messy
glory...[http://rtw.ml.cmu.edu/rtw/](http://rtw.ml.cmu.edu/rtw/)

Lots of other folks here (CMU). Particularly if you add an active learning.
Hard messy problem that crosses databases and ML.

------
stirfrykitty
There was a similar article (2014) that is also interesting. I think too many
of us see new and shiny and immediately glom onto it, forgetting that the
UNIX/regex fathers knew a thing or two about crunching data.

[https://adamdrake.com/command-line-tools-can-
be-235x-faster-...](https://adamdrake.com/command-line-tools-can-
be-235x-faster-than-your-hadoop-cluster.html)

~~~
close04
> often people use Hadoop and other so-called Big Data ™ tools for real-world
> processing and analysis jobs that can be done faster with simpler tools and
> different techniques.

Right tool for the right job, as always. For a 2-3GB dataset size you don't
need to bother with Hadoop just as for a 2-3PB dataset size you probably don't
need to bother with awk.

~~~
taeric
If like to think that it is feasible most 2-3PB datasets can be easily
partitioned to GB datasets. I rather guess it is more common to expand GB
datasets into PB ones, though. :(

------
samuell
I counted to 15 awk calls in our latest pipeline processing drug compounds to
build predictive machine learning models of them:
[https://github.com/pharmbio/ptp-
project/blob/master/exp/2018...](https://github.com/pharmbio/ptp-
project/blob/master/exp/20180426-wo-drugbank/wo_drugbank_wf.go)

One of the most eye-opening aspects of awk (goes for other pipeable
commandline tools too), was how they support iterative development of
pipelines regardless of data size.

We tried SQLite at some point for some of the stages because of pretty
complicated selections sometimes, but it often won't give back a single result
in minutes. Switching to AWK, I could immediately get _some_ output, so I
could quickly validate and iterate on the awk code until I got what I
expected. The actual execution was likeways always very fast.

------
et2o
My biggest question is why on earth did they give you 25 Tb of TSV genetic
data? :-)

I'm not sure what your sample was but seems like it would have been better to
use one of the special binary file formats for genetic data. You wrote SNP
chips, But in order to get to 25 Tb I assume there must be imputed calls, so
it seems like a BGEN might have been a lot easier.

This is speculation of course, I'm not sure exactly what your situation was.

~~~
wongarsu
If you unpack all of
[https://files.pushshift.io/reddit/comments/](https://files.pushshift.io/reddit/comments/)
you have many Tb of JSONs that are just dumps of API responses that slowly
change schema over the years. It's also an incredibly useful dataset.

In the end CPUs are fast enough and compression algorithms good enough that I
would argue it doesn't really matter what format you use for storage, as long
as it's reasonably easy to read back.

~~~
et2o
In the case of genomics, there have been at this point decades of work
developing high performance file formats and there are large ecosystems of
tools around them. Lots of bioinformatics is really manipulating these files.
So using a supported file format makes a big difference.

------
olodus
Awk is kinda my new favorite scripting language. I've been using it for the
master's thesis I'm writing and am amazed at how quickly I can script in it.
And how easy it was to learn. From the beginning out results-visualisation
pipeline of awk+gnuplot was just supposed to be a quick hack to get something
we could look at but it has kept up the whole thesis through and just been
extended and made better instead of switching. We still use python when we
need some lib help to get some data right but damn it goes quick to handle
well structured data with awk. Sad I didn't learn it earlier.

------
nstrayer
Hi! Author of the post here. I can attempt to answer any questions if need be
although it looks like others have done a great job doing that already!

~~~
markus_zhang
Hi nstrayer thanks for the excellent article!

I work as a data analyst and I never got to worry about big data as the DWH
takes care of the aggregation for us, plus I only work in Windows.

I see now that it would be very useful to learn *nix tools in general, as it
seems that the skills to process (not to predict/analyze) terabytes+ data are
very valuable and expensive to acquire and could be one's butter and bread.

~~~
banku_brougham
Windows OS is making it easier to access nix terminal commands, but in my
experience making the OS switch (mac machine but unix terminal, or cloud
linux) has been game changing for me.

------
nstrayer
As a followup to this. We have now successfully run complex statistical models
across all 2.5 million snps on a single AWS instance in less than 3 hours just
by writing R code using the package I describe at the end of the article.

~~~
wikibob
Have you considered using AWS EC2 Spot Instances? The price can be 50 to 70%
cheaper.

You would have to add some additional reliability to your pipeline so it could
continue processing when instances are unexpectedly terminated. But this might
be well worth it as it sounds like your research group is cost-constrained.

AWS made some changes this year so the spot prices are more stable and
instances don't get shut down as frequently.

~~~
nstrayer
I did use spot instances for most of the clusters and a few of the processing
jobs! I got out of the habit of using them earlier due to loosing them but now
that they have the 'pay up to the on-demand price' option they're great!

------
bsg75
Have used Mawk [1] in similar cases for a runtime savings, provided that the
script works without any GNU Awk extensions.

[1] [https://invisible-island.net/mawk/](https://invisible-island.net/mawk/)

~~~
mzs
Thank you! For those that don't know, 15 years ago GNU awk was sometimes oddly
REALLY slow. Mike's awk was not. Plus when things were overall slower back
then, it mattered more.

But there were bugs in mawk and it seemed basically unmaintained. So you'd run
into something and have to use gawk or perl instead.

That's no longer the case, the xterm guy adopted it, ten years ago, and now I
know!

~~~
bsg75
Last release was late 2017, but it has been very stable for me. Plus the
author responds to bug reports.

Mawk is my go-to version because of speed. GNU Awk when its extensions are
needed, or the task is over "small data" and the system default version is
sufficient.

------
nimrody
Very well written.

I think using 'make' with the -j parameter (# of parallel jobs) is more useful
than using gnu 'parallel'. The reason is that if one of the job fails for some
reason, you just re-run 'make' and only the required jobs are started instead
of restarting the entire computation.

~~~
IndustrialJane
With --results or --joblog and --resume-failed GNU Parallel can do this, too.

------
geogra4
I remember frantically needing to parse a few hundred megs of log files and
join them up to the db rows that fired the individual errors.

Initially I was trying to use SQLite for it but I kept running out of memory
and crashing the system. Turned out using grep, join, sort, and paste got the
job done in seconds.

------
innomatics
Great post and thanks for sharing your learnings.

A couple of quick questions:

Was the 25TB raw data gathered from a single human genome?

What would be the size in bytes of a unique genomic fingerprint once raw data
is all fully processed into high confidence base values? (including non-coding
regions)

If we just look at coding regions and further compress by only looking at
SNPs, how many bytes is that?

Considering that each base has ~2B of information... it would be super
interesting to know how much space it takes to describe our uniqueness!

~~~
markus_zhang
Sorry I saw this article and thought it was pretty interesting. This is NOT my
article but I'd like to know what others would do under this situation.

BTW not sure, but is it OK to post other's article here? Maybe I should add a
short commentary in the title.

~~~
kohtatsu
The default assumption is that it's not your article, unless you prepend "Show
HN" (or there's something obvious like your username matching the domain
name).

------
usgroup
Lol , this is basically how I roll most the time . However what Linux is
really missing right now is command line tools that saturate a GPU.

Just counterparts to all the favourites that utilise the GPU ... imagine GPU
awk.

~~~
_hl_
> imagine GPU awk

My intuition tells me that awk and other text processing tools won’t scale
well to a GPGPU. I might be wrong though. Is there any example of something
like grep etc working well on a GPU?

~~~
usgroup
Pffff naysayers ...

[https://www.cs.cmu.edu/afs/cs/academic/class/15418-s12/www/c...](https://www.cs.cmu.edu/afs/cs/academic/class/15418-s12/www/competition/bkase.github.com/CUDA-
grep/finalreport.html)

~~~
_hl_
That's very surprising, assuming they didn't doctor the results by choosing
the workload all too carefully.

I would have expected a GPU regex too perform much worse, given that regex
matching is probably very branchy code. Especially since computation is
generally way faster than IO.

~~~
kuzehanka
Modern GPU have no issues with branching.

~~~
reitzensteinm
What specifically are you referring to? Branching on GPUs has not
substantially changed for a decade. If all threads on warp skip a branch, it's
free. If one takes it, the rest also pay the penalty and mask out the vector
units.

What's at play here is that the needle in a haystack search of regex is going
to spend almost all its time 0 or 1 deep in the state machine, so the threads
skip the branches and the penalty is not large.

------
xiaodai
I was onto a very similar idea recently! I used split to split a large CSV
file and then used disk.frame
([https://github.com/xiaodaigh/disk.frame](https://github.com/xiaodaigh/disk.frame)),
which is my package, to read in a large file. Not 25TB though!

------
markus_zhang
I found this article particularly interesting as the author discusses a lot of
(failed) methods.

Since I never dealt with big data before, I'm wondering what would you do in
this situation?

~~~
marcinzm
It seems he had a lot of issues due to Spark executors failing which seems a
setting issue. My guess is that the executors were being killed by the system
OOM killer. Spark's memory management is counter-intuitive. Spark spills
intelligently to disk so executors don't need a lot of memory to process data
if you're not doing interactive queries. However spark will use all the memory
it's given and sometimes it will use more than that (might be the OS actually,
not sure). So the trick is to give Spark's executors LESS memory (as a
percentage of the node's memory) so there's a buffer in case Spark uses more
memory than allocated.

~~~
nstrayer
Pretty much. I am sure if I truly understood the inner workings of spark I
would have been able to get it to work. I didn't go into it too much in the
article but I did tweak the executor memory a lot. Going as far as
transcribing the aws article on tuning into an R script that generated a
config exactly as they stated. Also when I tried GLUE with its supposedly no-
configure setup I still got the same problems.

~~~
marcinzm
Interesting. Another reason I can think for it failing is lack of disc space
on the nodes. Spark will spill data to disc if it doesn't fit into memory and
your nodes may not have had enough disc space for 25TB of data.

------
emmanueloga_
Unix pipelines, AWK, gnu parallel, R, all great stuff.

If you have such an specific task, why not just write an "actual program" (as
opposed to pipeline of scripts)? From the looks of it, it sounds like this
problem could have been solved with, say, 50 lines of Java, C, Go, etc, etc.
Maybe a bit more verbose but it would give you full control, you wouldn't need
to lookup how to use command line parameters on S/O, and would probably give
you a bit more performance.

~~~
IndustrialJane
I have done that more than once. I often end up with a solution that works on
the test set but which breaks after 10 TB just because <" "@example.com> is a
valid email address according RFC-822 (Who the f __* thought it was a good
idea to allow spaces in email addresses?). Or some other exception that was
not part of the test set, and that was not identified before starting.

Dealing with exceptions is extremely error prone if these exceptions are not
mapped beforehand. Thus it can be very costly.

Similarly doing stuff in parallel is extremely error prone due to race
conditions: What does not happen when running on your 1 GB test set, may very
well happen when running on your 25 TB production data.

~~~
emmanueloga_
I get your point but the same error handling problems can appear in scripts
and pipelines, no?

In a program I'd try/catch defensively "just in case", if missing one line out
of 25TB is not a bit deal.

For parallel processing I'd reach for the nearest standard library at hand on
the language of choice.

~~~
IndustrialJane
> For parallel processing I'd reach for the nearest standard library at hand
> on the language of choice.

That is a good example of what I mean: The nearest standard library is likely
to either buffer output in memory or not buffer at all (in which case you can
have the start of one line ending with another line). This means you cannot
deal with output bigger than physical RAM. And your test set will often be so
small that this problem will not show up.

GNU Parallel buffers on disk. It checks whether the disk runs full during a
run and exits with a failure if that happens. It also removes the temporary
files immediately, so if GNU Parallel is killed, you do not have to clean up
any mess left behind.

You _could_ do all that yourself, but then we are not talking 50 lines of
code. Parallelizing is _hard_ to get right for all the corner cases - even
with a standard library.

And while you would not have to look up how to use command line parameters on
S/O you _would_ be doing exactly the same for the standard libraries.

Assuming you can get better performance is also not given: GNU Sort has built-
in parallel sorting. So you clearly would not want to use a standard non-
parallelized sort.

Basically I see you have 2 choices: Built it yourself from libraries, or build
it as a shell script from commands.

You would have to spend time understanding how to use the libraries and the
commands in both cases, and you are limited by whatever the library or the
command can do in both cases.

I agree that if you need tighter control than a shell script will give you,
then you need to switch to another language.

~~~
emmanueloga_
I agree with everything you said, as always, everything is a trade off. Good
point about trickiness of memory management w/parallel processing! Would have
to be extra careful to avoid hoarding RAM.

------
danielecook
Here is an alternative solution, although yours is a good one.

If you only need a single SNP or a group of SNPs within a region, you can use
tabix[1] to index gzipped TSVs and query by genomic position. The position of
SNPs can be obtained from a lookup table (2.5M is not very big even for R) or
from an API if you were to say - query by rsid.

tabix also works over http (and s3) and can utilize RANGE queries to select a
subset of a file...so you only wind up downloading a or reading a small
portion once it is indexed and can do something like this:

tabix [https://www.file.url.tsv](https://www.file.url.tsv) chr1:1-1000

The command above would return variants on chromosome 1 between 1 and 1000.

The following variant browser works in this way:
[https://elegansvariation.org/data/browser/](https://elegansvariation.org/data/browser/)
\- Theres no formal database (e.g. MySQL) running here, just tabix (actually
bcftools which uses tabix) to select variants in a particular region, wrap
them in JSON, and return to the client.

Setting this up on S3 requires configuring CORS... the igv browser also uses
tabix indexes and provides guidance on how to set this up [2]

[1]
[https://www.htslib.org/doc/tabix.html](https://www.htslib.org/doc/tabix.html)
[2] [https://github.com/igvteam/igv.js/wiki/Data-Server-
Requireme...](https://github.com/igvteam/igv.js/wiki/Data-Server-Requirements)

I created a similar solution to what you have done using this alternative
approach, writing a wrapper in R that invoked bcftools under the hood. The
dataset I was working with was a lot smaller (1.6M Snps x 252 individuals),
but should work with larger genotype sets as well.

------
totalperspectiv
Welcome to Bioinformatics! Excellent write up!

------
parhamn
While I get the whole "you can do it on one node without all the complexity"
thing, I do still wonder if map-reduce-synchronizer + coreutils is better than
the behemoths that are the distributed ETL platforms right now. All the system
would need to do is make a data file available on a node and capture stdout of
the unix pipeline. I know gnu parallel does some of this.

------
fpbarthel
Just wondering if this problem could have been solved by a properly indexed
table? The article says: “Eight minutes and 4+ terabytes of data queried later
I had my results“. 4+ TB seems way too much for 60k patients and sounds like
an inefficient table scan was performed.

~~~
fpbarthel
Also, wouldn’t partitioning only make sense if there is a sensible way to
separate data that is more likely to be accessed vs data less likely to be
accessed? Like is common with date data, since recent entries are often more
relevant compared to old entries. For example, you could categorize SNPs by
priority and eg. partition SNPs of high importance (frequently accessed) vs
medium importance (sometimes accessed) vs low importance (rarely accessed).

------
marcinzm
One correction to the article, snappy is not splittable however parquet files
using snappy ARE splittable. Parquet compresses blocks of data within a file
rather than compressing the file as a whole. Each block can then be read and
decompressed independently.

------
sansnomme
What's the best pipeline/workflow management tool for command line programs
with a GUI? E.g. for resuming the process after it gets interrupted etc.

~~~
c0l0
I am not kidding: A graphical terminal emulator.

Mastering the usual command line interface (terminal emulator, interactive
shell, maybe a terminal multiplexer) is non-optional if you want to use CLI
tools at or close to peak effectiveness.

~~~
sansnomme
I mean task resumption after interruption etc. Like airflow type of tools. Not
quite unix task suspend options, this is about data pipelines. For Hadoop-
style MapReduce, you can split the task into jobs which can be resumed and
discarded etc. Shell scripting is not an elegant way to deal with this, a
proper orchestrator tool is better.

~~~
cure
You could try the tool my group builds, Arvados
([https://arvados.org](https://arvados.org)). We use Common Workflow Language
(CWL) as the workflow language. Arvados works great for very large
computations and data management at petabyte scale. It really shines in a
production environment where data provenance is key. Intelligent handling of
failures (which are inevitable at scale) is a key part of the Arvados design.

------
srean
Can someone add "join" to the language/standard please. Its awkward to have
split but not join.

~~~
dima55
There's a "join" tool in the GNU coreutils. "man join"

~~~
EForEndeavour
Somehow, I hadn't known of the existence of the `join` utility until this
moment. I really should devote some time to play around with it and paste,
sort, awk, etc.

------
jstrong
(my) Lesson Learned: don't use spark.

