
What you need may be “pipeline +Unix commands” only - nanxiao
https://nanxiao.me/en/what-you-need-may-be-pipeline-unix-commands-only/
======
rofo1
I feel like the art of UNIX is slowly fading into oblivion, especially with
the new generation of programmers/developers.

Eventually, they'll become the ones that decide the fate of software engineers
(by being hiring managers, etc.) and we'll see more and more monstrosity like
the article portraits, instead of cleverly using UNIX tools where applicable.

There's so many things that the software world is doing wrong that I am
surprised that even at this inefficacy, it's such a viable and well-paid
profession. It's almost as if we are creating insanely complex solutions that
in turn require a large amount of developers to support them, whereas we could
have chosen a much more practical solution which is self-sustaining.

~~~
redsavagefiero
This is something I've noticed in the last 8-10 years. The rise of the
python/js/java paradigm everywhere. Some of the associated LDIF (json) I enjoy
much more than XML and flat files but the misapplication of tools is becoming
an epidemic. When I can write: awk -F "," '{for (x = 1 ; x <= NF ; x++) {if
($x ~ /[0-9]+/) {a[x] = a[x] + $x}}} END { for (p in a) {printf "%d =
%d\n",p,a[p]}}' to sum columns in 5 seconds and people are scrambling with
libraries to do matrix operations I tend to scratch my head and walk away. The
aversion to the command line is also something that bothers me but I don't run
into it as much in my field.

~~~
joshuamorton
When I can write

    
    
        import pandas as pd
        data = pd.read_csv(filename)
        print(data.sum())
    

and have the same result, I'm going to do the one that is faster to write,
fewer characters, and lets me understand what's going on.

And don't get me wrong, I've written some gnarly pipelined bash before,
although I'm by no means an expert, but that doesn't mean its always the right
tool for the job.

~~~
redsavagefiero
I was going to be the long haired *nix geek here but I have no hair and the
world is moving on. I can't pick bones with python for data science/analysis
and personal convenience. _However_ as a principal engineer if someone was to
say, for a trivial dataset, that we need python and pandas for an operation
like this where python + pandas was not already provisioned the answer would
be no.

------
twic
I've been a pipeline junkie for a long time, but i've only recently started to
get into awk. The thing i can do with awk but not other tools is to write
stateful filters, which accumulate information in associative arrays as they
go.

For example, if you want to do uniq without sorting the input, that's:

    
    
      awk '{ if (!($0 in seen)) print $0; seen[$0] = 1; }'
    

This works best if the number of unique lines is small, either because the
input is small, or because it is highly repetitive. Made-up example, finding
all the file extensions used in a directory tree:

    
    
      find /usr/lib -type f | sed -rn 's/^.*\.([^/]*)$/\1/p' | awk '{ if (!($0 in seen)) print $0; seen[$0] = 1; }'
    

That script is easily tweaked, eg to uniquify by a part of the string. Say you
have a log file formatted like this:

    
    
      2019-03-03T12:38:16Z hob: turned to 75%
      2019-03-03T12:38:17Z frying_pan: moved to hob
      2019-03-03T12:38:19Z frying_pan: added butter
      2019-03-03T12:38:22Z batter: mixed
      2019-03-03T12:38:27Z batter: poured in pan
      2019-03-03T12:38:28Z frying_pan: tilted around
      2019-03-03T12:39:09Z frying_pan: FLIPPED
      2019-03-03T12:39:41Z frying_pan: FLIPPED
      2019-03-03T12:39:46Z frying_pan: pancake removed
    

If you want to see the first entry for each subsystem:

    
    
      awk '{ if (!($2 in seen)) print $0; seen[$2] = 1; }'
    

Or the last (although this won't preserve input order):

    
    
      awk '{ seen[$2] = $0; } END { for (k in seen) print seen[k]; }'
    

I don't think there's another simple tool in the unix toolkit that lets you do
things like this. You could probably do it with sed, but it would involve some
nightmarish abuse of the hold space as a database.

~~~
klhugo
I admire your work. Clever usage of unix tools is very handy. But for parsing
text, do you really see that awk and Unix tools as a better solution then a
simple python script?

Although I admit that the key argument for Unix tools is that they don’t get
updated. That sounds awful, but think about it, once it works, it works
everywhere, no matters OS type, version or packages installed. That is
something experienced programmers always want from their solutions.

~~~
twic
Python is fantastic for little (or large!) bits of logic, but its handling of
input is clunky enough to put me off for tiny things. AFAIK the boilerplate
you need to get to working on the fields on each line is:

    
    
      import sys
      for line in sys.stdin:
        fields = line.split()
        # now you can do your logic
    

If you want to use regular expressions, that's another import.

Python also doesn't play well with others in a pipeline. You can use python
-c, but you can't use newlines inside the argument (AFAICT), so you're very
limited in what you can do.

~~~
ryl00
This is exactly where perl (namely, perl -ne) is so very, very useful.

~~~
twic
Yes, that helps a lot. The fact that Perl uses braces rather than whitespaces
also makes it work much better in this situation.

I still wouldn't touch Perl with a bargepole, though. Sorry not sorry.

------
Svoka
I had a task the other day to aggregate some logs. So I wrote a one liner,
which did most of what I wanted. I took about 4 minutes to run.

Then I decided to run it on larger dataset (because I needed too). Like week
of logs, not a day of logs.

While it was running, I wrote rust CLI, which was working like `cat __/ *.log
| logparser` and did one day in 12 seconds, and a week in a two minutes.

And I gave up waiting on awk, btw. It is not always better to use command
line. If you have gigabytes or tens of gigabytes of data, it would be easier
to write some cli tool to help you out.

Also, it was much easier to put significantly more complex logic into it
because of type checking, and, you know, being actual high level programming
language, not hack&slash awk script.

EDIT: Looking back on my "one liner" vs "rust cli" I would not be able to make
meaningful adjustments to one liner comprehension. It is, to my sorrow, write-
only thing.

~~~
unhammer
If your awk script gets too long / unreadable you just put it in a file and
use some whitespace and longer variable names.

AWK scripts tend to be very readable (much more so than e.g. sed) as long as
they stick to the "stateful filters" use-case as
[https://news.ycombinator.com/item?id=19294195](https://news.ycombinator.com/item?id=19294195)
calls it, but yes they have their limits.

If speed is a concern, you may want to try using mawk instead of GNU awk/gawk.
I've had 4x speedups with mawk.

~~~
DenseComet
Based on his timings, rust achieved a 20x speedup vs a 4x speedup if he used
mawk.

~~~
hedora
Yep, and adding gnu parallel or xargs to the mix would give a # of cores
speedup, which would make mawk faster than the single-threaded rust on a > 6
core machine (roughly).

------
jimmy_ruska
* If it's simple transforms, use cli tools.

* If it requires aggregation and it's small, use cli tools.

* If this is data you're using over and over again then load it in the database and then do the cleaning, ELT.

* If it's 2tb of data and under, still use bzip2, get splittable streams and pass it to gnu parallel.

* If it requires massive aggregations or windows, use spark|flink|bleam.

* If you need to repeatedly process the same giant dataset use spark|flink|bleam.

* If the data is highly structured and you mainly need aggregations and filtering on a few columns use columnar DBs.

I've been using Dlang with ldc a lot because of how fast its compile time
regex is, and its built in json support. Python3+pandas is also a good choice
if you don't want to use awk.

~~~
hedora
Before reaching for spark, etc:

Sort is good for aggregations that fit on disk (TBs these days, I guess)

Perl does well too if the output fits in a hashtable in DRAM, so 10’s (or
maybe 100’s?) of GBs

------
nik1aa5
I have been using the command line for all type of work for years now. The
most satisfying is to realize that there is always more to learn. And once you
grasped the basics, they fit together like LEGO bricks.

While I think it's important to make that argument, the posted article and the
one it refers to lack some guidance on how to reach "command line mastery". I
recently came across this great resource here on HN:

[https://github.com/jlevy/the-art-of-command-
line](https://github.com/jlevy/the-art-of-command-line)

It gives great overview of the toolbox you have on the command line. Equipped
with `man` you're ready to optimize your everyday work. And always remember to
write everything down and ask yourself WHY something works the way it works.
The interface of the standard tools is thought out very well. Getting
comfortable with this mindset pays off.

------
kissgyorgy
This old article has the same topic with a more complex example and a
surprising result about parsing 3.46Gb of data:

[https://adamdrake.com/command-line-tools-can-
be-235x-faster-...](https://adamdrake.com/command-line-tools-can-
be-235x-faster-than-your-hadoop-cluster.html)

------
ims
Sometimes there's a middle ground: make your "map" and "reduce" steps separate
scripts.

If you want to do the parsing in Python instead of awk, just make a tiny
script that reads from stdin and writes to stdout - that way you can put it
between xargs or parallel and whatever else is in the pipeline.

The parallelization is a separate concern, so it doesn't need to be mixed in
with the parsing (or whatever) concern. The downloading is a separate concern;
use wget or requests in a Python script or whatever, it doesn't need to be
mingled with the parsing.

------
speedplane
This article's primary example is a single static text file with 5M lines.
Sure, in that case, awk works great, but how often does that come up? In the
real world, those 5M lines are growing by several hundred thousand every day,
and after a few months, grows beyond what a single computer or awk can handle.
Further, users want real-time results, not just a few times a day when your
cron script runs.

Unix commands are great up to a few GBs of data, Excel is even better if
you're dealing with less than a few tens of MBs. But to deal with Terabytes of
data quickly and efficiently, these tools totally break down.

~~~
JoeSmithson
> how often does that come up?

It's an important point to remember that a lot of things involved in human
society have _not_ exploded in size or complexity in the last 30 years.

Many data sets are basically proportional to the human population (health
records, criminal records, property records etc), and these have been measured
in the millions for 30+ years. In the same time the compute power of a single
script has moved from the millions into the billions.

It's important, because if a government needs to, say, calculate something
involving "every building in the country", or "everybody with a criminal
record" they need to understand that this task, in 2019, can in fact be done
by a single programmer parsing flat text files on their MBP, and does not need
a new department.

This is a bit like Grace Hopper always pointing out the difference between a
microsecond and a nanosecond -
[https://www.youtube.com/watch?v=JEpsKnWZrJ8](https://www.youtube.com/watch?v=JEpsKnWZrJ8)

~~~
noir_lord
> Many data sets are basically proportional to the human population.

Awesome point well put.

------
0db532a0
Relevant reading: The GNU coreutils manual

[https://www.gnu.org/software/coreutils/manual/html_node/inde...](https://www.gnu.org/software/coreutils/manual/html_node/index.html)

------
wodenokoto
Does this add anything to the Taco Bell post linked in TFA?

I suggest changing the link to:
[http://widgetsandshit.com/teddziuba/2010/10/taco-bell-
progra...](http://widgetsandshit.com/teddziuba/2010/10/taco-bell-
programming.html)

~~~
sokoloff
TacoBellArticle> I could have done the whole thing Taco Bell style if I had
only manned up and broken out sed, but I pussied out and wrote some Python.

That’s cringe-worthy...

~~~
EdwardDiego
Despite the gendered terms, what makes me cringe is the belief that X is
somehow "better" than Y - if you know Python, and have access to Python, and
express yourself in Python faster, then use Python. I use sed, awk, cut,
Python, as needed - whatever lets me solve my problem faster.

No flexing about how you use X needed.

------
nickjj
Unix commands definitely go a long ways.

I've been freelancing for a long time but never automated invoicing people up
until recently.

So I combined grep, cut, paste and bc to parse a work log file to get how many
hours I worked on that project, what amount I am owed and how many days I
worked. I can run these analytics by just passing in the log file, a YYYY/MM
date (this month's numbers), YYYY date (yearly numbers) or no date (lifetime).

Long story short, the working prototype of it was 4 lines of Bash and took
about 10 minutes to make.

Now I never have to manually go through these work log files again and add up
invoice amounts (which I always counted up manually 3 times in a row to avoid
mistakes). If you're sending a bunch of invoices a month, this actually took
kind of a long time and was always error prone.

------
schoen
When I was in middle school, a friend and I had an internship with a physicist
who wanted us to write some software to perform a simple transformation on
data sets that he downloaded by FTP from some experiment.

We spent about a week writing a program in QuickBASIC that successfully parsed
the files and performed the transformation.

Some years later, I realized that this would be a one-line awk script which I
could now write in 20-30 seconds. (Probably someone comfortable with Excel
could also perform the transformation in 20-30 seconds, although it might not
scale as well to larger files.)

------
angarg12
I agree with the sentiment that many solutions are over-engineered, but when
you need to process billions of records a day, you do need more complex
systems.

Bottom line: when facing an engineering problem, start with the simplest,
fastest to implement solution, and build complexity as necessary. The simple
solution suffices most of the time.

~~~
lstodd
You know, "billions a day" is only on the order of 10K per second. A single
machine can handle that.

~~~
monsieurbanana
That would be amazing, wouldn't it? It's not true though, the problem with
dealing with billion of operations a day is the spikes, most of the time you
don't get a nice homogene rate for 24 hours straight.

~~~
fwip
Depends on how much latency matters. A lot of big data is batch processing,
for which if the data is 3 hours old that's more than good enough.

------
3pt14159
I really agree with aspects of this, and I think CLIs and Unix pipes are way
more powerful than we treat them, but be forewarned that there are problems
with doing everything with pipes.

You need to code more defensively with them. For example, it is rare, but
every so often a newline will be fail to be emitted.

    
    
        kinda\n
        likethis\n
        \n 
        example\n
    

There are many other gotchas, but that one is a doozy because if you're using,
say, tab delimited data and cut you'll miss a line. It's one of the reasons I
use line delimited JSON if at all possible.

Also, this constant re-parsing of text does mean your string validation needs
to be more paranoid. For example, some JSON parsers parse curly quotes as
normal programming quotes. Horrible practice, I know, but it could have been
avoided. Also, it's easy to accidentally do shit like this when you're in a
rush. Some string matching tools handle the character matching of the
different ways of creating, say, "ë" will also make matching quotes more
relaxed.

Anyway, all of this to say that I 100% agree with the posted and linked
articles, but each method has its own security considerations and software
folk should be aware of them before starting.

~~~
AnIdiotOnTheNet
This is one of the reasons I prefer PowerShell, it requires a lot fewer text
parsing shenanigans. UNIX tools simply failed to evolve. Single io stream
pipelining on raw ASCII was perfectly reasonable in the 1970s but it isn't the
1970s anymore.

We should be composing tools with multiple typed io stream paths in GUIs (or
TUIs I suppose), leveraging two or even three dimensional layouts. All our
interfaces should be composed this way, allowing us to take them apart and
modify them at will to fit our workflow.

But that never happened. We never made a better hammer, we just try to squeeze
all our problems into ASCII-processing nails instead.

~~~
3pt14159
You know it is so funny that you're mentioning this. I completely agree.

It's to the point where I've been toying around with creating my own shell and
faking typed IO streams via Postgres+DSL. It's tricky though. Sometimes I want
pub-sub, other times I want event stream. Sometimes I want crash-on-failure,
other times I don't. There is this problem in software that I can't really
word precisely, but the closest I can come is "do it like this, except these
cases here, except-except those cases there" and these things kinda keep
stacking up until you have a program that has too much knowledge baked into
it.

Take, for example, emoji TLDs. Because emojis aren't consistent across
platforms they can get coerced into different types. I didn't know that when I
bought and used a couple emoji domains. When someone tried to click on a link
in Android and was met with a 404, I was so confused. I wasn't even seeing the
request come into nginx!

After I figured it out, I realized that emoji domains won't work. The
underlying assumption of TLDs is that there is one, and only one, way of
encoding something and that these things aren't coerced. That assumption is
wrong.

------
sn41
I use an awk interpreter called mawk.[1] This is noticeably faster than gawk
or other standard variants.

[1] [https://invisible-island.net/mawk/](https://invisible-island.net/mawk/)

~~~
dredmorbius
Often, yes, though not always, and mawk has some omissions relative to gawk.

Try multiple interpreters with timings.

Gawk's profiler can be invaluable.

------
havkom
> BTW, if your data set can be disposed by an awk script, it should not be
> called “big data”.

I think this statement is wrong. The popular meaning of the hype term “big
data” can not be easily changed.

Rather, awk, sed and other tools that can read from stdin and write to stdout
are great tools for “big data” and often more efficient and suitable than
larger and more hyped systems.

~~~
plaidfuji
Among programmers I've found that the size of your "big data" is often implied
to correlate with the size of something else

~~~
hedora
Yeah, like the amount of gibberish boilerplate you write, or the verbosity of
the error messages your script emits.

------
hackerm0nkey
Yup. Totally agree with OP. Early on in my career I had to generate on the fly
reports for hundreds of GB of data and all it needed was to throw some *NIX
commands around and eventually piping them to awk to do the final bit and it
was blazing fast.

These days, these are called big data. No it isn't...

------
nudpiedo
I don't get the motivation of this article, it links to the taco bell
programming article which says exactly the same. I usually wouldn't write an
article to repeat the same another article says, or if it is something that
could have been just a comment in the original blog.

------
beagle3
Not directly command line, but very relevant:
[https://www.frankmcsherry.org/assets/COST.pdf](https://www.frankmcsherry.org/assets/COST.pdf)
"Scalability, but at what COST".

------
dontbenebby
Does anyone have suggestions on books to grow my scripting fu? (End of chapter
exercises tend to be useful for me)

I know bash, and know a lot of basic commands, but I'm not familiar with some
more advanced things. I don't know awk or sed for example.

~~~
photon_lines
If you're looking for good books, I can vouch for 'A Practical Guide to Linux
Commands, Editors, and Shell Programming' which has very through coverage and
end of chapter exercises. For PowerShell, I'm currently reading the the free
PowerShell Notes for Professionals (
[https://books.goalkicker.com/PowerShellBook/](https://books.goalkicker.com/PowerShellBook/)
) and it's a great resource as well.

~~~
dontbenebby
Thanks for the book suggestion, that looks like exactly the kind of thing I'm
interested in.

------
TicklishTiger

        if your data set can be disposed by an
        awk script, it should not be called “big data”.
    

Why not? I don't see how awk is limited to a certain amount of data.

~~~
gsich
If your data fits on a single harddrive it's not big data. So I would set the
current limit to at least 14 TB.

~~~
gizmo686
I thought the boundary point was RAM. It is relativly simple to work with data
across multiple drives. When you pass the boundary of being able to work in a
single systems RAM, you genneally need a more significant rework

~~~
Aeolun
Most stream processing doesn’t rely too much on RAM, unless you literally need
all the data in memory at the same time.

~~~
groestl
If it _needs_ to be in RAM, then either you got a big enough machine (then by
definition it's not Big Data) or it's impossible. If you manage to come by
with RAM using smart algorithms, although the full dataset would never fit in
RAM, then it's Big Data. So I'd argue, stream processing is Big Data, exactly
because it doesn't rely too much on RAM.

~~~
Aeolun
True, but I can stream process 20GB of data on my tiny 2GB RAM home server as
well.

That’s not really ‘big data’ in my opinion.

------
bibyte
This submission seems weirdly relevant.

[https://news.ycombinator.com/item?id=19271135](https://news.ycombinator.com/item?id=19271135)

~~~
sizzzzlerz
I was going to comment on exactly this article except that I couldn't find it
quickly. It isn't weirdly relevant; it's totally relevant. It demonstrates
that, with sufficient knowledge of the command line, one can write the most
amazing tools, quickly and succinctly. Here, knowledge doesn't necessarily
meaning knowing everything immediately but also knowing what resources to
reference to find out stuff.

I've been programming for 40 years and using unix/linux since the 80's and in
this little one-line script, I discovered two things that one can do with the
appropriate arguments that I've never known. YMMV.

------
dredmorbius
Case in point from my own recent work: I've been analysing characteristics of
Google+ Communities, mostly looking for plausibly active good-faith instances.

There are 8.1 million communities in total, and thanks to some friendly
assistance, I'd identified slihtly more than 100,000 with both 100 or more
members, and visible activity within the preceeding 31 days, as of early 2019.

The task of Web scraping those 100k communities, parsing HTML to a set of
characteristics of interest, and reducing _that_ to a delimited dataset of
about 16 MB, was all done via shell tools, and on very modest equipment.

Most surprising was that _parsing_ the HTML (using the HTML-XML utilities:
[http://www.w3.org/Tools/HTML-XML-utils/README](http://www.w3.org/Tools/HTML-
XML-utils/README)) took longer than downloading the data.

Creating the datafile was done with gawk, and most analysis subsequently in R,
though quick-and-dirty summaries and queries can be run in gawk.

Performance: downloading (curl): 16 hours, parsing (hxextract & hxselect) 48
hours, dataset preparation (gawk): 2 minutes, analysis (gawk / R), a few
seconds for simple outputs.

The parsing step is painfully long, the rest quite tractable.

~~~
pdimitar
Are you planning on open-sourcing the downloader part? I'm very interested.

~~~
dredmorbius
Literally just a Bash while-read loop over community IDs. It's embarrassingly
trivial.

I'm planning on posting the data, probably to
[https://social.antefriguserat.de/](https://social.antefriguserat.de/) and
will include procssing scripts.

This is the fetch-script, which saves both the HTML and HEAD responses:

    
    
        #!/bin/bash
        
        sample_file=$1
        
        comm_path='community-pages'
        base_url='https://plus.google.com/communities'
        
        i=0
        time sed -e 's,^.*/,,' $sample_file |
            while read commid;
            do
                i=$((i+1))
                echo -e "\n>>> $i  $commid <<<" 1>&2;
        
                url="${base_url}/${commid}"
                commfile="${comm_path}/${commid}.html"
                commhead="${comm_path}/${commid}.head"
        
                echo "curl -s -o '${commfile}' -D '${commhead}' '${url}'"
        
            done
    

The sample file is simply a list of G+ community IDs or URLs, e.g.:

    
    
        100000056330101053659
        100000310247038604843
        100000355641542704509
        100000408644688836681
        100000537266485621548
        100000813948204546252
        100001055751908082772
        100001158162744298957
        100001173291703462139
        100001193552641351693

~~~
noisy_boy
This seems to be a perfect use case for GNU Parallel[0] to download and
process, say 10 ids, in parallel. If you have already downloaded/processed and
have no need to do it again, then probably doesn't matter now.

[0]:
[https://www.gnu.org/software/parallel/](https://www.gnu.org/software/parallel/)

~~~
dredmorbius
Xargs, actually, though saturating my dinky Internet connection was trivial.
Ten concurrencies kept any one request from stalling the crawl though.

That's why the script echoed the curl commant rather than run it directly. It
fed xargs.

The other problem was failed or errored (non 3xx/4xx, or incomplete HTML -- no
"</html>" tag found) responses. There was no runtime detection of these.
Instead, I checked for those on completion of the first run and re-pulled
those in a few minutes, a few thousand from the whole run, most of which ended
up being 4xx/3xx ultimately.

------
dahart
> Every item on the menu at Taco Bell is just a different configuration of
> roughly eight ingredients.

HA! This has almost been my line for years regarding Mexican food. What I like
to say is: it’s amazing how every possible permutation of 8 ingredients has
been named. BTW I love Mexican food, lived in Mexico.

> The post mentions a scenario which you may consider to use Hadoop to solve
> but actually xargs may be a simpler and better choice.

I do feel like there’s a corollary to Knuth’s “premature optimization” quote
regarding web scaling; premature scaling and using tools much bigger than
necessary for the job at hand is pretty common.

------
01100011
It's not just data processing. There are simple solutions to all sorts of
tasks.

I run 'motion' on my linux desktop at home to serve as a security camera when
no one is home. For months I've been manually starting and stopping it,
figuring I needed to setup an IoT system if I wanted to automate things. i.e.
IFTTT on our phones, an MQTT server in the cloud, etc. Then I realized - I
just need to start the camera when all of our phones are off the LAN. It took
about 15 minutes to setup, and now I never have to worry about forgetting to
stop or start the camera.

------
DanielBMarkham
For my next book, I'm working through a concept I call "good enough
programming.

Good enough programming is something you code that provides something people
want -- and you never look at the code again. Find the problem, solve the
problem, walk away from the problem. That's not sexy. It's not going to get
you an article to write for a famous magazine, but it's good enough.

We have lost sight of "good enough" in programming, and without some kind of
guardrails, we end up doing stuff we like or stuff that sounds good to other
programmers. For instance, while I love cloud computing, I'm seeing "how-to"
articles written about setting up a VPC for doing something like playing
checkers. Yes, it was an oversimplified article, and you have to write that
way, but without _wisdom_ , how is the reader supposed to know that? What
criteria do they use to determine whether it's a co-lo server, a lambda, or a
world-wide distributed cloud?

We're going like gangbusters selling programmers and companies on all kinds of
new and complex ways of doing things. They like it. We like it. But it is in
anybody's best interest over the long run?

Recently I rewrote a pet project for the third time. First time it was C#, SQL
Server, and an ORM. Then it was F#, MySQL, and linux. The last time it was
pure FP in F# and microservices.

Some of you may know where this is going.

Just as I finished writing the app in a real microservices format, I realized.
Holy cow! This whole thing was just a few Unix commands and some pipes.

My thinking went from all kinds of concerns about transactions and ORM-fun to
just some nix stuff in a small script. The problem stayed the same.

Something else happened too. At each step, I did less and less maintenance.
The last rewrite has had no maintenance required at all. In my spare time, I'm
going to do the nix one using no servers at all on a static SPA. In a very
meaningful way, there's no app, there's no server, and there's nothing to
maintain. Yet I still get the functionality I need. And I never maintain it.

Of course that's not possible for every app, but the key thing I learned
wasn't the magic of serverless static SPAs or the joys of unix. It was that _I
didn 't know whether or not it was possible or not until I did it._ By
thinking in a pure FP fashion and deploying in true microservices, the rest
just "fell out" of the work. At first I was actually thinking in a way that
would have only led to more and more complexity and maintenance requirements.

My belief is that we get our thinking right first, use code budgets, and try
for a simple unix solution. If it doesn't work, why? At least then we've made
an effort to be good enough programmers. That beats most everybody else.

~~~
MrQuincle
Write a blog post about it! It would be nice to see the progression and read
about your insights of such complete reworks, even if it's a pet project.

Edit: Can I can sign up somewhere to get a heads up when your book is
available? Would be appreciated!

------
JJMcJ
One place installed a Hadoop cluster. Some of us suspected that replacing a
regex monstrosity with Lex/Yacc might have given the speed needed to process
the files.

EDIT: Lex/Yacc that is some faster parser generator, I'm not too knowledgeable
on that.

------
nudpiedo
I totally agree that most of the *nix tools are most of the time the best
ones, but things get trickier when there is a complex dynamic pipeline e.g.
the input conditions the kind of processes and these are also dynamic based on
other inputs.

~~~
fwip
I've found Nextflow to be an excellent solution to parallelizing and adding
extra logic to any cli pipeline. It also helps manage environment and track
metrics.

------
aboutruby
My favorite thing has been `| ruby -e "puts STDIN.to_a. ..."`, allows to run
any kind of code on the standard input, much easier than remembering awk/sed
various options and much more powerful.

edit: same thing can be done with Python/Perl

~~~
apabepa
Yes, easier if you know ruby.. sed/awk is good for munching strings in small
scripts and it is also always installed on a unix system (AFAIK).

~~~
dredmorbius
You'll find sed and awk in busybox, virtually always, even on minimal systems.
Including embedded devices, routers, Android, etc.

------
ChlorophZek
Yes, buddy, the simplest tools are the most powerful, and the fastest. Time is
money.

------
ambrop7
Unless your data contains spaces, tabs, or, god forbid, newlines. Unix
pipeline tools lack any sort of useful data structuring capabilities, making
them appropriate for one-off tasks at most.

~~~
mrighele
Spaces, tab and newlines are not a problem.

The fact that there are some standard tools available doesn't mean you are
limited to that.

If you have a CSV file with spaces and newlines use cvskit or a small python
script importing the relevant library. If you have to parse JSON file you can
use jq to pick the relevant fields regardless of how the document is
formatted. You can even process binary data as long as the file format is
understood by the tool.

~~~
zimpenfish
Transcoding CSV to TSV using csvkit has been a life saver for me the last few
months due to an old industry's insistence on CSV as an information delivery
method. jq has previously been a lifesaver too.

------
AdieuToLogic
GUI's are the McDonald's of interfaces.

The choices are limited and when you go there you know there's a good chance
that you will end up wondering why what you got caused so much pain.

------
jabl
This is the Big Data version of "90% of Oracle instances could be replaced
with sqlite".

Pragmatism almost always loses to CV-padding and office politics.

------
kschiffer
Nice article, though I think you should work on your blog's typography. Using
a bold typeface for body copy is not pleasant to read.

------
0x445442
Oh how I miss Ted Dziuba posts. He's was a treasure trove of pragmatism for
the industry.

------
booleandilemma
At my company what you can do is determined by which AWS services you can
string together.

------
tomcooks
This is what you get by not answering RTFM to stupid questions.

------
kurczynski
I hate to be that guy, but they're NOT "Unix" tools, as the name GNU literally
states.

The post makes a good point that I fully agree with, just doesn't explain it
well enough.

~~~
mrighele
Many of the GNU tools are reimplementation of already existing tools.

For example the initial implementation of AWK was in 1977 [1], a few years
before GNU even existed [2], so it _is_ a Unix tool.

[1]
[https://en.wikipedia.org/wiki/AWK#History](https://en.wikipedia.org/wiki/AWK#History)
[2]
[https://en.wikipedia.org/wiki/GNU#History](https://en.wikipedia.org/wiki/GNU#History)

