Hacker News new | past | comments | ask | show | jobs | submit login
What you need may be “pipeline +Unix commands” only (nanxiao.me)
249 points by nanxiao 51 days ago | hide | past | web | favorite | 177 comments

I feel like the art of UNIX is slowly fading into oblivion, especially with the new generation of programmers/developers.

Eventually, they'll become the ones that decide the fate of software engineers (by being hiring managers, etc.) and we'll see more and more monstrosity like the article portraits, instead of cleverly using UNIX tools where applicable.

There's so many things that the software world is doing wrong that I am surprised that even at this inefficacy, it's such a viable and well-paid profession. It's almost as if we are creating insanely complex solutions that in turn require a large amount of developers to support them, whereas we could have chosen a much more practical solution which is self-sustaining.

As a young programmer I certainly found some love for all these unix tools and learn new things every day.

I think the problem is scale. Back in the day (before I was born), very few people were programmers and the resources they could use were limited. This means they didn’t need insanely complex solutions because they already needed complex solutions just to make it work on the limited hardware. People were trying to solve problems with computers. Nowaday you take a problem that could be solved by an microcontroller with three buttons and make it a cloud app with web server, web interface and all kind of other things like containers.

We donlt really tend to ask the question what a good solution would look like. Often it is the case that you just use the technology the developer wants to learn

You forget that using more developers means more headcount, and more headcount means I have more responsibility as a manager.

These crazy complex solutions also look a lot more difficult on the slides than the simple 3 layer architecture that’s often shown to me.

With something that simple, and requiring so few people, how am I ever going to convince my clients to pay me multiple millions of dollars for it.

I guess you should sell them the choice of either the big complex thing written by a large team for $2.3 MM, or the lean highly-optimized solution written by a handful of elite "10X" developers for $1.9 MM. (Of course the 10X developers are in high demand and very expensive.)

I think it's mostly a matter of chance more than anything else.

If you've jumped straight into programming, you'll probably consider any of those problems as a nail to your C/JS/Java/Python Hammer.

I was lucky to be initiated to the GNU / UNIX toolset by operation folks when doing tech support in a SAAS biz. We were dealing with a lot of text files and it didn't feel right to offload whatever my problem was to them, so I started learning from them and the scripts they wrote.

Mostly trivial stuff, such as in a directory, only select the relevant files, iterate through them to sum up something or find exactly the bit of information you need.

This allowed me to decrease time spent looking for the answer to my current case or colleague's case.

Of course, I then built those functionalities into a .bashrc (or was it bash_profile?) script to help my fellow support folks finding the answer on their own..

Turns out, most people don't want to have anything to do with a command prompt, even if the hard part has been done for you. That's been a pretty good lesson BTW.

Fast forward a few years later to now and I'm realising a good amount of stuff I've been writing lately could potentially be GNU'd instead of writing a Python script.

So that's how it goes I guess.

I have also found this to be the case too. Most people would rather have a GUI before even touching the command line. Most notably is Git; every single one of my developers use Sourcetree and if I have to help them with something, I always have to pop open the terminal. It's gotten to the point where I'm considered "odd" because I use the command line. It's become a running joke among everyone.

I don't see how this can be a target of their joke: they have problems with their GUI (I am assuming that's what Sourcetree is), you solve them with your CLI. If they want to laugh, they 'd better fix their own problems themselves, I guess?

They probably respect his knowledge and know he is the only competent one, but they simply don't care to learn the CLI method as they are lazy, or can't justify all the time it takes to learn it if he is there to fix their problems (more efficient). I've been on both sides of that situation before. There is always an expert in something you want to know, but can't justify the time. So don't take it too hard on them. I support a Linux based app at work and thus got pretty comfortable with the command line (vim, grep, awk, head, tail, cut, sort, ls, cp, top, cat...etc). To my knowledge I'm the only one in engineering with this knowledge (not in IT support). I also notice my fellow engineers will frequently use more complicated techniques for something that is a single piped command for me. I don't fault them for it though as they have no need to use Linux.

Jokes can be light-hearted and inclusive.

> It's gotten to the point where I'm considered "odd" because I use the command line. It's become a running joke among everyone.

Well, the joke is on them for choosing to only stick to the GUI.

>It's gotten to the point where I'm considered "odd" because I use the command line. It's become a running joke among everyone.

That's... very bizarre. Amongst the engineers I work with it's always an "oh cool, you know how to use bash/terminal" rather than "weird, why are you using terminal".

it's gotten to the point where I'm considered "odd" because I use the command line

Me too. I'm thinking of doing a presentation on the Dunning-Kruger effect to see if it sparks some introspection in my colleagues.

> Turns out, most people don't want to have anything to do with a command prompt, even if the hard part has been done for you. That's been a pretty good lesson BTW.

My experience has been that some folks are resistant to the command line (but I wouldn't say most). This is too bad because I feel like it's a crucial part of development. I even wrote a post about it: https://letterstoanewdeveloper.com/2019/02/04/learn-the-comm...

In my specific case, it wasn't dev profiles but people on a spectrum going from I grok tech stuff to only browsing and the office suite.

So obviously, whenever I started doing trainings on how to use the cli / the toolkit, I could see what is best described as mild panic. Which I get, CLI isn't inviting at all, it's pretty daunting, there's no real emphasis on safety (as in not breaking anything), it's far from being easy to reason with when you're used to the Windows / GUI world.

The lesson was not so much about the prompt and more about understanding your target audience and catering to them really.

Ah, that makes sense. Yes, understanding your audience and meeting them where they are (or maybe just a bit closer to where you want them to get to) is always very important.

This is something I've noticed in the last 8-10 years. The rise of the python/js/java paradigm everywhere. Some of the associated LDIF (json) I enjoy much more than XML and flat files but the misapplication of tools is becoming an epidemic. When I can write: awk -F "," '{for (x = 1 ; x <= NF ; x++) {if ($x ~ /[0-9]+/) {a[x] = a[x] + $x}}} END { for (p in a) {printf "%d = %d\n",p,a[p]}}' to sum columns in 5 seconds and people are scrambling with libraries to do matrix operations I tend to scratch my head and walk away. The aversion to the command line is also something that bothers me but I don't run into it as much in my field.

When I can write

    import pandas as pd
    data = pd.read_csv(filename)
and have the same result, I'm going to do the one that is faster to write, fewer characters, and lets me understand what's going on.

And don't get me wrong, I've written some gnarly pipelined bash before, although I'm by no means an expert, but that doesn't mean its always the right tool for the job.

I was going to be the long haired *nix geek here but I have no hair and the world is moving on. I can't pick bones with python for data science/analysis and personal convenience. _However_ as a principal engineer if someone was to say, for a trivial dataset, that we need python and pandas for an operation like this where python + pandas was not already provisioned the answer would be no.

>When I can write: awk -F "," '{for (x = 1 ; x <= NF ; x++) {if ($x ~ /[0-9]+/) {a[x] = a[x] + $x}}} END { for (p in a) {printf "%d = %d\n",p,a[p]}}'

To be fair, what the matrix libraries do is provide readability and clarity.

Show a programmer unfamiliar with awk your statement and they're going to be spending quite a while parsing it.

Show a programmer unfamiliar with numpy/pandas some matrix multiplication and they may understand it intuitively for the most part without even having to look up references.

edit: I'm actually an awk noob myself but after rereading your code for a second, it makes quite a bit of sense. "For all rows in the column, take all digits 0-9 and sum them". So perhaps not much is gained by the library

>The aversion to the command line is also something that bothers me but I don't run into it as much in my field.

The company I recently started with is really big on Splunk.

The fact that they're proud enough of coming up with the tagline "Taking the sh out of IT" to print it on branded t-shirts featured in their training material was a hint that I wouldn't be a huge fan of the product, personally.

Abstracting things away is great, but something about IT pros being proud of avoiding the command line rubs me entirely the wrong way.

I don't think so. What is old is new again; even this blog post is an example of a developer relearning that core lesson.

Also, I'd like to point out that in both this blog post and the Taco Bell one, the "UNIX way" is being compared against a straw man example, not a real example of over-engineering. Neither post actually provides any evidence of any real inefficiency. Both authors are just trying to explain and improve their own thinking about programming, not trying to cast judgement on a generation.

> I feel like the art of UNIX is slowly fading into oblivion

I think that by itself isn't a problem, but fading right along with it is the capacity to decompose and structure the problem domain.

Even if one ends up writing a solution in a different language for whatever reasons, starting out by mapping the problem with UNIX command line tools will result in a better understanding of the problem; an understanding that is language agnostic and can be transferred to any preferred method of implementation.

These complex solutions allow one operations engineer to manage thousands and thousands of servers/containers. Guys that just knew how to bang together bash and Perl scripts got laid off all over the place in favor of people that know cloud stuff.

It turns out that banging together bash scripts which contain invocations of cloud SDK tools has as much relevance in the k8s era as any before.

Maybe it’s Python instead of Perl but that is about the most-significant change.

I believe the author probably understands that fully and is instead referring to the situation where the optimal solution is a little bash script and somehow the developer designs this horribly complex solution that isn't any more performant, but seems more fancy. Sadly, I think developers sometime do this just to learn new things and keep their resume up to date. It's rough out there from what I read (glad I'm not a developer).

I dunno -- there's definitely a community of developers who prefer not to use the command-line / Unix toolset, but at least half of those I know [especially the macOS users and those with some ops experience] routinely reach for the core Unix utilities , especially when doing ad hoc quick fixes / file processing. For my part, I can't imagine doing my job without them [or I guess more accurately, my job wouldn't exist in the first place with them]

I think git has a pretty significant amount to do with the persistence of command-line tool use among the younger set -- it's just so much more efficient / easier to use from the terminal.

> I feel like the art of UNIX is slowly fading into oblivion, especially with the new generation of programmers/developers. ...

People said the exact same thing in the 1990s, too. "Enterprise"-managed projects, using tech stacks like Win32 or Java, have generally tended to produce large, unwieldly monoliths.

Things fade and shine in succession. Good bits will always come back. It's a bit like the saying about mathematics truthiness nature: doesn't matter who or when you look, they will re-emerge as is. composing tiny bits is always good, whether it's unix commands, lisp functions, or forth words..

Societies are large and full of random fluxes and waves.. right now it might be the time for Wirth 17 pages long solutions .. but McIlroy one liner will come back.

> composing tiny bits is always good, whether it's unix commands, lisp functions, or forth words..

It's not so much about "tiny" but about no fuss, low complexity, no-nonsense, etc. For example, Forth may be tiny, but it's unusable for most applications. Assembler languages are tiny, too.

> composing tiny bits is always good, whether it's unix commands, lisp functions, or forth words

The problem with this type of thinking is that "composing tiny bits" is obviously good to anyone. BUT there are vast differences between composing unix commands and composing lisp functions (and composing tiny JS libraries in NPM etc.) and some of these are good, while some are not.

It's actually very difficult to properly create a system out of composable, re-usable components. I would actually say that Unix pipes are a particularly bad model of how to do that (as are NPM micro-libraries). No well-defined interfaces, arcane naming etc. - they are all reasons why most modern versions of these tools have grown and grown.

Composing unix commands means endlessly adjusting the textual output of one tool to the input requirements of another. Most such coding only cares about the happy cases; it will march on with an incorrect datum because of some unexpected twist in a textual format. For instance, a field delimiting character suddenly occurs in a field. Or some code is scraping data by hard columns, but a field width overflows and the columns shift.

Agreed, and well said.

> right now it might be the time for Wirth 17 pages long solutions .. but McIlroy one liner will come back.

In case someone is interested in &ollowing up this reference, it’s Knuth’s 17 page long solution.

> I feel like the art of UNIX is slowly fading into oblivion, especially with the new generation of programmers/developers.

For what it's worth, I'm a first-year computer science major and the class I'm currently taking is very much focused on the "art of Unix." We've been doing shell scripting and regular expressions and the like. I quite enjoy it.

I think the issue you describe is partially a case of "when all you've got is a hammer..."

When all you've got is a computer and your own hands, you use the Unix way.

If you have a team (or the money to hire a team) of 20 developers, then you start looking for a problem that you can solve with that hammer.

Leads are having the responsibility. Most of them are good and know there shit.

So I wouldn't worry to much. All other small companies who have people who don't know better, don't have too much technology inside anyway. Or at least often enough.

I've been a pipeline junkie for a long time, but i've only recently started to get into awk. The thing i can do with awk but not other tools is to write stateful filters, which accumulate information in associative arrays as they go.

For example, if you want to do uniq without sorting the input, that's:

  awk '{ if (!($0 in seen)) print $0; seen[$0] = 1; }'
This works best if the number of unique lines is small, either because the input is small, or because it is highly repetitive. Made-up example, finding all the file extensions used in a directory tree:

  find /usr/lib -type f | sed -rn 's/^.*\.([^/]*)$/\1/p' | awk '{ if (!($0 in seen)) print $0; seen[$0] = 1; }'
That script is easily tweaked, eg to uniquify by a part of the string. Say you have a log file formatted like this:

  2019-03-03T12:38:16Z hob: turned to 75%
  2019-03-03T12:38:17Z frying_pan: moved to hob
  2019-03-03T12:38:19Z frying_pan: added butter
  2019-03-03T12:38:22Z batter: mixed
  2019-03-03T12:38:27Z batter: poured in pan
  2019-03-03T12:38:28Z frying_pan: tilted around
  2019-03-03T12:39:09Z frying_pan: FLIPPED
  2019-03-03T12:39:41Z frying_pan: FLIPPED
  2019-03-03T12:39:46Z frying_pan: pancake removed
If you want to see the first entry for each subsystem:

  awk '{ if (!($2 in seen)) print $0; seen[$2] = 1; }'
Or the last (although this won't preserve input order):

  awk '{ seen[$2] = $0; } END { for (k in seen) print seen[k]; }'
I don't think there's another simple tool in the unix toolkit that lets you do things like this. You could probably do it with sed, but it would involve some nightmarish abuse of the hold space as a database.

  awk '{ if (!($2 in seen)) print $0; seen[$2] = 1; }'
You can even shorten this a bit! "awk '!seen[$2]++'" does the same thing -- awk will print the whole line when it's provided a truthy value. It's definitely more code-golfy than being explicit about what's actually going on though

Definitely very terse, but I'd call this idiomatic awk.

I did a lightning talk on awk last year and found this great article series from 2000 on all the powers of awk (including network access, but not yet email :) ).


I admire your work. Clever usage of unix tools is very handy. But for parsing text, do you really see that awk and Unix tools as a better solution then a simple python script?

Although I admit that the key argument for Unix tools is that they don’t get updated. That sounds awful, but think about it, once it works, it works everywhere, no matters OS type, version or packages installed. That is something experienced programmers always want from their solutions.

Python is fantastic for little (or large!) bits of logic, but its handling of input is clunky enough to put me off for tiny things. AFAIK the boilerplate you need to get to working on the fields on each line is:

  import sys
  for line in sys.stdin:
    fields = line.split()
    # now you can do your logic
If you want to use regular expressions, that's another import.

Python also doesn't play well with others in a pipeline. You can use python -c, but you can't use newlines inside the argument (AFAICT), so you're very limited in what you can do.

This is exactly where perl (namely, perl -ne) is so very, very useful.

Yes, that helps a lot. The fact that Perl uses braces rather than whitespaces also makes it work much better in this situation.

I still wouldn't touch Perl with a bargepole, though. Sorry not sorry.

Totally agree. On a related note, I came across a similar thing just the other day for rubyists:


Probably not as fast as many of the individual unix tools it could replace, but does look like a great way to leverage one's knowledge of ruby.

parsing text is what a lot of these scripts/mini-pipelines do.

the key argument for *nix tools is that they do one thing and only one thing extremely well. at a meta level these tools are units of functionality and you’re actually doing functional programming, on the command line, without realizing it.

The problem is there are like 5 different tools that do the same "one thing" well - awk, sed, grep, cut, etc.

Kind of simpler to learn one language that does lots of things well!

i’m going to respectfully disagree.

each tool you listed (with the exception of awk) does one thing and does it extremely well.

my goal is not to use a language to solve all possible variations on problems i have. my goal is to solve the problem.

another interesting side effect is that a lot of times this is super compact and good enough. when it’s not you can go to a programming language

^ agree -- I've seen lots of folks [newer users mostly] turn to grep when really what they wanted was sed. It's just a matter of learning which screwdriver is for which type of screw

Isn't sed's "one thing" a superset of grep's?

grep is for searching for stuff

sed is for editing streams

Pretty sure you can search for stuff with sed.

"I don't think there's another simple tool in the unix toolkit that lets you do things like this."

Perl can, since it borrowed a fair amount of awk. It's also almost as commonly already installed. The one liner equivalents to what you showed are pretty similar, for example: https://news.ycombinator.com/item?id=19294575

Though, I concede it falls outside the realm of "simple tool".

There's a wonderful quote about things like this in the Unix Hater's Handbook:

> However, since joining this discussion, a lot of Unix supportershave sent me examples of stuff to “prove” how powerful Unix is.These examples have certainly been enough to refresh my memory:they all do something trivial or useless, and they all do so in a veryarcane manner.

So I assume you would use some sort of ' grep XX | sort | uniq' (I still) to get a unique line output. Is this awk line now your default, or did you find yourself using both for convenience?

Do you alias these awk commands on all machines you work on, or other way put, I did not find a nice way to keep my custom aliases 'in sync' over different machines, perhaps you have some recommendation or workflow that is really sweet?


I still default to sort -u (or sort | uniq -c if i need counts), partly from habit, but partly because it's often useful to have the output sorted anyway.

I have a script on my path called huniq ('hash uniq') that contains that awk program. I prefer scripts to aliases because they play better with xargs and so on.

I have a Mercurial repository full of little scripts like this, and other handy things, which lives on Bitbucket, and which i clone on machines i do a lot of work on. In principle, whenever i make changes to it i should commit and push them, then pull them down on other machines, but i'm pretty slack about it. It still helps more than not having anything, though.

Thank you. I have just been working on something a this

> awk '{ if (!($0 in seen)) print $0; seen[$0] = 1; }'

fits right in.

Perl equivalent is:

perl -ne 'print unless $SEEN{$_}++'

I had a task the other day to aggregate some logs. So I wrote a one liner, which did most of what I wanted. I took about 4 minutes to run.

Then I decided to run it on larger dataset (because I needed too). Like week of logs, not a day of logs.

While it was running, I wrote rust CLI, which was working like `cat /*.log | logparser` and did one day in 12 seconds, and a week in a two minutes.

And I gave up waiting on awk, btw. It is not always better to use command line. If you have gigabytes or tens of gigabytes of data, it would be easier to write some cli tool to help you out.

Also, it was much easier to put significantly more complex logic into it because of type checking, and, you know, being actual high level programming language, not hack&slash awk script.

EDIT: Looking back on my "one liner" vs "rust cli" I would not be able to make meaningful adjustments to one liner comprehension. It is, to my sorrow, write-only thing.

If your awk script gets too long / unreadable you just put it in a file and use some whitespace and longer variable names.

AWK scripts tend to be very readable (much more so than e.g. sed) as long as they stick to the "stateful filters" use-case as https://news.ycombinator.com/item?id=19294195 calls it, but yes they have their limits.

If speed is a concern, you may want to try using mawk instead of GNU awk/gawk. I've had 4x speedups with mawk.

I also find that programming with languages which has strong type system is very helpful, if you trying to make anything with logic you can't fit in one simple sentence.

For example, I was parsing logs. They had entries urlencoded JSON, one document per line, each could be invalid, I had to extract 'id' field, and count number of entries with same IDs, and number of entries with same IDs and with special marker. Then take only entries with 10000+ results.

You can totally write cat/grep/awk/sort/head. But then I wanted to add a field, and it was hard to edit. Rust solved my problem, and I had pleasant time writing code, not editing foot long line of untyped code.

Based on his timings, rust achieved a 20x speedup vs a 4x speedup if he used mawk.

Yep, and adding gnu parallel or xargs to the mix would give a # of cores speedup, which would make mawk faster than the single-threaded rust on a > 6 core machine (roughly).

Would you mind posting your awk write-only monstrosity?

It wasn't all awk, there was a several commands connected by pipe, extracting data from JSON entries, some grep filtering, awk to aggregate data by key, etc.

I tried to look it up and couldn't find, sorry. It was more than half a year ago.

Doesn't sound like performance was your end goal unless it was to demonstrate that your understanding of the unix tools could produce a solution slower than a custom built solution in the language of champions.

* If it's simple transforms, use cli tools.

* If it requires aggregation and it's small, use cli tools.

* If this is data you're using over and over again then load it in the database and then do the cleaning, ELT.

* If it's 2tb of data and under, still use bzip2, get splittable streams and pass it to gnu parallel.

* If it requires massive aggregations or windows, use spark|flink|bleam.

* If you need to repeatedly process the same giant dataset use spark|flink|bleam.

* If the data is highly structured and you mainly need aggregations and filtering on a few columns use columnar DBs.

I've been using Dlang with ldc a lot because of how fast its compile time regex is, and its built in json support. Python3+pandas is also a good choice if you don't want to use awk.

Before reaching for spark, etc:

Sort is good for aggregations that fit on disk (TBs these days, I guess)

Perl does well too if the output fits in a hashtable in DRAM, so 10’s (or maybe 100’s?) of GBs

For bzip2 why not just use pbzip2? Frankly, I wish distros would replace the stock bzip2 with pbzip2 (I think it's drop in compatible).

I have been using the command line for all type of work for years now. The most satisfying is to realize that there is always more to learn. And once you grasped the basics, they fit together like LEGO bricks.

While I think it's important to make that argument, the posted article and the one it refers to lack some guidance on how to reach "command line mastery". I recently came across this great resource here on HN:


It gives great overview of the toolbox you have on the command line. Equipped with `man` you're ready to optimize your everyday work. And always remember to write everything down and ask yourself WHY something works the way it works. The interface of the standard tools is thought out very well. Getting comfortable with this mindset pays off.

This old article has the same topic with a more complex example and a surprising result about parsing 3.46Gb of data:


A few years ago I had a personal project where I needed to load some of Wikipedia's tables into MySQL on a cheap computer. On my first attempt to just load the downloaded SQL scripts, the computer churned for 3 days before I called it a quits on the attempt. This article inspired the alternative approach: strip the first few lines down until the first Insert collection of inserts, use sed to convert '),(' to \n, then use sed again to strip the first opening and the last closing parens leftover from the earlier operation. Now we have plain CSV. Import it into MySQL. The whole operation took about an hour or two, including importing into MySQL. Yes, I lost the indices, but it was a prototype, so nothing much was lost.

Sometimes there's a middle ground: make your "map" and "reduce" steps separate scripts.

If you want to do the parsing in Python instead of awk, just make a tiny script that reads from stdin and writes to stdout - that way you can put it between xargs or parallel and whatever else is in the pipeline.

The parallelization is a separate concern, so it doesn't need to be mixed in with the parsing (or whatever) concern. The downloading is a separate concern; use wget or requests in a Python script or whatever, it doesn't need to be mingled with the parsing.

This article's primary example is a single static text file with 5M lines. Sure, in that case, awk works great, but how often does that come up? In the real world, those 5M lines are growing by several hundred thousand every day, and after a few months, grows beyond what a single computer or awk can handle. Further, users want real-time results, not just a few times a day when your cron script runs.

Unix commands are great up to a few GBs of data, Excel is even better if you're dealing with less than a few tens of MBs. But to deal with Terabytes of data quickly and efficiently, these tools totally break down.

> how often does that come up?

It's an important point to remember that a lot of things involved in human society have not exploded in size or complexity in the last 30 years.

Many data sets are basically proportional to the human population (health records, criminal records, property records etc), and these have been measured in the millions for 30+ years. In the same time the compute power of a single script has moved from the millions into the billions.

It's important, because if a government needs to, say, calculate something involving "every building in the country", or "everybody with a criminal record" they need to understand that this task, in 2019, can in fact be done by a single programmer parsing flat text files on their MBP, and does not need a new department.

This is a bit like Grace Hopper always pointing out the difference between a microsecond and a nanosecond - https://www.youtube.com/watch?v=JEpsKnWZrJ8

> Many data sets are basically proportional to the human population.

Awesome point well put.

I've never seen that Grace Hopper bit, but that's awesome! Thanks for sharing. Definitely going to show it to some co-workers

> But to deal with Terabytes of data quickly and efficiently, these tools totally break down

Scaling up to 400~500GB of logs with awk and parallel has not been a problem for me. I dont think TB will be particularly hard, especially if one is reasonably proficient with the tools. Of course, if one has the mental bias that one has to throw hadoop or spark at it, thats a significant obstacle right there. Upton Sinclair effect also plays a role -- It is difficult to get a man to understand something, when his salary depends upon his not understanding it.

Of course at some scale simple unix tools become impractical, but usually people reach for flashier tools even at scales where unix tools will suffice.

Well, the article does explicitly say that if you can do this then you don’t have “big data”. Maybe you see TB level processing a lot in your line of work, but most developers never will. Whenever I deal with anything a bit bigger, I break off the smallest section I need to deal with and work with that.

> Maybe you see TB level processing a lot in your line of work, but most developers never will.

I doubt this is the case. If you're working on even a medium sized team, you'll see this level of data, whether it's internal server logs, publicly-sourced data, or a variety of other applications. Almost by definition, any company that runs a cluster is probably dealing with TBs of data (otherwise they could probably run it on one machine). I don't know the percentage of developers that deal with this, but I do know that it's pretty common.

> Almost by definition, any company that runs a cluster is probably dealing with TBs of data (otherwise they could probably run it on one machine).

This doesn't follow:

1) One reason to run a cluster is to use flaky commodity hardware instead of high-reliability specialized hardware. Note the transition from specialized hardware and computer rooms of the '80s and '90s to cloud computing on preemptible instances.

2) They might just be straight-up wrong / misguided. http://www.frankmcsherry.org/graph/scalability/cost/2015/01/... As a professional developer I've seen tons of distributed systems that could be replaced by a well-designed non-distributed system, but ELK and Mongo and etcd and Kafka and more generally cloud servers are easy off-the-shelf tools.

Just a few weeks ago I was working with an 11TB dataset which I processed with just unix command line tools told on a single Linux VM.

You can get a lot of milage out of parallel xargs and other of the shelf tools.

Indeed! Gnu prallel, xargs, make -j, netcat, mawk/gawk (with C extensions if needed), jq, coreutils, textutils can get insane amount of stuff done.

I have a simple bash script composed of 6 sed commands piped together to convert >100GB csvs (table dumps of recommendation data) into Redis binary protocol which are then ingested into a Redis Cluster using redis-cli --pipe. It takes somewhere around 15 minutes running on a modest bare metal server.

Sampling or subetting your data is almost always the correct response, especially for analysis.

If you're literally processing TB or PB of data, you'll want to parallelise. Though shell tools do this amazingly well.

> Sure, in that case, awk works great, but how often does that come up?

80% of bioinformatics.

> In the real world, those 5M lines are growing by several hundred thousand every day, and after a few months, grows beyond what a single computer or awk can handle.

This is a frustrating viewpoint, and it feels to me like the same poisonous worldview as "If your company isn't growing 10% month-over-month, it's not worthwhile and we won't invest in it." There are plenty of meaningful and useful things you can do for the world at a static size, and doing continued interesting work with a dataset of the same size doesn't mean you're not part of the "real world."

> In the real world, those 5M lines are growing by several hundred thousand every day

Or they don’t. Not all data are cloud-scale aggregations.

In my world your case is the corner case. Don't think I want to deal with multiple terabytes of data as it's provenance is questionable and this data services industries that the human population did without until about ten/fifteen years ago.

Users "want" vs do users "need" is the Q.

Also most people don't deal with terabyte sized data sets

Why would you need to look at the entire history of the log file each time?

Isn't it logical to have historical summary info and then the full log for say the past month?

> Isn't it logical to have historical summary info and then the full log for say the past month?

What if you want to change what's contained in your summary?

Relevant reading: The GNU coreutils manual


Does this add anything to the Taco Bell post linked in TFA?

I suggest changing the link to: http://widgetsandshit.com/teddziuba/2010/10/taco-bell-progra...

TacoBellArticle> I could have done the whole thing Taco Bell style if I had only manned up and broken out sed, but I pussied out and wrote some Python.

That’s cringe-worthy...

Despite the gendered terms, what makes me cringe is the belief that X is somehow "better" than Y - if you know Python, and have access to Python, and express yourself in Python faster, then use Python. I use sed, awk, cut, Python, as needed - whatever lets me solve my problem faster.

No flexing about how you use X needed.

Yeah, I know some women that are way better at unix then me. Heck I learned to program, to the extent that it wasn't auto didactic, mostly from women.

Anyway, it's poor form to put gender into the mix of what good coding looks like. Don't man up, do it hard core or bravely. Don't pussy out, chicken out. Don't try something ballsy, try something gutsy.

Unix commands definitely go a long ways.

I've been freelancing for a long time but never automated invoicing people up until recently.

So I combined grep, cut, paste and bc to parse a work log file to get how many hours I worked on that project, what amount I am owed and how many days I worked. I can run these analytics by just passing in the log file, a YYYY/MM date (this month's numbers), YYYY date (yearly numbers) or no date (lifetime).

Long story short, the working prototype of it was 4 lines of Bash and took about 10 minutes to make.

Now I never have to manually go through these work log files again and add up invoice amounts (which I always counted up manually 3 times in a row to avoid mistakes). If you're sending a bunch of invoices a month, this actually took kind of a long time and was always error prone.

When I was in middle school, a friend and I had an internship with a physicist who wanted us to write some software to perform a simple transformation on data sets that he downloaded by FTP from some experiment.

We spent about a week writing a program in QuickBASIC that successfully parsed the files and performed the transformation.

Some years later, I realized that this would be a one-line awk script which I could now write in 20-30 seconds. (Probably someone comfortable with Excel could also perform the transformation in 20-30 seconds, although it might not scale as well to larger files.)

I agree with the sentiment that many solutions are over-engineered, but when you need to process billions of records a day, you do need more complex systems.

Bottom line: when facing an engineering problem, start with the simplest, fastest to implement solution, and build complexity as necessary. The simple solution suffices most of the time.

You know, "billions a day" is only on the order of 10K per second. A single machine can handle that.

That would be amazing, wouldn't it? It's not true though, the problem with dealing with billion of operations a day is the spikes, most of the time you don't get a nice homogene rate for 24 hours straight.

Depends on how much latency matters. A lot of big data is batch processing, for which if the data is 3 hours old that's more than good enough.

> billions of records a day

A good GPU can do >1bn calculations per frame.

A good GPU today can do > 1 Tn calculations per frame.

I really agree with aspects of this, and I think CLIs and Unix pipes are way more powerful than we treat them, but be forewarned that there are problems with doing everything with pipes.

You need to code more defensively with them. For example, it is rare, but every so often a newline will be fail to be emitted.

There are many other gotchas, but that one is a doozy because if you're using, say, tab delimited data and cut you'll miss a line. It's one of the reasons I use line delimited JSON if at all possible.

Also, this constant re-parsing of text does mean your string validation needs to be more paranoid. For example, some JSON parsers parse curly quotes as normal programming quotes. Horrible practice, I know, but it could have been avoided. Also, it's easy to accidentally do shit like this when you're in a rush. Some string matching tools handle the character matching of the different ways of creating, say, "ë" will also make matching quotes more relaxed.

Anyway, all of this to say that I 100% agree with the posted and linked articles, but each method has its own security considerations and software folk should be aware of them before starting.

This is one of the reasons I prefer PowerShell, it requires a lot fewer text parsing shenanigans. UNIX tools simply failed to evolve. Single io stream pipelining on raw ASCII was perfectly reasonable in the 1970s but it isn't the 1970s anymore.

We should be composing tools with multiple typed io stream paths in GUIs (or TUIs I suppose), leveraging two or even three dimensional layouts. All our interfaces should be composed this way, allowing us to take them apart and modify them at will to fit our workflow.

But that never happened. We never made a better hammer, we just try to squeeze all our problems into ASCII-processing nails instead.

You know it is so funny that you're mentioning this. I completely agree.

It's to the point where I've been toying around with creating my own shell and faking typed IO streams via Postgres+DSL. It's tricky though. Sometimes I want pub-sub, other times I want event stream. Sometimes I want crash-on-failure, other times I don't. There is this problem in software that I can't really word precisely, but the closest I can come is "do it like this, except these cases here, except-except those cases there" and these things kinda keep stacking up until you have a program that has too much knowledge baked into it.

Take, for example, emoji TLDs. Because emojis aren't consistent across platforms they can get coerced into different types. I didn't know that when I bought and used a couple emoji domains. When someone tried to click on a link in Android and was met with a 404, I was so confused. I wasn't even seeing the request come into nginx!

After I figured it out, I realized that emoji domains won't work. The underlying assumption of TLDs is that there is one, and only one, way of encoding something and that these things aren't coerced. That assumption is wrong.

But do you have similar tools as te core CLI tools on Linux like grep and sed? How do you discover functionality without man?

I use an awk interpreter called mawk.[1] This is noticeably faster than gawk or other standard variants.

[1] https://invisible-island.net/mawk/

Often, yes, though not always, and mawk has some omissions relative to gawk.

Try multiple interpreters with timings.

Gawk's profiler can be invaluable.

> BTW, if your data set can be disposed by an awk script, it should not be called “big data”.

I think this statement is wrong. The popular meaning of the hype term “big data” can not be easily changed.

Rather, awk, sed and other tools that can read from stdin and write to stdout are great tools for “big data” and often more efficient and suitable than larger and more hyped systems.

Among programmers I've found that the size of your "big data" is often implied to correlate with the size of something else

Yeah, like the amount of gibberish boilerplate you write, or the verbosity of the error messages your script emits.

Yup. Totally agree with OP. Early on in my career I had to generate on the fly reports for hundreds of GB of data and all it needed was to throw some *NIX commands around and eventually piping them to awk to do the final bit and it was blazing fast.

These days, these are called big data. No it isn't...

I don't get the motivation of this article, it links to the taco bell programming article which says exactly the same. I usually wouldn't write an article to repeat the same another article says, or if it is something that could have been just a comment in the original blog.

Not directly command line, but very relevant: https://www.frankmcsherry.org/assets/COST.pdf "Scalability, but at what COST".

Does anyone have suggestions on books to grow my scripting fu? (End of chapter exercises tend to be useful for me)

I know bash, and know a lot of basic commands, but I'm not familiar with some more advanced things. I don't know awk or sed for example.

What really helped me is my desire to automate most aspects of my job. If I'm doing something more than once(and twice is more than once), I prefer to automate it. I'm really stupid, and I'm really, really lazy. I am not good at repetitive tasks, but computers are. If there's one thing I feel is important for every programmer to get, it's that our single reason for existence is to make computers do work for us. We should look for opportunities to do so, not always because it is the most efficient thing to do, but because it is what we do.

So I end up writing a lot of bash scripts. A lot of small python scripts. Stupid scripts. Over time, you build up a library of tricks. You read threads like this where you pick up new tricks. It isn't something you'll learn once. Some of the tools take years to really get the feel of.

If you're looking for good books, I can vouch for 'A Practical Guide to Linux Commands, Editors, and Shell Programming' which has very through coverage and end of chapter exercises. For PowerShell, I'm currently reading the the free PowerShell Notes for Professionals ( https://books.goalkicker.com/PowerShellBook/ ) and it's a great resource as well.

Thanks for the book suggestion, that looks like exactly the kind of thing I'm interested in.

Thank you for sharing

    if your data set can be disposed by an
    awk script, it should not be called “big data”.
Why not? I don't see how awk is limited to a certain amount of data.

If your data fits on a single harddrive it's not big data. So I would set the current limit to at least 14 TB.

One time I met a company who insisted they were sending tens of TB of data per day and would need multi-PB per year storage compressed. Took one look at the data: All json, all GUIDS and bools. If we just pre-parse it, the entire dataset for a year fits in a few 100s GB uncompressed -- literally could fit on a macbook air for most of the year.

The funny thing about "big data" in my experience, is just how small it actually becomes when you start using the right tools. And yet so much energy goes into just getting the wrong tools to do more...

> The funny thing about "big data" in my experience, is just how small it actually becomes when you start using the right tools.

Rings way too true for me atm.

My current workplace is currently struggling, because one of our application stores something like a combined 300G of analytics data in the database with the application data. Modifying the table causes hours of downtime because everyone claims that backwards compatible db changes are too hard. And everyone is scared because with more users there's "so much more analytics data" incoming. Yes, with 300Gb across 3-4 years.

And I'm just wondering why it's not an option to just move all of that into one decently sized mysql/postgres instance. Give it SSDs, 30 - 60Gb of ram for the hot dataset (1-2 month) and it'll just solve our problems. But apparently, "that's too hard to do and takes too much time" without further reason.

All of the source text for all of Google+ Communities posts is a few hundred GB. This for 8.1 million communities and ~10 million active users.

Add in images and the rest of the Web payload (800 KiB per page), and that swells to Petabyte range. But the actual scale of the user-entered text is stunningly small.

I thought the boundary point was RAM. It is relativly simple to work with data across multiple drives. When you pass the boundary of being able to work in a single systems RAM, you genneally need a more significant rework

"Big Data is any thing which is crash Excel."


Most stream processing doesn’t rely too much on RAM, unless you literally need all the data in memory at the same time.

If it _needs_ to be in RAM, then either you got a big enough machine (then by definition it's not Big Data) or it's impossible. If you manage to come by with RAM using smart algorithms, although the full dataset would never fit in RAM, then it's Big Data. So I'd argue, stream processing is Big Data, exactly because it doesn't rely too much on RAM.

True, but I can stream process 20GB of data on my tiny 2GB RAM home server as well.

That’s not really ‘big data’ in my opinion.

"EC2 High Memory instances offer 6 TB, 9 TB, and 12 TB of memory in an instance. "


That is harder to define. Server mainboards can hold more RAM then consumer mainboards. So with 32GB per slot and 4 slots I would set a limit to 128GB? Also this would make so much more tasks "big data". Games with 50+GB are not big data, neither is e.g video conversion.

I think the limit is more in the double digits TB range right now.

See? As I said this not easily definable. A single HDD is though. What you could do is a single RAM module.

Idk, it's fuzzy, obviously, and a moving goalpost, but in the end it comes down to this: do I buy/rent a bigger server? Or do I put more engineers to the task of replacing the naive algorithm. The latter is Big Data and usually the case when you run out of RAM, not out of disk. And the first one is an underrated option.

You can still connect a fairly big NAS to a beefy server and do the processing, unless thoughput rate becomes an issue. Saturating a 10Gbit link means you can probably process up to 100TB a day.

Medium data. Substantial data. Just-enough-data.

My favorite term is "annoying sized" data, not enough to warrant clusters and HPC, but enough to make a decent laptop crawl to a halt. It's that uncomfortable in-between that makes up the bulk of the data I usually encounter.

The one nice thing about AWS and its ilk is the ability to spin up big chunky VM for few hours/days for ad-hoc data processing for pretty small amount of money. That can in many cases shift the boundary of annoying sized enough, with the added bonus that your own workstation is not fully occupied by the processing. Of course such approach is not applicable to all scenarios, but it is a useful trick to keep in your sleeve.

A while back on RDS we did our migrations by standing up a super fat instance, fail over, migrate, fail back and shutdown the fat instance. This was a great asset, and AWS made this easy.

This is pretty much where we are at, especially with a lot of transforms and processing on high-speed video captures.

This submission seems weirdly relevant.


I was going to comment on exactly this article except that I couldn't find it quickly. It isn't weirdly relevant; it's totally relevant. It demonstrates that, with sufficient knowledge of the command line, one can write the most amazing tools, quickly and succinctly. Here, knowledge doesn't necessarily meaning knowing everything immediately but also knowing what resources to reference to find out stuff.

I've been programming for 40 years and using unix/linux since the 80's and in this little one-line script, I discovered two things that one can do with the appropriate arguments that I've never known. YMMV.

Case in point from my own recent work: I've been analysing characteristics of Google+ Communities, mostly looking for plausibly active good-faith instances.

There are 8.1 million communities in total, and thanks to some friendly assistance, I'd identified slihtly more than 100,000 with both 100 or more members, and visible activity within the preceeding 31 days, as of early 2019.

The task of Web scraping those 100k communities, parsing HTML to a set of characteristics of interest, and reducing that to a delimited dataset of about 16 MB, was all done via shell tools, and on very modest equipment.

Most surprising was that parsing the HTML (using the HTML-XML utilities: http://www.w3.org/Tools/HTML-XML-utils/README) took longer than downloading the data.

Creating the datafile was done with gawk, and most analysis subsequently in R, though quick-and-dirty summaries and queries can be run in gawk.

Performance: downloading (curl): 16 hours, parsing (hxextract & hxselect) 48 hours, dataset preparation (gawk): 2 minutes, analysis (gawk / R), a few seconds for simple outputs.

The parsing step is painfully long, the rest quite tractable.

Are you planning on open-sourcing the downloader part? I'm very interested.

Literally just a Bash while-read loop over community IDs. It's embarrassingly trivial.

I'm planning on posting the data, probably to https://social.antefriguserat.de/ and will include procssing scripts.

This is the fetch-script, which saves both the HTML and HEAD responses:

    time sed -e 's,^.*/,,' $sample_file |
        while read commid;
            echo -e "\n>>> $i  $commid <<<" 1>&2;
            echo "curl -s -o '${commfile}' -D '${commhead}' '${url}'"
The sample file is simply a list of G+ community IDs or URLs, e.g.:


This seems to be a perfect use case for GNU Parallel[0] to download and process, say 10 ids, in parallel. If you have already downloaded/processed and have no need to do it again, then probably doesn't matter now.

[0]: https://www.gnu.org/software/parallel/

Xargs, actually, though saturating my dinky Internet connection was trivial. Ten concurrencies kept any one request from stalling the crawl though.

That's why the script echoed the curl commant rather than run it directly. It fed xargs.

The other problem was failed or errored (non 3xx/4xx, or incomplete HTML -- no "</html>" tag found) responses. There was no runtime detection of these. Instead, I checked for those on completion of the first run and re-pulled those in a few minutes, a few thousand from the whole run, most of which ended up being 4xx/3xx ultimately.

> Every item on the menu at Taco Bell is just a different configuration of roughly eight ingredients.

HA! This has almost been my line for years regarding Mexican food. What I like to say is: it’s amazing how every possible permutation of 8 ingredients has been named. BTW I love Mexican food, lived in Mexico.

> The post mentions a scenario which you may consider to use Hadoop to solve but actually xargs may be a simpler and better choice.

I do feel like there’s a corollary to Knuth’s “premature optimization” quote regarding web scaling; premature scaling and using tools much bigger than necessary for the job at hand is pretty common.

It's not just data processing. There are simple solutions to all sorts of tasks.

I run 'motion' on my linux desktop at home to serve as a security camera when no one is home. For months I've been manually starting and stopping it, figuring I needed to setup an IoT system if I wanted to automate things. i.e. IFTTT on our phones, an MQTT server in the cloud, etc. Then I realized - I just need to start the camera when all of our phones are off the LAN. It took about 15 minutes to setup, and now I never have to worry about forgetting to stop or start the camera.

For my next book, I'm working through a concept I call "good enough programming.

Good enough programming is something you code that provides something people want -- and you never look at the code again. Find the problem, solve the problem, walk away from the problem. That's not sexy. It's not going to get you an article to write for a famous magazine, but it's good enough.

We have lost sight of "good enough" in programming, and without some kind of guardrails, we end up doing stuff we like or stuff that sounds good to other programmers. For instance, while I love cloud computing, I'm seeing "how-to" articles written about setting up a VPC for doing something like playing checkers. Yes, it was an oversimplified article, and you have to write that way, but without wisdom, how is the reader supposed to know that? What criteria do they use to determine whether it's a co-lo server, a lambda, or a world-wide distributed cloud?

We're going like gangbusters selling programmers and companies on all kinds of new and complex ways of doing things. They like it. We like it. But it is in anybody's best interest over the long run?

Recently I rewrote a pet project for the third time. First time it was C#, SQL Server, and an ORM. Then it was F#, MySQL, and linux. The last time it was pure FP in F# and microservices.

Some of you may know where this is going.

Just as I finished writing the app in a real microservices format, I realized. Holy cow! This whole thing was just a few Unix commands and some pipes.

My thinking went from all kinds of concerns about transactions and ORM-fun to just some nix stuff in a small script. The problem stayed the same.

Something else happened too. At each step, I did less and less maintenance. The last rewrite has had no maintenance required at all. In my spare time, I'm going to do the nix one using no servers at all on a static SPA. In a very meaningful way, there's no app, there's no server, and there's nothing to maintain. Yet I still get the functionality I need. And I never maintain it.

Of course that's not possible for every app, but the key thing I learned wasn't the magic of serverless static SPAs or the joys of unix. It was that I didn't know whether or not it was possible or not until I did it. By thinking in a pure FP fashion and deploying in true microservices, the rest just "fell out" of the work. At first I was actually thinking in a way that would have only led to more and more complexity and maintenance requirements.

My belief is that we get our thinking right first, use code budgets, and try for a simple unix solution. If it doesn't work, why? At least then we've made an effort to be good enough programmers. That beats most everybody else.

Write a blog post about it! It would be nice to see the progression and read about your insights of such complete reworks, even if it's a pet project.

Edit: Can I can sign up somewhere to get a heads up when your book is available? Would be appreciated!

One place installed a Hadoop cluster. Some of us suspected that replacing a regex monstrosity with Lex/Yacc might have given the speed needed to process the files.

EDIT: Lex/Yacc that is some faster parser generator, I'm not too knowledgeable on that.

I totally agree that most of the *nix tools are most of the time the best ones, but things get trickier when there is a complex dynamic pipeline e.g. the input conditions the kind of processes and these are also dynamic based on other inputs.

I've found Nextflow to be an excellent solution to parallelizing and adding extra logic to any cli pipeline. It also helps manage environment and track metrics.

My favorite thing has been `| ruby -e "puts STDIN.to_a. ..."`, allows to run any kind of code on the standard input, much easier than remembering awk/sed various options and much more powerful.

edit: same thing can be done with Python/Perl

Yes, easier if you know ruby.. sed/awk is good for munching strings in small scripts and it is also always installed on a unix system (AFAIK).

You'll find sed and awk in busybox, virtually always, even on minimal systems. Including embedded devices, routers, Android, etc.

Yes, buddy, the simplest tools are the most powerful, and the fastest. Time is money.

Unless your data contains spaces, tabs, or, god forbid, newlines. Unix pipeline tools lack any sort of useful data structuring capabilities, making them appropriate for one-off tasks at most.

What are you on about? I can't think of one Unix command that manipulates structured data and doesn't offer a delimiter option. Furthermore the entire point (if you take a moment to think about it) of pipelines is, instead of all commands supporting your preferred delimiter, you need one command to translate (I don't know perhaps something named tr? ) in the pipeline to deal with such limitations.

Do you seriously think non-whitespace separated structured records is a novel idea which the simpler times of Unix didn't have to deal with? Have you looked at the passwd file? /rant

Spaces, tab and newlines are not a problem.

The fact that there are some standard tools available doesn't mean you are limited to that.

If you have a CSV file with spaces and newlines use cvskit or a small python script importing the relevant library. If you have to parse JSON file you can use jq to pick the relevant fields regardless of how the document is formatted. You can even process binary data as long as the file format is understood by the tool.

Transcoding CSV to TSV using csvkit has been a life saver for me the last few months due to an old industry's insistence on CSV as an information delivery method. jq has previously been a lifesaver too.

Yeah I feel like there’s a pretty broad medium between Hadoop and awk. I generally find a python script to be a much clearer solution than an dense line of awk; the latter tends to turn into clever code golf pretty quickly.

GUI's are the McDonald's of interfaces.

The choices are limited and when you go there you know there's a good chance that you will end up wondering why what you got caused so much pain.

This is the Big Data version of "90% of Oracle instances could be replaced with sqlite".

Pragmatism almost always loses to CV-padding and office politics.

Nice article, though I think you should work on your blog's typography. Using a bold typeface for body copy is not pleasant to read.

Oh how I miss Ted Dziuba posts. He's was a treasure trove of pragmatism for the industry.

At my company what you can do is determined by which AWS services you can string together.

This is what you get by not answering RTFM to stupid questions.

I hate to be that guy, but they're NOT "Unix" tools, as the name GNU literally states.

The post makes a good point that I fully agree with, just doesn't explain it well enough.

Many of the GNU tools are reimplementation of already existing tools.

For example the initial implementation of AWK was in 1977 [1], a few years before GNU even existed [2], so it _is_ a Unix tool.

[1] https://en.wikipedia.org/wiki/AWK#History [2] https://en.wikipedia.org/wiki/GNU#History

The GNU coreutils are a reimplementation of the Unix utilities. With some caveats, the same tools are available on BSD and Solaris derivatives which are both Unix.

So while GNU is obviously an important project, it would not be correct to say "xargs" is not a Unix tool.

I and many others take “Unix” to mean any UNIX or UNIX-like operating system, whereas “UNIX” with all capital letters means only the ones that are certified UNIX.

awk is in the POSIX norm. It is a Unix tool.

GNU awk is one of the popular awk implementation (usually referred as gawk). I personnally prefer mawk. awk is not GNU.

awk predates GNU by at least 7 years. It is most certainly a UNIX tool.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact