
Command-line tools can be faster than your Hadoop cluster - wglb
http://aadrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html
======
danso
I'm becoming a stronger and stronger advocate of teaching command-line
interfaces to even programmers at the novice level...it's easier in many ways
to think of how data is being worked on by "filters" and "pipes"...and more
importantly, every time you try a step, something _happens_...making it much
easier to interactively iterate through a process.

That it also happens to very fast and powerful (when memory isn't a limiting
factor) is nice icing on the cake. I moved over to doing much more on CLI
after realizing that doing something as simple as "head -n 1 massive.csv" to
inspect headers of corrupt multi-gb CSV files made my data-munging life
substantially more enjoyable than opening them up in Sublime Text.

~~~
hanoz
Your CSV peeking epiphany was in essence a matter of code vs. tools though
rather than necessarily CLI vs. GUI. On Windows you might just as well have
discovered you could fire up Linqpad and enter
File.ReadLines("massive.csv").First() for example.

~~~
alrs
The example was a multi-gigabyte CSV file. You just sucked the whole thing off
the disk into RAM so that you could shave off the first line.

If you're unlucky, you started swapping out to disk about halfway through.

~~~
recursive
That code you're replying about was carefully and correctly written. You just
replied as if you know how it works just so you could look like you know what
you're talking about.

If you're unlucky, someone who actually knows how File.ReadLines() works will
show up in an hour or two and explain that it's lazily evaluated.

~~~
alrs
:) touche

------
crcsmnky
Perhaps I'm missing something. It appears that the author is recommending
against using Hadoop (and related tools) for processing 3.5GB of data. Who in
the world thought that would be a good idea to begin with?

The underlying problem here isn't unique to Hadoop. People who are minimally
familiar with how technology works and who are very much into BuzzWords™ will
always throw around the wrong tool for the job so they can sound intelligent
with a certain segment of the population.

That said, I like seeing how people put together their own CLI-based
processing pipelines.

~~~
EpicEng
Exactly this just happened where I work. The CIO was recommending Hadoop on
AWS for our image processing/analysis jobs. We process a single set of images
at a time which come in around ~1.5GB. The output data size is about 1.2GB.
Not a good candidate for Hadoop but, you know... "big data", right?

~~~
threeseed
Another explanation is that your CIO is not an idiot but rather they know
about future projects that you don't. CIOs want to build capabilities (skills
and technologies) not just one off implementations every time.

Not saying this is the case but CIO bashing is all too easy when you're an
engineer.

~~~
acdha
A good CIO would know that leaving out key parts of the project is unlikely to
produce good results. Even if the details aren't final, a simple “… and we
probably need to scale this up considerably by next year” would be useful when
weighing tradeoffs

------
aadrake
Hi all, original author here.

Some have questioned why I would spend the time advocating against the use of
Hadoop for such small data processing tasks as that's clearly not when it
should be used anyway. Sadly, Big Data (tm) frameworks are often recommended,
required, or used more often than they should be. I know to many of us it
seems crazy, but it's true. The worst I've seen was Hadoop used for a
processing task of less than 1MB. Seriously.

Also, much agreement with those saying there should be more education effort
when it comes to teaching command line tools. O'Reilly even has a book out on
the topic:
[http://shop.oreilly.com/product/0636920032823.do](http://shop.oreilly.com/product/0636920032823.do)

Thank you for all the comments and support.

~~~
jeroenjanssens
Author of Data Science at the Command Line here. Thanks for the nice blog post
and for mentioning my book here. While we're talking about the subject of
education, allow me to shamelessly promote a two-day workshop that I'll be
giving next month in London:
[http://datascienceatthecommandline.com/#workshop](http://datascienceatthecommandline.com/#workshop)

------
a3_nm
I think it is unsafe to parallelize grep with xargs as in done in the article,
because, beyond delivery order shuffling, the output of the parallel greps
could get mixed up (the beginning of a line is by one grep and the end of a
line is from a different grep, so, reading line by line afterwards, you get
garbled lines).

See
[https://www.gnu.org/software/parallel/man.html#DIFFERENCES-B...](https://www.gnu.org/software/parallel/man.html#DIFFERENCES-
BETWEEN-xargs-AND-GNU-Parallel)

------
pkrumins
The example in the article with cat, grep and awk:

    
    
        cat *.pgn | \
        grep "Result" | \
        awk '
         {
            split($0, a, "-");
            res = substr(a[1], length(a[1]), 1);
            if (res == 1) white++;
            if (res == 0) black++;
            if (res == 2) draw++;
          }
          END { print white+black+draw, white, black, draw }
        '
    

Can be written much more succinctly with just awk, and you don't even need to
split the string or use substr:

    
    
        awk '
          /Result/ {
            if (/1\/2/) draw++;
            else if (/1-0/) white++;
            else if (/0-1/) black++;
          }
          END { print white+black+draw, white, black, draw }
        ' *.pgn

~~~
dice
Keep reading, he removes the cat and grep in the final solution.

~~~
omaranto
Yes, but he still keeps the awkward Awk code with the substr and such. I
haven't benchmarked, maybe that's faster than the pretty regex matches.

~~~
lloeki
I believe this is to be a bit more educative about how to build a pipeline.
Also, iteratively building such solutions quickly often leads to such
"inefficiencies" but makes things easier to reason with. Besides, the awk step
may have been factored out in the end so it wouldn't make sense to optimise
early. Also, by the time the author reaches the end, he gets IO-bound so
there's not much need to optimise further (in the context of the exercise).

------
zokier
Author begins with fairly idiomatic shell pipeline, but in the search for
performance the pipeline transforms to a awk script. Not that I have anything
against awk, but I feel like that kinda runs against the premise of the
article. The article ends up demonstrating the power of awk over pipelines of
small utilities.

Another interesting note is that there is a possibility that the script as-is
could mis-parse the data. The grep should use '^\\[Result' instead of
'Result'. I think this demonstrates nicely the fragility of these sorts of ad-
hoc parsers that are common in shell pipelines.

~~~
tracker1
It probably depends on what you are trying to accomplish... I think a lot of
us would reach for a scripting language to run through this (relatively small
amount of data)... node.js does piped streams of input/output really well. And
perl is the grand daddy of this type of input processing.

I wouldn't typically reach for a big data solution short of hundreds of gigs
of data (which is borderline, but will only grow from there). I might even
reach for something like ElasticSearch as an interim step, which will usually
be enough.

If you can dedicate a VM in a cloud service to a single one-off task, that's
probably a better option than creating a Hadoop cluster for most work loads.

------
rkwasny
Bottom line is - you do not need hadoop until you cross 2TB of data to be
processed (uncompressed). Modern servers ( bare metal ones, not what AWS sells
you ) are REALLY FAST and can crunch massive amounts of data.

Just use a proper tools, well optimized code written in C/C++/Go/etc - not all
the crappy JAVA framework-in-a-framework^N architecture that abstracts
thinking about the CPU speed.

Bottom line, the popular saying is true: "Hadoop is about writing crappy code
and then running it on a massive scale."

~~~
earino
Dell sells a server with 6TB of ram (I believe.) I think the limit is way over
2TB. If you want to be able to query it quickly for analytical workloads, MPPs
like Vertica scale up to 150+TB (at Facebook.) I honestly don't know what the
scale is where you _need_ Hadoop, but it's gotten to be a large number very
quickly.

~~~
juliangregorian
They do, I checked. It comes in at a cool half million (Helloooo, investors!)

------
ricardobeat
Don't shoot me, but out of curiosity I wrote the thing in javascript:
[https://gist.github.com/ricardobeat/ee2fb2a6d704205446b7](https://gist.github.com/ricardobeat/ee2fb2a6d704205446b7)

Results: 4.4GB[1] processed in 47 seconds. Around 96mb/s, can probably be made
faster, and nodejs is not the best at munging data...

[1] 3201 files taken from
[http://github.com/rozim/ChessData](http://github.com/rozim/ChessData)

------
notpeter
This article echoes a talk Bryan Cantrill gave two years ago:
[https://youtu.be/S0mviKhVmBI](https://youtu.be/S0mviKhVmBI)

It's about how Joyent took the concept of a UNIX pipeline as a true powertool
and built a distributed version atop an object filesystem with some little
map/reduce syntactic sugar to replace Hadoop jobs with pipelines.

The Bryan Cantrill talk is definitely worth your time, but you can get an
understanding of Manta with their 3m screencast:
[https://youtu.be/d2KQ2SQLQgg](https://youtu.be/d2KQ2SQLQgg)

~~~
cheng1
I have developed a one-liner toolset for Hadoop (when I have to use it). It's
fresh to see a ZFS alternate of the concept. Don't like the JavaScript choice
though.

GUN parallel should be a widely adopted choice. Lightweight. Fast. Low cost.
Extendable.

~~~
xer0x
You can use command-line tools for Manta without touching any Javascript.
That's probably the best way to go. Although I do like Javascript.

------
sam_lowry_
Next to using `xargs -P 8 -n 1` to parallellize jobs locally, take a look at
paexec, GNU parallel replacement that just works.

See [https://github.com/cheusov/paexec](https://github.com/cheusov/paexec)

~~~
pmoriarty
What's the advantage of using paexec over GNU parallel?

~~~
ole_tange
See comparison here:
[http://www.gnu.org/software/parallel/man.html#DIFFERENCES-
BE...](http://www.gnu.org/software/parallel/man.html#DIFFERENCES-BETWEEN-
paexec-AND-GNU-Parallel)

------
jacquesm
See this very good comment by Bane:

[https://news.ycombinator.com/item?id=8902739](https://news.ycombinator.com/item?id=8902739)

------
mabbo
I had an intern over the summer, working on a basic A/B Testing framework for
our application (a very simple industrial handscanner tool used inside
warehouses by a few thousand employees).

When we came to the last stage, analysis, he was keen to use MapReduce so we
let him. In the end though, his analysis didn't work well, took ages to
process when it did, and didn't provide the answers we needed. The code wasn't
maintainable or reusable. _shrug_ It happens. I had worse internships.

I put together some command line scripts to parse the files instead- grep,
awk, sed, really basic stuff piped into each other and written to other files.
They took 10 minutes or so to process, and provided reliable answers. The
scripts were added as an appendix to the report I provided on the A/B test,
and after formatting and explanations, took up a couple pages.

~~~
m_mueller
On a tangent, I'd be interested in how you format heavily piped bash code for
documentation. Can comments be intersparsed there?

~~~
mappu
Functions, mostly - the big `awk` command in the example goes into something
like

    
    
        # @param $1 whatever
        chess_extract_scores() {
             awk blah blah blah
        }
    

and then your whole pipeline simplifies to

    
    
        cat foo | grep bar | chess_extract_scores
    

which is pretty readable. You can even do most of this in a live bash session
with ^X ^E.

~~~
plaes
You can actually do without cat:

grep bar foo | chess_extract_scores

[http://en.wikipedia.org/wiki/Cat_%28Unix%29#Useless_use_of_c...](http://en.wikipedia.org/wiki/Cat_%28Unix%29#Useless_use_of_cat)

~~~
nkuttler
Sure you can, but premature optimization is also a real thing
[http://en.wikipedia.org/wiki/Program_optimization#When_to_op...](http://en.wikipedia.org/wiki/Program_optimization#When_to_optimize)

------
knodi123
We have a proprietary algorithm for assigning foods a "suitability score"
based on a user's personal health conditions and body data.

It used to be a fairly slow algorithm, so we ran it in a hadoop cluster and it
cached the scores for every user vs. every food in a massive table on a
distributed database.

Another developer, who is quite clever, rewrote our algorithm in C, and
compiled it as a database function, which was about 100x faster. He also did
some algebra work and found a way to change our calculations, yielding a
measly 4-5x improvement.

It was so, so, so much faster that in one swoop we eliminated our entire
Hadoop cluster, and the massive scores table, and were actually able _sort
your food search results by score_ , calculating scores on the fly.

~~~
saym
May I ask: Who is we?

------
NyxWulf
This also isn't a straight either or proposition. I build local command line
pipelines and do testing and/or processing. When either the amount of data
needed to be processed passes into the range where memory or network bandwidth
makes the processing more efficient on a Hadoop cluster I make some fairly
minimal conversions and run the stream processing on the Hadoop cluster in
streaming mode. It hasn't been uncommon for my jobs to be much faster than the
same jobs run on the cluster with Hive or some other framework. Much of the
speed boils down to the optimizer and the planner.

Overall I find it very efficient to use the same toolset locally and then
scale it up to a cluster when and if I need to.

~~~
azylman
What toolset are you using that you can run both locally and on a Hadoop
cluster?

~~~
mdaniel
Almost all of them?

The vocabulary of the grandparent comment implies they are using hadoop's
streaming mode, and thus one can use a map-reduce streaming abstraction such
as MRJob or just plain stdin/stdout; both will work locally and in cluster
mode.

Or, if static typing is more agreeable to your development process, running
hadoop in "single machine cluster" mode is relatively painless. The same goes
for other distributed processing frameworks like Spark.

------
decisiveness
If bash is the shell (assuming recursive search is required), maybe it would
be even faster to just do:

    
    
        shopt -s globstar
        mawk '/Result/ {
            game++
            split($0, a, "-")
            res = substr(a[1], length(a[1]), 1)
            if(res == 1)
                white++
            if(res == 0)
                black++
            if(res == 2)
                draw++
        } END {
            print game, white, black, draw
        }' **/*.pgn
    
    ?

------
taltman1
This is a great exercise of how to take a Unix command line and iteratively
optimize it with advanced use of awk.

In that spirit, one can optimize the xargs mawk invocation by 1) Getting rid
of string-manipulation function calls (which are slow in awk), 2) using
regular expressions in the pattern expression (which allows awk to short-
circuit the evaluation of lines), and 3) avoiding use of field variables like
$1, and $2, which allows the mawk virtual machine to avoid implicit field
splitting. A bonus is that you end up with an awk script which is more
idiomatic:

    
    
      mawk '
      /^\[Result "1\/2-1\/2"\]/ { draw++ }
      /^\[Result "1-0"\]/ { white++ }
      /^\[Result "0-1"\]/ { black++ }
    
      END { print white, black, draw }'  
    

Notice that I got rid of the printing out of the intermediate totals per file.
Since we are only tabulating the final total, we can modify the 'reduce' mawk
invocation to be as follows:

    
    
      mawk '
      {games += ($1+$2+$3); white += $1; black += $2; draw += $3}
      END { print games, white, black, draw }'
    

Making the bottle-neck data stream thinner always helps with overall
throughput.

------
philgoetz
First, you don't score points with me for saying not to use Hadoop when you
don't need to use Hadoop.

Second, you don't get to pretend you invented shell scripting because you came
up with a new name for it.

Third, there are very few cases if any where writing a shell script is better
than writing a Perl script.

------
MrBuddyCasino
To quote the memorable Ted Dziuba[0]:

"Here's a concrete example: suppose you have millions of web pages that you
want to download and save to disk for later processing. How do you do it? The
cool-kids answer is to write a distributed crawler in Clojure and run it on
EC2, handing out jobs with a message queue like SQS or ZeroMQ.

The Taco Bell answer? xargs and wget. In the rare case that you saturate the
network connection, add some split and rsync. A "distributed crawler" is
really only like 10 lines of shell script."

[0] since his blog is gone: [http://readwrite.com/2011/01/22/data-mining-and-
taco-bell-pr...](http://readwrite.com/2011/01/22/data-mining-and-taco-bell-
prog)

~~~
threeseed
Oh right the "cool kids" approach.

Here's what the "sensible adults" think about when they see problems like
this. Operational Supportability: How do you monitor the operation ? Restart
Recovery: Do you have the ability to restart the operation mid way through if
something fails ? Maintainability: Can we run the same application on our
desktop as on our production servers ? Extensibility: Can we extend the
platform easily to do X, Y, Z after the crawling ?

I can't stand developers who come up with the xargs/wget approach, hack
something together and then walk away from it. I've seen it far too often and
it's great for the short term. Dreadful for the long term.

~~~
AnthonyMouse
The Unix people have thought of these things. You can easily do them with
command line tools.

> Operational Supportability: How do you monitor the operation ?

Downloading files with wget will create files and directories as it proceeds.
You can observe and count them to determine progress, or pass a shell script
to xargs that writes whatever progress data you like to a file before/after
calling wget.

> Restart Recovery: Do you have the ability to restart the operation mid way
> through if something fails ?

wget has command line options to skip downloading files that already exist. Or
you can use tail to skip the number of lines in the input file as there exist
complete entries in the destination directory.

> Maintainability: Can we run the same application on our desktop as on our
> production servers ?

I'm not sure how this is supposed to be an argument against using the standard
utilities that are on everybody's machine already.

> Extensibility: Can we extend the platform easily to do X, Y, Z after the
> crawling ?

Again, what? Extensibility is the wheelhouse of the thing you're complaining
about.

~~~
coderdude
> Downloading files with wget will create files and directories as it
> proceeds. You can observe and count them to determine progress, or pass a
> shell script to xargs that writes whatever progress data you like to a file
> before/after calling wget.

Which means using wget as your HTTP module and a scripting language as the
glue for the logic you'll ultimately need to implement to create a robust
crawler (robust to failures and edge cases).

> wget has command line options to skip downloading files that already exist.
> Or you can use tail to skip the number of lines in the input file as there
> exist complete entries in the destination directory.

Is wget able to check whether a previously failed page exists on disk [in some
kind of index] before making any new HTTP requests? It sounds like this would
try fetching every failed URL until it reaches the point where it left off
before the restart. If it's not possible to maintain an index of unfetchable
URLs and reasons for the failures then this would be one reason why wget
wouldn't work in place of software designed for the task of crawling (as
opposed to just fetching).

This is one of those tasks that seems like you could glue together wget and
some scripts and call it a day but you would ultimately discover the reasons
why nobody does this in practice. At least not for anything but one-off crawl
jobs.

Thought of another possible issue:

If you're trying to saturate your connection with multiple wget instances, how
do you make sure that you're not fetching more than one page from a single
server at once (being a friendly crawler)? Or how would you honor robots.txt's
Crawl-delay with multiple instances?

Edit: `previously fetched` -> `previously failed`

~~~
AnthonyMouse
> Which means using wget as your HTTP module and a scripting language as the
> glue for the logic you'll ultimately need to implement to create a robust
> crawler (robust to failures and edge cases).

This is kind of the premise of this discussion. You don't use Hadoop to
process 2GB of data, but you don't build Googlebot using bash and wget. There
is a scale past which it makes sense to use the Big Data toolbox. The point is
that most people never get there. Your crawler is never going to be Googlebot.

> Is wget able to check whether a previously failed page exists on disk [in
> some kind of index] before making any new HTTP requests? It sounds like this
> would try fetching every failed URL until it reaches the point where it left
> off before the restart. If it's not possible to maintain an index of
> unfetchable URLs and reasons for the failures then this would be one reason
> why wget wouldn't work in place of software designed for the task of
> crawling (as opposed to just fetching).

It really depends what you're trying to do here. If the reason you're
restarting the crawler is because e.g. your internet connection flapped while
it was running or some server was temporarily giving spurious HTTP errors then
you _want_ the failed URLs to be retried. If you're only restarting the
crawler because you had to pause it momentarily and you want to carry on from
where you left off then you can easily record what the last URL you tried was
and strip all of the previous ones from the list before restarting.

But I think what you're really running into is that we ended up talking about
wget and wget isn't really designed in the Unix tradition. The recursive mode
in particular doesn't compose well. It should be at least two separate
programs, one that fetches via HTTP and one that parses HTML. Then you can see
the easy solution to that class of problems: When you fetch a URL you write
the URL and the retrieval status to a file which you can parse later to do the
things you're referring to.

> If you're trying to saturate your connection with multiple wget instances,
> how do you make sure that you're not fetching more than one page from a
> single server at once (being a friendly crawler)? Or how would you honor
> robots.txt's Crawl-delay with multiple instances?

Give each process a FIFO to read URLs from. Then you choose which FIFO to add
a URL to based on the address so that all URLs with the same address are
assigned to the same process.

~~~
coderdude
> Give each process a FIFO to read URLs from. Then you choose which FIFO to
> add a URL to based on the address so that all URLs with the same address are
> assigned to the same process.

I wrote this in a reply to myself a moment after you posted your comment so
I'll just move it here:

Regarding the last two issues I mentioned, you could sort the list of URLs by
domain and split the list when the new list's length is >= n URLs and domain
on the current line is different from the domain on the previous line. As long
as wget can at least honor robots.txt directives between consecutive requests
to a domain, it should all work out fine.

It looks like an easily solvable problem however you go about it.

> It really depends what you're trying to do here.

I was thinking about HTTP requests that respond with 4xx and 5xx errors. It
would need to be possible to either remove those from the frontier and store
them in a separate list or mark them with the error code so that it can be
checked at some point being passed onto wget.

~~~
sillysaurus3
Open file on disk. See that it's 404. Delete file. Re-run crawler.

You'd turn that into code by doing grep -R 404 . or whatever the actual unique
error string is and deleting any file containing the error message. (You'd be
careful not to run that recursive delete on any unexpected data.)

Really, these problems are pretty easy. It's easy to overthink it.

~~~
pyre
> grep -R 404

This isn't 1995 anymore. When you hit a 404 error, you no longer get Apache's
default 404 page. You really can't count on there being any consistency
between 404 pages on different sites.

If wget somehow stored the header response info to disk (e.g.
"FILENAME.header-info") you could whip something up to do what you are
suggesting though.

~~~
sillysaurus3
Yeah, wget stores response info to disk. Besides, even if it didn't, you could
still visit a 404 page of the website and figure out a unique string of text
to search for.

------
kylek
I feel ag (silver surfer, a grep-ish alternative) should be mentioned (even
though he dropped it in his final awk/mawk commands) as it tends to be much
faster than grep, and considering he cites performance throughout.

~~~
ggreer
GitHub link for those who don't know about it:
[https://github.com/ggreer/the_silver_searcher/](https://github.com/ggreer/the_silver_searcher/)

I built ag for searching code. It can be (ab)used for other stuff, but the
defaults are optimized for a developer searching a codebase. Also, when
writing ag, I don't go out of my way to make sure behavior is correct on all
platforms in all corner cases. Grep, on the other hand, has been around for
decades. It probably handles cases I've never even thought of.

------
wglb
A similar story: [http://blogs.law.harvard.edu/philg/2009/05/18/ruby-on-
rails-...](http://blogs.law.harvard.edu/philg/2009/05/18/ruby-on-rails-and-
the-importance-of-being-stupid/): Tools used not quite the right way.

edit: with HN commentary:
[https://news.ycombinator.com/item?id=615587](https://news.ycombinator.com/item?id=615587)

------
sgt101
on a couple of GB this is true, actually if you have ssd's I'd expect any non
compute bound task to be faster on a single machine up to ~10gb after which
the disk parallelism should kick in and Hadoop should start to win.

~~~
KaiserPro
Depends on the dick, depends on the storage.

HDFS is a psudeo block interface. If you have a real filesystem like lustre,
or GPFS, not only do you have the abilty to use other tools, you can use that
storage for other things.

In the case of GPFS, you have configurable redundancy. Sadly with lustre, you
need decent hardware, otherwise you're going to loose data.

In all these things, paying bottom dollar for hardware, forgoing support is a
false economy. At scales of 1pb+ (which is about 1/2 a rack now) its much much
cheaper to use off the shelf parts with 24/7 support than "softwareing" your
way out.

~~~
radoslawc
> Depends on the dick

not really, sorry I had to

back to the topic, HDFS is really somewhat waste of disk space, especially
when used for something like munching logs

> At scales of 1pb+ (which is about 1/2 a rack now) its much much cheaper to
> use off the shelf parts with 24/7 support than "softwareing" your way out.

depends, if you need monthly reports from logs, as long as you don't loose
storage completely, then using even second hand hardware or decommissioned
from prod is cheapest choice

~~~
KaiserPro
_Ahem_

Disk....

------
linuxhansl
So don't use Hadoop to crunch data that fits on a memory stick, or that a
single disk spindle can read in few seconds.

Why is this first on the HN front-page?

Reminds me of the C++ is better than Java, Go is better than C++, etc, pieces.

Yes, the right tool for the right job. That's what makes a good engineer.

Somebody who thinks there is _no_ valid use case for Hadoop is a fool. (The
author did not say that, but many of the comments here seem to imply that
view)

~~~
lucaspottersky
> Why is this first on the HN front-page?

Because controversial topics are always fun! d:-)

------
guardiangod
Here is an analysis from a developer who looked at Hadoop-
[http://ossectools.blogspot.ca/2012/03/why-elsa-doesnt-use-
ha...](http://ossectools.blogspot.ca/2012/03/why-elsa-doesnt-use-hadoop-for-
big-data.html)

(ELSA is a logger that claims to be able to handle 100000 entries/sec (!!))

When to Use Hadoop

This is a description of why Hadoop isn't always the right solution to Big
Data problems, but that certainly doesn't mean that it's not a valuable
project or that it isn't the best solution for a lot challenges. It's
important to use the right tool for the job, and thinking critically about
what features each tool provides is paramount to a project's success. In
general, you should use Hadoop when:

    
    
        Data access patterns will be very basic but analytics will be very complicated.
        Your data needs absolutely guaranteed availability for both reading and writing.
        There are inadequate traditional database-oriented tools which currently exist for your problem. 
    

Do not use Hadoop if:

    
    
        You're don't know exactly why you're using it.
        You want to maximize hardware efficiency.
        Your data fits on a single "beefy" server.
        You don't have full-time staff to dedicate to it.
    

The easiest alternative to using Hadoop for Big Data is to use multiple
traditional databases and architect your read and write patterns such that the
data in one database does not rely on the data in another. Once that is
established, it is much easier than you'd think to write basic aggregation
routines in languages you're already invested in and familiar with. This means
you need to think very critically about your app architecture before you throw
more hardware at it.

------
greenyoda
_Shell commands are great for data processing pipelines because you get
parallelism for free. For proof, try a simple example in your terminal._

    
    
        sleep 3 | echo "Hello world."
    

That doesn't really prove anything about data processing pipelines, since
_echo "Hello world."_ doesn't need to wait for any input from the other
process; it can run as soon as the process is forked.

    
    
        cat *.pgn | grep "Result" | sort | uniq -c
    

Does this have any advantage over the more straightforward verson below?

    
    
        grep -h "Result" *.pgn | sort | uniq -c
    

Either the cat process or the grep process is going to be waiting for disk
I/Os to complete before any of the later processes have data to work on, so
splitting it into two processes doesn't seem to buy you any additional
concurrency. You would, however, be spending extra time in the kernel to
execute the read() and write() system calls to do the interprocess
communication on the pipe between cat and grep.

Also, the parallelism of a data processing pipeline is going to be constrained
by the speed of the slowest process in it: all the processes after it are
going to be idle while waiting for the slow process to produce output, and all
the processes before it are going to be idle once the slow process has filled
its pipe's input buffers. So if one of the processes in the pipeline takes 100
times as long as the other three, Amdahl's Law[1] suggests that you won't get
a big win from breaking it up into multiple processes.

[1]
[https://en.wikipedia.org/wiki/Amdahl%27s_law](https://en.wikipedia.org/wiki/Amdahl%27s_law)

Edit: As someone pointed out, my example needed "grep -h". Fixed.

~~~
omoikane
"grep <pattern> <files>" is not the same as "cat <files> | grep <pattern>", in
that the former will prefix lines with filenames if there is more than one
input file. What you want instead is "grep -h <pattern> <files>".

The advantage of using cat, therefore, is the few seconds of laziness saved in
not reading the manual.

~~~
salgernon
The advantage to using "cat foo | grep pattern" is that it is trivial to ^p
and edit the pattern before adding the next pipeline sequence.

~~~
philsnow
fwiw

    
    
        $ <filename grep <pattern>

no shell I'm aware of restricts you to placing redirections at the end, you
can throw them on the beginning no problem.

------
colin_mccabe
About 5 years ago I worked at a company that took the "pile of shell scripts"
approach to processing data. Our data was big enough and our algorithms
computationally heavy enough that a single machine wasn't a good solution. So
we had a bunch of little binaries that were glued together with sed, awk,
perl, and pbsnodes.

It was horrible. It was tough to maintain-- we all know how hard to read even
the best awk and perl are. It was difficult to optimize, and you always found
yourself worrying about things like the maximum length of command lines, how
to figure out what the "real" error was in a bash pipeline, and so on. When
parts of the job failed, we had to manually figure out what parts of the job
had failed, and re-run them. Then we had to copy the files over to the right
place to create the full final output.

The company was a startup and the next VC milestone or pivot was always just
around the corner. There was never any time to clean things up. A lot of the
code had come out of early tech demos that management just asked us to "just
scale up." But oops, you can't do that with a pile of shell scripts and custom
C binaries. So the technical debt just kept piling up. I would advise anyone
in this situation not to do this. Yeah, shell scripts are great for making
rough guesses about things in a pile of data. They are great for ad hoc
exploration on small data or on individual log files. But that's it. Do not
check them into a source code repo and don't use them in production. The
moment someone tries to check in a shell script longer than a page, you need
to drop the hammer. Ask them to rewrite it in a language (and ideally,
framework), that is maintainable in the long term.

Now I work on Hadoop, mostly on the storage side of things. Hadoop is many
things-- a storage system, a set of computation frameworks that are robust
against node failures, a Java API. But above all it's a framework for doing
things in a standardized way so that you can understand what you've done 6
months from now. And you will be able to scale up by adding more nodes, when
your data is 2x or 4x as big down the line. On average, the customers we work
with are seeing their data grow by 2x every year.

I feel like people on Hacker News often don't have a clear picture of how
people interact with Hadoop. Writing MapReduce jobs is very 2008. Nowadays,
more than half of our users write SQL that gets processed by an execution
engine such as Hive or Impala. Most users are not developers, they're
analysts. If you have needs that go beyond SQL, you would use something like
Spark, which has a great and very concise API based on functional programming.
Reading about how clunky MR jobs is just feels to me like reading an article
about how hard it is to make boot and root floppy disks for Linux. Nobody's
done that in years.

------
sabalaba
I've had the pleasure and displeasure of working with small datasets (~7.5GB
of images) in shell. One often needs to send SIGINT to the shell when it
starts to glob expand or tab complete a folder with millions of files. But
besides minor issues like that, command line tools get the job done.

~~~
_delirium
Until semi-recently, millions of files in a directory would not only choke up
the shell, but the filesystem too. ext4 is a huge improvement over ext3 in
that regard; with 10m files in an ext3 directory you ended up with long hangs
on various operations. And even with ext4, make sure not to NFS-export the
volume that directory is on!

~~~
ajuc
I've encountered this (or similar) issue on production.

We had C++ system that wrote temporary files to /tmp when printing, /tmp was
cleared on system startup, it worked ok for years, but the files accumulated.
At some point it started to randomly throw file access errors when trying to
create these temporary files. Not for each file - only for some of them.

Disk wasn't full, some files could be created in /tmp, others couldn't, it
turned out after a few days of tracking it, that filesystem can be overwhelmed
by too many similary named files in one directory - and it can't create file
XXXX99999 even if there's is no such file in this directory, but it can create
files like YYYYY99999 :)

I just love such bugs where your basic assumptions turn out to be wrong.

------
liotier
'xargs -n' elicits fond memories of spawning large jobs to my Openmosix
cluster ! I miss Openmosix.

~~~
pjmlp
Oh! I lost track of it.

------
sgt101
There's a perception that hadoop commands are terribly complex. If you run

$spark-shell

you can execute (interactively)

val file = spark.textFile("hdfs://...") val errors = file.filter(line =>
line.contains("ERROR")) errors.count()

And wordcount a file - ok the wget is not there, but this is really not
complex!

------
uxcn
This kind of approach can probably scale out pretty far before actually
needing to resort to true distributed processing. Compression, simple
transforms, R, etc... You can probably get away with even more by just using a
networked filesystem and inotify.

------
sleepythread
One common misconception about using Hadoop is that use Hadoop if your data is
large. Usage of Hadoop should be more driven based on the growth of data
rather than size.

I agree that for the given use case, the solution is appropriate and works
fine. Problem mentioned in the given post is not a Big Data problem.

Hadoop will be helpful in case if there are millions of games are played
everyday and we need to update the statistics daily e.t.c. For this case, the
given solution will hit bottleneck and there will be some optimisation/code
change needed to keep running the code.

Hadoop and its ecosystem are not a silver bullet and hence should not be used
for everything. The problem has to be a Big Data problem

~~~
yuanchuan
It is that buzz surrounding Hadoop that makes people misunderstood its use and
capability. I have met non-technical analysts who want RDBMS performance on
Hadoop. They expect seconds to minutes scale queries on hundreds of GB of
data.

I always throw this analogy to people who misunderstood Hadoop: A stone to
crack an egg or a spoon?

Hadoop and RDBMS only have a thin overlapping region in the Venn diagram that
describes their capabilities and use cases.

Ultimately, it is cost vs efficiency. Hadoop can solve all data problems.
Likewise for RDBMS. This is an engineering tradeoff that people have to make.

~~~
pacala
> They expect seconds to minutes scale queries on hundreds of GB of data.

Use BigQuery from Google.

~~~
yuanchuan
On-premise cluster.

Cloud solution are totally out due to the nature of the data. Not everything
can be done in cloud.

If you have such huge amount of data, the total amount of time it takes to
transfer there and compute is not as competitive as an on-premise solution,
unless all your data live in the cloud.

~~~
pacala
I would look into [https://spark.apache.org/](https://spark.apache.org/) then.
You can get quite good performance out of it, but you need to spend more
effort in babysitting your data.

------
zobzu
Heres a probably unpopular opinion.... Pipes make things a bit slow. A native
pipeless program would be a good bit faster - incl. an acid db. Note that
doing this in python and expecting it to beat grep wont work...

The other thing is that hadoop - and some others are slow on big data (peta,
or more) vs own tools. Theyre necessary/used because of massive clustering
(10x the hardware deployed easily beats making ur own financially).

I suspect its a general lack of understanding the way computers work
(hardware, os ie system architecture) vs "why care it works and
python/go/java/etc are easy for me i dont need to know what happens under the
hood".

~~~
gasping
> incl. an acid db.

Why would you want to use a database for this problem? The input data would
take time to load into an ACID db and we're only interested in a single
ternary value within that data. The output data is just a few lists of boolean
values so it has no reason to be in a database either.

This is a textbook stream processing problem. Adding a database creates more
complexity for literally no benefit assuming the requirements in the linked
article were complete. I would be baffled to see a solution to this problem
that was anything more than a stream processor, to say nothing of a database
being involved.

~~~
_delirium
If it really is just a one-shot with one simple-ish filter, I agree. But I
often find myself incrementally building shell-pipeline tangles that are sped
up massively by being replaced with SQLite. Once your processing pipeline is
making liberal use of the sort/grep/cut/tee/uniq/tac/awk/join/paste suite of
tools, things get slow. The tangle of Unix tools effectively does repeated
full-table scans without the benefit of indexes, and is especially bad if you
have to re-sort the data at different stages of the pipeline, e.g. on
different columns, or need to split and then re-join columns in different
stages of the pipeline. In that kind of scenario a database (at least SQLite,
haven't tried a more "heavyweight" database) ends up being a win even for
stream-processing tasks. You pay for a load/index step up front, but you more
than get it back if the pipeline is nontrivial.

------
TeeWEE
The gotcha here is: He is talking about 1.75GB of data. Off course you dont
use hadoop for this. Hadoop is for BigData, not for a few gigs.

Use the right tool for the job. If you think you will scale to TeraByte size,
dont start out with command line tools.

------
ec664
This is blown out of proportion... actually increase is probably a factor
10-20X, not 100s. The fact that EMR is used is a problem, provisioning,
bootstrapping the cluster alone accounts for probably half the time.

The fact that shell commands were run repeatedly means that the data ends up
in the OS buffer cache and basically in memory.

I'm not discounting that CLI is faster than Hadoop by an order of magnitude on
small datasets. Nor will I dive into Hadoop vs CLI. The answer to all that IMO
is that it depends. And in this case, it's not well warranted.

What I do take exception to is the Fox News style headlines that are
disproportional to the truth. EMR != Hadoop.

------
Swizec
Maybe I come from a weird world, or even a weird generation. But when I was in
high school, Linux fanboyism was at its peak and just like people get all
wound up on bands and such, us geeks got wound up on open-source and linux and
fck Micro$oft etc. etc. This was early-ish 2000's.

As a result. _Every_ serious programmer I know, especially those who are about
my age, lives their life in the CLI.

It always comes a surprise when somebody suggests that there are professional
developers out there who do not use predominantly CLI.

~~~
prodigal_erik
People are always surprised when I mention that the Microsoft devs I worked
with had free access to the highest tiers of Visual Studio, yet what they
actually worked in was vim and the internal fork of make. I don't know whether
that's still true; it's been a decade now.

~~~
wrs
When I was an MS dev (8 to 18 years ago), Visual Studio had nothing to do with
"real work" on Windows systems programming.

Quite a few people used a rather obscure editor called Source Insight
([http://www.sourceinsight.com/](http://www.sourceinsight.com/)) because of
its code-navigation abilities, which were similar to an IDE's but worked on
huge codebases that would take hours to actually parse and analyze "properly".
Sort of a supercharged ctags.

------
khaki54
If the entire dataset fits into memory on your laptop then it's not big data,
and the only reason for using map reduce etc is to build experience with it or
proof of concept for a larger dataset.

------
thehal84
A similar article that articulated this well.

[https://www.chrisstucchio.com/blog/2013/hadoop_hatred.html](https://www.chrisstucchio.com/blog/2013/hadoop_hatred.html)

------
JensRantil
Reminds me of "filemap" \- a commandline-like map/reduce tool:
[https://github.com/mfisk/filemap](https://github.com/mfisk/filemap)

~~~
JensRantil
Related: sometimes I wish most of my data was in JSON streams so I could
simply map and reduce the data using jq
([http://stedolan.github.io/jq/](http://stedolan.github.io/jq/)), pipes and
possibly filemap.

------
weitzj
There is also an interesting and fun talk to watch by John Graham Cumming from
CloudFlare.
[http://www.youtube.com/watch?v=woCg2zaIVzQ](http://www.youtube.com/watch?v=woCg2zaIVzQ)
using Go instead of xargs. Kind of fits into: "Using the right tool for the
job". There is no Big Data involved but it shows a sweetspot where it might
make sense(make it easier) to not use a shell script (i.e retries, network
failure)

------
dkarapetyan
Oh ya and it turns out when all is said and done the average data set for most
hadoop jobs is no more than 20GB which can again fit comfortably on a modern
desktop machine.

------
raincom
Hadoop is replacing many datawarehousing dbs like netezza, teradata, exadata.
In the process, many datwarehousing developers have become hadoop developers,
who write sql code; after all, hadoop got a sql interface via hive.

Informatica (another ETL tool) also provides another tool called
powerexchange, which automatically generates MR code for hadoop.

Whenever you hear hadoop, first ask yourself whether it is another disguised
datawarehousing stuff.

~~~
spydum
Yes, this is very much happening -- mostly based on the insane pricing
difference of supporting Hadoop clusters vs ntz or td infrastructure. Just
following a simple 3year lifecycle of HW depreciation essentially boosts your
performance for next to nothing. The same cannot be said of the big DWH
vendors

------
dundun
What is missed in the article and many of these comments is that Hadoop isn't
always going the best tool for one job. It shines in its multitenancy- when
many users are running many jobs-each developed in their favorite framework or
language(bash/awk pipeline? No problem) running over datasets bigger than
single machines can handle.

It also comes in handy when your dataset grows dramatically in size.

------
2ion
Data analysis using a shell can be amazingly productive. We also had a talk
about this at TLUG
([http://tlug.jp/wiki/Meetings:2014:05#Living_on_the_command_l...](http://tlug.jp/wiki/Meetings:2014:05#Living_on_the_command_line_.28by_Kalin_Kozhuharov.29)).

------
impostervt
Any decent tutorials out there to get me up to speed on CL tools? I use grep
and a few others regularly, but have avoided sed and awk as they seem
difficult to jump into.

~~~
jodrellblank
A couple of days ago this clear, brief introduction to AWK was submitted:
[http://ferd.ca/awk-in-20-minutes.html](http://ferd.ca/awk-in-20-minutes.html)

------
robbles
Is there a cached version of the original article that's referenced in this
anywhere? Site appears to be down.

------
nraynaud
on a tangential note, sometimes I use a slower methods for UI reasons. For
example avoiding blocking the UI, or allowing for canceling the computation,
or displaying partial results during the computation (that last one might
completely trash the cache).

------
haddr
great article! PS. probably some hardcore unix guy would tell you that you are
abusing cat. The first cat can be avoided, and you might gain even better
performance. Also using gnu grep seems to be faster.

~~~
aadrake
Thank you for the compliment. Point taken on cat, but that's the way I like to
introduce the process to people. I took cat and grep out at the end of the
article anyway.

------
davecheney
1.75gb is not big data. It's not even small data.

~~~
steego
> It's not even small data.

We get it. Your data's pretty big.

~~~
davecheney
1.75gb is three cd roms. It's not big data.

1.75gb of ram is less than the virtual address space for 32 bit windows Xp,
it's not big data.

Protip: if it fits on the computer on your desk, it's not big data.

------
vander_elst
until ~10 GB you'd better keep on going with single core machines, you'll see
some improvementes with bigger sets > 100 GB

------
exabrial
tl;dr: You do not have a big data problem.

~~~
cortesoft
I am glad to know the 4 million requests per second I am processing isn't big
data...

------
hawleyal
Flamebait

------
dschiptsov
Everyone with basic knowledge of CS could realize that Hadoop is a waste.

Unfortunately, it isn't about efficiency at all. It just memeization. Bigdata?
Hadoop! Runs everywhere. Same BS like Webscale? MongoDB! meme.

~~~
threeseed
Well sorry but you don't have a clue what you're talking about.

I very much work in "big data" with about 2 terabytes of new data coming in
every day that has to be ingested and processed with hundreds of jobs running
against them. The data needs to be queryable via an SQL like language and
analyzed by a dozen data scientists using R or Map Reduce.

There isn't anything on the market today that has been proven to work in
environments like this and has the tooling to back it up. Unless you want to
prove everyone e.g. Netflix, Linkedin, Spotify, Apple, Microsoft wrong ?

~~~
wglb
_Well sorry but you don 't have a clue what you're talking about._

From the Guidelines:

 _Be civil. Don 't say things you wouldn't say in a face to face conversation.

When disagreeing, please reply to the argument instead of calling names._

~~~
yourad_io
I don't see how he broke the guidelines.

------
smegel
What about if you are processing 100 Petabytes? And you are comparing to a
1000-node Hadoop cluster with each node running 64 cores and 1TB of main
memory?

~~~
dundun
Then you're hardly using commodity hardware anymore. While jobs like that
probably actually work on Hadoop, I'd imagine a problem like that might be
better suited for specialized systems.

~~~
TallGuyShort
IME, most installations where Hadoop is "successfully" used it's running on
pretty high-end machines. "Commodity hardware" really means standard hardware,
not cheap hardware (as opposed to buying proprietary appliances and
mainframes).

~~~
MichaelGG
Or it could be a company with 30M records a month that buys 100 x $200 servers
off eBay and is still unable to query their data.

~~~
TallGuyShort
I'm not sure what your point is with a hypothetical situation. Why wouldn't
the be able to query their data? All I'm saying is from my actual experience
with real users, it's best to build a Hadoop cluster with high quality
hardware if you can.

------
skynetv2
its a sensational headline ... the reality is someone applied a wrong tool and
got bad results.

------
ronreiter
Hadoop is highly inefficient when using default MapReduce configuration. And a
single Macbook Pro machine is much stronger than 7 c1.medium instances.

Bottom line - run the same thing over Apache Tez with a cluster that has the
same computational resources as your laptop, and I'm pretty sure you'll see
the same results.

------
wallflower
Awk and Sed aren't very accessible to most people who did not grow up learning
those tools.

The whole point of tools built on top of Hadoop (Hive/Pig/HBase) is to make
large scale data processing more accessible (by hiding the map-reduce as much
as possible). Not everyone will want to write a Java map-reduce in Hadoop.
However, many can write a HiveQL statement or Pig textual script. Amazon
Redshift brings it even farther - they are a _Postgres_ compatible database,
meaning you can connect your Crystal Reports/Tableau data analysis tool to it,
treating it like a traditional SQL database.

~~~
nfm
I think the author's point was that the example in question was orders of
magnitude smaller than "big data" and that it was more efficient to process it
on a single machine, not that Hadoop and friends aren't easy to use.

