
Command-line tools can be faster than a Hadoop cluster (2014) - matthberg
https://adamdrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html
======
fxtentacle
With all the Hipster tech being released recently, the headline statement
holds true for a lot of things, unfortunately.

We recently discussed new logging tools at work. It was either a redundant
Amazon EC2 cluster with ElasticSearch for $50K monthly, or two large bare
metal servers with rsyslog and grep for $400 monthly. The log ingestion and
search performance was roughly the same...

EDIT: To give everyone a sense of scale, those $200 each bare metal servers
are 2x Intel Xeon 6-core + 256GB RAM + 15x 10TB 7200 rpm. We retain logs for
30 days and handle 4-5TB per day.

~~~
jammygit
What cloud advocates always say is that the $50k monthly will save you money
from not needing to hire a team to manage it for you, and that over the course
of 10+ years you will be ahead.

Is that true in anyone's experience? Every once in a while somebody posts
about their competing bare-metal system and it looks like a lot of people have
managed to cut their server costs by 99% (based on the numbers they post) by
avoiding the cloud as a service

Honestly curious

~~~
bcrosby95
We maintain 30 bare metal servers at a colo center, and between me (primarily
a developer) and the CTO we spend maybe 1 day per month "managing" them. The
last time we had a hardware failure was months ago. The last hardware
emergency was years ago.

Servers run on electricity, not sysadmin powered hamster wheels.

~~~
krab
Yes, the maintenance is cheap. The changes are more costly.

We run a dozen bare metal servers and I see the difference what it takes to
spin up a new VM vs. set up a new physical server. There's planning, OS
installation (we use preseed images but we weren't able to automate
everything), sometimes the redundant network setup doesn't play well with what
the switches expect (so you need to call the datacenter).

Still, it works out in favor of the bare metal servers. But I'm looking
forward to a bit bigger scale to justify a MaaS tool to avoid this gruntwork.

~~~
andreimackenzie
I completely agree that some types of changes are much more expensive with a
bare metal architecture than with cloud.

6 years ago, I worked for a company in the mobile space. This was around the
time of the Candy Crush boom, and our traffic and processing/storage needs
doubled roughly every six months. Our primary data center was rented space co-
located near our urban office. For a while, our sysadmins could simply drive
over and rack more servers. We reached a point where our cages were full, and
the data center was not willing rent us adjacent space. We were now looking at
a very large project to stand up additional capacity elsewhere to augment what
we had (with pretty serious implications on the architecture of the whole
system beyond the hardware) or move the whole operation to a larger space.

This problem ended up hamstringing the business for many months, as many of
our decisions were affected by concern about hitting the scale ceiling. We
also devoted significant engineering/sysadmin resources to dealing with this
problem instead of building new features to grow the business. If the company
had chosen a cloud provider or even VPS, it would have been less critical to
try to guess how much capacity we'd need a few years down the road to avoid
the physical ceiling we dealt with.

~~~
krab
Yes, the cloud premium is also a kind of insurance - you know you'll probably
be able to double your capacity anytime you need it.

------
maxmunzel
I did some testing on the same (kind of) dataset and task:

First test: A single 2.9GB file

time rg Result all.pgn | sort --radixsort | uniq -c 13 [Result " _" ] 1106547
[Result "0-1"] 1377248 [Result "1-0"] 1077663 [Result "1/2-1/2"] rg Result
all.pgn 1.12s user 0.55s system 99% cpu 1.680 total sort --radixsort 3.87s
user 0.37s system 71% cpu 5.911 total uniq -c 2.69s user 0.02s system 45% cpu
5.909 total

Using Apache Flink and a naive implementation It took 13.969 seconds.

Second test: same dataset, split between 4 files

time rg Result chessdata/_ | awk -F ':' '{print $2}' \- | sort --radixsort |
uniq -c 13 [Result " _" ] 1106547 [Result "0-1"] 1377248 [Result "1-0"]
1077663 [Result "1/2-1/2"] rg Result chessdata/_ 1.70s user 0.97s system 42%
cpu 6.292 total awk -F ':' '{print $2}' \- 5.47s user 0.07s system 88% cpu
6.289 total sort --radixsort 4.13s user 0.42s system 43% cpu 10.559 total uniq
-c 2.73s user 0.03s system 26% cpu 10.559 total

Flink: 12.724s

Conclusion: For this kind of workload, both approaches have comparable
runtimes, even tough taco bell programming has the upper hand (as is should
for simply filtering a text file). It took me about equally long to implement
both. I think both approaches have their use case.

I ran this locally on my Laptop with 4 logical cores.

~~~
ma2rten
Hadoop is very slow, because it persist the data to disk before every stage.
You really wouldn't want to use Hadoop if you don't have a good reason too.
More modern tools like Spark and Flink fare better there.

------
dang
A thread from 2018:
[https://news.ycombinator.com/item?id=17135841](https://news.ycombinator.com/item?id=17135841)

2016:
[https://news.ycombinator.com/item?id=12472905](https://news.ycombinator.com/item?id=12472905)

2015:
[https://news.ycombinator.com/item?id=8908462](https://news.ycombinator.com/item?id=8908462)

~~~
threeseed
It's just the annual expedition for HN where everyone gets their turn to be
smarter than everyone else.

Even more irrelevant now because Hadoop is largely a dead-end technology.

------
thecleaner
The author is experimenting with 1.75Gigs of data. At that scale sure, a local
machine will be faster. Hadoop's real use-case though is when your data
doesn't fit in memory and even this is kind of debatable. It makes sense to
measure the performance with some prototypes and then make a final design
rather than just use whatever AWS offers. Besides packaged services in AWS are
also a bit more costly than basic services like EC2 instances and network
goodies.

~~~
reallydontask
These days you can get servers with Terabytes of RAM, so a lot of people
(most?) could fit their data in memory.

I just took a gander to HPE's website and you can get proliant servers with up
to 12 TB of RAM (you might be able to get them with more RAM, did not check in
detail)

~~~
thecleaner
HPE ?

~~~
arnolox
Hewlett-Packard Enterprise

------
beagle3
A classic from 2015 along the same lines: Scalability, but at what COST?

[http://www.frankmcsherry.org/graph/scalability/cost/2015/01/...](http://www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html)

------
krab
This reminds me my experience from a company internal hackathon. My colleague
started writing a Spark program that would process the data we needed (a few
hundreds GB uncompressed). Before he finished writing it, I was able to
process all the data on a single machine with a unix pipeline. The
computationally intensive steps were basically just grep, sort and uniq. When
he finished the program, it couldn't run because of some operational issues on
the cluster at the moment, so we didn't even find out the speed to compare.

For me, the morale is that the cheap hardware saves money/time twice:

1\. It's faster if a program can run on a single machine.

2\. It's easier to write a program that runs on a single machine.

With this in mind, cloud works great for analytical data processing. Just
start a big enough machine, download data, do the computation, upload the
result and turn the machine off. If you develop the program on a sample of the
data so you can do it locally, it will be even cheap because you use only
short time of the powerful server.

~~~
qaq
Given you can soon get beefy 64 core Threadripper based workstation under 10K
just running the analysis locally looks like a very decent option.

------
dcolkitt
The two approaches aren't necessarily mutually exclusive. Spark can easily
shell out using pipe(). Plus you can use that to compose and schedule
arbitrarily large data sets through your bash pipeline through a multi-node
cluster.

Beyond that, while the Unix tools are amazing for per-line FIFO-based
processing, they really don't do a great job at anything requiring any sort of
relational algebra.

~~~
greggyb
join and comm would like a word with you.

You can't match SQL expressiveness, but you can definitely handle set-based
stuff.

------
bandrami
Wait... are you telling me people over-engineer solutions to ultimately simple
problems? You're _kidding_.

------
supermatt
very simple processing, not memory bound, tiny data-set - of course its going
to be faster locally when the command itself takes less time than the
networking, distribution, coordination and collation overhead of using any
distributed tool...

~~~
llarsson
You know that, I know that, and we can be happy that we have the experience to
know what the right tool for this job would be by sizing up and describing the
characteristics of the problem like you just did. But those with less
experience may not be able to do that unless shown stuff like this in
practice.

Some may think this problem requires MapReduce. The quote from the original
implementation blog post certainly seems to indicate so.

~~~
threeseed
MapReduce as a paradigm and technology was popular about a decade ago and then
died shortly after in favour of Hive, Spark etc.

Pretty confident that not a single developer anywhere in this world would be
first thinking of MapReduce. Just like they wouldn't jump straight to Cobol.

~~~
jbergens
Some places still seem to go for large Kafka clusters just to calc some stats
and forward some messages. I am sure some of their solutions are MapReduce
below.

~~~
threeseed
Very curious to understand more about these mythical developers who are
recommending a technology/paradigm that stopped being used a decade ago.

You've seen people writing pages and pages of Java code to do ETL ?

~~~
doteka
I worked for a large multinational last year that had multiple teams rolling
out terrible Hadoop solutions.

The rest of the world does not particularly care about SV fashion trends.

------
toolslive
once you get to the stage where your laptop is just not enough anymore (or
your laptop has some cores you want to add to the processing as well), gnu
parallel might be of use.

[https://www.gnu.org/software/parallel/](https://www.gnu.org/software/parallel/)

------
sm4rk0
Is there any benefit of

    
    
      cat files* | grep pattern
    

over this

    
    
      grep -h pattern files*
    

aside from result color highlighting?

~~~
protanopia
With the first method, you can use xargs to run multiple copies of grep
running in parallel like he did later in the article.

~~~
nemetroid
The author switched to using _find_ before doing the parallelization.

------
commandlinefan
If you can fit your data on a single disk drive, you don't need Hadoop.

------
reagent_finder
When all you have is a hammer, every problem starts looking like a nail.

The basic premise is fine: If you have a simple problem, using simple tools
will give you a good result. Here you have text files, you just want to
iterate through them and find a result from ONE line that's the same in every
file, collate the results. No further analysis required.

Every problem in the world can be solved by a bash one-liner, right!?

There's an interesting dichotomy with bash scripts: One school says any bash
script over 100 lines should be rewritten in Python, because it's overcomplex
already. Another school says any Python script used daily over 100 lines
should be rewritten in bash so there are no delusions about it being easy to
maintain.

The original article is from 2013, and doesn't try to do any optimization (I
guess, the original article is unavailable at the time of writing of this
comment), so it would be an interesting question to see what you could do at
the Hadoop end to make the query faster. I would imagine quite a lot.

------
StreamBright
The bottom 90% of data users are in the gigabytes range. Anything works.

~~~
keanzu
I've been in a meeting where the amazing scalable cloud solution for the huge
data warehouse was laid out. Turned out to be 500GB. Judging by the death
stares I got I don't think I was supposed to say "Wow, the whole thing would
fit on the 512GB SD card I bought last week".

------
JimmyRuska
We had a poorly performing service which reads from a number of rest endpoints
and writes to s3 in date prefixed format. Offshore wrote 3,600 lines of codes
targeting kinesis firehose. By just piping the url endpoints to a named pipe
and cycling the s3 file in python, my code was 55 lines of code and did the
same thing without kinesis. Wrapping things in GNU parallel and using bash
flags, it handles any failure cases super gracefully, which is something the
offshore code did not do. The India offshore code had a global exception
catch-all, and would print the error and return exit success return code...
but I guess someone got to put Kinesis on their resume.

------
tomerbd
I maintain here a very small command line cheatsheet that I get back for
reference for mostly data analysis tasks
[https://tinyurl.com/tomercli](https://tinyurl.com/tomercli)

~~~
tomashertus
It's not accessible. Can you publish it?

~~~
tomerbd
yeah sorry here:
[https://sites.google.com/view/tomerbendavid/commandline](https://sites.google.com/view/tomerbendavid/commandline)

------
m0zg
Been saying that for years. Also, get this, 99.999% of companies do not need
"big data" or distributed systems of any kind. I feel like the old "cheap
commodity hardware" pendulum swung way too far. More expensive, less
"commodity" hardware can often be cheaper, if correctly deployed. I.e. you
don't need a distributed database if your database is below 1TB and QPS is
reasonable (and what's "reasonable" can surprise you today with large NVME
SSDs, hundreds of gigabytes of RAM, and 64-core machines being affordable).

------
jonstewart
This was a straw man article in 2014, it was a straw man article the other
times it’s been posted to HN in the intervening years, and it’s still a straw
man article in 2020. As noted in another comment here, the contemporary
technology of Apache Flink really isn’t far off command-line tools running on
a single machine. Meanwhile, HDFS has made a lot of progress on its overhead,
particularly unnecessary buffer copies. There are datasets where a Hadoop
approach makes sense. But not for ones where the data fits in RAM on a single
system. No one has ever argued that.

------
jsjohnst
While I personally would use a similar pipeline as OP for such a small data
set, saying Hadoop would take 50min for this is just flat wrong. It shows a
clear lack of understanding of how to use Hadoop.

------
openstep
Amen. You can do a lot with pipes, various utils (sed, awk, grep, gnu
parallel, etc.), sockets, so on and so forth. I see folks abuse Hadoop way too
often for simple jobs.

~~~
mikorym
I am always tempted to say too that "vim can be faster than IDE x"... But I
guess that is a bit more subjective.

------
pts_
That's because Hadoop is a big favorite of wining and dining suits, who scram
at the sight of the command line.

------
barrkel
If you're disappointed with the speed and complexity of your Hadoop cluster,
and especially if you're trying to crack a bit, you should give ClickHouse a
spin.

~~~
KptMarchewa
>crack a bit

What does that mean? I don't understand if you're trying to endorse ClickHouse
or make fun of it.

~~~
barrkel
Phone typo. Should have been 'nut'.

And yes, I'm endorsing ClickHouse; it scales down much better than Hadoop.

------
sandGorgon
if you're doing spark or hadoop today and are a python shop...you should
definitely look at Dask [https://dask.org/](https://dask.org/)

works as good as spark. very lightweight. works through docker.

Ground up integrated with kubernetes (runs on EKS/GKE,etc).

and no serialization betweek java/python, fatjar stuff, etc

------
philshem
command line tools like grep,awk,sed,etc are great for structured and line-
based files like logs. For json documents I can add a recommendation for jq:

[https://stedolan.github.io/jq/](https://stedolan.github.io/jq/)

------
arthurcolle
Cloud computing is kind of a joke. Yeah keep paying someone for shared
"virtual computers", that sounds suspiciously similar to shared hosting from a
decade or 2 ago... Oh but this is different, you get isolation from
containers/VMs! Yeah ok, meanwhile new exploits emerge every couple weeks.
It's like tech debt ideology on steroids... just keep pumping out instances
until the company either goes hyperbolic, or goes bankrupt. Realistically,
just buy a few physical servers and actually work to build efficiency into the
system instead of just throwing compute at your public-facing web app.

I recently bought a DELL r710 just for fun and was pleasantly surprised how
even days after spinning up a bunch of VMs, I don't have a 30gb logfile for
all the failed attempts at getting into my instance (this was my experience
recently with 2 cloud providers!)

It'll be interesting to see how the mkt reacts when you have a "first-of-its-
kind" massive, massive security breach that affect popular "pure play"
internet companies hosted on top of the mythical "cloud."

Seriously, $READER, look at your cloud computing-dependent startup, and
calculate egress costs for your storage, as if you HAD to stop using cloud
tomorrow. How much does it cost you? How could you adapt? It's designed to
keep you dependent on 3rd parties... Idk, IMO it is really not great.

~~~
fxtentacle
I see one really positive point in cloud computing and that is that I can
soundly sleep at night :)

Of course, cloud is overpriced, slow, and suffers from noisy neighbors. And
keeping things running in the cloud is about the same amount of work as
keeping it running on bare-metal. But for customer-visible things, I want to
use cloud so that someone else has to get up in the middle of the night when
apache crashes.

Sleeping peacefully makes it worth for me to pay $5,000 monthly to Heroku when
2-3x $100 bare metal servers would do. Plus I can cheaply insure against
supplier negligence, whereas insuring against employee negligence would be
much more expensive.

~~~
pepemon
Noisy neighbors tracking can be automated and partially solved then. With
cloud you gain access to the provisioning API and with Terraform/Ansible stack
(as for example) you are able to build up and manage infra quickly,
efficiently and declaratively. Bare metal provisioning can also be automated
(as via private cloud solution, for example) but you need dedicated team for
that (and nice OpenStackers aren't cheap). I was solo managing 500+ hosts once
on public cloud, there is no chance you can do this without what you call
"hipster tech" and modern devops toolchain.

