
Command-line tools can be faster than a Hadoop cluster (2014) - 0xmohit
http://aadrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html
======
vlahmot
I feel like every time something like this comes up people completely skip
over the benefit of having as much of your data processing jobs in one
ecosystem as possible.

Many of our jobs operate on low TBs and growing but even if the data for a job
is super small I'll write it in Hadoop (Spark these days) so that the build,
deployment, and schedluing of the job is handled for free by our curent
system.

Sure spending more time listing files on S3 at startup than running the job is
a waste but far less than the man hours to build and maintain a custom data
transformation.

The main benefit of these tools is not the scale or processing speed though.
The main benefits are the fault tolerance, failure recovery, elasticity, and
the massive ecosystem of aggregations, data types and external integrations
provided by the community.

~~~
aub3bhat
Having worked at a large company and extensively used their Hadoop Cluster, I
could not agree more with you.

The author of the blogpost/article, completely misses the point. The goal with
Hadoop is not minimizing the lower bound on time taken to finish the job but
rather maximizing disk read throughput while supporting fault tolerance,
failure recovery, elasticity, and the massive ecosystem of aggregations, data
types and external integrations as you noted. Hadoop has enabled Hive, Presto
and Spark.

The author completely forgets that the data needs to transferred in from some
network storage and the results need to be written back! For any non-trivial
organization ( > 5 users), you cannot expect all of them to SSH into a single
machine. It would be an instant nightmare. This article is essentially saying
"I can directly write to a file in a local file system faster than to a
database cluster", hence the entire DB ecosystem is hyped!

Finally Hadoop is not a monolithic piece of software but an ecosystem of tools
and storage engine. E.g. consider Presto, software developers at Facebook
realized the exact problem outlined in the blogpost but instead of hacking
bash scripts and command line tools, they built Presto. Which essentially
performs similar functions on top of HDFS. Because of the way it works Presto
is actually faster than "command line" tools suggested in this post.

[https://prestodb.io/](https://prestodb.io/)

~~~
qwertyuiop924
But, as I pointed out in another comment, what about systems like Manta, which
make transitioning from this sort of script to a full-on mapreduce cluster
trivial?

Mind, I don't know the performance metrics for Manta vs Hadoop, but it's
something to consider...

~~~
aub3bhat
From my experience organizations have adopted, Hive/Presto/Spark on top of
Hadoop. Which actually solves a whole bunch of problems that "script" approach
would not. With several added benefits. Executing scripts (cat, grep, uniq,
sort) do not provide similar, benefits, while they might be faster. A
dedicated solution such as Presto by Facebook will provide similar if not even
faster results.

[https://prestodb.io/](https://prestodb.io/)

~~~
qwertyuiop924
Ah, so it doesn't solve data storage, and runs SQL queries, which are less
capable than UNIX commmands. If your data's stuck inside 15 SQL DBs, than
that'd make sense, but a lot of data is just stored in flat files. And you
know what's really good at analyzing flat files? Unix commands.

~~~
aub3bhat
Did you even read it? Presto reads directly from HDFS, which is as close to
distributed "flat files" as you can get. As far as "SQL being less capable
than UNIX commands", you have got to be kidding me. SQL allows type checking,
conversion, joins all of which are difficult if not impossible with grep |
uniq | sort etc.

~~~
qwertyuiop924
I read it.

>Presto allows querying data where it lives, including Hive, Cassandra,
relational databases or even proprietary data stores. A single Presto query
can combine data from multiple sources, allowing for analytics across your
entire organization.

That doesn't sound like HDFS to me. I mean, I assume it _can_ read from HDFS,
but Presto is backend agnostic. You could probably write code to run it on
Manta. That would be neat for people who like Presto, I guess.

Type checking and conversions, no, and table joins only matter when you're
handling relational data.

Also, how many formats can Presto handle? Unix utilities can handle just about
any tabular data, and you can run them against non-tabular data in a pinch
(although nobody reccomends it). I doubt Presto is _that_ versitile.

~~~
threeseed
Hive operates on top of HDFS.

Presto absolutley runs directly on HDFS.

~~~
qwertyuiop924
Huh. Well then, I don't understand HDFS, or Facebook needs to fix Presto's
front page. Both are reasonably likely.

------
cs702
Most "big data" problems are really "small data" by the standards of modern
hardware:

* Desktop PCs with up to 6TB of RAM and many dozens of cores have been available for over a year.[1]

* Hard drives with 100TB capacity in a 3.5-inch form factor were recently announced.[2]

 _CORRECTION: THE FIGURE IS 60TB, NOT 100TB. See MagnumOpus 's comment below.
In a haste, I searched Google and mistakengly linked to an April Fool's story.
Now I feel like a fool, of course._ Still, the point is valid.

* Four Nvidia Titan X GPUs can give you up to 44 Teraflops of 32-bit FP computing power in a single desktop.[3]

Despite this, the number of people who have unnecessarily spent money and/or
complicated their lives with tools like Hadoop is pretty large, particularly
in "enterprise" environments. A lot of "big data" problems can be handled by a
single souped-up machine that fits under your desk.

[1] [http://www.alphr.com/news/enterprise/387196/intel-
xeon-e7-v2...](http://www.alphr.com/news/enterprise/387196/intel-
xeon-e7-v2-servers-support-6tb-of-ram)

[2] [http://www.storagenewsletter.com/rubriques/hard-disk-
drives/...](http://www.storagenewsletter.com/rubriques/hard-disk-
drives/incredible-record-of-100tb-into-3-5-inch-hdd-with-hamrhelium/)

[3]
[https://news.ycombinator.com/item?id=12141334](https://news.ycombinator.com/item?id=12141334)

~~~
MagnumOpus
> Hard drives with 100TB capacity in a 3.5-inch form factor were recently
> announced

That is an April Fools story.

(Of course you can still get a Synology DS2411+/DX1211 24-bay NAS combo for a
few thousand bucks, but it will take up a lot of space under your desk and
keep your legs toasty...)

~~~
jamescun
Earlier this year, Seagate were showing off their 60 TB SSD[1] for release
next year.

So 100 TB in a single drive isn't too far off.

EDIT: Toshiba is teasing a 100 TB SSD concept, potentially for 2018 [2]

[1] 11th August - [http://arstechnica.co.uk/gadgets/2016/08/seagate-
unveils-60t...](http://arstechnica.co.uk/gadgets/2016/08/seagate-unveils-60tb-
ssd-the-worlds-largest-hard-drive/)

[2] 10th August -
[http://www.theregister.co.uk/2016/08/10/toshiba_100tb_qlc_ss...](http://www.theregister.co.uk/2016/08/10/toshiba_100tb_qlc_ssd/)

~~~
rusanu
60TB at 500Mb/s transfer will take +1 day to read the data. This is the
problem of drinking the ocean through a straw. Even with SSD transfer rates,
is still a problem at scale. Clusters give you no only capacity, but also
multiplication factor for transfer rates.

~~~
faragon
Just use 24 of them interleaved/stripped and it will take just one hour for
loading the data.

~~~
rusanu
But then you need small disks (eg. 2TB). My point is that huge capacity drives
are not appropriate in _compute_ environments, as Hadoop is. They're more for
cold storage.

------
deadgrey19
This idea was the subject of a paper at a major systems conference. The paper
is called "Scalability! But at what cost?" \- It goes well beyond this simple
example above to explore how most major systems papers produce results that
can be beaten by a single laptop. Here's the paper and the blog post
describing it.

[http://www.frankmcsherry.org/assets/COST.pdf](http://www.frankmcsherry.org/assets/COST.pdf)

[http://www.frankmcsherry.org/graph/scalability/cost/2015/01/...](http://www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html)

~~~
aub3bhat
The paper is frankly stupid and a great example of difference between practice
and academia. it looks good because they are using a snapshot of Twitter
network from 2010. In reality the work flow is complex, e.g. the follower
graph gets updates every hour. 10 different teams have their different
requirements as to how to set up the graph and computations. These
computations need to be run at different (hourly, weekly, daily) granularity.
100 downstream jobs are also dependent on them and need to start as soon as
previous job finishes. The output of the jobs gets imported/indexed in
database which is then pushed to production systems and/or used by analysts
who might update and retry/rerun computations. Unlike a bunch of out of touch
researchers the key concern isn't how "fast" calculations finish, but several
others such as ability to reuse, fault tolerance, multi user support etc.

I can outrun a Boeing 777 on my bike in a 3 meter race but no would care. The
single laptop example is essentially that.

~~~
frankmcsherry
> The paper is frankly stupid and a great example of difference between
> practice and academia. it looks good because they are using a snapshot of
> Twitter network from 2010.

We used these data and workloads because that was what GraphX used. If you
take the graphs any bigger, Spark and GraphX at least couldn't handle it and
just failed. They've probably gotten better in the meantime, so take that with
a grain of salt.

> Unlike a bunch of out of touch researchers the key concern isn't how "fast"
> calculations finish, but several others such as ability to reuse, fault
> tolerance, multi user support etc.

The paper says these exact things. You have to keep reading, and it's hard I
know, but for example the last paragraph of section 5 says pretty much exactly
this.

And, if you read the paper even more carefully, it is pretty clearly not about
whether you should use these systems or not, but how you should not evaluate
them (i.e. only on tasks at a scale that a laptop could do better).

~~~
aub3bhat
"The paper says these exact things. You have to keep reading, and it's hard I
know, but for example the last paragraph of section 5 says pretty much exactly
this."

Thanks, that addresses my concern. I take back my comment.

But why stop at Rust implementation, there are vendors optimizing it down to
FPGA. This sort of comparison is hardly meaningful.

~~~
frankmcsherry
The only point of the paper is that the previous publications sold their
systems primarily on performance, but their performance arguments had gaping
holes.

The C# and Rust implementations have the property that they are easy and you
don't need to have any specific skills to write a for-loop the way we did (the
only "tricks" we used were large pages and unbuffered io in C#, and mmap in
Rust).

The point is absolutely not that these are the final (or any) word in these
sorts of computations; if you really care about performance, use FPGAs, ASICs,
whatever. There will always be someone else doing it better than you, but we
thought it would be nice if that person wasn't a CS 101 undergraduate typing
in what would literally be the very first thing they thought of.

------
mej10
> 1.75GB

> 3.46GB

These will fit in memory on modest hardware. No reason to use Hadoop.

The title could be: "Using tools suitable for a problem can be 235x faster
than tools unsuitable for the problem"

~~~
TickleSteve
This is exactly the point he was making.

People have a desire to use the 'big' tools instead of trying to solve the
real problem.

People both underestimate the power of their desktop machine and the 'old'
tools and overestimate the size of their task.

~~~
cheez
Yeah, someone was telling me they need big data for a million rows. I laughed
and said SQLite handles that...

~~~
matt_wulfeck
I would not want to be the one on-call for a million row SQLite database!

~~~
regularfry
I would. That's a pager that's never going off.

~~~
qwertyuiop924
Yeah. No significant software is bug-free, but if there was one, SQLite would
be a good candidate.

~~~
jschwartzi
Especially having looked at how thorough their test suite is.

~~~
qwertyuiop924
That's what I'm talking about. What other database tests against power outages
and hardware failure?

~~~
cheez
I'm sure many do...

~~~
qwertyuiop924
Do they do it for every release? What about OOM error testing? IO error tests?
fuzz tests? UB tests? Fuzz tests? Malformed DB tests? Valgrind analysis?
Memory leak checking? Regression tests?

SQLite does all of that and more with every release
([https://sqlite.org/testing.html](https://sqlite.org/testing.html)). There's
a reason I consider their test coverage impressive...

~~~
cheez
I don't know. I remember reading that other DBs use SQLite's test suite

~~~
qwertyuiop924
Huh. Well, I learned something new today.

------
qwertyuiop924
...And when you _do_ have 140 TB of chess data, you can move to Manta, and you
get to keep your data processing pipeline almost exactly the same. Upwards
scalability!

I don't know how the performance would stack against Hadoop, but it'd work.

~~~
elcritch
Actually, just posted essentially the same thing, before reading your comment.
I'm wondering as well how the performance would/will scale. It likely depends
on how the data is scattered / replicated, but presumable they've worked out
decent schedulers for the system. If not, it is open source! Lovin it.

~~~
qwertyuiop924
Well, all the code running against the data would already have the
paralellization advantages of a shell script, as described in this article. It
would additionally probably be running accross multiple nodes, meaning that
the IO speeds increase the number of records that can be processe)d
simultaneously. The disadvantage is that that data has to be streamed over a
network to the reducer node, which could add a good chunk of latency,
depending on how fast that is (if you can do some reduction during the map, it
would help, but it's possible that Manta spawns one process and virtualized
nods per object (and indeed, this is likely), meaning this is impossible), and
how many virtual nodes are running on the same physical hardware (but then
you're running into the same boundaries you hit on a laptop, just on a much
beefier system), as the network latency is near zero if the reducer and the
mapper nodes are on the same physical system.

But if you're processing terrabytes, the network latency is probably barely
factoring into your considerations, given how much time you're saving by
processing data in parallel in the first place.

~~~
elcritch
That's pretty similar to my thinking on the performance. Though your point
about the combination of shell script streaming and parallelization is a good
way to express it.

The real benefit of this system would be compared to "traditional" (modern?)
big data tools like spark, then the network latency cost of the reduce phases
should be comparable. Though since manta localizes the compute to the data,
there should be an overal order of magnitude less network transfer which
should significantly reduce the of of manta based solutions compared to
spark/s3 solutions.

In theory at least, it'd be great to test this on equivalent hardware, or at
least equivalent;y priced hardware. But that would require a nice test data
set which I don't have the resources to setup. Any suggestions on data code
that could test the above assumptions would be handy ( _ahem_ HN peeps got
anything?).

_Edits: grammar_

~~~
qwertyuiop924
I got nothing on data code. You could try running a comparable S3/EC2 against
Manta on Joyent, but that would be relatively expensive, and I have no idea of
the differences between Amazon and Joyent's datacenter layout, so such a test
would not be optimal, although it would test each in its most common use case.

It's also worth mentioning in performance analysis. that Manta is backed by
ZFS and Zones, so it has the performance characteristics of those.

------
nekopa
Reminds me of one of my all time favorite comments about Bane's Rule:

[https://news.ycombinator.com/item?id=8902739](https://news.ycombinator.com/item?id=8902739)

~~~
matt_wulfeck
And yet a fourth dimension to the problem: time/difficulty multiplier.

    
    
        Going   Multiplier
        -----   ---------
        High    1x
        Wide    4x
        Deep    8x
    

Make the best decision given your current circumstances and engineering
resources.

~~~
combatentropy
Bane's "deep" approach saved tens of millions of dollars in one example,
months of time in another. In all cases it was many times easier for the team
to keep up, day after day, year after year.

Bane advocates good old-fashioned refactoring. With humble tools like Perl,
SQL, and an old desktop, he bested new, fancy, expensive products. Bane
deserves the salary of a CEO, or at least a vice president, for the good he
has brought to his company.

I think ego leads us to make choices that appeal in the short run but are bad
in the long run. Which is more impressive sounding: that you bought a
distributed network of a thousand of the newest, shiniest machines, running
the latest version of DataLaserSharkAttack; or that you cobbled together some
Perl, SQL, and shell one-liners on a four-year-old PC?

Also good-old fashioned hard work is painful. It is a good kind of pain, like
working out your body, rather than a bad kind of pain, like accidentally
cutting yourself. But it is just good old-fashioned, humble, hard work to sit
down, work through the details, and come up with a better plan.

Before that, it is even more humble, hard work, to learn the things that Bane
had learned. Not just anybody could have done what he did. First you have to
learn the ins and outs of Perl, SQL, and all the little shell commands, and
all their little options. He knows about a lot of different programming
problems, like what a "semantic graph" is (I can't say I do), what an
"adjacency matrix" is (nope), whether something is an O(N^2) problem or an
O(k+n^2) problem (I know I've seen that notation before).

------
edude03
Arguable if you can keep everything on one box it will almost always be faster
(and cheaper!) than any soft of distributed system. That said, scalability is
generally more important than speed because once a task can be distributed you
can add performance by adding hardware. As well, depending on your use case
you can often get fault tolerance "for free".

~~~
zzzcpan
There is nothing preventing distributed systems to be faster than one box for
this kind of thing. But they don't always bother to pursue efficiency on that
level, because things are very different once you have a lot of boxes and
something that used to look important for a couple of boxes doesn't anymore.

~~~
kuschku
Yes, there is, you have a lot of overhead in any case for the same tools.

~~~
zzzcpan
You don't have the same tools. You are probably thinking about emulating POSIX
filesystem API and things like that and using those command-line tools on top
of that in a single-box kind of way. That's not how you treat your distributed
system.

EDIT: For something that beats a single box easily I envision an interpreter
with JIT running on each node in a distributed system and on the same process
that stores data, having pretty much no overhead to access and process it.

~~~
qwertyuiop924
>You are probably thinking about emulating POSIX filesystem API and things
like that and using those command-line tools on top of that in a single-box
kind of way. That's not how you treat your distributed system.

Yeah, but Manta's mapreduce does something close, and it seems to work okay.

------
ben_bai
This seems to be the Manta [0] way. Letting you run your beloved Unix command
pipeline on your Object Store files.

[0] [https://www.joyent.com/manta](https://www.joyent.com/manta) But the
youtube videos with Bryan Cantrill are even better at explaining.

------
bobivl
If you like to do data analyses in bash, you might also enjoy bigbash[1]. This
tool generates quite performant bash one-liners from SQL Select statements
that easily crunch GB of csv data.

Disclaimer: I am the main author.

[1] [http://bigbash.it](http://bigbash.it)

~~~
qwertyuiop924
That's pretty cool.

Do you think you can get it to support Manta? I think a lot of people in that
ecosystem could benefit from it if you could. I'd help, but I don't really
know Java all that well :-(.

------
chajath
It's all about picking the right tool for the job. I think shell scripting is
a great prototyping tool and often a good place to start. As the problem gets
more complex and bigger, eventually it will warrant a full scale development.

~~~
astrobe_
I think people overlook the fact that the author made an even more strong
point by using shell scripting, which is relatively inefficient compared to
using a compiled language. I guess it would hit the I/O cap without even going
parallel.

------
eggy
Date on article: Sat 25 January 2014

I am not a Big Data expert, but does that change any of the comments below
with reference to large datasets and memory available?

I use J and Jd for fun with great speed on my meager datasets, but others have
used it on billion row queries [1]. Along with q/kdb+, it was faster than
Spark/Shark last I checked, however, I see Spark has made some advances
recently I have not checked into.

J is interpreted and can be run from the console, from a Qt interface/IDE, or
in a browser with JHS.

[1]
[http://www.jsoftware.com/jdhelp/overview.html](http://www.jsoftware.com/jdhelp/overview.html)

~~~
justinsaccount
There isn't exactly a direct relationship between the size of the data set and
the amount of memory required to process it. It depends on the specific
reporting you are doing.

In the case of this article, the output is 4 numbers:

    
    
      games, white, black, draw
    

Processing 10 items takes the same amount of memory as processing 10 billion
items.

If the data set in this case was 50TB instead of a few GB, it would benefit
from running the processing pipeline across many machines to increase the IO
performance. You could still process everything on a single machine, it would
just take longer.

Some other examples of large data sets+reports that don't require a large
amount of memory to process:

    
    
      * input: petabytes of web log data. output: count by day by country code
      * input: petabytes of web crawl data. output: html tags and their frequencies
      * input: petabytes of network flow data. output: inbound connection attempts by port
    

Reports that require no grouping (like this chess example) or group things
into buckets with a defined size (ports that are in a range of 1-65535) are
easy to process on a single machine with simple data structures.

Now, as soon as you start reporting over more dimensions things become harder
to process on a single machine, or at least, harder to process using simple
data structures.

    
    
      * input: petabytes of web crawl data. output: page rank
      * input: petabytes of network flow data. output: count of connection attempts by source address and destination port
    

I kinda forget what point I was trying to make.. I guess.. Big data != Big
report.

I generated a report the other day from a few TB of log data, but the report
was basically

    
    
      for day in /data/* ; do #YYYY-MM-DD
        echo $day $(zcat $day/*.gz | fgrep -c interestingValue)
      done

------
matt_wulfeck
There's a lot of operational benefits to running on Hadoop/yarn as well. You
get operational benefits from node resiliency (host went down? Run the
application over there). You also get the Hadoop filesystem which conveniently
stores your data in S3 and distributed HDFS.

These systems were designed by people who probably managed difficult etl
pipelines that were nothing but what the author suggests: simplified shell
scripts using UNIX pipes.

Besides going up against Hadoop MR is easy. I'd like to see you compete
against something like Facebook's presto or spark which are optimized for
network and memory.

------
eva1984
What is the point? Who would want to use Hadoop for something below 10GB?
Hadoop is not good at doing what it is not designed for? How useful.

~~~
virmundi
Kind of depends on what the 10 GB is. For example, on my project, we started
on files that were about 10 GB a day. The old system took 9 hours to enhance
the data (add columns from other sources based on simple joins). So we did it
with Hadoop on two Solaris boxes (18 virtual cores between them). Same data;
45 minutes. But wait there's more.

We then created a two fraud models that took that 10+ GB file (enhancement
added about 10%) and executed within about 30 minutes a piece. But
concurrently. All on Hadoop. All on arguably terrible hardware. Folks at
Twitter and Facebook had never though about using Solaris.

We've continued this pattern. We've switched tooling from Pig to Cascading
because Cascading works in the small (your PC without a cluster) and in the
large. It's testable with JUnit in a timely manner (looking at you PigUnit).
Now we have some 70 fraud models chewing over anywhere from that 10+ GB daily
file set to 3 TB. All this in our little 50 node cluster. All within about 14
hours. Total processed data is about 50 TB a day.

As pointed out earlier, Hadoop provides an efficient, scalable, easy
distributed application development platform. Cascading makes life very Unix-
like (pipes and filters and aggregators). This coupled with a fully async
eventing pipe line for workflows built on RabbitMQ makes for an infinitely
expandable fraud detection system.

Since all processors communicate only through events and HDFS, we add new
fraud models without necessarily dropping the entire system. New models may
arrive daily, conform to a set structure, and are literally self-installed
from a zip file within about 1 minute.

We used the same event + Hadoop architecture to add claim line edits. These
are different from fraud models in that fraud models calculate multiple
complex attributes then apply a heuristic to the claim lines. Edits look at a
smaller operation scope. But in cascading this is pipe from HDFS -> filter for
interesting claim lines -> buffer for denials -> pipe to HDFS output location.

Simple, scalable, logical, improvable, testable. I've seen all of these. As
the community comes out with new tools, we get more options. My team is
working on machine learning and graphs. Mahout and Giraph. Hard to do all of
this easily with a home grown data processing system.

As always, research your needs. Don't get caught up in the hype of a new
trend. Don't be closed minded either.

------
banku_brougham
i agree that scalable infrastructure is needed to manage a production
pipeline, as others have explained well.

i found this article was a useful reminder, because sometimes a job doesnt
require a fully grown infrastructure. i commonly get these requests that dont
overlap with existing infrastructure and wont need any followup. in that
particular case a hadoop cluster, heck even loading into a pg db would be
wasted effort.

but i wouldnt want to manage our clickstream analytics pipeline with shell
scripts and cron jobs.

is there any lightweight tooling out there that can schedule/run basic
pipeline jobs in a shell environment?

~~~
hcrisp
Airflow? It might not be what you consider lightweight, though.

------
regularfry
In my experience you can classify people into two herds: those who, when faced
with a problem, solve it directly; and those who, faced with the same problem,
try to fit it to the tools they want to use. I like to think this is a
maturity question, but I can't think I've actually seen someone make the
transition from the latter to the former type.

~~~
reflexive
> those who, when faced with a problem, solve it directly

I think you mean "use the tools they already know".

> those who, faced with the same problem, try to fit it to the tools they want
> to use

I think you mean "use the correct tools for the job".

> I like to think this is a maturity question

It is, but the direction of maturity is from the first case to the second.

------
sfifs
This is just plain click bait.

Obviously if a dataset is small enough to possibly fit in memory, it will be
much faster to run on a single computer.

~~~
unethical_ban
I guess the author should have called it out more explicitly for some, but I
think that's the point.

I've seen the testimony dozens of times on HN, and I've heard it from a friend
who manages Hadoop at a bank, and I've seen it with people building scaled ELK
stacks for log analysis: People are too eager to scale out when things can be
done locally, given moderate datasets.

~~~
Anderkent
Though sometimes hadoop makes sense even if local computation is faster. For
example you might just be using hadoop for data replication.

~~~
gopalv
> For example you might just be using hadoop for data replication.

Good point. The reason someone who holds data for 7+ years uses hadoop is not
because it is faster.

The processing aspect of the system is only tangential to the failure
tolerance when you consider the age of the data set.

HDFS does waste a significant amount of IO merely reading through cold data
and validating the checksums, so that it safe against HDD bit-rot
(dfs.datanode.scan.period.hours).

The general argument about failure tolerance is off-site backups, but the
backups tend to have availability problems (i.e machine failed, the Postgres
backup restore takes 7 hours).

The system is built for constant failure of hardware, connectivity and in some
parts, the software itself (java over C++) - because those are unavoidable
concerns for a distributed multi-machine system.

The requirement that it be fast takes a backseat to the availability,
reliability and scalability - an unreliable, but fast system is only useful
for a researcher digging through data at random, not a daily data pipeline
where failures cascade up.

------
dpcx
The Hadoop article linked is available at
[https://web.archive.org/web/20140119221101/http://tomhayden3...](https://web.archive.org/web/20140119221101/http://tomhayden3.com/2013/12/27/chess-
mr-job/)?

------
sgt101
Ok, now do it for >2tb.

Our prod hadoop dataset is now > 130tb, try that!

~~~
teh
What kind of data do you have? Is this mostly text, or more like compressed
time series?

~~~
cottonseed
We're analyzing genome sequence data on that scale: [https://github.com/hail-
is/hail](https://github.com/hail-is/hail)

------
amelius
But what would happen if those exact same command-line tools were used inside
a Hadoop node? What would be the optimum number of processors then?

~~~
TickleSteve
you would just be adding management-overhead.

More software != more efficient software.

~~~
amelius
But faster because parallel.

------
linuxhansl
Yes, you do not need a 100 node cluster to crunch 1.75GB of data. I can do
that on my phone. What's the author's point?!

~~~
danudey
That hadoop's reputation of being worth the hassle and complexity that
managing it entails is undeserved.

------
cottonseed
Is there a place to download the database that isn't a self-extracting
executable? (Seriously?)

------
zobzu
Not just faster to run but also much faster to write

------
necessity
>cat | grep

why

------
yelnatz
(2014)

Here's the previous comments from the submission a couple years ago:
[https://news.ycombinator.com/item?id=8908462](https://news.ycombinator.com/item?id=8908462)

------
aub3bhat
This article is a great litmus test for checking if someone has experience
working at scale (Multi Terabytes, Multiple analysts, Multiple job types) or
not. Anyone who has had that experience will instantly describe why this
article is wrong. It's akin to saying a Tesla is faster than Boeing 777 on a
100 meter track.

~~~
detaro
I'd hope people who have worked at scale still are capable of recognizing when
the tools they used there are totally overkill. I'd suspect they would, since
they'd also be more aware of their limitations (vs somebody without
experience, who has to believe the "you need big data and everything is easy"
marketing).

That you wouldn't use a Boing 777 _IF_ your problem is just a 100m track is
the entire point of the article. It's explicitly not saying that you never
should use the big tools.

~~~
aub3bhat
They are not overkill at all, rather they are tuned towards different set of
performance characteristics. E.g. in the Boeing 777 example above,
transatlantic journey.

In the article above, the data and results stay on the local disk, however in
any organization, they need to be stored in a distributed manner, available to
multiple users with varying levels of technical expertise. Typically in NFS or
HDFS, preferably if they are records stored/indexed via Hive/Presto. At which
point the real issue is how do you reduce the delay resulting from
transferring data over the network. Which is what the original idea (moving
computation closer to data) behind Hadoop/MapReduce.

~~~
qwertyuiop924
_rolls eyes_

The point is that if you've got such tiny quantities of data, why are you
storing it in a distributed manner, and why are you breaking out the 777 for a
trip around the racetrack? Grab the 777 when you need it, and take the Tesla
when you need the performance characteristics of a Tesla.

