
What Happened to Hadoop - rspivak
https://architecht.io/what-happened-to-hadoop-211aa52a297
======
PaulHoule
Oh man.

The sharks have been circuling around HDFS for a long time. Content marketers
have a growing list of reasons why you should use something else but their
arguments don't ring true.

(Mainly because the "something else" doesn't really work; MapR pushed this for
years and finally went under.)

One of them is that HDFS is wasteful because it replicates everything three
times.

Maybe that's true, but erasure code systems don't perform as well under
degraded conditions as HDFS. Back when I was running a test cluster in my
house with three machines I had two machines turned off and was running
multiple jobs before I realized that two of them were turned off.

A large cluster has more parts to fail and is often in a degraded condition --
and HDFS functions well under degraded conditions.

The newer argument is the one that CPU utilization is low on HDFS clusters.
That's true, but easy to overstate because low CPU utilization helps with
responsiveness when you are running ad-hoc jobs. You could do better than the
average HDFS cluster, but you can easily do worse. Certainly the 100%
utilization that bean counters think they can accomplish leads to infinite
latency and 0% goodput.

The article above suggests S3 as a real alternative but I got laughed out of
the room by a Gartner analyst years ago (rightly so) for suggesting that,
because S3 is crazy expensive. At $100 a month per TB, you can afford to keep
moderate amounts of valuable data, but if your data is truly big you will go
bankrupt.

It's true that the MapRed engine is awful for anything that isn't a 1-1 map to
a single map and a single reduce pass, so Spark and other compute fabrics have
replaced it, but HDFS is still the champion for big data storage.

