Ask HN: Why is there nobody talking about hadoop anymore? - qlk1123
======
alexnewman
I was an early employee at cloudera (first year, first dozen). I also
committed to most of the Hadoop projects, including a commit bit.

\- Hadoop was software done wrong. The tests took days to run and never
passed. It was documented horribly. \- The open source companies all knee
capped other. \- Hadoop destroyed Apache's rep.

cloudera taught me a lot about how not to build a team and a company. Although
Mike Olson was the greatest ceo I ever worked for (as opposed to founding a
company with) the rest of the company became horribly political. We brought in
managers from Oracle. We had crazy personal projects that went out of control
like kudu and impalla. It breaks my heart

I ended up talking to one of the original investors afterwords and found out
cloudera was a boy band startup. By the time I was in the way out the cto was
playing counterstrike every day as I was heading home.

To this day there's some amazing engineers there and I just don't understand
why

~~~
qlk1123
Thanks for the sharing!

> cloudera taught me a lot about how not to build a team and a company. ...

It would be interesting if you are willing to share more about this. Are
managers from Oracle play the reason? Did they bring you unnecessary
disciplines against engineers? Any examples why you describe cloudera then
"horribly political"?

~~~
alexnewman
"Are managers from Oracle play the reason?" -> I think bringing in a bunch of
managers from bigco to shape up your startup is nearly as bad as bringing in
"experienced engineers" because you have too many young engineers. Diversity
is strength, but so is culture.

"Did they bring you unnecessary disciplines against engineers" -> What's this
mean?

Any examples why you describe Cloudera then "horribly political" -> Not off
the top of my head actually i just remembered it burning me

------
mindcrime
It's been around long enough to not be a "buzzword" anymore. Now it's faded
into the background as just something that everybody is using (even if all
they use is HDFS). It's like nearly every other "once trendy, now commonplace"
technology that rolls through the IT landscape. It's the "darling of the day"
for a while, and then a new shiny comes along and becomes the new darling.
Meanwhile the Hadoop clusters still keep chugging along doing their jobs, just
like the COBOL code, MVS systems, iSeries boxes, JBoss servers, CORBA brokers,
Beowulf clusters, or whatever else used to be trendy and now isn't.

Of course sometimes things fade away just because they are obviated by
something clearly superior. For the map/reduce part of Hadoop that may well be
Spark. But even now, in my experience, most Spark set ups are running on HDFS
and using Yarn. _shrug_

------
shoo
[http://veekaybee.github.io/2017/03/20/hadoop-or-
laptop/](http://veekaybee.github.io/2017/03/20/hadoop-or-laptop/)

Maybe we've also gotten over the "big data" hype wave, and there's more
understanding that in more cases it might be cheaper & more performant to
spend the budget on a single node with lots of ram rather than standing up and
maintaining a distributed system.

------
sedocmiv
[https://www.youtube.com/watch?v=Y6Ev8GIlbxc&t=28m15s](https://www.youtube.com/watch?v=Y6Ev8GIlbxc&t=28m15s)

Just saw this and might explain why Hadoop is out of the spotlight. In
summary, Spark and Kafka seem to be better? I'm not sure as I'm just starting
to enter this field.

~~~
nathairtras
Getting this out of the way first, I've only started exploring non-Hadoop/non-
HDFS Spark execution beyond some limited Amazon EMR work, but I'm interested
in learning more about it. What follows is a combination of work experience
and armchair research in the evenings. But I'm not claiming to be an expert.

Have grown to really appreciate Spark in the Hadoop space. Started with plans
to go with Impala, then went to Hive due to stability concerns, and finally to
Spark due to speed / flexibility. You can write code against a data frame, or
write Spark SQL, so you still have SQL.

HDFS has benefits over other storage approaches, if you are running Spark in
the same cluster you get data proximity. But you can go with a different
storage back-end. That costs in performance. "Performance of multiple query
and enrichment jobs concurrently executed resulted in 90% longer execution
times." [https://redhatstorage.redhat.com/2018/06/25/why-spark-on-
cep...](https://redhatstorage.redhat.com/2018/06/25/why-spark-on-ceph-
part-1-of-3/)

Unless you really have BIG data, you're invoking a lot of maintenance overhead
to support a cluster when you may do just fine without.

Haven't had the freedom to explore other possibilities until recently, very
interested in how Spark on k8s is working out. (Same comment could be made
here as above and elsewhere - do you really need k8s? But I want to play with
k8s and learn more about it, so... for that purpose I 'need' it.)
[https://spark.apache.org/docs/latest/running-on-
kubernetes.h...](https://spark.apache.org/docs/latest/running-on-
kubernetes.html)

And there's always the cloud route. You can run an EMR job that uses files in
s3. There is a cost, but you do not need to support a cluster in the same way.
Or if you're feeling adventurous, use Lambda.
[https://www.qubole.com/blog/spark-on-aws-
lambda/](https://www.qubole.com/blog/spark-on-aws-lambda/)

And Spark isn't the only option. Have started learning about Dask, also looks
very interesting for performing some of the same tasks.
[https://dask.org](https://dask.org)

