Cloudera and Hortonworks merger means Hadoop’s influence is declining

philippeback · on Oct 8, 2018

The move makes sense. Cloudera has cash on board and Hortonworks has excellent technical chops and associated vision. The offering is not Hadoop only. Druid can be deployed with ease and Superset can hook into it nicely, all with proper metrics captured. Ambari is a wonderful single pane of glass to manage it all nicely. YARN is really good at managing capacity leveraging cgroups. Spark is one thing but far from the only thing. E.g.Flink. Hadoop3 introduces Docker containers right into YARN. It is all driven with JVM components, so security actually works. For all its limitations HDFS also has pretty cool powers. HBase is also a beast for a couple use cases. It is a versatile platform and is evolving well. Of course the learning curve is pretty steep. But payoffs are huge.

MrPowers · on Oct 8, 2018

This article makes it sound like Spark needs to be run on Hadoop clusters and that's not the case. Spark can be run on object stores like AWS S3 and Azure.

I also don't agree with the author's assertion that Spark is "Scala centric". Yes, Spark is written in Scala, but PySpark is definitely a first class citizen. Databricks maintains a MLFlow project to make it easy to use Python with Spark: https://databricks.com/blog/2018/06/05/introducing-mlflow-an...

wenc · on Oct 8, 2018

> I also don't agree with the author's assertion that Spark is "Scala centric". Yes, Spark is written in Scala, but PySpark is definitely a first class citizen.

To be fair, prior to Spark Dataframes (i.e. the days of pure RDDs), the only way to get performance out of Spark was to write Scala code. The serialization overhead of PySpark precluded it from large-scale data engineering workloads. Most companies rewrote their PySpark code in Scala for production.

Now that we have Spark dataframes, PySpark performance is mostly on-par with ScalaSpark for many SQL-amenable operations. And with Apache Arrow in-memory support on Spark >2.3, the Python serialization overhead problem goes away.

But Spark is still to some extent Scala-centric. The documentation is trilingual, but there is still a distinct Scala-first culture.

monksy · on Oct 8, 2018

Thank goodness on Hadoop's influence. Hadoop has mostly been about 1 time transformations with large machines. It's doesn't exactly produce an efficent way to do a process, nor does it make it easy to do that. (The amount of speciality integration for your app is a bit large). (Also the move in spark was needed.. but it's still behind the times with "microbatching")

wenc · on Oct 8, 2018

To me, Hadoop was primarily designed around large-scale semi-structured data like logs; this plays to its strengths of scale and schema-on-read.

If I'm not mistaken, Hadoop was created at Google to handle weblogs (side note: in 2014, Google announced that Hadoop was no longer being used internally)

Enterprise data however is primarily structured and relational, which is more suited to handling in a database-like system. Hadoop was never designed for this use case. Scalable cloud databases like Redshift, Aurora etc. always seemed to be a better fit. Cloudera created technology like Impala and Kudu to address this but not sure about the uptake there.

atombender · on Oct 8, 2018

No, Hadoop was started at Yahoo! by Doug Cutting, the original author of Lucene.

Hadoop was based on Google's published papers on the Google File System (GFS) and MapReduce. The later project, HBase, was a direct carbon copy of Google's BigTable paper.

By the time Hadoop reached maturity, Google had mostly moved onto newer technologies such as Colossus and Megastore, and mostly doesn't use MapReduce anymore.

To my knowledge, Google has never used Hadoop internally, although you can lease a hosted version of Hadoop on Google Cloud Platform.

wenc · on Oct 8, 2018

You're quite right. It was GFS and MapReduce that was invented by Google, and Doug Cutting based it on those technologies.

crb002 · on Oct 8, 2018

On demand cloud compute and storage killed the Hadoop beast. Python/C++ tooling is slowly gaining marketshare back to the HPC glory days where MPI was king.

Once IBM mainframe, the king of CAp, is put in a major AWS/MS/GCP data center expect them to gobble Cloudera. Or Principal corporation goes nuts and starts taking on Guidewire.

kermatt · on Oct 8, 2018

> The deal signifies that the Hadoop market could no longer sustain two big competitors.

MapR is still out there.