
Hadoop “fails us” - cdvonstinkpot
https://www.datanami.com/2017/03/13/hadoop-failed-us-tech-experts-say/
======
75dvtwin
My observation that traditional database technologies are transforming
themselves into 'hybrids' (as far document-oriented data types go).

Example is Postgres. Now has with JSONB as document-oriented field type. Now
has Postgres XL, as horizontally scalable ACID database. Will have, by
approximately august', abiltity to maintain views in memory (aka
lambda.architecture speed layer) via PipelineDB extension, for fast streaming
analytics.

It seems that a combination of Postgres (with extensions) + Kafka +Redis -- is
a strong stack for lambda architecture and, as initial data hub component of
the overall puzzle

While Spark (or even Python+Dask) can be viewed as distributed data analytics
platforms that replaces 'non-ui centric' BI. I think ui-centric BI (eg adhoc
reports/visualizatiosn) are going to be still dominated within enterprises by
Tableu/QlickView type of solutions.

For traditional BI-oriented data marts (that organized downstream from the
datahub) -- probably traditional column oriented databases, and the new open
source ones make sense.

To me ,the promise of hadoop being a silver bullet for 'all the big data
needs' \-- was always nothing more than unsubstantiated hype.

So it definitely failed the ones who believed, the hype -- but did not fail
others who did not buy into it.

------
frik
Kafka relies on ZooKeeper of the Hadoop eco-system. ZooKeeper is not so great.

------
shmerl
What about competition like HPCC[1]?

[1]: [https://en.wikipedia.org/wiki/HPCC](https://en.wikipedia.org/wiki/HPCC)

------
ghc
The article is on point about how we need better data infrastructure to
support data scientists and analysts. In the past I've worked to develop very
scalable data infrastructure to support data science workloads on high variety
sensor dat,a but it always felt like the only reason we were doing it was that
nobody developed tools made for companies like ours.

We can build better data infrastructure for data scientists, but in practice
it's hard to sell "10x easier to use" into organizations with hadoop or half-
broken bespoke infrastructure because the IT groups running the show don't
really care that they're making their data scientists miserable.

If change is to come, it's going to have to be from data scientists embedded
within business units demanding better tools, because hadoop works just fine
if you don't care how hard it is for your users to access their data.

~~~
sixdimensional
To be clear, many of the comments are from Bob Muglia, CEO of Snowflake
Computing which is a cloud based columnar data warehouse solution. Presumably,
much easier to implement than Hadoop, so you have to take those comments with
a grain of salt.

Also, those entities who the article claims have a grasp on Hadoop are some of
the main people who helped develop it - lest we forget that the early work on
Hadoop came out of Yahoo. "Web-scale" companies by their very nature have to
own this technology, because almost nobody else on earth did what they have
done on that scale (except maybe governments and scientists).

However, I agree the article is right on. As someone who has done a lot of
work with different Hadoop distributions (Cloudera, Hortonworks and MapR to
name a few), I can say that compared to the performance and ease of use of
more traditional data platforms, Hadoop is a different and more complex
creature. It has evolved quickly in some regards and slowly in others. It's
way easier to get a Cloudera system up and running today (for example) than it
was in the past.

"Hadoop" by itself is a misnomer, it's not like it's one technology. HDFS,
MapReduce (and MR2), Hive, HBase, Spark, Pig, Flume, etc. etc. - it's an
ecosystem. It's also a mashup of many things depending on what distro you're
talking about - too many things - a "data hub", an "operational data store", a
"pre-data warehouse staging", an ETL platform, a "search" platform, a data
governance platform, a data archiving solution, a real-time data query /
streaming data processing etc. etc. etc.

I think Hadoop did one important thing to our overall awareness, and that was
to think of a "data platform", "data processing", "data architecture" and
"data infrastructure" as a major, and "big" strategic part of an organization.
I think the rise of "big data" thinking along with the rise of the "chief data
officer", "data scientists" and "data engineers" and the like all goes
together - a shift in thinking we made where we realized we could do more with
more data (for better or worse).

There are many new and old technologies that help people implement the data
infrastructure in their organizations - not just Hadoop. Hadoop just happened
to be a major player at the epicenter at this (and open source as well), and
if anything, has spurred technological advancement in databases, NoSQL and
NewSQL and OldSQL alike, ETL, data mining, AI/machine learning, etc. So it has
played its role, whether it specifically stays relevant or not as we perceive
it.

------
nazilla12
Disclaimer : I work for a corporate IT consulting giant.

The trend I'm seeing the big Data sphere in my company is a by-and-large move
away from technologies that implement MapReduce and and complicated batch
processing with HDFS as a data store. More and more customers want insight as
soon as they get/produce their data so we've seen a particularly large
increase in interest in technologies such as Kafka/ Samza and spark/ pySpark.

I see a trend in Kafka and but I think the community needs to jump behind it
too, keep it as a pipeline tool and not a querying engine.

I dont see Hadoop-based solutions going away any time soon though.

~~~
slv77
How do these customers plan on doing historical analysis of raw data without
Hadoop?

~~~
agibsonccc
Spark is still "hdfs". These guys just use spark sql instead of map reduce
jobs. SPark is heavily used for batch workloads.

------
c3534l
> Hadoop is great if you’re a data scientist who knows how to code in
> MapReduce or Pig

I guess that kind of makes sense. What programmers like might not necessarily
be a good basis for the long-term. I do have to say, though, that part of the
failure of Hadoop is it generated a lot of hype and so better tools and
alternatives developed to meet that need. So when you're saying a better
alternative is Spark or Kafka, I feel like it's almost as if you're saying
"Oracle is a failure that never materialized the promised benefits. Instead we
should be using PostGres and MySQL."

But I also know that with a lot of big data hype, businesses were wanting to
do Hadoop and NoSQL and all this stuff because it was the cool new thing, not
that they actually needed it. I've heard data scientists make the joke that
every business thinks they need these tools because they're having difficulty
running their business out of a spreadsheet.

I think it's important to remember that for most businesses, Spark, Hive,
whatever; those aren't the right tools, either. SQL is still what most
companies need. Businesses want machine learning, but usually what they need
is boring old statistics. In an industry that always wants to be ahead of the
curve, we tend to forget that it's not always the right thing to have the
newest toys. Sometime companies do well utilizing the latest and greatest, but
sometimes they just use what already exists wisely. I suspect that for many of
those companies that used Hadoop and felt it didn't work them, the problem
wasn't that they needed Spark instead, it was they were trying to solve
problems that didn't exist. Man, I'm too young to sound this old. But, eh,
yeah, we need to respect our elders' technology first, and consider the newest
stuff only when we have a definable need for it.

~~~
xapata
You were on target with the comment that businesses switched to Hadoop because
they had trouble with spreadsheets. I've been a culprit in helping folks make
that transition. I warned them, but they insisted. Who am I to refuse?

They had heard Hadoop didn't require conforming to a schema and would allow
data-driven insights. What they failed to verify in advance was whether those
insights would increase profit.

I don't see how any other database-ish technology would change the situation.
It ain't the tech, or the analysis, it's the business processes.

------
darose
This really strikes me as more of a marketing piece by Snowflake than a well-
researched piece of reporting. The article mostly just quotes one person - Bob
Muglia - who is, as they say on Wall Street, "talking his book" \- i.e. giving
an opinion that is not coincidentally in line with his own financial
interests. Sure, Hadoop is getting old, and is quickly becoming replaced by
spark. But loads of organizations have used, and continue to use hadoop /spark
successfully. And the part about Kafka replacing Hadoop /Spark is just silly.
They're completely different technologies, used for very different purposes,
and many organizations use both side by side.

~~~
MilnerRoute
The article also quotes Bobby Johnson, who helped run Facebook's Hadoop
cluster, as well as the creator of Kafka (who ran Hadoop clusters at
LinkedIn).

For what it's worth, all three of them seemed pretty down on Hadoop.

~~~
nstart
I think the parent is right though. Side topic but having seen how PR pieces
are crafted, this feels like something that Snowflake put together and then
passed on to datanami with a "we have a blog post we'd like you to publish"
type mail. Claim is somewhat unsubstantiated but everything about it reeks of
it trying to drive the person to discover of Snowflake at the start, and to
think of it again at the end.

A quick search of hadoop against the snowflake domain and the term hadoop
against the term snowflake, I keep finding that Snowflake has a definite
Target in mind which is to convert hadoop users or people evaluating hadoop to
choose them instead. They even have a webinar specifically for that segment of
people.

Even further searching of Alex Woodie and mentions of snowflake show multiple
articles with the CEO across multiple domains including datanami and
Enterprise tech.

All that is circumstantial but I'm exercising a healthy bit of skepticism that
this piece is pure research done by Alex Woodie. A little more objectively,

If I examine the "points" of the article, what I can see is:

Bob muglia has never met a happy hadoop customer. Mention couple of things
that might replace hadoop in the future.

Bob muglia has only seen a few customers who've tamed hadoop.

Some discussions with and about Facebook's experience with hadoop painting
hadoop as hard work from the outset.

More discussions with other tech folk (Kafka and data torrent). One is an
alternative of sorts, and the other again discusses pain of hadoop.

And then back to Bob Muglia and who his target customers are for Snowflake -
"hadoop refugees" \- and his belief that we are in the valley of despair
regarding hadoop.

Which brings us to the final mental point of the article. Ditch hadoop sooner
rather than later, and here are the alternatives where the main one pushed
from start to end is Snowflake.

I apologise if this was too far off the topic. I think the discussion of
hadoop's validity or how it's being used is valid. I do also believe it's
healthy to call out suspect stuff like this because the core of the article
itself provides little to no critical value.

------
ktamura
Hadoop didn't fail us. We failed Hadoop.

The decline of Hadoop as a software category is Software Product Marketing
101: it did not identify pervasive killer use cases critical to running . Yes,
it's true that Hadoop was a revolutionary way to store and process massive
datasets on commodity hardware, but what's the use case for that? If you are
Visa/AMEX (fraud detection), Facebook/Google (various ML-based data products)
and a few other types of companies with obvious applications of massive data
processing, yes, Hadoop has been great.

But here's the thing: beyond a few such corner cases, it never found a use
case that enterprise data warehouse couldn't handle.

Then came Redshift, then BigQuery, and now Snowflake (as a BigQuery on AWS,
really). While there are some key technical differences between Redshift and
BigQuery/Snowflake, they are all _much_ cheaper than the previous generation
of data warehouses (Vertica, Netezza, Greenplum, etc.) The lower price meant
greater access, and developers who previously couldn't imagine using data
warehouses could finally spin one up with a credit card swipe.

Hadoop, too, took a lot of collateral damage because many developers realized
that they didn't need much of Hadoop beyond SQL-on-Hadoop.

Redshift was a beautiful feat of product strategy and marketing: They just
took what used to cost a lot and offered it for much less in an environment
where developers already had a lot of data (AWS). This was much simpler to
execute than what Hadoop had to do: introduce new technology, identify use
cases, and finally compete with incumbent solutions.

We failed Hadoop (as you can see from www.cloudera.com, even Cloudera, the
Hadoop company, hardly mentions Hadoop on its top page). Not the other way
around.

~~~
vgt
Minor correction - unless you consider Paraccel a part of Redshift (you
probably should), BigQuery GA precedes Redshift release by around a year, and
Dremel at least 6.

------
samspenc
I worked on Hadoop and HBase extensively from 2011 - 2013, working on engines
processing 30 billion raw data points a month and storing a subset of those,
and then we migrated to other Big Data technologies. Just wanted to add my
thoughts here.

Hadoop (and its general ecosystem, which includes HBase), is a fairly good
idea. Its core ideas - map/reduce on Hadoop, and a large distributed key/value
store for HBase - are actually pretty solid.

And for many years, there were simply no alternatives to Hadoop. Think of the
years 2008 to 2012/13\. If you had to process terabytes or petabytes of data,
what were your solutions? No wonder Yahoo and Facebook (and others) put in so
much effort into their Hadoop solutions.

But, IMHO, there were several issues with Hadoop and their ilk.

1\. The core infrastructure wasn't stable enough. Hadoop / HBase were supposed
to be distributed systems, and they worked well, but small failures could
bring down your entire cluster. Given that Hadoop and HBase were being used in
mission-critical systems in the cloud, and given the amount of DevOps or sys-
admin work that went into maintaining these, I'm not surprised people
eventually migrated to distributed systems that were easier to maintain and
run.

2\. There are now plenty of "hosted on the cloud" solutions such as Amazon
DynamoDB or similar cloud solutions. When your company depends on 99.99% or
similar SLAs, you don't want to have downtime on your database systems and
spend time debugging complicated core dumps on your Hadoop or HBase clusters
when you can just store it "on the cloud" and be done with it. Sure, there's a
higher price point, but those are the trade-offs you live with.

3\. If you want to be in-house, there are plenty of alternatives out there as
well today. Apache Spark for processing, Kafka for a messaging bus / streaming
data, ElasticSearch for large scale storage, with multiple indices. Many of
them are much more robust than Apache Hadoop / HBase, and I'm not surprised
they've gotten more traction recently.

Ultimately, I think Hadoop / HBase are just showing their age. They were
fantastic for the first wave of Big Data technologies, and you had little
alternative if you were building large-scale systems circa 2008 to 2013, but
now, you just have a plethora of choices from various vendors.

~~~
sixdimensional
Just curious, what did you migrate to?

I think the comment "first wave of Big Data technologies" is interesting. I
agree with it, but also, the term "big data" is relevant to a point in time, I
think. I mean, VLDB was the term before "big data" \- I think when the term
"big data" became popular was when technology such as Hadoop became more
mainstream and available instead of just being used in special cases.

I was working with terabytes of data in SQL Server prior to 2008 (around
2004/2005 - which was doable but pushing some limits) and then got into Hadoop
etc. afterwards. It was interesting to reach the limits of a tech we were
using and be there for the transition to Hadoop etc.

There were expensive commercial alternatives - MPP databases like Teradata,
Netezza or Vertica - or things like scientific computing platforms (which were
very limited to scientific computing scenarios). But nothing was as..
mainstream and available as Hadoop became. And yet, Hadoop never really
completely beat the traditional MPP vendors until you started getting engines
like Impala, Presto, Spark etc. that made SQL on top of Hadoop easier. And
that took quite a while.. in fact I'd say making SQL on top of Hadoop work
well (among many other things) is probably still on-going in a way. We are
much, much further along with that now than we were.

On the other hand, it was interesting to see the new opportunities develop for
working with data other than just SQL - straight MapReduce, machine learning,
graph processing, DAG, etc.

Will be interesting to see where we go from here. I still think there is a lot
going on.

~~~
samspenc
We were running heavily on Amazon AWS, and I changed jobs shortly before the
migration, but I think they moved to a variety of Amazon solutions (Kinesis,
RedShift, etc). Not sure what they are running on now.

To be honest, Hadoop on AWS through Elastic MapReduce (EMR) actually served us
really well. The cluster downtime we had in our early days were all "human
error" (configuration and Map/Reduce programming mistakes). Once we fixed
those, EMR served us like a champ.

The bigger issues we had were with HBase actually, particularly during its
region splitting. It was great for storing massive amounts of data that you
could store and query fast (with the right key design), but we weren't able
find people skilled enough in HBase ops to make us comfortable with continuing
to use it.

I have seen Teradata, Netezza and Vertica being used heavily at that time
(circa 2010-2013) but never worked on them directly. Like you said, they are
extremely expensive, and I can imagine why some of the larger installations
were trying to move to alternatives.

End of the day, I don't have any perfect solutions to recommend to Hadoop or
HBase. I think there are a variety of cloud-based solutions from Amazon,
Microsoft and Google; as well as hybrid open-source solutions that do the job,
but to each his own!

