
A New Analytics Toolbox with Apache Spark – Going Beyond Hadoop - scalemeblue
http://planetcassandra.org/blog/the-new-analytics-toolbox-with-apache-spark-going-beyond-hadoop/
======
monstrado
Saying things like "Going beyond Hadoop" is very misleading. Virtually all of
the Hadoop vendors out there, whether it be Cloudera, Hortonworks, MapR
commercially support Spark as a computation framework for Hadoop, and some
already have customers who are already using Spark on HDFS to power mission
critical applications. People still pretend that Hadoop is some batch-
orientated system with a distributed file system, but they couldn't be farther
from the truth. Hadoop is a movement, an evolution of how data is to be
analyzed in today's world.

The fact is, the vast majority of people who use spark (or will use spark),
will rely on Hadoop for a lot of underlying technologies, such as, YARN
(resource management) or HDFS (distributed filesystem / in-memory caching).
Further, if you think Spark is somehow the "end all be all" computation
framework, you're living in a fantasy world. The best part of Hadoop is that
depending on your use case, you can bring a multitude of applications to your
data, whether it's Spark, Tez, MapReduce, HBase, Impala, Drill, Presto, Tajo,
Accumulo, ...the list goes on, and continues to evolve. Spark is in no way
replacing Hadoop, it's only strengthening it.

~~~
rs_atl
You have a point, but it's also a fact that lots of people use "Hadoop"
interchangeably with "MapReduce". And Spark can in fact replace the Hadoop
infrastructure entirely, as it's not a component of that ecosystem. Just
because the various Hadoop vendors also support Spark only validates the point
that there's a need to "go beyond Hadoop".

~~~
monstrado
Just because people use Hadoop and MapReduce interchangeably doesn't make it
correct. I would love to hear how you think Spark can replace Hadoop, because
that is an astonishingly inaccurate statement. Which part of Spark reliably
distributes data? Which part of Spark handles enterprise level security? Which
part of Spark can coordinate resources in multi-tenant environments? The
answer is none of them, it relies on Hadoop for that.

~~~
x0x0
um, you sound like a vendor. hadoop does mean map-reduce + hdfs as the common
usage by the majority of devs + admins. Claiming hadoop is now some
distribution of tools is fine, but that's simply not what the common usage is.

It remains to be seen if yarn will carry the day or no; my suspicion is that
many people are essentially going to be running spark on hdfs. I don't see
much use for yarn unless you need to balance hadoop and yarn, and weren't
their claims that yarn was going to support eg mpi style computation that
didn't pan out?

~~~
colin_mccabe
_um, you sound like a vendor. hadoop does mean map-reduce + hdfs as the common
usage by the majority of devs + admins. Claiming hadoop is now some
distribution of tools is fine, but that 's simply not what the common usage
is._

I am a Hadoop developer, and I can tell you that Hadoop does not mean "map-
reduce + hdfs". That's also not what people are installing when they install
Cloudera's distribution of Hadoop, Hortonworks' distribution of Hadoop, or
even Intel's distribution of Hadoop (which is being discontinued in favor of
adopting Cloudera's). This is more old information from 2008, being replayed
as current. YARN even lives in the Hadoop source code repository, it's hard to
get more "Hadoop" than that. Spark has its own repo, but it uses many classes
from Hadoop like InputFormat, etc.

 _It remains to be seen if yarn will carry the day or no; my suspicion is that
many people are essentially going to be running spark on hdfs. I don 't see
much use for yarn unless you need to balance hadoop and yarn, and weren't
their claims that yarn was going to support eg mpi style computation that
didn't pan out?_

Much confusion. Much sadness.

You run YARN (or its close competitor, Mesos) because you want to have
multiple jobs going on in the same cluster at once. You need things like per-
user queues, job control, reserving CPU and memory resources. The jobs going
on at once may be multiple MapReduce jobs, or they may be multiple Spark jobs.
Even Databricks, which employs many of the early Spark developers, doesn't
ship a product that runs Spark in standalone mode. They run on Mesos.

------
capkutay
"But you don’t have to install Hadoop at all, unless you decide to use HDFS as
Spark’s distributed file system. If you don’t, you can choose one of the other
DFS’s that it supports"

What other file systems does spark support? And how seamless is that
integration? I know with spark, I can write a Spark SQL query that will go
directly against data persisted in HDFS. Should I expect the same behavior if
I swap out HDFS with some other DFS?

~~~
tupshin
Spark doesn't depend on much of anything from the Hadoop ecosystem, including
HDFS. It supports S3, as well as NFS or any other locally mountable filesystem
right out of the box. The FAQ talks a bunch about Spark's weak ties to Hadoop.
[http://spark.apache.org/faq.html](http://spark.apache.org/faq.html)

------
mrbonner
"All in-memory".. yeah and I have 1TB data file to process. How is that going
to help?

~~~
tupshin
1) Buy a box with 1TB of RAM. Very doable these days, albeit a bit pricey
still

2) Scale out. Spark can easily handle hundreds of nodes in a single cluster.
Aggregate RAM across all of them can be used.

3) Cache intermediate data sets and/or hot data sets, as opposed to the entire
data set.

~~~
ironchef
For #2 and #3, see here: [http://spark.apache.org/docs/latest/programming-
guide.html#r...](http://spark.apache.org/docs/latest/programming-
guide.html#rdd-persistence)

We typically do #2 at our company and it's been fine so far. The bigger issue
isn't a single 1 TB data set. It's multiple large data sets as one then must
handle the data shuffles during joins, etc. The ability to keep the RDD in
memory through the operations still tends to beat normal long winded hadoop
operations anyways...

~~~
rs_atl
Long winded in more ways than one. I would still use Spark even if it were
slower than Hadoop, just to get a sane API.

