
The Hadoop Ecosystem Table - jmngomes
https://hadoopecosystemtable.github.io/
======
jonathan_mace
For what it's worth, I started putting together a visualization of big data
systems and how they interact. There are so many systems that it's difficult
to get a grasp on how they relate to each other. I got distracted with more
important things, so it's only partially complete.

[http://jonathanmace.github.io/bigdatasurvey/](http://jonathanmace.github.io/bigdatasurvey/)

------
bitcointicker
My recommendations...

For automated cluster building -
[https://ambari.apache.org/](https://ambari.apache.org/)

For analysing your data, dynamically building queries and sharing this with
other people in your company -
[https://zeppelin.incubator.apache.org/](https://zeppelin.incubator.apache.org/)

And coming soon - [https://www.zeppelinhub.com/](https://www.zeppelinhub.com/)

~~~
TallGuyShort
I'm especially excited about Zeppelin. Using IPython for SciPy and smaller
datasets is great. I would love it for the big data space I work in and
Python's tooling to come together more.

~~~
nl
IPython/Juypter works well against Spark. We have it working in production
like that, and both Google[1] and IBM[2] do the same.

[1]
[https://cloud.google.com/datalab/overview](https://cloud.google.com/datalab/overview)

[2]
[https://www.ng.bluemix.net/docs/services/AnalyticsforApacheS...](https://www.ng.bluemix.net/docs/services/AnalyticsforApacheSpark/index.html)

------
sciurus
Some of the entries in the table (e.g. Redis) seem to have nothing to do with
Hadoop.

~~~
threeseed
Redis has been used by at least a few people using Hadoop.

We've used it for caching intermediate results during a Spark job.

~~~
sciurus
Sure. But you could have chosen _any_ datastore to cache your results in,
right? Redis doesn't integrate with Hadoop in any special way. Some datastores
do. For example, it does make sense to say Cassandra is a part of the Hadoop
ecosystem, due to the features in
[https://wiki.apache.org/cassandra/HadoopSupport](https://wiki.apache.org/cassandra/HadoopSupport)

------
virmundi
I don't understand why Cascading is missing. It's by far one the easiest batch
flow controllers on the platform. You can test it is memory locally. When you
deploy to a real cluster, you know it will just work.

~~~
mtanski
I don't agree with the last statement. Based on experience with Hadoop (over
5+ years now) running locally is a poor indicator of running on the cluster.
Many sleepless night have been spent trying to figure out why the job that
runs locally doesn't want to run on Hadoop.

I do like cascading and scalding tho. Only so many times you want to implement
job flow, filters and joins by hand in lifez

~~~
virmundi
Maybe the statement could be a bit hyperbolic, but it could be tested. I
tested large complex flows locally, 59 steps, within JUnit. These test ran
with every build. So the whole build took 6 minutes for 75 fraud models, but I
cold easily focus on just my unit in Eclipse.

I haven't tried PigUnit for a while, but last time I did, it didn't support
macros and took minutes for how has would take seconds in Cascading.

It's this difference that's cemented in my mind that Cascading is for
repeatable processes while Pig is for probe ring and experimentation. This is
not to say you can't reuse Pig scripts. I mean that I have greater confidence
in the things I can create repeated tests for.

------
Ianvdl
The ecosystem has grown so large that it is nearly impossible for anyone to
have any meaningful experience with all of it. Not that it's a bad thing
though, choice is always good.

------
cjp222
Trafodion is Apache Trafodion (incubating), providing a fully distributed
transactional ANSI SQL on top of HBase for OLTP and operational workloads. The
link is incorrect as well. Instead use the Apache link:
[http://trafodion.apache.org](http://trafodion.apache.org) .

------
mziel
Nice list, but Spark is treated superficially. Also extremely out-dated
(Shark, Bagel).

SparkSQL should be in the SQL-on-Hadoop section. MLlib+ML should be in Machine
Learning section. If we include Storm and Giraph, we should include
SparkStreaming and GraphX.

------
fauria
For databases comparison, I really like Kristof Kovacs page:
[http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-
redis](http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis)

------
vonnik
Great table. Under machine learning, it should include
[http://deeplearning4j.org](http://deeplearning4j.org). (Co-creator here.) We
run on Hadoop and Spark.

------
RRRA
You might want to look at [http://db-engines.com](http://db-engines.com) and
you'll then have plenty more DB to cover!

------
rubidium
If the author is present, I recommend putting a clickable TOC at the top that
takes you to the relevant section.

------
melted
Too bad most of it is in Java.

~~~
brianwawok
Says something about Java aye?

~~~
melted
Turd of a language, but very popular.

