
Using Spark and Zeppelin to Process Big Data on Kubernetes 1.2 - TheIronYuppie
http://blog.kubernetes.io/2016/03/using-Spark-and-Zeppelin-to-process-Big-Data-on-Kubernetes.html
======
minimaxir
I _really_ wish there were more tutorials like this on how to set up Spark and
other Big Data tools (TensorFlow) on cloud computing, as that personally has
been the primary barrier to starting work with extra large amounts of data.
(Most current tutorials require running a ton of console commands that are
obsolete.)

I took a Spark course on eDX last year, but the environment was set up using a
customized Vagrant config with no real-world use. I definitely prefer the
Kubernetes approach.

~~~
mindcrime
_I really wish there were more tutorials like this on how to set up Spark and
other Big Data tools (TensorFlow) on cloud computing, as that personally has
been the primary barrier to starting work with extra large amounts of data._

Just as an FYI, we[1] are working on an open source, cloud based Machine
Learning / Big Data platform that might be of interest to you. It's not all
ready yet, but when it is, there will be a simple REST API that allows you to
define the kind of setup you want, "push a button" and have it all deployed.
Our initial backend is AWS with plain jane EC2 nodes, but it will be possible
to extend it to other configurations as well.

Right now we deploy a Spark/Hadoop Cluster with Apache SystemML, Mahout and
MLLib installed. Zeppelin will be coming to the stack, as will other tools
like TensorFlow, SparkR, CaffeOnSpark, etc.

We'll be offering our own hosted service based on this, but it'll be open
source so you can deploy it in an environment of your own if you wish.

We'll do a "Show HN" when we have something ready, so keep an eye out if that
sounds interesting.

I also plan to write up some tutorial and documentation based on our
experiences building this out, but the priority right now is getting it built.
:-)

[1]: [http://www.fogbeam.com](http://www.fogbeam.com)

~~~
TheIronYuppie
Disclosure: I work at Google on Kubernetes.

If you'd like to do this in containers/Kubernetes, we'd love to highlight your
work! Kubernetes runs great on AWS (as well as GCP, Azure and elsewhere), so
no cloud migration required.

~~~
jaz46
Shameless plug: Pachyderm is another way to run big data workloads in
Kubernetes. github.com/pachyderm/pachyderm. We spoke at KubeCon SF last year
and v1.0 is coming out next month!

~~~
TheIronYuppie
Disclosure: I work at Google on Kubernetes

Great to hear, congrats on reaching 1.0! Please do reach out when you get
there, we'd love to show off your work to the community.

------
TheIronYuppie
Disclosure: I work at Google on Kubernetes

Lots of folks have been interested in how to run common stacks on Kubernetes;
this is a great example of how to run a really common scenario (i.e. using
your cluster to do Spark processing).

~~~
manojlds
IS anyone actually using it (Spark on k8s) for intensive loads though?

------
rcpt
How would one go about this using hdfs instead of relying on gs, s3 etc. for
storage? Would hdfs run as a separate k8s service?

~~~
boulos
Given the failure mode of Datanodes in Hadoop (lose X replicas, adios) you'd
probably want something like the upcoming PetSet in Kubernetes 1.3. The
ability to lose nodes and have that take out your job is why we push so hard
on having people use our HDFS connector for GCS: it's really nice to have a
high bandwidth shared object store that won't "crash".

------
faizshah
Semi-related but does anyone know if zeppelin supports java yet?

------
markvdb
Totally off topic, but...

Am I the only one to think of
[https://en.wikipedia.org/wiki/Hindenburg_disaster](https://en.wikipedia.org/wiki/Hindenburg_disaster)
when seeing the words "spark" and "zeppelin" together? (This type of air ship
is called a "zeppelin" in my native language...)

