
Can Spark Streaming Survive Chaos Monkey? - ddispaltro
http://techblog.netflix.com/2015/03/can-spark-streaming-survive-chaos-monkey.html
======
MoOmer
Interesting stuff as usual Netflix; but, the identifying and reporting of
issues discovered under these conditions is really excellent for the
community. Kudos to you all.

------
estefan
I notice they're using spark standalone clusters.

I've had problems with executors dying when running under YARN, and it makes
keeping track of running jobs and finding log output more difficult. So unless
you really need to run on the same instances as other MR tech, it seems
standalone is the way to go.

If only AWS provided EMR AMIs for spark standalone clusters I could switch...

~~~
sgt101
Is this Spark on YARN or YARN or AWS that is causing the issue?

~~~
estefan
I was trying to run Spark on EMR, so I was running it under YARN on EMR.

The problem I found seemed to have been discovered by one or two other people
on mailing lists - it seemed something to do with YARN terminating spark
executors for memory usage. It was odd since I wasn't using the cluster for
anything else - YARN was only used as the scheduler because it was installed
in the AMI, and I didn't want to use the ec2 scripts to start a standalone
cluster just on EC2s.

It was difficult to debug because YARN makes things pretty opaque when tasks
just die - logs weren't copied around the cluster so I had to SSH to the
instance that died and try to find something to indicate what had happened,
and half the time the error messages were vague.

I also didn't work out how to view the job tracker UI when running under YARN.
"Learning Spark" says I need to proxy through the YARN cluster somehow but
doesn't give an example... So I have no progress about running jobs at the
moment, only about finished jobs.

In the end, I upped the instance size I was using and everything went well. I
think my dataset must have been too large and spark was spilling to disk which
tripped things up. So at the moment I don't have confidence that I could
process huge datasets with Spark on EMR which is a bit annoying :-(

------
vbsteven
I would like to know which language they are using with Spark: Python, Java or
Scala. I have experience with both python and java and while python works very
well for prototyping new Spark flows, once a flow becomes too complex I fall
back to Java because of generics in RDDs.

~~~
lmm
I'd really recommend looking at Scala - IMO it combines the best parts of
both. The syntax is often just as clear and concise as Python, but you also
get full static type safety.

------
toomuchtodo
I'm curious how certain services can survive Chaos Monkey. Memcached is one
example; if you start destroying instances, you're going to stampede your
persistent datastore to get that memcached replacement hot again.

~~~
viraptor
The surviving part doesn't have to be in the server itself. Here's an idea for
memcached "surviving":

\- set up N available servers

\- make clients store to N servers at the same time when they calculate the
value

\- query M (where M<=N) servers before deciding you have to recalculate

\- If you got a response from server between 2 and M, re-store the value
everywhere (just pay attention to preserving timeouts)

And you get distributed, self-healing, chaos monkey resistant memcached
without any support on the server side.

Also if you want to avoid stampede, you could insert 1s TTL placeholders that
mean "back off, someone else is calculating" into keys you know are popular
and may experience contention. Just make sure you use CAS so you don't
overwrite data with placeholder.

~~~
toomuchtodo
That's what I assumed. You lose capacity as you increase reliability, as
you're going to need to redundantly store that data somewhere in your
memcached cluster if you don't want to go back to the persistent layer.
Thanks!

------
eranation
Great write up. Using spark / graphx and so far it's pretty awesome.

------
rojoca
Interesting title given that Netflix has just released pricing for their
upcoming New Zealand release undercutting the local offering from Spark
(formerly Telecom).

~~~
EdwardDiego
About the same time that I introduced Spark in our workplace, Telecom
announced their multi-million dollar name change.

I spent the first two weeks clarifying that I was indeed referring to Apache
Spark, not the Spark-Formerly-Known-As-Telecom. They'll always be Telecom to
me.

------
photorized
I just got a "Netflix can't play this title at this time" error message - are
you guys experimenting?

