
A State of Xen – Chaos Monkey and Cassandra - tweakz
http://techblog.netflix.com/2014/10/a-state-of-xen-chaos-monkey-cassandra.html
======
Aissen
Is there a blog where they post about what happened during _actual_ downtime ?
Like the one on last September 21th ?

------
Oculus
_By trading away C (Consistency), we’ve made a conscious decision to design
our applications with eventual consistency in mind_

How does one go about writing a user-facing application with eventual
consistency?

~~~
Scaevolus
It depends on the application. For Netflix, transient glitches like having
outdated play history or catalog entries that take a while to propagate aren't
hugely detrimental -- the core value of serving the videos that users request
doesn't depend at all on having a consistent view of quickly-changing user
data.

For more complex applications (think Facebook), there are useful consistency
models other than strong consistency, with causal consistency being one of the
most promising:
[http://queue.acm.org/detail.cfm?id=2610533](http://queue.acm.org/detail.cfm?id=2610533)

~~~
_delirium
Do you know whether the linked article is what Facebook is currently using, or
a proposal? Facebook is an example of a webapp that fairly often has user-
visible weird behavior, like "read" notifications becoming unread again. But
I'm not sure if those are glitches, or a result of their consistency model.

~~~
MoOmer
Facebook wrote Cassandra for such scenarios.

[https://m.facebook.com/note.php?note_id=24413138919](https://m.facebook.com/note.php?note_id=24413138919)

------
chuckcode
Are any of these anti-chaos tools open source or shared with the community?
Would love to see more companies that I'm dependent on have this sort of
testing and robustness...

~~~
frankchn
Chaos Monkey (and related software) is open source:
[https://github.com/Netflix/SimianArmy/wiki](https://github.com/Netflix/SimianArmy/wiki)

------
ErikRogneby
2700+ Cassandra nodes! Anyone know how big Facebook's Cassandra is?

~~~
nemothekid
AFAIK, Facebook no longer uses Cassandra, but apparently Apple has 75,000+
Cassandra nodes.

------
lsc
hm.

from:
[http://xenbits.xen.org/xsa/advisory-108.html](http://xenbits.xen.org/xsa/advisory-108.html)

MITIGATION ==========

Running only PV guests will avoid this vulnerability.

Did amazon reboot all of it's VMs? or just the HVM VMs? why was neflix running
on HVM VMs?

~~~
RyanGWU82
Amazon rebooted lots of PV guests. Presumably they collocate HVM and PV guests
on the same box. If there were any HVM guests on the box, then there could be
the possibility of an attack. (I guess they could forcibly kick off the HVM
guests, but that wouldn't be very nice.)

Why shouldn't Netflix be running on HVM?

~~~
lsc
>Why shouldn't Netflix be running on HVM?

HVM, at least in the past, had a bunch more code that the guest DomU interacts
with vs. fully pv guests. This has security implications.

Now, my knowledge of HVM is a few years... or more like half a decade out of
date, for example, I don't even know how to force a HVM guest to only use PV
drivers (which would solve 90% of the problem.) and i know that more and more
of this has moved into hardware, so it's possible that what was true five
years ago is not true now, but... yeah, I don't let untrusted users on HVM
guests for the same reason I don't let untrusted users use pygrub or load
untrusted kernels directly.

~~~
grosskur
Most people agree HVM is the way to go on EC2:

[http://www.brendangregg.com/blog/2014-05-07/what-color-is-
yo...](http://www.brendangregg.com/blog/2014-05-07/what-color-is-your-
xen.html)

Forcing an HVM guest to use only PV drivers sounds like PVH, which is coming
in Xen 4.4:

[https://blog.xenproject.org/2014/01/31/linux-3-14-and-
pvh/](https://blog.xenproject.org/2014/01/31/linux-3-14-and-pvh/)

~~~
cthalupa
HVM guests at Amazon will default to using PV drivers for IO and networking.
(Unless using SRIOV/"Enhanced Networking", which will not use the PV drivers)

PVH is actually PV on top of an HVM container and is a bit different. You can
think of it as PV sitting on top of enough HVM bits to take advantage of the
hardware extensions Intel and AMD have invested so heavily in while still
being majority PV. This gives you the best of both worlds, including the
remaining PV performance benefits related to interrupts and timers that PV
drivers on HVM can't utilize.

------
roncohen
I'd love to see Netflix let the chaos monkey loose on their PostgreSQL
servers.

------
jshen
I wonder if their chaos system causes network partitions as well as node
failures.

~~~
sargun
The total toolkit they have is called the Simian Army --
[https://github.com/Netflix/SimianArmy](https://github.com/Netflix/SimianArmy)

They wrote a blog post about it here:
[http://techblog.netflix.com/2011/07/netflix-simian-
army.html](http://techblog.netflix.com/2011/07/netflix-simian-army.html)

I don't think they have one that introduces network partitions, but inside a
datacenter, network partitions are rare.

~~~
derek
> ...but inside a datacenter, network partitions are rare

It's ... complicated.

[http://aphyr.com/posts/288-the-network-is-
reliable](http://aphyr.com/posts/288-the-network-is-reliable)

~~~
sargun
Yeah, there was an excellent ACM article with him, and Bailis. I think if the
network starts to partition, or fail in a datacenter, that's time to evacuate
the datacenter / AZ. If a handful of machines fail, they should disengage. If
more than say, 5% of the machines in the DC are having reachability issues at
any given point (in a modern DC that's like ~2000 machines), it's time to shut
it down.

