Hacker News new | past | comments | ask | show | jobs | submit login
Introducing Chaos Engineering (netflix.com)
99 points by tweakz on Sept 11, 2014 | hide | past | web | favorite | 28 comments

Man, I really have the impression that my video-on-demand service has a better understanding of ensuring availability and risk-management than either my bank or any government service.

Could you imagine say a utility company creating a title called 'Chaos Engineer'.

Can you imagine how difficult that would be to justify in a bank or government service? It would be very hard to convince someone in a position of power at an institution like that that deliberately introducing random failures and problems into your production services would be a good idea. Even if it is a good idea, it would be hard to make that argument.

If Netflix's system fails, the worst thing that could happen is you don't get to watch Orange is the New Black. If your bank's system fails, you may lose payments, fail to pay your bills on time, or worse.

I work for a large institution. The thought process doesn't work that way -- the suggestion of introducing chaos would be a no-go not because of business risk, but the risk of disrupting the layers of bullshit that people build between them and whatever the actual business is.

Netflix's cost of failure is very high -- customers get real-time discovery of your failures, and the service costs less than $10/mo, so they aren't deeply invested in it and will drop you like a hot potato.

When people are only paying $10/month, failures are cheap. I imagine low-cost customers tend to be more forgiving. I could be wrong on this.

The difference in many government and some banking situations is that you don't have much recourse. If the DMV goes down, you still need plates. If it's a really epic failure, there may be political fallout, otherwise, it's just meeting low expectations.

For many people, changing bank accounts is very difficult for many people, which is one of the reasons that many banks are able to treat customers with something close to contempt.

You run a staging environment to mirror production [with fewer nodes per PoP] and throw things at it.

The only real difference is the load on staging would be generated artificially [e.g. ACH, credit card processing] using faked staging-only accounts. You even have a fake banking website for security audits that is a clone of production with the fake accounts.

No risk to production and probably 80% of the benefits. Of course its probably 125% of the effort.

Show me a staging environment that's as robust as a production environment...

My last company I worked at, they had like 30% of the staging environment running production stuff. They also had only one deploy target, which is drum roll production, which meant all of staging AND some of Dev/QA was live and could interact with the production environment. Also fun to see the production load at 300% of expected, burning, only to find out some event, data roll up, query, whatever was being run in duplicate on 10 different machines because somebody forget to to manually edit the configs after they rolled out a new version to staging/QA. Although I think this would qualify as "chaos engineering", I don't think it fits with what netflix is going for.

Yeah, ok that had nothing to do with the OP, sorry, I just had to vent.

That really sucks, I'm sorry.

The kinds of things I work on now are entire "environments" we sell, so "production" testing is more along the lines of "go plug the box in and do stuff to it".

1) I stated it as a requirement.

2) The projects I work at on $DAY_JOB that I control all have 3 hardware nodes for HA in both staging & production. Staging is under artificial load that plays back the same series of events on dummy data. [e.g. Job A runs -> Completes -> Puts a mock job in Staging's Job Queue -> runs on Staging with mocked data -> I randomly restart a node to see what happens]

Buying 3 1Us is $750 for a staging environment. If you have 1 DC availability, you can just put them on your Local LAN (Free).


Now, you suddenly have a rough approximation of a 3 Node / 1 DC environment.

But what if you have multiple DCs, 3 Nodes per DC, Hot/Cold Loadbalancers and have DNS Failover Between Them?!

Say, you run a LBx2+Nodex3/LBx2+Nodex3/LBx2+Nodex3 across 3 DCs.

Assuming you automate the build, you could just pay by hour. But if not...

Linode's Failover IPs let you do the Hot/Cold LB thing [$60 for 3 DCs worth]. 3 Nodes per DC [$40 Nodes for 4096 MB is probably more than enough to test most issues]. $360

So for $420/month, you could replicate a pretty big real world setup.

What if you have 25 servers like stackoverflow and a hot/cold DC setup that is all Linux? http://highscalability.com/blog/2014/7/21/stackoverflow-upda...

Linode's Failover IPs let you do the Hot/Cold LB thing [$40 for 2 DCs worth]. 13 Nodes per DC [$40 Nodes for 4096 MB is probably more than enough to test most issues]. $520

$560/month you can basically duplicate the 2 coast hot/cold setup with 26 servers.

But its expensive!

...unless this is a one-man operation, not really. A fully loaded junior engineer is a minimum of $120k. A robust staging environment is very useful and isn't going to cost more than 1 engineer month to setup.

Yes, you might not use the same hardware. However, it is close enough and honestly...the same hardware 1:1 with real servers [based on the ebay servers] and a rack at Fremont w/ Hurricane Electric is $600/month. 13 servers is $3,250 in hardware. Get someone else on the East Coast in like Buffalo, NY and you can get as close as your budget affords. I doubt you'd hit 1 Junior Engineer Year worth of $$.

Not really Free. Power and A/C are things you'll pay for but its not a notable cost [less than $100/month].

4chan has better availability and higher traffic than healthcare.gov. Think about that.

It's so true. The entertainment/consumer shopping industries seems to have surpassed banks/governments in availability. It's also pretty ridiculous knowing that some of these companies are tasked with keeping our data/money safe, and yet they still don't allow extremely complex passwords.

Scaling financial transactions requires real sharding, but you can't eliminate ACID from the mix... certain functionality just doesn't work like that.

In the end, one system needs to be responsible for a given account... best case you could have is only some accounts/users are affected. It's a different kind of problem.

Facebook, for example uses Cassandra with a very wide distribution of nodes, with an immediate ack on receive... you can't do that with a bank... transactional payment systems won't tolerate it. You could do a few things to alleviate this issue... read-only replicas, etc... even then it's a matter of failing soft (disabling only those systems/services that are unavailable instead of the whole thing).

I feel like there's lots of space for b2b disruption in banking. Banks really shouldn't be in the business of building secure and robust software. They don't manufacture their own vaults either.

Netflix puts out some great articles about architecture in the cloud. Auto-scaling, chaos monkey, and how they handle 'steal-time.' Does anyone know of any other company that publishes so much about cloud architecture? This is great stuff!

Most companies their size would use their own servers instead of the cloud.

Using AWS does not magically give you a HA infrastructure when you have a complicated service oriented architecture like Netflix. All the stuff mentioned here are still relevant even if they're running their own DC.

I wonder if Netflix has ever come out with some kind of, "So you want to get into chaos engineering, eh?" kind of article that explains the basics and some pitfalls/things to look out for.

I gave a talk at pagerduty about how this has been done for the last few years: https://blog.pagerduty.com/2014/03/injecting-failure-at-netf...

Nice! Thanks for sharing man!

I assume that other big internet companies also practice chaos engineering under a different name, but having such a name for the job is awesome. It highlights the difference to traditional stress testing. Names have surprising power. Growth Hacker was a bit annoying but very effective title trend and it helped to communicate the different approach to traditional marketing efforts. I think Chaos Engineer has the same potential.

You're probably right re:other companies doing similar things. Certainly Google does something similar via their "disaster days": http://queue.acm.org/detail.cfm?id=2371516

It's great to see Netflix taking disaster recovery and chaos mitigation seriously. Learning to work with constant failure is one the biggest challenges to anyone working with distributed systems and scale, and concepts like the Chaos Monkey help enormously. I hope other companies follow suit, and soon.

How does one go about becoming a chaos engineer? I imagine up to this point it is a field one falls into accidentally and gains experience over time. I can imagine it becoming a topic taught at a college level in the near future.

Not really, you probably start with an entry-level sysadmin/devops position, and work your way from fire-fighting incident to firefighting incident.

You just need an appreciation that any down-time in a system is a symptom of larger problems, and the will to identify (and reproduce as chaos!) those problems.

Someone calling themselves 'the chaos commander' who uses multi-regional active-active jargon is looking for chaos engineers.


Lol at monkey picture, full rambo style :p

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact