
I want to have an AWS region where everything breaks with high frequency - caiobegotti
https://twitter.com/cperciva/status/1292260921893457920
======
jedberg
For those saying "Chaos Engineering", first off, the poster is well aware of
Chaos Engineering. He's an AWS Hero and the founder of Tarsnap.

Secondly, this would help make CE better. I actually asked Amazon for an API
to do this ten years ago when I was working on Chaos Monkey.

I asked for an API to do a hard power off of an instance. To this day, you can
only do a graceful power off. I want to know what happens when the instance
just goes away.

I also asked for an API to slow down networking, set a random packet drop
rate, EBS failures, etc. All of these things can be simulated with software,
but it's still not exactly the same as when it happens outside the OS.

Basically I want an API where I can torture an EC2 instance to see what
happens to it, for science!

~~~
aloknnikhil
> I asked for an API to do a hard power off of an instance. To this day, you
> can only do a graceful power off. I want to know what happens when the
> instance just goes away.

Wouldn't just running "halt -f" do the same?

~~~
0xEFF
Kernel panics work very well for this use case.

echo c > /proc/sysrq-trigger

~~~
derefr
Maybe for testing the crash-resilience of software running _on_ the node; but
not necessarily for testing how the SDN autoscaling glop you've got configured
responds to the node's death.

A panicking instance is still "alive" from its hypervisor's perspective
(either it'll hang, sleep, or reboot, but it won't usually _turn off_ its vCPU
in a way the hypervisor would register as "the instance is now off"); while if
a hypervisor box suffers a power cut, the rest of the compute cluster knows
that the instances on that node are now very certainly _off_.

~~~
kl4m
echo o > /proc/sysrq-trigger will shutdown a AWS instance immediately. I've
used it a couple of times to make real sure my costly spot instance was
terminated.

[https://www.kernel.org/doc/html/v4.11/admin-
guide/sysrq.html...](https://www.kernel.org/doc/html/v4.11/admin-
guide/sysrq.html#what-are-the-command-keys)

~~~
derefr
[https://elixir.bootlin.com/linux/latest/source/kernel/power/...](https://elixir.bootlin.com/linux/latest/source/kernel/power/poweroff.c#L21)

Looks like SysRq-o does a _clean_ poweroff, not a dirty/immediate one — it
calls kernel_power_off(), not machine_power_off().

This means, importantly, that can actually take some time to happen, as
drivers get a chance to run deinitialization code. This also means that a
wedged driver that doesn't respond during deinit can prevent the kernel from
halting.

Thus, while SysRq-o might be useful for killing a wedged _userland_ , it's not
a panacea — especially, it isn't guaranteed to complete a shutdown for
unstable _kernels_ , or kernels with badly-written DKMS drivers attached. It's
not truly equivalent to a power-cut.

------
dijit
Isn’t us-east-1 exactly that?

All jokes aside, I actually asked my google cloud rep about stuff like this;
they came back with some solutions but often the problem with that is, what
kind of failure condition are you hoping for?

Zonal outage (networking)? Hypervisor outage? Storage outage?

Unless it’s something like s3 giving high error rates then most things can
actually be done manually. (And this was the advice I got back because
faulting the entire set of apis and tools in unique and interesting ways is
quite impossible)

~~~
londons_explore
> Unless it’s something like s3 giving high error rates

Just firewall off the real s3, and point clients at a proxy which forwards
most requests to the real s3 and returns errors or delays to the rest.

~~~
eru
I found that in practice for APIs, delays are often much worse than errors.

Mostly because programmers seem to have an easier time thinking about errors,
and their programming language might even encourage them to handle these kinds
of errors, but arbitrary delays often slip by unanticipated.

~~~
londons_explore
Googles approach to RPC deadline propagation solves this. I'm sad that the
vast majority of libraries don't support it.

The simple concept is all services/API's should take a deadline parameter. The
call should either complete before that time, or return an error. If you are
writing code for a service, and you get an incoming request, you send the same
deadline for any requests to other services.

~~~
nitrogen
As long as everything is idempotent. You don't want to have this happen:

Host A calls host B

Host B runs SQL+COMMIT

DB commit finishes

Host B deadline expires

DB success reaches host B

Also would need fanatical time synchronization. Maybe even PTP rather than
NTP?

~~~
londons_explore
Google has awesome clock sync in its datacenters (sub millisecond), and in
general all RPC calls should be idempotent (since the network or either host
can fail at any point).

I think one might be able to make this approach work with poor clock sync by
having the request contain the number of milliseconds left, rather than an
absolute time. It isn't great if network delays dominate, but typically it's
queueing delays that dominate most webservice responses.

~~~
dijit
Those clocks are not available to the general public, and is one of the
reasons that things like spanner have an edge over alternatives.

~~~
eru
Though nothing is stopping Amazon or Microsoft from putting atomic clocks into
their own datacentres.

------
davidrupp
[Disclaimer: I work as a software engineer at Amazon (opinions my own, obvs)]

The chaos aspect of this would certainly increase the evolutionary pressure on
your systems to get better. You would need really good visibility into what
exactly was going on at the time your stuff fell over, so you could know what
combination(s) to guard against next time. But there is definitely a class of
problems this would help you discover and solve.

The problem with the testing aspect, though, is that test failures are most
helpful when they're deterministic. If you could dictate the type, number, and
sequence of specific failures, then write tests (and corresponding code) that
help make your system resilient to that combination, that would definitely be
useful. It seems like "us-fail-1" would be more helpful for organic discovery
of failure conditions, less so for the testing of specific conditions.

~~~
cogman10
> The problem with the testing aspect, though, is that test failures are most
> helpful when they're deterministic.

Let's not let `perfect` get in the way of `good`.

Certainly having a 100% traceable system would be ideal, most systems are not
that.

There is still a TON of low hanging and easy to find issues that would
automatically fall out of a system of random fails. Even if engineers have to
spend some time figuring out what the hell is going on, it would overall
improve their system because it would shine a bright shiny flashlight on the
system to let them know "Hey, something is rotten here". From there, more
deterministic tests and better tracing can be added.

------
gregdoesit
When I worked at Skype / Microsoft and Azure was quite young, the Data team
next to me had a close relationship with one of the Azure groups who were
building new data centers.

The Azure group would ask them to send large loads of data their way, so they
could get some "real" load on the servers. There would be issues at the infra
level, and the team had to detect this and respond to it. In return, the data
team would also ask the Azure folks to just unpug a few machines - power them
off, take out network cables - helping them test what happens.

Unfortunately, this was a one-off, and once the data center was stable, the
team lost this kind of "insider" connection.

Howerver, as a fun fact, at Skype, we could use Azure for free for about a
year - every dev in the office, for work purposes (including work pet
projects). We spun up way too many instances during time, as you'd expect, and
only came around to turning them off when Azure changed billing to charge 10%
of the "regular" pricing for internal customers.

~~~
eru
When I was at Google, as a developer you officially got unlimited space in the
internal equivalent of Google Drive.

I always wondered how many people got some questions from the storage team, if
they really needed all those exabytes.

------
ben509
I don't see a us-fail-1 region being set up for a number of reasons.

One, this is not how AWS regions are designed to work. What they're thinking
of is a virtual region with none of its own datacenters, but AWS has internal
assumptions about what a region is that are baked into their codebase. I think
it would be a massive undertaking to simulate a region like this.

(I don't think a fail AZ would work either, arguably it'd be worse because all
the code that automatically enumerates AZs would have to skip it, which is
going to be all over the place.)

Two, set up a region with deliberate problems, and idiots will run their
production workload in it. It doesn't matter how many banners and disclaimers
you set up on the console, they'll click past them.

When customer support points out they shouldn't be doing this, the idiot
screams at them, "but my whole business is down! You have to DO something!"
This would be a small number of customers, but the support guys get all of
them.

Three, AWS services depend on other AWS services. There are dozens of AWS
services, each like little companies with varying levels of maturity. They
ought to design all their stuff to gracefully respond to outages, but they
have business priorities and many services won't want to set up in us-fail-1.
When a region adds special constraints, it has a high likelihood of being a
neglected region like GovCloud.

------
bob1029
It sounds to me what some people would like is for a magical box they can
throw their infrastructure into that will automatically shit test all the
things that could potentially go wrong for them. This is poor engineering.
Arbitrary, contrived error conditions do not constitute a rational test
fixture. If you are not already aware of where failures might arise in your
application and how to explicitly probe those areas, you are gambling at best.
Not all errors are going to generate stack traces, and not all errors are
going to be detectable by your users. What you would consider an error
condition for one application may be a completely acceptable outcome for
another.

This is the reliability engineering equivalent of building a data warehouse
when you don't know what sorts of reports you want to run or how the data will
generally be used after you collect it.

~~~
cogman10
I disagree.

Not handling failures correctly is a time honored tradition in programming. It
is so easy to miss.

For example, how often have you seen a malloc check for `ENOMEM`?

Even though that is something that could be semi common. Even though that's
definitely something you might be able to handle. Instead, most code will
simply blow chunks when that sort of condition happens. Is the person that
wrote it "wrong"? That's debatable.

Some languages like Go make it even trickier to detect that someone forgot to
handle an error condition. Nothing obvious in the code review (other than
knowledge of the API in question) would get someone senior to catch those
sorts of issues.

So the question is, HOW do you catch those problems?

The answer seems obvious to me, you simulate problems in integration tests.
What happens when Service X simply disappears. What happens when a server
restarts mid communication? Is everything handled or does this cause the apps
to go into an non-recoverable mode?

This are all great infrastructure tests that can catch a lot of edge case
problems that may have been missed in code reviews. Even better, that sort of
infrastructure testing can be generalized and apply to many applications.
Making rare events common in an environment makes it a lot easier to catch
hard to notice bugs that everyone writes.

It's basically just Fuzz testing but for infrastructure. Fuzz testing has been
shown to have a ton of value, infrastructure fuzzing seems like a natural
valuable extension of that. Especially when high reliability and low
maintenance is something everyone should want.

~~~
im3w1l
As a community we have basically decided against handling of out of memory;
that it's better to crash, and design the program so that such a crash will
not cause corrupt state.

Consider over-commit. It means that the malloc call will succeed even if there
isn't available memory. Instead the system just hopes you wont make use of the
memory you asked for. And if you do make use of it, well then the OS might
kill you. Just like that. No opportunity for error handling.

~~~
eru
Reliably crashing _is_ a way to handle out of memory errors.

When you don't handle it, you get something closer to undefined behaviour
instead.

------
falcolas
I don't work with the group directly, but one group at our company has set up
Gremlin, and the breadth and depth of outages Gremlin can cause is pretty
impressive. Chaos Testing FTW.

~~~
robpco
I’ve also had a customer who used Gremlin to dramatically improve their
stability.

------
jiggawatts
Along the same vein, instead of the typical "debug" and "release"
configurations in compilers, I'd love it if there was also an "evil"
configuration.

The evil configuration should randomise anything that isn't specified. No
string comparison type selected? You get Turkish. All I/O and networking
operations fail randomly. Any exception that can be thrown, is, at some small
rate.

Or to take things to the next level, I'd love it if every language had an
interpreted mode similar to Rust's MIR interpreter. This would tag memory with
types, validate alignment requirements, enforce the weakest memory model
(e.g.: ARM rules even when running on Intel), etc...

------
msla
A zone not only of sight and sound, but of CPU faults and RAM errors, cache
inconsistency and microcode bugs. A zone of the pit of prod's fears and the
peak of test's paranoia. Look, up ahead: Your root is now read-only and your
page cache has been mapped to /dev/null! You're in the Unavailability Zone!

~~~
mindcrime
Your conductor on this journey through the Unavailability Zone: the BOFH!

------
missosoup
That region is called Microsoft Azure. It will even break the control UI with
high frequency.

~~~
llama052
I was going to post this but you beat me to it.

We are forced to use Azure for business reasons where I work, and the
frequency of one off failures and outages is insane.

------
rob-olmos
I imagine AWS and other clouds have a staging/simulation environment for
testing their own services. I seem to recall them discussing that for VPC
during re:Invent or something.

I'm on the fence though if I would want a separate region for this with
various random failures. I think I'd be more interested in being able to
inject faults/latencies/degradation in existing regions, and when I want them
to happen for more control and ability to verify any fixes.

Would be interesting to see how they price it as well. High per-API cost
depending on the service being affected, combined with a duration. Eg, make
these EBS volumes 50% slower for the next 5min.

Then after or in tandem with the API pieces, release their own hosted Chaos
Monkey type service.

------
bigiain
Show HN! Introducing my new SPaaS:

Unreliability.io - Shitty Performance as a Service.

We hook your accounting software up to api.unreliability.io and when a client
account becomes delinquent, our platform instantly migrates their entire stack
into the us-fail-1 region. Automatically migrates back again within 10 working
days after full payment has cleared - guaranteed downtime of no less than 4
hours during migration back to production region. Register now for a 30 day
Free Trial!

------
kentlyons
I want this at the programming language level too. If a function call can
fail, I want to set a flag and have it (randomly?) fail. I hacked my way
around this by adding in some wrapper that would if random, err for a bunch of
critical functions. It was great for working through a ton of race conditions
in golang with channels, and remote connections, etc. But hacking it in
manually was annoying and not something I'd want to commit.

------
imhoguy
Failing individual computes isn't hard, some chaos script to kill VMs is
enough. Worst are situations when things seem to be up but not acceptable:
abnormal network latency, random packet drops, random but repeatable service
errors, lagging eventual consistency. Not even mentioning any hardware woes.

------
vemv
While these are not exclusive, personally I'd look instead into studying my
system's reliability in a way that is independent of a cloud provider, or even
of performing any side-effectful testing at all.

There's extensive research and works on all things resilience. One could say:
if one build a system that is proven to be theoretically resilient, that model
should extrapolate to real-world resilience.

This approach is probably intimately related with pure-functional programming,
which I feel has been not explored enough in this area.

------
terom
There are multiple methods for automating AWS EC2 instance recovery for
instances in the "system status check failed" or "scheduled for retirement
event" cases.

Yet to figure out how to test any of those cloudwatch alerts/rules. I've had
them deployed in my dev/test environments for months now, after having to
manually deal with a handful of them in a short time period. They've yet to
trigger once since.

Umbrellas when it's raining etc.

~~~
wmf
This is why it seems like it would be good to have explicit fault injection
APIs instead of assuming that the normal APIs behave the same as a real
failure.

------
swasheck
Wait. I thought this was ap-southeast-2

------
MattGaiser
Whichever region Quora is using.

------
haecceity
Why does Twitter often fail to load when I open a thread and if I refresh it
works. Does Twitter use us-fail-1?

~~~
mschuster91
You mean the "Click here to reload Twitter" thing? That's spam protection, I
hit this regularly after restarting Chrome (with ~600 tabs).

~~~
haecceity
Ya but clicking that doesn't even work. And I rarely go on Twitter so I don't
know why they would do it intentionally.

------
georgewfraser
I think people overestimate the importance of failures of the underlying cloud
platform. One of the most surprising lessons of the last 5 years at my company
has been how rarely single points of failure actually _fail_. A simple load-
balanced group of EC2 instances, pointed at a single RDS Postgres database, is
astonishingly reliable. If you get fancy and build a multi-master system, you
can easily end up creating more downtime than you prevent when your own
failover/recovery system runs amok.

------
kevindong
At my job, my team owns a service that generally has great uptime. Dependent
teams/services have gotten into the habit of assuming that our service will be
100% available which is problematic because it's obviously not. That false
assumption has caused several minor incidents unfortunately.

There have been some talk internally of doing chaos engineering to help
improve the reliability of our company's products as a whole. Unfortunately,
the most easily simulatable failure scenarios (e.g. entire containers go down
at once instantly, etc.) tend to be the least helpful since my team designed
the service to tolerate those kinds of easily modelable situations.

The more subtle/complex/interesting failure conditions are far harder to
recognize and simulate (e.g. all containers hosted on one particular node
experience 10s latencies on all network traffic, stale DNS entries, broken
service discovery, etc.).

------
thethethethe
You can just do this yourself. Google breaks it's systems intentionally for a
week every year, it's called DiRT week. DiRT takes weeks of planning before
people even start debugging.

Doing this constantly for a all products in a single region would be
absolutely exhausting for SRE teams

(Discliamer: I work for ^GOOG and my opinions are my own)

~~~
eru
That is a good idea, but by yourself you can not go and yank the power cable
out of an AWS box.

(I also used to work for Google, as an SRE. Fun times.

Many of our dirt exercises usually included the stipulation: 'by the way,
assume that <scarily over-competent coworker> is on vacation and not to be
disturbed'.)

~~~
thethethethe
> That is a good idea, but by yourself you can not go and yank the power cable
> out of an AWS box.

Sure you can. Simulate an Ec2 outage? Turn off your VMs. Lambda down? Turn off
your serving version. Etc...

I think the only things that difficult to simulate would be control plane
outages for testing CI/CD pipelines--though CI/CD pipelines generally are not
production critical and probably aren't worth testing directly. If you want to
test recovery/repair operations when CI/CD is down you can just prohibit the
victim from using it to fix things.

For non-automated API calls, you could just say that X API is down for the
duration of the test

~~~
eru
See the other comments. It's not a given that yanking turning off these
services gracefully via API is equivalent to yanking a power cable.

------
djhaskin987
Friendly reminder that for any given single availability zone, the SLA that
AWS provides is one single nine. That means that they expect that availability
zone to fail 10% of the time, or 6 minutes every hour. this very high failure
rate comes up solutely for free, no need for a special region. Therefore,
implementing a cross availability zone application that logs when packets are
dropped should give you some idea of how your application handles failure.

~~~
BillinghamJ
Yeah but it's very rare that they don't hit many nines of uptime anyway.

This would be about getting them to actually match the real world behaviour to
the SLA

------
gberger
Relevant snippet in the Google SRE Book:

[https://landing.google.com/sre/sre-book/chapters/service-
lev...](https://landing.google.com/sre/sre-book/chapters/service-level-
objectives/#xref_risk-management_global-chubby-planned-outage)

Google introduced exactly this to one of their internal services so that
downstream dependencies can't rely on its extremely high availability.

------
t0mek
Toxiproxy [1] is a tool that allows to create network tunnels with random
network problems like high latency, packet drops or slicing, timeouts, etc.

Setting it up requires some effort (you can't just choose a region in your AWS
config), but it's available now and can be integrated with tests.

[1]
[https://github.com/Shopify/toxiproxy](https://github.com/Shopify/toxiproxy)

------
chirag64
No one seems to talk about the pricing aspect of this. Developers would want
these us-fail-1 regions to be cheap or free since they wouldn't be using this
for production purposes. And before you know it, a lot of hobbyist developers
will start using these as their production setup since they wouldn't mind a 1%
downtime if they could pay less for it.

~~~
laurent92
A us-fail-1 at lower price sounds like an excellent way to recycle SSD/HDDs
that reach end-of-safe-life but that might still run for years.

------
exabrial
Simply host on Google Cloud! They will terminate your access for something
random, like someone said your name on YouTube while doing something bad! They
don't have a number you can call, and their support is ran but the stupidest
of all AI algorithms.

------
raverbashing
There's an easier way: spot instances (and us-east-1 as mentioned)

As for things like EBS failing, or dropping packages, it's a bit tricky as
some things might break at the OS levels

And given sufficient failures, you can't swim anymore, you'll just sink.

------
castratikron
Sometimes during development instead of checking the return code I'll check
rand() % 3 or something similar. I'll run through the code several times in a
loop and run through a lot of the failure modes very quickly this way.

------
exabrial
Sort of counter-intuitive, but for small projects, you want resilient hardware
systems as much as possible... the larger your scale out, the less reliable
you would want them to force that out of hardware into resilient software.

------
jonplackett
This is such a clever idea. I wonder if amazon are smart enough to actually do
this.

------
foota
Just deploy a new region with no ops support, it'll quickly become that.

------
thoraway1010
A great idea! I'd love to run stuff in this zone. Rotate through a bunch of
errors, unavailability, latency spikes, power outages etc every day, make it a
12 hour torture test cycle.

------
martin-adams
I can see a use case for this being implemented on top of Kubernetes. I've no
idea if that's achievable, but could go some way to make your code more
resilient.

------
bootyfarm
I believe this is available as a service called “Softlayer”

------
chucky_z
it's us-west-1! :D

we've had a ton of instances fail at once because they had some kind of rack-
level failure and a bunch of our EC2s ended up in the same rack. :(

------
SisypheanLife
This would require AWS to invest in Chaos Monkeys.

------
bootyfarm
I believe this is a service called “SoftLayer”

------
jschulenklopper
That region should have `us-wtf-1` as code.

------
whoisjuan
Isn't this what Gremlin does?

------
6510
Sounds useful. Crank it up to 99% failure and it becomes interesting science.

~~~
caiobegotti
It actually sounds useful, to the point I wouldn't be surprised if in the near
future cloud providers bundled up some chaos monkey stack and offered that
with a neat price within their realms (dunno, maybe per VPC or project).

~~~
pseudosavant
They will definitely figure out a way to charge us more for hardware that is
less reliable.

~~~
emerged
That's a great idea: instead of throwing away failing hardware, toss it into
the chaos region and charge double.

------
mamon
Try us-east-1 :)

------
rdoherty
This is called chaos engineering and many companies built tooling to do
exactly this. Netflix pioneered/proselytized it years ago. Since you likely
don't just rely upon AWS services if your app is in AWS, you want something
either on your servers themselves or built into whatever low level HTTP
wrapper you use. Use that library to do fault injection like high latency,
errors, timeouts, etc.

~~~
infogulch
It's harder to do chaos engineering if you're not engineering it. What this is
really asking for is the service provider to sell chaos engineering as a
service (CEAAS?), on the services they provide. I've wanted this kind of thing
for testing cloud infrastructure before: you read about various failure states
and scenarios you might want to handle from docs but there's no way to trigger
them so you just have to hope that they work as described and your code is
correct. At the least, let users simulate the effect of the failures that are
_part of your API_.

This would be great for testing the pieces of the stack that the provider is
responsible for, but you may still want to inject chaos into the part of your
stack that you do control.

~~~
segmondy
Netflix runs on AWS, they are doing chaos engineering quite alright on it.

------
lordgeek
brilliant!

------
fred_is_fred
us-east-1?

~~~
NovemberWhiskey
Anecdotally, I hear the South American regions are the places where the really
canary stuff goes out first.

~~~
mitchs
I've heard a fun story from the old timers in my org about a fiber outage in
Brazil. A routine fiber cut occurred. They figure out how far from one end the
cut is (there is gear that measures the time to see a light pulse reflect off
of the cut end.) Then they pull out a map of where the fiber was laid, count
out the distance, and send a technician out to have a look at where they
expect the cut to be. All standard practice up until this point.

The technician updates the ticket after a while with "cannot find road." The
folks back in the office try to send them directions, but then the technician
clarifies, "road is gone." Our fiber, and the road it was buried under was
totally demolished in the few hours it took to get someone out there. The
developing world can develop at alarming rates.

Other tales from the middle of nowhere: People shoot at arial fiber with guns.
Or dig it up and cut it for fun. One time out technician was carjacked on the
way to doing a repair.

~~~
schoen
> there is gear that measures the time to see a light pulse reflect off of the
> cut end

Though not very relevant to your stories about Brazil, it's a neat technique
in its own right:

[https://en.wikipedia.org/wiki/Optical_time-
domain_reflectome...](https://en.wikipedia.org/wiki/Optical_time-
domain_reflectometer)

------
CloudNetworking
You can use IBM cloud for that purpose

~~~
toast0
Hey, I used their loadbalancers for a couple months, and they only failed
every 30 days, that's not high frequency.

~~~
paranoidrobot
I used to use their Citrix Netscaler VPX1000s at a previous job.

They were very reliable, imo. Aside from general Netscaler bullshit, We only
ever had issues with them when we'd try to get them to do too much so that the
CPU or Memory was overloaded.

We tried on a few occasions to get more cores allocated to them, but no. This
made terminating large numbers of SSL connections on them problematic.

~~~
toast0
I was using their shared loadbalancers, not the run a load balancer in a VM
option, because I was hoping for something more reliable than a single
computer. For the couple months they were running, it was literally every 30
days, 10 minutes of downtime. So I went back to DNS round robin, cause it was
better.

------
jariel
This is a really great idea.

------
code4tee
This is what chaos monkey does.

