Hacker News new | past | comments | ask | show | jobs | submit login
I want to have an AWS region where everything breaks with high frequency (twitter.com/cperciva)
792 points by caiobegotti on Aug 10, 2020 | hide | past | favorite | 185 comments



For those saying "Chaos Engineering", first off, the poster is well aware of Chaos Engineering. He's an AWS Hero and the founder of Tarsnap.

Secondly, this would help make CE better. I actually asked Amazon for an API to do this ten years ago when I was working on Chaos Monkey.

I asked for an API to do a hard power off of an instance. To this day, you can only do a graceful power off. I want to know what happens when the instance just goes away.

I also asked for an API to slow down networking, set a random packet drop rate, EBS failures, etc. All of these things can be simulated with software, but it's still not exactly the same as when it happens outside the OS.

Basically I want an API where I can torture an EC2 instance to see what happens to it, for science!


> For those saying "Chaos Engineering", first off, the poster is well aware of Chaos Engineering. He's an AWS Hero and the founder of Tarsnap.

Yeah, but did he win the Putnam?



Same thread where Drew from "getdropbox.com" says he also has a "sync and backup done right" idea...



i like how the guy recovered with "Just the once, though, huh?"

2007 HN was highly entertaining


Further context for those of us not in North America: "Putnam" refers to a undergrad mathematics competition.

https://en.wikipedia.org/wiki/William_Lowell_Putnam_Mathemat...


He did, but that particular accolade didn't seem relevant. :)


> He's an AWS Hero

I don't pay much heed to AWS Heroes. My city has an "AWS Hero" who hoarded the organization of the official AWS meetup for themself, and who is infamous for making liberal use of Cunningham's Law [1] for everything.

[1] https://meta.wikimedia.org/wiki/Cunningham%27s_Law


I don't pay much heed to AWS Heroes.

Honestly, I don't either. There's a handful of us who are technical and a huge number whose contributions are primarily marketing.


And Colin isn't anything like that. At all. So aim that anger where it should be directed.


AWS Hero, Microsoft MVP.. something about those sorts of recognition seem to attract certain types of people - types seeking the limelight - often not particularly good, ethical, moral, or pleasant types.


This sounds like a legit complaint. Does anyone call them out on this?


I think localstack[1] gets you lot of this.

[1] https://localstack.cloud/


I was just about to suggest localstack. I use it religiously in personal projects; can't recommend it enough. I haven't started telling it to induce errors yet, but it definitely has that capability. And if you're running it in docker, some of the network stuff can be simulated that way as well.


Too bad its a resource hog though. Startup time is really bad, using it via Testcontainers takes patience.


Fascinating!

The mere existence of such an api would be an interesting source of problems when used accidentally/via buggy code ...

I wonder the degree to which this functionality would end up becoming relied on as a "in worst case hard kill things to recover" behavior that folks utilize for bad engineering reasons as opposed to good ones ...


Compare https://en.wikipedia.org/wiki/Crash-only_software and https://lwn.net/Articles/191059/

In a previous job we actually had to make a decision like this: our server software was generally stable enough, but we reset it whenever we did an update. Every once in a while it would break when running for long enough.

I opted for forcefully restarting the software every day, because I'd rather relive the pains of restarting, than to discover in production what it is like for the first time to run the system for the 501st day of uptime.

(To be honest, the system in question was mostly written in Erlang, so controlled restarts were already baked into its philosophy.)


I would assume said code should never be part of actual deployments or only part of unit tests, maybe an external project even.


Oh you, sweet summer child...


Totally. I have added wifi power outlets to installations because process restart, OS restart, and hardware restart did not get a system back up and running. Yes it was a startup, and yes it was remotely installed hardware.


Might be a bit strange as you are saying this is a example of bad planning but my employer was discussing this today. We had a hard time finding options, what did you end up deploying or did you develop your own in house?


You want to search for "PDU remote management"

Edit: Another term is Switched PDU


When I had this issue I just bought some cheap wifi 'smart sockets' which had some open-source firmware available and reflashed them. They even support reporting the voltage and current draw from the socket (though not particularly accurately) and toggling the power through a POST (with a password if set). Was very useful for power cycling various bits of hardware remotely.


Make it hard to accidentally call and hide it in the docs somewhere with items related to testing.


> Make it hard to accidentally call and hide it in the docs somewhere with items related to testing.

This would be more effective if so many other AWS features weren't hidden in non-obvious places in the docs.


Make these functions only work in us-chaos-1


Make it expensive.


I'd imagine there is only a comparatively small (but earnest) pool of people who would want such an offering.

On top of keeping people who don't want it out, you'd need it to be expensive to offset the lower utilization.

I'd be okay with a garbage-tier, too. It's not failing on purpose, but only because no one gives a shit. Maybe that's just standard AWS?


> I asked for an API to do a hard power off of an instance. To this day, you can only do a graceful power off. I want to know what happens when the instance just goes away.

Wouldn't just running "halt -f" do the same?


Possibly, but how can we be 100% sure without the ability to compare behavior? If I were following this line of research I'd still want to know if there's any difference in the nature of a failure when it comes from within the OS (possibly simulated by half -f) and the situation the parent OP pointed out where the instance just goes poof without sending any kind of signal to the OS itself.


Though of course then the trouble is, if AWS is simulating specific behaviour, it won't be exactly the same as real problems when they occur. It is a bit better, but hard to say how much better.

I'd think the key on this is being able to simulate very specific partial-failure conditions. e.g. specific packet loss, loss of connections to EBS, etc. Just turning machines off I expect wouldn't be that valuable.


> Though of course then the trouble is, if AWS is simulating specific behaviour, it won't be exactly the same as real problems when they occur.

If Amazon simulated power failure, or network cable disconnect, or potentially even corrupt writes to disk, I personally feel it would be indistinguishable from the event really happening.


The problem is that they can't really realistically simulate that for a single instance unless you've got the whole physical machine to yourself, and of course the worst issues tend to be when it's half-broken, not entirely off/disconnected


Power loss or network disconnect could easily be simulated without needing the whole physical machine.


Kernel panics work very well for this use case.

echo c > /proc/sysrq-trigger


Maybe for testing the crash-resilience of software running on the node; but not necessarily for testing how the SDN autoscaling glop you've got configured responds to the node's death.

A panicking instance is still "alive" from its hypervisor's perspective (either it'll hang, sleep, or reboot, but it won't usually turn off its vCPU in a way the hypervisor would register as "the instance is now off"); while if a hypervisor box suffers a power cut, the rest of the compute cluster knows that the instances on that node are now very certainly off.


echo o > /proc/sysrq-trigger will shutdown a AWS instance immediately. I've used it a couple of times to make real sure my costly spot instance was terminated.

https://www.kernel.org/doc/html/v4.11/admin-guide/sysrq.html...


https://elixir.bootlin.com/linux/latest/source/kernel/power/...

Looks like SysRq-o does a clean poweroff, not a dirty/immediate one — it calls kernel_power_off(), not machine_power_off().

This means, importantly, that can actually take some time to happen, as drivers get a chance to run deinitialization code. This also means that a wedged driver that doesn't respond during deinit can prevent the kernel from halting.

Thus, while SysRq-o might be useful for killing a wedged userland, it's not a panacea — especially, it isn't guaranteed to complete a shutdown for unstable kernels, or kernels with badly-written DKMS drivers attached. It's not truly equivalent to a power-cut.


For reference: https://unix.stackexchange.com/a/66205 (you obviously need to be root)}


To give an example of the difference: does it fsync() before stopping?


It would be really close yes, but not exactly the same as ripping the power cord out. It still gives the OS a hint that shutdown is coming.


Not exactly. "halt -f" depends on some variant of power management working. For extra fun, you could also have partial power-offs, like say a power cut to the motherboard, or any combination of other devices with their own power lines from a power supply. (in a current PC these would typically be PCI supplementary power, motherboard, SATA power, 4-pin 12v molex power, supplemental CPU power.)

Granted, most of those are out of scope for cloud development, but the concept of externally cutting off a VM is different than even calling whatever power service you have to cut power. In the 'real' world, enough of those failure triggers above would probably also trigger an automated power cycle if you're in a managed environment.


To elaborate a bit further - when things are fine, they're, well, fine. Unfortunately, the assumptions of how computers actually function go out of the window when enough things go wrong, like lightning striking the wrong thing. Things go weird sometimes, like the accepting TCP connections, but not sending responses. can't ping but can SSH in, or vice versa. In a cluster with automatic failover, you want to be sure the bad node is dead dead dead because a bad machine can cause harm to all the packets in the rack. (Eg HA with floating IPs and ARP takeover.) Thus the requirement is to forcefully disable the bad node, which is where the acronym STONITH comes from - Shoot The Other Node In The Head. (HA clusters with more than two nodes existed, but were rarer than 2-node clusters simply due to cost.) Before cloud computing was so pervasive, high availability was implemented with serial cables between in physical hardware and when one machine stopped responding, the equipment basically yanked its power cable out.


I want a stress testing feature where I can accidentally misconfigure an instance or resource somehow, then when I get an unexpectedly $10,000 higher bill at the end of the month, I can declare that I was only testing, and I don't have to pay it!


I wonder how much you would have to pay Amazon for them to send a tech down to the datacenter and pull the plug on your running node


It's AWS; surely half of the work would be finding the server that happens to be running your instance, anonymously sitting in a datacenter with thousands of other servers. There's also the question of how many other customers' instances are located on the same hardware.


Today you can use spot block instances for this. They are guaranteed to die off after your chosen time block of 1-6 hours.


As far as I know it still sends a graceful shutdown at the end of the time block. It sends exactly the same signal as if you use the shutdown API, which is the same as pressing the power button on the front of the machine.


Wouldn't a simple cron that disconnects the networking on an instance randomly and/or pegs all the cores be equivalent?


It won't simulate the block storage data loss, which is a key part of testing a database, filesystem or similar for robustness to those events.

Even killing a VM instantaneously on a host only simulates the loss of the guest OS's cache. The host OS still has its cache.

And even killing the host OS only simulates loss of the host OS's cache. The drives still have their caches powered, so might behave differently on power loss than host OS crash.

And killing a drive while the host keeps running is different again.

(And in terms of high-level software-observable data effects, abruptly killing power to a drive is not the same as slowly lowering the voltage or limiting the current to the drive, or seeing corrupt data on the I/O bus while the host system loses power (which is never instant), or... you can go quite deep with this.)

In the networked storage environment at AWS, who knows what events are possible to observe on EBS in the event of a system-level failure such as power loss in a data centre.

For example, EBS is a distributed system with replication, so if it has an implementation error that only affects certain untested, complex failure modes, it might lose recent writes in flight (as expected), but then some time later they might come back, or some of them might come back if it's sharded. Both are bad if software using it has already started state recovery after an outage. Distributed systems often have surprising incorrect recovery patterns, because it's a complex and subtle problem; that's why the Jepsen tests keep finding bugs in databases people have been using for years.

The observable events can differ depending on whether it's VM hosts, switches, block storage units, drives failing and in what ways.

For Linux guests we have abstracted away most differences we care about under fsync() or various combinations of O_DIRECT and virtual HDD cache disabling, or relying on known features of filesystems, but there's no easy way to be certain those abstractions actually provide the expected semantics on different kinds of system failure. Stress testing higher layers in the virtualization stack would go some way to verifying that they do in practice, not just in theory.


There are lots of ways to get a very close experience to a power pull, but nothing you can do from within the VM is quite the same as doing it outside the VM.


I believe you but I'm curios as to the subtle difference.

To my mind simply turning off the networking, physically cutting a cable, or hardware spontaneously combusting are not discernible events to an outside observer. What am I missing?


Tasks that are writing to disk, network outage wouldn't stop the task?


If the disk is network attached it would, and if it isn't, what difference does it make?


Well, we would know that if we could try it.


on a fully virtualized system, echo o > /proc/sysrq-trigger is going to be as close to a forced shutdown as you can get. that won't test hardware/firmware issues, but neither will a VM stop anyways.


Surely it eventually kills you if you just ignore the signal...?! (Though of course, by that time, we've missed the point of this exercise.)


Correct!


Nearly a decade ago I was working on Xen (or rather XenServer). We had lots of fun implementing these kinds of wonky devices for the virtual machines that would randomly drop network packets or fail to read the hark disk in arbitrarily bad ways.


Dumb question here from a CE beginner but can’t you have a Docker image for that service and turn it off?


OS-level 'turn off' options don't replicate what happens when you yank power on a rack of equipment.

Pretty much every option you have from the OS will let caches flush, will have in-progress writes complete.

Yank the power and none of that will happen. It'll let you actually see what level of fibbing your OS and hardware are telling you.

Oh you got a result from that flush to disk? It completed? Are you sure? Really really sure? Lets find out...


The main result of this testing is that you'll find bugs in all the abstraction layers that you can't control. You'll then be paranoid but unable to take any action. Have you ever seen the source code that's running on your SSD's CPU? Nope. And it's probably not a small program. My guess is that it works well when everything is OK, and fails catastrophically 0.0001% of the time when everything isn't OK. But you'll never know unless you try failing catastrophically a million times. Did the vendor even try failing catastrophically one million times before they started manufacturing (or at least sent the batch to retailers)? Did they do that, fail 1 time, and mark the bug closed as non-reproducible?

I have no idea! Maybe everything is actually great. Or maybe someone else will be oncall the week you hit that one in a million failure case. Or maybe it will happen to your competitors instead of you! Without testing, all we have is hope.


> You'll then be paranoid but unable to take any action.

I disagree.

While you can't control the bugs in the hardware or drivers or whatever, you can definitely ensure that your application detects corrupted data and can at least warn that something is wrong rather than starting back up and silently serving from corrupted data.

It also ensures that a node suddenly disappearing without a graceful "I'm leaving" message, and potentially with open transactions or other operations can be handled by the clients and nodes that ARE still around.


You've hit on one of my big pet peeves -- I am going to add redundancy and erasure codes at the application layer because the disk could catch on fire or be sucked up by a tornado, so why do I need a bunch of unaudited software adding extra redundancy and erasure codes behind my back?

I get it for desktop users... they want to open some benchmark they downloaded and get a higher number than their friends (RAID-0-esque sharding data across individual flash chips), and they also don't have access to software that can add error correction codes (because the Windows installer can't recognize an SSD with 4 flash chips as 4 drives in a RAID array)... so the disk has to do it itself. But why is it a thing in datacenters? Just so you can install Linux+ext2+MySQL and call it a day? That seems crazy to me when better storage software exists.


Aren't there SSDs that go in the opposite direction and expose a key-value store instead of a linear address space? Seems like that will make reliability testing even harder.


Even with testing, it’s still just hope; to quote Buzz Lightyear, it’s hoping with style.


I think this is too pessimistic. If all you do is test your error detection and backup recovery mechanism (of course you have backups right)? Then that’s an advantage.

Let’s say a hard power off causes data corruption that can’t be fixed. For lots of applications downtime is better than corruption, so in that case you at least will be able to test when you should take the system down and recover from a known good backup.


kill -9 on the VM process from the host?


If you're using Docker, yes, you can get a lot more options in testing by poking "from the outside", but that still doesn't test what happens when your docker host just dies.


That's not a dumb question at all.

I'm not the GP, but I assume it is because the GP would like to run his "production" setup (or even production setup itself) under such circumstances. So far as I remember, the origin of the idea came from the Chaos Monkey which would sabotage certain services in a production environment to ensure than everything was redundant and fail safe.


> Basically I want an API where I can torture an EC2 instance to see what happens to it, for science!

  # cat > dropme.sh <<'EOFILE'
  #!/bin/sh
  set -eu
  read -r SLEEP
  tmp=`mktemp` ; tmp6=`mktemp`
  iptables-save > $tmp
  ip6tables-save > $tmp6
  for t in iptables ip6tables ; do 
    for c in INPUT OUTPUT FORWARD ; do $t -P $c DROP ; done
    $t -t nat -F ; $t -t mangle -F ; $t -F ; $t -X
  done
  sleep "$SLEEP"
  iptables-restore < $tmp
  ip6tables-restore < $tmp6
  rm -f $tmp $tmp6
  EOFILE
  # chmod 755 dropme.sh
  # ncat -k -l -c ./dropme.sh $(ifconfig eth0 | grep 'inet ' | awk '{print $2}') 12345 &
  # echo "60" | ncat -v $(ifconfig eth0 | grep 'inet ' | awk '{print $2}') 12345
If you're lucky the existing connections won't even die, but the box will be offline for 60 seconds.


Not a bad script, but it's basically going "shields up!", blocking all traffic, and then taking them down. Definitely useful, but won't test for the hard stuff, like if my DB was in the middle of something and then died, scrambling my data. What's that gonna look like on reboot?


Did you try to setup an on premis euqalyptus cloud for that ? Eucalyptus has an api compliant with AWS api?


I really wish the world could agree on a better term than 'on premis' for 'not in the cloud'. At least 'on premises' is descriptive of the situation, although a bit clunky. On premises doesn't make any sense at all. 'Premis' is not the singular of 'premises', 'premises' is the singular of itself! I prefer 'on site', or 'self hosted' myself


I think it's a plurale tantum

https://en.wikipedia.org/wiki/Plurale_tantum

That is, it seems to be grammatically plural but not have a distinctive singular.

     4. pl. A piece of real estate; a building and its adjuncts;
        as, to lease premises; to trespass on another's premises.
        [1913 Webster]


> Basically I want an API where I can torture an EC2 instance to see what happens to it, for science!

And one day there will be PETA [0] for EC2 instances!

[0]: https://www.peta.org/


Isn’t us-east-1 exactly that?

All jokes aside, I actually asked my google cloud rep about stuff like this; they came back with some solutions but often the problem with that is, what kind of failure condition are you hoping for?

Zonal outage (networking)? Hypervisor outage? Storage outage?

Unless it’s something like s3 giving high error rates then most things can actually be done manually. (And this was the advice I got back because faulting the entire set of apis and tools in unique and interesting ways is quite impossible)


Yeah, us-east-1 is pretty good at failing already. We lost us-east-1c for most of the day about a week ago due to a fiber line being cut. I'd estimate that AWS manages fewer than "three 9s" in us-east-1 on average. Not across the board, but at any given time something has a decent chance of not working, be it an entire AZ, or regional S3, etc. They're still pretty reliable, and I like the idea of a zone with built-in failure for testing things, but your joke about us-east-1 is based in solid fact.


> Unless it’s something like s3 giving high error rates

Just firewall off the real s3, and point clients at a proxy which forwards most requests to the real s3 and returns errors or delays to the rest.


I found that in practice for APIs, delays are often much worse than errors.

Mostly because programmers seem to have an easier time thinking about errors, and their programming language might even encourage them to handle these kinds of errors, but arbitrary delays often slip by unanticipated.


Googles approach to RPC deadline propagation solves this. I'm sad that the vast majority of libraries don't support it.

The simple concept is all services/API's should take a deadline parameter. The call should either complete before that time, or return an error. If you are writing code for a service, and you get an incoming request, you send the same deadline for any requests to other services.


As long as everything is idempotent. You don't want to have this happen:

Host A calls host B

Host B runs SQL+COMMIT

DB commit finishes

Host B deadline expires

DB success reaches host B

Also would need fanatical time synchronization. Maybe even PTP rather than NTP?


Google has awesome clock sync in its datacenters (sub millisecond), and in general all RPC calls should be idempotent (since the network or either host can fail at any point).

I think one might be able to make this approach work with poor clock sync by having the request contain the number of milliseconds left, rather than an absolute time. It isn't great if network delays dominate, but typically it's queueing delays that dominate most webservice responses.


Those clocks are not available to the general public, and is one of the reasons that things like spanner have an edge over alternatives.


Though nothing is stopping Amazon or Microsoft from putting atomic clocks into their own datacentres.


This (or equivalent) will happen. You can either design your software to handle it correctly, or you can have buggy software.

Clock sync doesn't really help, especially if a request can trigger multiple independent action that can individually succeed or fail.

The simplest example where this goes wrong is posting a post on Reddit. You write your text into a text box and submit. You get an Error 500. You cannot know if your post went through (even if you go check whether your post is visible, sometimes the replica that you got your response from hasn't gotten your post yet). And they don't do any deduping, so if you just retry, you may end up with multiple copies of your reply.


ap-southeast-2-b in my experience hahaha


[Disclaimer: I work as a software engineer at Amazon (opinions my own, obvs)]

The chaos aspect of this would certainly increase the evolutionary pressure on your systems to get better. You would need really good visibility into what exactly was going on at the time your stuff fell over, so you could know what combination(s) to guard against next time. But there is definitely a class of problems this would help you discover and solve.

The problem with the testing aspect, though, is that test failures are most helpful when they're deterministic. If you could dictate the type, number, and sequence of specific failures, then write tests (and corresponding code) that help make your system resilient to that combination, that would definitely be useful. It seems like "us-fail-1" would be more helpful for organic discovery of failure conditions, less so for the testing of specific conditions.


> The problem with the testing aspect, though, is that test failures are most helpful when they're deterministic.

Let's not let `perfect` get in the way of `good`.

Certainly having a 100% traceable system would be ideal, most systems are not that.

There is still a TON of low hanging and easy to find issues that would automatically fall out of a system of random fails. Even if engineers have to spend some time figuring out what the hell is going on, it would overall improve their system because it would shine a bright shiny flashlight on the system to let them know "Hey, something is rotten here". From there, more deterministic tests and better tracing can be added.


"The chaos aspect of this would certainly increase the evolutionary pressure on your systems to get better. You would need really good visibility into what exactly was going on at the time your stuff fell over, so you could know what combination(s) to guard against next time. But there is definitely a class of problems this would help you discover and solve. "

Error conditions you already know about are easy to test and code against. I would guess most system failures come from conditions nobody expected. When you have a randomly failing system you can discover them. Fixing problems and testing for them then will be easy in comparisons.

For example: Years ago I worked on a video streaming solution. Some day we got a device that would garble our traffic randomly, slow it down and so on. things started crashing left and right. Within a month we squashed hundreds of bugs and had a rock solid system that basically impossible to crash.

I always wondered about AWS for other cloud systems how you could prepare for problems. It's hard to predict what can fail in what ways and you can't really force error conditions. I really like this idea of a cloud where everything breaks.


When I worked at Skype / Microsoft and Azure was quite young, the Data team next to me had a close relationship with one of the Azure groups who were building new data centers.

The Azure group would ask them to send large loads of data their way, so they could get some "real" load on the servers. There would be issues at the infra level, and the team had to detect this and respond to it. In return, the data team would also ask the Azure folks to just unpug a few machines - power them off, take out network cables - helping them test what happens.

Unfortunately, this was a one-off, and once the data center was stable, the team lost this kind of "insider" connection.

Howerver, as a fun fact, at Skype, we could use Azure for free for about a year - every dev in the office, for work purposes (including work pet projects). We spun up way too many instances during time, as you'd expect, and only came around to turning them off when Azure changed billing to charge 10% of the "regular" pricing for internal customers.


When I was at Google, as a developer you officially got unlimited space in the internal equivalent of Google Drive.

I always wondered how many people got some questions from the storage team, if they really needed all those exabytes.


It sounds to me what some people would like is for a magical box they can throw their infrastructure into that will automatically shit test all the things that could potentially go wrong for them. This is poor engineering. Arbitrary, contrived error conditions do not constitute a rational test fixture. If you are not already aware of where failures might arise in your application and how to explicitly probe those areas, you are gambling at best. Not all errors are going to generate stack traces, and not all errors are going to be detectable by your users. What you would consider an error condition for one application may be a completely acceptable outcome for another.

This is the reliability engineering equivalent of building a data warehouse when you don't know what sorts of reports you want to run or how the data will generally be used after you collect it.


I disagree.

Not handling failures correctly is a time honored tradition in programming. It is so easy to miss.

For example, how often have you seen a malloc check for `ENOMEM`?

Even though that is something that could be semi common. Even though that's definitely something you might be able to handle. Instead, most code will simply blow chunks when that sort of condition happens. Is the person that wrote it "wrong"? That's debatable.

Some languages like Go make it even trickier to detect that someone forgot to handle an error condition. Nothing obvious in the code review (other than knowledge of the API in question) would get someone senior to catch those sorts of issues.

So the question is, HOW do you catch those problems?

The answer seems obvious to me, you simulate problems in integration tests. What happens when Service X simply disappears. What happens when a server restarts mid communication? Is everything handled or does this cause the apps to go into an non-recoverable mode?

This are all great infrastructure tests that can catch a lot of edge case problems that may have been missed in code reviews. Even better, that sort of infrastructure testing can be generalized and apply to many applications. Making rare events common in an environment makes it a lot easier to catch hard to notice bugs that everyone writes.

It's basically just Fuzz testing but for infrastructure. Fuzz testing has been shown to have a ton of value, infrastructure fuzzing seems like a natural valuable extension of that. Especially when high reliability and low maintenance is something everyone should want.


>For example, how often have you seen a malloc check for `ENOMEM`?

I've never concerned myself with that, but it's apparently a lot more complicated than just doing that check.

See:

https://news.ycombinator.com/item?id=20143277

https://scvalex.net/posts/6/

Some of the thread is hilarious, for instance:

"malloc is not allowed to return a non-null pointer to a memory block that cannot be written to. Linux does it anyway, and in doing so blatantly violates the standard.

...

That is my point, it doesn't fail. Either the kernel finds out a way to map the memory and it succeeds, or it kill the program and the instruction never runs. Code that doesn't run can't violate the standard."


As a community we have basically decided against handling of out of memory; that it's better to crash, and design the program so that such a crash will not cause corrupt state.

Consider over-commit. It means that the malloc call will succeed even if there isn't available memory. Instead the system just hopes you wont make use of the memory you asked for. And if you do make use of it, well then the OS might kill you. Just like that. No opportunity for error handling.


Reliably crashing _is_ a way to handle out of memory errors.

When you don't handle it, you get something closer to undefined behaviour instead.


Okay I'll bite - I have problems with this line of thinking.

You're right that you'll never be able to cover 100% of all cases, but using that logic, your specs will never test 100% of scenarios, so you shouldn't write specs.

I think the problem is you assumed it was "a magical box that people can rely to cover 100% of test cases of".

That's a poor leap. It's clearly not. It's a good system to test a set of network failures at varying degrees. Just like any engineered system, it needs to be documented what it can and cannot do.

I also eagerly await my do it all, 100% magic box.


> building a data warehouse when you don't know what sorts of reports you want to run or how the data will generally be used after you collect it.

Hi Bob, I can't tell you what reports I want or what we'll do with the data until you've first collected the data for analysis. Thanks!


Why are you vacuuming up data if you don't even know what sort of outcomes you want to look for? How do you know you are even pulling from the right sources?

When you do finally figure out why you are in business, you may find that all the data you gathered is worthless. Sometimes it's really innocuous stuff like storing datetime values without enough resolution or accuracy. Only when you know how you plan to use the data can you ensure you are gathering it properly.

Teaching developers to collect all the data they can in hopes of being able to write arbitrary reports down the line is reckless behavior. Hyperscalers love pushing this ideology with inane bullshit like "Datalakes", so you can see where a conflict of interest may arise in the industry.


Ever worked with an enterprise company?

I've worked on a project (in a peripheral role, fortunately) where the entirety of the CIO's mandate was to "use AI", full stop. So the team was tasked with putting together a data warehouse that would collect random logs and app metrics, and then sprinkle AI fairy dust of their choice onto it to generate Insights(tm).

This kind of thing happens all. the time. in large companies where everybody involved wants either money or bullet points for their resume, and the actual outcome is nearly irrelevant.


When I implement a new service, I have to make an executive decision about which metrics to collect (operation counts, durations, storage usage, etc.). Before the first incident, I cannot know with 100% certainty which metrics are useful. Only after the incident can I look at the metrics and see which ones did weird things before and during the incident, and that helps drive my decisions on what to alert on.

I don't see how I could do it differently. I can only truly know which metrics are useful from experience with past incidents, but I can only gain this experience when the metrics were already collected during the past incident.


I don't see a us-fail-1 region being set up for a number of reasons.

One, this is not how AWS regions are designed to work. What they're thinking of is a virtual region with none of its own datacenters, but AWS has internal assumptions about what a region is that are baked into their codebase. I think it would be a massive undertaking to simulate a region like this.

(I don't think a fail AZ would work either, arguably it'd be worse because all the code that automatically enumerates AZs would have to skip it, which is going to be all over the place.)

Two, set up a region with deliberate problems, and idiots will run their production workload in it. It doesn't matter how many banners and disclaimers you set up on the console, they'll click past them.

When customer support points out they shouldn't be doing this, the idiot screams at them, "but my whole business is down! You have to DO something!" This would be a small number of customers, but the support guys get all of them.

Three, AWS services depend on other AWS services. There are dozens of AWS services, each like little companies with varying levels of maturity. They ought to design all their stuff to gracefully respond to outages, but they have business priorities and many services won't want to set up in us-fail-1. When a region adds special constraints, it has a high likelihood of being a neglected region like GovCloud.


I don't work with the group directly, but one group at our company has set up Gremlin, and the breadth and depth of outages Gremlin can cause is pretty impressive. Chaos Testing FTW.


I’ve also had a customer who used Gremlin to dramatically improve their stability.


Along the same vein, instead of the typical "debug" and "release" configurations in compilers, I'd love it if there was also an "evil" configuration.

The evil configuration should randomise anything that isn't specified. No string comparison type selected? You get Turkish. All I/O and networking operations fail randomly. Any exception that can be thrown, is, at some small rate.

Or to take things to the next level, I'd love it if every language had an interpreted mode similar to Rust's MIR interpreter. This would tag memory with types, validate alignment requirements, enforce the weakest memory model (e.g.: ARM rules even when running on Intel), etc...


A zone not only of sight and sound, but of CPU faults and RAM errors, cache inconsistency and microcode bugs. A zone of the pit of prod's fears and the peak of test's paranoia. Look, up ahead: Your root is now read-only and your page cache has been mapped to /dev/null! You're in the Unavailability Zone!


Your conductor on this journey through the Unavailability Zone: the BOFH!


That region is called Microsoft Azure. It will even break the control UI with high frequency.


I was going to post this but you beat me to it.

We are forced to use Azure for business reasons where I work, and the frequency of one off failures and outages is insane.


Thank you, this is the exact comment I was looking for


I imagine AWS and other clouds have a staging/simulation environment for testing their own services. I seem to recall them discussing that for VPC during re:Invent or something.

I'm on the fence though if I would want a separate region for this with various random failures. I think I'd be more interested in being able to inject faults/latencies/degradation in existing regions, and when I want them to happen for more control and ability to verify any fixes.

Would be interesting to see how they price it as well. High per-API cost depending on the service being affected, combined with a duration. Eg, make these EBS volumes 50% slower for the next 5min.

Then after or in tandem with the API pieces, release their own hosted Chaos Monkey type service.


Show HN! Introducing my new SPaaS:

Unreliability.io - Shitty Performance as a Service.

We hook your accounting software up to api.unreliability.io and when a client account becomes delinquent, our platform instantly migrates their entire stack into the us-fail-1 region. Automatically migrates back again within 10 working days after full payment has cleared - guaranteed downtime of no less than 4 hours during migration back to production region. Register now for a 30 day Free Trial!


I want this at the programming language level too. If a function call can fail, I want to set a flag and have it (randomly?) fail. I hacked my way around this by adding in some wrapper that would if random, err for a bunch of critical functions. It was great for working through a ton of race conditions in golang with channels, and remote connections, etc. But hacking it in manually was annoying and not something I'd want to commit.


Failing individual computes isn't hard, some chaos script to kill VMs is enough. Worst are situations when things seem to be up but not acceptable: abnormal network latency, random packet drops, random but repeatable service errors, lagging eventual consistency. Not even mentioning any hardware woes.


While these are not exclusive, personally I'd look instead into studying my system's reliability in a way that is independent of a cloud provider, or even of performing any side-effectful testing at all.

There's extensive research and works on all things resilience. One could say: if one build a system that is proven to be theoretically resilient, that model should extrapolate to real-world resilience.

This approach is probably intimately related with pure-functional programming, which I feel has been not explored enough in this area.


There are multiple methods for automating AWS EC2 instance recovery for instances in the "system status check failed" or "scheduled for retirement event" cases.

Yet to figure out how to test any of those cloudwatch alerts/rules. I've had them deployed in my dev/test environments for months now, after having to manually deal with a handful of them in a short time period. They've yet to trigger once since.

Umbrellas when it's raining etc.


This is why it seems like it would be good to have explicit fault injection APIs instead of assuming that the normal APIs behave the same as a real failure.


Wait. I thought this was ap-southeast-2


Whichever region Quora is using.


Why does Twitter often fail to load when I open a thread and if I refresh it works. Does Twitter use us-fail-1?


I don't know why, but it happens to everyone and it's been that way for a long time. Either their engineers are failing, or there's some sketchy monetary reason for it. You're not the only one.


I think they don't like browsers they can't fingerprint, or something like that.


You mean the "Click here to reload Twitter" thing? That's spam protection, I hit this regularly after restarting Chrome (with ~600 tabs).


Ya but clicking that doesn't even work. And I rarely go on Twitter so I don't know why they would do it intentionally.


I get this too. Seems to always be desktop


For me it's the exact opposite -- always on mobile, always when logged out (haven't extensively tested being logged in on mobile, since I'm barely logged in to Twitter on mobile.)


I think people overestimate the importance of failures of the underlying cloud platform. One of the most surprising lessons of the last 5 years at my company has been how rarely single points of failure actually fail. A simple load-balanced group of EC2 instances, pointed at a single RDS Postgres database, is astonishingly reliable. If you get fancy and build a multi-master system, you can easily end up creating more downtime than you prevent when your own failover/recovery system runs amok.


At my job, my team owns a service that generally has great uptime. Dependent teams/services have gotten into the habit of assuming that our service will be 100% available which is problematic because it's obviously not. That false assumption has caused several minor incidents unfortunately.

There have been some talk internally of doing chaos engineering to help improve the reliability of our company's products as a whole. Unfortunately, the most easily simulatable failure scenarios (e.g. entire containers go down at once instantly, etc.) tend to be the least helpful since my team designed the service to tolerate those kinds of easily modelable situations.

The more subtle/complex/interesting failure conditions are far harder to recognize and simulate (e.g. all containers hosted on one particular node experience 10s latencies on all network traffic, stale DNS entries, broken service discovery, etc.).


You can just do this yourself. Google breaks it's systems intentionally for a week every year, it's called DiRT week. DiRT takes weeks of planning before people even start debugging.

Doing this constantly for a all products in a single region would be absolutely exhausting for SRE teams

(Discliamer: I work for ^GOOG and my opinions are my own)


That is a good idea, but by yourself you can not go and yank the power cable out of an AWS box.

(I also used to work for Google, as an SRE. Fun times.

Many of our dirt exercises usually included the stipulation: 'by the way, assume that <scarily over-competent coworker> is on vacation and not to be disturbed'.)


> That is a good idea, but by yourself you can not go and yank the power cable out of an AWS box.

Sure you can. Simulate an Ec2 outage? Turn off your VMs. Lambda down? Turn off your serving version. Etc...

I think the only things that difficult to simulate would be control plane outages for testing CI/CD pipelines--though CI/CD pipelines generally are not production critical and probably aren't worth testing directly. If you want to test recovery/repair operations when CI/CD is down you can just prohibit the victim from using it to fix things.

For non-automated API calls, you could just say that X API is down for the duration of the test


See the other comments. It's not a given that yanking turning off these services gracefully via API is equivalent to yanking a power cable.


I remember a time when deployments to large web applications took weeks of planning.

I don't work at Google but I imagine just working on making it a dedicated goal to have a continuously faulty region would reduce the burden of everyone involved.


> I don't work at Google but I imagine just working on making it a dedicated goal to have a continuously faulty region would reduce the burden of everyone involved

This sounds nice in theory but I think you are underestimating the vast complexity of large cloud platform internal systems. The amount of work to make this happen reliablely would be immense. You would probably need a few people from each cloud SRE team with some SWE partners working for a year or longer to get this up and running. Then after that we would need to work out all of the bugs caused by these contrived failure modes and build automation to repair damage.

It's just too much work to justify when there is so much work to be done for regular prod already


Friendly reminder that for any given single availability zone, the SLA that AWS provides is one single nine. That means that they expect that availability zone to fail 10% of the time, or 6 minutes every hour. this very high failure rate comes up solutely for free, no need for a special region. Therefore, implementing a cross availability zone application that logs when packets are dropped should give you some idea of how your application handles failure.


Yeah but it's very rare that they don't hit many nines of uptime anyway.

This would be about getting them to actually match the real world behaviour to the SLA


Relevant snippet in the Google SRE Book:

https://landing.google.com/sre/sre-book/chapters/service-lev...

Google introduced exactly this to one of their internal services so that downstream dependencies can't rely on its extremely high availability.


Toxiproxy [1] is a tool that allows to create network tunnels with random network problems like high latency, packet drops or slicing, timeouts, etc.

Setting it up requires some effort (you can't just choose a region in your AWS config), but it's available now and can be integrated with tests.

[1] https://github.com/Shopify/toxiproxy


No one seems to talk about the pricing aspect of this. Developers would want these us-fail-1 regions to be cheap or free since they wouldn't be using this for production purposes. And before you know it, a lot of hobbyist developers will start using these as their production setup since they wouldn't mind a 1% downtime if they could pay less for it.


A us-fail-1 at lower price sounds like an excellent way to recycle SSD/HDDs that reach end-of-safe-life but that might still run for years.


To be honest, for the people who _really_ want this, I'd think around 10-100x pricing would be fine, or a base fee of 10k/mo or something


Simply host on Google Cloud! They will terminate your access for something random, like someone said your name on YouTube while doing something bad! They don't have a number you can call, and their support is ran but the stupidest of all AI algorithms.


There's an easier way: spot instances (and us-east-1 as mentioned)

As for things like EBS failing, or dropping packages, it's a bit tricky as some things might break at the OS levels

And given sufficient failures, you can't swim anymore, you'll just sink.


Sometimes during development instead of checking the return code I'll check rand() % 3 or something similar. I'll run through the code several times in a loop and run through a lot of the failure modes very quickly this way.


Sort of counter-intuitive, but for small projects, you want resilient hardware systems as much as possible... the larger your scale out, the less reliable you would want them to force that out of hardware into resilient software.


This is such a clever idea. I wonder if amazon are smart enough to actually do this.


Just deploy a new region with no ops support, it'll quickly become that.


A great idea! I'd love to run stuff in this zone. Rotate through a bunch of errors, unavailability, latency spikes, power outages etc every day, make it a 12 hour torture test cycle.


I can see a use case for this being implemented on top of Kubernetes. I've no idea if that's achievable, but could go some way to make your code more resilient.


I believe this is available as a service called “Softlayer”


it's us-west-1! :D

we've had a ton of instances fail at once because they had some kind of rack-level failure and a bunch of our EC2s ended up in the same rack. :(


This would require AWS to invest in Chaos Monkeys.


I believe this is a service called “SoftLayer”


That region should have `us-wtf-1` as code.


Isn't this what Gremlin does?


Sounds useful. Crank it up to 99% failure and it becomes interesting science.


It actually sounds useful, to the point I wouldn't be surprised if in the near future cloud providers bundled up some chaos monkey stack and offered that with a neat price within their realms (dunno, maybe per VPC or project).


They will definitely figure out a way to charge us more for hardware that is less reliable.


That's a great idea: instead of throwing away failing hardware, toss it into the chaos region and charge double.


Try us-east-1 :)


This is called chaos engineering and many companies built tooling to do exactly this. Netflix pioneered/proselytized it years ago. Since you likely don't just rely upon AWS services if your app is in AWS, you want something either on your servers themselves or built into whatever low level HTTP wrapper you use. Use that library to do fault injection like high latency, errors, timeouts, etc.


This type of service would be a compliment to those techniques, not replace them. Ideally we could have both.


Came here to say this exact thing. There have been a variety of techniques to achieve this - some intrusive to your binaries (i.e. they require embedding specific libraries) and others that are more "external" (ex: tc/iptables).

The "real" challenge is not creating chaos but managing it and verifying that your apps are resilient to said chaos.


It's harder to do chaos engineering if you're not engineering it. What this is really asking for is the service provider to sell chaos engineering as a service (CEAAS?), on the services they provide. I've wanted this kind of thing for testing cloud infrastructure before: you read about various failure states and scenarios you might want to handle from docs but there's no way to trigger them so you just have to hope that they work as described and your code is correct. At the least, let users simulate the effect of the failures that are part of your API.

This would be great for testing the pieces of the stack that the provider is responsible for, but you may still want to inject chaos into the part of your stack that you do control.


Netflix runs on AWS, they are doing chaos engineering quite alright on it.


brilliant!


us-east-1?


I came here to post this and it isn't even a joke. Just true.


same here :-)


Anecdotally, I hear the South American regions are the places where the really canary stuff goes out first.


I've heard a fun story from the old timers in my org about a fiber outage in Brazil. A routine fiber cut occurred. They figure out how far from one end the cut is (there is gear that measures the time to see a light pulse reflect off of the cut end.) Then they pull out a map of where the fiber was laid, count out the distance, and send a technician out to have a look at where they expect the cut to be. All standard practice up until this point.

The technician updates the ticket after a while with "cannot find road." The folks back in the office try to send them directions, but then the technician clarifies, "road is gone." Our fiber, and the road it was buried under was totally demolished in the few hours it took to get someone out there. The developing world can develop at alarming rates.

Other tales from the middle of nowhere: People shoot at arial fiber with guns. Or dig it up and cut it for fun. One time out technician was carjacked on the way to doing a repair.


> there is gear that measures the time to see a light pulse reflect off of the cut end

Though not very relevant to your stories about Brazil, it's a neat technique in its own right:

https://en.wikipedia.org/wiki/Optical_time-domain_reflectome...


Deployment ordering is intentionally not the same across services/teams, and said order can/does change over time within teams.


For us it’s AP-SE-2 with us-east-1 as a close second


You can use IBM cloud for that purpose


Hey, I used their loadbalancers for a couple months, and they only failed every 30 days, that's not high frequency.


I used to use their Citrix Netscaler VPX1000s at a previous job.

They were very reliable, imo. Aside from general Netscaler bullshit, We only ever had issues with them when we'd try to get them to do too much so that the CPU or Memory was overloaded.

We tried on a few occasions to get more cores allocated to them, but no. This made terminating large numbers of SSL connections on them problematic.


I was using their shared loadbalancers, not the run a load balancer in a VM option, because I was hoping for something more reliable than a single computer. For the couple months they were running, it was literally every 30 days, 10 minutes of downtime. So I went back to DNS round robin, cause it was better.


There's a microsoft azure/google cloud joke in there somewhere...


No, really, it's the IBM cloud that is the joke. This isn't the first I've heard of it, though I've not used it myself.

I'm a happy AWS user and I'll stay a happy AWS user not for their prices or features, but service. Which was the reason I was a Rackspace fan before it was sold and went down the tube.


Savage.


This is a really great idea.


This is what chaos monkey does.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: