Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The real RCA (IMHO) is not simulating outages in production as part of reliability engineering.

Whatever process was stuck in a loop, crashed, or whatever service (db, dns,etc..) was unavailable, that outage scenario can be simulated. Changes can have an automated rollback requirement.

My take away is that CF has single points of failure they're aware of, and for business reasons, they've decided to not have a redundancy/failover.

> ...and formally verified code, this bug would not have happened.

That's what I mean, "we should have caught the bug" , yeah, but that isn't reliability engineering. You assume there will be bugs/outages and prepare for them instead. What happens if the entire DB entered a weird state and was spitting out valid results with incorrect values? What happens if it accepts connections and just stalls?

You prepare for bugs that don't yet exist, you fix bugs that do exist.





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: