
The day I crashed production 4 times - Swizec
https://swizec.com/blog/the-day-i-crashed-production-4-times/swizec/9159
======
awinder
This post was a fun & great read -- kudos the author! By describing our
failures & misadventures hopefully will help newbies will learn more (or at
least require less antacids when making big prod changes).

Did wanna question / add commentary around this though:

    
    
      You can’t build partitions on null fields.
      That throws a database error.
      That database error says “Yo you can’t write this log”
      That error throws an exception
      That exception bubbles up
      Your API request fails
      Every API request fails
      ... continues till later ...
      Why was a failure in logging able to crash the system it 
      was observing? No excuse. The move to AWS Lambda 
      automatically solves the problem but damn
    

Lambda alone will definitely help with scale-related downtime but the switch
to queueing looks like what was the big difference maker. Which is true!
Queues help a lot and AWS hosted queues have great uptime at that. But
depending on what this system is doing & how important the data is -- a lot of
log/metrics systems fire&forget over UDP to prevent problems like this. Your
entire metrics backend could go poof, and yeah metrics data could go into the
ether -- but you would not take down prod through it.

------
rad_gruchalski
TL;DR: no tests.

