I have been on the saving end of some of these AWS mistakes (billing/usage alerts are important people), but alerts aren't always enough to keep them from happening completely if they are fast
- Not stopping AWS Glue with an insane amount of DPUs attached when not using it. Quote "I don't know I just attached as many as possible so it would go faster when I needed".
- Bad queueing of jobs that got deployed to production before the end of month reports came out. Ticket quote "sometimes jobs freeze and stall, can kick up 2 retry instances then fails", not a big problem in the middle of the month when there was only a job once a week. End of the month comes along and 100+ users auto upload data to be processed, spinning up 100+ c5.12xlarge instances which should finish in ~2 mins but hang overnight and spin up 2 retry instances each
- Bad queueing of data transfer (I am sensing a queueing problem) that led to high db usage so autoscaling r5.24xlarge (one big db for everything) to the limit of 40 instances
- Not stopping AWS Glue with an insane amount of DPUs attached when not using it. Quote "I don't know I just attached as many as possible so it would go faster when I needed".
- Bad queueing of jobs that got deployed to production before the end of month reports came out. Ticket quote "sometimes jobs freeze and stall, can kick up 2 retry instances then fails", not a big problem in the middle of the month when there was only a job once a week. End of the month comes along and 100+ users auto upload data to be processed, spinning up 100+ c5.12xlarge instances which should finish in ~2 mins but hang overnight and spin up 2 retry instances each
- Bad queueing of data transfer (I am sensing a queueing problem) that led to high db usage so autoscaling r5.24xlarge (one big db for everything) to the limit of 40 instances