Sudden surge of traffic as all their users returns to work?

ciceryadam · on Jan 4, 2021

Could be, it's the perfect time overlap between US-West, US-East, and Europe.

johnmaguire · on Jan 4, 2021

Yes - I wondered if they took some servers down prior to the break as a cost saving measure, and forgot to reinstate them.

fragmede · on Jan 4, 2021

Doubtful. It's not impossible a company the size of Slack would be reliant on a specific engineer logging on in the morning before a traffic spike so the service can handle the spike in load, but that's a misuse of modern distributed cloud-based computing.

Hate on the cloud all you want, but AWS has (several flavors of) load balancers and various ways to automatically scale up and down resources (and if you're conservative, you can disable the 'down' part). If you're operating a major SaaS company like Slack and not taking advantage of them, something's gone wrong.

onefuncman · on Jan 5, 2021

It's easy to fall behind on bumping up the high watermark for your max autoscaling or for new traffic patterns to cause emergent instability. New code paths are taking unprecedented amounts of traffic all the time.

In 2021, how does one keep track of resource starvation at the process, container, os, service, pod, cluster, availability zone and region levels?