

The Meltdown That Brought Our Startup to Its Knees for 15 Hours - kanamekun
http://www.groovehq.com/blog/downtime

======
nasalgoat
As a former system admin for 20 years, my mind boggles at the idea that they
only had email monitoring. It really illustrates what's wrong with "DevOps" \-
the people doing it don't have the basic common sense that a dedicated system
admin would have to have out-of-band notification systems.

Also, only one master database? No dumps to offsite storage? No secondaries?
Crazy talk. This is SysAdmin 101 stuff.

> No longer will infrastructure be a “feature” to be > weighed and prioritized
> against others in our backlog. > It’s the foundation of everything we have,
> everything > we do, and it will be treated as such."

I'm glad someone is learning this lesson. I wish it wasn't under those
circumstances.

------
ChuckMcM
This is a great ops story, I encourage everyone to read it and try to
internalize its lessons.

One of the things our server monitoring system does is text and phone (thanks
Twilo!) when things go this far south. Of course in Alex's case it might not
have helped since his phone was dead and not charged but at least one of the
team would have gotten the message. The only down side for me is that when my
family on the east coast texts me something in the "morning" which is like 4AM
pacific, I bolt awake thinking its a server outage.

~~~
beachstartup
tip: use do-not-disturb on your phone. within the specified hours, only
certain numbers are allowed to notify you.

