

Frank Talk About Site Outages - keyist
http://codeascraft.etsy.com/2010/09/17/frank-talk-about-site-outages/

======
mcyger
Great overview. If I sold on Etsy, I would be more confident that they are
proactively striving to reach 100% uptime and quickly addressing actual
downtime.

I love the structure of their problem solving references: root cause analysis,
time line of event post mortem, single point of failure. And, of course, the
fact that they actively measure the real customer experience from multiple
places around the world. Customers may not realize it's a local infrastructure
problem when they go to Etsy so being able to proactively find and address
these issues makes for fewer customer support calls and more satisfied
customers.

Kudos to Etsy for the post.

------
Poiesis
Their CTO, Chad Dickerson, seems a standup guy from everything I've read. Used
to love his stuff at InfoWorld. (Note, the post does not appear to be from
Chad).

------
djb_hackernews
I'm surprised they've made it this far with multiple single points of failure.
It sounds as if they only have one database!

~~~
mcfunley
We are in the midst of migrating from some vertically partitioned postgres
databases (each with a warm standby) into master/master mysql shards.

The PG databases weren't SPOF's in the worst sense of the word. They could
fail over, but this isn't as outage-resistant as the new setup. And we did
have more than one, but each was still pretty monolithic. So are most
databases for most sites before they've grown up completely.

Also, keep in mind that it's pretty easy to code yourself a single point of
failure even if your hardware doesn't force it upon you. Working things like
that out of a big codebase takes time.

------
Maven911
now imagine having a job where you deal with outages all the time and putting
out fires and having to listen to stressed out people all the time...welcome
to my world :)

