

Extended Salesforce.com outage due to power problems - rdl
http://trust.salesforce.com/trust/status/

======
rdl
I am curious what "power failure" could lead to a serious site outage which
lasts many hours to diagnose and many more hours to respond to. Anyone running
a serious business ($50/mo * total number of salespeople in the world),
hosting a cloud, etc. should have redundancy against any single piece of power
equipment failing. At minimum, dual power buses feeding any servers which
can't be easily replicated, and replication across multiple rooms in a colo at
absolute minimum, and across datacenters in a metro area for highly latency
sensitive things, or across continents otherwise.

I would assume Salesforce is at the scale where going +1 on their
infrastructure isn't going to materially affect costs, so I am at a loss for
why they are exposed to an outage like this.

Especially funny that their status site itself went down -- very basic
principle of hosting your status site separately from your own infrastructure.
(Actually, there's probably a great startup in just hosting status pages for
companies -- rip off the Heroku time-series presentation to users, augment
monitoring with analyzing the twitter firehose for "xxx is down" and "xxx is
slow" and "fucking xxx is not working", plus some ping monitoring, and maybe
app-level regression testing (reporting on the size of page if it drops to
like 10 bytes there is trouble, various other health checks).

Makes sense As A Service because you want it to be wholely independent of your
own infrastructure, and also makes sense to run separately because you can be
"fair". Gomez, etc. report to site owners, but this would be aimed at
reporting to end users, with the ability for verified site owners to report on
RFO, TTR, etc.

