"we’re not juking the stats here by omitting “scheduled” downtime"
Nice, you score some karma points there. Excluding "scheduled" downtime has always bugged me (including when my employer does it). When you're down you're down, thats kind of the end of the story. Whether or not your customer can do what they payed you for is all that matters to them.
Next up: 37signals publishes office room temperature...
Yes, we have our own uptime data, but we're a young company, and we're still scaling. Coincidentally, on a similar stack to 37Signals. More data is better.
For instance, they could be monitoring from an internal network and their methodology might not pick up external network outages that take their DC off the air (just an example of why the method and definition are important...).
I don't doubt that they have a great track record of reliability, but this presentation seems a bit thin.
Do you have any other questions about methodology? I'd be happy to elaborate.
A great example is that as I write this, I just realized that our public Pingdom network status reports are offline... I count this as downtime against our global availability stats, but it isn't an event that would show up in my pingdom reports :)
For our application outages, it's been pretty well correlated that if pingdom can't get a 200 OK on the test pages we've set up, it's been down. And I don't think we've had much if anything slip in under a 200 OK but still being down.
I'm sure we're still off by a couple of minutes here and there, but the big picture should be quite accurate.
Our test logs in, causes some data to be fetched from the database, and renders a page which we then check against what it should return.
We haven’t (to date) had either any false positives or false negatives (when it alerts, the site is really down, and if it doesn’t alert, the site is really up).
This obviously isn’t a replacement for functional or integration tests to ensure that a commit doesn’t cause a piece of the app to stop working, but it does test the full infrastructure stack to make sure that it’s performing the way we expect it to.
I'm just trying to point out that the definition of "uptime" is hazy at best. :)
(We also have thousands of internal health checks and alerts for other metrics.)
The key here is this reflects the view of the customer. (In other words if we down a system internally but there is no customer facing outage, yay redundancy and failover, then it would not show up here as long as the site was continuing to function with full health.)
Still not bad numbers considering. Running multiple DC's can be quite a pain in the ass and a money drain, and maybe that's one of the advantages of having really loyal customers. A good example of there being many viable paths to a goal.
Re multiple sites, we're working on it.