"we’re not juking the stats here by omitting “scheduled” downtime"
Nice, you score some karma points there. Excluding "scheduled" downtime has always bugged me (including when my employer does it). When you're down you're down, thats kind of the end of the story. Whether or not your customer can do what they payed you for is all that matters to them.
Go ahead and dismiss this, but I found it valuable. It's not often that a high-profile company publishes their own uptime data, and it has a clear and useful purpose to me. When I'm planning for uptime and making agreements with our customers, it's nice to have a point of reference. In other words, if 37Signals has a hard time getting to four-nines, I would anticipate similar struggles. I may write a four-nines (or five-nines) SLA, but I'll write the penalties in a way that doesn't shoot a hole in my boat.
Yes, we have our own uptime data, but we're a young company, and we're still scaling. Coincidentally, on a similar stack to 37Signals. More data is better.
the data is pretty meaningless unless they include a definition of downtime and how they are measuring it.
For instance, they could be monitoring from an internal network and their methodology might not pick up external network outages that take their DC off the air (just an example of why the method and definition are important...).
I don't doubt that they have a great track record of reliability, but this presentation seems a bit thin.
No not really, although again the value of the stats all depends on what is being monitored and how, etc. For example, Pingdom can report that an SMTP interface looks just great while at the same time it is impossible to send mail on that port because of a bad disk or another failure somewhere else. We use pingdom too and it is a great tool and transparency is definitely great but any number of user impacting events can and do go unnoticed (depending on how you are monitoring and what you are monitoring...)
A great example is that as I write this, I just realized that our public Pingdom network status reports are offline... I count this as downtime against our global availability stats, but it isn't an event that would show up in my pingdom reports :)
Stats are back... weird :-) http://about.hover.com/networkstatus (as an aside, making it easy for customers to see uptime/downtime and network events eases the burden on customer service and makes it much easier for potential customers to check your credibility. easy to implement and definitely a plus for the business overall).
That's nice but feels more like a status page to me than a way to gauge long-term uptime (which is what we were going for here).
For our application outages, it's been pretty well correlated that if pingdom can't get a 200 OK on the test pages we've set up, it's been down. And I don't think we've had much if anything slip in under a 200 OK but still being down.
I'm sure we're still off by a couple of minutes here and there, but the big picture should be quite accurate.
Our test logs in, causes some data to be fetched from the database, and renders a page which we then check against what it should return.
We haven’t (to date) had either any false positives or false negatives (when it alerts, the site is really down, and if it doesn’t alert, the site is really up).
This obviously isn’t a replacement for functional or integration tests to ensure that a commit doesn’t cause a piece of the app to stop working, but it does test the full infrastructure stack to make sure that it’s performing the way we expect it to.
Thanks for the feed back. This data is based on external checks every 10 seconds.
(We also have thousands of internal health checks and alerts for other metrics.)
The key here is this reflects the view of the customer. (In other words if we down a system internally but there is no customer facing outage, yay redundancy and failover, then it would not show up here as long as the site was continuing to function with full health.)
The view of the customer is the one that counts. Don't get me wrong, I really like what you've all done here. I'm just trying to understand it better thats all. All the downvotes imply to me that I'm not communicating well this morning ;-)
I understand your communication just fine. HN is however not immune to its own little cults or pockets of personality worship. Questioning some of those people, however validly, will often result in downvoting for your gall.
I was a bit surprised when I checked out their hosting situation. Only in one DC (suburban chicago) and up until about a year ago they were only using one network provider. Now they're on three, apparently they had some real pain points before being multi-homed. I'm guessing the year's timing isn't accidental.
Still not bad numbers considering. Running multiple DC's can be quite a pain in the ass and a money drain, and maybe that's one of the advantages of having really loyal customers. A good example of there being many viable paths to a goal.
The timing is not tied to our move off of Rackspace and on to our own hardware, or our move to multiple providers. Noah and JD finally had time to work on making this data public and so we did. The fact that it happened at the end of the year is coincidence.
This is something we may make available in the future, but we believe this presentation is easier to digest for the majority of our customers. Right now Pingdom data is going straight into our internal metrics dashboard which produces the public pages, so minus our internal annotations, it's exactly what you'd see in Pingdom. In the past we used a different site monitoring product, and we had to import all the data before closing our account there. (So we can't make that available directly anymore.)