

Heroku Status Uptime Calculation (99.28% June 2012) - rdl
https://devcenter.heroku.com/articles/heroku-status-uptime-calculation

======
rdl
I'm not sure how I feel about this method of calculating uptime. If you have a
site with 100k users, and 10k are down for an hour, is that a 6 minute outage?

The most useful thing is probably to be internally consistent month to month
-- you really can't depend on self-reported uptime stats from multiple
providers which might not use the same methodology. There are also other
performance metrics which matter -- if a site takes 3 minutes to respond, but
does so with 100%, that is "down" for some purposes but not down for others.

Just adding up the numbers is 487 minutes out of 43200 minutes in June, or
98.87% uptime. I'm not sure if this was calculated the new way to be more
accurate or to avoid the psychologically bad "one nine of uptime".

If a provider outage costs you recovery time yourself after an outage, though,
it's also not really useful to just use the provider's uptime. If your EBS
volumes come back dirty on EC2 and fscking them takes another few hours (due
to size or contention), your total outage can be longer, and that's all time
where you wouldn't have been out on a different platform.

~~~
bgentry
_Just adding up the numbers is 487 minutes out of 43200 minutes in June, or
98.87% uptime._

The number of minutes an issue was being tracked on our status site is not the
same as the number of minutes things were actually broken. Also, very few
incidents affect all apps at any point. And when things are being fixed it's
typical to see a long tail, especially for database failures with corrupt
volumes.

 _I'm not sure if this was calculated the new way to be more accurate or to
avoid the psychologically bad "one nine of uptime"._

This is the same way we've always done it, but we never explained it before
this month. We also couldn't find any other providers explaining how they
calculate aggregate availability, so we thought it was worth explaining.

The is the most realistic method we're aware of for a distributed platform
with millions of apps and non-total failure modes.

~~~
rdl
Seems fair (this is a good argument).

I don't think there's a huge amount of precedent. There are systems which are
tightly coupled and thus all-or-nothing (most hosting pre-cloud, power
generation facilities, etc.), and systems which are highly distributed
(networks). Generally for the distributed systems, no one reports a global
uptime percentage, but a per-link uptime target and percentage.

