
Show HN: Calculate how long your website/API can be down - jmstfv
https://tryhexadecimal.com/sla-uptime-calculator
======
chrismorgan
Ah, SLAs. It seems to me from having read quite a few standard sorts of ones
for web systems or APIs that they’re a profitable racket, typically a total
scam.

Simplifying and extrapolating, it seems to me that they normally boil down to
this: “if our product that you pay us $1,000 per month for, upon which you
depend so that you would incur losses of $1,000 per _hour_ if it was down,
goes down¹ for more than 43 minutes in any given month (half an hour of
downtime is all hunky dory, right? 99.9% looks good, right?), we’ll give you
back 10% of your monthly fee for every hour it’s down, up to your monthly fee
(so now from minute 43 for the next ten hours you’ll only lose $900 per
hour!)”.

From the ones I’ve seen, I can’t understand why anyone would _pay_ for them,
because they seem to be utterly toothless.

I mean seriously, I’d prefer to offer that kind of SLA to all users on any
product at no cost: it shows I’m _serious_ about the product and _care_ , and
practically costs little (the absolute disaster case is a month’s revenue) and
can be perceived as a goodwill gesture if it ever needs to be invoked.

The idea of charging people extra just to say “yeah, I’ll actually _provide_
the service you already paid for” rubs me the wrong way. But I bet some
business types would baulk at not having this SLA line they can upsell on.

Am I being unreasonable?

¹ For some excessively strict definition of “down”, so that we might only
count half of the _actual_ downtime.

~~~
oarsinsync
> Am I being unreasonable?

No. Also, yes.

> The idea of charging people extra just to say “yeah, I’ll actually provide
> the service you already paid for” rubs me the wrong way.

So this is ultimately the crux of the issue I think. What are you paying for
with the up front fee? I don't think I'd be willing to sell anyone a 100%
uptime 24/7/365 product for only $1000/month. I'd probably want 2 orders of
magnitude more for that guarantee, because my costs would scale significantly
in order to deliver that. I'd also need to build in proof to defend it when my
product becomes unreachable from your network, because of an issue with your
network / your ISP / an intermediate ISP / a hijacking attack / (one of) my
ISP(s).

The reality is the network (and this is true of almost any IP network) is
unreliable, so anything that involves a network should never really be sold
with 100% uptime guarantees. If you throw the internet in the middle, you
definitely cannot provide that guarantee.

Which then brings us to what was actually sold. Setting expectations up front
leads to happier trading partners. If you tell me you can deliver 100% uptime
for 30 days straight, and it goes down at any point at all in that duration,
I'm going to be legitimately upset. If you tell me there's only 99% uptime,
and there's a 6 hour outage, the scale of that being a problem for my business
is entirely my own problem.

I took on a product and the risks were made clear to me up front. If I chose
to not take out insurance, the costs to my business are entirely my own.
Sometimes that's just the cost of doing business, you can't afford to pay for
the insurance. That's one of the risks. It sucks when that happens.

Toothless SLA clauses like the one you've described do nothing to mitigate the
risk, so should not qualify. Unfortunately, people gloss over the details, but
as far as B2B commerce goes, _caveat emptor_.

As an example of a slightly less toothless SLA that I've seen negotiated, and
I'm paraphrasing the spirit here: "We will pay you $x to complete this job in
6 months. For each calendar day you are late, you will reimburse us $(x/30),
up to a maximum liability of $2x." and then the project was internally
scheduled to take up to 9 months. The risk of the contractor was factored in,
the expectation that they'd screw up was factored in, and the contractor was
heavily incentivised to not screw up (they did anyway)

~~~
jrockway
> I don't think I'd be willing to sell anyone a 100% uptime 24/7/365 product
> for only $1000/month.

I worked at a company that sold a 24/7/365 product with a 100% uptime SLA for
$1000/month.

I don't think it really did what anyone wanted. On one side of the coin,
people valued the service at more than they paid for it. One customer wrote in
to complain that they lost millions of dollars because of a scheduled outage.
For a million dollars, you could have had a backup circuit, or at least called
us to reschedule the planned maintenance (which we tended to be flexible
about, and weren't covered under the SLA anyway). I am not sure what their
true motivations were, but I think they assumed that the fine print said "we
can guarantee you that there will be no natural disasters, etc." when in
reality it said "if our shit turns off, we won't charge you". While the
legalese made it clear what the SLA was, the customer's expectation from the
simplified wording ("100% uptime") was not what the document said. That is
risky in terms of customer satisfaction.

Other customers we would refund, but their automated ACH bill payment would
send the full amount next month, so we had to track them down to tell them
that they overpaid. This resulted in a lot of work for them for some trivial
amount of money. We did honor our agreement, but the accounts payable person
at the other company ended up having to do a lot of meaningless work that
didn't really benefit them -- they don't care if their company pays $900 or
$1000 per month for some service. I think we should have just sent them
cookies or something as opposed to making them redo all their expense reports
for $100 or whatever. At least they would personally be able to enjoy a
cookie.

Personally, I would never sell someone an SLA without having first measured my
uptime extensively, and then I wouldn't sell them something that I've never
observed in the past. If we had 5 minutes of downtime last year, I would be
hesitant to guarantee 99.999% uptime. I'm not a good business person, though.

------
bobbiechen
There's a great recent paper [1] (and Adrian Colyer's summary in The Morning
Paper [2]) from the Google G Suite team about how uptime or error rate alone
do not capture the full user experience:

 _Indeed, none of the existing metrics can distinguish be- tween 10 seconds of
poor availability at the top of every hour or 6 hours of poor availability
during peak usage time once a quarter. The ﬁrst, while annoying, is a
relatively minor nui- sance because while it causes user-visible failures,
users (or their clients) can retry to get a successful response. In con- trast
the second is a major outage that prevents users from getting work done for
nearly a full day every quarter. In the following section, we describe a new
availability measure, user-uptime that is meaningful and proportional.
Afterwards, we’ll introduce windowed user-uptime, which augments it to be
actionable._

[1]
[https://www.usenix.org/system/files/nsdi20spring_hauer_prepu...](https://www.usenix.org/system/files/nsdi20spring_hauer_prepub.pdf)

[2] [https://blog.acolyer.org/2020/02/26/meaningful-
availability/](https://blog.acolyer.org/2020/02/26/meaningful-availability/)

------
jedberg
This is a problem we wrestled with a lot at Netflix. What counts as down? It
was rare for the platform to be completely unavailable. Most downtime was
partial, where just some people had issues. How do you calculate that?

We ended up settling on predicting how much traffic there should be at any
given second, and then going from there. For example, if there should have
been 3000 stream starts in that second, and we got 2000 starts, that would
count as 1/3 of a second of downtime.

Our goal was 4 nines, which allowed for a total of one minute of downtime a
week. We managed to achieve that on many weeks of the year. One week in the
three years I kept close track we managed 100%. It was near Christmas, when no
one was working and deploying code. Shooting ourselves in the foot was the
number one cause of downtime.

~~~
mcharezinski
Shooting ourselves in the foot was the number one cause of downtime.

What do you mean?

~~~
jedberg
The biggest cause of downtime was deployment. Either deploying code with a bug
that wasn't caught in testing, or changing a real-time configuration parameter
that wasn't properly scoped.

As opposed to a scaling issue that showed up later, or a node failure or all
the other things that could cause downtime.

------
jamieweb
As a side note, I was very impressed/delighted by the message on the site when
JavaScript is disabled:

> _Sorry pal, but this won 't work without a JavaScript. You are probably
> doing that for privacy reasons, and I do respect that. I tried to handle the
> noscript case gracefully, but there is only so much I could do. You can
> download this website, inspect the source code, and run it locally. Or, if
> you think this website is trustworthy enough, you can whitelist it in your
> browser or a script blocker. I don't have any third-party trackers on this
> website (only a CDN), so you will be safe here._

------
domrdy
Very happy Hexadecimal customer. Keep up the good work :)

~~~
jmstfv
Thanks a lot for your support, Dom. Appreciate it!

------
Simulacra
Our API is a little different because of security, but our metric for
evaluation is less than two minutes of down time per year.

~~~
Smaug123
I suspect you're being downvoted because this is so light on details, and
reads more like some sort of bragging than anything else. Would you be able to
share in what sort of way security makes your APIs different, for example, and
how that pertains to SLAs?

------
loige
I love the minimalistic style.

How long did it take you to build the first version? How many people? Which
tech are you using for it?

~~~
jmstfv
> How long did it take you to build the first version?

The calculator or Hexadecimal?

It took a couple of hours to build a calculator, and a couple more to polish
it.

Hexadecimal took a very, very long time to build (several months).
Engineering-wise, it is the hardest thing I've ever done. It was way over my
head when I started.

> How many people?

Just me.

> Which tech are you using for it?

The calculator is a part of the static site built with HTML, CSS, and
_sprinkles_ of ES6. I generate it with nanoc.

The web app is built in vanilla Rails stack: Ruby, Rails, Postgres, Redis,
Sidekiq, Caddy. I don't use front-end frameworks (yes, I'm a dinosaur). For
more context: [https://tryhexadecimal.com/running-
costs](https://tryhexadecimal.com/running-costs)

