Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Calculate how long your website/API can be down (tryhexadecimal.com)
101 points by jmstfv on May 30, 2020 | hide | past | favorite | 15 comments

Ah, SLAs. It seems to me from having read quite a few standard sorts of ones for web systems or APIs that they’re a profitable racket, typically a total scam.

Simplifying and extrapolating, it seems to me that they normally boil down to this: “if our product that you pay us $1,000 per month for, upon which you depend so that you would incur losses of $1,000 per hour if it was down, goes down¹ for more than 43 minutes in any given month (half an hour of downtime is all hunky dory, right? 99.9% looks good, right?), we’ll give you back 10% of your monthly fee for every hour it’s down, up to your monthly fee (so now from minute 43 for the next ten hours you’ll only lose $900 per hour!)”.

From the ones I’ve seen, I can’t understand why anyone would pay for them, because they seem to be utterly toothless.

I mean seriously, I’d prefer to offer that kind of SLA to all users on any product at no cost: it shows I’m serious about the product and care, and practically costs little (the absolute disaster case is a month’s revenue) and can be perceived as a goodwill gesture if it ever needs to be invoked.

The idea of charging people extra just to say “yeah, I’ll actually provide the service you already paid for” rubs me the wrong way. But I bet some business types would baulk at not having this SLA line they can upsell on.

Am I being unreasonable?

¹ For some excessively strict definition of “down”, so that we might only count half of the actual downtime.

> Am I being unreasonable?

No. Also, yes.

> The idea of charging people extra just to say “yeah, I’ll actually provide the service you already paid for” rubs me the wrong way.

So this is ultimately the crux of the issue I think. What are you paying for with the up front fee? I don't think I'd be willing to sell anyone a 100% uptime 24/7/365 product for only $1000/month. I'd probably want 2 orders of magnitude more for that guarantee, because my costs would scale significantly in order to deliver that. I'd also need to build in proof to defend it when my product becomes unreachable from your network, because of an issue with your network / your ISP / an intermediate ISP / a hijacking attack / (one of) my ISP(s).

The reality is the network (and this is true of almost any IP network) is unreliable, so anything that involves a network should never really be sold with 100% uptime guarantees. If you throw the internet in the middle, you definitely cannot provide that guarantee.

Which then brings us to what was actually sold. Setting expectations up front leads to happier trading partners. If you tell me you can deliver 100% uptime for 30 days straight, and it goes down at any point at all in that duration, I'm going to be legitimately upset. If you tell me there's only 99% uptime, and there's a 6 hour outage, the scale of that being a problem for my business is entirely my own problem.

I took on a product and the risks were made clear to me up front. If I chose to not take out insurance, the costs to my business are entirely my own. Sometimes that's just the cost of doing business, you can't afford to pay for the insurance. That's one of the risks. It sucks when that happens.

Toothless SLA clauses like the one you've described do nothing to mitigate the risk, so should not qualify. Unfortunately, people gloss over the details, but as far as B2B commerce goes, caveat emptor.

As an example of a slightly less toothless SLA that I've seen negotiated, and I'm paraphrasing the spirit here: "We will pay you $x to complete this job in 6 months. For each calendar day you are late, you will reimburse us $(x/30), up to a maximum liability of $2x." and then the project was internally scheduled to take up to 9 months. The risk of the contractor was factored in, the expectation that they'd screw up was factored in, and the contractor was heavily incentivised to not screw up (they did anyway)

> I don't think I'd be willing to sell anyone a 100% uptime 24/7/365 product for only $1000/month.

I worked at a company that sold a 24/7/365 product with a 100% uptime SLA for $1000/month.

I don't think it really did what anyone wanted. On one side of the coin, people valued the service at more than they paid for it. One customer wrote in to complain that they lost millions of dollars because of a scheduled outage. For a million dollars, you could have had a backup circuit, or at least called us to reschedule the planned maintenance (which we tended to be flexible about, and weren't covered under the SLA anyway). I am not sure what their true motivations were, but I think they assumed that the fine print said "we can guarantee you that there will be no natural disasters, etc." when in reality it said "if our shit turns off, we won't charge you". While the legalese made it clear what the SLA was, the customer's expectation from the simplified wording ("100% uptime") was not what the document said. That is risky in terms of customer satisfaction.

Other customers we would refund, but their automated ACH bill payment would send the full amount next month, so we had to track them down to tell them that they overpaid. This resulted in a lot of work for them for some trivial amount of money. We did honor our agreement, but the accounts payable person at the other company ended up having to do a lot of meaningless work that didn't really benefit them -- they don't care if their company pays $900 or $1000 per month for some service. I think we should have just sent them cookies or something as opposed to making them redo all their expense reports for $100 or whatever. At least they would personally be able to enjoy a cookie.

Personally, I would never sell someone an SLA without having first measured my uptime extensively, and then I wouldn't sell them something that I've never observed in the past. If we had 5 minutes of downtime last year, I would be hesitant to guarantee 99.999% uptime. I'm not a good business person, though.

I find SLAs mostly toothless as well, which is why I spend a lot of time trying to guesstimate the future downtime of a vendor and almost no time trying to negotiate to change it.

I’ll choose a vendor who promises nothing financial but I think will hit four nines over one who will give me all my money back for every month they’re under four nines and for which I expect I’ll get to cash some of those sweet, sweet SLA breach checks.

There's a great recent paper [1] (and Adrian Colyer's summary in The Morning Paper [2]) from the Google G Suite team about how uptime or error rate alone do not capture the full user experience:

Indeed, none of the existing metrics can distinguish be- tween 10 seconds of poor availability at the top of every hour or 6 hours of poor availability during peak usage time once a quarter. The first, while annoying, is a relatively minor nui- sance because while it causes user-visible failures, users (or their clients) can retry to get a successful response. In con- trast the second is a major outage that prevents users from getting work done for nearly a full day every quarter. In the following section, we describe a new availability measure, user-uptime that is meaningful and proportional. Afterwards, we’ll introduce windowed user-uptime, which augments it to be actionable.

[1] https://www.usenix.org/system/files/nsdi20spring_hauer_prepu...

[2] https://blog.acolyer.org/2020/02/26/meaningful-availability/

This is a problem we wrestled with a lot at Netflix. What counts as down? It was rare for the platform to be completely unavailable. Most downtime was partial, where just some people had issues. How do you calculate that?

We ended up settling on predicting how much traffic there should be at any given second, and then going from there. For example, if there should have been 3000 stream starts in that second, and we got 2000 starts, that would count as 1/3 of a second of downtime.

Our goal was 4 nines, which allowed for a total of one minute of downtime a week. We managed to achieve that on many weeks of the year. One week in the three years I kept close track we managed 100%. It was near Christmas, when no one was working and deploying code. Shooting ourselves in the foot was the number one cause of downtime.

Shooting ourselves in the foot was the number one cause of downtime.

What do you mean?

The biggest cause of downtime was deployment. Either deploying code with a bug that wasn't caught in testing, or changing a real-time configuration parameter that wasn't properly scoped.

As opposed to a scaling issue that showed up later, or a node failure or all the other things that could cause downtime.

As a side note, I was very impressed/delighted by the message on the site when JavaScript is disabled:

> Sorry pal, but this won't work without a JavaScript. You are probably doing that for privacy reasons, and I do respect that. I tried to handle the noscript case gracefully, but there is only so much I could do. You can download this website, inspect the source code, and run it locally. Or, if you think this website is trustworthy enough, you can whitelist it in your browser or a script blocker. I don't have any third-party trackers on this website (only a CDN), so you will be safe here.

Very happy Hexadecimal customer. Keep up the good work :)

Thanks a lot for your support, Dom. Appreciate it!

Our API is a little different because of security, but our metric for evaluation is less than two minutes of down time per year.

I suspect you're being downvoted because this is so light on details, and reads more like some sort of bragging than anything else. Would you be able to share in what sort of way security makes your APIs different, for example, and how that pertains to SLAs?

I love the minimalistic style.

How long did it take you to build the first version? How many people? Which tech are you using for it?

> How long did it take you to build the first version?

The calculator or Hexadecimal?

It took a couple of hours to build a calculator, and a couple more to polish it.

Hexadecimal took a very, very long time to build (several months). Engineering-wise, it is the hardest thing I've ever done. It was way over my head when I started.

> How many people?

Just me.

> Which tech are you using for it?

The calculator is a part of the static site built with HTML, CSS, and sprinkles of ES6. I generate it with nanoc.

The web app is built in vanilla Rails stack: Ruby, Rails, Postgres, Redis, Sidekiq, Caddy. I don't use front-end frameworks (yes, I'm a dinosaur). For more context: https://tryhexadecimal.com/running-costs

Applications are open for YC Summer 2023

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact