
Ask HN: What happens when companies break their SLA uptime? - rileyt
Is there an easy way to monitor when a companies uptime drops below what is specified in their SLA and enforce the penalties? It also seems like something that should be published somewhere for new customers to see.<p>Slack has a stated uptime of 99.99%  (4.5 minutes per month) in their SLA, but just today alone they are already approaching 30 minutes and it seems like this has been happening on a quarterly basis. Are they paying for this?
======
shoo
[https://get.slack.help/hc/en-us/articles/204113126-Plus-
Plan...](https://get.slack.help/hc/en-us/articles/204113126-Plus-Plan-Service-
Level-Agreement-SLA-)

it's interesting to read how downtime is precisely defined.

If slack is available for all other customers 100% of the time, ignoring
scheduled downtime, but is never available for you, and they have at least
10,000 customers in total, then this suffices to hit their 99.99% downtime
target.

> If we fall short of our 99.99% uptime guarantee, we’ll refund customers on
> the Plus plan and above 100 times the amount your workspace paid during the
> period Slack was down.

In this case they would not refund any customers, including you, since they
had hit their 99.99% uptime guarantee averaged over all their customers.

------
londons_explore
Generally to claim a refund from a company you need to be moitoring them
yourself.

You need to be able to say "We have this thing in our office pinging your API,
and it was down for the last 96 minutes, therefore pay up!"

If you don't chase the company and present evidence, they won't pay for their
own SLA violations.

The companies own status dashboards usually don't have sufficient proof to
claim. "We are investigating reports that some customers may be encountering
connection difficulties" isn't enough info to prove that _you_ were seeing
downtime.

~~~
rileyt
This seems to disagree with what shoo has found above. If the downtime is
based on the percentage of customers that are down, you would never be able to
know that only from your own monitoring. Also, how could you prove that your
monitoring is correct when making a claim?

~~~
caw
In Slack's case, yes they're proactive. But this is not the norm when
enterprises violate their SLA.

If a service outage is particularly bad, some providers will be proactive and
reach out with credits. This is rare.

In most other cases, the company relies on you to contact them and claim that
they were out of SLA, at which point they'll investigate and give you back a
pittance of what you paid that month.

The claims is normally a fun go-around process of pointing fingers.

See for instance the Amazon EC2 SLA -
[https://aws.amazon.com/compute/sla/](https://aws.amazon.com/compute/sla/). In
order to even begin to claim it, you need to be running in 2 AZs. You must
provide evidence in a ticket. Your service credit will be either 10% (99.0 -
99.9) or 30% (< 99.0). Whether you were down for 1 day or 14 days, it's the
same SLA credit. The credit is applied to future service, may be applied as a
refund to the current bill at their discretion.

------
debacle
Angry phone calls. Someone threatening to sue. No one actually suing. Someone
getting something for free.

------
geoah
Paying customers usually get back credits from slack after downtime, I have a
recollection that free accounts also got something at some point.

* Slack refunds customers 100x amount paid during outage - [https://news.ycombinator.com/item?id=16487812](https://news.ycombinator.com/item?id=16487812)

* [https://get.slack.help/hc/en-us/articles/204113126-Service-L...](https://get.slack.help/hc/en-us/articles/204113126-Service-Level-Agreements-SLA-)

------
sigi45
In one project company x had to pay an avg calculated loss to company y. Both
did business in a joint venture. 5 figures per hour.

In our SLA in another company it was written in every big contract because
customer companies where asking for it.

Some specific SLA meant oncall duty. Something like this cost money and affect
the monthly support and operations price.

------
InternetOfStuff
> already approaching 30 minutes

I've been having connection trouble essentially all day.

As far as I can tell this has been going on for hours.

~~~
rileyt
I guess the follow up question is how do we assure companies like Slack aren't
grossly under reporting their downtime?

~~~
LinuxBender
Create a simple integration that runs from multiple locations, ISP networks,
datacenters and log the results back to a monitoring system that records http
status codes, integration results, best / average / worst latency 95th
percentile. Provide an easy way for Slack to see the results and for others to
reproduce the results.

~~~
rileyt
This might be an interesting service for companies that are paying for
software with uptime in their SLA and want to hold them accountable. It would
probably pay for itself pretty fast, considering how often companies like
Slack are down...

~~~
LinuxBender
There are companies that do this, but they do not get involved in the 3rd
party SLA legal issues.

In my humble opinion, I think that each consumer of Slack and other web
services should implement their own monitoring. It probably would not hurt to
open source the monitoring so that Slack and the ilk can help keep your
monitoring accurate and honest. This would empower others to replicate your
methods for the service providers they consume as well without having to pay
monitoring service providers. They will only show outages from their
perspective and some of them run from Amazon, which may be sub-optimal in this
case.

Some existing open source tooling are Nagios and Sensu. Sensu being a little
more dynamic and cloud friendly and a newer kid at the table. There are chef
and ansible scripts for both of these.

------
borplk
Usually nothing.

For super serious stuff it's a different story but in the average small-medium
SaaS case you may end up with an apology email and $9 paid back or something
to that effect.

