Hacker News new | past | comments | ask | show | jobs | submit login
Your nines are not my nines (2019) (rachelbythebay.com)
106 points by thewarpaint 11 months ago | hide | past | favorite | 31 comments



This is a concept I've had to explain to entirely too many teams over the years, that 0.001% of requests failing as a (mostly) random distribution of all requests is very different than a 0.001% subset of requests that will fail (nearly) every time until the underlying issue is mitigated. They look the same on a high level dashboard but they are completely different conditions in terms of how the customer will feel it, and understanding which kind of problem you have also guides the investigation and troubleshooting process.


In addition, some requests are more important that others.

`/assets/app_bundle.js` failing will most likely be visible immediately and make everything else useless, unless you've been clever and only used JS for upgrading website/app experience, rather than replacing

`/metrics/user-activity` failing won't (shouldn't) have any impact on the user experience

`/stripe/payment-succeeded-callback` failing could have disastrous impacts on the user, but not immediately be visible when it's failing.


The way it works with cloud providers is - you can file for a refund for SLA breach. After all - those SLA's are at a service level for the customer. If you're yelling at support or engineering on the phone, you're likely getting the 9's treatment the author describes - this is the wrong forum to hold the provider accountable unless you're yelling about mitigation time (then, best of luck to you!).

Reading the fine print on the SLA's is extremely important, because they often do not say what you think they say.

https://aws.amazon.com/legal/service-level-agreements/ https://www.microsoft.com/licensing/docs/view/Service-Level-... https://cloud.google.com/terms/sla/

I have seen refunds on the order of hundreds of thousands of dollars. It's cold comfort if the impact to you was on the order of millions of dollars, but still it is something. As you can see it's not a free-money-a-thon, it's generally a % of your spend of the services which were not available.

There typically is a defined process for submitting a refund ticket, which will result in an availability review. This documented process is not always easy to find.

The only one I could easily find is for Microsoft:

https://learn.microsoft.com/en-us/partner-center/request-cre...

(It's just a support topic when you're submitting a support ticket)


Leaving the work to the victim isn’t exactly great and getting a refund of some credits that you were going to spend with them anyway often doesn’t come close to the repetitional loss + time spent on the issue. The incentives for the large players are all based on making more year on year profit.


I mean, yeah.

I suspect the economics of being a CSP may not be so favorable if SLA refunds were automatic, and you didn't have to work for it.


"You are the bug on the windscreen of the locomotive. The train has no idea you were ever there." - Rachel by the Bay.

That's how monopolies work. They need not fear their customers.

In time, this becomes Orwell's "If you want a vision of the future, imagine a boot stamping on a human face – forever." Ask anyone who's had a dispute with the Apple app store.


Hot take:

I would love to have service providers show their (down sampled!) Alarms actually used for operational excellence publicly (from a read replica/etc)

Doing so would enforce that you actually have those in place, since they're public and now a marketing point. That said, I get the concern of trolls and competitors trying to get a "low score".


Do you mean something like this page: https://heiioncall.com/status ? We use all of these internally at Heii On-Call https://heiioncall.com/ , and get paged when any of these triggers are alerting.

Edit: the subtle difference from the article is that it sounds like you want historical data, rather than present-state data.


Few things burn my butt more than chatting with five people (possibly not even working for the same company) who all see a service down while their status page shows green.

The fuck it is.

100% unavailable for 5% of your customers is very different than dropping 5% of requests in a uniform distribution.


The shining example is https://grafana.wikimedia.org .


Any time my coworkers start acting like we are amazing for how many requests we handle per second I send them to wikimedia.org. That'll smack the smug right outta most people.


Interesting their fastest backend response seems to be around 500ms up to a full second.


That seems pretty reasonable. That's for the app-server cluster to generate uncached results, and propogate them. In general the app server will do this upon the change occurring, before a user actually asks for the page/information.

Users should only really see this when performing "mutable" operations (submitting edits, adding new pages or content) or when searching uncommon queries.

I doubt it's anywhere close to the critical path for the average guest, casual user, or even contributor. I'd suspect the only type of user who would find themselves hitting those appserver requests frequently would be moderators and admins.


Also: 500 ms on wikimedia sites is very much still in the "okay" range, subjectively. They aren't really sites that you make requests to every minute, if loading the next article took 500ms every time then so be it.


You can show it delayed by like a week maybe


There's an old joke that goes something like, "Most of the people chasing five nines uptime achieved five eights."


I know of a case where an engineer asked for "nine fives" of reliability. The recipient naturally misread it.


Is the moral of the story people should start by chasing "one nine at a time" or something?


Sure, there's the issue of what your contract says and what the guarantee is, but all these companies do already track their metrics in ways that at least attempt to detect and respond to the problems the author describes.

They track their metrics by p50 (the average performance/reliability for everyone) but also by p99, p99.9, etc., which is the performance/reliability for the extreme edge cases, such as exactly what the author is describing. They already do evaluate their systems from the perspective of how it's performing for the worst affected customers. Again, maybe the issue is the contract itself, sure, but they do already try their best to prevent a small handful of customers from getting overly affected by something.


I remember seeing a talk years ago about percentiles and how they lie: https://www.youtube.com/watch?v=lJ8ydIuPFeU

You should be exposing the maximum metric from your app, computing a percentile from an aggregated histogram is lossy.

[edit: Found the link, "How NOT to Measure Latency" by Gil Tene]


Here's the thing though. If I'm selling a product and I'm sending more than 10% of the money to a single vendor I have several problems.

If a vendor who can completely stop my operation has an outage, and the SLA says they owe me that 10% as a refund, I'm still having to deal with the 10x I'm losing because one of my vendors is having a bad day.

Those guarantees - if they even honor them, and if you can spare the time to chase them down - are still a quick road to bankruptcy.

So at the end of the day I probably have to raise my costs 10% in order to guarantee that no single vendor can drop me to 0%. And if those two vendors share a vendor, I may still be screwed.


Google loves to talk about billions of users. That is quite a few nines. Obviously there’s fewer users of cloud than search. But an engineer can only care about so many, before they need to save their sanity. Human attention is the one thing that’ll never scale.


I dont really get why Cloud matters here. The exact same dynamic exists for on-prem services.


In on-prem, at least one part of the business can raise alarms and if it is a big chunk of the revenue, then the rest of the business tends to sit up and take notice. Here, unless your business is Uber or DoD, it's too small for the cloud providers to sit up and take notice.


I dont find that to be the case at all. My place is a startup and we have a very close relationship with our Account Managers. We can have product leads on the phone within 24 hours if we encounter AWS internal issues.

I've found that AWS is MUCH better at dealing with problems than internal teams since AWS has many more resources to throw at a problem


What level of AWS usage does it take to be able to get product leads "on the phone"? I'm guessing it is at least thousands of dollars/month in spend? Or am I wrong?

And 24 hours can be a long time, when TSHTF


It’s a matter of control. You have practically zero direct control over the vendor provided service because you are too small for them to care. If you control the system on-prem, you can at least attempt to fix it by hiring someone able to fix it or by diverting resources you already have.

Additionally, you can find ways to mitigate that failure from occurring or being as destructive in the future.

It’s important to factor that into decisions when choosing cloud provider vs on prem solutions.


I think the point of TFA is, unless you have hundreds of on-prem services, one service going down for hours will significantly move the needle in your monitoring.


> The exact same dynamic exists for on-prem services.

With on-prem, you can choose to defer deploying that significant (and therefore higher risk) change when it's crunch time for your business. That can reduce regression impact considerably.


(2019)





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: