Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: How to find 'error budget' as a DevOps Engineer?
3 points by _mqak on Jan 28, 2019 | hide | past | favorite | 1 comment
I am trying to find an error budget for a site that has several outside APIs integrated in it for its core features. How can I find the error budget for it?



Error budget = the actual downtime duration your site can still afford within a given time frame.

If you have the SLA of your outside APIs, you may compute your own maximum possible SLO and deduce from that your full error budget. But your error budget will diminish over time, as you use it.

Say your site depends on 3 external APIs having each a 99% SLA, your best possible site SLO would be 99% x 99% x 99% = 97% (= your site is, at best, as much reliable as the product of the reliability of your dependencies).

That is, unless your site has some built-in tactics for the specific downtime scenarios of these APIs (caching, retry, slow down, graceful limitation of features, etc.).

Should you pick a lower SLA than your SLO for your site then? Always. Things happen.

Let's take 95% SLA for simplicity.

Your max error budget would be, for 30 days, as a formula:

  + total time frame (say, 30 days = 720 hours)
  - target availability (at 97% avail. that would be 684 hours)
  - total downtime you've had already within this time frame
  = 36 hours or less
That's a start. Then you may track your actual production own indexes and adjust accordingly.

Reminder: https://enqueuezero.com/the-difference-between-sli-slo-and-s...




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: