A big problem seems to be stability/error reporting and averaging of statistics....

notyourwork · on June 4, 2018

It sounds like you've ever worked on a global scale service.

Services at this scale will have errors for all sorts of strange reasons, it doesn't mean the service is poorly engineered. In fact, if users don't notice these problems it usually means the service is resilient and robust when it encounters strange situations.

Consider a really simply example such as making a breaking API change to your service API. Now what happens when a user doesn't refresh their web browser and continues running javascript that doesn't work against new API. This can happen with smaller services but the odds of this happening are much higher when you are a global scale.

There are other strange problems that come with large services which means all components should be fault tolerant if possible.

acdha · on June 4, 2018

You’re conflating two separate things: internal and user-visible errors. While it’s true that errors are inevitable, robust systems try to handle the latter gracefully with minimal disruption. If the person you replied to is accurately describing their experience a system which has significant unrecovered user-visible errors which aren’t acknowledged has serious robustness issues.

Also, please don’t make disparaging comments about other people’s experience unless it’s highky relevant. It doesn’t add anything to the conversation and will likely derail the conversation.

lallysingh · on June 4, 2018

OP's post indicates that the metrics are poorly engineered.

As per the really simple example: generally you'd be better off rolling out a second endpoint for the new api and then stop serving responses that use the old one. First this doesn't break everyone who had your page up, and second you can stop rollout safely if you find a problem with the new api.

lucideer · on June 4, 2018

> Services at this scale will have errors for all sorts of strange reasons, it doesn't mean the service is poorly engineered.

Of course, and as I said, zero errors is not a practicably achievable in this type of context. The issue is with metrics though: the idea of taking averages instead of looking at troughs is still problematic.

> In fact, if users don't notice these problems it usually means the service is resilient and robust when it encounters strange situations.

True. But in the case of Gitlab, users are noticing these problems. Constantly. It's just Gitlab's own metrics that could be (I've not done more than browsed their Grafana instance a bit, so my comment is generally a bit speculative) ignoring the problems because they're focused on averages instead of specifics or thresholds.

> Consider a really simply example ...

lallysingh has already pointed this out, but I'll reiterate that this is a very apt bad example. You're right that ideally components should be fault tolerant if possible, but frankly that's a big ask. Especially for highly-scaled services supporting many many components of various types - ensuring that all of those components are completely fault tolerant is much more difficult than simply ensuring the old API continues to operate for a grace period while the new one is served from elsewhere.

I think your example is apt, because it's indicative of a common excuse for bad engineering: the assumption that downtime or disruption is necessary because of necessary software upgrades/improvements and poorly planned orchestration.