>Engineers building apps that depend on that know the limitations.
As someone who works at a slightly smaller tech company with of similar age with similar infrastructure I assure you this is not the case. Engineers are building things that rely on other things that rely on other things. There's a point where people don't know what their dependencies are.
I wouldn't be surprised if nobody actually knew there was customer traffic in this class until this happened.
Engineers are building things that rely on other things that rely on other things. There's a point where people don't know what their dependencies are.
That's what a distributed system is: a system in which you can't get your work done because a system you've never heard of has failed. (I had that attributed to Butler Lampson, but searching turns up Leslie Lamport instead)
I've never worked in this type of operation, can you shed some light? I would have thought there'd be some type of documentation of the dependency hierarchy for change request checklists. Or are such things not always quite as comprehensive ( or not possible to have such complex interdependencies be comprehensively documented) ?
If you build a new service that uses Spanner, you'd list Spanner as a dependency in your design doc, and maybe even decide to offer an SLO upper-bounded by Spanner's. But you wouldn't list, or even know, the transitive dependencies introduced by using Spanner. You'd more or less have to be the tech lead of the Spanner team to know all the dependencies even one level deep (including whatever 1% experiments they're running and how traffic is selected for them). And even if you ask the tech lead and get a comprehensive answer, it won't be meaningful to anyone reading your launch doc (since they work on, say, Docs, with you), and will be almost immediately out of date.
Google infrastructure is too complicated to know everything. Most of the time, understanding the APIs you need to use (and their quirks and performance tradeoffs and deprecation timelines, etc.) is more than enough work.
> not possible to have such complex interdependencies be comprehensively documented
And that is why you should switch certain classes of traffic down and eventually entirely off from time to time, to verify that everything you expect to keep working really does.
As someone who works at a slightly smaller tech company with of similar age with similar infrastructure I assure you this is not the case. Engineers are building things that rely on other things that rely on other things. There's a point where people don't know what their dependencies are.
I wouldn't be surprised if nobody actually knew there was customer traffic in this class until this happened.