>Engineers building apps that depend on that know the limitations. As someone wh...

fencepost · on June 4, 2019

Engineers are building things that rely on other things that rely on other things. There's a point where people don't know what their dependencies are.

That's what a distributed system is: a system in which you can't get your work done because a system you've never heard of has failed. (I had that attributed to Butler Lampson, but searching turns up Leslie Lamport instead)

ineedasername · on June 4, 2019

I've never worked in this type of operation, can you shed some light? I would have thought there'd be some type of documentation of the dependency hierarchy for change request checklists. Or are such things not always quite as comprehensive ( or not possible to have such complex interdependencies be comprehensively documented) ?

endtime · on June 4, 2019

If you build a new service that uses Spanner, you'd list Spanner as a dependency in your design doc, and maybe even decide to offer an SLO upper-bounded by Spanner's. But you wouldn't list, or even know, the transitive dependencies introduced by using Spanner. You'd more or less have to be the tech lead of the Spanner team to know all the dependencies even one level deep (including whatever 1% experiments they're running and how traffic is selected for them). And even if you ask the tech lead and get a comprehensive answer, it won't be meaningful to anyone reading your launch doc (since they work on, say, Docs, with you), and will be almost immediately out of date.

Google infrastructure is too complicated to know everything. Most of the time, understanding the APIs you need to use (and their quirks and performance tradeoffs and deprecation timelines, etc.) is more than enough work.

> not possible to have such complex interdependencies be comprehensively documented

Yeah, this.

ineedasername · on June 4, 2019

Got it, thank you. This type of constructive knowledge sharing is a big part of what makes HN a great community.

londons_explore · on June 5, 2019

And that is why you should switch certain classes of traffic down and eventually entirely off from time to time, to verify that everything you expect to keep working really does.