*It was way before microservices, etc.* It was not way before microservices at G...

It was way before microservices, etc.

It was not way before microservices at Google. But it was before there was much general knowledge about them.

Sadly the internal lessons learned by Google have not seeped into the outside world. Here are examples.

1. What I just said about how to make requests traceable through the whole system without excess load.

2. Every service should keep statistics and respond at a standard URL. Build a monitoring system which has scrapes of that operational data as a major input.

3. Your alerting system should supports rules like, "Don't fire alert X if alert Y is already firing." That is, if you're failing to save data, don't bother waking up the SRE for every service that is failing because you have a backend data problem. Send them an email for the morning, but don't page people who can't do anything useful anyways.

4. Every service stood up in multiple data centers with transparent failover. At Google the rule was n+2/n+1. Meaning that your service had to globally be in at least 2 more data centers than it needed for normal load, and in every region it had to be in at least one extra data center. With the result that if any data center goes out, no service should be interrupted, and if any 2 data centers go out the only externally visible consequence should be that requests might get slow.

Now compare that to what people usually do with Docker and Kubernetes. I just have to shake my head. They're buzzword compliant but are failing to do any of what they need to do make a distributed system operationally tractable. And then people wonder why their "scalable system" regularly falls over and nobody can fix it.