Hacker News new | past | comments | ask | show | jobs | submit login

This to me shows Google hasn't gotten in place sufficient monitoring to know the scale of problems and the correct scale of response.

For example, if a service has an outage affecting 1% of users in some corner case, it perhaps makes sense to do an urgent rolling restart of the service, perhaps taking 15 minutes. (On top of diagnosis and response times)

Whereas if there is a 100% outage, it makes sense to do an "insta-nuke-and-restart-everything", taking perhaps 15 seconds.

Obviously the latter is a really large load on all surrounding infrastructure, so needs to be tested properly. But doing so can reduce a 25 minute outage down to just 10.5 minutes (10 minutes to identify the likely part of the service responsible, 30 seconds to do a nuke-everything rollback)

The 15 seconds figure may be very wishful thinking. Often a service startup is a short burst of severe resource consumption. Doing in with 100% of the fleet at once may stall everything in an uncontrollable overloaded state.

Is infrastructure at this scale typically unable to do a cold start? I can believe that this is very difficult to design for, but being unable to do it sounds dangerous to me.

(Edit for the downvoters: I was genuinely curious how these kinds of things work at Google’s scale. Asking stupid questions is sometimes necessary for learning.)

I guess it depends what "infrastructure" means.

If you mean "all of Google" then a cold restart would probably be very hard. At Facebook a cold restart/network cutoff of a datacenter region (a test we did periodically) took considerable planning. There is a lot to coordinate — many components and teams involved, lots of capacity planning, and so on. Over time this process got faster but it is still far from just pulling out the power cord and plugging it in again.

If you mean a single backend component then cold starting it may or may not be easy. Easy if it's a stateless service that's not in the critical path. But it seems this GCP outage was in the load balancing layer and likely harder to handle. A parent comment suggested this could be restarted in 15s, which is probably far from the truth. If it takes 5s to get an individual node restarted and serving traffic you'd need to take down a third of capacity at a time, almost certainly overloading the rest.

In some cases the component may also have state that needs to be kept or refilled. Again, at FB, cold starting the cache systems was a fairly tricky process. Just turning them off and on again would leave cold caches and overload all the systems behind them.

Lastly, needing to be able to quickly cold restart something is probably a design smell. In the case of this GCP outage rather than building infra that can handle all the load balancers restarting in 15s it would probably be easier and safer to add the capability of keeping the last known good configuration in memory and exposing a mechanism to roll back to it quickly. This wouldn't avoid needing to restart for code bugs in the service but it would provide some safety from configuration-related issues.

> Lastly, needing to be able to quickly cold restart something is probably a design smell.

For everyone not at a scale to afford their own transoceanic fiber cables, a major internet service disruption is equivalent to a cold start. And as long as hackers or governments are able to push utter bullshit to the global BGP tables with a single mouse click, this threat remains present.

The comment I was replying to mentions "at [Google] scale", so my answer was with that in mind.

When Amazon S3 in us-east-1 failed a few years ago, the reason for the long outage(6 hours? 8 hours? I don't recall) was that they needed to restart the metadata service, and it took a long time for it to come back with the mind boggling amount of data on S3. Cold starts are hard to plan for precisely at this type of scale

It can be done. It takes a heck of a lot longer than 15s though.

Everyone flushing the toilet at the same time to clean the pipes

'Whereas if there is a 100% outage, it makes sense to do an "insta-nuke-and-restart-everything", taking perhaps 15 seconds.'

Take some time to consider what a restart means, across many data centers on machines which have no memory of the world before the start of their present job...

> rollback to the last known good configuration

could very much be the "fast" option. 15s restart, or anything close to it, across the entirety of it sounds quite unlikely.

15 second rollbacks don't exist at scale.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact