For example, if a service has an outage affecting 1% of users in some corner case, it perhaps makes sense to do an urgent rolling restart of the service, perhaps taking 15 minutes. (On top of diagnosis and response times)
Whereas if there is a 100% outage, it makes sense to do an "insta-nuke-and-restart-everything", taking perhaps 15 seconds.
Obviously the latter is a really large load on all surrounding infrastructure, so needs to be tested properly. But doing so can reduce a 25 minute outage down to just 10.5 minutes (10 minutes to identify the likely part of the service responsible, 30 seconds to do a nuke-everything rollback)
(Edit for the downvoters: I was genuinely curious how these kinds of things work at Google’s scale. Asking stupid questions is sometimes necessary for learning.)
If you mean "all of Google" then a cold restart would probably be very hard. At Facebook a cold restart/network cutoff of a datacenter region (a test we did periodically) took considerable planning. There is a lot to coordinate — many components and teams involved, lots of capacity planning, and so on. Over time this process got faster but it is still far from just pulling out the power cord and plugging it in again.
If you mean a single backend component then cold starting it may or may not be easy. Easy if it's a stateless service that's not in the critical path. But it seems this GCP outage was in the load balancing layer and likely harder to handle. A parent comment suggested this could be restarted in 15s, which is probably far from the truth. If it takes 5s to get an individual node restarted and serving traffic you'd need to take down a third of capacity at a time, almost certainly overloading the rest.
In some cases the component may also have state that needs to be kept or refilled. Again, at FB, cold starting the cache systems was a fairly tricky process. Just turning them off and on again would leave cold caches and overload all the systems behind them.
Lastly, needing to be able to quickly cold restart something is probably a design smell. In the case of this GCP outage rather than building infra that can handle all the load balancers restarting in 15s it would probably be easier and safer to add the capability of keeping the last known good configuration in memory and exposing a mechanism to roll back to it quickly. This wouldn't avoid needing to restart for code bugs in the service but it would provide some safety from configuration-related issues.
For everyone not at a scale to afford their own transoceanic fiber cables, a major internet service disruption is equivalent to a cold start. And as long as hackers or governments are able to push utter bullshit to the global BGP tables with a single mouse click, this threat remains present.
Take some time to consider what a restart means, across many data centers on machines which have no memory of the world before the start of their present job...
could very much be the "fast" option. 15s restart, or anything close to it, across the entirety of it sounds quite unlikely.