Having multiple hour outages makes me really want to go back to hiring a couple of physical servers in a rack somewhere.
Scaling is the only real selling-point of cloud and far more people think they need it than actually do
We're trying to fit something that's generally very centralized into the same model. Where the servers are does matter. What OS they run and how reliable they are matters on an individual level. The environment your server runs on is quite important and it's definitely your server, not "any server will do" by a Longshot.
If the cloud was just some source of CPU instructions we would have a government regulated source of CPU power for everyone. But depending on what you're running ram size, cache size, network latency, CPU architecture, drive type, endless variables come into play that are all important.
Depending on hardware that you cant control to have metrics you definitely need to control is going to make the system less reliable, and that's what we're seeing now with cloud computing.
I don't think on-prem will ever die out. It's like owning vs renting your office. There's pros and cons to each and we'll eventually hit some kind of equilibrium.
Cloud to me is also shared risk. I read - "Google cloud outage" as "multiple companies that rely on a shared infrastructure is not available ATM".
The mindset should be to run your services on a distributed infrastructure with no SPOF. Leverage cloud, fog, racks, PCs, whatever resources you can, but diversify your content/service and be risk averse from failure of one kind.
But if you're a small-ish team with ambitions of building something that may one day need to scale quickly to support a large influx of traffic / customers (generally unannounced / unplanned), I think it's insane not to have a cloud strategy / presence.
I have never seen research on it, but my hunch is that given that a large number of websites / services end up being impacted simultaneously, it's probably better to be down when everyone else is, than being the only one down.
In my experience when there's a large-scale outage, that information is far more likely to get back to the end user a lot faster, and in a fashion where it may not even impact their perception of your business (ie. maybe they originally experienced the issue on someone else's website / app).
But you can be certain that while you're down when everyone else is up, your potential and existing customers / users are far more likely to blame you and begin searching for alternatives.
They are currently rolling out a change that should completely fix the issue and somewhere else I read they are rolling back a config change they did recently..
At 15:10 UTC, we duplicated one Kubernetes Service in a new one. Since, we have no alerts.
* This is a multi-region (us-east1, us-central1, europe-west1, asia-northeast1 and asia-east1) outage (yeesh)
* Incident began at Incident began at 2017-08-29 18:35 PDT. At the time of this comment, we're over 13 hours (??)
I think Leslie needs to update this for Cloud, however.
The number of mistakes (silly and otherwise) keeps increasing and I'm not talking about the technical problem itself! ok, end of rant
You are confusing "don't know how" with "absolutely not incentivised to spend money on this"
writing poorly on the status dashboard, messing/mixing up timestamps, changing them retroactively hoping people don't notice, forgetting and/or mistakenly swapping the names of affected regions, consistently writing "next update at x o'clock" and then invariably publishing the update several to 10+ minutes after o'clock are all mistakes done by whoever is already paid to update that page and communicate with customers.
We haven't seen any issues, so presumably not? Or is it just a subset?