Hacker News new | past | comments | ask | show | jobs | submit login
[flagged] GCE down in all regions (cloud.google.com)
320 points by vox_mollis on Apr 12, 2016 | hide | past | web | favorite | 175 comments

I've been overall very impressed with the direction of Google Cloud over the last year. I feel like their container strategy is much better than Amazon's ECS in that the core is built on open source technology.

This can wipe away a lot of goodwill, though. A worldwide outage is catastrophic and embarrassing. AWS has had some pretty spectacular failures in us-east (which has a huge chunk of the web running within it), but I'm not sure that I can recall a global outage. To my understanding, these systems are built specifically not to let failures spill over to other regions.

Ah well. Godspeed to anyone affected by this, including the SREs over at Google!

I'm totally impressed with gcloud. Slick, smooth interface. Cheap pricing. The fact the UI spits out API examples for doing what you're doing is really cool. And it's oh-so-fast. (From what I can tell, gcloud's SSD is 10x faster or 1/10th the cost of AWS.)

And this is coming from a guy that really dislikes Google overall. I was working on a project that might qualify for Azure's BizSpark Plus (they give you like $5K a month in credit), and I'd prefer to pay for gcloud than get Azure for free

Same, was considering GCP for the future, but this is bad. I'm not using them without some kind of redundancy with another provider. I hope they write a good post-mortem, these are always interesting at large scale.

How bad is it really? They started investigating at 18:51, confirmed a problem in asia-east1 at 19:00, the problem went global at 19:21, and was resolved at 19:26.

They posted that they will share results of their internal investigation.

That kind of rapid response and communication is admirable. There will be problems with cloud services - it's inevitable. It's how cloud providers respond to those problems that is important.

In this situation, I am thoroughly impressed with Google.

It's bad because it concerns all their regions at the same time, while competing providers have mitigations against this in place. AWS completely isolates its regions for instance [1], so they can fail independently and not affect anything else. That Google let an issue (or even a cascade of problems) affect all its geographic points of presence really shows a lack of maturity of the platform. I don't want to make too many assumptions, and that specific problem could have affected AWS in the same way, so let's wait for more details on their part.

The response times are what's expected when you are running one of the biggest server fleets in the world.

1: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-re...

Expecting that problems won't happen with a cloud provider that happen everywhere else is a pipe dream. They might be better at it because of scale, but no cloud provider can always be up. It happened at Amazon, now it's happened at Google. Eventually, finding a provider that never went down will be like finding the airline that never crashed.

Operating across regions decreases the chances of downtime, it does not eliminate them.

> The response times are what's expected when you are running one of the biggest server fleets in the world.

That may be true, but actually delivering on that expectation is a huge positive. And more than having the right processes in place, they had the right people in place to recognize and deal with the problem. That's not a very easy thing to make happen when your resources cross global borders and time zones.

Look at what happened with Sony and Microsoft - they were both down for days and while Microsoft was communicative, Sony certainly was not. Granted, those were private networks, but the scale was enormous and they were far from the only companies affected.

> It happened at Amazon

AWS has never had a worldwide outage of anything (feel free to correct me). It's not about finding "the airline that never crashed", it's finding the airline whose planes don't crash all at the same time. It's pretty surprising coming from Google because 15 years ago they already had a world-class infrastructure, while Amazon was only known for selling books on the Internet.

Regarding the response times, I recognize that Amazon could do better on the communication during the outage. They tend to wait until there is a complete failure in an availability zone to put the little "i" on their green availability checkmark, and not signal things like elevated error rates.

Here's an example from this thread: http://status.aws.amazon.com/s3-20080720.html

I stand corrected, my statement was too broad.

AWS had two regions in 2008 [1]. That was 7 years ago, and I think you would agree that running a distributed object storage system across an ocean is a whole different beast than ensuring individual connectivity to servers in 2016.

1: https://aws.amazon.com/about-aws/global-infrastructure/

> AWS completely isolates its regions

Yeah... just don't look too closely under the covers. AWS has been working towards this goal but they aren't there yet. If us-east-1 actually disappeared off the face of the earth AWS would be pretty F-ed.

Our servers didn't go off, just lost connectivity. Same has happened to even big providers like Level3. Someone leaks routes or something and boom, all gone.

I'd be surprised if AWS didn't have a similar way to fail, even if they haven't. This is obviously a negative for gcloud, no doubt, but it's hardly omg-super-concerning. I'm sure the post-mortem will be great.

Actually, according to the status report, they confirmed that the issue affected all regions at 19:21 and resolved it by 19:27. That's six minutes of global outage.

Disclaimer: I work for Google (not on Cloud).

The outage took my site down (on us-central1-c) at 19:13, according to my logs, so it was already impacting multiple regions by 19:13. (I have been using GCP since 2012 and love it.)

Thank you, I missed that on my first reading - I saw the status update was posted at 19:45, not the content within it stating the issue was resolved at 19:27. I updated my parent comment.

I concur. The response was first rate.

Behind the scenes, I'm sure they will iterate on failure prevention and risk analysis.

Absolutely. GCP has been fantastic.

Amazon S3 went down globally on July 20, 2008: http://status.aws.amazon.com/s3-20080720.html

Sadly, I think a global outage was more acceptable in 2008 than it is now...

Knowing Google though, they'll learn their lesson on how to improve their entire workflow right quick.

Keep in mind, S3 was still a very new project at that point. Launched March 2006 and the first of its kind.


Can you talk about this? I have been spectacularly unsuccessful at using ECS (and currently run my VMs on a vanilla Debian ec2 instance)

Switching from ECS to GKE (Google Container Engine) currently. Both seem overcomplex for the simpler cases of deploying apps (and provide a lot of flexibility in return), but I have found the performance of GKE (e.g. time for configuration changes to be applied, new containers booted, etc) to be vastly superior. The networking is also much better, GKE has overlay networking so your containers can talk to each other and the outside world pretty smoothly.

GKE has good commandline tools but the web interface is even more limited than ECS's is - I assume at some point they'll integrate the Kubernetes webui into the GCP console.

GKE is still pretty immature though, more so than I realized when I started working with it. The deployments API (which is a huge improvement) has only just landed, and the integration with load balancing and SSL etc is still very green. ECS is also pretty immature though.

The Problem is that GCP doesn't run an RDS service with PostgreSQL. And external vendors are mostly more costly than AWS RDS. Especially for some customer homepages where you want to run on managed stuff as cheap as possible.

This is sad for sure. The new MySQL cloud 2.0 is really good, and if you use a DB agnostic ORM you can probably make MySQL work for quite a while. Sad to lose access to all the new PG features though, and I would love if Google expanded their cloud SQL offerings..

This is what made me use aws as well

I'm admittedly biased, but have you checked out Docker Cloud? http://cloud.docker.com

While not Docker Cloud specifically, when we eyeballed UCP we found it very underwhelming when pitted against Kubernetes.

To us it appeared yet another in a sea of many orchestration tools that will give you a very quick and impressive "Hello World", but then fail to adapt to real world situations.

This is what Kubernetes really has going for it, every release adds more blocks and tools that are useful and composable targeting real world use (and allow many of us crazies to deal with the oddball and quirky behavior our fleet of applications may have), not just a single path of how applications would ideally work.

This generally has been a trend with Docker's tooling outside of Docker itself unfortunately. Similarly docker-compose is great for our development boxes, but nowhere near useful for production. And it doesn't help Docker's enterprise offerings still steer you towards using docker-compose and the likes.

Not to bash, but the page you linked is classic Docker - it says literally nothing about what "Docker Cloud" is.

"BUILD SHIP & RUN, ANY APP, ANYWHERE" is the slogan they repeat everywhere, including here, and it means even less everytime they do it. What IS Docker Cloud? Is it like Swarm? Does it use Swarm? What kinds of customers is Docker Cloud especially good at helping? All these mysteries and more, resolved never.

I hadn't heard of it, actually. However it doesn't seem to support GCP which removes it from contention for us unfortunately.

so am I (I'm YC alumni) .. but RDS is too important for us to move away from it. Let me put it this way - if you had an RDS equivalent in Docker Cloud, lots of people would switch. Docker is more popular than you know.

Heroku should be an interesting learning example to the tons of new age cloud PAAS that I'm seeing. Heroku database hosting has always been key to adoption.. to an extent that lots of people continue to use it even after they move their servers to bare metal. The consideration and price sensitivity to data is very different than app servers.

I believe this is tutum that they bought some time ago. I tried tutum before with Azure. After deleting the Containers from tutum portal, it does not clean everything from Azure. Today the storage created by tutum is still in my Azure storage. LOL.

Docker Cloud still requires BYO cloud, however.

For the record, the Kubernetes dashboard comes pre-installed on all masters on GKE. So the UI is there, albeit not integrated into the Gcloud console.

Seconded--I can tell the ECS documentation is trying to help, but the foreign task/service/cluster model + crude console UI keeps telling me to let my workload ride on EC2 and maybe come back later.

what I figured out much later was that ECS is a thin layer on top of a number of AWS services - they use an AMI that I can use, ec2 VMs that I can run myself and Security Groups + IAMs that I can create by my own.

But the way they have built the ECS layer is very very VERY bad.. and I have an unusually high threshold for documentation pain.

I work on Convox, an open source PaaS. Currently it is AWS only. It sets up a cluster correctly in a few minutes. Then you have a simple API - apps, builds, releases, environment and processes - to work with. Under the hood we deploy to ECS but you don't have to worry about it.

So I do agree that ECS is hard to use but with better tooling it doesn't have to be.

I'm also a big fan of how GKE is shaping up.

Spotify is down due to this, which is, uh, pretty hilarious https://news.spotify.com/us/2016/02/23/announcing-spotify-in...

If Spotify wanted to be really sneaky, some amount of downtime might be good for them financially.

The bulk of their revenue comes from customers who subscribe on a per-month basis, while they pay out royalties on a per-song-played basis. This outage is reducing the amount they have to pay, and if the outage-elasticity-of-demand is low enough they could (hypothetically) come out ahead!

> while they pay out royalties on a per-song-played basis

I believe this is inaccurate. They pay out royalties on a share-of-all-plays basis, don't they?[0] So an outage wouldn't reduce the payout amount, it would just slightly alter the balance of payments for individual rightsholders.

[0]http://www.spotifyartists.com/spotify-explained/#royalties-i...: "That 70% is split amongst the rights holders in accordance with the popularity of their music on the service. The label or publisher then divides these royalties and accounts to each artist depending on their individual deals... Spotify does not calculate royalties based upon a fixed “per play” rate."

Rather, as they don't have SLA to end-users, they get credits from Google, so they still earn money.

But the reputation is damage.

I see your point. Based on @SpotifyStatus [0], it wasn't uncommon for Spotify to have service disruptions before they did the move though.

[0]: https://twitter.com/SpotifyStatus

I bet they and Google will have a positive post written up about it tomorrow though. :)

To be fair, AWS has some significant down time in the second half of last year.

Not across multiple regions though. It's not trivial to make an application cross region but at least there is a way to engineer around an outage, unlike this outage

I think it's a great idea to diversify not just regions, but providers.

The future is for per-application virtual networks that are agnostic to the underlying hosting provider. These networks work as an overlay, which means that your applications can be moved through providers without changing their architecture at all. You could even shut it down in provider A and start it in provider B without any changes at all.

At Wormhole[1] we have identified this problem and solved it.

[1]: https://wormhole.network

I was getting really frustrated at the gym when the Spotify app wouldn't work. Didn't expect to find the answer here.

And they won't be down again in a long, long time.

Was Google services (search, email, drives, apps) impacted at all?

Google Custom Search was down for a similar time period. (Outage lasted longer than GCE, but seems likely related.)

No they don't use gcloud to host their own apps. Yes that's ridiculous.

Which is exactly why I won't use GCE. If Google isn't confident enough to use it for themselves, neither will I.

The fact that Amazon dogfoods AWS is a major advantage for them.

Even if Amazon.com uses AWS (the extent of which seems it may be a mixture of marketing hype and urban legend), there are many ways AWS could fail that affects customers but leaves Amazon.com unaffected.

It would be hilarious if they use to host their services on AWS or Azure.

I remember people speculating that Spotify most likely received a big discount for doing this. Guess you get what you pay for ;)

Google provides 99.95% - https://cloud.google.com/appengine/sla

That is like 17 of such 15 minutes breaks per year, i.e. an allowance for one small (or a large fixed quickly) screwdup/month :)

99.95% is correct, but the Compute Engine SLA is actually here: https://cloud.google.com/compute/sla

Today's incident did not impact App Engine at all.

(Disclaimer: I work in Google Cloud Support.)

This allows unlimited small outages:

"Downtime Period" means, for an Application, a period of five consecutive minutes of Downtime. Intermittent Downtime for a period of less than five minutes will not be counted towards any Downtime Periods.

It seems to be working fine, I just tested.

Outage lasted about 16 minutes

Happy to say I didn't notice; I am using a 517 song offlined playlist ("EVE Online" by Michael Andrew) for my programming work.

That sounds awesome, care to share?

Sorry late - link is https://open.spotify.com/user/1231239981/playlist/3ka1SYnv2b... . Hope you see this post and enjoy it.

I see from the down votes that my reply must have been seen as kind of off topic to the GCE issue, however since Spotify came up as a "victim", I did feel it prudent to mention that Spotify Premium has offline playlists to allow users to weather network issues of any kind. Also for me personally, big playlists of quality music like this one is fantastic for my work.

That was nuts. Interested to read the post-mortem on this one. Our site went down as well. What could cause a sudden all-region meltdown like that? Aren't regions supposed to be more isolated to prevent this type of thing?

Seems to have only been down for about 10 minutes, so I'm thinking some sort of mis-configuration that got deployed everywhere...they were working to fix a VPN issue in a specific region right before it went down...


To make error is human. To propagate error to all server in automatic way is devops.

So, you're saying the servers are aladeen?

Our website was down as well for 16 minutes. My guess is that it was a bad route that was pushed out simultaneously (probably was not the intention). It happened once before, sometimes last year, if I remember correctly. We'll have to wait and see what the definitive cause was though.

Sounds like a routing issue.

Maybe someone pushed the wrong BGP routes, hence why the quick fix and the initial issue with Cloud VPN.

Source: Totally guessing.

Unless a bug was in all regions it's why it's good to consider multiple cloud services for failover

And where do you put your system that directs traffic to one cloud and/or another... and what happens when that goes down?

You get an AS number, and announce your own IP space. DNS failover only sort-of works.

Or your subscribe to a "GSLB" service where they do this for you for a significant fee. Or you use a "man-in-the-middle as a service" system like Cloudflare, who do this at an extremely reasonable and/or free cost.

Of course, you still have to deal with the risk of route leaks, BGP route flapping/dampening, and other things which can take your IP addresses offline despite the fact you are multihoming with different carriers in different locations.

So perhaps you setup IP addresses on different ASNs and use both DNS & IP based failover.

But then you find a bug somewhere in your software stack which makes all of this redundancy completely ineffective. So you just take your ball, go home and cry.

Kind of the point... adding more possibilities for failure, at increased complexity and expense isn't always worth it... and I'd say usually isn't.

You put it in all your clouds, with low TTL DNS entries pointing at all those instances (or the closest one geographically maybe). Then if you're really paranoid you use redundant DNS providers as well.

And then you discover that there are a LOT of craptastic DNS resolvers, middle boxes, AND ISP DNS servers out there that happily ignore or rewrite TTLs. With a high-volume web service, you can have a 1 minute TTL, change your A records, and still see a lovely long tail of traffic hitting the old IP for HOURS.

The point was that adding another point for potential failure still won't reduce the chance of failure... it's just something else that can and will break.

In any case, failures happen, and most systems are better off being as simple as possible and accepting the unforeseen failures than trying to add complexity to overcome them.

I wonder if it was an issue with their Maglevs[0]?

[0] http://research.google.com/pubs/pub44824.html

So in all seriousness, how do folks deal with this?

In this case, it ended up being a multi-region failure, so your only real solution is to spread it across providers, not just regions.

But I imagine it's a similar issue to scaling across regions, even within a provider. We can spin up machines in each region to provide fault tolerance, but we're at the mercy of our Postgres database.

What do others do?

Most people just deal with it and accept that their site will go down for 20 minutes every 3-4 years or so, even when hosting on a major cloud, because:

1) the cost of mitigating that risk is much higher than the cost of just eating the outage, and

2) their high traffic production site is routinely down for that long anyway, for unrelated reasons.

If you really, really can't bear the business costs of an entire provider ever going down, even that rarely (e.g. you're doing life support, military systems, big finance), then you just pay a lot of money to rework your entire system into a fully redundant infrastructure that runs on multiple providers simultaneously.

There really aren't any other options besides these two.

This here is right on.

I will add that if you can afford the time and effort to do so, it would be good to design your system in the beginning to work on multiple providers without many issues. That means trying as hard as you can to use as little provider-specific things as you can (RDS, DynamoDB, SQS, BigTable, etc). In most cases, pjlegato's 1) will still apply.

But you get a massive side-benefit (main benefit, I think) in cost. There are huge bidding wars between providers and if you're a startup and know how to play them off each other, you could even get away with not having to pay hosting costs for years. GC, AWS, Azure, Rackspace, Aliyun, etc, etc are all fighting for your business. If you've done the work to be provider-agnostic, you could switch between them with much less effort and reap the savings.

If you are doing life support, military systems, or HA big finance, then you are quite likely to be running on dedicated equipment, with dedicated circuits, and quite often highly customized/configured non-stop hardware/operating systems.

You are unlikely to be running such systems on AWS or GCE.

And that's why IBM is still in the server business: There's nothing like a mainframe when it comes to uptime.

HP also has some good products in the highly available space - http://h20195.www2.hp.com/v2/getpdf.aspx/4aa4-2988enw.pdf , likely from their acquisition of Tandem.

> likely from their acquisition of Tandem.

Yep. Those were originally Itanium-only, so their success was somewhat… limited, compared to IBM's "we're backwards compatible to punch cards" mainframes.

Only recently did Intel start to port over the mission critical features like CPU hotswap to Xeons, so they can finally let the Itanic die, so we're hopefully going to see more x86 devices with mainframe-like capabilities.

IBM also owns Softlayer which is a great cloud provider for the more traditional VM/dedicated servers architecture.

And have similar failure rates. Human errors are inevitable.

> even when hosting on a major cloud

Hosting on anything/anywhere really. Even if one builds clusters with true 100% reliability running on nuclear power buried 100 feet underground, you still have to talk to the rest of the world through a network which can fall apart for variety of reasons. If most of your users are on their mobile phones, they might not even notice outages.

At some point adding an extra 9 to the service availability can no longer be justified for the associated cost.

Also, 20 minutes every 3 years is 5-9s anyways.

> Most people just deal with it and accept that their site will go down for 20 minutes every 3-4 years or so, even when hosting on a major cloud

If THAT is what I get for the prices of Google Cloud Engine, I could just as well use OVHs cloud -- uptime isn't worse, and price is a lot cheaper.

Depends entirely on your business, but what I do is just tolerate the occasional 15-minute outage. There's increasing cost to getting more 9's on your uptime, and for me, engineering a system that has better uptime than Google Cloud does, by doing multi-cloud failover, is way out of favorable cost/benefit territory.

That's the only sane thing to do.

It is impossible to ensure 100% uptime, and it gets increasingly harder to approach that as you put more separate entities between yourself and the client. The thing is, you'll be blamed for problems that aren't in your control and aren't really related to your service, but to the customer. For example, local connectivity, phone data service, misbehaving DNS resolvers, packet mangling routers, mis-configured networks, mis-advertised network routes, etc. Every single one of those examples can happen before the customer traffic even gets to the internet, much less where you have your servers housed.

All you can do is accept that there will be problems that are attributed to your service, rightly so or not, and work to mitigate and reduce the possibility the problems you can, and learn it's not the end of the world.

One answer is to evaluate the uptime you get with a single cloud provider and figure out if it meets your needs. This outage means that for the year, GCE will have at most a .999948 == four and a half "nines" of uptime. From a networkworld article in 2015: http://www.networkworld.com/article/2866950/cloud-computing/... And 2015: http://www.networkworld.com/article/3020235/cloud-computing/...

The article quotes Ben Traynor as saying that Google aimed for and hit 99.95% uptime, which is 4.3 hours of downtime per year.

My guess is that, despite cloud outages being painful, many applications are probably going to meet their cost/SLO goals anyway. Going up from 4 9s starts to get very expensive very quickly.

Most people and customers are tolerant of 15 minutes downtime here and there once or twice a year. Sure, there will be loudmouths who claim to be losing thousands of dollars in sales or business, but they're usually lying and/or not customers you want to have. They'll probably leave to save $1 per month with another vendor.

It sucks but the days of "ZOMG EBAY/MICROSOFT/YAHOO DOWN!!11!" on the cover/top of slashdot and CNET are gone. Hell, slashdot and CNET are basically gone.

IMHO, the next wave is likely multi-cloud. Enterprises that require maximum uptime will likely run infrastructure that spans multiple cloud providers (and optionally one or more company controlled data centers).

OneOps (http://oneops.com) from WalmartLabs enables a multi-cloud approach. Netflix Spinnaker also works across multiple cloud providers.

DataStax (i.e. Cassandra) enables a multi-cloud approach for persistent storage.

DynomiteDB (disclaimer: my project) enables a multi-cloud approach for cache and low latency data.

Combine the above with microservices that are either stateless or use the data technologies listed above and you can easily develop, deploy and manage applications that continue to work even when an entire cloud provider is offline.

Get enough things running on multi-cloud, and you could potentially see multi cloud rolling failures, caused by (for example) a brief failure in service A leading to a gigantic load-shift to service B...

This assumes all the software your stack uses and you deploy is completely bug free. While rare, bugs can occure that have been in production for a long time and those will only occur when you hit a certain conditions. Now, all your services are down. 100% is impossible.

Also, if there is a problem with one component of your stack that could have run off a cloud services, chances are Google, or Amazon will fix your edge condition much quicker then you.

By pretending it's the still the golden age of the internet and use physical servers in those locations. You might have to hire some admins, though ;).

Hey now, we do want 100% uptime but let's not get hasty.

Guys, I've got it- instead of locking ourselves in with one vendor's platform and being subject to their mismanagement and arbitrary rules, why don't we buy our own hardware and do it ourselves?

We can free up our OpEx budget too! My sales rep sent me a TCO that shows it is way cheaper to run a data center than to pay a cloud subscription!

I'm calling the CFO!

Great idea! Can't wait for our own mismanagement and arbitrary rules! I always wanted to play hobbyist technical infrastructure specialist.

Will your own hardware have the same reliability as google?

I think it was sarcasm......

AWS had several major outages in the past, especially between 2009 and 2012. In some cases, it was not only downtime, but also data loss, which is the hardest part. 8,760 hours in a year, if you are down for a total of less than 8.7 hours, you're in the >99.9% uptime category (also called "three 9s"). Four 9s (99.99%) is considered a nice plus. Very few businesses really need that.

However, uptime is one thing. Data loss is another different beast.

AFAIK, this one for Google is only downtime. Being able to maintain GCE up for most of the year, except a few hours, means more than 99.9% availability, which is what most customers need.

Operational excellence, or the ability to have your cloud up and running, comes only with a large customer set; Google is now gaining a lot of significant customers (Spotify, Snapchat, Apple, etc), and therefore I expect them to learn what's needed over the coming months.

2016 will be ok. 2017, in my view, will be near perfect.

If Google wants to differentiate them from AWS, they should offer an SLA on data integrity (at a premium, obviously). Here's how you can get thousands of enterprise customers.

Shameless plug: I've also extensively written about AWS, Azure and GCE here: https://medium.com/simone-brunozzi/the-cloud-wars-of-2016-3f...

Great timing for that new book :)

Seems kinda misleading from Google to claim repeatedly that they are hosted just on the same infrastructure of GCP, and not go down with it.

EDIT: Switched from "dishonest" to "misleading"; while it's abundantly clear that Google doesn't run on GCP, GCP feels like a second-class citizen to Google because you just cannot get Google uptime with it.

Google and GCP run on the same infrastructure, but this was a GCP problem, not a problem with that common infrastructure.

(I'm a Google SRE, I'm on the team that dealt with this outage)

This did impact common infrastructure. Some (non-cloud) Google services were impacted. We've spent years working on making sure gigantic outages are not externally visible for our services, but if you looked very closely at latency to some services you might have been able to see a spike during this outage.

My colleagues managed to resolve this before it stressed the non-cloud Google services to the point that the outage was "revealed". If this was not mitigated, the scope of the outage would have increased to include non-cloud Google services.

there's a lot of infrastructure at Google. And claim is correct - GCP and Google is on top of the same servers, same backend systems. Are they on the exactly same servers? No, of course not -- there's a few servers out there :-)

This was a global network outage, we can't talk about the "exact same servers". There seems to be an implication that Google runs on GCP and a global network outage can't affect all GCP's customers but one.

I do not envy the current on-call rotation.

I don't envy anyone with an on-call job.

I was so glad to get away after 5 years of 24/7/365. I had to drive home 5 hours from holiday once, leaving the rest of the family behind, spend 20 minutes sorting stuff out and drive back - the untold joy of pre-cloud startups :)

What do you recommend? I figure that if you're working on something without oncall no one probably cares about it any way. I prefer to have a good rotation than no rotation.

Staffing such that on-call is handled by presently-in-office staff. This is, as I understand, pretty much what Google does. When you're in the office, you're in the office, but when you're not, you're not. Having global coverage means ops in several timezones, and this is what Google accomplishes.

Not knowing when, at any time, your phone or pager will go off wears in interesting ways over time.

It depends on the team and type of oncall rotation for the service. My team (a SWE team) has its own oncall rotation as we don't have dedicated SREs for all of our services.

Since we US based only, it means the oncall person will have pager duty while they sleep. Our pager can be a bit loud at night due to the nature of our services, so it's definitely not for everyone (luckily it's optional).

Is this at Google?

I'll note you're SWE not SRE. I'm talking mostly about dedicated Ops crew on pager.

It's one thing if you're responding to pages resulting from other groups' coding errors or failure-to-build sufficiently robust systems. Another if you're self-servicing.

One of my own "take this job and shove it" moments came after pages started rolling in at 2am, bringing me on-site until 6am. I headed back for sleep, showed up that afternoon and commented on the failure of any of the dev team to answer calls/pages/texts (site falling over, I had exceptionally limited access capabilities and was new on team). Response was shrugs.

Mine was "That wasn't your ass being hauled out of bed. See ya."

The opinions stated here are my own, not necessarily those of Google.

Yes, it is at Google. Our important and high visibility bits have SREs that help monitor our services (SREs actually approached us to take over some bits that were more important).

Google has a lot of oncall people that aren't going to go into a data center (most googlers never see a data center). So there is lots of oncall rotations that still have an SLA that can be handled from their bed if it happens at 2am.

(I sadly can't give any examples)

This is not generally true for at least the big SRE-supported services at Google. I don't know what every team does, but my team's oncall shift (for example) is 10am-10pm, Mon-Thu or Fri-Sun. Another office covers the 10pm-10am part of the US day.

That's for first response ops though, what if a code change is needed to recover or something else that goes beyond the playbooks?

I guess today is a perfect example, I wonder how many out of hours engineers got paged.

Then Dev gets to deal with its own shit.

The magic of devOps is carting two pagers around.

There are still a few things in the world that aren't internet services.

Only a few though.

Games, mobile apps, desktop apps like Photoshop, Office, Intellij etc and some shrink wrapped server side apps. But you are right, some of these products are starting to have an online component as well.

Interestingly, nowadays all but the last one tend to require a service to be available.

I recommend others doing on call so I don't have to. I'm not an ops person, though I probably wouldn't mind some of the job, and hate being on call. I did it for a year at my current workplace (as a dev). All the problems I was capable of fixing I automated away, and got really annoyed that others didn't do the same for their areas of expertise. In hindsight, we probably should have had separate rosters for separate areas to encourage ownership, but we were a very small team (6 or so).

Developing software that other people deploy, as opposed to running a service people use directly, is pretty great for doing something people care about without being on call.

At my last job (comfortably small enterprise software shop), we had customers with more employees deploying and running our product than we ourselves had engineers. The only people who were first-line pageable were IT and the one engineer maintaining our demo server, which we eventually shut down.

There has to be a certain level of karma/schadenfreude of this happening in the week where they are pushing their SRE book...did they handle it well? It seems so, but a lot of their book is an ounce of prevention over a pound of on-call pagers going off.

Prevention is a big part of SRE, but an equally big part is formalizing a process to learn from the inevitable outages that come with running a large, complex, distributed system built by fallible humans.

You figure out what went wrong and fix it, of course, but more importantly, you figure out where your existing systems and processes (failover, monitoring, incident response, etc.) did and didn't work, and you improve them for the next time.

Ask HN: Is anyone using different cloud providers for failover and what's your DNS configuration?

Do any cloud providers allow announcing routes for anycast DNS?

As a long time service provider network engineer I appreciate network clueful companies and recommend none and packet.net.

Vultr advertises and supports anycast, if you're looking for a multi-location vps provider. Others will do bgp but it's a sales process

Yes, I'm using two different providers and sync them up using master-master replication.

For DNS I use DNSMadeEasy's DNS Failover feature that automatically fails over to a different IP address when it's unable to ping the server.

I, too, would be interested in info on any cloud providers that support anycast.

I'm no networking expert but packet.net has a page on this: https://www.packet.net/bare-metal/network/anycast/

I see benefit on using anycast for your DNS, but is anycast actually a better option than DNS load balancing for my site? The idea behind using anycast is to use at least two different providers, so having packet.net only doesn't really cut it. Also I can do DNS load balancing with any provider by using something like Azure's Traffic Manager, so I struggle to see advantages.

It's very uncommon. In my experience the database becomes the issue.

If you're using something like Cassandra (C*), it's pretty easy to have replicated data to multiple zones in multiple clouds. RethinkDB has a similar replication system... there are many others as well.

Not all data models fit into a Non-SQL database though, and may work better in a more relational.. caching and read-only partially up, can be another approach.

Designing your data around this may be impractical for some applications, and on the small scale, likely more costly than necessary. Most systems can tolerate 15-30 minutes of downtime every couple years because of an upstream provider.

So what Google services went down with this? looks like they are not eating their own dog food?

Has Amazon.com ever been affected by a AWS outage?

edit: at least it's impacted some of their ancillary services before http://www.theregister.co.uk/2015/09/20/aws_database_outage/

Amazon.com depends on AWS but not all the services and not necessarily in the critical path of serving up pages on the main website.

E.g. All the servers run on EC2 but they don't really use ELB or EBS. S3 is used heavily throughout the company but some of the newer services not so much.

A lot of Google's own services predate these public offerings. So my guess is even if they're on the same or similar technologies, they may be separate systems.

Google is self-hosted, but they might not use the same hardware GCE uses.

different hardware isn't really part of the equation - it's more that most of google's internal systems aren't _on_ gce, but _adjacent to_ gce. There's a cloud beneath that cloud, so to speak.

most of google does not run on google cloud.

may be the things like gmail have multicloud failover, say to AWS?

Google Custom Search also seems to have gone down globally today. Likely related, although GCE is back up, CSE is still out, leaving many sites without an international search feature.

I did not choose because GCE does not have server at Asia Pacific, Microsoft Azure, DigitalOcean and AWS has one. Sorry, correction, What I mean is South East Asia.

They have three zones in Asia Pacific[0].

[0] https://cloud.google.com/compute/docs/zones#available

Sorry, I mean South East Asia. https://azure.microsoft.com/en-us/status/

kahwooi possibly meant South East Asia? Latency to GCE (Taiwan) from Australia is a fair bit worse than to AWS (Sydney/Singapore) for example.

i migrated from aws to google yesterday. fml

Time Warner Cable's DNS server went down at roughly the same time (in Austin). I'm hoping that's just a coincidence.

Correct me if i'm wrong, but it looks like Cloud VPN went down, not all of GCE.

FYI I run about 15 servers in Asia, USAEast, and Europe on GCE with external monitoring and didn't get a peep from my error checking emails during that timeframe.

I know this will get downvoted but clouds suck and this is just one more manifestation of why they suck. Unless you have very spiky workload save yourself long term pain and don't go this route (applies if your monthly AWS/GCE/Azure bill is over few K)

If you're in a position where you can spin up servers and get the job done, that means you're just using the cloud as rented servers, in which case you are absolutely right.

If however you're using the cloud as intended, and using all of the services it actually provides, I highly doubt you could run 23 data centers around the world with databases and firewalls and streaming logging and all they other stuff they provide at even a fraction of the cost.

In reality they provide remarkably little with a lot of strings attached :). Let's take such basic service as transport care to compare the cost ?

I'm not seeing why not. Your data center could go down for a myriad of reasons (ISP goes down, HDs, tripping on power cable, etc). If that happens you're pretty much screwed. You could compensate by having multiple data centers with different infrastructure providers. If you do, you're probably spending more than the few K you referenced in your post.

Yes, it's bad that apparently all of the regions failed. Google will hear about it. People will get in trouble. But a screw up at this level is rare. If you use cloud, or even a VPS provider like Linode, you get auto-fail over and someone that is contractually obligated to deal with failures.

Or the Fbi raids the coloc, and rips out everything that looks like a computer becuase another tenant was operating a silk road clone.

You are paying penalty in complexity, latency and poor tenant isolation when running on "cloud infrastructure" and when things blow up you have no recourse.

Do you have any examples of poor tenant isolation in AWS, GCE, or Azure?

Cloud complexity is also lower because you don't have to worry about power, cooling, upstream connectivity, capacity budgeting, etc. If 99.9-99.95% availability is fine for your application then you probably don't have to worry about your provider either.

AWS Netflix consumes enough resources that if they spike 40-50% everyone is screwed. The software required to run the cloud like AWS is orders of magnitude more complex then what avg project would need and results in major screwups. Both major AWS outages were due control plane issues second case was result of massive Netflix migration that triggered throttles for everyone in affected AZs. The throttles in the first place were put in due to the major outage that lasted for many hours.

> Do you have any examples of poor tenant isolation in AWS, GCE, or Azure?

I hate to feed a troll, but ...

Noisy neighbors are a problem all the way from sharing a server using VMS to top of rack switches.

An if you try hard enough, you can always escape your VM and "read somebody else's mail."

five nines = 45?

No; 999.99% uptime.

1000% = 10 days per year

so much for the SRE they publishd last week

16-17 minutes of down time isn't all that bad if you consider the SLA for GCE is 99.95%: https://cloud.google.com/compute/sla

So they can have 262 minutes of down time a year and still be within their SLA.

Hmmm... would this have affected Netflix?

(rant) Yes, traffic to Netflix increased significantly as Spotify was down.

It's fairly well known that netflix runs primarily on AWS.

So probably no.

Cheers - apologies for the ignorant question. (feel rather silly!)

how come netflix is working on IPv6 then, when aws does not offer it?

ELBs do IPv6 at the edge and everything else (ELB->EC2) is IPv4.

Though notably, EC2Classic ELBs only

Netflix runs on AWS.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact