This can wipe away a lot of goodwill, though. A worldwide outage is catastrophic and embarrassing. AWS has had some pretty spectacular failures in us-east (which has a huge chunk of the web running within it), but I'm not sure that I can recall a global outage. To my understanding, these systems are built specifically not to let failures spill over to other regions.
Ah well. Godspeed to anyone affected by this, including the SREs over at Google!
And this is coming from a guy that really dislikes Google overall. I was working on a project that might qualify for Azure's BizSpark Plus (they give you like $5K a month in credit), and I'd prefer to pay for gcloud than get Azure for free
They posted that they will share results of their internal investigation.
That kind of rapid response and communication is admirable. There will be problems with cloud services - it's inevitable. It's how cloud providers respond to those problems that is important.
In this situation, I am thoroughly impressed with Google.
The response times are what's expected when you are running one of the biggest server fleets in the world.
Operating across regions decreases the chances of downtime, it does not eliminate them.
> The response times are what's expected when you are running one of the biggest server fleets in the world.
That may be true, but actually delivering on that expectation is a huge positive. And more than having the right processes in place, they had the right people in place to recognize and deal with the problem. That's not a very easy thing to make happen when your resources cross global borders and time zones.
Look at what happened with Sony and Microsoft - they were both down for days and while Microsoft was communicative, Sony certainly was not. Granted, those were private networks, but the scale was enormous and they were far from the only companies affected.
AWS has never had a worldwide outage of anything (feel free to correct me). It's not about finding "the airline that never crashed", it's finding the airline whose planes don't crash all at the same time. It's pretty surprising coming from Google because 15 years ago they already had a world-class infrastructure, while Amazon was only known for selling books on the Internet.
Regarding the response times, I recognize that Amazon could do better on the communication during the outage. They tend to wait until there is a complete failure in an availability zone to put the little "i" on their green availability checkmark, and not signal things like elevated error rates.
AWS had two regions in 2008 . That was 7 years ago, and I think you would agree that running a distributed object storage system across an ocean is a whole different beast than ensuring individual connectivity to servers in 2016.
Yeah... just don't look too closely under the covers. AWS has been working towards this goal but they aren't there yet. If us-east-1 actually disappeared off the face of the earth AWS would be pretty F-ed.
I'd be surprised if AWS didn't have a similar way to fail, even if they haven't. This is obviously a negative for gcloud, no doubt, but it's hardly omg-super-concerning. I'm sure the post-mortem will be great.
Disclaimer: I work for Google (not on Cloud).
Behind the scenes, I'm sure they will iterate on failure prevention and risk analysis.
Knowing Google though, they'll learn their lesson on how to improve their entire workflow right quick.
GKE has good commandline tools but the web interface is even more limited than ECS's is - I assume at some point they'll integrate the Kubernetes webui into the GCP console.
GKE is still pretty immature though, more so than I realized when I started working with it. The deployments API (which is a huge improvement) has only just landed, and the integration with load balancing and SSL etc is still very green. ECS is also pretty immature though.
To us it appeared yet another in a sea of many orchestration tools that will give you a very quick and impressive "Hello World", but then fail to adapt to real world situations.
This is what Kubernetes really has going for it, every release adds more blocks and tools that are useful and composable targeting real world use (and allow many of us crazies to deal with the oddball and quirky behavior our fleet of applications may have), not just a single path of how applications would ideally work.
This generally has been a trend with Docker's tooling outside of Docker itself unfortunately.
Similarly docker-compose is great for our development boxes, but nowhere near useful for production.
And it doesn't help Docker's enterprise offerings still steer you towards using docker-compose and the likes.
"BUILD SHIP & RUN, ANY APP, ANYWHERE" is the slogan they repeat everywhere, including here, and it means even less everytime they do it. What IS Docker Cloud? Is it like Swarm? Does it use Swarm? What kinds of customers is Docker Cloud especially good at helping? All these mysteries and more, resolved never.
Heroku should be an interesting learning example to the tons of new age cloud PAAS that I'm seeing. Heroku database hosting has always been key to adoption.. to an extent that lots of people continue to use it even after they move their servers to bare metal. The consideration and price sensitivity to data is very different than app servers.
But the way they have built the ECS layer is very very VERY bad.. and I have an unusually high threshold for documentation pain.
So I do agree that ECS is hard to use but with better tooling it doesn't have to be.
I'm also a big fan of how GKE is shaping up.
The bulk of their revenue comes from customers who subscribe on a per-month basis, while they pay out royalties on a per-song-played basis. This outage is reducing the amount they have to pay, and if the outage-elasticity-of-demand is low enough they could (hypothetically) come out ahead!
I believe this is inaccurate. They pay out royalties on a share-of-all-plays basis, don't they? So an outage wouldn't reduce the payout amount, it would just slightly alter the balance of payments for individual rightsholders.
http://www.spotifyartists.com/spotify-explained/#royalties-i...: "That 70% is split amongst the rights holders in accordance with the popularity of their music on the service. The label or publisher then divides these royalties and accounts to each artist depending on their individual deals... Spotify does not calculate royalties based upon a fixed “per play” rate."
The future is for per-application virtual networks that are agnostic to the underlying hosting provider. These networks work as an overlay, which means that your applications can be moved through providers without changing their architecture at all. You could even shut it down in provider A and start it in provider B without any changes at all.
At Wormhole we have identified this problem and solved it.
The fact that Amazon dogfoods AWS is a major advantage for them.
That is like 17 of such 15 minutes breaks per year, i.e. an allowance for one small (or a large fixed quickly) screwdup/month :)
Today's incident did not impact App Engine at all.
(Disclaimer: I work in Google Cloud Support.)
"Downtime Period" means, for an Application, a period of five consecutive minutes of Downtime. Intermittent Downtime for a period of less than five minutes will not be counted towards any Downtime Periods.
I see from the down votes that my reply must have been seen as kind of off topic to the GCE issue, however since Spotify came up as a "victim", I did feel it prudent to mention that Spotify Premium has offline playlists to allow users to weather network issues of any kind. Also for me personally, big playlists of quality music like this one is fantastic for my work.
Seems to have only been down for about 10 minutes, so I'm thinking some sort of mis-configuration that got deployed everywhere...they were working to fix a VPN issue in a specific region right before it went down...
To make error is human. To propagate error to all server in automatic way is devops.
Maybe someone pushed the wrong BGP routes, hence why the quick fix and the initial issue with Cloud VPN.
Source: Totally guessing.
Or your subscribe to a "GSLB" service where they do this for you for a significant fee. Or you use a "man-in-the-middle as a service" system like Cloudflare, who do this at an extremely reasonable and/or free cost.
Of course, you still have to deal with the risk of route leaks, BGP route flapping/dampening, and other things which can take your IP addresses offline despite the fact you are multihoming with different carriers in different locations.
So perhaps you setup IP addresses on different ASNs and use both DNS & IP based failover.
But then you find a bug somewhere in your software stack which makes all of this redundancy completely ineffective. So you just take your ball, go home and cry.
In any case, failures happen, and most systems are better off being as simple as possible and accepting the unforeseen failures than trying to add complexity to overcome them.
In this case, it ended up being a multi-region failure, so your only real solution is to spread it across providers, not just regions.
But I imagine it's a similar issue to scaling across regions, even within a provider. We can spin up machines in each region to provide fault tolerance, but we're at the mercy of our Postgres database.
What do others do?
1) the cost of mitigating that risk is much higher than the cost of just eating the outage, and
2) their high traffic production site is routinely down for that long anyway, for unrelated reasons.
If you really, really can't bear the business costs of an entire provider ever going down, even that rarely (e.g. you're doing life support, military systems, big finance), then you just pay a lot of money to rework your entire system into a fully redundant infrastructure that runs on multiple providers simultaneously.
There really aren't any other options besides these two.
I will add that if you can afford the time and effort to do so, it would be good to design your system in the beginning to work on multiple providers without many issues. That means trying as hard as you can to use as little provider-specific things as you can (RDS, DynamoDB, SQS, BigTable, etc). In most cases, pjlegato's 1) will still apply.
But you get a massive side-benefit (main benefit, I think) in cost. There are huge bidding wars between providers and if you're a startup and know how to play them off each other, you could even get away with not having to pay hosting costs for years. GC, AWS, Azure, Rackspace, Aliyun, etc, etc are all fighting for your business. If you've done the work to be provider-agnostic, you could switch between them with much less effort and reap the savings.
You are unlikely to be running such systems on AWS or GCE.
Yep. Those were originally Itanium-only, so their success was somewhat… limited, compared to IBM's "we're backwards compatible to punch cards" mainframes.
Only recently did Intel start to port over the mission critical features like CPU hotswap to Xeons, so they can finally let the Itanic die, so we're hopefully going to see more x86 devices with mainframe-like capabilities.
Hosting on anything/anywhere really. Even if one builds clusters with true 100% reliability running on nuclear power buried 100 feet underground, you still have to talk to the rest of the world through a network which can fall apart for variety of reasons. If most of your users are on their mobile phones, they might not even notice outages.
At some point adding an extra 9 to the service availability can no longer be justified for the associated cost.
If THAT is what I get for the prices of Google Cloud Engine, I could just as well use OVHs cloud -- uptime isn't worse, and price is a lot cheaper.
It is impossible to ensure 100% uptime, and it gets increasingly harder to approach that as you put more separate entities between yourself and the client. The thing is, you'll be blamed for problems that aren't in your control and aren't really related to your service, but to the customer. For example, local connectivity, phone data service, misbehaving DNS resolvers, packet mangling routers, mis-configured networks, mis-advertised network routes, etc. Every single one of those examples can happen before the customer traffic even gets to the internet, much less where you have your servers housed.
All you can do is accept that there will be problems that are attributed to your service, rightly so or not, and work to mitigate and reduce the possibility the problems you can, and learn it's not the end of the world.
The article quotes Ben Traynor as saying that Google aimed for and hit 99.95% uptime, which is 4.3 hours of downtime per year.
My guess is that, despite cloud outages being painful, many applications are probably going to meet their cost/SLO goals anyway. Going up from 4 9s starts to get very expensive very quickly.
It sucks but the days of "ZOMG EBAY/MICROSOFT/YAHOO DOWN!!11!" on the cover/top of slashdot and CNET are gone. Hell, slashdot and CNET are basically gone.
OneOps (http://oneops.com) from WalmartLabs enables a multi-cloud approach. Netflix Spinnaker also works across multiple cloud providers.
DataStax (i.e. Cassandra) enables a multi-cloud approach for persistent storage.
DynomiteDB (disclaimer: my project) enables a multi-cloud approach for cache and low latency data.
Combine the above with microservices that are either stateless or use the data technologies listed above and you can easily develop, deploy and manage applications that continue to work even when an entire cloud provider is offline.
Also, if there is a problem with one component of your stack that could have run off a cloud services, chances are Google, or Amazon will fix your edge condition much quicker then you.
We can free up our OpEx budget too! My sales rep sent me a TCO that shows it is way cheaper to run a data center than to pay a cloud subscription!
I'm calling the CFO!
However, uptime is one thing. Data loss is another different beast.
AFAIK, this one for Google is only downtime. Being able to maintain GCE up for most of the year, except a few hours, means more than 99.9% availability, which is what most customers need.
Operational excellence, or the ability to have your cloud up and running, comes only with a large customer set; Google is now gaining a lot of significant customers (Spotify, Snapchat, Apple, etc), and therefore I expect them to learn what's needed over the coming months.
2016 will be ok. 2017, in my view, will be near perfect.
If Google wants to differentiate them from AWS, they should offer an SLA on data integrity (at a premium, obviously). Here's how you can get thousands of enterprise customers.
Shameless plug: I've also extensively written about AWS, Azure and GCE here: https://medium.com/simone-brunozzi/the-cloud-wars-of-2016-3f...
EDIT: Switched from "dishonest" to "misleading"; while it's abundantly clear that Google doesn't run on GCP, GCP feels like a second-class citizen to Google because you just cannot get Google uptime with it.
This did impact common infrastructure. Some (non-cloud) Google services were impacted. We've spent years working on making sure gigantic outages are not externally visible for our services, but if you looked very closely at latency to some services you might have been able to see a spike during this outage.
My colleagues managed to resolve this before it stressed the non-cloud Google services to the point that the outage was "revealed". If this was not mitigated, the scope of the outage would have increased to include non-cloud Google services.
Not knowing when, at any time, your phone or pager will go off wears in interesting ways over time.
Since we US based only, it means the oncall person will have pager duty while they sleep. Our pager can be a bit loud at night due to the nature of our services, so it's definitely not for everyone (luckily it's optional).
I'll note you're SWE not SRE. I'm talking mostly about dedicated Ops crew on pager.
It's one thing if you're responding to pages resulting from other groups' coding errors or failure-to-build sufficiently robust systems. Another if you're self-servicing.
One of my own "take this job and shove it" moments came after pages started rolling in at 2am, bringing me on-site until 6am. I headed back for sleep, showed up that afternoon and commented on the failure of any of the dev team to answer calls/pages/texts (site falling over, I had exceptionally limited access capabilities and was new on team). Response was shrugs.
Mine was "That wasn't your ass being hauled out of bed. See ya."
Yes, it is at Google. Our important and high visibility bits have SREs that help monitor our services (SREs actually approached us to take over some bits that were more important).
Google has a lot of oncall people that aren't going to go into a data center (most googlers never see a data center). So there is lots of oncall rotations that still have an SLA that can be handled from their bed if it happens at 2am.
(I sadly can't give any examples)
I guess today is a perfect example, I wonder how many out of hours engineers got paged.
At my last job (comfortably small enterprise software shop), we had customers with more employees deploying and running our product than we ourselves had engineers. The only people who were first-line pageable were IT and the one engineer maintaining our demo server, which we eventually shut down.
You figure out what went wrong and fix it, of course, but more importantly, you figure out where your existing systems and processes (failover, monitoring, incident response, etc.) did and didn't work, and you improve them for the next time.
Do any cloud providers allow announcing routes for anycast DNS?
Vultr advertises and supports anycast, if you're looking for a multi-location vps provider. Others will do bgp but it's a sales process
For DNS I use DNSMadeEasy's DNS Failover feature that automatically fails over to a different IP address when it's unable to ping the server.
Not all data models fit into a Non-SQL database though, and may work better in a more relational.. caching and read-only partially up, can be another approach.
Designing your data around this may be impractical for some applications, and on the small scale, likely more costly than necessary. Most systems can tolerate 15-30 minutes of downtime every couple years because of an upstream provider.
edit: at least it's impacted some of their ancillary services before http://www.theregister.co.uk/2015/09/20/aws_database_outage/
E.g. All the servers run on EC2 but they don't really use ELB or EBS. S3 is used heavily throughout the company but some of the newer services not so much.
FYI I run about 15 servers in Asia, USAEast, and Europe on GCE with external monitoring and didn't get a peep from my error checking emails during that timeframe.
If however you're using the cloud as intended, and using all of the services it actually provides, I highly doubt you could run 23 data centers around the world with databases and firewalls and streaming logging and all they other stuff they provide at even a fraction of the cost.
Yes, it's bad that apparently all of the regions failed. Google will hear about it. People will get in trouble. But a screw up at this level is rare. If you use cloud, or even a VPS provider like Linode, you get auto-fail over and someone that is contractually obligated to deal with failures.
Cloud complexity is also lower because you don't have to worry about power, cooling, upstream connectivity, capacity budgeting, etc. If 99.9-99.95% availability is fine for your application then you probably don't have to worry about your provider either.
I hate to feed a troll, but ...
Noisy neighbors are a problem all the way from sharing a server using VMS to top of rack switches.
An if you try hard enough, you can always escape your VM and "read somebody else's mail."
So they can have 262 minutes of down time a year and still be within their SLA.
So probably no.