Anyone using both AWS and GCP that can form an opinion on availability of both? ...

dimes · on June 2, 2019

I use both services heavily at work. The networking in GCP is terrible. We experience minor service degradation multiple times a month due to networking issues in GCP (elevated latency, errors talking to the DB, etc). We've even had cases where there was packet corruption at the bare metal layer, so we ended up storing a bunch of garbage data in our caches / databases. Also, the networking is less understandable on GCP compared to AWS. For instance, the external HTTP load balancer uses BGP and magic, so you aren't control of which zones your LB is deployed to. Some zones don't have any LBs deployed, so there is a constant cross-zone latency hit when using some zones. It took us months to discover this after consistent denials from Google Cloud support that something was wrong with a specific zone our service was running in.

AWS, on the other hand, has given us very few problems. When we do have an issue with an AWS service, we're able to quickly get an engineer on the phone who, thus far, has been able to explain exactly what our issue is and how to fix it.

jpmattia · on June 2, 2019

> We've even had cases where there was packet corruption at the bare metal layer,

I'd love to know how this happens in the modern world. I've seen it myself only once (not GCP, but our own network with cisco equipment.)

Is something in the chain not checking the packet's CRC?

user5994461 · on June 2, 2019

Had something similar last year because of a core router fabric issue. A few years ago, there was a batch of new servers with buggy motherboards corrupting/dropping packets, can't begin to imagine how hard it was to diagnose.

That's in own datacenters, not cloud.

jpmattia · on June 2, 2019

> can't begin to imagine how hard it was to diagnose.

Yeah, when it happened to me, it completely threw me for a loop. We had reports of corruption in video files, which started the debug cycle. It was shocking when we isolated the box causing the issue.

But I guess your bigger comment has to be right: About the only way to have this sort of error is at the hardware level, because basic CRC checking should otherwise raise some sort of alarm.

user5994461 · on June 2, 2019

Keep in mind that hardware run with a firmware. What is called a hardware issue can actually be a software issue.

It wasn't just one box for us. Basically, the part number was defective (motherboard NIC), every single one that was manufactured. This affected a variety of things, since servers are bought in batch and shipped to multiple datacenters, damn impossible to root cause.

CRC can be computed by the OS (kernel driver) or offloaded to the NIC. I think it's unlikely for buggy CRC code to shipped to a finished product, it would be noticed that nothing works.

captn3m0 · on June 2, 2019

Just curious, is this on a specific region(s)?

pm90 · on June 2, 2019

GCP is incredibly bad at communicating when there are problems with their systems. Just terrible. Its only when our apps start to break that we notice something is down, then look at the green dashboard which is even more infuriating.

timdorr · on June 2, 2019

AWS is often the same way. No one seems to be good at communicating outage details.

TallGuyShort · on June 2, 2019

I suspect there's a correlation between outages that are easy to detect and communicate and outages that automation can recover from so easily that you hardly notice.

obeattie · on June 2, 2019

I really don’t get this. There’s a huge number of complaints about poor communication from companies like Google and AWS during every outage. Yet they remain seemingly indifferent to how much customer trust they are losing, and the competitive edge the first one to get this right could gain.

sakisv · on June 2, 2019

I don't think they are losing any kind of customer trust.

Unless something is really fucked (like both GCP and AWS being down for us-east) incidents like these are not going to impact them at all.

The cost of either migrating to the other provider or, even worse, migrating to more traditional hosting companies is enormous and will require much more than "service was down for 2 hours in 2019". The contracts also cover cases like this and even if they don't, Google and Amazon can and will throw in some free treat as an apology.

On one hand I find this quite sad, but from a pragmatic point of view it makes sense.

atmosx · on June 2, 2019

If 20% of Google Cloud's customers leave after this outage because of poor communication they'll prioritise accordingly and apply all that nice SRE theories to their infra. But this isn't happening, because <various reasons>, so... who cares?

obeattie · on June 3, 2019

I mean, I care. All else being equal I’m not sure why you wouldn’t want good communication to your customers.

manigandham · on June 3, 2019

How much cloud spend do you control? That's the reality of how decisions are made.

obeattie · on June 3, 2019

Many millions of dollars per year. I care about how my providers behave when they have issues, and I can't see why you think it's not at all relevant.

manigandham · on June 3, 2019

> "why you think it's not at all relevant"

Nobody said this.

> "I care about how my providers behave when they have issues"

We all do.

As the other commenters stated, the communication is poor because the clouds are still growing rapidly and there's not much reason to be better. We might also be underestimating just how much more better service would cost and whether it's worth the revenue loss (if any). Are you really going to shift all of your spend overnight because of an outage? And where are you going to go?

The reality of these decisions is far more nuanced than it may seem and the current state of support is probably already optimized for revenue growth and customer retention.

jjeaff · on June 2, 2019

Their dashboard does show red on GCE and networking right now, for what it's worth. https://status.cloud.google.com/

hinkley · on June 2, 2019

What aren’t these on separate systems? I never had the impression that google cheaps out on things but this sounds exactly like the sort of shit that happens when people cheap out. Not even a canary system?

pm90 · on June 3, 2019

The idea that Google spends big on expensive systems is a huge lie.

Google started using a Beowulf cluster that the founders wired themselves. From the very beginning, the goal of metrics collection was to optimize costs. While today it’s seen as the cash cow, the focus has always been on cheap components strung together, relying on algorithms and code for stability and making the least possible demands of underlying hardware.

To think that they won’t try to save money any time they can seems implausible.

kenhwang · on June 2, 2019

AWS has what feel like monthly AZ brownouts (typically degradated performance or other control plane issues) with the yearly-ish regional brown/blackout.

GCP has quarterly-ish global blackouts, and generally on the data plane at that which makes them significantly more severe.

jjeaff · on June 2, 2019

Are there any services that track uptime for various regions and zones from various providers? It's rare that everything goes down and thus the cloud providers pretend they have almost no downtime.

luu · on June 3, 2019

CloudHarmony used to track this at some level for free, but it looks like you now need to sign-up or pay to get more than 1 month of history?

The last time I looked at it (back when it showed more info for free, IIRC), AWS had the best uptime of the three big cloud providers, with Azure in 2nd and GCP in 3rd.

IIRC, the memorable thing was that, shortly afterwards, the head of Google Cloud made a big announcement that CloudHarmony showed that GCP had the best uptime when CloudHarmony showed that it actually had the worst. Google was calculating this by computing downtime = downtime per region * number of regions, but at the time, Azure had ~30 regions and AWS had ~15 vs. ~5 for Google and if you looked at average region downtime or global outage downtime, Google came out as the worst, not the best.

kenhwang · on June 2, 2019

I can't imagine that being easy or cheap to make given the staggering number of product offerings across even the few big providers and how subtle some outages tend to be.

tubaguy50035 · on June 2, 2019

Obviously we don't know what the extent of the issue is yet, but afaik there has never been an AWS incident that has affected multiple regions where an application had been designed to use them (like using region specific S3 endpoints). GCP and Azure have had issues in multiple regions that would have affected applications designed for multi-region.

votepaunchy · on June 2, 2019

> like using region specific S3 endpoints

AWS had the S3 incident affecting all of us-east-1: “Other AWS services in the US-EAST-1 Region that rely on S3 for storage, including the S3 console, Amazon Elastic Compute Cloud (EC2) new instance launches, Amazon Elastic Block Store (EBS) volumes (when data was needed from a S3 snapshot), and AWS Lambda were also impacted while the S3 APIs were unavailable.”

https://aws.amazon.com/message/41926/

tango24 · on June 2, 2019

That's one region, not the multiple region that OP mentioned

AaronFriel · on June 2, 2019

Services in other regions depended transitively on us-east-1, so it was a multiple region outage.

sterwill · on June 3, 2019

Which services in other regions? I remember that day well, but I had my eyes on us-east-1 so I don't remember what else (other than status reporting) was affected elsewhere.

WaxProlix · on June 2, 2019

There was a massive push after that to have everything regionalized. It's not 100% but it's super close at this point.

dodobirdlord · on June 2, 2019

S3 buckets are a global namespace, so control plane operations have to be single-homed. As an example, global consensus has to be reached before returning a success response for bucket creation to ensure that two buckets can't be created with the same name.

cavisne · on June 3, 2019

The availability of CreateBucket shouldnt effect the availability of customers apps. This tends to be true anyway because of the low default limit of buckets per account (if your service creates buckets as part of normal operation it will run out pretty quickly).

The difference with Google Cloud is a lot of the core functionality (networking, storage) is multi region and consistent. The only thing thats a bit like that in AWS is IAM, however IAM is eventually consistent.

solidasparagus · on June 2, 2019

But isn't CreateBucket the single s3 operation where you need global consistency?

dodobirdlord · on June 3, 2019

As far as I know bucket policy operations also require global consistency.

theevilsharpie · on June 2, 2019

I find GCP quicker to post status updates about issues than AWS, but GCP also seems to run into more problems that span across multiple regions.

I'm overall happy with it, but if I needed to run a service with a 99.95% uptime SLA or higher, I wouldn't rely solely on GCP.

rexarex · on June 2, 2019

AWS has better customer service and I don’t remember the last time there was a huge outage like this besides the S3 outage

dodobirdlord · on June 2, 2019

There was a terribly day 2-3 months back in us-west-2 where CloudWatch went down for a couple of hours and took out AutoScaling with it, causing a bunch of services like DynamoDB and EC2 to improperly scale in tables and clusters, and then 12 hours later Lambda went down for a couple of hours, degrading or disabling a bunch of other AWS services.

solidasparagus · on June 2, 2019

I've heard from people who have worked with both AWS and GCP that AWS has far better availability.

thruhiker · on June 3, 2019

I've also heard similar from a teammate who previously worked with GCP. That said I know several folks who work for GCP and they are expending significant resources to improve the product and add features.