Google Cloud networking issues in us-east1

boulos · on July 2, 2019

Disclosure: I work on Google Cloud (but I'm not in SRE, oncall, etc.).

As the updates to [1] say, we're working to resolve a networking issue. The Region isn't (and wasn't) "down", but obviously network latency spiking up for external connectivity is bad.

We are currently experiencing an issue with a subset of the fiber paths that supply the region. We're working on getting that restored. In the meantime, we've removed almost all Google.com traffic out of the Region to prefer GCP customers. That's why the latency increase is subsiding, as we're freeing up the fiber paths by shedding our traffic.

Edit: (since it came up) that also means that if you’re using GCLB and have other healthy Regions, it will rebalance to avoid this congestion/slowdown automatically. That seemed the better trade off given the reduced network capacity during this outage.

[1] https://status.cloud.google.com/incident/cloud-networking/19...

mrweasel · on July 3, 2019

>The Region isn't (and wasn't) "down", but obviously network latency spiking up for external connectivity is bad.

As one of my old bosses said: I don't care that the site/service is technically running, if the customers can't reach it, then IT'S DOWN.

jodrellblank · on July 3, 2019

Your boss picked a ridiculous time to nitpick over wording, to shout and add stress to an already difficult situation, and giving up accuracy and precise understanding at a time those are most important.

shaftoe · on July 3, 2019

As someone who lost critical business functionality yesterday when my appengine instances returned only 502s for 5 hours, I find the idea it was "a ridiculous time to nitpick" hilarious.

My customers don't care that the network is down, the servers are down, or aliens have landed. The severity is the same and our infrastructure, regardless of the cause, was down.

During the impacted time period, we did a full DR failover to appengine instances we spun up in west2. This was not a minor hiccup.

jodrellblank · on July 3, 2019

My customers don't care that the network is down, the servers are down, or aliens have landed. The severity is the same and our infrastructure, regardless of the cause, was down.

But the people who have to fix it, desperately care about which specific part is down. That's just about the highest priority information they need. Honing in on where the problem is, is one of the few ways to get to fixing the problem. Having a boss shout that "everything is down, it's all broken" is the opposite of identifying the problem.

find the idea it was "a ridiculous time to nitpick" hilarious.

What? You lost critical business functionality for 5 hours, and you'd rather the boss was shouting at the workers because the wording used doesn't accurately reflect the boss's understanding, instead of the workers working on solving the problem?

mrweasel · on July 3, 2019

I don’t think he’s the one nitpicking. From a business perspective the site was down. Nitpicking is telling him: No it is in fact up, the customer just can’t use it.

jodrellblank · on July 3, 2019

"Customers are complaining they can't access a thing"

"OK, we have databases up, load balancers responding, DNS records check out, last change/deployment was at this time, all these services are up, and the latest test suite is running all green, this narrows down the places where a failure might be with some useful differential diagnosis, now we can move attention to.."

"I DON'T CARE THAT YOU THINK THINGS ARE WORKING, IF THE CUSTOMER CANNOT GET TO IT, IT'S DOWN"

"Thanks for that helpful input, let's divert troubleshooting attention from this P1 incident, and have a discussion about what "DOWN" means. You want me to treat the working databases as down because the customer can't get to them? Even though they're working?

It's like the hatred for "works on my machine". "WELL I'M NOT RUNNING ON YOUR MACHINE". No you aren't, but this demonstrates the current build works, the commands you're using are coherent and sensible, excludes many possible causes of failure, and adds useful information to the situation.

grey-area · on July 13, 2019

Internal communications differ from external customer-facing ones.

For troubleshooting and internal use of course you want to describe the outage in precise terms (while being very sure you are not downplaying the impact).

For talking to customers, a sufficiently slow response is the same as no response, and nothing is more irritating than being told 'it's not really down' when they can't use the service.

Voxoff · on July 3, 2019

"to shout and add stress to an already difficult situation" now that's accuracy

gsich · on July 3, 2019

Disagree.

ricardobeat · on July 2, 2019

Tangential question: does Google allow employees, not directly tasked with it, to represent the company online as they wish? Most companies I know of have a strict ‘do not speak for the company’ policy.

boulos · on July 2, 2019

As kyrra says below, you're in the clear if you state that this is just your opinion. Naturally, prefacing something terrible as "just your opinion" doesn't make it fine.

In my case, Cloud PR knows me, but I also knowingly risk my job (I clearly believe I have good enough judgment in what I post). If Urs and Ben think I should be fired, I'm okay with that, as it would represent a significant enough difference in opinion, that I wouldn't want to continue working here anyway.

Finally, for what it's worth, I have been reported before for "leaking internal secrets" here on HN! It turned out to be a totally hilarious discussion with the person tasked with questioning me. Still not fired, gotta try harder :).

kyrra · on July 2, 2019

To add my own story. I have made comments about other teams services on hacker news before. I've been contacted by the SRE responsible for the service I commented on asking me to correct what I said. Luckily no reports for leaking info. :)

Whenever I talk about the inner workings of Google I try to reference to external talks, books, or white papers to go along with my comments. Luckily a lot has already been set externally about how Google works.

milesward · on July 3, 2019

Folks, seriously, boulos is fucking amazing ok?

haecceity · on July 3, 2019

They only found you because your HN username is same as your Google alias?

robbiep · on July 3, 2019

woah that's weird. This (Hacceity) is a social media alias of mine. For a moment I thought I wrote this. Did you come across the word in the Mars trilogy too?

haecceity · on July 3, 2019

Nope. I'm a big fan of scifi. How does haecceity come into Mars?

robbiep · on July 7, 2019

The character Sax is asked to describe his belief system, and he says, essentially, that it is haecceity, the this-ness of things, that is his belief system, and I thought that was awesome.

If you haven't read them, you have to!

badrequest · on July 2, 2019

thank you very much for your candor!

bardworx · on July 2, 2019

That’s...that’s some petty fucking shit. I didn’t go through your comments but considering your email is in your profile, someone really had to have a hard-on to report you for leaks.

I would love to understand the though process of someone going out of their way to remove someone’s livelihood from them because of a comment on HN (when applied in a normal circumstance of adding additional information or correcting a misconception — I’m clearly not saying that bonehead comments shouldn’t have consequences.)

Thorrez · on July 3, 2019

You're assuming that the person making the report said "boulos needs to be fired!".

Maybe the person making the report said "Hey, I found some internal details on this external site. I'm not sure if this is allowed. Maybe someone who knows more should take a look at it, here's the link to the page."

bardworx · on July 3, 2019

Their email is in their profile. I would think it is sensible to reach out to them directly or speak with your manager to get a second opinion.

Submitting a complaint to an internal review because “you’re not sure it’s allowed” is really petty.

In my opinion, and experience, folks who have good intentions usually pull you to the side to get a feel for a situation before filing a formal complaint.

praneshp · on July 3, 2019

> I would love to understand the though process of someone going out of their way to remove someone’s livelihood from them because of a comment on HN

This is not so difficult though. You just need to adjust your starting point to someone who doesn't like boulos' first. That's not so difficult IMO, it's a large org and boulos' seems to be a fairly prolific commenter here.

joshuamorton · on July 3, 2019

It also could be someone will intentioned who believes boulos is sharing things he shouldn't be.

He certainly shares stuff I wouldn't be comfortable sharing, but then again he's a lot better connected and in the know than I am.

bardworx · on July 3, 2019

If you are their co-worker and believe he shared some info that shouldn’t be public, wouldn’t it be a simple curtesy to email them and get some clarity? That seems like a reasonable thing to do.

On the other hand, to anonymously submit a complaint feels, to me, like a personal attack. Someone who simply doesn’t like them in for whatever reason. To me, that action seem petty.

munificent · on July 2, 2019

I work at Google on an open source project and comment on it frequently.

One of the things I really like about working at Google is that they place a lot of trust in the judgement of the individual employees. I generally make it clear when I'm stating my personal opinion versus the "official" (for whatever that means given how informal the project is) one, but I don't have to carefully go through an approved list of talking points, run my HN by the legal department, etc.

Obviously, in certain situations, things get more official and formal. For example, when I went to Google IO to give a talk, we did have some documentation and coaching beforehand about how to handle various questions we might get about non-public stuff, other projects related to ours, etc. We are also expected to run any slides by legal before being publicly shown in a venue with a wide audience like IO. But, even then, the legal folks I've worked with have been a pleasure to talk to.

The company's culture is basically "We hired you because you're smart. We trust you to use your brain." It would be squandering resources to not let their employees use their own intelligence and judgement.

cameronbrown · on July 2, 2019

Off-topic, but I noticed in your bio you wrote Game Programming Patterns. Was a great read!

danso · on July 3, 2019

Also off-topic: am looking forward to the finishing of craftinginterpreters.com, which has been a fantastic read so far

munificent · on July 6, 2019

Thank you!

user5994461 · on July 2, 2019

Google employees are commenting publicly and on Hacker News all the time. If there is a policy of not speaking publicly about the company, this has been the most blatantly ignored policy ever.

akhilcacharya · on July 3, 2019

I’m 90% sure it’s just to flex, honestly.

I work at another FANG with a roughly equal engineering community and I don’t see my kind commenting as much at all!

laluser · on July 3, 2019

Another FANG = Amazon? If so, Amazon is pretty restrictive in how it wants employees to communicate about internal activities. Most people err on the side of caution and don't comment publicly.

akhilcacharya · on July 3, 2019

It is - but all companies I’ve ever worked for are. I’m not convinced the letter of the policy is much different.

9nGQluzmnq3M · on July 3, 2019

There are definitely major differences between the FAANGs: what was the last time you saw an Apple employee commenting on anything on HN?

zaphirplane · on July 3, 2019

Definitely not all FADANGs are the same. never seen a Disney employee comment ;) Or oracle in FADANGOs ;) just kidding

saagarjha · on July 3, 2019

Apple employees comment on Hacker News all the time: they just don’t identify themselves as speaking for the company and make sure to only talk about publicly available information.

kyrra · on July 2, 2019

It's a fine line. We are not allowed to represent Google in any kind of public discussion. But we can talk about some things we do, as long as we state it's our own opinion and we don't represent Google's views.

IX-103 · on July 3, 2019

And don't disclose material nonpublic information (since that would run afoul of insider trading laws).

It's probably okay to say that we know the problem and here are the steps we're taking to mitigate it. It would not be okay to say something with large scale stock price implications for Google it another publicly traded corporation. For instance a Google employee shouldn't say something like "faulty solar panels fried Google's 10 largest data centers and twelve others have been lost to rebel drone strikes", even if false, since it could have a drastic impact on the earnings and future value of Google, Google's customers, and Google's competitors.

Even less obvious things like Google's plans for adding privacy features to the Chromium open source project can have a serious impact (see https://www.barrons.com/articles/google-chrome-privacy-quest...).

titanomachy · on July 3, 2019

I'm not a lawyer, but if the information is false I don't think you could get dinged for insider trading. The legal approach that's used to prosecute insider trading is basically "theft of secrets".

jauer · on July 2, 2019

It's probably less "as they wish" and more "here's an approved statement" or "your role involves engaging with external parties, here are some guidelines"

edwintorok · on July 2, 2019

You seem to have 3 status messages on the dashboard at 14:31, 14:44 and 14:48 with exactly the same contents. Were those messages really posted 3 times, or did something go wrong and they got duplicated?

FrankPetrilli · on July 3, 2019

We're aware this happened - that posting is the responsibility of an adjacent team to my own, specifically the person right next to me. :)

walshemj · on July 3, 2019

Sounds like back hoe fade (from the write up) and it sounds like multiple cables sharing the same physical route got taken out.

harshreality · on July 2, 2019

Hacker News: The real status page and help desk for the internet.

Do companies realize how absurd this is?

ETA: It seems someone at Google had a change of heart, and most of what boulos posted in this thread has been added as updates to the official google status page. Better late than never, I guess, especially if this is the start of a trend in outage reporting.

boulos · on July 2, 2019

The outage information is fairly reasonable. Not everyone cares (nor should they!) about the why only what the situation is, and that people are on it. This is extra detail.

I mostly responded because there was confusion downthread (and in the title) about being “down”. During an outage is a tricky time for comms, so short corrections are best until a full postmortem can be done.

hajhatten · on July 3, 2019

This reminds me of an incident in Sweden a couple of years ago.

We test our disaster alarms on a known schedule. And just a couple of years ago, during the peak vacation time in the summer, the alarm went off, off schedule.

This made the entire country panic. Were we being attacked? The agency that is supposed to let people know through the public channels like tv, radio etc were silent. They were themselves on vacation probably. The websites and apps they've setup were ridiculously underpowered and were basically DDOS'ed by the spike in traffic they were getting.

News outlets were also struggling, but did way better.

The only thing that withstood the sudden burst in traffic without a hitch was facebook and twitter.

The official statement i think was that the alarm was triggered by accident (never happened before, i think). But goes to show how badly our emergency response is setup.

issati · on July 3, 2019

It goes to show how badly it is set up for a false alarm. In a real emergency all the primary functions would go up (taking over radio broadcasts for example) so there wouldn't be the same problem. It is still bad of course because of the "cry wolf" factor.

archy_ · on July 3, 2019

I think a similar situation happened in Hawaii last year and it took awhile to send a false alert message.

notatoad · on July 2, 2019

seriously, they've got a text field on the official status page, why not put the text boulos posted here in that instead of the meaningless text they've got there?

david-cako · on July 2, 2019

I work for AWS. There is typically a balance that has to be struck when sharing information with customers. I would imagine this goes for most companies, which is why it isn't until a post-mortem that the messaging is fully refined.

sudosteph · on July 3, 2019

True, but I'd argue that the "Customer Obsession" priciple would drive you to attempt some sort of good-faith effort towards real-time communication.

Back when I worked there, the AWS status board was (and probably still is) terrible b/c Service teams owned that communication channel, not AWS Support. That really ought to have been changed. Service teams don't have the time or incentive to give real-time updates. Why not just let the people who know the customers best deal with parsing the TT and giving updates?

ti_ranger · on July 3, 2019

> Back when I worked there, the AWS status board was (and probably still is) terrible b/c Service teams owned that communication channel, not AWS Support. That really ought to have been changed.

It has.

> Service teams don't have the time or incentive to give real-time updates. Why not just let the people who know the customers best deal with parsing the TT and giving updates?

The escalation team inside PS now drafts customer messaging within ~5 minutes of the impact being identified (usually about 5 minutes into an event) and if the impact is significant enough to post to the public dashboard, than may take another 5 minutes. Depending on the type of impact, affected customers will be notified via the personal health dashboard.

PS owns the tooling that does this, and is responsible for driving the process, but the service org's (e.g. EC2, S3 etc.) representative often makes the call on whether to post to the public status page or not (depending on the scale of the impact, e.g. 20% API failure rate for 5% customers probably won't make the status page, but affected customers will get notices). TT is almost out ... but the PS tooling supports it and its replacement, and provides easy access and summaries for internal teams (so you don't need to refresh TT or subscribe to the ticket just to see what the status is).

KirinDave · on July 3, 2019

I'm late to this party but I just wanna add, boulos could be wrong or inaccurate and it wouldn't be a big deal. Those status updates are communications to customers, and thus tend to be more conservative. Inaccuracy is a much bigger deal there than the informal status here.

Sadly, the closer you are to the action of a thing like this (for example, I'm on NetInfra SRE and we were part of the group that put in place the current mitigations you're seeing work now), the less you can say without fear of subtle inaccuracy or releasing non-public information.

boulos · on July 2, 2019

Can you expand on why you find it “meaningless”? As my other comment says, I’m not in SRE and the real people fixing it are trying their best to remediate the problem. I agree that the text I posted (with blessing from SRE!) gives you some more detail, but you can’t do anything differently with it, right? What about the new text do you prefer? (We’re happy to improve!)

toufka · on July 2, 2019

Your, even brief, description is interpretable by your clients and some customers - and is actually really informative. It helps estimate the magnitude of the issue, and the types of downstream problems to expect or avoid.

Knowing an astroid took out the entire continent tells you something about the repairability, resources required to fix the problem, and generally provides context for later updates, as opposed to other contexts like a cut fiber line, a burning datacenter or a bad power supply.

emerongi · on July 2, 2019

"Can we meet up on Friday?"

"No" vs "No, I already have plans with X"

First case gives you all the information needed (denial), however in the second case I understand the situation much better. I wouldn't call the text on the status page meaningless though - it's pretty nice and concise already (which is what you want in a "crisis"). Just some brief description of the problem would be good, even though technically unnecessary.

notatoad · on July 2, 2019

I think the difference between your comment here and the info on the status page is that after reading your comment i feel like i know what's happening.

you're right, there's no additional actionable information there, the status page contains everything i actually need to know. but a bit more information makes me feel better. I guess the difference is your comment reassures me that you actually know what's going on. the status page text (prior to the 14:31 update) could equally mean "we've got this under control" or "shit's broken and we don't know why"

NikolaeVarius · on July 2, 2019

You seem to have forgotten twitter

benburleson · on July 2, 2019

We can dream.

fastest963 · on July 2, 2019

Here's the original issue: https://status.cloud.google.com/incident/cloud-networking/19...

Not sure why they closed that one at 9:12 just to open a new one at 10:25. We didn't see any traffic coming to us-east1 during that time period so I would assume the original issue is still the root cause.

boulos · on July 2, 2019

Yeah, that happens sometimes based on which team notices, thinks it might be different and then opens an outage.

Sorry for the confusion, and yes, the fiber link issue is the root cause. Draining the Google.com traffic presumably resolved the issue for you, though you may still be seeing elevated latency as the updates suggest.

fastest963 · on July 2, 2019

Since we use GCP Global LBs I presume that "draining the Google.com traffic" also meant that you're diverting all global LB traffic, which is what we see. The second incident (the OP's link) indicates that but at first it was very confusing to a customer when the first issue was marked as resolved but we still saw no traffic being sent to us-east1 via our global LBs. If that makes sense.

boulos · on July 2, 2019

This part was somewhat nuanced, so I wasn’t sure to post it: yes, if you are using GCLB, and have more than 1 healthy Region, we will also rebalance to avoid us-east- for now (though not so statically as that sounds, mumble mumble).

Edit: added this to the top level comment so more folks see it.

edwintorok · on July 2, 2019

There were reports of 404 from Google Cloud Run earlier today (I can confirm that I got both a 404 and a successful load after retrying that website): https://news.ycombinator.com/item?id=20336102 Was it related, it is a bit odd to get a 404 instead of a 50x?

boulos · on July 3, 2019

Sorry, I hadn't seen your post earlier. No, the Cloud Run (intermittent) 404s were unrelated.

joshuamorton · on July 2, 2019

Hopefully the thread title can be updated. (If it were actually down, this thread would have been posted 3 hours ago and have 400+ comments).

dataflow · on July 2, 2019

Does anybody else feel like there have been a lot of outages in recent months? And I don't mean Google -- I mean lots of others too (I seem to recall CloudFlare, Facebook, etc.)... are they really increasing or are we just hearing more about them? Seems a bit odd.

user5994461 · on July 2, 2019

Now that you mention it, I just realized why. The current few months are the intern season!

m0zg · on July 2, 2019

That's more or less inevitable. As complexity increases (which it does naturally, if there's no effort to decrease it) at some point it begins to outstrip the limits of human understanding.

I've been saying this repeatedly (and downvoted for it repeatedly): if you want truly reliable systems, use simple, boring technology, and don't fuck with it after it's set up, and run it yourself. 99.99% of all these outages are due to screwing up something that already works, something that if it was in your own rack you could just leave alone and not touch at all.

dodobirdlord · on July 3, 2019

> 99.99% of all these outages are due to screwing up something that already works

Fiber optic cables are a great technology, but they don't react well to being cut in half by a backhoe. Is the solution you are recommending that we stop using fiber optic cables, or that we stop using backhoes?

dboreham · on July 3, 2019

I feel a sense of karmic balance on backhoe fade myself, because after running networks for decades, I now own a backhoe. So far I've managed to dig up a power cable I forgot I laid to an outbuilding, and a coax satellite dish feeder.

m0zg · on July 3, 2019

Stopping depending so much on remote datacenters unnecessarily would be a good start.

ithkuil · on July 3, 2019

I'm a remote employee of a distributed company. Where do you suggest we deploy our code/services?

Operyl · on July 2, 2019

My horribly out of date system works, therefore I should never strive to improve it or god forbid update it (since that involves “fucking with it” in ways that can break it from version to version)? That gets you technical debt and that’s not fun.

m0zg · on July 2, 2019

I'll tell you more. Much of the world is run by "horribly out of date" systems that nobody has touched in years _because they work_. And it all works fine. No "cloud", no Rust or Go rewrites, no Haskell, no fancy javascript frameworks or anything like that. Just boring ol' files, boring relational DBs with boring schemas, constraints and stored procedures, boring old languages, boring old hardware, boring old operating systems underneath it all. Don't screw with it and it will work for a decade. Start screwing with it and it will be busted every month, like Google Cloud.

You can't create "technical debt" if you don't change anything in the first place.

owenmarshall · on July 3, 2019

I got an email yesterday that told me the boring old HPUX server (which was racked before my intern was even born) barfed all over its boring old 50-pin SCSI drive and ops went scrambling to find one in storage so the boring old Oracle DB that was responsible for production lines running could be recovered. Took us around an hour, cost us a boring 5 figures. Luckily our sysadmin knows how to hide “unused parts” for days like that or we’d have been really in trouble.

> You can’t create “technical debt” if you don’t change anything in the first place.

Rubbish. The bits really do rot, and if you don’t do _something_ on occasion you end up with an entire data center no one wants to touch because the dust in the servers might be structural at this point.

I’m not saying go rewrite your apps against the Kafka instance your junior devs are fucking with, but you have to do something to fight the entropy.

kalleboo · on July 3, 2019

There's boring and there's legacy. Legacy is when the hardware is unsupported and doesn't get software/security updates. You don't want to let it become legacy.

The counter-story to yours is running that database on MongoDB in the cloud on a cluster. Instead you'd be having crazy MongoDB issues, data inconsistencies, connectivity issues when the cloud is down, etc etc.

The solution is somewhere in the middle. You can have modern, supported hardware running a LTS Linux and that counts as boring.

owenmarshall · on July 3, 2019

I think you are right. Where I’ve seen success in “the boring middle” is when an appropriate amount of tension exists in the engineering organization: you want some teams and groups pushing to try new things, but they need to push against a boundary - preferably something, not someone - and the boundary should define your organization’s best practices and standards. This way a team doesn’t get to sneak clustered Mongo into the cloud and make your ordering systems talk to it.

But over time boring IT turns into legacy, and without some tension to the system pushing it forward your standards end up locking you into legacy forever.

m0zg · on July 3, 2019

The first part of your post sounds like a success story to me. You got many years of use out of that server. 5 figures is cheap comparing to the cost (including the collateral damage cost) of a hotshot SWE or devops guy who insists on using the most resume-worthy, most bleeding-edge technology available.

owenmarshall · on July 3, 2019

I wish I could upvote you twice - I just finished a multi-week effort to unwind (defuse?) some of the resume-driven architectures that were left behind when resume-driven development was successful.

titzer · on July 3, 2019

> you have to do something to fight the entropy.

Stuff breaks. So you fix it. Boring old stuff needs fixing too sometimes. Problem is, old stuff gets obsolete, can't get replacement parts, because of progress. (or something). It's the same story since the first looms were made centuries ago.

What you can't fix you can't really depend on. Our time scales are just compressed to ridiculousness because the pace of change is off the charts these days. So basically, you can't really depend on anything working more than a few months before falling over. Sucks.

pferde · on July 3, 2019

Sounds like you were bitten by the technical debt that was created by placing an important database behind a single point of failure - a single physical server. Of course something would break sooner or later, PA-RISC and Itanium servers were great, but still not magic.

Important things go onto clusters, or at least have a (hot or cold) standby server.

dgacmu · on July 3, 2019

That's not entirely true. you don't have to try to add features in order for the operating environment of a legacy system to change. More users, transaction count fields overflowing, timestamp fuels hard coated without the century or with 32 bit time_t values...

Or it may simply not meet the needs of users anymore.

I would hardly hold the air traffic control system up as a model to aspire to, for example. The only reason we run the old one is that the upgrade attempts all failed.

m0zg · on July 3, 2019

Nothing ever is "entirely" true.

why_only_15 · on July 3, 2019

Of course - the point this person was making is that this is substantially false in the way they described.

8bitchemistry · on July 3, 2019

> You can't create "technical debt" if you don't change anything in the first place.

Tell that to your security team.

rjf72 · on July 3, 2019

COBOL is a big example of this. It's still ubiquitous in many industries where reliability is near the top of priorities. And I imagine in 50 years we'll still very likely still have many critical systems operating on COBOL. I wonder how many will be running 'The Next Big Thing' language/api then...?

t0astbread · on July 3, 2019

> Don't screw with it and it will work for a decade.

Don't screw with it and it will have security issues after a few months?

rrdharan · on July 3, 2019

Curious how you would imagine handling something like GDPR or SOX compliance in this alternative world you’re proposing. You can’t magically foresee new requirements and new implied complexity for all future time.

hdfbdtbcdg · on July 3, 2019

To many European companies GDPR did not really change the operational requirements - only the penalties for not meeting them.

owenmarshall · on July 3, 2019

That’s too clever by half. Avoiding substantial financial penalties for not meeting an operational requirement is an operational requirement.

cortesoft · on July 2, 2019

Technical debt is about the increasing difficulty to add features to a system.... if you aren't adding features, technical debt is not really an issue.

hnick · on July 3, 2019

Have you seen Jonathan's Blow talk that touched on this? I enjoyed it. I think his fundamental point is that as we build on complexity, future generations lose track of the underpinnings and things start failing for unexpected reasons and we may eventually lose our capability entirely. But he does meander a lot.

I've definitely seen this where I work - the "old guard" setup the system that put the company in a prime market position, the newer people are just doing API calls and scratching their heads if it doesn't work.

Here's a reddit link because YouTube is blocked here.

https://www.reddit.com/r/programming/comments/bq1dt6/jonatha...

bamboozled · on July 3, 2019

So a vulnerability is identified in a version of software you're running within your stack and doing nothing means you will most likely lose important and sensitive customer information if you do nothing about it.

Do you:

1) Don't fuck with it?

2) Make a mitigating code change. Patch / fix it (fuck with it)?

m0zg · on July 4, 2019

Vulnerabilities don't always matter. If it's some godforsaken internal-only backend that never sees external traffic, study whether there's risk, and if there is none, let it be.

If you must fix it, the correct solution is to replace the affected software with the same (or almost the same) version of the software with the fix. No API changes, no other fixes.

lhoff · on July 4, 2019

Sorry, but that's bullshit.

Once an attacker is in your organization he will look for exactly that kind of internal-only backend were exploits are already available and the attack vector is known.

There is no such thing as a internal-only backend regarding security.

Let's assume the attacker used social engineering to get credentials from an unprivileged user and uses these to log in to a remote desktop. (I know there are ways to prevent that but I think there are many examples shown that public facing remote desktop is not two unrealistic) Once he is inside your company he can reach the "internal-only" backend and uses the privilege escalation bug you thought is not worth fixing to get root.

archy_ · on July 3, 2019

Cloud should be a backup, a failover, but people build their entire business on other people's hardware because they can sell the cost per hour easier than the price of a new server which is cheaper in the long run. At this point, with so many outages showing the need for self-hosting, not allowing customers to do so shows how little you care about them.

dragonwriter · on July 3, 2019

> At this point, with so many outages showing the need for self-hosting,

Are they really showing that? None of the major cloud providers, even constrained to a single region (or even AZ) seems on average less reliable than the on prem datacenters I've seen, and there's

> not allowing customers to do so shows how little you care about them.

While the solutions may not be as complete for all use cases as public-cloud-only ones, are any of the major cloud providers not working to enable and selling their capacity to support hybrid-cloud deployments?

KirinDave · on July 3, 2019

The challenge of self hosting is then you don't get the sophisticated load balancing the CSPs like my employer, Microsoft and Amazon offer. You also don't get the dedicated networks.

But it's true, it's much cheaper if you can find a way to replicate those or do without.

niyazpk · on July 2, 2019

As more businesses move their compute to the cloud, one might predict that more people will be impacted by outages in the large cloud providers. This in turn means that the affected people will start up-voting these threads. Expect these to be more common.

dataflow · on July 3, 2019

I don't see how this is something that's specific to the last few months though.

fredthomsen · on July 3, 2019

And unfortunately that is making the web more centralized.

tiborsaas · on July 3, 2019

It's just global warming again. The weather in the clouds gets increasingly unpredictable :)

jimmaswell · on July 2, 2019

I came here to say this - it's like the cloud as a whole is imploding lately.

ajhurliman · on July 3, 2019

Seems like if it continues to be a problem that more multi-cloud solutions will present themselves (Terraform does that sort of thing, right?).

jcims · on July 3, 2019

Terraform gives you a single management stack to a number of services and endpoints, but it doesn’t magically make your solution multi-cloud...you still need to understand the architecture you are deploying and the idiosyncrasies of each provider and the services used (not a bad thing imo).

opsunit · on July 3, 2019

Terraform does not handle data locality. Since compute generally sits next to data for latency and cost reasons one should first think about how to ensure that their (perhaps considerable) data set is stored and synchronised elsewhere before worrying about which infrastructure manifest tool to use.

richardw · on July 3, 2019

What do people do to mitigate DNS services from going down? Is it possible to have multiple services for that? And CDN's too as per our recent CloudFlare issues.

londons_explore · on July 3, 2019

DNS caching combined with multiple servers makes it one of the most reliable services by design.

For small hobby projects I simply use a 3rd party 2ndary DNS service.

swozey · on July 3, 2019

microclouds!

xapata · on July 3, 2019

Tinfoil hat: Maybe someone practicing for an attack?

rossdavidh · on July 2, 2019

It's almost as if we had made an overly complicated system with too much "efficiency" and thus not enough redundancy, centralizing on too few pieces of what used to be a quite widely dispersed system.

The more "the cloud" replaces many, many servers at lots of different places, the more the outages (which once happened all the time, but to many different organizations at different times) will become big enough to notice.

So, yeah, not just your imagination.

dataflow · on July 2, 2019

> It's almost as if we had made an overly complicated system with too much "efficiency" and thus not enough redundancy, centralizing on too few pieces of what used to be a quite widely dispersed system. The more "the cloud" replaces many, many servers at lots of different places, the more the outages (which once happened all the time, but to many different organizations at different times) will become big enough to notice.

This is just for the last few months...?

Thaxll · on July 2, 2019

Looks like an external issue. "The Cloud Networking service (Standard Tier) has lost multiple independent fiber links within us-east1 zone. Vendor has been notified and are currently investigating the issue."

imroot · on July 2, 2019

It's not independent fiber links if they use the same tube to get into the building...just ask any backhoe operator.

fredthomsen · on July 2, 2019

my brother-in-law's construction company actually did just that. ground wasn't properly marked and the fiber got cut, multiple links

zamadatix · on July 2, 2019

It's not uncommon to see 500 strand in one tube get cut by a backhoe. So much so it's even jargon at this point http://www.catb.org/jargon/html/F/fiber-seeking-backhoe.html

mitchs · on July 3, 2019

500? Those are rookie numbers.

foobiekr · on July 2, 2019

It’s surprisingly hard to avoid shared fate links and it’s one of the things I would have thought google would be expert at.

vinay_ys · on July 2, 2019

It's not that hard. In India because of so much construction related digging cuts OFCs, we do the path planning quite well and our redundancies get tested quite regularly whether you want to or not.

tyingq · on July 2, 2019

It can be hard. Getting redundant separated paths under/over railroad tracks, for example, might require political power that not everyone has. Google, of course, has plenty.

dragonwriter · on July 2, 2019

> Getting redundant separated paths under/over railroad tracks, for example, might require political power that not everyone has. Google, of course, has plenty.

But Google's vendors might have less. One would hope that Google is auditing claims of independence from vendors at least somewhat, but at some level they have to rely on vendor representation and SLAs if they aren't going to do it all themselves.

vinay_ys · on July 3, 2019

The companies who operate the cross-country backbone fibers have independently verified fibre maps and you can also audit them with their cooperation. And those who operate last-mile metro networks are usually highly reputed ISPs (at least in India where there is decent competition in this space) who have a lot to lose if their reputation is damaged. Also, the community of their customers is small and they all talk to each other. So it is hard to make fake claims and get away with it. Usually, when cable cuts happen, it is more a question of whose traffic is rerouted on the available paths and whose traffic is dropped. If you are high-paying customer with strong SLAs then your traffic is usually safe and will displace a lower SLA customer's traffic. You will notice latency spikes due to rerouting and maybe temporary glitches w.r.t link stabilization. Since you see this so often, your BGP timers etc are all tuned to be patient and avoid cascading failures.

frandroid · on July 2, 2019

> whether you want to or not

Accidents happen. Regularly. :D

thsowers · on July 2, 2019

Why so many problems at Google lately? Calendar down two weeks ago[0], and Google Cloud had a larger outage a month ago[1]

[0]: https://news.ycombinator.com/item?id=20213092

[1]: https://news.ycombinator.com/item?id=20077421

tscanausa · on July 2, 2019

Terrance here from Google Cloud Support.

There are only 3 things I can say about this situation. 1) These issues are currently unrelated. 2) We learn a lot from these situations. 3) A lot of these types of issues can be mitigated by running in more then 1 region.

I really cant promise that today's situations will never happen again. There are a lot of moving pieces in our system and sometimes there are things outside of Google's control.

mathattack · on July 2, 2019

“You should be using more than 1 region” could also be “you should be using more than one provider”, no?

hn_throwaway_99 · on July 2, 2019

To somewhat echo BurritoElPastor's comment, running a system/app that can be run in multiple clouds is orders of magnitude more difficult than just running a system/app that can be run in multiple regions.

And, not to be snarky, but many of the other responses that are along the lines of "It's not really that difficult to run in multiple clouds" - let's just say I have trouble believing these commenters have real world experience actually doing this. I'm not saying it's impossible, but it is extremely difficult for any system of reasonable complexity with a dev team of, say, 10 or more people.

And, if you can stomach the cost, you do give up the ability to really use any of the proprietary (and often times awesome) functionality of a particular provider, which can put your dev velocity at a big disadvantage.

Doubleslash · on July 3, 2019

It's not trivial but it's also not an order of magnitude more difficult anymore, as you describe it. There is a reason why Kubernetes gets a lot of backing from corporate customers - precisely because it hides and abstracts most of the underlying infrastructure and provides platform-agnostic primitives that make sense at the application level.

Once you have deployed your stack on Kubernetes, you can pretty much run it on any cloud or infrastructure with minor tweaks at most.

CaptainJustin · on July 2, 2019

It's quite common in cloud solution design to design for failure. One of the common assumptions that we hold to is that one region may go down. Other examples: Assume an instance of an app can go down. Assume a VM can go down. Assume a DC can go down.

This is not to excuse the downtime in any way.

wombatpm · on July 2, 2019

Do we need a new definition for RAID level?

Redundant Array of independent Data Clouds.

I guess for RAID 5 would I need a min of 3 regions or 3 separate cloud providers.

lwb · on July 2, 2019

Do people ever worry that an entire cloud provider may go down, or is that too unlikely of a case?

jsty · on July 2, 2019

However much we technical people might salivate at the prospect of designing a multi-cloud solution, for the vast majority of businesses it simply isn't worth the cost / complexity. I'd wager 90-something percent of applications could suffer multi-hour outages without impacting business function to any measurable degree.

Plus the fact that without serious investment, you're probably more liable to decrease availability by going multi-cloud thanks to the increased system complexity.

hinkley · on July 2, 2019

The real trick here, which many people don’t want to look at, is to avoid overly centralizing your workflow.

I can get a lot of work done while Outlook is down. Hell, probably more work done.

If our build server is down I can work for a couple hours (unless we’ve done something very bad). Same for git or our bug database or wiki or or or. When I get stuck on one thing I can swap to something else every couple of hours. And there is always documentation (writing or consuming).

But if some idiot, hypothetically speaking of course, puts most of these services into the same SAN, then we are truly and utterly screwed if there is a hardware failure.

Similarly if you make one giant app that handles your whole business, if that app goes down and there are no manual backups you might as well send everybody home.

I went to get a drink the other day and the place looked funny. They’d tripped a circuit breaker and the whole kitchen lost power. But the registers and the beverage machines were on a separate circuit. And since they sold drinks and food in that order, they stayed open and just apologized a lot. Whoever wired that place knew what they were doing.

hunter2_ · on July 3, 2019

Probably lost 1 of 3 phases. You're quite right in that the decision of what phase a circuit is on has a lot to do with business, and hopefully no major repurposing of the space without rewiring the space has occurred. For lighting, you'd want 1/3 of fixtures per room to go out, not 1/3 of your rooms in their entirety. For appliances and receptacles, you'd rather lose a whole function (the kitchen) than be able to cook but not do dishes, with every function trying to figure out oddball workarounds.

StreamBright · on July 2, 2019

The chance that AWS goes down is much smaller than anything else going down. There are many SPOFs in a typical smaller company setup, most of those are not even obvious to the operators.

jsjohnst · on July 2, 2019

In the past ten years:

It’s happened more than once with Azure and GCP. I think it happened once with AWS, but not positive there.

dodobirdlord · on July 3, 2019

AWS had a multi-hour total S3 outage in us-east-1 in February 2017 that knocked out a huge number of things mostly because it turns out that a huge share of their customers run in only 1 region and it's us-east-1. Things mostly continued to work in other regions.

I recall Azure had some sort of multi-region database failover disaster that took several regions offline, and GCP has had several global elevated latency/error rate events, but I don't think that any cloud provider has been "down" in the sense that the word is usually used.

jsjohnst · on July 3, 2019

GCP (and all of Google) was down worldwide in 2013 as one example:

https://www.theregister.co.uk/2013/08/17/google_outage/

Here’s one that’s on Azure. Not a 100% total outage like above, but bad enough most I know in the industry would call it being down:

https://www.zdnet.com/article/windows-azure-suffers-worldwid...

If I get a free moment, I’ll dig up other examples, but those were ones that were easy to find.

sneak · on July 3, 2019

Billing issues can take down your entire account at a given cloud provider all at once.

Spooky23 · on July 2, 2019

It’s a legit concern, but it adds complexity that will probably cause more outages than the thing you are worried about.

IMO, you’re better off with a private data center or colo and separate integrations with cloud.

debaserab2 · on July 2, 2019

I don’t think it’s happened (yet) although some of the earlier outages when AWS was younger were pretty far reaching. I think all of S3 has gone down a time or two.

jsjohnst · on July 2, 2019

All of S3 has, but that’s because S3 had a single choke point in a single region for a long time.

ti_ranger · on July 3, 2019

> All of S3 has, but that’s because S3 had a single choke point in a single region for a long time.

The only S3 event here was limited to us-east-1: https://aws.amazon.com/premiumsupport/technology/pes/

Some APIs were impacted, because they are global by nature (e.g create-bucket). But S3 was working fine in all other regions, for existing buckets.

However, many websites were affected, because they didn't use any of the existing S3 features that allow for regional redundancy, simply because S3 had been so reliable they didn't know/think they needed to have critical assets in a bucket in a 2nd region that they could fail over to.

Admittedly, even the AWS status page was impacted, because it also relied on S3 in us-east-1.

S3 has done a lot of work to improve matters since, and mechanisms have been put in place to ensure that all AWS services don't have inter-region dependencies for "static" operation.

However, it is still incorrect to claim that it was all of S3. Many customers who use S3 only in other regions were totally unaffected.

tyingq · on July 3, 2019

All of S3 create-bucket is "all of S3" for a lot of use cases and customers.

BurritoAlPastor · on July 2, 2019

Well, sure, if you hate your devops team and you want to make sure they can’t use any of the proprietary functionality of either provider. At which point, if you want to be managing a fleet of vanilla Linux boxes yourself, why use a cloud provider at all?

toomuchtodo · on July 2, 2019

* You should not be locking yourself into proprietary functionality of a cloud provider unless you are deeply interested in what happened to Oracle customers getting raked over the coals happening to you.

* DevOps teams can be multi-cloud relatively easy when using infrastructure as code tooling (Terraform, Packer, etc) and traditional DevOps practices

* Why manage a fleet of vanilla boxes when you can use vanilla boxes with Kubernetes and not get gouged by cloud providers in the first place?

You don't need to jump off the hype train if you never got on in the first place.

opportune · on July 2, 2019

Proprietary managed services can save a lot of dev/setup/SRE time though. Many businesses have more pressing things to work on than spending dev time to prevent vendor lock-in.

toomuchtodo · on July 2, 2019

Everyone spends their runway differently. Once you’re off the ground, derisk.

jjeaff · on July 3, 2019

Most companies don't have a "runway", they are just bootstrapped and have to actually justify their expenses and lock-in every day.

tomcam · on July 2, 2019

if I voluntarily choose a provider at a price that’s acceptable to me am I being gouged?

tjr225 · on July 2, 2019

Not yet, but it seems obvious to me that the GP was referring to a situation where the price changes and then you are getting gouged. That's exactly what the negative connotations of lock-in refer to.

Spooky23 · on July 2, 2019

Each provider will seek to make you take their one true path, or you need to do your own engineering.

Using the providers path isn’t necessarily gouging, but it isn’t cost optimized either. The answer depends on you.

That said, cloud is like any tenant/landlord relationship. Your rights are linked to time and are whatever your contract provides. If you didn’t like Office 2007, you didn’t buy it. If you don’t like Office 365, 2021 edition, too bad.

the8472 · on July 2, 2019

It's not quite that black and white. You can use common/open APIs and cross-provider tooling whenever available and provider-flavored ones where necessary. It's more effort, but still less than hand-rolling everything.

Of course that only works as long as you're swapping out largely replaceable parts. If you built everything around some proprietary service then yeah, you've tied yourself to that anchor.

stingraycharles · on July 2, 2019

This seems overly negative. There are lots of ways to do hybrid clouds, especially if you’re doing it for only the more critical parts of your application.

0xbadcafebee · on July 2, 2019

> why use a cloud provider at all?

Cost+speed of scalability, and managed services. If you rarely need to scale, your workloads are all predictable, and you don't need managed services/support, you should just buy some VPSes or dedicated boxes.

mathattack · on July 2, 2019

Staying on current versions, and the ability to scale usage up and down?

fkdo · on July 2, 2019

Why would you want to lock into a cloud provider? You're losing a lot of operational flexibility for less devops and sysaadmin work.

You are really limiting your tech stack by using standardized things like Jenkins, Docker, K8, mqtt, kafka.

bradstewart · on July 2, 2019

It's not really that I "want to lock into a cloud provider". Sometimes I simply don't have the human bandwidth available to handle devops and sysadmin work while building the actual product.

"Outsourcing" those functions to cloud services can be big win for a small team. Like all engineering, it's a trade off.

StreamBright · on July 2, 2019

For the same reason you want "to lock in" (meaning use) any solution. You do not want to build or operate it yourself. Why don't you take this further? Why to use a water utility if you can just drill your own wells? Most businesses are better of on cloud because their core business is not to build and operate datacenters but provide services to their customers (on the top of datacenters running their apps).

cthalupa · on July 2, 2019

If you're running in multiple clouds for HA/DR reasons, you are limited to the lowest common denominator of features/services between them. Or maintaining multiple codebases/architectures, and the massive pile of issues that entails. I am not a fan of multi-cloud for this reason.

Multiple regions, as long as your provider offers all of the services, you can have a carbon copy. Much easier.

It depends on your needs, your architecture, your risk tolerance, etc. I think for most people "Use multiple regions" is the answer that strikes the correct balance. It probably isn't the correct answer for everyone.

not_kurt_godel · on July 3, 2019

> you can have a carbon copy. Much easier.

Certain terms and conditions may apply :) Carbon copy of a static website or one whose data is only a one-way flow from some off-cloud source of truth? Sure! Multi-master or primary-secondary with failover? Stray too far from the narrow path of specialized managed solutions and things get very complex, very quickly. That being said - it's mostly just the nature of the beast. If you're not able to tolerate a regional outage, multi-region is a pill you're going to have to swallow, no buts about it.

ffk · on July 2, 2019

This is one of the reasons things like Federated Kubernetes is being worked on. Stick a CDN in front and your compute can be migrated from cloud to cloud. You still need to do a lot of thinking about data though.

richardw · on July 3, 2019

Three CDN's. And three DNS providers.

quickthrower2 · on July 2, 2019

Maybe. If you get a billing issue or get marked as suspicious, you can lose all services with one provider.

dragonwriter · on July 2, 2019

More than one region is pretty easy, more than one provider is harder (especially if your workload is designed from the ground up for it.) But, yes, just as multi-region protects you from things mere multi-AZ doesn't, multi-provider protects you from even more.

_Codemonkeyism · on July 2, 2019

dkhenry · on July 2, 2019

I have an awesome demo I give running a complex stateful workload across cloud providers to show off the system that I work on. What I have learned from giving that presentation many times is that while it is nice to say you can run cross cloud, for most workloads you should just pick one cloud, and be able to move to another provider if you ever need to.

benbro · on July 2, 2019

Is it practical to use several providers when egress is so expensive?

opportune · on July 2, 2019

No, not unless you are someone like Netflix. Usually you can configure multi-region failover and such and that will keep your things running. It is more expensive but for most use cases I think the cost is still less than the dev time/complexity of setting up multi-provider workflows and the inevitable duplication of resources (which is part of the cost of multi-region anyway)

mbesto · on July 2, 2019

No. And there's been a lot of talk recently about multi-provider being the right strategy to mitigate downtime, which IMHO is a farce peddled by expensive consultants. The parent comment is correct - this is why availability zones and regions have been established by each provider.

For the large majority of businesses investing in infrastructure-as-code far outweighs any crazy HA, redundant, multi-provider, whizzbang whatever setup you may have.

dragonwriter · on July 2, 2019

> this is why availability zones and regions have been established by each provider.

But the degree of independence provided by AZs is not constant across providers, despite similar terminology.

_wmd · on July 2, 2019

You can move 1.6TB between providers in a month for the same price as a single beefy DB server (m4.16xlarge here). That's a whole lot of logical replication..

majewsky · on July 2, 2019

Depends on your use-case.

timc3 · on July 2, 2019

You are comparing one overpriced SKU to another over priced SKU.

gizmo385 · on July 2, 2019

> There are a lot of moving pieces in our system and sometimes there are things outside of Google's control.

Are you implying that the cause of this outage is not Google's fault? If so, can you go into more details about that?

dodobirdlord · on July 3, 2019

> The disruptions with Google Cloud Networking and Load Balancing have been root caused to physical damage to multiple concurrent fiber bundles serving network paths in us-east1, and we expect a full resolution within the next 24 hours.

From the dashboard. Looks like this can be blamed on an Act of Backhoe.

opportune · on July 2, 2019

Not him but oftentimes cloud outages can be due to issues with the network connections to the datacenter, or power outages.

Datacenters also sometimes have other single points of failure such as DNS, but those are within the company's control.

https://www.networkworld.com/article/3373646/network-problem...

https://www.datacenterknowledge.com/uptime/equinix-power-out...

stingraycharles · on July 2, 2019

But data centers are typically designed with network and power failures in mind, not? Isn’t this why these kind of ring based network topologies exist, so that whenever a single network connection fails, it can still easily be routed around?

jjeaff · on July 3, 2019

Almost always, yes, but the problem is that everyone has to start routing around the problem and it creates congestion. Those redundant pipes don't sit idle. They are sharing the traffic.

As mentioned in another thread, in this case, Google has rerouted google.com traffic out of the region to try to mitigate the congestion.

dodobirdlord · on July 3, 2019

On a smaller scale, to link up a few datacenters that are a few miles apart? Sure. On a grand scale though, no. Nobody's running an extra undersea cable from Japan to Singapore so that they can have a ring topology. Or trenching a second PBps of cables across the Appalachian Mountains. When something like that gets busted you go and reroute your least important traffic and send out the repair crew.

iamaelephant · on July 2, 2019

Cool man let me know how I can run my Calendar in multiple regions.

Decabytes · on July 2, 2019

Thanks for the reply Terrance. But isn't it more expensive to run in more than one region?

tscanausa · on July 2, 2019

Absolutely.

For some customer it is the right thing for other customers it may not be the right thing.

Every provider will have failures. So the question mostly boils down to does paying for more then 1 region cost more or less then paying for the the lost productivity or revenue of an outage like this.

From some places the most costly things they spend money on is employees. If your whole company comes to a stop for even 1 hour. It may cost more then the engineering effort for multi zone, multi region or multi cloud for your critical environments.

marme · on July 2, 2019

how do you use multiple regions when Google only supports certain things in limited regions like Dataflow Shuffle only being available in a single region in north america https://cloud.google.com/dataflow/docs/guides/deploying-a-pi...

zbowling · on July 2, 2019

unrelated. very big company with thousands of products that don't suffer outages. two incidents doesn't make a pattern.

thsowers · on July 2, 2019

I would argue that two direct Google Cloud outages within a month is pretty concerning for GCP customers, and that it's possible that the calendar outage could also be related in someway since it is likely hosted on GCP, although that is speculation

tdhoot · on July 2, 2019

Doubt Calendar is hosted on GCP. Generally Google does not run first-party systems on GCP, instead putting them on Borg (internal cloud).

samcday · on July 2, 2019

Which, IMO, is actually a big problem.

AFAIK Amazon are running a lot of actual production loads on AWS. Dogfooding can be extremely valuable, especially if a massive portion of your staff have the same profession as your target market.

I've been using Google Cloud in a new role I started recently. There's definitely some parts of GCP I like, but whenever I use the Web Console I get the distinct impression nobody at Google actually uses it. If they did, I'm fairly sure all the annoying little warts I encounter would not exist.

keepper · on July 2, 2019

It took amazon over over 6 years to do so though.

EC2 was released in 2006. Amazon.com last non ec2 server was 2012. But a lot of features of amazon.com still don't run on the main AWS offerings.

GCP has not been out for that long. Also, quite easier to run an e-commerce site than to run the web's largest search engine as well as the largest email provider, as well as the largest maps provider. Each of these has an order of magnitude more traffic than amazon.com

I'm sure they'll get there though, just not the same scale. Not even close.

WaxProlix · on July 2, 2019

Large parts of AWS don't run on AWS either, due to issues with circular dependencies and similar problems. Similarly, if all of AWS onboards to use your AWS service, suddenly that's the business. Your 'real' customers and their traffic are dwarfed by the rest of AWS, making it hard to keep those real customers at the forefront. There's also an issue with those deps of cascading failures; having two separate fabrics/strata for internal and external offerings is similar to having a multi-regional offering in that it's more robust to random failures and such.

jefftk · on July 2, 2019

> I get the distinct impression nobody at Google actually uses it. If they did, I'm fairly sure all the annoying little warts I encounter would not exist.

For what it's worth, the internal-only systems also have warts ;)

tdhoot · on July 2, 2019

I agree, it's a huge problem. It also leads to divergence between internal variants of systems and external where in many cases the internal variants are leagues better than even what competitors have. But unfortunately since they're not on GCP they don't drive cloud adoption for Google.

alteria · on July 2, 2019

I'm curious - what are some examples of the warts you encounter?

samcday · on July 2, 2019

Things like:

* filtering traces by services has been broken in App Engine flex environments for more than a year. * copy/pasting identifiers between places is a nightmare * their IAM design is somehow worse than AWS. It’s so impressively bad I can’t even be mad. My favourite part of their IAM approach is how they have consolidated a majority of the IAM controls in the IAM page, but then random services like GCS have it defined elsewhere. * not able to do basic time zooming of metric grafs on App Engine dashboard. * multi-account paper cuts. Almost everyone on my team has their personal and work google accounts logged in. Whenever I send them a link to a dashboard or whatever, they end up getting a permission denied, without fail.

These are all just off the top of my head. Many of them seem silly and minor (and they are!) but there’s enough of them that I kinda dread doing anything in the Cloud console now. I need to take more time to get productive in the gcloud CLI I guess...

ithkuil · on July 3, 2019

> > * multi-account paper cuts

Google multi account support within a single browser is a pain. It kinda works until it doesn't. I'm sidestepping this issue by using distinct chrome profiles for work and personal.

In the other hand, I've not found Amazon multi-account situation to be cozy either. IIRC you literally have to logout and login again or use assume role and the switch applies to all the open tabs.

orf · on July 2, 2019

> * multi-account paper cuts

I always considered the Google Cloud approach of a "single account, multiple projects" a lot cleaner than the AWS "hundreds of accounts" approach. Do you not find this the case?

samcday · on July 3, 2019

Oh the multi-project stuff is definitely nice. I'm referring to the ability to keep multiple distinct Google accounts logged in.

acomjean · on July 2, 2019

yeah. With google maps I ran into this little bug when trying to get a new key/update my payments info like they forced all google map users to do.

The UI was maddeningly obtuse. This is from the second time I tried.. They did fix it eventually.

Very complex system for distributing new keys taking payments.

https://issuetracker.google.com/issues/124188941

judge2020 · on July 2, 2019