
Google Cloud Global Loadbalancer Outage - brian-armstrong
https://status.cloud.google.com/incident/cloud-networking/18012
======
intsunny
Why can't Google just use UTC timestamps? Or at least include them alongside
their "US/Pacific" timestamps.

I don't want to remember if "US/Pacific" currently has daylight savings or
not.

Its a very strange decision especially considering that GCP has numerous
regions outside of "US/Pacific".

~~~
_wmd
Just another case of their idiotic culture leaking into their products. An
early "design" decision led to all their production kit being set to
US/Pacific, triggering frequent DST bugs, and last I heard (many years ago) it
was still the case.

Coordinating a tz change over a network of that size is probably infeasible,
so may as well push the pain on to customers

~~~
ngrilly
"Idiotic culture" is a bit harsh.

~~~
_wmd
You clearly haven't spent much time with App Engine

------
manigandham
The global load balancer is one of the best offerings from GCP but we are
always concerned about the single point of failure it causes.

Unfortunately this isn't the first time it's broken and it's starting to look
like a bad choice. We use a CDN in front so were able to switch traffic around
but it seems it's better to do the load balancing ourselves too instead of
using GLB.

~~~
dsl
Any load balancer is always a single point of failure. This is why lots of
folks go multi-cloud.

You should also be looking at multi-CDN.

~~~
p0rkbelly
I have yet to read solid case studies of real multi-cloud at scale. E.g.
Active-Active load-balanced between multiple providers. Plenty of companies
use multiple clouds, but, it tends to be a line of business decision. Team A
likes AWS, Team B like Azure etc.

~~~
user5994461
It's fairly common. No need to call it multi cloud, companies have had
multiple datacenters in multiple countries for a long time.

The only challenge is that you need global geographic load balancers and that
means F5 and at least a million dollar.

Also, you will find out later that some dependencies were only running in a
single location and services failed with the datacenter.

~~~
spydum
I think that’s quite an exaggeration - there are quite a few DNS providers who
can do intelligent DNS LB for you and don’t care what your backend is
(gcp/AWS/azure/onprem). Won’t even cost you a million bucks.

------
iowahansen
Grrr. So much for global redundancy.

What is going to be faster? Updating DNS records with TTL 3600 to point to a
single data center or Google fixing their problem.

We host DNS at AWS, but servers in GCP. Should we use AWS's automatic DNS
failover feature to cover for such a case?

~~~
colmmacc
AWS engineer here, I was lead for Route 53.

We generally use 60 second TTLs, and as low as 10 seconds is very common.
There's a lot of myth out there about upstream DNS resolvers not honoring low
TTLs, but we find that it's very reliable. We actually see faster convergence
times with DNS failover than using BGP/IP Anycast. That's probably because DNS
TTLs decrement concurrently on every resolver with the record, but BGP
advertisements have to propagate serially network-by-network. The way DNS
failover works is that the health checks are integrated directly with the
Route 53 name servers. In fact every name server is checking the latest
healthiness status every single time it gets a query. Those statuses are
basically a bitset, being updated /all/ of the time. The system doesn't "care"
or "know" how many health status change each time, it's not delta-based.
That's made it very very reliable over the years. We use it ourselves for
everything.

Of course the downside of low TTLs is more queries, and we charge by the query
unless you ALIAS to an ELB, S3, or CloudFront (then the cost of the queries is
on us).

~~~
iowahansen
Interesting, thank you. So a potential mitigation strategy could look like
this:

\- Route 53 failover record * primary record: Google global load balancer IP *
secondary record: Route 53 Geolocation set (really need that latency) \-
Elastic Load balancer record per region * routes to mirror region GCP IP
address (ELB's application load balancer seems to able to point to AWS
external IPs) * optionally spin up mirror infrastructure in AWS

Seems brittle. Does Azure support global load balancing with external IPs?

Does anyone have such (or similar) setup actually in production? How did it
work today?

~~~
fastest963
I haven't been able to make an ELB target be an external IP. What did you mean
by "ELB's application load balancer seems to able to point to AWS external
IPs"?

~~~
iowahansen
[https://aws.amazon.com/elasticloadbalancing/details/#details](https://aws.amazon.com/elasticloadbalancing/details/#details)

IP addresses as Targets You can load balance any application hosted in AWS or
on-premises using IP addresses of the application backends as targets. This
allows load balancing to an application backend hosted on any IP address and
any interface on an instance. You can also use IP addresses as targets to load
balance applications hosted in on-premises locations (over a Direct Connect or
VPN connection), peered VPCs and EC2-Classic (using ClassicLink). The ability
to load balance across AWS and on-prem resources helps you migrate-to-cloud,
burst-to-cloud or failover-to-cloud.

Looks like you need an active VPN connection to access external IPs.

~~~
trout
That feature requires you to use a private IP address, so if you have a VPN or
Direct Connect to another location you could load balance across locations. In
the case of the global load balancers those will be public addresses though.

"The IP addresses that you register must be from the subnets of the VPC for
the target group, the RFC 1918 range (10.0.0.0/8, 172.16.0.0/12, and
192.168.0.0/16), and the RFC 6598 range (100.64.0.0/10). You cannot register
publicly routable IP addresses."

[1]
[https://docs.aws.amazon.com/elasticloadbalancing/latest/netw...](https://docs.aws.amazon.com/elasticloadbalancing/latest/network/target-
group-register-targets.html)

------
LiquidFlux
Our services have just become available again, initial downtime starting at
20:25 GMT, uptime returning at 20:56 GMT

EDIT: App Engine and Kubernetes environments in our EU region appeared to go
down, Compute Engine was okay.

------
stanmancan
This has been causing issues for us for the last couple of hours. We migrated
a web application over to GAE last month. It's been rock solid and this is the
first issue we've had. Happy to know we can sit back and let them sort it out.

------
EZ-E
Still having issues as of 20:20 UTC - Firebase Database messages not beeing
delivered to client. It's been a full hour so far. The delivery rate also had
a minor drop earlier around 03:00AM to 05:00AM but no incident had been
listed.

------
WCityMike
I suspect this is what caused the technical difficulties with HQ Trivia.

------
beilabs
Could have sworn I was going crazy last night; was attempting to write a
scraper to download attachments / tickets from pivotal tracker for a yearly
accounting report. Everything was working fine, then suddenly not. At around
2am I gave up and just put it down to the 'gods' telling me to go to bed.

This was the first article I saw after I woke up, somewhat refreshed.
/feeling-vindicated

~~~
sidcool
Wow, 2 AM, some persistence. But as someone who's very sensitive to sleep
deprivation (I get violently ill if I don't sleep well for more than 2 days in
a row), I insist not making it a habit. Especially if you are in your 30s, it
has bad after effects. PM for more info.

~~~
voltagex_
Can't PM with no contact info in your profile - email is not public - this is
HackerNews.

------
hnarn
It's interesting and scary, mostly scary, that we now are almost at the point
where Google going down equals the Internet going down.

~~~
fortylove
To be honest, I hadn't even noticed. What popular sites went down?

~~~
hnarn
If you check the other comments you'll see that Spotify, Snapchat, Discord
etc. were all affected. We're not talking about "sites" but any application
_or_ site built on Google's infrastructure, even partly.

~~~
Operyl
Not "every application," just applications making use of the GLB or services
built upon it.

------
theclaw
Apparently snapchat, spotify and some game servers affected too.

~~~
Puer
Definitely noticed the issues on SnapChat.

------
pgrote
Is there an easy way to see what cloud based resources sites are using? For
instance, PushBullet is down now and this might explain why. Several other
sites affected, too.

~~~
RobinUS2
generally lookup the IPs (dig <hostname>) and then attempt a host/whois on
that IP. For PushBullet it's hidden by CloudFlare so hard to see easily, you
should try and find an exposed endpoint (which they haven't if they've done
well).

~~~
p1mrx
My Chrome extension (IPvFoo) displays all the subresource IP addresses in a
table:

[https://chrome.google.com/webstore/detail/ipvfoo/ecanpcehffn...](https://chrome.google.com/webstore/detail/ipvfoo/ecanpcehffngcegjmadlcijfolapggal)

(There's a right-click option to look up each address on bgp.he.net, but that
doesn't happen automatically, for privacy.)

------
drexlspivey
I got a Could not connect to POP server 'pop.gmail.com:995' (SSL=ssl): SSL
connect attempt failed because of handshake problems

------
qaq
Is there a single cloud that can match uptime of a single top tier DC?

~~~
jasonvorhe
Uhm, no, because a cloud service spans over several data centers and features
a lot more moving parts than just a solid data center.

Let's turn this around: could a typical data center + server uptime + service
uptime equal that of a major cloud provider?

~~~
qaq
In my experience yes but too few data points.

------
cdiddy2
Yup, discord impacted for me

~~~
xb95
Yes, yes we are.

~~~
cdiddy2
time to spread resources across clouds? or would that be too expensive?

~~~
RobinUS2
this one appears to be just global LBs, so a load balancer in AWS/Azure that
hits the actual backends over a VPN or something would have worked, but that's
just this case

~~~
jkaplowitz
Even Google Cloud's non-global load balancers, which live within a single
region, wouldn't have hit this particular outage.

~~~
merb
jep we had non-global lb's in europe-west-3 and i could not sense any outage.

------
throwaway923842
I sure wish we hosted on AWS today. Work sucked big time.

------
avip
Something no one has mentioned yet, could it be that the engineering force at
Google is no longer what it used to be?

~~~
jasonvorhe
Because of one data point? Seriously?

~~~
rrampage
I think OP was jovially referring to a similar comment regarding Amazon
because of their recent Prime day outage.

