
Google Cloud Is Having IO Issues in US-EAST1 - Fabricio20
https://status.cloud.google.com/incident/compute/19012
======
teovall
This outage is affecting Discord.

[https://status.discordapp.com/](https://status.discordapp.com/)

~~~
VolatileVoid
Is it just me or are Discord's status updates legendarily bad to the point of
being entirely useless? As a user of the discord app, don't tell me the "API
is having issues." Just tell me, "Discord is currently down. We are working on
it and apologise for the disruption." Everything else is unnecessary line
noise.

~~~
p1necone
Mentioning the cause of the issue seems better than just "shits broke yo" to
me.

~~~
VolatileVoid
Really? What in their status update read to you as "the problem is on our
side, not yours"? Understand that I'm a programmer and couldn't figure that
out.

~~~
Operyl
Because if it was a problem on your end, it wouldn't be listed on their status
page....?

------
hn_throwaway_99
I am in general a big fan of GCP, especially for startups. In many cases I
find I can be productive faster in GCP compared to AWS, in the sense that it's
easier to do "wrong" things in AWS if you don't have a top notch devops
person.

That said, it seems like GCP has a real, significant problem with overall
reliability. While I don't have numbers to compare, seems like I see frequent
outage notifications on HN for GCP. They really need to focus more on
reliability than new features.

~~~
ljhsiung
In case you're curious, here's some logs about their status
[https://status.cloud.google.com/summary](https://status.cloud.google.com/summary)

Looking at Cloud Compute, the one responsible for the outage today and I think
the big one a few months ago (13 hours!!!), they still do achieve 99% uptime
(having been down 81 hours total this year).

While 99% uptime "sounds" great, this is paltry in comparison to AWS, which
has set the standard for 99.99% (also known as "4 9s") in reliability,
translating to about an hour of downtime per year.

Just gives you some more perspective and concrete numbers. Perhaps one of the
reasons you find GCP "faster" is due to more devotion of resources to "speed"
over "reliability".

~~~
lostmyoldone
Where is AWS logs for all service outages, partial or not?

Can't seem to find them, only "selected" outages. The recent, several hours
long, service degradation in Frankfurt EC2 isn't in the historical list, seems
like it is only in the rolling status history?

AWS and GCloud seem to also be reporting disruptions completely differently,
AWS reports on tons of smaller pieces which makes any "aggregate" uptime
become something completely different than if you report on larger aggregates
as GCloud seems to do.

Unless I'm missing something, I don't see how one could compare comparing
service availability reasonably without running large numbers of "canary"
instances on both providers to actually measure aggregate availability?

------
mostlystatic
Cloud SQL has been down for me for 45 minutes now. The status page still
doesn't say, but it says this in the Known Issues for technical support:

"We are experiencing an issue with Cloud SQL instances hosted in the us-east1
region, beginning at Saturday, 2019-12-07 10:00 US/Pacific. Symptoms: Some
Cloud SQL instances hosted in this region are becoming unavailable, and are
refusing connections. Self-diagnosis: Connections to the Cloud SQL instance
are rejected. Workaround: None at this time Our engineering team continues to
investigate the issue. We will provide an update by Saturday, 2019-12-07 12:04
US/Pacific with current details."

------
synack
The most recent update indicates that this outage is only affecting a single
availability zone. [https://groups.google.com/forum/m/#!msg/gce-
operations/oYoFJ...](https://groups.google.com/forum/m/#!msg/gce-
operations/oYoFJiQzgVk/VubkMZ-pBgAJ)

------
Illniyar
Well here goes the 4 nines.

~~~
microcolonel
No worries, that's what the other datacenters are for. :- )

------
exabrial
Interesting, do they use a SAN or is it local storage?

~~~
ithkuil
Google Cloud Persistent Disks are implemented on a kind of Storage Area
Network (AFAIK it's based on IP networking, IIRC it's iSCSI over Colossus /D,
not on fiberchannel or other off the shelf SAN tech)

~~~
joatmon-snoo
It's not iSCSI (at least I don't think so).

The closest thing to public literature about Colossus and D that we've ever
published is the Procella paper, which describes the abstractions that
Colossus provides for it. In some ways it's similar to GFS (RPC interface,
writes are generally append/overwrite) but many things are completely
different now.

~~~
ithkuil
Yeah, that's what Colossus is, but t it's not (was not?) suitable to directly
implement a block store such as a (remote) disk.

You need something that provides the block store abstraction on top of the
primitives exposed by Colossus/D. Think of something like what modern SSD do
in order to work efficiently with the underlying flash memory.

Then you have to hook that adapter in your virtualization stack (e.g. kvm) so
you can boot from the volume and mount it from inside the VM. You could
implement a kernel module or do it internally in kvm/qemu somehow, but iSCSI
provides a straight-forward way to implement this in user-space: you have a
process on your physical machine that speaks iSCSI upstream, and speaks
Colossus/D RPC downstream.

(I don't know if they still do this but I have a vague memory of somebody
describing the stack of an early version of GCP while I was working there long
time ago)

~~~
vl
It’s a custom log-based block device implemented on top of Colossus/D in the
user-mode lib in the virtualizer (Vanadium). Guest OS communicates using NVMe
or VirtIO, Vanadium intercepts, calls PD lib.

This design has only one hop to the storage node. Low-latency workloads
benefit from this design, high-bandwidth workloads sometimes actually benefit
from off-loading PD to another host. To do iSCSI with one hop, you need to
implement iSCSI interceptor, and basically you would have same design with
less flexibility for guest OSes.

The irony, of course, that all this is a lot of legacy technologies needlessly
wasting computer power: guest file-system trying to communicate with 4K blocks
with “block device”, which goes through multiple layers of queues, then is re-
maps to another abstraction, which goes over network to multiple hosts, etc.
Not a single cloud customer ever said “we are so excited to manage volume
sizes and bandwidth quotas for PD”. Better design would be to implement true
data center-level filesystem to better support container workloads and leave
PD for legacy cases, but Google’s storage management is so detached from
reality, that it’s impossible to do cross-organizational project like this.

~~~
ithkuil
What do you mean by "true data center level"? A FS with posix semantics
suitable for normal apps doing file IO or something else?

~~~
vl
I would break it down like this: 1) FS-core: provide FS-like semantics on top
of D (or Colossus/D). Apart from data format and GC challenges, the piece that
is missing is fast DC-level locking service.

2) Then, for VMs you can do FS driver, jump to VMM and booms, you are done,
multiple legacy levels of re-packing and redirection are gone. For shared
access cases you do NFS/Samba interceptor and then same code path as above.

This system would be highly beneficial not only for Cloud, but for other
Google properties as well: it would provide normal posix FS to Borg jobs.
Amounts of equilibristics required to use any open-source package is enormous
and by this time exceeded costs of developing FS multiple times, ask YouTube,
MySQL, package management, etc groups.

Another example of Google’s storage craziness is cross-dc storage. This should
be low-level Colossus responsibility. Instead godzillion of teams implement
their own, GCS, PD, Spanner, Placer, etc. Crazy.

~~~
joatmon-snoo
Haha, yeah, I do wonder about the insanity on a pretty regular basis :)

I will say though, that most of the time I don't want to use open source
stuff. Observability is pretty crap, I don't want nor need software that uses
write() without fsync(), and the assumptions that most OSS makes about FSs
gives me nightmares on Borg.

------
Jasper_
The beauty of distributed systems is that you spread your services across lots
of different computers, so that when one of them goes down, they all go down.

~~~
eyjafjallajokul
Not sure what you mean here. Would a similarly architected non-distributed
system have higher levels of availability?

~~~
xvector
He was joking :) a buggy or poorly designed distributed system might behave
more like a house of cards than independent silos

------
Animats
Is this the beginning of Google's wind-down of "cloud services", as part of
the de-emphasis of Alphabet and the refocus on the core ad business?

~~~
ilogik
yes, they're moving out of a profitable business where they have a lot of
knowledge about.

and they start this move with an outage during the busiest time of the year

why not?

~~~
leesalminen
> yes, they're moving out of a profitable business where they have a lot of
> knowledge about.

Rackspace is trying to get out of the public cloud game. We had $20k/mo
account with them and the rep called me every month asking if we wanted their
help (professional services) migrating to AWS. I was baffled but they admitted
they don’t want to be in the Cloud game anymore and wanted to get customers
off their platform.

~~~
kick
The notification on their website is interesting:

 _Rackspace announced that it has completed the acquisition of Onica, an
Amazon Web Services (AWS) Partner Network (APN) Premier Consulting Partner and
AWS Managed Service Provider._

