
Google Cloud outage brings down Layer - wikyd
http://status.layer.com/
======
johnm1019
I am so turned off when I click a "Pricing" link and get a contact form. Even
more so when I read that, "our pricing team" [will get back to you].

So, you have an entire team of people who will try and maximize how much I
pay? Sounds like a great experience doing business with you. /heavysarcasm

~~~
homero
I automatically skip any product where I have to speak to a human at any time

~~~
theprotocol
Is it me, or are a lot of web-based service providers very chatty lately?

I won't name and shame any particular ones, but I will say I've found myself
regretting signing up for trials of certain services because of the almost
sycophantic attention I'd receive from the oh-so-personable and friendly CEOs
who make it a point to personally message all customers. I usually respond,
initially, but then it quickly becomes pushy and intrusive, e.g. "Hi, I've
noticed you haven't used [x] feature yet." "Hello? Are you getting my emails?"
"Hello?"

I don't mean to be rude, but I didn't sign up for the "omg you're so friendly
and amazingly helpful" show. I just wanted to try the service out. Kindly stop
breathing down my neck! :/

~~~
j_s
This happens because it works, though not necessarily so much for the HN
crowd.

~~~
bogomipz
Does it though? In my experience a good product or service doesn't require
constant spamming. It has nothing to with whether someone reads HN or not.

~~~
j_s
Does your request for confirmation incorporate the ancient HN discussion I
linked?

------
nwrk
Don't like the attitude. Pointing fingers doesn't help paying customers
trapped by Layer poor design choices.

~~~
smt88
Especially when they seem to be referencing only a single region. Multi-region
deployments is the most basic protection against outages when using IaaS.

~~~
andyfleming
They should at least be in multiple availability zones. Multiple regions often
comes with a lot of challenges, but there isn't much reason not to be
redundant in multiple AZs.

~~~
Artemis2
Were they or were they not in multiple AZs? Developing for multiple
availability zones is trivial when creating cloud-first software (and it's
irresponsible not to use AZs!), multi-region comes with its own set of
problems.

~~~
inlined
In the report they say they're looking at moving to a new region, but Google
apparently told them that us-central1-a was down. The "-a" makes it an AZ. It
sounds like they're only on one AZ and may not fully understand the
difference.

[correction: they accidentally called usc1-a a region, but everything
mentioned in their outage was a zone. They specifically called it a
"deployment zone" not an availability zone, so it sounds like an issue of
inexperience with best practices.]

[obligatory disclaimer: I'm a Google employee. I don't have a relationship
with Layer]

------
knorker
> As we are now several hours into this outage and do not have satisfactory
> timeline for resolution, we have begun the process of migrating our hosts
> into another deployment zone within GCE

Wait, what? Isn't running in multiple zones something like rule #1 or #3 in
"how to run in the cloud"?

So why did they not already do this?

------
mbesto
> _As we are now several hours into this outage and do not have satisfactory
> timeline for resolution, we have begun the process of migrating our hosts
> into another deployment zone within GCE. We will have a baseline set of
> services migrated within the hour and evaluate our ability to operate in a
> split deployment. Should we need to pursue a complete migration of hosts
> across zones then we would expect another 4-5 hours to return to full
> operational capacity._

Wait, their service isn't setup to operate in a split environment out of the
box? I think it's time SaaS companies start documenting their IaaS setup so
purchasers can do a high level audit before they decide to use it for
potentially a core part of their own product/service.

~~~
tschellenbach
I like how Algolia does that,
[https://www.algolia.com/infra](https://www.algolia.com/infra) (their
blogposts and presentations go into much more detail)

Currently thinking of creating a similar page for getstream.io, at the moment
we always explain it during sales/onboarding calls. (we replicate our data to
3 different instances across multiple AZs)

~~~
blakewatters
Thanks for the link. We are taking a look at what Algolia has done here and
will likely put together a public infrastructure overview page for Layer as
well.

------
flyt
It doesn't do any good to point the finger at your vendors when your service
goes down; that data isn't useful for your customers. Never forget the lesson
of
[http://www.whoownsmyavailability.com/](http://www.whoownsmyavailability.com/)

~~~
ngrilly
I'm not sure I agree. Customers like to know why it doesn't work. If it was a
physical machine, they would have said something like "the disks are broken
and we are replacing them". But it is cloud and they said "Google persistent
disks are currently unavailable and they are fixing it".

~~~
knorker
But the real reason is "we didn't set up our system properly".

This is like saying "Hitachi Storage hard drives broke" when you actually mean
"we didn't run RAID".

~~~
ngrilly
You can't compare persistent disks failing in a whole zone, with a RAID array
failing in a single machine.

There is a reason why Amazon and Google takes EBS/Persistent Disk failures
very seriously: there are not supposed to be unavailable during several hours,
except if the whole datacenter is unable to operate (flood, fire, etc.), but
it's not the case here.

If your RAID fails, and you have a support contract which guarantees
restoration within 1 hour, and it's not restored within 1 hour, then I think
you can legitimately say something was wrong at your provider. It's not
pointing fingers. Everyone does mistakes. It's taking responsibility.

That said, I agree they should have run in multiple zones, as recommended by
Google, if they need/want to avoid that kind of downtime.

But I maintain Google Compute Engine Persistent Disk are not supposed to fail
in such a way, and I'm quite sure Google will do whatever they can to avoid
this in the future, instead of saying "don't point finger at us, it's supposed
to happen".

~~~
jpatokal
Two clarifications: the disks were not "unavailable", they had high latency
(slow I/O) in one zone only (us-central1-a); and this affected only SSD PDs,
not "regular" PDs. Per the SLA [1], it's "downtime" when PDs are completely
unavailable for >5 minutes in at least two zones, and neither condition was
met here.

[1]
[https://cloud.google.com/compute/sla](https://cloud.google.com/compute/sla)

All that said, people choose SSD because it's faster and has higher
throughput, so SSDs _not_ being fast is obviously a real problem for
applications relying on this, and rest assured we are indeed doing whatever we
can to avoid this in the future.

Disclaimer: I work in Google Cloud Support.

~~~
pdeva1
this is a typical Google Cloud Support response (I used to host on GCloud).
Stretching the definitions to somehow get out of responsibility. If the SSDs
have super high latency, then for most purposes they are indeed 'unavailable'.
There is a reason why the user provisioned SSDs and not a regular disk.

------
ben_jones
Most concise summary of Layer I could find on the internet quickly.

> Layer is an amazingly elegant and light-weight solution for video
> communication. Layer is currently in a private beta primarily focused on
> Video, Voice and Chat on Android and iPhone. [1]

Comment was in 2014.

[1]: [https://www.quora.com/What-is-the-difference-between-
PubNub-...](https://www.quora.com/What-is-the-difference-between-PubNub-and-
Layer)

~~~
JasonSage
If you just go to layer.com, the first text you see on the page does a pretty
good job of spelling out what it is. At least, it did for me. It's also more
up-to-date than that comment, it would seem.

~~~
tschellenbach
Layer is just a building block for adding chat to your app. Similar to how you
would use Elastic for search or Sendgrid for email.

~~~
jameskegel
That reminds me, I wonder what ever came of the Adria Richards v. Sendgrid
issue.

------
wiradikusuma
Slightly OOT: Anyone know good alternative to Layer?

~~~
CometChat
CometChat works seamlessly on web, mobile & desktop! Your users can be on any
platform and communicate with each other.

Check out the demo here:
[https://www.cometchat.com/demo](https://www.cometchat.com/demo)

