
App Engine down - soofaloofa
http://code.google.com/status/appengine
======
davidjgraph
Before the doom and gloomers come out, this is the first time since leaving
beta I can remember it happening.

We left AWS about 18 months ago after one of the outages and switched to GAE.
I've counted 3-4 big downtimes for AWS compared to this one on GAE. That's
still a good decision (for now)....

~~~
acdha
One thing to remember: this took down all of app engine for at least an hour.
AWS has had only 17 minutes of downtime affecting all of us-east this year
(that network glitch a couple days after PyCon) - the rest of it has been a
subset of the service amplified by people rediscovering that they weren't as
redundant as they thought.

The correct less to draw is that any one point of infrastructure is a risk, so
you need to scale wide. This is possible to do with AWS regions, or other
providers - even internal bare iron if you're so inclined, but impossible to
do with GAE because you're committed to a single-vendor API as well as their
infrastructure.

~~~
stickfigure
GAE applications are distributed across multiple data centers[1], so in theory
you get "scale wide" automatically. Unfortunately it looks like there was some
sort of flaw in the architecture. I believe this is the first systemwide
failure of the HRD.

The real question is: Can you and your ops team build a "scale wide" system
better than Google?[2] How much effort are you willing to put into it, when
those development resources could be put into making features instead?

[1] For apps which use the high-replication datastore. Old (deprecated)
master/slave apps are served out of a single datacenter.

[2] This is a reasonably serious question to ask. In GAE, you're one app among
many, so a you-specific scaling solution might be easier and more robust than
the generalized one Google builds. But not necessarily.

~~~
acdha
I completely agree that Google is likely to do better than many teams, modulo
your second point (which I completely agree with - generic is much harder than
specific). For me it really just comes down to the lock-in aspect: with GAE if
you decide that Google isn't taking the platform in the right direction for
your business you're looking at something close to rewriting your application.
This is far from the most likely outcome - although pricing can be interesting
- but it's the kind of thing you really want to consciously acknowledge,
similar to the way Netflix has accepted the risks of AWS by investing heavily
in failure management tools.

~~~
latchkey
I keep hearing the lock-in argument over and over and I'm not quite sure that
there is a solid basis for it beyond a level of paranoia. It's fine to be
paranoid, I just don't want it to hold me back unnecessarily.

Looking at things more closely, the only thing that you're truly locked into
with GAE is the esoteric nature of the datastore. This isn't any worse than
picking say MySQL vs. Oracle or Riak vs. Mongo. Most applications end up
depending on some sort of specific functionality of each database that they
are written for. While it would be difficult to migrate to another solution
for storing your data, it wouldn't be impossible.

There is no way to predict what direction any product might take in the
future. Look at the way that Oracle is treating MySQL now. Tons of vendor
lockdown there.

The only reason to migrate away from GAE would be if you find out that your
application doesn't work well on it (pricing, scalability, etc) or if Google
decides to kill GAE entirely. Hopefully you do the analysis of your
application before you decide to use GAE (ie: you can't blame GAE for you
deciding to use it) and with a 3 year deprecation promise, I'm pretty
confident that it will be around for a while longer.

~~~
MatthewPhillips
> Looking at things more closely, the only thing that you're truly locked into
> with GAE is the esoteric nature of the datastore. This isn't any worse than
> picking say MySQL vs. Oracle or Riak vs. Mongo.

Sure it is. If you pick the wrong DB your only locked into that DB. You pick
GAE and you're locked into GAE's DB.... and GAE. I can move my MySql db to
another cloud provider.

~~~
marekmroz
@latchkey I think you are missing the point the MatthewPhillips is making.
With GAE's DB there is nowhere to move, so if you want to move you would have
to migrate your data to a different DB engine. If your hosting/cloud provider
uses something standard like MySQL you can find another provider or roll your
own if you decide to migrate out.

------
davis_m
I think this is larger than just GAE.

<http://internettrafficreport.com/namerica.htm>

It seems like large portions of the internet are down.

~~~
EwanToo
Internet Traffic Report, while a nice concept, is unfortunately very
misleading.

Their sample size is extremely small, and most of those are permanently down.

Have a look through their list of north american routers and find one of them
where packet loss has gotten worse as their main overall graph for packet loss
would suggest - I've just been through them all and couldn't find one.

~~~
calinet6
Their _baseline_ values are very misleading.

But their relative metrics can still be useful.

For example, it's very inaccurate to say that 51% of the internet is down.

But it's precise to say that packet loss among the working nodes has increased
about 30% in the last 24 hours, and sharply.

~~~
EwanToo
But that's likely 2 or 3 nodes, not a meaningful sample

------
fidotron
It's time we remembered the whole strength of the internet was that it was
distributed and we avoided introducing single points of failure. We have ended
up using vast amounts of infrastructure for no reason other than developer
convenience (often with respect to security), when having local direct
connections is often more suitable than shooting everything into the cloud.

~~~
lurker14
Which is better? Having a day of downtime each year, or not launching at all?

~~~
peterwwillis
Which is better? Using a fallacious comparison to suggest cloud computing is
the only viable option, or comparing the pros and cons of different computing
models to choose the best one for you?

~~~
davidkatz
While the argument was perhaps coming on a bit too strong, it's hard to deny
the ease of deployment on cloud services. It's probably a safe bet to say that
for most early stage startups the cloud is a good move.

~~~
peterwwillis
I'm at a loss in these discussions. I don't understand this developer-point-
of-view.

Can you specifically give me examples of why using a cloud provider is better
for a startup than, for example, using a couple desktops in your garage?

You can't say it's because of backups because the cloud doesn't provide a
backup (unless you purchase an extra data backup solution with your cloud
provider?). And correct me if i'm wrong, but you still have to set up your
development environment on your local computer to write the code, install
libraries to test with, etc.

What exactly are the steps involved in "deploying" that you couldn't do on
your laptop, or a VPS?

~~~
dpritchett
_The Germans referred to a Schwerpunkt (focal point and also known as
Schwerpunktprinzip or concentration principle) in the planning of operations;
it was a center of gravity or point of maximum effort, where a decisive action
could be achieved. Ground, mechanised and tactical air forces were
concentrated at this point of maximum effort whenever possible. By local
success at the Schwerpunkt, a small force achieved a breakthrough and gained
advantages by fighting in the enemy's rear. It is summarized by Guderian as
“Klotzen, nicht kleckern!” (literally "boulders, not blots" and means "act
powerfully, not superficially")._ [1]

For a product prototype, the initial primary goal is "get it online so that we
can start validating our assumptions". System administration skills and in-
house server administration teams are _valuable_ but not necessary.

[1] <http://en.wikipedia.org/wiki/Blitzkrieg#Schwerpunkt>

~~~
gtaylor
This is a much more succinct explanation than my own. This is the point I was
trying to make.

Even given that I have a good bit of sysadmin skills, I am needed more as a
software developer right now in the early goings. I expect, as you've pointed
out, that priorities will change with time and growth. We may even move to
bare metal eventually, if we find ourselves needing and able to do so.

Excellent analogy.

~~~
peterwwillis
So to summarize you both: You want a rapid development platform that doubles
as a production system and costs nothing to maintain. That does sound useful!

~~~
gtaylor
What? Did you read any of what was written? Point-by-point breakdown for you:

> You want a rapid development platform

No, we want low-maintenance infrastructure.

> that doubles as a production system

It _is_ a production system. It _does_ successfully serve many thousands of
users for us every day. We've yet to have an outage that wasn't our own fault.

> and costs nothing to maintain

What? I specifically said we're willing to pay more not to have to spend as
much time on infrastructure.

It's OK if you're too set in your ways to even attempt to level with
alternative points of view, but at least try to read a little more thoroughly.
And maybe admit that you're not willing to budge, so nobody wastes time trying
to explain an alternative point of view.

~~~
peterwwillis
I wasn't dismissing your point of view. Heroku is what I described. AWS, to a
certain extent, is what I described. And I said it was a production system,
too; you didn't have to over-emphasize what I had already said, as if I didn't
just say it. "Costs nothing to maintain" is in comparison to paying to
maintain it yourself. But thanks for knee-jerking.

------
chrisfarms
> "App Engine is currently experiencing serving issues. The team is actively
> working on restoring the service to full strength. Please follow this thread
> for updates."

\-- Max Ross (Google) maxr@google.com via googlegroups.com

[https://groups.google.com/forum/?fromgroups=#!topic/google-a...](https://groups.google.com/forum/?fromgroups=#!topic/google-
appengine-downtime-notify/SMd2pDJsCPo)

------
daave
And they've sent the all-clear:

At this point, we have stabilized service to App Engine applications. App
Engine is now successfully serving at our normal daily traffic level, and we
are closely monitoring the situation and working to prevent recurrence of this
incident.

This morning around 7:30AM US/Pacific time, a large percentage of App Engine’s
load balancing infrastructure began failing. As the system recovered,
individual jobs became overloaded with backed-up traffic, resulting in
cascading failures. Affected applications experienced increased latencies and
error rates. Once we confirmed this cycle, we temporarily shut down all
traffic and then slowly ramped it back up to avoid overloading the load
balancing infrastructure as it recovered. This restored normal serving
behavior for all applications.

We’ll be posting a more detailed analysis of this incident once we have fully
investigated and analyzed the root cause.

Regards,

Christina Ilvento on behalf of the Google App Engine Team

[https://groups.google.com/forum/#!topic/google-appengine-
dow...](https://groups.google.com/forum/#!topic/google-appengine-downtime-
notify/SMd2pDJsCPo/discussion)

------
abhijitr
Meanwhile... Gmail etc are working quite fine. So the claim that if you build
on GAE you "take advantage of the same infrastructure used for Google
services!!" starts to ring a bit hollow.

~~~
rytis
I wonder if anyone ever believed this claim to be true...

~~~
wildmXranat
Or if the definition of the word 'same' is somewhat fluid enough to get away
with it

------
cyberpanther
I'm seeing a bunch of Google properties also. Maybe they are running on app
engine? Like <https://developers.google.com/>

~~~
libria
Also,

<http://dartlang.org>

<http://golang.org>,

<http://code.google.com/codejam/>

<http://www.chromeexperiments.com/>

Gotta give 'em props for dogfooding.

~~~
moepstar
Props, really?

That's what i _expect_ them to do - otherwise i can't see anyone
trusting/using them if even they themselves avoid their own product(s)..

------
cilvento
At about 7:30am US/Pacific time this morning, Google began experiencing slow
performance and dropped connections from one of the components of App Engine.
Many App Engine applications are experiencing slow responses and an inability
to connect to services. We currently show that a majority of App Engine users
and services are affected. We are actively working on restoring service as
quickly as possible.

We are posting regular updates to our downtime-notify list here:
[https://groups.google.com/forum/?fromgroups=#!topic/google-a...](https://groups.google.com/forum/?fromgroups=#!topic/google-
appengine-downtime-notify/SMd2pDJsCPo)

Thanks, Christina, Google App Engine Product Manager

------
kjhughes
What's the earliest sign of trouble you've had?

Pingdom reports my GAE-hosted site has been down since 2012-10-26 10:37:38
EST, a bit over an hour now.

UPDATE: My site is back. Delayed report from Pingdom says site came back
online after 50 minutes. Performance is sketchy still. We're probably not in
the clear yet.

At least we can now get to the status dash:

<http://code.google.com/status/appengine>

~~~
warrentr
I think the first tweet i saw was 7:35 PST

------
jsdalton
It's really quite remarkable (to be honest, inexcusable is probably a better
word) that their status page is failing as well. My expectations for a company
with Google's resources and infrastructure are a lot higher than that.

Nothing on their Twitter account either: <https://twitter.com/app_engine>

A poor handling of a systems failure in my opinion.

~~~
rdwallis
If you subscribe to:

[https://groups.google.com/forum/?hl=en&fromgroups#!forum...](https://groups.google.com/forum/?hl=en&fromgroups#!forum/google-
appengine-downtime-notify)

They'll email you when issues occur and info becomes available.

It took about 30 minutes after the crash for me to receive an email which
seems very reasonable.

~~~
IheartApplesDix
If you find that reasonable, I have a slice in Brooklyn to sell you.

------
Yoms
Latest update:

"At approximately 7:30am Pacific time this morning, Google began experiencing
slow performance and dropped connections from one of the components of App
Engine. The symptoms that service users would experience include slow response
and an inability to connect to services. We currently show that a majority of
App Engine users and services are affected. Google engineering teams are
investigating a number of options for restoring service as quickly as
possible, and we will provide another update as information changes, or within
60 minutes."

[https://groups.google.com/forum/?fromgroups=#!topic/google-a...](https://groups.google.com/forum/?fromgroups=#!topic/google-
appengine-downtime-notify/SMd2pDJsCPo)

------
brutuscat
Status reports from the mailing list
[https://groups.google.com/d/topic/google-appengine-
downtime-...](https://groups.google.com/d/topic/google-appengine-downtime-
notify/SMd2pDJsCPo/discussion)

------
debacle
I'm really happy I don't host in the cloud. How quickly are the cost savings
of cloud computing obliterated by PR, customer service, and system
administration time when an outage like this occurs?

~~~
tomgallard
Surely hosting yourself exposes you to just as much, if not more risk? Problem
in the datacentre where you're co-lo'd, or one of your servers blows up?

I think people not trusting the cloud is similar to how people feel safer
driving their cars then taking a plane. The stats say the plane's safer, but
people prefer being in control. People like the idea of being in control of
their servers, even if that means there's hundreds of extra things that can go
wrong compared to a cloud provider.

We also get a lot more publicity when a cloud provider has an outage as LOTS
of sites go down at once. Hardly anyone notices when service X who self-host
go down for a few hours...

~~~
debacle
We're coloed across three datacenters spanning the US (one might be in TO I
think) and if a datacenter were to go down, we have a hot backup that's no
more than 12 hours stale.

The only real manual maintenance that we've got is a rolling reimaging of
servers based on whatever's in version control, which usually takes a few
hours twice a year, but we'd probably do that if we were in the cloud anyway.

When you can script away 90% of your system administration tasks, hosting in
the cloud doesn't really make a ton of sense.

~~~
tomgallard
But a DNS based failover is still going to take an hour or so to propagate
right (given that a lot of browsers/proxies/DNS servers don't respect TTL very
well at all)? And then you end up with a system with stale data, and the mess
of trying to reconcile it when your other system comes back up.

I'd take an hour long Appengine outage once a year over that anytime!

~~~
peterwwillis
Your name server or stub resolver is what respects DNS TTL, not your browser
or proxy. Everyone - including people hosting on AWS - needs to be able to
fail over DNS, if the AWS IP you're using is in a zone that just went down,
for example.

Any time you have an outage you need to contact your service provider to get
an estimate of downtime. If they can't give you one, assume it'll take forever
and cut the DNS over. The worst case is some of your users will start to come
back online slowly. If you don't cut over, the worst case is all your users
are down until whenever the service provider fixes it, and you get to tell
your users "we're waiting for someone else to deal with it", which won't make
them very happy.

12 hour stale data sounds kind of long to me. 4 hours sounds more reasonable.

~~~
codeka
I've seen plenty of crappy ISP DNS servers ignore TTL values and cache DNS
entries for many hours longer than they're supposed to. Unfortunately, it's
all too common.

------
foolery
Yup, all of the HRD apps are down. But the M/S apps are working.

~~~
jis
Well, my one M/S app is also down.

~~~
aderaynal
I have 2 M/S apps and 3 HRD. They are all down :(

~~~
cfontes
Mine is also down... :( any one got any position about it ?

------
aidos
Dropbox [0] is showing a 500 for me to. I've very confused as to what has just
happened to the internet...

[0] <https://www.dropbox.com/>

------
tomnewton
My Google contact said that 'SRE are all over it. Hope to have more details
soon.' but that was about 30 minutes ago.

Does tumblr.com use app engine? They're down...

~~~
tszming
@tumblr (<https://twitter.com/tumblr/status/261840787350896640>)

Tumblr is experiencing network problems following an issue with one of our
uplink providers. We will return to full service shortly.

------
libria
Hm, bad week for the Cloud. Can't even get to the status page; hopefully it's
not hosted on App Engine.

So going forward, what's the best way to protect against cloud downtime? Have
a hot/standby failover with a different provider? Prepare customers'
expectations for the possibility of server outages? Do a ton of research, pay
$$$ for lots of nines uptime, and lambast the host when they don't deliver?

~~~
josh2600
Build redundancy into your software to deal with single provider failure.

~~~
Achshar
That is not always a feasible option, specially for young projects with
limited capital.

------
bsaul
I would love it so much to see people at google showing all the internal tools
they're using to detect and solve this kind of issues. I can only imagine a
war room with screens all over the place showing gigantic amount of red
flashing lines :) Hope it doesn't last for long though, i was just praising
what a good choice app engine has been so far 10 minutes ago...

------
hugofierro
I hope it's not due to DiRT Exercises (SRE Disaster Recovery Test). Looking
forward to reading the post-mortem report!

------
notreadbyhumans
It's a bit nuts that they're hosting the status pages on the same
infrastructure.

------
albumedia
GAE has been very good since I started using it. They entire internet seems to
be slow...even HN

------
jeremi23
Now even the Google AppEngine status page is down.

------
vanwilder77
The Google app engine page down as well!

------
peterwwillis
Here's a zen koan for you:

If every website on the internet is hosted in the cloud, and the cloud goes
down, is there an internet?

------
cfontes
things are starting to run again...

My site is back up :D SLOW but up.

------
ams6110
passpack.com seems to be affected.

------
singingwolfboy
www.howfuckedisappengine.com

