This is the reason I abandoned AE and part of why adopting a platform that isn't standardized is incredibly dangerous. The problem is technical debt constantly accrues even when you aren't making changes.
Even though the API was unchanged, HRD differs subtly enough that breakage can occur on any non-trivial project. Edge cases (how indices behave within transactions comes to mind, but there are plenty more examples) will see new semantics compared to M/S, and so this "upgrade" involves not only thorough testing and auditing, but likely also code changes and potentially significant engineering hours.
http://goo.gl/HVuaC: These techniques are not needed with the (now deprecated) Master/Slave Datastore, which always returns strongly consistent results for all queries.
This means a project written and signed off circa 2011 requires mandatory engineering costs just to continue running in a functioning and supported fashion. An AE app will never quite resemble that ancient perl5 behemoth running uninterrupted since 1997, as the underlying implementation and recommended APIs are constantly modified and replaced (Datastore, NDB, Python major version).
"A strong test suite will save your soul!" I hear you say, tests that a small project might have survived without if targeting any other platform, and testing on AppEngine is also yet another moving target (for example, testing nested subrequests was all but impossible using the SDK until relatively recently).
The promise was a carefree life for a project willing to code against their proprietary APIs; the reality is a constantly moving target, "not quite free" autoscaling and the threat that while you're asleep an unannounced change will take down your app (I could name a few, but as many will attest this has happened regularly since launch).
Yeah, I got sucked in with the same promise and I had the exact same sour experience. Including panicky calls from the client when the app suddenly stopped working. The maintenance windows used to plop right in the middle of my client's busy time, once a month at least and often more.
The worst part is the apologists, like email@example.com in the original bug report:
> I got here from HackerNews, but after seeing the original poster spam the forums in multiple places and have a bad attitude, I can't blame Google for not fixing what looks to me like a non-issue.
> Fuck 'em.
That always reply to your request for help while you're attempting to fix a suddenly dead application and a totally screwed client.
I'm not saying that App Engine is a panacea, but regardless of how you write your code and what technologies you use, there'll be some sort of mandatory maintenance and system administration that you have to do every so often.
Maybe it'll go the way of Wave, or perhaps hopefully the technology style itself will simply be supplanted by newer and better. Regardless I'd say that given an app today, the 1997 perl5 app (MySQL 3.2 and perl5.003 were already circulating) still has much better supportability prospects over this time frame than App Engine ever did or ever will.
Then, upcoming platform changes are released to sandbox environments six months or so before they go live - you can see if your tests run, and have time to keep up with things. You do have to keep up though.
Maybe I've been spoiled by using Windows for 20 years but I feel that we should be able to expect better from vendors than this. That goes double if you're paying them, though it sounds in Salesforce's case as if they pay you, because I can't see any other way in which this arrangement would make sense.
I think they do. They take that pretty seriously, breaking changes to the API are rare and usually obscure edge cases.
Funny (?) thing is, as an engineer at Google, stuff like that happened to me ALL THE TIME. I don't even want to think about how much of my time was spent simply migrating to the "latest greatest" replacement for some critical service that was being deprecated.
Well, imagine now that you have a directory or lock service where you can store things and perform atomic updates. When you do a write to something in it, it fans out to all of its clients, and they all wake up (nearly) simultaneously and receive the update. They then have to do whatever processing you do with new data of that type.
If they all do this at the same time, then you have no processes left to service incoming requests. They're all identically busy with whatever mutexes held in order to apply those config changes safely, so no other work happens on those clients while they load in the new data.
It's not so much that it's taking a mutex and is getting stuck for a little bit, since that's going to happen no matter what. It's that all of the children do it at the same time, so there's nobody to service your hit, and you're guaranteed to get stuck. If it was spread out, then only some percentage of incoming requests would get stuck behind this. The others would get lucky and would hit another instance which either had already run it or hadn't yet run it.
I'm not saying this is what's going on here, but it sure sounds familiar.
The thundering herd problem applied to waking up child processes is one possible explanation, but there are dozens of other explanations that are just as likely, based on the information we're provided with.
Again, I don't know if this is what happened here. I've just seen this sort of thing before.
My first instinct when I see a report like this is: he probably has some cronjob running he forgot about; perhaps one whose performance decreased with O(n^2).
By which I'm not saying that Google is right in not replying for days, but by which I am saying that as a customer, there are easy ways to get attention beyond shouting and threatening. Show it's an interesting problem and you're bound to get some techie's attention.
This is a daily outage that affects all our master/slave appengine applications. We know these applications are 'deprecated', but we're still paying significant money for the service and therefore hadn't expected 'deprecated' to mean 'won't be fixed when there are problems'.
Migration to HRD is not trivial even with the tool provided by Google. HRD has a different consistency model, blob keys and associated image serving URLs will change by migrating, and last time we checked any deletes that happen during migration (which can take days) will not make it into the migrated app.
So it doesn't surprise me to read about weird performance degradations. Since years GAE suffers from such problems.
Maybe they don't care about small customers and love to hear about them move to Heroku or to good old Virtual Servers. It would be polite to tell upfront though.
It was obvious when I first tried it that GAE has crappy support.
In this case, the GAE feature that underlies this issue is the Master/Slave (MS) datastore. It's been deprecated for ages in favour of the High-Replication Datastore (HRD).
Or you start freaking out. And I don't think that's entirely irrational.
This is, among other things, why the 'post mortem' has become somewhat popular -- because it allows us to judge "Yeah, those guys DO know what they're doing, they're on top of things, the chances of outages are getting constantly smaller, not larger."
Has Google ever published such a "post-mortem" after an outage? Has Google ever even admitted there was an outage publically?
But also, yeah, rational or not, people like to have someone to talk to. In customer service in general, there are many studies showing that customers satisfaction will be higher when they are treated 'nicely' _without a solution_ than when they are treated brusquely but their problem is solved. This is not actually rational, and I'm not saying I'd like vendors to strive towards that model -- but it is apparently human psychology that vendors may want to take account of.
On the other hand, Google seems to be doing pretty fine how it is going. Although I don't know how GAE is doing, really, compared to competitors.
How Google does it, though, is basically no support at all, right? It's beyond 'good support' or 'bad support' -- with the possible exception of AdWords, is there any Google product where you can ever talk to a human about any support issue at all? For email that might be fine, especially when the email product is pretty darn reliable. For enterprise critical software... it would sure make me nervous.
Insanity #2: I need somebody to talk to when a service interruption occurs
You hear about an earthquake in California, you call your aunt to make sure she is ok.
You are getting bad weather in the area you live, your mom calls and checks on you.
The server you use disappears off the internet and your providers status page hasn't been updated for a week, you '...'?
When something goes wrong, it's not an event that effects everybody (even if it is), it's an event that effects you. As long as humans are still involved in the purchasing and managing of servers you'll always need someone to call and yell at/be soothed by.
If it was one person experiencing the problem, you would be right. But it's a number of people.
As other commenters note: they DID provide a minimal test case that showed the problem wasn't their side.
Unfortunately with GAE support screaming and yelling is pretty much the only recourse.
I had a Nexus 7 go AWOL at Christmas and I've never had such a shambolic customer service experience.
They have absolutely no respect or customer service ethos when it comes to people who are actually paying them real cash money.
Not in a million years would I sign off on hosting a production project on App Engine.
I had a support guy tell me I had to get Apple's legal team to contact Google so I could use the "Mac" trademark, because I happened to be selling a piece of software that ran on OS X. Like that's ever going to happen. My ad simply said "Try ____, a better way to _____ on Windows and Mac.", linking them to http://www.apple.com/legal/trademark/guidelinesfor3rdparties... didn't quite cut it, apparently, even though it clearly states that such use is acceptable under "2. Compatibility" near the top of the page. i.e. I can say my product runs on Mac if in fact it runs on Mac.
Approving a ten word ad takes Google over a week, in my experience. Baffling.
All this with their adwords $100 free trial. All that trial did was convince me that I should never ever in the life of the universe commit any money to Google, because they made it starkly apparent that I would never get what I paid for... running honest ads for honest products in a reasonable timeframe. I went with other ad networks in the end and had zero trouble whatsoever, and infinitely faster approval times. I suppose I may have had a smaller audience, but the headaches Google causes aren't worth the extra money.
Someone is going to come along and pull the rug out from under Google eventually. You can't rest on your laurels forever.
Khan Academy may have it easier, I'm sure Google won't let them down
Really, go somewhere else, spend less money and have better support.
(At the expense of, if you're lucky, Google will give you almost zero headaches)
But you can bet that if Khan Academy has a problem it will be looked into with extra attention.
Also, Khan academy receives funding from Google ( http://en.wikipedia.org/wiki/Khan_academy ), so I don't think it boils down to only having a Premier account.
Whether you believe it or not, a Premier account is what you need if you want support. You could argue that $500/mo is too expensive, but it is what it is.
What I don't believe is that the support provided by Google is good or sufficient. Based on experiences with paying Google Apps, I'd say it's not.
Either you click "Start Free Trial" or "Contact Sales". The one is automatic registration, the other involves human interaction.
I guess Google Apps illustrates this the best. IMHO it's somehow the glue that keeps your stuff together when you use Google as hoster.
1. Is GAE outside of their .9995 SLA* uptime? If they aren't, then it probably isn't important enough spend time looking into it. Customers cannot expect better than the agreed upon uptime percent, and hosting companies are obligated to reimburse customers if they go below SLA. Both of these are covered in the SLA doc.
2. Is it reproducible? So far, the bug report mentions 2 people out of GAE users. Is 2 people enough to say its a problem with GAE? One person is panicked, and the other provides few details for the bug report.
There is no SLA for M/S applications (which run on completely different infrastructure). GAE has only ever offered a SLA for HRD applications.
It looks like OP of the bug-report is using a depreciated feature/program which according to the Project Member is causing latency issues at a specific time daily. But that could not be the real issue since another commentator who is using the new HRD is also having the same problem. It is even frustrating for people who are reading this. All it implies is the lack of communication from Google when something goes awry. Come on Google, stop reinforcing my stereotypes about your customer support!
Selling to a customer is different than selling to a business, you may have a great product at a great price but if you offer terrible CS, in the B2B world everyone is going to avoid you. It is a place where support is valued more than the product itself.
Therefore, unless you start offering a decent CS, you can lower your price all you want, I will be sticking with AWS.
HRD is High Replication Datastore: https://developers.google.com/appengine/docs/adminconsole/mi...
M/S is deprecated, and HRD is the new hotness (and it conveniently costs more).
I can't say I think much of google's response here. Nearly two weeks before the first comment, and then shut down after 2 days and a question directed at who knows who, and no explanation?
The analysis elsewhere on here suggests they're violating SLA, so this should get more attention. I'm guessing support is under-resourced @ google, and the culture of support is a bit shabby (no acknowledgement of inconvenience or indication or evidence of work undertaken in the background) - hardly surprising for a large-scale software business based on free services.
I would recommend trying your apps on OpenStack (Openshift in particular), which doesn't have the vendor lock-in, which you face right now.
"Having control over" something is a scale, it's not binary.
Customer support from Google has always been like this as far as I've experienced and heard. There is no way to actually reach and converse with anyone, regardless whether you are paying them for the service or what kind of request it is.
Once a Google employee randomly replied to a complaint of mine about Google+ (I didn't even +mention them). After a few comments and him confirming that it was added to the bugs list, I asked if it was okay to +mention him in the future with similar issues. It was okay. I did. He never showed his face again. (His profile still says "Works at Google+".)
Another Google employee I know online also never replies to anything concerning Google. I know he works on the Google+ project, but I can only hope he passes on any bugs I +mentioned him in.
For Youtube, you can post in their forums but merely hope for a reply. Copyright complaint disputes are no priority, either.
I haven't used many paid products, but I have read about their customer support being one of the very worst and also have never been able to find a single e-mail address or phone number to get support at for any service.
Edit: By the way, I would have moved away from the Google Apps Engine a long time ago if my app went down every morning during rush hour for 10 days straight.
Is it really helpful for the public to comment on my support request? Seems like the signal to noise ratio would be quite low, and then you get inane comments like:
I got here from HackerNews, but after seeing the original poster spam the forums in multiple places and have a bad attitude, I can't blame Google for not fixing what looks to me like a non-issue.
You have to believe that the choice of tools has some bearing on the quality of the response from Google. Seems like there is very little incentive for any "Project members" to trawl through open bug reports when no one is ever responsible.
"M/S is deprecated and there is a clear and straightforward path to migrating to HRD."
M/S was deprecated April 4, 2012, so it has been some time since the notice has been out there. High replication data store has been available for over 2 years now. Whether or not less than a year is too short a deprecation period is another issue.
You can host your own applications using AppScale or Typhoonae.
Moving applications is easy. Moving data is much worse.
As noted the only attempt at diagnosis is completely wrong (even the reporter is not on MS) and very late.
Few people (who act obnoxious as hell) report a problem that can be solved by moving away from a deprecated system, yet they fail to even read the note because they're busy smashing exclamation marks into the issue tracker.
When your datastore gets deprecated, you act sooner rather than later.
For example, this guy was a paying Google customer but couldn't get help http://www.sultansolutions.com/google-voice-lost-number/
These are paying customers who are paying a non-trivial amount of money for support (though not the "Premium" support in this case, which is an extra $500 per month for GAE).
In this case, Google's response is seemingly "It might (or might not) be a system-wide issue, but we don't care - we won't fix it".
There's no indication in this case that someone paying $500 a month for the premium support would get a better answer.
Same issue from last year that took weeks to be resolved (check last comments!):
Some Pros and Cons of Google App Engine in this blog-post:
I have had some issues with Google Docs (paid for premier commercial account). Some documents we had stored simply vanished from our account.
After getting the run around for 3-4 days, finally a google engineer tolds us they can't help us recover the documents THEY 'lost' unless we have the URL to the document ...
Thankfully someone on our team had kept the URL when I first shared that document with them (1+ year after the document had been created).
"sentimentally is a tool that determines sentiment of your emails. Once determined, it helps you gauge your relationships with co-workers, customers, friends, or other individuals based on the tone of your conversations with these people."
Can anyone explain why this is not possible for them?
Wonderful Google support apart, there are a lot of alternatives out there.
Maybe wrong forum, but is there any infrastructure templates for setting up a scalable web/db/loadbalancer/memcached for a simple tradional webservice, in my case a game?
I want to be able to sleep at night, and easily scale up by adding some more machines in case of higher load.
I could use denormalized myslq/postgre or mongodb for speed. Preferred language is Python (or maybe c# or java).
Response from us was initially muted because it looked like it only affected M/S apps, but it turns out (a) it can impact HRD as well, and (b) we're pretty unhappy about the level of impact for many M/S apps so we're looking at ways to resolve. It's a high priority and we're looking at a number of ways to address it. It's also a pretty interesting issue, because indirectly it's caused by (a) the large scale that App Engine is running, and (b) the large extent with which GAE is running free applications.
Regardless, apologies to those who felt support was unresponsive. We are working very hard to improve support. For the sophisticated audience that comes to these pages, please link to me on Google+ to get my attention if we are failing you (https://plus.sandbox.google.com/110401818717224273095).