We left AWS about 18 months ago after one of the outages and switched to GAE. I've counted 3-4 big downtimes for AWS compared to this one on GAE. That's still a good decision (for now)....
The correct less to draw is that any one point of infrastructure is a risk, so you need to scale wide. This is possible to do with AWS regions, or other providers - even internal bare iron if you're so inclined, but impossible to do with GAE because you're committed to a single-vendor API as well as their infrastructure.
The real question is: Can you and your ops team build a "scale wide" system better than Google? How much effort are you willing to put into it, when those development resources could be put into making features instead?
 For apps which use the high-replication datastore. Old (deprecated) master/slave apps are served out of a single datacenter.
 This is a reasonably serious question to ask. In GAE, you're one app among many, so a you-specific scaling solution might be easier and more robust than the generalized one Google builds. But not necessarily.
Looking at things more closely, the only thing that you're truly locked into with GAE is the esoteric nature of the datastore. This isn't any worse than picking say MySQL vs. Oracle or Riak vs. Mongo. Most applications end up depending on some sort of specific functionality of each database that they are written for. While it would be difficult to migrate to another solution for storing your data, it wouldn't be impossible.
There is no way to predict what direction any product might take in the future. Look at the way that Oracle is treating MySQL now. Tons of vendor lockdown there.
The only reason to migrate away from GAE would be if you find out that your application doesn't work well on it (pricing, scalability, etc) or if Google decides to kill GAE entirely. Hopefully you do the analysis of your application before you decide to use GAE (ie: you can't blame GAE for you deciding to use it) and with a 3 year deprecation promise, I'm pretty confident that it will be around for a while longer.
This seem too simplistic. Off the top of my head, you may leave because:
* you want to do something new in your app that is not possible/effective to do within GAE
* you find another option that is cheaper
* you find that some of the assumptions and architectural decisions you took on day 0 no longer hold
* you did the initial analysis, and it was wrong
I do agree with GAE being awesome and lock-in not being so bad, but I doubt the "you have it all figured out before building it so it's going to work forever" idea.
GAE is not "all or nothing". You can still run exotic services in other hosts. Or, for that matter, use GAE for specific services in your "other cloud" app. You get two bills. Not much of a downside.
Sure it is. If you pick the wrong DB your only locked into that DB. You pick GAE and you're locked into GAE's DB.... and GAE. I can move my MySql db to another cloud provider.
With the datastore on GAE, you can get your data out of it and move it into something like MongoDB. I'd argue that it would probably be less code to move to MongoDB because the datastore has all sorts of esoteric issues that you have to code around (like the way that entity groups and transactions are handled).
In terms of the rest of your application, it is just a standard webapp in whatever platform you choose. The .war file I have for GAE will run just fine in Tomcat. The only real lockin is the way you store your data.
Not quite. You can roll out your own App Engine platform with AppScale:
Sadly, TyphoonAE (http://code.google.com/p/typhoonae/) seems abandoned.
Of course master/slave even has scheduled downtime.
Wonder what's happening.
It seems like large portions of the internet are down.
Their sample size is extremely small, and most of those are permanently down.
Have a look through their list of north american routers and find one of them where packet loss has gotten worse as their main overall graph for packet loss would suggest - I've just been through them all and couldn't find one.
But their relative metrics can still be useful.
For example, it's very inaccurate to say that 51% of the internet is down.
But it's precise to say that packet loss among the working nodes has increased about 30% in the last 24 hours, and sharply.
That said, it's still interesting how the overall traffic trends so sharply downwards. I wonder if they have more data than they are showing in the graphs.
I noticed a couple of days ago that some of our dns entries were mysteriously removed from level 3 servers which out of old habit are used for resolution (some of the ip's go back to uu-net/worldcom/mci)
Now the interesting bit is they were for private subnet ip's. They're working fine everywhere else.
Today the last of their dns servers removed the entries so I had users go to google (184.108.40.206) and all's well with our apps.
Level 3's entries for our external stuff is there, just the private subnet stuff is removed.
If others do this too and resolve with level 3....
edit: just found this: http://tracker.outages.org/reports/view/59
And according to them all routes are up and running just fine,not only that bit ping times aren't elevated.
51% of a very small self-selecting set of machines are down, mostly small businesses or home users which have obviously been shutdown or renamed.
If internettrafficreport automatically removed all devices which returned no results for 7 days, they'd be a slightly more useful resource.
What appears to be happening here is that you're still vulnerable because the C&C infrastructure ultimately has a single command source, and so can be vulnerable if some code is pushed that affects the whole system. Your homegrown cloud will suffer from the same vulnerability, it may just be more or less easy to manage depending on how specialized your needs are compared to the more general requirements of GAE.
Edit: actually, maybe I missed what you're advocating with "local direct connections". This might make more sense from a user's perspective: if everyone ran their own little cloud, a failure may bring down reddit, but not reddit, heroku, pinterest, etc simultaneously. That's actually an interesting point, but I'm not sure if it really matters if they sync up their downtimes (since they would still have downtime, and maybe more or less of it depending on how much they could afford to invest into managing their distributed solution). I'm also not sure if that really solves the problem, since there are other concentrations in the network, just less visible ones (there are a fairly small number of major datacenters around the world, for instance, and managing your own colocated server doesn't matter if the whole building goes dark).
I do agree that at the very least we need to maintain an ecosystem of "cloud" providers, however.
Can you specifically give me examples of why using a cloud provider is better for a startup than, for example, using a couple desktops in your garage?
You can't say it's because of backups because the cloud doesn't provide a backup (unless you purchase an extra data backup solution with your cloud provider?). And correct me if i'm wrong, but you still have to set up your development environment on your local computer to write the code, install libraries to test with, etc.
What exactly are the steps involved in "deploying" that you couldn't do on your laptop, or a VPS?
For a product prototype, the initial primary goal is "get it online so that we can start validating our assumptions". System administration skills and in-house server administration teams are valuable but not necessary.
Even given that I have a good bit of sysadmin skills, I am needed more as a software developer right now in the early goings. I expect, as you've pointed out, that priorities will change with time and growth. We may even move to bare metal eventually, if we find ourselves needing and able to do so.
> You want a rapid development platform
No, we want low-maintenance infrastructure.
> that doubles as a production system
It is a production system. It does successfully serve many thousands of users for us every day. We've yet to have an outage that wasn't our own fault.
> and costs nothing to maintain
What? I specifically said we're willing to pay more not to have to spend as much time on infrastructure.
It's OK if you're too set in your ways to even attempt to level with alternative points of view, but at least try to read a little more thoroughly. And maybe admit that you're not willing to budge, so nobody wastes time trying to explain an alternative point of view.
For us (a small three-man team), the full portfolio of AWS services lets us shove responsibility for some of our infrastructure off to Amazon, which saves us loads of time (EC2, ELB, S3, CloudFront, SQS, Route53, SES, and some light DynamoDB in our case). Even with their repeat EBS issues, we've engineered around the common failure points, and do so with a tiny number of self-managed VMs relative to our traffic. Even if we were to go down every few months, our hosted services do better than a one-man devops team could ever do on his own in his garage, or even in a co-lo. Though AWS is not expensive, we gladly pay up in order to focus on our own software. Sure beats hiring another person, in our case.
The common counter-argument is that if we managed it ourselves, that we'd have the ability to resolve outages on our own, because it'd be our responsibility. That is just the thing, we don't want it to be our responsibility, we've got one ops guy (me), and we need to be iterating on our product fast at this point. We'll instead plan for failure, design our architecture to continue operating under some degree of failure, and keep shipping.
Not to say that the "cloud" is a silver bullet. It's not. However, especially from the developer point of view, it lets our entire (tiny) team stay more focused on product development.
Addendum: I've managed servers in my basement, in co-lo, and in the "cloud". Each of these routes is more (or less) appropriate in various cases, but AWS has been a boon to our particular usage case.
Obviously this doesn't happen because it's hard, but also companies have a vested interest in piping all sorts of data through them for analytics purposes. This is not in the interest of the users at all.
SPOF,security and control are the major issues with Iaas and pass offerings.
SPOF, security and control are the major problems for the Iaas and pass.
-- Max Ross (Google) email@example.com via googlegroups.com
At this point, we have stabilized service to App Engine applications. App Engine is now successfully serving at our normal daily traffic level, and we are closely monitoring the situation and working to prevent recurrence of this incident.
This morning around 7:30AM US/Pacific time, a large percentage of App Engine’s load balancing infrastructure began failing. As the system recovered, individual jobs became overloaded with backed-up traffic, resulting in cascading failures. Affected applications experienced increased latencies and error rates. Once we confirmed this cycle, we temporarily shut down all traffic and then slowly ramped it back up to avoid overloading the load balancing infrastructure as it recovered. This restored normal serving behavior for all applications.
We’ll be posting a more detailed analysis of this incident once we have fully investigated and analyzed the root cause.
Christina Ilvento on behalf of the Google App Engine Team
Gotta give 'em props for dogfooding.
That's what i _expect_ them to do - otherwise i can't see anyone trusting/using them if even they themselves avoid their own product(s)..
We are posting regular updates to our downtime-notify list here: https://groups.google.com/forum/?fromgroups=#!topic/google-a...
Christina, Google App Engine Product Manager
Pingdom reports my GAE-hosted site has been down since 2012-10-26 10:37:38 EST, a bit over an hour now.
UPDATE: My site is back. Delayed report from Pingdom says site came back online after 50 minutes. Performance is sketchy still. We're probably not in the clear yet.
At least we can now get to the status dash:
Nothing on their Twitter account either: https://twitter.com/app_engine
A poor handling of a systems failure in my opinion.
They'll email you when issues occur and info becomes available.
It took about 30 minutes after the crash for me to receive an email which seems very reasonable.
"At approximately 7:30am Pacific time this morning, Google began experiencing slow performance and dropped connections from one of the components of App Engine. The symptoms that service users would experience include slow response and an inability to connect to services. We currently show that a majority of App Engine users and services are affected. Google engineering teams are investigating a number of options for restoring service as quickly as possible, and we will provide another update as information changes, or within 60 minutes."
I think people not trusting the cloud is similar to how people feel safer driving their cars then taking a plane. The stats say the plane's safer, but people prefer being in control. People like the idea of being in control of their servers, even if that means there's hundreds of extra things that can go wrong compared to a cloud provider.
We also get a lot more publicity when a cloud provider has an outage as LOTS of sites go down at once. Hardly anyone notices when service X who self-host go down for a few hours...
Cross site links being cut due to engineering works.
Over heating due to air conditioning failures.
The only real manual maintenance that we've got is a rolling reimaging of servers based on whatever's in version control, which usually takes a few hours twice a year, but we'd probably do that if we were in the cloud anyway.
When you can script away 90% of your system administration tasks, hosting in the cloud doesn't really make a ton of sense.
I'd take an hour long Appengine outage once a year over that anytime!
Any time you have an outage you need to contact your service provider to get an estimate of downtime. If they can't give you one, assume it'll take forever and cut the DNS over. The worst case is some of your users will start to come back online slowly. If you don't cut over, the worst case is all your users are down until whenever the service provider fixes it, and you get to tell your users "we're waiting for someone else to deal with it", which won't make them very happy.
12 hour stale data sounds kind of long to me. 4 hours sounds more reasonable.
How big is your ops team? I'm guessing it's more than 0.
You have to trust your cloud provider. They control everything you do. If their security isn't bulletproof, you're screwed. If their SAN's firmware isn't upgraded properly to deal with some performance issue, you're screwed. If their developers fuck up the API and you can't modify your instances, you're screwed. You have to put complete faith in a secret infrastructure run for hundreds of thousands of clients so there's no customer relationship to speak of.
That's just the "trust" issue. Then there's the issue of actual redundancy. It's completely possible to have a network-wide outage for a cloud provider. There will be no redundancy, because their entire system is built to be in unison; one change affects everything.
Running it yourself means you know how secure it is, how robust the procedures are, and you can build in real redundancy and real disaster recovery. Do people build themselves bulletproof services like this? Usually not. But if you cared to, you could.
We run about 11 client clusters across ~250 servers across 3 data centers in the US and Europe. Each of our client's uptime is very very close to 100%, and we've NEVER lost everything, even for 1 second.
I don't feel like i'm lacking control, i feel like somebody else is taking care of that really annoying shit that happens all the time, no matter how well you design your system.
Also a good hosting company will handle identifying/fixing/replacing bad hardware for you.
"The cloud is great because I can blame someone else" is obviously a tenuous argument.
Does tumblr.com use app engine? They're down...
Tumblr is experiencing network problems following an issue with one of our uplink providers. We will return to full service shortly.
So going forward, what's the best way to protect against cloud downtime? Have a hot/standby failover with a different provider? Prepare customers' expectations for the possibility of server outages? Do a ton of research, pay $$$ for lots of nines uptime, and lambast the host when they don't deliver?
The greatest thing about cloud-hosting is that you can just sit by and let them fix it. It usually takes about half an hour, or a couple of hours if the outage is severe, but usually less than the time it takes for an update of DNS records (unless you've got some proxy in front of your IPs, which would be another point of failure).
And then, even with these severe outages, the overall monthly uptime is still better than %99.9 and it's really hard to beat that, so just relax and let them fix it.
You need to decide how much uptime you're willing to pay for, how much your service can degrade for how long, and methodically address each level of the hierarchy between you and your customers – and it might be the case that you decide that the ongoing costs of your engineering support for e.g. wide geographic separation just aren't sustainable at the level your customers are willing to pay, particularly if you have something like a CDN helping keep your site partially responsive during less than catastrophic failures.
1) It's very complex and expensive
2) You're looking at DNS to hot failover, in most cases.
If GAE can recover in less than 30 minutes and sticks to, say, one outage a year, you just can't justify the kind of cost you're looking at with 2 (seriously, it's a lot of cash).
If every website on the internet is hosted in the cloud, and the cloud goes down, is there an internet?
My site is back up :D SLOW but up.