Hacker News new | past | comments | ask | show | jobs | submit login

A piece of hard-earned advice: us-east-1 is the worst place to set up AWS services. You're signing up for the oldest hardware and the most frequent outages.

For legacy customers, it's hard to move regions, but in general, if you have the chance to choose a region other than us-east-1, do that. I had the chance to transition to us-west-2 about 18 months ago and in that time, there have been at least three us-east-1 outages that haven't affected me, counting today's S3 outage.

EDIT: ha, joke's on me. I'm starting to see S3 failures as they affect our CDN. Lovely :/

Reminds me of an old joke: Why do we host on AWS? Because if it goes down then our customers are so busy worried about themselves being down that they don't even notice that we're down!

Reminds me of an even older joke (from 80's or 90's):

Q: Why computers don't crash at the same time?

A: Because network connections are not fast enough.

(I think we are starting to get there)

These are both pretty good. Added to color fortune clone https://github.com/globalcitizen/taoup

I'm getting the same outage in us-west-2 right now.

The dashboard doesn't load, nor does content using the generic S3 url [1], but we're in us-west-2 and it works fine if you use the region specific URL [2]. In practice this means our site on S3/Cloudfront is unaffected.

[1]: https://s3.amazonaws.com/restocks.io/robots.txt

[2]: https://s3-us-west-2.amazonaws.com/restocks.io/robots.txt

Good catch. My bet is that because s3.amazonaws.com originally referred to the only region (us-east-1) the service that resolves the bucket region automatically is really hosted in us-east-1. I think AWS recommends using the region in the URL for that reason, however that is easier said than done I think. I would bet that a few of Amazon's services use the short version internally and are having issues because of it.

Seeing it in eu-west-1 as well. Even the dashboard won't load. Shame on AWS for still reporting this as up; what use is a Personal Health Dashboard if it's to AWS's advantage not to report issues?

Now it's in the PHD, backdated to 11:37:00 UTC-6. How could it take an hour to even admit that an issue exists? We have alerts set on this but they're useless when this late.

Same here, and it's 100% consistent, not 'increased error rates' but actually just fully down. I'd just stop working but I have a demo this afternoon... the downsides of serverless/cloud architectures, I guess.

Heh that "increased error rates" got a chuckle out of me, I guess 100% is technically an increase.

Well what if you'd hosted it on your hard drive and it crashed? It seems like the probability of either is similar nowadays.

The difference there is you can potentially do something about it, vs having to wait on an upstream provider to fix an issue for everybody.

"you can potentially do something about it" vs. "you have to do something about it"

Perspective is everything.

Grab different machine, git clone your repo, good to go.

What's the odds of the server with your repo and your own hard drive crashing at the same time?

Strangely, your comment made me read this entire post about working out probabilities.. http://www.statisticshowto.com/how-to-find-the-probability-o...

Quite interesting really!

If we assume that the events are largely uncorrelated+ then we are multiplying the probabilities and our chance of wipe out are far lower.

+I would suggest that for situations where the probability of my machine and github's/bitbucket's servers being down due to the same event would be events of such magnitude that I would not be worried about my project anymore being more focused on basic survival...

Our services in us-west-2 have been up the whole time.

I think the problem is globally accessible APIs are impacted. As others have noted, if you can use region/AZ-specific hostnames to connect, you can get though to S3.

CloudFront is faithfully serving up our existing files even from buckets in US-East.

S3 bucket creation was down in us-west-2, because it relied on us-east-1 (I expect that dependency will get fixed after this), but all S3 operations should have continued to function in us-west-2, other than cross-region replication from us-east-1.

IIRC the console for S3 is global and not region specific even though buckets are.

Also, cross-region replication is a new-ish thing: https://aws.amazon.com/blogs/aws/new-cross-region-replicatio...

Same outage in ca-central-1

I can confirm this as well.

Huh, I'm not seeing it on my us-west-2 services. Interesting.

My advice is: don't keep your eggs in one basket. AZs a localised redundancy, but as Cloud is cheap and plentiful, you should be using two or more regions, at least, to house your solution (if it's important to you.)

EDIT: less arrogant. I need a coffee.

But now you're talking about added effort. Multi-AZ on AWS is easy and fairly automatic, multi-region (and multi-provider) not so much. It's easy to say things like this, but people who can do ops are not cheap and plentiful.

The only difficult aspect of multi-region use is data replication, which I can confirm is a (somewhat) difficult problem. This issue was with S3 which has an option to automatically replicate data from the bucket's region to another one. It's a check box. A simple bit of logic in the application and you can move between regions with ease.

Even data replication has options for this, too.

And I work in Ops.

Well, you've explained how to do multi-region in S3. Now let's cover EC2, ELB, EBS, VPC, RDS, Lambda, ElastiCache, API Gateway, and all the other bits of AWS that make up my services. And then we can move on to failover application logic.

I picked out S3 as this issue is directly related to it, yet the solution is simple: turn on replication and have your application work with it (which is on the developers, not ops.)

EC2: why are you replicating EC2 instances or AMIs across regions? Why aren't you using build tools to automatically create AMIs for you out of your CI processes?

ELB: Eh? Why do I need ELBs to be multi-regional? I'm a little confused by this on, sorry.

EBS: My systems tend to be stateless, storing as much log, audit, or data in external systems such as RDS, DynamoDB, S3, etc. Storing things on the local system's storage is a bit risky, but if you have to there are disk replication solutions available. EFS comes to mind for making that easier. Backups also come to mind in the event of data loss.

VPC: Why does a VPC need to be cross regional? This one is also lost on me.

RDS: Replication is easy -- it's done for you. Convincing developers their application needs to potentially work with a backup endpoint to the data is harder than data replication problems at times. More often than not, it's simply a case of switching to a read-only mode whilst you recover the write copy of your RDS instance, but this is the role of the developers, not ops.

Lambda, ElastiCache, API Gateway... all these things aren't arguments against my original point: architect correctly. Yes it involves more work (from the developer's perspective, mostly), but more often than not in the event of a failure you're left head and shoulders above your nearest competition and left soaking up the profits as a result.

Based on your responses, however, I think we can safely agree to disagree and move on.

Have a great day! I hope you weren't too badly effected by the S3 outage!

EDIT: typo.

>EC2: why are you replicating EC2 instances or AMIs across regions?

Exactly to avoid single region outages?

I think point was that you shouldn't replicate but just deploy to both.

Gamache's point is that making your production environment cross-regional means setting up all those things in another region and managing them as well. It's not a tickbox.

Our webservers were hit by this outage. In order to make these cross-regional, I'd need to set up VPCs properly, security groups, instances, datastores (several databases), so on and so forth. I don't store anything on the local disk, but I'm not going to run a server in Europe hitting my db servers in us-east-1. AWS doesn't offer all the databases we use. Cloudformation isn't trivial to use once you get past the tutorial examples either.

Basically, your comment is a version of "you're holding it wrong!"

The US is made up of several regions. You don't have to leave the country to go multi-region, you only need to go west or east from your current location in the US.

Some solutions present more difficulties than others, that's for sure. From the limited information you've given me, your solution is far from being a unique situation that poses many difficulties.

CloudFormation in YAML format is pretty easy. I recommend Terraform, however, which is much nicer again for this kind of stuff. It makes it rather "trivial" to get a multi-region solution in place.

As for the database replication: I highly doubt the solutions you're using don't offer replication, and if they don't, and they're not some very esoteric, highly specialised engines, then I would replace them with something that does.

It reads to me as though your primarily contention point is your databases. Not an easy problem to solve, I'll admit, but not impossible, neither.

Two different vendors if you can afford it. It's a bit of a hassle though.

I like to stick to one, but I have seen some success stories with an AWS/GCE mix :-)

HashiCorp's Terraform makes it a lot easier to go multi Cloud, and abstracting away configuration of the OS and applications/state with Ansible makes the whole process a lot easier too.

It shouldnt be technically possible to lose S3 on every region, how did amazon screw this up so bad?

I believe the reports here are misleading: if you try to access your other regions through the default s3.amazonaws.com it apparently routes through us-east first (and fails), but you're "supposed to" always point directly at your chosen region.

Disclosure: I work on Google Cloud (and didn't test this, but some other comment makes that clear).

Amen. We setup our company cloud 2 years ago in US-West-2 and have never looked back. No outage to date.

If you have a piece of unvarnished wood handy...

Is us-east-2 (Ohio) any better (minus this aws-wide S3 issue)?

us-east-2 is brand new and us-east-1 is the oldest region. Any time there is an issue, it is almost always us-east-1. If possible, I would migrate out of us-east-1.

Probably valid, though in this case while us-west-1 is still serving my static websites, I can't push at all.

The s3 outage covered all regions.

Really? Even Australia? Can you provide evidence of this so I know for any clients that call me today? :)

EDIT: Found my answer. "Just to stress: this is one S3 region that has become inaccessible, yet web apps are tripping up and vanishing as their backend evaporates away." -- https://www.theregister.co.uk/2017/02/28/aws_is_awol_as_s3_g...

That's a really good point!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact