Hacker News new | past | comments | ask | show | jobs | submit login
[flagged] Hold off on AWS Aurora and Postgres until further notice
67 points by sudhirj on Dec 29, 2017 | hide | past | web | favorite | 22 comments
We've had a sudden crash of the underlying storage (apparently) on one of our production workloads - and because the crash is in the underlying storage the MultiAZ setting makes no difference. Neither the primary nor the replica work.

The point in time restore can't complete a restore to the last available point, so this looks like data loss to me. Points a earlier in time work fine, so might be an issue with a recent patch or failure.

I'm currently in conversation with support, will update when I have more information.

The cloud is not immune from failure. Moving to the cloud just moves the line of responsibility for who manages a service, but it does not remove the risks.

Hardware and software fail. Amazon does a better job mitigating the damage from failures than me, but their still still fails.

I'm sorry for your data loss. I hope the recovery isn't too painful. Live and learn.

That’s a tough situation to be in. Hopefully Amazon is working on a solution for you.

Cloud is definitely not bullet-proof and risks are always there. I think it generally still wins however because as Cloud solutions get better, with better SLAs...companies can get better at what they specialize and offer.

At least that is the goal and how I understand it.

We've been using Postgres on RDS for years and were considering moving to Aurora. I am very curious to know more about the fix and planned remedy.

Current suspects are GIN index problems, but nothing confirmed yet.

Per my remarks here, I think that there is an obscure bug in GIN vacuuming that the community has yet to identify:


It's possible that this upstream PostgreSQL issue affected you. What is known for sure is that there was a bugfix for GIN, that just missed the latest round of point releases. See:


This fix alone could have made all the difference here (though this is a guess). I assume you won't have this fix with Aurora just yet (not sure -- I don't work for Amazon). But, I'd mention this to AWS support if it seems at all relevant. Hopefully they'll get in touch with the community if that's what it is.

Good luck

Thanks, will pass this on.

Ugh, that's bad -- sorry to hear that. We GIN index several JSONB fields as well.

It's the friday before a holiday curse. I'm sorry for your loss.

Very interested in what Amazon's final position on this is. I assumed part of the value proposition was that they handled redundancy and recovery.

" because the crash is in the underlying storage the MultiAZ setting makes no difference."

@OP - I don't understand this line... can you explain more?

Aurora read replicas are just compute heads decoupled from the storage. They increase availability and read performance, but don't do anything for the reliability of the storage.


"Each 10GB chunk of your database volume is replicated six ways, across three Availability Zones."

So, if there has been a storage layer failure, most likely it was a software bug in Aurora itself.

I don't see anything on the status page. I know the status page lags an actual incident, but I'd imagine something would be up by now. Is this a widespread problem?

The data has been recovered, no post mortem yet.

That's lucky. Was your service down for the entire time until the data was recovered?

We moved to a new DB as a clean slate, took fresh orders, and then merged the old ones in as they became available. Was able to do that because all primary keys were uuids and the order history API was being served somewhere else.

How did you handle other tables like users and sessions?

This microservice didn't have them, but even they're all keyed by UUID, so it's possible to start a clean slate pretty fast. Filling in existing users later should also be easy, the problems will occur if and when users try to sign up again with the same key (email / FBid etc).

Hello! Was this ever resolved? It seems like there was one update from the poster, but then nothing.

Resolved, managed to get the data back, don't have a postmortem yet.

For what it’s worth, this appears localized to your account. Sorry for your situation, though.

sorry to hear this, good luck. always sucks on a fri.


Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact