
Hold off on AWS Aurora and Postgres until further notice - sudhirj
We&#x27;ve had a sudden crash of the underlying storage (apparently) on one of our production workloads - and because the crash is in the underlying storage the MultiAZ setting makes no difference. Neither the primary nor the replica work.<p>The point in time restore can&#x27;t complete a restore to the last available point, so this looks like data loss to me. Points a earlier in time work fine, so might be an issue with a recent patch or failure.<p>I&#x27;m currently in conversation with support, will update when I have more information.
======
acranox
The cloud is not immune from failure. Moving to the cloud just moves the line
of responsibility for who manages a service, but it does not remove the risks.

Hardware and software fail. Amazon does a better job mitigating the damage
from failures than me, but their still still fails.

I'm sorry for your data loss. I hope the recovery isn't too painful. Live and
learn.

------
deckarep
That’s a tough situation to be in. Hopefully Amazon is working on a solution
for you.

Cloud is definitely not bullet-proof and risks are always there. I think it
generally still wins however because as Cloud solutions get better, with
better SLAs...companies can get better at what they specialize and offer.

At least that is the goal and how I understand it.

------
kurttheviking
We've been using Postgres on RDS for years and were considering moving to
Aurora. I am very curious to know more about the fix and planned remedy.

~~~
sudhirj
Current suspects are GIN index problems, but nothing confirmed yet.

~~~
petergeoghegan
Per my remarks here, I think that there is an obscure bug in GIN vacuuming
that the community has yet to identify:

[https://postgr.es/m/CAH2-Wz=GTnAPzEEZqYELOv3h1Fxpo5xhMrBP6aM...](https://postgr.es/m/CAH2-Wz=GTnAPzEEZqYELOv3h1Fxpo5xhMrBP6aMGEKLKv95csQ@mail.gmail.com)

It's possible that this upstream PostgreSQL issue affected you. What is known
for sure is that there was a bugfix for GIN, that just missed the latest round
of point releases. See:

[https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit...](https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=3b2787e1f8f1eeeb6bd9565288ab210262705b56)

This fix alone could have made all the difference here (though this is a
guess). I assume you won't have this fix with Aurora just yet (not sure -- I
don't work for Amazon). But, I'd mention this to AWS support if it seems at
all relevant. Hopefully they'll get in touch with the community if that's what
it is.

Good luck

~~~
sudhirj
Thanks, will pass this on.

------
stocktech
It's the friday before a holiday curse. I'm sorry for your loss.

------
tyingq
Very interested in what Amazon's final position on this is. I assumed part of
the value proposition was that they handled redundancy and recovery.

------
nolite
" because the crash is in the underlying storage the MultiAZ setting makes no
difference."

@OP - I don't understand this line... can you explain more?

~~~
vhold
Aurora read replicas are just compute heads decoupled from the storage. They
increase availability and read performance, but don't do anything for the
reliability of the storage.

[https://aws.amazon.com/rds/aurora/faqs/](https://aws.amazon.com/rds/aurora/faqs/)

"Each 10GB chunk of your database volume is replicated six ways, across three
Availability Zones."

So, if there has been a storage layer failure, most likely it was a software
bug in Aurora itself.

------
fjania
I don't see anything on the status page. I know the status page lags an actual
incident, but I'd imagine something would be up by now. Is this a widespread
problem?

------
sudhirj
The data has been recovered, no post mortem yet.

~~~
mrep
That's lucky. Was your service down for the entire time until the data was
recovered?

~~~
sudhirj
We moved to a new DB as a clean slate, took fresh orders, and then merged the
old ones in as they became available. Was able to do that because all primary
keys were uuids and the order history API was being served somewhere else.

~~~
mrep
How did you handle other tables like users and sessions?

~~~
sudhirj
This microservice didn't have them, but even they're all keyed by UUID, so
it's possible to start a clean slate pretty fast. Filling in existing users
later should also be easy, the problems will occur if and when users try to
sign up again with the same key (email / FBid etc).

------
CaliforniaKarl
Hello! Was this ever resolved? It seems like there was one update from the
poster, but then nothing.

~~~
sudhirj
Resolved, managed to get the data back, don't have a postmortem yet.

------
QuinnyPig
For what it’s worth, this appears localized to your account. Sorry for your
situation, though.

------
jaydestro
sorry to hear this, good luck. always sucks on a fri.

------
BallinBige
Sorry

