
Amazon Aurora Backtrack - jeffbarr
https://aws.amazon.com/blogs/aws/amazon-aurora-backtrack-turn-back-time/
======
ulkesh
"Aurora will try to retain enough log information to support that window of
time."

It's good to know that Aurora will try. It's not like it needs to be reliable
or anything.

~~~
chimeracoder
> It's good to know that Aurora will try. It's not like it needs to be
> reliable or anything.

"Reliable" isn't a binary concept. There's no service that is 100% reliable.

Having a solution that works 99.99% of the time (which is "only" four nines)
is probably good enough that people are willing to use it as a last-resort.
(From what it seems, it's included for free, so it's not like this costs you
anything anyway - at worst, you end up where you would be if they hadn't
released this).

~~~
ddorian43
It's not free, look at "estimated 10$/month".

------
duggan
Aurora keeps coming along in leaps and bounds, congratulations to the team,
this is a fantastic achievement!

I only wish that every new feature didn't inevitably come with the caveat that
it's only for the MySQL flavour of Aurora.

I understand both the engineering and product development reasons for doing so
(different stack and MySQL is undoubtedly a much larger customer base), but it
always makes these announcements a little underwhelming as an Aurora Postgres
user.

~~~
ben509
I have no problem with the fact that the MySQL users get to be my beta-
testers.

~~~
erik_seaberg
If my beta testers shrug off relational integrity, that makes me the real beta
tester.

------
Bucephalus355
Ok Oracle has had this feature for at least a decade, it’s called a “flashback
query”. Obviously Aurora costs 10% of Oracle, but still, I thought this was
going to be a huge feature-add considering the HN comment count.

That being said, I love AWS, am Pro-Certified, and work with it everyday.

I know Oracle is a giant mean bully company, but at least their arrogance was
never of the “world-destabilizing” kind like Facebook.

EDIT: changed rollback query to flashback query (flashback query can be used
both to view or to actually change the DB)

~~~
icebraining
Can you link to any docs about it? All I can find is the ROLLBACK statement in
a transaction.

~~~
Bucephalus355
Sorry rollback was wrong, the correct term was “flashback”.

[https://docs.oracle.com/cd/E11882_01/backup.112/e10642/rcmfl...](https://docs.oracle.com/cd/E11882_01/backup.112/e10642/rcmflash.htm#BRADV89737)

~~~
icebraining
Thanks!

------
ben509
Reading the paper[1] linked from Jeff's post:

> In Aurora, we have chosen a design point of tolerating (a) losing an entire
> AZ and one additional node (AZ+1) without losing data, and (b) losing an
> entire AZ without impacting the ability to write data. We achieve this by
> replicating each data item 6 ways across 3 AZs with 2 copies of each item in
> each AZ. We use a quorum model with 6 votes (V = 6), a write quorum of 4/6
> (V w = 4), and a read quorum of 3/6 (V r = 3). With such a model, we can (a)
> lose a single AZ and one additional node (a failure of 3 nodes) without
> losing read availability, and (b) lose any two nodes, including a single AZ
> failure and maintain write availability. Ensuring read quorum enables us to
> rebuild write quorum by adding additional replica copies.

There are many 2 AZ regions in AWS, of course. I don't think you can stripe 3
copies per AZ, an AZ failure drops you to potentially 2/6, and if you allow
for 2/6 and 3/6 writing you could have a split brain. Any thoughts how they
manage that?

[1]
[https://www.allthingsdistributed.com/files/p1041-verbitski.p...](https://www.allthingsdistributed.com/files/p1041-verbitski.pdf)

~~~
exawsthrowaway
I left AWS before Aurora was introduced, but I suspect it's similar to how S3
fulfills a similar promise. From the S3 FAQ[1]:

> Amazon S3 Standard, S3 Standard-Infrequent Access, and Amazon Glacier
> storage classes replicate data across a minimum of three AZs to protect
> against the loss of one entire AZ. This remains true in Regions where fewer
> than three AZs are _publicly_ available.

Italic emphasis is mine.

[1] [https://aws.amazon.com/s3/faqs/](https://aws.amazon.com/s3/faqs/)

------
gtsteve
This is nice but it appears that the entire database instance gets rolled back
to that point. It'd be a lot nicer if it could be done at a per-db or per-
table granularity.

Realistically I'd never use this feature because of the risk of data loss. I'd
restore a new instance from backups and copy the lost data back over manually.

~~~
ryanianian
> It'd be a lot nicer if it could be done at a per-db or per-table
> granularity.

As another commenter pointed out, per-table would be scary for referential-
integrity. And per-db _kinda_ makes sense, but if you've got totally different
use-cases hosted by the same MySQL you may be using the service incorrectly.

> I'd restore a new instance from backups and copy the lost data back over
> manually.

That's always been possible, but even with the best instances and most iops it
could take hours to do that. So it's really designed for a different scenario.

~~~
gtsteve
Regarding referential integrity I don't disagree. It would be a process to be
performed by an expert in who understands the data structures and what the
effect of rolling them back would be.

I might be wrong but I believe Aurora backups can be restored quite a bit
quicker than that, right? We are evaluating Aurora at work, hence my interest.

------
wpietri
Very interesting. The describe it as a rewind. Does anybody know if it's
really a rewind, where each log record is reversible? Or do they do the easier
thing of saving snapshots and then replaying the log from snapshot to desired
point?

~~~
misframer
Aurora uses a log-structured storage system, so they just point to a previous
version that they haven't garbage collected.

~~~
awgupta
that's correct.

------
thelastidiot
Amazon is the new IBM. Knock yourself out and jump into the AWS ecosystem. In
a few years down the line, you'll understand that you've lost the leverage you
had to potentially take your public cloud business somewhere else when you
have so many dependencies on Amazon tech. Basic principles from my view: don't
adopt anything but standard EC2/S3 services and create diversity not only in
your teams but in your infrastructure policies.

~~~
potle
what are some alternative that one can use for rds postgresql and sqs?

~~~
013a
There probably isn't a lot of fear with using RDS. They just do management for
you, it isn't any fundamentally different technology that would cause
heartache during a migration. Just stay away from Aurora if you care about
this.

SQS: There's AWS MQ, which is based on ActiveMQ and supports AMQP. If you're
going with a more CNCF-focused stack and want to use NATS, I'm not aware of
any hosted options.

~~~
RhodesianHunter
You don't even need to stay away from Aurora ince it's MySql/Postgres SQL API
compatible.

You would just want to stay away from Aurora specific features like the one in
OP.

~~~
koolba
Features like database rewind aren’t something that would be bound to your
core app structure either. While you could structure business process’s around
it, it’s more of a shit hits the fan restore scenario, not a daily or weekly
action.

Replicating it on a different platform could also be fine with a combination
of logical and physical backups.

~~~
cookiecaper
Point-in-time database restores are a best practice that is provided by
maintaining database write logs for the timeframe that you expect to have
point-in-time restoration for. You don't have to use Aurora to get them,
Aurora just has the clicky buttons to make it a clicky-button matter. Any
serious DBA should know how to do this _without_ Amazon's platform wrapping it
up in a GUI.

~~~
awgupta
We've had point in time restore for quite some time. Backtrack is different.
It moves you to a different point using the same instance. Since we don't do
destructive writes to blocks (it is log-structured storage), we can simply
mark a portion of the log as "ignored". It is a server feature, not a UI
enhancement.

------
qiuyesuifeng
TiDB has already supported this (similar) feature about 2 years ago and it has
been adopted by the gaming users:
[https://www.pingcap.com/blog/2016-11-15-Travelling-Back-
in-T...](https://www.pingcap.com/blog/2016-11-15-Travelling-Back-in-Time-and-
Reclaiming-the-Lost-Treasures/)

------
estsauver
I'm slightly confused, is this the same as the existing point-in-time restore
that's available for other RDS instances?

Edit: Main difference seems to be new cluster vs. in place.

~~~
scrollaway
No; look at the screenshots. You can pick _any_ point in time given the window
you request.

~~~
manigandham
That usually what point in time restores allow, by using snapshots as a base
and WAL since then.

------
aionic
It's a relatively classic invention, take something that exists and repackage.
A snapshot and a log replay accomplishes something pretty similar. AWS slapped
a ui and some orchestration around it. The cloud lock stuff makes sense
(although if having an easy "undo button" on your db layer is mission critical
to your business you might have other interesteting challenges.

~~~
awgupta
That's not correct. We made a change to mark portions of our log-structured
storage as though they should not have occurred. It is a totally different
approach than point-in-time restore.

------
cody8295
I don't know anything about Aurora and maybe I'm missing something. But why
not just wrap everything in a TRANSACTION and then do a ROLLBACK if there's an
issue?

~~~
ben509
This would be for deploying a database migration that hoses something. For
instance, we had a migration that touched a bunch of tables and, due to a
triggered procedure, it blew away all the modification dates.

TBH, I'm not sure how useful it is compared to the normal PITR. My usual
guidance on recovering after a database fuck up is "put the app into
maintenance mode and the database in read-only, do your investigation, run
PITR, port user activity to the backup, do spot checks to determine everything
got through, roll the app back to before the change, take the old database
down, point the app at the new database, bring the system back online."

I'm sure a manager is thinking, "oh, we can be back online in 5 minutes,
amazing!" But just because something broke doesn't mean users stopped hitting
your database! You can't just hit Undo and throw all that out!

~~~
awgupta
You can also clone the database volume in Aurora and then just backtrack one
of the volumes. That should help you ensure you have a version available for
forensic analysis.

------
craigkerstiens
At this time looks like it only applies to MySQL, will be curious to hear
if/when it becomes available for PostgreSQL.

------
setheron
How is this different than point in time restore already available?

~~~
matthewmacleod
PITR on say RDS is going to be a fair bit slower than this, I'd imagine.

~~~
delta-v
PITR is a lot slower especially if there are large transactions in the binlog.

------
manish_gill
The seamlessness of this feature is quite amazing. Backups are usually a huge
pain to deal with (I've recently been dealing with Postgres/Barman quite a
bit). And disaster scenarios aside (for which AWS already does replications
across regions), I think a frequent purpose of backups is really to do this
"Undo", go back in time and pretend something didn't happen.

All this makes me really really wanna use Aurora. :)

~~~
takeda
What issues did you have with barman? My experience with it was quite
pleasant.

~~~
manish_gill
Various issues with recovery during testing. Missing WALs in the data files
that `list-files` command says should be present. Backup failure because of a
different number of parallel workers and so on. Combined with the fact that I
had to write my own code to push it to S3 (no out of the box support).

I suppose it works fine on its own for small to medium databases, and it's a
fantastic product. I just wish it was a bit better. :)

~~~
takeda
About WALs, did you use replication slots? That feature makes WAL logs only
removed after all standbys process them.

As for S3, WAL-E is probably what you wanted to use. Barman was more intended
for on premises installation.

------
rustyworm
Interesting - but what if your database has constant activity? "Oops, my SQL
bad" becomes "Oops, my rewind lost 410 transactions"?

~~~
gtowey
This is exactly my concern. There is no production system I have ever worked
on which this feature would ever be used.

What if you had an e-commerce site? Other customers were placing orders,
right? Credit cards being charged? You didn't catch your mistake instantly. So
you have a window of maybe, 5 minutes or a hour in which other things
happened. You can't simply forget those other transactions and throw away the
data.

~~~
joemag
There have been many well publicized events where a production database was
lost, or in some way corrupted. I’ve lived through one such event myself. When
that happens, you usually go to a backup, and typically run into two problems:

1) Point in time back ups can be hours old.

2) More importantly, “backups” are useless, it’s the “restores” that are
valuable. And very few organizations have a well practiced muscle memory for
restoring from a backup.

A turn key restore solution, with a per second granularity can both
significantly decrease the loss window, and recovery time. Hope nobody gets to
use it, but when you have, it can be a difference between a big and a small
outage.

------
polskibus
How did Aurora begin its life? Was it written from scratch or forked from
existing open source database?

~~~
gtsteve
It's forked from MySQL 5.6 and 5.7. It's not open-source itself however so
presumably they signed an agreement with Oracle to do this.

~~~
nimrody
MySQL has a GPL license (not AGPL) so I think they can modify and not
distribute the source as long as they provide a _service_ and not _binaries_.

------
brettgo1982
How is this better than their already existing PITR?

Why would someone want to rollback their own production database instead of
PITR to a new database and switching over to it? Surely you would end up
losing data because you wouldn't be able to reconcile the new data written to
it.

------
truth_seeker
CockroachDB also support this.
[https://news.ycombinator.com/item?id=11958660](https://news.ycombinator.com/item?id=11958660)

~~~
awgupta
That is a different feature (although a cool one). They provide the ability to
run a query as of a point in time. We are moving the database backward in time
(which matters for running applications)

------
sleepychu
> _We’ve all been there! You need to make a quick, seemingly simple fix to an
> important production database._

Have we though? This could be one of those safety nets that makes me worse not
better.

------
ethanpil
Does Amazon release the source for these features? Would love to see these
ported to other flavors of mySQL.

~~~
tejasmanohar
Nope. And, FWIW, I'd bet Aurora has deviated far from core MySQL at this
point, too.

------
truth_seeker
If you use Datomic with DynamoDB, this feature is available at Query level.

------
edge17
I’m confused, how is this different than an Undo log?

~~~
misframer
This takes advantage of log-structured storage so there isn't any "undoing".

~~~
edge17
In a database Undo, in spite of the name, is the act of applying a transaction
log to return to a particular state. I used to work on databases earlier in my
career, so selling this as a new invention seems somewhat bizarre to me.

~~~
delta-v
That's not what Aurora is doing. What you have described is more like "git
revert". It's O(num of transcation in the undo log).

Aurora is a log structured db, they can just reset their DB to a particular
LSN. i.e. doing "git reset --hard", which's O(1).

~~~
awgupta
that's correct.

------
qurashee
Amazon rediscovering PITR, nice :P

~~~
ben509
No, they've had that for ages:
[https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_...](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_PIT.html)

~~~
manigandham
This post is about Aurora, their own custom db. They could've easily called it
PITR to match the rest of the industry.

~~~
awgupta
This is a different feature than PiTR. We have both. They serve different
needs.

