
Large Scale NoSQL Database Migration Under Fire - kawera
https://medium.com/appsflyer/large-scale-nosql-database-migration-under-fire-bf298c3c2e47
======
SuddsMcDuff
We've taken a very similar approach when migrating data from one DB to another
(MySql to Redis in our case, but the principle should apply to any databases).
We split it into 4 phases:

* Off - Data written only to MySql (starting state)

* Secondary - Data written to both MySql and Redis, and MySql shall be the source of truth.

* Primary - Data written to both MySql and Redis, and Redis shall be the source of truth.

* Exclusive - Data written exclusively to Redis.

As mentioned in the article, the _Secondary_ phase allowed some time for the
new database to be populated. And the distinction between _Primary_ and
_Secondary_ phases gave us a rollback option if something went wrong.

~~~
nimrody
The difficulty here is what happens when one of the databases is not available
temporarily (network error, etc.) You cannot have a "transaction" cover writes
to both systems so you either have to manually undo one of the writes or risk
the two systems getting out of sync.

~~~
toomuchtodo
This is typically solved (in my experience) using a reconciliation process
using transaction GUIDs along with backfill from non-source of truth in the
event the data isn't found in the source of truth. As long as a transaction
made it into one of your data stores, consistency isn't lost (and if writes
failed to both data stores, alarms should go off).

------
aynsof
Interesting write-up. I'd love to see stats on how the new database is
performing. Have they reduced it from running on 45 4xlarge instances? Do
backups still take a day? Was it a good financial decision?

~~~
barkanido
Financially it was a good decision. The current cluster is a lot less than the
original one, and traffic + data have grown since. we currently maintain 2
clusters of 5 i3.4xlarge machines. That's a total of 10 machines and is a lot
cheaper of what we had before. The DB is performing great. It is flash based
and 99.98% of the queries have <1ms latency. Each XDR end holds around 3.1B
records, with a replication factor of 2. midterm load is around 3 (very low)
and we are doing around 190K reads p/s plus 37K write p/s at pick load.

------
seanwilson
> The following post describes how we migrated a large NoSql database from one
> vendor to another in production without any downtime or data loss.

Are there any good write-ups where a migration went really wrong and how it
was fixed?

------
manigandham
Or... they could just run this entire thing using ScyllaDB on a single mid-
size VM with local SSDs with headroom to spare. Put 1 in each DC for
active/active replication. No enterprise contract needed.

~~~
barkanido
ScyllaDB was actually too late to enter our POC (we had a somewhat tight
schedule for migration) but it was a valid candidate nevertheless.

------
redwood
Would be interested in learning about why they chose the technology they did:
was the use case requiring ultra low latency lookups?

~~~
barkanido
Yes. Low latency lookups are q requirement. Saying that, even double the
latency we have now would be okay. More important then latency was actually
throughput and high availability. And this was demonstrated by Aeropsike well.

------
danbruc
_~2000 write IOPS

~180000 read IOPS_

What are those IOPS in this case? Queries? Transactions? Disk block accesses?

~~~
drodgers
I'm pretty sure they mean operations on the block storage layer (EBS) as
reported by AWS CloudWatch monitoring.

It's a standard measure:
[https://en.wikipedia.org/wiki/IOPS](https://en.wikipedia.org/wiki/IOPS)

~~~
danbruc
That is why I asked, I am only aware of the use of IOPS in connection with
disk I/O and it seems a rather unusual measure for a database where caches
hopefully avoid hitting the disk too often, at least for reads. And writing 8
MiB/s assuming 4k clusters seems not really noteworthy especially given three
fold redundancy. On the other hand reading 700 MiB/s per second which would
also not be affected by the redundancy seems a comparatively big number
especially because caches should limit the disk traffic to a small fraction.

~~~
z3t4
Databases usually don't use caches. It's cheaper to just ask the file-system
for the data.

~~~
danbruc
This is not true at all, bordering to utter nonsense. Databases try hard to
keep the correct set of blocks in memory because that is essential for their
performance. Heck, many of the fastest database systems advertise themselves
as in-memory databases avoiding disks altogether.

