
The Day the Replication Died - mecredis
http://www.kickstarter.com/backing-and-hacking/the-day-the-replication-died
======
thaumaturgy
Nice writeup. I've been wrestling with MySQL replication for a few months,
too. I have a multi-master setup with an additional slave off-site that stores
backups.

I've switched entirely to row-based replication because it offers one really
nice feature: in the event of a desync, I can resync the slaves with a query
like this on the master,

    
    
        create table new_table like bad_table;
        insert new_table select * from bad_table;
        drop table bad_table; rename table new_table to bad_table;
    

...and bam, everything's groovy. On a VPS, I can have the slaves resync'd
within a few minutes this way on tables with > 1m rows.

The other pretty important thing to have is fast notifications of trouble. It
sucks waiting for the slaves to catch up on the binlog from the master. I
wrote a shell script that runs from cron every few minutes, checks the
replication status on all hosts, and lets me know if there's trouble.

Being a few days behind in replication can take hours to catch up; being a few
hours behind only takes a few minutes to catch up.

I'm really looking forward to trying out Percona XtraDB-cluster and seeing if
they handle any of this stuff better than stock MySQL.

~~~
robbles
Could you explain why that technique is only viable for row-based replication?

It looks really useful and I don't quite understand enough about the
replication process to see why you couldn't rely on this for mixed replication
as well.

~~~
blantonl
this should work in a mixed setup, remember replication in MySQL is pretty
much a replay of SQL statements.

~~~
toast0
I don't know how mixed mode decides to use row or query based for a given
query; but if it chose query based, the queries would work, but you wouldn't
be resynchronized.

(In query based mode, I've done resynchs by iterating over each row, and
setting a not super important column to a different value, and then back,
including values for all the columns that need to be resynchronized. It's a
lot more queries, but it's also easier to rate limit if your tables are big
enough that doing it in two queries is going to back everything up)

------
sparkman55
Can someone point to a big postgres replication failure event? I know mysql is
getting better, but I still see public failures like this much more frequently
than big postgres (or Oracle) failures.

~~~
thaumaturgy
Sure: stock Postgres doesn't have multi-master capability at all. Postgres-XC
does (<http://postgres-xc.sourceforge.net/>), but I didn't know about that
when I finally decided to move from a mix of MySQL and Postgres to MySQL only.

~~~
sparkman55
From the article:

> the replicas were out of sync with the master

Most replicated relational database environments are single-master
environments; this mirrors the read-oriented traffic of many database use
cases.

Most multi-master solutions require you to leave parts of SQL and ACID behind
- the advantage is significantly better write scalability and write
availability.

Mysql replication has been around for ages; it seems to blow up spectacularly
on occasion. Postgres replication is relatively new, and I'm wondering if it
also has similar issues. Anecdotal evidence suggests that it has not: I have
never been awakened by panicked developers/operators because of postgres
replication, but mysql replication has caused more than one sleepless nights.

~~~
stcredzero
_> Mysql replication has been around for ages; it seems to blow up
spectacularly on occasion._

For us "old school" guys who have been programming for awhile, this is truly
eyebrow-raising. Hearing that DB replication "seems to blow up spectacularly
on occasion"...there's something wrong with this picture. A tool like DB
replication should be at a much higher level of reliability than application
code. It also confirms a lot of the grousing I've heard about the engineering
hubris of MySQL over the last decade.

Also reminds me of Alan Kay's quip about what makes programming, "not quite a
field."

~~~
thaumaturgy
I want to rush to MySQL's defense here, but I can't. If we forget for a moment
about the history of it or the engineering challenges specific to MySQL and so
on, what we have is an infrastructure application that one-way copies data to
a remote instance of the same application, with almost no error handling and
very little consistency checks, and that halts the data copy in the event of
an error, without notifying anybody that there's a problem.

MySQL replication feels like a hack, not the sort of thing that people can use
as part of a reliable infrastructure.

My MySQL replication wishlist would be:

1\. bidirectional communication protocol so that slaves can ask the master for
a fresh copy of some particular data in the event of an error;

2\. built-in notifications for anything that might make an alcoholic out of a
sober sysadmin;

3\. periodic idle-time consistency checking (master: "I have X tables with Z
definitions and N rows each"; slave: "something is wrong with my copy of Y, I
need rows 1 - 100").

I have a couple of projects in my pipeline right now that are being held up
entirely by the feeling that MySQL is not yet reliable enough and I need to
build better monitoring and automated error-handling systems first.

~~~
falcolas
What you're describing is essentially any flavor of Galara Cluster for MySQL.
Unfortunately, it comes with its own tradeoffs.

However I do see something a bit... off... about your 2nd concern. Why would
you want a database to handle its own monitoring? I'd personally rather just
set up a set of MySQL monitoring plugins into Nagios. That way all of my
monitoring is one place.

Consistency checking across terabytes of data on multiple servers is a hard
problem. It would be great if it could be solved, but I'm not holding my
breath for them (well, any DB vendor for that matter) to get it right.

~~~
thaumaturgy
I've had issues with Nagios in the past and am not its biggest fan.
Centralized monitoring is nice, but infrastructure-critical software shouldn't
require people to add on monitoring applications IMO.

~~~
falcolas
Do you have examples of other critical software incorporating their own
monitoring solutions, such as Apache, PostgreSQL, even Linux?

How would you monitor memory, disk and load usage?

Sure - Nagios has its issues, but there are quite a few alternatives that can
do the same thing, all of which can interface with your database for
monitoring.

------
druiid
I'm still trying to find time for a writeup of how we use MySQL Galera, but
I'll take this chance to note again that any big MySQL houses out there should
really take a look at it and see if it will work within your environment.
Basically it's a true multi-master MySQL environment with shared-nothing (kind
of like MySQL cluster). We've run into our share of bugs, be we're perhaps
more of an edge case. We create a TON of temporary tables and were
encountering a bug which was eating up memory slowly. That has been fixed for
a couple releases now and everything has been groovy.

Basically though, the cluster takes care of ejecting bad slaves so you'll
never need to worry about the replication status, etc.

Oh, and feel free to contact me through profile if you have any questions
about it. We have a good long use of it now and pretty much know about
most/all of the things to watch out for.

~~~
falcolas
The biggest problem with Galera cluster right now is that you can't have
large, long lived transactions. Due to the way that the non-write nodes verify
data consistency, large transactions can cause some serious slowdown issues
for the entire cluster.

~~~
druiid
True and that would certainly qualify under an 'edge case' scenario, although
if you increase the number of threads I believe it should mitigate this
particular issue somewhat (more threads open to do other applies at the same
time). Obviously if you need to have dozens of long writes at the same time
this isn't solved for.

------
falcolas
Ugh. A very painful occurrence. A couple of recommendations, in the event that
Kickstarter reads this thread:

\- Run pt-table-checksum on a daily or weekly cron, and fix what it finds
immediately.

* Switch off RDS.
    
    
      * Your db admins will thank you for taking off the kid gloves
    
      * You can restore from binary backups quickly.
    
      * Performance will get better
    

* You can use something like pt-slave-restart to get replication running quickly and re-sync the DBs after the immediate crisis is over.

* If you don't already have it, hook up Nagios to RDS with some MySQL monitoring plugins like pmp.

Data drift, and the resultant replication downtime, is an unfortunate reality
of asynchronous replication. It doesn't have to be multiple hours worth of
scramble, however - if you catch it early (with monitoring) and are prepared
to handle it (with pt-table-sync or binary backups).

------
jamesaguilar
Maybe this is a MySQL noob question, but why do they have to do this
complicated thing to get a unique id for each project backer?

    
    
        CREATE TABLE project_backers (
          # The number they're looking for.
          id INTEGER PRIMARY KEY AUTOINCREMENT,
          user_id INTEGER,  # user_table fk syntax here.
          project_id INTEGER  # project_table fk syntax here.
        )
    

Insert into this table every time someone backs a project and you get the
monotonically ascending number for free.

Again, I'm a total SQL noob, so maybe this isn't really doing what they want.

~~~
nknighthb
I suspect they were trying to make the data look reasonably nice for project
creators. Your (not incorrect) solution has the downside of ever-increasing
numbers quickly reaching into the millions or billions, while their solution
has numbers no larger than the number of backers on a project (unlikely to be
more than tens of thousands).

They also save the storage cost of going to a BIGINT, but I doubt that was a
major factor.

~~~
jamesaguilar
True, but couldn't you pretty easily derive the logical ID via some sort of
pagination scheme? That is, when you really want to see it at all. It seems
like this would be a pretty low cost query as long as you set your indices up
properly, and I'm assuming for the hundred-thousand backer projects you'll
never need to display _every_ backer in order.

Another slight concern is how big this project_backers table would get, since
it's a combination. Let's assume every project has an average of 10k backers
and there are about 100k projects. That makes 1B relationships, with O 128
bits per row (plus whatever associated data). 16 GB. Not too bad for such an
important relationship. Even with 100x growth and no garbage collection or
archiving, you could fit it on (a couple of) flash disks for the foreseeable
future. I don't know if MySQL supports delta encoding but that would also
probably make the table cheaper to store.

~~~
nknighthb
You could, but then you have multiple identifiers for the same object, and you
end up having to decide "wait, which ID do I want here?", and during long
nights when you're hopped up on five cans of Red Bull, the numbers start to
run together, and you don't know if you're looking at a real ID or a logical
ID, and even when you do know, you end up running queries directly against the
database trying to convince yourself that the mapping can't possibly be
correct.

Been there, done that, consumed a lifetime supply of Mountain Dew in a year,
and handed in a multi-page resignation letter[1]. I'd rather use their
solution, which ultimately minimizes the overall complexity of the system.

[1] OK, the resignation letter wasn't really about IDs, but the ID problems
were a symptom of larger problems.

~~~
jamesaguilar
Interesting. I would expect that the logical id would be displayed only, never
stored, but that would make debugging display issues difficult.

Thank you for your thoughts. It's helpful to hear the thoughts of other
database professionals, especially in domains where I have no knowledge, like
relational databases.

------
stcredzero
Statement-based replication and anything involving it just looks like a
disaster waiting to happen. There's the definite whiff of hubris here. There
are just so many potential ways for something to go wrong.

I'd guess there was a big win in terms of efficiency for this to have been
attempted. This is the sort of thing that you'd want formal methods or an
environment like Haskell for. Either that, or a decade+ of engineering
wizardry with relational systems and some nifty mathematical proofs in your
arsenal.

------
jtchang
These are the kind of issues that I hate dealing with and would make me
consider hosting on a platform as a service like Heroku. Does Heroku have auto
sharding IDs built in?

~~~
stcredzero
If you're using statement-based replication on Heroku, it still won't save
you. I have no idea if you can or can't do so. Statement-based replication
seems to be one of those things where you can shoot yourself in the foot if
you don't design with it in mind.

~~~
jrochkind1
While there are (or at least were) some third-party MySQL add-ons on Heroku,
in general Heroku uses postgres instead. I have no idea how/if these issues
apply to postgres.

~~~
stcredzero
Only in as far as people use statement-based replication on Heroku.

------
tedchs
> For those who love TLA's, here's some of our current stack: AWS, RoR, RDS,
> EC2, ES, DJ, SASS.

[http://www.kickstarter.com/backing-and-hacking/welcome-to-
ba...](http://www.kickstarter.com/backing-and-hacking/welcome-to-backing-and-
hacking)

------
blantonl
This is a great postmortem about a truly unique MySQL issue.

However, this post left me wondering how they _actually recovered_ from this
issue. I'm thinking the only way is to re-provision the replica infrastructure
from the master's presumably master data. Or was there another recovery
approach?

~~~
AaronBBrown
Because MySQL replication is simple, mature, and has a rich toolset, it is
usually easy to recover from most replication problems. The Percona Toolkit
(formerly Maatkit) has two tools that make this fairly painless - pt-table-
checksum and pt-table-sync. The former checks for consistency using MySQLs
built-in checksumming abilities and the latter repairs found inconsistencies.

Another useful tool is pt-slave-restart which will limp you along and skip
replication errors automatically until you can repair the problem.

------
neya
Wow, Kickstarter runs on Rails? I never knew...any one have an idea of how
many users they have/ how many page views they serve? Just curious...

~~~
jonathanjaeger
This is the Kickstarter stats page: <http://www.kickstarter.com/help/stats>

It doesn't have to do with traffic and insights into how they keep the site up
and running with x amount of traffic and users, but it's still interesting to
see everything about the projects there.

~~~
neya
Thank you for the link :)

