Migrating Our Django App from MariaDB Galera to PostgreSQL and Patroni

manigandham · on March 11, 2018

These traditional relational databases are fundamentally single-master systems. They have all gotten better at replication and failover but multi-master is not something that just be bolted-on and will never go well.

Most companies don't really need multi-master either, the performance level of modern servers is so high that usually failover is good enough. If it's really needed, it's better to look at options like CockroachDB or TiDB instead that are built from the beginning to be multi-master.

Also it's unclear from the article but why would the followers be so out of sync? Is there a really bad network? Especially if they set quorum writes on the master then at least 1 follower should always be up to date.

_rami_ · on March 11, 2018

> Also it's unclear from the article but why would the followers be so out of sync? Is there a really bad network? Especially if they set quorum writes on the master then at least 1 follower should always be up to date.

Yeah, I should have gone into more detail there. I'm talking about cases like after an outage when there is a lot to catch up with, or worse, if there are conflicting timelines (e.g. after a power outage) and synchronization stops working completely -- the broken follower will still answer queries and I've not yet found a simple way to monitor this condition.

elnygren · on March 11, 2018

Why do you want SQLite for development? Running Postgres on your dev machines is a docker one-liner that you can put in the README.md for your devs to copy paste.

Most sane CI services also offer Postgres.

_rami_ · on March 11, 2018

Our development setup currently does not need Docker at all. In fact, it does not need a lot more than git and a recent Python version: https://docs.pretix.eu/en/latest/development/setup.html

I agree that this is an opinionated setup and I do see the value of a development setup that mirrors production closely.

(1) During testing, in-RAM SQLite databases are really fast and well capable of running tests in parallel. We have a test suite of ~2000 tests that currently needs ~4min to run on a modern notebook with four threads, so this matters.

(2) pretix is an open source project. If we'd only have a small number of developers in an in-house team, I wouldn't care at all. However, there are many people contributing small and large features and fixes to pretix, and many of them are drive-by contributions: People fixing a problem they just experienced and then leaving again. I want to make it easy for these people, even if they use operating systems where installing Docker first might be a hoop they do not want to jump through.

That said, deprecating MySQL at some point would also force all self-hosting users of pretix to go through all the steps we did here, so this wouldn't be viable in the short term.

_rami_ · on March 11, 2018

I may add that our development setup has no external service dependencies, redis is optional as well. Of course, we wouldn't recommend that in production, but we're kinda proud that it's possible.

stavros · on March 11, 2018

I second this. I like to have my Django projects using SQLite and a redis mock until I actually need Postgres- and redis-specific functionality, at which point I switch to docker-compose.

Before that point, I enjoy the simplicity.

fovc · on March 11, 2018

2000 tests in 4 minutes actually sounds really slow. We use postres and our Django tests run at roughly 9000/min on 3 year old laptops.

Start your pg container once and use Django's -k flag when running tests to avoid start up times. We also take care to separate DB access from business logic and to unit test those separately.

_rami_ · on March 11, 2018

Sure, the main reason for this is that nearly all of our tests touch the database and operate on a comparatively high level, we have way more things in a fuzzy category between functional and integration tests than actual unit tests. This is of course a flaw in our test writing, but it's what we have right now ;)

bpicolo · on March 11, 2018

Top down integration tests might be slow, but they're a really top notch way to actually know your code works in dynamic language, mvc-style webapps. Can't blame you there.

shlomi-noach · on March 11, 2018

Agreed, and having some experience with both MySQL and SQLite, running SQLite in testing cannot cover problems you'd find with MySQL (e.g. SQLite is so limited with concurrency that you cannot use it to confirm concurrency issues will be dealt with in your production MySQL). I would assume the same holds for SQLite<->PostgreSQL.

_rami_ · on March 11, 2018

Sure, there are often problems differing between the databses. Our Travis CI configuration runs our test suite against all three different databases: https://travis-ci.org/pretix/pretix/builds/351704035

However, having SQLite at hand for local development/testing makes things easier.

In my experience, real-world concurrency issues (that you are talking about) are always hard to find in manual/automated testing, or do you have a good strategy of trying that?

shlomi-noach · on March 11, 2018

> In my experience, real-world concurrency issues (that you are talking about) are always hard to find in manual/automated testing

I agree, and I don't have a good, generic, strategy for that. I didn't give the best example.

ubernostrum · on March 11, 2018

Depends.

For apps that I distribute generically for use with Django, I usually stick only to ORM features that are supported across all backends. And then I just have tox set up to run the tests with an in-memory SQLite database, since that's fast (and Django's test suite already exercises its ORM on all the backends, so I can trust the ORM will work on other DBs).

My personal blog, though? I run Postgres locally on my laptop (via Postgres.app) to make the dev/testing environment as similar as possible. Most places I've worked have done similarly, running either Postgres or MySQL locally, and I have run into bugs that only manifested on MySQL (not because of bugs in Django, but because of MySQL acting in the way MySQL acts).

audiolion · on March 11, 2018

Agreed, I originally did this and the first time I ran into a prod bug because sqlite and Postgres differ in implementation of some more advanced methods I immediately moved off it and never looked back.

A lot of people stick with sqlite in dev because when you start a django project it will default to that.

Even if you dont use Docker, brew install postgresql is just as simple. The commands to setup a user are like 3 lines. 20 mins of doc searching and one time README setup.

nh2 · on March 11, 2018

This doesn't need docker and not even 20 minutes.

    initdb -D mypostgresdatadir             # creates the DB
    postgres -D mypostgresdatadir -p 1234   # runs DB on given port

Done.

On systems like Ubuntu, these binaries live in e.g. `/usr/lib/postgresql/9.5/bin/postgres`.

always_good · on March 11, 2018

I cringe at the thought of using so few features of Postgres that you can actually use Sqlite.

_rami_ · on March 11, 2018

Me too, at times. Django makes it bearable though, and going forward we'll probably use more advanced features of PostgreSQL in places where we can fail gracefully or fall back to simpler behaviour on other databases.

mattmanser · on March 11, 2018

[flagged]

_rami_ · on March 11, 2018

It certainly is a future perspective for us, moving to all-postgres, but forcing every installation of pretix (an open source application) to go through this migration is not something we'll do overnight ;)

foota · on March 11, 2018

Please be civil.

zzzeek · on March 11, 2018

> We talked to a few database experts at the side of one or two conferences and this made us lose any trust that Galera is able to perform even remotely well in terms transaction isolation, constraint enforcement, etc., even though we did not observe such problems in practice ourselves.

there's a statement really short on specifics. Galera has nothing to do with constraint enforcement, these are functions of the underlying MySQL / MariaDB database, and the major constraint enforcement issue there is CHECK constraints, which only MariaDB 10.2 actually supports, and that's a very recent thing. Similarly with transaction isolation, that's also a function of the MySQL/MariaDB engine with the exception of Galera nodes accepting write-sets from other nodes, but while those introduce the concept of your whole transaction being rejected, it doesn't introduce any other negative isolation effects, except if you were hoping to have your transactions talk to each other with dirty reads.

Like a lot of these "we migrated from X to Y" stories, all the negatives they refer to here that are Galera-specific are resolveable. The "MySQL sucks" part of it, e.g. the issue with the query planner, sure. Multi-master Postgresql will be extremely useful if a product as easy to use as Galera is produced.

zimbatm · on March 11, 2018

Here is how to do it with zero downtime:

Each customer has a completely separate data set, except maybe for login information. This makes it a good candidate for sharding. Sharding is also a good idea because eventually one big ticket sale event is going to hog the whole system and having a way to isolate it allows the rest of the customers to keep working.

1) Add a shard ID to the customer table. If it's blank, use the existing database.

2) Create additional master-slave pairs with the same DB schema

3) Create a tool to migrate customer data between databases

4) Migrate customers individually, creating minimal downtime during their transition. With streaming replication this can be turned to zero downtime.

Wordpress has similar scaling issues and talked a lot on how they do it.

shlomi-noach · on March 11, 2018

I'm curious whether pretix attempted using MariaDB/Galera with single-write mode, such that the three servers still run synchronous replication, but only one of them gets the writes.

I suspect (disclaimer: not running Galera myself) "WSREP has not yet prepared node for application use" and deadlocks would be mitigated, and am curious to learn if that were indeed the case.

_rami_ · on March 11, 2018

No, we haven't. You are absolutely right that this would probably mitigate the deadlocks. However,

(1) the deadlocks/WSREP was a problem annoying us, while the query performance was a problem leading to real customer complaints, (2) as I see it right now, we would have needed to take complicated steps to ensure that all application nodes always connect to the same database node; just a hardcoded priority would probably not be enough, (3) a mechanism for (2) would likely make us loose more of the advantages of Galera, e.g. short failover downtimes

shlomi-noach · on March 11, 2018

Since you already use HAProxy you should get that almost for free. Your app would talk to the databased by communicating through HAProxy, which would direct traffic to the single-writer; HAProxy's health tests would detect who the single-writer is, and would adapt in the event of failover.

clon · on March 11, 2018

In my experience it did indeed remove a lot of deadlocks when we went from multi master writing to electing a single "writable" node. So in the end it is pretty much the same as deploying some read only slaves, plus automatic master promotion.

But not quite. Since even with this setup you might get a lot more weird deadlocks compared to a normal master slave setup, even running SELECT. Before we ditched Galera we were never really able to solve a couple classes of deadlocks, one involving CREATE TEMPORARY TABLE AS SELECT. Also, not all Galera deadlocks are really deadlocks, but actually Galera certification errors, reported as deadlocks to the client, so this distinction must be drawn as well.

Remember, a slave just stupidly replays stuff from the binlog onto a known state in a single thread, in the same serialised order that the master has already been able to commit. Galera on the other hand will attempt to certify and commit individual transactions. A certification does not really guarantee that the trx will be able to be committed when it comes to that. Not even in the case of single writer node, in many circumstances.

I'd say, after working with Galera/PXC for several years at considerable scale, that the premise of multi master writing only works in very narrow domains and even then you need to write a lot of application code to recover from Galera idiosyncrasies.

evanelias · on March 11, 2018

fwiw, CREATE [TEMPORARY] TABLE AS SELECT can be problematic in MySQL even without Galera. It is outright disallowed if MySQL 5.6+ Global Transaction ID (GTID) is enabled, for example. The issue is that it's combining DDL and DML in a single statement/transaction, but DDL is still not yet inherently transactional in MySQL.

The work-around is to just to use two statements, CREATE TABLE (or possibly CREATE TABLE LIKE) and then a separate INSERT...SELECT.

That all said, yes, Galera should ideally handle this gracefully instead of ever deadlocking. But I don't have enough Galera experience to comment on that aspect.

clon · on March 11, 2018

The docs state that creating tables is not supported, but I am unsure about the '[TEMPORARY]' part. In practice, it is not verboten, see below for 5.6:

  mysql> SHOW VARIABLES LIKE '%gtid%';
  +---------------------------------+-----------+
  | Variable_name                   | Value     |
  +---------------------------------+-----------+
  | binlog_gtid_simple_recovery     | OFF       |
  | enforce_gtid_consistency        | ON        |
  | gtid_deployment_step            | OFF       |
  | gtid_executed                   |           |
  | gtid_mode                       | ON        |
  | gtid_next                       | AUTOMATIC |
  | gtid_owned                      |           |
  | gtid_purged                     |           |
  | simplified_binlog_gtid_recovery | OFF       |
  +---------------------------------+-----------+
  9 rows in set (0.00 sec)
  
  mysql> CREATE TABLE test AS SELECT 1;
  ERROR 1786 (HY000): CREATE TABLE ... SELECT is forbidden 
  when @@GLOBAL.ENFORCE_GTID_CONSISTENCY = 1.
  
  mysql> CREATE TEMPORARY TABLE test AS SELECT 1;
  Query OK, 1 row affected (0.03 sec)
  Records: 1  Duplicates: 0  Warnings: 0

So perhaps it is still 'not supported', as in our experience it does indeed cause some issues, probably avoidable if you are willing to do dirty reads (READ COMMITTED).

I wonder if REPEATABLE READ is a good default anyway.

zzzeek · on March 11, 2018

the "WSREP" error is something you never have to worry about with Galera assuming you are correctly deploying beneath a proxy server like HAProxy that handles this aspect for you. The proxy server is set up to only provide nodes that are available.

We also run some applications in "single-node" mode where HAProxy routes all connections to one node at a time - but if a node goes down it switches the app to another one. Hence the whole point of HA.

clon · on March 11, 2018

>MariaDB Galera is really easy to set up and maintain.

I take exception to that. When you log into all of your nodes after a network snafu and discover that every single node has attempted to perform a full state transfer, involving the destruction of the data directory...

  1. rm /var/mysql
  2. Attempt to do FST
  3. Crap happens

After a while, all nodes were left with no data. Off to backups :)

I am sure there is a way to prevent that, but Galera is still not easy ops wise.

tetha · on March 11, 2018

Maybe it's my small scale experience, but so far, multi-master relational databases with automated fail-over seems like a lot of risk for just a little payoff. If you can handle the risk, and the operational/development cost, and you need the payoff, go for it. I'm not in that spot.

If I have a master01 with master02 replicating as a standby, I can switch between these two masters within 5 - 15 minutes depending on my setup and at what infrastructural level I do the switch - I could reconfigure the application, switch a dns entry, use a load balancer like maxscale. It's downtime, but it's a low-risk recovery with well-known impacts.

With a multi-master setup, I have to do at least two things: First, I must ensure my applications transaction-safety. Read-Write splits with an application with bad transaction management is fun, and write-splitting will end up with even more of a mess. And then I need to setup and operate the multi-master setup, which is a non-trivial decision and selection imo. Just look at the number of possible solutions for postgres.

This in turn allows the system to automatically failover in case of trouble - which my current infrastructure would have had to do 3 times over two years, and it would have helped 2 times at most. Except if the failover itself fails and ends up harder to handle than the database failover, like in your case or in other really scary postgres failover horror stories.

And interestingly enough, in our b2b context, our customers actually prefer a well-known, low-risk failure plan, even if it is 30 minutes of outage.

clon · on March 11, 2018

This is very much our experience as well. As complexity increases, the marginal utility goes down, as operating concerns skyrocket.

You really need pretty special operational needs to run multi master relational databases.

It is am engineering tradeoff between complexity and fragility.

In our case Galera / PXC ended up with a significantly worse availability record than running our previous simple master slave failover system. See my other comment for more details.

zzzeek · on March 11, 2018

> When you log into all of your nodes after a network snafu and discover that every single node has attempted to perform a full state transfer, involving the destruction of the data directory...

I support customers that are using Galera at Red Hat and while we have seen lots of network snafus and situations bringing individual nodes back online, I've never seen all nodes attempt to state transfer each other like that nor have I ever seen a data directory "destroyed". The situations with Galera in 100% of cases involve operators acting too hastily and not understanding what's going on as they do it.

clon · on March 11, 2018

By destroyed I mean the data directory is wiped as a first step towards a full state transfer.

By the time we restored stable shell access to the nodes, after a lot of intermittent networking, all 3 nodes had indeed wiped themselves. The cluster was no more.

There may have been of course also something to do with SeveralNines CC that we were trialling at the time. Perhaps it was trying to do something hasty as you put it.

In any case, a lot of moving parts to understand fully.

zzzeek · on March 11, 2018

Galeras SST is going to rsync the files over but I can't imagine how you got an SST to initiate from a wiped data directory that would somehow wipe another one. Would love to see a reproducer for the condition you describe.

_rami_ · on March 11, 2018

Ouch. FTR, patroni does a similar thing, but instead of rm /var/lib/postgres it does a mv /var/lib/postgres /var/lib/postgres.bak.date ;)

clon · on March 11, 2018

A much nicer approach, if you have the luxury of data disk usage below 50%.

abkfenris · on March 11, 2018

Patroni has now implemented a Sync endpoint that you can check with HA_Proxy. https://github.com/zalando/patroni/pull/578

tbarbugli · on March 11, 2018

Am I the only one thinking this is a terrible Buy versus Build decision?

You can get HTTP load balancing, Postgresql with automatic fail-over and queuing up and running in a matter of minutes from AWS or GCE.

jimktrains2 · on March 11, 2018

Perhaps they don't want to be locked into a vendor's ecosystem or want newer versions of postgres or a different configuration?

garyrob · on March 11, 2018

I was wondering about this as well. I hope the OP responds.

nh2 · on March 11, 2018

Interesting article.

> Your browser is usually clever enough to use the working IP address if the other one is not working.

I found this to not work well. Have you tested it? For me it took 1-2 minutes for Chromium to switch over to the second entry after it had loaded the page from the first entry once (with a white page and spinning spinner during that time); this is usually not acceptable for site visitors as they will get impatient and leave already after a couple seconds.

What I'm using instead for a setup is multiple A entries in combination with active removal of down IP addresses from the response using e.g. Route53 health checks and 10-second TTLs.

innocentoldguy · on March 11, 2018

While I think, overall, the article provides good information, I would not recommend developing with SQLite if you’re using some other database in production. I have run into too many cases where doing this leads to hard to find bugs.

_rami_ · on March 11, 2018

Our CI rans our test suite against all of SQLite, MySQL and PostgreSQL which worked well enough that I can only remember one or maximum two bugs caused by this, but yes, wise words.

aargh_aargh · on March 11, 2018

Thanks for the detailed write-up, especially the dead-ends/failed approaches.

codedokode · on March 11, 2018

This link doesn't open neither in Chromium 46 (ERR_SPDY_INADEQUATE_TRANSPORT_SECURITY) nor in Firefox 45 (shows just blank page). Is it only me? I thinks there is something with their server setup.

ubernostrum · on March 11, 2018

Worked for me because my ad-blockers stopped their analytics script. It loads over HTTPS, but seems to include code that will load some other URL over HTTP. That would be an HSTS violation and should cause your browser to bail out.

_rami_ · on March 11, 2018

Hi! :) do you explicitly see that HTTP URL or is it just a guess? I'm struggling to find it right now :(

ubernostrum · on March 11, 2018

In line 43 of piwik.js I see an HTTP (not HTTPS) URL being constructed. I don't have the time to try to de-minify it and see what it does with that URL, but it definitely seems to be an HTTP URL.

_rami_ · on March 11, 2018

That's weird indeed (although I don't see any connections to a http: URL in the dev console). If someone else reports a similar problem, I'll just remove piwik alltogether.

_rami_ · on March 11, 2018

Hmm, weird, cannot reproduce this in either Chrome or Firefox, with or without ad blocker. Anyone else having this problem?

tpetry · on March 11, 2018

Did you evaluate citus?

_rami_ · on March 11, 2018

We haven't in depth, as they only advertise with their sharding and distributed query features etc and less with their HA features (if that is even in Citus Community).

Sharding data is something we won't do before we absolutely need it, since disk capacity is not at all our constraint right now and our scaling problem isn't that its hard to serve many tenants at once, but we may need to be able to serve one tenant with very high concurrency the minute their tickets go on sale (which would only affect one shard on a usual sharding setup).

Are you using Citus for HA? Would love to hear your thoughts!

manigandham · on March 11, 2018

Citus is a sharding solution, not about multi-master or high availability.

agnivade · on March 11, 2018

> MariaDB has a hard limit of using one index per table in a single query.

Wait, seriously ? What is the rationale behind something like this ?

tetha · on March 11, 2018

This seems to be not generally true. MysqlDB > 5.5 and MariaDB > something less than 5.3 support index_merge in their query plans. Skimming the mysql page, this allows mysql to evaluate different parts of the where-clause on different indexes and merges the results later on. Most examples on the mysql-man page also use just a single table in their queries.