Migrating 1200 databases from MySQL to Postgres

seanharr11 · on Aug 16, 2017

I wrote an open source tool that migrates between any 2 relational databases.

https://github.com/seanharr11/etlalchemy

This includes automatic migration of the schema, constraints, indexes and obviously data. It leverages fast bulk loading tools like mysqlimport, and PostgreSQL's COPY FROM. It doesn't need any config. Just plug and play.

The tool uses SQLAlchemy to abstract column types, and handles various column type conversions between any flavor of SQL.

The one thing it needs are tests, and scalability support for tables bigger than a few GB!

mkez00 · on Aug 16, 2017

Going to give this a try. Thanks for posting!

icelancer · on Aug 17, 2017

Thanks much - going to test it between MSSQL and MySQL in a few days here!

seanharr11 · on Aug 17, 2017

Let me know how it goes...please open issues if you find them!

willcodeforfoo · on Aug 16, 2017

I'm working on migrating an app right now (just one MySQL database) but can highly recommend [pgloader](http://pgloader.io/). It has a bunch of built-in and configurable rules to cast different column types to PostgreSQL and the developer has been very responsive in helping me make weird column types work, too.

erjiang · on Aug 16, 2017

Pgloader can be very powerful. Despite never having using Common Lisp, I found it easy to add in a custom column conversion functions without modifying pgloader itself. You can write additional lisp files with custom procedures and load them at runtime.

ganeshkrishnan · on Aug 17, 2017

If I may ask, what are you reasons to migrate from MySQL to Postgres?

We run MySQL on product and am in the process of de-normalizing it. If I have compelling reasons, I might push for migration to Postgres but I never found any genuine reasons. Uber even migrated the opposite way from Postgres to MySQL (Although I think reddit engineers disagreed with their reasons for move).

lbotos · on Aug 16, 2017

+1 for PGLoader. Still wrestling with getting it to work for a conversion, but the experience has been a pleasure so far and surprisingly easy.

sigi45 · on Aug 16, 2017

I like the approach and think something like this is fun but i don't understand the businsess case at all.

I would have migrated MySQL 5.5 to 5.6 first, than cleaned the shit out of it, normalized everything, added a few indezes and suddently realize that there is not much left to do.

I like to work with postgresql but if you have 1,2k dbs which are more or less not claned up and crappy, what are they used anyway?

richdougherty · on Aug 16, 2017

I agree. It sounds like they decided to do a cleanup ahead of an investment and not much of a real business case. Your migration path seems much, much simpler - and that's assuming the cleanup is even needed in the first place.

SOLAR_FIELDS · on Aug 16, 2017

One thing author did not touch on: since the tables were denormalized-ish in the original MySQL DB, did his application lose significant performance by having to perform the joins for every single query in the renormalized PostGres instance? Or were the DB's small enough and indexed properly enough so it didn't really matter?

Might have been worthy of testing this to see if it was worth it to materialize certain views and refresh them every 15-30 minutes or so.

koolba · on Aug 16, 2017

Alternatively did the application speed up after it was normalized? A normalized data set can be substantially smaller which may allow the entire data set to fit in memory.

Also, it's officially "PostgreSQL" or informally "Postgres". Not "PostGres".

Amezarak · on Aug 16, 2017

At least in my experience, joins in 100s of GB databases are not a performance problem. I've always been somewhat baffled by the claims I see pretty regularly that joins are a major performance concern. Was this a big deal in the past?

gaius · on Aug 16, 2017

Was this a big deal in the past?

In MySQL, yes. It lacked the concept of foreign keys.

kondro · on Aug 17, 2017

Lack of foreign keys does not effect performance.

In fact, having foreign keys is a net-negative performance-wise because it means that each foreign key must be checked for every insert/update/delete.

branko_d · on Aug 17, 2017

Foreign keys can improve performance, in some cases:

https://stackoverflow.com/a/8154375/533120

gaius · on Aug 17, 2017

Errm, we're talking about performance of joins

ComputerGuru · on Aug 17, 2017

Joins will use the regular index matching it against the join condition just like a where clause would.

kondro · on Aug 17, 2017

They don't help with that either.

Sacho · on Aug 17, 2017

Perhaps parent is conflating indexes with constraints? I believe InnoDB creates indexes automatically with foreign keys.

berns · on Aug 16, 2017

One thing that Postgres lacks is accent insensitive collations. Having clients with databases in Spanish, this is one of the reasons I wouldn't consider migrating to Postgres. I know I can use the unaccent extension, but I consider it a poor substitute to proper collations. I guess this isn't a problem for most people because it's never mentioned.

claar · on Aug 16, 2017

It appears work is active (May 2017) in this area, but accent-insensitive collations are still a ways off: https://blog.2ndquadrant.com/icu-support-postgresql-10/

edit: fulafel corrects me below that you can get accent-insensitive collations working using PostgreSQL 10's new ICU collation support. Note that PostgreSQL 10 is currently pre-release.

fulafel · on Aug 16, 2017

That post says that they currently work, but there is a new os-independent collation method coming.

Edit re your edit: the already working collations are in PG 9.x - the OS independent collations are coming in 10.x. (Hence the post title, "More robust collations with ICU support in PostgreSQL 10")

fauigerzigerk · on Aug 16, 2017

But do these "already working" collations include accent-insensitive variants? I didn't think so, and I don't think they are coming in PostgreSQL 10 either. Do you have other information on this?

petergeoghegan · on Aug 17, 2017

They do. The problem is that equality is still strict binary equality for text, for obscure technical reasons. So, while you can get extremely flexible customizations to collation with icu [1], you can't have case insensitive equality, or equality that ignores accents. This is recognized as a problem, and is probably going to be fixed in v11.

[1] https://www.postgresql.org/message-id/CAH2-Wz=bcgmk97YaZ3rs4...

anarazel · on Aug 17, 2017

> The problem is that equality is still strict binary equality for text, for obscure technical reasons

AKA performance.

fulafel · on Aug 17, 2017

My misunderstanding seemed to be that "accent-insensitive" turns out to be a precise technical term from MySQL that says "just sort UTF-8 using the American locale and collation, but pretend accented characters are equal to their non-accented versions" which is not a standard collation order in any locale.

berns · on Aug 17, 2017

I think this is incorrect. In Mysql utf8_spanish_ci and utf8_swedish_ci have different accent rules, for example.

fulafel · on Aug 17, 2017

But then what would be the case for using accent-ignoring spanish collation rules, rather than the default spanish collation rules that sort accented chars according to normal spanish sorting? This sounds strictly inferior in any human facing application.

fulafel · on Aug 16, 2017

The docs say that locales and LC_COLLATE work, are you sure?

bmn__ · on Aug 18, 2017

> One thing that Postgres lacks is accent insensitive collations.

So what if the functionality is not in core, but in an extension? You make it sound as if it does not exist at all and all we can do is gnash our teeth in frustration.

http://www.public-software-group.org/pg_collkey

first release in 2006 for PostgreSQL 8

karmakaze · on Aug 19, 2017

Why not just use a functional index? I'd expect it to have much better performance.

medius · on Aug 16, 2017

If you are migrating to AWS RDS, I recommend AWS Data Migration service. I migrated my live database (~50GB) from Mysql to Postgres (both RDS) with zero downtime.

I used AWS Schema Conversion Tool for initial PG schema. I customized the generated schema for my specific needs.

seanharr11 · on Aug 16, 2017

A few of these steps could be solved with this tool, including schema/index migrations, and even initial (fast) data transfer: https://github.com/seanharr11/etlalchemy

tkyjonathan · on Aug 16, 2017

I'm not against the idea of using postgres, but I have a sneaky suspicion, that you could have gotten that 30% speed benefit from MySQL, if you did some performance work on it. Or in other words, I don't feel that migrating from one relational database to another similar relational database was worth that much trouble.

collyw · on Aug 16, 2017

Thats interesting, as my dev experience would allow me to tune queries better in Postgres (the visual query planner makes it easier and there are more join types), while tuning them through configuration on either database would involve a lot more trail and error reading. I would more likely know what to do by migrating.

aidos · on Aug 16, 2017

I went through a similar process a few years ago (4k dbs into a single db, all mysql). You'll get something working fairly quickly and then spend days debugging data anomalies that have crept in over the years.

We ended up with a frankenstein mixture of bash scripts, a sequence of sql transformations, some python and some php to decode some of the data that had been stored in pickled / php encoded packages.

It's not something I want to repeat any time soon :-)

fermuch · on Aug 16, 2017

It truly sounds like a nightmare. How much time did it took to move everything to psql, and how long until it was stable enough to not keep an eye on it?

aidos · on Aug 16, 2017

To clarify, it all stayed on mysql — just pulled everything into a single db.

I can't remember exactly but the whole project was done in a real hurry, about 4 weeks I think. It included a rewrite of the application / website (about 20k lines of backend code, html and js / css), migration of all the data, creation of new transcoders, migration of a lot of data asset/video data from another server.

No idea how we got through it all in retrospect, but I had to. A few days after we finished I got married and went off on my honeymoon, so I guess I decided it was stable enough the day I released it...! :-)

JelteF · on Aug 16, 2017

One of the main things I ran into when migrating from MySQL to postgres was that the default text columns (TEXT, VARCHAR, etc) behave differently when searching. MySQL matches text case insensitive, while postgres matches it case sensitive. This resulted in searches for content suddenly not returning data. Luckily there's an official CITEXT extension for postgres [1], which matches text case insensitive and adds the correct indices for doing so efficiently.

[1] https://www.postgresql.org/docs/9.1/static/citext.html

pasta · on Aug 16, 2017

"MySQL matches text case insensitive"

This is of course only when you set a case insensitive collation (_ci).

But I agree this can be a problem because case insensitive collations are used a lot in MySQL.

Also other collations can be a trouble. For example the collation that returns results for both 'ß' and 'ss' (German)

ehxcaet · on Aug 16, 2017

I'm not entirely sure what the technical reason for this change was. Of course, cleaning up your DBs and whatnot is a good thing. But why did they move it to Postgres instead of simply clean?

noir_lord · on Aug 16, 2017

Would love to do this to the monstrosity I inherited in new job but it's a sispheyan task.

One day.

cntlzw · on Aug 16, 2017

One time I saw a database, with a single table, with a single text column that contained the old database.

shrikant · on Aug 16, 2017

This sounds similar to what I saw at a client a few years back. All XML responses from a REST API were inserted into a single VARCHAR(MAX) column (one row per response, but a response could have multiple "records") in a SQL Server database.

All interaction with this was through a byzantine web of XPath expressions using the severely neutered dialect of XQuery that shipped with SQL Server 2008.

Apologies for the mini-rant, your comment appears to have triggered some painful memories.

ishi · on Aug 16, 2017

It's databases all the way down...

overcast · on Aug 16, 2017

You should see the one here. An old flat file database monstrosity from the 80s, imported into oracle, complete with all of the duplicated data. It's the most non relational, relational database ever.

zimpenfish · on Aug 16, 2017

Would a database stored in an XML file shared between 60+ clients over NFS beat that for monstrositiness?

overcast · on Aug 16, 2017

Well, this is used now by about 4000 clients. So you be the judge. There is so many indexes, and spaghetti, just to make sure every possible "relationship" is accounted for. Amazing it works at all.

Just using an XML file is pretty crazy though.

cntlzw · on Aug 16, 2017

How concurrent is that? Can't imagine how writing to the file is even achieved.

zimpenfish · on Aug 16, 2017

"Badly". At least once a week someone managed to overwrite a change another client had just made. When I left, they were running a cron to copy it every minute as a "quick restore" when it happened.

shoo · on Aug 16, 2017

maybe there is an email thread shared between all clients for acquiring / releasing the write lock on the xml file

K-Wall · on Aug 16, 2017

Semaphore as a service Slack Chat bot.

noir_lord · on Aug 16, 2017

Seen that.

Shared drive with excel files, people would shout at each other to 'release the lock'

zimpenfish · on Aug 17, 2017

Yeah, first real job had the customer database as a VB frontend onto an Access file shared over {whatever Win95 was using in 1996-97}. After a few snafus, the CS / Sales people learned that they had to single-task with the database.

abraae · on Aug 17, 2017

That made me snort my coffee.

jboy55 · on Aug 16, 2017

Lol, if it sent emails, I might know of such a system

toomuchtodo · on Aug 16, 2017

This is a special level of hell.

noir_lord · on Aug 16, 2017

I have 177 tables, 764 stored procedures and 44 joins.

Every single table has either a uniqid() or a UUID() which isn't the primary key (they all have incrementing integer keys) but is 'joined' on in code.

When he needs to 'join' he used coalesce() on selects.

Referential integrity doesn't exist.

Oh and the cherry on the shit sandwhich, we have four seperate systems that talk to this database written in 3 different languages (Java (Android scanners), C# (Factory scanners) and PHP (the main system).

So there is no way to accurately know which SP's are used by what without grepping the entire codebase looking for call <foo>.

The PHP is written badly in the old PHP pure procedural style running on an outdated version of PHP/Debian/MySQL.

Basically if you took the absolute worst approach to everything this would be the end result.

It's good fun!.

TekMol · on Aug 16, 2017

Is it an universally accepted truth now that it's generally better to use Postgres then to use MySql?

Would anybody here use MySql for a new project? If so, why?

junkilo · on Aug 16, 2017

Having used both from the prototyping stage to very large scale my answer: It doesn't matter.

They are both plenty feature complete to work well for relational data type problems. Grab the one you have more exposure to and you'll be fine.

That said, I'm still using mysql-innodb (latest percona) because the built-in replication has had more time to bake and its concurrency model (think 100k qps+) is plain better. For further reading see:

https://eng.uber.com/mysql-migration/

falcolas · on Aug 16, 2017

Yes, I would use MySQL. I'm very familiar with it for one, and the extra storage engine options turn a sure thing into a no-brainer decision.

Ultimately, I think familiarity is and will always be the biggest driver for experienced devs who choose a DB.

jeffdavis · on Aug 16, 2017

What storage engines and options do you use?

falcolas · on Aug 16, 2017

InnoDB is my default goto, TokuDB when on-disk size is a consideration, Archive DB for the exceptional cases where I need even more, rarely accessed, data on disk, and MyISAM for its full text searching capabilities (usually de-normalized from a backing InnoDB table). Those are the big ones.

There's also the NDB and Galara Cluster, but you want a specific use-case to justify the added complexity and overhead.

I've seen Blackhole used in production, but that was an odd one involving a... unique... replication setup. The CSV engine is pretty awesome for huge data imports, but again, pretty unusual for actual use.

forgot-my-pw · on Aug 16, 2017

XtraDB if you can, InnoDB if you can't.

cookiecaper · on Aug 16, 2017

FWIW, MariaDB 10.2 no longer defaults to XtraDB, now considering InnoDB mostly on-par. [0]

[0] https://mariadb.com/kb/en/mariadb/why-does-mariadb-102-use-i...

leesalminen · on Aug 16, 2017

I chose MySQL (well, MariaDB) for a project that started in 2013. I chose it because I have the most experience maintaining it and have a pretty good handle on how to tune.

porker · on Aug 16, 2017

Ditto. I still choose it for small to medium sized webapps because - let's face it - the data storage is not their pain point.

MySQL is fast enough, usually has enough features, I know how to tune it and queries, and I normally only bump into 2-3 things in a project which would've been easier with PostgreSQL. And can work around them.

The backwards-incompatible SQL_MODE changes between MySQL 5.6 and 5.7 made me reconsider, as legacy apps needed updating.

tl;dr: yes. But PostgreSQL would be easier for some items

mseebach · on Aug 16, 2017

> The backwards-incompatible SQL_MODE changes between MySQL 5.6 and 5.7 made me reconsider, as legacy apps needed updating.

Uhm, err, that change is to stop the database from silently trowing away your data. It's not about "fast enough", or "easy". It's about your database actually saving your data.

If you're happy for your "database" to keep quietly throwing away data in order to keep legacy apps "working", you can just disable strict mode.

porker · on Aug 16, 2017

It's not just changes to stop data being thrown away. This issue has been painful: https://github.com/doctrine/doctrine2/pull/6143

Across multiple businesses, they've upgraded servers and got MySQL 5.7 installed, only to wonder why the app has stopped working. People didn't expect that from a point release/the default new operating system package. We could argue the point, but it's the expectation of businesses vs software engineers.

brightball · on Aug 16, 2017

Nope, because using MySQL instead of Postgres essentially cripples your dev tools once you know about everything PG opens up for you.

http://www.brightball.com/articles/why-should-you-learn-post...

sandGorgon · on Aug 16, 2017

Most e-commerce systems start off on magento, which is MySQL only. So a LOT of MySQL projects are simply the answer to :

1. Did you use magento/Drupal/WordPress/etc PHP framework

2. Did you host on a 20$ per month server with unlimited bandwidth and free database ?

e12e · on Aug 16, 2017

Like other comments here state, in general, use what you know. That said, postgresql has ~always been the better rdbms. Some would argue mysql at times have been a better ~key-value store with ~sql interface.

Apparently postgresql is difficult to run in a properly restricted bsd jail. Other than that, I'd be hard pressed to come up with reasons to prefer mysql on technical merit over either postgresql, or, say redis or something.

Avernar · on Aug 16, 2017

I'm running postgresql in a FreeBSD vnet jail without issue. The network stack is not shared as it's a vnet jail. I have it set for per jail sysvipc, new in FreeBSD 11, which was the last thing I needed to fully isolate it from other jails.

Is there anything else that's needed to properly restrict the jail?

e12e · on Aug 17, 2017

Mostly that afaik a jail with sysvipc set isn't really isolated any more - it works, but defeats some of the purpose of using a jail in the first place?

https://forums.freebsd.org/threads/59371/

Apparently this can(should?) be tweaked a bit since 11-release:

https://www.freebsd.org/cgi/man.cgi?query=jail&sektion=&n=1

> allow.sysvipc A process within the jail has access to System V IPC primitives. This is deprecated in favor of the per-mod- ule parameters (see below). When this parameter is set, it is equivalent to setting sysvmsg, sysvsem, and sysvshm all to ``inherit''.

I interpret that as it still being the case that setting this to the (equivalent of) the deprecated setting, one looses much of the protection a jail normally gives.

deathanatos · on Aug 16, 2017

https://grimoire.ca/mysql/choose-something-else

That's essentially my reasons for preferring PostgreSQL.

nrmitchi · on Aug 16, 2017

> Now is when one of the main reasons to take Postgres as our new database server. Postgres allows the use of Views(include link) supporting INSERT, UPDATE and DELETE meanwhile the relation between views and tables are one to one, ....isn’t it awesome?

Just a heads up, I think you missed a link in there

ibejoeb · on Aug 17, 2017

When talking about replaying live activity on the test system:

>We took the third option because the other two options were quite intrusive and they might interfere at the normal use of our app because in both cases they were going to be in between our users request and our app

One of those two rejected options was goreplay. It is passive, like tcpdump. Right from the docs:

"Gor is not a proxy: you do not need to put 3-rd party tool to your critical path. Instead Gor just silently analyzes the traffic of your application and does not affect it anyhow."

StreamBright · on Aug 16, 2017

Now that is what I call a proper migration. Great content! I was wondering how could somebody end up with corrupt data, I guess it was due to MySQL not strict enforcing the types probably.

ris · on Aug 16, 2017

https://github.com/the4thdoctor/pg_chameleon

adrianlmm · on Aug 16, 2017

I tried to migrate from Firebird to Postgres once and I failed, it needs more administration than Firebird.

ZeusNuts · on Aug 17, 2017

What was the reason to first migrate and then repair data? Repair scripts could've been ran against the old setup too and you would've had less stuff to migrate.