Hacker News new | past | comments | ask | show | jobs | submit login
Migrating 1200 databases from MySQL to Postgres (psyenix.blogspot.com)
250 points by yupyup on Aug 16, 2017 | hide | past | favorite | 82 comments



I wrote an open source tool that migrates between any 2 relational databases.

https://github.com/seanharr11/etlalchemy

This includes automatic migration of the schema, constraints, indexes and obviously data. It leverages fast bulk loading tools like mysqlimport, and PostgreSQL's COPY FROM. It doesn't need any config. Just plug and play.

The tool uses SQLAlchemy to abstract column types, and handles various column type conversions between any flavor of SQL.

The one thing it needs are tests, and scalability support for tables bigger than a few GB!


Going to give this a try. Thanks for posting!


Thanks much - going to test it between MSSQL and MySQL in a few days here!


Let me know how it goes...please open issues if you find them!


I'm working on migrating an app right now (just one MySQL database) but can highly recommend [pgloader](http://pgloader.io/). It has a bunch of built-in and configurable rules to cast different column types to PostgreSQL and the developer has been very responsive in helping me make weird column types work, too.


Pgloader can be very powerful. Despite never having using Common Lisp, I found it easy to add in a custom column conversion functions without modifying pgloader itself. You can write additional lisp files with custom procedures and load them at runtime.


If I may ask, what are you reasons to migrate from MySQL to Postgres?

We run MySQL on product and am in the process of de-normalizing it. If I have compelling reasons, I might push for migration to Postgres but I never found any genuine reasons. Uber even migrated the opposite way from Postgres to MySQL (Although I think reddit engineers disagreed with their reasons for move).


+1 for PGLoader. Still wrestling with getting it to work for a conversion, but the experience has been a pleasure so far and surprisingly easy.


I like the approach and think something like this is fun but i don't understand the businsess case at all.

I would have migrated MySQL 5.5 to 5.6 first, than cleaned the shit out of it, normalized everything, added a few indezes and suddently realize that there is not much left to do.

I like to work with postgresql but if you have 1,2k dbs which are more or less not claned up and crappy, what are they used anyway?


I agree. It sounds like they decided to do a cleanup ahead of an investment and not much of a real business case. Your migration path seems much, much simpler - and that's assuming the cleanup is even needed in the first place.


One thing author did not touch on: since the tables were denormalized-ish in the original MySQL DB, did his application lose significant performance by having to perform the joins for every single query in the renormalized PostGres instance? Or were the DB's small enough and indexed properly enough so it didn't really matter?

Might have been worthy of testing this to see if it was worth it to materialize certain views and refresh them every 15-30 minutes or so.


Alternatively did the application speed up after it was normalized? A normalized data set can be substantially smaller which may allow the entire data set to fit in memory.

Also, it's officially "PostgreSQL" or informally "Postgres". Not "PostGres".


At least in my experience, joins in 100s of GB databases are not a performance problem. I've always been somewhat baffled by the claims I see pretty regularly that joins are a major performance concern. Was this a big deal in the past?


Was this a big deal in the past?

In MySQL, yes. It lacked the concept of foreign keys.


Lack of foreign keys does not effect performance.

In fact, having foreign keys is a net-negative performance-wise because it means that each foreign key must be checked for every insert/update/delete.


Foreign keys can improve performance, in some cases:

https://stackoverflow.com/a/8154375/533120


Errm, we're talking about performance of joins


Joins will use the regular index matching it against the join condition just like a where clause would.


They don't help with that either.


Perhaps parent is conflating indexes with constraints? I believe InnoDB creates indexes automatically with foreign keys.


One thing that Postgres lacks is accent insensitive collations. Having clients with databases in Spanish, this is one of the reasons I wouldn't consider migrating to Postgres. I know I can use the unaccent extension, but I consider it a poor substitute to proper collations. I guess this isn't a problem for most people because it's never mentioned.


It appears work is active (May 2017) in this area, but accent-insensitive collations are still a ways off: https://blog.2ndquadrant.com/icu-support-postgresql-10/

edit: fulafel corrects me below that you can get accent-insensitive collations working using PostgreSQL 10's new ICU collation support. Note that PostgreSQL 10 is currently pre-release.


That post says that they currently work, but there is a new os-independent collation method coming.

Edit re your edit: the already working collations are in PG 9.x - the OS independent collations are coming in 10.x. (Hence the post title, "More robust collations with ICU support in PostgreSQL 10")


But do these "already working" collations include accent-insensitive variants? I didn't think so, and I don't think they are coming in PostgreSQL 10 either. Do you have other information on this?


They do. The problem is that equality is still strict binary equality for text, for obscure technical reasons. So, while you can get extremely flexible customizations to collation with icu [1], you can't have case insensitive equality, or equality that ignores accents. This is recognized as a problem, and is probably going to be fixed in v11.

[1] https://www.postgresql.org/message-id/CAH2-Wz=bcgmk97YaZ3rs4...


> The problem is that equality is still strict binary equality for text, for obscure technical reasons

AKA performance.


My misunderstanding seemed to be that "accent-insensitive" turns out to be a precise technical term from MySQL that says "just sort UTF-8 using the American locale and collation, but pretend accented characters are equal to their non-accented versions" which is not a standard collation order in any locale.


I think this is incorrect. In Mysql utf8_spanish_ci and utf8_swedish_ci have different accent rules, for example.


But then what would be the case for using accent-ignoring spanish collation rules, rather than the default spanish collation rules that sort accented chars according to normal spanish sorting? This sounds strictly inferior in any human facing application.


The docs say that locales and LC_COLLATE work, are you sure?


> One thing that Postgres lacks is accent insensitive collations.

So what if the functionality is not in core, but in an extension? You make it sound as if it does not exist at all and all we can do is gnash our teeth in frustration.

http://www.public-software-group.org/pg_collkey

first release in 2006 for PostgreSQL 8


Why not just use a functional index? I'd expect it to have much better performance.


If you are migrating to AWS RDS, I recommend AWS Data Migration service. I migrated my live database (~50GB) from Mysql to Postgres (both RDS) with zero downtime.

I used AWS Schema Conversion Tool for initial PG schema. I customized the generated schema for my specific needs.


A few of these steps could be solved with this tool, including schema/index migrations, and even initial (fast) data transfer: https://github.com/seanharr11/etlalchemy


I'm not against the idea of using postgres, but I have a sneaky suspicion, that you could have gotten that 30% speed benefit from MySQL, if you did some performance work on it. Or in other words, I don't feel that migrating from one relational database to another similar relational database was worth that much trouble.


Thats interesting, as my dev experience would allow me to tune queries better in Postgres (the visual query planner makes it easier and there are more join types), while tuning them through configuration on either database would involve a lot more trail and error reading. I would more likely know what to do by migrating.


I went through a similar process a few years ago (4k dbs into a single db, all mysql). You'll get something working fairly quickly and then spend days debugging data anomalies that have crept in over the years.

We ended up with a frankenstein mixture of bash scripts, a sequence of sql transformations, some python and some php to decode some of the data that had been stored in pickled / php encoded packages.

It's not something I want to repeat any time soon :-)


It truly sounds like a nightmare. How much time did it took to move everything to psql, and how long until it was stable enough to not keep an eye on it?


To clarify, it all stayed on mysql — just pulled everything into a single db.

I can't remember exactly but the whole project was done in a real hurry, about 4 weeks I think. It included a rewrite of the application / website (about 20k lines of backend code, html and js / css), migration of all the data, creation of new transcoders, migration of a lot of data asset/video data from another server.

No idea how we got through it all in retrospect, but I had to. A few days after we finished I got married and went off on my honeymoon, so I guess I decided it was stable enough the day I released it...! :-)


One of the main things I ran into when migrating from MySQL to postgres was that the default text columns (TEXT, VARCHAR, etc) behave differently when searching. MySQL matches text case insensitive, while postgres matches it case sensitive. This resulted in searches for content suddenly not returning data. Luckily there's an official CITEXT extension for postgres [1], which matches text case insensitive and adds the correct indices for doing so efficiently.

[1] https://www.postgresql.org/docs/9.1/static/citext.html


"MySQL matches text case insensitive"

This is of course only when you set a case insensitive collation (_ci).

But I agree this can be a problem because case insensitive collations are used a lot in MySQL.

Also other collations can be a trouble. For example the collation that returns results for both 'ß' and 'ss' (German)


I'm not entirely sure what the technical reason for this change was. Of course, cleaning up your DBs and whatnot is a good thing. But why did they move it to Postgres instead of simply clean?


Would love to do this to the monstrosity I inherited in new job but it's a sispheyan task.

One day.


One time I saw a database, with a single table, with a single text column that contained the old database.


This sounds similar to what I saw at a client a few years back. All XML responses from a REST API were inserted into a single VARCHAR(MAX) column (one row per response, but a response could have multiple "records") in a SQL Server database.

All interaction with this was through a byzantine web of XPath expressions using the severely neutered dialect of XQuery that shipped with SQL Server 2008.

Apologies for the mini-rant, your comment appears to have triggered some painful memories.


It's databases all the way down...


You should see the one here. An old flat file database monstrosity from the 80s, imported into oracle, complete with all of the duplicated data. It's the most non relational, relational database ever.


Would a database stored in an XML file shared between 60+ clients over NFS beat that for monstrositiness?


Well, this is used now by about 4000 clients. So you be the judge. There is so many indexes, and spaghetti, just to make sure every possible "relationship" is accounted for. Amazing it works at all.

Just using an XML file is pretty crazy though.


How concurrent is that? Can't imagine how writing to the file is even achieved.


"Badly". At least once a week someone managed to overwrite a change another client had just made. When I left, they were running a cron to copy it every minute as a "quick restore" when it happened.


maybe there is an email thread shared between all clients for acquiring / releasing the write lock on the xml file


Semaphore as a service Slack Chat bot.


Seen that.

Shared drive with excel files, people would shout at each other to 'release the lock'


Yeah, first real job had the customer database as a VB frontend onto an Access file shared over {whatever Win95 was using in 1996-97}. After a few snafus, the CS / Sales people learned that they had to single-task with the database.


That made me snort my coffee.


Lol, if it sent emails, I might know of such a system


This is a special level of hell.


I have 177 tables, 764 stored procedures and 44 joins.

Every single table has either a uniqid() or a UUID() which isn't the primary key (they all have incrementing integer keys) but is 'joined' on in code.

When he needs to 'join' he used coalesce() on selects.

Referential integrity doesn't exist.

Oh and the cherry on the shit sandwhich, we have four seperate systems that talk to this database written in 3 different languages (Java (Android scanners), C# (Factory scanners) and PHP (the main system).

So there is no way to accurately know which SP's are used by what without grepping the entire codebase looking for call <foo>.

The PHP is written badly in the old PHP pure procedural style running on an outdated version of PHP/Debian/MySQL.

Basically if you took the absolute worst approach to everything this would be the end result.

It's good fun!.


Is it an universally accepted truth now that it's generally better to use Postgres then to use MySql?

Would anybody here use MySql for a new project? If so, why?


Having used both from the prototyping stage to very large scale my answer: It doesn't matter.

They are both plenty feature complete to work well for relational data type problems. Grab the one you have more exposure to and you'll be fine.

That said, I'm still using mysql-innodb (latest percona) because the built-in replication has had more time to bake and its concurrency model (think 100k qps+) is plain better. For further reading see:

https://eng.uber.com/mysql-migration/


Yes, I would use MySQL. I'm very familiar with it for one, and the extra storage engine options turn a sure thing into a no-brainer decision.

Ultimately, I think familiarity is and will always be the biggest driver for experienced devs who choose a DB.


What storage engines and options do you use?


InnoDB is my default goto, TokuDB when on-disk size is a consideration, Archive DB for the exceptional cases where I need even more, rarely accessed, data on disk, and MyISAM for its full text searching capabilities (usually de-normalized from a backing InnoDB table). Those are the big ones.

There's also the NDB and Galara Cluster, but you want a specific use-case to justify the added complexity and overhead.

I've seen Blackhole used in production, but that was an odd one involving a... unique... replication setup. The CSV engine is pretty awesome for huge data imports, but again, pretty unusual for actual use.


XtraDB if you can, InnoDB if you can't.


FWIW, MariaDB 10.2 no longer defaults to XtraDB, now considering InnoDB mostly on-par. [0]

[0] https://mariadb.com/kb/en/mariadb/why-does-mariadb-102-use-i...


I chose MySQL (well, MariaDB) for a project that started in 2013. I chose it because I have the most experience maintaining it and have a pretty good handle on how to tune.


Ditto. I still choose it for small to medium sized webapps because - let's face it - the data storage is not their pain point.

MySQL is fast enough, usually has enough features, I know how to tune it and queries, and I normally only bump into 2-3 things in a project which would've been easier with PostgreSQL. And can work around them.

The backwards-incompatible SQL_MODE changes between MySQL 5.6 and 5.7 made me reconsider, as legacy apps needed updating.

tl;dr: yes. But PostgreSQL would be easier for some items


> The backwards-incompatible SQL_MODE changes between MySQL 5.6 and 5.7 made me reconsider, as legacy apps needed updating.

Uhm, err, that change is to stop the database from silently trowing away your data. It's not about "fast enough", or "easy". It's about your database actually saving your data.

If you're happy for your "database" to keep quietly throwing away data in order to keep legacy apps "working", you can just disable strict mode.


It's not just changes to stop data being thrown away. This issue has been painful: https://github.com/doctrine/doctrine2/pull/6143

Across multiple businesses, they've upgraded servers and got MySQL 5.7 installed, only to wonder why the app has stopped working. People didn't expect that from a point release/the default new operating system package. We could argue the point, but it's the expectation of businesses vs software engineers.


Nope, because using MySQL instead of Postgres essentially cripples your dev tools once you know about everything PG opens up for you.

http://www.brightball.com/articles/why-should-you-learn-post...


Most e-commerce systems start off on magento, which is MySQL only. So a LOT of MySQL projects are simply the answer to :

1. Did you use magento/Drupal/WordPress/etc PHP framework

2. Did you host on a 20$ per month server with unlimited bandwidth and free database ?


Like other comments here state, in general, use what you know. That said, postgresql has ~always been the better rdbms. Some would argue mysql at times have been a better ~key-value store with ~sql interface.

Apparently postgresql is difficult to run in a properly restricted bsd jail. Other than that, I'd be hard pressed to come up with reasons to prefer mysql on technical merit over either postgresql, or, say redis or something.


I'm running postgresql in a FreeBSD vnet jail without issue. The network stack is not shared as it's a vnet jail. I have it set for per jail sysvipc, new in FreeBSD 11, which was the last thing I needed to fully isolate it from other jails.

Is there anything else that's needed to properly restrict the jail?


Mostly that afaik a jail with sysvipc set isn't really isolated any more - it works, but defeats some of the purpose of using a jail in the first place?

https://forums.freebsd.org/threads/59371/

Apparently this can(should?) be tweaked a bit since 11-release:

https://www.freebsd.org/cgi/man.cgi?query=jail&sektion=&n=1

> allow.sysvipc A process within the jail has access to System V IPC primitives. This is deprecated in favor of the per-mod- ule parameters (see below). When this parameter is set, it is equivalent to setting sysvmsg, sysvsem, and sysvshm all to ``inherit''.

I interpret that as it still being the case that setting this to the (equivalent of) the deprecated setting, one looses much of the protection a jail normally gives.


https://grimoire.ca/mysql/choose-something-else

That's essentially my reasons for preferring PostgreSQL.


> Now is when one of the main reasons to take Postgres as our new database server. Postgres allows the use of Views(include link) supporting INSERT, UPDATE and DELETE meanwhile the relation between views and tables are one to one, ....isn’t it awesome?

Just a heads up, I think you missed a link in there


When talking about replaying live activity on the test system:

>We took the third option because the other two options were quite intrusive and they might interfere at the normal use of our app because in both cases they were going to be in between our users request and our app

One of those two rejected options was goreplay. It is passive, like tcpdump. Right from the docs:

"Gor is not a proxy: you do not need to put 3-rd party tool to your critical path. Instead Gor just silently analyzes the traffic of your application and does not affect it anyhow."


Now that is what I call a proper migration. Great content! I was wondering how could somebody end up with corrupt data, I guess it was due to MySQL not strict enforcing the types probably.



I tried to migrate from Firebird to Postgres once and I failed, it needs more administration than Firebird.


What was the reason to first migrate and then repair data? Repair scripts could've been ran against the old setup too and you would've had less stuff to migrate.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: