Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I've operated MySQL in production at different scales ranging up to _very_ large and as such, I place a lot of emphasis on how well that can be done. I posted some of this a couple years ago, but I would describe the production operation of MySQL as a minefield. I maintained a list of the mines I stepped on and here are the two I encountered with the biggest blast radius:

* If you have pre-5.6.4 date/time columns in your table and perform any kind of ALTER TABLE statement on that table (even one that doesn't involve the date/time columns), it will TAKE A TABLE LOCK AND REWRITE YOUR ENTIRE TABLE to upgrade to the new date/time types. In other words, your carefully-crafted online DDL statement will become fully offline and blocking for the entirety of the operation. To add insult to injury, the full table upgrade was UNAVOIDABLE until 5.6.24 when an option (still defaulted to off!) was added to decline the automatic upgrade of date/time columns. If you couldn't upgrade to 5.6.24, you had two choices with any table with pre-5.6.4 types: make no DDL changes of any kind to it or accept downtime while the full table rewrite was performed. To be as fair as possible, this is documented in the MySQL docs, but it is mind-blowing to me that any database team would release this behavior into production. In other words, in what world is the upgrade of date/time types to add a bit more fractional precision so important that all online DDL operations on that table will be silently, automatically, and unavoidably converted to offline operations in order to perform the upgrade? To me, this is indicative of the same mindset that released MySQL for so many years with the unsafe and silent downgrading of data as the default mode of operation.

* Dropping a table takes a global lock that prevents the execution of ANY QUERY until the underlying files for the table are removed from the filesystem. Under many circumstances, this would go unnoticed, but I experienced a 7-minute production outage when I dropped a 700GB table that was no longer used. Apparently, this delay is due to the time it takes the underlying Linux filesystems to delete the large table file. This was an RDS instance, so I had no visibility into the filesystem used and it was probably exacerbated by the EBS backing for the RDS instance, but still, what database takes out a GLOBAL LOCK TO DROP A TABLE? After the incident, I googled for and found this description of the problem (https://www.percona.com/blog/2009/06/16/slow-drop-table/) which isn't well-documented. It's almost as if you have to anticipate every possible way in which MySQL could screw you and then google for it if you want to avoid production downtime.

There are others, too, but to this day, those two still raise my blood pressure when I think about them. In addition to MySQL, I've pushed SQL Server and PostgreSQL pretty hard in production environments and never encountered gotchas like that. Unlike MySQL, those teams appear to understand the priorities of people who run production databases and they make it very clear when there are big and potentially disrupting changes and they don't make those changes obligatory, automatic, and silent as MySQL did. IMO, MySQL has its place for very specific workloads, but if you don't have deep enough knowledge of DB engines and your workload to know that yours is one of those specific workloads, you should default to PostgreSQL.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: