Hacker News new | past | comments | ask | show | jobs | submit login

> Adding a column to 10 million rows takes locks and doesn’t work.

It does not take locks, other than for very briefly.

1. Make a new empty table that has the same structure as the table you wish to add a column to. Add your new column to the empty table.

2. Put triggers on the old table that, whenever a row is added or updated, makes a copy of the row in the new table or updates the copy already there.

3. Run a background process that goes through the old table doing dummy updates:

    UPDATE table SET some_col = some_col WHERE ...
where the WHERE clause picks a small number of rows (e.g., just go through the primary key sequentially). Since you aren't actually modifying the table, all this does is trigger the trigger on the specified rows.

4. When you've hit everything with a dummy update, rename the current table to a temp name, and rename the new table to the current table. This is the only step that needs a lock.

There are tools for MySQL to automate much of this. There was a post either here or on Reddit a while back about this which linked to them. I'm sorry but I didn't save a link to it so you'll have to search if you want it.

Alternatively, you can use a database such as PostgreSQL, which stores metadata about tables in other tables, allowing you to not only add a column without locking the table but do so as part of a transaction with other changes that can all be rolled back atomically on failure.

PostgreSQL also supports concurrent index creation, so if you realize later you need an index on your amazingly large table you can have it built in the background while you are still using the table. (Managing indexes were another locking issue mentioned in the article.)

> PostgreSQL also supports concurrent index creation

I use this all the time, and am flabbergasted how people can do without it. I feel like migration frameworks should make it the default with Postgres.

It's too bad it can't be mixed with transactional DDL, but because indexes are not logical changes, I don't really care as much, even if it is dissatisfying.

So, all in all, for those who want to take advantage of this feature in Postgres:

Stop doing this:


Start doing this:


For the cost of one keyword, your index additions can be a non-event.

True to forgettable SQL-ish (did you know that indexes are not addressed by the SQL standard?) syntax, I got it slightly wrong:

    $ psql
    fdr=> \h CREATE INDEX
    Command:     CREATE INDEX
    Description: define a new index
    CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ name ] ON table [ USING method ]
        ( { column | ( expression ) } [ COLLATE collation ] [ opclass ] [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [, ...] )
        [ WITH ( storage_parameter = value [, ... ] ) ]
        [ TABLESPACE tablespace ]
        [ WHERE predicate ]
So, rather:


This is also true for MS SQL. However online index operations do require the expensive licenses, unfortunately. Postgres is amazing.

Ack. SQL 2008 Enterprise hurt badly, especially as we have a couple of machines with 8 physical CPUs.

+1 for PostgreSQL.

And, reddit uses PostgreSQL.

apparently they are doing it wrong


This is a deep question. Brainstorming I get:

- Steve and Alex founded reddit fresh out of school, and schools don't generally teach databases beyond the theory

- Fresh out of school, they didn't have the opportunity to learn it from someone with more experience

- They were busy building the rest of the site, and so didn't spend the time to delve in to these concepts

- Paul either didn't feel the need to explain it to them, or had reasons similar to the above to not know

Hmm... I'm out of ideas. Anyone else?

"allowing you to not only add a column without locking the table"

To be more clear, the actual advantage is that adding a column in postgres is an O(1) operation if the default value is NULL. It still requires taking a lock, but for many workloads you won't notice it. You still need to be aware of it though, because it can cause problems if you have long-running transactions.

Alternatively, if using Active Record/Rails, you can use Large Hadron Migrator gem: http://backstage.soundcloud.com/2011/05/introducing-the-larg...

In trying to undesrtand why such a negative reaction, soundcloud's take on this goes beyond just ruby or rails.

They detail an approach that goes extends tzs' suggestion:

1. Get the maximum primary key value for the table

2. Create new table and journal table

3. Activate journalling with triggers

4. Perform alter statement on new table

5. Copy in chunks up to max primary key value to new table

6. Switch new and original table names and remove triggers

7. Replay journal: insert, update, deletes

Not only did the have success with this approach, they studied Facebook's approach[1], and Twitter's[2], and explain why it didn't work for them.

[1] https://github.com/freels/table_migrator

[2] http://www.facebook.com/note.php?note_id=430801045932

Presumably not that exact algorithm.

There's a race between (1) and (3) where new entries could be created (increasing the max primary key value) before the triggers are in place.

Or (shameless plug) use tokudb. We have many online schema change features that work within mysql. You can add and remove columns while the database is running, without resorting to complicated tricks with triggers. You just add the column, it completes immediately, and you start using it right away. The work of changing existing rows happens in the background, in a transactionally consistent manner, so you never see the old schema again. It's pretty sweet.

FWIW, I often look at tokudb with envy, but there is no way I will switch to MySQL from PostgreSQL for it: there are so many other things I'd lose that I would rather put up with my other issues. That said, and I totally admit it was now over a year ago that I was looking into various database implementation companies (and thereby might have you confused with a differen company), it seems to me like your technology would be amazing if used as a standalone index (as opposed to also storing the heap), and adding indexes to PostgreSQL is both simple and "part of the point of using it in the first place".

(I also would have expected the PostgreSQL market to be better for this kind of thing, although I'm not in this business, so I'd love to understand where I'm wrong: the general argument being that relying on a third-party commercial index is a much lower commitment--and thereby a simpler sell--than relying on a third-party commercial storage engine, in addition to how most companies using MySQL seem to either know so little about database technology that their criteria was simply "popularity" or, alternatively, so much about database technology that to them MySQL is perfectly fine and they don't really need your solution.)

For the moment, I can say that MySQL has a nicely defined storage engine API which is why we release that, and that if you need something lower-level, please come talk to us because that's not the only thing we develop.

However, part of the point of storing things in a write-optimized data structure is that you can afford to store "the heap" directly in the index, and get good performance on range queries. Anyway, if you're interested, email us and we can talk about how we might be able to help.

TokuDB is interesting but priced a little outside what I'd be willing to pay. For the same price I can throw more machines at the problem and use Cassandra.

I'm sure it's great, but my first reaction when I looked up what it was was "oh no, another MySQL storage engine."

> It does not take locks, other than for very briefly.

You are right. I think this is a misunderstanding in the original material: if one does not use a default expression (i.e. the default is NULL) then the implementation they are using -- Postgres -- will just twiddle the catalog information. My guess is that they did not know this detailed system artifact -- adding a column of NULL value to table of billions of records is for most purposes as fast as a table of nothing.

No need to do such a fancy switchover of tables, although one will still need to backfill a desired default, as you have indicated in step 3.

For people wanting to then use this solution, the detailed explanation is that you 1) make the column without a default; 2) modify your application logic as you would otherwise already have to fill in the column during insert and maintain it during update, but not to rely on or use the column for updating other data; 3) alter the column to add the default, to be used for new rows past that point (you do this after step #2, as otherwise you won't know which values should be defaulted in step #4 and which values need special attention); and finally 4) update, in reasonably-sized batches, all of the old rows that are still null to the "correct value" (which might be the default, but might be a calculation based on other data).

(edit, as I forgot the conclusion:) At this point you will have all of your data filled in correctly, so you alter the column to add a not null constraint (if you had wanted that, which you probably did as otherwise you likely wouldn't have needed a non-null default in the first place) and then go back to into your application and go about business as usual using your new column in ways that would affect other data.

This is exactly the mechanism that pt-online-schema-change from Percona uses.


Wouldn't that double the size, in the database, of a very large table?

Depends on the DB. Doing that process on an Oracle DB with a huge table in archivelog mode can generate terabytes of archive logs.

> It does not take locks, other than for very briefly.

A lock (even a very brief one) can be infeasible on a sufficiently high-traffic table, though there's always the possibility of, say, a 30-second maintenance window.

Mysql 5.5 supports atomic table switch, meaning no lock is necessary at all.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact