Hacker News new | past | comments | ask | show | jobs | submit login
Reddit's database has only two tables (inburke.com)
296 points by kevinburke on Sept 2, 2012 | hide | past | web | favorite | 79 comments

> Adding a column to 10 million rows takes locks and doesn’t work.

It does not take locks, other than for very briefly.

1. Make a new empty table that has the same structure as the table you wish to add a column to. Add your new column to the empty table.

2. Put triggers on the old table that, whenever a row is added or updated, makes a copy of the row in the new table or updates the copy already there.

3. Run a background process that goes through the old table doing dummy updates:

    UPDATE table SET some_col = some_col WHERE ...
where the WHERE clause picks a small number of rows (e.g., just go through the primary key sequentially). Since you aren't actually modifying the table, all this does is trigger the trigger on the specified rows.

4. When you've hit everything with a dummy update, rename the current table to a temp name, and rename the new table to the current table. This is the only step that needs a lock.

There are tools for MySQL to automate much of this. There was a post either here or on Reddit a while back about this which linked to them. I'm sorry but I didn't save a link to it so you'll have to search if you want it.

Alternatively, you can use a database such as PostgreSQL, which stores metadata about tables in other tables, allowing you to not only add a column without locking the table but do so as part of a transaction with other changes that can all be rolled back atomically on failure.

PostgreSQL also supports concurrent index creation, so if you realize later you need an index on your amazingly large table you can have it built in the background while you are still using the table. (Managing indexes were another locking issue mentioned in the article.)

> PostgreSQL also supports concurrent index creation

I use this all the time, and am flabbergasted how people can do without it. I feel like migration frameworks should make it the default with Postgres.

It's too bad it can't be mixed with transactional DDL, but because indexes are not logical changes, I don't really care as much, even if it is dissatisfying.

So, all in all, for those who want to take advantage of this feature in Postgres:

Stop doing this:


Start doing this:


For the cost of one keyword, your index additions can be a non-event.

True to forgettable SQL-ish (did you know that indexes are not addressed by the SQL standard?) syntax, I got it slightly wrong:

    $ psql
    fdr=> \h CREATE INDEX
    Command:     CREATE INDEX
    Description: define a new index
    CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ name ] ON table [ USING method ]
        ( { column | ( expression ) } [ COLLATE collation ] [ opclass ] [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [, ...] )
        [ WITH ( storage_parameter = value [, ... ] ) ]
        [ TABLESPACE tablespace ]
        [ WHERE predicate ]
So, rather:


This is also true for MS SQL. However online index operations do require the expensive licenses, unfortunately. Postgres is amazing.

Ack. SQL 2008 Enterprise hurt badly, especially as we have a couple of machines with 8 physical CPUs.

+1 for PostgreSQL.

And, reddit uses PostgreSQL.

apparently they are doing it wrong


This is a deep question. Brainstorming I get:

- Steve and Alex founded reddit fresh out of school, and schools don't generally teach databases beyond the theory

- Fresh out of school, they didn't have the opportunity to learn it from someone with more experience

- They were busy building the rest of the site, and so didn't spend the time to delve in to these concepts

- Paul either didn't feel the need to explain it to them, or had reasons similar to the above to not know

Hmm... I'm out of ideas. Anyone else?

"allowing you to not only add a column without locking the table"

To be more clear, the actual advantage is that adding a column in postgres is an O(1) operation if the default value is NULL. It still requires taking a lock, but for many workloads you won't notice it. You still need to be aware of it though, because it can cause problems if you have long-running transactions.

Alternatively, if using Active Record/Rails, you can use Large Hadron Migrator gem: http://backstage.soundcloud.com/2011/05/introducing-the-larg...

In trying to undesrtand why such a negative reaction, soundcloud's take on this goes beyond just ruby or rails.

They detail an approach that goes extends tzs' suggestion:

1. Get the maximum primary key value for the table

2. Create new table and journal table

3. Activate journalling with triggers

4. Perform alter statement on new table

5. Copy in chunks up to max primary key value to new table

6. Switch new and original table names and remove triggers

7. Replay journal: insert, update, deletes

Not only did the have success with this approach, they studied Facebook's approach[1], and Twitter's[2], and explain why it didn't work for them.

[1] https://github.com/freels/table_migrator

[2] http://www.facebook.com/note.php?note_id=430801045932

Presumably not that exact algorithm.

There's a race between (1) and (3) where new entries could be created (increasing the max primary key value) before the triggers are in place.

Or (shameless plug) use tokudb. We have many online schema change features that work within mysql. You can add and remove columns while the database is running, without resorting to complicated tricks with triggers. You just add the column, it completes immediately, and you start using it right away. The work of changing existing rows happens in the background, in a transactionally consistent manner, so you never see the old schema again. It's pretty sweet.

FWIW, I often look at tokudb with envy, but there is no way I will switch to MySQL from PostgreSQL for it: there are so many other things I'd lose that I would rather put up with my other issues. That said, and I totally admit it was now over a year ago that I was looking into various database implementation companies (and thereby might have you confused with a differen company), it seems to me like your technology would be amazing if used as a standalone index (as opposed to also storing the heap), and adding indexes to PostgreSQL is both simple and "part of the point of using it in the first place".

(I also would have expected the PostgreSQL market to be better for this kind of thing, although I'm not in this business, so I'd love to understand where I'm wrong: the general argument being that relying on a third-party commercial index is a much lower commitment--and thereby a simpler sell--than relying on a third-party commercial storage engine, in addition to how most companies using MySQL seem to either know so little about database technology that their criteria was simply "popularity" or, alternatively, so much about database technology that to them MySQL is perfectly fine and they don't really need your solution.)

For the moment, I can say that MySQL has a nicely defined storage engine API which is why we release that, and that if you need something lower-level, please come talk to us because that's not the only thing we develop.

However, part of the point of storing things in a write-optimized data structure is that you can afford to store "the heap" directly in the index, and get good performance on range queries. Anyway, if you're interested, email us and we can talk about how we might be able to help.

TokuDB is interesting but priced a little outside what I'd be willing to pay. For the same price I can throw more machines at the problem and use Cassandra.

I'm sure it's great, but my first reaction when I looked up what it was was "oh no, another MySQL storage engine."

> It does not take locks, other than for very briefly.

You are right. I think this is a misunderstanding in the original material: if one does not use a default expression (i.e. the default is NULL) then the implementation they are using -- Postgres -- will just twiddle the catalog information. My guess is that they did not know this detailed system artifact -- adding a column of NULL value to table of billions of records is for most purposes as fast as a table of nothing.

No need to do such a fancy switchover of tables, although one will still need to backfill a desired default, as you have indicated in step 3.

For people wanting to then use this solution, the detailed explanation is that you 1) make the column without a default; 2) modify your application logic as you would otherwise already have to fill in the column during insert and maintain it during update, but not to rely on or use the column for updating other data; 3) alter the column to add the default, to be used for new rows past that point (you do this after step #2, as otherwise you won't know which values should be defaulted in step #4 and which values need special attention); and finally 4) update, in reasonably-sized batches, all of the old rows that are still null to the "correct value" (which might be the default, but might be a calculation based on other data).

(edit, as I forgot the conclusion:) At this point you will have all of your data filled in correctly, so you alter the column to add a not null constraint (if you had wanted that, which you probably did as otherwise you likely wouldn't have needed a non-null default in the first place) and then go back to into your application and go about business as usual using your new column in ways that would affect other data.

This is exactly the mechanism that pt-online-schema-change from Percona uses.


Wouldn't that double the size, in the database, of a very large table?

Depends on the DB. Doing that process on an Oracle DB with a huge table in archivelog mode can generate terabytes of archive logs.

> It does not take locks, other than for very briefly.

A lock (even a very brief one) can be infeasible on a sufficiently high-traffic table, though there's always the possibility of, say, a 30-second maintenance window.

Mysql 5.5 supports atomic table switch, meaning no lock is necessary at all.

This is called a http://en.wikipedia.org/wiki/Entity-attribute-value_model or http://en.wikipedia.org/wiki/Triplestore. I think the author is understating the price, though. There's a lot of existing software you could reuse if your data was stored in more conventional relations, and "manually enforce consistency" is a pipe dream. Your code has expectations about your data, so in the abstract you do still have a schema, and not writing it down merely prevents any tools from helping you keep your data sane over time. I've seen Notes databases decay to the point that not even the dev team could explain how a document got into its contradictory state nor how the apps should (much less currently would) handle it. The few people diligent enough to do completely correct work without a checkable schema, aren't the people who would be tempted to try.

I recently read The Pragmatic Programmers book SQL Antipatterns: Avoiding the Pitfalls of Database Programming (http://pragprog.com/book/bksqla/sql-antipatterns) and chapter 6 talks specifically about this kind of DB design (entity-attribute-value) and the benefits/pitfalls. It's a great read and I highly recommend it.

I bought that book after my first time developing on top of Magento.

Pure curiosity: Do you know of any other OSS projects that went down this route?

Yeah, I'd been thinking "Magento" from about halfway into that article.

Wordpress uses the EAV db pattern too (have a look at the wp_options and wp_meta tables).

Of course, Magento and Wordpress use it in fundamentally different ways - so this project I've got to build a combined search across a site with both Magento (EAV for products) and Wordpress (non-EAV for posts/pages) in use, is in that "it'll sit on the backburner until it becomes critical or somebody else solves it for me first" state…

Squiz's Matrix (open-source CMS) also uses an EAV model.

It should also be noted that this isn't the only way to do this either.

The other three most common ways I've seen to store differently structured data in a consistent way are:

  * Edge table
  * Binary table
  * Universal table
The edge table approach maps edges into key:value tables for the type of table. So you might have an integer data table that was ID<key>:Value<int>, and it only stored the data values and a key to identify the value. And then you have a thing table which stored Name<string>:Flag<type>:ID<key> where the flag column said "int" and the ID column then contains the key of some row in your integer table.

There are as many tables as there are types of data, + 1 table for the definition of all things stored.

The binary approach groups values of the same name into a table of that name. So if you have a property called "username" you'd have a table called "username", and this approach usually still includes the edge approach such that the values are stored in tables with ID:Value structure by type of value. Effectively partitioning the value table by name of the property being stored.

There are as many tables as there are types of data, + the count of unique property names.

The universal table approach uses a single table to store all data, effectively containing all of the edges in a graph. It's a conceptual full outer join of all binary tables. Imagine 100's of columns, and for each row the vast majority of those are NULLs.

The triplestore can be thought of as a condensed version (not type-safe) of the universal table. More space efficient, but usually at the cost of the datatype safety offered by the database.

You can combine these methods to store any type of semi-structured data in a traditional RDBMS and these approaches are used a lot if you know where to look for evidence of them.

SharePoint uses a version of the universal table. Oracle (for the ORA_XML storage type) uses a combination of the above under the hood to map XML into traditional database tables.

They all have pros and cons relating to query speed (for certain types of query), storage efficiency, indexing, types, etc. And as always you need to know why you're choosing something.

SharePoint also partitions their universal table dynamically. But they start with saying that each "SiteCollection" has it's own database/tablespace. As SiteCollections can contain a single site, this is basically equivalent to Reddit creating new tables per subreddit.

Elsewhere (up or down on this page) I saw that the SQLAlchemy author advises not to do this kind of thing. I simply say don't do if you don't know strongly why you should do it. Don't cargo-cult... go and read up on the problem space and only implement one of these solutions when you know the effects of doing so.

This is one of those times when analysis paralysis can be your friend and prevent you from building a mess you didn't need. If anyone starts with one of these designs and you haven't got a decade of experience to know why... you are probably doing it wrong.

Note: Steve was exaggerating a bit, or this post misrepresents what he said. I can think of 12 just off the top of my head.

That said, many of them fit the same pattern:

...and each had a complement with "data" in place of "thing".

Though there were also ones like:

...for the many-to-many relations.

That accounts for most of the big ones (though there are also a handful of smaller specialty tables, too.)

Also, the above is as of March 2011 -- a lot has moved to Cassandra since then.

Thanks for the comment, that makes more sense. I was trying to figure out in my head how to do the lookups to reander comments with votes and what comments the user voted on all from one data table.

Also I guess the table schemas are slightly different for the different "thing" tables.

Nope, all "thing" tables have exactly the same schema as each other. Ditto "data" and "xref" tables.

That quote is just painful to read, littered with FUD and not a single bit of evidence to back it up.

You should worry about the database because it's probably your canonical storage of data, which for most of us is the most important part of our product/service/whatever. A good schema enforces consist data, invariants, and all sorts of other stuff that you don't want to be dealing with a manual (and buggy) basis.

Schema updates do not need to be slow. They might not always be as elegant as you hope but the big databases are improving on that front, and as tzs mentions - there are tricks that can be employed. With the latest and greatest PG, I believe we're even starting to get event triggers, so it may well be possible to do schema updates with replication. I also have a feeling the binary replication in PG 9 and up can even do it out of the box, with hot standby to still allow responses. I'm not entirely convinced replication is a backup solution, so maybe that was an operations antipattern. That's some baseless assertion from me though :)

If deployments are a pain, work to alleviate pain. They are pretty mechanical, even if involved, which lead very nicely to being automated.

Seriously, we're smart people, let's not throw at least 30 years of research out the window in favour of glorified entity-attribute-value schemas.

Pretty sure Reddit has thousands of tables - last time I looked, it was really hard to see this but this is how it seemed like it was working. It has "thing"/"data" tables for every subreddit - created on the fly (a crime for which any DBA would have you put to death, normally). While I'm honored they use my library (SQLAlchemy) for relational database access, their actual usage of the relational DB couldn't be more...well... let's just say please don't imitate their style. If you want to build a Reddit, use Mongo/Cassandra or something like that. They'd very likely have done so themselves if NoSQL products were mature when they first developed their platform (I am vaguely recalling/guessing here on that one).

Edit: if any reddit devs want to correct me here, feel free, as I found the reddit source extremely difficult to follow back when I looked.

> It has "thing"/"data" tables for every subreddit - created on the fly (a crime for which any DBA would have you put to death, normally).

That's not correct. There in't "table" for a subreddit. There is a thing/data pair that stores metadata about a subreddit, and there is a thing/data pair for storing links. One of the properties of a link is the subreddit that it is in. Same with the comments. There is one thing/data pair for comments and the subreddit it is in is a property.

> They'd very likely have done so themselves if NoSQL products were mature when they first developed their platform (I am vaguely recalling/guessing here on that one).

Actually, still today I tell people that even if you want to do key/value, postgres is faster than any NoSql product currently available for doing key/value.

The main issue I run into with storing key/values in MySQL/Postgress is querying those values if they hold complex data (json, serialized data, etc) and the NoSQL products have features that let you dig into those meta fields when querying. (I haven't used a lot of NoSQL products yet, just dabbled with them)

Do you guys know of good techniques to do that with a traditional database like MySQL? I know MySQL has some XML parsing features if you store meta data as XML. I've experimented with it but never used it in production. I sometimes put JSON into a column for dynamic data. Usually I only do that for fields that I know won't need to be queried but that occasionally comes back to haunt me.

Don't know for MySQL, but postgres just added some support for JSON in the latest (beta) versions. Haven't used it and I don't know what it's capable of..

The support for JSON is only validation, otherwise it's just like a plain text column.

A better choice would be to use a hstore column, which is sort of like JSON but we different syntax. Postgres supports indexing hstore subfields, querying on them etc.

Postgres has XML, binary, JSON, geometry ...

And of course, hstore.

And array.

> That's not correct. There in't "table" for a subreddit. There is a thing/data pair that stores metadata about a subreddit, and there is a thing/data pair for storing links. One of the properties of a link is the subreddit that it is in. Same with the comments. There is one thing/data pair for comments and the subreddit it is in is a property.

Having a bit of trouble parsing this, but I think you mean that the "thing/data" tables are per type, where type is "subreddit", "comment", "links". Which would indicate a fixed schema.

Can you clarify if Reddit just has a fixed number of tables? I remember seeing some "table.create()" in there but I wasn't sure what I was looking at.

You are correct that reddit has a fixed number of tables (waves hands a bit). There is one table per type, so when a new type is created, a new table is created. That rarely happens.

The reason you see those create statements in the code is for bootstrapping for open source users -- the first time they run reddit, it creates the tables.

Every "thing" (a single table) has an attribute called "type". Type is a string, and is something like "comment" or "post" or "badge".

"table.create()" doesn't create a new table, I think it creates in index on "type", so you can treat it a bit like a separate table.

Yes, it is not a "real" database. It's a key-value store, which doesn't lose data.

> "table.create()" doesn't create a new table,

if its a sqlalchemy.schema.Table, create() emits DDL for "CREATE TABLE" to the database (trust me, I wrote it). I'm guessing "table" here is some other object local to the reddit codebase.

anyway, how many "thing" tables are there total?

How do you handle the size of the table (which I assume has many rows). Do the inserts start to slow down as it grows? How much RAM did you have for a master DB instance? Thanks for any input!

Reddit actually does use Cassandra, and you can read about their adventures and such with it here: http://blog.reddit.com/search/label/Cassandra

I guess now I know why reddit is so slow.

Among other problems, with a table structure like this it's hard to make good indexes.

No, it's not.

Everything (everything) is cached in memcachdb, which is also where they store their global variables.

It's a terrible design, but it's a web forum. All they need is good horizontal scaling.

Compared to what? Reddit is damn fast.

And this table structure has nothing to do with Reddit speed. The pages you get, 99.999% of them come from Cassandra and caches, pre-rendered.

So no, reddit is not slow, even less "so slow", and no, the table structure has nothing to do with it's speed.

Signed-in reddit is one of the slowest sites I use. It is, however, very fast for what it is doing. I have no complaints about the speed, only wishful thinking.

I concur. Signed-in reddit is very slow. Signed-out reddit is better but still feels slow. Perhaps I am spoiled by Facebook, Google & co.

Well this sort of explains the question I was thinking... which was how they would deal with such a massive table? At some point the indices will grow very large and need to be split up some how or they won't fit in RAM.

Postgres has feature called index partitioning, it solves such problems. There is also table partitioning. Creating/merging such tables and indexes should be done by your app automatically if you expect it to grow.

> created on the fly (a crime for which any DBA would have you put to death, normally)

Is the crime in creating tables on the fly? Or creating tables of identical structure on the fly?

A relational table is meant to store data, but is not meant to represent data by its own existence.

"Represent data by its own existence" means, if you wrote an app where every time a user created a new account, a brand new table called "user_account_XYZ" would be created to store that user's account information. This is the classic "tables-on-the-fly" antipattern. It's not the same as a horizontal sharding or replicated system where tables are structurally broken out for scaling purposes.

We of course have "on the fly" schemes for all kinds of things that are normally manually generated; some template languages automatically generate backing source files, cloud-based hosting spins up and tears down servers on demand, and database replication or sharding schemes will generate new schemas and tables to accommodate growth or change in the overall dataset. That's not what I mean when I say "on the fly" - I mean an application that relies upon creating new relational tables on the fly in order to represent data.


Except the second one would be a crime even if not on the fly.

I can't understand why this is a crime. Could you tell us (me) why it (creating new table on the fly) should be prohibited?

As the OP, would also appreciate Reddit dev input into what they're using now... I am summarizing a summary of a Reddit talk from 2010.

See my comment here: http://news.ycombinator.com/item?id=4468905

There isn't one table per subreddit.

I'm not a current dev, FYI, I left a year ago. But as far I know, it still works the same way as a year ago. Which is the same as it worked 3 years ago.

> If you want to build a Reddit, use Mongo/Cassandra or something like that.

Can they handle terabyte-scale data loads these days?

I can't really imagine mongo coping well with reddit, given the global write lock situation.

This is a thin wrapper around a good, but two year old, High Scalability post:


which was discussed on HN recently:


and on HN long ago:


EDIT: Fixed "long ago" vs "recently". Thanks, sync.

My fear with headlines like this is that people with no business working at the scale that Reddit does will suddenly eschew years of best practice SQL and by like "my dad's pizza shop CRM only needs two tables, JUST LIKE REDDIT".

Everyone please, use your brain before repeating such specific configurations. Reddit is quite exceptional. Your burgeoning to-do list app is not.

Umm, assuming the github repo is what they actually use (i assume so given how often it is committed to) there are two tables per object. Reddit_thing_link, reddit_data_link, reddit_rel_thing_savehide, etc etc

Seriously? Sounds like a half-baked reimplementation of a https://en.wikipedia.org/wiki/Triplestore to me ...

The post he refers to is more than 2 years old, things may have changed.

They've built a database in their database. Also known as the inner platform antipattern: http://en.wikipedia.org/wiki/Inner-platform_effect http://thedailywtf.com/Articles/The_Inner-Platform_Effect.as...

No, they have <thing> and <thing data> tables for each entity. This is a grave misunderstanding of what was actually said.

Datomic takes similar approach. A unit of Entity, Attribute, Value and Time is called datom. Although in this approach we don't have to have schema, datomic forces us to define schema for some benefits.

Basically they rediscovered RDF.

They should switch to an RDF triplestore, at least they would be able to exploit some of the RDF-only optimizations and their validation tools.

This is roughly the path I'm taking on my current project. There are some things that have a consistent, "traditional" schema. However, many of the feature-important things are being stored as JSON blobs. This means that I don't have to worry about schema migrations when I add a new feature; instead, I just define its JSON schema. This works particularly well because I don't need to join on these items. The downside is that I do have to do occasional manual processing on them (sum up all values of type FOO) that would be trivial if everything were perfectly normalized.

> many of the feature-important things are being stored as JSON blobs. ... when I add a new feature I just define its JSON schema.

Some no-sql stores work like that. You may be better off using one of them.

I looked at that (and have done that in the past), but I really missed some of the ORM features I got with relational. I would up writing more glue code than I liked for things that would otherwise be free. This is my compromise world (and Postgres is getting better at handling its JSON).

I used to work at a company with a huge LDAP database that had been growing in uncontrolled and organic ways for years. If you've never worked with LDAP before, it's basically a big key-value store, with some optional schema enforcement. We spent an inordinate amount of time manually cleaning it up and trying to enforce some kind of data integrity and we were always bemoaning the lack of a real structural schema.

I think the kind of ad-hoc generic data typing described in this article is sometimes the right solution but it comes at a cost.

I wish I read this article a year ago before I started developing my latest app. I have that exact problem, where adding a column to a 10M+ row table takes over an hour. So instead I end up with "things" tables all over the place.


The last company I worked at made an application that used the Entity-Attribute-Value pattern in the database. The stated reason was to be dynamic, so we didn't have to worry about adding new columns and the associated downtime (assuming the DB got huge, which of course it would because this app would surely be a huge success). We had that problem on our main app (with 10s of millions of rows) where adding a column was always tricky, so I think management over-corrected. The other supposed win was that since the model didn't change, the code didn't need to be updated.

The data that was being stored fit into the relational model pretty well. But thanks to E.A.V. it was very difficult to query. The kinds of questions we often looked at (how many records from this zip code) would have been trivial without the E.A.V. Today you might use a NoSQL database (which were just starting to get noticed at the time), but in reality it fit into MySQL just fine.

The real sad part is, we never used that functionality in the 2-3 years after it was developed while I was there. The app wasn't big enough for adding columns to take much time at all. All that "flexibility" we needed? We didn't use it, because it would have taken additional time to implement the additional front-ends and update the other backend systems.

Even if we wanted to keep things smaller, we could have gone with a table per type of record (record_type_one, record_type_two, etc) instead of one big records table. That would have made schema changes easier.

And of course, the code did need to be updated. Always. Sure there were no new columns that might cause problems if they didn't have default values, but you can never make changes to an app without code changes. We still had to implement the new interface. We had to implement the code to post that new kind of record to the systems it got processed by. Making the code handle the DB changes would have taken less time than the day or two a designer might work on the front end. It never would have been a bottleneck.

Looks like you guys did a bit of a premature optimization there. I know how painful it can get :(

That's exactly what it was.

It was designed to replace a growing set of systems that were all tiny forks of the same basic code base over a couple of years. Managing all that had become a mess, and it did need to be replaced. But there wan an opportunity and we ended up trying to reach for the stars when we should have aimed a bit lower.

The irony is that after the system had been in production for a few months, we noticed it had terrible performance that was getting worse with load.

It turned out the programmer who had written some parts of the system had it recalculating way too much data, things that didn't apply to what was going on and that couldn't have changed. It was probably an artifact from initial development (I'll do it this way to get it going, then cleanup later...). Once we caught and fixed that, it was much much faster.

That was the only optimization I remember it needing while I was there.

Applications are open for YC Winter 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact