Hacker News new | comments | show | ask | jobs | submit login

This is called a http://en.wikipedia.org/wiki/Entity-attribute-value_model or http://en.wikipedia.org/wiki/Triplestore. I think the author is understating the price, though. There's a lot of existing software you could reuse if your data was stored in more conventional relations, and "manually enforce consistency" is a pipe dream. Your code has expectations about your data, so in the abstract you do still have a schema, and not writing it down merely prevents any tools from helping you keep your data sane over time. I've seen Notes databases decay to the point that not even the dev team could explain how a document got into its contradictory state nor how the apps should (much less currently would) handle it. The few people diligent enough to do completely correct work without a checkable schema, aren't the people who would be tempted to try.

I recently read The Pragmatic Programmers book SQL Antipatterns: Avoiding the Pitfalls of Database Programming (http://pragprog.com/book/bksqla/sql-antipatterns) and chapter 6 talks specifically about this kind of DB design (entity-attribute-value) and the benefits/pitfalls. It's a great read and I highly recommend it.

I bought that book after my first time developing on top of Magento.

Pure curiosity: Do you know of any other OSS projects that went down this route?

Yeah, I'd been thinking "Magento" from about halfway into that article.

Wordpress uses the EAV db pattern too (have a look at the wp_options and wp_meta tables).

Of course, Magento and Wordpress use it in fundamentally different ways - so this project I've got to build a combined search across a site with both Magento (EAV for products) and Wordpress (non-EAV for posts/pages) in use, is in that "it'll sit on the backburner until it becomes critical or somebody else solves it for me first" stateā€¦

Squiz's Matrix (open-source CMS) also uses an EAV model.

It should also be noted that this isn't the only way to do this either.

The other three most common ways I've seen to store differently structured data in a consistent way are:

  * Edge table
  * Binary table
  * Universal table
The edge table approach maps edges into key:value tables for the type of table. So you might have an integer data table that was ID<key>:Value<int>, and it only stored the data values and a key to identify the value. And then you have a thing table which stored Name<string>:Flag<type>:ID<key> where the flag column said "int" and the ID column then contains the key of some row in your integer table.

There are as many tables as there are types of data, + 1 table for the definition of all things stored.

The binary approach groups values of the same name into a table of that name. So if you have a property called "username" you'd have a table called "username", and this approach usually still includes the edge approach such that the values are stored in tables with ID:Value structure by type of value. Effectively partitioning the value table by name of the property being stored.

There are as many tables as there are types of data, + the count of unique property names.

The universal table approach uses a single table to store all data, effectively containing all of the edges in a graph. It's a conceptual full outer join of all binary tables. Imagine 100's of columns, and for each row the vast majority of those are NULLs.

The triplestore can be thought of as a condensed version (not type-safe) of the universal table. More space efficient, but usually at the cost of the datatype safety offered by the database.

You can combine these methods to store any type of semi-structured data in a traditional RDBMS and these approaches are used a lot if you know where to look for evidence of them.

SharePoint uses a version of the universal table. Oracle (for the ORA_XML storage type) uses a combination of the above under the hood to map XML into traditional database tables.

They all have pros and cons relating to query speed (for certain types of query), storage efficiency, indexing, types, etc. And as always you need to know why you're choosing something.

SharePoint also partitions their universal table dynamically. But they start with saying that each "SiteCollection" has it's own database/tablespace. As SiteCollections can contain a single site, this is basically equivalent to Reddit creating new tables per subreddit.

Elsewhere (up or down on this page) I saw that the SQLAlchemy author advises not to do this kind of thing. I simply say don't do if you don't know strongly why you should do it. Don't cargo-cult... go and read up on the problem space and only implement one of these solutions when you know the effects of doing so.

This is one of those times when analysis paralysis can be your friend and prevent you from building a mess you didn't need. If anyone starts with one of these designs and you haven't got a decade of experience to know why... you are probably doing it wrong.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact