Hacker News new | past | comments | ask | show | jobs | submit login
A data modeling challenge: relational vs document oriented for modeling orders (maxschireson.com)
31 points by mschireson on May 10, 2011 | hide | past | favorite | 11 comments



It's perfectly acceptable to store point-in-time values--whole strings!--in your relation model, along side the FK. They actually record different information:

1. the strings are for reconstructing the order as it, at the time it was placed.

2. the FKs are for managing and categorizing the order data, eg: How many orders did customer X place in March? What's the margin on SKU Y? etc. The FKs facilitate this. Update the SKU description, assuming the product is still the same product, and all the data flows through properly to the reports.

Really, the relation model is not optimized for "first-order" applications, such as saving and displaying a complex entity. Document dbs are great at this.

Instead, it's optimized for second and third-order applications, like "show me all the clients that ordered X in March but didn't order Y in April." The whole point of the relation model is to make the data orthogonal to any particular application. Try asking those second-order questions of your document-based DB. Yes, it can be done, but at what effort, and with what degree of confidence in the results? cf. http://howfuckedismydatabase.com/nosql/

EDIT: all that said, I think there really is a benefit to document dbs: maybe you don't care about second order applications yet. Maybe you just need to bang something out today.

Since the relational model is orthogonal to your application, it can be a real bitch to get it "right". How do you know what you will use this data for tomorrow? You don't, you can make some guesses, but this is where the normal forms come in: "follow these rules, and you can use your data for anything!" Well, fine, but the resulting models are often very hard to grok, and hard to work with. For someone w/ better things to do, this can turn into a real time sink.

Perhaps it is better to take on a little technical debt today, with the expectation that tomorrow, if need be, you'll convert or extract to a more relational form.


all that said, I think there really is a benefit to document dbs: maybe you don't care about second order applications yet.

If you don't care about second order applications yet, then you don't have anybody asking questions about what's happening in the business yet. Make sure you're equipped to answer these questions, whether via relational queries or map-reduce jobs or whatever. Also make sure that you can turn around requests for a lot of different kinds of these queries, both scheduled and ad-hoc. Business folks call this "reporting" and it isn't optional, because it's one of the business' main ways of staying grounded in reality. Is the new marketing campaign working? Are the demographics of our users match our predictions? How did last week's outage affect customer retention? The worst case scenario is when the answers to those questions are nonexistent, and the second worst case is if they're hidden in log files, but after those two the worst case scenario is if the only people who can write queries against your datastore to answer those questions are the hard-core developers who ought to be adding new features or working on infrastructure.

So however you store your data, make sure it isn't rocket science to write the queries that the business guys need to answer their questions. Otherwise your rocket scientists will be wasted answering marketing questions like, "How many iPhone users who purchased our service in February used it in both March and April?" instead of improving your technology.

P.S. I believe reporting is a major motivation behind the SQL-like languages that pop up for non-relational databases, but my experience with the problem I described was actually with a very large and complicated relational database, so SQL is no panacea. It took us a long time to create a reporting team that was self-sufficient enough to shield our senior database developers from the flow of menial reporting work.


Yes, complex queries require a programmer who understands map/reduce. The degree of confidence corresponds pretty directly with the programmer's competence.

Is writing a function in javascript more or less difficult than writing a complex SQL query? I'd call it a tie at best.

So you have the upfront technical expense of setting up a relational schema which enables you to (maybe) save time later writing analytic queries.


Agree on using both point in time values and FKs.

Agree also that there are plenty of second and third order apps where RDBMS is great.

The non-DB comic is funny.


You took a real beating on your blog, which prompted my edit.

It may be useful at times to frame the debate in terms of time to market, secondary apps, etc. The relational model necessarily presupposes you will use the database for more than one thing, and over an extended period of time, thus warranting a high up-front investment and agnostic model. This may be right in some cases, but often times it's more important to publish early, bootstrap yourself, and start iterating.


Thanks. Didn't expect as much interest as I got - would definitely have worked longer on the post if I thought thousands of people would have read it :)

My view is not that relational is bad, but that the world needs an alternative.


I enjoy using mongo / other document stores, but this article is lame. The contrast is between a single document storing an order, with sub-documents containing details, versus a table per sub-doc type. They claim you have to denormalize data in the relational version, which isn't true, and use that claim to assert that updates are complicated.

Guess what? Denormalizing data can make things complicated in mongo-land, too. This isn't just a relational db problem.


Yeah, total straw man. For e.g.:

Similarly, the order lines don’t contain product names, but rather foreign keys which point to a product table

Well, the product name could be the foreign key. Or it could not be. If you have 2 different products with the same name (or 1 person with 2 different names) then you have some complexity. Doesn't matter what kind of database you're using.


I don't believe it's a straw man, and you haven't refuted the posters view other than to say you don't believe the historical preservation is necessary without elaboration.

I'd also point out that some relational luminaries advocate a similar perspective, such as Pat Helland.


the basic point of the article is that author likes documents better than normalized record sets. There are at least 2 sets of concepts involved - documents vs. records and denormalization vs. normalization. The author preference for document format has no bearing on the necessity and effectiveness of normalization if record approach is chosen.


I was playing around with the idea of redoing our customer database at work. Our customer database contains for each customer their email address, name, when their service expires, how many licenses they have, and things like that. Essentially a bunch of named scalar values.

It also contains their transaction history--essentially an event log. Some of the transactions have auxiliary information such as specific line items on an order.

A person subscribing to our service can use it on multiple computers (up to a number determined by exactly what they purchased). So there is a list of the computers they are using it on, with some information about that use.

The state of a customer could be represented nicely by a nested data structure, like a Perl hash or a JSON document, and so some kind of document oriented storage seems like it could work well. Most changes to a customer's data consist of adding new transactions, and occasionally updating the expiration date.

We have a customer service interface that our support people use to view and change a customer's data. Getting all the data by fetching a single JSON document would be much simpler than the multiple queries that are now done to grab the data from the MySQL database it is in, spread across many normalized tables.

In general, with one notable exception, almost everything that deals with a customer is dealing with one customer at a time. For instance, a support person is dealing with the customer they have on the phone. The order processor is dealing the customer's account for the order that it is currently processing.

It's not even clear we need a database at all for this. Why not just a file per account, with the file containing a serialized Perl hash or JSON document or something similar, with a simple classic Unix file locking mechanism used to prevent corruption? This is easy to scale, for both reads and writes.

That brings me to the notable exception to "dealing with one customer at a time". Reports. Reports like to do things involving selecting data from a large number of customers, related by various factors such as having purchased particular SKUs in a particular time frame, or having their credit cards charged through a particular gateway, or things like that.

Reports seems to be something that fits in a lot better with the traditional relational approach.

If the set of reports needed was known up front, I think one could design a system that uses a document approach for storing customer data, but also maintains a relational reports database to hold data specifically for reports. We tend to have a changing set of reports--we do a lot of poking around and playing with data to try to find ways to improve our system and I'm often looking at things I had no idea a week before I'd ever need to look at for a report.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: