I think that the linked blog post goes a bit overboard. Atwood himself says repeatedly that you should start with a normalized schema; he just also says that denormalizing for performance reasons should not have such a strong stigma attached, given that it's the only way to make big, massive databases work correctly. We can argue over whether Atwood is arguing to optimize a bit too early, but the general idea he's expressing is neither new nor (at least among experienced DBAs) disturbing.
Except that the premise is wrong! Denormalizing will not help you speed up the database. It's a false argument. It just won't do that.
There are specific table structures combined with specific query type that can be denormalized to speed things up. But a normal table will not speed up in the slightest, in fact most of the time it will slow down!
It will slow down because if you need only 1 column out of several, the larger each row is the slower the access will be.
Denormalizing is even worse for large sites. By normalizing you can often put some tables on one machine, and others on a different one. For example screen_name and phone number, which jeff combined. Screen_name is used by the chat program, and phone number by the profile page.
Run the chat program on a totally different set of machines. All you need to send it is the user_id.
I get what you're saying. But I think "screen_name" is used here to indicate the IM links that some sites let users put on their profile. They then create the little ymsgr:sendIM?user_name hyperlinks so that you can IM that person.
So, that IS actually something that would go on the profile page, right with the phone number.
I'd always heard that normalization should only be done at sane data boundaries. IE, if you have a bunch of fields that always go together, they should be in the same table.
The thing with screen_name was just an example of how normalizing could help with speed. It wasn't meant to be definitive. Plus for the profile page you could just get the data from that other server.
>I'd always heard that normalization should only be done at sane data boundaries. IE, if you have a bunch of fields that always go together, they should be in the same table.
No, not exactly. It's too big of a subject to explain in a post like this, plus there is tons of text on it if you search.
But, basically normalization removes duplicated data from the database. It's also used to allow multiple X to exist for a single Y without having columns like X_1, X_2, X_3, etc. And finally it's used to show relationships between things.
There are some people will will simply split a table into two tables along logical boundaries (but still, just one row in each table). This is not normalization. It can be done if some data is accessed rarely, or because the database can not handle such a big table, but it's not because of normalization.
Not long ago I submitted an Ask HN that requested information on database design practices but it never got popular. It seems like it would be quite obvious to most here on how to perform quality database design, but I cannot find any good resources online. If any or you here have that info, or more specifically standard practices that are not necessarily in books, please reply because I tend to feel somewhat lost on the topic.
1. Paul Graham was wrong in that essay, and also he's an idiot.
2. Anything by Jeff Atwood
3. Jeff Atwood was wrong in that blog post, and also he's an idiot.
4. How I made my startup (succeed|fail) by working 114 hours a day and eating only cardboard and pocket lint.
4a. How we built a web app in 4 minutes for only $0.37
5. Being a (programmer|hacker|founder) is teh awesome.
Not saying that all the articles are on those subjects, but they are HN gold, that's for sure. If you could somehow work some of these topics in to your question, it might do better.
I hear ya, notdarkyet. A post about the price of oil in Yemen will get 200 comments but a post about database design practices will get 5 comments on hacker news. Go figure.
There are millions of references on the subject, but perhaps you could start with a good intro, one of my favorites:
It's tricky because most database writing is about SQL and OLTP and other practical things, but studying those things without first understanding the basic concepts of set theory and predicate logic is kind of like studying C and pointers without first learning about computer hardware. If you learn about pointers first they're confusing, but if you learn about hardware first they're obvious.
Ted Codd's big idea is fairly simple: "we can store data in logical, rather than physical, structures and then use logic statements, rather than algorithms, to retrieve the data."
Wikipedia can often end being more confusing than enlightening but the "informal introduction" on this page is a place to start:
Celko used to post on usenet. I was looking to see if had anything useful to say and someone asked what books he recommended. He replied:
"The real trick is to spend some time with a good theory book first (I like my DATA & DATABASES), so you have an overview of RDBMS concepts. It makes the language much easier. Can you imagine trying to larn Fortran without knowing algebra?
I like Rick van der Lans book for a language reference and intro. Then move on to Hernandez and Viescas. And finish up with my stuff."
And, in the beginning there were 2 different teams at IBM trying to implement Ted Codds ideas. The ISBL team stuck faithfully to Codds ideas and the Sequel team kind of bastardized them. But SQL won in the marketplace. Hugh Darwen is still bitter:
If it sounds like I'm being pedantic, realize that 3 things must be taken into account when talking about whether or not a group of SQL tables represents a normalized group of relations: table stucture, row structure, and access methods.
It's possible to put an entire, fully 3rd normal form, relational database into a single SQL table.
This submission title looks like a troll, but I'll bite.
The good thing about Coding Horror is that it gets people thinking about issues that they might not have considered before. While the details may not be 100% correct, I think the general principle he's trying to communicate is valid.
I can't speak for Jeff, but one of the things that he will probably admit is that he isn't perfect. I've found that he doesn't necessarily write about stuff that he is an expert on, but mostly stuff he's learning about or issues that people should perhaps be thinking about.
I'm sure there are plenty of other websites that debate the finer points of database design.. perhaps they too should stop blogging in case that someone takes their advice too literally. :P