Normalize First

gecko · on July 16, 2008

I think that the linked blog post goes a bit overboard. Atwood himself says repeatedly that you should start with a normalized schema; he just also says that denormalizing for performance reasons should not have such a strong stigma attached, given that it's the only way to make big, massive databases work correctly. We can argue over whether Atwood is arguing to optimize a bit too early, but the general idea he's expressing is neither new nor (at least among experienced DBAs) disturbing.

_csoo · on July 16, 2008

"[denormalizing for performance reasons is] the only way to make big, massive databases work correctly"

How about a column-oriented database? http://en.wikipedia.org/wiki/Column-oriented_DBMS

Would that help much? (I don't know much about databases :S)

ars · on July 16, 2008

Except that the premise is wrong! Denormalizing will not help you speed up the database. It's a false argument. It just won't do that.

There are specific table structures combined with specific query type that can be denormalized to speed things up. But a normal table will not speed up in the slightest, in fact most of the time it will slow down!

It will slow down because if you need only 1 column out of several, the larger each row is the slower the access will be.

Denormalizing is even worse for large sites. By normalizing you can often put some tables on one machine, and others on a different one. For example screen_name and phone number, which jeff combined. Screen_name is used by the chat program, and phone number by the profile page.

Run the chat program on a totally different set of machines. All you need to send it is the user_id.

IsaacSchlueter · on July 16, 2008

I get what you're saying. But I think "screen_name" is used here to indicate the IM links that some sites let users put on their profile. They then create the little ymsgr:sendIM?user_name hyperlinks so that you can IM that person.

So, that IS actually something that would go on the profile page, right with the phone number.

I'd always heard that normalization should only be done at sane data boundaries. IE, if you have a bunch of fields that always go together, they should be in the same table.

ars · on July 16, 2008

The thing with screen_name was just an example of how normalizing could help with speed. It wasn't meant to be definitive. Plus for the profile page you could just get the data from that other server.

>I'd always heard that normalization should only be done at sane data boundaries. IE, if you have a bunch of fields that always go together, they should be in the same table.

No, not exactly. It's too big of a subject to explain in a post like this, plus there is tons of text on it if you search.

But, basically normalization removes duplicated data from the database. It's also used to allow multiple X to exist for a single Y without having columns like X_1, X_2, X_3, etc. And finally it's used to show relationships between things.

There are some people will will simply split a table into two tables along logical boundaries (but still, just one row in each table). This is not normalization. It can be done if some data is accessed rarely, or because the database can not handle such a big table, but it's not because of normalization.

notdarkyet · on July 16, 2008

Not long ago I submitted an Ask HN that requested information on database design practices but it never got popular. It seems like it would be quite obvious to most here on how to perform quality database design, but I cannot find any good resources online. If any or you here have that info, or more specifically standard practices that are not necessarily in books, please reply because I tend to feel somewhat lost on the topic.

IsaacSchlueter · on July 16, 2008

Articles that always seem to bubble up on HN:

1. Paul Graham was wrong in that essay, and also he's an idiot.

2. Anything by Jeff Atwood

3. Jeff Atwood was wrong in that blog post, and also he's an idiot.

4. How I made my startup (succeed|fail) by working 114 hours a day and eating only cardboard and pocket lint.

4a. How we built a web app in 4 minutes for only $0.37

5. Being a (programmer|hacker|founder) is teh awesome.

Not saying that all the articles are on those subjects, but they are HN gold, that's for sure. If you could somehow work some of these topics in to your question, it might do better.

edw519 · on July 16, 2008

You missed a few:

http://news.ycombinator.com/item?id=72463

babul · on July 16, 2008

Well look at the demographic of the HN user and the obvious becomes even more obvious.

edw519 · on July 16, 2008

I hear ya, notdarkyet. A post about the price of oil in Yemen will get 200 comments but a post about database design practices will get 5 comments on hacker news. Go figure.

There are millions of references on the subject, but perhaps you could start with a good intro, one of my favorites:

http://www.amazon.com/Database-Design-Mere-Mortals-Hands/dp/...

Hope that helps.

And keep on posting. Sooner or later the crowd oughta pick up on it a get your ball rolling.

davidmathers · on July 16, 2008

It's tricky because most database writing is about SQL and OLTP and other practical things, but studying those things without first understanding the basic concepts of set theory and predicate logic is kind of like studying C and pointers without first learning about computer hardware. If you learn about pointers first they're confusing, but if you learn about hardware first they're obvious.

Ted Codd's big idea is fairly simple: "we can store data in logical, rather than physical, structures and then use logic statements, rather than algorithms, to retrieve the data."

Wikipedia can often end being more confusing than enlightening but the "informal introduction" on this page is a place to start:

http://en.wikipedia.org/wiki/Relation_(mathematics)

Here's a page I just found about predicate/first-order logic:

http://www.rbjones.com/rbjpub/logic/log019.htm

This book by Chris Date covers a lot of basic relational database ideas, but his writing is a bit tedious:

http://www.amazon.com/dp/159059746X

After that the main mental challenge is just learning to think in sets. Joe Celko has some writings specifically about that in his various books.

After that, the best book I've seen for SQL and practical issues is The Art of SQL:

http://oreilly.com/catalog/9780596008949/

Hope that helps. There's a lot of really bad writing about relational databases out there.

davidmathers · on July 16, 2008

Celko used to post on usenet. I was looking to see if had anything useful to say and someone asked what books he recommended. He replied:

"The real trick is to spend some time with a good theory book first (I like my DATA & DATABASES), so you have an overview of RDBMS concepts. It makes the language much easier. Can you imagine trying to larn Fortran without knowing algebra?

I like Rick van der Lans book for a language reference and intro. Then move on to Hernandez and Viescas. And finish up with my stuff."

Rick has a book specifically for MySQL:

http://www.amazon.com/dp/0131497359/

davidmathers · on July 16, 2008

Oh! These might be interesting. Charles Peirce was Tedd Codd's teacher and clearly had a big influence on his thinking:

http://en.wikipedia.org/wiki/Charles_Peirce#Mathematics

And, in the beginning there were 2 different teams at IBM trying to implement Ted Codds ideas. The ISBL team stuck faithfully to Codds ideas and the Sequel team kind of bastardized them. But SQL won in the marketplace. Hugh Darwen is still bitter:

http://www.dcs.warwick.ac.uk/~hugh/TTM/HAVING-A-Blunderful-T...

pmorici · on July 16, 2008

If you read the mySQL manual it pretty much lays out the various performance issues for you.

davidmathers · on July 16, 2008

Oh man, this is just as bad as the Coding Horror post.

Four lists:

1. Things which are relations

-- relations

--

2. Things you can use to represent relations

-- pen and paper

-- perl/python/ruby/javascript/etc arrays & hashes

-- c arrays & pointers

-- SQL RDBMS tables

--

3. Things which the concept of relational normalization applies to

-- relations

--

4. Things which the concept of relational normalization doesn't apply to

-- pen and paper

-- perl/python/ruby/javascript/etc arrays & hashes

-- c arrays & pointers

-- SQL RDBMS tables

--

If it sounds like I'm being pedantic, realize that 3 things must be taken into account when talking about whether or not a group of SQL tables represents a normalized group of relations: table stucture, row structure, and access methods.

It's possible to put an entire, fully 3rd normal form, relational database into a single SQL table.

icey · on July 16, 2008

I saw this over proggit and it made me laugh hard enough that I bookmarked it.

Now I think of it every time I see Atwood's name:

http://lost-theory.org/realultimatepower/ (Yes, it's a ripoff of http://realultimatepower.net)

dhotson · on July 16, 2008

This submission title looks like a troll, but I'll bite.

The good thing about Coding Horror is that it gets people thinking about issues that they might not have considered before. While the details may not be 100% correct, I think the general principle he's trying to communicate is valid.

I can't speak for Jeff, but one of the things that he will probably admit is that he isn't perfect. I've found that he doesn't necessarily write about stuff that he is an expert on, but mostly stuff he's learning about or issues that people should perhaps be thinking about.

I'm sure there are plenty of other websites that debate the finer points of database design.. perhaps they too should stop blogging in case that someone takes their advice too literally. :P

jawngee · on July 16, 2008

Atwood runs Wordpress on a Windows box. Enough said.

dhotson · on July 16, 2008

There's really no need to be dogmatic about it.

If it's stupid and it works, it's not stupid.

thwarted · on July 16, 2008

This must be some new method of doing boolean algebra and discrete math I'm not familiar with.

jawngee · on July 16, 2008

Follow his twitter feed to find your answer.