

Ask YC/HN: Database Resources and Help - notdarkyet

Hey all, I am recently getting in to web development on my own and am struggling to find good resources on the organization of databases.  I am currently in college right now but cannot take the database class for another two semesters because the required prerequisites.<p>The issue that troubles me is appropriately organizing tables and efficiently organizing the information.  I am familiar with normalization and some of the basic concepts involved with such but I feel as though I am missing some possible rules of thumb or methods that individuals go about when designing the databases.<p>For example, take a site like HN or Digg, what would be the appropriate way to organize user accounts, profiles, each submitted site, the comments for each site, etc.  Logically, I would think that if a user made a comment, it would be appropriate to insert that comment into both the table for the user as well as the table for the site be commented on, but that may be an extremely naive approach to handling such a task. I understand that to most here this seems as though it is most likely an obvious question but if I had definitive knowledge to make an educated decision, I would do so.<p>Ideally I would love some quality online resources or a simple explanation over the way you hackers go about the design phase.  Good books would be appreciated but I am tight on cash so nothing too pricey.<p>Thanks for any responses you guys give.  I understand this is quite a noob question and there might be a more appropriate forum out there for such a query, but I really value the analysis from most of this community. Cheers!
======
gcv
I learned most of what I know about database design through practice and
making many enlightening mistakes. In all likelihood, I still don't design
databases correctly, but at least going through some bad designs and suffering
their consequences gave me an intuitive feel for what I should do. (That
intuitive feel has translated into loathing for RDBMSes and relational data
modeling in general, but that's an entirely different subject.)

In databases, the cleanest design often has bad performance characteristics.
This clean design, called "normal form," which has several different levels,
boils down to minimizing the number of times a particular piece of data is
stored in the database. In your example, it would typically be a mistake to
store the comment twice. It wastes disk space, and all inserts and updates on
the comment have to be done twice. Under some circumstances, copies of the
comment may get out of sync.

To address your example, here is a simple, incomplete, and probably
inefficient but fairly well-normalized table structure for something like HN:

    
    
      users: id, nickname, password_hash, joined_on, karma, ...
      stories: id, user_id (person who posted the story), url, content, posted_on, votes, ...
      comments: id, user_id (person who posted the comment), content, posted_on, story_id, parent_comment_id, votes, ...
    

In this table structure, stories refer to the person who posted the story by
the poster's user primary key ID field. I made this ID separate from the
user's name, and it would typically be monotonically increasing number (1, 2,
3, ...) or a GUID. (Some schools of thought advocate using a meaningful
identifier, and would recommend making the user's nickname the primary key.)
In the stories table, the user_id is foreign key into the users table, which
simply means that it refers to entries in the users table. Comments also refer
to users by the user's primary key, and they also refer to stories by primary
keys in the stories table. In addition, comments refer to other comments,
which allows parents to have a primitive hierarchical structure.

