
Ask YC: what's the best way to model a threaded comment system like YC? - pibefision
I'was thinking that maybe the best way would be use some king of nested set, but all the structures that i know are faster for reading purposes. And the tree here changes very often here, because voting changes it's structure.<p>Any idea?<p>This theory does not apply very well to a system like this (http://dev.mysql.com/tech-resources/articles/hierarchical-data.html)
======
kcl
It's best to keep things in-memory if you can.

Since you mention MySQL I assume that's what you're using to persist data.
Storing a tree in a flat database is kind of a challenge. This page
(<http://www.sitepoint.com/article/hierarchical-data-database>) suggests two
ways: adjacency lists, and modified preorder tree traversal. MPOTT is a cute
idea, but that's about it. It will make your database schema more complex,
make your algorithms more complex, and require preexisting entries to be
changed when new ones arrive. It's not faster than doing things in memory, and
it's just asking for trouble.

Try using adjacency lists and reconstructing the tree in memory each time you
serve it. This will let you get away with a single query for both reading and
writing. The dev.mysql.com link you mention warns against using the adjacency
list model, but that's because it's an area where MySQL itself is inadequate.
It's fine when you fix things in another language.

Here's a minimal schema:

topic, id, parent, payload

ID is unique and non-zero. Parent is the ID of the parent node (0 if top-
level). Topic is shared by all comments in a given topic. When you select,
select by topic and then recreate your tree by going through and
reestablishing "children" lists for each of your parents. You can then use
traditional tree algorithms.

This method is easily modified to support caching.

~~~
sah
I've done threaded comments with adjacency lists like this in MySQL in the
past, and it worked well. Note also that it's compatible with scores moving
siblings around relative to one another -- just calculate sibling position
based on both score and date. Your schema might look something like this:

id, date, topic_id, parent_id, score, payload

------
pg
<http://ycombinator.com/arc/arc2.tar>

------
mmcgrana
You might consider how much of the structure and ordering you actually need to
maintain in the database, and how much you could compute in your app servers
after pulling the raw data from the database.

For example, say that your comments table stored the user id, the post id, the
parent comment id (where the parent is the one the user clicked "reply" on),
the vote count, and a timestamp. Then to render a page for all of the comments
for a particular post, you could just pull all of the comments keyed by that
post id and then order/arrange them however you want in your application code,
such as by timestamp or number of votes. This doesn't require any tricky db
writes and generally removes load from the db (which is desirable since its
the hardest part of the app to scale).

The harder part is dealing with a case like the user comments page on neww.yc,
where you show all of the user's comments along with of the descendants of
those comments. Aside from the traditional approaches to storing hierarchies
in relational dbs like the ones you linked to, I know of two other
possibilities. One is to store the path of nodes to the root as
lexicographically-comparable strings. The other is to create a separate table
that stores ancestry relationships, in which for each node you have one row
for each of its ancestors in the tree up to the root. Again, both of these
approaches would allow you to pull out exactly the comments that you needed,
but you would have to do the arranging in the application code.

~~~
RyanGWU82
I worked at a company where we used a decimal number to represent a tree for a
threaded forum. It meant that we could use an ORDER BY clause and quickly get
all the descendants sorted, like this:

7.401 7.40101 7.40102 7.4010201 7.40103 7.4010301

The downside was that it wasn't nearly this straightforward -- numbers seemed
to be assigned haphazardly. And the scheme wasn't documented well, so it was
nearly impossible for new programmers to understand.

I would not recommend this approach again. If you need high-performance
searches for descendants, a separate table relating ancestors to descendants
will be much more straightforward in the long-run.

------
llimllib
Thoughts: 1) it's not relational data; is a database necessarily the right
place for you to store it?

2) If you do decide to use the database, you can increase the write speed of
the nested sets model by leaving large gaps between the left and right values.

That is, if the single child of the root node has left and right values of 0
and 1000, and you insert another child, you give it left and right of 1000 and
2000, and update the parent.

Then you'll only have to renumber if you get a wicked long thread.

~~~
tyler
I'm not sure I agree that we're not talking about relational data. I mean...
each record is -related- to others in a parent-child sort of way. Maybe I'm
misunderstanding what you mean by relational?

~~~
llimllib
Ooh, bad memories, my boss made the same argument at me last week.

Anyway! The relational model uses tables to model all data. Though you can
squeeze a comment tree into a relational data mode any number of ways, it is
most clearly expressed by a hierarchical model as in XML, S-exprs or a tree
data type in your favorite language.

So I'm not saying it can't be expressed as relational data, what I mean is
that that representation is not its most natural or useful form.

At work I use a query like:

    
    
        select s1.id, s2.id, s3.id, s4.id ...
        FROM subject s1
        LEFT JOIN subject s2 WHERE s2.id=s1.parent
        LEFT JOIN subject s3 WHERE s3.id=s2.parent
        LEFT JOIN subject s4 WHERE s4.id=s3.parent
        ....
        WHERE s1.id=?
    

to get all parents of a given node. It's ugly, inefficient, and fragile. You
can use the nested set model to simplify this particular query, but inserting
rows becomes extremely complex and painful; you really just trade one problem
for another.

So, while you can model a tree as relational data, it's not a good match.

~~~
noblethrasher
I think what the parent meant is that such a tree _is_ fundamentally
relational (not merely that it can be modeled as such). Since a relation is
just a mapping between tuples and a set of logical values (TRUE and FALSE in
most cases) then clearly there is a well-defined relation among the notion of
'parent' and 'child'. Thus the problem is the de-facto standard query
language. It seems to me that we could use new special query language (or an
extension to SQL) that could handle creating trees from pure relations in a
performant and lexically elegant way. Or maybe it’s already been tried and I
missed it. (edit: looks like a lot of smart people have already figured it
out)

------
bootload
_"... what's the best way to model a threaded comment system like YC? ..."_

A classic read is jwz's article on threaded messages ~
<http://www.jwz.org/doc/threading.html> jwz was a key hacker on mozilla ~
<http://en.wikipedia.org/wiki/Jamie_Zawinski>

------
clueless
<http://news.ycombinator.com/item?id=33888>

~~~
clueless
Also here is MY ineffecient way of doing it:

function article($link){

print_story($link->id);

print_comment(0,$link->id);

}

function print_comment($parent_id, $link_id){

$comments = $db->get_col("SELECT comment_id FROM table_comments WHERE
comment_link_id=$link_id and comment_parent = $parent_id ORDER BY
comment_votes DESC");

if ($comments) {

    
    
    		$comment = new Comment;
    		foreach($comments as $comment) {		
    				$comment->print();
    			}
    				print_comment($comment_id,$link_id);
    		}
    								
    	}
    

}

TABLE `table_comments` (

    
    
      `comment_id`, `comment_parent`, `comment_link_id`, `comment_votes`
    

)

If you store and manipulate all the data in memory, I don't see why this would
be that bad?

~~~
domnit
You can fetch all of the comments by comment_link_id in a single query, then
work with the result to recursively print the tree.

------
lampy
Here's a way I've previously "validated" some of my designs in this area.

Look at Slashcode (<http://www.slashcode.com/>), the code that powers
Slashdot. They have quite a few threading features and they obviously figured
out performance issues.

------
a-priori
The naïve approach is to give comments a parent (the reverse relationship
gives you its children), and a boolean specifying whether it's a "front page"
post. When selecting, order by votes. When rendering, do a depth-first search.

Is there any reason why that won't work?

~~~
pibefision
It's a little more complicated.

When you store the tree on the DBMS, you have some algorithtms to do it well.
For Example, the Nested Set (check the link up).

In that algorithm, the nodes are stored in a hierarchy that does not have any
information about the votes.

So, you have a perfect tree, but with the first vote, you must update all the
structure to reflect the new order of the child.

Nested set is great for speed reading, but not for writing often.

Sorry if I'm not very clear.

I was testing rails plugins : acts_as_nested_set and acts_as_better_nested and
all have the same problem.

~~~
neelesh
I used a slightly different approach of combining acts_as_tree (for faster
updates/inserts) and added a dotted_id for faster reads. The dotted id is a
string,as in 0001.0002.0007. Index this column and its fairly fast. The
obvious limitation this approach creates is the depth of your comment tree.
approx =dotted id col size/#of digits in max pk -1. Not an issue for many.
When Im updating the hierarchy , I update root of the hierarchy being moved
and then a single UPDATE to update the dotted ids in that subtree. If you are
using MySQL, then the max length of a column that can be indexed is also
limited. Don't remember the exact length.

------
tim2
Make a very simple database model with at least these columns: comment_id int,
created timestamp, parent int, score int,...

When you need to view a page of comments that has not been loaded, load it
into cache and then use your cache from there on out. When updates come in,
write them directly to the cache, re-sort the cache as necessary, and then
have the database updates run in the background.

For the cache tree model I use something like:

{'root_comment_id':123,'children_sorted_by_points':[]}

Cache expiration left as an exercise for you.

This can very easily handle something the size of hacker news.

edit: And this easily handles "many reads and many writes and an ever-changing
order".

------
optimal
More thoughts please on a system with both many reads and many writes and an
ever-changing order (i.e., sorted by dynamic rankings instead of "time
posted," a constant)!

~~~
pibefision
When you add the "dynamic ranking" to the rdbms model, it's turns very
complicated.

Would be a good idea to model the three on XML and store complete trees on the
dbms? (thinking loud!) :)

------
tlrobinson
Adjacency list in the database, build the tree in memory, cache the result in
memcached if you need to.

------
xirium
Have a message table and an ancestor table. The ancestor table allows you to
find all ancestors of a message or all decendents of a message. This allows
you to quickly retrieve a branch of the discussion tree.

------
shabda
MPTT is generally the fastest way to get relational to tree data. If you are
udsing Django code.google.com/p/django-mptt/

------
pibefision
One more idea :

Is here the solution? <http://arclanguage.org/item?id=3426>

