

Ask HN: How to handle a large collection of tags? - pipagiorgos

I was thinking about handling a large number of "items" (for example videos) with quite a lot of tags (let's say 20 tags on average), like for example last.fm or youtube does... where tags are aggregated and used for data mining. The question is: has anyone done this efficiently for millions of "items" and millions of tags and provide us with how it was done?<p>For example... in a relational database, should I use a unique id for the tag "funny" and every item with this tag is assigned through a middle table with this id... or is it better to save "funny" again and again in the database and get a new tag_id for it?... And then aggregate the tags through counts of the same word (no middle table is used in this case).<p>Or... Has anyone done something similar with a NON-relational database? For example in google App engine. Any way to assign tags to "items" ... and aggregate them (count, group by, sum...) efficiently? I'm trying to avoid running too soon in scaling problems... How far can I go with simple hardware? Thx.
======
spoiledtechie
I have taken the ASP.NET approach to how I count and figure out Tags. Quite
genious when you have millions of tags where each item has about 20 tags.

I create one table that contains all 20 million tags with a Unique ID which is
99% a integer just counting up. I also add a column sometimes for a counter
which every time a tag gets added to an item, I uptick this counter once.
Keeps track of how many times I used this tag.

When I tag an item with 20 tags, I create one column for all my tags.

So lets say I have tags

1, 5 8, 20000, 35, 36.

It doesn't matter to me in the database at this moment what they correspond
to, but it does matter that I need to make this column as small and easy as
possible because with 200000 mil tags and another 2mil items, the database can
get huge.

So my method is to insert the tags like so.

1:35:2000:35:5843:34

I then have created methods to where I can find the tags, export them and so
on. So the tags above are also now search able through an easy Regex
Expression "\d+".

That row of numbers is in one single column for my item and I call the column
tags.

I have done this for many sites.

Hope this helps.

~~~
spoiledtechie
Also, if you don't find the answer here, there is always
www.stackoverflow.com.

~~~
pipagiorgos
Thanks a lot guys.

spoildtechie... that was an excellent suggestion. A quick browsing reveals
that my question has probably been asked before... I had forgotten about
stackoverflow... Thx.

