

How FriendFeed uses MySQL to store schema-less data - jgrahamc
http://bret.appspot.com/entry/how-friendfeed-uses-mysql?

======
joshu
I probably should write up the (very similar, but with a few significant
differences) my design for the never-executed Delicious backend datastore.

~~~
eserorg
That would be very interesting to read.

We're looking at implementing a tagging system for navigating through a large
proprietary datastore of oil and gas well data.

The problem we're running into is scaling conjunction queries without blowing-
out our hardware budget -- 100's of tags per item with tens of millions of
items, updated daily.

~~~
joshu
The right data structure for tagging is probably an inverted index. Look at
Lucene, Solr, or Sphinx for this.

~~~
eserorg
You're absolutely right.

I hacked up a prototype last night that creates a reverse index of:

tags => item GUID's => a precomputed bloom filter for all tags in that GUID

Each tag has its own reverse-index, which is stored in a separate file
assembled using O_APPEND. The entries in each file are pre-sorted based on a
predefined ranking algorithm.

Queries are mapped to cluster nodes using a CRC32 hash modulo across the
number of cluster nodes.

Tag conjunction queries are run by computing a bloom filter for the tags in
the query. The tag with the fewest GUID's is determined (approximately)
through a LUT stored in main-memory. The query bloom filter is then
sequentially compared using a bitwise-and with each of the bloom filters in
the appropriate reverse-index file.

After 10 matches are found, the results are rendered.

No check is done for false positives.

The problem I'm having now is in prepending new entries to the start of a
reverse-index file. This operation is not performant. Since it's out-of-band
with the query stream, it's tolerable for the time being.

------
hackernews
this was discussed 413,442 posts back

<http://apps.ycombinator.com/item?id=496946>

~~~
dpapathanasiou
That strange feeling you sometimes get:
<http://www.youtube.com/watch?v=G2eUopy9sd8>

------
JoeAltmaier
This matches all my intuitions about how I was going to do this; now I have a
cookbook (several recipies!)! This is why Hacker News is so valuable. Nothing
beats experienced advisors.

------
SingAlong
I tried this method for my app. But it turned out that I needed a lot of
fields to be in their own columns since i would have to use them with queries
in the WHERE clause or ORDER clause.

So be careful while choosing the columns you would like to serialize or json-
ify and dump.

Just my 2c.

~~~
technoweenie
Part of this article talks about the index tables they use. The rationale for
having separate index tables is they can create or drop them without locking
tables or anything like that. It takes a little more manual work to keep
things updated though.

I too use a hybrid approach on my site. Only columns that aren't specifically
referenced in indexes aren't kept in the serialized attribute hash.

------
cnu
Check out a open sourced version written on top of SQLAlchemy.
<http://github.com/bickfordb/waffle>

