
Ranked Searches with SQL - danw
http://particletree.com/notebook/ranked-searches-with-sql/
======
abstractbill
For anyone wanting to make their searches fast, I would strongly recommend
ditching the database.

Write a program that sucks your entire database into memory (memory is cheap -
unless you really have _tons_ of data, you can afford to do this). Store
records as objects or more simple structures, and index them in hash-tables.
When you get a query, just look up the objects using the query as a set of
keys into your hash-tables. If you take advantage of your knowledge of the
data, you'll be able to do any intersections, filtering and sorting way faster
than the database can.

This is how I did things when I rewrote justin.tv's search engine. The bulk of
the work took me two days, and it's _fast_ (it'll be live in a few days, when
we let our designer loose on it - right now it's functional but ugly).

~~~
dfranke
Here we go again. I'll play my traditional opening move:

"How are you handling ACID guarantees?"

and then I'll direct you here:

<http://news.ycombinator.com/item?id=59885>

~~~
abstractbill
"How are you handling ACID guarantees?"

Maybe I'm missing something but I don't see how ACID plays into things. You
don't imagine the search engine is putting stuff _in_ to the database do you?

From the link above, "Do any of these with your "big hash table" in less than
1 second"

1 second is eternity as far as my search engine is concerned. If I can't
handle a request in less than a couple of milliseconds, I'd say I've failed.
I've used this same approach in the past on a search engine that indexed 14
million records btw, and I didn't fail ;-)

~~~
dfranke
14 million? Okay, that's not very much. A RAM-based approach is fine there.

~~~
abstractbill
Heh, fair enough. Do many startups really have more than a few tens of
millions of records to search though?

~~~
dfranke
I can't speak statistically, but the one I work for now does and the one I'm
about to start will too (but in the latter case I'm writing a DBMS, so that's
kind of a pathological example).

If all your content is human-generated and O(n) in size, then 10^7 is a lot.
The big numbers come when your data is produced by a simulation, or when
you're doing OLAPey things with O(n^2)-sized intermediate results. Real-life
example: keeping a log of all users, the IPs they've logged in from, and when,
and looking for evidence of sockpuppet accounts.

~~~
neilc
I'd be curious to know why you're writing a DBMS from scratch. That seems like
kind of an odd thing to do, from a value-for-effort POV, in this day and age.

~~~
dfranke
Not from scratch. I'm building it on top of Hadoop and HBase.

------
henning
Or you could use Lucene (either the native Java version or one of its many
ports) and leave the information retrieval to the experts.

Do you really want to implement stemming, WordNet integration, etc. yourself?
No, you don't, if you know what's good for you.

