

Design details of Audiogalaxy.com’s high performance MySQL search engine - slackerIII
http://www.spiteful.com/2008/02/29/design-details-of-audiogalaxycoms-high-performance-mysql-search-engine/

======
elq
> "we never had to deal with sharding the index across multiple servers or a
> hash table of words that wouldn’t fit in memory. Speaking of which, I’d love
> to read some papers on how internet-scale search engines actually do that.
> Does anyone have any recommendations?"

I can speak from experience on a very large search engine (not on a google
scale in # of docs, but within an order of magnitude - and google scale in
terms of qps [estimated - google doesn't publish such numbers])

Re: "sharding the index across multiple servers" - every document has an id,
mod the id against some number (preferably much larger than the number of
partitions/shards you have), split you index servers into N clusters, assign
mods to a cluster (do so in a way that you avoid hotspots), have a "query
aggregator" that sends an incoming query to one server in every partition. the
aggregator then merges the result sets and resorts based on a sort key passed
by the search node.

Re: "hash table of words that wouldn’t fit in memory" - the vocabulary I had
to work with included at least 7 (human) languages with _many_ artificial
words. The # of hash entries tended to hover around 2.7M tokens. How, do not
include numbers in the index (there's an infinite number of them :)), ignore
case, and tokenization. Tokenization is relatively easy except for CJK
languages for that either have fluent/native speakers define the tokenizing
semantics or find/buy a library.

~~~
slackerIII
The sharding thing is such an interesting problem. I'm thinking about the case
where a user searches for "A B". Assume A has a few million matching IDs, B
also has few million, and (of course) the indexes for A & B are stored on
different servers. Do you have to pull the full result sets for A & B back to
a single aggregator to intersect them? I'd love to learn more about the
optimizations for that problem.

~~~
elq
Ah. Well, keep in mind that we're splitting the documents up in an arbitrary
way WRT the data, so every cluster/partition has a complete copy of the index
for the documents that happen to be bucketed there. And no cluster owns a
single "document class/cluster".

So, if you want to search for, say "microsoft windows", the aggregator just
sends something like "PHRASE('microsoft', 'windows')", each query node finds
the document vector/set for microsoft and the document vector for windows and
does an intersection of the doc ids, then the node has to do scan that set,
grab the document position array from each document and filters out any
documents where windows doesn't occur at microsoft-Positions + 1.

All of the conjunctions, disjunctions, wildcard expansions, near operations,
and phrase operations, etc are executed on the query node. All of the complex
sort evaluation also happen on the query node. The aggregator only merges
result sets and and performs any necessary global sorting.

~~~
slackerIII
Ah, of course. I misunderstood your previous comment. The system is
partitioned by document, not by token - that makes a lot more sense. Oh well,
I guess that isn't quite as hard of a problem as I thought :) Thanks for the
follow up.

------
stillmotion
Man, Audiogalaxy was awesome.

~~~
slackerIII
It sure was. Working there was amazing, mainly because of all the cool tech I
got to build, but also because it meant that part of my job was _using
Audiogalaxy_. Good times...

~~~
aswanson
You _worked_ there? Thanks, I remember there was one song I was searching for
forever and Audiogalaxy was the only engine that found it.

~~~
slackerIII
Cool, glad you liked it -- it is always fun to talk to people who used the
site.

I was there from 1999 to 2002. It was exceptionally good at finding rare
music, at least partially because we never partitioned our network. When you
searched, it searched all 1 million+ users that were currently running the
Satellite.

~~~
ajkirwin
I have to echo what they said.

Audiogalaxy was simply wonderful. Even now, I would still say it's better than
current P2P offerings.

You could find so much.. and the client was so very lightweight!

~~~
mdemare
It was fantastic for finding rare music. Lots of music that I can't find
nowadays anywhere - paid or otherwise.

And I loved being able to control my home client from work through a web
interface.

~~~
rms
waffles.fm?

email me for an invite

