
How to write a search engine - abstractbill
http://www.acmqueue.com/modules.php?name=Content&pa=showpage&pid=143
======
notabel
Unfortunately, the author largely glosses over what I consider the most
difficult part of search--extracting semantic information from unstructured
source materials (also refered to as "magicks"). However, the discussion of
bandwidth and server allocation - of starting with what you can afford, and
spending your effort/money on what matters (i.e. your algorithms) resonates
with a lot of the ideas behind YC. When it comes right down to it, anyone who
can sweet-talk a VC can get fat pipes and big boxes--the (human) brains behind
the (algorithmic) brains are what differentiate Google from Lycos.

~~~
bootload
'... the author largely glosses over what I consider the most difficult part
of search--extracting semantic information from unstructured source materials
...'

This is a hard set of problems. Now the question is how do you think they go
about it? My guess is they work from whole to part on a document they find
determined by extension/mime type. Then analyse the document in detail for
commonly found information (based on statistical info).

I remember reading about how google analysed web authoring statistics ~
http://code.google.com/webstats For instance 'a' links within html pages which
reveal links to other documents can be parsed to extract lots of useful
information as found here ~
http://code.google.com/webstats/2005-12/element-a.html

But unstructured data is a different beast. How for instance does google work
out that 'cm' (what I mean by centimeter) is both 'Columbus McKinnon' and a
unit conversion as highlighted in a returned search? [0]

One way could be using techniques similiar to the 'Normalised Google Distance'
algorythm or NGD. Developed by 'Rudi Cilibrasi' and 'Paul Vitany' (National
Institute for Maths & Computer Science, Amsterdam). They build a database
model of how close words relate to each other & use this to compute word
combinations. The closer the word combinations, the closer the association. So
my example of 'cm' would have close association with 'Columbus McKinnon' and
'Centimeter' etc as returned by the google search ~
http://www.google.com/search?q=cm

You can read more about this 'Google's search for meaning' [1] and the
abstract , 'Automatic Meaning Discovery Using Google' [2].

Reference

[0] Slashdot, ' Deriving Semantic Meaning From Google Results' [Accessed
Wednesday, 7 March, 2007]

http://science.slashdot.org/article.pl?sid=05/01/29/1815242

[1] New Scientist, 'Google's search for meaning' [Accessed Wednesday, 7 March,
2007]

http://www.newscientist.com/article.ns?id=dn6924

[2] arxiv.org,Computer Science, abstract 'Automatic Meaning Discovery Using
Google' [Accessed Wednesday, 7 March, 2007]

http://www.arxiv.org/abs/%20cs.CL/0412098

------
abstractbill
Some of this is a little dated - for example there's not much point in
managing a crawler yourself when you can just pay alexa to do it for you.

I still found a few useful tips though.

~~~
python_kiss
The article is meant to be taken as a general guide. The author does suggest
that we use a pre-programmed crawler rather than code one ourselves. Refer to
this passage:

"Crawler. If you don't use an open source crawler, my advice is a super-simple
multistep crawler. This is very important advice that will cut months off your
development time, so if you ignore everything else, don't ignore this."

------
budu3
Thanks to Doug Cutting you don't have to write a search engine. You just use
Nutch and Lucence and Hadoop.

