
Ask HN: How did you learn about search engines/text processing ? - yr
Any good videos/books/code ?
======
tdmackey
Introduction to Information Retrieval by Manning et al is a great text on the
subject: [http://nlp.stanford.edu/IR-book/information-retrieval-
book.h...](http://nlp.stanford.edu/IR-book/information-retrieval-book.html)

------
dejv
You can take a look on Managing Gigabytes ([http://www.amazon.com/Managing-
Gigabytes-Compressing-Multime...](http://www.amazon.com/Managing-Gigabytes-
Compressing-Multimedia-
Information/dp/1558605703/ref=sr_1_1?ie=UTF8&s=books&qid=1272964874&sr=8-1))

It is nice book, but might be little bit outdated.

------
dhotson
The basic data structure behind most text databases and search engines is an
inverted index.

Basically it's a map of words to a list of documents containing that word.

eg {"hello": [1, 2], "world": [1,3,4], ...}

(the numbers are document id's)

So for example, the word 'hello' occurs in documents 1 and 2. 'world' occurs
in documents 1,3 and 4.

Doing boolean queries is also really easy with an inverted index. You
basically get the document set for each word and then do a union on the sets
for an OR query.. or an intersection to do an AND query.

Pretty cool right?

~~~
_delirium
The convention of calling this the "inverted" index never ceases to trip me up
whenever I see it, since when it comes to IR, I always think of a normal index
as being a word->locations mapping, like the index in the back of a book.

~~~
ctd
Heh, just happened to me, too. Have to remember we're in a technical context
here. :)

------
grrrr
The book by Manning (freely available online) has already been recommended. I
would start with this.

In addition there are a wealth of online video lectures that may inspire you:
[http://www.datawrangling.com/hidden-video-courses-in-math-
sc...](http://www.datawrangling.com/hidden-video-courses-in-math-science-and-
engineering) and <http://videolectures.net/mlss04_hofmann_irtm/> and
<http://videolectures.net/Top/Computer_Science/>

In so far as search engines go it's certainly worth playing around with
Lucene. It's well implemented and you'll learn a lot of what really matters
when it comes to indexing and retrieval.

For the text processing (classification, data extraction) side It may also be
worth brushing up on your stats (a good excuse to learn R) and checking out
Mahout <http://lucene.apache.org/mahout/>

~~~
arethuza
It is pretty nice that the different implementations of Lucene all use the
same index file formats.

There are some pretty nice tools to go with Lucene - I've used Luke quite a
bit: <http://code.google.com/p/luke/>

------
uggedal
For a high level overview I'd recommend Tim Bray's On Search series:
[http://www.tbray.org/ongoing/When/200x/2003/07/30/OnSearchTO...](http://www.tbray.org/ongoing/When/200x/2003/07/30/OnSearchTOC)

~~~
pasbesoin
I mentioned this myself -- I deleted the comment when I noticed your prior
comment.

Tim was nice enough to reply to my email query some years ago and point me to
this (already written). It's not comprehensive, but it was helpful. I guess
I'm mostly adding this comment to attest to the generosity inherent in the
sharing of such information. Aka, "thanks".

------
rmc00
I would echo the recommendations for Introduction to Information Retrieval. If
you want something with the same concepts but a little less math, I liked
Search Engine: Information Retrieval in Practice by Bruce Croft, et al as
well. If you happen to be a python programmer, Natural Language Processing
with Python by Steven Bird has some great examples of text processing.

------
vlad
I'm literally in the library right now, working on my last homework for a
Search Engines course taught by Distinguished Professor Bruce Croft. We're
using his new book, Search Engines: Information Retrieval in Practice. You can
find the slides that accompany the book here:

<http://www.search-engines-book.com> \- Slides, Data Sets

<http://www.pearsonhighered.com/croft1epreview/toc.html> \- Book Table of
Contents

The book expands on the slides, as well as includes homework problems, some
requiring the use or modifications of the open-source Galago Search Toolkit.

<http://www.galagosearch.org/quick-start.html>

------
ghotli
I've been deep into building a geocoder the past month. While we may get rid
of Solr eventually, it was a great foot in the door to information retrieval.
It helps that I have a problem to solve and a deadline, so I'm motivated to
read and work through these books. These three texts have been very helpful.
The last book is an excellent overview of text processing and some real world
problems you may encounter writing your search engine.

Solr 1.4 Enterprise Search Server [http://www.amazon.com/Solr-1-4-Enterprise-
Search-Server/dp/1...](http://www.amazon.com/Solr-1-4-Enterprise-Search-
Server/dp/1847195881)

Programming Collective Intelligence [http://www.amazon.com/Programming-
Collective-Intelligence-Bu...](http://www.amazon.com/Programming-Collective-
Intelligence-Building-Applications/dp/0596529325)

Building Search Applications: Lucene, LingPipe, and Gate
[http://www.amazon.com/Building-Search-Applications-Lucene-
Li...](http://www.amazon.com/Building-Search-Applications-Lucene-
LingPipe/dp/0615204252)

------
gtani
Whatever web app framework you favor, there should be plugins for SOLR and
sphinx that make fulltext indexing with reasonable defaults pretty easy. i.e.
for rails thinking sphinx. I used to use acts_as_solr (I think a lot of people
use sunspot now, and Xapian).

Play with a database or docs in a filesystem, do deltas of SOLR and sphinx,
changing parameters like stopwords, token separators, stemmers, UTF-8 and ISO-
Latin to ASCII mappings. See if you can get decent precision/recall metrics.
There's quite a few degrees of freedom, depending on the database.

[http://www.computationalmedicine.org/challenge/cmcChallengeD...](http://www.computationalmedicine.org/challenge/cmcChallengeDetails.pdf)

<http://stackoverflow.com/questions/tagged/sphinx>

------
probably
Text Processing in Python: <http://gnosis.cx/TPiP/>

------
gregschlom
I must cite Programming Collective Intelligence from Toby Segaram
(<http://oreilly.com/catalog/9780596529321>). Altough not entirely focused on
search engines, it's an awesome book for anyone who wants to get their hands
on some of the most useful algorithms for web apps, without having to deal
with the math.

I downloaded a torrent version, then bought the paperback version straight
after.

~~~
imp
I agree, PCI is a really cool book. All you need is some basic Python
knowledge and it takes you through so many actual data mining examples. I used
some of that code to analyze data using Amazon Elastic MapReduce, and got
results that I thought would have taken an actual CS degree to get.

------
DrJokepu
Search and Text Processing course at university. Unfortunately I can't find
anything related amongst MIT's online course materials.

------
jacquesm
By building a small search engine. It took about 4 months and it was
definitely worth it. Some of the problems that seem simple at first glance
were terribly hard (such as reliably separating out the body text of a web
page), and some that I thought would be hard turned out to be relatively easy
(the actual index).

It was a lot of fun, even if when I started out I was already fairly sure that
I would not have the stamina nor the funds to commercialize it but as a
learning experience it was great.

------
keefe
downloading lucene is a nice place to start if you're a java person

