Hacker Newsnew | comments | ask | jobs | submitlogin
Ask HN: How did you learn about search engines/text processing ?
45 points by yr 1447 days ago | comments
Any good videos/books/code ?


tdmackey 1447 days ago | link

Introduction to Information Retrieval by Manning et al is a great text on the subject: http://nlp.stanford.edu/IR-book/information-retrieval-book.h...

-----

dhotson 1447 days ago | link

The basic data structure behind most text databases and search engines is an inverted index.

Basically it's a map of words to a list of documents containing that word.

eg {"hello": [1, 2], "world": [1,3,4], ...}

(the numbers are document id's)

So for example, the word 'hello' occurs in documents 1 and 2. 'world' occurs in documents 1,3 and 4.

Doing boolean queries is also really easy with an inverted index. You basically get the document set for each word and then do a union on the sets for an OR query.. or an intersection to do an AND query.

Pretty cool right?

-----

_delirium 1447 days ago | link

The convention of calling this the "inverted" index never ceases to trip me up whenever I see it, since when it comes to IR, I always think of a normal index as being a word->locations mapping, like the index in the back of a book.

-----

ctd 1447 days ago | link

Heh, just happened to me, too. Have to remember we're in a technical context here. :)

-----

dejv 1447 days ago | link

You can take a look on Managing Gigabytes (http://www.amazon.com/Managing-Gigabytes-Compressing-Multime...)

It is nice book, but might be little bit outdated.

-----

grrrr 1447 days ago | link

The book by Manning (freely available online) has already been recommended. I would start with this.

In addition there are a wealth of online video lectures that may inspire you: http://www.datawrangling.com/hidden-video-courses-in-math-sc... and http://videolectures.net/mlss04_hofmann_irtm/ and http://videolectures.net/Top/Computer_Science/

In so far as search engines go it's certainly worth playing around with Lucene. It's well implemented and you'll learn a lot of what really matters when it comes to indexing and retrieval.

For the text processing (classification, data extraction) side It may also be worth brushing up on your stats (a good excuse to learn R) and checking out Mahout http://lucene.apache.org/mahout/

-----

arethuza 1447 days ago | link

It is pretty nice that the different implementations of Lucene all use the same index file formats.

There are some pretty nice tools to go with Lucene - I've used Luke quite a bit: http://code.google.com/p/luke/

-----

vlad 1447 days ago | link

I'm literally in the library right now, working on my last homework for a Search Engines course taught by Distinguished Professor Bruce Croft. We're using his new book, Search Engines: Information Retrieval in Practice. You can find the slides that accompany the book here:

http://www.search-engines-book.com - Slides, Data Sets

http://www.pearsonhighered.com/croft1epreview/toc.html - Book Table of Contents

The book expands on the slides, as well as includes homework problems, some requiring the use or modifications of the open-source Galago Search Toolkit.

http://www.galagosearch.org/quick-start.html

-----

uggedal 1447 days ago | link

For a high level overview I'd recommend Tim Bray's On Search series: http://www.tbray.org/ongoing/When/200x/2003/07/30/OnSearchTO...

-----

pasbesoin 1447 days ago | link

I mentioned this myself -- I deleted the comment when I noticed your prior comment.

Tim was nice enough to reply to my email query some years ago and point me to this (already written). It's not comprehensive, but it was helpful. I guess I'm mostly adding this comment to attest to the generosity inherent in the sharing of such information. Aka, "thanks".

-----

ghotli 1447 days ago | link

I've been deep into building a geocoder the past month. While we may get rid of Solr eventually, it was a great foot in the door to information retrieval. It helps that I have a problem to solve and a deadline, so I'm motivated to read and work through these books. These three texts have been very helpful. The last book is an excellent overview of text processing and some real world problems you may encounter writing your search engine.

Solr 1.4 Enterprise Search Server http://www.amazon.com/Solr-1-4-Enterprise-Search-Server/dp/1...

Programming Collective Intelligence http://www.amazon.com/Programming-Collective-Intelligence-Bu...

Building Search Applications: Lucene, LingPipe, and Gate http://www.amazon.com/Building-Search-Applications-Lucene-Li...

-----

gregschlom 1447 days ago | link

I must cite Programming Collective Intelligence from Toby Segaram (http://oreilly.com/catalog/9780596529321). Altough not entirely focused on search engines, it's an awesome book for anyone who wants to get their hands on some of the most useful algorithms for web apps, without having to deal with the math.

I downloaded a torrent version, then bought the paperback version straight after.

-----

imp 1447 days ago | link

I agree, PCI is a really cool book. All you need is some basic Python knowledge and it takes you through so many actual data mining examples. I used some of that code to analyze data using Amazon Elastic MapReduce, and got results that I thought would have taken an actual CS degree to get.

-----

rmc00 1447 days ago | link

I would echo the recommendations for Introduction to Information Retrieval. If you want something with the same concepts but a little less math, I liked Search Engine: Information Retrieval in Practice by Bruce Croft, et al as well. If you happen to be a python programmer, Natural Language Processing with Python by Steven Bird has some great examples of text processing.

-----

DrJokepu 1447 days ago | link

Search and Text Processing course at university. Unfortunately I can't find anything related amongst MIT's online course materials.

-----

gtani 1447 days ago | link

Whatever web app framework you favor, there should be plugins for SOLR and sphinx that make fulltext indexing with reasonable defaults pretty easy. i.e. for rails thinking sphinx. I used to use acts_as_solr (I think a lot of people use sunspot now, and Xapian).

Play with a database or docs in a filesystem, do deltas of SOLR and sphinx, changing parameters like stopwords, token separators, stemmers, UTF-8 and ISO-Latin to ASCII mappings. See if you can get decent precision/recall metrics. There's quite a few degrees of freedom, depending on the database.

http://www.computationalmedicine.org/challenge/cmcChallengeD...

http://stackoverflow.com/questions/tagged/sphinx

-----

probably 1447 days ago | link

Text Processing in Python: http://gnosis.cx/TPiP/

-----

jacquesm 1446 days ago | link

By building a small search engine. It took about 4 months and it was definitely worth it. Some of the problems that seem simple at first glance were terribly hard (such as reliably separating out the body text of a web page), and some that I thought would be hard turned out to be relatively easy (the actual index).

It was a lot of fun, even if when I started out I was already fairly sure that I would not have the stamina nor the funds to commercialize it but as a learning experience it was great.

-----

keefe 1446 days ago | link

downloading lucene is a nice place to start if you're a java person

-----




Lists | RSS | Bookmarklet | Guidelines | FAQ | DMCA | News News | Feature Requests | Bugs | Y Combinator | Apply | Library

Search: