Hacker News new | comments | show | ask | jobs | submit login
Ask HN: How did you learn about search engines/text processing ?
46 points by yr on May 4, 2010 | hide | past | web | favorite | 19 comments
Any good videos/books/code ?

Introduction to Information Retrieval by Manning et al is a great text on the subject: http://nlp.stanford.edu/IR-book/information-retrieval-book.h...

You can take a look on Managing Gigabytes (http://www.amazon.com/Managing-Gigabytes-Compressing-Multime...)

It is nice book, but might be little bit outdated.

The basic data structure behind most text databases and search engines is an inverted index.

Basically it's a map of words to a list of documents containing that word.

eg {"hello": [1, 2], "world": [1,3,4], ...}

(the numbers are document id's)

So for example, the word 'hello' occurs in documents 1 and 2. 'world' occurs in documents 1,3 and 4.

Doing boolean queries is also really easy with an inverted index. You basically get the document set for each word and then do a union on the sets for an OR query.. or an intersection to do an AND query.

Pretty cool right?

The convention of calling this the "inverted" index never ceases to trip me up whenever I see it, since when it comes to IR, I always think of a normal index as being a word->locations mapping, like the index in the back of a book.

Heh, just happened to me, too. Have to remember we're in a technical context here. :)

The book by Manning (freely available online) has already been recommended. I would start with this.

In addition there are a wealth of online video lectures that may inspire you: http://www.datawrangling.com/hidden-video-courses-in-math-sc... and http://videolectures.net/mlss04_hofmann_irtm/ and http://videolectures.net/Top/Computer_Science/

In so far as search engines go it's certainly worth playing around with Lucene. It's well implemented and you'll learn a lot of what really matters when it comes to indexing and retrieval.

For the text processing (classification, data extraction) side It may also be worth brushing up on your stats (a good excuse to learn R) and checking out Mahout http://lucene.apache.org/mahout/

It is pretty nice that the different implementations of Lucene all use the same index file formats.

There are some pretty nice tools to go with Lucene - I've used Luke quite a bit: http://code.google.com/p/luke/

I would echo the recommendations for Introduction to Information Retrieval. If you want something with the same concepts but a little less math, I liked Search Engine: Information Retrieval in Practice by Bruce Croft, et al as well. If you happen to be a python programmer, Natural Language Processing with Python by Steven Bird has some great examples of text processing.

I'm literally in the library right now, working on my last homework for a Search Engines course taught by Distinguished Professor Bruce Croft. We're using his new book, Search Engines: Information Retrieval in Practice. You can find the slides that accompany the book here:

http://www.search-engines-book.com - Slides, Data Sets

http://www.pearsonhighered.com/croft1epreview/toc.html - Book Table of Contents

The book expands on the slides, as well as includes homework problems, some requiring the use or modifications of the open-source Galago Search Toolkit.


For a high level overview I'd recommend Tim Bray's On Search series: http://www.tbray.org/ongoing/When/200x/2003/07/30/OnSearchTO...

I mentioned this myself -- I deleted the comment when I noticed your prior comment.

Tim was nice enough to reply to my email query some years ago and point me to this (already written). It's not comprehensive, but it was helpful. I guess I'm mostly adding this comment to attest to the generosity inherent in the sharing of such information. Aka, "thanks".

I've been deep into building a geocoder the past month. While we may get rid of Solr eventually, it was a great foot in the door to information retrieval. It helps that I have a problem to solve and a deadline, so I'm motivated to read and work through these books. These three texts have been very helpful. The last book is an excellent overview of text processing and some real world problems you may encounter writing your search engine.

Solr 1.4 Enterprise Search Server http://www.amazon.com/Solr-1-4-Enterprise-Search-Server/dp/1...

Programming Collective Intelligence http://www.amazon.com/Programming-Collective-Intelligence-Bu...

Building Search Applications: Lucene, LingPipe, and Gate http://www.amazon.com/Building-Search-Applications-Lucene-Li...

Whatever web app framework you favor, there should be plugins for SOLR and sphinx that make fulltext indexing with reasonable defaults pretty easy. i.e. for rails thinking sphinx. I used to use acts_as_solr (I think a lot of people use sunspot now, and Xapian).

Play with a database or docs in a filesystem, do deltas of SOLR and sphinx, changing parameters like stopwords, token separators, stemmers, UTF-8 and ISO-Latin to ASCII mappings. See if you can get decent precision/recall metrics. There's quite a few degrees of freedom, depending on the database.



Text Processing in Python: http://gnosis.cx/TPiP/

I must cite Programming Collective Intelligence from Toby Segaram (http://oreilly.com/catalog/9780596529321). Altough not entirely focused on search engines, it's an awesome book for anyone who wants to get their hands on some of the most useful algorithms for web apps, without having to deal with the math.

I downloaded a torrent version, then bought the paperback version straight after.

I agree, PCI is a really cool book. All you need is some basic Python knowledge and it takes you through so many actual data mining examples. I used some of that code to analyze data using Amazon Elastic MapReduce, and got results that I thought would have taken an actual CS degree to get.

Search and Text Processing course at university. Unfortunately I can't find anything related amongst MIT's online course materials.

By building a small search engine. It took about 4 months and it was definitely worth it. Some of the problems that seem simple at first glance were terribly hard (such as reliably separating out the body text of a web page), and some that I thought would be hard turned out to be relatively easy (the actual index).

It was a lot of fun, even if when I started out I was already fairly sure that I would not have the stamina nor the funds to commercialize it but as a learning experience it was great.

downloading lucene is a nice place to start if you're a java person

Applications are open for YC Summer 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact