Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The book by Manning (freely available online) has already been recommended. I would start with this.

In addition there are a wealth of online video lectures that may inspire you: http://www.datawrangling.com/hidden-video-courses-in-math-sc... and http://videolectures.net/mlss04_hofmann_irtm/ and http://videolectures.net/Top/Computer_Science/

In so far as search engines go it's certainly worth playing around with Lucene. It's well implemented and you'll learn a lot of what really matters when it comes to indexing and retrieval.

For the text processing (classification, data extraction) side It may also be worth brushing up on your stats (a good excuse to learn R) and checking out Mahout http://lucene.apache.org/mahout/



It is pretty nice that the different implementations of Lucene all use the same index file formats.

There are some pretty nice tools to go with Lucene - I've used Luke quite a bit: http://code.google.com/p/luke/




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: