

Beating Google With CouchDB, Celery and Whoosh (Part 1) - markokocic
http://andrewwilkinson.wordpress.com/2011/09/27/beating-google-with-couchdb-celery-and-whoosh-part-1/

======
timclark
[http://andrewwilkinson.wordpress.com/2011/09/29/beating-
goog...](http://andrewwilkinson.wordpress.com/2011/09/29/beating-google-with-
couchdb-celery-and-whoosh-part-2/) \- part two actually contains the
interesting content.

------
bialecki
Nice post.

For those considering something like this, you might want to consider using
scrapy, a Python web crawler, instead of rolling your own crawler.

I remember when I was looking for something like this a year ago and found
that project. I've used it for a few things and it does a nice job abstracting
away most of the core scraping architecture, but leaving room to be extended
as necessary. After playing with it for a few weeks, makes you realize just
how easy it is to grab a bunch of data from the web if you need it.

~~~
fuzzythinker
I would recommend haystack also for someone wanting to take this to the next
step. Since django is already being used and he can swap Whoosh for solr if
needed to.

------
jsherer
Whoosh is fast enough for small amounts of data, or when the searches do not
need to be performed in realtime. But, I agree, something like Lucene/Solr or
Sphinx might be a bit better for indexing large amounts of content (i.e., the
provided example of a webcrawler indexing web pages).

~~~
robterrell
Seems like using Cloudant's BigCouch fork, with its built-in Lucene indexing,
might be a better choice for this.

~~~
arethuza
Thanks - I was reading the article thinking "what I'd really want was CouchDB
with built in Lucene indexing and searching" - and there it is! :-)

~~~
timanglade
To be clear, our (Cloudant’s) integrated Search product is only available in
our paid product or on hosted accounts at <https://cloudant.com> Open-Source
BigCouch does not have Search built in, but you can still use it in
conjunction with Lucene or ElasticSearch (integration is straightforward).

------
StavrosK
Unfortunately, whoosh is too slow for any nontrivial amount of content. You're
much better off using solr or ElasticSearch.

