
Build your own web search engine - DanielHimmelein
http://himmele.blogspot.com/2011/05/build-your-own-internet-search-engine.html
======
duggan
Have you investigated <http://www.elasticsearch.org/> ?

I'd really recommend reading up on the Solr (I notice you mention Solr in your
first post) and ElasticSearch projects; these guys, along with Lucene, have
collectively solved many of the problems you're investigating.

They're both open source, and (Solr at least) have extensive mailing lists so
you can see the sorts of problems people face when building generalized search
engines.

~~~
mattmiller
Nutch will do the crawling and indexing for you. Solr has a web interface
built in. You can build your own SE in a couple hours. Then you can do some
clever machine learning based on usage data with Mahout.

<http://nutch.apache.org/> <http://mahout.apache.org/>

------
sajid
I recommend reading 'Managing Gigabytes' by Witten, Moffat and Bell:

[http://www.amazon.com/Managing-Gigabytes-Compressing-
Multime...](http://www.amazon.com/Managing-Gigabytes-Compressing-Multimedia-
Information/dp/1558605703)

------
kenjackson
Best of luck to the author.

Does remind me a broader question -- why is there no popular open source
search engine? This seems much more tractable than an open source social
network. I wouldn't be surprised if it could money/resources from major
players like Facebook, Apple, Oracle. Apache has a lot of hte pieces, but no
consumer facing front-end that ties it all together to search the web (AFAIK).

~~~
duggan
Edit: there appears to be such a project, at least on the crawling side:
<http://www.commoncrawl.org/>

I'd say there's a combination of factors, the first (and most important) being
that Google is good enough for most people.

You'd need to coordinate crawling so as not to turn it into a giant DDoS
machine; speed will be an issue due to geo distribution, variable hardware and
result sets.

Validity and reliability of the data would also be issues, and would probably
require several peers to "agree" to consistency, but in a way that does not
allow easy gaming of results.

I suppose they're all solvable, though I think there would have to be a
powerful incentive to do so. I imagine it'd be quite pricey too for the
individual, though perhaps Gabriel Weinberg* could weigh in there.

[*] <http://www.gabrielweinberg.com/>

~~~
kenjackson
_I'd say there's a combination of factors, the first (and most important)
being that Google is good enough for most people._

This is actually why I think it's important to have a good open source
alternative. Google is good today. And frankly, if Bing wasn't around, Google
could probably stop doing any work on search for the next five years with
impunity.

------
jacquesm
Talk to the guy behind gigablast.com

------
nirvana
I believe the solution to much of the problem lies in Riak. It is erlang
based, has Map Reduce, is document oriented, has Free Text Search built in, is
Solr compatible (though sketchy on details there) and is very scalable, and
importantly, operationally easy for a small team.

I too found an impedance mismatch with CouchDB for what I'm working on (which
is much like a search engine, but not quite), and found Riak to be a good
solution.

