
Gigablast Search Engine, Now Open Source (C/C++) - conductor
https://www.gigablast.com/
======
conductor
Gigablast (founded in 2000 by Matt Wells) announced [0] about open-sourcing
their engine under the Apache version 2 license at July 30. The engine is
written in mixture of C and C++ and counts more than 500,000 lines of code,
see the Github page [1].

Some facts about the engine:

The code compiles into a single executable file which can scale on thousands
servers.

It is easily configurable and has a nice documentation [2].

The code is very stable, it works in production since 2002.

Document processing is done using plugins, so you can write a plugin for any
type of documents.

\---

I would like to see a search engine based on this in the dark-nets,
particularly in I2P.

[0] - [http://www.prnewswire.com/news-releases/gigablast-now-an-
ope...](http://www.prnewswire.com/news-releases/gigablast-now-an-open-source-
search-engine-217624911.html)

[1] - [https://github.com/gigablast/open-source-search-
engine](https://github.com/gigablast/open-source-search-engine)

[2] -
[https://www.gigablast.com/admin.html](https://www.gigablast.com/admin.html)

~~~
X4
@conductor THANK YOU! THANK YOU! Thank you sooo much for posting this!!

I really really needed that just right now! You helped me soo much =) Thank
you sir!

~~~
X4
Why on earth does somebody downvote a thank you post? That's very rude.

~~~
AsymetricCom
2/10 extra points for effort

------
throwawayyyz
I love how the code is such a mess. You can really tell one guy just wrote
this whole thing over the span of a decade... It's just one patch on top of
another and the comments are pretty amusing. Also funny to see hardcoded
algorithms for pre-defined site paths and whole domains such as
facebook/myspace/vimeo. This is truly a makeshift search engine on a massive
scale.

EDIT: Gotta say, this has some very useful pieces of code. I'm working on a
niche-specific crawler and am battling the url stripping/cleanup part of it.
This is very useful: [https://github.com/gigablast/open-source-search-
engine/blob/...](https://github.com/gigablast/open-source-search-
engine/blob/master/Url.cpp#L249)

~~~
runarb
Just found a little gem myself. I am working on another open source search
engine[0], and needed a way to make bad behaving document filters timeout.

Unfortunately the document filter in questioning dose spawn child processes,
so the normal way of using fork() and a monitoring process was not working.
However using ulimit like this should work:
[https://github.com/gigablast/open-source-search-
engine/blob/...](https://github.com/gigablast/open-source-search-
engine/blob/master/gbfilter.cpp#L265) . Hadn’t thought about spanning a new
shell and let it have control like that :)

0: [https://github.com/searchdaimon/enterprise-
search](https://github.com/searchdaimon/enterprise-search)

~~~
conductor
There is possible buffer overflow right there (if the HOME directory is long
enough). Why don't people use snprintf?

~~~
runarb
_> Why don't people use snprintf?_

Old habits perhaps? When I look back at it I remember that my first books on C
were full of problematic sprintf and strcpy use. It may then easy to continue
using what you first learned, even when you know better. It basically the
"Baby duck syndrome"[0] for C functions.

0:
[http://en.wikipedia.org/wiki/Imprinting_(psychology)#Baby_du...](http://en.wikipedia.org/wiki/Imprinting_\(psychology\)#Baby_duck_syndrome)

------
frik
Features:
[http://www.gigablast.com/features.html](http://www.gigablast.com/features.html)

Interesting read, its history:
[http://www.gigablast.com/press.html](http://www.gigablast.com/press.html)

The great thing about this project is that it comes with good documentation
for administrators and developers who want to extend it. As Gigablast has been
sold to enterprise customers.

Admin Docu - how to build the source, troubleshooting, etc.:
[http://www.gigablast.com/admin.html](http://www.gigablast.com/admin.html)

Developer Docu - even explains how to use Bash, GIT what to do on hardware
failures, etc.:
[http://www.gigablast.com/developer.html](http://www.gigablast.com/developer.html)

Two Search Engine features are currently disabled because of code overhaul:
Boolean query support & Spellchecker. As Google is removing more and more such
advanced features from its search engine - "+" anyone. It would be great if
these features would celebrate a comeback, either from its original developer
or with the help from the open source community.

Thanks for open-sourcing it.

------
busterc
I'm really glad to see this open sourced. It could easily lead to a boom of
niche web search engines.

BTW, long ago I hoped Gigablast would become a popular google competitor; no
such luck. I remember asking Matt if I could provide an official IE toolbar
(when they were the rage) he declined; sadly. My hope has shifted to
duckduckgo.

I look forward to forking!

------
runarb
Does anyone have any insights in what they (he?) plans to do now? Do they plan
to continue development and operations, or are they open sourcing it because
they are shutting down, and want their work to at least live on in some form?

------
c001
The code does not seem to be neatly written: have randomly checked a few files
and found that the const methods and exceptions are not properly used. Here is
a sample function:

const char *CountryCode::getAbbr(int index) { if(index < 0 || index >
s_numCountryCodes) index = 0; return(s_countryCode[index]); }

[https://github.com/gigablast/open-source-search-
engine/blob/...](https://github.com/gigablast/open-source-search-
engine/blob/master/CountryCode.cpp#L1328)

------
ck2
Gigablast was like the old Google, it was really neat years ago, but sadly
never kept up.

Lots of details about its development on WebMasterWorld, it only uses a
handful of servers.

------
emjaykay
[https://github.com/emmjaykay/open-source-search-
engine](https://github.com/emmjaykay/open-source-search-engine)

I couldn't get it to compile on my ubuntu 13 machine with out some errors and
warnings, so I forked it and made some changes. i don't know git very well so
i don't know how to merge, etc.

~~~
theGimp
I looked at your fork, and it looks like you've already committed your source
code to GitHub. All you would have to do now is submit a pull request.

However, given the scale of the project and the fact that the code has been in
production for more than 10 years, it's more likely the errors you faced were
due to:

\- your local environment not being configured ideally, or

\- "configuration code" that you did not modify. :)

~~~
emjaykay
Thanks for the tip about github.

It says in html/admin.html to just type make to compile.

    
    
        You will need the following packages installed
        apt-get install make
        apt-get install g++
        apt-get install libssl-dev (for the includes, 32-bit libs are here)
        1. Run 'make' to compile. (e.g. use 'make -j 4' to compile on four cores)

~~~
theGimp
Indeed you are right :)

I am yet to try installing libssl-dev though as I don't have root access on
the machine I was testing on.

------
mindcrime
I don't know much about Gigablast, but this sounds pretty cool. If nothing
else, it's another alternative to Lucene/Solr or Nutch for people working on
search applications.

~~~
gregwebs
This isn't an alternative to general purpose text search engines: it is
specialized for searching the internet.

~~~
mindcrime
Right, so it's alternative in the cases where someone might use Lucene/Solr
for indexing and searching general Internet content. That's all I meant, is
that it's an alternative in certain very specific cases.

------
ithkuil
Anyone knows what the are the advantages here of using async io via signals
instead of epoll. Does gigablast use this technique for historical reasons?

------
throwawayyyz
Does anyone know what ever happened to Matt Wells' EventGuru.com project?

~~~
frik
I found the Event Guru Blog. The last post is from Apr 17, 2012: "New Site
Design":
[http://www.gigablast.com/egblog.html](http://www.gigablast.com/egblog.html)

The page is no more, Archive.org has no copy (due robots.txt flag) but Google
has still a cached copy of the blog:

[http://webcache.googleusercontent.com/search?q=cache:9lS6Ngk...](http://webcache.googleusercontent.com/search?q=cache:9lS6NgkcetMJ:www.gigablast.com/egblog.html+http://www.eventguru.com/egblog.html&cd=1&hl=en&ct=clnk)

~~~
throwawayyyz
Pretty odd to see 1995-era web design for a service launched in 2012. Thanks
for digging :)

