Hacker News new | comments | show | ask | jobs | submit login
Gigablast Search Engine, Now Open Source (C/C++) (gigablast.com)
112 points by conductor on Aug 3, 2013 | hide | past | web | favorite | 25 comments

Gigablast (founded in 2000 by Matt Wells) announced [0] about open-sourcing their engine under the Apache version 2 license at July 30. The engine is written in mixture of C and C++ and counts more than 500,000 lines of code, see the Github page [1].

Some facts about the engine:

The code compiles into a single executable file which can scale on thousands servers.

It is easily configurable and has a nice documentation [2].

The code is very stable, it works in production since 2002.

Document processing is done using plugins, so you can write a plugin for any type of documents.


I would like to see a search engine based on this in the dark-nets, particularly in I2P.

[0] - http://www.prnewswire.com/news-releases/gigablast-now-an-ope...

[1] - https://github.com/gigablast/open-source-search-engine

[2] - https://www.gigablast.com/admin.html

@conductor THANK YOU! THANK YOU! Thank you sooo much for posting this!!

I really really needed that just right now! You helped me soo much =) Thank you sir!

Why on earth does somebody downvote a thank you post? That's very rude.

2/10 extra points for effort

I love how the code is such a mess. You can really tell one guy just wrote this whole thing over the span of a decade... It's just one patch on top of another and the comments are pretty amusing. Also funny to see hardcoded algorithms for pre-defined site paths and whole domains such as facebook/myspace/vimeo. This is truly a makeshift search engine on a massive scale.

EDIT: Gotta say, this has some very useful pieces of code. I'm working on a niche-specific crawler and am battling the url stripping/cleanup part of it. This is very useful: https://github.com/gigablast/open-source-search-engine/blob/...

Just found a little gem myself. I am working on another open source search engine[0], and needed a way to make bad behaving document filters timeout.

Unfortunately the document filter in questioning dose spawn child processes, so the normal way of using fork() and a monitoring process was not working. However using ulimit like this should work: https://github.com/gigablast/open-source-search-engine/blob/... . Hadn’t thought about spanning a new shell and let it have control like that :)

0: https://github.com/searchdaimon/enterprise-search

There is possible buffer overflow right there (if the HOME directory is long enough). Why don't people use snprintf?

>Why don't people use snprintf?

Old habits perhaps? When I look back at it I remember that my first books on C were full of problematic sprintf and strcpy use. It may then easy to continue using what you first learned, even when you know better. It basically the "Baby duck syndrome"[0] for C functions.

0: http://en.wikipedia.org/wiki/Imprinting_(psychology)#Baby_du...

Features: http://www.gigablast.com/features.html

Interesting read, its history: http://www.gigablast.com/press.html

The great thing about this project is that it comes with good documentation for administrators and developers who want to extend it. As Gigablast has been sold to enterprise customers.

Admin Docu - how to build the source, troubleshooting, etc.: http://www.gigablast.com/admin.html

Developer Docu - even explains how to use Bash, GIT what to do on hardware failures, etc.: http://www.gigablast.com/developer.html

Two Search Engine features are currently disabled because of code overhaul: Boolean query support & Spellchecker. As Google is removing more and more such advanced features from its search engine - "+" anyone. It would be great if these features would celebrate a comeback, either from its original developer or with the help from the open source community.

Thanks for open-sourcing it.

I'm really glad to see this open sourced. It could easily lead to a boom of niche web search engines.

BTW, long ago I hoped Gigablast would become a popular google competitor; no such luck. I remember asking Matt if I could provide an official IE toolbar (when they were the rage) he declined; sadly. My hope has shifted to duckduckgo.

I look forward to forking!

Does anyone have any insights in what they (he?) plans to do now? Do they plan to continue development and operations, or are they open sourcing it because they are shutting down, and want their work to at least live on in some form?

The code does not seem to be neatly written: have randomly checked a few files and found that the const methods and exceptions are not properly used. Here is a sample function:

const char *CountryCode::getAbbr(int index) { if(index < 0 || index > s_numCountryCodes) index = 0; return(s_countryCode[index]); }


Gigablast was like the old Google, it was really neat years ago, but sadly never kept up.

Lots of details about its development on WebMasterWorld, it only uses a handful of servers.


I couldn't get it to compile on my ubuntu 13 machine with out some errors and warnings, so I forked it and made some changes. i don't know git very well so i don't know how to merge, etc.

I looked at your fork, and it looks like you've already committed your source code to GitHub. All you would have to do now is submit a pull request.

However, given the scale of the project and the fact that the code has been in production for more than 10 years, it's more likely the errors you faced were due to:

- your local environment not being configured ideally, or

- "configuration code" that you did not modify. :)

Thanks for the tip about github.

It says in html/admin.html to just type make to compile.

    You will need the following packages installed
    apt-get install make
    apt-get install g++
    apt-get install libssl-dev (for the includes, 32-bit libs are here)
    1. Run 'make' to compile. (e.g. use 'make -j 4' to compile on four cores)

Indeed you are right :)

I am yet to try installing libssl-dev though as I don't have root access on the machine I was testing on.

I don't know much about Gigablast, but this sounds pretty cool. If nothing else, it's another alternative to Lucene/Solr or Nutch for people working on search applications.

This isn't an alternative to general purpose text search engines: it is specialized for searching the internet.

Right, so it's alternative in the cases where someone might use Lucene/Solr for indexing and searching general Internet content. That's all I meant, is that it's an alternative in certain very specific cases.

Still an alternative to nutch and friends

Anyone knows what the are the advantages here of using async io via signals instead of epoll. Does gigablast use this technique for historical reasons?

Does anyone know what ever happened to Matt Wells' EventGuru.com project?

I found the Event Guru Blog. The last post is from Apr 17, 2012: "New Site Design": http://www.gigablast.com/egblog.html

The page is no more, Archive.org has no copy (due robots.txt flag) but Google has still a cached copy of the blog:


Pretty odd to see 1995-era web design for a service launched in 2012. Thanks for digging :)

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact