
MG4J – A free full-text search engine for large document collections - luu
http://mg4j.di.unimi.it/
======
verytrivial
That name sound very familiar, as does the feature set. Managing Gigabytes[1],
or "mg" was the output of a University of Melbourne and RMIT research in the
1990s. It went on to be commercialized as SIM and later TeraText[2] and has
largely disappeared into the government intelligence indexing and consulting-
heavy systems space (where it is now presumably being trounced by Palantir).

[1] [https://www.amazon.com/Managing-Gigabytes-Compressing-
Indexi...](https://www.amazon.com/Managing-Gigabytes-Compressing-Indexing-
Documents/dp/1558605703) \- Note review from Peter Norvig!

[2] [http://www.teratext.com/](http://www.teratext.com/)

~~~
timb07
That's exactly what I thought - I worked on index construction for MG back in
1994. (Note, although my name is Tim Bell, I'm not Timothy C. Bell, the
coauthor of "Managing Gigabytes".)

------
vigna
I don't how this project ended up here in this moment, but as one of the
authors let me answer the main questions.

1) The name is just a coincidence. I learned originally about indexing from
the "Managing Gigabytes" book, and that's the reason for the name, but the
book is now completely obsolete, and, even at that time, it contained a
significant number of red herrings. There's no connection or code or idea
sharing of any kind.

2) MG4J is our playground for doing research in information retrieval. This
means, for example, that we designed new data structures, such as Elias-Fano
indexing, which make MG4J have ridiculously faster times in benchmarks (see
[https://github.com/lintool/IR-Reproducibility](https://github.com/lintool/IR-
Reproducibility)). Elias-Fano is now the main Facebook indexing algorithm and
it is slowly percolating to Lucene (look in the sources).

3) You can define your queries using a very rich interval language with a very
fast implementation based on new algorithms. You can easily create parallel
indices with text and tagging and ask whether a phrase falls into an area
tagged as "location", for example.

2) MG4J is a project of two people and at this time I'm the only maintainer.
You cannot expect that it is refined as Lucene or Solr. But you can very
easily hack into it (even without modifying the sources), which is why it has
been popular with people experimenting with indexing. For example, there are
many tools to manipulate index, splitting them with a specified strategy,
combining them, etc.

3) So if you want an out-of-the-box solution for indexing, forget about it. If
you want a fun playground for doing research or a very efficient backbone on
which to build your infrastructure, MG4J might be useful to you. We used it
recently for [http://wikirank.di.unimi.it/](http://wikirank.di.unimi.it/) .

------
dumbfounder
Blast from the past! Distributed is a bit of a stretch, I think you need to
coordinate all of that yourself. It is no more distributed than Lucene (I
think).

Their fastutil stuff is pretty interesting though for creating highly
optimized algorithms. Lot's of primitive based data structures that are fast
and memory efficient.

------
styfle
How does this compare to Elasticsearch or Solr?

~~~
drdaeman
I think it makes more sense to compare it with Lucene (which both
ElasticSearch and Solr are based on) or, say, Xapian.

Based on a PDF
[http://grupoweb.upf.es/WRG/mir2ed/pdf/slides_appendixA.pdf](http://grupoweb.upf.es/WRG/mir2ed/pdf/slides_appendixA.pdf)
page 14 (linked from
[http://stackoverflow.com/q/5028314/116546](http://stackoverflow.com/q/5028314/116546)),
I think the differences are MG4J has constant RAM usage (as opposed to
Lucene's linear one), but is somewhat more CPU intensive.

Haven't used either directly.

~~~
fizx
The comparison is with an 8-year-old version of lucene. Lucene is (optionally)
constant RAM now.

------
woliveirajr
Some links are broken inside the unimi.it

------
bawllz
how is this on the first page of hackernews?

~~~
anigbrowl
It's hard to answer that without any idea of why you find its presence
surprising.

