LinkedIn open sources IndexTank: search engine and service

emmett · on Dec 22, 2011

This is awesome news. Massively advances the current state of the art in open source search.

Definitely considering replacing our search backend at TwitchTV with this...

hajrice · on Dec 22, 2011

Hey Emmet, we're one of the companies interested in continuing IndexTank's platform.

If you want to hear more, just ping me at emil@helpjuice.com

citricsquid · on Dec 22, 2011

related to your start up and not this post, you should work on your introduction/explanation video on the home page. Just from a quick watch it has some problems, the lack of any script (or if you had one you didn't rehearse it) means time I am investing in watching your pitch to me as a potential customer is time spent watching you think and decide on what to do next. The video on your tour page (http://helpjuice.com/tour) isn't great, but it is much much better as an introduction video to your product.

mattdeboard · on Dec 22, 2011

What is the differentiation between using this and using Solr? ElasticSearch?

What does "real-time" mean in this context? Is it indexing database content in real-time? Is it in reference to the look-ahead, predictive query completion LinkedIn has?

What would compel someone like me -- a dev who has ownership over the very significant search piece of my company's primary product -- to give this serious evaluation?

nl · on Dec 22, 2011

I looked at it some before IndexTank was bought (and I've done a reasonable amount of Solr work).

The biggest conceptual difference seemed to be that IndexTank was specifically written to autoscale - it was designed from the ground up to run on cloud providers, and to instantiate new resources as needed. It also has no central point of failure.

Solr Cloud (and things like Solandra) deliver some of this functionality to Solr.

Argorak · on Dec 22, 2011

Well, elasticsearch is written with this in mind as well - so whats the huge difference in those?

sandGorgon · on Dec 22, 2011

If you had to incorporate search today - would you use indextank or solr ?

nl · on Dec 22, 2011

Solr, because I know it well. But I'd love to play with IndexTank.

gnubardt · on Dec 22, 2011

I'd imagine they mean indexing (and being able to search on) data in real time. Given LinkedIn's previous open source projects around real time search (http://javasoze.github.com/zoie/).

Lucene (which Solr uses as its index) cannot expose newly indexed data immediately after it's added.

Lucene exposes IndexReaders for searches, which offer a snapshot view of the index. In order to search across new documents IndexReaders need to be re-opened, a somewhat expensive operation. Expensive enough to prevent it from happening after each document is added, especially if they're added frequently.

The latest version of Lucene supports "near real time" search, but afaik it's not widely used (with Solr).

mattdeboard · on Dec 22, 2011

Yeah, NRT is 4.0; our content is such that right now that kind of flexibility isn't required. (Once-a-day batch db writes that update the index in NRT via signaling)

nl · on Dec 22, 2011

IndexTank is built on Lucene too. I'm not sure if it is the real time branch or not, though.

nachopg · on Dec 22, 2011

It is not exactly built ON Lucene. It reuses very specific constructs. The main one is the structure that holds the comprised index. And that is only used for the long term index. The realtime part of the index has been written for IndexTank exclusively.

zfran · on Dec 22, 2011

http://indextank.com/documentation/faq

biznickman · on Dec 22, 2011

Great news but I'm still willing to pay for someone to manage the operational side of this :) Know of any solutions? I'm aware of websolr but their configuration process wasn't as simple as IndexTank

nestlequ1k · on Dec 22, 2011

Same here. Indextank service and pricing was great. Hoping someone can match it.

toisanji · on Dec 22, 2011

I'd like to see how this compares to lucene/solr. With solr its easy to index 100's of millions of docs, but its a pain to write a custom scorer.

espeed · on Dec 22, 2011

IndexTank provides real-time document indexing and its algorithm incorporates real-time metrics, like vote data. And it scales horizontally.

riffraff · on Dec 22, 2011

why did you find writing s custom scorer a pain? I've done it for raw lucene and it's trivial (in my case I added real time data in the formula using an external value source), I am not sure why it would be harder for Solr (I always got away with sorting order until now :).

alexro · on Dec 22, 2011

Last time I read about IndexTank I noticed that their query language isn't that sophisticated, it could basically find only matches. Did it improve, is it possible to do fuzzy matches?

ADD: also, does it support non-english languages at all?

nachopg · on Dec 22, 2011

IndexTank right now supports preffix search, stemming and a basic implementation of a Did You Mean feature. Regarding languages, it supports tokenization for every western language, and not long ago, we added support for CJK too.

gexla · on Dec 22, 2011

And a new startup offers a hosted IndexTank service in 3,2,1...

For anyone looking for a job at LinkedIn, making impactful contributions to this project could be a way in.

sycr · on Dec 22, 2011

Yeah, really though.

The indextank repo proper is interesting (and useful) enough, but indextank-service (https://github.com/linkedin/indextank-service) made my jaw drop a little. It's a full administrative stack for deploying indextank as a service.

SlightGenius · on Dec 22, 2011

Does IndexTank still integrate social inputs?

"IndexEngine: a real-time fulltext search-and-indexing system designed to separate relevance signals from document text. This is because the life cycle of these signals is different from the text itself, especially in the context of user-generated social inputs (shares, likes, +1, RTs)."

diego · on Dec 22, 2011

It integrates anything that can be represented as a number. Prices, number of badges, importance of titles, it doesn't matter. You can combine any of those inputs into a relevance formula that is evaluated at query time. Of course IndexTank won't find those inputs for you, you have to provide them.

mgkimsal · on Dec 22, 2011

Are the historical values of those signals kept and queryable? Such that I could check document ranking with signals X, Y and Z today and 3 days ago and check the impact of the signal changes?

atombender · on Dec 22, 2011

Anyone know about how IndexTank's facets scale with the cardinality of the attribute? We tried using ElasticSearch's facet system for tags, but we have about 150k tags, and this does not play well with ES. (It's very stupid about how it caches them.)

santip · on Dec 23, 2011

IndexTank categories are not designed for the tags use case, and will not work properly. It's intended for a relatively small amount of categories for which each document has a single value. The amount of different values of a category can be large but the amount of categories cannot. If you want to implement something like tags, then each tag should be a category because you'll want more than a single tag per document. We were in the process of designing a new feature to support this kind of use cases, and maybe we'll start a branch to implement it and hopefully the community will colaborate.

atombender · on Dec 24, 2011

Thanks for clearing that up.

swah · on Dec 22, 2011

Those kinds of services are mostly being written in Java these days, and everyone would aggree they constitute awesomer software than another Javascript blablabla library... so how can Java be dead? I should learn Java...

fufulabs · on Dec 22, 2011

In terms of ease of installation > working state, how does it compare to ElasticSearch or Solr?

iag · on Dec 22, 2011

very impressive linkedin. Good move.