related to your start up and not this post, you should work on your introduction/explanation video on the home page. Just from a quick watch it has some problems, the lack of any script (or if you had one you didn't rehearse it) means time I am investing in watching your pitch to me as a potential customer is time spent watching you think and decide on what to do next. The video on your tour page (http://helpjuice.com/tour) isn't great, but it is much much better as an introduction video to your product.
What is the differentiation between using this and using Solr? ElasticSearch?
What does "real-time" mean in this context? Is it indexing database content in real-time? Is it in reference to the look-ahead, predictive query completion LinkedIn has?
What would compel someone like me -- a dev who has ownership over the very significant search piece of my company's primary product -- to give this serious evaluation?
I looked at it some before IndexTank was bought (and I've done a reasonable amount of Solr work).
The biggest conceptual difference seemed to be that IndexTank was specifically written to autoscale - it was designed from the ground up to run on cloud providers, and to instantiate new resources as needed. It also has no central point of failure.
Solr Cloud (and things like Solandra) deliver some of this functionality to Solr.
I'd imagine they mean indexing (and being able to search on) data in real time. Given LinkedIn's previous open source projects around real time search (http://javasoze.github.com/zoie/).
Lucene (which Solr uses as its index) cannot expose newly indexed data immediately after it's added.
Lucene exposes IndexReaders for searches, which offer a snapshot view of the index. In order to search across new documents IndexReaders need to be re-opened, a somewhat expensive operation. Expensive enough to prevent it from happening after each document is added, especially if they're added frequently.
The latest version of Lucene supports "near real time" search, but afaik it's not widely used (with Solr).
Yeah, NRT is 4.0; our content is such that right now that kind of flexibility isn't required. (Once-a-day batch db writes that update the index in NRT via signaling)
It is not exactly built ON Lucene. It reuses very specific constructs. The main one is the structure that holds the comprised index. And that is only used for the long term index. The realtime part of the index has been written for IndexTank exclusively.
Great news but I'm still willing to pay for someone to manage the operational side of this :) Know of any solutions? I'm aware of websolr but their configuration process wasn't as simple as IndexTank
why did you find writing s custom scorer a pain?
I've done it for raw lucene and it's trivial (in my case I added real time data in the formula using an external value source), I am not sure why it would be harder for Solr (I always got away with sorting order until now :).
Last time I read about IndexTank I noticed that their query language isn't that sophisticated, it could basically find only matches. Did it improve, is it possible to do fuzzy matches?
ADD: also, does it support non-english languages at all?
IndexTank right now supports preffix search, stemming and a basic implementation of a Did You Mean feature. Regarding languages, it supports tokenization for every western language, and not long ago, we added support for CJK too.
The indextank repo proper is interesting (and useful) enough, but indextank-service (https://github.com/linkedin/indextank-service) made my jaw drop a little. It's a full administrative stack for deploying indextank as a service.
"IndexEngine: a real-time fulltext search-and-indexing system designed to separate relevance signals from document text. This is because the life cycle of these signals is different from the text itself, especially in the context of user-generated social inputs (shares, likes, +1, RTs)."
It integrates anything that can be represented as a number. Prices, number of badges, importance of titles, it doesn't matter. You can combine any of those inputs into a relevance formula that is evaluated at query time. Of course IndexTank won't find those inputs for you, you have to provide them.
Are the historical values of those signals kept and queryable? Such that I could check document ranking with signals X, Y and Z today and 3 days ago and check the impact of the signal changes?
Anyone know about how IndexTank's facets scale with the cardinality of the attribute? We tried using ElasticSearch's facet system for tags, but we have about 150k tags, and this does not play well with ES. (It's very stupid about how it caches them.)
IndexTank categories are not designed for the tags use case, and will not work properly. It's intended for a relatively small amount of categories for which each document has a single value. The amount of different values of a category can be large but the amount of categories cannot. If you want to implement something like tags, then each tag should be a category because you'll want more than a single tag per document. We were in the process of designing a new feature to support this kind of use cases, and maybe we'll start a branch to implement it and hopefully the community will colaborate.
Those kinds of services are mostly being written in Java these days, and everyone would aggree they constitute awesomer software than another Javascript blablabla library... so how can Java be dead? I should learn Java...
Definitely considering replacing our search backend at TwitchTV with this...