
Features of Solr vs. ElasticSearch - friendlytuna
http://solr-vs-elasticsearch.com/
======
superkt
Hi! I'm the author of <http://solr-vs-elasticsearch.com>

It's something I threw together in a couple hours, and figured I'd iterate and
improve over the next couple days, so please bear with the mistakes.

I fixed the more glaring errors (copy field, dynamic fields , Django etc), and
will continue to do so as comments come in.

------
whalesalad
I've been playing with ElasticSearch + Tire[0] over the span of the last week.
It's a joy to use. Sunspot + Solr isn't a bad alternative, though.

Tire's docs are a bit lacking but it maps more-or-less 1:1 with ElasticSearch,
so it's not too bad.

Very pleased with the performance ElasticSearch provides. The installation was
a bit foreign for me personally (a java service container? openjdk 6 or 7?)
but it's lightning quick and very flexible.

I chose to go the ElasticSearch route mainly due to my need for Geospatial
indexing. Solr does it too, but Foursquare[1] uses ElasticSearch and so that
got me interested in learning more. Geo queries are really fast. My last
experience with Geo involved GeoDjango and all kinds of obnoxious hacks to
PostgreSQL to make it work. With ElasticSearch you tell it to index a point
and boom you're off to the races.

[0]: <https://github.com/karmi/tire>

[1]: <https://foursquare.com/about/>

~~~
Freaky
I used Solr for the previous generation of FreshBSD[0], and migrated to
ElasticSearch and Tire over a year ago and haven't looked back. The docs for
both ElasticSearch and Tire leave something to be desired, but it's still so
much nicer to use and a lot faster out of the box, at least for my modest
needs.

[0]: <http://freshbsd.org/>

~~~
ecaron
Solr's docs also leave a lot to be desired. The wikis are a mess and knowing
what is still relevant (e.g. not deprecated) is frequently a crap shoot.

It'd be great if their docs had versioning (like the Apache HTTP Server
Project), but I suspect that isn't on anyone's roadmap.

------
cedrichurst
I love both Solr and ElasticSearch but the big missing comparison for me is:
are there any books available? Or even comprehensive tutorials beyond the
basics? I love ElasticSearch but it was a huge pain getting up-to-speed on
everything. Figuring out things like EdgeNGrams (something I already knew how
to do in Solr and Lucene) meant digging into the source code. I'm not shy
about doing that myself, but giving that advice to a consulting client would
be a non-starter. With the explosive growth of ES just in the last year or
two, it's really time for someone to start working on a book. Packt, Manning,
O'Reilly, any news?

------
meritt
Add geospatial to your comparison chart please. The way in which these
implement support varies widely in performance and accuracy. I've yet to find
one that actually uses R-Trees. Geohashes seem to be all the rage these days.

~~~
timr
_"I've yet to find one that actually uses R-Trees."_

That's because R-Trees don't scale well with random write loads.

R-tree insertion performance is extremely dependent on insertion order (search
for "sort tile recursive"). They're best used for problems where the data can
be bulk-loaded and left alone. If random writes are an important part of the
problem (as they are for most web-based tools), R-Trees are a bad idea.

------
zimbatm
I'm not an ElasticSearch expert but it seems that the scenario for "Field
copying" would be supported with the multi_field indexing (
[http://www.elasticsearch.org/guide/reference/mapping/multi-f...](http://www.elasticsearch.org/guide/reference/mapping/multi-
field-type.html) )

~~~
johnnymonster
That is correct. It appears that this comparison is a bit inconsistent. Also,
there are client libs for javascript as well.

------
aidos
Good overview - shows you just how powerful these engines are.

I can't speak for ElasticSearch but there are a couple things in the Solr list
that I'm not sure about.

 _"Multiple document types per schema"_ \- You can use dynamic fields so that
you don't even need to define your document schema

 _"Schema change requires restart"_ \- I think in MultiCore it happens when
you swap cores (which is a good way of running solr) [0]

[0] [http://stackoverflow.com/questions/10417422/solr-schema-
chan...](http://stackoverflow.com/questions/10417422/solr-schema-changes-
arent-picked-up-unless-solr-is-stopped-for-3-seconds)

~~~
nzadrozny
Furthermore, with respect to schema changes, ElasticSearch will refuse to make
backwards-incompatible changes. So for either search engine, you'll need to
get comfortable at some point with the procedure for creating a new index with
the new schema or mapping, reindexing your data, and hot-swapping the Solr
Core or ElasticSearch Alias.

~~~
codewright
Hot-swaps of ElasticSearch aliases are how we do it at my company. It's how we
produce a rolling archive.

------
bradbeattie
3rd party integration of ElasticSearch with Django:
<http://haystacksearch.org/>. So I'm not sure why the article says N/A.

~~~
famousactress
Yeah, especially weird since it's the same project that does 2rd party for
Solr, and ElasticSearch is increasingly a favored engine of the author.

------
ecaron
After the 3rd thing that was wrong about Solr, I stopped caring to write
anything more than this comment.

~~~
rgrieselhuber
Given that this appears to be a community resource and not sponsored by either
SOLR or Elastic Search people, I'm sure your specific critiques would be
useful.

~~~
nzadrozny
Looks like a sales/SEO play for a Solr/ElasticSearch consultant. Still seems
pretty helpful as a community resource. I emailed the author to see if he's
interested in setting up a public GitHub repo to take pull requests.

Personally, I'd like to see similar comparisons for other search engines, like
Sphinx and Postgres Full-Text. When I talk to people about search engines, the
first questions they ask me are to compare one against some other.

~~~
codewright
Which is especially egregious, since both fall apart in more serious use-
cases.

~~~
nzadrozny
Can you expand on what you mean?

~~~
Zombieball
Not sure about codewright's use cases. However in my own brief experimentation
with SOLR I ran into performance issues with garbage collection.

I setup a cluster of about 15 cc2.8xlarge machines (5 Shards with 3 replicas
each) containing 240Gb worth of documents (48gb per shard). Each node was
given on the order of 40GB heap space. While performing load tests with a
relatively small load (~150 QPS) after a few minutes the garbage collector on
nodes would kick in and run on the order of 15 to 30s. This had a cascading
effect of causing zookeper to think nodes were down, start leader re-election,
etc.

Admittedly I am quite inexperienced when it comes to dealing with applications
using such large heap sizes. Though I tried a few different JVM options with
respect to GC I was unsuccessful in resolving the problem.

If any folks here happen to have some good resources regarding GC and large
Solr clusters I would definitely be interested.

~~~
fizx
That huge heap is extremely counterproductive, because large heaps have
terrible GC performance, and you're actually stealing memory from the natively
memory-mapped files that make up your index.

Try it again with sane GC parameters, e.g.:

    
    
        -Xmx<N>G -Xms<N>g -XX:NewSize=<N/2>G -XX:MaxNewSize=<N/2>G -XX:+UseConcMarkSweepGC -XX:+DisableExplicitGC -verbose:gc -XX:+PrintGCTimeStamps -XX:+PrintGCDetails -XX:+CMSIncrementalMode
    

Where <N> is a value between 2-8.

Edit: I was benchmarking a similarly sized (though very differently
configured) Solr cluster for a well-known internet company, and was able to
tune it to do 5000qps, with p50 ~2ms and p99 ~20ms.

~~~
Zombieball
Thanks for the tips. I was considering trying testing again with more
partitions w/ smaller machines. Perhaps N x m1.xlarge w/ 8 GB heap space.

I was starting to think that since the heap space was so big perhaps I should
be worrying about page sizes as well. While I tried various GC settings
(UseConcMarkSweepGC, ConcGCThreads, UseG1GC, etc. ) I didn't take a stab at
playing with the size of New Genearation. Could you explain the reasoning
behind this? Is the idea that most objects die young so try to increase the
number of short run minor GCs and avoid bigger Major GCs? I am quite
interested.

Edit: Regarding the cluster you were working on. Would you be able to give
general dimensions to the number of nodes & partitions in your cluster +
memory for each? Just trying to get a general guideline to aim for.

~~~
fizx
In general, I fix the newgen size mostly to avoid the optimizer choosing
something braindead in a pathological case. 50/50 is safe, but not optimal.

In general, you should have enough unallocated memory on the box to cover your
working dataset (it'll get used by caches and memmaps). If you can, find a way
to exploit data locality. I shoot for (number of cores * 1-4)-ish partitions
per box depending on workload. Using bigger boxes is usually better, because
you can avoid communication latency and variance that arises from having tons
of boxes.

If you want to know more, you can email me at kyle@onemorecloud.com.

------
Hikari
both are nice and will do the job without too much pain. I've been running an
es cluster for about a year now. I appreciate how easy it is to setup but the
documentation is terrible. es doc should be a cross between rethinkdb and
redis. that would make life easier for everybody.

~~~
dguaraglia
Couldn't agree more. I think the problem with ElasticSearch docs is they
assume the user already understands the inner workings of the Lucene search
engine (after all, ElasticSearch is just a nice restful wrapper on top of
that.)

If, as was my case, the most complex search you've ever made before was a
fulltext search on a database field then you'll be lost for a good couple days
until you understand what's going on.

------
tarr11
Surprised ElasticSearch doesn't support dynamic fields. That is one of the
most useful featuresin SOLR.

~~~
sandGorgon
This is incorrect. Dynamic templates are pretty much the same thing.

[http://elasticsearch-users.115913.n3.nabble.com/Apply-
dynami...](http://elasticsearch-users.115913.n3.nabble.com/Apply-dynamic-
template-to-all-fields-of-an-object-td4018786.html)

------
hajrice
Having played with both, I personally find <http://www.searchify.com/> _much_
better than bost Solr And ElasticSearch. At least based off the search
results.

~~~
diek
It looks like searchify is a hosted search solution, whereas Solr and
ElasticSearch are distributions of search servers that can be deployed on your
own hardware.

~~~
nzadrozny
Technically Searchify is based on the open-sourced code from the previously
proprietary IndexTank. So in theory you can run your own:
<https://github.com/linkedin/indextank-engine>.

I'm skeptical about claims of different quality of relevancy results, since
IndexTank/Searchify is also based on Lucene (looks like 3.0.1 in the canonical
repo), and should share all the same fundamental relevancy and scoring
functionality.

~~~
hajrice
Yep, but hosting it is a pain.

It's actually heavily tweaked (this is what the IndexTank team has told me),
and apparently contains components of Solr

~~~
nzadrozny
There are hosted services for Solr and ElasticSearch, too. Such as fizx's and
my own <http://websolr.com> and <http://bonsai.io>

------
johnnymonster
With elasticsearch, you are not able to change shard count after initial index
creation. You are able to change replicas at any time.

~~~
superkt
Fixed. thanks..

------
KaoruAoiShiho
This seems biased in favor of Solr... it tries very hard to keep Solr and ES
balanced but in reality it's not that balanced.

------
adient
Yokozuna is Riak + Solr, not ES.

~~~
superkt
Oops. Typo. Fixed.

