
Optimizing Solr (Or How To 7x Your Search Speed) - raccoonone
http://carsabi.com/car-news/2012/03/23/optimizing-solr-7x-your-search-speed/
======
stevenp
My experience working with Solr is that a lot of the time people don't have a
good working knowledge of how to optimize an index because it's so easy not
to. At my last job, the initial implementation they we involved storing the
full text of millions of documents, even though they never needed to be
retrieved (just searched). If you're running Solr as a front-end search for
another database, the best way I've seen to optimize performance is just to
make sure you're not storing data unnecessarily.

Maybe everyone already should already know this, but I was working on a very
smart team, and we totally missed this initially. Setting "stored" to false
for most fields resulted in a 90% reduction of the index size, which means
less to fit into RAM.

~~~
raccoonone
Yep, totally agree with this. Last month we spent an hour or two going through
the schema and removing any fields that didn't need to be stored, and making
sure that only fields which we actually do queries on had index=true. I didn't
test before and after results, but qualitatively it seemed to be faster
afterwards.

------
fizx
Hey, Websolr founder here.

Websolr's indexes return in under 50ms for queries of average complexity.

The more expensive queries usually involve "faceting" or sorting a large
number of results. For an example, say you search Github for "while." Github
used to do language facets, where it would tell you that out of a million
results, 200103 files were in javascript, 500358 files were in C, etc.

The problem with this is that you have to count over a million records, on
every search! Unlike most search operations which are IO bound, the counting
can be CPU-bound, so sharding on one box will let you take advantage of
multiple cores.

Racoonone is "sorting on two dimensions, a geo bounding box, four numeric
range filters, one datetime range filter, and a categorical range filter."
This should put him in a cpu-bound range (in particular because of the sort).

Websolr has customers on sharded plans, but they are usually used in custom
sales cases where we're serving many, many millions of documents. We'll look
at adding sharding as an option to our default plans, so that they'll be more
accessible for people like raccoonone. In the meantime, if you send an email
to info@onemorecloud.com, we'll try to accomodate use cases like this.

Edit: Also, other possible optimizations include (1) indexing in the same
order you will sort on, if you know ahead of time, and (2) using the
TimeLimitedCollector.

~~~
raccoonone
Awesome, lemme know when you guys have sharding available! Would love to not
have to worry about running our own index, again.

~~~
fizx
It's available now, just not as part of a standard plan. I'll send you an
email, just to make sure we're getting you the best setup for your needs.

------
matan_a
There are quite a few other performance related points to think about for Solr
speed for queries and indexing.

Here are some that come to mind right now that are very useful:

\- Be smart about your commit strategy if you're indexing a lot of documents
(commitWithin is great). Use batches too.

\- Many times, i've seen Solr index documents faster than the database could
create them (considering joins, denormalizing, etc). Cache these somewhere so
you don't have to recreate the ones that haven't changed.

\- Set up and use the Solr caches properly. Think about what you want to warm
and when. Take advantage of the Filter Queries and their cache! It will
improve performance quite a bit.

\- Don't store what you don't need for search. I personally only use Solr to
return IDs of the data. I can usually pull that up easily in batch from the DB
/ KV store. Beats having to reindex data that was just for show anyway...

\- Solr (Lucene really) is memory greedy and picky about the GC type. Make
sure that you're sorted out in that respect and you'll enjoy good stability
and consistent speed.

\- Shards are useful for large datasets, but test first. Some query features
aren't available in a sharded environment (YMMV).

\- Solr is improving quickly and v4 should include some nice cloud
functionality (zookeeper ftw).

~~~
mthreat
These are good points. Solr/Lucene tuning is an art, so much so that some
search consulting companies charge tens (or hundreds) of thousands of dollars
for these services. That's the value proposition of Searchify's hosted search
- if you just want search, you shouldn't have to worry about shards, commit
strategies, batching, GC, etc. You just want to add your documents, search
them, and get great, fast results, without having to become a Lucene expert in
the process.

If this sounds interesting, check us out at <http://www.searchify.com> \- We
offer true real-time, fast hosted search, without requiring you to learn the
innards of Solr or Lucene.

~~~
matan_a
Good stuff. I see you're on Heroku as well which is always a win.

Now if someone could put SenseiDB on the cloud, i'd pay for it...

------
gpapilion
I'm curious what your queries look like, because these performance numbers are
awful.

I'm currently running an index that is 96 million documents(393GB) using a
single shard with a response time of 18ms.

If you're comfortable with it, I'd suggest profiling Solr. We found that we
were spending more time garbage collecting than expected, and spent some time
to speed up an minimize the impact of it. Most of this was related to our IO
though.

Second, don't use the default settings. Adjust the cache sizes, rambuffer, and
other settings so they are appropriate for you application.

I'd also start instrumenting you web application such that you can start
testing removal of query options that may be creating your CPU usage issue.
You get a lot of bang for your buck this way, and you may find the options you
were using provide no meaningful improvement in search. A metric like mean
reciprocal rank can go a long way to improve your performance.

~~~
astubbs
Re GC - were you using standard or parallel GC?

------
falcolas
Our company had to set up a Solr implementation with some pretty crazy
requirements (hundreds of shards, tens of thousands of requests per second,
etc), and we ended up with 4 machines - one for indexing, one as a backup
indexer/searcher, and 2 just doing load balanced searches. Replication was
interesting but easy to set up (since it's basically an rsync of the indexes
between servers).

The end result works very well, though it's a real memory hog when you get
into the "hundreds" of shards on an individual server.

~~~
raccoonone
What was the reason for having hundreds of shards on each server? Were you
still seeing performance benefits to sharding it that aggresively?

~~~
falcolas
Multi-tennent data separation required by contract.

------
markelliot
One thing I think would be valuable to know here is how many threads each
shard is using, and what effect changing that number would have.

(rather: why is it useful to explicitly shard vs running one big instance with
all of the memory and the same total number of threads? queuing theory would
lead me to believe the latter would be better)

------
ABS
take a look at this presentation if you are interested in NRT Solr (although
it was done before Solr added the latest NRT features):

Tuning Solr in Near Real Time Search environments:
<https://vimeo.com/17402451>

------
snikolic
Thoughtful sharding is not an optimization, it's a _requirement_ at scale.

------
sudoman69
does any one have experience with adding shards on the fly??we have a
requirement weher we get millions of docs every day and we need to have an
environment that can handle real-time as well previous days' data...any
thoughts on this will be appreciated...

~~~
simonw
The shards that are used in a Solr query are specified at runtime (you pass a
list of shard URLs as part of the search query string) so adding new shards on
the fly should Just Work.

------
zargath
anybody can recommend a good way to get startet with Solr ?

~~~
simonw
If you're using Django, Haystack makes it trivially easy to set up a Solr
index against your existing Django models.

<http://haystacksearch.org/>

~~~
riffraff
in a similar vein, if using ruby/rails Sunspot(sunspot_rails)[1] is awesome.
<http://sunspot.github.com/>

------
mthreat
I'd love to have them try out Searchify's hosted search and see how fast it
is. The key to fast search is RAM, which is why we run our search indexes from
RAM (not cheap), and most queries are served within 100ms. If you're the
author of the blog post, please contact me, chris at searchify, if you'd like
to do this comparison, and I'll set you up with a test acct.

~~~
espeed
Hi Chris, what is Searchify's relationship with the old IndexTank
(<http://indextank.com>) team?

~~~
mthreat
There's no formal relationship, although I have met several of the IndexTank
guys and they're a cool group. And Searchify is based on the IndexTank open-
source project.

------
phene
I improved solr performance by switching to elasticsearch. =)

~~~
grogenaut
Same here. Just make sure to set your inital shards correctly so that you can
grow for a while. Once you have to increase shards you have to re-index :(.
But that's fine. Just go with 4x or more the shards you need now and you'll be
able to scale out to that many boxes.

------
chenli
This is the founder of Bimaple. We provide hosted search and license our
engine software with significantly better performance and capabilities than
Lucene: (1) Supporting "Google-Instant" search experiences on your data; (2)
Powerful error correction by doing fuzzy search; (3) Optimized for mobile
users by doing instant fuzzy search with a speed 10x-100x higher than Lucene;
(4) Optimized for geo-location apps; (5) Designed and developed ground up
using C++. We have demonstrations on our homepage. If interested in using our
service or software, please email contact AT bimaple.com.

~~~
nkurz
Hi Chen,

It seems like an interesting product. Advertising here on HN is always tricky
--- it's a balancing act between self-promotion and restraint. I voted your
comment up in the hope that you'll stick around and tell us more about it. But
perhaps more on the tech side, and less on the marketing.

Have you published any papers describing your approach? Or white papers with
more meaty details? I work on the Apache Lucy project, and am very interested
in things that work better than Lucene.

~~~
chenli
Hi nkurz,

Thanks for voting my comment up. Based on your request, I provide the
following information. Let me know if you want to move this discussion to a
private email discussion.

We have two published white papers to compare our engine with Lucene/Sor:

(1) A comparison white paper in the case of traditional keyword search:
[http://www.bimaple.com/files/Bimaple-Keyword-Search-
Comparis...](http://www.bimaple.com/files/Bimaple-Keyword-Search-Comparison-
With-Solr.pdf) . We have a demonstration site to support instant, fuzzy search
on Stack Overflow data (600K question titles as of January 2011):
<http://demo.bimaple.com/stackoverflow>

(2) A comparison white paper in the case of geo-location keyword search:
[http://www.bimaple.com/files/Bimaple-Map-Search-White-
Paper-...](http://www.bimaple.com/files/Bimaple-Map-Search-White-
Paper-201108.pdf) . We have a demonstration site to support instant, fuzzy
search on 17 million US business listings: <http://www.omniplaces.com> . It
has an iphone app at:
<http://itunes.apple.com/us/app/omniplaces/id466162583?mt=8>

If you have questions, please feel free to contact me at chenli AT bimaple DOT
com. Thank you.

