
How we get high availability with Elasticsearch and Ruby on Rails - konklone
https://18f.gsa.gov/2016/04/08/how-we-get-high-availability-with-elasticsearch-and-ruby-on-rails/
======
shockzzz
I have no idea why the phrase "high availability" is in this post.

~~~
mburns
>We call these extensions "high availability" because this approach means that
re-indexing a production system can happen much faster, reducing downtime for
our users.

Agree with their use of the term or not, they give you their reasoning at the
end of the article.

~~~
shockzzz
That's crazy misleading. This is just a post saying, "hey, this is a way to
sync data faster." Awesome! Much kudos.

But stale data isn't "downtime." This is tech marketing at like, MongoDB
level.

~~~
ma2rten
Except it's not marketing from the vendor in question. This is a page by the
US government.

------
serguzest
"27 reports per second" what?

I use bulk api with .net Nest client.

I can easily put 1000 documents per second (which also include 3-10 nested
documents) with 4core i5 machine. Serialization is the cheapest operation in
my case. I would blame ruby in your workflow.

~~~
xentronium
Seriously depends on the structure of the documents and your analyzer setup. I
agree that it should be in hundreds recs per sec though.

------
acehyzer
Elasticsearch is awesome. It may be a good idea to use the bulk API that is
built into elasticsearch, use some joins in your SQL query, and index more
than just one record at a time. In the implementation I have, I batched my
query to 50,000 records at a time that then index into elasticsearch. For the
2.7 million records I indexed this week, it took a total of 54 queries to the
database (50,000 records returned at a time). Just one more idea to streamline
your indexing without slamming your DB quite so hard.

~~~
brndn
I started using elasticsearch recently and I was wondering, does the indexing
happen in real time during the index request? How do you know how long the
indexing process takes?

~~~
chrisatumd
There's an index.refresh_interval setting. It defaults to 1s, so by default
your data will be available for querying within one second after being
indexed.

~~~
nemo44x
In general, yes. But keep in mind that the Elasticsearch JVM GC could fire up
right after the document is indexed and possibly run for a few seconds if
there is a lot of memory pressure. When the GC is done Elasticsearch will
continue to process queries but it may be the indexed data hasn't been
refreshed yet. So, a query run "1 second" after the index operation may not in
fact return the document. However, this would be a very rare case.

------
sqlcook
I've indexed ~ 1 million docs a second, but with proper routing, can probably
even 5x that. Total cluster size was 50 terabytes, at the end.

~~~
true_religion
How many machines did you have on the cluster?

~~~
sqlcook
100 data nodes

basically if you want fast ingress, keep shards small, once they get past
~5-10gb , ingress significantly slows down. Also this was on ES 1.5 , have not
tested latest 2.0+ builds

~~~
sandGorgon
I assume you are also replicating your nodes...how does replication impact
ingress? What happens when nodes exceed 10 GB? Do you split them?

~~~
sqlcook
if you want the fastest ingress, disable replica until your ingress is done,
its faster to create replica at the end of ETL for that given index. Also, you
want to disable auto allocation as well, this will disable shard movement
during ingress, re-enable it afterwards.

on a 100 node cluster i had roughly 500GB on each node. this was not a single
index, multiple indexes, with roughly 8 shards per index per node. Shard count
is pretty important to get correct.

I did not manually control document routing (it was hard based on the type of
data i was ingressing), so it was set to auto and during the load i observed
hotspots in the cluster (you have to look at BULK thread/queue length), some
nodes were getting burst of docs while others were idle, roughly 40-50% of the
nodes in the cluster were under utilized, and maybe 5-10% had hot spots from
time to time.

Also, depending what you use to push data in, (I used ES hadoop plugin) , you
have to account for shard segment merges, which literally pause ingress for a
brief moment and merge segments in a given shard. You have to set retry to -1
(infinite) and retry delay to something like a second or two, otherwise you
will end up with dropped documents.

~~~
sandGorgon
this is brilliant ! if you had your ES and hadoop config somewhere it would be
awesome

------
IndianAstronaut
Is this sort of parallelism also doable with Solr as well?

~~~
brightball
I don't see why it wouldn't be. The main differentiator between Solr and
Elastic Search is that ES handles constant incoming data more consistently, so
it's a much better fit for realtime scenarios.

Just batch loading the data one time shouldn't create much of a problem for
Solr either.

