
Deploying Elasticsearch on a 150 node cluster to index 10B documents - burtonator
https://www.spinn3r.com/blog/elasticsearch-at-scale.html
======
packetized
"Periodically we do full index merges of multiple indexes into weekly indexes.
Then weekly into monthly. etc. ... We do this with an internal tool we've
developed named (for lack of anything creative) "index rewriter" similar to
spark, hadoop, or map reduce just for doing parallel/concurrent scans of
Elasticsearch and then writing the data to a new index."

ES 2.3 brings the reindex API [0], which is an absolute godsend. Also, that is
a remarkable amount of computronium brought to bear on what I would charitably
describe as a moderately-sized dataset. Is the 500ms query time an absolute
hard requirement?

And I may be admitting my ignorance here, but there's also this statement:

"However, Elasticsearch doesn't have a way to efficiently tell us how many
documents were returned in a given response."

Are you looking for something above & beyond the "hits" value returned in a
query response? Or am I missing something? ex:

    
    
      {
      "responses": [
        {
          "took": 16,
          "timed_out": false,
          "_shards": {
            "total": 3,
            "successful": 3,
            "failed": 0
          },
          "hits": {
            "total": 38,
            "max_score": null,
            "hits": [
              { ...

[0]:
[https://www.elastic.co/guide/en/elasticsearch/reference/2.3/...](https://www.elastic.co/guide/en/elasticsearch/reference/2.3/docs-
reindex.html)

~~~
craigching
So, just a clarification, the "hits.total" is the total number of documents
that were found. The only way to know the number of documents in the current
set ("hits.hits") is to parse the hits and count. But even that shouldn't be
too expensive unless I'm missing something here.

~~~
jonaf
Not even that. The number of hits returned has a default, which is 10. For
more, you choose (the size parameter indicates how many results to return).
There is a maximum, by default 100,000 I think.

If you do want to return all results, you can use the scroll API.

There's no guessing with the results in Elasticsearch if you read the
(surprisingly accurate) documentation.

Edit (moved from reply to myself): I guess I should clarify before someone
corrects me: you could get fewer results than the size parameter. But in that
case the total hits should be less than your size parameter and it should be
clear that the number of returned documents is the difference.

~~~
craigching
No, "hits.total" is the total number of documents that matched, regardless of
what you specified as the number of hits to return.

~~~
jonaf
Right, but if you set the size to 10, then you either return 10 or (hits.total
- 10) documents. If hits.total is greater than 10, you know you returned 10
documents; otherwise, you returned hits.total documents.

~~~
craigching
Ok, yeah, I see what you're saying and that makes sense. But, even my response
is premature because you generally have to parse the whole response to get
"hits.total" anyway. So unless you have written a partial parser (does an open
source one exist?) you are going to be parsing the whole document to get even
"hits.total".

~~~
jonaf
Based on the bug filed[1], "Return number of documents in the result counts as
HTTP header or in JSON," the implication is that including the number of
documents in the result in the JSON would be an acceptable solution. If that's
true, then they're OK with parsing the JSON.

Although, to be fair, the related issue[2] specifically asks for an HTTP
header.

JSON parsing is generally pretty fast, but I can understand it being
unnecessary overhead (although in general, there are other more significant
opportunities for optimization).

If you've ever implemented a JSON stream reader/writer, then you've done
partial JSON parsing. Check out the tool `jq`[2], which will parse partial
JSON surprisingly well and fast (and this is just a utility).

[1]
[https://github.com/elastic/elasticsearch/issues/18312](https://github.com/elastic/elasticsearch/issues/18312)

[2]
[https://github.com/elastic/elasticsearch/issues/16993](https://github.com/elastic/elasticsearch/issues/16993)

[3] [https://stedolan.github.io/jq/](https://stedolan.github.io/jq/)

------
jonaf
I'm always excited to read about Elasticsearch being used for something
besides logs! We have built quite the powerful realtime streaming big data
platform where I work (Bazaarvoice), and we use Elasticsearch heavily.

I was recently at an Elasticsearch meetup hosted by HomeAway and pleased to
see that may of the patterns that emerged at Bazaarvoice were generally
repeated in HomeAway's deployment. I definitely encourage anyone considering
Elasticsearch to explore its utility for more than just logging. So many times
I read about or hear folks using Elasticsearch to index "billions of
documents!!" only to find that only a few million documents are actually in an
open index.

Does anyone know of any heavy users of Elasticsearch for non-logging workloads
in AWS? So far it seems like most of these folks are running out of a colo or
their own data center. I've been running Elasticsearch in AWS for a few years
now and simply can't imagine dealing with the inconvenience of managing the
hardware down to the specific number of nodes. (If you asked me, I'd say I
don't know how many nodes -- lots!)

~~~
lateguy
I agree with you. I work for Education company Accredible. We use
Elasticsearch hosted on AWS for employment directory based on student data.
It's working perfectly for us from last one year, very happy with it.

------
landryraccoon
> The only other way to determine the amount of documents returned it to JSON
> parse the entire result which is not free and would add latency to each
> request.

Why do they have to bill synchronously (in realtime)? Why not return the
response, then calculate the fee asynchronously? The customer is probably
being billed at discrete intervals (e.g., weekly or monthly) anyway.

~~~
burtonator
It's a good idea (and thanks for the feedback) but we would still have to burn
the CPU (which isn't free) to parse the response.

Much easier if it's just an integer header field.

~~~
craigching
What's your request size and how big are the documents you're needing to
parse? Even if you're returning 1000 documents, if you're not fetching the
source as part of the request (do you need to?) or if you don't have a ton of
fields, it shouldn't be too expensive to parse that to get the total number of
hits.

~~~
jonaf
I'm assuming he needs the source, otherwise he should look at the count API.

~~~
craigching
The count API just returns counts. I'm assuming he needs some fields to
display which should be smaller than source if source is "large."
Unfortunately we are left to guess at his statements because not enough detail
is provided.

------
arosenbaum
The "CPU usage across our Elasticsearch cluster" chart shows a very large
spread of utilization at almost all times. This is read-only since you are re-
writing offline, right? What do you think is driving this variation?

Disclosure: I am VP, Product Strategy at MarkLogic...

