
Elasticsearch 2.0.0-rc1 released - vruiz
https://www.elastic.co/blog/elasticsearch-2-0-0-rc1-released
======
lobster_johnson
I'm glad they are cleaning up a lot and not worrying about making breaking
changes. Elasticsearch has always been full of (to me) weird and obscure
behaviour.

2.0 is mostly cleanup, and doesn't really bring any big new features. The
change that brings most utility is the merging of "filtering" and "querying"
into a single query DSL. This maps better to how developers think about
search, and reduces the mental overhead of having to decide if, say, a boolean
operation should use a filter or query.

~~~
yehosef
umm.. 2.0 is much more than cleanup. The pipeline aggregations are huge (it's
already really great for analytics - pipeline aggregation bring it to an
entirely new level.) Also the doc_value by default will help many people store
much bigger data with less memory without thinking about it (you can do it now
yourself but you have to know that you should). It's also has better stored
compression options that come with the move the Lucene 5. This is a major
release.

~~~
lobster_johnson
I didn't say it wasn't a major release.

As I understand it, pipeline aggregations aren't strictly necessary -- you
could do it client-side, so it's more of a convenience and optimization. Doc
values are another optimization. This release is full of optimizations,
cleanups and various low-visibility stuff of this kind. Few actual new
features.

~~~
yehosef
You don't actually use elasticsearch, do you?

~~~
lobster_johnson
We do use ElasticSearch.

------
LunaSea
And still no pagination for TermsAggregation and TopHitsAggregation ...

~~~
yehosef
Clinton explains in
[https://github.com/elastic/elasticsearch/issues/9112](https://github.com/elastic/elasticsearch/issues/9112)
why they don't support this. If you really need it, you can implement it
yourself by limiting the aggregation using filters
[https://www.elastic.co/guide/en/elasticsearch/reference/curr...](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-
aggregations-bucket-terms-aggregation.html#_filtering_values)

it's obviously not real pagination but the github issue explains why it isn't
practical. If you know of other systems that have paginations of aggregations,
perhaps you can reference those sources in the github issues and they can
learn some tricks how to do it.

As for top hits, the "from" option doesn't let you do what you want?

~~~
LunaSea
I read the GitHub issue but I don't think it's practical or even possible in
some cases to keep all the previous pages in memory, especially if you have a
web service.

My use case is that I have around +20M documents with non-unique hashes and
each query should return an arbitrary amount of documents matching the query /
filter as well as calculated meta-data based on the results in the aggregation
field.

Now the issues is that if you want to have only one document per hash, you
need to use a TermAggregation on the hash field followed by a
TopHitsAggregation of size 1 to obtain the actual document rather than the
hash field.

At this point you have many many buckets containing a single document but:

\- you can't paginate them since TermAggregation doesn't let you do pagination
for the reasons you explained and linked to above

\- you can't calculate an aggregation on all the returned documents since all
of them are in their own separate bucket (by hash)

~~~
yehosef
If I understand your case, you'd have to do it in multiple steps - you would
get the results and paginated them, and record the hashes in a different store
(eg redis) - then run the aggregations on a paginated set of hashes. I
understand it's a pain, but I think it's just a hard/messy problem. Like I
said, if you can find other people that are solving this problem (could be
there are..) then you can reference them to the ES people, I'm sure they'd
like to find solutions.

It's also possible you could restructure your data to make it easier to
extract the way you want.. I'm not sure if that's possible in your case, but
many times in the NoSQL world, modeling your data in the best way for how you
want to extract it is key for success.

~~~
LunaSea
Like I said I have +20M records so potentially tens of thousands of results
per query.

This would mean that every time a user triggers this API call we have to
insert those tens of thousands of documents into Redis and run an operation at
the end. Even without taking into account the JSON serialisation /
deserialisation costs, this would take forever where we need something that
runs in around one or two seconds at the maximum.

The problem is actually pretty simple and can be summed up by this sentence:
"I want to group documents by field X, take one document per group of X and
return documents from index OFFSET to index OFFSET + LIMIT".

In MongoDB for example this would be quite easy with a $group + $first
operation in the aggregation framework. Sadly MongoDB lacks the nice full-text
search features that ElasticSearch has. It's highly possible though that going
forward we'll have to hack a full-text-like search on MongoDB and switch from
ElasticSearch to MongoDB since stuff like this doesn't seem to possible.

~~~
lobster_johnson
Sounds like something Postgres would be very good at, using window functions
or (probably recursive) CTEs; its query language is much richer than both
ElasticSearch and MongoDB.

Postgres' text indexing isn't as advanced as ElasticSearch's (and indexing
performance is definitely lower), but it's not bad at all. If you only need
basic term support and not actual term vectors, you can use plain Postgres
arrays, which have all sorts of supported operators, and can be translated
into tables and back to perform some really neat in-memory queries.

One downside, of course, is that you don't get any sharding for free, and if
your dataset is large enough you might have to manually partition your tables,
either using Postgres partitions or by explicitly sharding (e.g., using
pg_shard).

~~~
yehosef
Good suggestion. 20M records is not very big so Postgres might be a great
choice.

------
meesterdude
I am using probably 5% of what ES is capable of, and now I feel even further
behind. But, it's amazing to see some of these things get polished and
improved on an already powerful system.

They were even kind enough to make an upgrade plugin:
[https://github.com/elastic/elasticsearch-
migration](https://github.com/elastic/elasticsearch-migration)

~~~
LunaSea
I just which that they would focus on adding basic features rather than adding
the more complex ones first.

~~~
meesterdude
what basic features are you hankering for?

~~~
LunaSea
[https://news.ycombinator.com/item?id=10372623](https://news.ycombinator.com/item?id=10372623)

[https://news.ycombinator.com/item?id=10373842](https://news.ycombinator.com/item?id=10373842)

------
Pyppe
Could someone confirm if doc_value support for analyzed strings is part of the
2.0 release?

I find no mention about it in the release notes, so I'm guessing it's not...
:\

------
nitin88g
Any link for complete list of new features/changes from current version to 2.0
version?

~~~
kevinastone
[https://www.elastic.co/guide/en/elasticsearch/reference/2.0/...](https://www.elastic.co/guide/en/elasticsearch/reference/2.0/breaking-
changes-2.0.html)

~~~
gegtik
did you read the contents of that link before posting?

