Lots of folks will be mad, but removing multiple mapping types is a nice change too. It was a feature that never really made sense. Index-per-type was always the better strategy, even going back to the 0.23 days.
As others in this thread will no doubt point out though - the ES folks are moving awfully fast. I still support 1.7.5 clusters that will take heroic efforts to update. I'd love to use the new features on those clusters, but there simply isn't a valid business case to take on the risk. This isn't like small NPM packages that you can update on a whim - some of these systems require terabytes of re-indexing to upgrade :/
The way Elasticsearch is going though looks promising. With sorted indices, single mapping type and the other changes we might give it another try after switching to Algolia.
Is there a safe way now to query Elasticsearch directly without the need to go via proxy scripts on the server? This just adds so much overhead to the queries compared to Algolia.
I can not give a timeframe, but it is one of the top features on the ES roadmap. :)
Source: I am an engineer working on this feature.
This works pretty well already if you are running on your own hardware and have a good network. We've been running a three data center setup across the US for four years. Next year we may extend it across the Atlantic.
But querying from a publicly facing front end would be a poor idea - would you expose a database directly to the front end?
It is a fantastic idea to call the index directly from the frontend and could be solved with a read only type of index or api key with read only scope.
The current design with an unnecessary security layer outside of Elasticsearch is a poor idea adding too much administrative overhead and ridiculous latency.
More charitably, I can understand why they felt the need to make a hard break, but difficult upgrades plus fast release cycle means a fair bit of friction.
The no downtime upgrades will be nice, but on a big production system, I wouldn't feel comfortable upgrading major versions every 6 months.
It feels from the discussion of the github issues that this was something that had caused some elastic developers pain and they decided to kill it.
There were two fundamental problems: 1) types had different mappings which was confusing since internally it's the same index and there is only one mapping. 2) for the use case where you have one type per index, you still had to arbitrarily create a "type".
It could have been done by:
1) making the mapping definition only at the index level - there's no such thing as a mapping for a type (this is how it works internally anyway.)
2) with a "type" field being optional and specified via a query string instead of url path. This would have left all the internals alone. Eg, there could have still be an internal meta "_type" field which would have had the default value of "default" or something. For those people that needed multiple types, specifically for more complex parent-child, they still could have done it.
The current approach is far more complicated because the have to change the internals to support both the new and old way during the transition and deal with a lot of internal things breaking because everything expects a "_type". You can checkout the github issues and see the work involved.
Are there any good alternatives to ES? Has Solr moved?
Ultimately, we settled on continuing to have different "types" in Kibana, but we treated them as data concerns rather than architectural concerns of Elasticsearch. At a high level, this meant that we added a new "type" field to documents to track the type itself, and then we prefixed fields with the type as well to preserve the ability to use the same userland id on different "types" in the index and such. The type/id prefixing thing doesn't get exposed beyond the layer that queries Elasticsearch for the kibana index.
Once that change was ready in the code, we also had to consider the need for users to migrate their existing multiple-type index to this new format. The upgrade assistant in x-pack basic handles all of this automatically, but folks can certainly repurpose the same reindexing operation we perform on .kibana on your own indices.
The underlying steps and reindex operation for this are outlined on the docs: https://www.elastic.co/guide/en/kibana/current/migrating-6.0...
The actual data transformation happens in step 3.
I hope this helps!
If you just wanted to be able to store different types in an index - you still can, just that the type will not be stored in a meta "_type" field but in a document level "type" (or whatever) field. Of course, the API does change since it's not in the URL and if you're creating custom doc ids you'll probably have to include the type in that (comment-123, post-123). So it's annoying, but I think mostly something that can worked around.
If you're using multiple types for parent-child, the situation is more bleak. They are still going to have a "join field" but there can only be one type of relation. While often this is ok, there are definitely reasonable use cases where it's not.
Currently the traditional parent-child hasn't been removed from the core because they need to support 5.x indices. But it will be phased out unless there is a big uproar.
Not to rub salt in the wound, but building a product around a feature provided by a single vendor (ie, not a standard feature or something developed in-house) means that you've just committed to maintaining that software or paying someone else good money to do so.
Anybody who's built products around eg Oracle has learned this at least twice!
But apparently migrating is still simpler than making a BigCorp pay for the ES security package.
> Elasticsearch 6.x
>> Indices created in 5.x will continue to function in 6.x as they did in 5.x.
ROTFL. And I opened the comments wondering what breaks this time.
Every time I hear about a new release, ElasticSearch gets worse and worse
option for storing logs.
Indeed, it wasn't designed for log storage, though it happened to match this
use scenario well (now less so with every release).
> There are far better ways to handle log analysis, particularly when your primary query is counting things that happen over T instead of finding log entries that match a query (which it always is)
Oh? This is the first time I hear that my use case (storing logs from syslog
for diagnostics at a later time) counts things over time. Good to know. I may
ask you later for more insight about my environment.
> streaming analysis is a much better fit than indexing, just lesser known.
Well, I do this, too, and not only from logs. I still need arbitrary term
search for logs.
The overhead per log record, building multiple indexes at log line rate, there’s just so many reasons not to do your use case in ES that I don’t even think about it. I think it’s a poorer fit than reporting, to be honest.
And you keep talking about how much you know and how ELK is literally worse than grep for searching off fields in logs for troubleshooting, but offer no alternative setups or use cases. You're hand-waving.
I've seen some of the performance issues of ELK at scale, and I'd be interested in what's out there, because its not my expertise. But you are just yelling "dataflow" and "streaming analytics".
You shouldn't have used authoritatively universal quantifier. There are plenty
of sysadmins who use ES for this case, you apparently just happened to only be
exposed to using it with websites.
Then, what ES+Kibana give me over grep? Search over specific field (my logs
are parsed to a proper data structure), which includes type of event
(obviously, different types for different daemons), a query language, and
a frontend with histograms.
Mind you, troubleshooting around a specific event is but one of the things
sysadmins do with logs. There are also other uses, all landing in the realm of
As an SRE, I’ve built high volume log processing at every employer in multiple verticals, including web. I know what sysadmins do. Not a fan of the condescension and assumptions you’re making. I have an opinion. We differ. That’s fine. Let it be fine.
You must be from the species that can predict each and every report before
it's needed. Good for you.
Also, I didn't claim that I don't use reports known in advance; I do use them.
But there are cases when preparing such a report for just seeing one trend is
an overkill, and there's still troubleshooting that is helped by the query
language. Your defined-in-advance reports don't help with that.
> I spend what time I can trying to show those very same sysadmins you’re talking about why ES is a poor architecture for log work, particularly at scale.
OK. What works "particularly at scale", then?
Also, do you realize that "particularly at scale" is a quite rare setting, and
"a dozen or less of gigabytes a day" scale is much, much more common, and ES
works (worked) reasonably well for that?
A dozen or less gigabytes a day means: use grep. This is just like throwing Hadoop at that log volume.
This was an opportunity to learn from someone with a different perspective, and I could learn something from yours, but instead, you’ve made me regret even saying anything. I’m sorry, I just can’t engage with you further.
(Edit: I’m genuinely mystified that discussing alternative architectures is somehow arrogant “pissing on” people. Why personalize this so much?)
I may need to tone down my sarcasm, but likewise, you need to tone down your
arrogance about working at Google or compatible.
But still, thank you for the search keyword ("dremel"). I certainly will read
the paper (though I don't expect too many very specific ideas from
a publication ten pages long), since I dislike the current landscape of only
having ES, flat files, and paid solutions for storing logs at a rate of few GB
> A dozen or less gigabytes a day means: use grep. This is just like throwing Hadoop at that log volume.
No, not quite. I do also use grep and awk (and App::RecordStream) with that.
I still want to have a query language for working with this data, especially
if it is combined with easily usable histogram plotter.
(I didn't downvote you btw)
I’ve extracted insight from millions of log records per second on a single node with a similar setup, in real time, with much room to scale. The key to scaling log analysis is to get past textual parsing, which means using something structured, which usually negates the reason you were using ElasticSearch in the first place.
Google’s first Dataflow talk from I/O and the paper should give you an idea of what log analysis can look like when you get past the full text indexing mindset. Note that there’s nothing wrong with ELK, but you will run into scaling difficulty far sooner than you’d expect trying to index every log event as it comes. It’s also bad to discover this when you get slashdotted, and your ES cluster whimpers trying to keep up. One thing streaming often gets you is latency in that situation instead of death, since you’re probably building atop a distributed log (and falling behind is less bad than falling over).
The key here is: are you looking for specific log events or are you counting? You’re almost always counting. Why index individual events, then?
As for scaling, it scales very well (you can read through the elastic blog/use cases to see plenty of examples.) That's not to say there aren't levels of scaling it won't handle. But I would venture to say that for 99% of the people out there, it will solve there problems very well.
> The images are available in three different configurations or "flavors". The basic flavor, which is the default, ships with X-Pack Basic features pre-installed and automatically activated with a free licence. The platinum flavor features all X-Pack functionally under a 30-day trial licence. The oss flavor does not include X-Pack, and contains only open-source Elasticsearch.
Even if it didn't have the full power of the Elastic JSON Queries, for simple SELECT COUNT() ..GROUP BY, it would have been a nice addition...oh well, back to counting open and closed brackets...
The data is mainly used for dashboards (which is Highcharts) so the aggregation functions map to something called a “series”, which is what you’d expect if you’ve ever used Highcharts. Anyway I think it’s quite cool how they did it.
A SQL interface would help a lot, even better if it came with a JDBC driver
P.S. Writing this level of SQL for ES that you describe isn't very difficult - in my project we got working in implementation in 2-3 weeks. Take Calcite. (Recommended but complex, imho), Facebook's Presto SQL parser (not recommended, but simpler)
I'd be interested to hear about what worked and what didn't. It's also important to try the 1.2 version. I had played with 1.1 and there were problems (not failures - just inefficient ES queries.)
Groovy is a fantastic language. It really is a hidden gem. It is my language of choice for cross-platform work, especially in enterprise.
Our scripting language "Painless" is faster and more secure than we could achieve with groovy, so in Elasticsearch 5.0 we made Painless the default and deprecated groovy.
In 6.0, groovy is gone.
We didn't do it to be minimalist, but we couldn't in good conscience continue to ship an insecure scripting language when we had an alternative.
Disclosure: I work at Elastic on security.
Disclaimer: I'm an Elasticsearch dev employed by Elastic.
In case of SQL I can start an in memory sqllite and run my tests (Symfony PHP).
Test data can be loaded from fixtures or captured/snapshotted using `docker commit` to create specific test images.