Is your list not entirely depending on the usecase?
I am using ES for years with over 1 million daily users.
It provides simple search funtionality.
it runs on a single node with 4gb of memory. For more than 5 years with hardly any issues.
If the dataset it fits on one node - good. You might still be missing out on reliability (should that node fail or just plain restart).
Simple catalogs and/or web pages search usually fit in RAM, so the advice of disk:ram becomes less relevant, also the type of drive would be mostly irrelevant.
A shard can handle concurrent writes from clients, there are multiple write threads scaling to the number of logical processors exposed to Elasticsearch, and we have invested in making Lucene accommodate this concurrency[0][1][2].
I'm kind of surprised this article doesn't mention anything about how many nodes you want in your cluster. Since ES performance starts to degrade once you get past 40 or so nodes .
Certainly a problem with 100-s of nodes. One thing is that GC pauses in big clusters become more problematic as the chance to hit stop-the-world pause on at least one node for every request is getting higher.
Going beyond dozens takes a fair amount of tuning and trouble-shooting.
No. ELK is a slow and expensive way to store and retrieve logs. The reason people use it is that nothing else exists. (I was blown away when I started using it at my last job. I used the fluentd Kubernetes daemonset to extract logs from k8s and into ES... and the "cat a file and send it over the network" thing uses 300+MB of RAM per node. There is an alternate daemon that can be used now, but wow. 300MB to tail some files and parse JSON.)
I think a better strategy is to store logs in flat files with several replicas. Do your metric generation in realtime, regexing a bunch of logs on a bunch of workers as they come in. (I handled > 250MB/s on less than one Google production machine, though did eventually shard it up for better schedulability and disaster resilience. Also those 10Gb NICs start to feel slow when a bunch of log sources come back after a power outage!)
For simple lookups like "show me all the logs in the last 5 minutes", you can maintain an index of timestamp -> log file in a standard database, and do additional filtering on whatever is retrieving the files. You can also probably afford to index other things, like x-request-id and maybe a trigram index of messages, and actually be able to debug full request cycles in a handful of milliseconds when necessary. For complicated queries, you can just mapreduce it. After using ES, you will also be impressed at how fast grepping a flat file for your search term is.
The problem is, the machinery to do this easily doesn't exist. Everything is designed for average performance at large scale, instead of great performance at medium scale. Someday I plan to fix this, but I just don't see a business case, so it's low priority. Would you fund someone who can make your log queries faster? Nope. "Next time we won't have a production issue that the logs will help debug." And so, there's nothing good.
I set up Loki on my Kubernetes cluster last night, as I've been meaning to try it and this was a good excuse.
Basically, it appears that you can do very minimal tagging of log data at ingestion time (done via promtail, requiring a restart for every config change just like fluentd), but not any after-the-fact searching. I didn't play with it at all because I've been down that road before; parsing logs is hard, and you need an interactive editor over the full history to get it right. (Some fun examples... I use zap for logging, which emits JSON structured logs. But... I also interact with the Kubernetes API, which has their own logger. When something bad happens, instead of returning an error, it just prints unstructured logs to the log file. So you'll have to handle that if you're writing a parser. nginx does this too; you can configure it for JSON, but sometimes it prints non-JSON lines. What?)
Loki is good for getting your logs off the rotated-every-10MB "kubect logs" pipeline... but it doesn't really help with after-the-fact debugging unless you really want to read all the logs, in which case you're back to grep.
I am getting more and more motivated to do something. At the very least, Loki's log storage itself seems pretty okay; put logs in, get logs back, so it saves me from having to write that part at least.
Loki and similar solutions are leveraging object storage (not just Ceph) as a way to store chunks of logs relatively cheaply, and scale performance. Loki can work on a single system, storing logs on the local filesystem, but that will eventually become an availability or performance bottleneck. Putting logs in object storage allows multiple systems to store/query/etc.
Flat files and grep works to a level. For huge datasets it can be _really_ hard to answer some questions using pipes, grep, awk, etc. that something like a structured query makes pretty simple.
We tried using CloudWatch logs insights at work and I was blown away by how fast it was on indexed data (we saw 10+ GB/s searches across a few hours of logs. Only problem is that it was prohibitively expensive so we ended up not going with it.
Biggest thing for me is that I don't want to own a log searching service/software. My customers don't see any benefit from me investing my time in a better log searching platform than what is already available out there IMO. I want to let someone who is an expert on querying logs to solve that problem so that I can solve my specific problem.
Yeah, being able to do structured queries quickly is the key. I don't think Elasticsearch is actually that great at that; it really feels designed for fulltext searches and suffers in terms of speed when dealing with the semi-structured nature of logs. (Overall, the cardinality of log messages is not as high as you'd think, but ES doesn't know this.)
I also agree that operating your own log search is kind of a pain. Often the node sizes are much larger than what you're using for your actual application. I wrote a bunch of go apps and they use 30MB of RAM each, then you read an article about making Elasticsearch work and find that you suddenly need 5 nodes with 64G of RAM each... when you can fit a week of log data in just one of those node's RAM. It's hard to get excited about paying for it when it's your own node, and it's even harder when someone else offers it as a service (because that RAM ain't free).
That is why I like some sort of mapreduce setup; you can allocate a small amount of RAM and a large amount of disk on each of your nodes, and these queries can use excess capacity that you have laying around. When your data gets big enough to need 320GB of RAM... you can still use the same code. You just buy more RAM.
Basically, the design of Elasticsearch confuses me. It's designed to be a globally-consistent replicated datastore with Lucene indexing. How either of those help with logs confuses me. You write your logfile to 3 nodes... if one of them blows up, now you only have two copies. A batch job can get the replication back at its leisure, if you really care. But in all honesty, you were just going to delete the data in a week anyway. (If you're retaining logs for some sort of compliance reason, then you do probably want real consistency. But you can trim down the logs to the data you need in real time, and write them to a real database in an indexable/searchable format.)
It is indeed good reading, but as you say, old. Since then, we have invested tremendously in the [data replication][0] and [cluster coordination][1] subsystems, to the point that we have closed the issues that Kyle had opened. We have fixed all known issues related to divergence and lost updates of documents, and now have [formal models of our core algorithms][2]. The remaining issue along these lines is that of [dirty reads][3], which we [document][4] and have plans to address (sorry, no timeline, it's a matter of prioritization). Please do check out our [resiliency status page][5] if you're interested in the following these topics more closely.
Thanks for all of the feedback in this entire thread.
I'm always happier when I see a follow up analysis by the Jepsen team, as a third party verification that the issues have been fixed and no major new ones introduced. Any chance Elastic is going to contract out to them for a follow up?
(Setting aside the fact that Logstash is JRuby/Java app easily eating said gigabytes of heap)
Do JSON logs take gigabytes?
If they do for you then yes, gigabytes of disk and memory are pretty much guaranteed. Also things tend to pile up with time (even on a few weeks horizon).
In all honesty, I believe a finely crafted native code solution for this problem could achieve x3 less ram usage and x2-3 indexing/search performance. Going beyond that is also possible but will take remarkable engineering.
In near future I totally see someone producing a great open-source distributed search project that is at least on par with today’s ES core feature set.
ES is good but it takes a small army to keep on top of the performance and management. And then there's the upgrades which will fix a few bugs and introduce new ones too.
> I believe a finely crafted native code solution for this problem could achieve x3 less ram usage and x2-3 indexing/search performance
Many people believe that about everything written in Java. The story often ends up a lot more complicated though. In the domains where it shines Java is surprisingly hard to beat without very significant effort. And then you must also contemplate what they same significant effort would achieve if directed at the Java solution or cherry picking particular performance hotspots out of it.
I’m basing it on my limited experience of rewriting things from optimized but messy C++/D to JVM. Yes, the end result is much simpler but is memory hungry and is ~2x slower (after optimizations and tuning). Sometimes you can fit Java in less then ~2x of original footprint but at the cost of burning CPU on frequent GC cycles.
Not every application is the same but the moment it involves manipulating a lot of data in Java you fight the platform to get back the control you require for those things. And by the end of day there is a limit at which reasonable people stop and just give up dodging memory allocations, creatively reusing objects, wrangling off-heap memory without the help of type system.
One project worth keeping an eye on is Loki (https://github.com/grafana/loki), which eschews full text search for more basic indexing off of "labels", ie it works a lot like prometheus.
After working with a client for multiple years continually hitting bottlenecks and complexity with the EFK stack, I'm really looking forward to something different.
What kind of performance improvements have you achieved through GC tuning? I'm surprised it's significant enough to start there. Might be something I should look more closely at.
If the heap is under 32GB, the JVM can use "compressed oops" with 4 byte "pointers" to 8 byte offsets inside the 32GB heap instead of 8 byte pointers. This makes everything more compact. Since Java doesn't have structs, everything is a pointer and arrays of things are arrays of pointers. So a 48GB heap won't fit a lot more data than a 32GB heap.
But the short of it: Java uses 32-bit numbers for object references if it can address whole heap with it. Given default 8-byte alignment of object we have 2^32 * 8 = 32g of addressable heap.
Once we are beyond that number 64-bit references are used and suddenly we are wasting a lot of heap space (Java is very much objects everywhere). Usually up to around 40% of heap is references.
See my other reply on big iron vs small instances (as in 31g of heap “small”).
In short there are many variables at play, but without any context generally I’d recommend trying many small instances first. Now depending on your hardware, scale and application things may easily change in favor of big iron.
well, I was thinking more in using more than one ES instances on the same machine listening on different ports, but your points on the other reply still apply.
I actually did that once - splitting one NUMA machine into 2 ES instances each isolated to its own socket (numactl etc).
Sadly at the time there was a big perf problem with indexingdocuments with 20kb+ binary blobs (non-indexed field) and so the change to 2 nodes vs one did have little noticeable effect on the throughput benchmarks I did with our data back then (~2015 ES 1.4) In the end I dropped this split and focused on the other bigger problems.
Would love to revisit this exercise with more of free time and better tools.
Currently we are running ES on big iron in production.
There are both advantages and disadvantages.
On plus side:
- easier to manage several big machines than shitload of small instances potentially on top of virtualization solution (debugging overlay network etc. at night is not fun)
- nodes are more resilient to big requests (both big bulk indexing and resource-heavy searches). Your risk of hitting sudden OOM is practically zero (although ES does have circuit-breakers that try to prevent processing requests that would cause OOM)
- while not having compressed oops is sad, not having multiple copies of JIT and compiled code cache etc. is a plus
- using a modern concurrent GC is more sensible on bigger heaps. G1 is actually fine on 64g. Some of us are running Shenandoah on Java 13 in production, I’m looking forward to apply it on my clusters (or get back to testing ZGC).
Disadvantages:
- NUMA. Try to pick more recent Java as it’s tuned to run better on NUMA machines, still not all of JVM is NUMA-aware
- fault domain is larger and getting hot node up to speed (in-sync) after restart can easily take about an hour. Getting it up from empty storage going to take a bloody while
- elastic.co folks seem to have recommended settings for lots of small nodes not big ones, so you are on your own to discover proper limits for every setting
Doing a quick googling on elasticsearch and ZGC I just found this https://github.com/elastic/elasticsearch/issues/44321 where the official response from elastic is that only G1GC and CMS are tested and supported.
I still think that it will be worth trying Shenandoah or ZGC if/when we ever get down to do some experiments with larger heaps.
Last time I ran with ZGC the main problem was that ES OOM circuit-breakers are going nuts since they are not prepared to have memory mmaped 3 times. So their calculations are off by a factor of 3 blocking most of requests for no good reason.
Running with oom circuit-breakers disabled is something I’m currently considering.
Most production systems are barely on JDK 8. G1GC is the most used GC for high-performance production systems at many large companies (1B+ USD revenue) I have worked for.
Thanks, trying out is not an option for so many things. It is up to vendors to decide which JVM to recommend and I cannot overrule them. As of personal use, I am on 11.
The article isn't that good because it mostly just verbatim repeats (some of) the information in the official documentation but sadly mixes it with a lot of things that are simply not correct/misunderstood. Also it omits a lot of stuff that is actually important.
The hierarchy breakdown in the article is misleading. Lucene indexes fields, not documents. Understanding this is key. More fields == more files. Segments are per field not per ES index. A lucene index is not the same as an Elasticsearch index.
Segments are not immutable but an append only file structure. Lucene creates a new segment every time you create a new writer instance or when the lucene index is committed, which is something that happens every second in ES by default and something you could configure to something higher. So, new segment files are created frequently but not on a per document basis. ES/Lucene indeed constantly merge segment files as an optimization. Force merge is not something you should need to do often and certainly not while it is writing heavily. A good practice with log files is to do this after you roll over your indices. With modern setups, you should be reading up on index life cycle management (ILM) to manage this for you.
The notion that ES crashes at ~140 shards is complete bullshit that is based on a misunderstanding of the above. It depends on what's in those shards (i.e. how many fields). Each field has its own sets of files for storing the reverse index, field data, etc. So, how many shards your cluster can handle depends on how your data is structured and how many of them you have. This also means you needs lots of file handles.
Understanding how heap memory is used in ES is key and this article does not mention the notion of memory mapped files and even goes as far as to recommend the filesystem cache is not important!! This too is very misguided. The reality is that most index files are memory mapped files (i.e. not stored in the heap) and they only fit in memory if you have enough file cache memory available. Heap memory is used for other things (e.g. the query cache, write buffers, small data-structures with metadata about fields, etc.) and there are a lot of settings to control that which you might want to familiarize yourself with if you are experiencing throughput issues. Per index heap overhead is actually comparatively modest. I've had clusters with 1000+ shards with far less memory. This is not a problem if those shards are smallish.
The 32GB memory limit is indeed real if you use compressed pointers (which you need to configure on the JVM, ES does this by default). Far more important is that garbage collect performance tends to suffer with larger heaps because it has more stuff to do. The default heap settings with ES are not great for large heaps. ES recommends having at least (as a minimum) half your RAM available for caching. More is better. Having more disk than filecache means your files don't fit into ram. That can be OK for write heavy setups and might be OK for some querying (depending on which fields you actually query on). But generally having everything fit into memory results in more predictable performance.
GC tuning is a bit of a black art and unless you have mastered that, don't even think about messing with the GC settings in ES. It's one of those things where copy pasting some settings somebody else came up with can have all sorts of negative consequences. Most clusters that lose data do so because of GC pauses cause nodes to drop out of the cluster. Mis-configuring this makes that more likely.
CPU is important because ES uses CPU and threadpools for a lot of things and is very good at e.g. concurrently writing to multiple segments concurrently. Most of these threadpools configure based on the number of available CPUs and can be controlled via settings that have sane defaults. Also, depending on how you set up your mappings (another thing this article does not talk about) your write performance can be CPU intensive. E.g. geospatial fields involve a bit of number crunching and some more advanced text analysis can also suck up CPU.
> The 32GB memory limit is indeed real if you use compressed pointers (which you need to configure on the JVM, ES does this by default).
But the article makes is sound like you should stay below 32G of heap to avoid using oop compression, and this is completely wrong. Setting -Xmx over 32G on a 64 bit architecture disables oop compression by default, and setting it lower enables it. The official ES docs also recommend you keep your heap below 32G, but they say you should do this to ensure oop compression is on: https://www.elastic.co/guide/en/elasticsearch/reference/curr....
>The hierarchy breakdown in the article is misleading. Lucene indexes fields, not documents. Understanding this is key. More fields == more files. Segments are per field not per ES index. A lucene index is not the same as an Elasticsearch index.
Unless I'm mistaken, segments are not per field.
A lucene index might be broken into segments depending on your merge policy, but the segment file structure allows it to be an effectively separate lucene index that you can search on and contains data for all the index fields.
And I too find the fact that the term "index" is reused by ES to mean a collection of lucene indexes. Once someone wants to dive deeper it only creates ambiguity.
- article doesn’t clarify if it’s on hardware or VMs
- 140 shards per node is certainly on the low side, one can easily scale to 500+ per node (if most shards are small, typically power law distribution)
- more RAM is better, and there is a ratio of disk:ram that you need to keep in mind (30-40 for hot data, 200-300 for warm data)
- heaps beyond 32g can be beneficial but you’d have to go for 64g+, 32-48g is a dead zone
- not a single line about GC tuning (I find default CMS to be quite horrible even in recommended ~31g sizes)
- CPUs are often a bottleneck when using SSD drives