I don't think Elasticsearch is a good logging system

beardface · on Sept 28, 2021

I believe the complaints here are a case of 'not using it correctly'.

The 'Reverse index' (Lucene's inverted index) is a fundamental data structure used to enable very fast search. Other data structures, like KD trees, are used for non-text data types. If you're not doing full text search, don't use `text` fields. If you're not querying the data, why store it in the first place?

Full text search for logs is incredibly useful for log files when combined with alerting. If you get log entries indicating that a disk is full, a service has stopped, or a user account is blocked, Elasticsearch can (with the right license) send emails or post on Slack.

Static mappings can be a pain but if you're constantly increasing the maximum field count for an index, use different indices for different log sources. Come up with an index pattern or alias that allows querying all those indices at the same time.

The main task here is reconciling the different logs so the index mappings are easily searchable, effectively as a union. Elastic Common Schema helps a lot with this. Elasticsearch mappings are easier to build when you first consider the queries you're going to be running on the data. You can then design the mapping with the right structure, field types, and settings.

vimda · on Sept 28, 2021

My point was that by the time you filter on keyword fields (and other exact matching fields), the number of logs is small enough that an efficient full text search isn't necessary. That doesn't mean that full text search itself isn't useful, just that maintaining an inverted index is overkill in the logging case

nerdponx · on Sept 28, 2021

This has been my experience. Obviously different people use logs for different things, but in my case I'm usually looking for information about something bad that already happened, within a very specific window of time, and within a specific section of the application. 99% of the time, that means I am filtering until there are only a handful of entries that match, at which point I don't need full text search at all.

patrakov · on Sept 28, 2021

I am not really sure about this.

A few days ago, a colleague asked me why a certain Google cloud instance does not exist. I did not know either, so I searched for this name in the Google audit log, and found when and by whom it was decommissioned.

But it was a full-text search, given the instance name. I probably could do it (in theory) as a field match, if I knew which field it was, and which format it was in (I am talking about project/abc/location/xyz type of junk that precedes the actual instance name).

And yes it was slow (this instance was deleted months ago, and Google tries to search the most recent logs first).

nerdponx · on Sept 28, 2021

This sounds like the 1% of my experiences not served by filtering.

Naturally your experiences will be different from mine!

willvarfar · on Sept 28, 2021

Completely agree.

My gripe with ES is that it won't let you do post-pass filtering at all. If you create an index with a few keyword fields indexed and then some unindexed fields, you can't query the unindexed fields.

Grafana's Loki seems to be exactly what we are looking for, although I haven't played with it.

lathiat · on Sept 28, 2021

I guess what they want is to use the elasticsearch query language but let it optionally do “expensive” non indexed filtering like a SQL database would let you do.

Without knowing for sure I imagine they originally expected the application side to handle this but many of the current solutions don’t do that. And they expose and overload the elastic search query language as the primary search interface with no additional app logic. The elastic search query “is” the search application.

Making some assumptions but might reconcile the different viewpoints on why it does or doesn’t make sense.

larkost · on Sept 28, 2021

The problem with this thinking is that in most cases having the server send all of the data back to the client to do their own search is going to be far more expensive than running the search (even of unindexed data) on the server.

And I am only talking about server-side costs here as moving data between server and client has costs both in serialization and transmission. Yes, I can make up regexes that wind up throwing this cost comparison out the window (e.g.: lookbacks), but the fast majority of cases this is true.

I think the main reason that ElasticSeach does not do this is that they would either have to provide grep-like or regex support, and those two would provided different answers than the lexical search system they provide otherwise. That would be a nightmare to try and explain the differences to clients.

Note: in most places I wind up using ElasticSearch I absolutely hate that it is lexical search rather than grep or regex... especially when I am looking for exact text. This is particularly a problem in Jira where I have to be very careful about word boundaries.

dd82 · on Sept 28, 2021

You can update the index with the new field specification and reindex your content.

Your complaint really doesn't make sense, how would you query an unindexed field? Elasticsearch is a _search_ engine, which means it needs to index content that is to be discoverable. What you're saying with unindexed fields is you're completely fine with those not being included in any search or filtering.

da_chicken · on Sept 28, 2021

Your response is, "Why can't you perfectly predict which columns/keywords will be necessary later on, or otherwise re-index the whole system at the drop of a hat for one query? And why would you think a search engine would be able to perform an unindexed, ad hoc search?"

Compared to my experience, you have a foundational difference of understanding with how systems are actually used.

barrkel · on Sept 28, 2021

You use the index to identify a subset of records and scan those for unindexed criteria.

It's ok to fail if the indexed criteria are not selective enough. In fact it's usually preferable to a long timeout.

dvdkon · on Sept 28, 2021

Exactly, it's a search engine. It probably doesn't make sense to use it as a storage engine for logs unless you need to search all of them efficiently.

hodgesrm · on Sept 28, 2021

There's also cLoki. It's a new project that puts a Loki gateway over a ClickHouse backend store. We're looking at it and plan a presentation from the author(s) at the next ClickHouse SF Bay Area Meetup.

https://github.com/lmangani/cLoki

cjcenizal · on Sept 28, 2021

Will runtime fields help you with post-pass filtering? https://www.elastic.co/blog/introducing-elasticsearch-runtim...

gpderetta · on Sept 28, 2021

I have little knowledge of the log aggregation domain, but generally indices are great for read mostly loads. It seems to me that for log aggregation writes are more frequent than searches; cheap writes and the occasional brute force search.

For alerting you might better off running each new line against a set of filters/watchers. It seems wasteful to run it after indexing.

Again, no experience or knowledge on the domain, so I might be completely off.

beardface · on Sept 28, 2021

> I have little knowledge of the log aggregation domain, but generally indices are great for read mostly loads.

Generally, you write and read to/from the same index in Elasticsearch. Where this falls apart is that you'll often want to change the configuration for an index based on whether it's write or read heavy. The main thing that changes in this scenario is the number of primary and replica shards (Lucene indices) for the Elasticsearch index.

Indices with a high write, low search workload will generally require more primary shards and less replicas. Low write, high search workloads require the opposite; lower primaries and more replicas.

The problem comes when you need high write and high search rates. Using a single cluster with lots of primaries and lots of replicas will overwhelm the hosts and you end up with terrible performance. The general pattern with Elasticsearch is to run two clusters. Index into one cluster, then use cross-cluster-replication (CCR) into a different cluster you run queries against.

There's an incredible amount of nuance to all of this. I've worked with many clusters and they all have different usage and configuration requirements. There's no magic formula for calculating configuration values; it all comes down to experience, monitoring, and experimentation.

cmarschner · on Sept 28, 2021

At the core of Lucene, as you index a document, it creates first an index containing a single document, and everything else is merge operations (operating in log N - merging larger and larger chunks). So the nice thing is that you can use the same query language, in fact the exact same implementation, to run a search query in alerting mode. You would create this single document index (which you'd do anyway to make it searchable) and run the query against it before adding it to the other documents.

wardb · on Sept 28, 2021

The not-indexing the log lines in Loki doesn't mean you can run complex queries on Loki. I've made a video to explain this concept: https://youtu.be/UiiZ463lcVA

latch · on Sept 28, 2021

Kibana and Loki both load full messages in their list page. You end up loading megabytes (sometimes hundreds of megabytes) of data but it only displays a few kilobytes.

I don't know when we forgot the basic paradigm of list -> click -> details where loading the details is a separate http call. This is what datadog does, and the difference is staggering. Almost enough to make me abandon Kibana/Elastic and pay datadog.

I can't let a ELK discussion pass without mentioning vector.dev (https://vector.dev) which I'm not affiliated with aside from being a very happy user (for log ingestion).

est · on Sept 28, 2021

+1 for vector. Very stable, performing and feature-rich.

linsomniac · on Sept 28, 2021

Looks like Vector has a Loki outputter.

piaste · on Sept 28, 2021

This is spot on based on our experience.

I would add that the default ELK settings aren't terribly log-friendly, and having to janitor index policies, sharding, lifecycle policies, VM resources, etc. etc., _which you have to do even with the managed Elastic Cloud offer_, is way too much effort just to find and aggregate your TimeoutExceptions.

We moved to NewRelic and while its dashboards are not _quite_ as fancy or powerful as Kibana, it's as close to zero-configuration as you can hope. It also has a bunch more features that show it's a monitoring system first and foremost.

Sure, it's a SAAS that we can't self-host, but diagnostic logs aren't business-critical so if we got locked out or priced out tomorrow, we would suffer no real disruption while looking for a replacement.

OldOneEye · on Sept 28, 2021

At high volumes, at my job, we have yet to find a good third-party log SaaS that performs not only better than self-managed Elastic but actually to perform good enough to be used.

New Relic could not handle the query aspect of having at around 5TB+ terabytes of logs (I know, a ludicrous amount of logs, but that's what it is) per day. Their architecture does not really allow for that. For small volumes I guess it would be enough. Not to ingest, that it did fine, but to query them under a reasonable time frame without a timeout, that's where it couldn't handle the high volume.

Also, their support service, while trying to win us over, that is, in their best moment, was nothing really stellar. Favouring sending us sales/presales people to solve technical problems.

It didn't leave us a good aftertaste.

pfarrell · on Sept 28, 2021

Not advocating for this decision, but did you investigate Splunk? In my experience, that’s the paid logging service that competes with ELK. It will be expensive, so you have to consider the total cost of ownership (e.g., ELK requires some experienced people to run it at your volumes) but it works AFAIK.

baq · on Sept 28, 2021

there's expensive and then there's splunk.

but you get what you pay for. splunk will handle your load unless you're google.

sethammons · on Sept 28, 2021

I love splunk. Our clusters process 10s of billions of structured log events daily. We have search, reports, PagerDuty integration, dashboards, etc. It is crazy expensive but is the best system I've used in this space. We are having to save costs with so much data, so we are lowering retention time and moving the data to snowflake for data older than a week. More and more, we are leveraging Looker for reporting out of Snowflake and relying more on Prometheus monitoring for alerting. But Splunk would still be my ideal service if we had less total data.

fn1 · on Sept 28, 2021

I second that. I love splunk as well.

Costs can also be reduced by spending some development-effort into abbreviating logs and being smart about deciding what to log and where.

VectorLock · on Oct 1, 2021

> Our clusters process 10s of billions of structured log events daily.

Whats that run you?

piaste · on Sept 30, 2021

> there's expensive and then there's splunk.

This got me curious, so OK, Splunk's pricing pages are very obtuse and they are really pushy about getting you to contact sales directly to get bleeded, but I managed to get to this "actually has a number in it" page for their Log Observer services[0], and... it looks cheaper than NewRelic, especially at scale?

NR charges $0.25 per ingested GB after the first 100 free GB; Splunk apparently only charges a flat $0.10, if you choose ingest pricing.

I guess that NR includes (a free tier of) a bunch of alerts, monitoring etc. features in their package, while they're separate packages for Splunk. Still, that doesn't seem wildly expensive at a glance. Where's the catch?

[0] https://www.splunk.com/en_us/software/pricing/faqs/devops.ht...

mrweasel · on Sept 28, 2021

Primary issue with using Splunk is that pretty much all other solution will seem inferiour. Great product, terrible business partner.

therealdrag0 · on Sept 28, 2021

Yep. We’re using a Splunk with TBs of logs a day and it’s been great.

VectorLock · on Oct 1, 2021

>At high volumes, at my job, we have yet to find a good third-party log SaaS that performs

Not even just performs but the costs are always astronomically higher.

sateesh · on Sept 28, 2021

I think for slightly lesser volumes Sumologic can be a choice. The kind of search queries, regex, and capture options give the feel of log parsing on a *nix box.

vimda · on Sept 28, 2021

At an old job we used NewRelic right up until they tried to 3x our annual bill :/

devin · on Sept 28, 2021

You are not alone.

Sevii · on Sept 28, 2021

How are diagnostic logs not business critical? If you have an outage while your non-mission critical logs are offline what are you going to do?

piaste · on Sept 28, 2021

If it's a live outage, we can still SSH into a machine and grep the console output / local rotating log files. (For that matter, I still prefer to do that when I'm just testing new stuff in dev/staging environments)

NewRelic "only" stores the logs for 30 days and displays them in a nice web UI with searching, alerting, sharing, and a bunch of other stuff. It's not like they cease to exist without it.

sateesh · on Sept 28, 2021

SSH to machine and grepping through is much harder when the number of machine is > ~5 (say). SSH to host doesn't scale as you keep adding the hosts.

piaste · on Sept 29, 2021

We use Docker Swarm so we can SSH into a manager and run `docker service logs` regardless of which hosts are actually running the services. I assume other orchestration systems have equivalent features.

VectorLock · on Oct 1, 2021

When someone says "we ssh and grep" you know pretty much how small their operation is.

toddpersen · on Sept 28, 2021

<shameless-plug>

Former co-founder and CTO of InfluxData here, currently building a new company in this space. My strongly-opinionated view on this is that Elasticsearch is not a time-series database and asking it to handle large volumes of logs (fundamentally a time-series use case) is always going to be painful and expensive.

We've built a product called EraSearch that mimics the Elasticsearch APIs for ease of integration but is built with a significantly more efficient (read time-series) architecture. We can handle ingest volumes with about 1/10th of the hardware required for Elasticsearch while still offering comparable (or faster) query performance. If you are generating large amounts of logs (~1TB per day or more), my guess is that this will resonate with you.

If any of this sounds interesting, drop me a note at todd@era.co - I'd love to hear more about your use case. Or even if you just want to talk about time-series data, I'm game. ;)

</shameless-plug>

rad_gruchalski · on Sept 28, 2021

I’d love to try your product out in combination with Jaeger. Is it possible?

toddpersen · on Sept 28, 2021

We've actually been doing some internal work with OpenTelemetry and Jaeger - we would love some feedback on it. Drop me a note at todd@era.co and we can get you set up with a demo instance.

t-writescode · on Sept 28, 2021

About 5 or 6 years ago, my previous job introduced ElasticSearch to our infrastructure.

Our existing codebase had a pattern where all logs for a transaction were stored in a single, big log and then that log was uploaded to a server to be stored.

We moved this large log to ElasticSearch, formatted it differently, labeled a bunch of columns, used Logstash to standardize variable names, etc.

We did this for our main services and kept those services separated as different indexes.

Each log had what user did the web request, response codes, we had detailed logging and general logging stored in 2 separate places and threw out the detailed logs after a very short while. We had fields for all common detail work.

It. Was. Perfect. Everyone could use it. Our in-house customer support team used those logs to help diagnose customer issues, our tech team used it to track issues. Our NOC used it to investigate issues.

We had tens of dashboards that were on a rotating view. We had at-a-glance server health tracking through it, all from active traffic that was being formatted and used.

-

And then the next company I worked with that used ElasticSearch tried to use it in a world where each logger.info() was its own row in ElasticSearch. That just seemed like a horrible, horrible idea.

It really does seem to come down to how you use it.

pachico · on Sept 28, 2021

Elasticsearch is good because it just ingests whatever you sent to it, which allows you to deliver solutions rather quickly.

Having said this, I agree there are better solutions. (Also, Elasticsearch shines because of its full text search capability, which is not often exploited in case of logs.)

Loki is fine (or better said, it will be fine once they finally release a version without write-out-of-order constrain) but I find its lack of high-availability solution a bit frustrating.

ClickHouse, on the other side, is just magnificent. I use it in combination with Vector as message pipeline solution (it's an alternative to Fluentd, let's say).

So, yes, Elasticsearch is just not great and not only for logs, but for everything else that doesn't require full text search, in my opinion.

wardb · on Sept 28, 2021

Out of Order support is available in Loki's main branch and included in the next release. It's already live in production on Grafana Cloud. https://grafana.com/blog/2021/09/16/avoid-dropped-logs-due-t...

High availability in Loki is supported in distributed mode. Helm chart here: https://github.com/grafana/helm-charts/tree/main/charts/loki...

pachico · on Sept 28, 2021

Yes, I'll try the next release, that's why I said it wasn't released yet :)

Regarding HA, I meant something beyond a k8s deployment.

541 · on Sept 28, 2021

Using ClickHouse for log storage and analysis is discussed here - https://news.ycombinator.com/item?id=26316401

hintymad · on Sept 28, 2021

> Grafana Labs' Loki is very exciting. Instead of storing a costly Inverted Index, Loki only indexes on fields (the equivalent of keyword fields in ElasticSearch)

One can configure Elasticsearch to index only on fields too, so I'm not sure "only indexes on fields" is a differentiating factor. The real advantage of Elasticsearch, or any search engine in general, is arbitrary boolean filter, as many log aggregation systems have started to use inverted index too. In addition, Elasticsearch has its own column-oriented data structure specifically for aggregation. Static sharding is a problem, but is not necessarily a big one, as many companies do not have enough scale to reach the problem yet.

BTW, we should really take a grain of salt on what Uber claims and what they do use. Case in point, they internally used Elasticsearch for years to aggregate all the logs in their marketplace for both real-time use cases and historical data that spanned months. Their Pinot-based solution and the promotion-oriented GPU-db didn't go anywhere.

justinsaccount · on Sept 28, 2021

> Uber has not open sourced this work so we are unable to benchmark it and see how it performs

I implemented their design here, specifically for importing zeek logs:

https://github.com/JustinAzoff/zeek-clickhouse

I don't have the elastic compatible query api though, or the smarts that auto materialize popular columns.

It works though, does a good job at soaking up any sort of log type and handling fields being added or removed.

jurajmasar · on Sept 28, 2021

Agreed!

> Ubers Clickhouse as a Log Storage thing

We built hosted ClickHouse-based logging as a service https://logtail.com, just launched with Show HN last week.

Disclaimer: I'm the founder, happy to answer any questions

siddharthgoel88 · on Sept 28, 2021

Elasticsearch apart from being an overkill for logging, I feel has a high cost of maintenance as the system starts to scale. Somehow that cost is not very visible in the start. One of the Singapore based successful start-up (Grab) had blogged about some other pain-points a couple of years back ( https://engineering.grab.com/how-built-logging-stack ) as well.

denysvitali · on Sept 28, 2021

What's worse (and I've seen this trend a lot) is to use Elasticsearch for metrics. My god.

CSDude · on Sept 28, 2021

Elasticsearch, with a mapping that does not full text search indices disabled, and a relatively larger time, works very well for timeseries data. It does filtering & metric and bucket aggregations crazy fast. And it's very easy to add a new node. Just run on spot instances. We push event per request, which might have 100-500 unique keys, 100 millions of records per day, and no solution we've tried comes at the easiness of Elastic.

Try to replicate the freedom (no-schema, cardinality, adding a new node to cluster) of Elasticsearch with InfluxDB or others and you will hit cardinality problems real fast.

tunesmith · on Sept 28, 2021

I'm struggling with this currently. I want to record some metrics with our codebase, like counting each time a certain thing happens. Our code logs requests and response times to ElasticSearch, so there's the option of just "logging" these metrics so ElasticSearch has them too. It just seems a mismatch to me. Earlier I wanted to create a dashboard that graphed some derived stats off of that request data, like server utilization which depends on doing some math with average request count and average response time - it just didn't seem like elastic search dashboards easily supported that. I was able to do it on the AWS lb level with Cloudwatch. Not having had exhaustive training with ElasticSearch maybe I'm looking for a separate metrics system like statsd and Grafana.

stackedinserter · on Sept 28, 2021

It's fine for metrics, we've been using it for few years.

beardface · on Sept 28, 2021

Why do you consider this a bad thing?

101011 · on Sept 28, 2021

I'm not convinced it is a bad thing. Good scale, relatively low latency, keeps your infrastructure costs down, and many of the other alerting options (looking at you Datadog and NewRelic) are crazy expensive.

Open to hearing other opinions though.

denysvitali · on Sept 28, 2021

Because you're suddenly putting metrics (that should belong to a time-series database) into a full-text search database.

The result is a dashboard that takes ages to load just to show a trend in values.

I think it's using the wrong tool for the job, but maybe it's just me.

beardface · on Sept 28, 2021

Elasticsearch is fine for time series data. A lot of tasks are actually easier with time series data. You add a field called `@timestamp` to your documents and a lot of analysis becomes possible, like date histograms, date range queries, ML jobs, etc.

Immutable time series data like logs and metrics are a great fit for Elasticsearch due to the way Lucene stores data. Documents in Lucene are immutable so an update in Elasticsearch is creates a new document and places a tombstone marker on the old one. Immutable data means you don't have to tolerate those inefficiencies.

Dashboards don't load the entire dataset by default. I can't remember what the exact default time range is but I think it's ~15 minutes or so. They're fairly quick to render in Kibana.

Elasticsearch is a great tool for observability data (logs, metrics, and APM data). Elastic's tooling makes a lot of this really easy in most cases.

nemo44x · on Sept 28, 2021

I think you may want to check your mappings/templates. There are a lot of data types for this kind of data and they don't rely on the inverted index that you would use for searching fields. Lucene, which Elasticsearch is built on, has a feature called "doc values" that stores data as column-oriented fields. This makes for the fast aggregations, sorting, and grouping for numeric and text fields.

One of the main strengths of Elasticsearch is that you can use it for searching and aggregating in a single query. But you need to ensure you are searching on fields that are indexed for search and sorting/aggregating on fields that are indexed for that.

detaro · on Sept 28, 2021

> (and I've seen this trend a lot)

any idea why? Seems like an odd approach to me.

atombender · on Sept 28, 2021

Because that's what Elastic is positioning Elasticsearch for. If you look at the features being added these last few years, so much is about time series stuff — aggregation, time-based partitioning, warm/cold storage, etc. Relatively little focus on structured search.

denysvitali · on Sept 28, 2021

Because people like to have everything (logs + metrics) in the same place. And probably Beats makes it easier to just ship everything in Elasticsearch (but of course this doesn't make it good)

closeparen · on Sept 28, 2021

When I see a nonzero value in a metric about “things that shouldn’t happen,” the very next thing I’m going to ask for is examples.

It’s generally considered too expensive to derive metrics from logs in this way, but it’d be a killer debugging experience.

moduspol · on Sept 28, 2021

I was interviewing earlier this year and saw at least three companies actively planning to migrate from their existing relational databases into Elasticsearch.

It seems to commonly be perceived as a panacea of databases.

alkz · on Sept 28, 2021

in most cases elastic is easier to scale than a relational or timeseries db

moduspol · on Sept 28, 2021

I mean, kind of? It depends on what your data looks like and how you're querying it.

If you're not doing free-text search, and your data will fit in memory in a big relational database VM for the foreseeable future, why Elasticsearch?

alkz · on Sept 28, 2021

sure relational db is ok if fits in memory, but if we're talking about TBs of metrics i'd rather look for another way

chmike · on Sept 28, 2021

ClickHouse is written in C++ and is open source: https://github.com/ClickHouse/ClickHouse

dariusj18 · on Sept 29, 2021

Where can I find resources on tools that integrate with Clickhouse, ex. are there any tools for gathering server metrics and sending them to Clickhouse?

arminiusreturns · on Sept 28, 2021

I often find myself wondering if all this is really that much better than the old rsyslog/syslog-ng stuff we used to do. These days its all some timeseries on top of journald, has anyone tested the remote journald options? Also at what point do we break logs out from metrics? Trying to do everything in one tool is so un-*nix.

vimda · on Sept 28, 2021

What we do internally is use syslog-ng (https://github.com/syslog-ng/syslog-ng) to read the journald socket and push to a remote and into Kafka. I think journald works well as a structured logging tool, but it's certainly deficient in other ways

reacharavindh · on Sept 28, 2021

I have looked into ElasticSearch + Kibana as a solution to aggregate logs. There may be plenty of choice to replace ElasticSearch(ClickHouse, even Postgres, heck even journald), but a nice UI where you can simply search for that random piece of text you need to sift through the logs is the red herring.

Until now, I have not seen a web interface to log as powerful as Kibana that can work with anything other than ElasticSearch.

This is why I chose to stop my search and pay for Datadog to do this correctly, and simply allow me to search for that keyword on logs when I need it the most(and not worry about whether I indexed stuff correctly, or balanced some whatever in ElasticSearch, or remembered to setup something far too technical for a log system). Datadog allows you to keep a short periods worth of data in the index and "expire" old content into archives while retaining the ability to add them back to index if needed for any investigation.

pas · on Sept 28, 2021

journald is not good at handling a lot of data, nor is it good at managing imported data. (It could be improved, probably "easily", but it's main feature is that it's an "always on" not terribly dumb log target, it's not a long term log management system.)

reacharavindh · on Sept 28, 2021

Hmm. Never tried to use journald at any reasonable scale beyond tens of servers. Good to know its characteristics.

To be honest I wasn’t looking for a long term log management system and that is why Journald even came up in mind. If it could aggregate logs from several servers and retain them for a week while expiring older logs to an archive source, it’s sufficient for my needs.

pas · on Sept 30, 2021

Exactly why I wrote my comment. :) Because it seems it's able to do that, but not really. And it seems easy to fix, but of course patches are welcome. (Hopefully.)

https://github.com/systemd/systemd/issues/5242

Sure, it's not terribly hard to work around it with a cron (or systemd-timer) script, but why go uphill, when there are better tools.

sofixa · on Sept 28, 2021

Grafana?

reacharavindh · on Sept 28, 2021

Haven’t played with the logs part of Grafana recently, but would it work on top of say Clickhouse? I thought it was more tuned for the Loki use case… is it not?

nullify88 · on Sept 28, 2021

I'd agree, it isnt good for exploratory queries. But if you have some predefined ES queries for correlating log messages to metrics it can be useful to have it all in one dashboard.

OrvalWintermute · on Sept 28, 2021

> I’ve really pushed ElasticSearch to its limits, with hundreds of terrabytes of data across dozens of machines and tens of thousands of shards and in all that time I’ve found that it really only works well for one of those situations.

This is a non sequitur

jmpman · on Sept 28, 2021

If I remember correctly, the core Splunk patents are from around 2006. Once the patents expire, could an open source project duplicate the Splunk indexing structures? As much as Elasticsearch has tried, Splunk is better for logs.

annexrichmond · on Sept 28, 2021

I moved to a company that uses ElasticSearch as an observability stack. It has been such an awful experience. It’s just painful creating new dashboards and alerts, such that people avoid having to deal with it at all. One example is that alerts are completely independent from dashboards. It’s so difficult to get a full graph related to the alert condition to gather more context. The reason it’s like this, imo, is that they are trying to do too much with the same product and so each feature feels compromised.

At my last company I was very enthusiastic about monitoring where we were using Splunk and SignalFX. It was just fast, seamless, and reliable.

lazyant · on Sept 29, 2021

ES and the ELK stack is the worst pile of shit I've had to deal with. Everybody I know who has worked with it has had problems. Obscure messages that nobody can figure out what they mean. Dropping logs silently, sometimes because it can't reconcile two index mappings. Constant baby-sitting and needing to add more hardware even if there are no indications that it needs it, super hard to tune or understand. A proper db doesn't need this much work and shipping logs (syslog) has been solved since the 80s. Move to Loki.

airocker · on Sept 28, 2021

You cannot do much analytics on Loki, there is no index so each query is costly

buro9 · on Sept 28, 2021

You can do a surprisingly lot with Loki.

To get the performance run more queriers (horizontally scaled read path that is in front of the object storage).

We're improving performance constantly, and we already run Loki at a very large scale (multi-region, multi-tenant, etc) with aggressive internal SLOs. We see customers doing network analytics, log analytics, line-of-business data analytics all on Loki and it works really well. Customers also use it as part of Cloud hosted ETL processes and to drive alerts. What we run in the Cloud today is the OSS version on k8s.

That said... if we're missing some use-case or just a scenario that you feel should work and it doesn't work for you, then we really want to learn about it. If there's an out of the box experience that isn't great, then let us know. Feel free to ping me any details you're not willing to put on HN at david.kitchen@grafana.com but I'm very cool with public conversations too.

jarpineh · on Sept 28, 2021

Hi.

I have a question. I have put some time trying to learn the Grafana stack. Heaviness of elastic both self hosted and cloud version has led me to seek for alternatives. Is there a way to get logs into Loki that can be run offline, when jobs have finished and hosts have been shutdown? It seems Loki recommends Promtail agent which looks a bit heavy-handed (https://grafana.com/docs/loki/latest/getting-started/get-log... )

I have relaxed requirements on gathering logs for analytics. I’d much rather just run a tool that goes through log files and ingests them into Loki. Results of several jobs would be compared as a whole.

From what I’ve seen Loki with Grafana can do a lot. For example this video: https://m.youtube.com/watch?v=7zmRhHd-ohk

Thank you!

buro9 · on Sept 28, 2021

One of the engineers has suggested trying:

```cat <files> | promtail -stdin```

Though we like the idea as a feature request... "Have promtail support reading a file or directory and exiting once complete".

So we've put that on the plan.

jarpineh · on Sept 28, 2021

Ah, my thanks. I had somehow understood Promtail wrong. Its name, mention of Prometheus, scraping and such first tells me it's closer to something like node exporter. I had watched Loki release presentation and few others, but managed to still misunderstand. Oh well. Now, things make more sense. Moving forward with the installation... :)

buro9 · on Sept 28, 2021

Promtail is a process that watches a directory and effectively tails the files and sends them to Loki.

Agree that the name isn't ideal as it's not clear that this is a "tail and send text files" process.

But yup, Promtail is what you want.

airocker · on Sept 29, 2021

I wonder the Loki designers thought about the features that ES provides before rewriting the whole thing in a new way? Search and Analyze is pretty cool. Almost all of Loki can be an analzer plugin. Automatic schema detection is cool. ITs not that crazy to administer.

chromatin · on Sept 28, 2021

The premise of the article is that it is unnecessary and inefficient to index the full text fields, as searching on fields is often more than enough.

Other solutions like Loki, which only index non-full-text fields, are suggested.

Why not RDBMS? I have over 10 years come full circle from SQL/RDBMS -> NoSQL of various types -> SQL/RDBMS.

edit: Is it really that other databases types are better suited to high ingest rate / write-heavy light-read workloads?

jakozaur · on Sept 29, 2021

(sameless plug of Sumo Logic employee)

The SaaS vendors are much better solution for logs, such as Sumo Logic. They have multiple tier of data (continuous, frequent, infrequent) which different performance and cost trade-offs. Most companies would be better, dividing logs into different category than running and tuning Elastic Search.

pytlicek · on Sept 28, 2021

Of course, it's not! It is a search engine and who knows its history will know it was not intended to be a logging system.

alkz · on Sept 28, 2021

About the field limit, an application shouldn't have more than 1000 fields in their structured logging.

Usually when this happens is because the source uses what could be considered user input as keys, for example in access logs with stuff such as headers o query parameter, and you make a new field for each header or query parameter. In the latter case all you need to exhaust your available fields in the index is some nasty bot trying random query strings on your site. This can be easily solved in the ingest/logstash pipeline

It is true that you have to spend time to carefully map and process the documents you're ingesting in elasticsearch, but once you do I don't think there's any other solution with the same performance and features for logs, especially considering the licensed features (like alerting from anomaly detection machine learning jobs)

The elastic stack is complex and might be hard to grasp (their own training courses are a great help though), requires significant computing resources, and managing a big deployment can be a full time job, but once implemented correctly you can do awesome stuff with it.

Loki is a simpler and slower solution which does less things, so if you don't need what elastic provides, it could be a better fit as it's probably easier to manage

delaaxe · on Sept 28, 2021

As the name implies, it has been a great search system in my experience, although with a somewhat steep learning curve.

finnh · on Sept 28, 2021

Just use Scalyr.

chondl · on Sept 28, 2021

A million times this. Crazy fast flexible searching. I'm nervous they'll lose focus on the logging product after the SentinelOne acquisition but so much easier than anything else.

jriddle567 · on Sept 28, 2021

pure frustration and after a few hundred words it is about how cool the challenging software is. well it doesn't sove the problem, you just have some different problems on that scale. your problems require architecture

exabrial · on Sept 28, 2021

It's not by itself. You need to put Graylog in front of it :)

jillesvangurp · on Sept 28, 2021

I can't say I agree with this article. Elasticsearch is a bit of a Swiss army knife and definitely needs some setup and configuring. But if you set it up properly, it does logs extremely well.

And it does actually come with useful defaults and well integrated tools. For example, if you use the various Beats agents to push data to Elasticsearch, they come with a default schema and out of the box dashboards that they can create for you. These are actually useful. I don't know that many tools in this space that have out of the box dashboards pre-configured, with matching schemas, etc.

Its also not that hard to create your own schemas and dashboards. Dynamic mapping is a rookie mistake that you might get away with if you don't have a lot of structured data. The classic mistake here would be mapping hundreds/thousands of fields. Not a great idea when you have billions of documents. The per document overhead adds up pretty quickly (disk, memory, indexing speed, querying overhead, etc). Much better is to only map those fields that matter to you and disable dynamic mapping for everything else.

And then there are features like rollups, data streams, index lifecycle management, etc. that make a lot of sense if you are storing logs at large scale. But you have to of course know to use them. As these were x-pack features, you won't be getting that with Opensearch.

Now there are of course many other solutions. But frankly it's a bit of a messy field of half integrated tools and a high DIY degree of how to set this up. Log analytics is only as good as the dashboards you get with them. Usually that means Grafana these days. How the data is stored is only part of the problem.

Grafana is basically an old fork of Kibana (v3) that has had lots of love and attention since then. It's a bit less feature rich perhaps but very capable. Incidentally, they went down the same re-licensing path as Elasticsearch: https://www.infoq.com/news/2021/04/grafana-licence-agpl/. So, you might want to think twice about putting that on a server unless you are completely fine having AGPLv3 licensed code close to where our own code is deployed (hint, if you are uncertain, you probably should not be doing that). LOKI is licensed similarly.

You can definitely gobble together some dashboards pretty easily using e.g. Grafana and any of its backends. But the out of the box experience for that is a lot of work figuring out what to use, how top operate it properly, etc. Elastic Cloud is expensive at scale but kind of plug and play on this front.

Probably the next best thing would be Opensearch. It's a fairly recent fork of Elasticsearch and Kibana. And unlike Elasticsearch/Kibana or Grafana/Loki, it remains under the Apache 2.0 license. Most of what you would need for a decent log analytics system is in there.

What you get by default from Google Cloud, AWS, etc. is pretty bare-bones in comparison in their cloud logging.

thinkharderdev · on Sept 28, 2021

My experience with ElasticSearch is that it is amazing for log aggregation until you reach a certain volume and then the operational cost (both $ and time/complexity) of running a large ES cluster starts to really bite hard.

To take one recent(ish) example from my last job. We built a bunch of streaming pipelines to enrich log data during ingest so we could use the ES ML jobs to do some proactive alerting on our application logs. It was all really nice but then we had to set up a regular patching cycle for our ES cluster. Generally everything is ES is stored in an index so doing a blue/green deployment works pretty well. Except that the ML jobs have some sort of in-memory state that can't be migrated, so the ML jobs would get migrated but in a weird, half-working state. So we ended up having to manually delete and recreate the "half-working" ML jobs whenever we rolled out the new patched cluster.

Having an "all in one" solution like ES is great until you hit a certain scale and the inherent complexity of such a system starts to really make itself apparent.

OldOneEye · on Sept 28, 2021

Thanks for sharing your experience with ML.

We're experienced with managing a big scale elastic cluster, but never managed the ML jobs, so knowing about this limitation is definitely useful!

jcims · on Sept 28, 2021

Anyone that's used Splunk for any amount of time will feel like they are in a straightjacket with ES.

Anyone stuck paying for Splunk will think they probably should be.

dvdkon · on Sept 28, 2021

AGPL is just GPL where usage over a network requires giving users the source code of the AGPL codebase. I'll take it over a custom licence where I have to worry if I need to release the source code of everything around it (SSPL) or if I'm providing it as a managed service (EL).

rio517 · on Sept 28, 2021

Oh man, https://github.com/elastic/elasticsearch-py/issues/1734 is a disappointing read. I know ES wants to save their business, but alienating users isn't exactly the path to success.

vosper · on Sept 28, 2021

FWIW, the ES team implemented a change that I requested via GitHub, and I’m just some guy who’s not even a paying customer. The change involved the behavior of force merge, and probably didn’t affect too many people. It’ll allow my employer (and hopefully some other companies) to make more efficient use of our disk space.

I’m really appreciative of the way they considered and then implemented my request.

waste_monk · on Sept 29, 2021

I know someone who works at Elastic and it sounds like a really nice workplace (in terms of culture, average competency level across all employees, etc.), and that seems to be reflected in their products.

rio517 · on Sept 28, 2021

That's good to hear.

gertrunde · on Sept 28, 2021

I'm not sure I regard that particular example as actually bad.

The response of "use the client version that matches the server version" seems perfectly fair.

alkz · on Sept 28, 2021

elastic 5.x is ANCIENT and no one in their right mind should still use that for any production workload