Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: What's your preferred logging stack in Kubernetes
72 points by ryanisnan 56 days ago | hide | past | favorite | 61 comments
Hi HN,

I'm looking for advice and insight on what y'all might use for an internally hosted logging solution in Kubernetes. Currently we use a self-hosted Graylog setup, but are finding it difficult to maintain as our system grows.

Here's our current setup:

  - Multiple clusters
  - Logs aggregated to a single Graylog setup, itself running in Kubernetes
  - Logs are sent to Graylog via Fluentbit

Some problems we've had are:

  - Index management in Graylog's ElasticSearch cluster is a PITA when you have many differently shaped log forms going to shared indices (managing separate indices per source is also a pain)
  - Management of MongoDB in Kubernetes is frustrating and has been a reliability challenge
I'd love for us to be able to use a hosted logging solution but $$$ obviously. I'm aware of many other alternatives, but one of the things I've painfully learned is that a basic feature matrix only tells a very small piece of any story. The real education comes from running this type of tech and living with it through scale and various lifecycle events.

Some questions I have:

  - What logging solutions are you using in your Kubernetes environment, and how has your experience been?
  - How do you handle log retention and storage costs?
TIA



> What logging solutions are you using in your Kubernetes environment, and how has your experience been?

We store all the logs from all the containers running in our Kubernetes clusters into VictoriaLogs during the last year. It works smoothly and uses very small amounts of RAM and disk space. For example, one of our VictoriaLogs instance contains 2 terabytes of logs while using 35 gigabytes of disk space and 200MB of RAM on average.

> How do you handle log retention and storage costs?

VictoriaLogs provides a single clear command-line flag for limiting disk space usage - `-retention.maxDiskSpaceUsageBytes`. It automatically removes the oldest logs when disk space usage reaches the configured limit. See https://docs.victoriametrics.com/victorialogs/#retention-by-... .

P.S. I can be biased, because I'm the core developer of VictoriaLogs. I recommend trying to use in production VictoriaLogs alongside other log management solutions and then choosing the solution which fits better your particular needs from operations, costs, usability and performance PoV.


Take a look at Coroot [0], which stores logs in ClickHouse with configurable TTL. Its agent can discover container logs and extract repeated patterns from logs [1].

[0] https://github.com/coroot/coroot

[1] demo: https://community-demo.coroot.com/p/qcih204s/app/default:Dep...


They all sort of suck to be honest. The least suck has actually been hosted Google Cloud Logging of late, it's just "not bad enough" to get the job done.

When I worked at Postmates we had a proprietary log search built on Clickhouse which was excelent. The same idea was also implemented concurrently at Uber (yay multiple discovery) and is documented at a relatively high level here: https://www.uber.com/blog/logging/

If gun was placed to head I would rebuild that over running the existing logging solutions.

I also worked for several months on building my own purpose built logging storage and indexing engine based trigram bitmap indices for accelerated regex searches ala CodeSearch but I ran out of motivation to finish it and commercialisation seemed very difficult, too much competition even if that competition is bad. Really really should get around to finishing it enough that it can be OSSed at least.


Can you stream self-hosted clusters logs in GCP?


You probably could. Easiest way would probably be to standup a GKE cluster quickly and then copy out the fluentd config.


sounds like Chronosphere


I would recommend using ClickHouse which provides very efficient compression and thus reduces the size of data drastically.

Apart from that, it provides various other feature:

- Dynamic datatype [0] which are very useful for semi-structured fields which generally logs contains very often.

- You can configure column's & table's TTL [1] which provides efficient way to configure retention.

At my previous job (Cloudflare), we migrated from Elasticsearch to ClickHouse and saved nearly 10x reduction in data size and got 5x perf improvement. You can read more about it [2] and watch the recording here [3]

Recently, ClickHouse engineers published a wondering detailed blog about their logging pipeline [4]

[0] https://clickhouse.com/docs/en/sql-reference/data-types/dyna...

[1] https://clickhouse.com/docs/en/engines/table-engines/mergetr...

[2] https://blog.cloudflare.com/log-analytics-using-clickhouse

[3] https://vimeo.com/730379928

[4] https://clickhouse.com/blog/building-a-logging-platform-with...


Loki backed by S3 and queried via Grafana is a good, mostly FOSS solution. Installs pretty easily via helm and S3 gives a reasonable balance between cost, ease, and durability if you're in AWS already.


That's kind of what we were thinking about as a next stack to experiment with.

How difficult have you found scaling Loki?

One concern I have with our current setup is the frequency that we need to step in and manually intervene.


What kind of issues that you see where you have to step in and intervene? We have been running Loki for a long time, and it barely requires any manual intervention apart from version upgrades.


One thing that I find hard is configuring the retention time. The docs are unclear and there are about 3 configuration parameters involved. It scares me as a non expert because I don't want to fill up all disk space but also don't want to keep too little logs.


You can log straight to s3 and use policies there to clean up old logs


Yes I agree with rootsu, we have been running Loki in production for over two years and rarely have to step in and fix things. It's very stable and performant.


Loki is not bad but PSA, don't use it with anything else than real AWS S3. The performance with Minio is awful (and can't be good, because of how minio works). Might be a bit better with Seaweedfs.


Can you elaborate why it's performance is bad and what the reason is?


it simply stores objects as files on the disk. Then it distributes the chunks around the place (so you need to reassemble it when reading) and lastly, when you read the file, it's not O(1). There is some "discovery" process to locate the objects where the servers chat with each other rather than have the location stored somewhere and be O(1).


Yeah this. Make sure you provision lots of memcache chunk caches as well because S3 is slow as shit.



I'm currently migrating from Elasticsearch to Loki, and it's much simpler to run and still meet our requirements.

I think Elasticsearch had its day when it's used to derive metrics from logs and performing aggregate searches. But now as logging is often paired with metrics from Prometheus or similar tdb, we don't run such complex log queries anymore, and so we find ourselves questioning whether it's worth running such a intensive and complex Elasticsearch installation.


We migrated away from an Elastic stack to a Loki stack, and were able to store order of magnitude more data for less money. Maybe we did Elastic wrong, but we tried various managed solutions and always ran into limits. The new Loki stack has always given quicker answers too.


There are many ways to get Elastic wrong, index templates and field types, shard sizes, number of primary and replica shards, node heap size, and those are only a fraction. It's very easy to get Loki right in comparison.


Elasticsearch being strongly typed I think creates a lot of overhead since you need to manage the schema for all your logs. Loki only expects certain (indexes) fields which are key, value pairs so you can throw all kinds of data into it and only mess with schema when you're querying


Hi.

I was a big Elasticsearch user during several years, I wasn't convinced by Grafana Loki which is far less expensive because the data are stored in object storage, but have poor read performances because it's not a real search engine.

Then I discovered Quickwit which is a combination of the advantages of both world. With Quickwit you can ingest the logs the way you're already used to : through OTLP/grpc, with a log collector like fluentbit or Vector which can ingest the stdout of your pods and forward to Quickwit using the http API, etc.

And you can then use Grafana with pretty much the same features available for the Elasticsearch datasource.

https://quickwit.io/docs/log-management/send-logs/using-flue...

The read performance of Quickwit are incredible because of their amazing indexing engine (which is kind of like Lucene rewritten in Rust) and the storage is very cheap and without limitation other than your cloud provider capabilities.

https://quickwit.io/blog/quickwit-binance-story


Careful with Loki if you ever plan to export logs out of it. It can export limited logs based on very very limited time range and search range over long time is PITA if your log volume is fairly medium to high.

That being said, if it is setup and forget, then Loki is as low resource friendly as you can get without spending big $$$ to maintain it.

ELK is massive resource hog and is best kept in cloud, but if storage and compute is irrelevant over search experience, then ELK is unbeatable.


If you need efficient exporting of any amounts of logs, then take a look at VictoriaLogs - the log management system I work on. It is designed in the way which allows exporting arbitrary number of logs via the standard querying API. See https://docs.victoriametrics.com/victorialogs/querying/#comm...


The Loki CLI does paging to get larger time windows back - I wish the Grafana GUI did something similar, but that is at least a workaround.


Since 2017, at two different companies, I’ve sent logs via UDP to Sumo Logic, via their collector hosted in our cluster. Sumo Logic is reasonably priced, super powerful, easy to use, and really flexible. Can’t recommend it enough.

We do log collection and per service log rate limiting via https://github.com/NinesStack/logtailer to make sure we don’t blow out the budget because someone deployed with debug logging enabled. Fluentbit doesn’t support that per service. Logs are primarily for debugging and we send metrics separately. Rate limiting logs encourages good logging practices as well, because people want to be sure they have the valuable logs when they need them. We dashboard which services are hitting the rate limit. This usually indicates something more deeply wrong that otherwise didn’t get caught.

This logging setup gives us everything we’ve needed in seven years of production on two stacks.


hello,

idk ... imho. - as always

* keep things "stupid-simple" ~ rsyslog to some centralized (linux)system

* i want something "more modern" & with a GUI ~ grafana loki

* "more capable" but still FOSS ~ ELK

* i'm enterprisy, i want "more comfort" and i want to pay for my logging-solution / for the "peace of mind" ~ splunk

* i'm making a "hell of money" with that system so it better performs well, provides a lot of insight etc. and i don't care what i pay for it ~ dynatrace

did i miss something!? ;))

just my 0.02€


What to use as a centralized system for `keep things "stupid-simple"`? I'd recommend taking a look at VictoriaLogs - the system I work on. It is designed with KISS principle in mind, so it is very easy to setup, operate and use. VictoriaLogs is a single relatively small executable, which runs optimally with the minimum configuration. You need to specify only the directory for storing the ingested logs. All the other configs work good out of the box for most use cases. See https://docs.victoriametrics.com/victorialogs/

See how to send logs from rsyslog to VictoriaLogs - https://docs.victoriametrics.com/victorialogs/data-ingestion...


Stop using logging. You're using logging wrong and there is no using it right. Logging (unqualified) is for temporary debugging data. It shouldn't go anywhere or be aggregated unless you need to be debugging, and then it should go to the developer doing the debugging's machine.

Request logging should be done in a structured form. You don't need an indexing solution for this kind of request logging - It's vaguely timestamp ordered, and that's about it. If you need to search it, it gets loaded into a structured data query engine - Spark, or Bigquery/Athena.

Audit logging belongs in a durable database and it requires being written and committed before the request is finished serving - Logging frameworks that dump to disk or stdout obviously fail his requirement.


Yes. And when a customer complains about an issue, tell them that your software has no bugs and they should reevaluate what they are doing, because obviously, nothing unexpected happens in your software and any trail of execution that would help developers figure out what happened is just for lamers (see also Go's stance against stack traces).

/s, if it wasn't obvious enough.

I hate software that fails silently because "you don't need logs". It usually shows that it was developed by someone who never had to act as a sysadmin and just never learned the value of good logs. Thinking anyone can predict all the possible ways a service can fail is misguided at best.


Even bad logs are better than no logs. Because at least, you can grep the source code for the log message and you have a starting point for debugging.


in addition to that, the absence of logs is useful information as well —customer swears they submitted the form? ok show me the request line…


I did not say "Don't log anything". I said "Stop using logging", and I'll qualify that (though my original post already did) with "So much". You're using logging for a to of irrelevant actions, and you're trying to sort needles from haystacks when you're better off not piling on the hay.


I agree with the sentiment but logging is incredibly simple to implement and is almost completely vendor agnostic (it's up to the log aggregator to make sense of the log file, not the software to output data in a specific format).

In addition, there's so much software that logs out of the box with limited support for metrics and traces it's not practical to ignore it.

In general though, I agree it's good to avoid if practical. Tracing is much more helpful in most cases especially when helpful context data is attached to traces.


> not the software to output data in a specific format

I like to output structured log anyway. Not JSON as it's not really human readable but at least in "logfmt" format[1] as it's both human readable and easy to parse.

FYI, Go provides `log/slog` which offers structured logging in both formats (JSON and logfmt).

[1] - https://brandur.org/logfmt


I more-so meant there's not a single, standard way. Lots of software uses logging libraries with arbitrary serialization formats but you generally need to take each schema into account (one software might use "error" while another uses "err")

Compared to OpenTelemetry traces (which many vendors support) that has standardized field names with the ability to attach arbitrary data


Filebeat + ELK stack is pretty good. You can easily run filebeat as a daemonset and have it detect all your pods and logs. This is what I'm using right now.

Otherwise Loki. Also seen used and I think it's fine. That's more "pure" logging where ELK has more advanced searching/indexing/dashboards etc.


EFK. Loki just sucks if you’re used to kibana searches.


It's quite sufficient for logs that you'd just cat/grep anyways.


...but it's superpower is that Grafana can put metric panels right beside error log panels. If a graphs show you a spike in errors, you zoom in on the spike and the error logs panel zooms to the same period. So it's like cat/grep with a visual cue.


Except there is no grep? You can’t just type a search into the box on an unindexed field can you? It’s a bit more like Prometheus in that stuff has to be labelled and you just get all of it? Or am I using it wrong?


You can, try using |= "your text string" appended to your logql query


You can also pipe through regex search, through json parser,... Very convenient and powerful. Comparing it to cat + grep (+ jq) seems fitting.


I'm not entirely sold on it yet, but Quickwit seems to be the current trendy solution.


Latest HN thread on quickwit (Binance built a 100PB log service with Quickwit): https://news.ycombinator.com/item?id=40935701

I also wrote a benchmark on Loki vs. Quickwit: https://quickwit.io/blog/benchmarking-quickwit-loki


Thanks for the article, it was useful for me. You have a typo btw: "correclty used"


We pull in everything across our AWS infra via otel collectors (metrics/logs/traces) and forward to HyperDX (ourselves, w/ storage backed on Clickhouse). You'll find that Clickhouse is a ton more efficient than Elastic when it comes to observability use cases, which helps a lot with keeping costs under control. There's less schema management typically needed on Clickhouse as they have more flexible map types for chaotic structured logs. The Otel collector is also very flexible in adding filtering rules to throw out noisy messages.


I'm using loki with S3 storage (not AWS, OpenStack Swift that my hoster sells). Can't say I'm amazed, logcli is not pleasant to use, Grafana integration is very bare-bones and I spent more than I'd like to make it work smoothly, but in the end it works, so no big issues either. Log retention is configured in loki and storage costs are low, it compresses things well.


Can some one ELI5 why logging with loki is better than using a database like sqlite or postgresql.


Log data is often massive so you don’t want to be paying for a gigantic Postgres instance for storage. Logs also do not get updated, so the transactionality of Postgres is wasted.


Scalability and cost -- Loki stores the the actual log data on S3 and only keeps an index of a few fields. Log queries that can then efficiently target the (hopefully small) set of files containing the data and loki can re-parse those specific files from S3 to display the log results.


Log workload tends to be OLAP so an OLTP isn't very well optimized for the use case.

Usually you can relax consistency and sometimes even partition tolerance to get something cheaper/faster https://en.wikipedia.org/wiki/CAP_theorem


Has anyone tried running SigNoz? I see a lot of comments about Clickhouse and they offer an observability stack built on Clickhouse.


At Propel, we moved from Honeycomb and CloudWatch to SigNoz and ClickHouse, and we're pretty happy with the transition. We've significantly reduced our costs and gotten similar functionality. We like the traces that Signoz gives us. ClickHouse is a beast. The open-source Signoz has some limitations that are kind of random , like the limit on alerts.


We removed those limitations from open source. Now you can create unlimited alerts and dashboards in SigNoz OSS too. p.s - i am one of the maintainers.


Cool. Thanks for the update on this. That's awesome


is anyone else using cloudwatch? we aren’t logging huge amounts but i can’t tell if im missing something from the rest of this thread…


We use CloudWatch for less often queried data but it has minimal indexing so it's not very good at lots of ad-hoc queries for discovery. More feature-rich platforms also allow things like ad-hoc metric generation and a more robust ingest pipeline

In addition, CloudWatch metrics aren't great (for custom metrics) so you end up using CloudWatch + <something else> so ideally you just use only <something else>


Self-hosted Loki + Traces using Tempo + Grafana.


one word: axoflow




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: