Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: How do you manage logs for your backend services?
250 points by tusharsoni on Jan 3, 2020 | hide | past | favorite | 139 comments
I have been working on a few side projects that usually have Go backend services. Right now, they log straight to a file and it works just fine since I only have a couple of instances running.

If I were to scale this to more instances/servers, I would need a centralized logging service. I've considered ELK but it's fairly expensive to both self-host or buy a managed subscription.

Any advice?

People jumped to recommending things before asking:

What's the volume of logs?

What's the expected usage? (debugging? audit?)

Are you using text, or structured logs? (or can you switch to structured easily?)

How far back do you want to keep the logs? How far back do you want to query them easily?

Are you in an environment with an existing simple solution? (GCP, AWS)

Well the question in the title was how do you manage your logs, so people are answering.

I'm cool with the answers about what people are actually using. But there are also recommendations, and that's what OP asked for at the end.

One should always read beyond the title. The actual request was for advice on what tusharsoni was thinking of doing.

Google's Stackdriver.

I've been using Google App Engine for some ten years now and I'm still dumbfounded that this is still an ongoing struggle for other platforms. It collates logs from a variety of sources, presents requests as a single unit, has sophisticated searching capabilities, and the UI doesn't suck.

Best of all, it just works... there's zero configuration on GAE. Such a time saver.

I find Stackdriver ugly compared to SumoLogic or Datadog. Also it has ingestion limits, we're losing logs when load becomes considerable.

From what I've seen at demo booths in conferences, Datadog's logging is impressive but also incredibly expensive. At the rate that we produce logs we'd be paying over $30k/mo. They claim that we can use log rehydration and not ingest all logs but then we can't really have alerts on them so what's the point in having them. Yes, I understand that you can look at the logs when things are going wrong but you can also know _from_ the logs when things are going wrong.

You can create metrics and alerts from filtered logs in Datadog.

The process would be: log data -> add index filters -> go to live tail and create a metric on a filtered log event -> create monitor on metric.

edit: also, you log 24 billion messages a month? I think that's what it would be to cost $30k for their platform per month

24B/month is less than 10k/second.

10k qps is certainly not a dev/test instance, but if you were to log all RPCs including database lookups, you could easily get there with even just 100-ish user-facing requests per second.

Indeed, I did the calculation myself as well. If you're logging every RPC including database lookups in production you might have a problem with your logging principles (signal vs noise etc) and if you really need that log data for every request but can't afford $1.27 per million you might have a product problem.

indeed, 10k qps in a large production instance is actually quite typical. Datadog is expensive when it comes to logs.

I just checked and we've got roughly half of the number you came up with (~11B). We store logs for 30 days instead of 7 which does increase the price.

Also it's worth noting that not all of the logs are necessarily our application logs but could also be audit logs of third party services we use (Okta, GSuite, AWS, etc.) to detect anomalies or potential breach attempts. We have a pretty comprehensive alerting pipeline based on our logs so we're unfortunately unable to pay that much for logging. I understand that this doesn't apply to everyone but we're able to run a self-hosted logging pipeline for a fraction of that cost without a dedicated team running it (the Infrastructure team, the team I'm on, currently maintains this pipeline along our other responsibilities).

In which conferences do you see Datadog booths

Both AWS re:Invent and KubeCon had Datadog booths

AWS Re:Invent had a DataDog booth.

Imo stack driver on GCP is too expensive too

IME the thing that usually gets people isn't that it's difficult, it's that it can be expensive. Didn't realize Stackdriver's free plan was so generous (50GB a month!), but its price ($0.50/GB) will scare a lot of folks for whom I've consulted away from it.

I highly recommend Datadog's logging platform. One important lesson I've learned in my career is never to run your own observability platform if you can afford for someone else (whose entire product is observability) do it for you.

I've used ELK (managed and hosted), Splunk, NewRelic, Loki, and home grown local/cloud file logs and nothing has been as cheap, easy, and powerful as Datadog. They charge per million log events indexed but also allow you to exclude log events by patterns/source/etc and they ingest but ignore those rows (you pay $0.10/gb for those ignored logs).

The 12 factor way to do logging is very easy with Datadog, as you can tell the agent to injest from stdout, containers, or file sources, then the application is agnostic to the log aggregator as the agent will collect and send logs to the platform.

Not only is it cheap and easy to set up, it also gives you the option to take advantage of the other features of Datadog that can be built on your log data. Metrics based on log parsing, alerts, configuration via terraform, etc become possible when you ship your logs to their platform.

I've seen our production apps log 10-20k messages per second without Datadog breaking a sweat but I'm not sure if they have any limits.

> One important lesson I've learned in my career is never to run your own observability platform if you can afford for someone else (whose entire product is observability) do it for you.

Why is that?

1. You have to keep your logging, monitoring, alerting infrastructure up yourself

2. You have to monitor, log, and alert on your that infrastructure yourself with something else

3. You usually have to spend more money, both in hosting and in developer/ops time, on getting something mediocre compared to a provider that exclusively does observability as a product

4. Logging etc are a commodity and you should have a really good reason to build or run something yourself if it is not your core competency

5. Observability is much harder than it sounds and providing a cohesive platform that ties log events to traces to metrics to alerts, and keeping it up and highly available is a hard problem that you shouldn't do if observability isn't your core competency/product

Yep, if you can afford it make it someone else's problem. No team should need to have dedicated OPs people that are responsible for keeping the logging platform running. This is a solved problem and costs the company more in engineering time (paying a person to do the job) than the Saas solution.

E.g. 30K/month isn't much when you compare it to paying a sole engineer 15K/year + cloud costs of hosting your own logging solution + any licenses. There's also a slim chance the home grown logging platform will be as resilient as a Saas product.

I for one am very grateful to DataDog's support team as they've been very helpful in debugging logging problems we've had in the past.

A think a more generic advice would be to focus on what brings value to your customers and makes your product better.

> I thought using loops was cheating, so I programmed my own using samples. I then thought using samples was cheating, so I recorded real drums. I then thought that programming it was cheating, so I learned to play drums for real. I then thought using bought drums was cheating, so I learned to make my own. I then thought using premade skins was cheating, so I killed a goat and skinned it. I then thought that that was cheating too, so I grew my own goat from a baby goat. I also think that is cheating, but I’m not sure where to go from here. I haven’t made any music lately, what with the goat farming and all.

We love JSON logs and previously just sent most of it to systemd's journald and use a custom tool to view them. But maybe a year ago Grafana released https://github.com/grafana/loki and we've been using it on https://oya.to/ ever since.

IIRC, the recommended way to integrate it with Grafana is via promtail but we weren't too keen on the added complexity of yet-another service in the middle so we developed a custom client library in Go to just send the logs straight to Loki (which we should probably open source at some point).

I don't think there's any fancy graph integration yet, but the Grafana explore tab with log level/severity and label filtering works well enough esp. since they introduced support for pretty printed JSON log payloads.

I have tried out loki, too. But I was not satisfied because you have to run an extra server for it. I have only a very small app so I was searching for a much simpler solution and found https://goaccess.io/ . The nice thing is that it is very flexible (you can pipe your logs in command line, but also run as a server) and if you are using standard tools like e.g. nginx or apache, the setup only takes an evening :)

Here are a few more monitoring tools listed https://turtle.community/topic/monitoring

It is always a question of complexity and how much time you want to spend :) At first I used Multitail to peek into server logs via ssh. Then I switched to GoAccess and if you really have a greater infrastructure I would maybe switch to Loki or ELK.

I've used GoAccess in the past, but didn't find it a good fit esp. since I prefer JSON logs. I don't know if it supports this - maybe I didn't read enough docs - but at the time, I remember it was easier to spend an hour whipping up a tool in Go that did exactly what I wanted.

As for loki, it's a separate server but the setup takes maybe 10-30 minutes of reading some docs, maybe changing some config files and the systemd unit file to keep it up and running is less than 10 lines (most of which is boilerplate).

Of-course I have the benefit of a client library so I can just call a function on a struct at the end of a request with no need to worry about serializing the relevant data into some predetermined format, compression, etc.

> I've used GoAccess in the past, but didn't find it a good fit esp. since I prefer JSON logs. I don't know if it supports this

Yes, GoAccess supports HTML, JSON and CSV: goaccess --output=json

I thought goaccess (which is spectacular) was for HTTP logs, and not a general purpose logging solution

> we developed a custom client library in Go to just send the logs straight to Loki (which we should probably open source at some point).

I'd be interested in seeing this, we're likely to start using Loki at Sourcegraph soon and would likely want this approach as well.

I'll push it to https://github.com/oyato/promqueen when it's ready. Hopefully I will have time to work on it over the next week or two.

Awesome! Much appreciated :)

I don't have a lot of experience with it yet, but Loki looks promising for small projects. You'd still use it as a centralized logging server, but it's not as resource-expensive as something like self-hosting ELK.

I've only been using it for my homelab, and haven't even moved everything to it yet - but I like it so far. I already use Grafana+influxdb for metrics so having logs in the same interface is nice.


I've been benching it in production.

On a 1core 2gb vps it can do around 1200 logs/sec.

Compared to ELK, we're saving several hundred $ per month.

If you can't do 1200 logs/sec with that hardware + elasticsearch, you've done something horrendously wrong.

You can do 1200 logs/sec with a container limited to 1 core and 256mb of memory running elasticsearch.

The savings are in RAM costs due to ELK stack basically not running under 4GB. And larger logs with stack trace are much slower compared to Loki. There's a reason there is no cheap ELK services.

Really depends on the size of each log and the complexity of the tokenizers. With 1 core you have time budget of less than a millisecond per log statement for processing and that doesn't include the relevant ES/Lucene operational overheads.

This is extremely doable for some workloads, but not others. Really depends on what you're stuffing in.

We use Loki in production. It's pretty good! There are still some issues with memory usage (particularly the log-cli tool) and query performance, but it's a great start.

Loki is really hard to scale up

ELK is expensive, in terms of hardware and time to configure/manage. But it does scale to large volumes in a way that your current ad hoc 1970s logging will not.

I had a good experience with a local decentralized logging system, essentially daemontools-style service logs, that then fed in to ELK. ELK provided the bulk storage and analysis; and the local daemontools logs provided the immediate on-machine per-individual-service recent log access, and decoupled logging from the network connection to logstash.

* http://jdebp.uk./Softwares/nosh/guide/commands/export-to-rsy...

* http://jdebp.uk./Softwares/nosh/guide/commands/follow-log-di...

One of the advantages of this approach is that one can do the daemontools-style logging first, very simply, without centralization, and with comparatively minor expense; and then tack on ELK later, when volume or number of services gets large enough, without having to alter the daemontools-style logging when doing so.

Of course, it can be something else other than ELK, fed from the daemontools-style logs.

One thing that I recommend against, ELK or no, is laboriously re-treading the path of the late 1970s and the 1980s, starting from that "logging straight to a file". Skip straight through to the 1990s. (-:

* http://jdebp.uk./FGA/do-not-use-logrotate.html

For POSIX systems, syslog is the standard way. For systemd-based ones, journald may be preferable because of its additional features; both support sending of logs to a remote server. I'd suggest to avoid custom logging facilities (including just writing into a file) whenever possible, since maintaining many services with each using custom logging becomes quite painful.

Graylog is a really nice free product, and although it can look a bit scary, it's not that hard to get setup - especially since the introduction of the ElasticSearch REST API, meaning you no longer have to make GrayLog join the ES cluster as a non-data node.

You can spin it up on a single machine with ES and start playing with it. I usually forward all of my logs to rsyslog, then that duplicates the logs out - they go to flat file storage, and to graylog for analysis.

We run on GCP and I have to say that Stackdriver Logging with the google fluentd agent is actually pretty good and relatively cheap. I don't like Stackdriver's metrics at all, but Logging feels more like part of GCP. Fluentd has given me far fewer problems than logstash or filebeat did when we were running an ELK setup. The search UI is obviously nowhere near as nice as something like Kibana, but it gets the job done. If you aren't on GCP, it's not worth it, but if you are, the whole setup is "good enough" that you might not need to set something more sophisticated up (I'm still looking at Loki though because I can't help myself).

We use flume forwarding to s3 and then athena to query the logs. Flume processes each logfile with morphline (which is akin to logstash) and parses each rawlog into json before pushing to s3.

We used to run an elk stack but hit a bottleneck crunching logs with logstash. We found flume's morphline to be performant enough and the nice property of flume is that you can fanout and write to multiple datasources.

It's ironic, but because Athena is kind of a flaky product (lots of weird Hive Exceptions querying lots of data) and because it's really only good at searching if you know what you're looking for, we're considering having flume write to an elasticsearch cluster (but still persisting logs to long-term storage on s3).

I've always wondered why companies for whom logs are important but not their core focus / product bother implementing stuff like this themselves. Surely a saas service that does just logs can do it cheaper and with more features? Is it compliance?

We are in the unfortunate circumstance where we have high traffic but are budget conscious. S3 is cheap and Athena is as well. We significantly reduced cost moving from a fairly large elk cluster on ec2 to a handful of flume machines running morphline. We’ve looked at datadog and scalyr and even went so far as to implement a flume sink to scalyr but the scalyr quote was way too high.

What are you using to connect flume to s3? The HDFS sink?

Yah, with the s3a connector

IMO ELK is worth it. All the filtering, sorting and graphing means you can easily do post mortems (which makes you more likely to do them at all), you can get detailed performance and other metrics without spending days setting up robust testing, and it makes it easy to correlate events from your entire infrastructure. Just make sure NTP is enabled everywhere :)

I would budget some time after setting it up to weed out uselessly verbose logging and rotating old logs out of RAM and onto cheap storage. You'll love it.

Papertrail/Timber.io - cheapest way to aggregate logs and has simple search functionality.

Scalyr - my personal favorite. Just a little more costly than Papertrail, but can do as much as any full service SaaS - powerful queries, dashboards, and alerts. Takes some practice to learn, but their support is very helpful.

Sumologic - Fully featured log agregator. It works pretty good but their UI is super duper annoying. You have to use their tabbing system on their page, you can't open multiple dashboards/logs in multiple browser tabs. For the money, I personally prefered Scalyr, but this is a reasonable option.

Splunk - a great place to $plunk down your $$$$. I think their cheapest plan was $60k/yr, but I will admit that it was easy to get going and use and also had the most features. It's not a bad bang for your buck as long as you have lots of bucks to spend.

We use and like https://www.scalyr.com/.

Same for us, though they have outages more often than I'd prefer. Half the time when we get a PagerDuty alert, it's because Scalyr is screwed up and not ingesting our logs, not anything wrong with our systems.

But when it's not having an outage, it's lightning fast and easy to use. I especially like that you can do basically 100% of the configuration (aside from stuff like billing info) by uploading JSON files to them, so you can keep all your parsing and alerting configuration in your own version control system.

Second this. Love scalyr. Very fast/efficient setup and fast/efficient interface. Great API which allows config to be stored in version control.

ditto. we've got maybe 4,000 logs/second going to scalyr. it's fast, has good features, and is easily searchable (with structured logs).

yup, also my favorite of all the saas platforms.

I was just about to start looking into doing this myself and for the foreseeable future, I'll probably just use `dsh`... since I'm a cheapskate, have been trying to reduce my usage on cloud tools, and I just found out about it today:


Once installed, change the default from rsh to ssh where it's installed e.g. `/usr/local/Cellar/dsh/0.25.10/etc/dsh.conf`

Then setup a group for machines, in this case I'm calling it "web"

> mkdir -p .dsh/group

> echo "some-domain-or-ip" >> .dsh/group/web

> echo "some-domain-or-ip-2" >> .dsh/group/web

Then fire off a command:

> dsh -M -g web -c -- tail -f /var/log/nginx/access.log

> some-domain-or-ip [... log message here ...]

> some-domain-or-ip-2 [... log message here ...]

The flags I used are:

-M "Show machine name"

-g "group to use"

-c run concurrently

That's about as easy as I can think of... /shrug while I like the idea of centralized logging services I haven't really found one I actually cared for... most just run rampant with noise, slow UIs, and strange query languages no one wants to learn. I guess I could start a machine up with `dsh` on it in my cluster and then write the output from dsh to a file... easy centralized logging on the cheap, ha!

Wow. Maybe just use syslog for this case? It's designed for centralized logging

Yes... Uhh.. `dsh` works with tailing syslog too. I don't know why it wouldn't?

Sorry for my example not being general enough?

That hurt to read.

Presumably you're running your service as a container, and presumably you only run one master process per container. Print your logs to stdout/stderr, and then you can use any log-capturing mechanism (fluent bit?) to stream them to any log-indexing system.

Tweak your system so you can selectively turn on log collection, and collect useful metrics 24/7. Logs almost never matter until you're diagnosing an active problem, anyway, and then only a few logs will be necessary. Differentiating the type and quality of particular log messages is also very useful.

I used Papertrail at my last job, search function works well, and it was easy to use.

Seconding Papertrail. I don't love it but it works and it's cheap enough and I don't need to love every tool I use.

I do constantly wonder if there's a logging/searching solution small and lightweight enough to fit on the cheapest DO droplet. $7/gb/mo on Papertrail vs $5/mo with a 25GB SSD? That'd be a no-brainer, but ELK and Graylog won't run with that low of hardware.

Depends on how risky you're willing to be with your log data. Something like what you're describing could lose all of your data with a single instance or hard drive failure while a saas like Paper trail has your data replicated across data centers.

You should first consider what you need logging for.

Is it just ad-hoc text "debug" logs? Structured transaction or request logs to feed some sort of analytics? This impacts the backend trade-offs quite a bit.

Are you trying to do monitoring via logs? Don't. Export metrics via Prometheus (or something like it), much cheaper and dependable then extracting metrics via a log collection system.

If you haven't read the chapter of 12factor on logging I highly recommend it https://www.12factor.net/logs

This is coming from an ops person, do that and I'll be happy. Essentially the goal is to externalize all your log routing to stdout, then wrap tooling around your application to route it wherever you want it to go. It's geared toward heroku but same rules apply in docker land and more traditional VM environments.

We either send logs ECS -> Cloudwatch for our AWS stuff or docker swarm -> fluentd -> s3 for an onsite appliance (also anything syslog ends up in Sumologic through a collector we host in the environment). From there the logs get consumed by our SIEM (in our case Sumologic, which is a great service). We keep logs hot for 90 days and ship archives back to s3 for a year. Set up the lifecycle management stuff, keeping log files forever is not only a waste but can actively hurt you if they ever get exposed in a breach.

I highly recommend formatting your logs JSON (even if you have to do it by hand like in Apache). If you do that and go to Splunk, Sumologic, or ELK all your fields will be populated either automatically or with a single filter. Saves writing or buying your own and if you add a field there's no action for you to take.

nginx/apache default logging is complete trash. Look at the variables they expose, there's a lot of stuff in there you'll want to add to the log format to make your life loads easier. I have a format I use I'd be willing to send you if you want it.

I don't recommend logstash (the L in ELK), ever (except maybe if you're java across the board). It's way too damn heavy to run on a workload host, fluentd is much lighter (and not java, why would I deploy java for a system tool ever?). Maybe as a network collector you throw syslog at but that would be it.

For your use case Sumologics free service would be great. You can get I think up to 200m a day with a weeks retention for free and you'll get exposed to what an SIEM can do for you (ingestion rate and retention period are typically how hosted solutions are billed, you'll need email with any non-free email domain to get a free account from them). IMO you have to get to some fairly insane log rates for me to ever recommend running ELK stack yourself, it has way too much care and feeding if you want to run it correctly with good security.

I'm interested in these webserver configs you got for logs, please

This is what I previously used for apache 2.4 (this very out of date for what I use now since the company I'm with isn't apache but the gist is there). I don't have the nginx one readily in front of me.

I snipped a few fields out of this that were 'me setup' specific so small chance the formatting is off. The field variables for nginx are a hell of a lot less obtuse than the apache ones.

LogFormat "{ \"application_name\":\"%v\", \"application_canonical-port\":\"%p\", \"application_client-ip\":\"%a\", \"application_local-ip\":\"%A\", \"application_local-port\":\"%{local}p\", \"application_pid\":\"%P\", \"fastly-client-ip\":\"%{fastly-client-ip}i\", \"request_x-forwarded-for\":\"%{X-Forwarded-For}i\", \"request_x-tracer\":\"%{X-TRACER}i\", \"request_geo-ip\":\"%{GEOIP_ADDR}e\", \"request_geo-continent\":\"%{GEOIP_CONTINENT_CODE}e\", \"request_geo-country\":\"%{GEOIP_COUNTRY_CODE}e\", \"request_host\":\"%{Host}i\", \"request_auth-user\":\"%u\", \"request_content-type\":\"%{Content-Type}i\", \"request_timestamp\":\"%t\", \"request_uri\":\"%r\", \"request_referer\":\"%{Referer}i\", \"request_user-agent\":\"%{User-Agent}i\", \"response_code\":\"%>s\", \"response_bytes\":\"%b\", \"response_seconds\":\"%T\", \"response_microseconds\":\"%D\", \"response_content-type\":\"%{Content-Type}o\" }" extendedcombined

It'll come out looking like this (thrown through jsonlint.com)


"application_name": "%v", # Server name from the vhost config (not necessarily the hostname if you have aliases and depending on how you handle bare ip requests and if you use named vhosts

"application_canonical-port": "%p", # 80 or 443 depending on if you're tls or not, I never figured out the point of this and it's actively confusing

"application_client-ip": "%a", # The client ip calling apache, not necessarily the users IP if you have lbs/reverse proxies in the mix

"application_local-ip": "%A", # Host ip the application is running on

"application_local-port": "%{local}p", # Actual port the application is running on

"application_pid": "%P", # Apache pid that handled the request

"fastly-client-ip": "%{fastly-client-ip}i", # This is a header that the fastly service will add to tell you the actual client ip, we don't use them anymore but they're good (expensive). They actively defend this field, I spent an hour or so poking at it to see if I could add false data without success (of course if you don't protect your origins someone could falsify the data there instead).

"request_x-forwarded-for": "%{X-Forwarded-For}i", # X-forwarded-for, occasionally useful if I suspect the client was monkeying with request data, there's a similar header for the protocol the load balancer saw in AWS land

"request_x-tracer": "%{X-TRACER}i", # I used to use this pattern for when we wanted to set an arbitrary header we could trace a request with, usually by QA

"request_geo-ip": "%{GEOIP_ADDR}e", # Relevant to mod_geoip, I don't use this anymore

"request_geo-continent": "%{GEOIP_CONTINENT_CODE}e", # Relevant to mod_geoip, I don't use this anymore

"request_geo-country": "%{GEOIP_COUNTRY_CODE}e", # Relevant to mod_geoip, I don't use this anymore

"request_host": "%{Host}i", # Hostname the client sent

"request_auth-user": "%u", # Relevant if you're using basic auth, usually not relevant

"request_content-type": "%{Content-Type}i", # Content-Type the client requested, comes up if we expect request monkeying

"request_timestamp": "%t", # Request timestamp

"request_uri": "%r", # Request URI, I don't remember if this logs the get fields or not

"request_referer": "%{Referer}i", # Request referer header if there is one

"request_user-agent": "%{User-Agent}i", # Request user agent

"response_code": "%>s", # Response we got

"response_bytes": "%b", # Response size

"response_seconds": "%T", # Response time (not necessarily how long it took the client to get it) in seconds

"response_microseconds": "%D", # Same thing in microseconds

"response_content-type": "%{Content-Type}o" # Response returned content-type header, occasionally relevant if we expect monkeying


When this makes it's way into Splunk or ELK they will automatically parse out the header fields. Sumologic will do it if you pass a query through "| json auto nodrop"

Sumologic will handle sub structures (if you're passing an object where one of the fields is a hash object, not relevant for apache/nginx), I don't know about the others. Sumologic will intelligently handle non-json data (like the timestamps and tagging rsyslog adds).


>I have a format I use I'd be willing to send you if you want it.

Throw the nginx one on a paste in please

Keep logs local, and send only errors and noteworthy events to a central server like sentry.io, rollbar, or something else.

If you log everything to a single server, as you noticed yourself, it will become very expensive and difficult to filter out the stuff that you really need when looking for errors or when you try to hunt down a specific problem.

I used sentry.io for a project and it worked well for what it is - error tracking. But when users reach out with issues, it is useful to look at logs and having them local means I would have to look at each log file per instance which could be on multiple servers.

When running on AWS, I generally just use CloudWatch. For non-AWS hosts, and/or when I need something more feature rich: DataDog is a solid hosted service with reasonable pricing.

It is expensive. We forward logs to spkunk (we run our own instances). Splunk is really solid. All the logs are json and require certain fields. We use it for tend analysis, alerting, graphs, reports, and digging into production issues. It digs through terabytes of data relatively quickly.

I think Splunk is the reason I can’t recommend ELK, Splunk simply makes ELK look almost non-functional. This is years ago but we tried Splunk for a year, but the switched to ELK due to pricing. Our number of searches dropped to almost zero after that, because usability was so poor, in comparison.

As a result we didn’t utilize the data we had, or in many cases reverted to using grep.

If you want a cheaper alternative, Humio has become rather good and is relatively easy to use.

Splunk can get really expensive. And there's something about renting your own data that rubs me the wrong way.

AWS CloudWatch Logs has come a long way. The new Insights UI is great. No need for us to manage ELK for logs anymore.

Looking into it for the current ELK replacement, looks great so far (especially simple enough query language). Offtopic: what do you use for load balancing logs in S3? I’m seeing basically two in-AWS options: Athena (but it lacks built-in fancy UI) or ship LB logs to the CloudWatch too.

Do all developers have AWS console access to check cloudwatch logs?

For cost, stackdriver works the best. Idk if they have a custom agent but Fluentd works great to ship logs to your platform of choice.

AWS cloudwatch is also good for cost but has much slower query speeds (the slowness makes me think it's not Lucene based?).

If you could splurge on hosted services, my favorite logging goes to datadog and it has all the other bits of observability built in for down the road.

https://landscape.cncf.io/ is usually my go-to if you wanna find best-in-class solutions to host yourself.

I've had the misfortune of setting up Application Insights for logging across a distributed system.

It's awful. The integration with most (C#) logging frameworks is horribly, the adaptive sampling, which is hard to turn off, means that Application Insights randomly drops logs, which makes any sort of distributed tracing of events really difficult.

To top that, there's a delay of 5-10 minutes from the logs are written until they're queryable, which is a huge pain when debugging your setup.

DISCLAIMER - Im the CEO of Coralogix.com

ELK can be pretty expensive and is kind of a pain to manage. for simple use cases though, hosted ELK by AWS should do the trick and wouldn't cost too much. Small startups and dev shops should choose a SaaS logging too l since most of them start and 10$-30$ which is cheaper than anything you'll spin on our own.

Looking at the market right now, looks like logs, metrics, and SIEM are going to combine in the next 2-3 years.


- Exist a tool that allow to navigate structured logs easily, without bring a heavy machinery like ELF stack? and work in terminal?

- That also filter only ERRORS and few lines above/below?

for the endpoint itself, consider switching to syslog so you get a bunch of stuff for free (auto-rotations, docker logging, ...) and more easy to change decisions later (pipe to splunk/elk/...). it's thought out and pretty easy!

Expanding a bit here: Splunk is free for < 500MB day and has one of the easiest UIs, so you can run on same-node or elsewhere, and deploy via docker to skip most weird setup. So just record syslog, volume mount to a Splunk docker, enable syslog reading + nix metrics, and done. (Too bad no free cloud for the same...)

We work with a ton of log tools as company bringing graph analytics+vis to investogations largely log & event data, and I've increasingly shifted to that when I need easy basics.

fluentd is great. You can setup forwarding nodes, that relay logs to one or mutiple masters that then persists into whatever layer(s) you want. Tolerance and failover baked in. Tons of connectors and best of all docker logs driver is built ships with docker so almost zero setup to get your container logs to fluentd. Also works nicely with kubernetes too!

I haven't used it in maybe 3 years or so, so some of this could be misremembered a bit, but I didn't have a great experience with fluentd.

Trying to do much customization was kind of painful, the config file structure is kind of confusing and the docs were sparse and differ depending on the plugin you're using, and there's no validation of the config so if you have any of the arguments slightly wrong it'll fail silently.

The modularity of routing and filtering logs seems like it would be great, but it turns out not to be all that flexible, you really have to follow the framework's preconceived idea of how the processing pipeline should work. I forget the details but we were trying to do some processing on the logs, like ensuring they're valid json and adding a few fields to them based on other metadata source, and it would have required writing our own plugin. In other ways it felt too modular for its own good, like it's up to individual plugins to implement anything they need and the core doesn't seem to provide you with much. Even things like an error handling or retry framework are not built in, so if a plugin throws an unexpected exception it's not handled gracefully. One plugin we were using would just stop sending logs until we restarted the fluentd process if it ever got an http failure from being temporarily unable to connect to the api, because they hadn't built in their own retry/recovering mechanism. Granted this was for a pretty new logging SAAS service we were trying out, if you're sticking with the basics like files and syslog it's probably more robust, but at that point you also might not need the flexibility promised by a tool like fluentd.

Despite their advice not to write a lot of Ruby code, the fluentd plugins are Ruby code, so I ended up just writing one big custom plugin with all the rewrite/routing logic in it. It was very simple to do, and so much easier to read than the convoluted fluentd config files you end up with to do anything mildly complicated with filtering.

My main advice is avoid ELK. I have no clue how Elastic managed to convince the world that Elasticsearch should be the default log database when it is _terrible_ for logs.

If you're logging structured JSON, then you'll hit a ton of issues - Elasticsearch can't handle, say, one record with {foo: 123} and another with {foo: "abc"} - it'll choke on the different types and 400 error on ingest.

Even if you try to coerce values to string, you'll hit problems with nested values like {foo: "abc"} vs {foo: {bar: "baz"}}. So now you have to coerce to something like {"foo.bar": "baz"}, and then you have to escape the dot when querying...

Finally, if you solve all the above, you'll hit problems with high cardinality of unique fields. Especially if you are accepting arbitrary nested values, at some point someone is going to log something like {users_by_email: {<email>: <user id>}} and now you have one field for every unique email...

These problems are tractable but a massive hassle to get right without majorly locking down what applications are allowed to log in terms of structured data.

As a seperate issue, Elasticsearch does fuzzy keyword matching by default. eg. if you search for "foo" you'll get results for "fooing" and "fooed" but not nessecarily for "foobar" (because it's splitting by word - and the characters it considers part of one "word" aren't obvious). This is great if you want to search a product catalog, but horrible when you're trying to find an exact string in a technical context. Yes you can rephrase your query to avoid it, but that's not the default and most people won't know how to structure their query perfectly to avoid all the footguns.

Finally, as others are saying here, Elasticsearch is just painful and heavy to manage.

As for what to use instead...I don't have good answers. I haven't exhaustively checked out all the other products being mentioned, but in my experience a lot of them will have similar issues around field cardinality, which means it'll always be possible to cripple your database with bad data. This is less of an issue if you're just running a few services, but in larger orgs it can be nigh impossible to keep ahead of.

For smaller scale deployments, don't underestimate just shipping everything to timestamp+service named files as newline-delimited JSON, and using jq and grep for search and a cronjob to delete/archive old files.

When it comes to the "read from local source and ship elsewhere" component, I've had the best luck with filebeat (specifically for files -> kafka). Most others tend to read as fast as they can then buffer in memory or disk if they can't write immediately, whereas filebeat will only read the source file as fast as it can write downstream.

Note however that all such components are awful to configure, as they seek to provide a (often turing-complete) configuration file for transforming your logs before shipping them, and like most turing-complete configuration scripts, they're less readable and more buggy than the equivalent would've been in any real programming language.

Ok, rant over. Sorry, a good logging system is kind of my white whale.

Second every line of this. If you can find someone else to handle all this stuff with quality that you can accept - do that and focus on your app instead. All that field types and cardinality mess are “easily” prevented with a bit of self-discipline, which you’re going to develop anyway if you really going to use your logs for incident response or statistics.

I came to the same conclusion about ELK. But you know, enterprisey + consultants = uptake.

Try Scalyr.

don't forget that the forward slash "/" is a special character in elastic search. so try searching service logs by http route and you are completely f&^$ed.

Our (https://bestprice.gr/) services/“programs” generate three different types of events:

- Short events (no longer than ~1k in size) where the cost to generate and transmit is very low(sent to dedicated service via UDP). We can generate dozens of them before we need to care about the cost to do so. They are binary-encoded. The service that receives those datagrams generates JSON representations of those events and forwards them to all connected clients(firehose) and also publishes them to a special TANK partition.

- Events of arbitrary size and complexity, often JSON encoded but not always -- they are published to TANK(https://github.com/phaistos-networks/TANK) partitions

- Timing samples. We capture timing traces for requests that may take longer-than-expected time to be processed, and random samples from various requests for other reasons. They capture the full context of a request(complete with annotations, hierarchies, etc). Those are also persisted to TANK topics

So, effectively, everything’s available on TANK. This has all sorts of benefits. Some of them include:

- We can, and, have all sort of consumers who process those generates events, looking for interesting/unexpected events and reacting to them (e.g notifying whoever needs to know about them, etc)

- It’s trivial to query those TANK topics using `tank-cli`, like so:

  tank-cli -b <tank-endpoint> -t apache/0 get -T T-2h+15m -f "GET /search"  | grep -a Googlebot
This will fetch all events starting 2 hours ago, for up to 15 minutes later, that include “GET /search”

All told, we are very happy with our setup, and if we were to start over, we’d do it the same way again.

Allow me to rep the tool I help build: Coralogix, which is a managed log analytics service. You haven't said what your budget is, but our pricing starts at $15/month to handle 5 GB/month of logs - certainly cheaper than running ELK yourself.


Very happy with Coralogix. We use the service on many of our Heroku dynos.

I've recently spent a few days adding proper instrumentation to a .NET Core side project using CloudWatch Logs.

I don't know anything about the Go ecosystem, but if there is a nice solution for structured logging then CloudWatch Logs is very easy to implement, very cheap, can easily make decent dashboards with it and if needed in the future you can forward the logs onto Elastic Search.

I'm using a library called Serilog in my project to log everything in a consistent structured log that gets all kinds of metadata automatically appended to it and the json payload winds up in CloudWatch Logs. Then I've got a couple of Custom metrics to measure throughput as well as some log filter metrics to track latency of the service and it's downstream dependencies.

It works very well and was surprisingly quick to put together. I believe cost wise my usage level is covered pretty much for free. Can't complain!

We used to use an ELK cluster but it was always breaking - I'm sure this stuff can be reliable but we just wanted an easy way to search ~300GB of logs (10GB/day)

Somehow I came across scalyr and it's just phenomenally fast - and cost less than our ELK cluster. Definitely worth trying if it provides the features you need.

ELK has way too much care and feeding for it to be worth it until your log ingestion rates start to get crazy.

I used to use Logentries for a while. It's essentially a hosted ELK. I think they got bought-out or something, and the service went downhill. Don't remember exactly, but it was slow and clunky, and support wasn't great either.

Then discovered Scalyr. It was awesome. The UI isn't pretty, but it's super-powerful. It's fast. You have to format your logs to make the most of it, but it's worth it.

Unfortunately their alerting wasn't as flexible at the time (for example, it wouldn't include contextual info from the matching logs that triggered the alert). Besides that, we decided to consolidate things and move to Datadog.

Datadog is pretty great. The monitoring and alerting features are solid. They then added logging, APM and keep adding more. It's not that cheap, but overall works great for us.

I'm pretty late to this party, but I've been using rsyslog and "EK" (skip the L - it's way too slow and resource hungry).

rsyslog / syslog-ng handles shipping logs to a central server and it's dead simple to keep local logs and a central log at the same time. Every language can spit logs to syslog vey quickly. And then you can use plugins to inject your life from rsyslog directly into elastisearch, which is incredibly fast.

Other critiques about ES still apply especially when it comes to managing conflicting keys in structured logs, but most complaints about fragility and scaling are because of Logstash, which I agree, is Terrible for logging.

I've written this up in detail if anyone is interested.

Yes please give some configuration and performance details.

I don't have anything regarding performance detail, but I wrote this detailed post about configuration.


At trivago, we rely on the ELK stack to support our logging pipeline. We use a mix of filebeat and Gollum (github.com/trivago/gollum) to write logs into Kafka (usually we encode them using Protocol Buffers) these are later on read by Logstash and written into our ELK cluster.

For us, ELK has scaled quite well both in terms of queries/s and writes/s and we ingest ~1.5TB of logs daily just in our on-premise infrastructure. The performance/monitoring team (where I work at) consists of only 4 people and although we take care of our ELK cluster it is not our only job (nor a component that requires constant attention).

Nobody mentioning Splunk? Too obvious or am I not understanding the ask?

I’ve heard from different people that is trapware. You getting it, investing time and building process around it. And when you starting to get valuable amount of data your bill grow to the deep space. The same time they said that it is good, but very expensive.

Splunk is great but very expensive, like you mentioned. They're actively getting disrupted by Sentinel and Backstory and Elastic though, so either pricing will change or the recommendation will change.

I don't recommend Splunk because I don't want someone recommending we run Splunk onsite ever again.

Running your own log infrastructure is the absolute worst. As in, we had to put a staff devops engineer on just that for 2 months the last time the company I was at needed an upgrade.

Correct, it's not uncommon to have one engineer full time on an internal Splunk infrastructure at a mediumish org. Not everything has to be Cloud. This is what people are paid to do.

Yeah we're not putting a person on logging when we can outsource it for a fraction of the cost.

If it's small, text files that you rotate per day and delete after 1-3 month(s).

If it's big, Graylog is great.

If it's too big, /dev/null, best logs gathering since 1971.

does /dev/null support sharding?

Alright! /dev/null is webscale!

I have used sumologic, your servers send the logs to sumo and you can setup your alerts/monitoring/dashboards etc on their side.

We use https://www.humio.com/ and we love it

I have been having a TODO on my list to explore using Toshi + Tantivvy( Rust projects as replacement for ELastic Search) and using it to supplement a simpler (ripgrep + AGrind) file based search on logs centralized using rsyslog. Haven’t gotten around to play with them yet. Hopefully sometime this year.

I could not find an equivalent to Kibana though :-(

Check out the free version of Graylog.

Graylog, it's a purpose-built package for log management over elastic search! We transport our logs over ActiveMQ from our apps and they're read off the broker via an openwire input. The setup can handle several thousand rights per second on modest hardware.

How much data is being generated by your logs?

If it’s less than 500MB/day you may use Splunk for free forever.

We're using Seq as log server which you must host yourself. This fine and free product is from the same guys as serilog which is known in the dotnet world.


In the old days, I had some daily cron jobs to upload the logs to a centralized place. Then you use grep to find which file contains the log you want. Because you are fine with file, I guess you don't need to make it too complicated.

The less experience you have the more you should pay a SaaS. Then when you gain more experience you start using frameworks like self managed ELK stack more. If that is not enough at some point you can roll out your own.

LogDNA is best and it's cheap enough you don't have to worry about it.

Been happy with LogDNA as well!

Here’s some inspiration from Avery Pennarun: https://apenwarr.ca/log/20190216

We used Sumologic for a long time and still do. You can query your logs in an SQL-way (you can do joins for instance) and last I checked they have a free tier.

Their ability to figure out json for mostly-but-not-entirely json log records in the query line as well as LogReduce and LogCompare are great. Hot take, they're way better than Splunk.

I built CabinJS after I was frustrated with all existing solutions.


It's unclear if it handles logs from sources other than browsers and node?

We use DataDog. We run their logging agent in a container which tails the log file and syncs with their cloud service.

I used logdna in a previous company. This is a really nice hosted, cheap logging service (like papertrail)

I am currently trying out logdna. Their charting is underwhelming.

Cloudwatch. Insights makes it so much easier to query but be sure to log out JSON.

What do you guys do with docker logs? They seem to just accumulate over time.

Cloudwatch logs + insights or Fluentd - kinesis firehose - elasticsearch

Every developer for themselves because the devops guy.

Clickhouse + Grafana or Prometheus + Victoriametrics

Try Graylog!!!!!

We use getseq.net, a pretty smooth aggregator and it keeps it all on your own infra, so no/less GDPR concerns. Proven really powerful with tons of services pushing logs to it.


syslog / stderr -> Kafka -> centralized ELK

Files / log rotation is completely the wrong approach because log entries are mostly innately structured, rich data that occurs at a specific time. Serializing and then parsing log lines again is wasted effort. Messaging is a better fit than lines in files which create log-management headaches like not rotating, losing messaging on rotation, compressing/decompressing and a lengthy soup of destructured data that fills up local disks.

Logging to files on local disks is wrong and often creates privacy problems. Logging to cloud services is also expensive, a legal quagmire and raise data portability concerns.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact