
Ask HN: How do you manage logs for your backend services? - tusharsoni
I have been working on a few side projects that usually have Go backend services. Right now, they log straight to a file and it works just fine since I only have a couple of instances running.<p>If I were to scale this to more instances&#x2F;servers, I would need a centralized logging service. I&#x27;ve considered ELK but it&#x27;s fairly expensive to both self-host or buy a managed subscription.<p>Any advice?
======
viraptor
People jumped to recommending things before asking:

What's the volume of logs?

What's the expected usage? (debugging? audit?)

Are you using text, or structured logs? (or can you switch to structured
easily?)

How far back do you want to keep the logs? How far back do you want to query
them easily?

Are you in an environment with an existing simple solution? (GCP, AWS)

~~~
twalling
Well the question in the title was how do you manage your logs, so people are
answering.

~~~
viraptor
I'm cool with the answers about what people are actually using. But there are
also recommendations, and that's what OP asked for at the end.

------
stickfigure
Google's Stackdriver.

I've been using Google App Engine for some ten years now and I'm still
dumbfounded that this is still an ongoing struggle for other platforms. It
collates logs from a variety of sources, presents requests as a single unit,
has sophisticated searching capabilities, and the UI doesn't suck.

Best of all, it just works... there's zero configuration on GAE. Such a time
saver.

~~~
batter
I find Stackdriver ugly compared to SumoLogic or Datadog. Also it has
ingestion limits, we're losing logs when load becomes considerable.

~~~
dmlittle
From what I've seen at demo booths in conferences, Datadog's logging is
impressive but also incredibly expensive. At the rate that we produce logs
we'd be paying over $30k/mo. They claim that we can use log rehydration and
not ingest all logs but then we can't really have alerts on them so what's the
point in having them. Yes, I understand that you can look at the logs when
things are going wrong but you can also know _from_ the logs when things are
going wrong.

~~~
jjjensen90
You can create metrics and alerts from filtered logs in Datadog.

The process would be: log data -> add index filters -> go to live tail and
create a metric on a filtered log event -> create monitor on metric.

edit: also, you log 24 billion messages a month? I think that's what it would
be to cost $30k for their platform per month

~~~
boulos
24B/month is less than 10k/second.

10k qps is certainly not a dev/test instance, but if you were to log all RPCs
including database lookups, you could easily get there with even just 100-ish
user-facing requests per second.

~~~
jjjensen90
Indeed, I did the calculation myself as well. If you're logging every RPC
including database lookups in production you might have a problem with your
logging principles (signal vs noise etc) and if you really need that log data
for every request but can't afford $1.27 per million you might have a product
problem.

------
jjjensen90
I highly recommend Datadog's logging platform. One important lesson I've
learned in my career is never to run your own observability platform if you
can afford for someone else (whose entire product is observability) do it for
you.

I've used ELK (managed and hosted), Splunk, NewRelic, Loki, and home grown
local/cloud file logs and nothing has been as cheap, easy, and powerful as
Datadog. They charge per million log events indexed but also allow you to
exclude log events by patterns/source/etc and they ingest but ignore those
rows (you pay $0.10/gb for those ignored logs).

The 12 factor way to do logging is very easy with Datadog, as you can tell the
agent to injest from stdout, containers, or file sources, then the application
is agnostic to the log aggregator as the agent will collect and send logs to
the platform.

Not only is it cheap and easy to set up, it also gives you the option to take
advantage of the other features of Datadog that can be built on your log data.
Metrics based on log parsing, alerts, configuration via terraform, etc become
possible when you ship your logs to their platform.

I've seen our production apps log 10-20k messages per second without Datadog
breaking a sweat but I'm not sure if they have any limits.

~~~
firstbabylonian
> One important lesson I've learned in my career is never to run your own
> observability platform if you can afford for someone else (whose entire
> product is observability) do it for you.

Why is that?

~~~
jjjensen90
1\. You have to keep your logging, monitoring, alerting infrastructure up
yourself

2\. You have to monitor, log, and alert on your that infrastructure yourself
with something else

3\. You usually have to spend more money, both in hosting and in developer/ops
time, on getting something mediocre compared to a provider that exclusively
does observability as a product

4\. Logging etc are a commodity and you should have a really good reason to
build or run something yourself if it is not your core competency

5\. Observability is much harder than it sounds and providing a cohesive
platform that ties log events to traces to metrics to alerts, and keeping it
up and highly available is a hard problem that you shouldn't do if
observability isn't your core competency/product

~~~
tedk-42
Yep, if you can afford it make it someone else's problem. No team should need
to have dedicated OPs people that are responsible for keeping the logging
platform running. This is a solved problem and costs the company more in
engineering time (paying a person to do the job) than the Saas solution.

E.g. 30K/month isn't much when you compare it to paying a sole engineer
15K/year + cloud costs of hosting your own logging solution + any licenses.
There's also a slim chance the home grown logging platform will be as
resilient as a Saas product.

I for one am very grateful to DataDog's support team as they've been very
helpful in debugging logging problems we've had in the past.

------
WinonaRyder
We love JSON logs and previously just sent most of it to systemd's journald
and use a custom tool to view them. But maybe a year ago Grafana released
[https://github.com/grafana/loki](https://github.com/grafana/loki) and we've
been using it on [https://oya.to/](https://oya.to/) ever since.

IIRC, the recommended way to integrate it with Grafana is via promtail but we
weren't too keen on the added complexity of yet-another service in the middle
so we developed a custom client library in Go to just send the logs straight
to Loki (which we should probably open source at some point).

I don't think there's any fancy graph integration yet, but the Grafana explore
tab with log level/severity and label filtering works well enough esp. since
they introduced support for pretty printed JSON log payloads.

~~~
thojest
I have tried out loki, too. But I was not satisfied because you have to run an
extra server for it. I have only a very small app so I was searching for a
much simpler solution and found [https://goaccess.io/](https://goaccess.io/) .
The nice thing is that it is very flexible (you can pipe your logs in command
line, but also run as a server) and if you are using standard tools like e.g.
nginx or apache, the setup only takes an evening :)

Here are a few more monitoring tools listed
[https://turtle.community/topic/monitoring](https://turtle.community/topic/monitoring)

It is always a question of complexity and how much time you want to spend :)
At first I used Multitail to peek into server logs via ssh. Then I switched to
GoAccess and if you really have a greater infrastructure I would maybe switch
to Loki or ELK.

~~~
WinonaRyder
I've used GoAccess in the past, but didn't find it a good fit esp. since I
prefer JSON logs. I don't know if it supports this - maybe I didn't read
enough docs - but at the time, I remember it was easier to spend an hour
whipping up a tool in Go that did exactly what I wanted.

As for loki, it's a separate server but the setup takes maybe 10-30 minutes of
reading some docs, maybe changing some config files and the systemd unit file
to keep it up and running is less than 10 lines (most of which is
boilerplate).

Of-course I have the benefit of a client library so I can just call a function
on a struct at the end of a request with no need to worry about serializing
the relevant data into some predetermined format, compression, etc.

~~~
darekkay
> I've used GoAccess in the past, but didn't find it a good fit esp. since I
> prefer JSON logs. I don't know if it supports this

Yes, GoAccess supports HTML, JSON and CSV: goaccess --output=json

------
johntash
I don't have a lot of experience with it yet, but Loki looks promising for
small projects. You'd still use it as a centralized logging server, but it's
not as resource-expensive as something like self-hosting ELK.

I've only been using it for my homelab, and haven't even moved everything to
it yet - but I like it so far. I already use Grafana+influxdb for metrics so
having logs in the same interface is nice.

[https://grafana.com/oss/loki/](https://grafana.com/oss/loki/)

~~~
tmikaeld
I've been benching it in production.

On a 1core 2gb vps it can do around 1200 logs/sec.

Compared to ELK, we're saving several hundred $ per month.

~~~
chucky_z
If you can't do 1200 logs/sec with that hardware + elasticsearch, you've done
something horrendously wrong.

You can do 1200 logs/sec with a container limited to 1 core and 256mb of
memory running elasticsearch.

~~~
tmikaeld
The savings are in RAM costs due to ELK stack basically not running under 4GB.
And larger logs with stack trace are much slower compared to Loki. There's a
reason there is no cheap ELK services.

------
JdeBP
ELK _is_ expensive, in terms of hardware and time to configure/manage. But it
does scale to large volumes in a way that your current ad hoc 1970s logging
will not.

I had a good experience with a local decentralized logging system, essentially
daemontools-style service logs, that then fed in to ELK. ELK provided the bulk
storage and analysis; and the local daemontools logs provided the immediate
on-machine per-individual-service recent log access, and decoupled logging
from the network connection to logstash.

* [http://jdebp.uk./Softwares/nosh/guide/commands/export-to-rsy...](http://jdebp.uk./Softwares/nosh/guide/commands/export-to-rsyslog.xml)

* [http://jdebp.uk./Softwares/nosh/guide/commands/follow-log-di...](http://jdebp.uk./Softwares/nosh/guide/commands/follow-log-directories.xml)

One of the advantages of this approach is that one can do the daemontools-
style logging _first_ , very simply, without centralization, and with
comparatively minor expense; and then tack on ELK later, when volume or number
of services gets large enough, without having to alter the daemontools-style
logging when doing so.

Of course, it can be something else other than ELK, fed from the daemontools-
style logs.

One thing that I recommend _against_ , ELK or no, is laboriously re-treading
the path of the late 1970s and the 1980s, starting from that "logging straight
to a file". Skip straight through to the 1990s. (-:

* [http://jdebp.uk./FGA/do-not-use-logrotate.html](http://jdebp.uk./FGA/do-not-use-logrotate.html)

------
defanor
For POSIX systems, syslog is the standard way. For systemd-based ones,
journald may be preferable because of its additional features; both support
sending of logs to a remote server. I'd suggest to avoid custom logging
facilities (including just writing into a file) whenever possible, since
maintaining many services with each using custom logging becomes quite
painful.

------
iDemonix
Graylog is a really nice free product, and although it can look a bit scary,
it's not that hard to get setup - especially since the introduction of the
ElasticSearch REST API, meaning you no longer have to make GrayLog join the ES
cluster as a non-data node.

You can spin it up on a single machine with ES and start playing with it. I
usually forward all of my logs to rsyslog, then that duplicates the logs out -
they go to flat file storage, and to graylog for analysis.

------
thraxil
We run on GCP and I have to say that Stackdriver Logging with the google
fluentd agent is actually pretty good and relatively cheap. I don't like
Stackdriver's metrics at all, but Logging feels more like part of GCP. Fluentd
has given me far fewer problems than logstash or filebeat did when we were
running an ELK setup. The search UI is obviously nowhere near as nice as
something like Kibana, but it gets the job done. If you aren't on GCP, it's
not worth it, but if you are, the whole setup is "good enough" that you might
not need to set something more sophisticated up (I'm still looking at Loki
though because I can't help myself).

------
natebutler
We use flume forwarding to s3 and then athena to query the logs. Flume
processes each logfile with morphline (which is akin to logstash) and parses
each rawlog into json before pushing to s3.

We used to run an elk stack but hit a bottleneck crunching logs with logstash.
We found flume's morphline to be performant enough and the nice property of
flume is that you can fanout and write to multiple datasources.

It's ironic, but because Athena is kind of a flaky product (lots of weird Hive
Exceptions querying lots of data) and because it's really only good at
searching if you know what you're looking for, we're considering having flume
write to an elasticsearch cluster (but still persisting logs to long-term
storage on s3).

~~~
RhodesianHunter
I've always wondered why companies for whom logs are important but not their
core focus / product bother implementing stuff like this themselves. Surely a
saas service that does just logs can do it cheaper and with more features? Is
it compliance?

~~~
natebutler
We are in the unfortunate circumstance where we have high traffic but are
budget conscious. S3 is cheap and Athena is as well. We significantly reduced
cost moving from a fairly large elk cluster on ec2 to a handful of flume
machines running morphline. We’ve looked at datadog and scalyr and even went
so far as to implement a flume sink to scalyr but the scalyr quote was way too
high.

------
l0b0
IMO ELK is worth it. All the filtering, sorting and graphing means you can
_easily_ do post mortems (which makes you more likely to do them at all), you
can get detailed performance and other metrics without spending days setting
up robust testing, and it makes it easy to correlate events from your entire
infrastructure. Just make sure NTP is enabled everywhere :)

I would budget some time after setting it up to weed out uselessly verbose
logging and rotating old logs out of RAM and onto cheap storage. You'll love
it.

------
particlesplus
Papertrail/Timber.io - cheapest way to aggregate logs and has simple search
functionality.

Scalyr - my personal favorite. Just a little more costly than Papertrail, but
can do as much as any full service SaaS - powerful queries, dashboards, and
alerts. Takes some practice to learn, but their support is very helpful.

Sumologic - Fully featured log agregator. It works pretty good but their UI is
super duper annoying. You have to use their tabbing system on their page, you
can't open multiple dashboards/logs in multiple browser tabs. For the money, I
personally prefered Scalyr, but this is a reasonable option.

Splunk - a great place to $plunk down your $$$$. I think their cheapest plan
was $60k/yr, but I will admit that it was easy to get going and use and also
had the most features. It's not a bad bang for your buck as long as you have
lots of bucks to spend.

------
zygy
We use and like [https://www.scalyr.com/](https://www.scalyr.com/).

~~~
koreth1
Same for us, though they have outages more often than I'd prefer. Half the
time when we get a PagerDuty alert, it's because Scalyr is screwed up and not
ingesting our logs, not anything wrong with our systems.

But when it's not having an outage, it's lightning fast and easy to use. I
especially like that you can do basically 100% of the configuration (aside
from stuff like billing info) by uploading JSON files to them, so you can keep
all your parsing and alerting configuration in your own version control
system.

------
rubyn00bie
I was just about to start looking into doing this myself and for the
foreseeable future, I'll probably just use `dsh`... since I'm a cheapskate,
have been trying to reduce my usage on cloud tools, and I just found out about
it today:

[https://www.netfort.gr.jp/~dancer/software/dsh.html.en](https://www.netfort.gr.jp/~dancer/software/dsh.html.en)

Once installed, change the default from rsh to ssh where it's installed e.g.
`/usr/local/Cellar/dsh/0.25.10/etc/dsh.conf`

Then setup a group for machines, in this case I'm calling it "web"

> mkdir -p .dsh/group

> echo "some-domain-or-ip" >> .dsh/group/web

> echo "some-domain-or-ip-2" >> .dsh/group/web

Then fire off a command:

> dsh -M -g web -c -- tail -f /var/log/nginx/access.log

> some-domain-or-ip [... log message here ...]

> some-domain-or-ip-2 [... log message here ...]

The flags I used are:

-M "Show machine name"

-g "group to use"

-c run concurrently

That's about as easy as I can think of... /shrug while I like the idea of
centralized logging services I haven't really found one I actually cared
for... most just run rampant with noise, slow UIs, and strange query languages
no one wants to learn. I guess I could start a machine up with `dsh` on it in
my cluster and then write the output from dsh to a file... easy centralized
logging on the cheap, ha!

~~~
edoceo
Wow. Maybe just use syslog for this case? It's designed for centralized
logging

~~~
rubyn00bie
Yes... Uhh.. `dsh` works with tailing syslog too. I don't know why it
wouldn't?

Sorry for my example not being general enough?

------
peterwwillis
Presumably you're running your service as a container, and presumably you only
run one master process per container. Print your logs to stdout/stderr, and
then you can use any log-capturing mechanism (fluent bit?) to stream them to
any log-indexing system.

Tweak your system so you can selectively turn on log collection, and collect
useful metrics 24/7\. Logs almost never matter until you're diagnosing an
active problem, anyway, and then only a few logs will be necessary.
Differentiating the type and quality of particular log messages is also very
useful.

------
newjobseeker
I used Papertrail at my last job, search function works well, and it was easy
to use.

~~~
freehunter
Seconding Papertrail. I don't _love_ it but it works and it's cheap enough and
I don't need to love every tool I use.

I do constantly wonder if there's a logging/searching solution small and
lightweight enough to fit on the cheapest DO droplet. $7/gb/mo on Papertrail
vs $5/mo with a 25GB SSD? That'd be a no-brainer, but ELK and Graylog won't
run with that low of hardware.

~~~
RhodesianHunter
Depends on how risky you're willing to be with your log data. Something like
what you're describing could lose all of your data with a single instance or
hard drive failure while a saas like Paper trail has your data replicated
across data centers.

------
lokar
You should first consider what you need logging for.

Is it just ad-hoc text "debug" logs? Structured transaction or request logs to
feed some sort of analytics? This impacts the backend trade-offs quite a bit.

Are you trying to do monitoring via logs? Don't. Export metrics via Prometheus
(or something like it), much cheaper and dependable then extracting metrics
via a log collection system.

------
whatsmyusername
If you haven't read the chapter of 12factor on logging I highly recommend it
[https://www.12factor.net/logs](https://www.12factor.net/logs)

This is coming from an ops person, do that and I'll be happy. Essentially the
goal is to externalize all your log routing to stdout, then wrap tooling
around your application to route it wherever you want it to go. It's geared
toward heroku but same rules apply in docker land and more traditional VM
environments.

We either send logs ECS -> Cloudwatch for our AWS stuff or docker swarm ->
fluentd -> s3 for an onsite appliance (also anything syslog ends up in
Sumologic through a collector we host in the environment). From there the logs
get consumed by our SIEM (in our case Sumologic, which is a great service). We
keep logs hot for 90 days and ship archives back to s3 for a year. Set up the
lifecycle management stuff, keeping log files forever is not only a waste but
can actively hurt you if they ever get exposed in a breach.

I highly recommend formatting your logs JSON (even if you have to do it by
hand like in Apache). If you do that and go to Splunk, Sumologic, or ELK all
your fields will be populated either automatically or with a single filter.
Saves writing or buying your own and if you add a field there's no action for
you to take.

nginx/apache default logging is complete trash. Look at the variables they
expose, there's a lot of stuff in there you'll want to add to the log format
to make your life loads easier. I have a format I use I'd be willing to send
you if you want it.

I don't recommend logstash (the L in ELK), ever (except maybe if you're java
across the board). It's way too damn heavy to run on a workload host, fluentd
is much lighter (and not java, why would I deploy java for a system tool
ever?). Maybe as a network collector you throw syslog at but that would be it.

For your use case Sumologics free service would be great. You can get I think
up to 200m a day with a weeks retention for free and you'll get exposed to
what an SIEM can do for you (ingestion rate and retention period are typically
how hosted solutions are billed, you'll need email with any non-free email
domain to get a free account from them). IMO you have to get to some fairly
insane log rates for me to ever recommend running ELK stack yourself, it has
way too much care and feeding if you want to run it correctly with good
security.

~~~
edoceo
I'm interested in these webserver configs you got for logs, please

~~~
whatsmyusername
This is what I previously used for apache 2.4 (this very out of date for what
I use now since the company I'm with isn't apache but the gist is there). I
don't have the nginx one readily in front of me.

I snipped a few fields out of this that were 'me setup' specific so small
chance the formatting is off. The field variables for nginx are a hell of a
lot less obtuse than the apache ones.

LogFormat "{ \"application_name\":\"%v\", \"application_canonical-
port\":\"%p\", \"application_client-ip\":\"%a\", \"application_local-
ip\":\"%A\", \"application_local-port\":\"%{local}p\",
\"application_pid\":\"%P\", \"fastly-client-ip\":\"%{fastly-client-ip}i\",
\"request_x-forwarded-for\":\"%{X-Forwarded-For}i\", \"request_x-
tracer\":\"%{X-TRACER}i\", \"request_geo-ip\":\"%{GEOIP_ADDR}e\",
\"request_geo-continent\":\"%{GEOIP_CONTINENT_CODE}e\", \"request_geo-
country\":\"%{GEOIP_COUNTRY_CODE}e\", \"request_host\":\"%{Host}i\",
\"request_auth-user\":\"%u\", \"request_content-type\":\"%{Content-Type}i\",
\"request_timestamp\":\"%t\", \"request_uri\":\"%r\",
\"request_referer\":\"%{Referer}i\", \"request_user-agent\":\"%{User-
Agent}i\", \"response_code\":\"%>s\", \"response_bytes\":\"%b\",
\"response_seconds\":\"%T\", \"response_microseconds\":\"%D\",
\"response_content-type\":\"%{Content-Type}o\" }" extendedcombined

It'll come out looking like this (thrown through jsonlint.com)

{

"application_name": "%v", # Server name from the vhost config (not necessarily
the hostname if you have aliases and depending on how you handle bare ip
requests and if you use named vhosts

"application_canonical-port": "%p", # 80 or 443 depending on if you're tls or
not, I never figured out the point of this and it's actively confusing

"application_client-ip": "%a", # The client ip calling apache, not necessarily
the users IP if you have lbs/reverse proxies in the mix

"application_local-ip": "%A", # Host ip the application is running on

"application_local-port": "%{local}p", # Actual port the application is
running on

"application_pid": "%P", # Apache pid that handled the request

"fastly-client-ip": "%{fastly-client-ip}i", # This is a header that the fastly
service will add to tell you the actual client ip, we don't use them anymore
but they're good (expensive). They actively defend this field, I spent an hour
or so poking at it to see if I could add false data without success (of course
if you don't protect your origins someone could falsify the data there
instead).

"request_x-forwarded-for": "%{X-Forwarded-For}i", # X-forwarded-for,
occasionally useful if I suspect the client was monkeying with request data,
there's a similar header for the protocol the load balancer saw in AWS land

"request_x-tracer": "%{X-TRACER}i", # I used to use this pattern for when we
wanted to set an arbitrary header we could trace a request with, usually by QA

"request_geo-ip": "%{GEOIP_ADDR}e", # Relevant to mod_geoip, I don't use this
anymore

"request_geo-continent": "%{GEOIP_CONTINENT_CODE}e", # Relevant to mod_geoip,
I don't use this anymore

"request_geo-country": "%{GEOIP_COUNTRY_CODE}e", # Relevant to mod_geoip, I
don't use this anymore

"request_host": "%{Host}i", # Hostname the client sent

"request_auth-user": "%u", # Relevant if you're using basic auth, usually not
relevant

"request_content-type": "%{Content-Type}i", # Content-Type the client
requested, comes up if we expect request monkeying

"request_timestamp": "%t", # Request timestamp

"request_uri": "%r", # Request URI, I don't remember if this logs the get
fields or not

"request_referer": "%{Referer}i", # Request referer header if there is one

"request_user-agent": "%{User-Agent}i", # Request user agent

"response_code": "%>s", # Response we got

"response_bytes": "%b", # Response size

"response_seconds": "%T", # Response time (not necessarily how long it took
the client to get it) in seconds

"response_microseconds": "%D", # Same thing in microseconds

"response_content-type": "%{Content-Type}o" # Response returned content-type
header, occasionally relevant if we expect monkeying

}

When this makes it's way into Splunk or ELK they will automatically parse out
the header fields. Sumologic will do it if you pass a query through "| json
auto nodrop"

Sumologic will handle sub structures (if you're passing an object where one of
the fields is a hash object, not relevant for apache/nginx), I don't know
about the others. Sumologic will intelligently handle non-json data (like the
timestamps and tagging rsyslog adds).

~~~
edoceo
Thanks!!

------
deepersprout
Keep logs local, and send only errors and noteworthy events to a central
server like sentry.io, rollbar, or something else.

If you log everything to a single server, as you noticed yourself, it will
become very expensive and difficult to filter out the stuff that you really
need when looking for errors or when you try to hunt down a specific problem.

~~~
tusharsoni
I used sentry.io for a project and it worked well for what it is - error
tracking. But when users reach out with issues, it is useful to look at logs
and having them local means I would have to look at each log file per instance
which could be on multiple servers.

------
bradstewart
When running on AWS, I generally just use CloudWatch. For non-AWS hosts,
and/or when I need something more feature rich: DataDog is a solid hosted
service with reasonable pricing.

------
sethammons
It is expensive. We forward logs to spkunk (we run our own instances). Splunk
is really solid. All the logs are json and require certain fields. We use it
for tend analysis, alerting, graphs, reports, and digging into production
issues. It digs through terabytes of data relatively quickly.

~~~
mrweasel
I think Splunk is the reason I can’t recommend ELK, Splunk simply makes ELK
look almost non-functional. This is years ago but we tried Splunk for a year,
but the switched to ELK due to pricing. Our number of searches dropped to
almost zero after that, because usability was so poor, in comparison.

As a result we didn’t utilize the data we had, or in many cases reverted to
using grep.

If you want a cheaper alternative, Humio has become rather good and is
relatively easy to use.

------
cmclaughlin
AWS CloudWatch Logs has come a long way. The new Insights UI is great. No need
for us to manage ELK for logs anymore.

~~~
rzerda
Looking into it for the current ELK replacement, looks great so far
(especially simple enough query language). Offtopic: what do you use for load
balancing logs in S3? I’m seeing basically two in-AWS options: Athena (but it
lacks built-in fancy UI) or ship LB logs to the CloudWatch too.

------
lycidas
For cost, stackdriver works the best. Idk if they have a custom agent but
Fluentd works great to ship logs to your platform of choice.

AWS cloudwatch is also good for cost but has much slower query speeds (the
slowness makes me think it's not Lucene based?).

If you could splurge on hosted services, my favorite logging goes to datadog
and it has all the other bits of observability built in for down the road.

[https://landscape.cncf.io/](https://landscape.cncf.io/) is usually my go-to
if you wanna find best-in-class solutions to host yourself.

------
geewee
I've had the misfortune of setting up Application Insights for logging across
a distributed system.

It's awful. The integration with most (C#) logging frameworks is horribly, the
adaptive sampling, which is hard to turn off, means that Application Insights
randomly drops logs, which makes any sort of distributed tracing of events
really difficult.

To top that, there's a delay of 5-10 minutes from the logs are written until
they're queryable, which is a huge pain when debugging your setup.

------
mamcx
Also:

\- Exist a tool that allow to navigate structured logs easily, without bring a
heavy machinery like ELF stack? and work in terminal?

\- That also filter only ERRORS and few lines above/below?

------
lmeyerov
for the endpoint itself, consider switching to syslog so you get a bunch of
stuff for free (auto-rotations, docker logging, ...) and more easy to change
decisions later (pipe to splunk/elk/...). it's thought out and pretty easy!

~~~
lmeyerov
Expanding a bit here: Splunk is free for < 500MB day and has one of the
easiest UIs, so you can run on same-node or elsewhere, and deploy via docker
to skip most weird setup. So just record syslog, volume mount to a Splunk
docker, enable syslog reading + nix metrics, and done. (Too bad no free cloud
for the same...)

We work with a ton of log tools as company bringing graph analytics+vis to
investogations largely log & event data, and I've increasingly shifted to that
when I need easy basics.

------
zMiller
fluentd is great. You can setup forwarding nodes, that relay logs to one or
mutiple masters that then persists into whatever layer(s) you want. Tolerance
and failover baked in. Tons of connectors and best of all docker logs driver
is built ships with docker so almost zero setup to get your container logs to
fluentd. Also works nicely with kubernetes too!

~~~
cactus2093
I haven't used it in maybe 3 years or so, so some of this could be
misremembered a bit, but I didn't have a great experience with fluentd.

Trying to do much customization was kind of painful, the config file structure
is kind of confusing and the docs were sparse and differ depending on the
plugin you're using, and there's no validation of the config so if you have
any of the arguments slightly wrong it'll fail silently.

The modularity of routing and filtering logs seems like it would be great, but
it turns out not to be all that flexible, you really have to follow the
framework's preconceived idea of how the processing pipeline should work. I
forget the details but we were trying to do some processing on the logs, like
ensuring they're valid json and adding a few fields to them based on other
metadata source, and it would have required writing our own plugin. In other
ways it felt too modular for its own good, like it's up to individual plugins
to implement anything they need and the core doesn't seem to provide you with
much. Even things like an error handling or retry framework are not built in,
so if a plugin throws an unexpected exception it's not handled gracefully. One
plugin we were using would just stop sending logs until we restarted the
fluentd process if it ever got an http failure from being temporarily unable
to connect to the api, because they hadn't built in their own retry/recovering
mechanism. Granted this was for a pretty new logging SAAS service we were
trying out, if you're sticking with the basics like files and syslog it's
probably more robust, but at that point you also might not need the
flexibility promised by a tool like fluentd.

~~~
wrs
Despite their advice not to write a lot of Ruby code, the fluentd plugins
_are_ Ruby code, so I ended up just writing one big custom plugin with all the
rewrite/routing logic in it. It was very simple to do, and so much easier to
read than the convoluted fluentd config files you end up with to do anything
mildly complicated with filtering.

------
ekimekim
My main advice is avoid ELK. I have no clue how Elastic managed to convince
the world that Elasticsearch should be the default log database when it is
_terrible_ for logs.

If you're logging structured JSON, then you'll hit a ton of issues -
Elasticsearch can't handle, say, one record with {foo: 123} and another with
{foo: "abc"} - it'll choke on the different types and 400 error on ingest.

Even if you try to coerce values to string, you'll hit problems with nested
values like {foo: "abc"} vs {foo: {bar: "baz"}}. So now you have to coerce to
something like {"foo.bar": "baz"}, and then you have to escape the dot when
querying...

Finally, if you solve all the above, you'll hit problems with high cardinality
of unique fields. Especially if you are accepting arbitrary nested values, at
some point someone is going to log something like {users_by_email: {<email>:
<user id>}} and now you have one field for every unique email...

These problems are tractable but a massive hassle to get right without majorly
locking down what applications are allowed to log in terms of structured data.

As a seperate issue, Elasticsearch does fuzzy keyword matching by default. eg.
if you search for "foo" you'll get results for "fooing" and "fooed" but not
nessecarily for "foobar" (because it's splitting by word - and the characters
it considers part of one "word" aren't obvious). This is great if you want to
search a product catalog, but horrible when you're trying to find an exact
string in a technical context. Yes you can rephrase your query to avoid it,
but that's not the default and most people won't know how to structure their
query perfectly to avoid all the footguns.

Finally, as others are saying here, Elasticsearch is just painful and heavy to
manage.

As for what to use instead...I don't have good answers. I haven't exhaustively
checked out all the other products being mentioned, but in my experience a lot
of them will have similar issues around field cardinality, which means it'll
always be possible to cripple your database with bad data. This is less of an
issue if you're just running a few services, but in larger orgs it can be nigh
impossible to keep ahead of.

For smaller scale deployments, don't underestimate just shipping everything to
timestamp+service named files as newline-delimited JSON, and using jq and grep
for search and a cronjob to delete/archive old files.

When it comes to the "read from local source and ship elsewhere" component,
I've had the best luck with filebeat (specifically for files -> kafka). Most
others tend to read as fast as they can then buffer in memory or disk if they
can't write immediately, whereas filebeat will only read the source file as
fast as it can write downstream.

Note however that all such components are awful to configure, as they seek to
provide a (often turing-complete) configuration file for transforming your
logs before shipping them, and like most turing-complete configuration
scripts, they're less readable and more buggy than the equivalent would've
been in any real programming language.

Ok, rant over. Sorry, a good logging system is kind of my white whale.

~~~
rzerda
Second every line of this. If you can find someone else to handle all this
stuff with quality that you can accept - do that and focus on your app
instead. All that field types and cardinality mess are “easily” prevented with
a bit of self-discipline, which you’re going to develop anyway if you really
going to use your logs for incident response or statistics.

------
markpapadakis
Our ([https://bestprice.gr/](https://bestprice.gr/)) services/“programs”
generate three different types of events:

\- Short events (no longer than ~1k in size) where the cost to generate and
transmit is very low(sent to dedicated service via UDP). We can generate
dozens of them before we need to care about the cost to do so. They are
binary-encoded. The service that receives those datagrams generates JSON
representations of those events and forwards them to all connected
clients(firehose) and also publishes them to a special TANK partition.

\- Events of arbitrary size and complexity, often JSON encoded but not always
-- they are published to TANK([https://github.com/phaistos-
networks/TANK](https://github.com/phaistos-networks/TANK)) partitions

\- Timing samples. We capture timing traces for requests that may take longer-
than-expected time to be processed, and random samples from various requests
for other reasons. They capture the full context of a request(complete with
annotations, hierarchies, etc). Those are also persisted to TANK topics

So, effectively, everything’s available on TANK. This has all sorts of
benefits. Some of them include:

\- We can, and, have all sort of consumers who process those generates events,
looking for interesting/unexpected events and reacting to them (e.g notifying
whoever needs to know about them, etc)

\- It’s trivial to query those TANK topics using `tank-cli`, like so:

    
    
      tank-cli -b <tank-endpoint> -t apache/0 get -T T-2h+15m -f "GET /search"  | grep -a Googlebot
    

This will fetch all events starting 2 hours ago, for up to 15 minutes later,
that include “GET /search”

All told, we are very happy with our setup, and if we were to start over, we’d
do it the same way again.

------
solatic
Allow me to rep the tool I help build: Coralogix, which is a managed log
analytics service. You haven't said what your budget is, but our pricing
starts at $15/month to handle 5 GB/month of logs - certainly cheaper than
running ELK yourself.

[https://coralogix.com](https://coralogix.com)

~~~
rvanmil
Very happy with Coralogix. We use the service on many of our Heroku dynos.

------
james_s_tayler
I've recently spent a few days adding proper instrumentation to a .NET Core
side project using CloudWatch Logs.

I don't know anything about the Go ecosystem, but if there is a nice solution
for structured logging then CloudWatch Logs is very easy to implement, very
cheap, can easily make decent dashboards with it and if needed in the future
you can forward the logs onto Elastic Search.

I'm using a library called Serilog in my project to log everything in a
consistent structured log that gets all kinds of metadata automatically
appended to it and the json payload winds up in CloudWatch Logs. Then I've got
a couple of Custom metrics to measure throughput as well as some log filter
metrics to track latency of the service and it's downstream dependencies.

It works very well and was surprisingly quick to put together. I believe cost
wise my usage level is covered pretty much for free. Can't complain!

------
fiveguys94
We used to use an ELK cluster but it was always breaking - I'm sure this stuff
can be reliable but we just wanted an easy way to search ~300GB of logs
(10GB/day)

Somehow I came across scalyr and it's just phenomenally fast - and cost less
than our ELK cluster. Definitely worth trying if it provides the features you
need.

~~~
whatsmyusername
ELK has way too much care and feeding for it to be worth it until your log
ingestion rates start to get crazy.

------
gingerlime
I used to use Logentries for a while. It's essentially a hosted ELK. I think
they got bought-out or something, and the service went downhill. Don't
remember exactly, but it was slow and clunky, and support wasn't great either.

Then discovered Scalyr. It was awesome. The UI isn't pretty, but it's super-
powerful. It's fast. You have to format your logs to make the most of it, but
it's worth it.

Unfortunately their alerting wasn't as flexible at the time (for example, it
wouldn't include contextual info from the matching logs that triggered the
alert). Besides that, we decided to consolidate things and move to Datadog.

Datadog is pretty great. The monitoring and alerting features are solid. They
then added logging, APM and keep adding more. It's not that cheap, but overall
works great for us.

------
enobrev
I'm pretty late to this party, but I've been using rsyslog and "EK" (skip the
L - it's way too slow and resource hungry).

rsyslog / syslog-ng handles shipping logs to a central server and it's dead
simple to keep local logs and a central log at the same time. Every language
can spit logs to syslog vey quickly. And then you can use plugins to inject
your life from rsyslog directly into elastisearch, which is incredibly fast.

Other critiques about ES still apply especially when it comes to managing
conflicting keys in structured logs, but most complaints about fragility and
scaling are because of Logstash, which I agree, is Terrible for logging.

I've written this up in detail if anyone is interested.

~~~
karmakaze
Yes please give some configuration and performance details.

~~~
enobrev
I don't have anything regarding performance detail, but I wrote this detailed
post about configuration.

[https://www.reddit.com/r/devops/comments/9g1nts/rsyslog_elas...](https://www.reddit.com/r/devops/comments/9g1nts/rsyslog_elasticsearch_logging/)

------
jorgelbg
At trivago, we rely on the ELK stack to support our logging pipeline. We use a
mix of filebeat and Gollum (github.com/trivago/gollum) to write logs into
Kafka (usually we encode them using Protocol Buffers) these are later on read
by Logstash and written into our ELK cluster.

For us, ELK has scaled quite well both in terms of queries/s and writes/s and
we ingest ~1.5TB of logs daily just in our on-premise infrastructure. The
performance/monitoring team (where I work at) consists of only 4 people and
although we take care of our ELK cluster it is not our only job (nor a
component that requires constant attention).

------
duelingjello
syslog / stderr -> Kafka -> centralized ELK

Files / log rotation is completely the wrong approach because log entries are
mostly innately structured, rich data that occurs at a specific time.
Serializing and then parsing log lines again is wasted effort. Messaging is a
better fit than lines in files which create log-management headaches like not
rotating, losing messaging on rotation, compressing/decompressing and a
lengthy soup of destructured data that fills up local disks.

Logging to files on local disks is wrong and often creates privacy problems.
Logging to cloud services is also expensive, a legal quagmire and raise data
portability concerns.

------
indigodaddy
Nobody mentioning Splunk? Too obvious or am I not understanding the ask?

~~~
xenator
I’ve heard from different people that is trapware. You getting it, investing
time and building process around it. And when you starting to get valuable
amount of data your bill grow to the deep space. The same time they said that
it is good, but very expensive.

~~~
freehunter
Splunk is great but very expensive, like you mentioned. They're actively
getting disrupted by Sentinel and Backstory and Elastic though, so either
pricing will change or the recommendation will change.

------
ariel_coralogix
DISCLAIMER - Im the CEO of Coralogix.com

ELK can be pretty expensive and is kind of a pain to manage. for simple use
cases though, hosted ELK by AWS should do the trick and wouldn't cost too
much. Small startups and dev shops should choose a SaaS logging too l since
most of them start and 10$-30$ which is cheaper than anything you'll spin on
our own.

Looking at the market right now, looks like logs, metrics, and SIEM are going
to combine in the next 2-3 years.

------
keyle
If it's small, text files that you rotate per day and delete after 1-3
month(s).

If it's big, Graylog is great.

If it's too big, /dev/null, best logs gathering since 1971.

~~~
james_s_tayler
does /dev/null support sharding?

~~~
Too
Absolutely [https://devnull-as-a-service.com/](https://devnull-as-a-
service.com/)

~~~
james_s_tayler
Alright! /dev/null is webscale!

------
sumedh
I have used sumologic, your servers send the logs to sumo and you can setup
your alerts/monitoring/dashboards etc on their side.

------
nkobber
We use [https://www.humio.com/](https://www.humio.com/) and we love it

------
reacharavindh
I have been having a TODO on my list to explore using Toshi + Tantivvy( Rust
projects as replacement for ELastic Search) and using it to supplement a
simpler (ripgrep + AGrind) file based search on logs centralized using
rsyslog. Haven’t gotten around to play with them yet. Hopefully sometime this
year.

I could not find an equivalent to Kibana though :-(

------
dunnotbh
Check out the free version of Graylog.

------
exabrial
Graylog, it's a purpose-built package for log management over elastic search!
We transport our logs over ActiveMQ from our apps and they're read off the
broker via an openwire input. The setup can handle several thousand rights per
second on modest hardware.

------
gesman
How much data is being generated by your logs?

If it’s less than 500MB/day you may use Splunk for free forever.

------
SeriousM
We're using Seq as log server which you must host yourself. This fine and free
product is from the same guys as serilog which is known in the dotnet world.

[https://datalust.co/seq](https://datalust.co/seq)

------
franzwong
In the old days, I had some daily cron jobs to upload the logs to a
centralized place. Then you use grep to find which file contains the log you
want. Because you are fine with file, I guess you don't need to make it too
complicated.

------
narnianal
The less experience you have the more you should pay a SaaS. Then when you
gain more experience you start using frameworks like self managed ELK stack
more. If that is not enough at some point you can roll out your own.

------
hemantv
LogDNA is best and it's cheap enough you don't have to worry about it.

~~~
jwegner
Been happy with LogDNA as well!

------
cpach
Here’s some inspiration from Avery Pennarun:
[https://apenwarr.ca/log/20190216](https://apenwarr.ca/log/20190216)

------
winrid
We used Sumologic for a long time and still do. You can query your logs in an
SQL-way (you can do joins for instance) and last I checked they have a free
tier.

~~~
whatsmyusername
Their ability to figure out json for mostly-but-not-entirely json log records
in the query line as well as LogReduce and LogCompare are great. Hot take,
they're way better than Splunk.

------
niftylettuce
I built CabinJS after I was frustrated with all existing solutions.

[https://cabinjs.com](https://cabinjs.com)

~~~
tmikaeld
It's unclear if it handles logs from sources other than browsers and node?

------
stephenboyd
We use DataDog. We run their logging agent in a container which tails the log
file and syncs with their cloud service.

------
weitzj
I used logdna in a previous company. This is a really nice hosted, cheap
logging service (like papertrail)

~~~
snisarenko
I am currently trying out logdna. Their charting is underwhelming.

------
paulmendoza
Cloudwatch. Insights makes it so much easier to query but be sure to log out
JSON.

------
jaequery
What do you guys do with docker logs? They seem to just accumulate over time.

------
bribri
Cloudwatch logs + insights or Fluentd - kinesis firehose - elasticsearch

------
dillonmckay
Every developer for themselves because the devops guy.

------
jmakov
Clickhouse + Grafana or Prometheus + Victoriametrics

------
badrabbit
Try Graylog!!!!!

------
janpieterz
We use getseq.net, a pretty smooth aggregator and it keeps it all on your own
infra, so no/less GDPR concerns. Proven really powerful with tons of services
pushing logs to it.

------
toomuchtodo
Graylog

