
Ask HN: How do you handle logging? - ElFitz
Hi!<p>I work as the backend developer at a mobile app startup, and we don&#x27;t currently have any centralized logging.<p>So... how do you do it? Is there any way to have something similar to AWS X-Ray, to trace a single chain of events across platforms? Unless it&#x27;s a bad idea? I really don&#x27;t know ^^&#x27;
======
wenc
1) Log to local disk (most people will tell you this is bad practice and that
you should directly log to socket or whatever, but it's more likely for your
network to be down than for your disk to fail).

In Python, use the RotatingFileHandler to avoid running out of space.

2) Incrementally forward your log files to a server using something like
fluentd that can pre-aggregate and/or filter messages.

Big advantage of logging to disk: if logging server is unreachable, forwarder
can resume once it's up again. If you log directly over network, if things
fail the very log messages you need to troubleshoot the failure are
potentially gone.

3) Visualize. Create alerts.

I've evaluated a bunch of logging solutions. Splunk is the best, and
affordable at low data volumes (they have a pricing calculator, you can check
for yourself). It's medium hard to setup.

Sumo Logic is the easiest to set up, and at low data volumes, prices are
similar to Splunk. You can get something working within an hour or less.

ELK stack is free only in bits but not in engineering time.

I've not actually tried Sentry.io but I saw it at PyCon and it looks pretty
impressive. If you only care about tracking errors/events and not about
general-purpose logging functionality per se, I would take a serious look at
it.

~~~
codemac
> _it 's more likely for your network to be down than for your disk to fail_

For most people, the network being down means they can't reach the disk.

Buffering unsent logs via local disk or RAM is critical due to network
flakiness for sure, but not logging over the network as well is a bad idea
100% of the time.

~~~
scarface74
He specifically mentioned mobile apps. When developing mobile apps, you almost
always operate in a “semi connected” state where ideally you can function
without network access and rely on syncing.

~~~
aflag
He's actually working in the backend, not in the mobile app itself.

------
enobrev
Everything logs to syslog (I generally use rsyslog) in JSON format.

All syslog instances push to a central instance, also running rsyslog. This
allows us to tail logs on each instance, as well as tail / grep system-wide on
the central instance.

Central instance pushes everything directly into elasticsearch.

Using Kibana for searching and aggregating. Using simple scripts for
generating alarms and reports.

Every day a snapshot of the previous day is uploaded to S3 and indexes from 14
days ago are removed. This allows us to easily restore historical data from
the past, but also keeps our ES instance relatively thin for daily usage /
tracking / debugging. It also makes it possible to replace our central log
instance without losing too much.

All devs use some simple convention (ideally built into the logging libs) to
make searching and tracing relatively easy. These include "request ids" for
all logs pertaining to a single process of work, and "thread ids" for tracing
multiple related "requests".

I documented how I have rsyslog and elasticsearch set up here:
[https://www.reddit.com/r/devops/comments/9g1nts/rsyslog_elas...](https://www.reddit.com/r/devops/comments/9g1nts/rsyslog_elasticsearch_logging/)

~~~
porker
How do you change everything on a system to use JSON format? My syslog
(Debian) is filled with text-line entries, and I've not seen a setting to
change this.

~~~
enobrev
By "Everything" in my post, I mean all of our own applications. Some services
allow you to format logs to json like nginx using log_format[1]. For others,
you may find app-specific configuration or plugins for log formatting or
simply use plain grep / kibana text search.

I imagine in those cases something like logstash may help, but I don't really
know as I tend to avoid logstash.

1:
[https://stackoverflow.com/a/42564710/14651](https://stackoverflow.com/a/42564710/14651)

------
vinay_ys
Since others have answered with specific tech stacks, I'll give a more
generalized/abstracted answer. While getting started, here are a few high-
level principles I found useful to adhere that will make your life easier
later:

Think of a multi-stage pipeline for getting raw data from your
transactional/interaction systems and extracting insights and intelligence out
of them.

Stage-1: Ingestion – Keep this simple. Don't mess this up. Its a serious
headache if you do.

1\. Generate a request-id or message-id at the genesis of the request and
propagate it throughout your call graph from client to servers (across any
number of api call graph hops).

2\. At each node in the call graph, emit whatever logs you want to emit,
include this id.

3\. Use whatever format you find natural and easy to use in each tech stack
environment. Key is to make the logging instrumentation very natural and
normal to each codebase such that the instrumentation does not get
accidentally broken while adding new features.

4\. Build a plumbing layer (agnostic of what is being logged) that can locally
buffer these log messages, periodically compress and package them with added
sequence and integrity verification mechanisms, and reliability transmit them
to a central warehouse. Use this across all your server-side nodes. Build a
similar one for each of your client side platforms.

5\. At the central warehouse, immediately persist these log packages durably
and then only respond to client indicating it is safe to purge those packages
on their local nodes.

Stage-2: Use-case driven ETLs.

6\. Come up with use-cases to consume this data. Define data tables (facts and
dimensions) needed to support these consumption use case.

7\. Build a high-performance stream processing system that can process the raw
log packages for doing ETL (extract, transform and load) on the raw data in
different formats to the defined consumable data tables.

Stage-3: Actual Use-case data applications.

Run your analytics and machine learning systems on top of these stable
consumable data formats.

Keep the stages separate and decoupled in code and systems. Don't do end-to-
end optimizations and break the boundaries. Recognize that the
actors/stakeholders involved in each stage are different. The job of data team
is to be the guardian of these stages and run the systems and org processes to
support it.

~~~
meowface
This is true if you plan to develop a SIEM or ELK completely from scratch.
Interesting as general background info, but I can't see this information being
practically useful to anyone who just wants to log stuff. It'd be like
building a washing machine and drying machine from scratch because you want to
wash your clothes.

You seem to be describing low level principles, not high level ones. A high
level principle would be "forward your logs to a centralized logging service
and let the logging library and the service do 100% of the work for you",
which I think is what nearly everyone should do (and which most are already
doing).

~~~
zawerf
His first bullet point is probably the most practical thing I know about
logging. Log lines are useless without context especially when the lines are
interleaved. But this is easily fixed just by prefixing them with a context id
that you filter on later. This requires no frameworks and even works with
unstructured text logging.

Super simple, super useful, not everyone does it.

~~~
meowface
True, but this is semi-automatically handled by any structured logging
library, though. No need to reinvent the wheel or force yourself to remember
to prefix or postfix every log message with one or more "%s"'s (or equivalent)
and the ID(s) to interpolate. I think that's one of the main reasons to use a
structured logging library in the first place, and maybe the main purpose of
structured logging.

A simple example for a Python Flask app:
[http://www.structlog.org/en/stable/examples.html](http://www.structlog.org/en/stable/examples.html)

    
    
        log = logger.new(request_id=str(uuid.uuid4()))
        log.info("user logged in", user="test-user")
        # gives you:
        # event='user logged in' request_id='ffcdc44f-b952-4b5f-95e6-0f1f3a9ee5fd' user='test-user'
    

Sure, if you absolutely must use unstructured logging, you need to remember to
do the format string prefixing or postfixing for every single message. But why
put yourself in that position if you don't need to? Other than maybe when
maintaining large legacy apps that aren't worth the effort to add structured
logging to.

~~~
SahAssar
The point as I read it was to do this once, and only once per request. So if
you have a few different microservices that call each other you generate the
id once, either at the first service it hits or (preferably) at the load
balancer, and then propagate it down to all other services. This is especially
useful if you use queueing or methods of doing tasks not bound to the same
process as what the request hits.

If I just copied your example I would probably have a few different ID:s for
the same request in different parts of the application (unless it was a
single-service app directly exposed to the internet).

~~~
meowface
True, this is complicated by a microservice model. I haven't worked with
microservices much, but I figure there must be some libraries and tooling out
there that can make this simpler. From some quick Googling, it looks like this
is a component of some microservice frameworks. But I understand that this is
a case where you'd often have to implement this yourself into your
architecture, like at a load balancer / reverse proxy. So point 1 is valid.

------
stickfigure
I use the stackdriver logging in Google Cloud Platform.

My GAE apps and google services just log there automatically. My non-GCP
services require a keyfile and couple lines of fairly trivial setup.

I have a single logging console across my entire system with nearly zero
effort and expense. It works incredibly well. Doing this in-house is a waste
of engineering resources.

~~~
Axsuul
How has their client been with searching/tailing?

~~~
stickfigure
Works great? They may have other tools, but I use the web interface. It has a
sophisticated search language. Logs are conveniently grouped by request. The
UI could be snappier, but I really have no major complaints.

~~~
Axsuul
Thanks! Wow their 50GB free per month is super generous too.

------
jacobsenscott
As a startup you should be using one of the many logging services out there -
definitely don't waste time rolling your own or trying to install some open
source log aggregator in an EC2 instance or something.

For error tracking, which is mostly what you'll care about, use a service like
honeybadger, or rollbar, or whatever fits well with your stack.

For performance metrics use a dedicated service for that as well. NewRelic, or
Skylight, or whatever works well for your stack.

------
jrockway
Yes, you want to have a single chain of events across all of your
infrastructure. This is called "distributed tracing". There are a few
solutions available; I recommend Jaeger.

You do need to instrument your applications to emit traces, but don't go
overboard. Make sure everything can extract the trace ID from headers /
metadata and that requests they generate include the trace ID. Most languages
have plugins for their HTTP / gRPC server and client libraries to do this
automatically.

You will want your edge proxy to start the trace for you; this is very easy
with Envoy and ... possible ... with nginx and the opentracing plugins.

I use structured logs (zap specifically), so I wrote an adaptor that accepts a
context.Context for a given request, extracts the trace ID from that (and
x-b3-sampled), and logs it with every line. This means that when I'm looking
at logs, I can easily cut-n-paste the ID into Jaeger to look at the entire
request, or if I'm looking at a trace, type the ID into Kibana and see every
log line associated with the request. (The truly motivated engineer would
modify the Jaeger UI to pull logs along with spans since they're both stored
in ES. Someday I will do this.)

As for log storage and searching, every existing solution is terrible and you
will hate it. I used ELK. With 4 Amazon-managed m4.large nodes... it still
takes forever to search our tiny amount of logs (O(GB/day)). It took me days
to figure out how to make fluentd parse zap's output properly. And every time
I use Kibana, I curse it as the query language does overly-broad full-text
searches, completely ignoring my query and then spending a minute to return
all log lines that contain the letter "a" or something. "kubectl logs xxx |
grep whatever" was my go-to searching solution. Fast and free.

If anyone wants to pay me to write a sane log storage and searching system...
send me an email ;)

~~~
dmoy
> If anyone wants to pay me to write a sane log storage and searching
> system... send me an email ;)

You can pry lingo/sawzall from my cold, dead hands.

~~~
jrockway
I am thinking more along the lines of interactive querying, not analysis. Show
me all log lines for this request ID, or show me all log lines for the last 5
minutes from every instance of the job, etc.

Google has a system for that... but when I was there, it was awful. Meanwhile
in the real world, we have ELK... and it's even worse. People stop looking at
logs once the kubelet rotates it. It's just too slow and flaky.

(But yes... one key aspect to lingo/sawzall's design is that logs are sharded.
And naturally, logs are sharded. Each program produces a log file over a
period of time, and so (time, pod) forms a natural shard. Introduce something
like ELK, and your sharding is thrown away, so you can never properly
parallelize searching. A properly designed logging system would maintain
shards and ensure that workers have replicas of those shards, so that you can
use lots of computers to quickly get you the result you want. Of course, as
much should be indexed as practical, so you can find the shard you're looking
for without looking at every shard. Lots of work that could be done here, and
it's all super easy. That's why it makes me mad that nobody has done this.)

~~~
jsmeaton
Have you used sumologic or splunk? I feel like they both have the capabilities
you’re talking about.

------
jshawl
Disclaimer: I work for both Papertrail and Loggly's parent company:
SolarWinds.

For general purpose logging - we deploy Papertrail's remote_syslog2
[https://github.com/papertrail/remote_syslog2](https://github.com/papertrail/remote_syslog2)
\- which is more or less set it and forget it setup. e.g. specify which text
files I want to aggregate, and then watch them flow into the live tail viewer.

For logging in more limited environments (can't sudo or apt-get install), we
use Loggly's http API ([https://www.loggly.com/docs/http-
endpoint/](https://www.loggly.com/docs/http-endpoint/)). Also, Loggly's JSON
support allows us to answer questions like: "how many signup events failed
since the last deployment". Or "What is the most common signup error".

Bonus! If you're looking for trace-level reporting and integrating that with
your logs, check out the AppOptics and Loggly integration:
[https://www.loggly.com/blog/announcing-appoptics-loggly-
inte...](https://www.loggly.com/blog/announcing-appoptics-loggly-integration/)

~~~
pragmatic
remote_syslog2 project doesn't seem to be very active. Still supported and
maintained?

~~~
jshawl
That is a great question - remote_syslog2 is a fairly mature project and is
still the recommended way to aggregate your application/text log files. It
does one job and does it well!

There is still active server-side development that does not show up in the rs2
repo on GitHub.

I will forward this comment along to our product team as feedback - thanks!

------
binarylogic
I'm biased because my team and I created Vector [0], but I'd highly recommend
investing in a vendor-agnostic data collector to start. You can use this to
collect your data and send it wherever you please. This will afford you the
flexibility to make changes as you learn more, which will be inevitable.

[0]: [https://github.com/timberio/vector](https://github.com/timberio/vector)

------
jbob2000
Don't try to roll it out all in one shot. Just work on solving problems.
Database is timing out? Add some logging there. Requests getting dropped
between proxy and app servers? Add some logging there.

If you try to add logging across the entire infrastructure in one shot, you
won't know what logs you _actually_ need. And when it comes time to diagnose a
problem, you probably won't be capturing the correct data.

~~~
weq
This is a good point.

For me, this looked like logging to a ringbuffer and then dumping that log
with an associated error report when an exception occoured. was good enough
for 99% of the errors i debugged, and we never actually needed a log-shipping
solution. Logs were kept on disk and requested to be uploaded on demand when
investigating specific issues.

it depends on what kind of startup u are in, what kind of product you ship,
what kind of user base you have, what kind of solution you have. if you cobble
together a set of SaaS solutions, ETL will be your integration challenge.

------
gtsteve
Well I threw together a system which assigns a guid to each request and
reports this guid to the user if something goes wrong. The guid is sent when
calling across services internally so you can trace log lines across API calls
and services.

The logs are written from containers to CloudWatch and consequently forwarded
to ElasticSearch where we use Kibana and LogTrail [0] to view the logs and
search them.

It's nowhere near as nice as XRay and other APM solutions but it hardly took
any time to throw together. Fundamentally, this is how XRay works, only there
is a specific format for the ID.

However, XRay now supports our runtime so we'll take another look at that. It
looked like an interesting option at the time.

For a mobile app you'd want to assign a guid or some sort of user id to the
device itself so you can track the distinct API calls it makes. I believe XRay
and other systems support this but we don't have a mobile app so I don't know
how that'd work for you.

[0]
[https://github.com/sivasamyk/logtrail](https://github.com/sivasamyk/logtrail)

------
badrabbit
I am shocked that no one has mentioned Graylog so far.

Check it out. It's done wonders for me. You can manipulate,sort,retain and do
other things on log events with it. It uses elasticsearch to store the logs.

It has SIEM like functionality with alerts and they are continuing to make it
more suitable as a SIEM replacement.

And it does have cloudtrail support.

~~~
nullwarp
My only real complaint with graylog is it seems between V2 and V3 all the
modules/packs (or whatever they call them) broke and you the useful ones are
broken now.

Maybe it's better now since I tried but it was a real negative when trying to
import some of them to find out later they were incompatible.

~~~
badrabbit
It could be. I was never exposed to 2.X

------
colechristensen
Centralized logging: SaaS services that do this are a dime a dozen. Sumologic,
Datadog, Elastic, etc.

You seem to be interesting in tracing or APM [1] which also has many
providers.

Lots of people do a local Elasticsearch, Logstash, Kibana stack which can be
done without licensing with a variety of forwarders.

You might be most interested in Envoy Proxy or Elastic APM (there are many
others)

[https://www.envoyproxy.io](https://www.envoyproxy.io)

[https://www.elastic.co/products/apm](https://www.elastic.co/products/apm)

1\.
[https://en.wikipedia.org/wiki/Application_performance_manage...](https://en.wikipedia.org/wiki/Application_performance_management)

------
kirktrue
It sounds like you're looking for something like distributed tracing (vs.
vanilla logging).

Zipkin ([https://github.com/openzipkin](https://github.com/openzipkin)) and
OpenTracing ([https://github.com/opentracing](https://github.com/opentracing))
purport to be vendor/platform agnostic tracing frameworks and have support
with various servers/systems/etc.

X-Ray was pretty trivial to use in AWS land w/ Java as a client.

------
ElFitz
I really didn't expect to get this many passionate opinions on the matter.

It took me some time to... build up the courage to read through all of your
answers, and you have been of tremendous help. I've learned quite a lot. Thank
you very much! I deeply appreciate it!

I'll steer clear of self-hosted ELK, for now, mostly because being the only
backend, I can't really take the risk of holding the whole team back while
getting it up and running or maintaining it.

I'll look into Splunk, Sumo Logic, Sentry & a few others, while keeping in
mind the more general guidelines that were laid down here.

Also, thank you for the terminology! It's much easier to find the proper
resources know that I know what to look for!

Edit: I'll also take some time to answer to the different comments; but it
really felt rude of me to be procrastinating while you all had taken the time
to properly answer

------
Sevii
Log to disk. Rotate every hour and upload to S3. Download from S3 as needed
and query via grep, awk, etc.

------
cbanek
> trace a single chain of events across platforms

Since it sounds like you also control the app, maybe make an HTTP header that
the app sends that has some kind of UUID for that transaction. When your
backend gets it, keep passing it on and logging it as part of your context
when you emit log lines. Then using whatever log aggregation system you use,
you can search for that UUID.

As for collecting your logs, I like ELK stacks, and they are easy to set up
and get all your syslogging to go there. There are also ready made helm charts
to install these into a kubernetes cluster if you're using that, and they will
automatically scoop up everything logged to stdout/stderr.

------
linsomniac
Apache logs to go rsyslog via logger (apparently the best option with Apache
2.4). Syslogs go to a central rsyslog server over RELP (which mostly has been
reliable, but a recent bug in rsyslog daily caused us to have to reload a
week's worth of logs).

Central rsyslog server uses mmnormalize/liblognorm to parse the apache logs
and load them into Elasticsearch.

haproxy logs directly to rsyslog via a domain socket, RELP to central server,
lognorm to load into ES.

ELB logs go into S3, and logstash pulls them down and loads them into ES.

The remainder of syslog messages just go into files on the central server.

We also have Sentry set up with some newer applications logging into that.

------
danesparza
First, centralized logging is not just a good idea -- its key when you start
working with multiple servers (which will most likely be almost right away).
You need to be able to trace requests / responses / errors across your
platform. Many tools (including logging library -> database and a custom log
search / viewer) can give you this. Just pick something that works for your
budget and development process and start there. To track a single chain of
events, you'll just need to have a GUID that you pass between calls in a
single request (used for logging).

Next, you'll want to track analytics centrally. Etsy and Netflix have been
pioneers in this area. Their engineering blogs are very good to follow. Think:
something like a timeseries database (like Influx / Prometheus) and getting
data into it. Use tools like Grafana to get data out of it in dashboards or
reports. This is separate from your application debug / error logging system.

The next step after this is developing something that consumes data from both
of those systems and provides alerts based on unusual activity -- something
that provides early warning to devops.

------
nodesocket
Recommend DataDog logs[1]. It integrates with cloud providers and pulls logs
from resources like load balancers, s3 buckets, etc. Additionally you can
ingest from files on servers using the DataDog agent, and finally there are
language SDK's to push log events from code.

[1] [https://docs.datadoghq.com/logs/](https://docs.datadoghq.com/logs/)

------
jharohit
log in-application directly to local Fluent bit instance (spool locally in
case Fluent bit is down, log rotate) -> collect in a centralized Fluentd ->
self hosted memory optimized ES (cause the default options and ES Cloud is
shit) -> Grafana for monitoring & alerting

Having spent months with team, found this to be the best high performance
stack for cloud & on-premise solutions for our clients

~~~
tilolebo
how do you visualize logs with Grafana when they are stored in ES?

~~~
jharohit
Grafana has an ES connector

------
atmosx
The only worthy math teacher I came across once said "The right answer to
99,9% of the questions I will ask you in the oral exam is: _it depends_ "
while smiling cunningly. To answer your questions, the answer is " _it
depends_ ". What you're really looking for however is not _logging_ , what
you're looking for is _observability_ which has 3 pillars:

\- Logs

\- Metrics

\- Tracing and/or APM

The above are true for systems and applications but let's talk applications.
Your decision should be based on assessment of at leat the following:

\- Do you have compliance requirements? (e.g. GDPR)

\- What is your logs/metrics/traces retention period? (let's assume 30 days)

\- What is your logs/metrics/traces lifecycle requirements? (are you going to
need logs older the 30 days? If not, I'd say don't bother delete everything,
keeping them around has managerial and hosting costs)

I advice to take a look at ElasticSearch:

\- ElasticSearch for hosting logs

\- For sending logs, metrics and tracer you can use filebeat, metricbeat and
ElasticSearch APM or Jaeger.

If you are a small startup, I'd say go with ElasticSearch Cloud and use their
tools. They do all you need and more.

[1]: I prefer metricbeat over prometheus/grafana because it solves the high
availability headache for those who already have an ES cluster and you don't
have to support (setup, monitor, manage, scale) an additional stack. You can
use a _push_ model which has it's own pros and cons.

ps. No affiliation with elastic, I just spent some time with a variety of
their products and like what I see so far.

------
__exit__
I used external services such as Sentry[0] and NewRelic[1], which allow one to
access detailed debugging and performance checks on specific errors and API
endpoints.

Aside from the classic print statements and grepping log files manually.

[0]: [https://sentry.io](https://sentry.io)

[1]: [https://newrelic.com/](https://newrelic.com/)

------
a10c
The company I work for uses Splunk to ingest about 70TB/day.

Our services send to fluentd running on each instance which aggregate and
flush to a Kinesis stream in AWS with KCL workers responsible for putting it
through a separate pipeline that allocates the logs to specific indexes
depending on the service(s) they come from as well as applying ACLs on a per-
index basis.

------
monocasa
I have a custom printf that logs to a ringbuffer in MRAM, which is sort of
like battery backed SRAM that doesn't need a battery.

------
gbuk2013
Structured JSON logs to Elastic Search and local disk. ES gets “info” level,
disk also has “debug” but only 2 weeks.

------
EamonnMR
We've had great luck with our switch from 'write everything to files' to
Graylog. We've got a bunch of different microservices and having all of the
logs in one searchable place has been a boon. That and our switch to
Kubernetes had made our logfiles harder to get at.

------
jillesvangurp
There are plenty of logging stacks to choose from. I've used Elastic a lot and
it has improved a lot. You can spin up a cheap cluster for under 200$ and
start instrumenting your servers to send stuff to it. You'll want to set up
index life cycle management to ensure you don't run out of disk space on your
cluster. You scale by throwing more money at it. Basically you need to think
in terms of millions of messages per day and retention periods. A 200$ cluster
should be able to retain tens of millions of messages.

That's how you get started. There are plenty of tools on this stack to do APM,
security auditing, request logging, etc. If you are using a decent application
server stack that produces metrics, it can handle those too.

------
aflag
I don't log anything. Deploy and forget. Sharks don't keep reliving their
mistakes. Sharks swim fast, take a bite of whatever they see and move on. Are
you a shark or a little fish? There's no time to answer that anymore, I've
already moved on.

------
jesterson
All OS and Apps log messages to local Syslog. Local syslog forwards all
messages to central facility with GrayLog2 on it, where all
search/visualisation/analysis happens.

This works flawlessly for years and relatively easy to set up. Couldn't
recommend it more

------
madhadron
> to trace a single chain of events across platforms?

For each individual instance of some class of things, generate a unique
identifier. For example, each network request the mobile app makes to the
backend should have a request ID. The mobile app includes that request ID in
all its log entries and sends it with the request. The backend plumbs it
through everywhere and all its log entries have it, too. If you have multiple
instances of things in the backend, like batches of queries sent to a
database, log an identifier for them as well.

Then you dump all of this into one big index in some semi-structured data
store and use the identifiers to pull out all related entries.

------
MichaelRazum
Note sure if this is the best solution. But works so far for me: ELK + Redis +
Curator. Everything in a docker contrainer. Single Machine Setup. Curator
deletes old logs. Redis is responsible for caching. Logs are put directly to
redis. I think one of the most important metrics:

Performance 4 Core Machine with 32Gb Ram about 3000 logs per second. 70% CPU
usage and 80% SSD Usage- Quite happy with the setup, since the SSD can be
upgraded to a faster one. Also a more powerfull machine could handle about
10000 logs per second. Would love to hear other number from Splunk or similar
solutions.

Costs: Nearly zero. Some time to setup and bring redis + curator into play.

------
peu4000
We're in the middle of a cloud migration, but in our dockerized environment
we're sending logs directly from stdout to cloudwatch using Docker's
cloudwatch plugin.

In our legacy environment we're writing to files and sending them up to
cloudwatch using awslogs.

Cloudwatch is kind of ass for logging, but they added insights somewhat
recently; it upgraded cloudwatch logs from being unusable to just being a pain
in the ass to use.

This works for us so far because it's super simple and we don't have a major
need for log analytics, just the occasional production debugging session.

I did a PoC for fluentd + logdna/logz/etc and that also seemed to work pretty
well.

------
viraptor
For structured / trace logging like x-ray, you need to do quite a lot of work
in the app. It doesn't happen automatically. You can get a bit of it "for
free" from NewRelic APM which can do some sampling of execution traces, but
it's mostly around function calls, not custom spans. (You can define those
too)

If you need just text output logging, there are a few solutions already
described. But at this point you should really make a decision - are you after
simple text logs, or can you put in work to get structured events or tracing
out of your app.

------
nineteen999
We use Splunk with lots of redundancy (ie. multiple forwarders, indexers and
search heads per site).

I'd probably use Graylog or some ELK stack variation though if our client
would let us, since Splunk is $$$.

------
PascalW
[https://github.com/grafana/loki](https://github.com/grafana/loki) is very
promising in this space. Dead-easy to run.

------
yogsototh
Basic: log to syslog

Advanced: log structured objects (keys and string values) to Riemann. Write
smart rules in Riemann then send those to ES and explore structured object in
Kibana.

------
exabrial
Over activemq using a logback openwire plugin, then off to graylog using an
activemq input plugin.

Works great, can handle thousands of messages per second on modest hardware.

------
twblalock
For tracing, Zipkin is a good place to end up.

But before you get there, you can standardize on a "request ID" header that
gets passed through your call stack and logged by whatever services receive
it. You can search for it in your log aggregator (SumoLogic, Splunk, etc.) and
get a good idea of which services your request went to, what time they got it,
how long it took, etc.

------
humility
1) Build a proper (local) logging service in your app. With Node.Js I use
winstonjs/winston.

2) Use time series databases to log your server metrics. Eg. InfluxDB

3) Familiarize yourself with CLI tools like cat, less, tail, grep, sed for
when you have to get your hands dirty with raw data.

4) Logrotate is a great choice to cap the size of different program logs.

------
emmelaich
If you're asking specifically about associating events at different levels,
see ELK's APM. Or other products with an APM component.

NewRelic might also help.

If you're asking about logging generally - it's a vast subject and you
probably need to ask more specific questions. On a StackExchange site
probably.

------
o1lab
One of the short comings I found in many logging mechanism in backend is :
There are no APIs to enable/disable logs at runtime.

Have made an small attempt to fix it in node.js

[https://github.com/o1lab/dynamic-debug](https://github.com/o1lab/dynamic-
debug)

------
flurdy
Just remember to scrub the logs for privacy sensitive data.

Logs are number one for privacy violations. Pin codes, passwords, social
security numbers etc. People remember to hash the data in their databases, but
logs are often forgotten about, then stored and archived containing data that
are illegal under many laws such as GDPR. And obviously also a security risk.

Developers most of time remember to scrub the data out of actual log messages
but forget trace and rawer logged data also go into some log aggregators.

I am sure I have accidentally done it as well, though I try my hardest not to.

* Twitter: [https://twitter.com/TwitterSupport/status/992132808192634881](https://twitter.com/TwitterSupport/status/992132808192634881)

* Monzo: [https://www.zdnet.com/article/monzo-admits-to-storing-paymen...](https://www.zdnet.com/article/monzo-admits-to-storing-payment-card-pins-in-internal-logs/)

* Github: [https://www.bleepingcomputer.com/news/security/github-accide...](https://www.bleepingcomputer.com/news/security/github-accidentally-recorded-some-plaintext-passwords-in-its-internal-logs/)

* Facebook: [https://www.wired.com/story/facebook-passwords-plaintext-cha...](https://www.wired.com/story/facebook-passwords-plaintext-change-yours/)

~~~
davinic
Often you're not scrubbing individual data, you're logging the contents of an
entire object. In this case, depending on the language, I will usually use an
annotation/decoration on the object itself so the logging platform will know
not to log it. I have also seen this method used in environments subject to
HIPAA regulations.

------
mrkeen
Instead of comments+data+logging as three sources of truth of what the system
does/did, only do logging.

If your system observed something, it writes it down (logs it). If you want
your system to react to that thing happening, then the log is going have to be
machine-parsable.

------
Nursie
When tracing across microservices, the approach we took was to embed a request
id in the incoming request headers, then use it in every log line and
propagate it on to other microservices as they are called.

Helps is see what happened across the whole system.

------
tiborsaas
I run my node.js apps with PM2 and it supports logging out of the box. It
probably won't scale very well, but for a single server app to run side
projects it's perfect.

------
MorGrin
Hey! I work for Coralogix, maybe you'd like to check us out :)
[https://coralogix.com/](https://coralogix.com/)

------
paulmendoza
AWS CloudWatch. I log everything as JSON and then use CloudWatch Insights to
query it quickly. It is the cheapest solution I have found and is pretty easy
as well.

------
gerdandy15
Scalyr.com and Humio have some of the strongest logging platforms in the
market. Their approach of not creating indexes to search makes em easy to work
with.

~~~
morten-gram
Organizations are asking for modern approaches to log management issues. The
availability of an index-free approach will enable them to get faster results
at a fraction of the cost compared to traditional approaches. Index-free
logging provides and entirely different approach that doesn’t involve indexing
at all. This new wave of index-free log management incorporates three key
approaches that are increasingly attractive to users looking for a modern
approach: * Reduce the amount of data you have to manage; * Reduce the amount
of data you analyze; * Trade off a slight decrease in analytics on historical
data for much faster ingest, larger flexibility and better efficiency on
hardware usage. See more here: [https://www.itopstimes.com/monitor/humio-
index-free-logging-...](https://www.itopstimes.com/monitor/humio-index-free-
logging-is-the-way-forward/)

------
ycmimi
You can try our SaaS solution free!
[https://datawiza.com](https://datawiza.com) It gives you full observability
to all your APIs. Compared to existing solutions, we save you from tedious and
heavy work, such as configuring logs, parsing, extracting, and enriching data
in your logs, building dashboards, and etc. All you need to do is installing
our software in 2 mins, then you get comprehensive dashboards for all your API
activities.

------
jmcd17
We just recently started using CHAOSSEARCH (chaossearch.io) in combination
with Fluentd. Highly recommended!

------
davewasthere
on old projects, Nlog to local file log and Elmah pointing to a centralised
database.

on newer projects, Serilog with combo of text file logging, sql sink and just
recently to centralised ElasticSearch

------
DeepYogurt
Give Mozdef a look if you want alerting on your logs.

------
beamatronic
System.out.println()

------
xyz-x
About 7 years ago, when I was trying to monitor .Net apps, there weren't that
many alternatives available. The Elasticsearch-Logstash-Kibana stack had just
gotten off the ground and I needed a way to send it structured logs and a
large number of machines. Logary.tech was born to solve that problem.

Since then Logary has expanded with excellent support for sending both metrics
and tracing data to a large number of targets. In production, I use this
setup;

client browser -> Logs & metrics to Logary Rutta HTTP ingestion endpoint via
Logary JS

nginx-ingress -> Traces to Jaeger Agent via opentracing C++ client nginx-
ingress -> Metrics to InfluxDB nginx-ingress <\- Metrics via Prometheus scrape
annotation

Our NextJS site and GraphQL server: site -> Traces to Jaeger Agent via
opentracing site -> stdout logs via Logary JS (also get added as Logs in the
Span of OpenTracing) site <\- Metrics via Prometheus scrape annotation and
prom-client

api -> Traces to Jaeger Agent via Logary's F# API api -> Metrics to InfluxDB
via Logary's F# API api <\- Metrics via Prometheus scrape annotation and
Logary.Prometheus api -> Events to Mixpanel via Logary.Targets.Mixpanel api ->
Logs to Stackdriver via Logary.Targets.Stackdriver (hosted on GCP)

Also, Kubernetes ships logs via FluentD to Stackdriver in GCP, but they are
not structured, and the remaining infrastructural services also send traces to
Jaeger if they can.

Logary Rutta is a stand-alone log router, written in Hopac + F# (like
Concurrent ML), and used by some of the largest Swedish software companies for
thousands of logs and metrics per second. It's capable of shipping to a large
number of targets
[https://github.com/logary/logary/tree/master/src/targets](https://github.com/logary/logary/tree/master/src/targets)
Since it talks HTTP and UDP with a number of encodings (JSON, plain, binary),
it's easy to plug into an existing infrastructure and existing log shippers.
It can also connect point-to-point to itself with a high-perf binary encoding.
Because you can send any JSON into it, it's very easy to get started with
together with mobile apps.

Logary for JS currently has support for user logs, and I'm currently testing
rudimentary metrics and browser info.

Logary for .Net supports the OpenTelemetry spec, structured logging and
metrics.

Of course you can pick any toolchain you want, but I've had great success (and
great fun!) writing and using the above. You can see I don't keep logs on
disk; it causes them to fill up; if your network is down, your service is
down, and then you know it's the network anyway.

Once in Logary, you can choose where you send them. I've done an analytics/ETL
pipeline based on Logary with its Stackdriver+BigQuery+GooglePubSub targets
and with Flink, with great success as well. Logary is free to use for non-
profit and then I have a pricing calculator on the home page, for when you
start selling the software you build. Pricing aside, how Logary is structured
and how I've used it might give you some hints on how to do it yourself.

------
bt848
Never centralize logging. Log at the leaves and store it there. Push search
predicates down to servers running on each leaf when/if needed. Log to sockets
always; your FD can be a regular file if you want but keep the flexibility to
change it later.

~~~
millettjon
Leaf logging is also susceptible to attackers deleting log files. Central
logging is effectively append only from the leaf and thus provides a security
benefit.

~~~
weq
So the attacker starts DoSing your central log server and you do what?

~~~
bananocurrency
how is my logging server being attacked from outside my network?

