Hacker News new | past | comments | ask | show | jobs | submit login
Monitoring demystified: A guide for logging, tracing, metrics (techbeacon.com)
487 points by malechimp on July 31, 2020 | hide | past | favorite | 92 comments

A lot of excellent information in that blog post and linked from it... but if you're wondering where to start:

1. Write good logs... not too noisy when everything is running well, meaningful enough to let you know the key state or branch of code when things deviate from the good path. Don't worry about structured vs unstructured too much, just ensure you include a timestamp, file, log level, func name (or line number), and that the message will help you debug.

2. Instrument metrics using Prometheus, there are libraries that make this easy: https://prometheus.io/docs/instrumenting/clientlibs/ . Counts get you started, but you probably want to think in aggregation and to ask about the rate of things and percentiles. Use histograms for this https://prometheus.io/docs/practices/histograms/ . Use labels to create a more complex picture, i.e. A histogram of HTTP request times with a label of HTTP method means you can see all reqs, just the POST, or maybe the HEAD, GET together, etc... and then create rates over time, percentiles, etc. Do think about cardinality of label values, HTTP methods is good, but request identifiers are bad in high traffic environments... labels should group not identify.

Start with those things, tracing follows good logging and metrics as it takes a little more effort to instrument an entire system whereas logging and metrics are valuable even when only small parts of a system are instrumented.

Once you've instrumented... Grafana Cloud offers a hosted Grafana, Prometheus metrics scraping and storage, and Log tailing and storage (via Loki) https://grafana.com/products/cloud/ so you can see the results of your work immediately.

If it's a big project, you have a lot of options and I assume you know them already, this is when you start looking at Cortex and Thanos, Datadog and Loki, tracing with Jaegar.

I’d add to no. 1 by saying include a correlation id or request id so you have a way to filter the logs into a single linear stream related to the same action.

Absolutely constantly being able to get a single linear stream is when tracing becomes super powerful.

Also adding a context tag can very powerful, especially in times of Microservices and some kind of event driven stuff like monthly payments.

Imagine: user registers, the post gets and id, and context registering. Then he adds a credit card. (a new id, context credit payments) after 14 days the bill goes out, same id, same context.

> Grafana Cloud offers a hosted Grafana, Prometheus metrics scraping and storage, and Log tailing and storage (via Loki)

I haven't looked at their pricing before, but for small-ish environments, their standard plan looks really good and simple. None of the "per host, but also per function, and extra for each feature, and extra for usage" approach like other providers (datadog, I'm looking at you).

> None of the "per host, but also per function, and extra for each feature, and extra for usage" approach like other providers (datadog, I'm looking at you).

I was thinking "God, this is exactly why I hate Datadog" as I was reading your description and got a great laugh when I reached the end. Their billing is absolutely byzantine.

I don't know that I've ever seen a company that had such a stark difference between great engineering/product and awful business/sales practices. Their product is really the best turn-key option out there, but I'm always hesitant to use its features without double checking it's not going to add 50% to my bill. Their sales teams are some of the worst I've dealt with, and I deal with a lot of vendors. They're starting to get a really bad reputation as well.

I'm a customer that uses most of their tools (no network performance monitoring since it's less useful than a service mesh and no logging because we need longer history than most and cost would be prohibitive).

Is it really that expensive when compared to other vendors? Thought their newer logging tool was a lot cheaper than splunk and their apm tool for distributed tracing is also pretty cheap when compared with something like new relic. Sure it's more expensive than free tools that you need to setup yourself. But the velocity it lets your teams have is so much better than having to use something like grafana with tools like Prometheus. Again, sure it can be done for cheap, but the time it takes to manage those tools and the velocity that you lose when doing that doesn't seem like it's worth it for smaller companies but I can see it making more sense as you scale a company.

It's not the cost per se, though I do think they're pretty high for some features. It's the pricing models and the associated patterns.

For instance, you have to pay Datadog per host you install the agent on. In addition to the per host cost, you have to pay per container you run on that host (past a very small baseline per host), and the per contain cost turns out to be nearly as high as the per host cost if you have reasonable density. Why am I paying Datadog per container I run? Aside from a not particularly useful dashboard, why does a process namespace and some cgroup metrics nearly double my bill? They are literally just processes on a server. Because Datadog wants you to run more hosts, so you install more agents.

Every feature they add also seems to be charged separately, but is not behind any sort of feature gate. This means new features just show up for my developers, and they have no clue if it costs money to use them. I can't just disable or cap, for example, their custom metrics per user, per project, or at all. So when my developers see a useful feature and start using it, all of a sudden I have an extra $10k on my monthly bill. Even more fun are features that show up and are initially free but then start charging.

This is such a pain that we've had to tell dev teams not to use Datadog features outside of a curate list. Every product has some rough edges, but with Datadog the patterns are all setup such that you end up paying them thousands of extra dollars. Again, great product, but not a business I would be interested in associating with again given the choice.

It's not so much the total cost, but the fact that there's so much nickel and diming. When Trace Analytics came out they tried getting us to turn it on, and its like...we're already paying for APM and you want to charge us more, at least tell us how much more and they couldn't. I think it probably ended up not being a ton of money, but just the question was enough for us to not do it. From working with other providers, it's also much easier working with our finance if we can say 'it costs at most this' instead of 'it costs at least this'

> Their product is really the best turn-key option out there, but I'm always hesitant to use its features without double checking it's not going to add 50% to my bill.

You might want to check out New Relic One, especially with the new pricing model. I think they even added a Prometheus integration recently?

For APM and Front-end monitoring, you can try https://www.atatus.com/

Twice I fell into the trap of datadog, having paid more than $300 for a single month in each case.

The simplicity of it, dashboards, notebooks, logs etc, is what makes it so appealing though.

If it saves you more than an hour or two, $300 seems super worth it for a system making you money...

It depends where you are in the world. When I was working in Switzerland, most SaaS pricing were no-brainers for us. But since I work in Latin America for small companies with local costumers, all the different services and tools you might want to use, with prices targeted at "western" customers, much more quickly add up to the equivalent of having multiple people on staff full time.

Still it is often not worth to roll your own, so it is nice to have alternatives for different price points and company scales.

> It depends where you are in the world.

Exactly this. We operate in Eastern Europe with local clients, offering on-prem SaaS. If I added all my clients' servers on datadog it would very easily eat through our profit margins.

> Still it is often not worth to roll your own

I tried hard not to, but at at the end, after spending 1 week trying to setup netdata and failed, I decided not to spend another week trying to setup grafana/influx/prometheus (lot's of docs to go through), and just have some bash scripts send metrics on a $10 digital ocean node service that sends me emails/sms when something "looks bad" (eg. high cpu temperature, stopped docker containers, etc).

I gave up on aggregated logging for the time being, since I can just ssh on each server and check journal and docker logs if I need to (as long as the hard drives don't crash).

The trick with the Netdata agent is to try the install script if running into trouble.

Yeah, having looked at what the script does I decided to 'containerize' the agent, and that led to other issues like configuring email alerts etc.

I was already a week deep into looking at various options and had to deliver on basic metrics and alerting, so I figured a couple of bash scripts, that log into local files with log rotation, systemd, and a dump/memory only receiving end running on nodejs for the alerts would be much faster and easier to maintain.

So far so good.

Tangent: what's the difference between on-prem SaaS and just "software"?

I guess it refers to them servicing (updating, reacting to downtime, etc.) the software, while it being deployed on premise? In contrast to the clients IT department doing so.

Yes that's right.

We provide both the hardware and the software. Our clients are pretty small, with no IT staff, and no technical expertise.

it used to be called "private cloud", wasn't it?

If you're in an established company making money - yes. If you're bootstrapping a service and counting on $50 total monthly cost while initial users are signing up - no.

From the post "The key, he says, is using the right transaction identifiers so that calls can be traced across components, services, and queues".

I think this is a key feature not many people implement especially in today's world of over blown micro services, having a transaction id from the time the request hits the reverse-proxy till the database write is so helpful in debugging, saves a ton of time.

100% if you manage to get opentrace to work, it is a brilliant debug tool

You can also just propagate a uuid throughout your system. I've used both uuids and opentrace. Just don't get hung up if opentrace seems too complex.

I agree with this wholeheartedly. You can even define a standard and let downstream services opt in over time. Simple wins like this should not be put off because "someday" we're going to implement a complex distributed tracing solution.

So, in terms of full implementation, is the constraint the dev time to implement, or getting various factions to agree to something?

I guess, is it political, or technical?

Asking for a friend.


Often it's political, but political friction can feed into technical friction if there are also a half dozen different half wrappers around half-baked HTTP libraries. Also, if someone has included OpenTracing as a shadow library in a company-wide library (JVM territory), but you want it as a top-level dependency, you have to write translators.

I agree with 2. I have a presentation at https://www.polibyte.com/2019/02/10/pytennessee-presentation... which goes into how to get started with Prometheus. Not as in "how to set it up", but more about what to instrument and why, how to name things, etc. Despite the title, there's very little in it that's specific to Python.

> Don't worry about structured vs unstructured too much, just ensure you include a timestamp, file, log level, func name (or line number), and that the message will help you debug.

if you include all these information and the logs are not structured, you won't get much information out of them.

Why should you have to use Prometheus? There are plenty of options, and good reasons why you might want to push data rather than pull. Measurements should minimize perturbation of the system being measured, and the (computer) system generating data is likely best placed to determine when and how, when that matters -- e.g. in HPC, where jitter is important.

I would add that if you add a traceId to #1 and use something like https://www.jaegertracing.io/. You get even more.

I don't think loki is production ready. Needs more work. It's going to be great with grafana though

What else do you think it needs. I’ve been playing around with v1.5 the past few days in Grafana and I like it’s simplicity. Grafana now has a Loki datasource which is nice.

Gavin from Zebrium here. Completely concur with #1. We are big advocates of writing good logs and not having to worry about structured vs unstructured (and even if you structure your logs, you'll still probably have to deal with unstructured logs in third party components).

Our approach to deal with logs is to use ML to structure them after the fact (and we can deal with changing log structures). You can read about it in a couple of our blogs like: https://www.zebrium.com/blog/using-ml-to-auto-learn-changing... and https://www.zebrium.com/blog/please-dont-make-me-structure-l....

A few things I have learnt along the way:

Logs are great, but only once you've identified the problem. If you are searching through logs to _find_ a problem, its far too late.

Processing/streaming logs to get metrics is a terrible waste of time, energy and money. Spend that producing high quality metrics directly from the apps you are looking after/writing/decomming (example: dont use access logs to collect 4xx/5xx and make a graph, collate and push the metrics directly)

Raw metrics are pretty useless. They need to be manipulated into buisness goals: service x is producing 3% 5xx errors vs % of visitors unable to perform action x

Alerts must be actionable.

Alerts rules must be based on sensible clear cut rules: service x's response time is breeching its SLA not service x's response time is double its average for this time in may.

> Processing/streaming logs to get metrics is a terrible waste of time, energy and money. Spend that producing high quality metrics directly from the apps you are looking after/writing/decomming

Yeah nah, but, okay, nah yeah.

Generating metrics in the app is much more intrusive, and requires that you figure out the metrics you need ahead of time. It adds dependencies, sockets, and threads to your app.

Unless you're very careful, it's also easy to end up double-aggregating, computing medians of medians and other meaningless pseudo-statistics - if you're using the Dropwizard Metrics library, for example, you've already lost.

If you output structured log events, where everything is JSON or whatever and there are common schema elements, you can easily pull out the metrics you need, configure new ones on the fly, and retrospectively calculate them if you keep a window of log history.

When i've worked on systems with both pre- and post-calculated metrics, the post-calculated metrics were vastly more useful.

The huge, virtually showstopping, caveat here is that there is lots of decent, easy-to-use tooling for pre-calculated metrics, and next to nothing for post-calculated metrics. You can drop in some libraries and stand up a couple of servers and have traditional metrics going in a day, with time for a few games of table tennis. You need to build and bodge a terrifying pile of stuff to get post-calculated metrics going.

Anyway if there's a VC reading this with twenty million quid burning a hole in their pocket who isn't fussy about investing in companies with absolutely no path to profitability, let me know, and i'll do a startup to fix all this. I'll even put the metrics on the blockchain for you, guaranteed street cred.

> Unless you're very careful, it's also easy to end up double-aggregating,

Oh no, never do anything fancy on the client end. yeah thats total trash. Any client that does any kind of aggregating is a massive pain in the arse.

Counters are good enough for 90% of everything you want. You can turn counters into hits per second easily. Plus they are more resistant to time based averaging. If you do your stats correctly, you can even has resetting counters create nice smooth graphs (non negative derivatives are a god send)

> Dropwizard

Yes, this is a library that argues strongly against the use of metrics. From what I recall 1 node of casasndra will output close to 50,000 metrics by default. That is too much.

When a team I worked with were migrating away from splunk to graphite/grafana, they shat out something close to a million metrics. 99.8% were totally useless.

> You need to build and bodge a terrifying pile of stuff to get post-calculated metrics going.

Yes! I think thats my main objection. Its so bloody expensive to do post-hock metrics. you can buy in splunk, but thats horrifically expensive. Or you can use an open source version and loose 4 person years before you even get a graph.

> if you're using the Dropwizard Metrics library, for example, you've already lost.

Can you go into a bit more detail here? Curious to know where Dropwizzard goes wrong.

I prefer to use the Prometheus client libraries where possible. Prometheus' data model is "richer" -- metric families and labels, rather than just named metrics. Adapting from Dropwizzard to Prometheus is a pain, and never results in the data feeling "native" to Prometheus.

I think they just mean the host is aggregating, so any further aggregation is compounded slant time the data. Like StatsD’s default is shipping metrics every 10s, so if you graph it and your graph rolls up those data points into 10 minute data points (cuz you’re viewing a week at once), then you’re averaging an average. Or averaging a p95. People often miss that this is happening, and it can drastically change the narrative.

Yes, exactly this. It's the fact that you're doing aggregation in two places. Since you're always going to be aggregating on the backend, aggregating in the app is bad news.

It may be interesting to think about the class of aggregate metrics that you can safely aggregate. Totals can be summed. Counts can be summed. Maxima can be maxed. Minima can be minned. Histograms can be summed (but histograms are lossy). A pair of aggregatable metrics can be aggregated pairwise; a pair of a total and a count lets you find an average.

Medians and quantiles, though, can't be combined, and those are what we want most of the time.

Someone who loves functional programming can tell us if metrics in this class are monoids or what.

There is an unjustly obscure beast called a t-digest which is a bit like an adaptive histogram; it provides a way to aggregate numbers such that you can extract medians and quantiles, and the aggregates can be combined:


The problem is post calculating is so slow. At least from my naive viewpoint. I can load dozens of graphs in datadog in seconds, can change tags or time frame and takes literally a second to load. Our Splunk dashboards can take over a minute to load, and reload for any change is more waiting.

Splunk taking minutes to load dashboard is not a problem imposed by post-calculation, it's more of a problem of lack of schema.

Most post calculation works on free text logs and thus has to regex it's way to a solution. But it doesn't have to be that way; that's why the original poster talked about a lack of tooling in the post-calculation world

You're optimizing for the wrong thing. The hard part about this space isn't extracting value from data, it's physically shipping the data through the infrastructure and into the relevant systems. Metrics are so great compared to logs precisely because they're precalculated (read: highly compressed) before leaving the originating service.

That does sound like you are assuming that everyone always talks about large-scale environments with dozens or more machines?

Everything we're talking about here is trivial until you reach a minimum scale. "Dozens of machines" is still trivial, to be honest.

Not for many people running such environments. So you either see with very minimal setups without tooling to help (which is survivable at small scales, but inefficient and mindnumbing), or towers of complexity that followed widely shared advice by people always assuming massive scale.

I miss the powerful metrics and logging systems that I used in Amazon.

> Processing/streaming logs to get metrics is a terrible waste of time, energy and money. > Spend that producing high quality metrics directly from the apps

Absolutely not. Most application metric systems generate metrics as text strings with a simple format that is parsed by the metric collector.

This is what we also call a structured log. Parsing such text strings takes very little CPU.

All logs and metrics represent events. A good approach is to prefer numerical values where possible, but only for quantities that are comparable. Metrics are for the "how many?" question.

But never forget to log text events, because you need to answer the "what happened?" question.

Don't be afraid of generating too many different metrics but avoid too frequent datapoints and unnecessary verbosity in logs.

Never dump complex objects "just in case". Treat overlogging and underlogging as a bug.

Spend time every day in reviewing the metric dashboards and improve them constantly.

If it takes more that 10 seconds do add a new non-obvious chart (e.g. to calculate a ratio between 2 metrics or a percentile or other computation) throw away your charting system.

Lying with numbers is very easy: always look at distributions, not just instant values. Some metrics must be represented as percentiles and min/avg/max are meaningless.

Percentiles are good for ignoring meaningless outliers, but always count the outliers to ensure that you are not ignoring meaningful data. Especially during incidents.

Metrics and text logs tell a story together. Process, correlate and visualize them together as much as possible.

> Processing/streaming logs to get metrics is a terrible waste of time, energy and money. Spend that producing high quality metrics directly from the apps you are looking after/writing/decomming

The other side is that I don't know what metrics I'll want until later.

When do you think it's better to pull metrics from structured logs vs generating metrics in app?

You can go the https://www.honeycomb.io/ way and make structured logs your metrics. It will cost you a lot in storage, but simplifies a lot. Just throw properly structured logs into storage as long as you query them efficiently (which honeycomb provides)

I think the only times it ever really makes sense to use logs to generate metrics are fairly limited:

1. You haven't yet instrumented the application with metrics yet.

2. The logs are from a third party tool that don't emit metrics

3. The log format is well defined and doesn't change (I'd still prefer native metrics)

Otherwise the issue is that logging messages can and do change over the lifetime of an application. Relying on the content of the log for metrics becomes an implicit API that's not obvious to developers working on the code. I've seen issues of broken monitoring and alerting because a refactor changed log formatting and content. Much better to be explicit about metrics and instrument them directly.

Aha! that is the eternal question.


almost never. structured logs are expensive in terms of infra, management and query time. Storing logs just in case is much more expensive at any kind of scale compared to metrics alone.

Long answer:

A lot of it depends on what the service/program is meant to be doing.

If we take a proxying webs service router for example listening on example.com/* We would want metrics to tell us how well its doing for its specific job, and any upstream services.

So for each service URL we'd want at least a hit count for 2xx, 3xx, specific 4xx and 5xx return codes. We'd also want the time taken to process that request.

We'd also probably want to know the total number of active connection to back end, and total clients connected. Memory and CPU usage would also be a given.

From that we could easily ascertain the health of upstream services, the performance, and total load (which is useful for autoscaling of either the service router, or the upstream apps)

I think it requires sitting down with a peice of paper and imagining your service/app breaking, and then working back to see how that would look. Once you've done that, you can figure out some counters to keep track of those thins.

On the rest I agree but on

> Raw metrics are pretty useless. They need to be manipulated into buisness goals: service x is producing 3% 5xx errors vs % of visitors unable to perform action x

I think in general the business goals metrics are OK but you still need to keep lower level metrics as well, otherwise it would be more difficult to pinpoint the exact failure, you will just know that a % of visitors is unable to perform action X. In a moderately-complex system a user-level action X is probably composed by several low-level services.

I agree wholeheartedly.

I was trying to get across that just because you collect metrics it doesn't make them useful. I encourage people to generate metrics for everything, we can always join them together later to make something useful.

I think what I should have said is: "Collect metrics for everything, but be sure to display them is a way thats relevant to the customer"

Gavin from Zebrium here. We've found that if only you somehow knew what you were monitoring for in logs, they can be a great source of detecting (and then describing) the long tail of unknown/unknowns (failure modes with unknown symptoms and causes). Our approach is to be able to find these patterns in near real-time using ML. This blog by our CTO explains the tech with some good examples: https://www.zebrium.com/blog/is-autonomous-monitoring-the-an....

> If you are searching through logs to _find_ a problem, its far too late.

To be fair, this is addressed in the article which links to Netflix's blog on the topic and how they do so effectively at their scale: https://netflixtechblog.com/lessons-from-building-observabil...

Agreed for most what was said there. Still, I find that people mostly use SLA as only thing important to track for alerting and incidents arousal. There is a lot of said about importance of defining solid SLI - Service Level Indicators which are aligned to SLO - Service Level Objectives SLAs are usually given to external user of SaaS, not very useful for SRE team.

> Processing/streaming logs to get metrics is a terrible waste of time, energy and money.

Only if you're using Elastic.

I was going to try out Elastic APM for self-hosted APM option. Would that be the same case of a waste of time, energy, money? TIA for any insights!

The difference in metrics seems to be proportional to the level of understanding about how the organization works.

sometimes log volume would be too high (high TX count)

although in most of the systems in my career it has not been the case.

The Art of Monitoring [1], covers most of these stuff in a unified manner.

You are introduced to some basics (push vs. pull monitoring), then proceeded with simple system metrics collection (cpu, memory) via collectd, then goes to logs ingestion and ends up extracting application-specific metrics from jvm and python applications.

I highly recommend it, even for seasoned professionals.

[1] https://artofmonitoring.com/

I never see an important system management principle brought up: If you get a user complaint (for some value of "user") and not an alert, you should fix the monitoring system so that you don't get another occurrence of it or related problems. Obviously that's within reason, depending on the circumstances; the effort might not be worth it.

We log extensively. Here are some of my thoughts it

- at least in C++, the requirement to be able to log from pretty much anywhere can lead to messy code that either passes a reference to your logger to all classes that might possibly need it, or you've got an extern global somewhere. Yuck.

- logging can enable laziness. Being able to log that something weird happened can be considered a sufficient substitute for proper testing.

- logs are only as useful as the info they contain. This can mean state needs to be passed around all over the place just so that it can all be eventually logged on one line (it saves your data team from having to do a 'join')

- if your logger doesn't support cycling log files it's useless. If something goes wrong you can easily fill a disk.

I'd disagree with 2 and 4.

2. Given a large enough system you will encounter situations where the only action you can take is to log "this really shouldn't happen" and try to roll back as cleanly as possible. This may be due to either complexity or a bug manifesting in a layer completely different than where it occurred (I've seen a null reference crash on "if(foo) foo->bar();" in the past)

4. I believe loggers should ideally know as little as possible about your logs. Logs can be rotated externally, can be buffered and sent to other hosts without touching the disk, can be ignored. Ideall, the system should care, not the app.

> I've seen a null reference crash on "if(foo) foo->bar();" in the past

References can't be null. Regardless, that's a valid check for a null pointer and I don't think what you wrote is at all possible (unless maybe in some multithreaded scenario?).

> if your logger doesn't support cycling log files it's useless. If something goes wrong you can easily fill a disk.

Few applications should be logging to disk directly. Services running under systemd or any modern orchestration platform should log to stdout/stderr and let the system manage the stream.

> the requirement to be able to log from pretty much anywhere can lead to messy code

Ah, Milewski's example of insight from the supposedly useless mathematical stuff: https://bartoszmilewski.com/2014/12/23/kleisli-categories/ (and the corresponding lecture video).

Expanding on your second point, logging is also not a substitute for proper error handling.

It’s weird to see the stuff by Jay Kreps (of Kafka ~fame~) listed in the logs section. His writing is specifically _not_ about logs the observability tool, but logs the data structure such as you’d see at the heart of a database.

No. The original Kafka paper does talk about logs in the observability sense as a premise to solve the aggregation problem.


> There is a large amount of “log” data generated at any sizable internet company. This data typically includes (1) user activity events corresponding to logins, pageviews, clicks, “likes”, sharing, comments, and search queries; (2) operational metrics such as service call stack, call latency, errors, and system metrics such as CPU, memory, network, or disk utilization on each machine. Log data has long been a component of analytics used to track user engagement, system utilization, and other metrics.

> We have built a novel messaging system for log processing called Kafka [18] that combines the benefits of traditional log aggregators and messaging systems....Kafka provides an API similar to a messaging system and allows applications to consume log events in real time.

A quote from the LinkedIn blog post linked in the article:

“But before we get too far let me clarify something that is a bit confusing. Every programmer is familiar with another definition of logging—the unstructured error messages or trace info an application might write out to a local file using syslog or log4j. For clarity I will call this "application logging". The application log is a degenerative form of the log concept I am describing”

Fair enough. But I don't think quoting this for logging in the tracing sense is wrong here. He does acknowledge that trace logs are a degenerative form of logs from the perspective of log processing. The only difference being in the semantics of human readable text v/s binary logs.

Very true. Jay Krep's log is completely unrelated to the topic of this article. This added to my feeling that this "guide" is rather a collection of fragments put together without a real understanding of the subject from the author.

Is there an open source solution for processing streams of structured and unstructured logs and routing then onward? I see solutions for moving logs to elastic or Kafka but nothing for evaluating the log.

This is a problem that is both solved again and again, but also all the available solutions are bad.

In my experience what happens is:

1. you start with a "ship logs from X to Y" product

2. you add more sources and more destinations, making it more of a central router. you add config options for specifying your sources and dests.

3. since the way you checkpoint or consume or pull or push certain sources or dests doesn't generalize, you end up buffering internally to present a unified "I have recieved / sent this message successfuly" concept to your inputs and outputs.

4. you want to do some basic transforms on the logs as you go. you implement "filters" or "transforms" or "steps" and make them configurable. your config now describes a graph of sources -> filters -> dests

5. your filters need to be more flexible. you add generic filters whose behaviour is mostly controlled by their config options. your configs grow more complicated as you use multiple layers of differently-configured filters

6. you have a bad turing complete programming language embedded in your config file. getting simple tasks done is possible, getting complex tasks done becomes an awful, inefficient and unreadable mess.

My solution to this cycle has been to just write simple hard-coded applications that can only do the job I need them to do. If they need a different configuration later I edit the source. I'm writing my transforms in a real programming language and I avoid the additional complexity of abstractions. Of course, that comes with its own costs but I consider it well worth it.

The "OG" in the space is collectd, which is still my favorite choice if you are responsible at the operating system level: https://collectd.org/wiki/index.php/Chains

https://github.com/elastic/logstash was one of the first modern approaches. I started using it less the more often I ran into JRuby related bugs.

https://github.com/trivago/gollum is my pick from the golang ecosystem.

There are many more variants depending on how much complexity you are trying to apply. If you need to apply machine learning models, for example, you're probably going to end up with something similar to Apache Storm, though I don't know if it's operational story has improved enough to consider it over other alternatives, I lost track years ago between Apache Spark and the half dozen other stream processing projects.

Riemann [1]. You can create custom endpoint for accepting almost any kind of messages, logs or data.

[1] http://riemann.io/

>Is there an open source solution for processing streams of structured and unstructured logs and routing then onward?


It doesn't route them onward - it will collect, aggregate and provide you the tools to correlate/analyze logs across your environment. Enable the built in network monitoring tools too and you have not only a powerful tool to help you with application management, but security as well (hence its namesake).

Beware - in pealing back the layers of your environment you can really get sucked in. I never seem to have enough hardware to do what I want with SO but it's pretty amazing what you can do with it.

EDIT - wow, I'm a little shocked that no one else has brought Security Onion up. I guess they need to up their advertising game!

You can actually do this log manipulation in fluent-bit (you can write Lua if you need to) although the forwarding cannot be routed to different locations.

https://vector.dev/ sounds pretty close

Maybe you're looking something like this https://docs.tremor.rs/

I haven't found anything, we are moving to hosted Humio really soon. It uses kafka

> Logging is critical to detecting attacks and intrusions.

Yes, but not universally - and just collecting logs will not take you far. Logging everything and trying to approach security via the ’collect all data’ is both expensive and inaccurate, and one of the major inefficiencies in modern cyber.

This is done efficiently at scale by both Cylance and Crowdstrike, but is certainly only one part of a defense in depth strategy.

There are viable products around human threat hunting which would be impossible without a 'collect all the data' component.

You are correct, and this is the key part - what % of organizations have money, skills and people to build a robust enough capability around threat hunting, for example?

I’ve been super lucky to meet various orgs and their security in all geographies and many industries and my gut feeling is 1 out of 10 teams.

Security Onion does an amazing job at collecting and correlating, especially for an open source product. The traditional trade of with Open Source is there - a bit of up front effort for longer term value.

Recently, I was searching for a service which offers those functionalities on a very basic level. I tried several options and was really disappointed with all of them. The only one that I found to be usable was https://logdna.com/. I've now been using it for a couple of weeks and it works OK. It offers logging, alerts, metrics/dashboards, and some other things. And all that for a reasonable pricing.

Am I the only one that can’t reach the “save and exit” privacy button on mobile?

It’s hard for me to think that this is not intentional when the “Accept all” is usable but the alternative isn’t...


If you don't need all the fancy metrics, and just want something simple to keep an eye on your services, alert you if they fail, and automatically restart them, check out my stealthcheck service. It's all of 150 lines of free range, 0-dependency go:


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact