1. Write good logs... not too noisy when everything is running well, meaningful enough to let you know the key state or branch of code when things deviate from the good path. Don't worry about structured vs unstructured too much, just ensure you include a timestamp, file, log level, func name (or line number), and that the message will help you debug.
2. Instrument metrics using Prometheus, there are libraries that make this easy: https://prometheus.io/docs/instrumenting/clientlibs/ . Counts get you started, but you probably want to think in aggregation and to ask about the rate of things and percentiles. Use histograms for this https://prometheus.io/docs/practices/histograms/ . Use labels to create a more complex picture, i.e. A histogram of HTTP request times with a label of HTTP method means you can see all reqs, just the POST, or maybe the HEAD, GET together, etc... and then create rates over time, percentiles, etc. Do think about cardinality of label values, HTTP methods is good, but request identifiers are bad in high traffic environments... labels should group not identify.
Start with those things, tracing follows good logging and metrics as it takes a little more effort to instrument an entire system whereas logging and metrics are valuable even when only small parts of a system are instrumented.
Once you've instrumented... Grafana Cloud offers a hosted Grafana, Prometheus metrics scraping and storage, and Log tailing and storage (via Loki) https://grafana.com/products/cloud/ so you can see the results of your work immediately.
If it's a big project, you have a lot of options and I assume you know them already, this is when you start looking at Cortex and Thanos, Datadog and Loki, tracing with Jaegar.
Imagine: user registers, the post gets and id, and context registering. Then he adds a credit card. (a new id, context credit payments) after 14 days the bill goes out, same id, same context.
I haven't looked at their pricing before, but for small-ish environments, their standard plan looks really good and simple. None of the "per host, but also per function, and extra for each feature, and extra for usage" approach like other providers (datadog, I'm looking at you).
I was thinking "God, this is exactly why I hate Datadog" as I was reading your description and got a great laugh when I reached the end. Their billing is absolutely byzantine.
I don't know that I've ever seen a company that had such a stark difference between great engineering/product and awful business/sales practices. Their product is really the best turn-key option out there, but I'm always hesitant to use its features without double checking it's not going to add 50% to my bill. Their sales teams are some of the worst I've dealt with, and I deal with a lot of vendors. They're starting to get a really bad reputation as well.
I even posted about it in another thread a couple weeks ago about great software.
Is it really that expensive when compared to other vendors? Thought their newer logging tool was a lot cheaper than splunk and their apm tool for distributed tracing is also pretty cheap when compared with something like new relic. Sure it's more expensive than free tools that you need to setup yourself. But the velocity it lets your teams have is so much better than having to use something like grafana with tools like Prometheus. Again, sure it can be done for cheap, but the time it takes to manage those tools and the velocity that you lose when doing that doesn't seem like it's worth it for smaller companies but I can see it making more sense as you scale a company.
For instance, you have to pay Datadog per host you install the agent on. In addition to the per host cost, you have to pay per container you run on that host (past a very small baseline per host), and the per contain cost turns out to be nearly as high as the per host cost if you have reasonable density. Why am I paying Datadog per container I run? Aside from a not particularly useful dashboard, why does a process namespace and some cgroup metrics nearly double my bill? They are literally just processes on a server. Because Datadog wants you to run more hosts, so you install more agents.
Every feature they add also seems to be charged separately, but is not behind any sort of feature gate. This means new features just show up for my developers, and they have no clue if it costs money to use them. I can't just disable or cap, for example, their custom metrics per user, per project, or at all. So when my developers see a useful feature and start using it, all of a sudden I have an extra $10k on my monthly bill. Even more fun are features that show up and are initially free but then start charging.
This is such a pain that we've had to tell dev teams not to use Datadog features outside of a curate list. Every product has some rough edges, but with Datadog the patterns are all setup such that you end up paying them thousands of extra dollars. Again, great product, but not a business I would be interested in associating with again given the choice.
You might want to check out New Relic One, especially with the new pricing model. I think they even added a Prometheus integration recently?
The simplicity of it, dashboards, notebooks, logs etc, is what makes it so appealing though.
Still it is often not worth to roll your own, so it is nice to have alternatives for different price points and company scales.
Exactly this. We operate in Eastern Europe with local clients, offering on-prem SaaS. If I added all my clients' servers on datadog it would very easily eat through our profit margins.
> Still it is often not worth to roll your own
I tried hard not to, but at at the end, after spending 1 week trying to setup netdata and failed, I decided not to spend another week trying to setup grafana/influx/prometheus (lot's of docs to go through), and just have some bash scripts send metrics on a $10 digital ocean node service that sends me emails/sms when something "looks bad" (eg. high cpu temperature, stopped docker containers, etc).
I gave up on aggregated logging for the time being, since I can just ssh on each server and check journal and docker logs if I need to (as long as the hard drives don't crash).
I was already a week deep into looking at various options and had to deliver on basic metrics and alerting, so I figured a couple of bash scripts, that log into local files with log rotation, systemd, and a dump/memory only receiving end running on nodejs for the alerts would be much faster and easier to maintain.
So far so good.
I think this is a key feature not many people implement especially in today's world of over blown micro services, having a transaction id from the time the request hits the reverse-proxy till the database write is so helpful in debugging, saves a ton of time.
I guess, is it political, or technical?
Asking for a friend.
if you include all these information and the logs are not structured, you won't get much information out of them.
Our approach to deal with logs is to use ML to structure them after the fact (and we can deal with changing log structures). You can read about it in a couple of our blogs like: https://www.zebrium.com/blog/using-ml-to-auto-learn-changing... and https://www.zebrium.com/blog/please-dont-make-me-structure-l....
Logs are great, but only once you've identified the problem. If you are searching through logs to _find_ a problem, its far too late.
Processing/streaming logs to get metrics is a terrible waste of time, energy and money. Spend that producing high quality metrics directly from the apps you are looking after/writing/decomming (example: dont use access logs to collect 4xx/5xx and make a graph, collate and push the metrics directly)
Raw metrics are pretty useless. They need to be manipulated into buisness goals: service x is producing 3% 5xx errors vs % of visitors unable to perform action x
Alerts must be actionable.
Alerts rules must be based on sensible clear cut rules: service x's response time is breeching its SLA not service x's response time is double its average for this time in may.
Yeah nah, but, okay, nah yeah.
Generating metrics in the app is much more intrusive, and requires that you figure out the metrics you need ahead of time. It adds dependencies, sockets, and threads to your app.
Unless you're very careful, it's also easy to end up double-aggregating, computing medians of medians and other meaningless pseudo-statistics - if you're using the Dropwizard Metrics library, for example, you've already lost.
If you output structured log events, where everything is JSON or whatever and there are common schema elements, you can easily pull out the metrics you need, configure new ones on the fly, and retrospectively calculate them if you keep a window of log history.
When i've worked on systems with both pre- and post-calculated metrics, the post-calculated metrics were vastly more useful.
The huge, virtually showstopping, caveat here is that there is lots of decent, easy-to-use tooling for pre-calculated metrics, and next to nothing for post-calculated metrics. You can drop in some libraries and stand up a couple of servers and have traditional metrics going in a day, with time for a few games of table tennis. You need to build and bodge a terrifying pile of stuff to get post-calculated metrics going.
Anyway if there's a VC reading this with twenty million quid burning a hole in their pocket who isn't fussy about investing in companies with absolutely no path to profitability, let me know, and i'll do a startup to fix all this. I'll even put the metrics on the blockchain for you, guaranteed street cred.
Oh no, never do anything fancy on the client end. yeah thats total trash. Any client that does any kind of aggregating is a massive pain in the arse.
Counters are good enough for 90% of everything you want. You can turn counters into hits per second easily. Plus they are more resistant to time based averaging. If you do your stats correctly, you can even has resetting counters create nice smooth graphs (non negative derivatives are a god send)
Yes, this is a library that argues strongly against the use of metrics. From what I recall 1 node of casasndra will output close to 50,000 metrics by default. That is too much.
When a team I worked with were migrating away from splunk to graphite/grafana, they shat out something close to a million metrics. 99.8% were totally useless.
> You need to build and bodge a terrifying pile of stuff to get post-calculated metrics going.
Yes! I think thats my main objection. Its so bloody expensive to do post-hock metrics. you can buy in splunk, but thats horrifically expensive. Or you can use an open source version and loose 4 person years before you even get a graph.
Can you go into a bit more detail here? Curious to know where Dropwizzard goes wrong.
I prefer to use the Prometheus client libraries where possible. Prometheus' data model is "richer" -- metric families and labels, rather than just named metrics. Adapting from Dropwizzard to Prometheus is a pain, and never results in the data feeling "native" to Prometheus.
It may be interesting to think about the class of aggregate metrics that you can safely aggregate. Totals can be summed. Counts can be summed. Maxima can be maxed. Minima can be minned. Histograms can be summed (but histograms are lossy). A pair of aggregatable metrics can be aggregated pairwise; a pair of a total and a count lets you find an average.
Medians and quantiles, though, can't be combined, and those are what we want most of the time.
Someone who loves functional programming can tell us if metrics in this class are monoids or what.
There is an unjustly obscure beast called a t-digest which is a bit like an adaptive histogram; it provides a way to aggregate numbers such that you can extract medians and quantiles, and the aggregates can be combined:
Most post calculation works on free text logs and thus has to regex it's way to a solution.
But it doesn't have to be that way; that's why the original poster talked about a lack of tooling in the post-calculation world
> Processing/streaming logs to get metrics is a terrible waste of time, energy and money.
> Spend that producing high quality metrics directly from the apps
Absolutely not. Most application metric systems generate metrics as text strings with a simple format that is parsed by the metric collector.
This is what we also call a structured log. Parsing such text strings takes very little CPU.
All logs and metrics represent events. A good approach is to prefer numerical values where possible, but only for quantities that are comparable. Metrics are for the "how many?" question.
But never forget to log text events, because you need to answer the "what happened?" question.
Don't be afraid of generating too many different metrics but avoid too frequent datapoints and unnecessary verbosity in logs.
Never dump complex objects "just in case". Treat overlogging and underlogging as a bug.
Spend time every day in reviewing the metric dashboards and improve them constantly.
If it takes more that 10 seconds do add a new non-obvious chart (e.g. to calculate a ratio between 2 metrics or a percentile or other computation) throw away your charting system.
Lying with numbers is very easy: always look at distributions, not just instant values. Some metrics must be represented as percentiles and min/avg/max are meaningless.
Percentiles are good for ignoring meaningless outliers, but always count the outliers to ensure that you are not ignoring meaningful data. Especially during incidents.
Metrics and text logs tell a story together. Process, correlate and visualize them together as much as possible.
The other side is that I don't know what metrics I'll want until later.
When do you think it's better to pull metrics from structured logs vs generating metrics in app?
1. You haven't yet instrumented the application with metrics yet.
2. The logs are from a third party tool that don't emit metrics
3. The log format is well defined and doesn't change (I'd still prefer native metrics)
Otherwise the issue is that logging messages can and do change over the lifetime of an application. Relying on the content of the log for metrics becomes an implicit API that's not obvious to developers working on the code. I've seen issues of broken monitoring and alerting because a refactor changed log formatting and content. Much better to be explicit about metrics and instrument them directly.
almost never. structured logs are expensive in terms of infra, management and query time. Storing logs just in case is much more expensive at any kind of scale compared to metrics alone.
A lot of it depends on what the service/program is meant to be doing.
If we take a proxying webs service router for example listening on example.com/* We would want metrics to tell us how well its doing for its specific job, and any upstream services.
So for each service URL we'd want at least a hit count for 2xx, 3xx, specific 4xx and 5xx return codes. We'd also want the time taken to process that request.
We'd also probably want to know the total number of active connection to back end, and total clients connected. Memory and CPU usage would also be a given.
From that we could easily ascertain the health of upstream services, the performance, and total load (which is useful for autoscaling of either the service router, or the upstream apps)
I think it requires sitting down with a peice of paper and imagining your service/app breaking, and then working back to see how that would look. Once you've done that, you can figure out some counters to keep track of those thins.
> Raw metrics are pretty useless. They need to be manipulated into buisness goals: service x is producing 3% 5xx errors vs % of visitors unable to perform action x
I think in general the business goals metrics are OK but you still need to keep lower level metrics as well, otherwise it would be more difficult to pinpoint the exact failure, you will just know that a % of visitors is unable to perform action X. In a moderately-complex system a user-level action X is probably composed by several low-level services.
I was trying to get across that just because you collect metrics it doesn't make them useful. I encourage people to generate metrics for everything, we can always join them together later to make something useful.
I think what I should have said is: "Collect metrics for everything, but be sure to display them is a way thats relevant to the customer"
To be fair, this is addressed in the article which links to Netflix's blog on the topic and how they do so effectively at their scale: https://netflixtechblog.com/lessons-from-building-observabil...
Only if you're using Elastic.
although in most of the systems in my career it has not been the case.
You are introduced to some basics (push vs. pull monitoring), then proceeded with simple system metrics collection (cpu, memory) via collectd, then goes to logs ingestion and ends up extracting application-specific metrics from jvm and python applications.
I highly recommend it, even for seasoned professionals.
- at least in C++, the requirement to be able to log from pretty much anywhere can lead to messy code that either passes a reference to your logger to all classes that might possibly need it, or you've got an extern global somewhere. Yuck.
- logging can enable laziness. Being able to log that something weird happened can be considered a sufficient substitute for proper testing.
- logs are only as useful as the info they contain. This can mean state needs to be passed around all over the place just so that it can all be eventually logged on one line (it saves your data team from having to do a 'join')
- if your logger doesn't support cycling log files it's useless. If something goes wrong you can easily fill a disk.
2. Given a large enough system you will encounter situations where the only action you can take is to log "this really shouldn't happen" and try to roll back as cleanly as possible. This may be due to either complexity or a bug manifesting in a layer completely different than where it occurred (I've seen a null reference crash on "if(foo) foo->bar();" in the past)
4. I believe loggers should ideally know as little as possible about your logs. Logs can be rotated externally, can be buffered and sent to other hosts without touching the disk, can be ignored. Ideall, the system should care, not the app.
References can't be null. Regardless, that's a valid check for a null pointer and I don't think what you wrote is at all possible (unless maybe in some multithreaded scenario?).
Few applications should be logging to disk directly. Services running under systemd or any modern orchestration platform should log to stdout/stderr and let the system manage the stream.
Ah, Milewski's example of insight from the supposedly useless mathematical stuff: https://bartoszmilewski.com/2014/12/23/kleisli-categories/ (and the corresponding lecture video).
> There is a large amount of “log” data generated at any sizable internet company. This data typically includes (1) user activity events corresponding to logins, pageviews, clicks, “likes”, sharing, comments, and search queries; (2) operational metrics such as service call stack, call latency, errors, and system metrics such as CPU, memory, network, or disk utilization on each machine. Log data has long been a component of analytics used to track user engagement, system utilization, and other metrics.
> We have built a novel messaging system for log processing called Kafka  that combines the benefits of traditional log aggregators and messaging systems....Kafka provides an API similar to a messaging system and allows applications to consume log events in real time.
“But before we get too far let me clarify something that is a bit confusing. Every programmer is familiar with another definition of logging—the unstructured error messages or trace info an application might write out to a local file using syslog or log4j. For clarity I will call this "application logging". The application log is a degenerative form of the log concept I am describing”
In my experience what happens is:
1. you start with a "ship logs from X to Y" product
2. you add more sources and more destinations, making it more of a central router. you add config options for specifying your sources and dests.
3. since the way you checkpoint or consume or pull or push certain sources or dests doesn't generalize, you end up buffering internally to present a unified "I have recieved / sent this message successfuly" concept to your inputs and outputs.
4. you want to do some basic transforms on the logs as you go. you implement "filters" or "transforms" or "steps" and make them configurable. your config now describes a graph of sources -> filters -> dests
5. your filters need to be more flexible. you add generic filters whose behaviour is mostly controlled by their config options. your configs grow more complicated as you use multiple layers of differently-configured filters
6. you have a bad turing complete programming language embedded in your config file. getting simple tasks done is possible, getting complex tasks done becomes an awful, inefficient and unreadable mess.
My solution to this cycle has been to just write simple hard-coded applications that can only do the job I need them to do. If they need a different configuration later I edit the source. I'm writing my transforms in a real programming language and I avoid the additional complexity of abstractions. Of course, that comes with its own costs but I consider it well worth it.
https://github.com/elastic/logstash was one of the first modern approaches. I started using it less the more often I ran into JRuby related bugs.
https://github.com/trivago/gollum is my pick from the golang ecosystem.
There are many more variants depending on how much complexity you are trying to apply. If you need to apply machine learning models, for example, you're probably going to end up with something similar to Apache Storm, though I don't know if it's operational story has improved enough to consider it over other alternatives, I lost track years ago between Apache Spark and the half dozen other stream processing projects.
It doesn't route them onward - it will collect, aggregate and provide you the tools to correlate/analyze logs across your environment. Enable the built in network monitoring tools too and you have not only a powerful tool to help you with application management, but security as well (hence its namesake).
Beware - in pealing back the layers of your environment you can really get sucked in. I never seem to have enough hardware to do what I want with SO but it's pretty amazing what you can do with it.
EDIT - wow, I'm a little shocked that no one else has brought Security Onion up. I guess they need to up their advertising game!
Yes, but not universally - and just collecting logs will not take you far. Logging everything and trying to approach security via the ’collect all data’ is both expensive and inaccurate, and one of the major inefficiencies in modern cyber.
There are viable products around human threat hunting which would be impossible without a 'collect all the data' component.
I’ve been super lucky to meet various orgs and their security in all geographies and many industries and my gut feeling is 1 out of 10 teams.
It’s hard for me to think that this is not intentional when the “Accept all” is usable but the alternative isn’t...