Logging is for events with identity, metrics are for events without identity. They are not the same thing.
Sometimes I care about the bulk behaviour of my application. I hit 2757 RPS and suddenly find a catastrophic gap in the performance landscape of my application -- that's useful. I don't care much about the particulars of the requests. What matters is the statistical behaviour.
But sometimes the application crashes when a particular input is given. These are outliers, they don't show up clearly in statistical behaviour. In that case I care about the details.
Metrics are warnings that smokers die younger, on average, than non-smokers.
Logs are a biopsy showing which kind of lung cancer you have.
No sensible production system or PaaS treats them as identical, even if you can derive metrics from logs.
Well put! Also, there are the cases when the system does not behave as expected (for example, mis-configuration, or wrong input from the user) - in those cases there are no exceptions or crashes, but a good log will tell you why it happens.
An additional practical difference is that due to tracking less, that metrics can take much fewer resources allowing you to gain insight that's not practical from logs.
I'd expect in a well instrumented system that a single request could easily affect hundreds of metrics. Trying to get that level of information from logging alone isn't practical due to the volume of data it'd require.
If you want to track overall behaviour of hundreds of codepoints then metrics are good, if you want to track behaviour for all of your events at a handful of codepoints then logs are good.
I think you are conflating between the needs of centralizing logs and having them around at all. The OP is saying that always centralizing them might not always make sense, and I tent to agree (and I say this as someone who maintains a popular open source log collector)
If I were to play the devil's advocate, the real needs for raw log data in a centralized location is for folks outside of Ops: data analysts and data scientists.
You are preaching to the choir here. Fluentd (as a proxy for my logging-related beliefs) pushed Docker to have logging long before the logging driver, and now it is one of the officially supported logging drivers.
What you don't seem to realize is that the cost of centralized logging is not always worth it. Machines are ephemeral and so are many application related problems. It's one thing to counter the OP saying that centralized logging has merits (and I believe the OP agrees with that statement) and another to say centralized logging is always a must.
If the logs might contain the stacktrace of a request that brought down your application, how do you look at them if your health-checker helpfully terminates the now-considered-wedged app VM, and your logs also "helpfully" disappear right along with it?
Half the point of logs is that they're the examinable blast-wave of something that might no longer exist.
But why do you care if a request took down the application, unless it happens repeatedly? Any sufficiently complex system will have transient errors. Why waste time tracking them down?
If it happens repeatedly, then you add central logging until you find the problem.
Maybe the app comes back up with all its users missing. Maybe you get an email a day later telling you you've been hacked and listing details of your database. Or maybe the one instance wedged itself in a weird way (crashed with a lock open, etc.) and destabilized the system, and you spent sixteen hours getting everything back up.
The severity of a bug has nothing to do with its commonality. Sometimes there's something that happens once, and bankrupts your business. Security vulnerabilities are such.
However, I'm more confused by the statement "add central logging"—how are you doing this, and how much time does it cost you? If you mean enabling logging code you already wrote using ops-time configuration (in effect, increasing the log-level) then I can see your point. If you mean adding logging code, then you're making your ops work block on developers.
Either way, what is the cost you're imagining to central logging, that you would consider adding it in specific cases where it's "worth it", but not otherwise? It's just a bunch of streams going somewhere, getting collated, getting rotated, and then getting discarded. The problem itself is about as well-known as load-balancing; it's infrastructure you just set up once and don't have to think about. It doesn't even have scaling problems!
> But sometimes the application crashes when a particular input is given. These are outliers, they don't show up clearly in statistical behaviour. In that case I care about the details.
But my question is, why do you care about the outliers if the customer experience is not harmed by them? And if it is harmed, then it should should up in your 99th percentile metrics, even as an outlier, at which point you can turn on aggregated logging to look for that one thing. And if that one thing doesn't happen again, then did it really matter?
> But my question is, why do you care about the outliers if the customer experience is not harmed by them?
The effect of errors is not necessarily linear to their frequency.
Distributed systems of any consequence are not meaningfully modelled with smooth, continuous functions suitable for highschool calculus. There is a great deal of chaos and catastrophe.
Metrics -- including customer visible ones like "zero HTTP responses per second" -- raise the question. Logs provide the physical clues.
> Metrics -- including customer visible ones like "zero HTTP responses per second" -- raise the question. Logs provide the physical clues.
Exactly my point. First your monitoring tells you that there is zero HTTP response, then you turn on central logging until you find the problem.
If the problem goes away before you turn the logging on, then it wasn't really a problem, right? And if it happens again, then you can turn on central logging until you find it.
turning on logging post-facto won't go back in time to catch a rare bug. And since you need the added resources that logging requires to be available at all times anyway, you might as well just leave logging on. What's your objection to having recent logs?
> turning on logging post-facto won't go back in time to catch a rare bug
I guess my point is that if a rare bug is rare enough, then it doesn't really matter, especially if the effect isn't catastrophic.
> And since you need the added resources that logging requires to be available at all times anyway, you might as well just leave logging on. What's your objection to having recent logs?
There are a lot more resources required to log everything all the time then to log a subset of things some of the time. Even having recent logs of everything is a huge overhead compared to more detailed logs of just some things.
> I guess my point is that if a rare bug is rare enough, then it doesn't really matter, especially if the effect isn't catastrophic.
That's only true if you run a trivial application. If you are a bank for example, it matters very much why the outcome of a transaction came out wrong, even if the effect wasn't catastrophic. Every circumstance and every bit of identity to the transaction is important. "It's rare" or "I don't believe in logging" doesn't cut it.
> There are a lot more resources required to log everything all the time then to log a subset of things some of the time.
Resources are a matter of budgetary concerns, mostly. What does a bug in production cost you? Are you liable? How fast? What's your SLA? Those things matter.
Don't forget safety-critical software. I don't want avionics software developers thinking "that bug doesn't matter because it's very rare". Even if the effect is non-catastrophic.
For certain classes of software, zero bugs is not an unreasonable or unrealistic goal.
> Resources are a matter of budgetary concerns, mostly.
True, but most people unfortunately don't have unlimited budgets to work with.
Also, there is more than a monetary cost. Every system in your architecture incurs a non-zero cost to maintenance, reliability, cognitive load, and management. For example, if you are in AWS, every unused box adds time to the API call for listing the instances. Every extra box that is there for "just in case" does too. It's important to balance these things.
> Resources are a matter of budgetary concerns, mostly.
After making my own logging system, which was expensive in dev time + maintenance, I switched our company over to papertrail ( https://papertrailapp.com ) they handle quite large volumes of logs for less than $100 a month.
They provide an interface to search the logs with low latency.
I'm not sure what type of organisation can't afford it.... (cost scales with usage)
That works if you're building Reddit, Netflix, or even most Google services. It doesn't work if you're doing financial software or anything that handles money or is critical for someone else's business.
I haven't looked into it, so I'm not sure, but I'm also not yet convinced that server logs can increase the integrity of banking transactions. I'd love to get some more info on specifics for this.
For example, a lot of people just assume that a bank can't be eventually consistent, but if you make them stop and think about it, they realize that they can and are. The ATM isn't always connected to the bank's servers. That's why your card has a withdrawl limit. That's how much they are willing to risk for eventual consistency.
I feel like the same is true for logging. That other checks can be put in place to mitigate risk to an acceptable level.
It's not that the logs themselves increase the integrity of banking transactions, it's that they find bugs that increase the integrity of banking transactions.
Say that you've built your system like any good highly-available system so that you rigorously check consistency of the program invariants, log if there are any anomalies, and then either try to recover or back out with a user error if something goes wrong. You've configured your monitoring to alert if any of the consistency checks fail. Now you get an alert that you've errored out and served a 500 on one in every million requests.
For Reddit, Facebook, Google, or other free services, it's no big deal: one in a million means that you're serving a couple hundred, maybe a few thousand 500s per day, max, and then the user just refreshes and gets over it.
But in a financial transaction app, this is a big deal. The basic assumption you have to make is that if you have a consistency problem that your monitoring caught, you probably have consistency problems that you didn't catch. And the right thing to do is investigate until you understand what is going on and what the impact is. So you pull the logs of all surrounding requests, you pull the logs of any requests that were in flight at the same time, you pull the logs for the user that initiated the request and see exactly what they were doing. And then you try to reproduce the problem, test for it, and fix it.
The difference is entirely in the tolerances built into the system, and how that dictates that you respond to errors. In some domains, a one-in-a-million error is a "well, we'll catch it next time" event. In others, a one-in-a-million error is a "drop everything you're doing, fix it, and ensure that no similar errors exist" event.
BTW, the strictest tolerances usually are not in B2C companies, they are in suppliers of large B2C companies (or the government). The bank is completely willing to risk a few hundred dollars per customer on an ATM card, but if the ATM vendor loses that money without it being part of the spec they gave the bank, they've lost the contract. Most of these enterprise B2B contracts are based on trust, and if the customer observes you losing money accidentally, they'll wonder what else you're doing accidentally.
Imagine customer support where you really need a record of the individual transactions for that particular customer. A monitoring solution won't give you that level of detail.
Individual transactions aren't found in application logs, they are found in the data store. Also, again, it's important to stress that central logging should be available, just not on all the time. If it's a specific sequence of events that triggers a problem, turn on logging and then ask them to do the sequence again. If they don't need to do the sequence again, then it wasn't really of consequence. And if it happens for multiple users, then customer service should be able to turn on let the developers know to try and find the cause which they can do by adding more metrics or perhaps some temporary central logging.
A better compromise would be always having logging on but only retaining very recent logs. You need the infrastructure to support logging and correlation anyway and if you have a short enough retention period you don't have to worry about dealing with growing infrastructure just to support it.
Yeah I think most people store them for a limited time, such as a week or a month. Most issues that pop up will only be relevant that long. You can have permanent archives for individual data since storage is cheap, or even summary metrics. Better to start with ground truth because you often don't know what data you'll need in hindsight.
I think that's the difference between a log and an audit trail.
Parent makes a good point imo. Systems are now sufficiently complex that logging everything is a useless indicator. And if you're only using a subset of your logging anyway (by generating metrics), why collect anything more before you know there's a problem.
That and their cost is relatively low from a business standpoint (and I say this as a security person myself). Even the worst breaches generally don't have much effect on the bottom line (Target is a great example). Ashely Madison and Code Spaces are the only ones I can think of off the top of my head that were bad enough to actually cause a material effect to the bottom line.
Sometimes I care about the bulk behaviour of my application. I hit 2757 RPS and suddenly find a catastrophic gap in the performance landscape of my application -- that's useful. I don't care much about the particulars of the requests. What matters is the statistical behaviour.
But sometimes the application crashes when a particular input is given. These are outliers, they don't show up clearly in statistical behaviour. In that case I care about the details.
Metrics are warnings that smokers die younger, on average, than non-smokers.
Logs are a biopsy showing which kind of lung cancer you have.
No sensible production system or PaaS treats them as identical, even if you can derive metrics from logs.