Guide to Logging

jedberg · on Sept 28, 2015

I have a long rant about logging, but the TL;DR is: aggregated logging in general is useless don't bother, everything you need to know should be in your monitoring system. Basically, if something is broken and it's a problem, it will happen again, so you just start logging when something breaks. How do you know something breaks? Log your application metrics. Log your customer experience in the form of metrics and monitoring.

But don't aggregate every code exception and warning. If they aren't affecting your customer experience, who cares? And if they are affecting it, then it should be expressed as an application metric in your monitoring system.

Edit: To make my point a little clearer, I believe that central logging and aggregating should be available, just not turned on all the time.

jacques_chester · on Sept 28, 2015

Logging is for events with identity, metrics are for events without identity. They are not the same thing.

Sometimes I care about the bulk behaviour of my application. I hit 2757 RPS and suddenly find a catastrophic gap in the performance landscape of my application -- that's useful. I don't care much about the particulars of the requests. What matters is the statistical behaviour.

But sometimes the application crashes when a particular input is given. These are outliers, they don't show up clearly in statistical behaviour. In that case I care about the details.

Metrics are warnings that smokers die younger, on average, than non-smokers.

Logs are a biopsy showing which kind of lung cancer you have.

No sensible production system or PaaS treats them as identical, even if you can derive metrics from logs.

henrik_w · on Sept 28, 2015

Well put! Also, there are the cases when the system does not behave as expected (for example, mis-configuration, or wrong input from the user) - in those cases there are no exceptions or crashes, but a good log will tell you why it happens.

I've given a more detailed example of this in example 2 here: http://henrikwarne.com/2013/05/05/great-programmers-write-de...

bbrazil · on Sept 28, 2015

An additional practical difference is that due to tracking less, that metrics can take much fewer resources allowing you to gain insight that's not practical from logs.

I'd expect in a well instrumented system that a single request could easily affect hundreds of metrics. Trying to get that level of information from logging alone isn't practical due to the volume of data it'd require.

If you want to track overall behaviour of hundreds of codepoints then metrics are good, if you want to track behaviour for all of your events at a handful of codepoints then logs are good.

kiyoto · on Sept 28, 2015

I think you are conflating between the needs of centralizing logs and having them around at all. The OP is saying that always centralizing them might not always make sense, and I tent to agree (and I say this as someone who maintains a popular open source log collector)

If I were to play the devil's advocate, the real needs for raw log data in a centralized location is for folks outside of Ops: data analysts and data scientists.

trjordan · on Sept 29, 2015

Log data should always be centralized if the machines they're stored on are ephemeral.

jacques_chester · on Sept 29, 2015

All machines are ephemeral. Some people just don't realise it.

kiyoto · on Sept 29, 2015

You are preaching to the choir here. Fluentd (as a proxy for my logging-related beliefs) pushed Docker to have logging long before the logging driver, and now it is one of the officially supported logging drivers.

What you don't seem to realize is that the cost of centralized logging is not always worth it. Machines are ephemeral and so are many application related problems. It's one thing to counter the OP saying that centralized logging has merits (and I believe the OP agrees with that statement) and another to say centralized logging is always a must.

jedberg · on Sept 29, 2015

Why? My whole point is that the data isn't useful beyond the life of the machine so why store it?

derefr · on Sept 29, 2015

If the logs might contain the stacktrace of a request that brought down your application, how do you look at them if your health-checker helpfully terminates the now-considered-wedged app VM, and your logs also "helpfully" disappear right along with it?

Half the point of logs is that they're the examinable blast-wave of something that might no longer exist.

jedberg · on Sept 29, 2015

But why do you care if a request took down the application, unless it happens repeatedly? Any sufficiently complex system will have transient errors. Why waste time tracking them down?

If it happens repeatedly, then you add central logging until you find the problem.

derefr · on Sept 29, 2015

Maybe the app comes back up with all its users missing. Maybe you get an email a day later telling you you've been hacked and listing details of your database. Or maybe the one instance wedged itself in a weird way (crashed with a lock open, etc.) and destabilized the system, and you spent sixteen hours getting everything back up.

The severity of a bug has nothing to do with its commonality. Sometimes there's something that happens once, and bankrupts your business. Security vulnerabilities are such.

However, I'm more confused by the statement "add central logging"—how are you doing this, and how much time does it cost you? If you mean enabling logging code you already wrote using ops-time configuration (in effect, increasing the log-level) then I can see your point. If you mean adding logging code, then you're making your ops work block on developers.

Either way, what is the cost you're imagining to central logging, that you would consider adding it in specific cases where it's "worth it", but not otherwise? It's just a bunch of streams going somewhere, getting collated, getting rotated, and then getting discarded. The problem itself is about as well-known as load-balancing; it's infrastructure you just set up once and don't have to think about. It doesn't even have scaling problems!

jedberg · on Sept 28, 2015

> But sometimes the application crashes when a particular input is given. These are outliers, they don't show up clearly in statistical behaviour. In that case I care about the details.

But my question is, why do you care about the outliers if the customer experience is not harmed by them? And if it is harmed, then it should should up in your 99th percentile metrics, even as an outlier, at which point you can turn on aggregated logging to look for that one thing. And if that one thing doesn't happen again, then did it really matter?

jacques_chester · on Sept 28, 2015

> But my question is, why do you care about the outliers if the customer experience is not harmed by them?

The effect of errors is not necessarily linear to their frequency.

Distributed systems of any consequence are not meaningfully modelled with smooth, continuous functions suitable for highschool calculus. There is a great deal of chaos and catastrophe.

Metrics -- including customer visible ones like "zero HTTP responses per second" -- raise the question. Logs provide the physical clues.

jedberg · on Sept 28, 2015

> Metrics -- including customer visible ones like "zero HTTP responses per second" -- raise the question. Logs provide the physical clues.

Exactly my point. First your monitoring tells you that there is zero HTTP response, then you turn on central logging until you find the problem.

If the problem goes away before you turn the logging on, then it wasn't really a problem, right? And if it happens again, then you can turn on central logging until you find it.

felixgallo · on Sept 28, 2015

turning on logging post-facto won't go back in time to catch a rare bug. And since you need the added resources that logging requires to be available at all times anyway, you might as well just leave logging on. What's your objection to having recent logs?

jedberg · on Sept 28, 2015

> turning on logging post-facto won't go back in time to catch a rare bug

I guess my point is that if a rare bug is rare enough, then it doesn't really matter, especially if the effect isn't catastrophic.

> And since you need the added resources that logging requires to be available at all times anyway, you might as well just leave logging on. What's your objection to having recent logs?

There are a lot more resources required to log everything all the time then to log a subset of things some of the time. Even having recent logs of everything is a huge overhead compared to more detailed logs of just some things.

xorcist · on Sept 28, 2015

> I guess my point is that if a rare bug is rare enough, then it doesn't really matter, especially if the effect isn't catastrophic.

That's only true if you run a trivial application. If you are a bank for example, it matters very much why the outcome of a transaction came out wrong, even if the effect wasn't catastrophic. Every circumstance and every bit of identity to the transaction is important. "It's rare" or "I don't believe in logging" doesn't cut it.

> There are a lot more resources required to log everything all the time then to log a subset of things some of the time.

Resources are a matter of budgetary concerns, mostly. What does a bug in production cost you? Are you liable? How fast? What's your SLA? Those things matter.

ryandrake · on Sept 28, 2015

Don't forget safety-critical software. I don't want avionics software developers thinking "that bug doesn't matter because it's very rare". Even if the effect is non-catastrophic.

For certain classes of software, zero bugs is not an unreasonable or unrealistic goal.

jedberg · on Sept 28, 2015

> Resources are a matter of budgetary concerns, mostly.

True, but most people unfortunately don't have unlimited budgets to work with.

Also, there is more than a monetary cost. Every system in your architecture incurs a non-zero cost to maintenance, reliability, cognitive load, and management. For example, if you are in AWS, every unused box adds time to the API call for listing the instances. Every extra box that is there for "just in case" does too. It's important to balance these things.

daurnimator · on Sept 29, 2015

> Resources are a matter of budgetary concerns, mostly.

After making my own logging system, which was expensive in dev time + maintenance, I switched our company over to papertrail ( https://papertrailapp.com ) they handle quite large volumes of logs for less than $100 a month.

They provide an interface to search the logs with low latency.

I'm not sure what type of organisation can't afford it.... (cost scales with usage)

nostrademons · on Sept 28, 2015

That works if you're building Reddit, Netflix, or even most Google services. It doesn't work if you're doing financial software or anything that handles money or is critical for someone else's business.

jedberg · on Sept 28, 2015

I haven't looked into it, so I'm not sure, but I'm also not yet convinced that server logs can increase the integrity of banking transactions. I'd love to get some more info on specifics for this.

For example, a lot of people just assume that a bank can't be eventually consistent, but if you make them stop and think about it, they realize that they can and are. The ATM isn't always connected to the bank's servers. That's why your card has a withdrawl limit. That's how much they are willing to risk for eventual consistency.

I feel like the same is true for logging. That other checks can be put in place to mitigate risk to an acceptable level.

nostrademons · on Sept 28, 2015

It's not that the logs themselves increase the integrity of banking transactions, it's that they find bugs that increase the integrity of banking transactions.

Say that you've built your system like any good highly-available system so that you rigorously check consistency of the program invariants, log if there are any anomalies, and then either try to recover or back out with a user error if something goes wrong. You've configured your monitoring to alert if any of the consistency checks fail. Now you get an alert that you've errored out and served a 500 on one in every million requests.

For Reddit, Facebook, Google, or other free services, it's no big deal: one in a million means that you're serving a couple hundred, maybe a few thousand 500s per day, max, and then the user just refreshes and gets over it.

But in a financial transaction app, this is a big deal. The basic assumption you have to make is that if you have a consistency problem that your monitoring caught, you probably have consistency problems that you didn't catch. And the right thing to do is investigate until you understand what is going on and what the impact is. So you pull the logs of all surrounding requests, you pull the logs of any requests that were in flight at the same time, you pull the logs for the user that initiated the request and see exactly what they were doing. And then you try to reproduce the problem, test for it, and fix it.

The difference is entirely in the tolerances built into the system, and how that dictates that you respond to errors. In some domains, a one-in-a-million error is a "well, we'll catch it next time" event. In others, a one-in-a-million error is a "drop everything you're doing, fix it, and ensure that no similar errors exist" event.

BTW, the strictest tolerances usually are not in B2C companies, they are in suppliers of large B2C companies (or the government). The bank is completely willing to risk a few hundred dollars per customer on an ATM card, but if the ATM vendor loses that money without it being part of the spec they gave the bank, they've lost the contract. Most of these enterprise B2B contracts are based on trust, and if the customer observes you losing money accidentally, they'll wonder what else you're doing accidentally.

mostlyjason · on Sept 28, 2015

Imagine customer support where you really need a record of the individual transactions for that particular customer. A monitoring solution won't give you that level of detail.

jedberg · on Sept 28, 2015

Individual transactions aren't found in application logs, they are found in the data store. Also, again, it's important to stress that central logging should be available, just not on all the time. If it's a specific sequence of events that triggers a problem, turn on logging and then ask them to do the sequence again. If they don't need to do the sequence again, then it wasn't really of consequence. And if it happens for multiple users, then customer service should be able to turn on let the developers know to try and find the cause which they can do by adding more metrics or perhaps some temporary central logging.

brazzledazzle · on Sept 28, 2015

A better compromise would be always having logging on but only retaining very recent logs. You need the infrastructure to support logging and correlation anyway and if you have a short enough retention period you don't have to worry about dealing with growing infrastructure just to support it.

mostlyjason · on Sept 28, 2015

Yeah I think most people store them for a limited time, such as a week or a month. Most issues that pop up will only be relevant that long. You can have permanent archives for individual data since storage is cheap, or even summary metrics. Better to start with ground truth because you often don't know what data you'll need in hindsight.

ethbro · on Sept 28, 2015

I think that's the difference between a log and an audit trail.

Parent makes a good point imo. Systems are now sufficiently complex that logging everything is a useless indicator. And if you're only using a subset of your logging anyway (by generating metrics), why collect anything more before you know there's a problem.

Tl;dr: avoid premature log verbosity

stouset · on Sept 28, 2015

Mindsets like this are a large component of why security vulnerabilities are ubiquitous.

jedberg · on Sept 28, 2015

That and their cost is relatively low from a business standpoint (and I say this as a security person myself). Even the worst breaches generally don't have much effect on the bottom line (Target is a great example). Ashely Madison and Code Spaces are the only ones I can think of off the top of my head that were bad enough to actually cause a material effect to the bottom line.

kiyoto · on Sept 28, 2015

I think there are two kinds of logging that's conflated into two in the industry: logging for devops and logging for analytics.

For logging for devops, I 100% agree with you. Looking at application metrics rather than raw logs is far more productive, and the raw logs should only be consulted after you have triaged the situation based on the metrics monitored.

However, there is another kind of logging, and that's for data science and analytics. Here, it's hugely helpful to have centralized logging. Hell, it is a must. The last thing you want is to have data scientists with a shaky Linux knowledge to ssh into your prod machines. At the same time, logs are the best source of customer behavior data to inform product insights, etc. By centralizing these logs and making them available on S3 or HDFS or something, you can point them there and have everyone win.

Among Fluentd users, we definitely see both camps. As a matter of fact, one of the reasons that I think people like Fluentd is that because it enables both monitoring and log aggregation within it.

jedberg · on Sept 28, 2015

You're absolutely right. I was only talking about DevOps logging (should have made that clearer). Logging for data science is a totally different ball game.

nostrademons · on Sept 28, 2015

Does that change your opinion about having a centralized log data store? If you need it anyway for your data scientists, why not give your security/devops people access to it when they need to debug a problem?

jedberg · on Sept 28, 2015

It doesn't change my opinion because the logging for data scientists is different than for DevOps. For data scientists I assume it would be all application information going into either a queue or stream processor, or being inserted directly into a database, or being pulled out of a database during ETL.

Stuff going to syslog isn't generally going to be used for data science.

a3voices · on Sept 28, 2015

One alternative is to replicate some of your data into a different DB that is safe for the data scientists to use. And that way, they have raw data to play with instead of logs.

nostrademons · on Sept 28, 2015

Curious what you mean by "some of your data"? What kinds of data? Usually, I think of the logs as the raw data, everything else is derived data & analysis.

a3voices · on Sept 28, 2015

Well it would be data from your databases, and anything they would find useful excluding private user information.

EvanAnderson · on Sept 28, 2015

This perspective completely ignores the use case of off-the-shelf software where you have limited or no capability to add "application metrics". Logs are invaluable in that scenario, which is arguably the majority of software deployments.

jedberg · on Sept 28, 2015

You can extract application metrics from the local logs as they are output (and then redirected to /dev/null).

shanemhansen · on Sept 28, 2015

I've done this for Cassandra. While there are tons of jmx metrics exposed for Cassandra, last I checked the recommended way of logging compaction times was to tail the log. I even have a little project that tails logs and sends timestamps and values to graphite based on regexp:

https://github.com/shanemhansen/graphitetailer

teacup50 · on Sept 28, 2015

What?

The only way to find many issues is by looking at logs of unexpected (and especially unhandled) exceptions.

If you have so many of those that they're impossible to dig through, FIX YOUR CODE.

jedberg · on Sept 28, 2015

You can keep the logs on the local box and take a peek at them once in a while if you want to clean up the code (which you should), but there is no reason to spend a ton of money and time aggregating the logs.

ceejayoz · on Sept 28, 2015

> You can keep the logs on the local box and take a peek at them once in a while...

That doesn't work very well if you're using stuff like AWS Lambda functions, short-lived VMs, Docker containers that come and go regularly, etc.

Plus, off-server logs are invaluable if the server is compromised.

jedberg · on Sept 28, 2015

> That doesn't work very well if you're using stuff like AWS Lambda functions, short-lived VMs, Docker containers that come and go regularly, etc.

Actually that's the use case where it works best. If you have constantly changing infrastructure, how useful are those logs anyway? Unless its something that is happening across the fleet, in which case it should get picked up by your application metrics. Also this is the perfect case for stream processing of the logs where you look for things in real time and then throw away the actual logs (no need to keep them around after you've processed them).

> Plus, off-server logs are invaluable if the server is compromised.

Only if your systems are mutable. If they are immutable then you're much better off looking at the incoming data through an application firewall or looking for unexpected data changes in the data store. The application server should just be a conduit for the user to interact with the data store.

jacques_chester · on Sept 28, 2015

Proposing the ideal world as a solution to the problems of the real world is an engineering antipattern with a long, storied and unhappy history.

seiji · on Sept 28, 2015

Centralized logging is critical to managing security. No real world production systems should be logging locally.

jedberg · on Sept 28, 2015

The longer version of my rant stipulates that you have a good application firewall that is looking at the incoming traffic and you are using immutable infrastructure for the fronted so an attacker can't do much damage if they do get in.

Keep in mind I consider the existence of exceptions to be an application metric that should be logged, so if there is a security issue causing exceptions that should show up in the monitoring, and then you can look at the exceptions that happen going forward.

If the box is compromised, your local IDS should catch that (assuming you are allowing writes on the box at all)

seiji · on Sept 28, 2015

you have a good application firewall that is looking at the incoming traffic

I will concede if you have a perfectly configured all-knowing security oracle in front of your application then you don't need proper logging. :)

immutable infrastructure for the fronted so an attacker can't do much damage if they do get in.

I've seen attackers get access then just dump the site databases or copy all the site's code sitting on the compromised servers. Immutable infrastructure doesn't protect against reading things over the network.

The class of security problems is larger and less clearly defined than what can be pre-programmed into static monitoring or analysis tools. It's always good to store logs for after-the-fact forensics if something does go wrong. How often do we see attacks, but then the attacked company doesn't even know what data was compromised or exfiltrated due to lack of logging? It happens daily.

If you do happen to have a magic all-knowing security oracle, you should probably productize it and make a trillion dollars.

jedberg · on Sept 28, 2015

> I will concede if you have a perfectly configured all-knowing security oracle in front of your application then you don't need proper logging.

Heh, that's not quite was I saying. :) I was just saying that any analysis you'll do on application logs you can also do with an application firewall.

> How often do we see attacks, but then the attacked company doesn't even know what data was compromised or exfiltrated due to lack of logging? It happens daily.

Again, I would contend that you won't find this information in application logs anyway. You'd find it in your IDS logs that are monitoring outbound network traffic.

seiji · on Sept 28, 2015

you won't find this information in application logs anyway.

Anecdote: I've previously caught an in-progress exploit by seeing mysql errors logged from the application because the exploits were doing dumb things like SELECT * against a table with tens or hundreds of millions of rows. Sometimes the little things let you know.

IDS is nice in theory, but it's not really de rigueur for non-specialized platforms these days. It's about as practical as requesting companies keep complete netflow logs: great in theory, but almost nobody does it.

If these security approaches were deployable in a complete drop-in fashion, we could push for industry wide adoption, but right now everything is custom tailored to individual architectures and environments. ain't nobody got time for that when there's hustlin' to be done and worse is better and perfect is the enemy of good and fail fast and be lean and flaunt your tail feathers towards all the VCs.

jedberg · on Sept 28, 2015

> Anecdote: I've previously caught an in-progress exploit by seeing mysql errors logged from the application because the exploits were doing dumb things like SELECT * against a table with tens or hundreds of millions of rows. Sometimes the little things let you know.

Wouldn't that require you to actively watch the logs going by? In a sufficiently large system, you can't really watch the logs scroll by and gain anything from it. In which case you would need some real time filtering, which basically means hand coding an IDS. :)

seiji · on Sept 28, 2015

Wouldn't that require you to actively watch the logs going by?

lol, in that case, yes. we had 500 servers reporting centralized syslog and many people would just tail -f the central log to make sure not too much crazy shit was happening.

(in a perfect software system (spherical cow) obviously all errors would be tagged and categorized with proper monitoring and alerting thresholds. but, in the real world you have a 800,000 lines of php spread across 1,000 files written by 300 mostly junior people (who only stay for 6 to 14 months at a time) over the past 10 years. you make the best of what you've got.)

xorcist · on Sept 28, 2015

What has immutable infrastructure to do with logging?

Here's an example of what you typically want to do: "Give me a list of all customers whose contact information was viewed by customer representative X and who's had a Paypal withdrawal made since that event".

How are you going to accomplish that without logs in sufficient detail?

You might call your audit trail something else if you wish, but you are required to do it (immutable logging inaccessible by applications) if you have a nontrivial application with any sort of compliance requirements.

jedberg · on Sept 28, 2015

Those are application data logs, which if you have a compliance requirement will be stored directly in a database as the actions happen, by the application itself.

That's not the kind of data you would get with stuff being spit out to syslog.

otterley · on Sept 28, 2015

AFAIK there's nothing in SOC-II or other regulatory regimes that requires you to use a particular storage or retrieval technology for your application or audit logs.

You can totally use syslog as a transport protocol for these if you like. You just want to ensure that you're using the reliable form of the protocol (i.e. TCP, preferably with sender-side disk buffering to guard against unavailability of the collector).

When set up correctly, the practical differences between using, say, syslog-ng PE and some other message bus (e.g. Kafka) for recording events becomes relatively small.

We need to take care not to conflate metrics, audit logs, transport mechanisms, encodings, indices, and storage formats. They're all very different pieces of the complete picture and deserve separate scrutiny.

jedberg · on Sept 29, 2015

Sorry what I meant was the kind of data you'd be collecting isn't normally done through syslog. Sure, you can use syslog, but it's not usually the default transport for that kind of data.

thebournepopret · on Sept 28, 2015

You have no idea what you're talking about. It is clear you have not had to perform any incident response or forensics.

Accept the fact that remote logging is necessary (and cheap) for both security and stability reasons.

jedberg · on Sept 28, 2015

> You have no idea what you're talking about. It is clear you have not had to perform any incident response or forensics.

This was a nice ad hominem attack but I'll respond anyway. I actually have multiple certifications in computer forensics, and have done forensics and incident response for eBay, PayPal, reddit and Netflix.

> Accept the fact that remote logging is necessary (and cheap) for both security and stability reasons.

I'm an open minded person and I'm willing to change my opinion in the face of new facts, but you haven't actually presented any new facts. Do you have any use cases that support your statement?

I have a few facts that counter them. Central logging is definitely not cheap. It costs a lot of money to store those logs at rest, and more money to store them in a way that is searchable, as those data structures expand pretty quickly. It also isn't necessary to stability, given that we made stability go up after we ditched central logging at Netflix (I will be the first to admin this is correlative and not causative, but still, it isn't necessary for stability).

Liuser · on Sept 28, 2015

When security at Netflix needs to investigate for incidents, or to analyze data for anomalies, how do they go about doing it? If I recall correctly, Netflix is an Elasticsearch / Kibana shop right? Are there multiple clusters that they gather info from? How is visibility done for the overall org?

I'm genuinely curious how the security team goes procedures of analysis there.

jedberg · on Sept 28, 2015

I'm not sure how much detail I can get into, but yes, there is a large Elasticsearch cluster with a lot of application data as well as web application firewalls and IDS data.

lenova · on Sept 28, 2015

In all my years of lurking on HN, telling jedberg that he has no idea about large scale networking and security procedures is probably the funniest thing I've ever read...

Hint: google his username.

Zombieball · on Sept 28, 2015

I personally don't understand this. Say your team owns 2 or 3 interconnected services. A customer reports an issue. How do you go about tracing the route cause? You don't know which server services a request so now you need to fetch logs from all your servers. So now you go and retrieve local logs from all of your servers to your local dev box (which could be considerable size given the size of your infrastructure and request volume) and then try and grep them? What if you need low level wire and request logs to debug the issue? After doing this enough times you are probably going to wish you had centralized and searchable log infrastructure.

ketralnis · on Sept 28, 2015

Unless you think you can predict every single bug that you'll ever write, and also how it will affect your customers' experience, how do you know you can track it at all?

And if you're not logging it, once you do figure out that 50% of your customers are leaving because your payment endpoint throws an exception on Amex cards, how do you know what's causing that? Now it's just a silly detective game for something you could have been just logging in the first place.

And if you're only able to find out via your customer metrics, it's already too late. If you were just logging the exception, you'd have said "huh that's funny" when you deployed the code, rather than when you have enough data to realise that you've already lost the customers.

Personally I'd rather just have the logs.

jedberg · on Sept 28, 2015

> If you were just logging the exception, you'd have said "huh that's funny" when you deployed the code, rather than when you have enough data to realise that you've already lost the customers.

I'm assuming that you still have the local logs on the local server and you're looking at one instance as you deploy to find those exceptions.

I'm also assuming that we're talking about a system too complex to actually watch logs go by in real time.

If the log level is light enough that you can actually read the logs as you deploy, then by all means do so. But if you're aggregating them just to be able to search them because you can't read them as they go by, then I'm saying there are better things to be doing than storing all those logs.

mugsie · on Sept 28, 2015

Couple of reasons - 1. You need to stick to compliance requirements. 2. Tracking individual transactions (user complains about $Problem - if you attach a request ID to each request, and pass it between all components, you can find the exact problem for that request) 3. Building retrospective graphs is much easier when you have the data to hand, which you can them opitimise into a metric.

Logging is not the only solution, but neither is metrics / monitoring only.

henrik_w · on Sept 28, 2015

Where I used to work, we had a great system for logging/tracing only sessions we were interested in, while not generating any output for all the other sessions. I wrote about it here: http://henrikwarne.com/2014/01/21/session-based-logging/

mugsie · on Sept 28, 2015

That looks really cool! Will have a look at how we could implement it.

One problem is that we run across multiple services, on multiple machines, so we would have to have some way of passing a customer ID into our AZ level logstash forwarders, and find a way of collating ERROR level logs, and sending those transactions up.

We also have the problem of customers asking for retrospective issue debug, so having all the logs is useful.

agopaul · on Sept 29, 2015

I'm impressed by how only a few people mentioned what I think it's the main point of application/system logging: debugging problems. Especially in small projects when you don't work 'at scale and you don't have metrics.

Am I the only one that stash GBs of low-level log messages in order to diagnose bugs that occurred in the past?

pbowyer · on Sept 28, 2015

Having just started to go down the 'log everything centrally' route (to keep track of when third party services are down, what actions User 1234 performed the day they reported an issue, and which transactional emails seem to have disappeared), can you explain how I'd change this from logging to metrics & monitoring?

One driver has been the lack of information I've had for debugging production issues (e.g. user X not able to log in against authentication provider Y) and so storing extra info is needed (vs the standard web server logs)

pbowyer · on Sept 28, 2015

Thinking further, I guess I'm conflating logging for devops and logging for analytics and logging for audit trails.

Not sure how to separate it out without data duplication (the boundaries aren't clear cut?) and for a small system it prob doesn't matter - I'm nowhere near Netflix, PayPal et al for scale.

jedberg · on Sept 28, 2015

A good rule of thumb is that if you can send all your logs to one place and tail -f is still a reasonable way to consume them, just send them all to one place.

mostlyjason · on Sept 28, 2015

Hey guys I'm one of the technical editors for the guide. We created it because logging is really valuable for troubleshooting and monitoring systems. However, it's a complex topic and so many resources today are spread around on different forums. We wanted to have a place to consolidate all that useful information into something much easier to read. If it helps a dev or ops person solve a problem faster or easier than we've done our job!

There is still a lot of content to be written for languages like python, ruby, and more. We'd love to have more contributors, or even just suggestions for the site. I can also get in touch with any of the authors if you have questions for them.

kordless · on Sept 28, 2015

> many resources today are spread around on different forums

Centralizing the all the knowledge that's been written about a given topic isn't scalable in terms of contributions. It's far better to focus on enabling contributions than it is to focus on building a solution in which to put the contributions. There's a reason products like StackExchange are more successful at providing features for encouraging contributions than topic specific sites.

> There is still a lot of content to be written for languages like python, ruby, and more.. We'd love to have more contributors,

There is already a ton of contributions to logging solutions on the Internet. Why not build a solution categorizing those, instead of trying to build new content in a centralized (and I might add commercially curated) solution?

Also, where's the source to this "open source" project? I'm not seeing any links to code.

mostlyjason · on Sept 28, 2015

I think readability is a concern. Often when I'm learning a new technology I'd rather sit down with a good book on the topic rather than reading a bunch of forum posts. I can be pretty confident that it will cover the most important concepts, building up from basics to more advanced ones, and showing all the connections in between. A lot of times reading regular posts, you have to build all these connections yourself. We do link to really good resources and documentation where they are available though so if you have any to suggest please let me know!

I think the open source part of the title refers to the fact that we cover all different technologies from a variety of perspectives, not just a single vendor solution. Also, we allow contributions from anyone in the community. I couldn't find an easy way for WordPress to share or collaborate on source code, but if you think of a way, let me know. In the meantime, we're giving access to people who want to contribute.

kordless · on Oct 1, 2015

I'd suggest building the content off a Github repo, rendering from markdown files. This solves the problem of contribution enablement. Rackspace has a docs/developer project that I was involved with for a very short time which has its source code available for use. Start there and get your content moved off Wordpress. Nothing against Wordpress, but it's not built for this. Also, code snippets.

Rackspace site on Github: https://github.com/rackerlabs/developer.rackspace.com

I agree with the readability part. It's important to curate from the stance of educating the multitudes quickly. Having disparate content can be an issue in starting the education, but variety is the spice of life - and use cases as time goes on. I'd recommend content + reference as a solution here. Cookbook style, if you will.

Keep in mind I've spent a fair amount of clock cycles thinking about this than most people. I've rejected things like logging standards as well. That doesn't mean I'm right, but it is, at the least, a data point for you.

mugsie · on Sept 28, 2015

https://www.loggly.com/ultimate-guide/?s=parse+WSGI+logs&pos...

0 results

https://www.loggly.com/ultimate-guide/?s=parse+python+logs&p...

Gives you a page on NodeJS

Ultimate ... not so much.

jt2190 · on Sept 28, 2015

I've downvoted you because the world doesn't need yet another post about how some open source project that has the audacity to call themselves "the ultimate resource" for their initial release doesn't live up to some-random-person-on-the-internet's idea of "ultimate".

I'm guessing from the specifics of your example queries that you already have some knowledge about WSGI and Python that you could contribute to the guide. Why not do that instead?

mugsie · on Sept 28, 2015

And I commented, as it calls itself ultimate, and it has articles about Java, Node, Apache, Linux (including SystemD) and Windows.

If it was titled "We want to build the Ultimate Logging Resource" I would be OK with it.

I would have no problem contributing, and may well do so over the next while, but titles like this get my goat

Edit: Also, while I know this will be downvoted to hell - Open Source? It is by a commercial company, with -all- most of the examples being their product.

mostlyjason · on Sept 28, 2015

I think that's a fair point. It's really something we want and believe will become ultimate over time as more people contribute to it. It's got a good start but needs a lot more work. I think it's useful to show the vision of what we want to become, because logging is a complex topic and so many resources today are spread around on different forums. That's why we wanted to have a place to consolidate all that useful information into something much easier to read. I think it's probably worth a adjustment to the tagline and mission statement on the front page to make this more obvious. Thanks for the suggestion!

mostlyjason · on Sept 28, 2015

Great suggestion! I'm one of the tech editors for the guide and we'd love to have contributors for python and other languages! http://www.loggly.com/ultimate-guide/become-a-contributor/

mentat · on Sept 28, 2015

For even "haproxy". Surely no one uses that right?

mugsie · on Sept 28, 2015

did a quick scan - nginx just returns generic "linux" / "systemd" (and node for some reason).

ruby - you guessed it - node.

mostlyjason · on Sept 28, 2015

I'm one of the tech editors for the guide and we'd love to have contributors for nginx and ruby! http://www.loggly.com/ultimate-guide/become-a-contributor/

strickjb9 · on Sept 28, 2015

Your site's readability would be greatly improved if you darkened the text color. #444 or darker is good enough for a white background.

packetized · on Sept 29, 2015

'Become a Contributor' seems to mean 'please give us your email address to be added to our mailing list.'

This seems moderately suspect.

def8cefe · on Sept 29, 2015

Loggly is crazy expensive for what they have on offer (at least given my use case which is aggregating syslog and Windows event logs). You're much better setting up Logstash or Graylog2 on-premises or in public cloud.

on Sept 28, 2015

[deleted]

gk1 · on Sept 28, 2015

It's built on Wordpress, so yes.

http://www.wordpress.org

avn2109 · on Sept 28, 2015

Was I the only one who was really hoping this was going to be about best practices for cutting down trees?

mirimir · on Sept 28, 2015

I find nothing on tor or lighttpd :(

acheron · on Sept 28, 2015

My four-year-old watches this show, whence I've learned all about logging: http://www.amazon.com/Mighty-Machines-In-The-Forest/dp/B0000...

0xdeadbeefbabe · on Sept 28, 2015

The subtext that I've just started to notice, as I watch with my four year old, is that Canada is awesome.

def8cefe · on Sept 29, 2015

I'm going to take this even further off-topic but if you enjoy cheesy vintage Canadian children's television Hammy Hamster AKA Tales of the Riverbank is required viewing.