Contextual logging makes structured logging even more powerful. For example, you can attach an ID to the contex of an http request when you receive it, which then gets logged in every operation that is performed for serving that request. If you are investigating what happened for a specific request, then you can just search for its ID.
This works with any repeatable task and identifier, like runs of a cron job, and user ids.
Adding userId, clientId, sessionId, and requestId to every log line and every event in the data warehouse is one of those „one weird tricks“ that actually does improve your life.
You can track them internally (pass through in process/request flow), but have 2 version of logs: pii and non-pii, and store pii in pii logs, with much stricter access restrictions. This alone considerably mitigates problem, as often you don't need details like userid to troubleshoot.
Using anonymous ID's that you can (manually) correlate to a real users + a reasonable log retention period is perfectly GDPR compliant and let's you retain all the benefits.
Nitpick: IDs that can be correlated back to a real user is pseudonymous - not anonymous - according to GDPR[1].
You should still pseudonymize your PII wherever you can, but protection measures are required.
You should have a short term log retention period (i.e. not something like 2 years, unless you are legally required to store all the data for this period). If the log retention is too long, even if you can prove you need this data, you would still have to handle data erasure requests (a.k.a. "the right to be forgotten"). Setting up a way to delete all log entries for a certain user ID could be an alternative to a short retention period.
The other feature you want to have is encryption (both in-transit and at-rest). Encryption is not a hard requirement in GDPR, but almost no "technical" measure is. GDPR Article 32 mentions various technical measures (including encryption and pseudonymization) and requires controllers to implement them while "[t]aking into account the state of the art, the costs of implementation and the nature, scope, context and purposes of processing as well as the risk of varying likelihood and severity for the rights and freedoms of natural persons".
Here we have to do some interpretation, but generally the accepted interpretation[2] is: "state-of-the-art" refers to methods and techniques which are widely available. Cost of implementation is strongly tied to risk: high-cost technical measures are not required unless the risk is equally high.
Since encryption, pseudonymization are simple, cheap and have widely available open source implementations, it's a good idea to make sure everything is encrypted.
Restricting access control to log data is another measure that is trivial enough to implement, that it becomes a practical requirement. I've seen a lot of cases where the entire company or org had unrestricted access to logs containing PII in the past. This probably won't fly with GDPR.
Lastly, I would make sure the logging infrastructure and everything connected to it, follows industry standard security measures (at least OWASP Top 10). It sounds obvious, but I've seen a lot of cases where logs have been treated as "non-critical" part of the system ("they're just logs!") and had not been reviewed or tested for security. If you suffer from a data breach and investigation reveals your logs were not properly secured, you'll likely be fined.
So in short, if you want to play it safe, all of the below:
- Pseudonymous IDs
- Short retention period (in accordance with other laws)
Except like, not really? If you need to remove someone's data for GDPR reasons, then "if match == userId then delete" is pretty straightforward in your log aggregation store.
Many log aggregation stores are not optimized for performing row-level updates or deletes like this. In my experience, the majority of log aggregation stores are immutable and support primarily time-based retention only.
(Though perhaps one can meet compliance needs by keeping these logs only for a fixed maximum period of time, e.g. 30 days, and keeping only appropriately anonymized data longer.)
Saying "we need to keep these logs for 30 days to allow us to troubleshoot problems. We can't reasonably delete them sooner, but they get deleted after 30 days" is a valid way to comply. You have a justifiable reason to keep them, the interval is reasonably short, and you have good technical reasons not to do it faster.
If your internal compliance people don't like it you can also rephrase it as "we are removing the data starting right now, the procedure takes 30 days". You have one month to even respond to removal requests, and can stretch that by another two. As long as you are not intentionally causing delays these are perfectly reasonable time frames.
Of course you still have to do all the other stuff for GDPR compliance, like making sure you have rules who gets access to the log system instead of just giving it to the entire company, making sure you store to an encrypted drive, etc.
A log aggregation store that can handle deletes is a security and compliance problem. Try proving to an auditor that a hacker couldn't have hacked in and then covered their tracks by deleting the logs.
That’s an incredibly weak response. Laws you can’t fuck with, auditors can fuck off. I’d love you trying to explain to the EU why you’re violating their laws because some auditor wanted to check a box. I sure hope your auditors are assuming legal responsibility.
Don't log anything you're not allowed to log. But in some industries (like finance) you need an immutable logging system and if you could easily delete evidence of a crime or security breach that would be a bug not a feature.
what if you anonymize the actual user entity with that user id instead? even if you have that user id in your logs the name or any sensitive field would be something like 'GDPR says HI".
This is necessary but not sufficient. Logs can contain other data, that could be used to narrow down the user base enough that you could guess which user it is, and now from just the logs you have de-anonymised an ID and can see everything that user did, or likely did.
In reality you need multiple different steps here: anonymous IDs, well-defined reasonable retention periods, strong access control and audit logging, and a privacy policy that says why the data is collected (for service quality typically) and how/when it will be deleted.
There's no one-clever-trick to GDPR, the law was intentionally designed to require businesses to apply holistic best practice. Whether it has done that well or not is another matter, but that was at least the aim.
First, as another reply above has mentioned, other data in the logs (such as IP address, list of friends, browser fingerprint) can be used to de-anonymize the pseudonymous ID.
Second, GDPR makes it quite clear (for the reasons above) that pseudonymized data, is still considered personal data. Pseudonymization reduces the risks, but does not remove them entirely. It should generally be combined with other measures such as encryption.
IMO manual logging and especially being consistent is hard in a team. Language skills, cultural background, personal preferences, etc.
Much better is a well thought of error handling. This shows exactly when and where something went wrong. If your error handler supports it, even context information like all the variables of the current stack is reported.
Add managable background jobs to the recipe which you can restart after fixing the code...
Ciao! This is all gold thank for sharing! I agree that consistency is the key. But it is the key in many fields this is why well-designed abstractions or continuous integration exists. To enforce consistency.
Error handling as well can be very helpful to communicate what your system is doing, but errors are not the only state you want to look for.
In theory, but it is something that I didn't see used too much a logging library can be wrapped into an abstraction where useful to enforce consistency. For example if wrap your library in something that conventionally sounds like "ThisLoggerIsCriticalDontMoveItAsYouWillDoWithOtherLogs(logger)" you are communicating something more about how that.
For Java applications, we built a structured logging library which would do a few things -
- Add OTel based instrumentation to generate traces
- Do salted hash of PII (injected in plain text by API Gateway in each request) like userid, etc to propagate internally to other downstream services via Baggage
- Inject all this context like trace-id and hashed PIIs into log
- Have Log4j and Logback Layout implementations to structure logs in JSON format
Logs are compressed and ingested to AWS S3 so it is also not expensive to store so much logs to S3.
AWS provides a tool called S3Select to search structured logs/info in S3. We built a Golang Cobra based cli tool, which is aware of the structure we have defined and allows us to search for logs in all possible ways, even with PII info even without saving.
In just 2 months, with 2 people we were able to build this stack and integrate to 100+ microservices and get rid of Cloudwatch. This not just saved us a lots of money on Cloudwatch side but also improved our capability to search to logs with a lot of context when issues happens.
hey, we're in pretty similar place logging wise, and I would really like to know more about your solution. If at all possible, I'd like to understand your rationale and implementation architecture more.
If this is “wild” to you then you are incredibly, incredibly green behind the ears. This is a very common reality. What you’re really saying is “I could do so much better than this”, which…good for you?
Because this is a wordpress website, and two decades ago this seemed like a reasonable idea. And wordpress is the epitome of "good enough". There are plenty of wordpress plugins to fix this, but might forget setting it up until this happens to you
What do you want someone to say to this? It’s incredibly easy to sit back and suggest changes to infrastructure. Smugness aside, nothing being said here is going to make the site come up.
My habit lately has been to have a "request event" object that picks up context as it works it's way through the layers and then is fully saved to disk referenced by it's unique event number. In Go this is usually just a map. These logs are usually very large and get rotated into archive and deletion very quickly.
Then in my standard error logs I always just include this event ID and an actual description of the error and it's context from the call site. These logs are usually very small and easy to analyze to spot the error and every log line includes the event ID that was being processed when it was generated.
I was just talking to some acquaintances the other day where I was asking them what they used for structured logging and they looked at me like "what's that" and I remembered people don't do it everywhere.
I think SQLite is maybe the best option if you can get a bit clever around the scalability implications.
For instance, you could maintain an in-memory copy of the log DB schema for each http/logical request context and then conditionally back it up to disk if an exception occurs. The request trace SQLite db path could then be recorded in a metadata SQLite db that tracks exceptions. This gets you away from all clients serializing through the same WAL on the happy path and also minimizes disk IO.
"A breakpoint for logging is usually scalability, because they are expensive to store and index."
I hope 2024 is the year where we realize that if we make the log levels dynamically update-able we can have our cake and eat it too. We feel stuck in a world where all logging is either useless bc it's off or on and expensive. All you need is a way to easily modify log level off without restarting and this gets a lot better.
That's probably true for some uses of logging, but for information about historical events you're stuck with whatever information you get from the log level you had in the past.
One great thing about Go is its built-in structured logging package "log/slog" since Go 1.21. Not only can you output in multiple structured formats, but Go also widely uses its `context.Context` type to pass request-level information, so you can easily attach requestID, sessionID, etc.
It's trivial in C/C++ due to GCC being a first-class citizen in Linux, but idk how it's done for interpreted languages, Java, etc. If anyone can chime in I'm curious.
In C or C++, you just run 'ulimit -c unlimited' in your shell before running your program. When it crashes, a GDB-friendly core dump is generated. Then you can load it in gdb ('gdb myexecutable mycoredump'), and it takes you to the exact line where it crashed, including showing you the stack trace, letting you view local variables at every frame of the stack, etc. Every C++ IDE supports loading a core file, so it's literally an interactive debugger at the time you most need it. It's a life-saver.
Keep in mind you have to compile with debug symbols enabled to be able to make sense of the coredump. However, you can then strip your binary, as long as you keep an unstripped copy around to help you with debugging.
This has nothing to do with GCC being a first-class citizen in Linux. It’s a kernel feature. The kernel doesn’t care which compiler or debugger you’re using. You can dump core of any process regardless of the language it’s written in. Every modern OS supports that.
This works with any repeatable task and identifier, like runs of a cron job, and user ids.