> Back when I was a junior developer, there was a smoke test in our pipeline that never passed. I recall asking, “Why is this test failing?” The Senior Developer I was pairing with answered, “Ohhh, that one, yeah it hardly ever passes.” From that moment on, every time I saw a CI failure, I wondered: “Is this a flaky test, or a genuine failure?”
This is a really key insight. It erodes trust in the entire test suite and will lead to false negatives. If I couldn't get the time budget to fix the test, I'd delete it. I think a flaky test is worse than nothing.
"Normalisation of Deviance" is a concept that will change the way you look at the world once you learn to recognise it. It's made famous by Richard Feynman's report about the Challenger disaster, where he said that NASA management had started accepting recurring mission-critical failures as normal issues and ignored them.
My favourite one is: Pick a server or a piece of enterprise software and go take a look at its logs. If it's doing anything interesting at all, it'll be full of errors. There's a decent chance that those errors are being ignored by everyone responsible for the system, because they're "the usual errors".
I've seen this go as far as cluster nodes crashing multiple times per day and rebooting over and over, causing mass fail-over events of services. That was written up as "the system is usually this slow", in the sense of "there is nothing we can do about it."
Oof, yes. I used to be an SRE at Google, with oncall responsibility for dozens of servers maintained by a dozen or so dev teams.
Trying to track down issues with requests that crossed or interacted with 10-15 services, when _all_ those services had logs full of 'normal' errors (that the devs had learned to ignore) was...pretty brutal. I don't know how many hours I wasted chasing red herrings while debugging ongoing prod issues.
we're using AWS X-ray for this purpose, i.e. a service is always passing on and logging the X-ray identifier generated at first entry into the system. pretty helpful for this purpose. And yes, there should be consistent log handling / monitoring. Depending on service we differ between error log level (=expected user errors) and critical error level (makes our monitor go red).
It often isn't as simple as using a correlation identifier and looking at logs across the service infrastructure. If you have a misconfiguration or hardware issue it very likely may be intermittent and only visible as an error in a log before or after the request. The response has incorrect data inside a properly formatted envelope.
I guess that's one of the advantages of serverless - by definition there can be no unrelated error in the state beyond the request (because there is none), except for the infrastructure definition itself. But a misconfig there you'll always see in form of an error happening at calling the particular resource - at least I haven't seen anything else yet.
You don't even have to go as far from your desk as a remote server to see this happening, or open a log file.
The whole concept of addressing issues on your computer by rebooting it is 'normalization of deviance', and yet IT people in support will rant and rave about how it's the fault of users for not rebooting their systems whenever they get complaints of performance problems or instability from users with high uptimes— as if it's not the IT department itself which has loaded that user's computer to the gills with software that's full of memory leaks, litters the disk with files, etc.
I agree with what you're saying, but this is a bad example:
> Pick a server or a piece of enterprise software and go take a look at its logs. If it's doing anything interesting at all, it'll be full of errors.
It's true, but IME those "errors" are mostly worth ignoring. Developers, in general, are really bad at logging, and so most logs are full of useless noise. Doubly so for most "enterprise software".
The trouble is context. Eg: "malformed email address" is indeed an error that prevents the email process from sending a message, so it's common that someone will put in a log.Error() call for that. In many cases though, that's just a user problem. The system operator isn't going to and in fact can't address it. "Email server unreachable" on the other hand is definitely an error the operator should care about.
I still haven't actually done it yet, but someday I want to rename that call to log.PageEntireDevTeamAt3AM() and see what happens to log quality..
> The trouble is context. Eg: "malformed email address" is indeed an error that prevents the email process from sending a message
I’m sure you didn’t quite mean it as literal as I’m going to take it and I’m sorry for that. Any process that gets as far as attempting to send an email to something that isn’t a valid e-mail address is, however, an issue that should not be ignored in my opinion.
If your e-mail sending process can’t expect valid input then it should validate its input and not cause an error. Of course this is caused by saving invalid e-mail addresses as e-mail addresses in the first place which in it self shows that you’re in trouble, because that means you have to validate everything everywhere because you can’t trust anything. And so on. I’m obviously not disagreeing with your premise. It’s easy to imagine why it would happen and also why it would in fact end up in the “error.log”, but it’s really not an ignorable issue. Or it can be, and it likely is in a lot of places but that’s exactly GPS point isn’t it? That a culture which allows that will eventually cause the spaceship to crash.
I think we as a society are far too cool with IT errors in general. I recently went to an appointment where they had some digital parking system where you’d enter your license plate. Only the system was down and the receptionist was like “don’t worry, when the system is down they can’t hand out tickets”. Which is all well and good unless you’re damaged by working in digitalisation and can’t help but do the mental math on just how much money that is costing the parking service. It’s not just the system that’s down, it’s also the entire fleet of parking patrol people who have to sit around and wait for it to get to work. It’s the support phones being hammered and so on. And we just collectively shrug it off because that’s just how IT works “teehee”. I realise this example is probably not the best, considering it’s parking services, but it’s like that everywhere isn’t it?
Attempting to send an email is one of the better ways to see if it's actually valid ;)
Last time I tried to order pizza online for pickup, the website required my email address (I guess cash isn't enough payment and they need an ad destination), but I physically couldn't give them my money because the site had one of those broken email regexes.
The article you link ends by agreeing with what I said. So I’m not exactly sure what to take it as. If your service fails because it’s trying to create and send an email to an invalid email, then you have an issue. That is not to say that you need excessive validation, but in most email libraries I’ve ever used or build you’re going to get runtime errors if you can’t provide something that looks like x@x.x which is what you want to avoid.
I guess it’s because I’m using the wrong words? English isn’t my first language, but what I mean isn’t that the email actually needs to work just that it needs to have something that is an email format.
LOG_CRIT and LOG_ALERT are two separate levels of "this is a real problem that needs to be addressed immediately", over just the LOG_ERR "I wasn't expecting that" or LOG_WARNING "Huh, that looks sus".
Most log viewers can filter by severity, but also, the logging systems can be set to only actually output logs of a certain severity. e.g. with setlogmask(3)
If you can get devs to log with the right severities, ideally based on some kind of "what action needs to be taken in response to this log message" metric, logs can be a lot more useful. (Most log messages should probably be tagged as LOG_WARNING or LOG_NOTICE, and should probably not even be emitted by default in prod.)
> someday I want to rename that call to log.PageEntireDevTeamAt3AM()
In my experience, the problem usually is that severity is context sensitive. For example, a external service temporarily returning a few HTTP 500 might not be a significant problem (you should basically expect all webservices to do so occasionally), whereas it consistently returning it over a longer duration can definitely be a problem.
> I still haven't actually done it yet, but someday I want to rename that call to log.PageEntireDevTeamAt3AM() and see what happens to log quality..
The second best thing (after adding metrics collection) we did as a dev team was forcing our way into the on-call rotation for our application. Now instead of grumpy sysops telling us how bad our application was (because they had to get up in the night to restart services and what not) but not giving us any clue to go on to fix the problems, we could do triage as the issues where occurring and actually fix the issues. Now with mandate from our manager because those on-call hours where coming from our budget. We went from multiple on-call issues a week to me gladly taking weeks of on-call rotation at a time because I knew nothing bad was gonna happen. Unless netops did a patch round for their equipment which they always seem to forget to tell us about.
I want to rename that call to log.PageEntireDevTeamAt3AM() and see what
happens to log quality
I managed to page the entire management team after hours at megacorp. After spending ~7 months being tasked with relying on some consistently flakey services I'd opened a P0 issue on a development environment. At the time I tried to be as contrite as possible, but in hindsight what a colossal configuration error. My manager swore up and down he never caught flack for it, but he also knew I had one foot out the door.
Horrors from enterprise - few weeks ago a solution architect forced me to rollback a fix (a basic null check) that they "couldn't test" because its not a "real world" scenario (testers creating incorrect data would crash business process for everyone)...
This is a really key insight. It erodes trust in the entire test suite and will lead to false negatives. If I couldn't get the time budget to fix the test, I'd delete it. I think a flaky test is worse than nothing.