> Back when I was a junior developer, there was a smoke test in our pipeline tha...

jiggawatts · on Aug 30, 2023

"Normalisation of Deviance" is a concept that will change the way you look at the world once you learn to recognise it. It's made famous by Richard Feynman's report about the Challenger disaster, where he said that NASA management had started accepting recurring mission-critical failures as normal issues and ignored them.

My favourite one is: Pick a server or a piece of enterprise software and go take a look at its logs. If it's doing anything interesting at all, it'll be full of errors. There's a decent chance that those errors are being ignored by everyone responsible for the system, because they're "the usual errors".

I've seen this go as far as cluster nodes crashing multiple times per day and rebooting over and over, causing mass fail-over events of services. That was written up as "the system is usually this slow", in the sense of "there is nothing we can do about it."

It's not slow! It's broken!

ertian · on Aug 30, 2023

Oof, yes. I used to be an SRE at Google, with oncall responsibility for dozens of servers maintained by a dozen or so dev teams.

Trying to track down issues with requests that crossed or interacted with 10-15 services, when _all_ those services had logs full of 'normal' errors (that the devs had learned to ignore) was...pretty brutal. I don't know how many hours I wasted chasing red herrings while debugging ongoing prod issues.

m_mueller · on Aug 30, 2023

we're using AWS X-ray for this purpose, i.e. a service is always passing on and logging the X-ray identifier generated at first entry into the system. pretty helpful for this purpose. And yes, there should be consistent log handling / monitoring. Depending on service we differ between error log level (=expected user errors) and critical error level (makes our monitor go red).

johnbellone · on Aug 30, 2023

It often isn't as simple as using a correlation identifier and looking at logs across the service infrastructure. If you have a misconfiguration or hardware issue it very likely may be intermittent and only visible as an error in a log before or after the request. The response has incorrect data inside a properly formatted envelope.

m_mueller · on Aug 30, 2023

I guess that's one of the advantages of serverless - by definition there can be no unrelated error in the state beyond the request (because there is none), except for the infrastructure definition itself. But a misconfig there you'll always see in form of an error happening at calling the particular resource - at least I haven't seen anything else yet.

johnbellone · on Aug 30, 2023

That's assuming your "serverless" runtime is actually the problem.

pxc · on Aug 30, 2023

You don't even have to go as far from your desk as a remote server to see this happening, or open a log file.

The whole concept of addressing issues on your computer by rebooting it is 'normalization of deviance', and yet IT people in support will rant and rave about how it's the fault of users for not rebooting their systems whenever they get complaints of performance problems or instability from users with high uptimes— as if it's not the IT department itself which has loaded that user's computer to the gills with software that's full of memory leaks, litters the disk with files, etc.

gregmac · on Aug 30, 2023

I agree with what you're saying, but this is a bad example:

> Pick a server or a piece of enterprise software and go take a look at its logs. If it's doing anything interesting at all, it'll be full of errors.

It's true, but IME those "errors" are mostly worth ignoring. Developers, in general, are really bad at logging, and so most logs are full of useless noise. Doubly so for most "enterprise software".

The trouble is context. Eg: "malformed email address" is indeed an error that prevents the email process from sending a message, so it's common that someone will put in a log.Error() call for that. In many cases though, that's just a user problem. The system operator isn't going to and in fact can't address it. "Email server unreachable" on the other hand is definitely an error the operator should care about.

I still haven't actually done it yet, but someday I want to rename that call to log.PageEntireDevTeamAt3AM() and see what happens to log quality..

devjab · on Aug 30, 2023

> The trouble is context. Eg: "malformed email address" is indeed an error that prevents the email process from sending a message

I’m sure you didn’t quite mean it as literal as I’m going to take it and I’m sorry for that. Any process that gets as far as attempting to send an email to something that isn’t a valid e-mail address is, however, an issue that should not be ignored in my opinion.

If your e-mail sending process can’t expect valid input then it should validate its input and not cause an error. Of course this is caused by saving invalid e-mail addresses as e-mail addresses in the first place which in it self shows that you’re in trouble, because that means you have to validate everything everywhere because you can’t trust anything. And so on. I’m obviously not disagreeing with your premise. It’s easy to imagine why it would happen and also why it would in fact end up in the “error.log”, but it’s really not an ignorable issue. Or it can be, and it likely is in a lot of places but that’s exactly GPS point isn’t it? That a culture which allows that will eventually cause the spaceship to crash.

I think we as a society are far too cool with IT errors in general. I recently went to an appointment where they had some digital parking system where you’d enter your license plate. Only the system was down and the receptionist was like “don’t worry, when the system is down they can’t hand out tickets”. Which is all well and good unless you’re damaged by working in digitalisation and can’t help but do the mental math on just how much money that is costing the parking service. It’s not just the system that’s down, it’s also the entire fleet of parking patrol people who have to sit around and wait for it to get to work. It’s the support phones being hammered and so on. And we just collectively shrug it off because that’s just how IT works “teehee”. I realise this example is probably not the best, considering it’s parking services, but it’s like that everywhere isn’t it?

hansvm · on Aug 30, 2023

Attempting to send an email is one of the better ways to see if it's actually valid ;)

Last time I tried to order pizza online for pickup, the website required my email address (I guess cash isn't enough payment and they need an ad destination), but I physically couldn't give them my money because the site had one of those broken email regexes.

blcknight · on Aug 30, 2023

I disagree about extensive validating of email addresses. This is why: https://davidcel.is/articles/stop-validating-email-addresses...

devjab · on Aug 30, 2023

The article you link ends by agreeing with what I said. So I’m not exactly sure what to take it as. If your service fails because it’s trying to create and send an email to an invalid email, then you have an issue. That is not to say that you need excessive validation, but in most email libraries I’ve ever used or build you’re going to get runtime errors if you can’t provide something that looks like x@x.x which is what you want to avoid.

I guess it’s because I’m using the wrong words? English isn’t my first language, but what I mean isn’t that the email actually needs to work just that it needs to have something that is an email format.

Karellen · on Aug 30, 2023

> Developers, in general, are really bad at logging, and so most logs are full of useless noise.

Well, most logging systems do have different log priority levels.

https://manpages.debian.org/bookworm/manpages-dev/syslog.3.e...

LOG_CRIT and LOG_ALERT are two separate levels of "this is a real problem that needs to be addressed immediately", over just the LOG_ERR "I wasn't expecting that" or LOG_WARNING "Huh, that looks sus".

Most log viewers can filter by severity, but also, the logging systems can be set to only actually output logs of a certain severity. e.g. with setlogmask(3)

https://manpages.debian.org/bookworm/manpages-dev/setlogmask...

If you can get devs to log with the right severities, ideally based on some kind of "what action needs to be taken in response to this log message" metric, logs can be a lot more useful. (Most log messages should probably be tagged as LOG_WARNING or LOG_NOTICE, and should probably not even be emitted by default in prod.)

> someday I want to rename that call to log.PageEntireDevTeamAt3AM()

Yup, that's what LOG_CRIT and above is for :-)

noctune · on Aug 30, 2023

In my experience, the problem usually is that severity is context sensitive. For example, a external service temporarily returning a few HTTP 500 might not be a significant problem (you should basically expect all webservices to do so occasionally), whereas it consistently returning it over a longer duration can definitely be a problem.

Pawka · on Aug 30, 2023

That is exacly what previous commenter meant - developers a bad at setting correct serverity for logs.

This becomes even a bigger proglem in huge organizations where each team has own rules so consistency vanishes.

aequitas · on Aug 30, 2023

> I still haven't actually done it yet, but someday I want to rename that call to log.PageEntireDevTeamAt3AM() and see what happens to log quality..

The second best thing (after adding metrics collection) we did as a dev team was forcing our way into the on-call rotation for our application. Now instead of grumpy sysops telling us how bad our application was (because they had to get up in the night to restart services and what not) but not giving us any clue to go on to fix the problems, we could do triage as the issues where occurring and actually fix the issues. Now with mandate from our manager because those on-call hours where coming from our budget. We went from multiple on-call issues a week to me gladly taking weeks of on-call rotation at a time because I knew nothing bad was gonna happen. Unless netops did a patch round for their equipment which they always seem to forget to tell us about.

inferiorhuman · on Aug 30, 2023

  I want to rename that call to log.PageEntireDevTeamAt3AM() and see what
  happens to log quality

I managed to page the entire management team after hours at megacorp. After spending ~7 months being tasked with relying on some consistently flakey services I'd opened a P0 issue on a development environment. At the time I tried to be as contrite as possible, but in hindsight what a colossal configuration error. My manager swore up and down he never caught flack for it, but he also knew I had one foot out the door.

jiggawatts · on Aug 30, 2023

> Developers, in general, are really bad at logging

That's not the problem. I'll regularly see errors such as:

    Connection to "http://maliciouscommandandcontrol.ru" failed. Retrying...

Just... noise, right? Best ignore it. The users haven't complained and my boss said I have other priorities right now...

sixstringtheory · on Aug 30, 2023

> my boss said I have other priorities right now

Way to bury the lede...

dzhiurgis · on Aug 31, 2023

Horrors from enterprise - few weeks ago a solution architect forced me to rollback a fix (a basic null check) that they "couldn't test" because its not a "real world" scenario (testers creating incorrect data would crash business process for everyone)...

koonsolo · on Aug 30, 2023

Your system could also retry the flaky tests. If it fails after 3 or 5 runs, it's for sure a defect.

no_wizard · on Aug 30, 2023

This is the power of GitHub actions where each workflow is one YAML file.

If you have flaky tests, you can isolate them to their own workflow, and deal with it as isolated away from the rest of your CI process.

Does wonders around this. The idea of monolithic CI job is backward to me now