Click bait title, its about a bug that has the value "ten years " as part of the issue, not a bug that took ten tears to solve like the title implies...
At least it's not "You wouldn't believe why we could not solve this bug!" :)
It's about issues that would have lasted for ten year, without the possibility to be fixed, if we hadn't discovered the root cause - the caching policy issue.
Anyway, 10 years is way too long for caching. A week is just fine, that's plenty long to not have many requests, but not so long you can never change anything.
Plus if your user hasn't been there in a week, why are you still in their cache?
I have a non-reproducible bug (at least I think it's just one and I've failed to reproduce it) in my software.
It causes the entire UI to disappear and the user ends up opening a new instance of the program. I only know it exists because about once every few months a user can't upgrade and when I remote in and look in the task manager, an instance of the program is running.
I have an idea how to track it down and fix it but there's always something more important to be done.
Imagine a patient dies during surgery because the surgeon didn't wash their hands. Would you give the surgeon a second chance or just fire the person and revoke his license for not following basic hygiene procedures?
Likewise, a programmer that doesn't provide deliverables that are maintainable is also disservicing the company, and society as a whole.
There's a difference between a legitimate error and negligence. In this case, the problem looks like negligence and was easily avoidable.
I believe this comment has value, but it could stand to be worded better. The last paragraph about firing in particular is probably the most out of line with HN's tone.
The code quality seemed low and to me it was negligence and a disservice to the employer that had to compromise their reliability for quite a long time and spend significant effort root causing the problem.
I hope I wasn't out of line by vouching the comment to reply to it. I see it is dead again, but I think maybe a lesson was learned? Anyway, thanks for keeping HN's discussions at such an exceptional level. There really is nowhere else like it that I'm aware of.
What exactly would differently written code ("5 seconds creating a variable") have helped here? Nothing in the article indicates that they even realized that it was a caching issue, and if they did the header would have been trivial to grep for.
Because variables have identifiers, and the identifier can allow the programmer to be more expressive and explicit about what the number means. Then, in the context of the code, you use the variable instead of just the value, making the code more readable.
I'm aware of the principle, but that does not answer how that would have helped them here or how this indicates anything about the quality of the entire code base. Nothing in the writeup suggests that they had difficulty understanding this particular piece of code or that they were searching for something specific but didn't find it due to bad naming.
Exposing this value as an external configuration parameter in a place that is more often looked at might have helped someone to realize caching might be an issue earlier, but it could easily have been ignored as well.
Explicitly saying "I want 10 years" is different than saying "a large number. Do the math to know more". This is how identifiers can help understanding the programmer intention.
Then, you can define the time length as an arithmetic operation. Such as 10 * (60 * 60 * 24) * 365.
These might be still considered magic numbers to certain extent, but it's simpler to recognize numbers such as 60, 24 and 365.