
The trickiest bug I've encountered - kenjackson
http://mokafive.tumblr.com/post/41912162643/the-case-of-the-mystery-data-blocks
======
beat
Sounds vaguely similar to the trickiest bug I ever had (and the first really
hard bug I ever dealt with on my own). Mine was in AIX 3.2.5, circa 1995. We
were buffering a latency-prone data stream between read and write processes
using shared memory buffers. The original design used shmat(), which was
limited to three buffers total on AIX. I rewrote the buffering using mmap() to
create memory-mapped anonymous file buffers - the number was effectively
unlimited and could be tuned via configuration. Around 100 buffers gave
optimum performance, with a 300%+ throughput improvement in real-world
conditions - a huge win on a huge problem!

Then it blew up in production. Like hard crashes shortly after starting. Upon
investigation, I found that entire pages of mapped memory were being
overwritten with nulls, more or less randomly (4096 bytes at a time).

Turned out the bug was in mmap() due to the _order_ patches were applied on
various servers. The dev/qa servers were patched at different times than the
production servers. That, that was hairy. And for a junior programmer to have
to explain this to the tech leads and IBM support - I don't even know how many
times I heard variants on "What's wrong with your code, really? mmap() isn't
buggy!"

Ah, the 1990s. No Stack Overflow. No ssl (not even http). We did network
programming by wrapping raw sockets in C and writing stream parsers in lex and
yacc. Kids these days, they don't know hacking!

~~~
tjradcliffe
I once #defined strlen to return a short. This was in the days of the 16-bit
to 32-bit transition on Windows, and I was running a project to eliminate
spurious warnings from our code. There were several hundred places where the
return value of strlen was being assigned to a short, every one of which
created a warning.

I was a senior developer but it was my first commerical job--I'd been in
academia previously. I ran the change by the the most senior technical people
in the company, all the way up to the guy who'd written the original
application as employee #1. He OK'd it because, as he said, "strlen returns as
short on the Mac anyway, and since our code runs on Mac as well as Windows
it's a limitation we have to respect anyways."

A few years later the company stopped supporting the Mac.

A few years after that (and well after I'd left the company) one user site
started getting crashes. I heard later that it took four senior devs a week to
track down the cause, and much head-scratching because strlen was documented
to return an int. They eventually found my #define, along with a comment that
this eliminated so-many hundred warnings in the build, and that the change had
been approved by the most senior people (I wasn't totally naive).

It turned out the problem at the specific site having the issue was users
putting the entire contents of files into what amounted to tool-tips. It was
totally unexpected user behaviour, but they'd found a place they could cache
some useful data and we let them do it, so it should have worked.

Today I'd write a script that auto-edited all the cases where the problem
occured, and regression test the hell out of it, but yeah: the '90's were a
different time!

~~~
unwind
I'd just like to point out that at least in the draft C89 standard
([http://port70.net/~nsz/c/c89/c89-draft.html#4.11.6.3](http://port70.net/~nsz/c/c89/c89-draft.html#4.11.6.3)),
strlen() had the same prototype as it has today: in other words it returns
_size_t_. Not int.

------
nathanb
In my day job, I have to write reams of complicated code that can slow down
the system or make maintenance more annoying...just because a user _can_ do
something (even though doing that is unsupported).

That's what it means to write business-class software. Nobody worth having as
a customer is going to build their business on your platform if your attitude
whenever something goes wrong is "you shouldn't have been doing that in the
first place".

(I am surprised to hear this story though. I actually found a bug in the NTFS
buffer cache a few years ago which was introduced in (as I recall) Windows
Server 2012. Maybe the Server organization are way more on the ball than the
consumer OS organization, which is definitely possible. But they took it
seriously and fixed it in a patch.)

~~~
derefr
> Nobody worth having as a customer is going to build their business on your
> platform if your attitude whenever something goes wrong is "you shouldn't
> have been doing that in the first place".

My favorite set of APIs is AWS. You know why? Because they've realized they
hold two very weighty sticks that they can use when designing, and they've put
them in place all over.

1\. They can make any arbitrary message to the API _cost the user money_ every
time they send it, to disincentivize using that part of the API thoughtlessly.
That's whether or not they expect this to be an actual revenue stream at the
rates people are charged for reasonable usage.

2\. They can put a "soft cap" on any arbitrary resource, so that you have to
phone them and get the cap raised if you want more than [some reasonable
number] of something. This likewise disincentivizes bad designs that use a
nigh-infinite number of costly somethings to accomplish tasks that could be
just as easily accomplished some more idiomatic, less costly way.

AWS doesn't _prevent_ you from doing stupid things... but it makes you really
not want to. I love it.

~~~
click170
Have you ever tried to setup AWS IAM permissions for a user pursuant to the
principle of least privilege? Because Amazon's APIs are about as far from
friendly as you can get in this respect.

Their docs make it easy to make the mistake of thinking that fine-grained
controls are available for most things, but when it comes to really important
things like being able to segregate a production and Dev VPC, their APIs
basically force you to grant permissions to everything or nothing.

Some examples of things I've hit: Not being able to restrict a user to only
change a specific routing table Not being able to restrict a user to only
change a specific elastic NIC

I'm consistently surprised at what's missing from their API and couldn't
disagree more about being happy with it.

~~~
derefr
These things _are_ possible... but this gets at another aspect of AWS's design
in particular.

I do a lot of my AWS work in CloudFormation. When I hit a wall, the answer is
pretty much always to stand up an EC2 instance that can speak SNS, grant it
larger-than-necessary permissions to my VPC, teach CloudFormation about it as
a custom resource type, and have it serve as a proxy for the not-configurable-
enough resource, allowing it to assert its own policy and make third-party
calls before making the real callback into your VPC [or not.] It's the AWS
equivalent of writing a factory method to wrap a badly-written constructor.

To generalize that thought: IAM "users" are made to either be people (e.g.
your developers, your ops people), or representative tokens for entire third-
party organizations (e.g. a CI bot.) Despite the existence of IAM roles, IAM
isn't really made to assert "machine-agent"-granular permissions.

Instead, what you really want is to imagine a third-party service running in
the AWS cloud that does exactly what you want. You would grant that third-
party's IAM user overly-wide permission to play with your VPC, but trust it to
only do what it should, because, obviously, you have a business relationship
and it would be dumb of them to abuse it.

As soon as you can see what API needs to exist, you can turn around and
_become_ that very same imaginary third-party: make a separate AWS account,
stand up an API server in it that takes requests to do what your "clients"
want, and then, in turn, make requests to the AWS APIs on their behalf to
accomplish those things.

AWS isn't a high-level framework; it's a kit of low-level tools. (This is
really what the PaaS vs IaaS distinction implies, I think.) AWS is built
assuming that you're willing and able to take their tools and pipe/script them
together to build the higher-level components you need. And, since AWS is for
web services, that assumption comes in the form of expecting you to be able to
pipe, hook, or wrap any of their APIs to/with/in your own API.

------
digi_owl
When a dev goes "why would anyone do that?!?" you know you are in for a bad
day...

~~~
dredmorbius
When it's the QA manager saying it, leave.

------
zamalek
Getting Microsoft to fix bugs is _hard._ We had a bug in the .Net runtime.
Took _ages_ to reach even agreement on the memory dumps: probably because bugs
in the .Net runtime are like unicorns, even I had a level of disbelief that
.Net was to blame.

Once you prove it, it's completely free. Ultimately memory dumps are the way
to go when it comes to MSFT bug reports, if you can catch their bug red-handed
and snap it to disk things go real smooth.

~~~
sseveran
Yeah, I found a great one in .net. If you used HttpWebRequest to fetch a url
that had an invalid GZIP header the app would hard crash. This is because the
request and decompression was being handled on a threadpool thread. An
exception would be thrown and no one would be around to catch. Fortunately
this was in a web crawler so it took quite a while to build a repro case and
an exact diagnoses. I believe it was fixed 4 or 5 years after I reported it in
2008.

------
alexdowad
To me, stuff like this is a powerful argument for using open-source software
(or at least software which _you_ can access the source for), whenever
possible. When things go very wrong, at least you can dig into the source code
and try to fix it.

~~~
MaulingMonkey
It's certainly an argument for having source access. For OS / "platform" level
bugs it's no panacea however - you may very well need to keep your workaround,
either because you can't rely on end users upgrading to the latest version, or
simply need a stopgap until you finish the patch, submit the patch upstream,
address all the issues brought up in the review, have it integrated into the
next stable release, have it trickle downstream into the stable releases of
individual distros, etc etc etc...

------
ww520
Ahhh, the fun with caching in the file system. Incidentally one of trickiest
bugs I encountered also had to do with the Windows Cache Manager.

It was the interaction between the Cache Manager and the Memory Manager in
managing the rare transient state ModifiedNoWrite of the cache pages when
dealing with reentry IO read requests. The cache page status became Modified
when its content was filled in from disk, but it's marked as NoWrite to avoid
being flushed out by the Memory Manager. The physical page backing the cache
page can't be reused since it's dirty (Modified) but the Memory Manager can't
flush it out (NoWrite). Slowly over time as more pages are read, the system
would run out of physical pages.

The Cache Manager was supposed to change the cache page status back to Standby
after the read returned from lower layer. But with reentry IO read requests,
it won't do it when upper layer IO request buffer passed straight down. The
work around was to allocate a separate buffer to interact with the Cache
Manager and copied the content back to the upper layer IO request buffer,
incurring an extra copy.

At the end I intimately knew about the Windows Cache Manager and Memory
Manager more than I needed to know.

------
vermooten
How about fixing the white-text-on-a-pale background bug while you're at it?

------
meshko
The experience of dealing with Microsoft's support sounds exactly the same as
what I had to deal with. In my situation there was no workaround possible
short of rebooting the system (it was a kernel resource leak), so after
spending 3 months iterating through different ways to reproduce the problem
(including "we don't support Windows on VMWare, after asking me to send them a
VM image to reproduce the problem), going through 3 levels of support, I got
to the people who were able to get me a fix. Alas, it was a private patch
which was only available upon request, which didn't help much as I was working
for an ISV.

------
j_s
Am I correct in believing that from a high-level view, this NTFS error
resembles the ext4 bug currently on the front page?

[https://news.ycombinator.com/item?id=9576917](https://news.ycombinator.com/item?id=9576917)

Edit: That must be why this post from 2013 (update the title!) discussing an
issue fixed in 2011 hit the front page...

~~~
jimrandomh
Similar, but the NTFS one was worse in that it was potentially a security
vulnerability, since the corrupted data that _it_ wrote inappropriately into
the middle of files came from elsewhere in the filesystem, rather than zeroes.

(Software is hard and often buggy, filesystems included. Check your backups!
Even if your filesystem is perfect, I spilled water on my laptop yesterday,
and you could, too.)

~~~
mikeash
If data isn't backed up, it doesn't exist. If it's not backed up offsite, it's
not backed up. These are words to live or die by.

I don't think it's likely that I'll spill water on your laptop, though.

~~~
janfoeh
3) if you haven't successfully tried to restore it, it's not backed up.

As the old saying goes, nobody cares about backups — people care only about
restores.

~~~
quotedmycode
Yeah, I like to extend this out and make it generic like this:

If you don't test it, how do you know?

For example, one person told me he can't understand antivirus software and why
people buy it, because he never got a virus. I asked him "how do you know you
didn't get a virus?". He just looked at me, not saying a word. I hope my point
got across though. If you aren't checking, you don't know. Same could be said
about hacking these days. You secure your system, that's good, but if you
don't have something to detect hackers, you are the same as the guy without
antivirus and the guy without tested backups. You just have no idea whether or
not you are protected.

~~~
FeepingCreature
Though to be fair, the people telling me that I might have a virus are the
people who want me to give them money.

"How do you know you didn't get a virus?"

I don't. But it's not epistemically clean to let other people set your priors
for things like risk, if they have a financial interest in making you worry.

------
ksk
When you reported the bug did you include a repro? I would appreciate it if
someone could post it here.

To be specific, I wanted to know the specific conditions the article talks
about under which this causes an issue. Flush will cause writes (obviously) so
issuing writes to a file which is pending a flush is an interesting scenario.

~~~
vollmond
>> We reported the bug to Microsoft and included a short program that easily
reproduced the issue.

Now we just need a copy of it.

------
beagle3
Ah, reminds me of a Windows bug that I chased in Win2000 Server. (For all I
know it might still be there - I've stopped writing Windows code in 2007).

When you wrote to a file (using WriteFile or fwrite()), it first extended the
file length, and then committed the buffer. This is supposed to be atomic -
that is, you should never be able to see the length already extended but the
data not yet there. And it apparently was atomic if both reads and writes came
from the local machine, or both came from the network - however, if the write
was on the local machine but the read was from the network, locking was
missing, and it WAS possible to read zeros instead of the real data (but only
because of a race condition - reading the same file again later would give the
expected answer)

Tried to get Microsoft to at least confirm this bug, to no avail - there was
no one interested in talking to a lone freelance developer back then.

------
dba7dba
I think the trickiest bug that I heard of was USAF F22 jets losing all
computer systems (navigation, communication etc etc) as it crossed date line
while flying to Hawaii for the first deployment on the island.

The flight of 4 jets were able to return to US mainland only because the
accompanying tanker was able to guide them back.

------
chris_wot
Yeah, well Microsoft never managed to find why Liberation fonts caused a BSOD:

[https://news.ycombinator.com/item?id=5468390](https://news.ycombinator.com/item?id=5468390)

------
freebish
In retrospect, I wonder if there was a better way to report that bug. Clearly
the first person didn't get it, at all.

~~~
perlgeek
Public disclosure as a security bug (information leak)?

------
Diamons
I love the blog design

~~~
snarkyturtle
I hate it, skinny text, low contrast, non-consistent background. It's
everything wrong with modern web design. If, say, as you scrolled down the
gradient went away, I'd be fine with it, but since it's fixed it's
distracting.

~~~
Benjamin_Dobell
Agreed! I genuinely wanted to read the article but found myself squinting just
to try read the text on that awful background. It was excruciating so I just
gave up.

------
tonetheman
I might agree with at least one thing the MS guy said. I really do not trust
the OS so I would be zeroing out memory. That is me, I am an un-trusting soul.

~~~
ikeboy
>I really do not trust the OS so I would be zeroing out memory.

If you don't trust the OS, don't use it. If you use it, you're giving it
access to all your files, full stop.

~~~
randyrand
Sound simple but reality is much different. What if the intended users of your
software use it?

~~~
ikeboy
Anyone using your software on a system they don't trust should have no
expectation of the data being safe.

