
The Habitat of Hardware Bugs - ChickeNES
https://www.embeddedrelated.com/showarticle/988.php
======
HeyLaughingBoy
_It 's bug-free if and only if they can't sell it with bugs_

This is so very true. A long time ago I was writing a device driver for a
chip. I kept running into a problem and spent days looking for the bug in my
code. After all, it _had_ to be my code. No way the chip would fail to work in
this mode: thousands of customers would be screaming bloody murder.

Finally, I gave up and called my rep at TI. And found out... they knew about
the bug and were in the process of fixing it. Why weren't all those customers
complaining of a bug in the chip's most basic mode? Well "actually you guys
are only the second company to buy this version of the chip..."

~~~
joezydeco
Oh man, chip errata documents are incredibly scary things. It makes you wonder
how the thing works at all.

Here's the current errata for the Freescale iMX6D/Q. _All 225 pages of it._

[http://cache.freescale.com/files/32bit/doc/errata/IMX6DQCE.p...](http://cache.freescale.com/files/32bit/doc/errata/IMX6DQCE.pdf)

~~~
rasz_pl
200 pages? thats average for Microchip

~~~
joezydeco
Yeah I guess this is actually pretty good for an SoC the size of the iMX.

------
ChuckMcM
That is a pretty reasonable way of looking at it. One of the things that made
NetApp interesting when I was there was ONTap, a completely custom OS with one
memory space and no user mode. When you thought about it made sense, all you
need for a NAS box is a really feature rich ethernet driver :-). Anyway, what
it meant was that NetApp would uncover problems in CPUs and Chipsets that
nobody in the "PC" world would ever see. Race conditions on the frontside bus,
PCI express traffic that would freeze up the chipset etc. It was also true of
drive firmware. Drives have all these commands which look good in the manual
except no PC ever calls them in production. As a result they don't get a lot
of testing. We discovered that 'write zeros' which was a command for zeroing
out a disk, on some firmware revs was "write mostly zeros, except when you
don't." Never good when you're trying to initialize RAID stripes. As a result
there was always a "Netapp version" of the drive firmware which had been
qualified but customers always believed it was just a way of preventing them
from using commodity drives[1].

Any time you step off the beaten path and try to use a complex technology in
an "unusual" way, you are blazing a trail which may not have been traveled
before. Always good to be on the lookout for undocumented bugs.

[1] It did have that effect but it wasn't the motivation.

~~~
_yosefk
This works at higher levels of abstraction, too. For instance, NetApp filers
have a deduplication feature, where identical files are detected and stored
once instead of several times. When one of the files is changed, supposedly a
copy on write happens. Yet in practice, I saw, more than once, two identical
files, with completely identical time stamps, owned by two different users,
where only one file was modified intentionally by a program ran by one of the
users (that program logged its actions to a file, and the other user's log
would be empty, plus there was no chance that both modified their files at the
exact same time.) I concluded that NetApp's deduplication wasn't on the beaten
path - or perhaps something in the timing or other specifics of our creation
and modification of identical files was unusual.

~~~
tw04
The first problem with your example is that NetApp deduplication occurs at the
block level, not the file level. The second problem is, given the number of
systems in the field utilizing it - if your example were accurate there would
be literally THOUSANDS of people up in arms.

Furthermore, their deduplication is post-process, so even if dedup were to
somehow modify atime, which it doesn't, you wouldn't have seen the access time
change for at best 24 hours after the file was modified.

Troll on.

~~~
_yosefk
Then it wasn't deduplication. I swear it was a NetApp file server, two files,
one modified by a program that logged the change, the other getting the same
bits as the first, the time stamps were completely identical. Dedup was just a
guess.

------
mcshicks
At least for digital asics, if you can control the temperature and core
voltages one strategy to determine if the issue is hardware is to see if the
problem changes (either stops or happens more frequently) at high
temperature/low core voltage compared to low temperature/high voltage. That's
usually a pretty good sign it's not a software/firmware problem. If you can't
modify the voltage to the datasheet operating limits, you can try just
temperature, but in my experience it's better to do both.

------
userbinator
A good example of a pretty serious DRAM bug that showed up on PCs a short
while ago --- yet with surprisingly little coverage in the media etc.:

[https://www.ece.cmu.edu/~safari/pubs/kim-
isca14.pdf](https://www.ece.cmu.edu/~safari/pubs/kim-isca14.pdf)

~~~
vardump
What makes that different from rowhammer?

------
planteen
I spent weeks trying to stamp out a bug that turned out to be a signal
integrity issue between the processor and DRAM. It was horrible. The bug would
only happen after about 30 minutes and looked like memory corruption. I spent
tons of time looking for an interrupt corrupting memory.

~~~
_yosefk
People who lie about having checked signal integrity suck. I have a horror
story of my own along these lines, with very creative memory corruption.

