
The source of the e1000e corruption bug (2008) - davegauer
https://lwn.net/Articles/304105/
======
kazinator
> _But the other one was just as important: the e1000e driver should never
> have left its hardware configured in a mode where a single stray write could
> turn it into a brick._

... because, like, it's totally _okay_ for the hardware to be bricked by a
single stray write?

Make like Microsoft and blame the driver. :)

It should be a basic manufacturing test to repeatedly write pseudo-random
garbage to the entire I/O register space, and check that the hardware never
gets into a state that is not recoverable by either a simple power cycle, or,
failing that, a factory reset.

~~~
saurik
A long time ago in college I was in a silly play (as an actor) and also
helping build a robot for the performance: I did the software, and a friend of
mine did the hardware (and notably, he was like "a god of hardware"). One day,
the robot _caught on fire_... like smoke seriously started coming out of it
and we had to quickly turn it off and the logic board was charred.

My friend, whom of course knew his hardware was _perfect_ , immediately looked
at me and was all "what did you do?!" and I was super confused as all I had
done was written some code to let us control it remotely with a game
controller. We go through my code, and he finds a place where I set two
variables to true at the same time, and he explains to me that by doing that I
shorted the robot's power transformer and made it catch on fire... :/.

What I am mostly known for these days is working on "jailbreaking" of iOS and
Android devices. I can tell you that many early Android devices were easily
bricked by just flashing a bad restore image from restore mode after having
flashed a bad normal image. iOS is super resilient, and we have only very
seldom managed to truly brick a device... one story, though, is quite topical:
pod2g was demonstrating a new fuzzer he had written to do just what you
describe to one of the NAND controllers to look for exploitable bugs... but on
stage during the demo it managed to find just the right sequence of register
changes to brick his demo device ;P.

Honestly, given the mental model of most hardware developers and how much
trust is put in drivers, I'd just be glad that the e1000e merely bricked
itself instead of burning your entire house down ;P.

~~~
formerly_proven
Driving all FETs in an H-bridge with a microcontroller to save parts costs.

 _pop_ _smoke_

Didn't save part costs.

~~~
javawizard
You jest, but there's truth here: it's disingenuous (as I suspect the GP
knows) for GP's friend to claim that the hardware is perfect when in fact it's
missing a simple protection against software bugs causing it to catch fire.

~~~
Espressosaurus
If I had a dollar for every time a silicon or analog designer said "well don't
do that" when I identify a combination of register writes that will damage the
hardware...

------
ndesaulniers
Whew, kernel bugs due to complexities of live patching code!
([https://nickdesaulniers.github.io/blog/2020/04/06/off-by-
two...](https://nickdesaulniers.github.io/blog/2020/04/06/off-by-two/)) high
fives...anyone...anyone...?

~~~
marcan_42
Oh my, that last bit from Linus there. Been there done that, on a modern
system!

So I was making Linux run on a PlayStation3 booting from an exploit, so that
we could have full access to the hardware/GPU (not the locked down PS3 Linux
support that it came with, and had since been retroactively removed in an
update). This is running on top of a built-in hypervisor, on a PowerPC CPU.

I had to debug early bring-up code of the Linux bootloader I was writing from
scratch (AsbestOS), and at that point you don't have any real hardware access.
There is an internal serial port but it's not accessible to VMs, graphics is
way too hard to bring up, USB is also a pain. No usable LEDs too. So what is
there? Panic. The hypervisor panic hypercall had an argument with two modes:
you could either reboot, or shut down with a beep. So that was my boolean
output primitive, which at one point I used to "print" out an address, bit by
bit, during debugging the assembler bringup code.

[https://github.com/marcan/asbestos/blob/master/stage2/start....](https://github.com/marcan/asbestos/blob/master/stage2/start.S#L81)

Once I got into C and I could start writing more interesting code, I
eventually graduated from panic/reboot to using Ethernet. The Ethernet device
was virtualized to some extent. You still had to write DMA descriptors to get
packets out, but you didn't have to worry about low-level hardware init. So I
stuffed some hand-crafted broadcast UDP packets in there, and the descriptors
to blast them out, and called the "start transmit" Ethernet device hypercalls
to get printf-over-UDP working. Fun thing is the PS3 has a built-in Ethernet
switch with VLANs, so you needed to stuff VLAN tags into the packets if you
wanted them to make it out of the Ethernet port. Until the PS3 slim, when they
got rid of that switch, and then you no longer needed tags. Fun.

[https://github.com/marcan/asbestos/blob/master/stage2/debug....](https://github.com/marcan/asbestos/blob/master/stage2/debug.c)

I used this same UDP-over-LV1-hv-ethernet debugging tool to debug early kernel
startup too, after the bootloader hands off, since we had to revamp chunks of
the Linux on PS3 early memory startup code (as the memory config while booting
in "Game" mode was very different from the formerly officially supported
"Other OS" mode) way before graphics was up. This code was eventually
upstreamed:

[https://github.com/torvalds/linux/blob/master/arch/powerpc/p...](https://github.com/torvalds/linux/blob/master/arch/powerpc/platforms/ps3/gelic_udbg.c)

By comparison, when I was doing (unofficial) Linux on PS4 we had the luxury of
running on bare metal, and there were testpoints for a serial port on the
motherboard, so we could just poke data out of there with some trivial UART
code, and use the existing Linux earlycon support for the same after handoff
(though the baud rate multiplier was wrong... and also, bizarrely enough, it's
an "XScale" ARM variant 8250 port because the PS4's southbridge is a
repurposed Marvell ARM SoC with a PCIe bridge to the x86!).

~~~
ndesaulniers
Wow! That's a great read, you should post a copy to your blog.

I remember running yellow dog Linux on my PS3 (as a kid) but didn't know the
GPU was inaccessible.

For the hypervisor, did you have to do any kind of RAM training?

Are folks running Linux on PS4's these days? I haven't been keeping up to much
with the system exploit scene (had exploited Xbox 360, PSP) other than the
excellent YouTube channel "modern vintage gamer."

------
jdblair
This sounds familiar. Around 2002 I bricked a Pentium motherboard while
probing the SPI address space looking for the fan and temperature sensors.

~~~
tzs
I bricked an expensive high end motherboard by ignoring the "Win98 or later"
requirement listed on the box and trying to install Win95.

Actually, two of them...when the first one bricked I assumed that it had just
been defective, so tried again on the second one we had.

It turned out that something in the Win95 device scan happened to trigger a
BIOS flashing, what flashed garbage into the BIOS.

I really hate old buses where there is no mandatory and standard form of
device ID, so the only ways a driver can find its device are (1) ask the user,
or (2) read and write registers at likely addresses looking for ones that
respond the same way its device would hoping that at addresses where the is
something that is not your device your commands don't accidentally make what
is there do something terrible.

------
DoofusOfDeath
I really respect Jonathan Corbet's writing skills. The article was clear,
informative, and engaging.

------
rini17
Oh the memories...this was something similar, but just an innocent RESET ATA
command.

[https://www.zdnet.com/article/mandrake-linux-9-2-kills-
some-...](https://www.zdnet.com/article/mandrake-linux-9-2-kills-some-cd-rom-
drives/)

------
thomasjudge
Love the understatement: "As a general rule, bricking the hardware is a level
of overhead which goes well beyond the acceptable parameters."

------
dehrmann
> As it happens, doing nothing is a highly optimized operation

