
A Practical Guide to Watchdogs for Embedded Systems - fra
https://interrupt.memfault.com/blog/firmware-watchdog-best-practices
======
monocasa
One neat thing I've seen that doesn't get called out enough, is a high
priority timer that has a slightly smaller period than your watchdog. When you
let the watchdog, you pet this timer too. Then in the timer ISR, write out the
trap frame to brain dead non volatile memory (we had battery backed SRAM, and
then MRAM on newer boards). Then when the board reboots, and checks in, you
can pull down what it was doing when the watchdog triggered.

~~~
unwind
Yes!

I implemented something similar on a particularly annoying 8/16-bit controller
just a few weeks ago. Extra fun since it had no instruction to read the
program counter (and no general purpose register wide enough to hold it).

I hope ARM adds that (or IC vendors like ST if possible) to the core itself,
just mirror the PC to a register on reset. Should be trivial in hardware.

~~~
elcritch
Hmm, maybe you could possibly instrument the functions[1] to write the
function address to a register, presuming you have enough program space.
Presuming you're in C and your compiler supports that.

1: [https://mcuoneclipse.com/2015/04/04/poor-mans-trace-free-
of-...](https://mcuoneclipse.com/2015/04/04/poor-mans-trace-free-of-charge-
function-entryexit-trace-with-gnu-tools/)

------
NelsonMinar
I was surprised at the poor state of watchdogs in PC-class Linux systems. I
needed one recently and was bummed at the state of the old watchdog daemon /
softdog kernel module. It works, but it is not nearly as easy to get going (on
Ubuntu) as I expected. systemd also has its own watchdog and I can't figure it
out.

Anyway turns out I really needed a full PC hardware watchdog. I ended up
buying some $8 anonymous piece of Chinese hardware that's USB powered. It hits
the motherboard reset switch if the motherboard hard drive activity light
hasn't flashed in awhile. Dumb thing, but it seems to work.

~~~
agapon
Almost all consumer motherboards have a chipset with embedded hardware
watchdog that's well supported by respective drivers in FreeBSD / Linux / etc.
Some vendors like Intel may lock down that functionality, but most (Asus,
Gigabyte, etc) don't.

~~~
NelsonMinar
Do you have any more info on that chipset watchdog and how I'd use it in
Linux? Is it this, the TCO Watchdog? [http://www.madore.org/~david/linux/iTCO-
wdt-test.html](http://www.madore.org/~david/linux/iTCO-wdt-test.html)

~~~
agapon
Yes, iTCO_wdt is for Intel chipsets; sp5100_tco is for AMD chipsets (including
FCH).

The names are a bit weird to my taste. FreeBSD ones sound a little bit better
(but maybe I am just more used to them): ichwd (ICH WD) and amdsbwd (AMD
SouthBridge WD).

~~~
NelsonMinar
Thanks! I somehow spent several hours researching how to run watchdogs on
Linux without finding this. Seems like exactly what I need. While I'm here,
systemd's watchdog docs are also pretty good:
[http://0pointer.de/blog/projects/watchdog.html](http://0pointer.de/blog/projects/watchdog.html)

------
joezydeco
One scenario missed in the list of causes is a corrupted runtime image.

An advanced topic is enabling support for the watchdog in your bootloader and
having a defined recovery path when the system fails to load or, worse, the
application falls into a boot loop.

If you have the space, you can fall back to a recovery image or duplicate of
the application. If you don’t have the space, falling into a DFU mode is a
good plan.

------
fra
Watchdogs are one of the more frustrating types of issues to debug. Chris's
overview of how to implement them properly, and investigate resets is an
amazing resource I wish I had earlier in my career.

~~~
RealityVoid
Heh, I work in embedded and, IMO, they were one of the most fun issues to
debug in the begging of my career. Mostly, if I had such an issue, I enabled
the ISR triggering and just before the context got trashed, I looked where the
program was. Things got pretty obvious then.

Once you get the hang of it, it becomes pretty straightforward. And it's great
training on how the HW looks like and functions!

Granted, on more lower powered systems, with fewer, more primitive debugging
options (looking at you, PIC) things miiight look a bit more painful.
Thankfully, such systems are (much) smaller.

IMO, the _most_ painful issues to debug are memory corruption issues happening
on large systems without MPU enabled.

~~~
fra
You’re right, if it reproduces at your desk you’ve got some good tools at your
disposal.

Collecting enough state to fix them from a customer report is tricky, I would
say.

------
retSava
The watchdog is a nice feature to have a borked system reboot, lifesaver in
the field if feces hits the fan.

What's less fun is if there is too little protection against electrostatic
fields/EMI on the JTAG clock pin. On the small cortex m-class devices we work
with, some of them can't shut off the JTAG part of the chip, meaning that when
operating, if there are enough (I think 8) logic flips on the TCK pin in _any_
amount of time, the JTAG part wakes up, sets the HALT ON BOOT flag. Next time
the device reboots (due to firmware update, or watchdog, ...), it will stop
and stay in JTAG debug mode. Not nice. You need to manually power cycle the
thing.

We detect this by periodically checking the JTAG power domain, and if it is
on, tell the server this so that we avoid rebooting it (eg automatically after
firmware update). This way we've found poor hw implementations and tough EMI
environments by proxy of JTAG power domain :D.

~~~
scoutt
I work with Cortex-M since they launched and I've never seen on heard about
this problem. Of course, I don't know the kind of environment your devices
work, but many of my projects run in automotive.

TCK has usually a pull-up/termination (but I've also seen pull-downs
with/without caps). You see this issue even with the pull-up/down?

~~~
retSava
This was primarily in a lighting solution, and it wasn't the only problem with
that early hw. We noticed lights not coming back online after a firmware
update without a power cycle. Since it was very early hw, IIRC it didn't have
any shielding, and also IIRC no pull on TCK.

We now see this primarily on 1st/early iteration hw and on prototypes, but
checking this proactively have saved us from a lot of headaches of the type
"why doesn't it come back online - hw, or something in the update, or
what...".

------
cmroanirgo
A neat enough article, but surprised it didn't talk about an electronic
watchdog: basically pulsing a gpio pin to trigger a recharge of a cap which
holds a transistor active for a second or so, and that transistor drives eg a
relay. An alternate method uses a gpio to reset a 555 timer. This will allow
machinery to cut off when the embedded circuit stops looping. That is, any
attached machinery would have a guaranteed NO (normally open) circuit and can
only be engaged when all the watchdogs are working properly.

Some mcu pins also go into an unknown state (neither guaranteed high nor low),
so resetting a cpu can have bad consequences if it's driving big machinery, if
not designed correctly.

One project I had a pc sending software watchdog pings to several independent
devices and each of those had an _actual_ hardware watchdog (as opposed to the
cpu resetting one in the article). I used the watchdog to physically control
the power to contactors: no watchdog = no power = nothing activates.

The system controlled firing of gas burners and fans etc, but the design was
very safe, heaps of redundancy and was guaranteed to fail into a safe mode at
any instant.

------
senderista
I would like to find a reliable software watchdog that kills a process when a
timer expires (for preventing zombie processes from violating lease timeouts).

~~~
foota
I mean, you probably shouldn't rely on the local system time for determining
lease timeouts for correctness, if it matters.

~~~
senderista
I mean something like CLOCK_MONOTONIC.

~~~
foota
Don't you still need to worry about the clock not going up in pace with
others? Like if a system has a lease until "100" but the clock doesn't tick
(or more realistically ticks slowly), then another system could think it had
the lease if it observed a local system time of 101? Maybe I'm
misunderstanding how you're using the leases though.

~~~
senderista
Yes you do, and normally an upper bound epsilon for clock rate skew is
explicitly assumed. That's obviously fragile but there are many highly
available systems that have managed to get away with it, partly by keeping
leases short.

------
Paul_S
Never in my career have I ever heard of any other term used than "kicking" the
watchdog. Are the other terms popular in America?

~~~
alxlaz
Why would you kick the poor thing? :(

Not in America but terms I've heard include feeding the watchdog, keeping it
alive, petting, greeting, barking at it, calling it and, my favourite,
shushing it.

------
fwsgonzo
This was an awesome read. Thanks for writing this.

