
Software Updates for IoT Devices and the Hidden Costs of Homegrown Updaters [pdf] - ralphmender
https://mender.io/resources/guides-and-whitepapers/_resources/Mender%2520White%2520Paper%2520_%2520Hidden%2520Costs%2520of%2520Homegrown.pdf
======
seanalltogether
I've been contracting with an IoT company for 3 years now. It's interesting to
see guys who coded exclusively for radio or infrared based remotes get pulled
into the software development world.

V1: 5 years ago everything was raw sockets and custom messaging formats with
hand coded firmware and all data stored in a custom vector format, builds were
distributed on google drive and flashed by hand.

V2: 3 years ago we dragged them kicking and screaming into http and hand coded
json apis, firmware was still custom, data still stored in custom vector
format but updates were now done on a non secure server with a hash check.

V3: this past year they started on a small box with a micro linux distro, apis
are provided by standardized library, data now stored in sql, updates done
over https.

Things are better now, except they still expect to sell and support those
first 2 options for the next 10 years.

~~~
nitrogen
Part of what you are describing is exactly why IoT is sometimes called IoS.
We've moved from solid, fast, low-latency, low-power hardware to
unpredictable, slow, jittery, high-latency crap. Take the Philips Hue bridge
for example. Raw binary protocol lighting tech from the 1980s can outperform
it in terms of latency, throughput, and jitter.

It's unaccpetable for a button to take action after some random delay between
100ms and 5s. It's even worse if there's a remote https round trip required,
as network lag adds another layer of unpredictability.

~~~
seanalltogether
Raw binary protocol lighting tech from the 1980 was stateless, but nobody is
willing to accept that nowadays for home automation.

"Turn device on" \- Great I can do that fast

"Turn device off" \- Great I can do that fast

"Is device on or off?" \- Hold on while I poll a serial rf signal device by
device while I determine state.

All of the slowdown coming from our tech and others that I've seen is because
hardware guys still think these old stateless solution are acceptable and then
have to hack something dirty on top to turn it into a stateful solution.

~~~
ianhowson
> guys still think these old stateless solution are acceptable and then have
> to hack something dirty on top to turn it into a stateful solution

Isn't this the foundation of modern web development?

------
tomc1985
This is FUD.

Partition twice the amount of space you need on IOT's fixed media, and
trickle-download your update to the empty partition. Once that's finished,
verify the download then switch your bootloader. I have to update 500+ field
installations with a new OS and this is the approach I'd take if we had the
space and bandwidth

(Instead we're sending out an army of techs and account reps armed with USB
sticks)

~~~
geofft
I'm not sure what part you're referring to as FUD - the cost?

I definitely agree with the rest of the comment. It's a good solution. I
inherited a software product that worked like this at a past job, and it was
great. (And now I'm trying to convince my current job that we want to move to
this model for normal servers in datacenters, very much not IoT.)

A couple of complications:

\- You should think about the fact that this means your root partition
changes. Either you want to structure your system with separate read and write
partitions and bind-mount the relevant directories from the write partition,
or you want to make it completely read-only / stateless. Remember that
/var/log is traditionally on your local disk, so if you don't do anything
special, you'll even get two /var/logs on each device, which may or may not be
what you want.

\- You do want a management server, as this document suggests, to track which
devices have actually updated and which haven't, so you can manually send
people after devices that are just behind a terrible internet connection.

\- You want some mechanism for detecting if the new version doesn't work and
rolling back; this is basically as simple as setting a "I just tried partition
X, if it doesn't work don't try it again" flag in the bootloader on boot, and
clearing it once userspace is up (and when the partition gets rewritten with a
new version).

\- The updates should be signed etc. as described in the document. Depending
on your threat model, you might want to prevent replay attacks that cause an
attacker-controlled downgrade by giving it a higher-versioned filename; either
use HTTPS to an update server you control, or use signed metadata files with
timestamps.

The fact that you get image-based deployments instead of dealing with apt
upgrades from arbitrarily-old versions (and thus inevitably slightly drifting
configs on devices installed at different times) is fantastic.

~~~
Cyph0n
How do you handle key and/or certificate storage at the client side? Depending
on the threat model, the update verification step can be subverted.

~~~
geofft
You install manually / out-of-band (in my old job, you installed by copying to
a USB stick from a desktop machine), and the updater runs within the OS,
because you can safely write to the unused partition while the OS runs. So it
has the full set of functionality that you ship with your OS - Python,
OpenSSL, GPG, whatever. You're not downloading the update from the bootloader
or anything (which would take too long).

Once the device has been installed, there is always at least one working
partition on the device - the partition that was last booted. So you don't
need a minimal recovery partition or anything. (You could build a recovery
command-line option in, if you want, but it's just a custom way of booting the
normal partition.)

~~~
Cyph0n
How do you protect keys and/or certificates stored on the device from being
exfiltrated or replaced by an adversary?

~~~
geofft
For all of my use cases, either physical access counts as game over and is out
of the threat model, or we're using Secure Boot for verifying the bootloader
and/or TPMs for keeping secrets, at which point this is a problem with known
solutions that aren't specific to image-based updates. If you're using Secure
Boot with read-only images, one thing to try is dm_verity, which is how Chrome
OS solves this problem - it's a Merkle hash of the entire block device that's
checked lazily as blocks are accessed. If a block has been tampered with, you
can configure dm_verity to either panic the system or return an I/O error for
that block. (Or you can just read the entire image and verify it up front at
the cost of slower boot time.)

In particular, for all my use cases, there's already some mechanism for the
device gaining a secure channel to the rest of the infrastructure, so if
you're worried about keeping updates secret (which you may or may not be!),
just protect updates by that mechanism.

------
msarnoff
It's not a full OTA solution, but fwup
([https://github.com/fhunleth/fwup](https://github.com/fhunleth/fwup)) handles
the packaging and application of Linux firmware update images quite well.
Apache licensed, supports A/B updates (see tomc1985's comment), Ed25519
digital signature verification, and u-boot support.

------
jon-wood
This is wonderfully timed, I’m currently looking for options when it comes to
doing firmware updates on Linux based IoT devices. Does anyone have any
recommendations?

~~~
sjmulder
A friend who is building an Linux appliance is using Yocto:
[https://www.yoctoproject.org/](https://www.yoctoproject.org/). He also talked
about Mender, mentioned in the other comment. I don't know how much overlap
there is between these projects.

~~~
ralphmender
Disclaimer: I'm with Mender.io and author of the article.

The Yocto Project is a popular build system for your own embedded Linux
distribution.

Mender integrates with the Yocto Project with a layer
([https://github.com/mendersoftware/meta-
mender](https://github.com/mendersoftware/meta-mender)), but is a separate
project for end-to-end OTA, which includes the client and the management
server. Both are licensed under Apache 2.0 thus it is freely available and
you're not locked into a hosted-only backend.

Although this is an older blog post, here is how you can port the Mender
client to a non-Yocto build system: [https://mender.io/blog/porting-mender-to-
a-non-yocto-build-s...](https://mender.io/blog/porting-mender-to-a-non-yocto-
build-system)

------
karmicthreat
I like mender. It gives me a cheap 80% solution that covers updating and
management. That said I would really like something as easy as docker for
building devices. Yocto has a learning curve like a cliff.

I really like resin.io's container system, but I want to self-host.

------
paulgerhardt
If you're starting a new IoT project, build the updater first and use that for
pushing new builds. By the time you're ready for production it will be rock
solid and quick to boot.

------
kahlonel
I wonder how will a Git based client-side agent fit in. Why not use an already
proven tech for the rollout, and then use custom installation scripts (also
within the git repo) for doing software setup? The "server" here will just be
normal git server, with commit ID serving as version number.

~~~
wmf
That's addressed by the article. What about a management server, power/network
loss, atomic updates, validation, etc? If you try to write custom installation
scripts to do all that it's going to be a lot of work.

------
keithnz
We have 1000s and 1000s of devices and can easily Update them. It's not hard.
The devices also has multiple micros and they can individually be updated and
rolled back. It's not really hard to implement. Though in our case we had to
build a lot of the infrastructure anyways for other reasons.

