Hacker News new | comments | ask | show | jobs | submit login
Software Updates for IoT Devices and the Hidden Costs of Homegrown Updaters [pdf] (mender.io)
84 points by ralphmender 12 months ago | hide | past | web | favorite | 40 comments



I've been contracting with an IoT company for 3 years now. It's interesting to see guys who coded exclusively for radio or infrared based remotes get pulled into the software development world.

V1: 5 years ago everything was raw sockets and custom messaging formats with hand coded firmware and all data stored in a custom vector format, builds were distributed on google drive and flashed by hand.

V2: 3 years ago we dragged them kicking and screaming into http and hand coded json apis, firmware was still custom, data still stored in custom vector format but updates were now done on a non secure server with a hash check.

V3: this past year they started on a small box with a micro linux distro, apis are provided by standardized library, data now stored in sql, updates done over https.

Things are better now, except they still expect to sell and support those first 2 options for the next 10 years.


Part of what you are describing is exactly why IoT is sometimes called IoS. We've moved from solid, fast, low-latency, low-power hardware to unpredictable, slow, jittery, high-latency crap. Take the Philips Hue bridge for example. Raw binary protocol lighting tech from the 1980s can outperform it in terms of latency, throughput, and jitter.

It's unaccpetable for a button to take action after some random delay between 100ms and 5s. It's even worse if there's a remote https round trip required, as network lag adds another layer of unpredictability.


Raw binary protocol lighting tech from the 1980 was stateless, but nobody is willing to accept that nowadays for home automation.

"Turn device on" - Great I can do that fast

"Turn device off" - Great I can do that fast

"Is device on or off?" - Hold on while I poll a serial rf signal device by device while I determine state.

All of the slowdown coming from our tech and others that I've seen is because hardware guys still think these old stateless solution are acceptable and then have to hack something dirty on top to turn it into a stateful solution.


> guys still think these old stateless solution are acceptable and then have to hack something dirty on top to turn it into a stateful solution

Isn't this the foundation of modern web development?


Maybe the industry should listen to their kicking and screaming.

So the engineering problem could be solved with a micro-controller and some binary communication protocol - yet now you're solving the same problem, but need a system-on-a-chip, so you can run a linux distro, so you can run a web-server, so you can serve JSON. Because... why?

I don't think that's the spirit of the linked paper either. Yes, we should find standardized, secure update mechanisms. But why do we have to bring the web stack into this again, even though nothing of this has anything to do with the web?

Generally, I don't see how adding more layers and moving parts increases security.


> So the engineering problem could be solved with a micro-controller and some binary communication protocol - yet now you're solving the same problem, but need a system-on-a-chip, so you can run a linux distro, so you can run a web-server, so you can serve JSON. Because... why?

I can actually walk you through each piece of this and explain why things are better now, just because I've been in this swamp for so long.

1. Raw socket communication is bad because if the socket is dropped, you have to reinitialize it. On mobile it's going to get dropped alot. Plus new commands require lots of custom dev instead of using a standard REST library.

2. Binary packets are simple but also not flexible. "Hey we need to change device name to allow 64 characters now instead of 32, but you can't break all older hubs that still restrict to 32 characters in the binary packet, this means battery level could start at offset 108 or 140" Now repeat that for 5 other properties over the course of a year.

3. The bridge data was stored in a custom vector format instead of sql, so we can't get the engineers to cleanly migrate that data as new requirements come down from management. Testing takes 4 times as long to make sure nothing breaks.

4. I had a different client who built an entire bridge around udp thinking this would mean everything would be 1 or 2 milliseconds faster for home automation. In the end they had to rebuild tcp on top of udp to ensure correct state.

5. Custom firmware, they used to just download new firmware over http, check the hash and then apply it directly. If anything bad happens during this process the bridge is bricked, there's no rollback mechanism, no integrity check, just a dead hub. Moving to a micro linux distro at least brings in sanity checks for updates.


Solving these problems does not require moving to Linux, though.

> Raw socket communication is bad because if the socket is dropped, you have to reinitialize it

I don't know what to say here. This is a fundamental requirement of network programming and not a difficult one to meet.

Best case, you can re-use an existing connection. Worst-case, you have to establish one from scratch, which puts you in exactly the same position as an HTTP-based solution.

Once you factor in the large time required to establish an SSL session on a small CPU, you really want to think about holding connections open.

> new commands require lots of custom dev instead of using a standard REST library

You still need to implement the REST client and server code.

> Binary packets...

If you're reading fields out of a struct with no version checking or inline layout description, with full knowledge that the layout might change, again, I have no words. This is basic software engineering. The packet length ought to tell you which frame format you've got...

JSON doesn't really solve this. You've got limited RAM, remember? There's no limit on the length of a JSON document.

At least use protobufs or Avro or something.

> bridge data was stored in a custom vector format

Not all the world's an SQL database. An array of structs will fit even the tiniest computers and serve most of the same use cases.

> built an entire bridge around udp thinking this would mean everything would be 1 or 2 milliseconds faster

Eliminating TCP session startup and teardown could easily explain this. It sounds like a reasonable decision for some applications.

> they used to just download new firmware over http, check the hash and then apply it directly

Most cheap IoT devices do exactly this. It's not good, but it's common, and if you have a simple recovery bootloader it works well enough.

Using the A-B partition switching makes this mechanism basically foolproof (once you add verification of firmware blobs).

So you can do one of two things:

1. Run Linux and try to do in-place updates 2. Run something smaller and use the extra flash space for a backup partition

I would go for (2) every time on robustness and simplicity grounds. My clients prefer it too as it makes their devices cheaper and simpler.


I get the point you're trying to make, but making everything custom adds cognitive load that isn't necessary and restricts your design space. The more time you spend thinking about your custom solutions, the less time you can spend on features that differentiate your product.

I had to troubleshoot an issue that a higher up in the company was having with their bridge at home. I asked him to hit a specific endpoint on his bridge in the browser and send me the result, a win for HTTP.

We just released swagger docs for internal devs and 3rd parties to plug into without needing to spend time writing excessive documentation on what our data format look like, a win for JSON.

The list goes on for all the extras you get for free when working with standardized tools.


> The list goes on for all the extras you get for free when working with standardized tools.

I guess this is the balance between how much less you pay for standardized tools and tech (and R&D), and how much more for all the BOM extensions required. TTM may be important too, favoring using of-the-shelf techs. On the long run, and with big quantities, the balance might actually justify the custom-made.

However, it is with the big quantities deployed that the issues of firmware upgrade -- reliability, ease of use, security -- become really serious, tipping the balance again towards the standard tech.


With big quantities, BOM cost becomes the overriding factor.

It's the reverse of the web dev scalability argument. In web dev, you go for high abstraction and divisible components because it's the only way to handle more requests per second while keeping the software flexible.

In hardware, to scale the business, you need to keep BOM cost low. To do this, you must use cheaper (smaller) hardware and specialize the software to fit. You need to eliminate abstraction and keep the software as tight as possible. Even bootup time matters -- 30 seconds for Linux to boot adds $1 of manufacturing floor time per device.

Linux adds massive BOM cost, and once you're shipping more than 5000 devices per year, it's usually worthwhile to eliminate it.


This is FUD.

Partition twice the amount of space you need on IOT's fixed media, and trickle-download your update to the empty partition. Once that's finished, verify the download then switch your bootloader. I have to update 500+ field installations with a new OS and this is the approach I'd take if we had the space and bandwidth

(Instead we're sending out an army of techs and account reps armed with USB sticks)


I'm not sure what part you're referring to as FUD - the cost?

I definitely agree with the rest of the comment. It's a good solution. I inherited a software product that worked like this at a past job, and it was great. (And now I'm trying to convince my current job that we want to move to this model for normal servers in datacenters, very much not IoT.)

A couple of complications:

- You should think about the fact that this means your root partition changes. Either you want to structure your system with separate read and write partitions and bind-mount the relevant directories from the write partition, or you want to make it completely read-only / stateless. Remember that /var/log is traditionally on your local disk, so if you don't do anything special, you'll even get two /var/logs on each device, which may or may not be what you want.

- You do want a management server, as this document suggests, to track which devices have actually updated and which haven't, so you can manually send people after devices that are just behind a terrible internet connection.

- You want some mechanism for detecting if the new version doesn't work and rolling back; this is basically as simple as setting a "I just tried partition X, if it doesn't work don't try it again" flag in the bootloader on boot, and clearing it once userspace is up (and when the partition gets rewritten with a new version).

- The updates should be signed etc. as described in the document. Depending on your threat model, you might want to prevent replay attacks that cause an attacker-controlled downgrade by giving it a higher-versioned filename; either use HTTPS to an update server you control, or use signed metadata files with timestamps.

The fact that you get image-based deployments instead of dealing with apt upgrades from arbitrarily-old versions (and thus inevitably slightly drifting configs on devices installed at different times) is fantastic.


I looked at the document and it immediately reminded me of the kind of person who at meetings brings up tons of questions that may or may not be relevant to the actual task. There is a recurring thread of problems in CS being really simple but then needlessly complicated by well-intentioned nerds unable to see the forst for the trees. I say FUD because I see helpful blog posts promoting new tools and software when they should be promoting techniques.

Seriously those stack diagrams are massively overkill designs for OTA update infra.

As for your specific points, I left out a LOT of detail.

  - You should think about the fact that this means your root partition changes
Yes, or no, depending on your bootloader. GRUB and others can chainload to other partitions after control is given. Of course you'll need to think about how to re-provision user-specific data, though the first idea that pops in my mind would be to simply read it from the old install on first boot. (IRL we are solving this with a first-boot 'activation' app, which re-downloads user data on completion)

  - You do want a management server, as this document suggests...
yes, but this is way easier than needing to learn and roll out someone else's complicated piece of tech. An API server that can receive updates from the field and download scripts/run the commands to start the process.

Also, if you've been supporting this kind of product for a while you probably already have your own infra.

  - You want some mechanism for detecting if the new version doesn't work and rolling back;
of course. Though, if your hardware is uniform you can get away with a lot less of this than one might think, due to the updates being image-based and not package-based. Easiest way is to simply compute a SHA-hash on the download to make sure it wasn't corrupted (make sure the value you compare it with came from a trusted conversation with your server). If it fails to boot the newly provisioned code needs to know to re-point the bootloader at the previous install, though again with uniform hardware your main concerns are transport-related.

  - The updates should be signed etc. as described in the document.
Fairly trivial with pgp et al.


> I say FUD because I see helpful blog posts promoting new tools and software when they should be promoting techniques.

Please do not try to redefine the meaning of an existing term. "FUD" already has its well-known meaning, different than what you tried to use it as: https://en.wikipedia.org/wiki/FUD


How do you handle key and/or certificate storage at the client side? Depending on the threat model, the update verification step can be subverted.


You install manually / out-of-band (in my old job, you installed by copying to a USB stick from a desktop machine), and the updater runs within the OS, because you can safely write to the unused partition while the OS runs. So it has the full set of functionality that you ship with your OS - Python, OpenSSL, GPG, whatever. You're not downloading the update from the bootloader or anything (which would take too long).

Once the device has been installed, there is always at least one working partition on the device - the partition that was last booted. So you don't need a minimal recovery partition or anything. (You could build a recovery command-line option in, if you want, but it's just a custom way of booting the normal partition.)


How do you protect keys and/or certificates stored on the device from being exfiltrated or replaced by an adversary?


For all of my use cases, either physical access counts as game over and is out of the threat model, or we're using Secure Boot for verifying the bootloader and/or TPMs for keeping secrets, at which point this is a problem with known solutions that aren't specific to image-based updates. If you're using Secure Boot with read-only images, one thing to try is dm_verity, which is how Chrome OS solves this problem - it's a Merkle hash of the entire block device that's checked lazily as blocks are accessed. If a block has been tampered with, you can configure dm_verity to either panic the system or return an I/O error for that block. (Or you can just read the entire image and verify it up front at the cost of slower boot time.)

In particular, for all my use cases, there's already some mechanism for the device gaining a secure channel to the rest of the infrastructure, so if you're worried about keeping updates secret (which you may or may not be!), just protect updates by that mechanism.


Unless you have a TPM or something you just don't; this is an unsolvable problem in general.


Disclaimer: I work on the Mender project.

Signing and verification in Mender is covered here: https://docs.mender.io/artifacts/signing-and-verification


All of you interested in embedded tips might find Jack Ganssle's The Embedded Muse interesting. The current and back issues are here:

http://www.ganssle.com/tem-subunsub.html

Relevant to tomc1985's comment, Jack surveyed how embedded developers were handling updates with their methods in these two articles:

http://www.ganssle.com/tem/tem288.html

http://www.ganssle.com/tem/tem289.html


Second that! Have been reading TEM since 2004. Great fortnightly source of wisdom for anyone dealing with microcontrollers!


A short tale of three firmware updaters.

Device #1: A "root boot" loader in flash, very simple, loaded by a boot ROM. Its only job is to find the first good "large" chunk of boot code, get it into RAM and run it. The "large" chunks are in one of two slots. The update process is simple (update and test each slot independently), and if the update is interrupted there's always an image to boot from. It takes the developer responsible for FW update about a day to write the code and get it rock solid. An update takes less than about five seconds.

Device #2: Implements a complicated file system. Some files need to be updated, some are untouchable, some need other special treatment (and the documentation is, unsurprisingly, dead wrong about things). The update process is very slow (tens of minutes) and if it is interrupted you wind up with a brick. It takes about two man-years to get the update process stable enough to ship.

Device #3: Implements the standard USB firmware update protocol (DFU). There's a mountain of host-side code to deal with the edge conditions, and firmware updates are unpredictable. Devices get bricked pretty often. Takes maybe six months to get the update process stable enough to ship.

Thinking of #2 and #3 still makes me mad. This stuff just isn't hard.


Heh - I've mentioned this here before - I got very lucky a while back, when between finalising the design and BOM of our IoT widget and ordering parts for the first production run - the price of 4G micro SD cards dropped below the price of 2G ones. That gave me an entire full sized extra partition to play with for updates, which reduced a _lot_ of my headaches...



That's precisely what Android Things does for you for updates :) Plus automatic rollback if the new version is problematic.


It's not a full OTA solution, but fwup (https://github.com/fhunleth/fwup) handles the packaging and application of Linux firmware update images quite well. Apache licensed, supports A/B updates (see tomc1985's comment), Ed25519 digital signature verification, and u-boot support.


This is wonderfully timed, I’m currently looking for options when it comes to doing firmware updates on Linux based IoT devices. Does anyone have any recommendations?


The source of this article is Mender.io, which is an open source embedded linux distro that includes built-in OTA updates:

http://mender.io/

I can't speak to their quality as I haven't tried it yet. I have used Resin.io quite a bit and it's great if it fits your use case (and the pricing model makes sense for you).


I occasionally contribute to a fairly large home automation project named Home Assistant (https://home-assistant.io/)

They have an all in one operating system perconfigured for a raspberry pi or intel nuc dubbed "hass.io" that uses ResinOS under the covers:

https://home-assistant.io/hassio/

https://resinos.io/

Under the covers, ResinOS is a minimal Yocto embedded linux + docker + some ota stuff. Consider looking into that before building your own.


Resin is definitely quite high up on my list of options, I've done some prototyping with their SaaS service as well, my only hesitation is that the pricing is pretty prohibitive for the sort of volumes we're looking at in the medium term. Like you, my initial contact with them was through Hassio, which has been fantastic.

Definitely going to do some digging into Yocto and Mender though.


Since Mender and Yocto have already been mentioned, I have to bring up Buildroot with SWupdate. Batteries aren't included like with Mender, but you also don't have to assemble layers from the far reaches of the internet like Yocto.


A friend who is building an Linux appliance is using Yocto: https://www.yoctoproject.org/. He also talked about Mender, mentioned in the other comment. I don't know how much overlap there is between these projects.


Disclaimer: I'm with Mender.io and author of the article.

The Yocto Project is a popular build system for your own embedded Linux distribution.

Mender integrates with the Yocto Project with a layer (https://github.com/mendersoftware/meta-mender), but is a separate project for end-to-end OTA, which includes the client and the management server. Both are licensed under Apache 2.0 thus it is freely available and you're not locked into a hosted-only backend.

Although this is an older blog post, here is how you can port the Mender client to a non-Yocto build system: https://mender.io/blog/porting-mender-to-a-non-yocto-build-s...


This page on the Yocto wiki https://wiki.yoctoproject.org/wiki/System_Update provides an overview of many of the current offerings for software update systems.


I like mender. It gives me a cheap 80% solution that covers updating and management. That said I would really like something as easy as docker for building devices. Yocto has a learning curve like a cliff.

I really like resin.io's container system, but I want to self-host.


If you're starting a new IoT project, build the updater first and use that for pushing new builds. By the time you're ready for production it will be rock solid and quick to boot.


I wonder how will a Git based client-side agent fit in. Why not use an already proven tech for the rollout, and then use custom installation scripts (also within the git repo) for doing software setup? The "server" here will just be normal git server, with commit ID serving as version number.


That's addressed by the article. What about a management server, power/network loss, atomic updates, validation, etc? If you try to write custom installation scripts to do all that it's going to be a lot of work.


We have 1000s and 1000s of devices and can easily Update them. It's not hard. The devices also has multiple micros and they can individually be updated and rolled back. It's not really hard to implement. Though in our case we had to build a lot of the infrastructure anyways for other reasons.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: