V1: 5 years ago everything was raw sockets and custom messaging formats with hand coded firmware and all data stored in a custom vector format, builds were distributed on google drive and flashed by hand.
V2: 3 years ago we dragged them kicking and screaming into http and hand coded json apis, firmware was still custom, data still stored in custom vector format but updates were now done on a non secure server with a hash check.
V3: this past year they started on a small box with a micro linux distro, apis are provided by standardized library, data now stored in sql, updates done over https.
Things are better now, except they still expect to sell and support those first 2 options for the next 10 years.
It's unaccpetable for a button to take action after some random delay between 100ms and 5s. It's even worse if there's a remote https round trip required, as network lag adds another layer of unpredictability.
"Turn device on" - Great I can do that fast
"Turn device off" - Great I can do that fast
"Is device on or off?" - Hold on while I poll a serial rf signal device by device while I determine state.
All of the slowdown coming from our tech and others that I've seen is because hardware guys still think these old stateless solution are acceptable and then have to hack something dirty on top to turn it into a stateful solution.
Isn't this the foundation of modern web development?
So the engineering problem could be solved with a micro-controller and some binary communication protocol - yet now you're solving the same problem, but need a system-on-a-chip, so you can run a linux distro, so you can run a web-server, so you can serve JSON. Because... why?
I don't think that's the spirit of the linked paper either. Yes, we should find standardized, secure update mechanisms. But why do we have to bring the web stack into this again, even though nothing of this has anything to do with the web?
Generally, I don't see how adding more layers and moving parts increases security.
I can actually walk you through each piece of this and explain why things are better now, just because I've been in this swamp for so long.
1. Raw socket communication is bad because if the socket is dropped, you have to reinitialize it. On mobile it's going to get dropped alot. Plus new commands require lots of custom dev instead of using a standard REST library.
2. Binary packets are simple but also not flexible. "Hey we need to change device name to allow 64 characters now instead of 32, but you can't break all older hubs that still restrict to 32 characters in the binary packet, this means battery level could start at offset 108 or 140" Now repeat that for 5 other properties over the course of a year.
3. The bridge data was stored in a custom vector format instead of sql, so we can't get the engineers to cleanly migrate that data as new requirements come down from management. Testing takes 4 times as long to make sure nothing breaks.
4. I had a different client who built an entire bridge around udp thinking this would mean everything would be 1 or 2 milliseconds faster for home automation. In the end they had to rebuild tcp on top of udp to ensure correct state.
5. Custom firmware, they used to just download new firmware over http, check the hash and then apply it directly. If anything bad happens during this process the bridge is bricked, there's no rollback mechanism, no integrity check, just a dead hub. Moving to a micro linux distro at least brings in sanity checks for updates.
> Raw socket communication is bad because if the socket is dropped, you have to reinitialize it
I don't know what to say here. This is a fundamental requirement of network programming and not a difficult one to meet.
Best case, you can re-use an existing connection. Worst-case, you have to establish one from scratch, which puts you in exactly the same position as an HTTP-based solution.
Once you factor in the large time required to establish an SSL session on a small CPU, you really want to think about holding connections open.
> new commands require lots of custom dev instead of using a standard REST library
You still need to implement the REST client and server code.
> Binary packets...
If you're reading fields out of a struct with no version checking or inline layout description, with full knowledge that the layout might change, again, I have no words. This is basic software engineering. The packet length ought to tell you which frame format you've got...
JSON doesn't really solve this. You've got limited RAM, remember? There's no limit on the length of a JSON document.
At least use protobufs or Avro or something.
> bridge data was stored in a custom vector format
Not all the world's an SQL database. An array of structs will fit even the tiniest computers and serve most of the same use cases.
> built an entire bridge around udp thinking this would mean everything would be 1 or 2 milliseconds faster
Eliminating TCP session startup and teardown could easily explain this. It sounds like a reasonable decision for some applications.
> they used to just download new firmware over http, check the hash and then apply it directly
Most cheap IoT devices do exactly this. It's not good, but it's common, and if you have a simple recovery bootloader it works well enough.
Using the A-B partition switching makes this mechanism basically foolproof (once you add verification of firmware blobs).
So you can do one of two things:
1. Run Linux and try to do in-place updates
2. Run something smaller and use the extra flash space for a backup partition
I would go for (2) every time on robustness and simplicity grounds. My clients prefer it too as it makes their devices cheaper and simpler.
I had to troubleshoot an issue that a higher up in the company was having with their bridge at home. I asked him to hit a specific endpoint on his bridge in the browser and send me the result, a win for HTTP.
We just released swagger docs for internal devs and 3rd parties to plug into without needing to spend time writing excessive documentation on what our data format look like, a win for JSON.
The list goes on for all the extras you get for free when working with standardized tools.
I guess this is the balance between how much less you pay for standardized tools and tech (and R&D), and how much more for all the BOM extensions required. TTM may be important too, favoring using of-the-shelf techs. On the long run, and with big quantities, the balance might actually justify the custom-made.
However, it is with the big quantities deployed that the issues of firmware upgrade -- reliability, ease of use, security -- become really serious, tipping the balance again towards the standard tech.
It's the reverse of the web dev scalability argument. In web dev, you go for high abstraction and divisible components because it's the only way to handle more requests per second while keeping the software flexible.
In hardware, to scale the business, you need to keep BOM cost low. To do this, you must use cheaper (smaller) hardware and specialize the software to fit. You need to eliminate abstraction and keep the software as tight as possible. Even bootup time matters -- 30 seconds for Linux to boot adds $1 of manufacturing floor time per device.
Linux adds massive BOM cost, and once you're shipping more than 5000 devices per year, it's usually worthwhile to eliminate it.
Partition twice the amount of space you need on IOT's fixed media, and trickle-download your update to the empty partition. Once that's finished, verify the download then switch your bootloader. I have to update 500+ field installations with a new OS and this is the approach I'd take if we had the space and bandwidth
(Instead we're sending out an army of techs and account reps armed with USB sticks)
I definitely agree with the rest of the comment. It's a good solution. I inherited a software product that worked like this at a past job, and it was great. (And now I'm trying to convince my current job that we want to move to this model for normal servers in datacenters, very much not IoT.)
A couple of complications:
- You should think about the fact that this means your root partition changes. Either you want to structure your system with separate read and write partitions and bind-mount the relevant directories from the write partition, or you want to make it completely read-only / stateless. Remember that /var/log is traditionally on your local disk, so if you don't do anything special, you'll even get two /var/logs on each device, which may or may not be what you want.
- You do want a management server, as this document suggests, to track which devices have actually updated and which haven't, so you can manually send people after devices that are just behind a terrible internet connection.
- You want some mechanism for detecting if the new version doesn't work and rolling back; this is basically as simple as setting a "I just tried partition X, if it doesn't work don't try it again" flag in the bootloader on boot, and clearing it once userspace is up (and when the partition gets rewritten with a new version).
- The updates should be signed etc. as described in the document. Depending on your threat model, you might want to prevent replay attacks that cause an attacker-controlled downgrade by giving it a higher-versioned filename; either use HTTPS to an update server you control, or use signed metadata files with timestamps.
The fact that you get image-based deployments instead of dealing with apt upgrades from arbitrarily-old versions (and thus inevitably slightly drifting configs on devices installed at different times) is fantastic.
Seriously those stack diagrams are massively overkill designs for OTA update infra.
As for your specific points, I left out a LOT of detail.
- You should think about the fact that this means your root partition changes
- You do want a management server, as this document suggests...
Also, if you've been supporting this kind of product for a while you probably already have your own infra.
- You want some mechanism for detecting if the new version doesn't work and rolling back;
- The updates should be signed etc. as described in the document.
Please do not try to redefine the meaning of an existing term. "FUD" already
has its well-known meaning, different than what you tried to use it as:
Once the device has been installed, there is always at least one working partition on the device - the partition that was last booted. So you don't need a minimal recovery partition or anything. (You could build a recovery command-line option in, if you want, but it's just a custom way of booting the normal partition.)
In particular, for all my use cases, there's already some mechanism for the device gaining a secure channel to the rest of the infrastructure, so if you're worried about keeping updates secret (which you may or may not be!), just protect updates by that mechanism.
Signing and verification in Mender is covered here: https://docs.mender.io/artifacts/signing-and-verification
Relevant to tomc1985's comment, Jack surveyed how embedded developers were handling updates with their methods in these two articles:
Device #1: A "root boot" loader in flash, very simple, loaded by a boot ROM. Its only job is to find the first good "large" chunk of boot code, get it into RAM and run it. The "large" chunks are in one of two slots. The update process is simple (update and test each slot independently), and if the update is interrupted there's always an image to boot from. It takes the developer responsible for FW update about a day to write the code and get it rock solid. An update takes less than about five seconds.
Device #2: Implements a complicated file system. Some files need to be updated, some are untouchable, some need other special treatment (and the documentation is, unsurprisingly, dead wrong about things). The update process is very slow (tens of minutes) and if it is interrupted you wind up with a brick. It takes about two man-years to get the update process stable enough to ship.
Device #3: Implements the standard USB firmware update protocol (DFU). There's a mountain of host-side code to deal with the edge conditions, and firmware updates are unpredictable. Devices get bricked pretty often. Takes maybe six months to get the update process stable enough to ship.
Thinking of #2 and #3 still makes me mad. This stuff just isn't hard.
I can't speak to their quality as I haven't tried it yet. I have used Resin.io quite a bit and it's great if it fits your use case (and the pricing model makes sense for you).
They have an all in one operating system perconfigured for a raspberry pi or intel nuc dubbed "hass.io" that uses ResinOS under the covers:
Under the covers, ResinOS is a minimal Yocto embedded linux + docker + some ota stuff. Consider looking into that before building your own.
Definitely going to do some digging into Yocto and Mender though.
The Yocto Project is a popular build system for your own embedded Linux distribution.
Mender integrates with the Yocto Project with a layer (https://github.com/mendersoftware/meta-mender), but is a separate project for end-to-end OTA, which includes the client and the management server. Both are licensed under Apache 2.0 thus it is freely available and you're not locked into a hosted-only backend.
Although this is an older blog post, here is how you can port the Mender client to a non-Yocto build system: https://mender.io/blog/porting-mender-to-a-non-yocto-build-s...
I really like resin.io's container system, but I want to self-host.