As a sort of PSA, I want to plug the work that we have done at balena.io (formerly known as resin.io) to solve problems like these for everyone deploying Linux IoT devices, including open source / non-paying users.
We've built the open source balenaOS and the (newly) open sourced openBalena management server that anyone can use without paying us a penny. If you're about to manufacture a fleet of linux devices, and are about to put them out there, and don't want to pay for a service (so you're not using balenaCloud) but also don't want to solve problems we've spent years already finetuning, (such as ntp, dns, cellular modem support, read-only filesystem, host updates, etc etc) use the above projects and save yourself a mountain of pain in the ensuing years.
I hate being alarmist, but working with device fleets over the last 7 years, I've seen things you wouldn't believe. I've seen devices catch on fire, drown in storms, launch DDoS attacks, and everything in between. It's incredibly hard to underestimate how bad things can get when you put your devices out there with no ability to find them and update them if push comes to shove.
 balenaOS: https://www.balena.io/os/
 openBalena: https://www.balena.io/open/
 openBalena github: https://github.com/balena-io/open-balena
...off the shoulder of Orion?
I would go a bit more radical: I'd say you'd have a problem.
Also, your story is quite fascinating, can you tell us more?
I am very, and I mean VERY interested in working on something like that, but not sure how to transition out of the generic CRUD web programmer to your area... :(
The difficulty comes in configuration, where things are much more complex than the cloud, since every device is "special". Different customers, keys, settings of all sorts. So we allow individual environment variable settings for each device.
The real complexity is in ensuring you can reach every device as long as possible, and that the devices behave well when they can't reach the cloud. We set up a VPN to ensure the former, and have a pull-based architecture where the device is responsible for "catching up" to the fleet when it comes back online to handle the latter.
There's a bunch of other things we've solved, like container deltas to improve download times by something like 10-70 times (and reduce bandwidth costs) but honestly the biggest difficulty has been in integrating cloud/hardware/developer/network/OS etc seamlessly, so people can succeed at building fleets, and fast. It's like there's 10,000 papercuts moreso than one or two big problems to solve. It's a long chain of things that all have to work right to make the system work, and each one has to be done in a way so that it doesn't break after some time passes (see the OP for an example).
To answer your last question with (yet another) plug, you can probably run through our getting started guide in an hour and have your first device that you can deploy a js project to pretty fast. If you spend a few days, you can start to have some real accomplishments. It's not that hard to make the jump these days if you have a web background (hell, most of the founders didn't have IoT experience when we started) but there are things you need to learn that relate to hardware and linux if you want to do something more advanced. You can learn at your pace though, and I thoroughly recommend starting with a project you enjoy, something like, but not necessarily https://www.balena.io/blog/make-a-web-frame-with-raspberry-p...
The target context is a medical facility in a remote location without Internet (or extremely limited Internet), though. Is there a way to distribute updates by USB stick?
What's the best place to read stories of people who have tried deploying with Balena on RasPi-type servers and/or contexts without Internet access, and learn about their successes and difficulties?
And if you're interested in a board that is as easy to program as the Pi but more robust, you can check out -- https://www.balena.io/blog/introducing-project-fin-a-board-f...
- Ability to run on 6V to 24V power
- Low-power coprocessor
- Battery-powered real-time clock
Given that my plan is to run off of 5V USB power and I don't intend to write special software for the coprocessor, can you help me understand whether there might still be good reasons to use a Fin?
The other question I have is about the stability of docker itself at scale. I worry about the reliability of essentially developer/server tooling in an embedded space where there is no ops team to help it along. Have you had any issues?
We've done a lot of things to make sure balenaEngine is rock solid in an embedded environment, and our fleetOps team is pretty hardcore in helping customers get out of hard situations (which help us further evolve the OS and the engine to avoid those by default).
possibly the only work of fiction ever inspired by NTP, IoT and cryptographic protocols.
This struggle with embedded ip address seems to echo this part of my story, "These hastily made things flooded the market and soon replaced other well-documented things. At times something failed and its inventors could not say why, they just assembled a new one or went bankrupt."
Do you have a list of your stories avalible?
Paced by the Animals
The Trix Rabbit of Thorium
A relative newbie finding himself responsible for nontrivial design and implementation decisions for fleets of robots. Luckily they were always updatable. But if you asked me to set up the NTP story for them (which they had but people smarter than me worried about it) I would have Googled it for a while and just hoped that I didn't miss any fundamental understanding of how to use NTP.
p.s. this article felt like it was the perfect length. It shares the perfect amount of detail succinctly.
The author responded to the initial problem in the old-fashioned internet "we're all here to make things work" kind of way, without letting himself get taken advantage of.
And then when the problem decided not to play nicely, he increased the pressure in a civilized way.
These days most companies would have just said, "Sucks to me you" and cut off the dummy IoT company.
This also illustrates why big companies like Apple maintain their own NTP services.
(It could have been set up properly to be distributed here, but they didn't do it)
However, people just go to a list of NTP servers, then copy a few into the code, instead of using the distributed pool. Then it's not a surprise that the NTP in a product is going to stop working, meanwhile a one very overwhelmed guy who happened to run one of the server is going to have serious troubles, see https://news.ycombinator.com/item?id=18753835.
ARIN and RIPE control allocations, but do not control routing to allocated ranges.
But it's true that newcomers these days can't truly "own" their own addresses.
Nevertheless, there are two practical difficulties.
1. Static IP addresses are unavailable in most home connections. And many home broadbands throttle UDP traffic, dropping them when the pps rate is high. It makes one's home unsuitable for hosting a NTP server.
2. Unlike Stradium-2, you cannot simply use a "cloud" service for your dedicated server. To run a NTP Stratum-1, you have to physically host your server with the customized hardware in a datacenter, which costs 100 dollars/month in my city, not including network transit and bandwidth. I really want to run one, but I cannot afford it.
3. Shortwave / GPS reception is usually not available in a datacenter, and an antenna installation is usually not allowed. You can be creative, a good way is using the time provided by mobile basestations. But it needs experience.
That would definitely not be Stratum-1, so I wouldn’t recommend it.
Makes for interesting reading.
1. NEVER, ever hardcode an individual NTP server (in form of a IP or domain). DO NOT just go to a list of NTP servers, then copy a few into your code. DON'T ping pool.ntp.org and get its IP address written down. DON'T DO ANY OF THESE! PLEASE!
2. DO NOT use Stratum 0 and Stratum 1 servers. Please use Tier-2 and lower. Practically, if you follow Rule No.1, then you are always following this rule.
3. If the scale of your system is small, in hundreds, or in a few thousands, PLEASE USE pool.ntp.org, this is the NTP community cluster backed by DNS load balancer. Always request the DNS, and make sure the IP is not cached locally for too long. If you need more than one servers, use 0.pool.ntp.org, 1.pool.ntp.org, 2.pool.ntp.org, etc (3 is often enough).
4. If the scale of your system is large, such as tens of thousand, or you are making a new system, you SHOULD request a customized prefix from pool.ntp.org, such as debian.pool.ntp.org, it helps the community to manage the traffic. If your system is a large commercial one, you ARE REQUIRED to donate some servers to the NTP Pool to compensate the community. Another option is running your own private NTP cluster. The policy is here: https://www.ntppool.org/en/vendors.html
5. If possible run a standard NTP implementation, like NTPd, chrony, or something else as long as it's written professionally. Nowadays even lightbulbs run Linux, then why don't you run a standard NTPd?
But If you can't, then make sure...
(a) implement NTPv4, DO NOT use NTPv1.
(b) Read the new SNTP RFC if you are implementing an SNTP client. http://www.faqs.org/rfc/rfc4330.txt
(c) DO NOT synchronize time on the beginning of an hour, or 00:00 UTC! Select a minute in a hour randomly for synchronization.
(d) Use an exponentially-increase retrial interval, DO NOT keep retrying when the server is unreachable, you are launching a DDoS attack!
(e) Support Kiss of Death packet, your client should immediately stop requesting a server, cease and desist, once a KoD packet is received.
(f) Make sure the client will stop requesting the builtin list of servers, once an alternative server is set by the user.
These should have been written in all textbooks related to practical networking lectures, but apparently, there aren't. People don't even realize that their actions are harmful, and we have all the problems...
The NTP community is a complete tragedy of the commons. Even many government institutions cannot keep up with the abusive traffic, and stopped providing public NTP servers.
Today, if we don't count Microsoft and Apple's NTP, almost all public NTP servers are provided by the volunteers from https://www.pool.ntp.org. By using DNS, it forms a NTP cluster to distribute the load. These people provide time for the entire Internet, and they are the people who withstand all the abuses day by day.
People just assume they are some random super servers that always work, without being responsible for their actions, such as hardcoding IP addresses, writing abusive retry code (without exponential increment of timeout), and making a cronjob that initialize a synchronization exactly at midnight (without randomization), effectively a DDoS.
Usually, if a device comes with hardcoded NTP addresses, it, in fact, usually indicates their program is poorly-written, and the manufacturers are irresponsible. Those devices have the worst homebrew NTP implementation on the planet,
1. They send ancient NTPv1 packets, while the latest version is NTPv4.
2. They synchronize their time on the beginning of an hour, effectively making a flooding attack. Another larger flooding attack starts at 00:00 UTC.
3. They retry interval is around 3 minutes, if fails to reach the server, make even more traffic to the server, rather than an exponentially-increase interval.
4. They still try to talk to the default hardcoded servers, even if an alternative server list is set.
5. They don't support the Kiss of Death packet, nothing can stop them if they became wild.
Stratum 0/1 servers are the most vulnerable: they have highest accuracy, with reference clock. Despite the acceptable usage of ST-1 is only passing time to downstream, or for scientific purposes, since there's only a handful of them and often listed publicly, they are often spotted by those manufacturers, and put in their devices by default.
Stratum 0/1 are usually provided by universities, or unpaid volunteers for the public good of the Internet. If a single server got hardcoded in those mass-manufactured devices, serious consequences can happen, the volunteer may literally bankrupt: your whole institute/school will be kicked out from the Internet ; when you came to the manufacture asking to pay the damage they are responsible for, you are threatened by a lawyer from California.  The whole Internet community should honor the spirit of self-sacrifice of these NTP volunteers.
The NTP community pool is Stratum 2+, suitable for general use. It has similar issues of abuses - once you're in and became well-known on the net, there's no way out and you keep receiving bad-traffic, because some clueless people have hardcoded your IP address, or has put it in a cache that never expires. Fortunately given a reasonable bandwidth, it is often a negligible issue and safe to ignore. But there are exceptions.  One of my NTP server became DDoSed one day, because an ISP cached the IP address for pool.ntp.org with a large TTL, and the IP address happened to be mine! The traffic was 40 Mbps...
In contrast, NTPd has proper rate-limit mechanism built-in, such as KoD and good pooling interval, blocking NTP does NOT causes more user traffic. What increased is the abuser traffic. The damage caused by a standard NTPd and silly sysadmin is much less significant and is negligible compared to the Internet of Scary Things.
By the way, not only hardware devices can contains dangerous NTP code, but also software.
As long as manufactures still write broken code and unaware of the proper way to use NTP, nothing can be done to solve this issue. Many involved in these misuses and abuses are totally unaware what they are doing. The proper way to use NTP should have been written in all textbooks related to practical networking lectures.
: Flawed Routers Flood University of Wisconsin Internet Time Server
: Open Letter to D-Link about their NTP vandalism
: Recent NTP pool traffic increase
On the the other hand, the internet is a much bigger place. Things are orders of magnitude more complex. The feedback loops that made NTP work well in a 1990s university environment are mostly absent. When a problem happened then, I'd see something in the logs or in packet captures, figure out what was happening, and get the responsible person quickly on the phone. That's not even hazily possible these days.
As much as I'd love to think putting a stern warning in textbooks would fix this, I doubt that would matter at all. What we really need is a major increase in observability or traceability. And failing that, what we'll get is common resources getting sliced up so they fit within domains of accountability.
Also, I think NTP needs more publicity. We need people to be aware of it before we could get feedback. The community then can have a watchdog team that spots misuses and publishes alerts.
It would be "time.google.com" and you would need a Google account.
If customers can't update their devices' software (or you can't push a remote update), then you need to get the software right in version 1. This seems to be a foreign concept to software developers nowadays, who are used to the world of endless updates and patches. It takes a different kind of development process and a lot of QA to do it right.
Baking in domains or IPs you don't own always seems like a bad idea.
Do you like the writing style and inclusion of gifs?
I guess I'm so used to useless ads and graphics in blog posts, I honestly, scout's honour, did not notice and have no recollection whatsoever!
A damning statement on the Internet of today perhaps, but it neither enhanced nor detrimented my reading of the article ️
I'm personally not a fan, as I feel it takes away from the content of the post, but I'd be curious to see how it impacts the readership metrics.
Also, I used to work in IoT, and you are fortunate with your outcome. So many OEMs are worse.
Is it practical for you to only play the animations on mouse over, and pause them otherwise? I think that might allow us the entertainment value of the gifs, without the annoyance of them looping endlessly while we're trying to read.
Keep writing, it was an excellent post, but only include pictures if they are relevant.
I rarely (~5%) find pictures of any sort improve the understanding, even in "mainstream" news sources.
There are tons of different RTCs you can get. Sometimes you don’t really have an “RTC” as much as you have an RC oscillator and a guess at what its frequency should be. A simple crystal oscillator could be as bad as 200 ppm, or five minutes per fortnight. RC oscillators are worse.
Decent wristwatches generally have temperature compensated crystal oscillators (TCXO) which can be calibrated at the factory, often by adjusting the counter values periodically (e.g. every N counts, add X to the counter). NTP is better than this, but only if you run it as a daemon.
45 minutes in a year is under 1 part in 10000.
What happened to NetThing’s customers after they ceased trading? Who took over the lighting management of the car parks etc. ?
In conversation with the software eng it was implied that they intended to send someone on site to each of over 500 sites to reimage the devices. That must have cost them way more than £70/month and the way that after ~10 months the number of devices actually went up to over 1,000 suggests they were happy to just keep paying.
The thing is, it was essentially no work. All I did was remove a firewall rule. I had to run NTP anyway for my regular customers. Initially more time was spent just in email back and forth and honestly I was enjoying that.
Because of it being basically no work, I had a moral problem with trying to find the absolute highest amount of money they would bear.
I know that is wrong and it does me no good, but I couldn't get past it.
What did annoy me was their inability to pay bills on time, and time I spent chasing invoices and creating custom late payment paperwork that is never relevant for my usual customers.
That was the main impetus for doubling the rate, and despite me jokingly suggesting that their product was not good enough to be profitable (I have no real data on that either way) I suspect they had much bigger organisational problems to be consistently paying late and ending up insolvent.
Now that I think about it, I really don't know why an NTP catch up script was needed.
Basically VMWare time was not reliable. Windows will by default not catch up unless you get to around 5 minutes off. My script checked every day to see what the drift was and correct it if it was more than 5 seconds of drift.
The underlying reason for concern was logging - we wanted to make sure that our log times were comparable.
Thanks great and educational writeup.
This is true. I know a TV station that has a tiny satellite antenna bolted to the outside wall to run its internal NTP for all of its wired and wireless devices because GPS simply doesn't work inside the building.
I'm not sure if that's a problem because of all the electronic equipment, or the construction of the building, or the fact that the building sits almost underneath a 500-foot-tall tower with several 10 to 20KW transmitting antennae on it.
Got a huge kick out of that.
That's not much IMO a problem of "modern IoT" but a problem of modern managerial-driven society that lead to a proliferation of Ford-model workers at ANY level, not only the lowest one.
People simply can't reason autonomously anymore, at any level, can't really understand "the big picture" of pretty anything: think only at periodic "cry" for $FamousFreeServices down and the relative cue of polemics that follow...
I very, very rarely explain what "AWS" is when I'm casually writing about cloud stuff. It's table stakes. You should know, or you aren't gonna appreciate reading it anyway.
You might find yourself pleasantly surprised you’re providing an NTP server in the NTP global DNS pool.
That is, go to google.com or DuckDuckGo.com and search for it...
I mean, open the browser, click on location bar (on the top) and write...
That is, if you are on Windows, click Start menu (which is now 4 rectangles),...
When I need to google a term, I highlight it, and then press ⌘C, ⌘T, ⌘V, enter. (Copy the search term, open a new tab, paste into url/search bar, search term)
I've gotten quite fast at the keyboard sequence; it takes maybe one second total. I imagine I could make this process even faster with a plugin, But I see no need.
I would like to think that most Windows machines would let you be similarly performant by default. But if not, that's further evidence in my book that Windows just sucks...
I will note that some acronyms can be annoyingly un-googlable, as the same one stands for a wide variety of different terms. This problem does not apply to ntp, however, which comes up right away.
The behavior appears to have changed at some point, though, because it now opens searches in a new tab. I'll probably change my behavior now. Thanks for the reminder. ^_^
My point was that the OP can't guess their readers' level of knowledge, and it would be impossible for them to cater to all levels (as my attempt to explain searching... failed to show :-/) If readers don't know what NTP is, they should be able to either ignore the blog post or find the missing bits of their knowledge by themselves.
If this were a project blog post explaining their latest features, I'd agree with you. If the point is outreach, then yes, they should make it accessible. But he's telling a story. A story that requires a relatively deep understanding of the history and practice of operating internet services. Him writing "NTP (Network Time Protocol)" will not make the story much clearer. And if he explains the whole background, then it's no longer a story, it's a general-audience essay. That's a lot of work for you to expect from somebody that you aren't paying.
Obscure but critical way servers set their accurate time, but maintained by even more obscure people with limited recognition and reward.
That said, I went and looked it up after. Because that's what you do.
If you're stepping in on somebody's semi-public journal, it's probably incumbent upon the reader to do their spadework if they care.