Hacker News new | past | comments | ask | show | jobs | submit login
The school for sysadmins who can’t timesync good (libertysys.com.au)
112 points by zdw on June 26, 2021 | hide | past | favorite | 31 comments

Archive.org links for the series in case the site goes down due to the HN Hug of Death:

Part 1 - The Problem with NTP: https://web.archive.org/web/20210627035347/https://libertysy...

Part 2 - How NTP Works: https://web.archive.org/web/20210627035910/https://libertysy...

Part 3 - Installation and Configuration: https://web.archive.org/web/20210308233351/https://libertysy...

Part 4 - Monitoring and Troubleshooting: https://web.archive.org/web/20210308233515mp_/https://libert...

Part 5 - Myths, Misconceptions, and Best Practices: https://web.archive.org/web/20210308232954mp_/https://libert...

Oh dear... I've been triggered by that reference to the HP-UX boxes at Queensland police in the early 2000s... yes, it was as bad as can be imagined... no, actually, it was worse.

A large number of them were out-of-date and at their end-of-life. HP was charging a super premium for keeping them in support beyond their normal end-of-life period... some reseller pointed this out as a justification for why it would be cheaper to replace them than to keep them in support. It back-fired: QLD police just took them out of support without replacing the hardware. State-level critical infrastructure running on obsolete equipment with no vendor support....

It definitely got better when MOG moved lots of tech people into their own agency. So, so much better. Like, a whole department of people not answerable to anyone but their COO, who had basically no idea how to support agencies.

I inherited an "enterprise environment" to look after that had attempts to talk to on prem NTP services via VPN, but that had failed over time. Cybersec had closed the route without notice and the environment eventually drifted out of sync and was completely unable to get updates. It hadn't had any updates for 3 years. There were still other elements of the VPN that could talk to parts of both networks used between two big agencies supported. That system was classified as sensitive. Also, the firewall hadn't had a definitions review in 4 years. .Net Core alpha release was being used.

Fortunately I was able to nuke the whole thing because of the low number of users.

My adventures in ntp resulted in finding out that different ntp servers handle leap seconds differently. Googles will gradually stretch the second out over time, which violates the NTP standard of just adding or subtracting the second. So if your setup depends on very accurate time, make sure you know what NTP servers you're using.

It’s a good read, and I really wish time was always that simple. Sometimes you have multiple platforms on different clock systems and distributed systems running across them. Did you know google and amazon smear leap seconds? GPS/Galileo/that-China-one don’t have leap seconds (but all are different TAI offsets) but that Russian GPS one does have leaps. They all have slightly different versions of utc.

Most of the time none of that matters and you can just install chronie and point it to whatever.pool.ntp.org and you’re off to the races. But boy does it suck when you have to to know.

> GPS/Galileo/that-China-one don’t have leap seconds (but all are different TAI offsets) but that Russian GPS one does have leaps. They all have slightly different versions of utc.

This is not wrong, but it's missing a large chunk of information. All non-UTC-based geonavigation satellites also broadcasts both the offset between internal and UTC and if there's an impeding leap second.

You can nowadays buy a chip-scale atomic clock for $2000, which is a modest fraction of the price of an entire server. Every datacenter should have one.

Nice comprehensive series but couldn't get to page 4 -site timed out. On the windows side of things, I'm more familiar with "w32tm" and "net time." My time sync post has the highest amount of views on my site from people googling "how to set time clock on domain" so their cell phones match their computers at work. Would be interesting to see how the windows protocols differ from nix.

The Win32 daemon only provides coarse time adjustment. Basically doing an ntpdate to adjust the clocks once per day or so. Good enough for domain logins, but a couple orders of magnitude worse than the regular NTP protocol.

Of course on Linux most of the arcane details of the ntp daemon aren't relevant because most distros end up running SystemD with timesyncd instead. I discovered this when all of my T1 time sources (GPS receivers) stopped working after an update. As usual you can disable the systemd bit, but it doesn't like it.

> most distros end up running SystemD with timesyncd instead

which tends to be "good enough" for the average desktop user, but anything even vaguely server-ish should run a full NTP implementation such as chrony and not an SNTP one such as systemd-timesyncd.

a few years ago at $dayjob we had a fleet of CoreOS hosts. CoreOS, at the time, defaulted to systemd-timesyncd using pool.ntp.org addresses.

our CoreOS hosts, obviously, ran Docker containers.

systemd has a neat "feature" where if your network configuration changes, it'll trigger a time synchronization through timesyncd.

when a new Docker container was started, this counted as a "network config change" and caused a time synchronization.

by itself, this isn't too bad. it caused time syncs to happen more often than they need to, strictly speaking, but shouldn't have caused any further problems.

except...enter "falsetickers". hosts in the NTP pool are run by volunteers. an individual host in the pool may have the incorrect time.

the infrastructure for the NTP pool has monitoring for this, and will kick a host out of the DNS rotation if it's wrong. except this won't happen immediately - there'll always be some lag between when the host starts being wrong and when the monitoring system kicks it out.

and if your hosts are synchronizing their time more often than necessary, it increases the chance they'll do a time sync in one of these small windows where a falseticker is being advertised by the pool.

a full NTP implementation is specifically designed to handle this, of course. a client polls multiple servers, and will discard significant outliers.

SNTP? not so much. I haven't looked at timesyncd to see if it's improved since then, but at the time it would pick one of the [0-3].pool.ntp.org hosts at random, send it one NTP packet, and then jump the time to that response.

...and that's the story of how some of my company's production hosts would have their system time autonomously jump to be 5-10 minutes fast, maintain that time for several minutes to an hour, and then jump back to the correct time, all without human intervention.

I cheat and have an authoritative NTP server locally and then override dns for pool.ntp.org and friends.

Then at least if I’m off we’re all off together.

> I cheat and have an authoritative NTP server locally and then override dns for pool.ntp.org and friends.

Generally if you've made the effort to have internal recursive DNS server(s) for your network, then just enable NTPd or chrony as well and have a single source of Time Truth for your network.

Point to ≥4 NTP servers, even using pool.ntp.org, and you probably don't have to worry about false ticker(s) either.

For bonus points, hook up a GPS with a PPS output to the local one so it's stratum 1.

I do this, with all the trimmings (running in kernel space, PTP simulation, etc). I appreciate that a good estimation of the time inside the non-deterministic OS is being made, but I haven't quite wrapped my head around what it means to extract the time from that non-deterministic OS.

How big is that unmeasured error?

I do this at home with a Pi, it was a fun project.

SNTP is something else (a text based protocol IIRC) than ntpdate-style oneshot point in time sync, which, while still vulnerable to hitting one server and getting the wrong time, uses the ntp (binary IIRC) protocol to do so.

> The Win32 daemon only provides coarse time adjustment.

True in XP (it was a crappy SNTP implementation), but it was rehauled significantly in Windows 10/Server 2016 and above because of Azure requirements. It can now guarantee accuracy within 1 second at all times and even higher when the NTP server is local (https://docs.microsoft.com/en-us/windows-server/networking/w...)

That’s still absolutely rubbish.

> That’s still absolutely rubbish.

Knowing how accurate the timers within Windows is, this is actually not "rubbish" as you say, at least relative to Windows. Windows is designed for general computing, not for superprecise timings. Use Linux for that use case, not just shovel NTPD (spoiler: NTPD uses the Media timers in Windows, those are not definitely designed for that use case and there are too many applications that breaks if they are forced with the high-precision timers).

P.S. Linux can hold precise timings, but there are certain configurations that will break this assumption. Double check if this is important to you.

I think you can also disable w32t and install an ntp client and point it to your preferred ntp server(s).

The title makes it seem as if there's some architectural thing people don't understand. It's a good extensive article, but it's too much information for most sysadmins.

If you really need high precision time synchronization, for example when triangulating signals on different machines, you should look at ptpd (https://github.com/ptpd/ptpd).

What's with our current confusion between adjectives and adverbs? It seems to be getting more prominent everywhere. I don't think it's an influence from EFL like "learnings".

Hah, that'll teach me! But people are using this phrasing more commonly, thinking different and so on..

"who can’t timesync good"? It's really poor writing.

It's a riff on the "Derek Zoolander Center For Kids Who Can't Read Good And Wanna Learn To Do Other Stuff Good Too" from the comedy Zoolander.

Hrm, couldn't get to page 3. Guess he needs a school for scaling static content delivery.

works ok for me, what error do you get?

Just hanged, never loaded.

Be nice to my server - try going through cloudfront:

1. https://d38if4m2in2lkc.cloudfront.net/2016/09/the-school-for...

2. https://d38if4m2in2lkc.cloudfront.net/2016/10/the-school-for...

3. https://d38if4m2in2lkc.cloudfront.net/2016/10/the-school-for...

4. https://d38if4m2in2lkc.cloudfront.net/2016/10/the-school-for...

5. https://d38if4m2in2lkc.cloudfront.net/2016/12/the-school-for...

I probably won't read (m)any of the comments below, but if I had to pick the "do not miss" parts, they would be https://d38if4m2in2lkc.cloudfront.net/2016/10/the-school-for... (It really isn't a typical consensus algorithm.) and https://d38if4m2in2lkc.cloudfront.net/2016/12/the-school-for... (Them: "1 NTP peer is better than 2"; me: "Don't make me come down there"), but really, you should just go read https://tools.ietf.org/html/rfc8633

At least its not time syncing for ants

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact