For less demanding users, a Raspberry Pi and a cheap GPS module can do the trick[1]. I get less than +/-300us in error according to chronyc on the clients.
Note that if you get one of the GPS modules, some do not expose the PPS pin to the headers, so might require some board modification (they use it just to drive a LED). I got one like this[2] that has it exposed.
Also note that the small antennas only work outside, the slightly larger square ones work inside but very near a window. I got an active antenna[3] as I wanted it more inside the room.
And finally the NEO-6M module I linked to is quite old, the newer NEO-7M and NEO-8M lock on faster etc, but for me this was sufficient.
Oh and I had to disable serial echo[4], almost forgot about that.
That reminded me of Mitxela's GPS clock[0]. Using GPS modules to get an incredibly accurate time is really interesting, I need to find some time to make one of these.
There exists an open source NTP server project by Netnod in Sweden.
The servers has been up and running, providing NTP including NTP AUTH services since about 2015. The FPGA based platform also provides experimental NTS services since about last summer.
> Microprocessor based Network Time Protocol (NTP) servers suffer from a large amount of timestamp jitter, due to the hardware and Operating System (OS) being shared among other applications.
So instead of configuring Linux correctly and dedicating a core for your NTP you have created custom hardware. Congratulations, now you have two problems.
I have been working on algorithmic trading and it is not that hard to reliably (like 100% of the time) respond within couple microseconds.
You will never notice it, though, because your network will introduce way more variability. This is especially the case with NTP which usually is a single box shared between large number of servers throughout your network.
> I have been working on algorithmic trading and it is not that hard to reliably (like 100% of the time) respond within couple microseconds.
As someone who works in the flight test instrumentation industry (that primarily uses PTP rather than NTP), my first thought was "is a couple of microseconds supposed to be good?". With PTP, I usually achieve RMS offsets of < 50 nanoseconds over gigabit Ethernet.
I sort of agree with your point in that it's probably pointless implementing NTP inside of an FPGA, but I'd like to extend it further and say that if you really care about accurate timing, you should be using PTP rather than NTP anyway. In that case, an FPGA solution makes much more sense. However, in the systems I've worked with, the FPGA only implements a few pieces of core functionality - like the frame time stamping and the numerically-frequency-controlled clock. The actual protocol is usually implemented in software, except in the case of transparent switches.
That said, most modern NICs have PTP hardware, but you still usually need some external logic to actually use these synchronized clocks.
On a "normal" box the time is usually combination of various sources neither of which was really designed to provide accurate, sub microsecond, absolute time. (That except possibly for this "PTP hardware" in NICs that I don't know anything about)
So even if you somehow devise a piece of hardware to pass the accurate time to the operating system, there would be no way to track it accurately.
Now, I think that really accurate absolute timing on machines isn't really that critical. For example, for consistency protocols usually it is enough to provide guarantee of 1s around true time.
For algorithmic trading we were mostly interested in responding as quickly as possible to the event that came from the stock exchange. The packets always came with a known delay and the timing wasn't all that important except for debugging. Everything was done on a single box (so no need to synchronize with anything else) and even on that box the events were processed such that there was no need to coordinate with other threads, each thread just consumed, processed and published results.
It's just about inserting the exact time the frame will be sent into a sent frame, and noting the exact time the frame is received. All that's left is a small amount of clock jitter and cable delays.
High end switches support PTP themselves so you don't need to worry about queueing delays, but you don't need a switch that supports PTP to get high resolution, low jitter, low uncertainty time-- at least on reasonably sized networks that are usually not saturated.
There are many algorithmic trading applications that rely upon high quality time derived from a single coherent source, along with telecommunications, control systems, instrumentation, etc. And there are other distributed systems approaches where true ordering is nice: yes, you can provide ordering within a system of dependent events using a Lamport clock, vector clock, etc, but without high quality time you can also correctly sequence causally related events originating outside your tightly coupled system in many use cases.
That said: PTP relies on a tightly coupled master and realistically is intended for a tightly coupled, hierarchal system. NTP is a better "internet" time protocol-- lots of logic for clock precedence, averaging of multiple sources, slower control system, etc.
It could be made to work fairly well with hardware / driver support. Timing information of frames is known rather precisely, and delay spread is <100ns in most environments.
Heh as soon as I read this I thought "algo trader?". If you don't mind me asking outside of trading how often do you see systems seriously manage jitter (specifically I'm thinking about interrupt management). I'm in HPC and whilst some such management is present it seems seldom thought about.
Heh, I already thought somebody will point it out.
See, they decided to create custom hardware just to do NTP. How is instead dedicating a single core a worse solution? Setting dedicated core is basically configuration detail.
There are no gains from custom hardware as network variability would mask them. If you want really good time guarantee because having right time is critical for your application, you put atomic clocks in your servers as one well known company does.
Also, it is customary to have dedicated machines for various functions. You would not normally want to mix some different types of loads that conflict with each other, for example high security with low security, high throughput with real time, etc.
You could just configure a single machine dedicated to NTP (which is customary), but instead of sharing cores between OS and NTP (and introducing jitter) you can have separate core for the OS and separate for NTP so that NTP can work undisturbed.
So if you maintain an NTP instance and have problem with jitter and you red this article my hint is: don't set up custom FPGA to fix your problem, just dedicate single core on a machine you already have and you will get as good results as you can.
> How is instead dedicating a single core a worse solution?
It is pretty much guaranteed to perform worse, if that's what you mean. Is it worth the effort and extra complexity? Probably not.
The timestamping jitter with an FPGA would probably be on the order of tens of nanoseconds while we're talking microseconds with a software solution.
> There are no gains from custom hardware as network variability would mask them.
I _think_ multiple cascaded jitters in a system add with the square root of the sum of squares, so there would still be an improvement.
Consider a system where you had 10 microseconds of inherent jitter, then you added another 10 microseconds on top of that. The total is not 10 microseconds - it would be 14.
The precision achievable with FPGAs is well beyond 10s of nanoseconds.
The high rate transceivers are used with capacitative dividers or internal signal propagation delays defining the reference intervals for time to digital conversion in extreme accuracy applications, these and similar approaches yield precision to picosecond scale and beyond.
Right, but Ethernet timestamping usually occurs on a symbol or byte level inside the MAC, so for GigE, you're looking at 8 nanoseconds per symbol. This could be improved by oversampling the symbols, but at least with GigE, there's (I think) 4 cycles of uncertainty each time the link comes up due to how the clocking agreement between master/slave is established. It's also somewhat difficult to run large counters faster than a few hundred MHz inside of an FPGA. I've worked with systems that use two different clocks - one small counter for Ethernet timestamping, and another main clock, with a periodic latching between the two to correlate them, but it's uncommon.
There's also the fact that most PHYs operate by having a FIFO with data being clocked in and clocked out on different clock domains. The way I understand it, the PHY FIFO fills up half way, and only then does it start draining into the MAC. This is to allow for oscillator frequency errors between each device. This is also why there's a maximum frame size and max oscillator tolerance specified for Ethernet - so that this FIFO doesn't under or overrun before the entire frame is received.
I think sync-e can improve this quite a bit.
Of course, FPGAs themselves are capable of much tighter timings, but in practice, it's usually implemented this way.
The approaches I am referring to are indeed oversampled and occur in a "bump in the wire" or on a parallel data path.
Moreover, the oversampling occurs at a much higher frequency than the maximum counter frequency of the FPGA -- each clock period is divided into a fractional offset from the rising edge by the capacitative divider or delay line.
It is not uncommon at all for dedicated PTP/NTP hardware to implement synchronization and timestamping in this way, even commodity NICs supporting hardware timestamping often implement it on at least a parallel data path.
> Oh I wasn't meaning specifically in the context of NTP, just generally.
So you decided to just answer with a generalization without regard for the particular case of NTP?
Are you frequently building custom hardware when default OS settings do not suit your needs?
Linux has a lot of configuration potential. It is wise to explore and understand various knobs available before you decide to complicate your life and build something very complex that could easily be replaced with a one line shell script.
I was just asking about your thoughts of how outside of trading jitter is seldom considered. I appreciate it's off-topic from the actual HN post but I was curious given your experience in trading.
The uncontrollable and unobservable SMM interrupts of most modern CPUs add sufficient jitter to the reference clock sampling that there is no correct configuration.
You generally must use an FPGA or appropriate DSP/microcontroller to achieve precision beyond ~10s of nanoseconds, which is entirely achievable even with NTP between hosts that have low contention 10+Gbps per transceiver lane interfaces / properly configured DCB/QoS so the time synchronization packets always egress without delay. PTP can, of course, achieve even better precision.
Round trip time / transmission time ("respond within couple microseconds") are irrelevant, it's a rising or falling edge feature of the packet burst which timestamps the sample and the interval between these samples that's used to train the clocks. This can be accurate to within femtoseconds at the extreme.
It's going to offer an incremental improvement in accuracy for the server's derived clock when the reference clock is not affected by that and other sources of jitter.
Beyond the accuracy, the hardware is significantly cheaper and more efficient than a server with a general purpose x86 CPU.
For a dedicated network reference clock, an FPGA or ASIC solution is simply better in every measurable way.
It is more complex, to be sure, but the complexity needn't be your concern.
Yes, as we inch towards mid-21st century, doing things fast is the unremarkable part. Doing things fast and synchronously over the network is the trickier one.
Interesting project, especially to gain insights into the FPGA programming part.
I think it is essentially a mixture of PTP and NTP now.
I guess this will work within the same local network, as the major inaccuracy of NTP comes from the asymmetric path delays at the network layer over the Internet.
PTP solves this by incorporating these hardware timestamps exactly. But this works only within the same LAN.
The main difference between PTP and NTP is that PTP relies on hardware support in switches and routers. Those are not cheap. If they had the same support for NTP, it would perform as well.
A highly accurate stratum-1 NTP server can be build with a common computer NIC. No need to mess with FPGAs (unless that's your thing). The Intel I210 is about $50. It has a PPS input and output. With some calibration, the timestamping can be accurate to few tens of nanoseconds.
NTP can work very well between directly connected NICs. But without hardware support in the switches/routers, that accuracy degrades quickly in the network. A single switch can easily add hundreds of nanoseconds worth of jitter and tens of nanoseconds worth of asymmetry.
Yes, NICs with support for hardware timestamping are common (it's typically in the MAC, not PHY), but switches that have a good support for PTP, either as a boundary clock, or transparent clock, are not cheap. At least I have not seen one yet. Do you have any examples?
Some switches support NTP as a server and client (equivalent to the PTP boundary clock), but there don't seem to be any using hardware timestamping. It's just the classic ntpd using software timestamps, good to few tens of microseconds at best.
And yes, NTP could definitely perform as well as PTP if the switches had a proper support. In my tests with directly connected NICs the synchronization is stable to few nanoseconds, same as with PTP. At the protocol level, they use the same timestamps.
Probably all current Cisco offerings? Ubiquiti industrial switches? A whole crowd of second tier vendors like Lantech or Korenix? These are just those I had direct experience with.
Of these, Cisco definitely does boundary clock on L3, on at least several models of their routers.
> In my tests with directly connected NICs the synchronization is stable to few nanoseconds, same as with PTP.
Yes network sync is piece of cake if you drop the whole network bit. That said am slightly skeptical about ns level precision with NTP. Did you measure synchronicity between the two devices via scope?
I personally enjoyed the challenge of setting up PTP at home. Why would a hacker scoff at nanosecond-level timekeeping —- isn’t the entire internet a “telco/enterprise” thing?
To do PTP "right" requires every switch to support it and a NIC with hardware timestamps. Also, I've seen claims that PTP is no more precise than a good implementation of NTP.
ptp is <1us synchronization. From my testing NTP is ~20-60us after about 10 minutes of sync, but it intentionally drifts the phase around. On average, NTP is pretty close.
If you look at the white rabbit FPGA PTP updates, its in the ns range.
Any kind of GPS + most intel nics will get you PTP with an accurate clock. If you didn't need to sync too many devices you could use a single system with a bunch of nics as your "switch".
This post didn’t sound right to me, but I realized that my raspi4 GPS NTP server has been running ntp and not chrony. Chrony is better at modeling non deterministic timing behavior, so I swapped to that.
It’s been ten minutes now and chronyc tracking has been marching the offset down. It’s sub 1 us at this point.
System time : 0.000000123 seconds fast of NTP time
Last offset : +0.000000366 seconds
How to get this precise time out of a non deterministic OS? Beats me. Once I figure that out I can finish my clock project.
My best lead is to step through the different python timing and scheduler implementations and see which has the lowest jitter relative to the PPS on an oscilloscope.
Assuming you're using a PPS signal and a kernel driver, presumably there's an interrupt handler or perhaps a capture timer peripheral that is capturing a hardware timer when the PPS edge occurs. It doesn't matter too much when the userspace code gets around to adjusting the hardware timer as long as it can compute the difference between when the PPS edge came in and when it should have come in. The Linux API for fine tuning the system time works in deltas rather than absolute timestamps, so it is once again fairly immune to userspace scheduling jitter.
Even good hardware oscillators can have a wide amount of drift, say 50uS per second, but they tend to be stable over several minutes outside of extreme thermal environments. Therefore, it's pretty easy to estimate and compensate for drift using a PPS signal as a reference. Presumably, that compensation is partially what takes a while for the time daemon to converge on.
Additionally, the clock sync daemon likely takes a while to converge because it isn't directly controlling the system time. Rather, it is sending hints to the kernel for it to adjust the time. The kernel decides how best to do that, and it does it in a way that attempts to avoid breaking other userspace programs that are running. For example, it tries to keep system time monotonically increasing. This means that there's relatively low gain in the feedback loop, and so it takes a while to cancel out error.
It's possible for a userspace program to instead explicitly set system time, but that really isn't intended to be used in Linux unless time is more than 0.5 seconds off. The API call to do that is inherently vulnerable to userspace scheduling jitter, but it's fine since 0.5 seconds is orders of magnitude longer than the expected jitter. You get the system time within the ballpark, and then incrementally adjust it until it's perfect.
If you're not using a kernel driver to capture the PPS edge's timestamp, then you're going to have a rougher time. Either you're just going to have to accept the fact that you can't do better than the scheduling jitter (other than assume it averages out), or you're going to have to do something clever/terrible. One idea would be to have your userspace process go to sleep until, say, 1ms before you expect the next PPS edge to come in. Then, go into a tight polling loop until the edge occurs. As long as reading the PPS pin from userspace is non-blocking and your process doesn't get preempted, you should be able to get at least within microseconds. You can poll system time in the same tight loop, allowing you to fairly reliably detect whether the process got preempted or not.
Thank you for the detailed response! The PPS is currently driving a hardware interrupt on the raspberry pi that is read in by kernel mode software. My project is to drive an external display. Normally I would bypass the raspberry pi altogether and connect the PPS signal to the strobe input of the SIPO shift register. The problem is that the PPS signal cannot be trusted to always exist. Using a raspberry pi has a few benefits. Setting the timezone based on location, leap seconds, and smoothing out inconsistent GPS data. So while opting to use system time to drive the start of second adds error, I think the tradeoff for reliability is worth it.
I have considered adding complexity, such as adding a hardware mux to choose whether to use the GPS PPS signal or the raspberry pi's start-of-second. I should walk before I run though.
If you want to precisely generate a PPS edge in software with less jitter than you can schedule, you can use a PWM peripheral. Wake up a few milliseconds before the PPS edge is due, get the system time, and compute the precise time until the PPS is due. Initialize the PWM peripheral to transition that far into the future, then go back to sleep until a bit after the transition should have happened, and disable the PWM peripheral.
This works because a thread of execution generally knows what time it is with higher precision than it can accurately schedule itself.
I'm not sure I understand how you're using a PPS signal to drive a display, though. Is it an LED segment display? I assume you want it to update once a second, precisely on the edge of each second. Displays generally exist for humans, though, and a human isn't going to perceive a few milliseconds of jitter on a 1Hz update.
Nixie tubes driven by a pair of cascaded HV5122 (driver + shift register). The strobe input is what updates the output registers with the recently shifted in contents. The driver takes 500 ns to turn on and the nixie tubes take about 10 us to fire once the voltage is applied.
I know it's absurd to worry about the last few ms, but it's part of what interests me about the project. The goal is to make The Wall Time as accurate as I can. I could go further with a delay locked loop fed from measuring nixie tube current. There is room push down to the dozens of nanoseconds of error relative to the PPS source, but I am content with the 10s of microseconds. I can't imagine ever having access to a camera that could capture that amount of error.
Thanks for the tip. Hardware timers are best. I'll likely have to take some measurements to calibrate the computation time of getting the system time and performing the subtraction.
Sounds like fun! For what it's worth, ublox GPS modules and their clones should be configurable to always produce a PPS signal regardless of whether or not they have a satellite fix. The module would probably do a better job than software on a pi could during transient periods without a fix (due to how accurate the oscillators need to be in a GPS module). So, as long as you can trust the GPS module to exist and be powered, you should be able to reliably clock your display update with it. The only reason really to generate your own PPS would be if you want it to work without a GPS module at all, perhaps by NTP or something; you're then of course again looking at only a millisecond or so of accuracy.
I'm using an uputronics GPS/RTC hat that has a u-blox M8 engine. I set it to stationary mode for extra accuracy. I'll have to look into other configuration options.
NTP gets worse if you sync more than two devices across a broader network with other switched traffic, more into low 100s of µs. PTP does not degrade similarly and yes, most of PHYs made since middle of the last decade support it.
> If you look at the white rabbit FPGA PTP updates, its in the ns range
As I recall, I had even better performance than that. Around the tens of picoseconds. But I guess the advertised 1 ns is a conservative estimate. The precison is incredible but its not magic, they squeeze the maximum amount of determinism out of custom hardware and fiber optic links. It is a bit of pain too set up, as you need to calibrate each link individually every time you change the fiber or the SFP.
> To do PTP "right" requires every switch to support it and a NIC with hardware timestamps.
I agree, but PTP will in fact work over regular commercial switches on a LAN. The problem is that it will introduce jitter if there's other traffic on the network, but as long as the paths from master to slave (terms used in the standard) remain symmetric, you can filter this out and achieve performance almost as good as if you were using PTP transparent switches.
The NIC with hardware timestamps part should be pretty easy if you're already implementing it using an FPGA like this project did - in fact, in a sense that seems to be exactly what they're doing with NTP. Finding switches that support it might be a little harder.
Many many computers on the internet have a RTC with second resolution and timer tick resolution in the microseconds. Reference time resolution that's that much higher than your timer tick period is useless.
Jitter is the variation in latency of all sources.
The latency of the A) delay in the original packet being sent + B) the latency of the network itself + C) latency of the ntp server receiving the packet + D) latency of the process answering the packet + E) latency of the kernel sending the response + F) the latency of the network itself for the return pack + G) the latency of the client receiving the return packet and giving it an accurate local time mark.
On a uncongested LAN, B & F can be very low in absolute terms and low in variation. This server constrains C, D, & E to basically nothing. So only A & G-- the limitations of the client itself -- remain. How long does it take to receive the frame, get it to system RAM, dispatch an interrupt (which may be coalesced), service the interrupt, and context switch to/wake up the right client process? Alternatively, the network driver can capture a timestamp at the moment the packet is received, eliminating a lot of this variation.
NTP clients assume half of the delay is client->server and half is server->client, for the purpose of computing offset. That is, they subtract half the roundtrip delay off the received time. So reducing delay in C, D, & E further reduces the amount that this guess can be wrong by.
The board there is overkill -- if you really want something with a Zynq including an Arm Cortex A9 that can run Linux, the Arty Z7 from Digilent is another Zynq board that's essentially a one-stop shop for this: https://store.digilentinc.com/arty-z7-zynq-7000-soc-developm...
as someone who doesn't know much about fgpa boards, what does it mean when you write "From there you can instantiate either Xilinx's Microblaze MCU"?
Also, one thing that is confusing to me is that the source code that is linked on this page is a .h and .c file, is the fgpa on these zynq boards programmable with C code?
“Soft” means the CPU is added to your overall design as a block and is compiled along with your design to the FPGA bitstream. This way, the CPU ends up being implemented (instantiated) on the FPGA. This approach eats into your overall FPGA resource budget and leads to lower CPU performance due to the FPGA overhead.
The alternative to a soft core is a “hard” CPU core, which simply means that the CPU is included in a separate area of silicon (usually on the same die). The Zynq 7000 is a good example.
Note that if you get one of the GPS modules, some do not expose the PPS pin to the headers, so might require some board modification (they use it just to drive a LED). I got one like this[2] that has it exposed.
Also note that the small antennas only work outside, the slightly larger square ones work inside but very near a window. I got an active antenna[3] as I wanted it more inside the room.
And finally the NEO-6M module I linked to is quite old, the newer NEO-7M and NEO-8M lock on faster etc, but for me this was sufficient.
Oh and I had to disable serial echo[4], almost forgot about that.
[1]: https://n4bfr.com/2020/04/raspberry-pi-with-chrony/2/
[2]: https://www.aliexpress.com/item/4001136384325.html?spm=a2g0s...
[3]: https://www.aliexpress.com/item/33059221782.html?spm=a2g0s.9...
[4]: https://raspberrypi.stackexchange.com/a/104296