The uncontrollable and unobservable SMM interrupts of most modern CPUs add sufficient jitter to the reference clock sampling that there is no correct configuration.
You generally must use an FPGA or appropriate DSP/microcontroller to achieve precision beyond ~10s of nanoseconds, which is entirely achievable even with NTP between hosts that have low contention 10+Gbps per transceiver lane interfaces / properly configured DCB/QoS so the time synchronization packets always egress without delay. PTP can, of course, achieve even better precision.
Round trip time / transmission time ("respond within couple microseconds") are irrelevant, it's a rising or falling edge feature of the packet burst which timestamps the sample and the interval between these samples that's used to train the clocks. This can be accurate to within femtoseconds at the extreme.
It's going to offer an incremental improvement in accuracy for the server's derived clock when the reference clock is not affected by that and other sources of jitter.
Beyond the accuracy, the hardware is significantly cheaper and more efficient than a server with a general purpose x86 CPU.
For a dedicated network reference clock, an FPGA or ASIC solution is simply better in every measurable way.
It is more complex, to be sure, but the complexity needn't be your concern.
You generally must use an FPGA or appropriate DSP/microcontroller to achieve precision beyond ~10s of nanoseconds, which is entirely achievable even with NTP between hosts that have low contention 10+Gbps per transceiver lane interfaces / properly configured DCB/QoS so the time synchronization packets always egress without delay. PTP can, of course, achieve even better precision.
Round trip time / transmission time ("respond within couple microseconds") are irrelevant, it's a rising or falling edge feature of the packet burst which timestamps the sample and the interval between these samples that's used to train the clocks. This can be accurate to within femtoseconds at the extreme.