I have measured it several times in various places with fairly consistent result...

I have measured it several times in various places with fairly consistent results. Of course, if you are on a platform which doesn't offer VDSO for your clock, or which disables or virtualizes `rdtsc` then the results could be much longer.

One of the places I measure it is in uarch-bench [1], where running `uarch-bench --clock-overhead` produces this output:

    ----- Clock Stats --------
                                                      Resolution (ns)               Runtime (ns)
                           Name                        min/  med/  avg/  max         min/  med/  avg/  max
                     StdClockAdapt<system_clock>      25.0/ 27.0/ 27.0/ 29.0        27.1/ 27.4/ 27.6/ 30.6
                     StdClockAdapt<steady_clock>      25.0/ 26.0/ 26.9/ 94.0        27.0/ 27.0/ 27.1/ 32.6
            StdClockAdapt<high_resolution_clock>      26.0/ 27.0/ 27.0/ 28.0        27.1/ 27.5/ 27.7/ 30.0
                  GettimeAdapter<CLOCK_REALTIME>      25.0/ 26.0/ 25.7/ 27.0        25.1/ 25.5/ 25.6/ 48.3
           GettimeAdapter<CLOCK_REALTIME_COARSE>       0.0/  0.0/  0.0/  0.0         7.2/  7.3/  7.3/  7.3
                 GettimeAdapter<CLOCK_MONOTONIC>      24.0/ 25.0/ 25.5/ 27.0        24.7/ 24.7/ 24.9/ 27.2
          GettimeAdapter<CLOCK_MONOTONIC_COARSE>       0.0/  0.0/  0.0/  0.0         7.0/  7.2/  7.2/  7.3
             GettimeAdapter<CLOCK_MONOTONIC_RAW>     355.0/358.0/357.8/361.0       357.4/358.2/358.1/360.5
        GettimeAdapter<CLOCK_PROCESS_CPUTIME_ID>     432.0/437.0/436.4/440.0       434.7/436.0/436.2/440.9
         GettimeAdapter<CLOCK_THREAD_CPUTIME_ID>     422.0/426.0/426.1/431.0       424.6/427.1/427.2/430.4
                  GettimeAdapter<CLOCK_BOOTTIME>     363.0/365.0/365.3/368.0       364.2/364.5/364.7/367.7
                                       DumbClock       0.0/  0.0/  0.0/  0.0         0.0/  0.0/  0.0/  0.0

The Runtime column shows the cost. Ignoring DumbClock (which is a dummy inline implementation returning constant zero), note that the clocks basically group themselves into 3 groups: around 7 ns, 25-27 ns and 300-400 ns.

The 7 ns group are those that are implemented just by reading a shared memory location, and don't need any rdtsc call at all. The downside, of course, is that this location is only updated periodically (usually during the scheduler tick), so the resolution is limited.

The 25ish ns group are those that are implemented in the VDSO - they need to do an rdtsc call, which is maybe half the time, and then do some math to turn this into a usable time. Note that CLOCK_REALTIME falls into this group on my system.

The 300+ ns group are those that need a system call. This used to be ~100 ns until Spectre and Meltdown mitigations happened. Some of these cannot easily be implemented in VDSO (e.g., those that return process-specific data), and some could be, but simply haven't.

For what it's worth, I wasn't able to reproduce your results from the SO question. Using your own test program (only modified to print the time per call), running it with no sleep and 10000 loops gives:

    $ ./clockt 0 10 10000
    init run 15256
    trial 0 took 659834 (65 cycles per call)
    trial 1 took 659674 (65 cycles per call)
    trial 2 took 659578 (65 cycles per call)
    trial 3 took 659550 (65 cycles per call)
    trial 4 took 659548 (65 cycles per call)
    trial 5 took 659556 (65 cycles per call)
    trial 6 took 659552 (65 cycles per call)
    trial 7 took 659556 (65 cycles per call)
    trial 8 took 659546 (65 cycles per call)
    trial 9 took 659544 (65 cycles per call)

On my 2.6 GHz system, 65 cycles corresponds to 25 ns, so those results are exactly consistent with the uarch-bench results shown above. So either your system is weird, or you weren't running enough loops, or ... I'm not sure.

[1] https://github.com/travisdowns/uarch-bench