> When you check the elapsed time, you subtract the old time from the new time. But if the new time is less than the old time, you get an overflow. Because we save time as an unsigned integer, you get a very large number that will for sure be larger than any timeout value you set, including our default 3 minutes.
If they were using unsigned arithmetic then they would've gotten the expected value. For example, 0 - 0xffffffffffffffff is 1 when using 64-bit types, just as one might expect. Unsigned (modulo) arithmetic works exactly how you'd want in a situation like this; it works until the interval exceeds the range of the type. So either they're describing things wrong, or their code wasn't doing what they thought.
Of course, in any event this wouldn't be the root of their problem--time stamp counter drift across cores. The Linux kernel has alot of code to decide whether to use the RDTSC-optimized clock_gettime vDSO implementation or fallback to a syscall. And when it does use the RDTSC method, IIRC it has code to train the time stamps to minimize any cross-core drift, and mask out any potential difference (reducing granularity). The moral of the story is to just use clock_gettime+CLOCK_MONOTONIC; at least on Linux it'll use RDTSC in most cases where that's practical.
Sorry if something was explained awkwardly or incorrectly, I'm trying to get a hang of writing posts like this, technical posts that are interesting and fun to read.
> If they were using unsigned arithmetic then they would've gotten the expected value.
And we did, but we did not expect to get the input values we got, which are explained by your last paragraph: "time stamp counter drift across cores." This was the real problem, and something new we learned.
> use clock_gettime+CLOCK_MONOTONIC
I forgot to include this because I was focused on the solutions we ended up using, but after some research this is 100% correct! We were afraid that too many syscalls during the query execution can produce huge performance hits so we decided to look for a different solution.
I didn't think about the vDSO, and that it's used for `clock_gettime`, so thank you for the information and correction!
It's really hard for me to get a feel of what is okay to use, what is the standard way to do something without actually implementing everything and seeing how it behaves, so information like this help me a lot.
It turns out lots of systems are subtly broken, and need workarounds. This is where it's handy to have a decent standard library instead of learning all of this the hard way:
__rdtsc drift isn’t an AMD bug, it’s by design. The only way to use the rdtsc instruction is to pin code to the same physical core between calls. They probably just got lucky in that the drift always exceeded the time that elapsed between calls on other cpus or the Linux scheduler’s behavior on Ryzen 7 differs from other CPUs (I’m not familiar with Ryzen 7 but ThreadRipper is basically non-numa and needed patches to the scheduler under Windows to fix some pathological conditions).
It seems more of a QoI issue. Most Intel chips maintain synchronized (same rate of change) TSCs. That knowledge is encoded in the Linux kernel:
/*
* Intel systems are normally all synchronized.
* Exceptions must mark TSC as unstable:
*/
if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL) {
/* assume multi socket systems are not synchronized: */
if (num_possible_cpus() > 1)
return 1;
}
> hey probably just got lucky in that the drift always exceeded the time that elapsed between calls on other cpus
And this is also true. :D
But I would say we were lucky to notice this problem at all because it really doesn't happen on that many CPUs, at least on the ones we tried out.
If they were using unsigned arithmetic then they would've gotten the expected value. For example, 0 - 0xffffffffffffffff is 1 when using 64-bit types, just as one might expect. Unsigned (modulo) arithmetic works exactly how you'd want in a situation like this; it works until the interval exceeds the range of the type. So either they're describing things wrong, or their code wasn't doing what they thought.
Of course, in any event this wouldn't be the root of their problem--time stamp counter drift across cores. The Linux kernel has alot of code to decide whether to use the RDTSC-optimized clock_gettime vDSO implementation or fallback to a syscall. And when it does use the RDTSC method, IIRC it has code to train the time stamps to minimize any cross-core drift, and mask out any potential difference (reducing granularity). The moral of the story is to just use clock_gettime+CLOCK_MONOTONIC; at least on Linux it'll use RDTSC in most cases where that's practical.