
Core_bench: better micro-benchmarks through linear regression (2014) - jasim
https://blog.janestreet.com/core_bench-micro-benchmarking-for-ocaml/
======
jasim
Couple of things I found interesting:

> 2: Time.now is an expensive function that takes a while to execute and
> typically requires control transfer to a VDSO. (On my older laptop this
> takes 800+ nanos to run. On more expensive server class machines, I have
> seen numbers as low as 40 nanos.)

Then there is system noise for which you still need a large sample size, and
the result has a high variation and we haven't still accounted for GC.

So they do this:

> This brings us to how Core_bench works: Core_bench runs f in increasing
> batch sizes and reports the estimated time of f () by doing a linear
> regression. In the simplest case, the linear regression uses execution time
> as the predicted variable and batch size as the predictor. Several useful
> properties follow as a consequence...

While we're on the topic of benchmarking, what is the best way to to do micro-
benchmarks for Javascript code (both Node and browser)? I tried node --prof,
but the report is a little difficult to parse. Chrome's performance tab
would've been helpful, but it doesn't seem to profile WebWorkers.

~~~
vardump
> On more expensive server class machines, I have seen numbers as low as 40
> nanos.

That's pretty unusual. Usually larger NUMA systems take longer to retrieve
time than single socket consumer ones. They need to synchronize between
multiple sockets.

~~~
dragontamer
Unnecessary in the general case.

RDTSC is the assembly instruction needed to read the current timestamp for the
current core (not necessarily synchronized to the whole system). Modern RDTSCs
do NOT read the cycle-count, because frequencies change due to "Turbo Boosts"
and other such optimizations. (See here for more details:
[https://randomascii.wordpress.com/2011/07/29/rdtsc-in-the-
ag...](https://randomascii.wordpress.com/2011/07/29/rdtsc-in-the-age-of-
sandybridge/))

A modern RDTSC reads at the "base clock". If your base clock is 3.6 GHz, the
RDTSC ticks at 3.6GHz (even if your processor can turbo to 4GHz or 5GHz like
the i9-9900k). So RDTSC is not the same from processor-to-processor, but it is
consistently the most granular micro-benchmarking clock available on x86
systems.

The main issue with RDTSC is that task-switches may cause your thread of
execution to change cores or mess up your timing, especially in long runs. So
Windows / Linux have higher-level performance counters that take task switches
into account, but have lower granularity. For Windows, this is
"QueryPerformanceCounter", which ticks on my system at roughly 3MHz. Still
useful for microbenchmarks, and the guarantee for cross-thread behavior is
useful more often than not.

Microsoft documents "QueryPerformanceCounter" as using rdtsc in most cases.
[https://docs.microsoft.com/en-
us/windows/desktop/dxtecharts/...](https://docs.microsoft.com/en-
us/windows/desktop/dxtecharts/game-timing-and-multicore-processors)

The rdtsc instruction seems to take ~40 clocks or so according to Agner Fog's
instruction tables. Suggesting you can get a high-speed clock in as little as
10ns on a 4GHz x86 computer should you use the raw assembly instruction.

Add a few nanoseconds for a function call, setting up the stack, and some math
to "normalize" the rdtsc clock... and the CPUID to clear out pipelines... and
50ns total seems reasonable.

There are some slower clocks available from the motherboard or I/O system. But
those should be avoided unless you are running a 2008-era x86 processor (Aka:
Nehalem/Westmere. The only Intel processor with "Turbo" frequencies which
changed RDTSC timing). Older systems don't have turbo, newer systems lock
RDTSC to the base clock.

\------------

Anyway, I'm no Javascript guru. But surely there's a way to pass the RDTSC
instruction "up" from the assembly level to Javascript code? Or alternatively,
maybe an OS-level timer function that's built on top of RDTSC.
(QueryPerformanceCounter in windows, or CLOCK_MONOTONIC_RAW in Linux)

~~~
vardump
> Microsoft documents "QueryPerformanceCounter" as using rdtsc in most cases.
> [https://docs.microsoft.com/en-
> us/windows/desktop/dxtecharts/...](https://docs.microsoft.com/en-
> us/windows/desktop/dxtecharts/..).

And the cases when it doesn't is when it's running on more than two CPU
sockets. Then it'll fall back on other timers that can take even microseconds
to query.

