
The slow currentTimeMillis() - jsnell
http://pzemtsov.github.io/2017/07/23/the-slow-currenttimemillis.html
======
amluto
This post has a confused notion of why the TSC isn't used. It has nothing to
do with integration with NTP -- it's because the kernel thinks the TSC doesn't
work right on the machine in question. The kernel logs should explain why.
Also, on some hypervisors, there are TSC problems, but the situation is slowly
improving.

Also:

> The second value is the nanoseconds time, relative to the last second,
> shifted left by gtod->shift, Or, rather, it the time measured in some units,
> which becomes the nanosecond time when shifted right by gtod->shift, which,
> in our case, is 25. This is very high precision. One nanosecond divided by
> 225 is about 30 attoseconds. Perhaps, Linux kernel designers looked far
> ahead, anticipating the future when we’ll need such accuracy for our time
> measurement.

That's not what's going on. This is straightforward fixed point arithmetic,
where the point just happens to be variable.

~~~
Const-me
> why the TSC isn't used

Because on most modern CPUs, clock frequency is extremely unstable: Turbo
Boost, SpeedStep, Cool'n'Quiet, etc. These features make TSC not suitable for
time measures.

~~~
bitcharmer
Please check for what features like constant_tsc or nonstop_tsc are for before
spreading misinformation

~~~
Const-me
That thing is intel-only, or at least Linux kernel thinks so:
[https://github.com/torvalds/linux/blob/master/arch/x86/kerne...](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/tsc.c)

And even on Intel, TSC is not reliable:
[https://lwn.net/Articles/388286/](https://lwn.net/Articles/388286/)

~~~
makomk
AMD CPUs have supported constant_tsc and nonstop_tsc since the Phenom era too.

------
heinrichhartman
Theo Schlossnagle and Riley Berton have recently implemented and discussed
timing functions with the following performance characteristics (25-50ns):

    
    
      Operating System  Call          System Call Time  Mtev-variant Call	Speedup
      Linux 3.10.0      gettimeofday  35.7 ns/op        35.7 ns/op          x1.00
      Linux 3.10.0	    gethrtime     119.8 ns/op       40.4 ns/op          x2.96
      OmniOS r151014    gettimeofday  304.6 ns/op       47.6 ns/op          x6.40
      OmniOS r151014    gethrtime     297.0 ns/op       39.9 ns/op          x7.44
    

Here is the article: [https://www.circonus.com/2016/09/time-but-
faster/](https://www.circonus.com/2016/09/time-but-faster/)

They also use TSC and (CPU-bound) background threads to update shared memory.

The implementation is available as C library for Linux and Illumos/OmniOS:
[https://github.com/circonus-
labs/libmtev/blob/master/src/uti...](https://github.com/circonus-
labs/libmtev/blob/master/src/utils/mtev_time.c)

------
the8472
Related and required reading if you're doing microbenchmarking:
[https://shipilev.net/blog/2014/nanotrusting-
nanotime/](https://shipilev.net/blog/2014/nanotrusting-nanotime/)

Also the, _tsc_ CPU flag is not sufficient since early tsc implementations had
issues that made them unusuable for walltime measurements. There's also
_constant_tsc_ and _nonstop_tsc_ which are relevant because they indicate that
tsc keeps ticking independent of frequency scaling and CPU low power modes,
only then do they become usable for low-overhead userspace timekeeping.

hotspot and the linux vDSO generally take care of doing the right thing,
depending on CPU flags, so the TSC is used for _currentTimeMillis_ and
_nanoTime_ if it can determine that it is reliable, otherwise more expensive
but accurate methods are used.

The article's own conclusion:

> The title isn’t completely correct: currentTimeMillis() isn’t always slow.
> However, there is a case when it is.

It only happens under circumstances where the TSCs are not used.

~~~
joosters
These points are all made in the original article...

~~~
the8472
My point is that the heading is misleading. On modern systems linux/hotspot
generally do the right thing(tm) and no tinkering with sysfs should be needed,
if the right cpuflags (check /proc/cpuflags) are available. Some
virtualization solutions may cause issues here.

------
curun1r
Is this really new? I remember when System.nanoTime was introduced in the
early 2000s largely because of the reasons stated here.
System.currentTimeMillis was slow and had jitter which made it non-monotonic
but it was the best method to call if your timestamp needed to interact with
the world outside the JVM. In contrast, if all you cared about was relative
times, you'd get a speed and accuracy boost from using System.nanoTime since
it actually was monotonic and while a single nanosecond value was pretty
meaningless, the difference between two nanosecond values was basically
accurate.

Benchmarks are a class of problems where the overall time doesn't matter, only
the relative time (end - start). The fact that he's using
System.currentTimeMillis for his start and end values doesn't give me
confidence that he's very knowledgeable about what's to follow.

------
brendangregg
Good analysis, and yet again, this is why we use TSC at Netflix. I benchmarked
them all a while ago. I also put "clocksource" on my tools diagram, so people
remember that clocksource is a thing:
[http://www.brendangregg.com/Perf/linux_perf_tools_full.png](http://www.brendangregg.com/Perf/linux_perf_tools_full.png)

------
jwilk
> The vDSO is mapped to a new address each time (probably, some security
> feature). However, the function is always placed at the same address as
> above (0x00007ffff7ffae50) when run under gdb (I’m not sure what causes
> this).

GDB (≥ 7) disables ASLR by default:

[https://sourceware.org/gdb/current/onlinedocs/gdb/Starting.h...](https://sourceware.org/gdb/current/onlinedocs/gdb/Starting.html#index-
set-disable_002drandomization)

------
hyperpape
"The way to test the performance of System.currentTimeMillis() is
straightforward"

I'd be worried about that claim. The article executes System.currentTimeMillis
in a loop and sums the values. That takes care of dead code elimination, but
it will still be subject to other distortionary effects
([https://shipilev.net/#benchmarking-1](https://shipilev.net/#benchmarking-1)).
Maybe the fact that System.currentTimeMillis() is implemented using native
code reduces some of the JIT induced benchmarking pitfalls, but I would still
prefer to try and use JMH to test.

I'm also surprised that performance of System.currentTimeMillis can fall down
so far under some circumstances. In one of Cliff Click's talks (maybe
[https://www.youtube.com/watch?v=-vizTDSz8NU](https://www.youtube.com/watch?v=-vizTDSz8NU)),
he mentions that some JVM benchmarks run it many millions of times a second.

------
msluyter
This is especially relevant if you're running in the cloud.

We've recently run into slow performance with gettimeofday on Xen systems,
such as EC2 (see: [https://blog.packagecloud.io/eng/2017/03/08/system-calls-
are...](https://blog.packagecloud.io/eng/2017/03/08/system-calls-are-much-
slower-on-ec2/)) This hits particularly hard if you're running instrumented
code. Switching the clocksource from xen to tsc seems to help and we're still
evaluating the ramifications of doing so (various sources including the above
seem to suggest that tsc is unsafe and prone to clock skew, but others suggest
that it's ok on more modern processors.)

------
js2
Most Android developers will know this already, but currentTimeMillis() is
probably not the call you want for event timing:

[https://developer.android.com/reference/android/os/SystemClo...](https://developer.android.com/reference/android/os/SystemClock.html)

------
pjc50
The Windows version is absolutely brilliant: a shared area mapped into every
process. Because of the way page tables work, this costs only one page (4k) of
RAM and only has to be written once by the OS.

~~~
bonzini
The Linux version does the same, actually.

As explained in the article, the Windows version is so small because it can
only provide a resolution limited to the timer tick's resolution (2048 Hz in
this case, but unless a program explicitly requests a high resolution it can
be as low as 64 Hz IIRC). The cost to having a higher resolution is also a
higher number of interrupts per second, which wastes power unnecessarily when
you are idle[1].

Instead, as the atthe Linux version can provide a resolution limited only by
the actual clocksource, which is nanoseconds in the case of the TSC. There is
no background process that "updates the fields of this structure at regular
intervals"; the updates only need to happen if the speed of the clock changes,
for example due to an NTP update. While in the blog post it's measured to
happen every millisecond, it can be configured to be lower without affecting
the resolution of gettimeofday. There is also a special mode where you make it
tick as slowly as 1 Hz on CPUs that have only one runnable process (if you
have more than one runnable process, the scheduler needs to do context
switches so the kernel cannot tick slowly).

[1] [https://randomascii.wordpress.com/2013/07/08/windows-
timer-r...](https://randomascii.wordpress.com/2013/07/08/windows-timer-
resolution-megawatts-wasted/)

~~~
jevinskie
XNU has an equivalent to vDSO: the commpage. It is also used for time.

------
bitcharmer
On a properly configured environment TSC is the default clocksource. I have no
idea why one would want to precisely measure time with any other clock source.

On relatively contemporary systems clock_gettime(CLOCK_REALTIME) will give you
a nanosecond resolution wall clock time and the cost of going through JNI will
cost you around 11-13ns, so with a simple native wrapper in java the cost is
negligible. Although interesting, the whole study of the cost of obtaining the
current time in Java with currentTimeMillis() with clock sources other than
TSC is somewhat irrelevant.

~~~
usefulcat
The Linux kernel may choose to not select TSC as the clocksource, regardless
of the configuration. For example, we have a 4 socket machine running CentOS 6
that refuses to use the TSC timer because at boot time the kernel determines
(experimentally, I believe) that TSC is not reliable on that hardware. So it
falls back to using HPET instead.

We also have a similar 2 socket machine running a similar version of CentOS
that has no problems using TSC. ¯\\_(ツ)_/¯

------
loeg
So there's a performance bug in Java's currentTimeMillis() implementation on
Linux, where it uses gettimeofday() instead of clock_gettime() with a coarse
timer (even the TSC method is ~3x slower or worse than the coarse timer, which
has sufficient precision for currentTimeMillis()). Is there a bug filed for
this? It seems to warrant one.

~~~
dbdr
[http://bugs.java.com/bugdatabase/view_bug.do?bug_id=JDK-8185...](http://bugs.java.com/bugdatabase/view_bug.do?bug_id=JDK-8185891)

------
blauditore
For what it's worth, currentTimeMillis() shouldn't be used for interval
measurements anyway, as it might not be continuous in some scenarios. A better
alternative is System.nanoTime().

------
Ensorceled
People have been discovering and rediscovering gettimeofday is really slow for
a long time now. I remember switching to OS specific equivalents for AIX,
SunOS, Ultrix, HP-UX, etc. back in the late 80s and early 90s.

~~~
nqzero
yeah, this isn't new news. but it is a very nice write-up and simple enough
that any technical person should be able to grasp it

------
fokinsean
I started off being able to mostly comprehend what's going on, but about half
way down I got lost in the assembly and realized I barely remember much from
my OS/Architecture courses in school.

This was a solid reminder my every day work sits atop the shoulders of giants
and I am much appreciative.

------
kodablah
An easy way to do this from the outside would be to create a JVMTI agent, and
make currentTimeMillis call into your JNI version. Here's an example in Rust
of me doing it for a bit more involved purpose:
[https://github.com/cretz/stackparam](https://github.com/cretz/stackparam).
Then people only have to pass your shared lib as a cli arg to get the benefit.

------
amelius
> Some background thread regularly updates three double-words there, which
> contain the high part of the value, the low part, and the high part again.

So to implement a clock, they are using a different thread that updates a
counter every millisecond or so? Isn't this a bit inefficient?

~~~
bonzini
It's not a thread, it's an interrupt handler. Windows (on x86) uses the real-
time clock's periodic timer, which can be made to tick at any power-of-two
frequency between 1 Hz and 32768 Hz.

~~~
amelius
Is that interrupt handler installed for every process that needs the timer? Or
is there only one handler globally?

~~~
avdicius
It's a hardware interrupt from a hardware clock that is handled by the OS
kernel. Regular user processes are usually totally forbidden from touching
that hardware stuff.

------
dingo_bat
Really love these kind of in depth investigations. However, I still want to
know how windows manages to update the value in the shared memory so
consistently without any impact to system performance.

~~~
revelation
Why would there be a system performance impact? This is just updating a memory
value (that happens to be mapped into the VM of every process) at 2 kHz.

~~~
dingo_bat
Silly me, the time is only updated once every half a millisecond. That's
virtually zero CPU usage on a 2GHz chip.

~~~
amluto
It's worse than you might think. There are at least three things happening
2000x/second: getting an interrupt, writing shared memory, and returning.
Getting a CPU interrupt takes something like 1k cycles. Dealing with the APIC
may be even worse on a system without X2APIC, which is all too common due to
crappy firmware (never benchmarked it, though), and returning takes many
hundreds of cycles because x86 is very bad at these things. That shared memory
write is n basically free unless NUMA is involved, and then it can be fairly
bad.

So we're taking maybe 4M cycles per second used here. In practice, there's
probably quite a bit of extra software overhead.

------
empath75
It's sort of amazing that 600 ns is considered slow.

~~~
ptero
For many of the things we work on, this is not just slow, it is _atrociously_
slow. For analysis of real-time systems I want to put many data tap points
throughout the processing chain. If I add 50 (not uncommon), at 5ns a pop most
systems (again, IME) are not going to care about 0.25 microsecond extra
latency. At 600ns I add 30 microseconds to their processing chain and will
likely have to spend some of my time explaining to the owners why my tap
points will not affect the system I am helping fix.

------
ramgorur
I think I've also read somewhere about a bug in the java Random class, it was
something about after N iterations, the number starts from iteration 1 where N
is not very big (i.e. clearly visible pattern/cyclic repeat in the numbers).
But I can't find it anymore, if someone knows its whereabouts, please let me
know.

------
pzemtsov
I've added an update on the nanoTime() to the article

