Hacker News new | comments | show | ask | jobs | submit login
The slow currentTimeMillis() (pzemtsov.github.io)
323 points by jsnell 11 months ago | hide | past | web | favorite | 52 comments

This post has a confused notion of why the TSC isn't used. It has nothing to do with integration with NTP -- it's because the kernel thinks the TSC doesn't work right on the machine in question. The kernel logs should explain why. Also, on some hypervisors, there are TSC problems, but the situation is slowly improving.


> The second value is the nanoseconds time, relative to the last second, shifted left by gtod->shift, Or, rather, it the time measured in some units, which becomes the nanosecond time when shifted right by gtod->shift, which, in our case, is 25. This is very high precision. One nanosecond divided by 225 is about 30 attoseconds. Perhaps, Linux kernel designers looked far ahead, anticipating the future when we’ll need such accuracy for our time measurement.

That's not what's going on. This is straightforward fixed point arithmetic, where the point just happens to be variable.

> why the TSC isn't used

Because on most modern CPUs, clock frequency is extremely unstable: Turbo Boost, SpeedStep, Cool'n'Quiet, etc. These features make TSC not suitable for time measures.

Please check for what features like constant_tsc or nonstop_tsc are for before spreading misinformation

That thing is intel-only, or at least Linux kernel thinks so: https://github.com/torvalds/linux/blob/master/arch/x86/kerne...

And even on Intel, TSC is not reliable: https://lwn.net/Articles/388286/

AMD CPUs have supported constant_tsc and nonstop_tsc since the Phenom era too.

There have also been some multi-socket configurations where TSC will gradually become desynchronized across sockets.

That lkml thread was back in 2010, before invariant TSC was a thing.

> intel-only

So... >99% of servers and the vast majority of desktops?

This is covered in the fine article.

On modern CPUs the tsc does not count CPU cycles.

Theo Schlossnagle and Riley Berton have recently implemented and discussed timing functions with the following performance characteristics (25-50ns):

  Operating System  Call          System Call Time  Mtev-variant Call	Speedup
  Linux 3.10.0      gettimeofday  35.7 ns/op        35.7 ns/op          x1.00
  Linux 3.10.0	    gethrtime     119.8 ns/op       40.4 ns/op          x2.96
  OmniOS r151014    gettimeofday  304.6 ns/op       47.6 ns/op          x6.40
  OmniOS r151014    gethrtime     297.0 ns/op       39.9 ns/op          x7.44
Here is the article: https://www.circonus.com/2016/09/time-but-faster/

They also use TSC and (CPU-bound) background threads to update shared memory.

The implementation is available as C library for Linux and Illumos/OmniOS: https://github.com/circonus-labs/libmtev/blob/master/src/uti...

Related and required reading if you're doing microbenchmarking: https://shipilev.net/blog/2014/nanotrusting-nanotime/

Also the, tsc CPU flag is not sufficient since early tsc implementations had issues that made them unusuable for walltime measurements. There's also constant_tsc and nonstop_tsc which are relevant because they indicate that tsc keeps ticking independent of frequency scaling and CPU low power modes, only then do they become usable for low-overhead userspace timekeeping.

hotspot and the linux vDSO generally take care of doing the right thing, depending on CPU flags, so the TSC is used for currentTimeMillis and nanoTime if it can determine that it is reliable, otherwise more expensive but accurate methods are used.

The article's own conclusion:

> The title isn’t completely correct: currentTimeMillis() isn’t always slow. However, there is a case when it is.

It only happens under circumstances where the TSCs are not used.

These points are all made in the original article...

My point is that the heading is misleading. On modern systems linux/hotspot generally do the right thing(tm) and no tinkering with sysfs should be needed, if the right cpuflags (check /proc/cpuflags) are available. Some virtualization solutions may cause issues here.

Is this really new? I remember when System.nanoTime was introduced in the early 2000s largely because of the reasons stated here. System.currentTimeMillis was slow and had jitter which made it non-monotonic but it was the best method to call if your timestamp needed to interact with the world outside the JVM. In contrast, if all you cared about was relative times, you'd get a speed and accuracy boost from using System.nanoTime since it actually was monotonic and while a single nanosecond value was pretty meaningless, the difference between two nanosecond values was basically accurate.

Benchmarks are a class of problems where the overall time doesn't matter, only the relative time (end - start). The fact that he's using System.currentTimeMillis for his start and end values doesn't give me confidence that he's very knowledgeable about what's to follow.

Good analysis, and yet again, this is why we use TSC at Netflix. I benchmarked them all a while ago. I also put "clocksource" on my tools diagram, so people remember that clocksource is a thing: http://www.brendangregg.com/Perf/linux_perf_tools_full.png

> The vDSO is mapped to a new address each time (probably, some security feature). However, the function is always placed at the same address as above (0x00007ffff7ffae50) when run under gdb (I’m not sure what causes this).

GDB (≥ 7) disables ASLR by default:


"The way to test the performance of System.currentTimeMillis() is straightforward"

I'd be worried about that claim. The article executes System.currentTimeMillis in a loop and sums the values. That takes care of dead code elimination, but it will still be subject to other distortionary effects (https://shipilev.net/#benchmarking-1). Maybe the fact that System.currentTimeMillis() is implemented using native code reduces some of the JIT induced benchmarking pitfalls, but I would still prefer to try and use JMH to test.

I'm also surprised that performance of System.currentTimeMillis can fall down so far under some circumstances. In one of Cliff Click's talks (maybe https://www.youtube.com/watch?v=-vizTDSz8NU), he mentions that some JVM benchmarks run it many millions of times a second.

This is especially relevant if you're running in the cloud.

We've recently run into slow performance with gettimeofday on Xen systems, such as EC2 (see: https://blog.packagecloud.io/eng/2017/03/08/system-calls-are...) This hits particularly hard if you're running instrumented code. Switching the clocksource from xen to tsc seems to help and we're still evaluating the ramifications of doing so (various sources including the above seem to suggest that tsc is unsafe and prone to clock skew, but others suggest that it's ok on more modern processors.)

Most Android developers will know this already, but currentTimeMillis() is probably not the call you want for event timing:


The Windows version is absolutely brilliant: a shared area mapped into every process. Because of the way page tables work, this costs only one page (4k) of RAM and only has to be written once by the OS.

The Linux version does the same, actually.

As explained in the article, the Windows version is so small because it can only provide a resolution limited to the timer tick's resolution (2048 Hz in this case, but unless a program explicitly requests a high resolution it can be as low as 64 Hz IIRC). The cost to having a higher resolution is also a higher number of interrupts per second, which wastes power unnecessarily when you are idle[1].

Instead, as the atthe Linux version can provide a resolution limited only by the actual clocksource, which is nanoseconds in the case of the TSC. There is no background process that "updates the fields of this structure at regular intervals"; the updates only need to happen if the speed of the clock changes, for example due to an NTP update. While in the blog post it's measured to happen every millisecond, it can be configured to be lower without affecting the resolution of gettimeofday. There is also a special mode where you make it tick as slowly as 1 Hz on CPUs that have only one runnable process (if you have more than one runnable process, the scheduler needs to do context switches so the kernel cannot tick slowly).

[1] https://randomascii.wordpress.com/2013/07/08/windows-timer-r...

XNU has an equivalent to vDSO: the commpage. It is also used for time.

The locking is also neat - no software locking, and yet you are guaranteed to get correct values. Simply read High, Low, High2, and then check that High and High2 are equal.

(Hitting the memory address will require some level of CPU level cache coherency protocol to kick in, but that's much cheaper than the equivalent mutex).

interestingly this can only works on x86 due to it being a TSO architecture. windows on ARM must be doing something else since it's weakly ordered.

Memory barriers, these are instructions that force memory order. You can read about the ARM implementation of memory barriers here;


On a properly configured environment TSC is the default clocksource. I have no idea why one would want to precisely measure time with any other clock source.

On relatively contemporary systems clock_gettime(CLOCK_REALTIME) will give you a nanosecond resolution wall clock time and the cost of going through JNI will cost you around 11-13ns, so with a simple native wrapper in java the cost is negligible. Although interesting, the whole study of the cost of obtaining the current time in Java with currentTimeMillis() with clock sources other than TSC is somewhat irrelevant.

The Linux kernel may choose to not select TSC as the clocksource, regardless of the configuration. For example, we have a 4 socket machine running CentOS 6 that refuses to use the TSC timer because at boot time the kernel determines (experimentally, I believe) that TSC is not reliable on that hardware. So it falls back to using HPET instead.

We also have a similar 2 socket machine running a similar version of CentOS that has no problems using TSC. ¯\_(ツ)_/¯

So there's a performance bug in Java's currentTimeMillis() implementation on Linux, where it uses gettimeofday() instead of clock_gettime() with a coarse timer (even the TSC method is ~3x slower or worse than the coarse timer, which has sufficient precision for currentTimeMillis()). Is there a bug filed for this? It seems to warrant one.

For what it's worth, currentTimeMillis() shouldn't be used for interval measurements anyway, as it might not be continuous in some scenarios. A better alternative is System.nanoTime().

People have been discovering and rediscovering gettimeofday is really slow for a long time now. I remember switching to OS specific equivalents for AIX, SunOS, Ultrix, HP-UX, etc. back in the late 80s and early 90s.

yeah, this isn't new news. but it is a very nice write-up and simple enough that any technical person should be able to grasp it

I started off being able to mostly comprehend what's going on, but about half way down I got lost in the assembly and realized I barely remember much from my OS/Architecture courses in school.

This was a solid reminder my every day work sits atop the shoulders of giants and I am much appreciative.

An easy way to do this from the outside would be to create a JVMTI agent, and make currentTimeMillis call into your JNI version. Here's an example in Rust of me doing it for a bit more involved purpose: https://github.com/cretz/stackparam. Then people only have to pass your shared lib as a cli arg to get the benefit.

> Some background thread regularly updates three double-words there, which contain the high part of the value, the low part, and the high part again.

So to implement a clock, they are using a different thread that updates a counter every millisecond or so? Isn't this a bit inefficient?

It's not a thread, it's an interrupt handler. Windows (on x86) uses the real-time clock's periodic timer, which can be made to tick at any power-of-two frequency between 1 Hz and 32768 Hz.

Is that interrupt handler installed for every process that needs the timer? Or is there only one handler globally?

It's a hardware interrupt from a hardware clock that is handled by the OS kernel. Regular user processes are usually totally forbidden from touching that hardware stuff.

Really love these kind of in depth investigations. However, I still want to know how windows manages to update the value in the shared memory so consistently without any impact to system performance.

https://blogs.msdn.microsoft.com/oldnewthing/20050902-00/?p=... rather implies it updates at the "tick" frequency, not actually every milisecond. Someone should do the followup test to see how frequently it actually changes.

Why would there be a system performance impact? This is just updating a memory value (that happens to be mapped into the VM of every process) at 2 kHz.

Silly me, the time is only updated once every half a millisecond. That's virtually zero CPU usage on a 2GHz chip.

It's worse than you might think. There are at least three things happening 2000x/second: getting an interrupt, writing shared memory, and returning. Getting a CPU interrupt takes something like 1k cycles. Dealing with the APIC may be even worse on a system without X2APIC, which is all too common due to crappy firmware (never benchmarked it, though), and returning takes many hundreds of cycles because x86 is very bad at these things. That shared memory write is n basically free unless NUMA is involved, and then it can be fairly bad.

So we're taking maybe 4M cycles per second used here. In practice, there's probably quite a bit of extra software overhead.

If it's part of a register map or can be computed from a register, you can just map that section of the registers into that 4k page and then use the function call to compute the return value without a context-switch. The hardware then becomes responsible for updating the value for you. The only thing you're left with is possible read latency, but that should be relatively small.

That's my theory.

It's sort of amazing that 600 ns is considered slow.

For many of the things we work on, this is not just slow, it is atrociously slow. For analysis of real-time systems I want to put many data tap points throughout the processing chain. If I add 50 (not uncommon), at 5ns a pop most systems (again, IME) are not going to care about 0.25 microsecond extra latency. At 600ns I add 30 microseconds to their processing chain and will likely have to spend some of my time explaining to the owners why my tap points will not affect the system I am helping fix.

It's all relative. It may sound fast but on the other hand it's less than two million times a second. There are many cases where spending an extra 600ns in a tight inner loop would absolutely obliterate a program's performance, whereas an implementation taking just 3ns would not.

Compared to 4 ns on Windows, many calls to the method can create significantly different performance profiles across the platforms.

if you do e.g. HFT, additional 600ns might kill you

I think I've also read somewhere about a bug in the java Random class, it was something about after N iterations, the number starts from iteration 1 where N is not very big (i.e. clearly visible pattern/cyclic repeat in the numbers). But I can't find it anymore, if someone knows its whereabouts, please let me know.

I've added an update on the nanoTime() to the article

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact