
Show HN: IPC between multiple Java processes/JVMs with nanosecond latency - mikaelj1
http://mappedbus.io
======
mbell
For those looking for a high performance inter-thread library, I've found
LMAX's disruptor library has worked quite well: [https://lmax-
exchange.github.io/disruptor/](https://lmax-exchange.github.io/disruptor/)

~~~
fiatmoney
The LMAX folks are legit, and any given youtube video by them is likely to be
extremely informative.

------
halayli
What will the reader do while polling? If they will spin/poll, that will
consume CPU resource unnecessarily. If the reader does it frequently the OS
will preempt the process and pay the price.

Typically, it takes around 50ns to 80ns to access RAM via north bridge. Most
probably the CPU cache is screwing his measurements.

~~~
kasey_junk
In low latency usages you are doing everything in your power to avoid RAM
access. Preventing preemption by "wasting" CPU resources is a common
technique.

~~~
halayli
the OS scheduler will penalize you if you waste CPU resources.

~~~
theyoungestgun
Then penalize the OS scheduler by removing your spinning CPU from its
available resources ;)

------
clumsysmurf
Any chance this works on Android?

~~~
vardump
So ARMv7 you mean? This is CPU hardware level memory model dependant. Not
absolutely sure, didn't look that hard, but I _think_ this doesn't implement
ARMv7 memory model correctly, so maybe it doesn't always function right. It
_might_ be possible for a message to be committed, but some cache lines
containing message bytes still dirty.

This implementation is using sun.misc.unsafe afterall.

Different ARM CPUs implement different memory models. So what applies to one
ARM design might not apply to another. ARMv8 is probably easiest to support
due to new instructions, load-acquire and store-release.

Again, not sure without further analysis. And no time to do it.

~~~
motoboi
Not only ARMv7, but running on Dalvik or ART.

~~~
vardump
CPU memory model and atomics support is what matters here. Dalvik or ART are
unlikely to have anything to do with it, as long as they provide access.

------
jinmingjian
Come on, boy, stop joining headline Party!:)

You should make sure you truly understand what's the meaning of the word
"latency".

Your story is not unique[1]. The basic concept is: latency != 1/throughput.

I have no interest in your codes. Just from your README, I am sure you make
mistakes.

Here, do you ask several questions for you before publishing eye-catching
headline? Such as,

1\. what's the cost of one CAS op (and maybe volatile)?

2\. what's the timing accuracy in Java?

3\. what's the latency of one main memory accessing?

4\. ok, more pitfalls in Java, OS and micro-benchmark coming...

if you have basic understanding to question#1-3, you may not claim "20 ns is
possible. The best I've gotten with a bare bones optimized test was about 16
ns" yourself.

[1] [http://www.infoq.com/articles/High-Performance-Java-Inter-
Th...](http://www.infoq.com/articles/High-Performance-Java-Inter-Thread-
Communications/)

~~~
vardump
> Come on, boy, stop joining headline Party!:)

Seriously. Saying things like that make you look really bad - no one will take
you seriously.

I got that 16 ns with cache-line aligned carefully tuned assembler. No false
sharing. And it most definitely didn't use CAS, but FAA. I used CPU TSC
timers.

That said, I'm pretty sure you can achieve close to that in Java as well.
20-40 ns latency between two CPU cores on same socket wouldn't surprise me at
all. Throughput can be even higher, if you start to batch messages.

Anyways:

1) Cost of CAS depends on contention. About 15 ns in any case without
contention.

2) Timing accuracy on Java... Well not that I really use Java much, but I'd
imagine it's exactly same as on C++, if you can access TSC somehow. Nothing
prevents from ping-ponging messages between two threads either a few million
times -- that way even low resolution timer is more than enough. Yes, 20ish ns
latency is doable.

3) Who cares about memory latency, the point is to communicate by cache line
sharing. Main memory would defeat that.

4) ?

~~~
hyperpape
[http://shipilev.net/blog/2014/nanotrusting-
nanotime/](http://shipilev.net/blog/2014/nanotrusting-nanotime/) By
coincidence, this was reposted today. Findings: nanotime probably has >= 15ns
latency, and >= 30 ns granularity.

~~~
vardump
So does 1M ping pongs give me sufficient resolution?

It's not like it makes sense to measure a single transaction, but say, 1M in a
chain.

~~~
jinmingjian
Yes, as you are ware that, it is hard to measure the latency correctly
especially in your nano-second level.

------
dicroce
Nanosecond? I don't think so... Unless by having nanosecond latency you mean
that it has a latency that is representable in nanoseconds... but that's not
what it means to me. Tens or hundreds of microseconds I would buy, but not
nanosecond.

~~~
vardump
If you busy poll, getting to 50 ns latency from producer to consumer is easy,
20 ns is possible. The best I've gotten with a bare bones optimized test was
about 16 ns. You tend to get best results with high frequency dual core CPUs
with hyper-threading disabled.

Bigger issue is that the reader needs to busy poll, consuming 100% CPU time on
one CPU core. Maybe energy consumption could be reduced by using monitor/mwait
- not sure if it's possible from user mode or only in kernel.

Another issues in this implementation is use of compare-and-swap. For this
purpose, fetch-and-add (LOCK XADD on x86) would be more efficient. Multiple
writer contention is _much_ worse with CAS.

Record size is also fixed, you can't have multiple different message sizes.

Overhead per message is 12 bytes, two ints that can be either 0 or 1 and an
int called Metadata. Seeing there's commit and rollback fields, both 1 byte, 2
byte overhead for commit and rollback should have been enough. You're going to
get false sharing whether it's an int or a byte.

~~~
dicroce
Ugh. Polling? You better have serious throughput to justify that....

~~~
rdtsc
> You better have serious throughput to justify that....

It is often about latency not just throughput. Although sometimes they go hand
in hand. For example you can achieve pretty high throughput if you take the
whole network stack outside the kernel and talk directly to the network card.

[http://www.intel.com/content/dam/www/public/us/en/documents/...](http://www.intel.com/content/dam/www/public/us/en/documents/presentation/dpdk-
packet-processing-ia-overview-presentation.pdf)

But I've seen this spin polling done when latency needed to be optimized.

