
Persistence modules for the invesdwin-context module system - gsubes
https://github.com/subes/invesdwin-context-persistence/blob/master/README.md
======
Animats
Yes, Linux pipes suck as an IPC mechanism. I'd like to see a benchmark using
QNX message passing on the same hardware.

~~~
rdtsc
Doesn't linux have something similar: System V messages or even Unix Dgram
sockets say using named paths for queue names.

~~~
Animats
System V messages weren't bad, but they aren't used much.

The real trick is tight IPC and CPU scheduling integration. You want a send
from process A to process B to result in an immediate transfer of control from
process A to process B, preferably on the same CPU. The data you just sent is
in the CPU's cache. QNX is one of the few OSs where somebody thought about
this.

With unidirectional or pipe-like IPC, the sender sends, which unblocks the
receiver, but the sender doesn't block. So the OS can't just toss control to
the receiver. The receiver goes on the ready-to-run list and, quite likely,
another CPU starts running it. Meanwhile, the sending process runs for a short
while longer and then typically blocks reading from some reply pipe/queue. It
takes two extra trips through the scheduler that way. Worse, if the CPU is
busy, sending a message can put you at the end of the line for CPU time, which
makes for awful IPC latency under load.

It's one of the classic mistakes in microkernel design.

~~~
rdtsc
> With unidirectional or pipe-like IPC, the sender sends, which unblocks the
> receiver, but the sender doesn't block. So the OS can't just toss control to
> the receiver.

Interesting. I remember looking at the API but I just didn't have enough
experience or context then to dig deeper and answer those questions. I stayed
away from message queues and opted for shared memory, mostly because they
seemed obscure and was afraid I would hit some corner case bug and would be
stuck on my own debugging low level kernel code.

> You want a send from process A to process B to result in an immediate
> transfer of control from process A to process B, preferably on the same CPU.

I can see a message-passing centric system having some specific optimizations
in scheduler. Say once a few messages are sent, there might be a DAG formed of
which senders send to which receivers. Sorting that DAG using topological sort
might be interesting, then making scheduling decisions based on it. That is,
if sender1 sends message to receiver1 and receiver1 and the sends to
receiver2. Maybe it is more efficient to run them in that order -- sender1,
receiver1, receiver2.

Saw that done in a realtime system, which processed low latency data. That
graph was static, but this sorting trick allowed sometimes for processing data
with the latency of only one frame.

~~~
Animats
There's a discussion of this in this old QNX design document.[1] See the IPC
section. Integration between scheduling and message passing is essential if
you're making lots of little IPC calls. When control passes from one process
to another via an IPC call, the receiving process temporarily inherits the
priority and remaining CPU quantum of the caller. This makes it work like a
subroutine call for scheduling purposes - nobody goes to the back of the CPU
queue. So IPC calls don't incur a scheduling penalty.

[1]
[http://www.qnx.com/developers/docs/6.3.0SP3/neutrino/sys_arc...](http://www.qnx.com/developers/docs/6.3.0SP3/neutrino/sys_arch/kernel.html#NTOIPC)

------
faragon
TL;DR: If I understood it correctly, it describes performance differences on
how using one thread for doing JNI to LevelDB request serialization and IPC
communication between other threads, using different IPC mechanisms.

I would like to see where the time is spent, e.g. if pipe communication is
slow because of small requests, because how serialization is implemented (e.g.
time on spin locks and mutexes), etc.

~~~
gsubes
Our tests showed than even with larger messages (100k price ticks per request)
pipes were still a magnitude slower.

------
sctb
We updated the title from “Memory Mapping 15x faster than Named Pipes and 2x
faster than Queue (Java)“, which was referring to these benchmarks from the
project page:

    
    
      ArrayDeque (synced)       Records:   127.26/ms  in  78579 ms    => ~50% slower than Named Pipes
      Named Pipes on TMPFS      Records:   263.80/ms  in  37908 ms    => why ~5% slower on TMPFS?
      Named Pipes               Records:   281.15/ms  in  35568 ms    => using this as baseline
      SynchronousQueue (fair)   Records:   924.90/ms  in  10812 ms    => ~3 times faster than Named Pipes
      LinkedBlockingQueue       Records:  1988.47/ms  in   5029 ms    => ~7 times faster than Named Pipes
      Mapped Memory             Records:  3214.40/ms  in   3111 ms    => ~11 times faster than Named Pipes
      Mapped Memory on TMPFS    Records:  4237.29/ms  in   2360 ms    => ~15 times faster than Named Pipes

~~~
gsubes
why these title changes?!? Does one really have to write separate blog posts
with the same title so they don't get changed?

~~~
sctb
From the guidelines:

> _Otherwise please use the original title, unless it is misleading or
> linkbait._

The long-standing policy is to represent the submitted content as accurately
as possible and let readers pick out what's interesting to them, not what the
submitter found interesting. A comment in the thread is a fine place to call
such things out if a blog post is overkill.

------
rdtsc
Shared memory is fast but dangerous if not done right, lots of opportunities
for data races. I used it for low latency data access. It took an inordinate
amount of time to debug. With hardware at the time there was no other
alternative given the constraints.

Two other things that would be fun to benchmark is Unix Sockets and System V
IPC messages (anyone uses those? probably the most obscure IPC around these
days). Hmm, maybe some of those are already used behind the scenes by some of
the Java methods described.

------
prodigal_erik
Where does the speed advantage come from? Is it because LinkedBlockingQueue
doesn't spin-wait before giving up and rescheduling the consumer's thread?

~~~
gsubes
I am wondering about this myself, even SynchronousQueue (where the spin wait
was taken from) is slower than Mapped Memory. Though ASpinWait spins a
magnitude longer before sleeping, maybe that is the difference. Or the fact
that Memory Mapping goes Off-Heap and instantiates no objects during transfer.

------
optforfon
I tried to figure out how to do memory mapping with Boost once... It was
complicated. I wish there was a good crossplatform solution that was clear and
easy to use

~~~
jschwartzi
Are you talking about just mapping a file into memory? Because there are only
two function calls you need to do something like that in POSIX:

open(), and then mmap().

If you're talking about POSIX shared memory, you can do that with shm_open().
The only thing you have to do is have both processes use the same name for the
shared memory area. Additionally, you can use POSIX named semaphores as a
synchronization primitive.

It's pretty easy to wrap these functions up in a C++ class. You could
conceivably share an entire C++ class between two processes using these
primitives.

~~~
faragon
"You could conceivably share an entire C++ class between two processes using
these primitives."

Not really. May be just memory mapping a plain class/struct, without virtual
function tables, etc. (pointers are not necessarily valid from process to
process)

