
Rationale: Or why am I bothering to rewrite nanomsg? - aleksi
https://nanomsg.github.io/nng/RATIONALE.html
======
elevation
From the article:

> nanomsg has dozens of state machines, many of which feed into others, such
> that tracking flow through the state machines is incredibly painful.

> Worse, these state machines are designed to be run from a single worker
> thread.

Despite the negative tone, the author gives me the impression that nanomsg as
a simple, consistent architecture that just needs to be documented better or
perhaps refactored.

State machines are useful for precisely the reasons the author states: their
behavior and performance are easy to reason about and state machines can
enable concurrent processing of multiple tasks when when you're limited to a
single execution thread.

The simplicity of state machines makes them useful for secure code or embedded
processes where debug visibility can be poor. Embedded environments like
microprocessors also benefit from the single-thread concurrency, but this can
also be handy on more capable OSs if you want to cut the latency of fork() or
pthread_create().

State machines are a really useful tool; there are worse complaints you could
make about a code base.

~~~
im_down_w_otp
I came to the same conclusion more or less.

Does Go provide meaningful abstractions for writing and managing state
machines? Using gen_fsm or gen_statem in Erlang is a critical tool in writing
software in a way which _must_ follow some known protocol correctly. Likewise
using Session-Types in Rust to go one better than what I get in Erlang (i.e.
compile-time assurances vs. runtime ones).

So I was left thinking that Go may be missing meaningful abstractions or
facilities for state machine modelling? Or, perhaps more of nanomsg needs to
adhere even further to using state machines to define and operate its internal
machinery (e.g. the `inproc` race-i-ness mentioned)?

In either case I was confused as to how the conclusion was that state machines
were making everything more confusing and less deterministic. Because the
point of them is the opposite of that. However, in languages or tools which
have poor support for working with them I'm sure ad-hoc interaction with them
can be obfuscating and confusing.

~~~
Matthias247
> Does Go provide meaningful abstractions for writing and managing state
> machines?

Not really more than C. I think in Go the needs are somewhat reduced, since it
hasn't the need to use state-machines for actually linear control flow (like
making HTTP requests or reading data from a socket without being interrupted).

However there are still needs for state-machines, especially when the control-
flow isn't linear (more than one event change trigger state-changes). In that
case we are mostly back to things like switch/case statements.

There are some interesting patterns for running state-machines in separate
goroutines, for letting threaded code interact with them. We can e.g. see
those in Go's HTTP/2 implementation (
[https://github.com/golang/net/blob/master/http2/server.go](https://github.com/golang/net/blob/master/http2/server.go)
). In this thing the serve() method basically is a giant state-machine that
runs on it's own thread. However that's definitely not easy to write,
understand and maintain code. A single threaded state-machine is probably much
easier to grasp for an average programmer.

------
blattimwind
I have to admit, after all these years, that I take everything coming from
that general direction with a huge grain of salt. Crossroads I/O was supposed
to be the great zmq successor and failed entirely, nanomsg was supposed to be
an even better redesign of zmq and failed and now nanomsg-ng is supposed to be
an even better design iteration on nanomsg.

Meanwhile the old/bad/bloated/poorly designed zmq just kept working fine all
these years and even got a bunch of useful new features along the way.

~~~
tel
Because ZMQ is vastly better managed from a human POV. The energy required to
make a project like this succeed surpasses the individual.

~~~
sevensor
I think that's the important insight behind the success of ZMQ. Pieter
Hintjens understood that a software library had a social as well as a
technical context. ZMQ was built to be a delight to work with, not just a
technical achievement. I have lots of respect for Sústrik, but I get the
impression that he's willing to sacrifice everything else to have the right
implementation.

Really, I think it takes both a Hintjens and a Sústrik to make something
great. (Or a Lennon and a McCartney, a Jobs and a Wosniack, a Rodgers and a
Hart.) There's a tension between giving people something with integrity, and
giving people something they'll love.

~~~
hood_syntax
After reading a lot of writing by both Hintjens and Sústrik I came to the same
conclusion. It's such a shame they fell out, I feel like with a little
compromise on both sides they could have come to a workable place. There would
have been heads butting and voices being raised (figuratively) but a lot of
great work can be done in a place like that. The problem is that it's tiring
to fight those battles on a daily basis, and it gets old really fast.

------
nitwit005
> Sadly, this initial effort, while it worked, scaled incredibly poorly — even
> so-called "modern" operating systems like macOS 10.12 and Windows 8.1 simply
> melted or failed entirely when creating any non-trivial number of threads.
> (To me, creating 100 threads should be a no-brainer, especially if one
> limits the stack size appropriately. I’m used to be able to create thousands
> of threads without concern.

Both work just fine with hundreds of threads, and both offer built in thread
pools, for that matter. I have 345 processes running on OS-X right now.

~~~
kev009
I think this is some kind of psychological projection/rationalization because
the statement "Having been well and truly spoiled by illumos threading (and
especially illumos kernel threads)" is actually quite laughable.

------
dmitrygr

       > even so-called "modern" operating systems
       > like macOS 10.12 and Windows 8.1 simply
       > melted or failed entirely when creating
       > any non-trivial number of threads. (To
       > me, creating 100 threads should be a no-brainer
    

This is nonsense. NT kernel happily creates thousands of threads. Just made a
little test app to try. No issues at all.

~~~
farazbabar
Modern operating systems are very efficient at scheduling threads with
negligible workloads compared to threads that are performing under load. I
suspect that a lot of test code for toy examples and default workload for
thousands of threads you see on an operating system has running at any given
time is probably not cpu intensive enough to matter.

Compared to this, a library designed to perform actual work using hundreds of
threads on a modern os with let's say 8 cpu cores - simply won't work due to
the constant context switching overhead. Consider the library called LMAX
disruptor as an example of this behavior where developers need to configure
threads to be equal to number of available cpus if they wish to do busy_wait
processing for inbound messages on the ring buffer. Chronicle Q actually uses
JNI to pin message processing threads to individual CPU cores to make sure
operating system scheduling does not end up flushing L1/L2 caches and that
message processing for a given queue continues to occur on the same thread/cpu
combo.

So it depends on your use-case and application.

~~~
otabdeveloper1
There's no such thing as a "context switching overhead". You don't understand
how operating systems work.

The OS context switches to the next available thread at fixed time intervals.
It doesn't matter if you have 5 threads or 5000, the number of context
switches is the same.

The only difficulty is deciding which thread is the "next available" one. This
is the job of the process scheduler.

Normally, the process scheduler is a complex heuristic affair that tries to
pick a more deserving thread based on some black magic rules of thumb. Under
certain loads this black magic can backfire, picking some threads more
frequently while letting others starve.

If you know your workload, you can use a scheduler without magic heuristics.
These kinds of schedulers are called "real time". Try them, you'll be
surprised.

~~~
viraptor
> There's no such thing as a "context switching overhead".

So what do you call the time spent changing the system internal information
about the currently executing thread, switching page tables via cr3 and the
invalidation it causes, and the time wasted because of cache misses (although
a lot of it can be reclaimed with CPU pinning). Because that's what people
normally call "context/thread switching overhead"

> The OS context switches to the next available thread at fixed time
> intervals. It doesn't matter if you have 5 threads or 5000, the number of
> context switches is the same.

Not just at fixed intervals. If your threads are doing io work, more threads
will switch more often because they will issue io requests forcing
waits/yields. If they do pure-cpu work, that doesn't apply, but we're talking
about networking mostly in these comments.

~~~
ezdiy
TLB is only an indirect cause. This is because kernel scheduler preempts
processes fairly infrequently (100 or 1000hz, or dynamic, but still capped to
a small number).

Scheduling quantums are so large precisely to keep TLB flush overhead of a
context switch low. If a network mandates more interaction (say, 100k req/s
across all workers), each quantum tick must process a queued bundle of 1000
requests which piled up while asleep. This works as designed - you're supposed
to use up all of your quantum, and not terminate it early by issuing blocking
IO per request. One prerequisite for this is that your network/disk protocol
_must_ be pipelineable (most are because thats how we deal with network/seek
latencies).

But at certain point the overhead of this pipelining itself becomes so great
(message queues too deep) you have to switch to threading.

Hardcore threading advocates on the other hand, need to account for overhead
of atomics (for locking, or for "lockless" algorithms). An atomic must wait
for all pending writeback flush. Threading gets a lot of bad rep not because
"kernels suck at it", but because person making such a statement wrote their
program as an exercise in lock contention and/or too much write cache
pollution per single atomic.

Threading vs process tradeoff = deep pipeline overhead vs frequent queue
flush+locking overhead tradeoff.

Typically, you need to meet somewhere in the middle for best performance,
which is when you end up with threads with job queues - those basically
emulate process-induced queues within thread model.

~~~
dmitrygr

       > Threading gets a lot of bad rep
       > not because "kernels suck at it",
       > but because person making such a
       > statement wrote their program as
       > an exercise in lock contention
    

Well put. I'm going to have this printed on a plaque and hung above my desk.

------
nfoz
Do people generally consider nanomsg to be a "failed experiment"? Is anyone
using it for their projects?

The author's tone makes it seem like I shouldn't use nanomsg. nanomsg in turn
makes it seem like I shouldn't use ZeroMQ.

So what would people recommend me to use right now for a project? Are these
issues all that serious?

My intended use would be for a simple friendly pub-sub API for programs to
talk to each other, locally or across a network.

------
rumcajz
Original author of zmq/nanomsg here.

After all those years dealing with the problem of implementing network
protocols I believe that this entire tangle of problems exists because we are
dealing with something like 35 years of legacy in two different but subtly
interconnected areas: concurrency/parallelism and network programming APIs.

The area of concurrency/parallelism started quite reasonably with the idea of
processes. But then, at some point, people felt that processes are too heavy-
weight and introduced threads (I'm still trying to find out who the culprit
is, but it looks like they've covered their tracks well.) When even threads
became too heavy-weight people turned to all kinds of callback-driven
architectures, state-machine-driven architectures, coroutines, goroutines etc.
Now we have all of those gand we are supposed to make them work together
flawlessly, which is a doomed enterprise from the beginning.

At the side of network programming, BSD sockets (introduced in 1983) are the
only common ground we have. They are long past their expiry date, they don't
adapt to many use cases, but there's no alternative. There are more modern
APIs there, but, AFAICS, none of them provides enough added value of top to
become the new universal standard.

It should be also said that creation of new universal APIs is hindered by a
host of weird network protocol designs out there in the wild. The API designer
faces a dilemma: either they go for sane API and rule at least some weird
protocols out or they try to support everything and end up with one mess of an
API. Not a palatable choice to make.

Then there's the area where the two problems about interact. Originally, you
were supposed to listen for incoming TCP connections, fork a new process for
each one and access the socket using simple single-threaded program with no
state machines, using only blocking calls. Today, you are supposed to have a
pool of worker threads, listen on file descriptors using poll/epoll/kqueue,
then schedule the work to the worker pool by hand. This raises the complexity
of any network protocol implementation by couple of orders of magnitude. Also,
you get a lot of corner cases, undefined behaviour, especially at shutdown,
weird performance characteristics and I am not even speaking of the increased
attack surface.

All in all, it's a miracle that with the tools we have we are able to write
any network applications at all.

These days I am working on attacking the issue on both fronts. On the
concurrency side it's [http://libdill.org](http://libdill.org) \-- esentially
not very interesting, just an reimplementation of goroutines for C, however,
what's worth looking at is the idea of "structured concurrency", a system of
managing the lifetimes of coroutines in a systemic manner:
[http://libdill.org/structured-
concurrency.html](http://libdill.org/structured-concurrency.html)

On the other front, the network programming, I am trying to put together a
proposal for revamp of BSD socket API. The goal is to make it possible to
layer many protocols on top of each other as well as one alongside the other.
It's a work in progress, so take it with a grain of salt:
[https://raw.githubusercontent.com/sustrik/dsock/master/rfc/s...](https://raw.githubusercontent.com/sustrik/dsock/master/rfc/sock-
api-revamp-01.txt)

~~~
spacenick88
So for me the really big question in all of this is "Are threads really too
heavyweight?". This obviously needs the constraint "on a sane, modern OS".

For me the most sane C (non-datagram) networking model at least on Linux is
threads, each calling accept concurrently (afaik few people know this is
supported) and then handling the accepted connection until that is closed. For
systems where you only want to handle a fixed number of connections like
databases you keep your number of threads fixed for others you start a new
thread once every other thread handles a connection already. It gets rid of
Thread Pools (since you only do pthread_create() when your number of
concurrent connections increases), Async- and callback-hell and makes all your
handling code linear.

Every other day I keep seeing "Blabla uses async epoll so handles 10k
connections" but a) what serious work can you do with 10k connections i.e. 125
kB/s per connection @ 10 Gbit/s. b) at what cost to
readability/maintainability of code and c) are you sure you haven't just moved
your bottleneck to something else? Also I've never seen any benchmark showing
how this actually beats threads on a modern Linux box.

As for shitty OSs I say fuck them

~~~
daurnimator
> "Blabla uses async epoll so handles 10k connections" but a) what serious
> work can you do with 10k connections i.e. 125 kB/s per connection @ 10
> Gbit/s.

One example where I've done+needed this is an XMPP server.

A much more common example is a HTTP server with lots of idle keepalive
connections.

------
e12e
Looks interesting, especially the Zerotier transport. Although I wonder why
that's needed/what it means: with Zerotier you already have ip4/6 connectivity
- what's the benefit of burying down below that?

Would that mean a "wireguard transport" would make sense as a default secure
transport?

My other concern is that this starts to sound very big - have you been able to
maintain clear modularisation of the code?

~~~
kej
>with Zerotier you already have ip4/6 connectivity - what's the benefit of
burying down below that?

LibZT [1] provides a socket-like programming interface without requiring the
full ZeroTier software and its system-wide virtual interfaces. I could see
wanting to use something like nanomsg on top of that even though it's not an
actual socket implementation.

[1] [https://github.com/zerotier/libzt](https://github.com/zerotier/libzt)

------
daurnimator
> OpenSSL has it’s own struct BIO for this stuff, and I could not see an easy
> way to convert nanomsg's usock stuff to accomodate the struct BIO.

It's actually quite easy to write a custom BIO to work with your own state
machines and buffering. This could have been a 1-day project....

------
theincredulousk
This is at least partially ill-advised. Comes off as an expertly done, but
same old refactoring project that could be titled "I didn't understand this
and would understand it better if I were designed based on my personal
preferences". This is reinforced by the probably approaching 1 that nobody
needs another "six of one half dozen of the other" message queue framework,
and alluding to an belief that the C++ library is somehow too bloated for
embedded environments. While that was at a time true, no reasonably modern
embedded system that requires multi-threading to "100s" of threads, or uses
100s of live sockets for message queue I/O, has to avoid C++ for being too
heavy. This is just ZeroMQ alternative #N - not anything objectively better,
and certainly not "nano" for embedded systems. The "one true messaging
framework" is a unicorn - everyone feels like it should exist but nobody can
make it.

 _But for many cases this is not necessary. A simple callback mechanism would
be far better, with the FDs available only as an option for code that needs
them. This is the approach that we have taken with nng._

Replacing a state machine with callbacks... something something something
you're gonna have a bad time. Esp. considering the gripes are about
readability, following control flow, and race conditions. Callbacks are
objectively worse for all of those things. Control flow _is_ hard to read in
state machine frameworks because the primary flow is dictated by something
like "nextState(thisState, action)", so you can't follow it with code lookup.

The problem here (and almost always) is lack of documentation or visualization
(or picking the wrong abstraction level for the formally defined states). The
beauty is that the definition is almost by default naturally easy to parse
(tables of states in a header file, etc.). It takes some extra effort, it is a
one shot to write something that generates graphviz state chars or something
similar from the state table definition. You could write a custom dot syntax
generator from a C-style table definition in what, three hours? Doxygen
already does this for much more complicated stuff. Googling reveals this is
nothing new:
[https://gist.github.com/freqlabs/24d88ad8e687891c970a69f16f1...](https://gist.github.com/freqlabs/24d88ad8e687891c970a69f16f1387b8)

All that said, State Machines are (currently) the one true abstraction for a
given program because that is what a computer is, and every program is, to
begin with. If you're not using them explicitly, it just means you have a
poorly defined/documented state machine. Maybe someday there will be a better
model of computing, or a better way to model programs. For now, the human
brain isn't getting any better at keeping track of computer programs, and
anything more than single-threaded functional-style code is almost certainly
not any "absolute" improvement in readability.

I firmly believe that visualization has become necessary due to complexity,
and it is past time to embrace it. There is some stigma that visualization is
for fakers - "real programmers only use a bare text editor" \- or that it is
for children learning programming, or non-engineering folks that need pictures
because they're dumb. To be a bit hyperbolic, if we want any chance at keeping
up with "the machines", we're going to need a better general-purpose, more
workable abstraction than text files. There is no more canonical example of
this than the issues pointed out here - state machines are the right
abstraction for complex systems, but complex state machines are incredibly
difficult to follow in source code.

~~~
tuukkah
> _All that said, State Machines are (currently) the one true abstraction for
> a given program because that is what a computer is, and every program is, to
> begin with._

I don't understand these claims. To me, a computer "is" processors that step
through memory locations interpreting them as operations and operands - not a
state machine. Equally, hardly any program is a state machine.

Is a Haskell program a state machine, or "poorly defined"? I'd suggest other
models of computation such as typed lambda calculus provide a much better
basis for defining and documenting computer programs.

~~~
theincredulousk
The immediate side-effects of any program and the processor are _only_ to
read, manipulate, and write data. Data is stored in a finite memory that all
have states 0 and 1.

Every single thing that can be represented or manipulated by a program is just
an array of 0s and 1s that are read, manipulated, and written. The values in
the array are the state, and everything a program is capable of doing only
moves it to a different state of memory.

Every single program ever written is a state machine, so is every microchip
for that matter.

[https://en.wikipedia.org/wiki/Turing_machine](https://en.wikipedia.org/wiki/Turing_machine)

~~~
tuukkah
I see. I think your argument falls apart when you wave away the differences
(e.g. memory) between a finite state machine, a Turing machine and an actual
computer.

Theoretically, the Turing machine and lambda calculus are equivalent:
[https://en.m.wikipedia.org/wiki/Church–Turing_thesis](https://en.m.wikipedia.org/wiki/Church–Turing_thesis)

However, in practise, the specifics of programming languages are essential:
[https://en.m.wikipedia.org/wiki/Turing_tarpit](https://en.m.wikipedia.org/wiki/Turing_tarpit)

When you include the state of the memory in the state of the state machine,
the state space explodes, which is more than inconvenient.

On the other hand, in lambda calculus and Haskell, you don't have to even
think about state.

------
larrik
I think you meant "Rationale"

------
senatorobama
Can someone explain what's so special about these libraries that make then
"better" than raw sockets?

~~~
braywill
They're not better than raw sockets, they're simply an abstraction layer on
top of them.

For example, building a message queuing service using raw sockets that works
on Windows, Mac OS, and Linux is quite the undertaking. With ZeroMQ (and I'm
assuming nanomsg as well), it's quite simple.

ZeroMQ and nanomsg are like raw socket toolboxes.

