
ZeroMQ - Disconnects are Good for You - mrud
http://lucumr.pocoo.org/2012/6/26/disconnects-are-good-for-you/
======
jrockway
Yeah, don't use REQ/REP.

I used to be a big ZeroMQ fan, but honestly, the library trades sanity for
performance and has collapsed in on itself because the code is no longer
maintainable. Last I checked, it's being completely rewritten in a different
programming language. Maybe the result will be fine, but the core features of
retrying connections and remembering dropped messages for clients that
disappear temporarily are easy enough to write yourself.

(I do like Mongrel2's use of that feature to allow me to make an HTTP request
and _then_ start the backend server. And the right place for this is a
library. It's just that ZeroMQ had too many features and too much code.)

~~~
cageface
_It's just that ZeroMQ had too many features and too much code._

Wasn't ZeroMQ supposed to be the simple, lightweight alternative?

I'm feeling better about my decision to ignore all the whole *MQ circus.

~~~
rumcajz
Too much code?

0MQ ~15k LOC RabbitMQ ~107k LOC AMQP Qpid ~542k LOC ActiveMQ ~1160k LOC

~~~
newobj
i'm sorry but comparing 0MQ to these other systems is caca doodoo. i'm telling
you as a developer who has on multiple occasions considered using 0MQ it
lowers my trust in you as an evangelist for the platform to suggest that this
is a relevant or worthwhile comparison to make.

~~~
rumcajz
Consider my comment in the context. The comment I was replying to says that
0MQ/XS is not lightweight because it has too much code. That's simply not true
and that's what I was alluding to.

------
tome
A implementation of this kind of usage pattern is already provided in the ZMQ
guide:

[http://zguide.zeromq.org/page:all#Client-side-Reliability-
La...](http://zguide.zeromq.org/page:all#Client-side-Reliability-Lazy-Pirate-
Pattern)

If the client has not received a response by the timeout it should close the
connection itself and reopen a new one. Whether this is a sutible solution for
the blog author issue I don't know, but work well for RPC connections with
little or no state.

~~~
rarrrrrr
Indeed. ZMQ has other capabilities beyond REQ/REP exactly for this situation,
and helps you layer "patterns" on top of them.

I found working through all five chapters of the ZeroMQ guide unusually
educational. It's full of the wisdom of people writing message oriented
software for years, and includes frank discussions and solutions for several
of these performance and reliability situations. (Don't miss the adventures of
the suicidal snail in chapter 5!)

I found it worthwhile even to spend the time to work through all the examples
in both C and Python.

In the author's situation, the normal loop of the client shouldn't be to just
call blocking receive forever, as he discovered. Instead it should loop,
polling the socket with some reasonable timeout, and between iterations do
things like check for shutdown signals, parent process exiting, and the other
typical housekeeping tasks. Then you only call receive when poll has told you
there are messages waiting, and then you call it without blocking.

This sort of loop gives an obvious place to also integrate timeouts. You can
also watch multiple sockets. Blocking receive forever is appropriate for a
prototype sort of client but as things grow, generally more sophistication is
needed.

~~~
the_mitsuhiko
> In the author's situation, the normal loop of the client shouldn't be to
> just call blocking receive forever, as he discovered. Instead it should
> loop, polling the socket with some reasonable timeout, and between
> iterations do things like check for shutdown signals, parent process
> exiting, and the other typical housekeeping tasks. Then you only call
> receive when poll has told you there are messages waiting, and then you call
> it without blocking.

I think in my case the confusion behavior came from the fact that I started
using a project that used ZeroMQ and build part of its implementation on
REP/REQ sockets and showed that behavior. Then I went to the ZeroMQ
documentation and it does not present the REP/REQ examples with a caveat that
they might block the client if the "server" goes away unexpected.

------
m0th87
> This could probably be improved by having a background thread that uses a
> ZeroMQ socket for heartbeating.

Don't use heartbeats on REQ/REP, because they won't work well with the
lockstep communication fashion of those socket types. Also, you have to be
careful because ZeroMQ sockets are not thread-safe, so the background and
active thread must coordinate through a lock, or work in an implementation
that handles this implicitly for you.

In ZeroRPC, we solve this by using XREQ/XREP with heartbeats. This has worked
out pretty well in practice.

------
tcwc
Rather than polling, zeromq >= 2.2 allows you to set ZMQ_RCVTIMEO on the
socket which seems to be what the author is after. It would be nice to be
notified of disconnected peers, but the timeout + retry approach has been good
enough for me.

------
chubot
Great article. ZeroMQ had a "smell" that I couldn't put my finger and thing
article kind of nailed it. In retrospect I guess the smell is that it is
tightly couples both sides of the network to make performance claims. It
sacrifices robustness for performance.

I guess that it was developed for financial trading applications. Maybe it
will work fine for those -- you have a few machines and high network
connectivity between them. But people started doing "data center" stuff with
0MQ. Then you have geographical separation, and WAN latency and reliability.

------
hogu
I think the problem is people come into zeromq expecting a high level library
that handles all the details, zeromq does not do that, you need to handle
reliability and disconnect behavior yourself. I agree that the default
behavior in this case could be saner, but it's pretty easy to build reliable
request reply in many different ways as illustrated in the guide, so I'm fine
with it.

The benefit though, is that in zeromq you get to (and are forced to) choose
exactly how your messaging patterns are reliable (or not)

------
willvarfar
The better solution? That the 0mq libs do the right thing and don't get
wedged. It shouldn't be on the users of the API to handle this.

EDITED: my point is general; it should be 0mq libs doing the timeouts and
keepalives and so on and only pushing meaningful error handling like "the
server has gone away and cannot reconnect" back up to the user.

~~~
StavrosK
There's a problem when you restart servers at the wrong moment, though, as the
article mentions...

~~~
rdtsc
So ... in other words there is a serious problem.

Restarting server / a server crash / a network outage and now potentially
thousands of clients are in a bad state they cannot recover from. And this was
by design. That was the point of the post I think. This isn't as much as an
oversight as a bad design decision.

~~~
rarrrrrr
Agreed, but the bad design is not in ZMQ, but the way it is being used.

A client should never just wait forever for a response to a message. Any
reliable system has to implement something like timeouts our immediate message
acknowledgement (at which point maybe you can wait forever for a reply.)

There's a comprehensive discussion of this in chapter 4 "Reliable Request
Reply" of the ZMQ guide.

TCP doesn't give you guaranteed response either. Just guaranteed delivery (or
error.) In this case, this is exactly what the author's getting with ZMQ. The
client's Request socket makes a successful delivery, then the server crashes
before generating a response, and the client waits forever for the response
that will not come.

------
lucian1900
It seems to be that the better solution might be just using Twisted and
regular networking techniques.

~~~
StavrosK
So whenever there's a small problem with something, the solution is to discard
the whole thing and go down a layer?

I don't like some of Python's warts, but you don't see me writing assembly.

~~~
lucian1900
One could argue whether using Twisted as opposed to ZMQ is going down a layer
at all.

------
o1iver
Sure, the REQ/REQ sockets are limited, especially because they force the
Request/Reply/Request/Reply/... series. I don't think any complex applications
use this. I recently built an application using DEALER/ROUTER sockets, where
you can send multiple requests, without having to wait for responses, etc.
Additionally, no application should rely on receiving a response, the poller
he suggests solves this problem nicely (although I don't think it necessary to
wrap it into send/recv methods as pyzmq offers a nice polling API).

------
stonemetal
_Carries messages across inproc, IPC, TCP, and multicast._

so when you are using actual sockets across the network, it uses TCP. So ZMQ
should be able to detect disconnects rather easily.

------
boothead
The solution I use is the one I mentioned to Armin on twitter:
<https://gist.github.com/2994781>

It's not really idea that a synchronous connection doesn't have notification
of a connection failure, but this has been working fine for us for ages.

~~~
rumcajz
Armin is right that the timeout works, but delays the signal about connection
failure from TCP. If anyone feels like implementing automatic resend inside
0MQ/XS (in case of timeout or TCP connection failure), give it a go and submit
a patch. If noone submits the patch, I'll fix the problem once I have more
free time.

~~~
X-Istence
As far as I was aware there was a patch like that available for the ZMQ 3.0
but was nixed for 3.1, same with a patch to deal with the issue of 1000
clients on an XREP socket, where the first 999 disappear and thus don't need
to have their messages processed and should just be dropped on the XREP
socket... that was added in 3.0 and then reverted in 3.1.

I've worked around a lot of issues in ZeroMQ with retries, heartbeats, and
stuff like that, but it just feels kludgy and like I am writing code that
should be part of the library. I've looked at adding the functionality myself
from the old patchsets that were available and neither were straight forward
to implement. The code (at least ZeroMQ) is somewhat of a mess and difficult
to follow/understand and I am saying that as a C++ programmer working on large
enterprise applications.

~~~
rumcajz
Yep. The patches were reverted because they've broker wire-level compatibility
with older versions of 0MQ. The good news is that I protocol versioning will
be available in next release of crossroads.io and thus the patches like that
can be applied without breaking backward compatibility (the library will just
speak two different protocol versions).

As for the code quality, I would say the code is complex to the extent where
it's almost impossible to maintain and improve it. (In case of need here's the
overview of the architecture: <http://www.aosabook.org/en/zeromq.html>)
However, it's not a mess in sense of being lousy. I've spent literally months
of my time cleaning it up. Although it may have got worse since I've started
working on the fork.

~~~
X-Istence
I can understand the need for them to be reverted, although I honestly think
that the patches provided functionality that should have already existed.
Worst case the patched could have stayed and they could have been made
available through certain flags (much like the HWM, LWM, and others).

I wouldn't say that the code quality is lousy, au contraire I think overall
the code quality is very good, just very difficult to figure out where
everything goes. I found it difficult to "brain" map it such that I could get
proficient at following what was going on while reading different sections of
the code.

Is there any effort under way in crossroads to make the notification level
triggered and not just edge triggered? For example sticking ZeroMQ into a
libev loop is a pain because if two messages arrive at the same time I only
get notified once and I have to loop through and zmq_read(), which if the
ZeroMQ backend is sending really fast means I now starve all my other sockets!

~~~
rumcajz
As for the options: Yes. There's patch that supports protocol versioning
(version=socket option) already applied to crossroads.io. That in turn makes
it possible to improve the protocol without breaking backward compatibility.

As for code complexity: It's hard to brain-map it for me as well, which is
pretty alarming given that I am the author :) I am working on major rewrite of
the codebase (simplification of the threading model) but that'll take some
time to finish.

As for level-triggering, that's a problem with POSIX, not 0MQ. There's no way
in POSIX to simulate a file descriptor in the user space. The closest you can
get is eventfd() which is a) Linux-specific b) doesn't allow to signal both
!POLLIN and !POLLOUT at the same time :(

------
kephra
This badly reminds me at my MQSeries experience.

I wonder - is there any MQ that does not suck ?

~~~
freyrs3
ZeroMQ isn't a MQ ( Message Queue ). It's a message passing library. You can
use it to build message queues though.

~~~
sausagefeet
I really despise the name ZeroMQ for this reason. First time you hear it "Is
it an MQ and Zero is just a fun clever name" or "Is it not an MQ"? People seem
to go with the latter. Then what is it? Why define itself by what it
isn't?</rant>

------
shasty
The TCP stack takes care of this problem this is an insane attempt at POST
mature optimization.

