I used to be a big ZeroMQ fan, but honestly, the library trades sanity for performance and has collapsed in on itself because the code is no longer maintainable. Last I checked, it's being completely rewritten in a different programming language. Maybe the result will be fine, but the core features of retrying connections and remembering dropped messages for clients that disappear temporarily are easy enough to write yourself.
(I do like Mongrel2's use of that feature to allow me to make an HTTP request and then start the backend server. And the right place for this is a library. It's just that ZeroMQ had too many features and too much code.)
The rewrite you mention is, I believe, Crossroads I/O ( http://www.crossroads.io/ ). Martin Sustrik was one of the creators of ZeroMQ to begin with.
He's got a writeup as to why he should have used C to begin with ( http://www.250bpm.com/blog:4 ).
1. As for lost requests, the right thing to do is to timeout the request and re-send the request if the response haven't been received. This should be done inside 0MQ/XS, however, it's pretty easy to do in the application so nobody so far felt incovenienced by it so much as to implement the feature.
2. As for the implementation language C would have been better than C++, but that's only an implementation detail and of not much interest to 0MQ/XS users.
3. Finally, yes, the 0MQ/XS functionality should be implemented in the kernel. Here's the Linux kernel implementation:
Here are the userspace examples:
Here's the discussion group:
However, the problem with kernel implemetnation is that it is -- obviously -- not multi-platform. Thus we'll need the user-space implementations in the future, at least for obscure operating systems.
Out of curiosity, was it impossible to prevent the fork from happening? I do like the idea of ZeroMQ quite a lot but this can only work if it will end up on kernel level at one point and if two projects compete for that spot that probably does not help its cause much.
Anyway, there's only one Linux kernel implementation and I am not aware of anyone trying to do the same (except the dubious attempt to get DBus into kernel).
Finally, the ultimate goal of the whole project is to make messaging integral part of the Internet stack. That means standardising the protocols in IETF. And having at least 3 independent implementations makes it almost suitable for fast-tracking.
Wasn't ZeroMQ supposed to be the simple, lightweight alternative?
I'm feeling better about my decision to ignore all the whole *MQ circus.
0MQ ~15k LOC
RabbitMQ ~107k LOC
AMQP Qpid ~542k LOC
ActiveMQ ~1160k LOC
Might want to try this comparison again when 0MQ (or is it xroads?) supports clustering, high availability, durability, web management, federation, & STOMP.
He wonders if "wasn't ZeroMQ supposed to be the lightweight alternative" --which means he never bothered to go read the source code or use to find out.
Then he feels "validated" of not using any *mq stuff (besides, say ZeroMQ and RabbitMQ being totally different in scope and implementation), because he read a random comment in this very thread that reads more like a /. flame.
Lastly, he doesn't even do much work (if any) with messaging. From his response below, it seems like he considered messaging options for some project, couldn't figure then out and/or didn't proceed with the project, and does non message-related iOS work now.
Yes, because a random, 1 paragraph (and content-less, at that) comment on a social media site is very good grounds for verifying your technical decision.
(And referring to a series of projects as "circuses" without any deeper knowledge about them besides casual internet mentions, is also very mature).
If the client has not received a response by the timeout it should close the connection itself and reopen a new one. Whether this is a sutible solution for the blog author issue I don't know, but work well for RPC connections with little or no state.
I found working through all five chapters of the ZeroMQ guide unusually educational. It's full of the wisdom of people writing message oriented software for years, and includes frank discussions and solutions for several of these performance and reliability situations. (Don't miss the adventures of the suicidal snail in chapter 5!)
I found it worthwhile even to spend the time to work through all the examples in both C and Python.
In the author's situation, the normal loop of the client shouldn't be to just call blocking receive forever, as he discovered. Instead it should loop, polling the socket with some reasonable timeout, and between iterations do things like check for shutdown signals, parent process exiting, and the other typical housekeeping tasks. Then you only call receive when poll has told you there are messages waiting, and then you call it without blocking.
This sort of loop gives an obvious place to also integrate timeouts. You can also watch multiple sockets. Blocking receive forever is appropriate for a prototype sort of client but as things grow, generally more sophistication is needed.
I think in my case the confusion behavior came from the fact that I started using a project that used ZeroMQ and build part of its implementation on REP/REQ sockets and showed that behavior. Then I went to the ZeroMQ documentation and it does not present the REP/REQ examples with a caveat that they might block the client if the "server" goes away unexpected.
Don't use heartbeats on REQ/REP, because they won't work well with the lockstep communication fashion of those socket types. Also, you have to be careful because ZeroMQ sockets are not thread-safe, so the background and active thread must coordinate through a lock, or work in an implementation that handles this implicitly for you.
In ZeroRPC, we solve this by using XREQ/XREP with heartbeats. This has worked out pretty well in practice.
I guess that it was developed for financial trading applications. Maybe it will work fine for those -- you have a few machines and high network connectivity between them. But people started doing "data center" stuff with 0MQ. Then you have geographical separation, and WAN latency and reliability.
The benefit though, is that in zeromq you get to (and are forced to) choose exactly how your messaging patterns are reliable (or not)
EDITED: my point is general; it should be 0mq libs doing the timeouts and keepalives and so on and only pushing meaningful error handling like "the server has gone away and cannot reconnect" back up to the user.
The right thing to do re-send the request after disconnection or after timeout have expired. It can be done easily in the application, however, if you want it inside the library, feel free to submit a patch.
When dealing with network code, you need 1) Timeouts, 2) Keepalives.
What kind of keepalives and timeouts depend entierly on your needs. The problem is that most libraries/protocols doesn't have either, and most example code never shows this. (Any TCP example that does a read() or write() without any form of timeout is a DOS waiting to happen)
Restarting server / a server crash / a network outage and now potentially thousands of clients are in a bad state they cannot recover from. And this was by design. That was the point of the post I think. This isn't as much as an oversight as a bad design decision.
A client should never just wait forever for a response to a message. Any reliable system has to implement something like timeouts our immediate message acknowledgement (at which point maybe you can wait forever for a reply.)
There's a comprehensive discussion of this in chapter 4 "Reliable Request Reply" of the ZMQ guide.
TCP doesn't give you guaranteed response either. Just guaranteed delivery (or error.) In this case, this is exactly what the author's getting with ZMQ. The client's Request socket makes a successful delivery, then the server crashes before generating a response, and the client waits forever for the response that will not come.
I don't like some of Python's warts, but you don't see me writing assembly.
so when you are using actual sockets across the network, it uses TCP. So ZMQ should be able to detect disconnects rather easily.
It's not really idea that a synchronous connection doesn't have notification of a connection failure, but this has been working fine for us for ages.
I've worked around a lot of issues in ZeroMQ with retries, heartbeats, and stuff like that, but it just feels kludgy and like I am writing code that should be part of the library. I've looked at adding the functionality myself from the old patchsets that were available and neither were straight forward to implement. The code (at least ZeroMQ) is somewhat of a mess and difficult to follow/understand and I am saying that as a C++ programmer working on large enterprise applications.
As for the code quality, I would say the code is complex to the extent where it's almost impossible to maintain and improve it. (In case of need here's the overview of the architecture: http://www.aosabook.org/en/zeromq.html) However, it's not a mess in sense of being lousy. I've spent literally months of my time cleaning it up. Although it may have got worse since I've started working on the fork.
I wouldn't say that the code quality is lousy, au contraire I think overall the code quality is very good, just very difficult to figure out where everything goes. I found it difficult to "brain" map it such that I could get proficient at following what was going on while reading different sections of the code.
Is there any effort under way in crossroads to make the notification level triggered and not just edge triggered? For example sticking ZeroMQ into a libev loop is a pain because if two messages arrive at the same time I only get notified once and I have to loop through and zmq_read(), which if the ZeroMQ backend is sending really fast means I now starve all my other sockets!
As for code complexity: It's hard to brain-map it for me as well, which is pretty alarming given that I am the author :) I am working on major rewrite of the codebase (simplification of the threading model) but that'll take some time to finish.
As for level-triggering, that's a problem with POSIX, not 0MQ. There's no way in POSIX to simulate a file descriptor in the user space. The closest you can get is eventfd() which is a) Linux-specific b) doesn't allow to signal both !POLLIN and !POLLOUT at the same time :(
I wonder - is there any MQ that does not suck ?
ZeroMQ is not an MQ but it does not suck. That particular behavior is just confusing and should probably pointed out in the docs, even if it's supposed to be obvious. Also it would be nice if you could poll for transport level disconnect events.
(I also use 0mq but only for a disposable internal queue)