Hacker News new | comments | show | ask | jobs | submit login
ZeroMQ - Disconnects are Good for You (pocoo.org)
108 points by mrud 1322 days ago | past | web | 54 comments



Yeah, don't use REQ/REP.

I used to be a big ZeroMQ fan, but honestly, the library trades sanity for performance and has collapsed in on itself because the code is no longer maintainable. Last I checked, it's being completely rewritten in a different programming language. Maybe the result will be fine, but the core features of retrying connections and remembering dropped messages for clients that disappear temporarily are easy enough to write yourself.

(I do like Mongrel2's use of that feature to allow me to make an HTTP request and then start the backend server. And the right place for this is a library. It's just that ZeroMQ had too many features and too much code.)

-----


> Last I checked, it's being completely rewritten in a different programming language.

The rewrite you mention is, I believe, Crossroads I/O ( http://www.crossroads.io/ ). Martin Sustrik was one of the creators of ZeroMQ to begin with.

He's got a writeup as to why he should have used C to begin with ( http://www.250bpm.com/blog:4 ).

-----


Hi, that's Martin Sustrik here. Couple of points:

1. As for lost requests, the right thing to do is to timeout the request and re-send the request if the response haven't been received. This should be done inside 0MQ/XS, however, it's pretty easy to do in the application so nobody so far felt incovenienced by it so much as to implement the feature.

2. As for the implementation language C would have been better than C++, but that's only an implementation detail and of not much interest to 0MQ/XS users.

3. Finally, yes, the 0MQ/XS functionality should be implemented in the kernel. Here's the Linux kernel implementation:

https://github.com/250bpm/linux/tree/sp-linux

Here are the userspace examples:

https://github.com/250bpm/sp-userland

Here's the discussion group:

https://groups.google.com/forum/?fromgroups#!forum/sp-in-lin...

However, the problem with kernel implemetnation is that it is -- obviously -- not multi-platform. Thus we'll need the user-space implementations in the future, at least for obscure operating systems.

-----


> Hi, that's Martin Sustrik here. Couple of points:

Out of curiosity, was it impossible to prevent the fork from happening? I do like the idea of ZeroMQ quite a lot but this can only work if it will end up on kernel level at one point and if two projects compete for that spot that probably does not help its cause much.

-----


The fork was over trademark policy, not any technical matter. So no, it was not possible to prevent it. To avoid trademark restrictions, it was necessary to change the name of the project.

Anyway, there's only one Linux kernel implementation and I am not aware of anyone trying to do the same (except the dubious attempt to get DBus into kernel).

Finally, the ultimate goal of the whole project is to make messaging integral part of the Internet stack. That means standardising the protocols in IETF. And having at least 3 independent implementations makes it almost suitable for fast-tracking.

-----


http://www.crossroads.io/ looks like the sane path to follow after the community implosion of zeromq.

-----


"community implosion" ? Please do share.. I've only been dabbling with ØMQ, but it seems a lot more active than crossroads.

-----


What is the community implosion you're referring to?

-----


It's just that ZeroMQ had too many features and too much code.

Wasn't ZeroMQ supposed to be the simple, lightweight alternative?

I'm feeling better about my decision to ignore all the whole *MQ circus.

-----


Well, zeromq has nothing to do with amqp (rabbitmq as the prototypical example), so there's not much of a circus.

-----


People compare them all the time, including the ZeroMQ people themselves:

http://www.zeromq.org/docs:welcome-from-amqp

-----


It's not really a useful comparison. You can use ZeroMQ to build a lightweight message queue, but if you really need a message queue it would be best to just use a full message queue. ZeroMQ is more of a message passing toolkit.

-----


Too much code?

0MQ ~15k LOC RabbitMQ ~107k LOC AMQP Qpid ~542k LOC ActiveMQ ~1160k LOC

-----


i'm sorry but comparing 0MQ to these other systems is caca doodoo. i'm telling you as a developer who has on multiple occasions considered using 0MQ it lowers my trust in you as an evangelist for the platform to suggest that this is a relevant or worthwhile comparison to make.

-----


Consider my comment in the context. The comment I was replying to says that 0MQ/XS is not lightweight because it has too much code. That's simply not true and that's what I was alluding to.

-----


> 0MQ ~15k LOC RabbitMQ ~107k

Might want to try this comparison again when 0MQ (or is it xroads?) supports clustering, high availability, durability, web management, federation, & STOMP.

-----


the only circus is the incorrect conflation of 0MQ with actual durable message queue systems.

-----


Did you use something else instead or just decide not to design around message-queues at all?

-----


I switched to iOS development. ;)

-----


Does he sound like the right person to ask about the matter?

He wonders if "wasn't ZeroMQ supposed to be the lightweight alternative" --which means he never bothered to go read the source code or use to find out.

Then he feels "validated" of not using any *mq stuff (besides, say ZeroMQ and RabbitMQ being totally different in scope and implementation), because he read a random comment in this very thread that reads more like a /. flame.

Lastly, he doesn't even do much work (if any) with messaging. From his response below, it seems like he considered messaging options for some project, couldn't figure then out and/or didn't proceed with the project, and does non message-related iOS work now.

-----


>I'm feeling better about my decision to ignore all the whole MQ circus.*

Yes, because a random, 1 paragraph (and content-less, at that) comment on a social media site is very good grounds for verifying your technical decision.

(And referring to a series of projects as "circuses" without any deeper knowledge about them besides casual internet mentions, is also very mature).

-----


Do you have an alternative you use?

-----


A implementation of this kind of usage pattern is already provided in the ZMQ guide:

http://zguide.zeromq.org/page:all#Client-side-Reliability-La...

If the client has not received a response by the timeout it should close the connection itself and reopen a new one. Whether this is a sutible solution for the blog author issue I don't know, but work well for RPC connections with little or no state.

-----


Indeed. ZMQ has other capabilities beyond REQ/REP exactly for this situation, and helps you layer "patterns" on top of them.

I found working through all five chapters of the ZeroMQ guide unusually educational. It's full of the wisdom of people writing message oriented software for years, and includes frank discussions and solutions for several of these performance and reliability situations. (Don't miss the adventures of the suicidal snail in chapter 5!)

I found it worthwhile even to spend the time to work through all the examples in both C and Python.

In the author's situation, the normal loop of the client shouldn't be to just call blocking receive forever, as he discovered. Instead it should loop, polling the socket with some reasonable timeout, and between iterations do things like check for shutdown signals, parent process exiting, and the other typical housekeeping tasks. Then you only call receive when poll has told you there are messages waiting, and then you call it without blocking.

This sort of loop gives an obvious place to also integrate timeouts. You can also watch multiple sockets. Blocking receive forever is appropriate for a prototype sort of client but as things grow, generally more sophistication is needed.

-----


> In the author's situation, the normal loop of the client shouldn't be to just call blocking receive forever, as he discovered. Instead it should loop, polling the socket with some reasonable timeout, and between iterations do things like check for shutdown signals, parent process exiting, and the other typical housekeeping tasks. Then you only call receive when poll has told you there are messages waiting, and then you call it without blocking.

I think in my case the confusion behavior came from the fact that I started using a project that used ZeroMQ and build part of its implementation on REP/REQ sockets and showed that behavior. Then I went to the ZeroMQ documentation and it does not present the REP/REQ examples with a caveat that they might block the client if the "server" goes away unexpected.

-----


> This could probably be improved by having a background thread that uses a ZeroMQ socket for heartbeating.

Don't use heartbeats on REQ/REP, because they won't work well with the lockstep communication fashion of those socket types. Also, you have to be careful because ZeroMQ sockets are not thread-safe, so the background and active thread must coordinate through a lock, or work in an implementation that handles this implicitly for you.

In ZeroRPC, we solve this by using XREQ/XREP with heartbeats. This has worked out pretty well in practice.

-----


Rather than polling, zeromq >= 2.2 allows you to set ZMQ_RCVTIMEO on the socket which seems to be what the author is after. It would be nice to be notified of disconnected peers, but the timeout + retry approach has been good enough for me.

-----


Great article. ZeroMQ had a "smell" that I couldn't put my finger and thing article kind of nailed it. In retrospect I guess the smell is that it is tightly couples both sides of the network to make performance claims. It sacrifices robustness for performance.

I guess that it was developed for financial trading applications. Maybe it will work fine for those -- you have a few machines and high network connectivity between them. But people started doing "data center" stuff with 0MQ. Then you have geographical separation, and WAN latency and reliability.

-----


I think the problem is people come into zeromq expecting a high level library that handles all the details, zeromq does not do that, you need to handle reliability and disconnect behavior yourself. I agree that the default behavior in this case could be saner, but it's pretty easy to build reliable request reply in many different ways as illustrated in the guide, so I'm fine with it.

The benefit though, is that in zeromq you get to (and are forced to) choose exactly how your messaging patterns are reliable (or not)

-----


The better solution? That the 0mq libs do the right thing and don't get wedged. It shouldn't be on the users of the API to handle this.

EDITED: my point is general; it should be 0mq libs doing the timeouts and keepalives and so on and only pushing meaningful error handling like "the server has gone away and cannot reconnect" back up to the user.

-----


The problem with that is that 0MQ socket abstracts multiple underlying connections. Reporting error would mean making the connections visible to the user. There would have to be connection IDs, accept function, error notifications etc. In the end the whole thing would boil down to ugly version of standard BSD TCP sockets.

The right thing to do re-send the request after disconnection or after timeout have expired. It can be done easily in the application, however, if you want it inside the library, feel free to submit a patch.

-----


I've been experimenting with being completely asynchronous (and working on being connection-less). The protocol layer just wraps up payloads and unpacks them. There is a background heartbeat and when the heartbeat is not met there is a notification that the heartbeat has not been met but the user is in charge of if this should be considered a disconnection. This is mostly inspired by how Oz does distribution. I don't have any good results yet, though.

-----


Receiving and handling I/O errors is easy, the harder part is when something goes wrong on the peer and you don't receive an error.

When dealing with network code, you need 1) Timeouts, 2) Keepalives.

What kind of keepalives and timeouts depend entierly on your needs. The problem is that most libraries/protocols doesn't have either, and most example code never shows this. (Any TCP example that does a read() or write() without any form of timeout is a DOS waiting to happen)

-----


There's a problem when you restart servers at the wrong moment, though, as the article mentions...

-----


So ... in other words there is a serious problem.

Restarting server / a server crash / a network outage and now potentially thousands of clients are in a bad state they cannot recover from. And this was by design. That was the point of the post I think. This isn't as much as an oversight as a bad design decision.

-----


Agreed, but the bad design is not in ZMQ, but the way it is being used.

A client should never just wait forever for a response to a message. Any reliable system has to implement something like timeouts our immediate message acknowledgement (at which point maybe you can wait forever for a reply.)

There's a comprehensive discussion of this in chapter 4 "Reliable Request Reply" of the ZMQ guide.

TCP doesn't give you guaranteed response either. Just guaranteed delivery (or error.) In this case, this is exactly what the author's getting with ZMQ. The client's Request socket makes a successful delivery, then the server crashes before generating a response, and the client waits forever for the response that will not come.

-----


It seems to be that the better solution might be just using Twisted and regular networking techniques.

-----


So whenever there's a small problem with something, the solution is to discard the whole thing and go down a layer?

I don't like some of Python's warts, but you don't see me writing assembly.

-----


One could argue whether using Twisted as opposed to ZMQ is going down a layer at all.

-----


Something so broken it wedges may be the kind of time its time to just use a TCP socket.

-----


Sure, the REQ/REQ sockets are limited, especially because they force the Request/Reply/Request/Reply/... series. I don't think any complex applications use this. I recently built an application using DEALER/ROUTER sockets, where you can send multiple requests, without having to wait for responses, etc. Additionally, no application should rely on receiving a response, the poller he suggests solves this problem nicely (although I don't think it necessary to wrap it into send/recv methods as pyzmq offers a nice polling API).

-----


Carries messages across inproc, IPC, TCP, and multicast.

so when you are using actual sockets across the network, it uses TCP. So ZMQ should be able to detect disconnects rather easily.

-----


The solution I use is the one I mentioned to Armin on twitter: https://gist.github.com/2994781

It's not really idea that a synchronous connection doesn't have notification of a connection failure, but this has been working fine for us for ages.

-----


Armin is right that the timeout works, but delays the signal about connection failure from TCP. If anyone feels like implementing automatic resend inside 0MQ/XS (in case of timeout or TCP connection failure), give it a go and submit a patch. If noone submits the patch, I'll fix the problem once I have more free time.

-----


As far as I was aware there was a patch like that available for the ZMQ 3.0 but was nixed for 3.1, same with a patch to deal with the issue of 1000 clients on an XREP socket, where the first 999 disappear and thus don't need to have their messages processed and should just be dropped on the XREP socket... that was added in 3.0 and then reverted in 3.1.

I've worked around a lot of issues in ZeroMQ with retries, heartbeats, and stuff like that, but it just feels kludgy and like I am writing code that should be part of the library. I've looked at adding the functionality myself from the old patchsets that were available and neither were straight forward to implement. The code (at least ZeroMQ) is somewhat of a mess and difficult to follow/understand and I am saying that as a C++ programmer working on large enterprise applications.

-----


Yep. The patches were reverted because they've broker wire-level compatibility with older versions of 0MQ. The good news is that I protocol versioning will be available in next release of crossroads.io and thus the patches like that can be applied without breaking backward compatibility (the library will just speak two different protocol versions).

As for the code quality, I would say the code is complex to the extent where it's almost impossible to maintain and improve it. (In case of need here's the overview of the architecture: http://www.aosabook.org/en/zeromq.html) However, it's not a mess in sense of being lousy. I've spent literally months of my time cleaning it up. Although it may have got worse since I've started working on the fork.

-----


I can understand the need for them to be reverted, although I honestly think that the patches provided functionality that should have already existed. Worst case the patched could have stayed and they could have been made available through certain flags (much like the HWM, LWM, and others).

I wouldn't say that the code quality is lousy, au contraire I think overall the code quality is very good, just very difficult to figure out where everything goes. I found it difficult to "brain" map it such that I could get proficient at following what was going on while reading different sections of the code.

Is there any effort under way in crossroads to make the notification level triggered and not just edge triggered? For example sticking ZeroMQ into a libev loop is a pain because if two messages arrive at the same time I only get notified once and I have to loop through and zmq_read(), which if the ZeroMQ backend is sending really fast means I now starve all my other sockets!

-----


As for the options: Yes. There's patch that supports protocol versioning (version=socket option) already applied to crossroads.io. That in turn makes it possible to improve the protocol without breaking backward compatibility.

As for code complexity: It's hard to brain-map it for me as well, which is pretty alarming given that I am the author :) I am working on major rewrite of the codebase (simplification of the threading model) but that'll take some time to finish.

As for level-triggering, that's a problem with POSIX, not 0MQ. There's no way in POSIX to simulate a file descriptor in the user space. The closest you can get is eventfd() which is a) Linux-specific b) doesn't allow to signal both !POLLIN and !POLLOUT at the same time :(

-----


This badly reminds me at my MQSeries experience.

I wonder - is there any MQ that does not suck ?

-----


ZeroMQ isn't a MQ ( Message Queue ). It's a message passing library. You can use it to build message queues though.

-----


I really despise the name ZeroMQ for this reason. First time you hear it "Is it an MQ and Zero is just a fun clever name" or "Is it not an MQ"? People seem to go with the latter. Then what is it? Why define itself by what it isn't?</rant>

-----


> I wonder - is there any MQ that does not suck ?

ZeroMQ is not an MQ but it does not suck. That particular behavior is just confusing and should probably pointed out in the docs, even if it's supposed to be obvious. Also it would be nice if you could poll for transport level disconnect events.

-----


I love RabbitMQ with the passion of a thousand suns right now.

(I also use 0mq but only for a disposable internal queue)

-----


The TCP stack takes care of this problem this is an insane attempt at POST mature optimization.

-----




Applications are open for YC Summer 2016

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: