Hacker News new | past | comments | ask | show | jobs | submit login
SurgeMQ: MQTT Message Queue at 750,000 MPS (zhen.org)
81 points by signa11 on Dec 6, 2014 | hide | past | favorite | 39 comments

It's good to see new MQTT servers. As one of the developers of IBM's product in this space I'd like to mention it: http://www-03.ibm.com/software/products/en/messagesight

I'm biased and MessageSight has existed a lot longer so it's not fair to compare them but if people want to see numbers our performance report is here: http://www-01.ibm.com/support/docview.wss?uid=swg24035590 (that report is for 1.1, the equivalent for 1.2 released a couple of weeks ago isn't out yet).

Good luck and a fair wind to everyone in the MQTT space!

These numbers are great, but they are also highly dependant on hardware architecture and a strong coupling with network architecture. Another big contributor to messaging is in the message size. One can claim 10M messages/sec but for 1 byte messages, 10m messages/s is hardly impressive.

I can say that on today's hardware, achieving 3-7M msgs/sec in a single-stream, guaranteed, in-order messaging is possible. This us also with 120 byte messages, which amounts to about 3-7Gbps if sustained messaging in clusters of 2 to 200+ machines. What do you need to achieve this?

* Kernel bypass network architecture e.g. Solarflare 10/40gbe

* 10gbe switches

* decoupling of transport layer into a dedicated processor cores

* busy-polling processes

* business logic placed on dedicated cores.

* use of shared memory and/or memory mapping to transport messaging between processes.

It all sounds fairly convoluted to throw up processes and burn a core on rx/tx, but if you need to process millions of messages a second at fairly low latency, this is about the limit of where it goes. Using traditional application and socket tied to a single process will get you to around 1 million messages/sec with some very large buffers and efficient business logic, but thst eill be about it. The next step would be to start sharding streams, but that comes with its own complexities.

I think you are overstating the difficulty of achieving good network throughput on modern hardware. Architecture definitely matters but saturating 10GbE is pretty achievable without going to extraordinary lengths. While doing kernel bypass for networking is more efficient (and preferable for other reasons), it is not necessary for pushing a huge number of messages in my experience.

The biggest bottleneck tends to be the computational cost of parsing the protocol format. Nonetheless, I've seen GeoJSON documents moved over networks at per node messaging rates similar to those in the article.

In my experience, just passing messages around w/ dummy logic should be yielding numbers about 10x greater at around 10x the reduced latency.

With architectural tuning, the messages/sec in the various scenarios can be increased 5x (~4M msgs/s) while decreasing the average latency of processing a message to the theoretical minimums of around 5-6 microseconds to get from wire to application memory and back out to wire.

Even with efficient protocol formats (mold64), one reaches ceilings too soon. The fact that the author mentions that they are neither CPU bound or network IO bound leads me to believe that they are probably bound by memory IO and lost efficiencies to context switching.

Author here. The message size tested is 1024 bytes.

Are you cpu bound or (net) io bound?

When I ran this I gave the server and client 2 cores each. Neither were maxing out the cores.

At about 856K MPS (sent + received) @ 1024 bytes/msg, that's less than 1 GBps. So I can't imagine it being io(net) bound either.

So inconclusive...unfortunately. Need to profile a bit more to know.

Edit: Tho this is TCP based so it's possible that it's netio bound given the protocol overhead. At this message rate, each packet is almost guaranteed to be full size. So we are doing about 600K pps, approximately.

Are you passing messages on the same host? When doing message passing between processes on the same machine, memory mapping and shared memory will be your friends. Check this architecture out for intra-host message passing: https://lmax-exchange.github.io/disruptor/

If you are doing tcp over loopback, the MTU is 64K, IIRC. You should be able to pack more messages per packet. Since you aren't CPU bound, you are probably memory bound.

Also if you on a non-kernelbypass accelerated host, wire-to-application memory takes about 10 micros, so wire-to-memory-to-wire will probably land somewhere around 20 microseconds.

lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> mtu 16384

Mac OS Yosemite..

> When I ran this I gave the server and client 2 cores each. Neither were maxing out the cores.

grep for GOMAXPROCS didn't turn up anything so not sure what you mean by that. This may help with core utilization.

   func init() {
       // …

I ran with GOMAXPROCS=2 on the command line. The post showed the actual commands that I used in each case.

Ah, missed that. In any event, that sets the number of Go Processors per Go runtime. It has no bearing on the number of "cores" utilized.

There's some typoes and broken links. I like the code, though, lots of work has been done.

It's worth bearing in mind that messages per second is important, but it's easy to get fixated on benchmark porn.

Different queueing systems have different guarantees, disciplines and semantics. These affect user-facing behaviour, which tends to be more important at first blush than throughput.

I like zillions of messages per second as much as the next fellow. But frequently you need to worry about things like:

* Are messages delivered once and only once?

* Are messages are messages delivered at least once?

* Can messages be dropped entirely under pressure? (aka best effort)

* If they drop under pressure, how is pressure measured? Is there a back-pressure mechanism?

* Can consumers and producers look into the queue, or is it totally opaque?

* Is queueing unordered, FIFO or prioritised?

* Is there a broker or no broker?

* Does the broker own independent, named queues (topics, routes etc) or do producers and consumers need to coordinate their connections?

* Is queueing durable or ephemeral?

* Is durability achieved by writing every message to disk first, or by replicating messages across servers?

* Is queueing partially/totally consistent across a group of servers or divided up for maximal throughput?

* Is message posting transactional?

* Is message receiving transactional?

* Do consumers block on receive or can they check for new messages?

* Do producers block on send or can they check for queue fullness?

And there's probably a bunch more I've forgotten.

The thing is that answers to these questions will fundamentally change both the functional and non-functional nature of your queueing system.

For example, a queue system giving best-effort, unordered, non-durable behaviour is going to run a lot faster. It also pushes a lot of work onto the application programmer. On the other hand, once-and-only-once, durable, consistent queues are lot slower and screech to a halt under most partition conditions. But they also fit what most application developers expect to happen upon the first encounter with queueing systems.

I work on a section of Cloud Foundry in my day job, and other teams have seen that different tasks require different queueing approaches.

For example, stuff like metrics is is still useful under conditions of dropping messages, out-of-order messages and so on, because what's interesting is the statistics, not any one single measurement.

But a message like "start this app" requires much higher guarantees of ordering, durability, delivery certainty. People get mad if your PaaS doesn't actually run the application you asked it to run.

So, just remember: queues are not queues. You need to compare delivered apples with lossy oranges.

As a note, the author observes that MQTT provides an option to select which delivery semantics you prefer (at-least-once, at-most-once / best-effort, once-and-only-once), but I can't see which one the benchmark is run for.

Outstanding comment. I'm working at a distributed queue and those are exactly the set of questions I ask myself when building different parts of the system, but your exposition is something I'm going to cut&paste and read every time I'm going to work to my project. Btw, we both work at Pivotal :-) Just a question: "Is queueing partially/totally consistent across a group of servers or divided up for maximal throughput". I'm not sure I understand this fully, could you please extend that a bit with some more context? Thanks.

> Btw, we both work at Pivotal :-)

I'm at Labs in NYC on secondment to Cloud Foundry. You should come see us and do a tech talk!

> "Is queueing partially/totally consistent across a group of servers or divided up for maximal throughput"

I didn't do a good job of explaining this.

Basically, assuming the most popular queueing discipline -- FIFO -- you can either set up brokers to be highly available, or to scale approximately linearly, but not both.

This is because HA + FIFO + at-least-once queueing requires servers to coordinate the state of a queue and to either write to disk or replicate messages. It's really, really hard and it can ruin your day when there's a partition.

If you relax all the guarantees that make life easier for application developers, you can just send any old message to any old server. No server needs to coordinate with any other and so scaling is closer to linear.

Amazon's SQS is a really good case study in what queues with looser guarantees can achieve.

I love NYC! That's tempting :-)

Thanks a lot for further specifying the cross-consistency guarantee. Basically the system I'm developing is a lot more like Amazon's SQS, it gives guarantees about message durability / replication, but only provides weak ordering, for three reasons: 1) What you said, scalability. 2) Availability since the system is available into the minority partition as well. 3) My queue is fundamentally designed with the idea that messages must be acknowledged by consumers, or they are reissued after a job-specific retry time (that can be set to 0 if you really want at-most-once delivery, but it's a rare requirement). When you have auto-re-queue strict ordering is basically useless since it is violated every time a consumer does not acknowledge the message fast enough.

However not strict ordering does not mean random, so it tries to approximate a FIFO using each node's wall clock timestamps. This means that while messages may be delivered in random order, usually what is queued first is served first, which is no guarantee at all from the POV of the developer, but is a more general guarantee about the fact that if there are N users waiting into a web application for some thing to get processed, the first in queue will likely be served before the last.

p.s. sorry for errors, writing with my daughter pulling my arms :-)

Total consistency requires global consensus and that puts a definitive break on performance.

Yep, but are we talking about consistency in the order of messages? That was not clear to me.

I principally was thinking of order, yes. When we application developers hear "queue" we typically assume it's going to have a FIFO discipline.

Thank you for covering this, numbers are great but it's the details that matter (and end up biting you in the ass). I have nearly a decade running and working with high-volume transactional message systems and it came up fairly regularly that Product A (WMQ-LL, 0MQ, Tibco FTL, 29West, or Solace) was faster than Product B (WMQ, AMQP- Rabbit, Active, Swift- or Tibco) ignoring recovery, transactionality, durability, reliability, ordering, and delivery (a favorite head scratcher is when you expect FIFO and get unordered). Matching the requirements to the system and setting expectations, especially expectations around how and when it goes wrong, is so much more important the benchmarks.

> And there's probably a bunch more I've forgotten

+ Latency vs throughput.

What would be really informative (and this is not remotely directed only at Jian Zhen's work here) is to see metrics across a spectrum per a given modality per a standard topology and capacity. This way it would be almost clear at a glance at which approach is the best fit for a given domain.

Also I wonder if the time has come for the software geeks to visit their fellow geeks in the Mech-E departments and check out the work to date on flow analysis … ;)

Excellent question regarding latency.

Tyler Treat who wrote an excellent blog post on MQ performance tested this and got mean latency for 1000000 messages of 98.751015 ms.


I have not. But I am going to assume you are telling me to learn more about how to calculate my performance numbers. :) Fair point and I will bookmark for future reading.

Good benchmarking requires a lot of work. What you did is actually useful within the narrow context you're applying it to -- quickly validating design changes.

But if, for example, you wanted to send it into SIGMETRICS for publication, they'd expect a whole bunch of extra work and background material.

One thing that's tricky in our profession is that the space of all possible configurations and inputs is gigantic and we only tend to sample the very, very small parts of it that occur to us.

Yeah, for the record, the benchmarks I've run aren't at all scientific. Should also be looking at percentiles. That said, surge looks promising so nice work, Jian :)

My dream would be to build a framework which orchestrates testing across distributed senders/receivers. That would give you a much more accurate representation of performance and reliability instead of a single machine.

Author here. A bit of a pleasant surprise to wake up and see this on HN.

Thanks for the detailed response. Here's some answers hopefully can help clarify things a bit.

* Are messages delivered once and only once?

A: MQTT allows QoS 0 (at most once), 1 (at least once), and 2 (exactly once.) The performance numbers in the blog are for QoS 0.

However, SurgeMQ implements all three and there's unit tests for all three. I just haven't done the performance tests for QoS 1 and 2.

* Are messages are messages delivered at least once?

SurgeMQ supports it though the numbers posted are for QoS 0 (at most once)

* Can messages be dropped entirely under pressure? (aka best effort)

No. Currently no messages are dropped.

* If they drop under pressure, how is pressure measured? Is there a back-pressure mechanism?

See above.

* Can consumers and producers look into the queue, or is it totally opaque?

Not sure what this means..sorry..

* Is queueing unordered, FIFO or prioritised?

Ordered, FIFO...MQTT spec requires that messages from publishers to delivered in the same order to the subscribers.

* Is there a broker or no broker?


* Does the broker own independent, named queues (topics, routes etc) or do producers and consumers need to coordinate their connections?

Brokers uses topics to route. Publisher publishes to a topic, subscribers subscribe to multiple topics w/ optional wild cards.

* Is queueing durable or ephemeral?

Ephemeral currently. Though MQTT spec requires that any unack'ed QoS 1 and 2 messages be redelivered when the server restarts or client reconnects. So once SurgeMQ meets that spec, it could be considered somewhat durable.

* Is durability achieved by writing every message to disk first, or by replicating messages across servers?


* Is queueing partially/totally consistent across a group of servers or divided up for maximal throughput?

Currently SurgeMQ is a single server w/ no clustering ability. However, MQTT spec does mention the bridge capability that is a poor man's cluster. Not yet implemented.

* Is message posting transactional?

QoS 1 and 2 starts to be more transactional. QoS 0 is strictly fire and forget.

* Is message receiving transactional?

See above.

* Do consumers block on receive or can they check for new messages?

Block on receive.

* Do producers block on send or can they check for queue fullness?

Block on send.

Great reply.

I wasn't expecting to make you fill out a survey. I was just trying to show off scars I and others I know have accumulated over the years :)

Heh. No worries. Excellent list of questions. Like antirez, I am evernoting (is that a verb now?) this for future references.

I beleive by 'peeking into the queue' he means, is the only way to see the messages by receiving, or can you inspect the list of messages yet to be received without actually calling receive?

Ah..k...no peeking beyond just the head of the queue.

>> * Can consumers and producers look into the queue, or is it totally opaque?

> Not sure what this means..sorry..

I'm definitely an amateur on this topic, but I suppose he talks about privacy issues. If you're delivering secrets you don't want that every actor can read the messages of all the others.

Ah, I missed that one.

Some queues let you "peek". It's not common because it weakens the whole concept of a queue and tends to be difficult to implement sanely.

i was looking at rabbitmq - do you recommend a MQ that isnt a SaaS?

RabbitMQ is originally an installable package that can support a wide range of behaviours -- you still need to be mindful of which you choose. Depending on your settings, you can see orders of magnitude changes in throughput.

Disclaimer: RabbitMQ belongs to Pivotal, the same company I work for.

Author here.

There's really no comparison tbh. RabbitMQ has been battle tested for years and SurgeMQ is just at its infancy. And RabbitMQ has a lot more enterprise security features that SurgeMQ doesn't have today. So if you are looking for a solution today, go w/ RabbitMQ.

SurgeMQ hopefully will get there someday, but it's not ready today, yet.

Rabbitmq is AMQP and mostly an "enterprise-class" thing. MQTT is more of the IoT messaging protocol, don't think they're trying to do the same thing?

yea was just looking for an opinion - these guys know their stuff, i never get to leave the frontend so idk which mq to choose ;)

How does this compare to RabbitMQ?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact