The LMAX Architecture - 100K TPS at Less than 1ms Latency

bodski · on Oct 30, 2011

Cliff Click makes a good post [1] discussing the LMAX/Disruptor architecture and how it can give large performance gains in specific circumstances (such as those generally encountered by LMAX). In particular, Disruptor works best when there is a one-to-one ratio of disruptor threads and cpu cores. Great to see Cliff rolling up his sleeves and looking into this stuff from LMAX.

Additionally Cliff points out that there really needs to be some kind of standard CPU/Socket affinity API for VMs so that the relevant threads can share L3/L2 cache and reduce bus saturation. The Managed Runtime Initiative [2] is an interesting early effort to address this and other VM common concerns.

Edit: Looks like memory latency (not bus saturation) is the big issue if producer and consumer threads end up on different CPU sockets, hence Cliff's call for an explicit CPU/socket affinity API.

[1] http://www.azulsystems.com/blog/cliff/2011-09-23-a-pair-of-s... (scroll down to Disruptor section)

[2] http://www.managedruntime.org/

jbellis · on Oct 31, 2011

Too bad there's been essentially zero activity on MRI for over a year.

mjpt777 · on Oct 31, 2011

A good posting showing the effect of setting thread affinity for version 1.0 of the Disruptor.

http://java.dzone.com/articles/java-threads-steroids

haberman · on Oct 30, 2011

The article starts by saying that low latency is the main requirement for financial trading, but then quotes all performance numbers in throughput ("6 million orders per second"). I couldn't find any mention of average or worst-case latency, and throughput numbers alone tell you nothing about latency.

In particular, since each input event has to be journaled and replicated (which both involve I/O) before it can be processed, there is a potentially large and unpredictable delay for each event.

Another issue is that this architecture assumes that 1 CPU core is enough to run all business logic processors, and that 1 machine's memory is enough to hold your state. If your processing is something CPU-intensive or your state is large, you'll hit a scaling wall that requires you to manually shard your input across multiple machines.

davefarley · on Nov 1, 2011

You are right predictable latency is key, but that is one of the big wins that we have found with the Disruptor and our surrounding infrastructure. The Disruptor is not the only thing that we have worked on to achieve this. For example we have spent time experimenting with journalling and clustering techniques to give us high-throughput, predictable, mechanisms for protecting our state.

As Martin said in his reply our system is implemented as a number of services, which communicate via pub-sub messaging, it is not quite as naive as you assume, we still have a distributed system. For the high-throughput, low latency functions of the system, (e.g. order matching in our exchange, or risk evaluation in our broker) they are each serviced on a single business logic thread, on separate servers.

Our latency measurements include these multiple hops within our network and represent an inbound instruction arriving at our network to the time that we have a fully processed, outbound, response back at the edge of our network, as Martin pointed out, modern networking can be quiet efficient when used well.

We have designed to allow us to shard our system when the need arises, but we are actually a long way away from that need. Even though these high performance services keep their entire working set in memory, that is still a relatively small number when compared to the amount of memory available in modern commodity servers. We currently have lots of head-room!

We think that this is a very scalable approach and under-used. Keeping the working set in memory is pretty straight forward for many business applications, and has the huge benefit of being very simple to code. Much more straight-forward than the, more conventional, shuffling of data around and translation from one form to another that is such a common theme in more conventional systems.

The sharding decision is simple, for our problem domain we have two obvious dimensions for sharding, accounts and order-books. Each instance (shard) would continue to have it's business logic processed on a single thread. I think that this is normal for most business problems, it is a matter of designing the solution to avoid shared state between shards.

We are not advocating "no parallelism", rather we advocate that any state should be modified by a single thread, to avoid contention. Our measurements have shown that the cost of contention almost always outweighs the benefit. So avoid contention, not parallelism.

  Dave Farley 
  (Head of Software development at LMAX)

redcircle · on Oct 30, 2011

They don't call the system low latency --- it is the processing that occurs in the single-threaded main loop that has to be low latency. They of course want the end-to-end system to be low latency, but the term here describes the inner work loop, because their approach doesn't work if you have higher latency work items in the main loop.

If you apply "low latency" to the inner loop only, then you can resolve your second critique too: they won't be supporting anything that is CPU-intensive (since that isn't low latency). Also, all state has to fit in a single node, to keep things low latency.

rickmode · on Oct 30, 2011

You're saying this approach doesn't fit problem domains that are CPU-intensive. That's a valid point.

The article doesn't directly address this but does compare LMAX to a typical database backed business application. The computation here is not usually CPU-bound.

mjpt777 · on Oct 31, 2011

The LMAX system is more about latency than throughput. When designing for very low-latency the result can be a system that achieves very great throughput if the appropriate techniques are employed. Single threaded applications are very suitable for low-latency because of the avoidance of lock contention and predictability they bring.

Some means of reliable delivery of messages, to and from this single thread, is necessary to make a useful application. These messages must be delivered in the event of system failures. To address this need the Disruptor is employed to pipeline and run in parallel, the replication, journalling and business logic for these messages. The whole system is asynchronous and non-blocking.

In our architecture we have multiple gateway nodes that handle border security and protocol translation to and from our internal IP multi-cast binary protocol for delivery to highly redundant nodes. Lots of external connections can be multiplexed down from the outside world this way.

The 6 million TPS refers to our matching engine business logic event processor. We have other business logic processors for things like risk management and account functions. These can all communicate via our guaranteed message delivery system that can survive node failures and restarts, even across data centres.

Modern financial exchanges can process over 100K TPS and have to respond with latency in the 100s of microseconds firewall to firewall, thus including the entire internal infrastructure. For those tracking the latest developments will see it is possible to have single digit microsecond network hops with IP multicast with user space network stacks and RDMA over 10GigE. Even a well tuned 1GigE stack can achieve sub 40us for a network hop. For reference single digit microseconds is in the same space as a context switch on a lock with the kernel arbitrating. Most financial exchanges rely on having data on multiple nodes before the transaction is secure. A number of these nodes can be asynchronously journalling the data down to disk. At LMAX we tend to have data in 3 or more nodes at any given time.

In my experience of profiling many business applications the vast majority of the time is either spent in protocol translation such as XML or JSON to business objects, or within the JDBC driver doing buffer copying and waiting on the database to respond, when the application domain is well modelled.

Often applications are not well modelled for their domain. This can result in algorithms that, rather than be O(1) for most transactions, have horrible scale up characteristics because of inappropriate collections representing relationships. If you have the luxury of developing an in-memory application requiring high performance it quickly becomes apparent the cost of a CPU cache miss is the biggest limitation to latency and throughput. For this one needs to employ data structures that exhibit good mechanical sympathy for CPU and memory subsystem. At LMAX we have replaced most of the JDK collections with our own that are cache friendly and garbage free.

So far we have had no issue processing all transactions for a given purpose on a single thread, or holding all the live state in memory for a single node. If we ever cannot process all the transactions necessary on a single thread then we simply shard the model across threads/execution contexts. We only hold the live mutating data in-memory and archive out to database completed transactions as they are then read only.

Martin (LMAX CTO)

vilya · on Oct 30, 2011

It's clear from Fowler's discussion of "mechanical sympathy" that he didn't understand the memory hierarchy. I found it pretty shocking that someone in his position could not know this stuff. He really needs to read Ulrich Drepper's "What every programmer should know about memory" (http://www.akkadia.org/drepper/cpumemory.pdf).

It's such a fundamental topic that everyone ought to at least know the basics of it. People architecting large systems, like Fowler, have a special responsibility to know these things: if they aren't taken into consideration in the system design, it's easy to dig a performance hole too deep to climb out of.

koevet · on Oct 30, 2011

Can you be more specific?

vilya · on Oct 31, 2011

About the performance issues?

Well I work on software that does image processing at the moment & we've chosen to represent images as separate channels (i.e, all the red values, then all the green values, then all the blue values, etc). If we had interleaved the channels instead, many common operations that only need to look at one channel - or just one at a time - would be a lot slower.

When data is loaded into the cache, it's loaded in chunks called cache lines. These are usually 64 or 128 bytes long, depending on your CPU. So assuming a 64 byte cache line, if you ask for the value at address 4 then it'll load in all the values in addresses 0-63. If you then ask for the value at address 8, it'll already be cached & therefore quick to access; but if you ask for the value at address 64, it'll have to fill another cache line first - a cache miss.

So back to the images, say we're just looking at the alpha channel of an RGBA image. With separate channels we get 16 alpha values in each cache line (each channel is a float, so 4 bytes). If the channels were interleaved then we'd only get 4 alpha values in each cache line, so the CPU will have to fill 4 times as many cache lines.

Because so much of our code deals with images (and not just ours - our clients too), if we'd chosen an interleaved channel representation and were now finding that too slow, we'd be pretty stuck. So it's really important to consider these issues up front, but of course you can't if you don't know at least a little bit about how the hardware works.

Hopefully that explains it a bit better?

jasonwatkinspdx · on Oct 30, 2011

He states pretty explicitly that he doesn't feel he has an intuitive understanding of hardware's performance quirks right in that section. Why should you then be surprised?

vilya · on Oct 31, 2011

Because this is a guy who's supposed to be an authority on software engineering & reading something like that makes me wonder how on earth he ever came to be an authority. It's a bit like someone convincing you they're a racing driver then admitting they're not really sure what the clutch does. (Yes that's hyperbole, but still...)

jasonwatkinspdx · on Oct 31, 2011

To abuse your analogy there are many fast drivers that have little to know mechanical knowledge.

You seem to be under the impression that to be an authority one must have comprehensive and detailed knowledge, particularly in areas you care about. I believe this is rarely the case. It's far more important to understand the essential. The key to being a good leader is not to know everything better than everyone, but to understand how to help everyone contribute their best, then get out of the way.

vilya · on Oct 31, 2011

> You seem to be under the impression that to be an authority one must have comprehensive and detailed knowledge, particularly in areas you care about.

Detailed and comprehensive knowledge is exactly what I expect from an authority on a subject. If they don't have it, then on what basis are they considered an authority?

It sounds to me - and I apologise if I've misunderstood - like you are arguing that a software architect doesn't need to understand the performance impact of their designs to be considered an authority on their subject? If so, then I respectfully disagree. It's possible to be a (bad) software architect without understanding that stuff, but I expect better from people who are supposed to be experts in the field.

> The key to being a good leader is not to know everything better than everyone, but to understand how to help everyone contribute their best, then get out of the way.

That's all very well but we're not talking about being a leader, we're talking about being an expert.

jasonwatkinspdx · on Oct 31, 2011

> what I expect from an authority on a subject

It's about the subject area. Fowler is an expert on OO architecture, not low level optimization. I think you're holding him to the wrong standard, and I think it's rather absurd to insist that anyone has a duty to be what you think they should be professionally.

vilya · on Nov 1, 2011

The way I see it, they're linked. The memory hierarchy has a direct impact on a number of architectural concerns: data organisation, choice of components, communication patterns between components, etc. It can make the difference between a system that meets its functional requirements and one that doesn't. These are all things an architect is supposed to care about, so to me it seems reasonable to expect an expert architect to be aware of the factors influencing them. I get the feeling we're not going to agree about this though.

jacques_chester · on Oct 30, 2011

As usual, it depends on what you define as a "transaction".

If you don't need ACID, I can probably jerry-rig something in C that'll give you hundreds of thousands of "transactions" per second. It may have problems with retrieving the data, but hey, look at that sucker go!

OK. Time to be more reasonable.

1. Relaxed Guarantees

There's no actual durability in the conventional spinning-rust sense. Instead you're hoping that a fleet of identical instances all receive the same event objects in the same order. If they don't, you're going to have mysterious problems when some of the systems get out of sync.

According to fowler, business transactions are broken into events which are processed individually. This essentially means that some transactions aren't atomic. LMAX can credit A and fail to debit B because the bits are done independently.

Consistency is palmed off to the input queues.

2. Smart data structures

They profiled the code and found places where they could swap out standard Java collections for their own problem-specific variants.

What they refer to as "Disruptors" are smart ring buffers; or as they are sometimes called, "Queues". I realise that this is not as cool-sounding as "Disruptor" (they should have called the "Business Logic Processor" the "Phaser"!), but it seems to more or less describe the interface, which is that things go in a given order and come out in approximately that order.

Actually, it's possible I've misunderstood how that structure works. It might also be that the system is skipping over elements that aren't yet ready, in which case this looks more like an active priority queue, interface-wise.

3. Conclusion

Impressive work from the LMAX team. But let's remember to keep stuff in perspective. It has always been the case that ACID exacts a high toll on performance. To the extent that you relaxed it you could always go faster.

Too often we in our industry see the shiny big performance number and forget that it isn't free. Like everything in life there is an opportunity cost. Choose carefully.

aaronblohowiak · on Oct 30, 2011

Re: 1. There is no "hoping", the input disrupter garauntees a total ordering to events as written to a journal. this journal is then processed by BLPs.

RE: atomiticity -- the BLP processes one message at a time, which ensures that you do not have multiple threads clashing, but you do not have MVCC or any form of rollback, which he addresses thusly:

>LMAX's in-memory structures are persistent across input events, so if there is an error it's important to not leave that memory in an inconsistent state. However there's no automated rollback facility. As a consequence the LMAX team puts a lot of attention into ensuring the input events are fully valid before doing any mutation of the in-memory persistent state. They have found that testing is a key tool in flushing out these kinds of problems before going into production.

I am more concerned with the hand-waving around the failure case -- falling over to an alternate BLP on failure does not prevent you from duplicate instructions; if a processing an event would create multiple output events for the output disrupter, but the blp is terminated before all output events are sent you either a) must have some kind of multi/exec on output events or b) must write code that is able to resume the processing of a message from an intermediary point or c) must otherwise prevent or accomodate duplicate output events from the same source event.

This is a result of the lack of "transactionality" that you are referring to, and I would love to read more about how they address this particular sticky wicket when a system fails.

mjpt777 · on Oct 31, 2011

Single nodes can die in the system without issue. They often do! Since we use IP multicast the network failure is transparent as a replica takes up the primary role.

The one issue to be managed with this type of system is exceptions in the business logic thread. This can be handled via a number of prevention techniques. First, apply very strict validation on all input parameters. Second, take a test driven approach to development; at LMAX we have 10s of thousands of automated tests. Third, code so methods are either idempotent, or changes are only applied at the end when the business logic is complete using local variables. With this combination of approaches we have not seen a production outage due to business logic thread failure in over a year of live operation.

_grrr · on Oct 30, 2011

As far as I understand three of the key features are:

1. Reduced queue contention: queues are typically implemented with a list, e.g. linked list, this introduces contention (queues spend a lot of time empty or very full) for the head and tail of the queue which are often the same dummy node. The ring buffer removes this contention.

2. Machine Sympathy vis a vis cache striding and ensuring concurrent threads are not invalidating each others level 1/2 cache.

3. Pre-allocation of queue data structures to ensure GC is not a factor.

Personally I think the LMAX team have done well in advancing the state of the art in what is often a key component in event driven, high throughput low latency systems such as those used in banks for trading, exchanges and market data.

koevet · on Oct 30, 2011

In regards to your comments:

1. relaxed guarantees, "As a consequence the LMAX team puts a lot of attention into ensuring the input events are fully valid before doing any mutation of the in-memory persistent state.". It looks like they deal with inconsistency at the business logic level instead of delegating to the persistence store. Probably, they don't have very complex transactional logic.

2. Smart data structures, here http://disruptor.googlecode.com/files/Disruptor-1.0.pdf you have a deeper technical view of the Disruptor and the rationale behind "mechanical sympathy".

jacques_chester · on Oct 30, 2011

> It looks like they deal with inconsistency at the business logic level instead of delegating to the persistence store.

Actually it looks the other way around. As I read it, the "Input Disruptor" does the validation.

Do the three components all run in the same thread, or are they on different JVMs?

koevet · on Oct 30, 2011

The Disruptor components are not part of a single threaded process, as far as I understand

"Also these three tasks are relatively independent, all of them need to be done before the Business Logic Processor works on a message, but they can done in any order. So unlike with the Business Logic Processor, where each trade changes the market for subsequent trades, there is a natural fit for concurrency."

I think that the Disruptor can run on the save JVM as the business logic, also to avoid remoting-related bottlenecks.

aaronblohowiak · on Oct 30, 2011

The BLP is single-threaded, and the component that modifies the application's current state. It is my understanding that "Events" in event-sourcing are immutable and that state is just a memoized computation over the total journal of events -- so the layer of atomiticity that is provided is at the event level (as the BLP only processes one event at a time.)

jacques_chester · on Oct 30, 2011

So the system is as "weak as its strongest link".

Hmm.

_grrr · on Oct 30, 2011

The hand off between producer and consumer can be configured to be both synchronous (single threaded) as well as asynchronous (one producer thread, many consumers threads).

espeed · on Oct 30, 2011

In the footnotes, he lists "what's in a transaction" (http://martinfowler.com/articles/lmax.html?t=1319912579#foot...).

mjpt777 · on Oct 31, 2011

All transactions processed by the LMAX system are ACID. The application is modelled such that each input event represents one transaction. Where business transaction span multiple nodes, rather than handle the complexity of retaining ACID properties across multiple messages, our reliable messaging system ensures delivery to a remote system and queuing a message is included in the same transaction context as the business logic. This is true of services that use a high speed journal (event-sourced) and those that perform more traditional database operations. Long running transactions become steps in a state model.

erichocean · on Oct 30, 2011

For people who want to implement this approach in C++, FastFlow is the library to use (and it's crazy fast): http://calvados.di.unipi.it/dokuwiki/doku.php?id=ffnamespace...

fsaintjacques · on Oct 30, 2011

https://github.com/fsaintjacques/disruptor--

erichocean · on Oct 31, 2011

Wow, great link! Thanks for sharing.

andrewcooke · on Oct 30, 2011

it seems to me that you need something else that i didn't see described in the article (perhaps i missed it?):

the output disruptor has to be able to recognise and discard duplicate requests for output. otherwise, when you replay the input to a new logic processor (after the old one died) you're going to get repeat outputs.

is that right or have i misunderstood something else? it seems like it places additional constraints on the design (the output ring buffer must be large enough to store sufficient old data, perhaps?). thanks.

alternatively, the systems that the output talks to could be idempotent (more exactly, the operations triggered by the messages to the systems are idempotent), but i suspect that is not (always?) possible.

mjpt777 · on Oct 31, 2011

The business logic simply outputs state changes that other systems subscribe to. If the subscribing system misses the update our reliable message delivery will ensure the message gets to the subscribing system and processed. Our system guarantees message processing and not just delivery.

Queries against the business domain model are just input events. The business logic will serialise the requested part of the model and publish it out. This type of query can easily be handled by any number of replica nodes.

All messages in the system carry a sequence number so duplicate messages can be detected and ignored.

andrewcooke · on Nov 1, 2011

Thanks. In retrospect this seems like the only way you can guarantee eventual consistency anyway, so I suppose I should have assumed it.

mbucc · on Oct 30, 2011

I have been using event-based persistence, and while performance is definitely a bonus, the win for me is the ability to replay the transaction log.

For example, let's say you log each page view and include the referer in that transaction. Initially, the only state you maintain (in memory) is the count of page views.

Three months after your system is on-line, you decide you want to take a closer look at who is sending you traffic. So, you update the business logic that processes the pageview_transaction to count page views by referer, perhaps a count of domains by day (for up to 365 days). Replay the log and you will now have that state for the entire history of your app.

I expect there will be benefits for debugging as well, since I can replay the transaction log with a local, annotated version of the app to figure out specific sequence of real-world transactions that surfaced the bug.

And you can use SQL if you like. You just have to keep your transaction processing fast (e.g., by using an in-memory db).

If you are interested, here's a Python implementation: https://github.com/PetrGlad/python-prevayler.

espeed · on Oct 30, 2011

The code for the open-sourced "Disruptor" is here (http://code.google.com/p/disruptor/).

rikf · on Oct 30, 2011

I came across this a while ago on twitter fascinating stuff there is loads more info on the team blog. http://blogs.lmax.com/

xtacy · on Oct 30, 2011

The "network of queues" architecture reminds me of Prof. Matt Welsh's PhD thesis titled SEDA: Staged Event Driven Architecture. It has benefits of both the concurrency offered by event driven and parallelism offered by threaded architectures. http://www.eecs.harvard.edu/~mdw/proj/seda/

Of course, the network of queues isn't the only thing that makes the system possible, but it closely resembles processing in a distributed system; a system of workers.

jacques_chester · on Oct 30, 2011

He name checks SEDA in the piece.

spenrose · on Oct 30, 2011

Many of the lessons he mentions here jibe with what I learned co-developing a system based on interacting single-threaded processes. I especially like the emphasis on how important and difficult it is to test and measure and the value of regular restarts for long-term reliability. The shout-out to CQRS is worth amplifying as well.

ds13 · on Oct 30, 2011

Good overview video presentation on LMAX: http://www.infoq.com/presentations/LMAX

Presented by Martin Thompson and Michael Barker on Dec 16, 2010

Some Java framework are already using Disruptor: http://jdon.org

jebblue · on Oct 30, 2011

I didn't see a place to add comments on his article page.

In general, this reminds me a lot of architectures designed for embedded systems today (which are how software was designed for PC's in the early days).

The huge up side is performance the huge down side is that it completely ignores the significance of what a relational database offers to the company as a whole.

We need to be looking at ways to make SQL databases faster, not at ways to avoid its use.

colinb · on Oct 30, 2011

Embedded systems are designed to squeeze the absolute maximum out of the hardware upon which they run. All this processor affinity, cache stride calculation, and avoidance of cross-core conflicts is just how business is done.

The upside is performance, because that's what is considered important in this case. The downside, which you've neglected to mention is difficulty of maintenance due to decreased comprehensibility of the system as a whole.

A relational database may not have that much value to the company as a whole.

By all means look at ways to make SQL DBMSes quicker. Lots of clever people have spent decades doing just that. I'm sure they're not finished yet. [incidentally Mohan et al, were using sequential writes for their [undo - or redo? I don't recall which] logs back when CPU speeds where measured in double-digit MHz - DB2 and all that. While its been a while since I had cause to look at the code inside any DBMS, I suspect the same is true today.]

But talking about SQL here is just a hammer in search of a thumb. Pick the tool for the job.

[in summary, I'm more than happy to keep building high-performance low-latency embedded systems in ways that might make an applications programmer weep, but I'm quite glad that the folks who take care of my company's payroll are running industrial-strength transactional systems]

davefarley · on Nov 1, 2011

I agree with your comments about picking the right tools for the job, but it is my contention that our code, at LMAX, is cleaner as a result of our architecture, not more obscure.

The code that matters to our business, the business logic processors in our services, is a clean, well modelled [ mostly ;-) ] implementation of our domain problem with NO technology constraints - no DB code, no magic annotations, no complex threading code, just single-threaded stateful POJOs that provide the solutions to our business problems. For me that is one of the most important benefits of our approach, not just the significant performance benefits that we get as a result, but the simplicity of the programming model.

We have entirely isolated the complex bits to infrastructure, most of our day to day work is in writing business logic code. High performance code is surely focussed on doing the minimum amount of work for the maximum function. How better to achieve that than have a software simulation of the business problem, a domain model in the DDD sense? Yes you need to pick your collections wisely to represent the relationships within your domain model, but other than that modelling the problem well is a key attribute of both high-performance systems and good code - at least to my way of thinking.

   Dave Farley

dr_rezzy · on Oct 31, 2011

I like the self awareness here but I kind of cringed at some of the points made by the author. LMAX is definitely a counter to the one size/approach/design/trend fits all thinking in software. I guess its true when they say execution is everything. In the end, a great example of a well executed innovative design.

ludwigvan · on Oct 30, 2011

The lack of threads reminds me of JavaScript's handling of events. There are no threads in JavaScript too, and everything is processed in an event loop. The article also discusses node.js as I expected, but I would appreciate someone familiar with node.js compare their approach to what node does.

dfc · on Oct 30, 2011

Is disruptor a common term/phrase?

wr1472 · on Oct 30, 2011

This was posted on HN a few months back, when the article was first published. Can't find the link though.

kb101 · on Oct 30, 2011

This is great stuff, thanks very much for posting this.

kb101 · on Oct 31, 2011

This must be the only place on the web where saying thank you gets you pissed on. I'm going back to lurking, can't figure this place out. Please downvote my "karma" to zero & adios.

veyron · on Oct 31, 2011

If everyone said "thank you" to every post, those would swamp any meaningful discussion. Think about it. On this article alone there would be over 100 such comments.

Generally, you can give a +1 by upvoting. If you have something to add to the discussion, you can add a thank you in that comment, but in isolation a "thank you" really adds no value.

espeed · on Oct 31, 2011

You say thank you by voting up the URL or comment. Literally saying "thank you", "I agree", or stuff like "+1" is frowned upon because it doesn't add to the discussion (see http://ycombinator.com/newsguidelines.html).