Additionally Cliff points out that there really needs to be some kind of standard CPU/Socket affinity API for VMs so that the relevant threads can share L3/L2 cache and reduce bus saturation. The Managed Runtime Initiative  is an interesting early effort to address this and other VM common concerns.
Edit: Looks like memory latency (not bus saturation) is the big issue if producer and consumer threads end up on different CPU sockets, hence Cliff's call for an explicit CPU/socket affinity API.
 http://www.azulsystems.com/blog/cliff/2011-09-23-a-pair-of-s... (scroll down to Disruptor section)
In particular, since each input event has to be journaled and replicated (which both involve I/O) before it can be processed, there is a potentially large and unpredictable delay for each event.
Another issue is that this architecture assumes that 1 CPU core is enough to run all business logic processors, and that 1 machine's memory is enough to hold your state. If your processing is something CPU-intensive or your state is large, you'll hit a scaling wall that requires you to manually shard your input across multiple machines.
As Martin said in his reply our system is implemented as a number of services, which communicate via pub-sub messaging, it is not quite as naive as you assume, we still have a distributed system. For the high-throughput, low latency functions of the system, (e.g. order matching in our exchange, or risk evaluation in our broker) they are each serviced on a single business logic thread, on separate servers.
Our latency measurements include these multiple hops within our network and represent an inbound instruction arriving at our network to the time that we have a fully processed, outbound, response back at the edge of our network, as Martin pointed out, modern networking can be quiet efficient when used well.
We have designed to allow us to shard our system when the need arises, but we are actually a long way away from that need. Even though these high performance services keep their entire working set in memory, that is still a relatively small number when compared to the amount of memory available in modern commodity servers. We currently have lots of head-room!
We think that this is a very scalable approach and under-used. Keeping the working set in memory is pretty straight forward for many business applications, and has the huge benefit of being very simple to code. Much more straight-forward than the, more conventional, shuffling of data around and translation from one form to another that is such a common theme in more conventional systems.
The sharding decision is simple, for our problem domain we have two obvious dimensions for sharding, accounts and order-books. Each instance (shard) would continue to have it's business logic processed on a single thread. I think that this is normal for most business problems, it is a matter of designing the solution to avoid shared state between shards.
We are not advocating "no parallelism", rather we advocate that any state should be modified by a single thread, to avoid contention. Our measurements have shown that the cost of contention almost always outweighs the benefit. So avoid contention, not parallelism.
(Head of Software development at LMAX)
If you apply "low latency" to the inner loop only, then you can resolve your second critique too: they won't be supporting anything that is CPU-intensive (since that isn't low latency). Also, all state has to fit in a single node, to keep things low latency.
The article doesn't directly address this but does compare LMAX to a typical database backed business application. The computation here is not usually CPU-bound.
Some means of reliable delivery of messages, to and from this single thread, is necessary to make a useful application. These messages must be delivered in the event of system failures. To address this need the Disruptor is employed to pipeline and run in parallel, the replication, journalling and business logic for these messages. The whole system is asynchronous and non-blocking.
In our architecture we have multiple gateway nodes that handle border security and protocol translation to and from our internal IP multi-cast binary protocol for delivery to highly redundant nodes. Lots of external connections can be multiplexed down from the outside world this way.
The 6 million TPS refers to our matching engine business logic event processor. We have other business logic processors for things like risk management and account functions. These can all communicate via our guaranteed message delivery system that can survive node failures and restarts, even across data centres.
Modern financial exchanges can process over 100K TPS and have to respond with latency in the 100s of microseconds firewall to firewall, thus including the entire internal infrastructure. For those tracking the latest developments will see it is possible to have single digit microsecond network hops with IP multicast with user space network stacks and RDMA over 10GigE. Even a well tuned 1GigE stack can achieve sub 40us for a network hop. For reference single digit microseconds is in the same space as a context switch on a lock with the kernel arbitrating. Most financial exchanges rely on having data on multiple nodes before the transaction is secure. A number of these nodes can be asynchronously journalling the data down to disk. At LMAX we tend to have data in 3 or more nodes at any given time.
In my experience of profiling many business applications the vast majority of the time is either spent in protocol translation such as XML or JSON to business objects, or within the JDBC driver doing buffer copying and waiting on the database to respond, when the application domain is well modelled.
Often applications are not well modelled for their domain. This can result in algorithms that, rather than be O(1) for most transactions, have horrible scale up characteristics because of inappropriate collections representing relationships. If you have the luxury of developing an in-memory application requiring high performance it quickly becomes apparent the cost of a CPU cache miss is the biggest limitation to latency and throughput. For this one needs to employ data structures that exhibit good mechanical sympathy for CPU and memory subsystem. At LMAX we have replaced most of the JDK collections with our own that are cache friendly and garbage free.
So far we have had no issue processing all transactions for a given purpose on a single thread, or holding all the live state in memory for a single node. If we ever cannot process all the transactions necessary on a single thread then we simply shard the model across threads/execution contexts. We only hold the live mutating data in-memory and archive out to database completed transactions as they are then read only.
Martin (LMAX CTO)
It's such a fundamental topic that everyone ought to at least know the basics of it. People architecting large systems, like Fowler, have a special responsibility to know these things: if they aren't taken into consideration in the system design, it's easy to dig a performance hole too deep to climb out of.
Well I work on software that does image processing at the moment & we've chosen to represent images as separate channels (i.e, all the red values, then all the green values, then all the blue values, etc). If we had interleaved the channels instead, many common operations that only need to look at one channel - or just one at a time - would be a lot slower.
When data is loaded into the cache, it's loaded in chunks called cache lines. These are usually 64 or 128 bytes long, depending on your CPU. So assuming a 64 byte cache line, if you ask for the value at address 4 then it'll load in all the values in addresses 0-63. If you then ask for the value at address 8, it'll already be cached & therefore quick to access; but if you ask for the value at address 64, it'll have to fill another cache line first - a cache miss.
So back to the images, say we're just looking at the alpha channel of an RGBA image. With separate channels we get 16 alpha values in each cache line (each channel is a float, so 4 bytes). If the channels were interleaved then we'd only get 4 alpha values in each cache line, so the CPU will have to fill 4 times as many cache lines.
Because so much of our code deals with images (and not just ours - our clients too), if we'd chosen an interleaved channel representation and were now finding that too slow, we'd be pretty stuck. So it's really important to consider these issues up front, but of course you can't if you don't know at least a little bit about how the hardware works.
Hopefully that explains it a bit better?
You seem to be under the impression that to be an authority one must have comprehensive and detailed knowledge, particularly in areas you care about. I believe this is rarely the case. It's far more important to understand the essential. The key to being a good leader is not to know everything better than everyone, but to understand how to help everyone contribute their best, then get out of the way.
Detailed and comprehensive knowledge is exactly what I expect from an authority on a subject. If they don't have it, then on what basis are they considered an authority?
It sounds to me - and I apologise if I've misunderstood - like you are arguing that a software architect doesn't need to understand the performance impact of their designs to be considered an authority on their subject? If so, then I respectfully disagree. It's possible to be a (bad) software architect without understanding that stuff, but I expect better from people who are supposed to be experts in the field.
> The key to being a good leader is not to know everything better than everyone, but to understand how to help everyone contribute their best, then get out of the way.
That's all very well but we're not talking about being a leader, we're talking about being an expert.
It's about the subject area. Fowler is an expert on OO architecture, not low level optimization. I think you're holding him to the wrong standard, and I think it's rather absurd to insist that anyone has a duty to be what you think they should be professionally.
If you don't need ACID, I can probably jerry-rig something in C that'll give you hundreds of thousands of "transactions" per second. It may have problems with retrieving the data, but hey, look at that sucker go!
OK. Time to be more reasonable.
1. Relaxed Guarantees
There's no actual durability in the conventional spinning-rust sense. Instead you're hoping that a fleet of identical instances all receive the same event objects in the same order. If they don't, you're going to have mysterious problems when some of the systems get out of sync.
According to fowler, business transactions are broken into events which are processed individually. This essentially means that some transactions aren't atomic. LMAX can credit A and fail to debit B because the bits are done independently.
Consistency is palmed off to the input queues.
2. Smart data structures
They profiled the code and found places where they could swap out standard Java collections for their own problem-specific variants.
What they refer to as "Disruptors" are smart ring buffers; or as they are sometimes called, "Queues". I realise that this is not as cool-sounding as "Disruptor" (they should have called the "Business Logic Processor" the "Phaser"!), but it seems to more or less describe the interface, which is that things go in a given order and come out in approximately that order.
Actually, it's possible I've misunderstood how that structure works. It might also be that the system is skipping over elements that aren't yet ready, in which case this looks more like an active priority queue, interface-wise.
Impressive work from the LMAX team. But let's remember to keep stuff in perspective. It has always been the case that ACID exacts a high toll on performance. To the extent that you relaxed it you could always go faster.
Too often we in our industry see the shiny big performance number and forget that it isn't free. Like everything in life there is an opportunity cost. Choose carefully.
RE: atomiticity -- the BLP processes one message at a time, which ensures that you do not have multiple threads clashing, but you do not have MVCC or any form of rollback, which he addresses thusly:
>LMAX's in-memory structures are persistent across input events, so if there is an error it's important to not leave that memory in an inconsistent state. However there's no automated rollback facility. As a consequence the LMAX team puts a lot of attention into ensuring the input events are fully valid before doing any mutation of the in-memory persistent state. They have found that testing is a key tool in flushing out these kinds of problems before going into production.
I am more concerned with the hand-waving around the failure case -- falling over to an alternate BLP on failure does not prevent you from duplicate instructions; if a processing an event would create multiple output events for the output disrupter, but the blp is terminated before all output events are sent you either a) must have some kind of multi/exec on output events or b) must write code that is able to resume the processing of a message from an intermediary point or c) must otherwise prevent or accomodate duplicate output events from the same source event.
This is a result of the lack of "transactionality" that you are referring to, and I would love to read more about how they address this particular sticky wicket when a system fails.
The one issue to be managed with this type of system is exceptions in the business logic thread. This can be handled via a number of prevention techniques. First, apply very strict validation on all input parameters. Second, take a test driven approach to development; at LMAX we have 10s of thousands of automated tests. Third, code so methods are either idempotent, or changes are only applied at the end when the business logic is complete using local variables. With this combination of approaches we have not seen a production outage due to business logic thread failure in over a year of live operation.
1. Reduced queue contention: queues are typically implemented with a list, e.g. linked list, this introduces contention (queues spend a lot of time empty or very full) for the head and tail of the queue which are often the same dummy node. The ring buffer removes this contention.
2. Machine Sympathy vis a vis cache striding and ensuring concurrent threads are not invalidating each others level 1/2 cache.
3. Pre-allocation of queue data structures to ensure GC is not a factor.
Personally I think the LMAX team have done well in advancing the state of the art in what is often a key component in event driven, high throughput low latency systems such as those used in banks for trading, exchanges and market data.
1. relaxed guarantees, "As a consequence the LMAX team puts a lot of attention into ensuring the input events are fully valid before doing any mutation of the in-memory persistent state.". It looks like they deal with inconsistency at the business logic level instead of delegating to the persistence store. Probably, they don't have very complex transactional logic.
2. Smart data structures, here http://disruptor.googlecode.com/files/Disruptor-1.0.pdf you have a deeper technical view of the Disruptor and the rationale behind "mechanical sympathy".
Actually it looks the other way around. As I read it, the "Input Disruptor" does the validation.
Do the three components all run in the same thread, or are they on different JVMs?
"Also these three tasks are relatively independent, all of them need to be done before the Business Logic Processor works on a message, but they can done in any order. So unlike with the Business Logic Processor, where each trade changes the market for subsequent trades, there is a natural fit for concurrency."
I think that the Disruptor can run on the save JVM as the business logic, also to avoid remoting-related bottlenecks.
the output disruptor has to be able to recognise and discard duplicate requests for output. otherwise, when you replay the input to a new logic processor (after the old one died) you're going to get repeat outputs.
is that right or have i misunderstood something else? it seems like it places additional constraints on the design (the output ring buffer must be large enough to store sufficient old data, perhaps?). thanks.
alternatively, the systems that the output talks to could be idempotent (more exactly, the operations triggered by the messages to the systems are idempotent), but i suspect that is not (always?) possible.
Queries against the business domain model are just input events. The business logic will serialise the requested part of the model and publish it out. This type of query can easily be handled by any number of replica nodes.
All messages in the system carry a sequence number so duplicate messages can be detected and ignored.
For example, let's say you log each page view and include the referer in that transaction. Initially, the only state you maintain (in memory) is the count of page views.
Three months after your system is on-line, you decide you want to take a closer look at who is sending you traffic. So, you update the business logic that processes the pageview_transaction to count page views by referer, perhaps a count of domains by day (for up to 365 days). Replay the log and you will now have that state for the entire history of your app.
I expect there will be benefits for debugging as well, since I can replay the transaction log with a local, annotated version of the app to figure out specific sequence of real-world transactions that surfaced the bug.
And you can use SQL if you like. You just have to keep your transaction processing fast (e.g., by using an in-memory db).
If you are interested, here's a Python implementation: https://github.com/PetrGlad/python-prevayler.
Of course, the network of queues isn't the only thing that makes the system possible, but it closely resembles processing in a distributed system; a system of workers.
Presented by Martin Thompson and Michael Barker on Dec 16, 2010
Some Java framework are already using Disruptor:
In general, this reminds me a lot of architectures designed for embedded systems today (which are how software was designed for PC's in the early days).
The huge up side is performance the huge down side is that it completely ignores the significance of what a relational database offers to the company as a whole.
We need to be looking at ways to make SQL databases faster, not at ways to avoid its use.
The upside is performance, because that's what is considered important in this case. The downside, which you've neglected to mention is difficulty of maintenance due to decreased comprehensibility of the system as a whole.
A relational database may not have that much value to the company as a whole.
By all means look at ways to make SQL DBMSes quicker. Lots of clever people have spent decades doing just that. I'm sure they're not finished yet. [incidentally Mohan et al, were using sequential writes for their [undo - or redo? I don't recall which] logs back when CPU speeds where measured in double-digit MHz - DB2 and all that. While its been a while since I had cause to look at the code inside any DBMS, I suspect the same is true today.]
But talking about SQL here is just a hammer in search of a thumb. Pick the tool for the job.
[in summary, I'm more than happy to keep building high-performance low-latency embedded systems in ways that might make an applications programmer weep, but I'm quite glad that the folks who take care of my company's payroll are running industrial-strength transactional systems]
The code that matters to our business, the business logic processors in our services, is a clean, well modelled [ mostly ;-) ] implementation of our domain problem with NO technology constraints - no DB code, no magic annotations, no complex threading code, just single-threaded stateful POJOs that provide the solutions to our business problems. For me that is one of the most important benefits of our approach, not just the significant performance benefits that we get as a result, but the simplicity of the programming model.
We have entirely isolated the complex bits to infrastructure, most of our day to day work is in writing business logic code. High performance code is surely focussed on doing the minimum amount of work for the maximum function. How better to achieve that than have a software simulation of the business problem, a domain model in the DDD sense? Yes you need to pick your collections wisely to represent the relationships within your domain model, but other than that modelling the problem well is a key attribute of both high-performance systems and good code - at least to my way of thinking.
Generally, you can give a +1 by upvoting. If you have something to add to the discussion, you can add a thank you in that comment, but in isolation a "thank you" really adds no value.