
The LMAX Architecture - 100K TPS at Less than 1ms Latency - koevet
http://martinfowler.com/articles/lmax.html?t=1319912579
======
bodski
Cliff Click makes a good post [1] discussing the LMAX/Disruptor architecture
and how it can give large performance gains in specific circumstances (such as
those generally encountered by LMAX). In particular, Disruptor works best when
there is a one-to-one ratio of disruptor threads and cpu cores. Great to see
Cliff rolling up his sleeves and looking into this stuff from LMAX.

Additionally Cliff points out that there really needs to be some kind of
standard CPU/Socket affinity API for VMs so that the relevant threads can
share L3/L2 cache and reduce bus saturation. The Managed Runtime Initiative
[2] is an interesting early effort to address this and other VM common
concerns.

Edit: Looks like memory latency (not bus saturation) is the big issue if
producer and consumer threads end up on different CPU sockets, hence Cliff's
call for an explicit CPU/socket affinity API.

[1] [http://www.azulsystems.com/blog/cliff/2011-09-23-a-pair-
of-s...](http://www.azulsystems.com/blog/cliff/2011-09-23-a-pair-of-somebody-
elses-concurrency-bugs) (scroll down to Disruptor section)

[2] <http://www.managedruntime.org/>

~~~
jbellis
Too bad there's been essentially zero activity on MRI for over a year.

~~~
mjpt777
A good posting showing the effect of setting thread affinity for version 1.0
of the Disruptor.

<http://java.dzone.com/articles/java-threads-steroids>

------
haberman
The article starts by saying that low latency is the main requirement for
financial trading, but then quotes all performance numbers in throughput ("6
million orders per second"). I couldn't find any mention of average or worst-
case latency, and throughput numbers alone tell you nothing about latency.

In particular, since each input event has to be journaled and replicated
(which both involve I/O) before it can be processed, there is a potentially
large and unpredictable delay for each event.

Another issue is that this architecture assumes that 1 CPU core is enough to
run all business logic processors, and that 1 machine's memory is enough to
hold your state. If your processing is something CPU-intensive or your state
is large, you'll hit a scaling wall that requires you to manually shard your
input across multiple machines.

~~~
davefarley
You are right predictable latency is key, but that is one of the big wins that
we have found with the Disruptor and our surrounding infrastructure. The
Disruptor is not the only thing that we have worked on to achieve this. For
example we have spent time experimenting with journalling and clustering
techniques to give us high-throughput, predictable, mechanisms for protecting
our state.

As Martin said in his reply our system is implemented as a number of services,
which communicate via pub-sub messaging, it is not quite as naive as you
assume, we still have a distributed system. For the high-throughput, low
latency functions of the system, (e.g. order matching in our exchange, or risk
evaluation in our broker) they are each serviced on a single business logic
thread, on separate servers.

Our latency measurements include these multiple hops within our network and
represent an inbound instruction arriving at our network to the time that we
have a fully processed, outbound, response back at the edge of our network, as
Martin pointed out, modern networking can be quiet efficient when used well.

We have designed to allow us to shard our system when the need arises, but we
are actually a long way away from that need. Even though these high
performance services keep their entire working set in memory, that is still a
relatively small number when compared to the amount of memory available in
modern commodity servers. We currently have lots of head-room!

We think that this is a very scalable approach and under-used. Keeping the
working set in memory is pretty straight forward for many business
applications, and has the huge benefit of being very simple to code. Much more
straight-forward than the, more conventional, shuffling of data around and
translation from one form to another that is such a common theme in more
conventional systems.

The sharding decision is simple, for our problem domain we have two obvious
dimensions for sharding, accounts and order-books. Each instance (shard) would
continue to have it's business logic processed on a single thread. I think
that this is normal for most business problems, it is a matter of designing
the solution to avoid shared state between shards.

We are not advocating "no parallelism", rather we advocate that any state
should be modified by a single thread, to avoid contention. Our measurements
have shown that the cost of contention almost always outweighs the benefit. So
avoid contention, not parallelism.

    
    
      Dave Farley 
      (Head of Software development at LMAX)

------
vilya
It's clear from Fowler's discussion of "mechanical sympathy" that he didn't
understand the memory hierarchy. I found it pretty shocking that someone in
his position could not know this stuff. He really needs to read Ulrich
Drepper's "What every programmer should know about memory"
(<http://www.akkadia.org/drepper/cpumemory.pdf>).

It's such a fundamental topic that everyone ought to at least know the basics
of it. People architecting large systems, like Fowler, have a special
responsibility to know these things: if they aren't taken into consideration
in the system design, it's easy to dig a performance hole too deep to climb
out of.

~~~
koevet
Can you be more specific?

~~~
vilya
About the performance issues?

Well I work on software that does image processing at the moment & we've
chosen to represent images as separate channels (i.e, all the red values, then
all the green values, then all the blue values, etc). If we had interleaved
the channels instead, many common operations that only need to look at one
channel - or just one at a time - would be a lot slower.

When data is loaded into the cache, it's loaded in chunks called cache lines.
These are usually 64 or 128 bytes long, depending on your CPU. So assuming a
64 byte cache line, if you ask for the value at address 4 then it'll load in
all the values in addresses 0-63. If you then ask for the value at address 8,
it'll already be cached & therefore quick to access; but if you ask for the
value at address 64, it'll have to fill another cache line first - a cache
miss.

So back to the images, say we're just looking at the alpha channel of an RGBA
image. With separate channels we get 16 alpha values in each cache line (each
channel is a float, so 4 bytes). If the channels were interleaved then we'd
only get 4 alpha values in each cache line, so the CPU will have to fill 4
times as many cache lines.

Because so much of our code deals with images (and not just ours - our clients
too), if we'd chosen an interleaved channel representation and were now
finding that too slow, we'd be pretty stuck. So it's really important to
consider these issues up front, but of course you can't if you don't know at
least a little bit about how the hardware works.

Hopefully that explains it a bit better?

------
jacques_chester
As usual, it depends on what you define as a "transaction".

If you don't need ACID, I can probably jerry-rig something in C that'll give
you hundreds of thousands of "transactions" per second. It may have problems
with retrieving the data, but hey, look at that sucker go!

OK. Time to be more reasonable.

1\. Relaxed Guarantees

There's no actual durability in the conventional spinning-rust sense. Instead
you're hoping that a fleet of identical instances all receive the same event
objects in the same order. If they don't, you're going to have mysterious
problems when some of the systems get out of sync.

According to fowler, business transactions are broken into events which are
processed individually. This essentially means that some transactions aren't
atomic. LMAX can credit A and fail to debit B because the bits are done
independently.

Consistency is palmed off to the input queues.

2\. Smart data structures

They profiled the code and found places where they could swap out standard
Java collections for their own problem-specific variants.

What they refer to as "Disruptors" are smart ring buffers; or as they are
sometimes called, "Queues". I realise that this is not as cool-sounding as
"Disruptor" (they should have called the "Business Logic Processor" the
"Phaser"!), but it seems to more or less describe the interface, which is that
things go in a given order and come out in approximately that order.

Actually, it's possible I've misunderstood how that structure works. It might
also be that the system is skipping over elements that aren't yet ready, in
which case this looks more like an active priority queue, interface-wise.

3\. Conclusion

Impressive work from the LMAX team. But let's remember to keep stuff in
perspective. It has always been the case that ACID exacts a high toll on
performance. To the extent that you relaxed it you could always go faster.

Too often we in our industry see the shiny big performance number and forget
that it isn't free. Like everything in life there is an opportunity cost.
Choose carefully.

~~~
aaronblohowiak
Re: 1. There is no "hoping", the input disrupter garauntees a total ordering
to events as written to a journal. this journal is then processed by BLPs.

RE: atomiticity -- the BLP processes one message at a time, which ensures that
you do not have multiple threads clashing, but you do not have MVCC or any
form of rollback, which he addresses thusly:

>LMAX's in-memory structures are persistent across input events, so if there
is an error it's important to not leave that memory in an inconsistent state.
However there's no automated rollback facility. As a consequence the LMAX team
puts a lot of attention into ensuring the input events are fully valid before
doing any mutation of the in-memory persistent state. They have found that
testing is a key tool in flushing out these kinds of problems before going
into production.

I am more concerned with the hand-waving around the failure case -- falling
over to an alternate BLP on failure does not prevent you from duplicate
instructions; if a processing an event would create multiple output events for
the output disrupter, but the blp is terminated before all output events are
sent you either a) must have some kind of multi/exec on output events or b)
must write code that is able to resume the processing of a message from an
intermediary point or c) must otherwise prevent or accomodate duplicate output
events from the same source event.

This is a result of the lack of "transactionality" that you are referring to,
and I would love to read more about how they address this particular sticky
wicket when a system fails.

~~~
mjpt777
Single nodes can die in the system without issue. They often do! Since we use
IP multicast the network failure is transparent as a replica takes up the
primary role.

The one issue to be managed with this type of system is exceptions in the
business logic thread. This can be handled via a number of prevention
techniques. First, apply very strict validation on all input parameters.
Second, take a test driven approach to development; at LMAX we have 10s of
thousands of automated tests. Third, code so methods are either idempotent, or
changes are only applied at the end when the business logic is complete using
local variables. With this combination of approaches we have not seen a
production outage due to business logic thread failure in over a year of live
operation.

------
erichocean
For people who want to implement this approach in C++, FastFlow is the library
to use (and it's crazy fast):
[http://calvados.di.unipi.it/dokuwiki/doku.php?id=ffnamespace...](http://calvados.di.unipi.it/dokuwiki/doku.php?id=ffnamespace:about)

~~~
fsaintjacques
<https://github.com/fsaintjacques/disruptor-->

~~~
erichocean
Wow, great link! Thanks for sharing.

------
andrewcooke
it seems to me that you need something else that i didn't see described in the
article (perhaps i missed it?):

the output disruptor has to be able to recognise and discard duplicate
requests for output. otherwise, when you replay the input to a new logic
processor (after the old one died) you're going to get repeat outputs.

is that right or have i misunderstood something else? it seems like it places
additional constraints on the design (the output ring buffer must be large
enough to store sufficient old data, perhaps?). thanks.

alternatively, the systems that the output talks to could be idempotent (more
exactly, the operations triggered by the messages to the systems are
idempotent), but i suspect that is not (always?) possible.

~~~
mjpt777
The business logic simply outputs state changes that other systems subscribe
to. If the subscribing system misses the update our reliable message delivery
will ensure the message gets to the subscribing system _and_ processed. Our
system guarantees message processing and not just delivery.

Queries against the business domain model are just input events. The business
logic will serialise the requested part of the model and publish it out. This
type of query can easily be handled by any number of replica nodes.

All messages in the system carry a sequence number so duplicate messages can
be detected and ignored.

~~~
andrewcooke
Thanks. In retrospect this seems like the only way you can guarantee eventual
consistency anyway, so I suppose I should have assumed it.

------
mbucc
I have been using event-based persistence, and while performance is definitely
a bonus, the win for me is the ability to replay the transaction log.

For example, let's say you log each page view and include the referer in that
transaction. Initially, the only state you maintain (in memory) is the count
of page views.

Three months after your system is on-line, you decide you want to take a
closer look at who is sending you traffic. So, you update the business logic
that processes the pageview_transaction to count page views by referer,
perhaps a count of domains by day (for up to 365 days). Replay the log and you
will now have that state for the entire history of your app.

I expect there will be benefits for debugging as well, since I can replay the
transaction log with a local, annotated version of the app to figure out
specific sequence of real-world transactions that surfaced the bug.

And you can use SQL if you like. You just have to keep your transaction
processing fast (e.g., by using an in-memory db).

If you are interested, here's a Python implementation:
<https://github.com/PetrGlad/python-prevayler>.

------
espeed
The code for the open-sourced "Disruptor" is here
(<http://code.google.com/p/disruptor/>).

------
rikf
I came across this a while ago on twitter fascinating stuff there is loads
more info on the team blog. <http://blogs.lmax.com/>

------
xtacy
The "network of queues" architecture reminds me of Prof. Matt Welsh's PhD
thesis titled SEDA: Staged Event Driven Architecture. It has benefits of both
the concurrency offered by event driven and parallelism offered by threaded
architectures. <http://www.eecs.harvard.edu/~mdw/proj/seda/>

Of course, the network of queues isn't the only thing that makes the system
possible, but it closely resembles processing in a distributed system; a
system of workers.

~~~
jacques_chester
He name checks SEDA in the piece.

------
spenrose
Many of the lessons he mentions here jibe with what I learned co-developing a
system based on interacting single-threaded processes. I especially like the
emphasis on how important and difficult it is to test and measure and the
value of regular restarts for long-term reliability. The shout-out to CQRS is
worth amplifying as well.

------
ds13
Good overview video presentation on LMAX:
<http://www.infoq.com/presentations/LMAX>

Presented by Martin Thompson and Michael Barker on Dec 16, 2010

Some Java framework are already using Disruptor: <http://jdon.org>

------
jebblue
I didn't see a place to add comments on his article page.

In general, this reminds me a lot of architectures designed for embedded
systems today (which are how software was designed for PC's in the early
days).

The huge up side is performance the huge down side is that it completely
ignores the significance of what a relational database offers to the company
as a whole.

We need to be looking at ways to make SQL databases faster, not at ways to
avoid its use.

~~~
colinb
Embedded systems are designed to squeeze the absolute maximum out of the
hardware upon which they run. All this processor affinity, cache stride
calculation, and avoidance of cross-core conflicts is just how business is
done.

The upside is performance, because that's what is considered important in this
case. The downside, which you've neglected to mention is difficulty of
maintenance due to decreased comprehensibility of the system as a whole.

A relational database may not have that much value to the company as a whole.

By all means look at ways to make SQL DBMSes quicker. Lots of clever people
have spent decades doing just that. I'm sure they're not finished yet.
[incidentally Mohan et al, were using sequential writes for their [undo - or
redo? I don't recall which] logs back when CPU speeds where measured in
double-digit MHz - DB2 and all that. While its been a while since I had cause
to look at the code inside any DBMS, I suspect the same is true today.]

But talking about SQL here is just a hammer in search of a thumb. Pick the
tool for the job.

[in summary, I'm more than happy to keep building high-performance low-latency
embedded systems in ways that might make an applications programmer weep, but
I'm quite glad that the folks who take care of my company's payroll are
running industrial-strength transactional systems]

~~~
davefarley
I agree with your comments about picking the right tools for the job, but it
is my contention that our code, at LMAX, is cleaner as a result of our
architecture, not more obscure.

The code that matters to our business, the business logic processors in our
services, is a clean, well modelled [ mostly ;-) ] implementation of our
domain problem with NO technology constraints - no DB code, no magic
annotations, no complex threading code, just single-threaded stateful POJOs
that provide the solutions to our business problems. For me that is one of the
most important benefits of our approach, not just the significant performance
benefits that we get as a result, but the simplicity of the programming model.

We have entirely isolated the complex bits to infrastructure, most of our day
to day work is in writing business logic code. High performance code is surely
focussed on doing the minimum amount of work for the maximum function. How
better to achieve that than have a software simulation of the business
problem, a domain model in the DDD sense? Yes you need to pick your
collections wisely to represent the relationships within your domain model,
but other than that modelling the problem well is a key attribute of both
high-performance systems and good code - at least to my way of thinking.

    
    
       Dave Farley

------
dr_rezzy
I like the self awareness here but I kind of cringed at some of the points
made by the author. LMAX is definitely a counter to the one
size/approach/design/trend fits all thinking in software. I guess its true
when they say execution is everything. In the end, a great example of a well
executed innovative design.

------
ludwigvan
The lack of threads reminds me of JavaScript's handling of events. There are
no threads in JavaScript too, and everything is processed in an event loop.
The article also discusses node.js as I expected, but I would appreciate
someone familiar with node.js compare their approach to what node does.

------
dfc
Is disruptor a common term/phrase?

------
wr1472
This was posted on HN a few months back, when the article was first published.
Can't find the link though.

------
kb101
This is great stuff, thanks very much for posting this.

~~~
kb101
This must be the only place on the web where saying thank you gets you pissed
on. I'm going back to lurking, can't figure this place out. Please downvote my
"karma" to zero & adios.

~~~
veyron
If everyone said "thank you" to every post, those would swamp any meaningful
discussion. Think about it. On this article alone there would be over 100 such
comments.

Generally, you can give a +1 by upvoting. If you have something to add to the
discussion, you can add a thank you in that comment, but in isolation a "thank
you" really adds no value.

