It's actually the other way around. As a big fund looking to trade a large number of shares in the public market, you'll quickly realize that the market tends to move away from you, and statistically, you're more likely to get a bad deal than a good one. Even if you try to be smart about execution by splitting your orders into chunks, randomizing order sizes, and similar tactics, there is still a huge information asymmetry between you and more sophisticated players. In many cases, they can classify your orders based on different characteristics of your order flow (such as latency profile), distinguishing them from so-called toxic flow from other HFT firms.
The purpose of these private rooms is to separate your orders from those players so that you trade against other uninformed parties, making your chances of getting a good or bad deal closer to 50/50.
This is not exactly how it works. You're right that a big fund executing on a public market will incur (potentially excessive) impact, but the purpose of these private rooms is not to prevent trading against informed parties! Often, the counterparties that a big fund might find on these private rooms will in fact be the same market makers and liquidity providers present on public exchanges.
The difference is that in these private rooms, liquidity providers are often able to understand their customer more. For example, big passive index funds aren't buying and selling due to some adverse knowledge of future price movement. Instead, they are merely following the index. If market makers are able to distinguish between the passive indexers and the smart sophisticated hedge funds, they will then be able to provide to the passive indexers at a better price.
Attempts at doing this are effectively already existing, the IEX [1] exchange being an example, albeit on a less ambitious scale than your idea:
> It's a simple technology: 38 miles of coiled cable that incoming orders and messages must traverse before arriving at the exchange’s matching engine. This physical distance results in a 350-microsecond delay, giving the exchange time to take in market data from other venues—which is not delayed—and update prices before executing trades
IntelligentCross Midpoint (a darkpool) is a better example, since it actually does matching periodically every couple of milliseconds [1]. IEX just introduces additional latency for everyone.
I think it's an interesting thought experiment. What would happen if the stock market were quantized to a blind one trade per-minute granularity?
I suspect this would put everyone on more even footing, with less focus on beating causality and light lag, placing more focus on using the acquired information to make longer-term decisions. This would open things up to anyone with a computer and a disposable income, though it would disappoint anyone in the high-frequency trading field.
> What would happen if the stock market were quantized to a blind one trade per-minute granularity?
Like one share of stock trades each minute in each name? Or one trade randomly executes?
If the former, you stop trading the stock and start trading something pointing at it. If the latter, the rich get to trade.
> less focus on beating causality and light lag
You’d have to ban cancelling orders, otherwise you bid and offer and then cancel at the last minute. Either way, you’d be constantly calculating the “true” price while the market lags and settling economic transactions on that basis. (My guess is the street would settle on a convention for the interauction model price.)
If you’re upset about stock markets looking like casinos, the problem isn’t the fast trading. It’s the transparency. Just don’t report trades until the end of the day.
If you aesthetically don’t like HFT, that’s a tougher problem as the price of the stock points at something tied to reality, and reality runs real time.
It has the same utility as in the opening cross, the most algorithmically-trafficked moments of trading after the closing cross. The last order can incorporate more information than an earlier one. Given the book is assembled transparently, that means an order submitted close to the deadline can “see” other orders in a way they couldn’t “see” it.
You would change the rules, but I think the result would largely remain the same. As a market participant with the fastest access to data from other markets, news, and similar sources, as well as low order entry latency, you would still be able to profit from information asymmetry.
Imagine that a company announces the approval of its new vaccine a few milliseconds before the periodic trade occurs. As an HFT firm, you have the technology to enter, cancel, or modify your orders before the periodic auction takes place, while less sophisticated players remain oblivious to what just happened. The same applies to price movements on venues trading the same instrument, its derivatives, or even correlated assets in different parts of the world.
On the other hand, you risk increasing price volatility (especially in cases where there is an imbalance between buyers and sellers during the periodic auction) and making markets less liquid.
It's not easy to get data structures like this right in C++. There are a couple of problems with your implementation of the queue.
Memory accesses can be reordered by both the compiler and the CPU, so you should use std::atomic for your producer and consumer positions to get the barriers described in the original LMAX Disruptor paper.
In the get method, you're returning a pointer to the element within the queue after bumping the consumer position (which frees the slot for the producer), so it can get overwritten while the user is accessing it.
And then your producer and consumer positions will most likely end up in the same cache line, leading to false sharing.
>> In the get method, you're returning a pointer to the element within the queue after bumping the consumer position (which frees the slot for the producer), so it can get overwritten while the user is accessing it. And then your producer and consumer positions will most likely end up in the same cache line, leading to false sharing.
I did not realize this. Thank you so much for pointing this out. I'm going to take a look.
>> use std::atomic for your producer
Yes, it is hard to get these data structures right. I used Martin Fowler's description of the LMAX algorithm which did not mention atomic. https://martinfowler.com/articles/lmax.html
I'll check out the paper.
I sincerely doubt the big HFT firms use anything of Fowler’s. Their optimizations are down to making their own hardware. LL is very context dependent and Amdahl’s law applies here.
I have absolutely no idea how this works in Java, but in C++, there are a few reasons you need std::atomic here:
1. You need to make sure that modifying the producer/consumer position is actually atomic. This may end up being the same instruction that the compiler would use for modifying a non-atomic variable, but that will depend on your target architecture and the size of the data type. Without std::atomic, it may also generate multiple instructions to implement that load/store or use an instruction which is non-atomic at the CPU level. See [1] for more information.
2. You're using positions for synchronization between the producer and consumer. When incrementing the reader position, you're basically freeing a slot for the producer, which means that you need to make sure all reads happen before you do it. When incrementing the producer position, you're indicating that the slot is ready to be consumed, so you need to make sure that all the stores to that slot happen before that. Things may go wrong here due to reordering by the compiler or by the CPU [2], so you need to instruct both that a certain memory ordering is required here. Reordering by the compiler can be prevented using a compiler-level memory barrier - asm volatile("" ::: "memory"). Depending on your CPU architecture, you may or may not need to add a memory barrier instruction as well to prevent reordering by the CPU at runtime. The good news is that std::atomic does all that for you if you pick the right memory ordering, and by default, it uses the strongest one (sequentially-consistent ordering). I think in this particular case you could relax the constraints a bit and use memory_order_acquire on the consumer side and memory_order_release on the producer side [3].
Fowler's implementation is written in Java which has a different memory model from C++. To see another example of Java memory model vs a different language, Jon Gjengset ports ConcurrentHashMap to Rust
The most benefit comes from the fact that you end up with a lot less TLB misses, since single mapping covers a large chunk of memory. Predictable memory access pattern helps with caches misses thanks to hardware prefetch, but as far as I know hardware prefetch won't work if it would cause TLB miss on most CPUs.
The part about getting everything into hugepages sounds interesting. Any idea where can I find some resources on that? Most of what I was able to find only tell you how to do that for heap allocations.
Thanks, cool stuff. Especially liblppreload.so described in [2] and [3]. I'll give it a try. Do you have any tips how to achieve the same for the stack?
I haven't done this myself but given that ELF does not have a dedicated .stack region, I guess you first have to find out what memory address range will ELF use to store variables on the stack.
Finding out the beginning of the stack should be possible to deduce from memory addresses upon entering the main().
Finding out the end of the stack depends on how big the stack is and whether it grows upwards or downwards. Former is of dynamic nature but usually configured at system level on Linux and for the latter I am not sure but I think it grows downwards.
IMO the easiest way (but certainly not the only way) is to allocate a new stack and switch to it with makecontext(). The manpage has a full code example. You just need to change the stack alloc. This approach has a few drawbacks but is hard to beat in terms of simplicity.