Hacker News new | past | comments | ask | show | jobs | submit login

I think I actually saw these folks present at JavaOne a couple years ago? Either that or there's more than one shop branding itself as "HFT" that uses Java.

I worked in the industry and it's always a little funny to see who calls themselves HFTs vs quants.

Basically, there's a bit of a spectrum of fast vs smart. In general it's hard to do incredibly smart stuff fast enough to compete in the "speed-critical" bucket of trades and vice-versa there's barely any point in being ultra-fast in the "non-speed-critical" bucket because your alphas last for minutes to hours.

Just from this read, I feel like these folks are just a hair to the right of "fast" in the [fast]---------[smart] continuum. I mostly make this appraisal based on these paragraphs:

>To gain those few crucial microseconds, most players invest in expensive hardware: pools of servers with overclocked liquid-cooled CPUs (in 2020 you can buy a server with 56 cores at 5.6 GHz and 1 TB RAM), collocation in major exchange datacentres, high-end nanosecond network switches, dedicated sub-oceanic lines (Hibernian Express is a major provider), even microwave networks. >It’s common to see highly customised Linux kernels with OS bypass so that the data “jumps” directly from the network card to the application, IPC (Interprocess communication) and even FPGAs (programmable single-purpose chips).

That's nice but that's where the cutting edge of the speed game was in 2007ish. Everything mentioned here is table stakes at this point (colocation, dedicated fiber, expensive switches, bypassing the kernel in network code, etc). The fact that "even FPGAs" is listed as "even" is the biggest thing I focus on. FPGA's and/or custom silicon is where the speed game is right now. Similarly, "even microwave networks" is also table stakes at this point (you can get on nasdaq's wireless[0] just by paying).

This is the kind of game where capex for technology is dwarfed by the margin you're slinging around every day in trading, so you see some pretty absurd hardware justified.

[0] http://n.nasdaq.com/WirelessConnectivitySuite

Edit: Also shout-out to a different comment in this thread mentioning ISLD, a story I considered telling as well: https://news.ycombinator.com/item?id=24896603




I discovered something amazing when working with some people who were writing HFT software.

Why do you need 1TB of RAM in these machines? Because when you're Java based, you want to avoid stop-the-world GC pauses. These trading systems only have to be up from 9:30AM-4:30PM EST, so they simply disable GC altogether! At the end of a trading day, restart the app or reboot the system.


Speaking from experience with JVM HFT applications (we used Scala).

There are a lot of tricks though to not require 1TB.

And allocation in general is a bad idea even if you don't collect because it scatters stuff all over memory and messes up cache locality. You really, really don't want to allocate in a performance sensitive jvm application if you can avoid it. It's the opposite of a lot of what I was told and taught (e.g. never do object pooling), but empirically, in my experience, allocations are the biggest slowdown. You can get an application a lot faster just by opening up the memory allocation tab in a jmc flightrecording and refactoring the biggest allocators, usually there is a lot of easy to optimize low hanging fruit that will give good performance improvements, even better than focusing on hot spots in code (in my personal experience).

By far the biggest allocator in trading is going to be marketdata and calculations on it. For reading marketdata from the exchange it's best to leave raw data in memory and access it with a ByteBuffer / sun.misc.unsafe. Under this pattern classes have 1 value, the memory address to pass into sun.misc.unsafe, then everything from there on is done with offsets onto that address. For calculations it's better to write things as static functions, or use object pooling.

In the course of optimizing a trading engine I wrote lots and lots of code to get allocations down to zero. It's definitely doable, but best done from the start, I refactored an existing trading engine to do that, it was not very fun.


1TB of RAM is cheap though, versus spending engineering hours.

Reminds me of the classic WTF "That would've been an option too" https://thedailywtf.com/articles/That-Wouldve-Been-an-Option...


> 1TB of RAM is cheap though, versus spending engineering hours.

It's not about the amount of RAM, it's how fast and predictable the overall system is. Note in particular the remark about cache locality; that can be the difference between nanoseconds and microseconds.


What the GP is referring to (Misc.Unsafe) is basically going back to manual pointer reads/writes (IE writing more or less C code in Java), the benefit is you get C speed/memory layout/cache-locality but with the downside of writing it in a less suited language (Java).

HOWEVER If you DO _allocate_ much then 1TB seems like a great choice since GC pauses are killers w.r.t. latency in comparison to cache issues. A cache line (often 64 bytes) would probably only hold at most a couple of Java objects (minimum is probably like 16 or more bytes for each small object) so cache locality won't be improved much by a GC (Yes, the G1 GC in newer JDK's does neighbour compacting iirc so you get a little cache locality but only if the patterns were bad from the start, but avoiding GC entirely is better)


Can you explain why you would code java this way instead of dropping down to C?


Probably because they started out in Java and don't want a full rewrite once they've gotten this far (might be that they don't have many C++ devs anyhow). I remember working on similar codebases 10 years back when J2ME games were still a thing (J2ME runtimes usually had horrible GC's and some people really went overboard in trying to avoid them) and it wasn't that fun at all (Even if the questDB codebase that is linked elsewhere in this thread seems to be saner).

Also JNI for calling C/C++ code is fairly bad and error prone (if you want to combine code), JNA is better but i think this is one of the bigger reasons that Oracle is funding GraalVM is that it promises "seamless" interoperability and breaking up things into Java and C/C++ parts might be a good option in the future.


JNI replacement is called Panama it doesn't have anything to do with GraalVM.

GraalVM is the new name of MaximeVM one of the JVM meta-circular JVMs. Other well known ones being JikesRVM and SquawkVM.

What Oracle and others in the Java community are doing is reducing the need to drop down to C with support for value types and more fine grained control over native memory.

Java 16 will have the first preview release of the new low level memory APIs.


The meta-circularity is part of the impl details, cross-language calling (Java/JS/LLVM) seems to be the main selling point of it, so if you want GC-less logic then Clang->LLVM then running on the JVM is an option, no idea if it also simplifies native bindings though (that is more of a Panama issue and improving that is a good idea aswell).


Productivity and safety of Java.

Those unsafe bits are a tiny portion of the overall code.


This was a great read, thanks for that! Never heard of The Daily WTF before.


Congratulations!

https://xkcd.com/1053/


Of course there’s an XKCD for this! Ah, the wonders of the internet.

The article in question was written in 2008: I was 11 years old. I guess I now fall in to the category of “heard of it by the age of 30”.


> ...a new intern named Bob. Bob raised a finger and said "Yes, I have a question...

Damn those smart-ass interns!


agree. hardware is getting cheaper and cheaper


Except Memory / DRAM.


Ha, these are many of the same things one might do to make a chess engine fast on the JVM. This only strengthens my belief that the best computer science education you can get is studying chess programming.


I had a side project of 'flattening' classes at classloading time into large mappedbytebuffers, removing all allocations needed for ser-des and recursing on belongs-to references, copying buffer slices from sockets to the buffers (and reusing the socket buffers), and using netty for sockets (selector allocates crazy). Very fun project with javassist...


Do you know of any public examples of how this sort of Java looks? I imagine you lose out on being able to take advantage of much of the JVM ecosystem and I struggle to see what using Java even adds anymore.


here is one: https://github.com/questdb/questdb. Disclaimer, I work on this project. Main reason we use Java is speed of development (which increased with amount of base libraries written) and ease of testing.


Would you do it again? Coming from C/C++ first, Java then and coming to C# it just feels so much more pragmatic with f.ex. struct(s), slices and stackalloc (and unsafe blocks with pointers in a pinch) allowing for GC less programming w/o resorting to turning pointers to integers and using function calls for memory access all around. (Noticed that you do i guess query compilation via the ASM toolkit?)


It is a good question. I feel there is a happy medium where boilerplate is in Java and more intricate data processing routines are in C++. Makes both worlds simpler. I like C++ better. Those things you mentioned - Java truly sucks at indeed, but you don't always need them. When it comes to IDE, testing, compilation speed, cross platform code and finding talent - Java is way easier than C++.


Yeah i see that outlook, my comment was actually mostly about C# (been using it for the past year and these things above are in it as well as all the base being Java-ish)


RAM is not the only aspect of a Garbage Collector....

Over a decade ago, we used a third-party pre-trade risk system that was implemented in Java. Since it was a "service" (we connected to a TCP port), the underlying tricks they used to make it "fast" were transparent to us... until it was not.

They highly tuned their GC to where there was seldom any GC. One day, the third-party made a change to a supervisory service to generate more periodic monitoring emails. The file handles from this actor were apparently not GC'd and held by the process until the system ran out of file handles. That made the service stop working properly.

But, in addition to alerting, that service had a more important job: it was a post-trade risk "watchdog" to the pre-trade risk gateways, which we sharded across.

The various pre-trade risk gateways, upon not hearing from the watchdog authority then 1) began cancelling outstanding orders and 2) not allow new orders. We saw this happening haphazardly over 10 minutes and had little time and capability to recover; this also happened near the end of the trading day when some orders (MOC, LOC) are not cancellable. This was across our whole organization so it affected many strategies. So it wasn't as simple as "stop all" and although we had many checks and recovery procedures, this was a pretty special Chaos Monkey.

We ended up with a >$1B basket of random stocks that cost >$10M to liquidate over the next few days.

$10M evaporating in 10 minutes because of "GC optimizations" and poorly considered OS settings.

The #1 risk in algo/HFT trading is not financial, but operational.

[Some of the technical details may be slightly off, as it was third-party; my outlook is pieced together from post-mortems. Also, no other party (e.g. broker, SIPC, market participant, the third-party) was financially affected besides our firm.]


This is similar to how missiles don't need GC.

https://devblogs.microsoft.com/oldnewthing/20180228-00/?p=98...


it has, it explodes and the memory is free again. just not for further usage.


The memory is freed all over the target, I suppose


Even if you completely disable GC, Java is still allocating tons of ephemeral objects on the heap. Which in turn is leading to expensive and unpredictable page faults. In contrast, C++'s default is to allocate objects on the heap unless you knowingly call new/malloc.

Of course, it's possible to write Java in such a way to minimize this type of heap-thrashing. But by that point, you're already doing the equivalent of C++'s manual memory book-keeping anyway. It's not like "just provision a ton of memory and turn off GC" is a free lunch.


If you add -XX:+AlwaysPreTouch to the JVM arguments, the JVM will pre-touch the entirety of the heap at startup time to avoid unpredictable page faults through the life of the application.

I'd imagine other HFT companies may also pay for the Azul JVM which goes even further and comes with a kernel module that the JVM coordinates with for memory allocation. The kernel module pre-reserves x% of system memory at the time it is loaded.


Is the Azul JVM open source? I thought it was just the support which costed money. (I'm just not familiar enough with it)


It is commercial, and sadly their GC algorithms are patented. I'm really anxious for those patents to start running out because the functionality is quite clever.

Basically one main reason you need to stop the world in common GC's right now (and thus cause pauses) is because if you want to compact memory a GC thread has no idea if other threads reads/writes to an object while moving leading to worst case corrupt/lost writes.

What Azul does is to make the pages in question invalid for reading/writing so any access to them would page-fault and then proceeds to move the objects without worrying about other threads. Since the hardware provides the page-fault "for-free" this protection doesn't cost anything in terms of runtime performance in the common case (compared to the new ZGC made by Oracle that has read-write barriers that steals some mutator performance), worst case if another thread tries to read/write memory that is being moved a specialized page-fault handler detects if this was an page that was in motion and can then do a slower read-write but ONLY in the seldom cases where this occurs (compared to all the time of f.ex. ZGC)


Hey thanks. That was a fantastic explanation.


why does the hardware provide page-fault for free again?


Page-faults aren't for free (actually kinda expensive) but rather the paging functionality is giving "free" read/write-barriers instead of those being instruction sequences being inside the main program (ie read/write barriers are handled as page-faults by the paging system that the OS already provides to avoid requiring every read/write of the user-code to be aware of moving pointers).


There's two, Zing (pay for support) and Zulu (pay for access)


That's covered in Java by escape analysis and allocation on the stack.


We have written a database in zero GC java and one thing I have not seen any evidence of "escape analysis".

   @State(Scope.Thread)
   @BenchmarkMode(Mode.AverageTime)
   @OutputTimeUnit(TimeUnit.NANOSECONDS)
   public class EscBenchmark {

       Rnd rnd = new Rnd();

       public static void main(String[] args) throws RunnerException {
           Options opt = new OptionsBuilder()
                   .include(EscBenchmark.class.getSimpleName())
                   .warmupIterations(5)
                   .measurementIterations(5)
                   .forks(1)
                   .addProfiler(GCProfiler.class)
                   .build();

           new Runner(opt).run();
       }

       @Benchmark
       public int testEscapeAnalysis() {
           int[] tuple = {0, 2}; // esc analysis? where are you?
           return tuple[rnd.nextPositiveInt() % 2];
       }
   }

And the output of GC profiler:

  Benchmark                                                     Mode  Cnt     Score     Error   Units
  EscBenchmark.testEscapeAnalysis                               avgt    5     8.234 ±   0.029   ns/op
  EscBenchmark.testEscapeAnalysis:·gc.alloc.rate                avgt    5  2647.216 ±   9.275  MB/sec
  EscBenchmark.testEscapeAnalysis:·gc.alloc.rate.norm           avgt    5    24.000 ±   0.001    B/op
  EscBenchmark.testEscapeAnalysis:·gc.churn.G1_Eden_Space       avgt    5  2643.140 ± 177.137  MB/sec
  EscBenchmark.testEscapeAnalysis:·gc.churn.G1_Eden_Space.norm  avgt    5    23.963 ±   1.613    B/op
  EscBenchmark.testEscapeAnalysis:·gc.count                     avgt    5   157.000            counts
  EscBenchmark.testEscapeAnalysis:·gc.time                      avgt    5   103.000                ms


Scalar replacement has several requirements, and escape analysis is only part of the story. One of the restrictions with scalar replacement for arrays is that indexing has to be constant at JIT time, which your example clearly isn't.


The point would be that escape analysis could have proven that the array was not used outside of the function and allocated it fully on the stack, thus not providing any GC pressure. Scalar replacement isn't the point of interest really.


Which version of Java are you using? And what is Rnd? The class included with JDK is Random. We rely on escape analysis to elide such object creation for us and it works reasonably well. If something this trivial doesn't work for you, file bug with them. We have had success with that as well.


This ran on Java 11. This isn't so much of an issue for us. We are trying to avoid allocations, even as trivial as those. There are other examples that allocated where they should not, for example this lambda will allocate.

  () -> System.out.println(1)
I lost hope in escape analysis quite frankly.

Rnd is something I have written because Java's Random is slow and clunky.

https://github.com/questdb/questdb/blob/master/core/src/main...


Escape analysis is very JVM implementation specific, but inline classes should finally sort it out.


The JVM does scalar replacement, not stack allocation. The former is much more limited in functionality, and much more finicky. You also have about zero control over it, compared to, say, C# structs.


I guess what you mean is the no-op garbage collector which is available in Java 11 http://openjdk.java.net/jeps/318

Even this isn't fail proof right ?

Last-drop latency improvements. For ultra-latency-sensitive applications, where developers are conscious about memory allocations and know the application memory footprint exactly, or even have (almost) completely garbage-free applications, accepting the GC cycle might be a design issue. There are also cases when restarting the JVM -- letting load balancers figure out failover -- is sometimes a better recovery strategy than accepting a GC cycle. In those applications, long GC cycle may be considered the wrong thing to do, because that prolongs the detection of the failure, and ultimately delays recovery.

So essentially you need to know the memory footprint of your applications. To err on the side of caution just get as much memory that is available in the market.

At this point like another reply mentions here aren't you just better off writing C++ code ?


> * I guess what you mean is the no-op garbage collector which is available in Java 11*

I think I heard of the same pattern before 2018, so I guess there are other ways to do it. Perhaps Java has some option like memory usage value at which the GC is run.


Depends if your returns are based on writing faster code or shipping correct code faster. Once the advantages of the former are eliminated, optimize for the latter.


Game devs face similar challenges, especially on phones. Though they have to be less radical in their solutions—which afaik usually boil down to ‘preload the level into memory, use object pools, and don't do any new allocations’.


Why don't they use fixed size arrays like in embedded systems and not allocate memory dynamically?


I wonder if it would be cheaper for them to have a hot standby system, and swap them every few hours.


seems pragmatic; remind me of this old story: https://devblogs.microsoft.com/oldnewthing/20180228-00/?p=98...


why not just use C++ or something and never deallocate memory then?


Malloc is usually slower than GC-based allocation (which basically just increments a pointer). Of course, one can emulate this via custom allocators in C++. My guess is that Java development is just easier and quicker, and the JIT may even result in more effective optimizations than AOT compilation.


In C++ you can link with your own version of malloc (one which just returns the next consecutive block, and can be implemented simply by adding the allocated size to a pointer).


in C++ they could also try using Profile-guided optimization to optimize hot code paths.


You generally choose the language which has the libraries and ecosystem which solves your problem. For instance, you'd be silly to use anything but Java to work with Hadoop. This is the pragmatic choice.


But as soon as you make use of almost any library from the ecosystem, you get allocations. Doesn't that defeat the purpose? Why not isolate the Java-using part in a separate process from the non-allocating fast part?


How was these achieved before Epsilon? http://openjdk.java.net/jeps/318

Also ZGC / Shenandoah can help a lot otherwise


There are still a number of players in the HFT space that use JVM languages.

I think as with any space, if you really zoom into the details, you see a lot of diversity. You're obviously right in that people who are still hitting the CPU can't compete with people that do everything on FPGA, but it seems like there's still plenty of money to be made by people who are just a bit less fast than that.

I've heard people argue that they prefer being in that sort of space, too, as it gives them a bit more room to compete on their own alphas, and tends to be a bit less winner-take-all.


Yeah absolutely. There's money to be made across the spectrum of fast and smart. My experience is just way on the left side, where the special sauce is only barely statistics-y but extremely technical (both from a software/hardware standpoint and market microstructure saviiness standpoint).

I'm not trying to gatekeep who can call themselves HFTs or not. The main thing I find funny is if you ask 10 firms if they're HFT or quant shops, it will probably not actually line up all that well with exactly how many orders they send or how speed-sensitive they are.


I think GP had that covered mentioning a [fast]<--->[smart] continuum.


In a info session, IMC mentioned they use Java for some of their systems.


You're making good points, but one thing I want to emphasize is that latency is not the sole dimension of competition in HFT space.

Certain things may just be "table-stakes", but for many strategies table-stakes is the only requirement. Simply being "fast enough" may be fine if you have a smarter model, exploit niche opportunities overlooked by others, or are willing to shoulder certain risks that other HFTs are trying to offload.

To take this to the extreme, look at Renaissance's Medallion fund. It's certainly the case that much of what they're doing is "HFT-ish" in the sense that they're executing high turnover, high Sharpe strategies with short holding periods. Yet they've managed to continue to be successful from the 90s well until the age of HFT, without competing very hard on latency at all.

Markets are an ecosystem, and like biological ecosystems there isn't just one trait that predicts dominance.


correct, there is a peculiar nuance on financial trading topics where people conflate "computer executed trade" with "high frequency trading"

you may need to rapidly adjust the expected fill price at a high frequency on a multi leg strategy being run in 100 positions, but the actual sending orders to the exchange doesn't need to be high frequency, and it is an important distinction that you wouldn't be competing with others on the frequency at this stage, it doesn't matter if this occurs in 1 millisecond of 500 milliseconds, even a couple of seconds. Very different ballgame than the wishful femtosecond game.

its probably better that people don't understand that, but it is annoying that there are this many roadblocks to a nuanced conversation

either way your servers still have to have all the authentication code, and various algorithms running to monitor the tape.


what's your source for the fact that they are not competing on latency?


It’s difficult to compete on latency and remain anonymous.


That's right, the more to the fast end you are, the more your strategy is obvious. I worked at an HFT where it was literally buy-here-sell-there for one of the strategies, and of course you can't do that without being ridiculously fast.

DSquare is reasonably famous in London, somehow everyone knows them. They are on the smarter end of things, because the founders were plugged into the early development of the electronic FX market at the beginning of this century.

There's a sort of engineering-vs-business culture thing to this as well, but I think it blurs over time.


There is no singular definition of HFT. The extreme end of low latency is really only applicable for market making systems, but there are plenty of systems in the HFT sphere that do not rely on market making latencies. For example, those microwave links you're talking about? Those are specifically for cross exchange arbitrage. In that domain acceptable latencies go up quite a bit, and general purpose computing is still quite competitive.


Nice overview. The fast vs smart comparison is apt. But something that gets missed in these conversations is what trading system architectures actually look like. The part where latency is measured in nanoseconds is just the tip of the iceberg. Java and 2007 tech is absolutely acceptable for most of the architecture.

Another thing is the structure of Equity and Derivative markets vs FX. The former is fairly standardized. You'll need infrastructure at a handful of exchanges in 2 or 3 cities. Microwave networks and FPGA's are the norm.

But for FX markets, where these guys trade, the picture is much more messy. There are countless places to trade. Co-location means different things at all of them. The entire structure is much less regulated and understood. As a result, strategies and trading system architecture looks much different.


Agreed. A lot of this is all standard now - even when I wrote about it circa 2013 [1]

[1] https://queue.acm.org/detail.cfm?id=2536492


I think the JavaOne presentation was made by the LMAX devs: https://www.youtube.com/watch?v=eTeWxZvlCZ8

They built the Disruptor data structure around 2011 for their high performance financial exchange on the JVM: https://lmax-exchange.github.io/disruptor/files/Disruptor-1....

I used the Disruptor at a smart grid startup in 2012-2014, after LMAX open sourced it.

Martin Thompson has a lot of interesting presentations on the concept of mechanical sympathy:

- https://www.youtube.com/watch?v=929OrIvbW18 - Adventures with concurrent programming in Java: A quest for predictable latency by Martin Thompson - https://www.youtube.com/watch?v=03GsLxVdVzU - Designing for Performance by Martin Thompson


> Basically, there's a bit of a spectrum of fast vs smart. In general it's hard to do incredibly smart stuff fast enough to compete in the "speed-critical" bucket of trades and vice-versa there's barely any point in being ultra-fast in the "non-speed-critical" bucket because your alphas last for minutes to hours.

Depending on the overall strategy, you can be smart and fast at once. If you have occasional alpha harvesting opportunities that must be acted on quickly (very low latency but also low-ish frequency), then it is possible to spend the time in between trades modeling market context and developing optimal short-term plans for how to react in case of various market triggers.


I think he means smart as in "how long your model prediction takes". If your neural net (haven't actually met anyone who uses these in trading) takes 5ms to make a prediction that'll lock you out of a whole lot of trading opportunities/strategies.

Speed always matters, no matter where on the smartness spectrum you are, but it's relative. If your model prediction takes 5ms you're not getting much ROI out of investing $1M into shaving off 50ns in your data processing. But if your end-to-end latency without prediction is 1ms, you better invest in getting that down.


> If your neural net (haven't actually met anyone who uses these in trading) takes 5ms to make a prediction that'll lock you out of a whole lot of trading opportunities/strategies.

Let's say you have trading opportunities once every 100ms. They need to be acted upon within 2ms or they vanish.

You don't have the time budget to run a NNet every time the state of the market changes. You can, however, train a NNet to output a very small decision tree that can run in under 1ms. The NNet can then decide which "micro-strategy" in the form of a much faster and more reactive decision tree is more appropriate for the current market context.


Flexibility in code is important. When markets change it is good to be able to quickly rework your code. Sometimes depending on the market this is more important than pure speed.

Market data is the slow monster - especially in the options market - not order entry. This is where the fpgas come in to help.

Often I have seen trading groups throw money at infrastructure when it was really just that a competitor (a virtu or citadel) has a better way of internalizing order flow or managing risk so they can pay more for the same trade.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: