
How low can you go? Ultra low latency Java in the real world [video] - based2
https://www.youtube.com/watch?v=BD9cRbxWQx8
======
ynniv
The reason Java has been so successful is that avoiding memory errors is far
more important than being fast. Once you have removed the possibility of those
errors, being fast is still valuable, which is how we end up with people
spending so much time optimizing their Java instead of switching to C.

It's more likely that people will switch to OCaml or Rust than ever go back to
unmanaged memory. First secure, then correct, then fast.

~~~
pron
In case you haven't watched the talk, people _don 't_ spend so much time
optimizing Java; on the contrary, Java is often faster than C/C++ without
optimization effort. It does have a lower ceiling, but as he says, if you need
to be faster than Java (for the uses he talks about), then you're better off
going FPGA than another language, because that's the only way to gain a very
significant boost.

~~~
willtim
Getting good performance out of Java is not always easy. One often has to
structure code imperfectly to get efficient branching (Java offers only vtable
dispatch or switch on integers/enums). It also boxes nearly all values leading
to cache inefficient pointer chasing. These problems can be avoided, but
you'll be working against the language.

~~~
atulatul
\-- switch on integers/enums

Strings in switch Statements:

[https://docs.oracle.com/javase/7/docs/technotes/guides/langu...](https://docs.oracle.com/javase/7/docs/technotes/guides/language/strings-
switch.html)

~~~
willtim
Not sure that really makes up for the lack of sum types!

Supporting a switch on static final references would be a good first step.

------
matant
London has an amazing Java community. They organise a lot of interesting
talks!

~~~
kitd
They have a large pool of developers working in HFT etc, which has a history
of using Java.

~~~
piokoch
That always amazed me. Why people keep torturing themselves applying Java to
the usecase where Java is clearly not a right technology. I've watched very
interesting talk by LMAX people and half of it was about how to overcome
garbage collection gaps, latency, etc.

~~~
dmos62
I've researched this some a few months back. I don't blame you for saying that
Java is wrong for HFT. My first reaction was the same, but there's much more
to it.

Java has a few things going for it in HFT. The obvious pluses are it's mature
and memory safe. What's less obvious is that you can make it low-latency. It
takes a lot of work, but it's doable, at which point you have all the nice
things: mature ecosystem, speed, latency, safety. It takes a lot of work,
because Java was always oriented towards server use cases, as in high-
throughput, not low-latency. That's changing by the way, there are two new GC
engines coming out that are low-latency oriented. Also, there's been third-
party JVMs with low-latency guarantees for quite a while.

Of course, what's between the lines is that there isn't any easy answers for
HFT people. You either choose mature safety and do gc gymnastics (because
everything is throughput oriented), or you choose manual memory management,
which is its own gymnastics.

Anyway, that's my take. I welcome input and contradictions.

~~~
bluGill
Mostly I agree, but there is one factor that makes C++ (or any native compiled
language) better than java for HFT: the ability to lie to the optimizer about
what the hot path is. in HFT you have thousands of no trades for every trade,
so the java optimizer will optimize the no-trade code path as more likely,
then when the trade happens java pays a CPU branch prediction miss penalty at
the only time low latency matters.

You have to have good algorithms optimized to the max for this to matter
though.

~~~
scott00
I was with you until you mentioned branch prediction... isn't branch
prediction a hardware feature? How do you trick the HW branch predictor into
predicting the unlikely case?

~~~
bluGill
Yes it is a hardware feature. However the hardware can be given hints as to
which branch is more likely. This is generally documented by the manufacturer,
in one of those technical documents aimed at compiler writers.

With profile guided optimization it is possible for the compiler to have much
better information about branches than the CPU can guess. Java applies profile
guided optimization in real time, with C++ it much more complex to apply.

~~~
usefulcat
> However the hardware can be given hints as to which branch is more likely.

I don't think that is the case for modern (last 8 years or so) Intel
processors. For example, I'm under the impression that gcc's __builtin_expect
only affects the layout of the generated code. However I'd love to learn
something new here; do you have a source or any additional info you could
share?

~~~
scott00
The source for this on Intel processors is the Intel Optimization Reference
Manual, section 3.4.1:
[https://www.intel.com/content/dam/www/public/us/en/documents...](https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-
optimization-manual.pdf)

~~~
hyperpape
To me, that section says that you can emit machine code that will allow the
branch predictor to do a better job, not that you can control what the branch
predictor does.

~~~
bluGill
To my mind those amount to the same thing.

~~~
hyperpape
See my other comment:
[https://news.ycombinator.com/item?id=18735053](https://news.ycombinator.com/item?id=18735053).
I really should have consolidated them somehow.

------
based2
[https://www.reddit.com/r/java/comments/a7ygnm/how_low_can_yo...](https://www.reddit.com/r/java/comments/a7ygnm/how_low_can_you_go_ultra_low_latency_java_in_the/)

------
DrBazza
tl;dr

Use memory mapped files to "talk" between processes. Use flyweights for the
messages. Avoid strings in the messages wherever possible. Use single-threaded
processes wherever possible. Isolate cpus. Pin to isolated cpu. Turn off
hyperthreading. Use some form of object pooling.

The last point is the only Java/GC language specific thing.

Having used Chronicle I can confirm that you can see single digit microsecond
latency (or better), and reduce 99th percentile jitter with significant
analysis.

~~~
xtreme
HPC software like MPI has been using these kind of tricks for years to achieve
sub-microsecond (often as small as 0.2-0.3 us) latency for inter-process
communication.

------
lbj
I really enjoyed having this guy read his slides out loud

------
InGodsName
In short, you need to write Java like C++ for low latency.

~~~
qalmakka
This. What is the point of using a higher level, garbage collected language
instead of C++ or Rust if you're going to spend the same amount of time
fighting against the VM or the Garbage collector to make it actually work the
way you want it to? It feels like it's just for the sake of avoiding any kind
of exposure to a newer language.

~~~
kitd
The GC isn't the feature you want though, it's the JIT compiler.

JIT compilers optimise based on the actual runtime characteristics of the
running code, not the best guesses that a static compiler has to make. The
effect can be significant. It's no surprise that the Java leaders in
Techempower benchmarks match or surpass the C++ ones.

~~~
zozbot123
You can use profile-guided compilation with an AOT compiler, too-- Or you
could just add optimization hints to the performance-critical parts of the
code. Either way, the compiler doesn't need to "guess"

~~~
gpderetta
Also one of the advantages of JIT is that it can take advantage of whatever
dedicated instruction is available on the current CPU, while an AOT compiler
(discounting runtime dispatching) need to target the lowest common
denominator.

In this case though you would know exactly the target hardware.

