
How the JVM compares strings on x86 using pcmpestri - mpweiher
http://jcdav.is/2016/09/01/How-the-JVM-compares-your-strings/
======
lvh
The JVM has a number of cool features that enable you to efficiently drop down
into C (FFI) yourself: these tricks aren't just for built-ins.

A quick overview:

\- First there was JNI. JNI means that you write method stubs with the
"native" keyword. Then you run javah, which gives you some C glue code that
you eventually need to compile. This is very fast, but it's annoying because
now you need a tool chain everywhere. The Python equivalent of this is roughly
writing CPython extensions.

\- People thought JNI was annoying, so Sun developed JNA. JNA lets you bind a
library directly: all the magic comes with the JVM, and you can just dlopen
something and call some syms. This works fine, but it's very slow. The Python
equivalent of this is roughly ctypes.

\- Most recently, there's JNR and jnr-ffi. They do a very clever trick: you
use JNI to get to libffi, and then you use libffi to call everything else
performantly. You get roughly the performance of JNI, with the convenience of
JNR. The Python equivalent of this is roughly cffi.

JNR is way more usable than I thought it would be. I develop caesium[0], a
Clojure libsodium binding, using jnr-ffi and a pile of macros. I gave a talk
about this at Clojure Conj (recording [1], slides [2]) if you're interested.

To be fair: this code uses intrinsics, which means that it's implemented
differently than the three methods shown above, so it's still slightly
different. It's just not different in a way that's meaningful to you unless
you're working on the JVM itself :)

[0]: [https://github.com/lvh/caesium](https://github.com/lvh/caesium) [1]:
[https://dev-videos.com/videos/Lf-M1ZH6KME/Using-Clojure-with...](https://dev-
videos.com/videos/Lf-M1ZH6KME/Using-Clojure-with-C-APIs-for-crypto-and-more--
lvh) [2]: [https://www.lvh.io/CCryptoClojure/#/sec-title-
slide](https://www.lvh.io/CCryptoClojure/#/sec-title-slide)

~~~
quotemstr
JNR (and JNA and cffi) strike me as unsafe due to the lack of type
enforcement. Systems like this let you call any random pointer as if it were a
C function of any type. That usually works, but sometimes doesn't. If you're
lucky, mistakes make your program blow up right away. If you're unlucky, you
get impossible to diagnose memory corruption.

I'd much rather write conventional bridges and have the system check that I'm
right.

~~~
bdonlan
To be fair, when you're writing the C side of your JNI bindings, it'll also
let you call any random pointer as if it were a C function of any type. You'll
have non-type-safe code _somewhere_, it's just a choice of where.

~~~
quotemstr
Ah, not true! You'll have as much type-checking as C allows for, which is
actually quite substantial. Moreso in C++. The opportunities for accidental
type mismatch are much reduced: basically, to the JNI->function bindings
(eliminated if you codegen the glue).

~~~
lvh
I see what you’re saying, but that piece of code doesn’t exist at all in JNR
(so it can’t have bugs) and you can codegen JNR/FFI too to eliminate the same
hole, right?

I feel like saying “don’t mess up this signature” is not only plausible but
it’s a lot better than having to deal with writing and compiling the stub for
every platform.

This is why pycparser exists, for example.
[https://github.com/eliben/pycparser](https://github.com/eliben/pycparser)

------
pcwalton
One thing I learned about pcmpxstrx is that it's surprisingly slow: latency of
10-11 cycles and reciprocal throughput of 3-5 cycles on Haswell according to
Agner's tables, depending on the precise instruction variant. The instructions
are also limited in the ALU ports they can use. Since AVX2 has made SIMD on
x86 fairly flexible, it can sometimes not be worth using the string comparison
instructions if simpler instructions suffice: even a slightly longer sequence
of simpler SIMD instructions sometimes beats a single string compare.

The SSE 4.2 string comparison instructions still have their uses, but it's
always worth testing alternate instruction sequences when optimizing code that
might use them.

~~~
burntsushi
I have the same experience. I've tried using pcmpestr in substring search a
few times, and it had always turned out to not be worth it. I have however
never tried it in during comparison functions, so I can't speak to that, but I
wouldn't be surprised if the latency of the instruction mad at impact there
too.

------
dullgiulio
Go uses the same technique[1], or a Duff's device[2] when AVX2 is unsupported.

I am convinced that similar tricks are employed by almost every language
runtime or standard library (GNU libc also does this, etc.)

[1]
[https://github.com/golang/go/blob/master/src/runtime/asm_amd...](https://github.com/golang/go/blob/master/src/runtime/asm_amd64.s#L1530)
[2]
[https://en.wikipedia.org/wiki/Duff%27s_device](https://en.wikipedia.org/wiki/Duff%27s_device)

------
JanecekPetr
This guy's blog is awesome. A lot of in-depth information, original research,
nice and funny writing style. I hope he finds some time to write / speak more
soon.

------
userbinator
I find it a bit sad that, instead of optimising the existing REP CMPS
instruction to do vectorised compares like they did with REP MOVS/STOS and
block copies/writes, Intel introduced another even more complex instruction
that itself requires a bunch of additional support code to use. I certainly
don't think it's a "good use of CISC".

~~~
pcwalton
pcmpxstrx is a lot more powerful than rep cmps.

I'm lukewarm on pcmpxstrx too, but for a different reason: I'd prefer the
effort to go into more general purpose, highly flexible SIMD instructions
(which is thankfully happening now with AVX 512).

~~~
eddyb
What do you think about Cray-style vectors, which are coming back in the form
of ARM SVE and the RISC-V V extension?

At least the latter claims code compiled once is compatible with _all_
possible hardware configurations, from the start (by way of giving the CPU a
"remaining iterations count" and having it reply with how many it can do for
the chosen vector lane shapes).

IMO, if it does end up working that well in practice, it does put all of the
various incompatible versions of packed SIMD extensions in a pretty awkward
spot - could we have skipped all of MMX, SSE, AVX, NEON, etc. versions with
technology that has been around for almost half a century?

~~~
pcwalton
Sounds like a good idea to me. Seems similar to how GPUs work, which is
generally a good path to follow.

------
r4um
Another excellent series on JVM perf/internals [https://shipilev.net/jvm-
anatomy-park/](https://shipilev.net/jvm-anatomy-park/)

------
exikyut
> _But did you know there is also a secret second implementation?
> String.compareTo is one of a few methods that is important enough to also
> get a special hand-rolled assembly version._

oooooh.

Is there a list somewhere of all the functions that have had such special
treatment?

~~~
nicoulaj
The intrinsics for the HotSpot JVM are declared here:
[http://hg.openjdk.java.net/jdk10/jdk10/hotspot/file/5ab7a67b...](http://hg.openjdk.java.net/jdk10/jdk10/hotspot/file/5ab7a67bc155/src/share/vm/classfile/vmSymbols.hpp#l679)

(look for "java_lang_Math" for instance)

~~~
rplnt
Are you telling us it's not a secret then?

~~~
sudhirj
Don’t know if this is sarcasm, but no, doubtful it’s a secret. Or might depend
on the JVM license. OpenJDK obviously has no secrets.

But otherwise, assembly overrides for hot paths in execution targets that
support them aren’t actually a big secret, just that they’re rarely visible.
Think Go has a lot of processed architecture specific assembly for some
functions, especially crypto iirc.

~~~
rplnt
Yeah, it was a sarcasm. The article (quoted in parent comment) called it a
secret implementation.

------
vardump
I think it could be fairly profitable performance wise to write a specialized
version of compareTo (simple memcmp) for cases where the result is directly
compared to 0, equal or not equal case.

Or to have compareEqualityTo() version.

Current version supports greater and less as well, which is of course useful
in many data structures and sorting, but might not represent bulk of the call
sites. This feature does not come for free.

This alternative version (memcmp alias) could be written to run faster.

Only benchmarking could tell whether or not it's ultimately worth it.

~~~
BeeOnRope
That version exists - it's called equals(), and it generates different code
that doesn't need to distinguish the less than/greater than cases.

~~~
vardump
Of course, silly me, haven't touched Java for a while.

------
saagarjha
> The code that generates this, MacroAssembler::string_compare in
> macroAssembler_x86.cpp is well-documented for the curious.

I was expecting that this would be some C++ code, or maybe inline assembly,
but it turns out it's just a bunch of macros that are a thin layer over
assembly instructions. Is this done for portability, to abstract over
differing instruction names on different architectures?

------
ahmadzaraei
What a lovely ready; I enjoy learning about the intricate details of the JVM.
Thank you for sharing!

------
ghusbands
tl;dr: The string comparison intrinsic in the JVM uses a vectorised string
comparison instruction.

(vpcmpestri (of the pcmpxstrx family) isn't an especially crazy instruction to
use for string comparison. That's what it's designed for.)

~~~
amelius
Thanks for the tl;dr. I think the "crazy" adjective refers to the instruction,
and not to the use of it. As the article explains, the instruction is quite
complicated and has a large number of parameters.

~~~
B4TMAN
Is there any place to get a proper manual of all these uses?

------
zakk
Click here to discover that crazy instruction! C++ programmers HATE IT!!

(I am just ironizing about the title, the content is actually great!)

~~~
sctb
Yes indeed. We've un-clickbaited the title. Nothing is safe, it seems.

------
frik
UTF-16 is one of these ill-fated developments that curse some languages &
platforms (WinNT incl Win10, WinAPI32, Java, Flash, JS, Python 3) to his day.

    
    
      compareTo uses 0x19, which means doing the “equal each” 
      (aka string comparison) operation across 8 unsigned words 
      (thanks UTF-16!) with a negated result. This monster of an 
      instruction takes in 4 registers of input:

~~~
RyanZAG
Java uses UTF8 in latest release for strings without special characters.

[http://www.baeldung.com/java-9-compact-
string](http://www.baeldung.com/java-9-compact-string)

UTF16 is not really a curse for languages that require it. String operations
in non-English languages are very fast because of it, and most software these
days has to deal with localization.

~~~
oblio
Well, not necessarily a curse, but a suboptimal solution. Are there any
situations where UTF16 is a clear upgrade over UTF8?

~~~
Const-me
Any non-Latin string operations.

While technically UTF16 is variable length, 99.99% cases use single word per
character. I.e. on modern hardware with branch prediction and speculative
execution, these branches don't affect speed. With UTF8, CPU mispredicts
branches all the time because spaces, punctuations and newlines are single
bytes even in non Latin-1 text.

~~~
vardump
I think the most common operations are comparisons for equality and copying
anyways. UTF-8 is faster for those.

I tried out how fast I could make UTF-8 strlen, with an assumption of a valid
UTF-8 string. The routine ran at 18 GB/s on a single core using SSE.

> With UTF8, CPU mispredicts branches all the time because spaces,
> punctuations and newlines are single bytes

I don't understand this sentence. Why would there be any more mispredictions
because of those being single bytes? These days code is so often bandwidth
limited if anything, so smaller data helps.

~~~
Const-me
> most common operations are comparisons for equality and copying anyways

Indexing & substrings are common, too.

> These days code is so often bandwidth limited if anything

Right, and for 1 billion Chinese speaking people UTF16 is 2 bytes/character,
UTF8 is 3 bytes/character.

~~~
vardump
> Indexing & substrings are common, too.

Indexing code points in both UTF-8 and UTF-16 requires reading the whole
string up to index location. Substrings are the same as well.

> Right, and for 1 billion Chinese speaking people UTF16 is 2 bytes/character,
> UTF8 is 3 bytes/character.

That's true for a text file without markup. But most text is not like that in
2017. HTML is probably the most common text format nowadays.

So let's see how a popular Chinese language website does.

    
    
      curl http://language.chinadaily.com.cn/ --silent | wc -c
      52678
    
      curl http://language.chinadaily.com.cn/ --silent | iconv -f utf8 -t utf-16le | wc -c
      93368
    

So UTF-8 seems to be quite a bit more efficient in this case, 52678 bytes.
When converted to UTF-16, same page was 93368 bytes.

~~~
Ded7xSEoPKYNsDd
> Indexing code points in both UTF-8 and UTF-16 requires reading the whole
> string up to index location. Substrings are the same as well.

Java's String functions don't index by Unicode code points, though. Java
strings are encoded in UCS-2, or at least the API needs to pretend that they
are.

~~~
Const-me
Right. Same in C#, C++ STL, and in Apple’s obj-c/swift.

~~~
tjalfi
Swift takes a different approach than Objective C[0].

[0] [https://www.mikeash.com/pyblog/friday-qa-2015-11-06-why-
is-s...](https://www.mikeash.com/pyblog/friday-qa-2015-11-06-why-is-swifts-
string-api-so-hard.html)

