
Complete X86/x64 JIT and Remote Assembler for C++ - nkurz
https://github.com/kobalicek/asmjit
======
haberman
Interesting. This seems to be more or less a C++ equivalent of DynASM
([http://luajit.org/dynasm.html](http://luajit.org/dynasm.html)). Like DynASM,
it lets you generate machine code at runtime. The library handles instruction
encoding and linking of jump targets.

Like DynASM, it exposes the underlying architecture/instructions directly. It
does not have an architecture-agnostic IR like LLVM.

It does do a little more than DynASM though. It provides (as an option) a
higher-level interface that abstracts away ABI calling conventions and
physical registers. It has a register allocator that assigns virtual registers
to physical ones. It must also then be able to spill registers to the stack
when necessary? It is interesting; I have not come across any assemblers that
take this particular approach (native CPU instructions with virtual
registers).

One thing that confused me about the title: the project exposes a C++ API, but
it is not a JIT for compiling C++ programs _themselves_.

~~~
tiffanyh
More information on DynASM can be found in the unofficial reference guide.

[http://corsix.github.io/dynasm-doc/](http://corsix.github.io/dynasm-doc/)

------
saurik
Does anyone know what they mean by "remote"? I skimmed through the
documentation, and it still is not clear. When I search for "remote assembler"
the only results are 1) this project and 2) a kind of job in manufacturing
(mostly with reference to mainframes).

~~~
davepage
Remote in the sense that a local assembler would be on the developer's system,
and remote assembler is on the end user's system. Using this lib, the
developer uses C++ as if it were metacode or a macro system to then emit
machine code instructions on demand. This will presumably then execute faster
than plain C++ in certain applications.

I can think of several interesting applications for this, such as text search.

~~~
stingraycharles
Can you elaborate on the text search example? I genuinely have problems trying
to find a fitting use case for this.

~~~
vardump
Regexp JIT was already mentioned. It's notable that regexp JIT technique is
being used by practically all high performance regexp libraries already. Such
as these:

* Perl and PCRE: [http://sljit.sourceforge.net/pcre.html](http://sljit.sourceforge.net/pcre.html)

* Java: [https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pa...](https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html)

* .NET CLR: [http://msdn.microsoft.com/en-us/library/gg578045(v=vs.110).a...](http://msdn.microsoft.com/en-us/library/gg578045\(v=vs.110\).aspx)

And a lot of others.

For text search in particular, you could also take advantage of SSE 4.2 string
instructions, but still run on older CPUs.
[http://en.wikipedia.org/wiki/SSE4#SSE4.2](http://en.wikipedia.org/wiki/SSE4#SSE4.2)

Similar story with AVX2, you have 256-bits wide registers. Soon (Intel Skylake
in 2015) there will be AVX512, 512-bits wide registers with byte level
processing instructions. Being able to process 64 bytes in one instruction,
with ILP [1] potential of two instructions per clock cycle or more, can
provide an order of magnitude performance advantage.

You can also optimize away unnecessary code for that particular search. No
need to have those ignore case, etc. flags. Or for example you could to not
include Unicode related logic, if you know ahead of time that normalization
etc. won't be necessary for this particular text search case. This can have
particularly high savings if you're branch predictor buffer limited already,
by reducing the number of branches [2].

If the memory access patterns are not sequential (= predictable by the CPU),
you could insert prefetch instructions at CPU model appropriate places to
ensure data is going to be in L1 cache in time before use.

If you know the data is going to be searched only once, you could give a hint
to the CPU that you're streaming the data. CPU then can optimize memory access
patterns and minimize L1/L2 cache evictions, because it knows this data should
not be stored in cache. In other words, non-temporal (= streaming) memory
loads and stores. Like
[http://www.felixcloutier.com/x86/MOVNTDQA.html](http://www.felixcloutier.com/x86/MOVNTDQA.html).

You could do profile guided optimization at runtime. Or just try random
variations and pick the fastest for that particular combination of parameters
and hardware without recompiling anything. Different CPU models have a lot of
variation [3].

And a lot of other things. If the data sets are large, ability to adapt to a
particular problem at runtime can have a huge payoff.

[1]: Instruction level parallelism.

[2]: A branch can mean if-statements, ?-ternary operator, boolean logic ("||",
"&&" etc.), switch statements, and so on. Every branch in currently executing
loop can potentially need an entry in the CPU branch predictor. If branch
predictor buffer entries run out, this might cause CPU to mispredict that
branch every time. The cost of a mispredicted branch is very high. On Intel
Ivy Bridge processor, a single branch misprediction costs 14 clock cycles or
the time to theoretically execute up to 4*14=56 instructions, practically
about 15-30! Slightly related links, LLVM CPU scheduler definitions:

[http://llvm.org/klaus/llvm/blob/release_33/lib/Target/X86/X8...](http://llvm.org/klaus/llvm/blob/release_33/lib/Target/X86/X86SchedSandyBridge.td)

[http://llvm.org/klaus/llvm/blob/release_33/lib/Target/X86/X8...](http://llvm.org/klaus/llvm/blob/release_33/lib/Target/X86/X86SchedHaswell.td)

[3]: About CPU model variation, see for example this:
[http://www.agner.org/optimize/blog/read.php?i=285](http://www.agner.org/optimize/blog/read.php?i=285)

