
Tracing JITs and modern CPUs part 3: A bad case - nkurz
https://github.com/lukego/blog/issues/8
======
nkurz
_This is necessary because there is a branch (if) in our code and traces are
not allowed to have internal branches._

I (mostly) understand the performance of the assembly, but don't know much (if
anything) about the limitations of tracing JITs. Predictable branches such as
the one in this example are approximately free on modern CPUs, so it's not
clear to me why the JIT couldn't just treat the branching assembly as part of
the root trace. Is there a fundamental reason why a trace is not allowed to
have internal branches, or is this just a implementation detail of the current
trace selection algorithm?

~~~
rayiner
A branch-free section of straight-line code is pretty much the definition of a
trace. You can use something other than traces in a "tracing JIT" of
course,[1] but your complexity goes up.

One nice property of straight-line traces is that it makes optimization
extremely simple. Any instruction dominates all instructions that occur later
in the trace. In the example with the "if" statement inside the loop, that is
not true. Operations that happen in either branch do not dominate anything
that comes after control flow merges back together after the if statement. So
you have to do more sophisticated analysis to merge data flows at that point.
It's doable--the point of SSA representation is to make that easier--but it's
a lot slower and more complex than what you can do with straight line code.

[1] E.g. trees of traces:
[https://github.com/oleganza/iovm2/blob/master/doc/papers/Inc...](https://github.com/oleganza/iovm2/blob/master/doc/papers/Incremental%20Dynamic%20Code%20Generation%20with%20Trace%20Trees.pdf).
Trace trees avoid the above problem by basically doing aggressive tail
duplication.

~~~
nkurz
Thanks for the answer and the link. I think my confusion is/was that an "if"
statement in the interpreted language might even be implemented as a
conditional move in assembly. Thus it seems like it would be useful to
distinguish "if (condition) a = a + 1" from "if (condition) frobnicate(a)"
when defining a trace, even if both are written with an "if" in the
interpreted language.

------
seanmcdirmid
I'm working on a live programming environment that is quite reactive,
requiring a lot of book keeping to know when recomputations occur. Since code
can be executed multiple times over time, I perform a trace-based optimization
that caches bookkeeping work in a fast path as long as the objects used and
branches taken don't change. So the fast path has to check that these
invariants haven't changed, and its branching behavior is necessarily always
the same as the slow path that preceded it (otherwise, another slow path re-
computation is triggered).

And branching behavior for the most part doesn't change: your condition is
often a continuous function where small changes in input cause small changes
in output, meaning the condition flips between true and false only very
rarely. Nowhere continuous conditions like the one pointed out in the example
(if i % 2 == 0) are very rare in real code! Modern CPUs also depend on
coherent branching behavior, and conditions like this throw it off. I bet most
of the 15X slowdown that the author is experiencing is not from the extra
conditions, but from thrashing in the BPU (branch prediction unit)!

~~~
nkurz
_Modern CPUs also depend on coherent branching behavior, and conditions like
this throw it off. I bet most of the 15X slowdown that the author is
experiencing is not from the extra conditions, but from thrashing in the BPU
(branch prediction unit)!_

You might be underestimating the efficiency of branch prediction on modern
CPUs. It's getting close to the point where if you (the human) can easily
predict the pattern based on past history, the CPU will perfectly predict it
as well. Manufacturers are usually a little cagey about the exact
specifications, so one is often limited to empirical testing, but a simple
pattern like one in the example is going to be predicted perfectly after the
first couple iterations of training. Here's what one authority who's done lots
of testing has to say about Haswell, the CPU the post is concerned with:

    
    
      The Haswell is able to predict very long repetitive jump  
      patterns with few or no mispredictions. I found no specific 
      limit to the length of jump patterns that could be 
      predicted. Loops are successfully predicted up to a count
      of 32 or a little more. Nested loops and branches inside 
      loops are predicted reasonably well.
    

[http://www.agner.org/optimize/microarchitecture.pdf](http://www.agner.org/optimize/microarchitecture.pdf)

~~~
seanmcdirmid
Looks like you guys are right, though I do have to execute the loop multiple
times to find the same performance result. Weird.

------
malkia
Long time ago, I've did some experiments with luajit and branching (well more
like the CPU). It was based on this article: [http://igoro.com/archive/fast-
and-slow-if-statements-branch-...](http://igoro.com/archive/fast-and-slow-if-
statements-branch-prediction-in-modern-processors/)

[https://raw.githubusercontent.com/malkia/ufo/master/samples/...](https://raw.githubusercontent.com/malkia/ufo/master/samples/bench/badif.lua)

    
    
      local ffi = require("ffi")
      local band = bit.band
    
      local function test(n, m)
         local count = 0; for i=1, n do if band(i, m)==0 then count = count + 1 end end return count
      end
    
      local function timeit(m)
         local t = os.clock()
         test(0xFFFFFFF,m)
         local t = os.clock() - t
         print(string.format("%08X",m),t)
      end
    
      timeit(0x80000000)
      timeit(0xffffffff)
      timeit(1)
      timeit(3)
      timeit(2)
      timeit(4)
      timeit(8)
      timeit(16)
    
      --[[
      -- This is on OSX 10.7.2 MBP 2008 Jan build
        ./luajit samples/badif.lua
      80000000	14.067395
      FFFFFFFF	20.252955
      00000001	13.497108
      00000003	17.337942
      00000002	14.142266
      00000004	14.221376
      00000008	14.42377
      00000010	14.708237
      --]]
    

Results now - 2015 (again Macbook OSX, much better than before and also luajit
2.0.4):

    
    
      80000000	0.247822
      FFFFFFFF	1.253168
      00000001	0.664586
      00000003	0.974336
      00000002	0.676823
      00000004	0.679609
      00000008	0.675207
      00000010	0.673961
    

without jit, just interpretter (pretty much the same results):

    
    
      $ src/luajit -joff ~/badif.lua 
    
      80000000	3.329909
      FFFFFFFF	3.553143
      00000001	3.440378
      00000003	3.484646
      00000002	3.424834
      00000004	3.424539
      00000008	3.616981
      00000010	3.598598

------
haberman
Perhaps this example is just meant to be illustrative, but if this is a real
case, it seems easy to remove the branch:

    
    
        local counter = require("core.counter")
    
        local n = 1e9
        local c = counter.open("test")
        for i = 1,n do
           -- Add 9 for odd i, 0 for even.
           counter.add(c, 1 + (bit.band(i, 1) * 9))
        end

~~~
tedunangst
Yup. The link at the end goes to [http://wiki.luajit.org/Numerical-Computing-
Performance-Guide](http://wiki.luajit.org/Numerical-Computing-Performance-
Guide) which mentions much the same: "Use bit.*, e.g. for conditional index
computations."

It's certainly a real problem that can affect real code, this "issue" just
falls a wee bit short demonstrating how to fix it.

------
sillyryan
I had read somewhere that LuaJIT specifically warns about avoiding branches in
loops

