Since we are primarily using it for dynamic code generation, I haven't done much benchmarking against LuaJIT directly. Instead, we have compared it C by implementing a few of the language benchmarks (nbody and fannkuchredux, performance is normally within 5% of C), and comparing it against ATLAS, which implements BLAS routines by autotuning x86 assembly. In the case of ATLAS, we're 20% slower, but we are comparing auto-tuned Terra and auto-tuned x86 assembly.
Small note, the BF description on the website does go on to implement the '[' and ']' operators below. I just left them out of the initial code so it was easier to grok what was going on. The full implementation is at (https://github.com/zdevito/terra/blob/master/tests/bf.t).
I'm taking a different albeit related approach for dynamic runtime code gen, but either way this is rock solid work, though I'm pretty terrible at deciphering the lua + macro heavy code that is your code examples.
edit: I'm doing something more akin to the Accelerate haskell EDSL approach, with some changes
Just what they've done is a pretty solid. That said, it's not really done as part of a framework for numerics, which just means its a great validation benchmark of their code Gen.
Here's one perhaps relevant paper about LuaJIT for dynamic code generation in QEMU-esque instruction set simulation: http://ieee-hpec.org/2012/index_htm_files/Steele.pdf
It would be better to comparing it to LuaJIT with ffi