Author here. You're right that we designed Terra primarily to be an enviornment for generate low-level code.
In particular, we want to be able to easily design and prototype DSLs and auto-tuners for high-performance programming applications.
We explain this use-case in more detail in our upcoming PLDI paper (http://terralang.org/pldi071-devito.pdf).
Since we are primarily using it for dynamic code generation, I haven't done much benchmarking against LuaJIT directly. Instead, we have compared it C by implementing a few of the language benchmarks (nbody and fannkuchredux, performance is normally within 5% of C), and comparing it against ATLAS, which implements BLAS routines by autotuning x86 assembly. In the case of ATLAS, we're 20% slower, but we are comparing auto-tuned Terra and auto-tuned x86 assembly.
Small note, the BF description on the website does go on to implement the '[' and ']' operators below. I just left them out of the initial code so it was easier to grok what was going on. The full implementation is at (https://github.com/zdevito/terra/blob/master/tests/bf.t).
This is really great CS work. Props. The fact that your numerical example is DGEMM, AND that you're comparing against ATLAS and MKL is very compelling, especially since you're only showcasing the kernel itself!
I'm taking a different albeit related approach for dynamic runtime code gen, but either way this is rock solid work, though I'm pretty terrible at deciphering the lua + macro heavy code that is your code examples.
edit: I'm doing something more akin to the Accelerate haskell EDSL approach, with some changes
It's also a very rare research paper that actually uses blas dgemm as the benchmark, that isn't a paper by someone explicitly focused on writing blas. Usually they just use dot product or a local convolution kernel (whereas in some sense matrix mult is a global convolution).
Just what they've done is a pretty solid. That said, it's not really done as part of a framework for numerics, which just means its a great validation benchmark of their code Gen.
I saw LLVM IR being referenced, but I am not sure if you are referring to the LLVM bitcode. If you are, wouldn't it be possible to compile Terra to JavaScript by using Emscripten?
Since we are primarily using it for dynamic code generation, I haven't done much benchmarking against LuaJIT directly. Instead, we have compared it C by implementing a few of the language benchmarks (nbody and fannkuchredux, performance is normally within 5% of C), and comparing it against ATLAS, which implements BLAS routines by autotuning x86 assembly. In the case of ATLAS, we're 20% slower, but we are comparing auto-tuned Terra and auto-tuned x86 assembly.
Small note, the BF description on the website does go on to implement the '[' and ']' operators below. I just left them out of the initial code so it was easier to grok what was going on. The full implementation is at (https://github.com/zdevito/terra/blob/master/tests/bf.t).