zdevito's comments

zdevito · 2025-05-15T20:43:11 1747341791

I tried to do something similar with 'first-class' dimension objects in PyTorch https://github.com/pytorch/pytorch/blob/main/functorch/dim/R... . For instance multi-head attention looks like:

    from torchdim import softmax
    def multiheadattention(q, k, v, num_attention_heads, dropout_prob, use_positional_embedding):
        batch, query_sequence, key_sequence, heads, features = dims(5)
        heads.size = num_attention_heads
    
        # binding dimensions, and unflattening the heads from the feature dimension
        q = q[batch, query_sequence, [heads, features]]
        k = k[batch, key_sequence, [heads, features]]
        v = v[batch, key_sequence, [heads, features]]
    
        # einsum-style operators to calculate scores,
        attention_scores = (q*k).sum(features) * (features.size ** -0.5)
    
        # use first-class dim to specify dimension for softmax
        attention_probs = softmax(attention_scores, dim=key_sequence)
    
        # dropout work pointwise, following Rule #1
        attention_probs = torch.nn.functional.dropout(attention_probs, p=dropout_prob)
    
        # another matrix product
        context_layer = (attention_probs*v).sum(key_sequence)
    
        # flatten heads back into features
        return context_layer.order(batch, query_sequence, [heads, features])

However, my impression trying to get a wider userbase is that while numpy-style APIs maybe are not as good as some better array language, they might not be the bottleneck for getting things done in PyTorch. However, other domains might suffer more, and I am very excited to see a better array language catch on.

zdevito · on April 25, 2023

When developing PyTorch, we also run into a lot of mixed Python/C++ language situations. We've recently been experimenting with in-process 'combined' Python/C++/PyTorch 2.0 stack traces to make it easier to understand where code is executing (https://dev-discuss.pytorch.org/t/fast-combined-c-python-tor...).

voz_ · on April 25, 2023

Funny, I was going to post this, but the horse's mouth is even better ;)

zdevito · on May 14, 2013

One of our design goals was to make sure terra could execute independently of Lua. So everything that you describe is possible. For instance our simple hello world program (https://github.com/zdevito/terra/blob/master/tests/hello.t) compiles a standalone executable with the "terralib.saveobj" function. You can also write out object (.o) files that are ABI compatible with C. For instance, gemm.t (https://github.com/zdevito/terra/blob/master/tests/gemm.t) our matrix-matrix multiply autotuner writes out a .o file my_dgemm.o which we then call from a test harness in a separate C program (https://github.com/zdevito/terra/blob/master/tests/reference...). Once you have the .o files, you can use Lua to call the system linker to generate a dynamic library.

zdevito · on May 14, 2013

Yes! One of the benefits of making sure that Terra code can execute independly of Lua is that you can use multi-threading libraries pretty much out-of-the box. For instance, we have an example that launches some threads using pthreads (https://github.com/zdevito/terra/blob/master/tests/pthreads....).

There are still some limitations. You'd still have to manage thread synchronization manually, and I think LuaJIT only allows one thread of Lua execution to run at a time, so if your threads call back into Lua they may serialize on that bottleneck.

slashdev · on May 14, 2013

So it's using LuaJIT and not PUC Lua? (A good thing to be sure, as LuaJIT is much faster, even without the JIT.)

Is there a way to run Terra with a separate Lua state per thread? So as to not have the problem with serializing when calling Lua from Terra?

slashdev · on May 14, 2013

To answer my own questions, yes it uses (relies on) LuaJIT and there seems to be no problem running multiple Lua states, each using Terra. In fact because of the independent nature of compiled Terra code, I would wager you can create the Lua states with Terra and threads from within Terra itself. LuaJIT actually can't do this currently, you need a little bit of C code, because the callback passed to pthread_create will be invoked in a new thread on the old state (which would be invalid in LuaJIT as you can't share a state across threads like that.) Anyway I'll try it out and submit it as a test case for Terra if it works.

snogglethorpe · on May 14, 2013

LuaJIT definitely has sweet-spots where it just flies, but there are other cases where LuaJIT isn't faster than PUC Lua, e.g. where you're doing a lot of string manipulation and calling into non-Lua libraries. In "average" code, it varies a lot, but you often don't get the insane speedups that makes LuaJIT look so great on small benchmarks.

Given that LuaJIT has some other drawbacks (e.g. it has memory limitations that PUC Lua doesn't have, due to the details of LuaJIT's NaN-encoding), the usual lesson applies: YMMV, so benchmark... :]

slashdev · on May 14, 2013

Yes, that's true, except for the calling into non-Lua libraries, this is where LuaJIT + ffi really shines. It can actually optimize away boxing/unboxing and inline the native function call into the trace (not the body of the native function, but the call itself.) Surely you're referring to something else? The often repeated wisdom on the LuaJIT mailing list is only use the Lua C API for legacy code as it can't come close to the performance or ease of use of the ffi. If your experience is otherwise, maybe the JIT bailed on your test code? The ffi is very slow without the JIT.

String and memory limitations have yet to bother me at all because anywhere speed matters you get order of magnitude improvements by managing the strings/memory yourself via the ffi. With Terra, it's clear that's the approach being advocated as well. I agree it really bolluxes small benchmarks, especially if the code is written for Lua and not done the "LuaJIT way" with the ffi. Outside of embedded or other exotic environments I think one would be hard-pressed to come up with a real-world workload where Lua PUC outperforms LuaJIT and there's no easy way to turn the tables. There are just far more options for optimization with LuaJIT and more ability to get closer to the metal than you have with Lua PUC.

None of that is to take away from what the Lua PUC guys have accomplished. Like any craftsmen who enjoys his work, I just like to use the best tools. That's LuaJIT in my opinion, and now Terra too.

zdevito · on May 14, 2013

Author here. You're right that we designed Terra primarily to be an enviornment for generate low-level code. In particular, we want to be able to easily design and prototype DSLs and auto-tuners for high-performance programming applications. We explain this use-case in more detail in our upcoming PLDI paper (http://terralang.org/pldi071-devito.pdf).

Since we are primarily using it for dynamic code generation, I haven't done much benchmarking against LuaJIT directly. Instead, we have compared it C by implementing a few of the language benchmarks (nbody and fannkuchredux, performance is normally within 5% of C), and comparing it against ATLAS, which implements BLAS routines by autotuning x86 assembly. In the case of ATLAS, we're 20% slower, but we are comparing auto-tuned Terra and auto-tuned x86 assembly.

Small note, the BF description on the website does go on to implement the '[' and ']' operators below. I just left them out of the initial code so it was easier to grok what was going on. The full implementation is at (https://github.com/zdevito/terra/blob/master/tests/bf.t).

carterschonwald · on May 14, 2013

This is really great CS work. Props. The fact that your numerical example is DGEMM, AND that you're comparing against ATLAS and MKL is very compelling, especially since you're only showcasing the kernel itself!

I'm taking a different albeit related approach for dynamic runtime code gen, but either way this is rock solid work, though I'm pretty terrible at deciphering the lua + macro heavy code that is your code examples.

edit: I'm doing something more akin to the Accelerate haskell EDSL approach, with some changes

carterschonwald · on May 14, 2013

It's also a very rare research paper that actually uses blas dgemm as the benchmark, that isn't a paper by someone explicitly focused on writing blas. Usually they just use dot product or a local convolution kernel (whereas in some sense matrix mult is a global convolution).

Just what they've done is a pretty solid. That said, it's not really done as part of a framework for numerics, which just means its a great validation benchmark of their code Gen.

lukego · on May 14, 2013

Cool stuff :-)

Here's one perhaps relevant paper about LuaJIT for dynamic code generation in QEMU-esque instruction set simulation: http://ieee-hpec.org/2012/index_htm_files/Steele.pdf

haberman · on May 14, 2013

Thanks for the info! Sorry for missing the implemented '[' and ']' -- you might want to replace the "NYI" comment with "Implemented below". :)

snaky · on May 14, 2013

>I haven't done much benchmarking against LuaJIT directly

It would be better to comparing it to LuaJIT with ffi

ezdiy · on May 14, 2013

Than you so much for this. You actually implemented Haskell/Caml for us, mere mortals, and as close to bare metal as possible. Bravo!

ysangkok · on May 16, 2013

I saw LLVM IR being referenced, but I am not sure if you are referring to the LLVM bitcode. If you are, wouldn't it be possible to compile Terra to JavaScript by using Emscripten?