
Inline assembly in Linux - 0xAX
https://github.com/0xAX/linux-insides/blob/master/Theory/asm.md
======
sillysaurus3
I'm trying to add call/cc to node, or to lua.

Recap: call/cc is the ability to save the current state of a running thread,
then revert to that state at a later point in time. In other words, at any
point in your program, you can say "Save the current stack." It's saved as a
function. Later, whenever you call that function, the current stack is thrown
out, and replaced with the saved stack.

This is very useful for a number of reasons. It's also a very rare feature to
have in your language.

Neither node nor lua has support for this. The closest I've found is a Lua
extension which adds coroutine.clone(). In principle, this is the solution. In
practice, it has a number of limitations, such as restrictions on when you're
allowed to call coroutine.clone(). (For example, if your stack looks like Lua
-> C -> Lua, then it won't work.)

I tried to channel my inner Mike Pall and solve this problem once and for all,
but I'm not Mike Pall, and this is really hard. I was hoping you might know of
any possible solution which is (a) practical, (b) cross-platform, and (c)
works in all cases.

Why post this here? Because this post happens to attract exactly the kind of
people that might know a way forward. There _must_ be a way. Apologies for the
off-topic comment.

~~~
haberman
> if your stack looks like Lua -> C -> Lua, then it won't work.

I don't think you can safely solve this in the general case. There is a key
problem I don't think you can work around.

Say your stack looks like C(1) -> Lua -> C -> Lua. The outermost C frames
might not know anything about Lua (they just use some library that uses Lua as
a library). Say you try to take a snapshot of this stack to create a
continuation. You probably just want to snapshot the Lua -> C -> Lua part,
since that is the portion of the stack representing the execution of the Lua
program.

Now say all these frames return. Then the main program calls Lua again,
through through a slightly different code-path, and now you have C(2) -> Lua.
Say the embedded Lua program decides to resume the continuation.

Now keep in mind that the C stack is not position-independent. The C stack can
contain pointers to the C stack, so when you resume, you need your resumed
stack to live at _exactly_ the same address as last time it ran. But what if
C(1) and C(2) are not exactly the same size? What if we called one extra
function before calling Lua the second time? It is impossible to copy the
continuation's C stack back into its original position. So it's impossible to
resume the continuation.

You could try to snapshot the _entire_ C stack to get around this, including
the outermost C frames. But this would be most unexpected for the C program
that is using the Lua interpreter. Lua is supposed to just be a regular C
library: you call a function, it does things, and then returns. It wouldn't be
acceptable that calling a Lua interpreter function like lua_call() backs your
entire C program to a previous state just because the embedded Lua program
used a fancy feature called continuations!

There are many other things that would make this tricky at best to get
working, but I think the problem above really tanks the idea completely.

~~~
sillysaurus3
_But what if C(1) and C(2) are not exactly the same size?_

Say you want to resume continuation K, which has a stack of some size N.

The current thread has a stack of size M. If M >= N, everything is fine: you
can safely overwrite the current stack with K's stack.

If M < N, recurse until M >= N.

 _You could try to snapshot the entire C stack to get around this, including
the outermost C frames._

Indeed! This is a solution.

 _It wouldn 't be acceptable..._

I like doing unacceptable things in my programs. It's the best part of
programming, really.

There are a lot of solid arguments against call/cc. I think the most
persuasive argument in favor of call/cc is that you become more powerful.
Whatever metric you use to measure power, call/cc will improve it: Smaller
code, less time spent writing code, and you can even write algorithms that you
otherwise would not be able to.

Personally, I want call/cc in order to be able to use choose and fail. It's
the ability to write programs that are guaranteed to never call fail(). pg
explains it well:

"For example, this is a perfectly legitimate nondeterministic algorithm for
discovering whether you have a known ancestor called Igor:

    
    
      Function Ig(n)
        if name(n) = ‘Igor’
          return n
        if parents(n)
          return Ig(choose(parents(n)))
        fail
    

The fail operator is used to influence the value returned by choose. If we
ever encounter a fail, choose would have chosen incorrectly. By definition
choose guesses correctly."

Call/cc makes this possible. There are a lot of fun things to do. The last few
chapters of _On Lisp_ show some particularly interesting sketches.

~~~
haberman
> Say you want to resume continuation K, which has a stack of some size N.

The size of the continuation's stack doesn't matter for the problem I
described, it's the size of the stack "underneath" your continuation that
matters (ie. C(1) and C(2) above).

If C(2) > C(1) there is no way to shrink C(2) such that the continuation's
stack can be copied into the right place.

> I like doing unacceptable things in my programs. It's the best part of
> programming, really.

What you do in your programs is up to you! But nobody else is going to use a C
library that messes with the execution state of its callers (unless that is
the point of the library, which it isn't with Lua).

~~~
sillysaurus3
I don't understand, but I'd like to.

To create a continuation, we need to copy the entire stack, by definition. But
"the stack" is just an array of bytes. It's all the bytes between the current
stack pointer and the "root" stack frame. So to create a continuation, copy
these bytes and stash them somewhere, then set up a longjmp target to the
current instruction.

To apply a continuation, i.e. to restore the stack, we overwrite the current
stack starting from the root frame. Then we longjmp to where the continuation
was originally created.

It seems like this scheme should work in any situation, but perhaps I'm
missing something?

loeg pointed out getcontext(3) / setcontext(3), which seems promising. It
looks like a standard way to sidestep all of this bookkeeping. It appears to
be a high-level interface to the operations described above.

Lua is just a language, though. It's not "for" anything in particular.

~~~
haberman
The Lua implementation is a C library. You invoke it by calling C functions
like lua_call().

Imagine you have a C program like this:

    
    
        #include <fancylib.h>
    
        int main() {
          for (int i = 0; i < 10; i++) {
            printf("val[%d] = %d\n", i, fancylib_calculate(i));
          }
          fancylib_cleanup();
        }
    

Now imagine that internally, fancylib uses Lua. So fancylib_calculate() calls
lua_call().

Now imagine that the Lua function run by fancylib decides to use
continuations. When you call fancylib_calculate(0), it creates a continuation.
And when you call fancylib_calculate(1), it decides to call the continuation.

If you restore the entire C stack to resume the Lua continuation, it will
reset the loop in main() to i=0! Your program might end up printing val[0]
over and over, in an infinite loop. This would be extremely surprising to you
as the author of main(), because you were just trying to write a normal old
for() loop. The Lua continuation should just restore the Lua-related stack,
not the stack of the functions calling Lua!

~~~
sillysaurus3
_If you restore the entire C stack to resume the Lua continuation, it will
reset the loop in main() to i=0!_

That's the point of continuations, though. That's a feature, not a bug. When
you create a continuation, you're saying "whatever happens after this, allow
me to do it again at some later time." If the calling library happened to be
in a loop, then the goal is to serialize that loop so that it can be invoked
again, at a later point.

If you keep applying the continuation in a loop, then you'll get an infinite
loop. But if you invoke the continuation once, (and if subsequent calls to
fancylib_calculate() don't), then you'll get the ability to print

    
    
      val[0] = 42
      val[1] = 99
      ...
      val[9] = 7
    

on demand. By invoking the continuation, you cause the loop to happen again.

In fact, you gain the ability to prevent the program from terminating, in a
controlled fashion. Since you have access to the continuation, you can choose
to invoke it on the 10th call to fancylib_calculate(), up to 3 times in a row.
That would produce output like:

    
    
      val[0] = 42
      val[1] = 99
      ...
      val[9] = 7
      val[0] = 42
      val[1] = 99
      ...
      val[9] = 7
      val[0] = 42
      val[1] = 99
      ...
      val[9] = 7
    

then the program would exit.

Does that make sense? Apologies if we're talking past each other. I appreciate
the patience.

~~~
haberman
> That's the point of continuations, though. That's a feature, not a bug.

If that's what you're after, then by all means implement that. :) But I think
most people would expect the Lua interpreter state to be self-contained, and
not to affect the state of the surrounding C execution environment.

~~~
a_t48
The lua library can already affect the surrounding C state (either through FFI
with LuaJIT, or having a lua function call back into C code), so I don't see
that as a real argument.

Unless you are allowing arbitrary code execution, you are likely the one
passing in scripts to the interpreter - these sorts of interactions should be
well documented by the lua scripts themselves...and I don't recommend allowing
arbitrary lua anyhow - os.syscall and friends say hi - need to block access to
those carefully. Thus, we have two situations - either you know and trust the
code not to do unexpected things (or to do them in an expected manner :) ), or
you've set up your lua environment in such a way to disallow such calls, and
it doesn't matter.

------
nkurz
I've been using a lot of inline assembly lately, and while the Stockholm
syndrome might be in effect, I'm coming to like the GCC syntax. For me, main
thing that has helped has been to adopt a consistent syntax. Here's some
examples of what I'm currently using for an AVX2 popcnt optimization, with
some explanation.

    
    
      #define ASM_VEC_BYTE_COUNT_SET(vec, sum, mask, shuf)                  \
        __asm volatile ("vpsrld $4, %[VEC], %[SUM]\n"                       \
                        "vpand %[MASK], %[VEC], %[VEC]\n"                   \
                        "vpand %[MASK], %[SUM], %[SUM]\n"                   \
                        "vpshufb %[VEC], %[SHUF], %[VEC]\n"                 \
                        "vpshufb %[SUM], %[SHUF], %[SUM]\n"                 \
                        "vpaddb %[VEC], %[SUM], %[SUM]\n" :                 \
                        /* rd/wr ymm */ [VEC] "+&x" (vec),                  \
                        /* write ymm */ [SUM] "=&x" (sum) :                  \
            	    /* read ymm  */ [MASK] "x" (mask),                  \
                        /* read ymm  */ [SHUF] "x" (shuf))
    

1) Try to use the %[symbolic] syntax rather than %[n] numeric. It's slightly
longer to write, but usually clearer to read. Use upper case for the symbolic
name. Put your inputs one per line, with a preceding comment.

2) If you are using the same assembly more than once in your program, declare
your assembly within a #define macro, then use the macro in your code.

3) Use "__asm volatile". Declaring "volatile" is not required, but once you
are writing inline assembly you usually know more than the compiler about
where the block should go.

5) If you have multiple lines of assembly and output registers, you are almost
always safer to use "+&" and "=&" for your constraint rather than just "+" or
"=". Search for "early clobber" for details.

6) Strongly prefer single type constraints. The more flexibility you give the
compiler, the more likely it will defeat your efforts at optimization. Use
explicit memory addressing modes rather than "m". The modifier "c" is needed
for the offset.

    
    
      #define ASM_VEC_LOAD_OFFSET_MEM(off, mem, vec)                    \
        __asm volatile ("vmovdqu %c[OFF](%[MEM]), %[VEC]\n" :           \
                        /* destination */ [VEC] "=x" (vec) :            \
                        /* byte offset */ [OFF] "i" (off),              \
                        /* mem address */ [MEM] "r" (mem))
    

7) The register constraints for vectors are tricky, because the "x" constraint
is used for both XMM and YMM vectors. There is no way to specify that one
wants only one or the other. This sort of makes sense, since in hardware they
share the same register. You can use the "q" modifier when you need to specify
XMM syntax in the output when you need both forms of the same vector.

~~~
brigade
3 - using volatile for asm that doesn't have otherwise inexpressible side
effects has the same askance that using it for thread safety has. If you think
you need it, maybe you needed to add a "memory" clobber instead.

5 - I can't think of any meaning early clobber has on an input+output
constraint ("+")?

6 - there are many cases where you really do want to give the compiler
flexibility in addressing modes. Unfortunately clang tends to ignore that and
generate (reg) regardless.

7 - not really different than GPRs; you use "r" as the constraint then a
modifier like "k" for the size.

I guess the lesson is that yeah gcc inline asm is powerful, but they try to
leave it undocumented for a reason. Also, who stole number 4?

~~~
nkurz
re 3: If it were for correctness, I'd agree. But I don't need volatile to make
it work, I need it to produce the assembly I want. If one instruction can
execute only on Port 1 (popcnt) and the other can execute on Ports 0, 1, 5, or
6, there's sometimes a 50% performance difference based on the order two
seemingly independent instructions are executed. Volatile also prevents the
compiler from hoisting loads ahead of my inline assembly, which sometimes
makes a difference. Clobbering "mem" might force other reloads that I don't
want to happen.

re 5: Barring compiler bugs, I think you'd be right if correctness was the
only issue. But I'm pretty sure I've sometimes solved problems by adding it,
although this may have been when working around the POPCNT bug that added a
false dependency on the output. It also might have been when reading and
writing a variable multiple times?

re 6: In theory, yes. But usually in these cases you should be writing
intrinsics or straight C instead of inline assembly. The place where this
comes up most for me is when I have two variables that use the same index, and
I want to ensure "DEC/JNZ" fusion at the end of the loop. If I let the
compiler choose, it will find a way to defeat me by incrementing both array
addresses. The other case is when you explicitly want a store to use Port 7
for address generation, which only happens without an index register.

re 7: Yes, I just personally find it more confusing because "x" fits so well
with "XMM", and thus it feels odd to use it when you want only a "YMM". Also,
see here for problems with a Clang and %q[VEC]:
[http://stackoverflow.com/questions/34459803/in-gnu-c-
inline-...](http://stackoverflow.com/questions/34459803/in-gnu-c-inline-asm-
whatre-the-modifiers-for-xmm-ymm-zmm-for-a-single-operand)

re 4: Oops, I forgot to renumber. I had another comment suggesting that one
always use the "V" VEX prefix on vector commands and the explicit output
register, but deleted it because it seemed off topic.

------
ndesaulniers
Is it ever the case that inline assembly is required over separate object
sources just in assembly? I would have thought it would be preferred to not
use inline assembly, and simply link in object files of what you need. It
would seem simpler syntax-wise, too. Why prefer inline assembly?

~~~
cyphar
There's several macros in the kernel which contain inline assembly, and you
can't use code written in assembly because it would require using the stack to
call the function (the case I'm thinking of is the switch_to macro which
switches between tasks in the kernel).

------
TwoBit
GCC inline assembly is one of the most terrible things I've ever had to work
with. Somebody seriously need to redesign it or replace it. Aside from that,a
decent reference manual for the existing version would be welcome.

~~~
AndyKelley
I'm working on a new language that has inline assembly and I pretty much just
copied GCC's syntax[1]. Do you have any specific suggestions on how to do
better?

[1]:
[https://github.com/andrewrk/zig/blob/master/std/linux_x86_64...](https://github.com/andrewrk/zig/blob/master/std/linux_x86_64.zig#L60)

------
Ameo
Having lived my development life so far removed from the actual physical
CPU/memory, thinking about implementing this kind of low-level stuff into
actual code is mind-boggling to me.

