Hacker News new | past | comments | ask | show | jobs | submit login
CorePy: Assembly Programming in Python (corepy.org)
68 points by mace on Nov 26, 2008 | hide | past | web | favorite | 27 comments

We've got something similar in Ruby (we're not the first, and ours is nowhere nearly as mature as CorePy). We use it for building ad-hoc debuggers for reversing targets that don't support native debugging and tracing interfaces.

Ruby is an excellent language for this, because it gets out of the way. Here's a Rasm snippet:

            @epilog ||= @prog.add {
                push ebx 
                mov ebx, retv
                mov [ebx], eax
                pop ebx
                xor eax, eax
Obviously, that's all pure Ruby.

Once you have a class mapping for each of the instructions in your target ISA --- way easier than it sounds --- getting your code to jump into a runtime generated buffer is pretty easy.

We're starting to throw code onto Github; I've been thinking about publishing this, but didn't think it would be very interesting, except as a hack.

SBCL has an extremely sophisticated facility called a VOP (Virtual Operation) that allows you to write assembly with s-expressions and integrate the code into the compiler and runtime. The SBCL backend itself is composed of these VOPs.

Significantly, this means that you can write assembly macros with the full power and elegance of lisp. A trivial example is the move macro, which is used extensively in VOPs:

  (defmacro move (dst src)
    "Move SRC into DST unless they are location=."
    (once-only ((n-dst dst)
                (n-src src))
      `(unless (location= ,n-dst ,n-src)
         (inst mov ,n-dst ,n-src))))
Check out the SBCL source if you want to see some wild uses of VOPs and macros that expand into assembly code. In compiler/x86/arith.lisp there is a lisp/assembly implementation of the Mersenne Twister.

VOPs can optionally handle argument type checking and can be tuned to emit specialized code in various circumstances, such as when one of the arguments to the call is a compile-time constant.

What I want to see is Python or Ruby taking advantage of a runtime code generation library to start rewriting the VM or JIT'ing method calls.


Yep. I forgot about Psyco. Does it still work?

I am not sure which version of Python will break Psyco. It looks like it stopped being maintained in '06.

That's really nice. I'd love to see the code, even if there are other "something similar"s in Ruby. Did you actually implement the entire x86 instruction set, or just the parts of it you have used so far?

> "CorePy makes assembly fun again!" (Alex Breuer)

This is great because it allows you to create algorithms at runtime for your data.

So rather than doing...

def doit(option1, data): for x in data: if option1: x += 7

You can create a function that doesn't include the "if option1" part. This is great if you have many different options... and instead of bloating your code for many different optimized function, or one slower generic function, you can create the optimized functions at run time.

Looks very nice. I just wish it had support for 3dnow, and windows :)

i would have liked to see llvm bindings to python... i am sure they are out there...

any body benchmarked the performance of CorePy?

No, but I'd assume any intense math operations that happen inside a loop would be much faster.

Then your understanding of machine language and compilers would be flawed.

In most cases the compiler will do a better job at optimizing machine code than humans will.

I am interested to hear about how often you've actually tried to beat a c compiler with hand optimized assembly code. In my experience, this statement is only said by people who have never tried.

For example, numerical calculations that are highly SIMD can be improved substantially by using SSE instructions. Autovectorization is ok, but not great. Furthermore, a programmer who is familiar with SSE instructions can alter/swizzle data to make things easier to use with SIMD instructions. Yet further, a programmer can take advantage of things like non-temporal storing which compilers will not do on their own.

Now, granted, you can massage gcc into giving you "good" code with alot of hints but to do it "right" you are still peering at the assembly and making sure gcc isn't doing anything "stupid". Highly optimized C code is so dense with compiler directives as to be unrecognizable to someone unfamiliar with the underlying architecture.

The belief that your naive for-loop computation is somehow transformed automagically into perfection by the compiler is a pure and unadulterated myth.

Using SSE is specifically one of the cases where writing something in assembly can make sense (and in fact, the only case where I've written things in asm for purely performance reasons).

Let's go back to the original statement:

I'd assume any intense math operations that happen inside a loop would be much faster.

That's what I was responding to. Just trasnlating the logic to asm won't make it fast. If you look at my earlier followup, I mentioned that if you could make assumptions about your code that the compiler can't know, then you're back in the land where asm optimizations can make sense. That seems to be what you're getting at with the rest of your points.

I tried it, back when I was doing a lot of assembly language programming in college. I lost to the compiler miserably every time.

SIMD is kind of a special case. The vast majority of code I've written in my life has not been SIMD'able. Heck, the vast majority of the code I've written hasn't even been for-loops.

I'm interested to see what kind of (non-SIMD) code can be done better in hand-coded assembly than from an optimizing compiler, for real programs.

Indeed, this is quite true; I have functions which are as much as 18 times faster in SIMD assembly than in C. This is one reason every programmer should spend a few days learning assembly: it teaches you that the compiler is stupid.

If said intensive math is currently written in Python, then rewriting it in CorePy will likely give a big speed boost. CPython does not compile down to machine code, so it can't do the optimizations that a full compiler can do.

Now, if a project like this made it easier to embed C or C++, you might get the best of both worlds...

Those are all nice frameworks for building extensions, but requiring an external build step is a bit cumbersome.

Ultimately though, I agree that a JIT is the way to go.

Embedding C or C++ into python is no big deal with scipy.weave and gcc. Just try it:


Thanks -- scipy.weave looks to do the trick!

BTW, while that linked code is a good example of how weave works, it's a bad example of when to use it. I've no doubt that the code runs faster with that portion written in C, but a different algorithm could blow it away. A prime generator like that would be MUCH faster with a sieve, even in pure Python.

That code was proof of principle, just playing. No claim to have written any particularly efficient prime number generator.

I'm glad you found it instructive.

AFAIK, that is true for whole programs. What about a tiny routine that is executed millions of times deep inside a loop? You can get better performance by hand coding it, especially since you can have machine generated version for comparison.

Also, do you compile your python programs?

There are a lot of tricks to modern architectures and things that modern compilers can figure out. There's more to making things fast than using the right instructions -- modern optimizers also try to keep the CPU pipelines full, unroll loops where appropriate, try to detect operations that don't do anything, etc.

Note that I'm not so much comparing against Python as comparing to calling a C function from Python (which is easy).

The cases where you tend to want to do things in assembly are usually where there are assumptions that you can make about the execution that the compiler can't know.

Wow. So cool.

Thanks for this news.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact