Then your understanding of machine language and compilers would be flawed. In mo...

lbrandy · on Nov 26, 2008

I am interested to hear about how often you've actually tried to beat a c compiler with hand optimized assembly code. In my experience, this statement is only said by people who have never tried.

For example, numerical calculations that are highly SIMD can be improved substantially by using SSE instructions. Autovectorization is ok, but not great. Furthermore, a programmer who is familiar with SSE instructions can alter/swizzle data to make things easier to use with SIMD instructions. Yet further, a programmer can take advantage of things like non-temporal storing which compilers will not do on their own.

Now, granted, you can massage gcc into giving you "good" code with alot of hints but to do it "right" you are still peering at the assembly and making sure gcc isn't doing anything "stupid". Highly optimized C code is so dense with compiler directives as to be unrecognizable to someone unfamiliar with the underlying architecture.

The belief that your naive for-loop computation is somehow transformed automagically into perfection by the compiler is a pure and unadulterated myth.

wheels · on Nov 26, 2008

Using SSE is specifically one of the cases where writing something in assembly can make sense (and in fact, the only case where I've written things in asm for purely performance reasons).

Let's go back to the original statement:

I'd assume any intense math operations that happen inside a loop would be much faster.

That's what I was responding to. Just trasnlating the logic to asm won't make it fast. If you look at my earlier followup, I mentioned that if you could make assumptions about your code that the compiler can't know, then you're back in the land where asm optimizations can make sense. That seems to be what you're getting at with the rest of your points.

ken · on Nov 27, 2008

I tried it, back when I was doing a lot of assembly language programming in college. I lost to the compiler miserably every time.

SIMD is kind of a special case. The vast majority of code I've written in my life has not been SIMD'able. Heck, the vast majority of the code I've written hasn't even been for-loops.

I'm interested to see what kind of (non-SIMD) code can be done better in hand-coded assembly than from an optimizing compiler, for real programs.

DarkShikari · on Nov 26, 2008

Indeed, this is quite true; I have functions which are as much as 18 times faster in SIMD assembly than in C. This is one reason every programmer should spend a few days learning assembly: it teaches you that the compiler is stupid.

CUViper · on Nov 26, 2008

If said intensive math is currently written in Python, then rewriting it in CorePy will likely give a big speed boost. CPython does not compile down to machine code, so it can't do the optimizations that a full compiler can do.

Now, if a project like this made it easier to embed C or C++, you might get the best of both worlds...

njharman · on Nov 27, 2008

Several projects like that...

http://www.cosc.canterbury.ac.nz/greg.ewing/python/Pyrex/ver...

http://sourceforge.net/projects/cxx/

http://www.boost.org/doc/libs/1_37_0/libs/python/doc/index.h...

swig for completeness but I don't think it qualifies for easier

or just let the jit do it http://psyco.sourceforge.net/introduction.html

CUViper · on Nov 28, 2008

Those are all nice frameworks for building extensions, but requiring an external build step is a bit cumbersome.

Ultimately though, I agree that a JIT is the way to go.

albertcardona · on Nov 26, 2008

Embedding C or C++ into python is no big deal with scipy.weave and gcc. Just try it:

http://albert.rierol.net/doodle_programming.html#1

CUViper · on Nov 28, 2008

Thanks -- scipy.weave looks to do the trick!

BTW, while that linked code is a good example of how weave works, it's a bad example of when to use it. I've no doubt that the code runs faster with that portion written in C, but a different algorithm could blow it away. A prime generator like that would be MUCH faster with a sieve, even in pure Python.

albertcardona · on Nov 30, 2008

That code was proof of principle, just playing. No claim to have written any particularly efficient prime number generator.

I'm glad you found it instructive.

kirubakaran · on Nov 26, 2008

AFAIK, that is true for whole programs. What about a tiny routine that is executed millions of times deep inside a loop? You can get better performance by hand coding it, especially since you can have machine generated version for comparison.

Also, do you compile your python programs?

wheels · on Nov 26, 2008

There are a lot of tricks to modern architectures and things that modern compilers can figure out. There's more to making things fast than using the right instructions -- modern optimizers also try to keep the CPU pipelines full, unroll loops where appropriate, try to detect operations that don't do anything, etc.

Note that I'm not so much comparing against Python as comparing to calling a C function from Python (which is easy).

The cases where you tend to want to do things in assembly are usually where there are assumptions that you can make about the execution that the compiler can't know.