
Outperforming everything with anything: Python? Sure, why not? - pjmlp
https://wordsandbuttons.online/outperforming_everything_with_anything.html
======
jcranmer
Arrggh. An example of bad benchmarking, I'd say.

One of the annoying things about the x86-64 ABI is that floating-point numbers
use vector registers, even if they're only scalars and not vectors. The only
way you can tell the difference is if the result is the p versus the s in
vmulsd/vmulpd.

So the optimized assembly here isn't using any vectorization. Which isn't
surprising, since as far as I could tell, the author isn't actually
_optimizing_ the code using LLVM. Most of the optimizations happen in opt, not
in llc, which only does codegen optimizations. Those sorts of optimizations
are largely things like stack frame optimization, instruction scheduling, or
some more powerful pattern matching in instruction selection (which, even in
-O0 in LLVM, does a limited amount of common subexpression elimination and the
like). You might have to add restrict to the pointers (in LLVM terms, noalias
on the arguments) to get vectorization to kick in, but the number of values is
small enough for some sort of loop-versioning to probably kick in anyways.

The benchmarking is also pretty unfair. The C and C++ code have the values get
copied in from a volatile array before progressing each loop iteration, which
the Python-via-LLVM doesn't have. That doesn't sound like much, but it's 30
million extra guaranteed memory accesses over the time of the program, which
will come out to a few milliseconds. (And , gee, the Python code is only
faster by a few milliseconds). A better approach is to measure the time by
moving the function to an external file and calling it in a tight loop,
wrapped by a high-precision clock.

Also, benchmarking a 5×5? In practice, it's the memory access patterns that
kill code, so you'll want to use programs large enough to actually cause any
expected cache movements at the L2/L3 level. 5×5 is small enough that you
could actually stick it all in vector registers if there's a bit of
vectorization.

~~~
mehrdadn
Speaking of bad benchmarking, also see
[https://www.youtube.com/watch?v=vm1GJMp0QN4&t=17m49s](https://www.youtube.com/watch?v=vm1GJMp0QN4&t=17m49s)

~~~
gnulinux
That's madness. This is done by a company whose entire business is
benchmarking? That's like selling bike with square tires.

------
jVinc
"You don't have to learn some special language to create high-performant
applications or libraries."

I like this comment, which seems to suggest that I don't need to learn LLVM IR
in order to write a python class that emits LLVM IR when run through python
code. I'd suggest that perhaps the reason he's able to do this is exactly
because he knows "some special language" quite well.

~~~
yvdriess
When such statements in the style of "you don't have to learn a new language"
is uttered, it is usually completely ignoring the fact that solving the
problem at hand is many times harder than learning some new syntax. Learning C
when coming from Python is not hard because as a language C is hard.

To generalize, I have the impression that the general CS/programmer culture
values initial easy of use much more than long term usefulness. And that's not
directed at the younger generation, you will find plenty of old farts that
only want to use or learn git when it's been thoroughly lathered the thick
sugary syrup of a pretty GUI.

~~~
d33
Git CLI is universally agreed to be absolutely terrible, with its manuals
actually making it worse. Just flick through one or two blog posts you can
find here, many of which already made HN:

[https://www.google.com/search?q=why%20git%20cli%20sucks](https://www.google.com/search?q=why%20git%20cli%20sucks)

~~~
mlthoughts2018
I like git cli and find it to be amazingly well-designed and intuitive.

~~~
d33
If you could expand it into a longer answer, I would appreciate that - perhaps
there are some pieces of good design (git add -p?) in Git, but overall it's
too messy for many.

~~~
mlthoughts2018
git add, push, pull, fetch, bisect and rebase, along with git grep and log
commands, are very easy to use and exceedingly rarely is anything else needed.
Just use a rebase-only workflow and the cli essentially is self explanatory in
terms of how your mutations affect the revisions.

Other commands like stash, checkout and branch are equally simple. It’s just a
very simple user experience.

I think where people express confusion over git is that they don’t expect to
be interact with revision history, rather just inflicting changes onto a head-
only svn style repo.

People are mistakenly encouraged to bring that mentality to the table when
it’s not useful. You need bring a mentality that you’re mutating and sculpting
branches.

It’s like a beginner saying “snowboards are complicated” because they come in
expecting it to be like a sled. You hear the same complaints about other
systems like darcs and mercurial too, for the same reasons, even though each
of those offers respectively super easy to use cli experiences as does git.

~~~
anon4242
> Other commands like stash, checkout and branch are equally simple

Ok, so they are the simple because you say so? Stash in all but the simplest
cases is a sure-fire way to shoot yourself in the foot. Checkout is a weird
guy, he does too many different things, and what's up with the double dashes?
Super-weird! Also beware that any of your branch names happen to match any of
the file names in the repo :O.

> I think where people express confusion over git is that they don’t expect to
> be interact with revision history

This is one of the things that confuses beginners about git, but it's not a
confusion about git cli. You are conflating the criticism of git cli with
criticism of git. Git is good, git cli is (notoriously) bad. However,
eventually you'll (hopefully) learn the proper incantations and then using git
cli will seem simple to you.

~~~
mlthoughts2018
I never claimed any of this was anything more than my own technical appraisal
/ opinion. But to be clear, you’re not offering any counter argument, just
unsupported assertions of your different opinion, including things like
“what's up with the double dashes?“ which makes it hard for me to believe I
could personally find value in your view on it.

> “You are conflating the criticism of git cli with criticism of git.”

Not at all. I’m saying most of the time when people say they are confused by
git cli, that it behaves counter to their expectations, they are not actually
expressing any design limitation of the cli commands themselves, but instead
blaming cli commands for the conceptual confusion they have about git’s model
in general.

------
a-dub
There are a few tools out there that fuse Python and llvm. What I find
interesting about this is the way he did codegen by passing in mock objects
that just emit llvm it in the simplest possible way and that it actually works
while being incredibly succinct.

That said, like all things that try to accelerate Python, it only works for a
subset of the language (and in this case a very narrow one) and TBH while this
whole "let's accelerate Python" movement is cool it's really quite awkward to
have hidden subsets of a language that are accelerable. I'm not sure, but I
think it might be better if there were a second accelerable language that was
inlined with clear rules other than this stuff that seems like it will always
be prone to inconsistency...

...or Julia. I suspect there are tradeoffs there too though... Building a
language/runtime for high level experimentation with built in support for
performance that rivals custom low level code is hard, the goals themselves
are at odds with each other...

~~~
jerf
"What I find interesting about this is the way he did codegen by passing in
mock objects that just emit llvm it in the simplest possible way and that it
actually works while being incredibly succinct."

It's interesting, but it doesn't work in general, which is why you don't see
the technique in common use. The problem is that it only catches methods
called on objects, but it can't catch anything that isn't a method called on
an object. Unfortunately, that includes for loops, if statements, some aspects
of exceptions, and just generally "all flow control".

So rather than being a powerful and general technique, it's a powerful, but
very very specific technique that only works in certain toy problems.
Unfortunately, once you leave those toy problems behind, it doesn't even help
you solve the problem because the paradigm this forces you to operate under
makes easy things easy, but medium things hard and hard things pull-your-hair-
out impossible. You're better off writing a real compiler... _or_ doing this
in a language that _does_ expose enough of what's going on to make this
technique work, although you tend to end up in either a Lisp or Haskell, which
come at this from very different angles but both have the capacity to make
this work. (Lisp via syntax tree manipulation, Haskell via Free Monads and
other similar techniques.)

~~~
cwillu
The AST is accessible in python, I wish more projects took advantage of it.

------
twtw
Has anyone reproduced these results?

FWIW, I can't.

    
    
      $ cd wordsandbuttons/exp/python_to_llvm/exp_c
      $ make
      $ /usr/bin/time -f "%U\n" ./exp_volatile_a_b
      0.13
    
      $ cd wordsandbuttons/exp/python_to_llvm/exp_embed_on_call
      $ make
      $ /usr/bin/time -f "%U\n" ./benchmark
      0.16
    

Intel(R) Core(TM) i7-7800X CPU @ 3.50GHz, gcc (Ubuntu 7.3.0-27ubuntu1~18.04),
clang version 6.0.0-1ubuntu2, LLVM version 6.0.0

~~~
twtw
Some more interesting points:

\- Switching to -O3 for gcc (instead of author's default -O2), the C version
(exp_volatile_a_b) gets ~30% faster, bringing it down to ~50% of the python-
generated llvm - it appears to still be doing the full computation at runtime.

\- Switching to -O3 for llc doesn't make any difference for the python-
generated llvm version.

\- With gcc 4.8 -O2 (instead of 7.3 -O2), I get ~0.06s - it looks like gcc 4.8
decides to inline everything, and gcc 7.3 doesn't.

~~~
BeeOnRope
It makes sense: gcc's -O3 is closer to LLVM/clang's -O2 in terms of big
picture optimizations they enable such as vectorization and loop unrolling.

------
bazizbaziz
This 'partial interpretation' trick used here has also been used successfully
in the database community to accelerate whole queries, etc. Tiark Rompf's
group in particular has been pushing this idea to it's limit.

Functional Pearl: A SQL to C Compiler in 500 Lines of Code -
[https://www.cs.purdue.edu/homes/rompf/papers/rompf-
icfp15.pd...](https://www.cs.purdue.edu/homes/rompf/papers/rompf-icfp15.pdf)

How to Architect a Query Compiler, Revisited -
[https://www.cs.purdue.edu/homes/rompf/papers/tahboub-
sigmod1...](https://www.cs.purdue.edu/homes/rompf/papers/tahboub-sigmod18.pdf)

Flare: Optimizing Apache Spark with Native Compilation for Scale-Up
Architectures and Medium-Size Data -
[https://www.usenix.org/conference/osdi18/presentation/essert...](https://www.usenix.org/conference/osdi18/presentation/essertel)

------
fizixer
I haven't seen this website before and, after taking a cursory look at this
and the previous posts mentioned the article, I'm highly skeptical of the
quality of this work:

\- In the C article he shows that LAPACK can be massively improved upon. That
sounds very impressive, until you realize he hasn't mentioned which
LAPACK/BLAS implementation he used. The default download from netlib is the
vanilla version and is known to be very slow in general. Any benchmark dissing
LAPACK is incomplete without comparison with OpenBLAS and Intel MKL. And
instead we see the author pulling all sorts of overengineered tricks to top
the performance of a vanilla LAPACK build (I mean there isn't even a mention
of whether or not he used optimization flags while compiling LAPACK).

\- In this python aritcle, again, python comes with highly developed
scipy/numpy stack which can be coupled with a highly performant OpenBLAS or
Intel MKL, but there is absolutely zero mention of the most popular scientific
stack in the python community. Again, instead, we see a round-about and
"clever" solution of trying to couple python with LLVM.

~~~
twtw
On my system, I see the following:

    
    
      $ /usr/bin/time -f "%U\n" ./exp_volatile_a_b
      0.13
      $ /usr/bin/time -f "%U\n" ./exp_lapack
      0.52
      $ /usr/bin/time -f "%U\n" ./exp_openblas
      0.32
    

I got LAPACK and BLAS via apt-get on Ubuntu 18.04, so whatever that means. I
installed OpenBLAS from source, and it looks like the target detection decided
to enable AVX2 (but not AVX512).

So the fully-unrolled version is still faster on the 5x5 matrix, which doesn't
seem super surprising to me. I would expect a LAPACK implementation to have
some overhead compared to a straight line solution to a problem this small.

~~~
celrod
Yes. This is easy to see in Julia, by comparing native arrays with
StaticArrays.jl. Native arrays operations are linked to OpenBLAS, while many
StaticArray operations are unrolled (will have to get to a computer to see if
that includes 5x5 inversion). At the small sizes, BLAS is tens of times
slower. Especially OpenBLAS (MKL does better at small sizes). There's a lot of
overhead from things like determining optimal blocking structures.

------
d4l3k
The tracing Python objects technique is very similar to how pytorch traces
models to produces a graph representation that can be executed entirely in C++
for performance reasons.

The pytorch one is a little bit more general since it can trace arbitrary
models + have custom high level operations that might not be supported by
default. You just pass it a python method and an example input.

    
    
      import torch
      def foo(x, y):
          return 2*x + y
      traced_foo = torch.jit.trace(foo, (torch.rand(3), torch.rand(3)))
    

[https://pytorch.org/docs/stable/jit.html#torch.jit.ScriptMod...](https://pytorch.org/docs/stable/jit.html#torch.jit.ScriptModule)

------
paultopia
_But it shows that you can write programs in Python that outperform their
counterparts written in C or C++. You don 't have to learn some special
language to create high-performant applications or libraries._

... you just have to write bespoke macros to turn python code into LLVM
code...

------
spricket
I think this is a false dichotomy. You can make almost any language fast by
writing very non-idiomatic code. The advantage of something like Java is that
you can write readable, "normal" code that runs at nearly the speed of
optimized C.

I see these comparisons all the time and they inevitably jump through a ton of
hoops to make a language like Python perform like Java/C#/Go/rust. IMO the big
advantage of these language is that they're fast for the majority of normal,
totally unoptimized code.

I can write REST endpoints in Java/C# that do a hundred thousand requests per
second using a mainstream framework with no optimization. That's important in
the real world where I'm usually working on large ancient projects with
somewhat poor code quality

~~~
gnulinux
If you write a simple for loop incrementing `i` then yeah Java runs almost as
fast as C. In whole programs where GC unpredictably runs and does all sorts of
nasty things like dirtying pages and pressuring the cache, you won't get
anything close to C. Citation needed.

~~~
spricket
This was true until a few years ago. Java 11 has a new ultra-low-pause GC
that's almost entirely multi-threaded and mostly predictable.

There's plenty of semi-realistic benchmarks out there. I like TechEmpower the
best since they test an entire web framework + database.

Several java implementations are right up there with C, along with .NET Core,
Go, and Rust. In the vast majority of cases the extra 20-50% speed of C isn't
worth the dangers.

------
jlarocco
Meanwhile, I'll just keep using Common Lisp and get the benefits of a higher
level language AND have _all_ of my code run 15-20x faster than equivalent
Python.

And I'll still be able to manually generate LLVM IR if I really want to (but I
don't).

------
ChrisSD
To summarise: Use Python to output LLVM intermediate representation (IR) and
then compile the IR.

C was created in the PDP-11 era whereas LLVM IR was designed for more modern
processors, so coding in the latter allows for more performance optimisations
that C compilers can't always make.

~~~
stochastic_monk
Is gcc’s IR fundamentally limited compared to LLVM’s somehow? I see better
assembly generated by each for different cases.

~~~
codeflo
I don't think so, and even where it would be, GCC's IR can evolve. But GP was
talking about the limitations of C, not GCC. C's memory model puts some limits
on what a compiler can do that other languages might not. Fortran is usually
given as an example of a language that can be faster than C, because the
compiler is allowed to do more. Rust might get there as well, at least in very
specific circumstances. Given that GCC also has a good Fortran frontend, I'd
assume that their IR is already more powerful than what's needed to compile C
alone.

~~~
celrod
Isn't it just that it is hard to prove there's no aliasing in C, which can
make it harder to vectorize? If so, "restrict" should handle that in most
cases.

------
ychen306
LLVM IR is _not_ portable. LLVM's infrastructure allows you to write portable
code generator, but the IR is definitely not portable.

~~~
saagarjha
LLVM IR isn’t portable in the general sense, but if you’re careful to maintain
specific ABI details it’s possible to have IR that works on multiple
platforms.

------
ris
This approach is basically like building a tiny version of numba
[http://numba.pydata.org/](http://numba.pydata.org/).

------
kayamon
For anyone interested in doing this kind of stuff in an actually sane manner
(i.e. Using a high-level dynamic language to intermix llvm), the sadly now
abandoned Terra project is worth checking out.

[http://terralang.org/](http://terralang.org/)

------
magicalhippo
Reminds me of the FEniCS[1] project, a framework for solving partial
differential equations, which I got introduced to some 10 years ago during my
studies.

LLVM wasn't a thing back then so they generated C++ source code which got
compiled into a Python module which was loaded back into the running Python
script and executed.

The translation pass also allowed for optimizations, for example if one of the
inputs was a constant it could call a different library function which handled
that special case. The result was a very flexible solver which allowed you to
write code that looked almost like the math you were trying to solve, yet
could match or beat a handwritten solver.

[1]: [https://fenicsproject.org/](https://fenicsproject.org/)

------
GuB-42
From my experience, what makes program slow more often have to do with memory
than CPU cycles.

More memory use mean more cache misses, and cache misses have a huge impact.
Roughly, L2 cache is 10x slower than L1 and RAM is 10x slower than L2.

With C, when you want to allocate some memory, you typically precalculate the
size and malloc() just the right amount. In C++, you are more likely to use
containers, which are safer and more flexible but tend to introduce some
overhead. GC languages like Java have the GC overhead in addition and dynamic
languages like Python are the worst because they also need to keep track of
object types.

Now there are some applications that are really CPU-limited, like complex
calculations. But in that case, chances are that you'll probably want to
offload things to the GPU.

Still, I really liked the article. That's another great tool to have when it
comes to performance. But it won't change the general case where C > C++ >
Java > Python when it comes to performance.

A little caveat when it comes to C vs C++ though. C++ can be faster than C if
you use C-like memory management, but that's not the usually recommended way
of doing things.

------
cnezin
Looks like it's just benchmarked against inverting a 5x5 matrix? Would the
results still stand for larger matrices?

~~~
Animats
It's total loop unrolling, so, no. If you did this on a 5000x5000 matrix,
you'd have a huge sequential code block that wouldn't fit in cache.

This trick is only useful for code that makes no control flow decisions based
on the data being processed.

~~~
stochastic_monk
I also expect Intel’s xsmm to be as good or better for an operation of this
size.

------
carlsborg
Related. Has anyone used this?
[https://il.pycon.org/2016/static/sessions/anna-
herlihy.pdf](https://il.pycon.org/2016/static/sessions/anna-herlihy.pdf)

"A compiler from a subset of Python to LLVM-IR"

~~~
viraptor
No, but there's also mypyc if you're interested in that area:
[https://github.com/mypyc/mypyc](https://github.com/mypyc/mypyc)

------
adamnemecek
This isn’t really python is it?

~~~
chubot
It is Python in the sense that the solve_linear_system() function is
unmodified. At first it runs with Python objects, and then it runs with
LLVMCode objects.

It's a form of metaprogramming which some would call staged programming or
multi-stage programming:

[https://en.wikipedia.org/wiki/Multi-
stage_programming](https://en.wikipedia.org/wiki/Multi-stage_programming)

~~~
adamnemecek
It’s not metaprogramming, it’s a program that outputs llvm ir. What about that
is meta?

~~~
chubot
See the links I put here:

[https://news.ycombinator.com/item?id=19013437](https://news.ycombinator.com/item?id=19013437)

The authors of all those systems call it metaprogramming, more or less.
Metaprogramming is a very general term, and there many varieties of it.

Once you get past the syntax and specific technologies, you'll see that many
of these systems have a similar structure, and you could probably do a line-
by-line port of solve_linear_system() to those systems. I think Scala LMS is
probably the closest one.

------
nudpiedo
I agree with the sentiment of the article, nowadays there is almost no reason
to not use a high level language for most of the tasks... native code could be
just plugged on demand.

That's exactly what Numba[0] does for python, just a decorator on a function
and magically a function gets native performance by being compiled with llvm.

[0] [http://numba.pydata.org](http://numba.pydata.org)

------
lixtra
Since the author appears to iterate one million times a calculation of
constant a and b, I would expect a perfect compiler to compute x only once at
compile time and then also compute sum_x so that there is not much else to do.

------
cybersol
I love using Python for getting stuff made fast, and then only if needed
accelerating a few function calls to make it fast. I have used scipy, cython,
and even the C-API for this, but this looks like a interesting new take.

