One of the annoying things about the x86-64 ABI is that floating-point numbers use vector registers, even if they're only scalars and not vectors. The only way you can tell the difference is if the result is the p versus the s in vmulsd/vmulpd.
So the optimized assembly here isn't using any vectorization. Which isn't surprising, since as far as I could tell, the author isn't actually optimizing the code using LLVM. Most of the optimizations happen in opt, not in llc, which only does codegen optimizations. Those sorts of optimizations are largely things like stack frame optimization, instruction scheduling, or some more powerful pattern matching in instruction selection (which, even in -O0 in LLVM, does a limited amount of common subexpression elimination and the like). You might have to add restrict to the pointers (in LLVM terms, noalias on the arguments) to get vectorization to kick in, but the number of values is small enough for some sort of loop-versioning to probably kick in anyways.
The benchmarking is also pretty unfair. The C and C++ code have the values get copied in from a volatile array before progressing each loop iteration, which the Python-via-LLVM doesn't have. That doesn't sound like much, but it's 30 million extra guaranteed memory accesses over the time of the program, which will come out to a few milliseconds. (And , gee, the Python code is only faster by a few milliseconds). A better approach is to measure the time by moving the function to an external file and calling it in a tight loop, wrapped by a high-precision clock.
Also, benchmarking a 5×5? In practice, it's the memory access patterns that kill code, so you'll want to use programs large enough to actually cause any expected cache movements at the L2/L3 level. 5×5 is small enough that you could actually stick it all in vector registers if there's a bit of vectorization.
I like this comment, which seems to suggest that I don't need to learn LLVM IR in order to write a python class that emits LLVM IR when run through python code. I'd suggest that perhaps the reason he's able to do this is exactly because he knows "some special language" quite well.
To generalize, I have the impression that the general CS/programmer culture values initial easy of use much more than long term usefulness.
And that's not directed at the younger generation, you will find plenty of old farts that only want to use or learn git when it's been thoroughly lathered the thick sugary syrup of a pretty GUI.
I find I am frequently removing sections of inscrutable “vectorized” NumPy code that is inefficient, not because NumPy is inefficient but because conforming the data to a structure amenable to NumPy is too inefficient in the first place, and rewriting in straightforward, simple Cython functions.
Also the next maintainer will thank you for it.
To be clear, this is not meant as a critique on C, on the contrary. Put simply, it is just that making computers do stuff is hard. This is especially true if you want it to do it in a very particular way such as in high-performance code. If you cannot already leverage the work of other experts, you need to know about modern optimizing compilers, language runtimes, modern processor architectures, OS/system, ...)
To continue your Numpy example, that framework entombs the know-how and man-years of work of specialist in a certain area. If you need to do simple dense algebra, you will a hard time beating Numpy/Julia/Eigen/MKL etc.
If your problem goes slightly off-track (e.g. sparse problems) where you cannot easily leverage the hard work and expertise of others, you better start rolling up your sleeves and open that main.c
With more experience you'll find that certain operations are easier with particular workflows.
Git vs other ... well, Subversion is the standard. I'm not getting paid to move repositories on a whim. I have to make a business case ... etc. Guess what, I've got better things to do (i.e. do things that make money) than chase the newest shiny thing.
Sometimes a multi-hundred line git dump works and sometimes it doesn't. In some workflows the GUI scans better. e.g. pick out the hot spots in functions.
- learn a basic workflow and ask/google for help for problems. A gui makes this slightly easier
- learn the underlying model for git. Then you probably still have to google for invocations occasionally but you would have to do the same with a sugary gui. And for complex actions in a gui the answer usually is 'use the cli'
Plus occasionally a gui does something incredibly weird like giving direct access to `git clean --force` like vs code did a while back.
I like git, but the cli is only usable after a dash and a spoonful of aliases.
Other commands like stash, checkout and branch are equally simple. It’s just a very simple user experience.
I think where people express confusion over git is that they don’t expect to be interact with revision history, rather just inflicting changes onto a head-only svn style repo.
People are mistakenly encouraged to bring that mentality to the table when it’s not useful. You need bring a mentality that you’re mutating and sculpting branches.
It’s like a beginner saying “snowboards are complicated” because they come in expecting it to be like a sled. You hear the same complaints about other systems like darcs and mercurial too, for the same reasons, even though each of those offers respectively super easy to use cli experiences as does git.
Once you grok rebasing branches, git add, push, pull, bisect, rebase, log, commit --amend are all I have needed to work with Git and the experience has been far better than working with any other version control like SVN or Mercurial.
Ok, so they are the simple because you say so? Stash in all but the simplest cases is a sure-fire way to shoot yourself in the foot. Checkout is a weird guy, he does too many different things, and what's up with the double dashes? Super-weird! Also beware that any of your branch names happen to match any of the file names in the repo :O.
> I think where people express confusion over git is that they don’t expect to be interact with revision history
This is one of the things that confuses beginners about git, but it's not a confusion about git cli. You are conflating the criticism of git cli with criticism of git. Git is good, git cli is (notoriously) bad. However, eventually you'll (hopefully) learn the proper incantations and then using git cli will seem simple to you.
> “You are conflating the criticism of git cli with criticism of git.”
Not at all. I’m saying most of the time when people say they are confused by git cli, that it behaves counter to their expectations, they are not actually expressing any design limitation of the cli commands themselves, but instead blaming cli commands for the conceptual confusion they have about git’s model in general.
Basically the author has written a tiny framework for making a reasonable subset of Python a DSL that targets LLVM, which is super cool.
That said, like all things that try to accelerate Python, it only works for a subset of the language (and in this case a very narrow one) and TBH while this whole "let's accelerate Python" movement is cool it's really quite awkward to have hidden subsets of a language that are accelerable. I'm not sure, but I think it might be better if there were a second accelerable language that was inlined with clear rules other than this stuff that seems like it will always be prone to inconsistency...
...or Julia. I suspect there are tradeoffs there too though... Building a language/runtime for high level experimentation with built in support for performance that rivals custom low level code is hard, the goals themselves are at odds with each other...
It's interesting, but it doesn't work in general, which is why you don't see the technique in common use. The problem is that it only catches methods called on objects, but it can't catch anything that isn't a method called on an object. Unfortunately, that includes for loops, if statements, some aspects of exceptions, and just generally "all flow control".
So rather than being a powerful and general technique, it's a powerful, but very very specific technique that only works in certain toy problems. Unfortunately, once you leave those toy problems behind, it doesn't even help you solve the problem because the paradigm this forces you to operate under makes easy things easy, but medium things hard and hard things pull-your-hair-out impossible. You're better off writing a real compiler... or doing this in a language that does expose enough of what's going on to make this technique work, although you tend to end up in either a Lisp or Haskell, which come at this from very different angles but both have the capacity to make this work. (Lisp via syntax tree manipulation, Haskell via Free Monads and other similar techniques.)
The one issue is compilation time, which is much and improved and will be further obviated through more caching and planned multi level compilation.
Then we have metaclasses, abstract base classes and protocols.
Also Super and self everywhere.
Even more complexity with typing, which is necessary if you want to approach similar type safety to julia ie to make the comparison more fair.
This has happened to me 5 or 6 times now. At this point I just stick to stock python.
https://scala-lms.github.io/ -- Scala LMS is similar in that you replace type T in your algorithms with Ref<T>, and you get a code generator (compiler) rather than an interpreter.
This links to a CACM article by the authors of Scala LMS:
https://andrewkelley.me/post/zig-programming-language-blurs-... -- Zig has a code generator for printf using compile-time metaprogramming.
http://terralang.org/ -- You can write arbitrary Lua to generate Terra code, sort of like he's writing Python to generate LLVM code. I think this is more general and doesn't necessarily work by retaining the structure of say solve_linear_system(). I'm not quite sure. But the problem domains they tackle are very similar to this (numerical code).
TensorFlow also has two stages of computation -- you define a graph of objects in imperative Python, using Python operator overloading in the same way that LLVMCode in this example does. And then later the separate TensorFlow language is evaluated in parallel / distributed on GPUs.
In one case you're defining an SSA graph of ops in LLVM; in the other case you're defining a graph of TensorFlow ops.
Note that the semantics of TensorFlow language are distinct from the semantics of Python, e.g. https://ai.google/research/pubs/pub46196
One interesting language in the same vein is Spiral, with "first-class staging" for machine learning:
You can also do a lot of this with constexpr / templates in C++, although the syntax is horrible.
As you allude to, these staged programming systems can have TWO languages or one. In the case of Scala and Zig it's one language, but in the case of Lua/Terra it's two. Arguably in the case of C++, it's two languages and not one.
Another thing to search for is "partial evaluation". I still have to understand the difference more precisely, as alluded to in the link above.
The Python/RPython language pair in PyPy is also related:
At first I was a little critical but I get it a little more now. Of course the real test is trying it, but I'm not working in that area at the moment, though I've used TF before.
What I find interesting is the recurring theme of two languages -- in Swift's case they say you can use a "static subset". I think that means all the imperative control flow, and then static dispatch (as in templates), but NOT dynamic dispatch (as in OOP).
So in this example you have Python/LLVM, then there's Lua/Terra, Python/RPython, Python/TensorFlow, or Swift/static Swift, etc. Except in the latter case I guess it's all mixed together and you don't explicitly have a graph building stage.
This is not true for Cython.
FWIW, I can't.
$ cd wordsandbuttons/exp/python_to_llvm/exp_c
$ /usr/bin/time -f "%U\n" ./exp_volatile_a_b
$ cd wordsandbuttons/exp/python_to_llvm/exp_embed_on_call
$ /usr/bin/time -f "%U\n" ./benchmark
- Switching to -O3 for gcc (instead of author's default -O2), the C version (exp_volatile_a_b) gets ~30% faster, bringing it down to ~50% of the python-generated llvm - it appears to still be doing the full computation at runtime.
- Switching to -O3 for llc doesn't make any difference for the python-generated llvm version.
- With gcc 4.8 -O2 (instead of 7.3 -O2), I get ~0.06s - it looks like gcc 4.8 decides to inline everything, and gcc 7.3 doesn't.
Functional Pearl: A SQL to C Compiler in 500 Lines of Code -
How to Architect a Query Compiler, Revisited -
Flare: Optimizing Apache Spark with Native Compilation for Scale-Up Architectures and Medium-Size Data - https://www.usenix.org/conference/osdi18/presentation/essert...
- In the C article he shows that LAPACK can be massively improved upon. That sounds very impressive, until you realize he hasn't mentioned which LAPACK/BLAS implementation he used. The default download from netlib is the vanilla version and is known to be very slow in general. Any benchmark dissing LAPACK is incomplete without comparison with OpenBLAS and Intel MKL. And instead we see the author pulling all sorts of overengineered tricks to top the performance of a vanilla LAPACK build (I mean there isn't even a mention of whether or not he used optimization flags while compiling LAPACK).
- In this python aritcle, again, python comes with highly developed scipy/numpy stack which can be coupled with a highly performant OpenBLAS or Intel MKL, but there is absolutely zero mention of the most popular scientific stack in the python community. Again, instead, we see a round-about and "clever" solution of trying to couple python with LLVM.
$ /usr/bin/time -f "%U\n" ./exp_volatile_a_b
$ /usr/bin/time -f "%U\n" ./exp_lapack
$ /usr/bin/time -f "%U\n" ./exp_openblas
So the fully-unrolled version is still faster on the 5x5 matrix, which doesn't seem super surprising to me. I would expect a LAPACK implementation to have some overhead compared to a straight line solution to a problem this small.
The pytorch one is a little bit more general since it can trace arbitrary models + have custom high level operations that might not be supported by default. You just pass it a python method and an example input.
def foo(x, y):
return 2*x + y
traced_foo = torch.jit.trace(foo, (torch.rand(3), torch.rand(3)))
... you just have to write bespoke macros to turn python code into LLVM code...
I see these comparisons all the time and they inevitably jump through a ton of hoops to make a language like Python perform like Java/C#/Go/rust. IMO the big advantage of these language is that they're fast for the majority of normal, totally unoptimized code.
I can write REST endpoints in Java/C# that do a hundred thousand requests per second using a mainstream framework with no optimization. That's important in the real world where I'm usually working on large ancient projects with somewhat poor code quality
There's plenty of semi-realistic benchmarks out there. I like TechEmpower the best since they test an entire web framework + database.
Several java implementations are right up there with C, along with .NET Core, Go, and Rust. In the vast majority of cases the extra 20-50% speed of C isn't worth the dangers.
That's also without making any attempt at optimizing, basically a straight conversion that took ten minutes.
It's not the speed of C but it's pretty close. And it has memory safety.
It's an inversion of the instinct people have with our culture's commandment about deferring premature optimization. The instinct is that selecting for performance at the start of a project is premature. I argue that selecting a platform is precisely the time to consider performance because it is a decision that is extremely high-friction to change later. Therefore, it's timely, not premature.
By selecting a high-performance platform, you set the performance ceiling and baseline performance of your application high. This gives you the freedom to develop application code in a sub-optimal manner—deferring its optimization until such time that it is necessary, perhaps forever. With a lower-performance platform, you will run into a performance hurdle event that requires optimization effort earlier and the level of effort to raise your performance ceiling will be greater than it would be had you selected a higher-performance platform at the start.
The key, of course, is all else being equal. That's not always the case for all teams, but it is still often the case. The key is knowing when your team can manage to adopt a high-performance platform and being careful about that decision.
If the goal were to write one algorithm to be the fastest implementation, performance would usually equate to having low level control.
If the goal were to do automatic optimization of a wide variety of similar algorithms, then we're in the "actually writing a compiler" territory and then there is little to be gained from defaulting to low-level control.
And I'll still be able to manually generate LLVM IR if I really want to (but I don't).
C was created in the PDP-11 era whereas LLVM IR was designed for more modern processors, so coding in the latter allows for more performance optimisations that C compilers can't always make.
Besides that, LLVM IR is not all that different from C. LLVM essentially originated as an implementation of the abstract machine described in the C standard, and most of the semantics of LLVM essentially align with the equivalent semantics in C. There are some additions for things that don't exist in C--bitcast, vector types, unwinding/exception handling--and there are knobs to turn down some of the undefinedness in C (e.g., you can choose whether or not integer overflow causes undefined behavior, and you can vary the alignment of load and store operations). But using LLVM IR directly isn't going to expose any extra optimization opportunities that you couldn't have already expressed with C, potentially with some extra gcc-isms (such as vector types).
"An overview of the PL.8 compiler"
LLVM wasn't a thing back then so they generated C++ source code which got compiled into a Python module which was loaded back into the running Python script and executed.
The translation pass also allowed for optimizations, for example if one of the inputs was a constant it could call a different library function which handled that special case. The result was a very flexible solver which allowed you to write code that looked almost like the math you were trying to solve, yet could match or beat a handwritten solver.
More memory use mean more cache misses, and cache misses have a huge impact. Roughly, L2 cache is 10x slower than L1 and RAM is 10x slower than L2.
With C, when you want to allocate some memory, you typically precalculate the size and malloc() just the right amount. In C++, you are more likely to use containers, which are safer and more flexible but tend to introduce some overhead. GC languages like Java have the GC overhead in addition and dynamic languages like Python are the worst because they also need to keep track of object types.
Now there are some applications that are really CPU-limited, like complex calculations. But in that case, chances are that you'll probably want to offload things to the GPU.
Still, I really liked the article. That's another great tool to have when it comes to performance. But it won't change the general case where C > C++ > Java > Python when it comes to performance.
A little caveat when it comes to C vs C++ though. C++ can be faster than C if you use C-like memory management, but that's not the usually recommended way of doing things.
This trick is only useful for code that makes no control flow decisions based on the data being processed.
"A compiler from a subset of Python to LLVM-IR"
From a programmer productivity scenario, I'd find it difficult to justify all the extra work and time spent on doing the Python + LLVM version, as compared to something that was far simpler to maintain, and performed almost as well.
Now if he came out with some sort of 2 order of magnitude improvement, that would be a different story. But I don't see Python getting there. Without appealing to something external, which isn't, itself, Python.
It's a form of metaprogramming which some would call staged programming or multi-stage programming:
The authors of all those systems call it metaprogramming, more or less. Metaprogramming is a very general term, and there many varieties of it.
Once you get past the syntax and specific technologies, you'll see that many of these systems have a similar structure, and you could probably do a line-by-line port of solve_linear_system() to those systems. I think Scala LMS is probably the closest one.
Many low perf dynamic languages did all sorts of tricks to improve perf. It's even an engineering rule to rewrite things that need more perf.
Similarly you can twist the language if it has metalevel capabilities...
It's a thing that I liked in less common languages, they often enjoy generating low level code and consider the system as an object (cpu being consumer of instruction stream) and not a divinity.
That's exactly what Numba does for python, just a decorator on a function and magically a function gets native performance by being compiled with llvm.