Hacker News new | past | comments | ask | show | jobs | submit login
Making Python fast – Adventures with mypyc (meadsteve.dev)
224 points by meadsteve on Sept 27, 2022 | hide | past | favorite | 92 comments



I’ll be that guy who says I love Python but it’s been shoved into too many spaces now. It’s been a great tool for me for writing things that require a lot of I/O and aren’t CPU bound.

I am even rethinking that now because I was able to write a program in Go with an HTTP API and using JSON as the usual API interchange format in one night (all stdlib too), and it was so easy that I plan to pitch using it for several services we need to rewrite at work that are currently in Python. That would be very similar to what I wrote in a day.

If Python doesn’t fix their packaging, performance, and the massive expansion in the language, I think it’s going to start losing ground to other languages.


Can't fault Py as a scripting lang but the lengths people go to kludge it into areas it shouldn't really be (in prod at least) is always a massive red flag for any org.


Microsoft is bankrolling some efforts to improve Python performance. They even hired Guido for that.

(Disclosure: I'm a very minor volunteer contributor to that effort. I have a series of pull requests in the works that will improve runtime performance by about 1% across the board. I also have some more specialised improvements to memory.)


Syntactically Python is great. Runtime though, its not. Its fast to code in - that is all.


Agreed. I'm past the "it's like writing pseudo-code - so cool!" honeymoon phase and onto the "some sort of static typing is actually pretty useful" as a developer.


Good thing is that python3 provides typing.


Optional type hints are not the same at all.


No, its better. You can build the code first, test that it functions as expected, then add types, then compile it.

This is miles ahead of having to struggle with getting types correct in C++ before you even have anything resembling a workable solution.


I don’t get the struggle with types people have. It’s not hard to use a knife instead of a hammer to cut stuff. What’s the struggle with types, specifically?


Try teaching c++ to a new programmer and you’ll understand the struggles people have with types.

You might have forgotten, but you had head scratchers too when you where learning. Everyone has them. I’ve taught people who are absolute geniuses, even they struggled initially. And sure you get over it, just like people can become quite adapt at programming in esoteric languages like brainfuck, but that doesn’t mean there aren’t any better ways.


Yea types are hard, but I think to write quality software types are really important, almost necessary. I guess starting in a foundation where types are minimal like Python would be good for learning, then adding static typing is just a level of abstraction above that. We have to learn programming gradually though abstractions anyways, so I don’t see a problem there.


I don't disagree but I wish Python had builtin support for runtime type checking. I've thought about switching to Go or Rust for certain projects but Python's rich ecosystem makes it hard for me to switch, so for now I long for runtime type checking without needing an external library (e.g typeguard).


Very much agree. It's one thing I really appreciated about how php added type annotations. You didn't need them but once you added them they became a gaurantee.


You can check types at runtime in python. You can even check values at runtime. Heck you can check the weather at runtime if you like. What’s the problem exactly?


You can configure vscode to fail on typing problems and it will show errors in your editor


So that's an IDE feature rather than a language level feature


Nope, that's a language level feature. Your code will also fail at runtime. The IDE is just turning it on.


I don't think that's how it works. Unless you install a library like typeguard python doesn't do runtime type validation


I didn't know that you can compile individual modules with mypyc. That's very interesting since it allows a gradual adoption of the compiler, which really helps with big codebases.

Do you know if there are any requirements for which modules can be compiled? E.g. can they be imported in other modules or do they have to be a leaf in the import tree/graph ?


Having read through the docs Mypyc has a concept of "native classes" and python classes, and it looks like you can use a "native" (compiles) class from regular python and vice-versa.

So my reading is that it should be pretty seamless.


I recently experimented with using mypyc to make some of my python a little faster. I was pleasantly surprised with how well it worked for very little code change so I thought I'd share my experiences.

The blog post wanders around a little because I had to add setuptools and wheel building as my project had previously skipped this.


I just found out about Lagom from this blog post and it's exactly what I have been looking for.

All other Python options I've seen feel too involved or leak too much into your code. Lagom seems to balance everything just right.

Thank you!


Haha, I can imagine Steve is quite pleased with this comment. You should look up the meaning of the (Swedish) word lagom.


Thanks 4140tm and thanks cinntaile. I was very pleased. That was very much the intention of the name


Anybody using Python and Rust should also check out maturin and pyo3. I run some (non public) Python modules created in Rust and both the performance and the testability is stellar.


Yeah these are great approaches too. I'd actually considered a rewrite of the core in rust before I went with mypyc. But it was nice not to have to do a rewrite.


Totally understandable. More options are better anyways.


I have the exact same experience. Both Maturin and PyO3 have been a game changer for the work that I have been doing lately. It works so seamlessly.


We built the logic backing the Temporal Python SDK[0] in Rust and leverage PyO3 (and PyO3 Asyncio). Unfortunately Maturin didn't let us do some of the advanced things we needed to do for wheel creation (at the time, unsure now), so we use setuptools-rust with Poetry.

0 - https://github.com/temporalio/sdk-python


I had no issues with the standard maturin way of building wheels – but my requirements were not special at all. I also did this maybe 5 months ago, so maybe it has indeed gotten better, I cannot tell.


Worth mentioning Taichi, a high-performance parallel programming language embedded in Python. I've experimented with it a bit, and high-performance is very true. One can pretty much just write ordinary Python, plus enhancing existing Python is not that difficult either.

From their docs:

You can write computationally intensive tasks in Python while obeying a few extra rules imposed by Taichi to take advantage of the latter's high performance. Use decorators @ti.func and @ti.kernel as signals for Taichi to take over the implementation of the tasks, and Taichi's just-in-time (JIT) compiler would compile the decorated functions to machine code. All subsequent calls to them are executed on multi-CPU cores or GPUs. In a typical compute-intensive scenario (such as a numerical simulation), Taichi can lead to a 50x~100x speed up over native Python code.

Taichi's built-in ahead-of-time (AOT) system also allows you to export your code as binary/shader files, which can then be invoked in C/C++ and run without the Python environment.

https://www.taichi-lang.org/


> export your code as binary/shader files

That's cool af.


Speaking of python performance, I recently benchmarked "numpy vs js" matrix multiplication performance, and was surprised to find js significantly outperforming numpy. For multiplying two 512x512 matrices:

    python
      numpy:               ~3.30ms
      numpy with numba:    ~2.90ms

    node
      tfjs:                ~1.00ms
      gpu.js:              ~4.00ms
      ndarray:           ~118.00ms
      vanilla loop:      ~138.00ms
      mathjs:           ~1876.00ms

    browser
      tfjs webgpu:          ~.16ms
      tfjs webgl:           ~.76ms
      tfjs wasm:           ~2.51ms
      gpu.js:              ~6.00ms
      tfjs cpu:          ~244.65ms
      mathjs:           ~3469.00ms

    c
      accelerate.h:         ~.06ms

Source here: https://github.com/raphaelrk/matrix-mul-test


Numpy runs on CPU. Tfjs runs on GPU. Not a fair comparison.


You're right, it's not a fair comparison -- I think it's still interesting though, since numpy is the standard people would reach for, which made me think it would be the fastest / use the GPU. I expect a python library that uses the GPU would be just as fast as the others.


For that, you can use cupy[0], PyTorch[1] or Tensorflow[2]. They all mimic the numpy's API with the possibility to use your GPU.

[0] https://cupy.dev/ [1] https://pytorch.org/ [2] https://www.tensorflow.org/


It adds a great deal of complexity (and often user frustration) for a Python package to support both CPU and GPGPU.


People don’t reach for numpy because it’s the absolute fastest…


What I meant was I expected numpy to be faster than the js libraries I was testing, simply because people use it so much more, for real "scientific computing" work. And indeed it is very fast given it only uses the CPU, but that still leaves its matrix multiplication as ~100x slower than what my mac is capable of.


TensorFlow.js matrices are immutable, which puts more restrictions on your programming style that standard Numpy. You cat get immutable, GPU-enhanced matrices for Python, too.


Is anyone else using MyPyC in production and can share their experience? Did you attempt the compile it all approach, or incrementally add? What do compile times look like at scale?

Would love to buy you a coffee and hear about your experience and the challenges and benefits.


I like the concept of using mypyc to leverage type hints to compile python. But I was pretty frustrated recently when I got bit by a bug in mypyc[1] while trying to use black. Especially since I wasn't using mypyc myself and so didn't realize it was even in my dependency tree. Beware adding "alpha" quality software as a dependency to your supposedly production ready tool.

[1] https://github.com/psf/black/issues/2846


Nice to see this. Do they have a project roadmap for mypyc?

Doubling performance is nice, though it does still leave a lot of performance on the table. I’d be curious to see a comparison between this and Cython.


> for free ... this was a problem as a number of my tests rely on this [incompatible behaviour]


Maybe I should have said for cheap


I'm curious how to compare this with a PyPy FAQ: https://doc.pypy.org/en/latest/faq.html#would-type-annotatio... which describes a bit about why type hints aren't as helpful to optimize code under PyPy as one (including myself) might think.

Can someone explain more about how mypyc is in a better position to produce better optimizations than pypy, or am I confused about this?


pypy argues that considering type annotations gives them less useful data than their existing tracing does, and thus pypy wouldn't be faster if it considered them. Something like mypyc by design has no chance of doing tracing, and thus has to work with annotations. (I also don't see where you get the claim from that that mypyc has better optimizations than pypy? But the two also follow different designs, so they might be good at different things)


sorry I didn't mean to claim that mypyc does have better optimizations, I meant to be asking if that was possible. My superficial read was: this post about mypyc goes from type hints to compiling to "faster", and then I remembered the pypy FAQ which says type hints didn't help with that.

But if mypyc has no runtime information to go on (which pypy does have), then certainly having some type information is better than having none.


Mind you, it still requires to have a c compiler, to be installed separately. It's very easy on linux, but a x-code install on mac, and can be fiddling on windows.

Still nice, but not like golang or rust where you have a stand alone solution.

It's an alternative to nuitka, which I recommend to try out.


What's with this fascination with making python fast? It's not supposed to be fast, it's supposed to be simple. If you want speed use a compiled language. Trying to make python fast is like trying to strap a turbocharger to a tricycle.


I agree with you -- but I also don't say no to free food.

I mean regardless of whether mypy was going to make my code run faster I would have used it for the shear confidence it gives wrt to my code correctness. The fact that I can use that same code (untouched) to speed it up... that's just means I get to have my cake and eat it too :P


Yeah this is exactly it for me. I already had type annotations and ran mypy to help with correctness. And I tried this out because it felt like a nice thing to get for free.


Python programming is simple because I the programmer have to do less stuff. But also, to a really really crude approximation, some asm that does less stuff is both simpler, and faster, than some asm that does more stuff (depends on what it is I'm doing in particular, but, it's a decent rule of thumb).

But there is a disconnect between Python programming simplicity and Python speed, which stems from the fact that under the hood Python is doing much more than its minimal asm 'spiritual equivalent'.

But in a pure abstract theory sense, it shouldn't "need" to. I don't really care about the intricacies of garbage collection or global interpreter lock or page misses etc - what if I just care about "can I make this nice idea into reality in 10 minutes". The reality is that I'm just barely working with the tip of an iceberg composed of 60+ years of computer abstractions. But who can blame me - I am but a mere mortal.

If we could have a programming language that is both simple for the programmer and simple for the computer, it would be great. It's not that unreasonable that people start from the user experience side - making a simple language faster, by getting rid of unnecessary work - rather than the opposite extreme: making it simpler to come up with optimal machine code whose simplicity withstand contact with hundreds of vestigial appendages that just have to be dealt w bc of computer history (spanning from how to do a syscall to how to make a gui in some particular os)


I can think of use cases in academic research for example.

Many pieces of code run only a handful times but potentially move a lot of data. Sometimes reimplementing existing code in C/Haskell/Rust, including finding equivalent libraries, writing tests, and debugging, only because the computation turned out to be heavier than I had expected is not a good use of time. If that’s the place where I’m at, PyPy, mypyc, etc., might just do the trick.


mypyc is cool and all, but I can't help thinking about how Node just JITs everything automatically without the need for any special steps like this.


That's not Node - that's V8. And it's possible to do the same thing for Python - there's nothing magic about JavaScript compared to Python - it's just a lot of engineering work to do it, which is beyond what this project's scope is. PyPy does it, but not inside standard Python.


I'm well aware of V8 and pypy. I also really like Python as a language, especially with mypy.

It just makes me sad that in a world with multiple high-performance JIT engines (including pypy, for Python itself), the standard Python version that most people use is an interpreter. I know it's largely due to compatibility reasons (C extensions being deeply intertwined with CPython's API).

There is a really important (if not "magic") difference between JavaScript and Python. JS has always (well, since IE added support) been a language with multiple widely-used implementations in the wild, which has prevented the emergence of a third-party package ecosystem which is heavily tied to one particular implementation. Python on the other hand is for a large proportion of the userbase considered CPython, with alternate implementations being second class citizens, despite some truly impressive efforts on the latter.

The fact that packages written in JS are not tied to (or at least work best with) a single implementation is also what made it possible for developers of JS engines to experiment with different implementation approaches, including JIT. While I'm not intimately familiar with writing native extension modules for Node (having dabbled only a little), my understanding is the API surface is much narrower than Python, allowing for changes in the engine that avoid breaking APIs. But there is less need for native modules in JS, because of the presence of JIT in all major engines.


> It just makes me sad that in a world with multiple high-performance JIT engines (including pypy, for Python itself), the standard Python version that most people use is an interpreter. I know it's largely due to compatibility reasons (C extensions being deeply intertwined with CPython's API).

this is misleading, if one sees the phrase "interpreter" as that code is represented as syntax-derived trees or other datastructures which are then traversed at runtime to produce results - someone correct me if I'm wrong but this would apply to well known interpreted languages like Perl 5. cPython is a bytecode interpreter, not conceptually unlike the Java VM before JITs were added. It just happens to compile scripts to bytecode on the fly.


That's not misleading, that's standard terminology. an interpreter using bytecode is still an interpreter.


Bytecode is just another data structure that you traverse at runtime to produce results. It's a postfix transformation of the AST. It's still an interpreter.


Well, ok, but then isn't a CPU is also just an interpreter, traversing the object code text of compiled code?


We don't normally call hardware or firmware implementations an 'interpreter'.

Almost all execution techniques include some combination of compilation and interpretation. Even some ASTs include aspects of transformation to construct them from the source code, which we could call a compiler. Native compilers sometimes have to interpret metadata to do things like roll forward for deoptimisation.

But most people in the field would describe CPython firmly as an 'interpreter'.


I call it "bytecode interpreted" to distinguish it from traditional parse-tree interpretation such as Perl 5 and others


so you'd call the pre-JIT JVM an "interpreter" and you'd call Java an interpreted language?


> so you'd call the pre-JIT JVM an "interpreter"

Yeah? I think almost everyone would?

> and you'd call Java an interpreted language?

Java is interpreted in many ways, and compiled in many ways, as I said it's complicated. It's compiled to bytecode, which is interpreted until it's time to be compiled... at which point it's abstract interpreted to a graph, which is compiled to machine code, until it needs to deoptimise at which point the metadata from the graph is interpreted again, allowing it to jump back into the original interpreter.

But if it didn't have the JIT it'd always be an interpreter running.


I am not too concerned about the word "interpreter", and more about cPython being called an "interpreted language", which implies it works like Perl 5, or that cPython being an "interpreter" is somehow a problem. It's normal mode of operation works more like pre-JVM Java, with "interpreted bytecode" from .pyc files.


Most people don’t make this distinction, and would just say ‘interpreter’. Interpreting bytecode vs an AST is a pretty minor difference. It’s exactly the same data in a slightly different format. The ‘compilation’ is just a post-order linearisation. And storing it in files or not even more so.


as I'm sure you're aware, bytecode interpretation typically implies a superior performing model than AST interpretation, and compiling into bytecode produces a lot of opportunities for optimization that are not typically feasible when working with an AST directly. Of course it's all bits and anything is possible, but it's assumed to be a better approach in a generally non-subtle way.


To clarify my comment, I did mean bytecode interpreter.

This is a common implementation approach - parse the source to generate an AST, transform the AST to bytecode, then interpret the bytecode. It's still interpretation, and is slow. Contrast to JIT engines which transform the intermediate code (whether that's AST or bytecode) to machine code, and is fast.


someone correct me if I'm wrong but this would apply to well known interpreted languages like Perl 5

Perl uses the same execution method you describe for cPython.


This is in the process of being addressed - look into the HPy project


Python is a bit more dynamic than JS, which makes it uniquely hard to optimize. There is more improvement to be done however and is being done.


Right, but I think we know how to optimise all these things. It's all solved problems.


A few things are impossible without changing/subsetting the language. What I was trying to get at.


I think it's more that cpython is so slow so a lot of things people use are implemented using the C API, and many optimizations will break a bunch of things. If everything was pure python the situation would be different.


What things are you thinking of?

(Not trying to interrogate you or prove you wrong, but I've got an interest in optimising very difficult meta-programming patterns.)


Nearly everything (or is it everything?) in memory can be modified at runtime. There are no real constants for example. The whole stack top to bottom can be monkeypatched on a whim.

This means nothing is guaranteed and so every instruction must do multiple checks to make sure data structures are what is expected at the current moment.

This is true of JS as well, but to a lesser extent.


> so every instruction must do multiple checks

Aren't all the things you mentioned already fixed by deoptimisation?

You assume constants cannot be modified, and then get the code that wants to modify constants to do the work of stopping everyone who is assuming a constant value, and modify them that they need to pick up the new value?

> To deoptimize means to jump from more optimised code to less optimized code. In practice that usually means to jump from just-in-time compiled machine code back into an interpreter. If we can do this at any point, and if we can perfectly restore the entire state of the interpreter, then we can start to throw away those checks in our optimized code, and instead we can deoptimize when the check would fail.

https://chrisseaton.com/truffleruby/deoptimizing/

I work on a compiler for Ruby, and mutable constants and the ability to monkey patch etc adds literally zero extra checks to optimised code.


No such thing as a constant in Python. You can optionally name a variable in uppercase to signal to others that it should be, but that's about it.

You can write a new compiler if you'd like, as detailed on this page. But CPython doesn't work that way and 99% of the ecosystem is targeted there.

There is some work on making more assumptions as it runs, now that the project has funding. This is about where my off-top-of-head knowledge ends however so someone else will want to chime in here. The HN search probably has a few blog posts and discussions as well.


> No such thing as a constant in Python. You can optionally name a variable in uppercase to signal to others that it should be, but that's about it.

Yeah that’s the point - the JIT takes that capitalisation as a hint to treat it as a true constant and bake the value in until it’s redefined.

This is all solved stuff and isn’t a barrier to implementing a powerful JIT for Python if someone wanted to.


It's solved stuff in languages other than Python. Many groups even at google have tried and failed.


No, we know how to optimise all these issues. They're solved, through a combination of online profiling, inline caching, splitting, deoptimisation, scalar replacement, etc. (I wrote a PhD on it.) I don't think you could name a single Python language feature that we don't know how to optimise efficiently. (I'd be interested if you could.) But implementing them all is a difficult engineering challenge, even for Google, mainly because it involves storing a lot of state in a system that isn't designed to have state attached everywhere.


Yes, that’s what my reply means, your “no…” is poor communication style. If you think you can do better than the folks working on it for a decade plus, by all means step up.


But you can't actually give any examples? Ok.

I'll give you one you could have used - the GIL - however I'm not sure the GIL's semantics are really specified for Python, they're an implementation detail people accidentally have relied on.


If it's solved, why is python so slow?


That's what Microsoft is paying Guido for, for the next versions of python.


I think that's not really the plan - they're talking about just basic template compilation, nothing like V8 https://github.com/markshannon/faster-cpython/blob/master/pl....


I'm curious - if you can't have nested classes, does @dataclass(slots=True) not work?


Have they cleaned up Python's packaging/dependency problem yet?


You're going to have to define what you mean by that first, given there Python packaging landscape has changed quite a bit in the past few years.


Can this work with pyinstaller to make an executable faster?


I can't see why not. I've packaged some complex dependencies with PyInstaller – on Windows. There is always a way. This wouldn't even be particularly difficult.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: