Hacker News new | comments | ask | show | jobs | submit login
Show HN: Cannoli – A compiler for a subset of Python written in Rust (github.com)
292 points by joncatanio 8 months ago | hide | past | web | favorite | 94 comments



I recently finished the code for my thesis and wanted to share with you all :). The goal of the thesis was to evaluate language features of Python that were hypothesized to cause performance issues. Quantifying the cost of these features could be valuable to language designers moving forward. Some interesting results were observed when implementing compiler optimizations for Python. An average speedup of 51% was achieved across a number of benchmarks. The thesis paper is linked on the GitHub repo, I encourage you to read it!

This was also my first experience with Rust. The Rust community is absolutely fantastic and the documentation is great. I had very little trouble with the "learning curve hell" that I hear associated with the language. It was definitely a great choice for this work.

I also included PyPy in my validation section and "WOW". It blew both Cannoli and CPython out of the water in performance. The work they're doing is very interesting and it definitely showed on the benchmarks I worked with.


I still don't understand why Pypy hasn't been adopted by Google or Dropbox (the standard bearers of the Python ecosystem) as a forward looking investment. It is constantly underfunded (https://pypy.org/py3donate.html) and given the potential for the work that's happening, I don't understand why these guys don't write cheques for a few hundred k.


After I ran the experimental evaluation, I had similar thoughts. If PyPy ever matches the current version of CPython I'm not sure why one wouldn't use PyPy over CPython. The biggest hurdle is matching support for popular libraries like NumPy, Tensorflow, Pandas, Scipy etc. I know they're working on supporting these, it's definitely a lot of work to do, easier said than done.


PyPy doesn't speed up all workloads, sometimes the JIT overhead is just too large to still get a speedup in the end. E.g. the Oil shell runs slower under PyPy: http://www.oilshell.org/blog/2018/03/04.html#toc_13


(author here) Thanks for the reference :)

I should also add: Suppose PyPy was twice as fast as CPython for a given workload, but it also used twice as much memory.

I doubt Google or Dropbox would use it in that case. On large clusters, memory usage probably contributes to the need to buy machines more than CPU usage (CPU utilization can be low; memory utilization is higher).

I've personally rewritten some Python code as a C++ extension module and gotten 5x decrease in memory usage across thousands of machines.

(As far as I understand, this is the typical tradeoff for PyPy: it's faster but uses more memory. I'm happy to hear more detail on this though.)


I'm not so sure about this. The single biggest factor in data centers are cooling costs. So it stands to reason that if your CPU usage drops, so does your cooling costs (or you run double your workloads per unit of cooling).

Memory is cheap OTOH.


Memory may be cheap, but it is a limiting factor for a lot of workloads. If your memory consumption goes up by 2x, you need to provision 2x the nodes.

In fact, I've heard that one of the reasons IBM is investing in Swift is because it uses so much less memory compared to jitted Java. Apparently, most of their cloud workloads sit idle most of the time, meaning the number of client VMs/jobs/whatever they can put on a machine is almost entirely determined by memory use.

Also, memory usage to a large extent is performance and is power consumption. A random memory access is still on the order of 50-60ns, in that time you can do several hundred ALU operations. (This is assuming the memory is actually used and not just sitting around). See for example http://www.ists.dartmouth.edu/library/202.pdf

"...computation is essentially free, because it happens “in the cracks” between data fetch and data store; on a clocked processor, there is little difference in energy consumption between performing an arithmetic operation and peforming a no-op." -- Andrew Black in Object-oriented programming: some history, and challenges for the next fifty years


Yup - long running and very repetitive processes are the best fit for PyPy. If you have a slow but short-lived process then PyPy is not going to improve things for you.


This is exactly why PyPy blew both Cannoli and CPython away in the microbenchmarks used for analysis. As I've said elsewhere, the focus was on comparing Cannoli (unoptimized) to Cannoli (optimized) and not a direct comparison to CPython or PyPy. However, the microbenchmarks were running iterations of 1-10 million, giving the JIT plenty of time to find beneficial traces in the PyPy interpreter.


BTW, the test I did where PyPy was slower than CPython ran for a minute or so (IIRC). It wasn't that long lived, but it wasn't like the "instant" invocation you often see with shell scripts either.

I don't think the JIT warmup was the main issue there; I think it was PyPy's lack of ability to optimize certain kinds of code combined with increased memory usage.


I am hoping Facebook to fund PyPy given that Instagram runs on Python.

It seems Google and Dropbox are not interested. Google is working on Grumpy, Dropbox worked on Pyston.


I thought so too. But Google has a history of funding multiple initiatives in parallel. Grumpy is a very different effort to translate Python to Go - but Pypy is more or less drop in.

Pyston also seems dead - nobody has committed in a year. You have to give kudos to the Pypy devs. The level of passion contributing countless hours to an underfunded, high impact project must be incredible.


It seems that Instagram is making tweaks here and there, as reported in this weeks LWN for example:

https://lwn.net/SubscriberLink/754163/a38214c50e7b3ece/

To be honest I barely use python, but when I read lwn, etc, I get the impression multiple people are tryign to solve multiple problems with python (from the GiL onwards).

It seems like a hard problem, given the dynamic nature of the language and the unwillingness to break the C API.


thanks for linking to that!

> Thomas Wouters asked if he had looked at PyPy. Shapiro said the company had, but there was only a modest bump in performance for its workload. He was not the one who did the work, however. Wouters noted that PyPy is more than "Python with a JIT" because it has its own data model as well.

This is interesting. How much was a "modest bump" in performance ? And why was the bump in performance not a reason for adoption ? Does Pypy break a lot of stuff ?

Oh and this

> Some of what Shapiro presented did not sit well with Guido van Rossum, who loudly objected to Shapiro's tone, which was condescending, he said. Van Rossum thought that Shapiro did not really mean to be condescending, but that was how he came across and it was not appreciated. The presentation made it sound like Shapiro and his colleagues were the first to think about these issues and to recognize the inefficiencies, but that is not the case. Shapiro was momentarily flustered by the outburst and its vehemence, but got back on track fairly quickly.

> Shapiro's overall point was that he felt Python sacrificed its performance for flexibility and generality, but the dynamic features are typically not used heavily in performance-sensitive production workloads. So he believes it makes sense to optimize for the common case at the expense of the less-common cases. But Shapiro may not be aware that the Python core developers have often preferred simpler, more understandable code that is easier to read and follow, over more complex algorithms and data structures in the interpreter. Some performance may well have been sacrificed for readability.


Grumpy development has stopped, at least on the grumpy github repo.


On Google's case it appears they see more worthwhile to migrate their Python code into Go and Swift than improving Python runtimes.

Remember Unladen Swallow?


Unladen swallow has never been an official attempt to solve that. It was the initiative of one guy (http://qinsb.blogspot.fr/2011/03/unladen-swallow-retrospecti...), during an internship, with basically zero support. And so google never even consdered as a serious solution.

Beside, when they took the Go decision, they were still using Python 2.4. It had poor unicode support, bad async io, no multiprocessing pool, aweful packaging and deployment story and the project of a 3.X breaking everything.

It made a lot of sense, business wise. If you have to rewrite your code base, better rewrite it in a language you control (no PSF to fight against), specifically design for your workfload, and fixing all those quirks, plus running faster and eating each memory.

Today's Python unicode support is top notch, asyncio ensure easy networking code, you can use multi-core easily, deployment is pretty much a solved problem and 3.X is running pretty much for 90% of people. You even have a hook to plug in a JIT inside CPython, waiting to be used, and Type hints to facilitate the handling of huge code bases.

Their decision would probably have been different then, although that wouldn't have made Python any easier to speed up. But if a few smart people managed to make it work, by rewritting python in python, no less, I'm sure a dedicated google team would have done great.

They did it for the slug that was JS after all. But they didn't have the luxury to be able to change the language for that. So they pourred millions into it. Not one guy during an internship.


What kind of Python code are they migrating to Swift? On the server? If so, I'm a bit surprised they're already using Swift on the server. It's a cool language, but for Linux servers it seems really young.



Wouldn't the biggest issue still be that C modules either don't work or are slow? I'd imagine it's much better to be able to solve performance bottlenecks by using cython/c than to have overall faster runtime, but no option to go further.


The Python ecosystem in general is severely underfunded despite all big players using it extensively.

I think one reason is that the community is doing too good of a job. The language is pretty sane, it solves most problems right, the libs and docs are good, and the general direction thinks take is reasonable. And it's free not only as beer and freedom, but also free from business influences. The PSF is really giving away pretty much everything.

Everybody contribues a little (we have the brett canon team from ms, the guido team from dropbox, the alexi martelli team from google, mozilla even donated for pypy, etc). But it's nothing massive. Nobody said "ok here is 10 millions euros, solve the packaging problem".

Compare to JS: the language started as slow, with terrible design, and no consensus on the direction to take. So eventually, people (Google first) pourred a load of money to it until it became usable, and they had a cleaner leadership. They had huge problem to solve on the ever expending market that is the web plateform. Of course JS as the unfair advantage of a captive audience and total monopoly on its field.

Remember Unladen shallow ? "Google" attempt to JIT Python ? It was just one guy during his internship (http://qinsb.blogspot.fr/2011/03/unladen-swallow-retrospecti...).

And look at the budget the PSF had in 2011 to help the community: http://pyfound.blogspot.fr/2012/01/psf-grants-over-37000-to-... I mean, even today they have to go though so many shenanigans for barely 20k (https://www.python.org/psf/donations/2018-q2-drive/).

But at the same time you hear people complaining they yet can't migrate to Python 3 because they have millions of lines of Python. You hear of them when they want to extend the support for free, but never to support the community.

It's ridiculous.

Also compare to PHP: the creators made a business out of it, plain and simple.

Compare to Java/C#/Go: it's owned by huge players that have a lot of money engaged.

Python really needs a sugar daddy so that we can tackle the few items remaining on the list:

- integrated steps to make an exe/rpm/deb/.app

- JIT that works everywhere

- mobile dev

- multi-core with fast and safe memory sharing

There are projects for that (nuikta, pyjion, kivi, etc), but they all lack of human power, money and hence integration, perfs, features, etc.

You need a simple way to code some GUI, make it work on mobile or desktop, turn it into and exe and distribute it.

You need a simple way to say "this is a long running process, JIT the hell out of it".


> So eventually, people (Google first) pourred a load of money to it until it became usable, and they had a cleaner leadership.

Google was not the first company to pour lots of money into JavaScript. Remember the first browser war?


Yes, and JS was not the center of it at all. Support was inconsistent, it was slow, leaked memory...

Actually, most actors tried to find a way to replace JS with something in house (ActionScript, ActiveX, Java applets) so they could dominate the market.


Most scripting languages were slow at that point, and memory leaks are still rampant in popular browser engines. Microsoft had an enormous team working on "JScript". I wouldn't be surprised if the size of that team significantly exceeded the size of the V8 team in the early days.

Netscape and later Mozilla were also heavily investing in JavaScript long before Google came on the scene.


Engineers usually have more fun rewriting everything in “that new shiny tool that people is speaking about”.

Managers enjoy avoiding conflicts.

Very rarely someone in a position of power will point out to this kind of solution, which anyway is going to be against wishes of many employees.


Interesting work! I have a bunch of comments and questions.

> The goal of the thesis was to evaluate language features of Python that were hypothesized to cause performance issues.

In another life I did something similar using a similar compiler simulation technique, looking at other Python features like redundant reference count operations, boxing of numbers, dynamic type checks etc. See G. Barany, Python Interpreter Performance Deconstructed. Dyla'14. http://www.complang.tuwien.ac.at/gergo/papers/dyla14.pdf

After obtaining the numbers in that paper, the work didn't really go anywhere; there were no really obvious optimizations to try based on the data. But it was fun!

Anyway, questions:

1. If I understand the source on GitHub correctly, you parse Python source code yourself. I'm fairly sure your simulation would be a lot more faithful if you compiled Python bytecode instead. Did you consider this, and if yes, was there a particular reason not to do it that way?

I ask this in particular because if I understand your thesis correctly, you look up local variables in hash tables every time they are referenced. This is not what Python does: It maps variable names to integer indices during compilation to bytecode, and the bytecode just takes those embedded constant indices and indexes into an array to obtain a local variable's value. That's a lot faster. And you would get it automatically if you started from bytecode. (Plus, it would be easier to parse, but if you have fun parsing stuff, that's reasonable too.)

2. Where do you actually make useful use of Rust's static ownership system? I've only skimmed that part of the thesis very quickly, but I missed how you track ownership in Python programs and can be sure that things don't escape. Can you give an example of a Python program using dynamic allocation that your compiler maps to Rust with purely static ownership tracking and freeing of the memory when it's no longer used?

3. Related to 2: Why bother with any notion of ownership at all? Did you try mapping everything to Rust's reference counting and just letting it do its best? I'm wondering how much slower that would be. Python is also reference counted, after all, and I guess the Rust compiler should have more opportunities to optimize reference counting operations.

4. In general, do you have an idea why your code is slower than Python, besides the hash table variable lookup issue I mentioned above?


Regarding the bytecode - it was always considered an internal implementation detail subject to change (unlike the JVM bytecode) and in 3.6 they have in fact made a fairly major change[1]:

"The Python interpreter now uses a 16-bit wordcode instead of bytecode which made a number of opcode optimizations possible."

They haven't been shy about changing it in the past either, since there's no guarantee of stability, so it's likely to continue to change in incompatible ways.

[1] https://docs.python.org/3/whatsnew/3.6.html#optimizations


Good point, and shows that I haven't kept up with the latest developments here :-) Though in the context of a one-off research project, I still think picking a concrete Python version and using its byte/wordcode is the better choice. It won't be an enterprise-quality project with a long lifetime, but neither is the author's solution of parsing (a subset of) Python manually.


> you look up local variables in hash tables every time they are referenced. This is not what Python does: It maps variable names to integer indices during compilation to bytecode, and the bytecode just takes those embedded constant indices and indexes into an array to obtain a local variable's value.

This is only true for function arguments right? Module level bindings and class and object attributes are looked up in dictionaries. I think the same for variables used in closures too?


And for any local variables, not just arguments.


> Module level bindings and class and object attributes are looked up in dictionaries.

Depends on __slots__, yes?


> the work didn't really go anywhere;

That's really too bad.

> there were no really obvious optimizations to try based on the data.

Is that because Python already is the way it is? In other words, if you started from scratch, how would you design a language differently so that it doesn't run into these issues?

Asking for a friend ;-)


> Is that because Python already is the way it is?

Partly, yes. But also because the optimizations I would have come up with were all tried by Stefan Brunthaler before. See some citations in the paper, in particular to: http://arxiv.org/abs/1310.2300 where he claims a 4x speedup over CPython using purely interpreter-based and safe optimizations. So I guess I should have written "no really obvious and new optimizations to try".

Anyway, Stefan's work in general is very interesting for implementors of dynamic language interpreters. A lot of it harks back to things that were first developed for Smalltalk, which I think is where you are coming from?

> In other words, if you started from scratch, how would you design a language differently so that it doesn't run into these issues?

I don't think there would be a need to start from scratch. Python 3 is a pretty good language! Also, it's not clear that it needs to be fast, it's just fun to think about how it might be fast.

That said, I think there is only one key feature missing to enable aggressive optimizations: An annotation on functions/methods that says that the function may be optimized even at the cost of some loss in reflective capabilities. In current Python any function's code can be inspected and changed at run time, as can its local variables by callees who can walk up the call stack. This means that every optimization that compiles Python code to machine code or even just changes the interpreter's handling of local variables can, in general, change the program's semantics.

I think it would suffice to add a @static decorator to disable these reflective features for individual functions. (And disabling dynamic lookup of globals would be good too.) The interpreter could then recognize that you want those optimized and could do whatever magic it likes without caring about boring corner cases like invalidating its optimizations if someone tries to patch the function's code at runtime.

This would not be a big thing to do, and there would be no real need to restart from scratch; Python 3 is a pretty good programming language!

Everything else would pretty much fall out automatically from that using known/already implemented techniques, especially with the type hints that would allow you to gradually move a fully dynamic, interpreted application to a statically typed, compiled one function by function. Such a Python would still not win the speed crown, but it could beat the current one by miles on a lot of applications.


That work is great!

> We have presented the first limit study that tries to quantify the costs of various dynamic language features in Python.

This is spot on what we were doing as well, that's great to have this as a reference.

> 1. If I understand the source on GitHub correctly, you parse Python source code yourself. I'm fairly sure your simulation would be a lot more faithful if you compiled Python bytecode instead. Did you consider this, and if yes, was there a particular reason not to do it that way?

We did not consider this actually. This would be a very interesting concept to explore. For the unoptimized version of Cannoli we do look up variables in a list of hash tables (which represent the current levels of scope). We did perform a scope optimization that then uses indices to access scope elements and this was much faster. However, it meant that the use of functions like `exec` and `del` were no longer permitted since we would not be able to statically determine all scope elements at run time (consider `exec(input())`, this could introduce anything into scope and we can't track that).

If you know, how does CPython resolve scope if it maps variable names to indices? In the case of `exec(input())` and say the input string is `x = 1`, how would it compile bytecode to allocate space for x and index into the value? I don't have much experience with the CPython source, so please excuse me if the question seems naive :)!

> 2. Where do you actually make useful use of Rust's static ownership system? I've only skimmed that part of the thesis very quickly, but I missed how you track ownership in Python programs and can be sure that things don't escape. Can you give an example of a Python program using dynamic allocation that your compiler maps to Rust with purely static ownership tracking and freeing of the memory when it's no longer used?

Elements of the Value enum (that encapsulates all types) relied on `Rc` and `RefCell` to defer borrow checking to run time. Consider a function who has a local variable that instantiates some object. Once that function call has finished Cannoli will pop that local scope table and all mappings will be dropped when it goes out of scope. The object encapsulated in a `Rc` will have it's reference count decremented to 0 and be freed.

This is how I've interpreted the Rust borrow checker, I will say that this was the first time I had ever used Rust so it's possible that I am not completely right on this. But once that table goes out of scope, all elements should be dropped by the borrow checker and any Rc should be decremented/dropped.

> 3. Related to 2: Why bother with any notion of ownership at all? Did you try mapping everything to Rust's reference counting and just letting it do its best? I'm wondering how much slower that would be. Python is also reference counted, after all, and I guess the Rust compiler should have more opportunities to optimize reference counting operations.

I did defer a lot of borrow checking to run time with Rc, but I tried to use this as little as possible to maximize optimizations that may result from static borrow checking.

> 4. In general, do you have an idea why your code is slower than Python, besides the hash table variable lookup issue I mentioned above?

If you remove the 3 outlier benchmarks (that are slow because of Rust printing and a suboptimal implementation of slices), Cannoli isn't too far off from CPython. And in fact, with the ray casting benchmark, Cannoli began to outperform CPython at scale. This leads me to believe that the computations in Cannoli are faster than CPython. However, there is still a lot of work to do to create a more performant version of Cannoli. The compiler itself was only developed for ~4 months, I have no doubt that more development time would yield a better results.

That being said, I think the biggest slowdown comes from features of Rust that might not have been utilized. This is just speculation, but I think the use of lifetimes could benefit the compiled code a lot. I also think there may be more elegant solutions to some of the translations (e.g. slices), that could provide speedup. But I can't say that there is one thing causing the slowdown, and profiling the benchmarks (excluding the outliers) support that.


Thanks for your answers!

> This leads me to believe that the computations in Cannoli are faster than CPython.

Right, the speed difference due to output buffering seems annoying. I wonder if you can change the stdout buffering mode to be more similar to Python's (if Python's stdout is not line buffered, I don't recall).

Also, you might want to try to run the same benchmarks PyPy uses: http://speed.pypy.org/

These are still micro-ish benchmarks, but not quite as micro as some of yours.

> However, there is still a lot of work to do to create a more performant version of Cannoli.

I saw in another post that you are going off to industry to do something else, but if you will continue development and evaluation of Cannoli and want to discuss further, feel free to drop me a line at the email address in the paper linked above.


In Python 2, the bytecode optimization that lets you access variables by index is turned off if your function has exec code in it that may modify locals.

So if your function is:

   def foo(): exec "a=1"; return a
Then running dis.dis on foo to disassemble the bytecode it you will see:

              8 LOAD_NAME                0 (a)
while you normally would see:

              6 LOAD_FAST                0 (a)
You can't use the exec in locals thing in Python 3 at all I believe.


I just tested it in Python3:

    def foo():
        exec("a=1")
        return a

    print(foo())
Fails with a NameError:

    Traceback (most recent call last):
      File "test.py", line 5, in <module>
        print(foo())
      File "test.py", line 3, in foo
        return a
    NameError: name 'a' is not defined


I get the same error with your example, but this works fine (Python 3.6.4):

    exec("a = 1")
    print(a)
This will print "1".


That is because the exec runs in the global scope. When Python sees a variable that is not assigned to in the local scope, it is assumed to be a global variable, so when exec creates a new local variable, the load still fails because it looks into the globals dictionary.

But you can do this:

  def foo():
    exec("a = 1")
    return locals()['a']


Ahh, I see, this makes more sense. Thanks for the clarification!


I am aware of PyPy but have not used it myself. My understanding of PyPy is that it gains performance improvements mainly through a hotspot JIT compiler. If Cannoli compiles the entire Python program down to machine code (via rust) then how does PyPy "blow it away"?


As others have commented, AOT compilation is limited to the information available at compile time. Various features of Python like dynamic typing and object/class mutation (via del) preclude many static analysis techniques. In Cannoli, this meant that the compiler had to also generate code that manages scope at run time. Whenever an identifier was encountered in the compiled code a hashmap would be searched to find the bound value. This overhead becomes expensive, and the thesis covers optimizations that avoid this. PyPy's JIT operates on the PyPy interpreter itself, finding linear lists of operations that are frequently used. It can then compile these operations to bytecode so the next time that trace is encountered it can execute the compiled code. The self-analysis at run time provides information that an AOT compiler just doesn't have.

That being said, I did leave a few suggestions in the "future work" section that talk about writing an AOT compiler for RPython (the version of Python that PyPy's interpreter is written in). This would provide more information at compile time and would be an interesting comparison between a Python interpreter compiled AOT versus a Python interpreter with a JIT (PyPy).


> As others have commented, AOT compilation is limited to the information available at compile time

Does this mean that a AOT compiler could get a lot faster with pep 484 type hints?


Yes :)! So I mention PEP 3107 (https://www.python.org/dev/peps/pep-3107/) in the thesis. This allows type annotations in the function signature. Cannoli leverages both type annotations in function signatures and assignments to output optimized code. Other projects like Pythran also use type annotations.

PEP 3107 does say:

> By itself, Python does not attach any particular meaning or significance to annotations.

However, I think this will change especially as more projects begin to outperform CPython. In the "Results > Object Optimization" section of the thesis paper, I cover using these very type annotations to optimize the code.

The biggest problem with Python annotations right now is that they don't really mean anything. Nothing is really enforced so it is totally valid to have 'x : int = "string"'. The compiler would have to just ignore this annotation since it was provided the wrong data. This could also be difficult to identify if a variable was being used and its type mislabeled. So it's not perfect but I think it's a step in the right direction.


Only if you change the semantics of the language (e.g. raise an exception if you pass in a value of the wrong type). Otherwise, the type hints probably aren't super useful because the language doesn't actually enforce them.

Edit: That said, if the compiler can determine with certainty that the type signature is always obeyed, then yes, you could apply optimizations to remove a lot of the runtime overhead. I imagine this would be rather difficult to do unless you have type annotations for all (or the vast majority) of your code, including third-party libraries.


JIT'ers tend to be aggressive and disregard language semantics. As far as I know, the java hotspot compiler goes ahead and inlines implementations of calls on generic interfaces, as long as a couple of hundred calls to into the same implementation method. For large methods, it might not inline, but it'll remove the conditional jump based on the type with a direct jump.

That's obviously wrong, examples to show that are trivial. But that's why the direct call also contains a trap to check the type to perform de-optimization if the assertion appears to be wrong. But a simple 'assert type byte sequence is a known value' is still faster than a dynamic jump. A lot faster.


> That's obviously wrong, examples to show that are trivial.

Does that actually affect the behavior of the code, though? That seems to me like a private implementation detail, but I don't know much about the HotSpot JIT or the JVM in general.


Instead of an exception you could fall back to a generic version of the function/block, like deoptimization in JITs.


Deoptimization is actually really hard to implement if you have an ahead of time compiler like Cannoli. You need to get all the stuff that is living in machine registers or in the C stack and then convert them back to whatever representation your generic interpreter uses.

I think this is actually one of the things that most get in the way if you want to use traditional AOT compiler technology (like gcc or LLVM) to implement a JIT. In state of the art JIT compilers this part is always a nest of highly complex and nonportable assembly language.


Compiling to machine code is not a panacea for optimization. A optimized JIT compiler is going to blow an AOT compiler out of the water. Being smart about the machine code generated is significantly more important than generating machine code. In particular, PyPy makes several optimizations over python code that a more direct implementation of CPython at the machine level probably wouldn't. For example, PyPy erases dictionary lookups for object member access if the object shape is statically known. Given how prevalent this kind of lookup is in Python code, it's possible that even an interpreter that made this optimization would be faster than a machine code version that used an actual hash table.

I think this compiler also makes this particular optimization, but this is just one of many many optimizations PyPy does. I imagine that with sufficient work, this compiler could be brought up to speed with PyPy, but as it stands right now, PyPy simply benefits from having years of optimization work that a new project doesn't.


For most dynamic languages, the available speedups aren't in simple compilation, but in removing the runtime type checks, method lookups, and other slow operations. This needs the ability to guess at what the code is going to do based on past behavior, and generate specialized versions that get thrown away if the guesses are invalidated.

So, for example, you might see that the last 100 calls to a function were done with integers, so you can generate a variant of the function that only works for integers, and check if it's applicable when you enter the function. If that function stops getting used, you can throw it away.

Doing that well ahead of time requires an extremely good idea of how the program will behave at run time, and even with good information, is still very likely to bloat up your binary hugely. (Facebook used to compile their PHP codebase to a multi-gigabyte binary before moving to HHVM, for example).


Actually I think you answered a question that I already asked about Rust being faster than C. If you don't need to carry out as many checks then I see how that will speed things up.


JITs get to analyze both code and data and optimize for each machine deployed to. A static compiler can only analyze code and the machine used for compilation. If dependencies were pre-compiled, the static compiler won't be able to optimize their relationship with the project. If the machine is changed for deployment.

More information means better optimizations. JITs FTW.


JITs also tend to optimize for compilation time over final performance. Doing ahead-of-time superoptimization or polyhedral loop transformation isn't going to happen in a JIT.


There's no restriction on the kind of optimization a JIT can do. Perhaps there's a current implementation tendency. In contrast, ahead-of-time (data- and platform-ignorant) compilers are restricted.


Question, is Rust inherently faster than C?

I thought the main benefits were safer code. Is it just the fact that you looked at what needed optimized and put some effort in or did the language choice help?


It can be, but "inherently" is a bit strong. There's also the question of "the best Rust programmer vs the best C programmer" vs "the average Rust programmer" vs the "average C programmer" here too.


IIRC Nuitka the other Python compiler claims better performance than PyPy. Is this just years of optimisation?

http://nuitka.net/


Does it claim better performance than CPython or PyPy? I can't quite find the reference to PyPy (after a quick scan of the page/github repo. It looks like a cool project! They seem to be doing a lot of optimizations, which they list on their github page https://github.com/kayhayen/Nuitka#optimization. It looks like the git repo was created ~2013 (I dunno if it was hosted/worked-on elsewhere prior to that) so they've had a few years to optimize. Cool project though!


It doesn't claim that at all.


This could be used to ditch the Python interpreter and distribute Python binaries for Win/Linux/Mac. When will standard library, exceptions, and inheritance support be added?


Probably never. This is a subset of Python that would break most libraries.

The author says that this is a research project, and adding the standard library (if possible at all) would be a humongus task by itself.


A large part of the Python standard library is actually written in Python itself. You can even use inspect.getsource to look at the source of inspect.getsource itself.

So if you can already compile arbitrary Python, you only need to reimplement the parts that are written in C, or maybe just rewrite them to use a different FFI API.


It can't compile arbitrary Python, it's only a subset of Python (removing some of the dynamic features that might be used in the standard library).


Now something that would be interesting: writing python extensions in rust.


This is already quite possible! There are even multiple libraries to help you get started. Extending languages like this is a huge use case for Rust; one of the first production Rust uses was extending Ruby like this.


Hey Steve, glad to see you saw this post, now I get to personally thank you for all of your work on the Rust project. It's a really great community, it was very easy to get help in the IRC when I'd get stuck on new-to-me concepts. The documentation was also incredible, and the language itself is awesome! So thanks, and keep up the great work over there, I'm going to be definitely peddling Rust when I can haha.


Thanks! I have your paper open in a tag, im excited to dig in more. Congrats!


Nuitka is also an alternative implementation of Python.


> I had very little trouble with the "learning curve hell" that I hear associated with the language

Dude, you're doing a thesis in computer science. Of course it's easy for you.


Well, a professional software developer should be at least on the same level - and these comments are made by professionals.


> Well, a professional software developer should be at least on the same level

You can become a "professional software developer" in less time than it takes to settle the subject of your thesis - generally two years.


This was a masters thesis, and those normally don't take 2 years to choose the subject


Well, yes I see your point. But I wouldn't say that it was easy, just not as bad as it had been hyped up to be. Although I think most of that sentiment comes from older Rust versions, when lifetimes were defined in more places and you had to learn some of the more advanced concepts.


This isn't a really important question, but why the name Cannoli? I feel like you missed an opportunity here to call it "PycRust". (c standing for "compiled to" of course)


My heritage is Italian and I happened to be eating cannoli when I started writing the parser around December, figured it'd be a fun name :)


Well, it is a fun name.

Do you plan to support this down the line? I feel like compilers that compile down to Rust could become really good tools depending on how the popularity of the language goes.


I would certainly like to. The base implementation of Cannoli is the best place to continue this. Some optimizations that were implemented apply restrictions to parts of the language and following these any further would ultimately diverge from the Python project. That being said, many of the optimizations could still be done and "fall back" on unoptimized code at run time if needed. I think this project certainly shows a lot of promise, it just hasn't been developed as long as other projects so it still needs plenty of work :).


Having been around the Rust community for a few years, I'm a bit tired of -rs and rust names.

I feel like the postfix is something languages lose the bigger they get, I hope this happens to rust too.


I think I read this as "Pike Rust" about 5 times before I got the joke. I guess it's time for that 2PM coffee.


Took me a minute too. "pie crust".


Fortunately, that means the "pie crust" name is still available for a Pike[1]†-Rust project :-)

[1]: https://en.wikipedia.org/wiki/Pike_(programming_language) †: I have a particular fondness for Pike, because it's derived from the first programming language I ever used: LPC. Anyone else an LPC hacker?


Nice work Jon! The cannoli logo is great!

Spun up a quick dashboard of the project here: https://chart.ly/github-dashboard/joncatanio/cannoli

Not tons of revelations there, but cool to see your longest streak was 7 days straight committing to the repo. Also cool to know this is part of your thesis.

What are your plans after Cal Poly?


This is very cool! Thanks for doing that :)!

I'm actually moving out to NYC this July to work for Major League Baseball. The Advanced Media division (MLBAM). I'll be doing some software engineering there, mainly API work for various apps, I'm very excited about it!

I'll have to work on compilers in my free time haha, I really enjoyed the work I did on this thesis.


This is awesome! I definitely recommend reading through Jon's thesis (link on GitHub). It's well written and very readable even if you know nothing of Rust or compilers.


How was your experience using Rust as a target language (instead of C)? I understand that Rust has lots of features for when you want to write code by hand but do those also help when you are working with generated code? Or does the borrow checker get in the way all the time?


Great question! The "Compiling Python" section of my thesis is pretty much an explanation of how I had to translate elements of Python into Rust because of the borrow checker. There were a couple tricks (like using closures for functions) to getting around compile-time borrow checking. Some situations required the use of Rc & RefCell to provide multiple references to mutable data, this defers borrow checking to run time. So yes, the borrow checker got in the way. But I didn't have to write a garbage collector because the automatic memory management was handled via Rust's ownership rules (the caveat here is with cyclical references which would need to be tracked, this work was omitted for time).

It does complicate the generated code, I don't know if Rust is the greatest intermediate representation. But I do think it was a better choice than C. Debugging the generated code was so great because of the detail that the Rust compiler displays for warnings/errors.

I'd be interested in seeing how a Python interpreter written in Rust would compare to CPython, this would probably make use of more Rust optimizations (than trying to generate code).


Ah, I hadn't realized that Cannoli is also using Rust-style memory management. In that case compiling to Rust would certainly help a lot.


Not to answer your question but if you want C / C++, have a look at Shedskin

https://github.com/shedskin/shedskin


Interesting project... Why python, out of curiosity?


I've used Python quite a bit for various projects. For a compilers class I wrote a compiler in Python and had a blast. So I spoke with that advisor and decided I wanted to get a Master's and he had suggested a project that analyses Python. The main question concerned which dynamic features of a language cause performance issues. Python just happened to have a lot of the features that we hypothesized caused slowdowns so we chose it. Plus we were both familiar with the language so that was a draw.

The same analysis could be done on JS or Ruby, it would be cool to see if a similar compiler would yield the same performance results for restricting features in JS/Ruby. It would also validate this work nicely as well.


The name of the thesis this repo is a part of

> Leave the Features: Take the Cannoli - Jonathan Catanio

That's pretty good


Leave the features - take the cannoli.

Made me laugh out loud.


Skimmed it. They spent a lot of time flushing printing to the screen




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: