This was also my first experience with Rust. The Rust community is absolutely fantastic and the documentation is great. I had very little trouble with the "learning curve hell" that I hear associated with the language. It was definitely a great choice for this work.
I also included PyPy in my validation section and "WOW". It blew both Cannoli and CPython out of the water in performance. The work they're doing is very interesting and it definitely showed on the benchmarks I worked with.
I should also add: Suppose PyPy was twice as fast as CPython for a given workload, but it also used twice as much memory.
I doubt Google or Dropbox would use it in that case. On large clusters, memory usage probably contributes to the need to buy machines more than CPU usage (CPU utilization can be low; memory utilization is higher).
I've personally rewritten some Python code as a C++ extension module and gotten 5x decrease in memory usage across thousands of machines.
(As far as I understand, this is the typical tradeoff for PyPy: it's faster but uses more memory. I'm happy to hear more detail on this though.)
Memory is cheap OTOH.
In fact, I've heard that one of the reasons IBM is investing in Swift is because it uses so much less memory compared to jitted Java. Apparently, most of their cloud workloads sit idle most of the time, meaning the number of client VMs/jobs/whatever they can put on a machine is almost entirely determined by memory use.
Also, memory usage to a large extent is performance and is power consumption. A random memory access is still on the order of 50-60ns, in that time you can do several hundred ALU operations. (This is assuming the memory is actually used and not just sitting around). See for example http://www.ists.dartmouth.edu/library/202.pdf
"...computation is essentially free, because it happens “in the cracks” between data fetch and data store; on a clocked processor, there is little difference in energy consumption between performing an arithmetic operation and peforming a no-op." -- Andrew Black in Object-oriented programming: some history, and challenges for the next fifty years
I don't think the JIT warmup was the main issue there; I think it was PyPy's lack of ability to optimize certain kinds of code combined with increased memory usage.
It seems Google and Dropbox are not interested. Google is working on Grumpy, Dropbox worked on Pyston.
Pyston also seems dead - nobody has committed in a year. You have to give kudos to the Pypy devs. The level of passion contributing countless hours to an underfunded, high impact project must be incredible.
To be honest I barely use python, but when I read lwn, etc, I get the impression multiple people are tryign to solve multiple problems with python (from the GiL onwards).
It seems like a hard problem, given the dynamic nature of the language and the unwillingness to break the C API.
> Thomas Wouters asked if he had looked at PyPy. Shapiro said the company had, but there was only a modest bump in performance for its workload. He was not the one who did the work, however. Wouters noted that PyPy is more than "Python with a JIT" because it has its own data model as well.
This is interesting. How much was a "modest bump" in performance ? And why was the bump in performance not a reason for adoption ?
Does Pypy break a lot of stuff ?
Oh and this
> Some of what Shapiro presented did not sit well with Guido van Rossum, who loudly objected to Shapiro's tone, which was condescending, he said. Van Rossum thought that Shapiro did not really mean to be condescending, but that was how he came across and it was not appreciated. The presentation made it sound like Shapiro and his colleagues were the first to think about these issues and to recognize the inefficiencies, but that is not the case. Shapiro was momentarily flustered by the outburst and its vehemence, but got back on track fairly quickly.
> Shapiro's overall point was that he felt Python sacrificed its performance for flexibility and generality, but the dynamic features are typically not used heavily in performance-sensitive production workloads. So he believes it makes sense to optimize for the common case at the expense of the less-common cases. But Shapiro may not be aware that the Python core developers have often preferred simpler, more understandable code that is easier to read and follow, over more complex algorithms and data structures in the interpreter. Some performance may well have been sacrificed for readability.
Remember Unladen Swallow?
Beside, when they took the Go decision, they were still using Python 2.4. It had poor unicode support, bad async io, no multiprocessing pool, aweful packaging and deployment story and the project of a 3.X breaking everything.
It made a lot of sense, business wise. If you have to rewrite your code base, better rewrite it in a language you control (no PSF to fight against), specifically design for your workfload, and fixing all those quirks, plus running faster and eating each memory.
Today's Python unicode support is top notch, asyncio ensure easy networking code, you can use multi-core easily, deployment is pretty much a solved problem and 3.X is running pretty much for 90% of people. You even have a hook to plug in a JIT inside CPython, waiting to be used, and Type hints to facilitate the handling of huge code bases.
Their decision would probably have been different then, although that wouldn't have made Python any easier to speed up. But if a few smart people managed to make it work, by rewritting python in python, no less, I'm sure a dedicated google team would have done great.
They did it for the slug that was JS after all. But they didn't have the luxury to be able to change the language for that. So they pourred millions into it. Not one guy during an internship.
I think one reason is that the community is doing too good of a job. The language is pretty sane, it solves most problems right, the libs and docs are good, and the general direction thinks take is reasonable. And it's free not only as beer and freedom, but also free from business influences. The PSF is really giving away pretty much everything.
Everybody contribues a little (we have the brett canon team from ms, the guido team from dropbox, the alexi martelli team from google, mozilla even donated for pypy, etc). But it's nothing massive. Nobody said "ok here is 10 millions euros, solve the packaging problem".
Compare to JS: the language started as slow, with terrible design, and no consensus on the direction to take. So eventually, people (Google first) pourred a load of money to it until it became usable, and they had a cleaner leadership. They had huge problem to solve on the ever expending market that is the web plateform. Of course JS as the unfair advantage of a captive audience and total monopoly on its field.
Remember Unladen shallow ? "Google" attempt to JIT Python ? It was just one guy during his internship (http://qinsb.blogspot.fr/2011/03/unladen-swallow-retrospecti...).
And look at the budget the PSF had in 2011 to help the community: http://pyfound.blogspot.fr/2012/01/psf-grants-over-37000-to-... I mean, even today they have to go though so many shenanigans for barely 20k (https://www.python.org/psf/donations/2018-q2-drive/).
But at the same time you hear people complaining they yet can't migrate to Python 3 because they have millions of lines of Python. You hear of them when they want to extend the support for free, but never to support the community.
Also compare to PHP: the creators made a business out of it, plain and simple.
Compare to Java/C#/Go: it's owned by huge players that have a lot of money engaged.
Python really needs a sugar daddy so that we can tackle the few items remaining on the list:
- integrated steps to make an exe/rpm/deb/.app
- JIT that works everywhere
- mobile dev
- multi-core with fast and safe memory sharing
There are projects for that (nuikta, pyjion, kivi, etc), but they all lack of human power, money and hence integration, perfs, features, etc.
You need a simple way to code some GUI, make it work on mobile or desktop, turn it into and exe and distribute it.
You need a simple way to say "this is a long running process, JIT the hell out of it".
Actually, most actors tried to find a way to replace JS with something in house (ActionScript, ActiveX, Java applets) so they could dominate the market.
Managers enjoy avoiding conflicts.
Very rarely someone in a position of power will point out to this kind of solution, which anyway is going to be against wishes of many employees.
> The goal of the thesis was to evaluate language features of Python that were hypothesized to cause performance issues.
In another life I did something similar using a similar compiler simulation technique, looking at other Python features like redundant reference count operations, boxing of numbers, dynamic type checks etc. See G. Barany, Python Interpreter Performance Deconstructed. Dyla'14. http://www.complang.tuwien.ac.at/gergo/papers/dyla14.pdf
After obtaining the numbers in that paper, the work didn't really go anywhere; there were no really obvious optimizations to try based on the data. But it was fun!
1. If I understand the source on GitHub correctly, you parse Python source code yourself. I'm fairly sure your simulation would be a lot more faithful if you compiled Python bytecode instead. Did you consider this, and if yes, was there a particular reason not to do it that way?
I ask this in particular because if I understand your thesis correctly, you look up local variables in hash tables every time they are referenced. This is not what Python does: It maps variable names to integer indices during compilation to bytecode, and the bytecode just takes those embedded constant indices and indexes into an array to obtain a local variable's value. That's a lot faster. And you would get it automatically if you started from bytecode. (Plus, it would be easier to parse, but if you have fun parsing stuff, that's reasonable too.)
2. Where do you actually make useful use of Rust's static ownership system? I've only skimmed that part of the thesis very quickly, but I missed how you track ownership in Python programs and can be sure that things don't escape. Can you give an example of a Python program using dynamic allocation that your compiler maps to Rust with purely static ownership tracking and freeing of the memory when it's no longer used?
3. Related to 2: Why bother with any notion of ownership at all? Did you try mapping everything to Rust's reference counting and just letting it do its best? I'm wondering how much slower that would be. Python is also reference counted, after all, and I guess the Rust compiler should have more opportunities to optimize reference counting operations.
4. In general, do you have an idea why your code is slower than Python, besides the hash table variable lookup issue I mentioned above?
"The Python interpreter now uses a 16-bit wordcode instead of bytecode which made a number of opcode optimizations possible."
They haven't been shy about changing it in the past either, since there's no guarantee of stability, so it's likely to continue to change in incompatible ways.
This is only true for function arguments right? Module level bindings and class and object attributes are looked up in dictionaries. I think the same for variables used in closures too?
Depends on __slots__, yes?
That's really too bad.
> there were no really obvious optimizations to try based on the data.
Is that because Python already is the way it is? In other words, if you started from scratch, how would you design a language differently so that it doesn't run into these issues?
Asking for a friend ;-)
Partly, yes. But also because the optimizations I would have come up with were all tried by Stefan Brunthaler before. See some citations in the paper, in particular to: http://arxiv.org/abs/1310.2300 where he claims a 4x speedup over CPython using purely interpreter-based and safe optimizations. So I guess I should have written "no really obvious and new optimizations to try".
Anyway, Stefan's work in general is very interesting for implementors of dynamic language interpreters. A lot of it harks back to things that were first developed for Smalltalk, which I think is where you are coming from?
> In other words, if you started from scratch, how would you design a language differently so that it doesn't run into these issues?
I don't think there would be a need to start from scratch. Python 3 is a pretty good language! Also, it's not clear that it needs to be fast, it's just fun to think about how it might be fast.
That said, I think there is only one key feature missing to enable aggressive optimizations: An annotation on functions/methods that says that the function may be optimized even at the cost of some loss in reflective capabilities. In current Python any function's code can be inspected and changed at run time, as can its local variables by callees who can walk up the call stack. This means that every optimization that compiles Python code to machine code or even just changes the interpreter's handling of local variables can, in general, change the program's semantics.
I think it would suffice to add a @static decorator to disable these reflective features for individual functions. (And disabling dynamic lookup of globals would be good too.) The interpreter could then recognize that you want those optimized and could do whatever magic it likes without caring about boring corner cases like invalidating its optimizations if someone tries to patch the function's code at runtime.
This would not be a big thing to do, and there would be no real need to restart from scratch; Python 3 is a pretty good programming language!
Everything else would pretty much fall out automatically from that using known/already implemented techniques, especially with the type hints that would allow you to gradually move a fully dynamic, interpreted application to a statically typed, compiled one function by function. Such a Python would still not win the speed crown, but it could beat the current one by miles on a lot of applications.
> We have presented the first limit study that tries to quantify the costs of various dynamic language features in Python.
This is spot on what we were doing as well, that's great to have this as a reference.
> 1. If I understand the source on GitHub correctly, you parse Python source code yourself. I'm fairly sure your simulation would be a lot more faithful if you compiled Python bytecode instead. Did you consider this, and if yes, was there a particular reason not to do it that way?
We did not consider this actually. This would be a very interesting concept to explore. For the unoptimized version of Cannoli we do look up variables in a list of hash tables (which represent the current levels of scope). We did perform a scope optimization that then uses indices to access scope elements and this was much faster. However, it meant that the use of functions like `exec` and `del` were no longer permitted since we would not be able to statically determine all scope elements at run time (consider `exec(input())`, this could introduce anything into scope and we can't track that).
If you know, how does CPython resolve scope if it maps variable names to indices? In the case of `exec(input())` and say the input string is `x = 1`, how would it compile bytecode to allocate space for x and index into the value? I don't have much experience with the CPython source, so please excuse me if the question seems naive :)!
> 2. Where do you actually make useful use of Rust's static ownership system? I've only skimmed that part of the thesis very quickly, but I missed how you track ownership in Python programs and can be sure that things don't escape. Can you give an example of a Python program using dynamic allocation that your compiler maps to Rust with purely static ownership tracking and freeing of the memory when it's no longer used?
Elements of the Value enum (that encapsulates all types) relied on `Rc` and `RefCell` to defer borrow checking to run time. Consider a function who has a local variable that instantiates some object. Once that function call has finished Cannoli will pop that local scope table and all mappings will be dropped when it goes out of scope. The object encapsulated in a `Rc` will have it's reference count decremented to 0 and be freed.
This is how I've interpreted the Rust borrow checker, I will say that this was the first time I had ever used Rust so it's possible that I am not completely right on this. But once that table goes out of scope, all elements should be dropped by the borrow checker and any Rc should be decremented/dropped.
> 3. Related to 2: Why bother with any notion of ownership at all? Did you try mapping everything to Rust's reference counting and just letting it do its best? I'm wondering how much slower that would be. Python is also reference counted, after all, and I guess the Rust compiler should have more opportunities to optimize reference counting operations.
I did defer a lot of borrow checking to run time with Rc, but I tried to use this as little as possible to maximize optimizations that may result from static borrow checking.
> 4. In general, do you have an idea why your code is slower than Python, besides the hash table variable lookup issue I mentioned above?
If you remove the 3 outlier benchmarks (that are slow because of Rust printing and a suboptimal implementation of slices), Cannoli isn't too far off from CPython. And in fact, with the ray casting benchmark, Cannoli began to outperform CPython at scale. This leads me to believe that the computations in Cannoli are faster than CPython. However, there is still a lot of work to do to create a more performant version of Cannoli. The compiler itself was only developed for ~4 months, I have no doubt that more development time would yield a better results.
That being said, I think the biggest slowdown comes from features of Rust that might not have been utilized. This is just speculation, but I think the use of lifetimes could benefit the compiled code a lot. I also think there may be more elegant solutions to some of the translations (e.g. slices), that could provide speedup. But I can't say that there is one thing causing the slowdown, and profiling the benchmarks (excluding the outliers) support that.
> This leads me to believe that the computations in Cannoli are faster than CPython.
Right, the speed difference due to output buffering seems annoying. I wonder if you can change the stdout buffering mode to be more similar to Python's (if Python's stdout is not line buffered, I don't recall).
Also, you might want to try to run the same benchmarks PyPy uses: http://speed.pypy.org/
These are still micro-ish benchmarks, but not quite as micro as some of yours.
> However, there is still a lot of work to do to create a more performant version of Cannoli.
I saw in another post that you are going off to industry to do something else, but if you will continue development and evaluation of Cannoli and want to discuss further, feel free to drop me a line at the email address in the paper linked above.
So if your function is:
def foo(): exec "a=1"; return a
8 LOAD_NAME 0 (a)
6 LOAD_FAST 0 (a)
Traceback (most recent call last):
File "test.py", line 5, in <module>
File "test.py", line 3, in foo
NameError: name 'a' is not defined
exec("a = 1")
But you can do this:
exec("a = 1")
That being said, I did leave a few suggestions in the "future work" section that talk about writing an AOT compiler for RPython (the version of Python that PyPy's interpreter is written in). This would provide more information at compile time and would be an interesting comparison between a Python interpreter compiled AOT versus a Python interpreter with a JIT (PyPy).
Does this mean that a AOT compiler could get a lot faster with pep 484 type hints?
PEP 3107 does say:
> By itself, Python does not attach any particular meaning or significance to annotations.
However, I think this will change especially as more projects begin to outperform CPython. In the "Results > Object Optimization" section of the thesis paper, I cover using these very type annotations to optimize the code.
The biggest problem with Python annotations right now is that they don't really mean anything. Nothing is really enforced so it is totally valid to have 'x : int = "string"'. The compiler would have to just ignore this annotation since it was provided the wrong data. This could also be difficult to identify if a variable was being used and its type mislabeled. So it's not perfect but I think it's a step in the right direction.
Edit: That said, if the compiler can determine with certainty that the type signature is always obeyed, then yes, you could apply optimizations to remove a lot of the runtime overhead. I imagine this would be rather difficult to do unless you have type annotations for all (or the vast majority) of your code, including third-party libraries.
That's obviously wrong, examples to show that are trivial. But that's why the direct call also contains a trap to check the type to perform de-optimization if the assertion appears to be wrong. But a simple 'assert type byte sequence is a known value' is still faster than a dynamic jump. A lot faster.
Does that actually affect the behavior of the code, though? That seems to me like a private implementation detail, but I don't know much about the HotSpot JIT or the JVM in general.
I think this is actually one of the things that most get in the way if you want to use traditional AOT compiler technology (like gcc or LLVM) to implement a JIT. In state of the art JIT compilers this part is always a nest of highly complex and nonportable assembly language.
I think this compiler also makes this particular optimization, but this is just one of many many optimizations PyPy does. I imagine that with sufficient work, this compiler could be brought up to speed with PyPy, but as it stands right now, PyPy simply benefits from having years of optimization work that a new project doesn't.
So, for example, you might see that the last 100 calls to a function were done with integers, so you can generate a variant of the function that only works for integers, and check if it's applicable when you enter the function. If that function stops getting used, you can throw it away.
Doing that well ahead of time requires an extremely good idea of how the program will behave at run time, and even with good information, is still very likely to bloat up your binary hugely. (Facebook used to compile their PHP codebase to a multi-gigabyte binary before moving to HHVM, for example).
More information means better optimizations. JITs FTW.
I thought the main benefits were safer code. Is it just the fact that you looked at what needed optimized and put some effort in or did the language choice help?
The author says that this is a research project, and adding the standard library (if possible at all) would be a humongus task by itself.
So if you can already compile arbitrary Python, you only need to reimplement the parts that are written in C, or maybe just rewrite them to use a different FFI API.
Dude, you're doing a thesis in computer science. Of course it's easy for you.
You can become a "professional software developer" in less time than it takes to settle the subject of your thesis - generally two years.
Do you plan to support this down the line? I feel like compilers that compile down to Rust could become really good tools depending on how the popularity of the language goes.
I feel like the postfix is something languages lose the bigger they get, I hope this happens to rust too.
†: I have a particular fondness for Pike, because it's derived from the first programming language I ever used: LPC. Anyone else an LPC hacker?
Spun up a quick dashboard of the project here: https://chart.ly/github-dashboard/joncatanio/cannoli
Not tons of revelations there, but cool to see your longest streak was 7 days straight committing to the repo. Also cool to know this is part of your thesis.
What are your plans after Cal Poly?
I'm actually moving out to NYC this July to work for Major League Baseball. The Advanced Media division (MLBAM). I'll be doing some software engineering there, mainly API work for various apps, I'm very excited about it!
I'll have to work on compilers in my free time haha, I really enjoyed the work I did on this thesis.
It does complicate the generated code, I don't know if Rust is the greatest intermediate representation. But I do think it was a better choice than C. Debugging the generated code was so great because of the detail that the Rust compiler displays for warnings/errors.
I'd be interested in seeing how a Python interpreter written in Rust would compare to CPython, this would probably make use of more Rust optimizations (than trying to generate code).
The same analysis could be done on JS or Ruby, it would be cool to see if a similar compiler would yield the same performance results for restricting features in JS/Ruby. It would also validate this work nicely as well.
> Leave the Features: Take the Cannoli - Jonathan Catanio
That's pretty good
Made me laugh out loud.