Hacker News new | comments | show | ask | jobs | submit login
PyPy funded to begin support for Python 3 and Numpy (morepypy.blogspot.com)
178 points by synparb on Jan 27, 2012 | hide | past | web | favorite | 30 comments

This post made me think of this article: http://technicaldiscovery.blogspot.com/2011/10/thoughts-on-p...

For those who don't know, Travis Oliphant is the creator of NumPy and(?) SciPy.

It's a good read and it puts some of the issues with a port into perspective.


While I generally grossly disagree with Travis in regard to how far PyPy can go I wonder what kind of perspective are you talking about?

I think this statement sums up the perspective I'm talking about:

"NumPy is just the beginning (SciPy, matplotlib, scikits, and 100s of other packages and legacy C/C++ and Fortran code are all very important)"

I'm not that familiar with matplotlib and not familiar at all with scikits. But, the point is that there is a lot of other C/Fortran code that users of NumPy rely on. How much do you gain by porting NumPy to PyPy? (Not a rhetorical question... I'm genuinely curious why the PyPy folks have chosen this as a goal?)

PyPy team, if you're out there, please don't take my question as criticism -- it's not. I'm just genuinely curious. Congrats on getting the funding and keep doing what you love!

It's discussed to death in the comments on that blog and others, but reading it might be very boring, so I'll repeat my stance on it (I'm a guy implementing numpy on pypy):

NumPy that's faster is already very interesting for many people, because you don't have to go to great lenghts to shift code to C or Cython to experiment. Besides it integrates seamlessly with your current stack that might be in python.

Regarding low-level API: Calling C/fortran from PyPy's numpy should be dead easy, over say ctypes. You should be able to call to whatever C libraries you wish.

Matplotlib, SciPy and scikits should be relatively easy to get working to some extend using hacks like this - http://morepypy.blogspot.com/2011/12/plotting-using-matplotl...

As for other stuff - well if it depends too much on CPython C API PORT IT. It's not that hard and once you have a respectable Python runtime, you can do it, it has been done.

Just because we won't support all possible users from day one does not mean we should not try. There are very valid usecases where people shy away from Python because as soon as you try to write a loop in Python, stuff gets to such a crawl that you can't even run experiments. I personally believe Cython is not an answer here and you actually need full python to do most, especially for unexperienced users, so we're primarily targeting the niche that can't be possibly attacked by any solution that's based on CPython.

As for other stuff - numpy even if you vectorize stuff is nowhere near the speed of C. We try to attack that as well and even surpass C eventually.

This is pretty much it, feel free to ask more questions.

Your statement that NumPy is "nowhere near the speed of C" is false and misleading. For some people "NumPy" is actually just a front-end to the vendor-optimized libraries that actually do the work (and your C-coded loops are going to be much slower than those). For most operations (with large-enough vectors) NumPy is only 2x slower than a specific crafted C-loop.

Yes, there are generic operations in NumPy that you can speed -up with specific code in C (or any-other compiled language). In addition, there is much low-hanging fruit to optimize in NumPy as well (which we at Continuum are working on as I write this).

Fijal, I know you are enthusiastic about PyPy and you should be --- it's a cool system. But, please don't spread mis-information. There are a lot of people who don't understand enough about the details of what you are talking about, and you are just going to alienate them once they realize that you don't have all your facts about NumPy clear.

For people who only make occasional use of NumPy, PyPy and it's version of numpy will likely be fine. But, those people should be well-aware that they are intentionally remaining outside the larger Python/NumPy ecosystem (Matplotlib, SciPy, scikits, etc.) and it will be a long-haul to build the features in PyPy to enable that ecosystem to migrate (and that assumes the individual projects decide that it's even worthwhile to do so).

I think I disagree pretty much about every single point you make. First for something as simple as laplace equation solver numpy vectorized loop is 35ms per loop vs 6.3ms for C. As your list of operations increase, your need of intermediates grow and your speed decreases, but let's not go to details. Obviously if you just call a vendor-optimized library, you can use whatever you feel like and it'll be equally good, be it PyPy, be it numpy, be it matlab.

You consistently spread rumor that we intend to reimplement all of scipy/matplotlib/scikits etc in RPython and this is plain false. I think those projects are completely reusable using one hack or another, for example the blog post I posted where within a day I was able to draw basic stuff using matplotlib on PyPy. We seriously want to reuse as much code as possible from the entire ecosystem, but also a part of the project is to provide people with a really fast python that can perform numeric computations.

Also, which facts about numpy I didn't get clear?

First for something as simple as laplace equation solver numpy vectorized loop is 35ms per loop vs 6.3ms for C.

Isn't numexpr a good solution for that problem? numexpr is much much simpler than PyPy, so if it can reduce the performance overhead of complex vector operations with loop fusion, that seems a huge win.

Yes; numexpr and weave are pretty reasonable solutions, although a little "weird" because they take opaque strings.

One bit of context that is missing from this discussion (although Travis did allude to it earlier) is that we are actively working on building robust deferred computation support into Numpy, and to make these run much faster than what hand-tuned C can provide, via a variety of mechanisms.

(Disclaimer: I also work with Travis at http://continuum.io, along with the author of Numexpr and PyTables. :-)

[citation needed] as for numexpr being simpler than PyPy. Using the dumbest measure of simplicity, total code size, it fails: PyPy's NumPy is 167KB of code, NumExpr is 181KB.

Um, that's not a dumb measure of simplicity, it's just a measure of code size. You might as well look at the average number of characters in the names of the authors of the two projects; it would be equally irrelevant.

Here is a real measure of simplicity: how long does it take to explain to a Numpy user how to wrap an array expression in a string, versus explaining how a JITting compiler compiler works and how to interface its runtime to their existing Python installation and how to build it and what the limitations of RPython are.

Heck, I'm an actual developer (not a scientific programmer) and it took me a little while to understand what PyPy does.

>I'm an actual developer (not a scientific programmer) and it took me a little while to understand what PyPy does.

I'm a scientist, not a scientific programmer or a developer and this is all I really care about: PyPy is currently--in it's partially implemented state--much, much faster than CPython on the vast majority of things it can do. If I am able to use PyPy's NumPy and it's faster than traditional NumPy I will do so as long as the opportunity cost doesn't outweigh the speed increases (NumPy is pretty useless to me--maybe not some--without SciPy and matplotlib).

I don't care that PyPy is written in RPython any more than I care that it has a JIT, or that CPython is written in C. I also don't care how that JIT works or how CPython compiles to byte code or how Jython does magic to make my Python code run on the JVM. I do care that it "works," as a scientist. I do care that they are "correct" implementations, as a scientist. As an individual, I am interested in the inner workings of PyPy and CPython, CoffeeScript, Go, and Brain Fuck, but when I'm working on research the only thing that actually is important as far as the language implementation is concerned is that it just works. The interpreter is just a brand of the particular tool that I'm using.

I would certainly prefer it if PyPy was 100% compatible with the CPython C API even if it was at 80% (maybe even 60%) of the CPython C API speed because then I don't even have to think. I'd be using PyPy because it's faster overall and I can do the analyses I want faster.

Anyway, I think if you're explaining all of what you mentioned to a NumPy user or a PyPy NumPy user you'd be doing it wrong. Or maybe the PyPy folks would be doing it wrong. Because this is how that conversation would go with my peers.

  Sad Panda: "Ugh my code is running slowly I think I have to jump into C"
  Me: "Have you tried PyPy's NumPy yet?"
  Sad Panda: "What's that?"
  Me: "It's faster Python and NumPy. Go here [link] and download it see if it runs your code faster"
  Sad Panda: "Okay I'll do that"
  ..a while later..
  Sad Panda: "It was a little faster, but I ended up getting one of the CS guys to help me run it on a tesla cluster with OpenCL. But I think I can use it on the spiny lumpsucker data I'm collecting."

>I would certainly prefer it if PyPy was 100% compatible with the CPython C API even if it was at 80% (maybe even 60%) of the CPython C API speed because then I don't even have to think. I'd be using PyPy because it's faster overall and I can do the analyses I want faster.

While part of me agrees with this, if PyPy starts sacrificing performance for CPython compatibility then pretty soon it'll degenerate into CPython.

why the people wrapping stuff in strings have to understand the limitations of RPython? It's "wrap expression in strings" vs "do nothing".

numexpr is much simpler than pypy. I never said that it was much simpler than pypy's incomplete numpy support.

Fijal, I'm not sure you even understand my perspective. None of your comments in response have given me confidence that you do. I have no disagreement with you about how far "PyPy could go". Obviously, we could re-create the entire Python ecosystem including the scientific stack under its run-time. I just question your understanding of how expensive and time-consuming that would be. My official position is that I think it's possible to get the benefits of PyPy (i.e. fast Python loops) in different ways that don't also toss out years of extension modules in the process and whose answer to current users about the features they rely on in CPython not working under PyPy is "just port it" or "just use ctypes".

From having read the various replies to this and similar threads, I think a basic problem (as Travis pointed out) is that Fijal has been (1) highly dismissive of numpy and particularly cython, which a lot of really smart people have put a lot of time and effort into and created an amazing scientific community within python, and (2) confrontational with people who are trying to lend an alternative opinion which has been informed by a lot of experience in the scientific python world. PyPy might eventually be a wholly superior alternative for scientific computing with Python, but it would be good to acknowledge the insight of those who are 'in the trenches' even if you go a different route, because part of the success of scientific python has been the community.

I know very little about the development aspect of PyPy or Numpy, but I know that at this moment in time Numpy/Scipy/Cython have revolutionized how I do research on a day to day basis. It seems unfortunate that there seems to be such animosity surrounding this issue.

My thoughts exactly.

Note: this reply is on the wrong level, I cannot reply correctly.

I don't think I'm highly dismissive about numpy/scipy/cython community. I wouldn't be implementing all this stuff if I didn't think those APIs are good and they're the future of scientific computing, they just lack a reasonable replacement for C. If Python is to surpress Fortran on the scientific field, it really does need a way to express fast algorithms in Python and I don't think CPython can provide that.

Personally, I don't like Cython as a way to speed up Python, because it sacrifices the beauty of the language in favor of performance. I think investing time in the Python VM is a much better spent time, but this is a very personal opinion and I won't blame people who thing otherwise. I think Cython is a better way to call to C than all other options that exist right now (like using CPython C API or ctypes), but this is yet entirely different than using it for speedups.

The proposed way so far has been "everything must be 100% backwards compatible, otherwise it won't work". This is all well and good, but I'm not aware about a way to make things both 100% compatible and fast, so we decided to break with some compatibility like CPython C API or reusing most of what's implemented in C in numpy. I argued many times for those decisions, but in short -- there has to be a bit of breakage before we can make a leap. This is not based on dismissal of other people's opinions, especially those that spent tons of time with the scientific community, it's just that I don't see a way forward that does not introduce some breakage.

"...they just lack a reasonable replacement for C"

Actually, Fijal, I have a question - and this is related to our previous Skype discussion as well. I know that a lot of the work on PyPy has been on the JIT, but have you guys really ever pursued the idea of just building PyPy as a front end for LLVM? All your type inferencing logic would probably be a lot more code than what a traditional LLVM front-end normally consists of, but you'd get to leverage all of the massive community efforts on LLVM's optimized code generation and other backend optimizers.

Just a thought..

LLVM has been looked at, as both a backened for RPython as well as for the JIT, I've written down some of the reasons it's not appropriate for PyPy (particularly WRT the JIT) here: http://www.quora.com/LLVM/Is-LLVM-not-good-for-interpreted-l...

Thanks for the link!

Well. Fast Python loops in CPython has been tried before and failed. I seriously don't see a way of getting a working JIT that really optimizes a lot of code out there and native support for CPython extensions. There will be some side that suffers. Also, numpy is fairly special as it does have a good potential to be optimized by the JIT in ways that are not quite possible using C or Cython.

I like pypy and its ambitions, last time I tested it, about a month ago it was very speedy and startup time considerable faster than Cpython. However the regex exercises I wanted to do couldnt be done. Pypy seemingly didnt have a good regexp engine. If I remember correctly, something with groups and backwards-reference...

Has that changed?

No, not much has changed about our regex engine in the last month. However, our `re` module is fully compatible with CPython's so I'm bit confused, are you saying it didn't run something, or was slow?

Hi, could you give some details about those regexes that you were trying to run? Are you sure these are problems with the PyPy implementation, not with the Python regexp spec itself? I've used grouping and back-references with pypy with no problems so far, so I wonder which case you're talking about.

Yes I am sure it was the pypy implementation since the same regex would do fine on python2.7.

Unfortunately I dont have the regex at hand, just remembering it was something with groups and backtracking.

The fundamental question is: How compatible will PyPy ever be? Which kind of applications can be run with PyPy?

http://pypy.org/compat.html mentions compatiblity according to the standard library. This is fine for (web) servers and command line applications.

But what about desktop applications? Can I (someday) take a PyQt or PyGTK code and compile it with PyPy without modifications?


cpyext is a hack to let PyPy run Python/C API modules. It works for some, but not all, and it's slow. I'm sure it'll keep getting better, but not sure if it'll ever be good enough to run PyGTK or PyQt.

PyPy has good support for ctypes, so ctypes bindings are a good option. There are projects out there like pygir-ctypes and ctypes-gtk. One of them just needs to become complete enough to be a good choice for GTK programming. Compatibility with new PyGObject is more likely than compatibility with legacy PyGTK, though.

PyQt is harder because it's C++.

I believe that C extensions, which have always been and still are very CPython specific, will have to be replaced with ctypes and possibly Cython versions eventually.

I expect to see quite a lot of improvement in that area as a side effect of the work on numpy.

i have a pure python regexp implementation that you can use in pypy (and it passes almost every test used by the cpython re package - only the LOCALE flag is not supported). however, it's around 100x[1] slower that the re package (on simple matches; it's not backtracking so it can be similar in speed for pathological cases).

also, when i looked at pypy the re implementation was in c and looked so similar to cpython's i assumed it was the same or closely related. but i didn't pay much attention, so i may be mistaken.

[1] roughly. it actually runs 6x faster on recent pypy that cpython.

Applications are open for YC Summer 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact