Hacker News new | comments | show | ask | jobs | submit login
Ian Bicking: My Unsolicited Advice For PyPy (ianbicking.org)
121 points by jnoller on Apr 4, 2011 | hide | past | web | favorite | 52 comments

Ahh, but PyPy is not being written by web developers -- or for them. The main guy behind the project is a research computer scientist (http://cfbolz.de/) and most of the people who are excited about it are also doing scientific computing.

I am one of them, and the ability to write scientific code in Python that runs as fast (or faster!) than C would be so awesome that I can hardly put it into words. Unlike web development, performance is still a huge bottleneck in this field; people still mill about for 3 weeks waiting for their job to finish just like it was the 1970s. Consequently, most code written for research projects consists of lot of arcane, difficult-to-read C, C++, or even (gasp!) FORTRAN.

The benefits of switching to Python in terms of development time, ease of debugging, reusability, potential for collaboration, reproducibility, etc. are immense, but the performance penalty is such that it rarely happens. Even in spite of that, Python already has a large academic following; if PyPy manages to pull off its stated goals, it would be revolutionary.

The big issue here is that most scientific code in python runs on top of numpy, and getting numpy to work on top of pypy would be challenging.

I understand that relying on a C extension compatibility layer would make things slower at first, but the alternative of re-implementing numpy from scratch in python does not seem realistic to me. Even if numpy on top of pypy is itself say 2 or 3x slower than on top of cpython, this could be quite acceptable if you can get speed back on top of it for things for which numpy does not help. This would also enables an iterative process of getting away from cpython internals (which has already started in numpy a couple of years ago, albeit more slowly than everybody would have liked).

I hear there is effort to get Numpy working on Cython, and work to get Cython to target PyPy as a backend. Given the two efforts... maybe doable.

There is not effort to "get numpy working on cython" (I am not sure what it would really mean). There is an effort to refactor numpy code between the python-dependent part and the more core, "pure C" part, which would help.

Adding pypy as a target to cython is news to me, but that also seems like a significant work given the nature of both projects.

What fundamentally bothers me with those approaches is that they rely on a lot of work before even proving they can work at all. Whereas a more complete C API emulation layer would help many more projects, and give a more iterative process. The latter almost always works better in my experience.

Enough with the Cython nonsense! Cython has excellent marketing but it is a really naive compiler. Just figure out how to translate Cython to ctypes+Python for backwards compatibility and use pypy.

Cython is supposed to be a naive compiler, it's primary use-case is for speeding up specific functions by having them compiled. That it uses a python-like language to do this is great.

I'm reasonably certain the C parts of numpy rely more on implementation optimisations than compiler optimisations for their increased efficiency.

You don't know what you're talking about. Using something like Cython (I'm still using Pyrex) to make C extensions dramatically speeds development, if only because it largely solves the reference count problem.

I also do scientific computing, with scipy. The weave module allows me to write C code inline directly, it's great for optimizing away the bottlenecks. On the minus side, documentation is scarce, the error messages are worthy of template C++, and the idiosyncrasies suck.

Upside: I get 10-100X performance improvements!


Now imagine the same kind of performance improvement, without any complicated inline C, wrapping or any of that!

I really have high hopes for PyPy :)

That would be awesome.

Right now, with Scipy arrays, you get pretty good performance if you use matrix operations and built-in functions. That can be unnatural, besides it would be more efficient to do a plain for loop in an ideal world. Is PyPy optimized for matrix-style code, for loops, or both? What about functional programming, say, finding the maximum of a user-supplied function? Will the maximizing and maximized functions be inlined for performance? Thank you to any answering PyPy developer :)

You can already do that, albeit in a somewhat more limited scope, with shedskin. I've had great success with it.

Use Cython, then. It's my preferred way of writing C.

Well, some of us (the PyPy developers) are web developers, but I'll grant you I'm the exception, not the rule :)

It seems to me that there could be a version of Python that is solely optimized for speed, and allows full compilation. I haven't given this a great deal of thought, but I bet that if you defined a subset of the language that is Python-like, but doesn't allow the full introspection and fancy class manipulations, and other cool features, you could have a fully compilable language that would rival C in performance, but kick C's butt in readability.

Maybe it's not a binary switch (performance exclusions switch, for example). Maybe it's a setting with various degrees of performance. You need introspection? We'll give you that, but now you can only compile at the 2x performance level (or whatever). You need certain libraries that aren't written for performance? That's going to cost you.

Then again, it's quite possible that I don't know what I am talking about!

PyPy itself has RPython, which is a restricted Python that can be compiled down to something fast. Cython allows you to annotate Python code (inline or externally) to allow compiling it down. ShedSkin compiles some Python directly to C++. So there's a bunch of options (all of which PyPy has to compete with!)

A lot of scientific code also runs on the (crufty, spiffy) Matlab, which is fast but not super-fast, is less portable than most things but has good usability and lots of history.

A python that could run incrementally, with robust threads, in the ground might, just possibly, make SAGE a better environment and so increase migration from MATLAB (my personal scientific computing experience was with Matlab and anything that escape this monster is great in my book).

I mean, Python won't become faster than optimized C so however convincing or unconvincing the faster iteration and usability argument is, that, in the end is going have to be what Python makes.

Right now it looks like LuaJIT is your best option for this type of performance. PyPy might get there eventually though...

LuaJIT is awesome, but python already has a large volume of scientific libraries. I haven't heard mention of scientific researchers using Lua (yet).

"What do I want? [...] Shared objects across [micro-]processes with copy-on-write; then you can efficiently share objects (like modules!) across concurrent processes without the danger of shared state, but without the overhead of copying everything you want to share."

Is overhead of modules really an issue on servers with GB of RAM? More importantly, almost every multi-threaded string library (like C++ STL) has moved away from sharing across threads and COW because the cost of the atomic operations is too high. See Herb Sutter's "Optimizations That Aren't (In a Multithreaded World)."


Also, to say that speed is "really uninteresting" but latency is important seems like a contradiction. Latency is absolutely gated by speed for non-parallelizable problems. And even if the problem can be parallelized, speed is a much easier and simpler way to decrease latency than parallelizing.

I guess this reads to me more like a list of theoretically interesting ideas than a set of features that will actually help anyone in the real world.

RAM seems relatively constrained to me -- at least, running a multiprocess server without a few GB or RAM is hard to do (and memory seems to be the most expensive part of servers). And when it goes wrong (you reach whatever your limit is) then things tend to fail in less than ideal ways. Tools around memory usage are also quite poor, so while performance gets optimized memory seldom does. And a culture of cavalier memory usage doesn't help either -- too many people are borrowing memory to get performance, compounded in the case of an application that is typically an aggregation of many people's work.

The amount of code involved in systems is also continuing to go up, so that the amount of memory you use before you've done anything (but when you've loaded up all the code) is getting constantly higher. In the world of static/compiled languages this might not be as notable as there's a clear sense of "code" and "data" -- Python has no such distinction, so if you don't share anything then you don't share anything.

WRT speed-vs-latency -- definitely related, but most benchmarking seems to specifically remove latency from the benchmark and instead test throughput. E.g., tests frequently throw away the slowest run, even though abnormally slow runs exist in real programs and can have an effect. (Of course they aren't predictable and might be affected by other things on the system -- which is a kind of bias to not optimize things that are hard to measure). But latency is mostly just simplicity of implementation, so no, I wouldn't expect parallelizing to help.

I guess I can't disagree with your specific points about memory, but OS pages are already copy-on-write -- if you're really concerned about that kind of overhead you could always fork() individual OS processes from one that already has everything loaded.

Re: copy-on-write: http://news.ycombinator.com/item?id=2407397 -- tl;dr reference counting ruins copy-on-write

I was thinking he was talking about (mostly?) immutable data structures.

For what it's worth, another 2c from another web developer - for me memory isolation and COW for module dependencies would be less useful than a single process and a lack of GIL (hold the flamethrowers for one second please) - allowing me to share on a multicore server resources like those same modules, as well as database connections, as well as any significant static data i may read in at start up, decreasing my operational complexity, etc. In a web server environment shared memory isn't particularly scary, to me anyway. Am I misunderstanding the reason for COW?

Also, the most important characteristic to me is latency. I don't care about throughput that much as long as it's "good enough", hardware is getting really cheap really quickly anyway (in terms of cores and RAM/$). Latency on the other hand is tough. While IO is normally going to be the biggest factor here, if the runtime allows me to consistently shave 5ms off of every request - that's still a great improvement. In this sense i would say that raw speed IS important to me.

Perhaps I haven't encountered the use cases Ian is talking about, it was difficult to figure them out from the blog post (i'm a relative novice in Python).

But I am hoping PyPy can at some point help with the concerns I have above, even lower latency by itself may be compelling enough to switch.

Copy-on-write lets you efficiently share objects across boundaries -- not copying everything right away (and everything in Python means everything), while also safely avoiding the sharing of state. If you are cool with threads then it's not an issue. If you can spare the overhead of a few processes then the GIL isn't an issue. I'd rather it not be a matter of tradeoffs, and I don't think either OS processes or threads are it.

It seems like he could get most of what he wants without a new runtime. Diesel (http://dieselweb.org) provides async without callbacks or Deferreds. Diesel1 used Python 2.5's yield expression, and Diesel2 switched to using greenlets. As long as your modules and your applications don't rely on global state (which they shouldn't anyway), you essentially get everything on this list. For web programming, cooperative multitasking usually works perfectly well and no state is generally shared between any 2 http requests, so I'm not sure how much a web programmer stands to gain from preemptive multitasking and completely separate address spaces with CoW.

The one thing Diesel is missing is good runtime introspection and more primitives to encourage reliability (a la Erlang), but if you hang out in #diesel you'll see that jamwt has already implemented a lot of that at Bump and he plans on pushing it down into the framework soon (probably for Diesel3).

All these systems are share-everything, which I'm not excited about -- if I want share-everything I can just use threads, and they work far better than most people seem willing to admit.

Can you say more about what you mean when you said they're "share-everything?" I'm not entirely sure what you mean by that, since, for example, we do diesel in what we consider to be a "shared nothing" approach at Bump. Redis is the transport (queuing, pubsub) and the network-bound state, while diesel loops are stateless.

In fact, the higher-level erlang-style stuff that's not released yet really abstracts the idea of queues and consumers to the point where it's not entirely clear or important if particular jobs are within the same process of even the same machine (in practice, they usually aren't).

Generally cooperative multitasking systems like these (anything based on greenlets, I think) mean that you are running concurrent tasks in a single process, with a single set of shared data. It's basically the same as threads, only you can't have multiple tasks literally accessing the same data at the same time (though with the GIL I guess technically that's true with threads too ;).

Well, it's only shared if you share it!

I guess it boils down to "don't use mutable globals". That's as true in threaded environments as it is in async ones.

If your requirement is that there must be system-enforced safety rather than safe practices and idioms, I'd say Python isn't the right tool (Haskell is) to begin with. As you know, Python is all about practices, idioms, and discipline.

Just don't access the same data? Threads, greenlets, tasklets, etc. allow you to share state, but don't force you to.

Greenlets/tasklets are game changing with respect to OS threads not because they introduce new ways of working, but because their performance and scalability afford you ways of working that would be theoretically possible but impractical with threads. The GIL only widens that gap.

Threads have at least the following problems in high concurrency environments:

(1) context switching is expensive

(2) access to shared variables leads to race conditions, which requires careful synchronization (to be fair, this is true for any shared storage: databases, mmap()ed regions, etc.)

See also http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-1...., which provides a thorough treatment on why multithreaded programs are inherently less reliable than single-threaded.

Quoting Robert Harper: 'Concurrency is concerned with nondeterministic composition of programs (or their components).' [1]

Threads can be a very efficient way of introducing non-determinism (preferable to incurring IPC overhead). Writing correct code then boils down to controlling interactions between threads. Actor architectures can be very helpful in such settings. Key to safe implementations are type-enforced isolation and immutability. More experimentally, STMs offer the promise of high-level composition of actors. My 2p.

[1] http://existentialtype.wordpress.com/2011/03/17/parallelism-...

> access to shared variables leads to race conditions ...

Race conditions are a type of programmer error, not an invariable consequence of shared state. You seem to be implying that high concurrency increases the risk of race conditions, which it doesn't. A race condition is either present or it isn't. A correctly written application that uses threads isn't inherently less reliable than a single-threaded application, it is just harder to write.

Well threads certainly don't work well for high concurrency, which I read as your biggest impetus for a new runtime.

I suppose if you mean true (multi-core) concurrency then you're right that Python has some big limitations, but that pretty much boils down to the age old "remove the GIL" proposal.

Threads work pretty darn well, actually. Depending on the libraries you use the GIL may not get in the way (e.g., database libraries, stuff like lxml that releases the GIL). If you have blocking stuff like I/O (socket or file) then threads help. The only time when the GIL really gets in the way is with multiple processors and CPU-bound routines. Even with CPU-bound routines you are often better off with threads as the machinery does share the CPU transparently and relatively fairly.

Threads are a little half-assed, sure, but so are most of the currently offered "replacements" for threads, including cooperative multitasking (which also doesn't do anything with multiple processors!)

I guess that depends on whether you define "high concurrency" as ~50 or more like ~5000.

I'm guessing your "half-assed" comment is in reference to some of the sames things David Beazley mentioned in his PyCon talk, but it's worth linking to anyway: http://www.dabeaz.com/GIL/

Basically threads work pretty darn well until you throw some real load at them.

Well, practice shows threads work pretty darn well with real load, as there are a huge number of running examples of high-load threaded Python applications. There is the potential of failure due to race conditions, but IMHO it's not as bad as it's often made out to be.

For high concurrency, there is of course the question of what level in the stack you want to handle the concurrency at. There's not much use to doing 5000 things at the same time. But you might want to handle 5000 things at the same time (say, open sockets). The question is how much you want to twist your code around to do this -- with threads you might have worker pools, and you might have to be sure your chunks of work are of reasonable size (and often it's tricky to do some things incrementally). With cooperative concurrency... well, you do all the same things ;) But sometimes when threads are enough you don't have to contort your code, and in some error cases threads will help a lot (like you happen to not partition your work sufficiently).

I really just want to get rid of those tweaky code things for all cases, so that it's not a matter of these different choices with different tradeoffs, but just one really good choice that applies much more widely.

Threads for python webapps are fine, but they break down when doing things like chat (using long polling or websockets) where you have a lot of open mostly idle connections. Coroutines and libraries like gevent and eventlet let you elegantly use greenlets without contorting your code -- ie, you get async without callbacks or deferreds -- and you have the benefit of being able to handle thousands of connections in a single python process. Of course you don't get processor concurrency, but you don't really get that with straight CPython because of the GIL anyway -- and if you're deploying a largish web app, you're going to have a load balancer (haproxy ftw) and you can just run one process per core anyway. This also mostly transparently lets you scale out to N boxes with M cores per box with very little architecture changes at the inbound request to web-server level.

Which brings me to the microprocess / shared module ideas -- RAM is pretty cheap -- 1GB or 2GB per core isn't anything fancy, and if you're using more than a couple gig in a python webapp process, something is probably wrong. As far as sharing state between microprocesses, I don't see how that really solves anything -- you're going to have processes on different boxes if you have any sort of traffic or fault tolerance and you're going to be putting that shared state into a cache or database somewhere anyway.

Raw compute speed is important to the scipy/numpy crowds, so I think things like Cython and PyPy make a lot of sense there -- webapps are I/O bound, so you spend 90% or more of the time it takes to service a request waiting, which async is really good at. PyPy isn't going to beat CPython by 20x on some django benchmark, but they will on some compute-intensive ones.

So anyway, I think my point is, you can have your cake and eat it too -- coroutines without pain (gevent) and processor-level concurrency (load balancer) for web applications using off the shelf production-ready technology on commodity hardware...

The first two points are describing fork(), which python already has on unix-like systems. If he means something different, then he needs to explain more how exactly it would work (especially regarding hardware/os support).

Interpreter-level fork() is very much what I mean ("green processes" would be another phrasing I guess?) Of course the Python interpreter always acts as an intermediary for any access to resources, so I feel like you can kind of punt on that, call it an interpreter implementation detail.

fork() isn't awesome with CPython, as its reference counting kind of destroys the utility of copy-on-write (reading an object creates a reference, which increments the object's reference count, which is a write). And there's really a lot of shared data that you can have between processes, since every module you import is data (and expensive data!) -- I'd like to see it feasible to create extremely short-lived processes, for instance (why have a worker pool at all?)

Once you've created a new microprocess I suppose the actual concurrency would be up to the interpreter to best decide (or at least there would be a lot of flexibility because that concurrency would not be so overtly exposed to the Python program) -- it would be reasonable for a Python microprocess to be an OS-level thread, for instance, or for it to simply be cooperative multitasking at the VM level, or even be a real OS fork (though without an OS fork it seems reasonable to share immutable objects, but once forked the objects must be immutable and serializable).

Eh, PyPy does not use reference counting. Are you saying PyPy + fork() is what you want?

BTW, PyPy (unlike CPython) indeed great for multiprocessing.

CPython's ref-counting forces copy-on-write in RAM even when no write operation have been even applied at all. So, if you have a huge read-only object in RAM forget about multiprocessing. The object will be cloned for each process.

I agree. I think there needs to be an effort to push towards concurrency in Pypy. They were optimistic that the GIL would be easier to factor out in their runtime, and I think that could be a great start. I'm all for new methods for concurrency, but I feel being able to make efficient use of native threads would bring a lot of momentum to the project. Even though it's not as big a deal as most people make it out to be, Python gets a bad rap for the well publicized GIL problems.

Everyone is optimistic that they can factor out the GIL (see: Unladen Swallow). Then they find out the problem is much, much harder then they initially thought. That said, all it requires is a fundamental change (e.g. dropping) the ref counting system:


So, perhaps PyPy can do it, and I hope they do - from an evolutionary standpoint, they're the interpreter with the most velocity inside of the interpreter core development, so if it's going to happen, it's going to happen there.

CPython will stay conservative, and much as it is today for some time.

For a while I had it in my mind that maybe you could save some memory while having safety just by using os.fork() after pre-importing a bunch of stuff; but then reference counting was pointed out to me and I realized it would be pointless -- the act of merely looking at an object changes the object, so copy-on-write means copy-everything.

Reference counting seems like it could be removed... except for all the C extensions that are specifically handling reference counting internally, removal of which I'm guessing would cause a lot of breakage. But then... it seems like those C extensions are exactly the problem, so if PyPy tries to keep them is it messing itself up? (Though I'm guessing it would somehow use proxy objects or some other trick to keep its internal object representation flexible.)

Ah! But you see, it's possible to trick CPython extensions into thinking you do support refcounting: http://code.google.com/p/ironclad/ is proof. So you can be compatible, yet advanced :)

> ...reference counting was pointed out to me and I realized it would be pointless -- the act of merely looking at an object changes the object, so copy-on-write means copy-everything.

Hmm...might it be worthwhile to work around this by storing reference counts in a separate set of pages, and having each object instead hold a constant pointer to its refcount? Presumably the extra level of indirection would have some degree of performance impact, but if it allows CoW to actually work in a meaningful way...

Though I guess that would probably break every C extension in existence, no? (Note: I know absolutely nothing about Python's C extension API, just guessing.)

Have your cake and eat it too: open a socket to a CPython process and let it handle the legacy C extensions.

In addition to speed and the points raised by the OP, I am interested in static/duck typing as it relates to both speed and static error-checking and documentation. i.e., making debugging easier and faster. I somehow think this can happen while keeping almost all of Python's strengths and ease of development.

Python 3 added type annotations, but I've yet to see much interest in actually using it. Traits (http://code.enthought.com/projects/traits/) pursues the error-checking side of type declarations, while Cython pursues the speed side. If you want both then too bad ;)

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact