I am one of them, and the ability to write scientific code in Python that runs as fast (or faster!) than C would be so awesome that I can hardly put it into words. Unlike web development, performance is still a huge bottleneck in this field; people still mill about for 3 weeks waiting for their job to finish just like it was the 1970s. Consequently, most code written for research projects consists of lot of arcane, difficult-to-read C, C++, or even (gasp!) FORTRAN.
The benefits of switching to Python in terms of development time, ease of debugging, reusability, potential for collaboration, reproducibility, etc. are immense, but the performance penalty is such that it rarely happens. Even in spite of that, Python already has a large academic following; if PyPy manages to pull off its stated goals, it would be revolutionary.
I understand that relying on a C extension compatibility layer would make things slower at first, but the alternative of re-implementing numpy from scratch in python does not seem realistic to me. Even if numpy on top of pypy is itself say 2 or 3x slower than on top of cpython, this could be quite acceptable if you can get speed back on top of it for things for which numpy does not help. This would also enables an iterative process of getting away from cpython internals (which has already started in numpy a couple of years ago, albeit more slowly than everybody would have liked).
Adding pypy as a target to cython is news to me, but that also seems like a significant work given the nature of both projects.
What fundamentally bothers me with those approaches is that they rely on a lot of work before even proving they can work at all. Whereas a more complete C API emulation layer would help many more projects, and give a more iterative process. The latter almost always works better in my experience.
I'm reasonably certain the C parts of numpy rely more on implementation optimisations than compiler optimisations for their increased efficiency.
Upside: I get 10-100X performance improvements!
I really have high hopes for PyPy :)
Right now, with Scipy arrays, you get pretty good performance if you use matrix operations and built-in functions. That can be unnatural, besides it would be more efficient to do a plain for loop in an ideal world. Is PyPy optimized for matrix-style code, for loops, or both? What about functional programming, say, finding the maximum of a user-supplied function? Will the maximizing and maximized functions be inlined for performance? Thank you to any answering PyPy developer :)
Maybe it's not a binary switch (performance exclusions switch, for example). Maybe it's a setting with various degrees of performance. You need introspection? We'll give you that, but now you can only compile at the 2x performance level (or whatever). You need certain libraries that aren't written for performance? That's going to cost you.
Then again, it's quite possible that I don't know what I am talking about!
A python that could run incrementally, with robust threads, in the ground might, just possibly, make SAGE a better environment and so increase migration from MATLAB (my personal scientific computing experience was with Matlab and anything that escape this monster is great in my book).
I mean, Python won't become faster than optimized C so however convincing or unconvincing the faster iteration and usability argument is, that, in the end is going have to be what Python makes.
Is overhead of modules really an issue on servers with GB of RAM? More importantly, almost every multi-threaded string library (like C++ STL) has moved away from sharing across threads and COW because the cost of the atomic operations is too high. See Herb Sutter's "Optimizations That Aren't (In a Multithreaded World)."
Also, to say that speed is "really uninteresting" but latency is important seems like a contradiction. Latency is absolutely gated by speed for non-parallelizable problems. And even if the problem can be parallelized, speed is a much easier and simpler way to decrease latency than parallelizing.
I guess this reads to me more like a list of theoretically interesting ideas than a set of features that will actually help anyone in the real world.
The amount of code involved in systems is also continuing to go up, so that the amount of memory you use before you've done anything (but when you've loaded up all the code) is getting constantly higher. In the world of static/compiled languages this might not be as notable as there's a clear sense of "code" and "data" -- Python has no such distinction, so if you don't share anything then you don't share anything.
WRT speed-vs-latency -- definitely related, but most benchmarking seems to specifically remove latency from the benchmark and instead test throughput. E.g., tests frequently throw away the slowest run, even though abnormally slow runs exist in real programs and can have an effect. (Of course they aren't predictable and might be affected by other things on the system -- which is a kind of bias to not optimize things that are hard to measure). But latency is mostly just simplicity of implementation, so no, I wouldn't expect parallelizing to help.
Also, the most important characteristic to me is latency. I don't care about throughput that much as long as it's "good enough", hardware is getting really cheap really quickly anyway (in terms of cores and RAM/$). Latency on the other hand is tough. While IO is normally going to be the biggest factor here, if the runtime allows me to consistently shave 5ms off of every request - that's still a great improvement. In this sense i would say that raw speed IS important to me.
Perhaps I haven't encountered the use cases Ian is talking about, it was difficult to figure them out from the blog post (i'm a relative novice in Python).
But I am hoping PyPy can at some point help with the concerns I have above, even lower latency by itself may be compelling enough to switch.
The one thing Diesel is missing is good runtime introspection and more primitives to encourage reliability (a la Erlang), but if you hang out in #diesel you'll see that jamwt has already implemented a lot of that at Bump and he plans on pushing it down into the framework soon (probably for Diesel3).
In fact, the higher-level erlang-style stuff that's not released yet really abstracts the idea of queues and consumers to the point where it's not entirely clear or important if particular jobs are within the same process of even the same machine (in practice, they usually aren't).
I guess it boils down to "don't use mutable globals". That's as true in threaded environments as it is in async ones.
If your requirement is that there must be system-enforced safety rather than safe practices and idioms, I'd say Python isn't the right tool (Haskell is) to begin with. As you know, Python is all about practices, idioms, and discipline.
Greenlets/tasklets are game changing with respect to OS threads not because they introduce new ways of working, but because their performance and scalability afford you ways of working that would be theoretically possible but impractical with threads. The GIL only widens that gap.
(1) context switching is expensive
(2) access to shared variables leads to race conditions, which requires careful synchronization (to be fair, this is true for any shared storage: databases, mmap()ed regions, etc.)
See also http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-1...., which provides a thorough treatment on why multithreaded programs are inherently less reliable than single-threaded.
Threads can be a very efficient way of introducing non-determinism (preferable to incurring IPC overhead). Writing correct code then boils down to controlling interactions between threads. Actor architectures can be very helpful in such settings. Key to safe implementations are type-enforced isolation and immutability. More experimentally, STMs offer the promise of high-level composition of actors. My 2p.
Race conditions are a type of programmer error, not an invariable consequence of shared state. You seem to be implying that high concurrency increases the risk of race conditions, which it doesn't. A race condition is either present or it isn't. A correctly written application that uses threads isn't inherently less reliable than a single-threaded application, it is just harder to write.
I suppose if you mean true (multi-core) concurrency then you're right that Python has some big limitations, but that pretty much boils down to the age old "remove the GIL" proposal.
Threads are a little half-assed, sure, but so are most of the currently offered "replacements" for threads, including cooperative multitasking (which also doesn't do anything with multiple processors!)
I'm guessing your "half-assed" comment is in reference to some of the sames things David Beazley mentioned in his PyCon talk, but it's worth linking to anyway: http://www.dabeaz.com/GIL/
Basically threads work pretty darn well until you throw some real load at them.
For high concurrency, there is of course the question of what level in the stack you want to handle the concurrency at. There's not much use to doing 5000 things at the same time. But you might want to handle 5000 things at the same time (say, open sockets). The question is how much you want to twist your code around to do this -- with threads you might have worker pools, and you might have to be sure your chunks of work are of reasonable size (and often it's tricky to do some things incrementally). With cooperative concurrency... well, you do all the same things ;) But sometimes when threads are enough you don't have to contort your code, and in some error cases threads will help a lot (like you happen to not partition your work sufficiently).
I really just want to get rid of those tweaky code things for all cases, so that it's not a matter of these different choices with different tradeoffs, but just one really good choice that applies much more widely.
Which brings me to the microprocess / shared module ideas -- RAM is pretty cheap -- 1GB or 2GB per core isn't anything fancy, and if you're using more than a couple gig in a python webapp process, something is probably wrong. As far as sharing state between microprocesses, I don't see how that really solves anything -- you're going to have processes on different boxes if you have any sort of traffic or fault tolerance and you're going to be putting that shared state into a cache or database somewhere anyway.
Raw compute speed is important to the scipy/numpy crowds, so I think things like Cython and PyPy make a lot of sense there -- webapps are I/O bound, so you spend 90% or more of the time it takes to service a request waiting, which async is really good at. PyPy isn't going to beat CPython by 20x on some django benchmark, but they will on some compute-intensive ones.
So anyway, I think my point is, you can have your cake and eat it too -- coroutines without pain (gevent) and processor-level concurrency (load balancer) for web applications using off the shelf production-ready technology on commodity hardware...
fork() isn't awesome with CPython, as its reference counting kind of destroys the utility of copy-on-write (reading an object creates a reference, which increments the object's reference count, which is a write). And there's really a lot of shared data that you can have between processes, since every module you import is data (and expensive data!) -- I'd like to see it feasible to create extremely short-lived processes, for instance (why have a worker pool at all?)
Once you've created a new microprocess I suppose the actual concurrency would be up to the interpreter to best decide (or at least there would be a lot of flexibility because that concurrency would not be so overtly exposed to the Python program) -- it would be reasonable for a Python microprocess to be an OS-level thread, for instance, or for it to simply be cooperative multitasking at the VM level, or even be a real OS fork (though without an OS fork it seems reasonable to share immutable objects, but once forked the objects must be immutable and serializable).
CPython's ref-counting forces copy-on-write in RAM even when no write operation have been even applied at all. So, if you have a huge read-only object in RAM forget about multiprocessing. The object will be cloned for each process.
So, perhaps PyPy can do it, and I hope they do - from an evolutionary standpoint, they're the interpreter with the most velocity inside of the interpreter core development, so if it's going to happen, it's going to happen there.
CPython will stay conservative, and much as it is today for some time.
Reference counting seems like it could be removed... except for all the C extensions that are specifically handling reference counting internally, removal of which I'm guessing would cause a lot of breakage. But then... it seems like those C extensions are exactly the problem, so if PyPy tries to keep them is it messing itself up? (Though I'm guessing it would somehow use proxy objects or some other trick to keep its internal object representation flexible.)
Hmm...might it be worthwhile to work around this by storing reference counts in a separate set of pages, and having each object instead hold a constant pointer to its refcount? Presumably the extra level of indirection would have some degree of performance impact, but if it allows CoW to actually work in a meaningful way...
Though I guess that would probably break every C extension in existence, no? (Note: I know absolutely nothing about Python's C extension API, just guessing.)