
Ian Bicking: My Unsolicited Advice For PyPy - jnoller
http://blog.ianbicking.org/2011/04/04/unsolicited-advice-for-pypy/
======
hyperbovine
Ahh, but PyPy is not being written by web developers -- or for them. The main
guy behind the project is a research computer scientist (<http://cfbolz.de/>)
and most of the people who are excited about it are also doing scientific
computing.

I am one of them, and the ability to write scientific code in Python that runs
as fast (or faster!) than C would be so awesome that I can hardly put it into
words. Unlike web development, performance is still a huge bottleneck in this
field; people still mill about for 3 weeks waiting for their job to finish
just like it was the 1970s. Consequently, most code written for research
projects consists of lot of arcane, difficult-to-read C, C++, or even (gasp!)
FORTRAN.

The benefits of switching to Python in terms of development time, ease of
debugging, reusability, potential for collaboration, reproducibility, etc. are
immense, but the performance penalty is such that it rarely happens. Even in
spite of that, Python already has a large academic following; if PyPy manages
to pull off its stated goals, it would be revolutionary.

~~~
cdavid
The big issue here is that most scientific code in python runs on top of
numpy, and getting numpy to work on top of pypy would be challenging.

I understand that relying on a C extension compatibility layer would make
things slower at first, but the alternative of re-implementing numpy from
scratch in python does not seem realistic to me. Even if numpy on top of pypy
is itself say 2 or 3x slower than on top of cpython, this could be quite
acceptable if you can get speed back on top of it for things for which numpy
does not help. This would also enables an iterative process of getting away
from cpython internals (which has already started in numpy a couple of years
ago, albeit more slowly than everybody would have liked).

~~~
ianb
I hear there is effort to get Numpy working on Cython, and work to get Cython
to target PyPy as a backend. Given the two efforts... maybe doable.

~~~
joeforker
Enough with the Cython nonsense! Cython has excellent marketing but it is a
really naive compiler. Just figure out how to translate Cython to
ctypes+Python for backwards compatibility and use pypy.

~~~
glenjamin
Cython is supposed to be a naive compiler, it's primary use-case is for
speeding up specific functions by having them compiled. That it uses a python-
like language to do this is great.

I'm reasonably certain the C parts of numpy rely more on implementation
optimisations than compiler optimisations for their increased efficiency.

------
haberman
"What do I want? [...] Shared objects across [micro-]processes with copy-on-
write; then you can efficiently share objects (like modules!) across
concurrent processes without the danger of shared state, but without the
overhead of copying everything you want to share."

Is overhead of modules really an issue on servers with GB of RAM? More
importantly, almost every multi-threaded string library (like C++ STL) has
moved _away_ from sharing across threads and COW because the cost of the
atomic operations is too high. See Herb Sutter's "Optimizations That Aren't
(In a Multithreaded World)."

<http://www.gotw.ca/publications/optimizations.htm>

Also, to say that speed is "really uninteresting" but latency is important
seems like a contradiction. Latency is absolutely gated by speed for non-
parallelizable problems. And even if the problem can be parallelized, speed is
a _much_ easier and simpler way to decrease latency than parallelizing.

I guess this reads to me more like a list of theoretically interesting ideas
than a set of features that will actually help anyone in the real world.

~~~
ianb
RAM seems relatively constrained to me -- at least, running a multiprocess
server _without_ a few GB or RAM is hard to do (and memory seems to be the
most expensive part of servers). And when it goes wrong (you reach whatever
your limit is) then things tend to fail in less than ideal ways. Tools around
memory usage are also quite poor, so while performance gets optimized memory
seldom does. And a culture of cavalier memory usage doesn't help either -- too
many people are borrowing memory to get performance, compounded in the case of
an application that is typically an aggregation of many people's work.

The amount of code involved in systems is also continuing to go up, so that
the amount of memory you use before you've done anything (but when you've
loaded up all the code) is getting constantly higher. In the world of
static/compiled languages this might not be as notable as there's a clear
sense of "code" and "data" -- Python has no such distinction, so if you don't
share anything then you don't share _anything_.

WRT speed-vs-latency -- definitely related, but most benchmarking seems to
specifically remove latency from the benchmark and instead test throughput.
E.g., tests frequently throw away the slowest run, even though abnormally slow
runs exist in real programs and can have an effect. (Of course they aren't
predictable and might be affected by other things on the system -- which is a
kind of bias to not optimize things that are hard to measure). But latency is
mostly just simplicity of implementation, so no, I wouldn't expect
parallelizing to help.

~~~
haberman
I guess I can't disagree with your specific points about memory, but OS pages
are already copy-on-write -- if you're really concerned about that kind of
overhead you could always fork() individual OS processes from one that already
has everything loaded.

~~~
ianb
Re: copy-on-write: <http://news.ycombinator.com/item?id=2407397> \-- tl;dr
reference counting ruins copy-on-write

------
n_are_q
For what it's worth, another 2c from another web developer - for me memory
isolation and COW for module dependencies would be less useful than a single
process and a lack of GIL (hold the flamethrowers for one second please) -
allowing me to share on a multicore server resources like those same modules,
as well as database connections, as well as any significant static data i may
read in at start up, decreasing my operational complexity, etc. In a web
server environment shared memory isn't particularly scary, to me anyway. Am I
misunderstanding the reason for COW?

Also, the most important characteristic to me is latency. I don't care about
throughput that much as long as it's "good enough", hardware is getting really
cheap really quickly anyway (in terms of cores and RAM/$). Latency on the
other hand is tough. While IO is normally going to be the biggest factor here,
if the runtime allows me to consistently shave 5ms off of every request -
that's still a great improvement. In this sense i would say that raw speed IS
important to me.

Perhaps I haven't encountered the use cases Ian is talking about, it was
difficult to figure them out from the blog post (i'm a relative novice in
Python).

But I am hoping PyPy can at some point help with the concerns I have above,
even lower latency by itself may be compelling enough to switch.

~~~
ianb
Copy-on-write lets you efficiently share objects across boundaries -- not
copying everything right away (and everything in Python means _everything_ ),
while also safely avoiding the sharing of state. If you are cool with threads
then it's not an issue. If you can spare the overhead of a few processes then
the GIL isn't an issue. I'd rather it not be a matter of tradeoffs, and I
don't think either OS processes or threads are it.

------
mrshoe
It seems like he could get most of what he wants without a new runtime. Diesel
(<http://dieselweb.org>) provides async without callbacks or Deferreds.
Diesel1 used Python 2.5's yield expression, and Diesel2 switched to using
greenlets. As long as your modules and your applications don't rely on global
state (which they shouldn't anyway), you essentially get everything on this
list. For web programming, cooperative multitasking usually works perfectly
well and no state is generally shared between any 2 http requests, so I'm not
sure how much a web programmer stands to gain from preemptive multitasking and
completely separate address spaces with CoW.

The one thing Diesel is missing is good runtime introspection and more
primitives to encourage reliability (a la Erlang), but if you hang out in
#diesel you'll see that jamwt has already implemented a lot of that at Bump
and he plans on pushing it down into the framework soon (probably for
Diesel3).

~~~
ianb
All these systems are share-everything, which I'm not excited about -- if I
want share-everything I can just use threads, and they work far better than
most people seem willing to admit.

~~~
otterley
Threads have _at least_ the following problems in high concurrency
environments:

(1) context switching is expensive

(2) access to shared variables leads to race conditions, which requires
careful synchronization (to be fair, this is true for any shared storage:
databases, mmap()ed regions, etc.)

See also
[http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-1....](http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-1.pdf),
which provides a thorough treatment on why multithreaded programs are
inherently less reliable than single-threaded.

~~~
FixedPoint
Quoting Robert Harper: 'Concurrency is concerned with nondeterministic
composition of programs (or their components).' [1]

Threads can be a very efficient way of introducing non-determinism (preferable
to incurring IPC overhead). Writing correct code then boils down to
controlling interactions between threads. Actor architectures can be very
helpful in such settings. Key to safe implementations are type-enforced
isolation and immutability. More experimentally, STMs offer the promise of
high-level composition of actors. My 2p.

[1]
[http://existentialtype.wordpress.com/2011/03/17/parallelism-...](http://existentialtype.wordpress.com/2011/03/17/parallelism-
is-not-concurrency/)

------
btmorex
The first two points are describing fork(), which python already has on unix-
like systems. If he means something different, then he needs to explain more
how exactly it would work (especially regarding hardware/os support).

~~~
ianb
Interpreter-level fork() is very much what I mean ("green processes" would be
another phrasing I guess?) Of course the Python interpreter always acts as an
intermediary for any access to resources, so I feel like you can kind of punt
on that, call it an interpreter implementation detail.

fork() isn't awesome with CPython, as its reference counting kind of destroys
the utility of copy-on-write ( _reading_ an object creates a reference, which
increments the object's reference count, which is a write). And there's really
a lot of shared data that you can have between processes, since every module
you import is data (and expensive data!) -- I'd like to see it feasible to
create extremely short-lived processes, for instance (why have a worker pool
at all?)

Once you've created a new microprocess I suppose the actual concurrency would
be up to the interpreter to best decide (or at least there would be a lot of
flexibility because that concurrency would not be so overtly exposed to the
Python program) -- it would be reasonable for a Python microprocess to be an
OS-level thread, for instance, or for it to simply be cooperative multitasking
at the VM level, or even be a real OS fork (though without an OS fork it seems
reasonable to share immutable objects, but once forked the objects must be
immutable _and_ serializable).

~~~
sanxiyn
Eh, PyPy does not use reference counting. Are you saying PyPy + fork() is what
you want?

~~~
vak
BTW, PyPy (unlike CPython) indeed great for multiprocessing.

CPython's ref-counting forces copy-on-write in RAM even when no write
operation have been even applied at all. So, if you have a huge read-only
object in RAM forget about multiprocessing. The object will be cloned for each
process.

------
ominous_prime
I agree. I think there needs to be an effort to push towards concurrency in
Pypy. They were optimistic that the GIL would be easier to factor out in their
runtime, and I think that could be a great start. I'm all for new methods for
concurrency, but I feel being able to make efficient use of native threads
would bring a lot of momentum to the project. Even though it's not as big a
deal as most people make it out to be, Python gets a bad rap for the well
publicized GIL problems.

~~~
jnoller
Everyone is optimistic that they can factor out the GIL (see: Unladen
Swallow). Then they find out the problem is much, much harder then they
initially thought. That said, all it requires is a fundamental change (e.g.
dropping) the ref counting system:

[http://code.google.com/p/unladen-
swallow/wiki/ProjectPlan#Gl...](http://code.google.com/p/unladen-
swallow/wiki/ProjectPlan#Global_Interpreter_Lock)

So, perhaps PyPy _can_ do it, and I hope they do - from an evolutionary
standpoint, they're the interpreter with the most velocity inside of the
interpreter core development, so if it's going to happen, it's going to happen
there.

CPython will stay conservative, and much as it is today for some time.

~~~
ianb
For a while I had it in my mind that maybe you could save some memory while
having safety just by using os.fork() after pre-importing a bunch of stuff;
but then reference counting was pointed out to me and I realized it would be
pointless -- the act of merely looking at an object changes the object, so
copy-on-write means copy-everything.

Reference counting seems like it could be removed... except for all the C
extensions that are specifically handling reference counting internally,
removal of which I'm guessing would cause a lot of breakage. But then... it
seems like those C extensions are exactly the problem, so if PyPy tries to
keep them is it messing itself up? (Though I'm guessing it would somehow use
proxy objects or some other trick to keep its internal object representation
flexible.)

~~~
jjs
Have your cake and eat it too: open a socket to a CPython process and let _it_
handle the legacy C extensions.

~~~
jnoller
[http://morepypy.blogspot.com/2010/04/using-cpython-
extension...](http://morepypy.blogspot.com/2010/04/using-cpython-extension-
modules-with.html)

------
baltcode
In addition to speed and the points raised by the OP, I am interested in
static/duck typing as it relates to both speed and static error-checking and
documentation. i.e., making debugging easier and faster. I somehow think this
can happen while keeping almost all of Python's strengths and ease of
development.

~~~
ianb
Python 3 added type annotations, but I've yet to see much interest in actually
using it. Traits (<http://code.enthought.com/projects/traits/>) pursues the
error-checking side of type declarations, while Cython pursues the speed side.
If you want _both_ then too bad ;)

