
Automatic SIMD vectorization support in PyPy - wyldfire
http://morepypy.blogspot.com/2015/10/automatic-simd-vectorization-support-in.html
======
Fede_V
PyPy is an absolutely amazing piece of technologies and the core developers
are brilliant people - however, their main use case is not numeric python.

Their framework is (theoretically) able to trace any Python operation, no
matter how dynamic, and speed it up. This means that if you want to speed up
pure Python code, PyPy is really the only game in town.

The downside is that when scientists use Python, Python is used as a beautiful
API on top of very optimized code written in Fortran or C. If you want to do
numerically complex code in Python, you are much, much better off using numba.
numba is much less ambitious than PyPy - it handles a small subset of Python
(basically, NumPy) but it is very, very fast and very efficient at speeding up
pure NumPy code. In my experience, 100x speed ups (over pure NumPy code) are
not that uncommon.

The founder of continuum (Travis Oliphant) wrote a blog about his technical
vision for a Python jit: [http://technicaldiscovery.blogspot.it/2012/08/numba-
and-llvm...](http://technicaldiscovery.blogspot.it/2012/08/numba-and-
llvmpy.html) and [http://technicaldiscovery.blogspot.it/2012/07/more-pypy-
disc...](http://technicaldiscovery.blogspot.it/2012/07/more-pypy-
discussions.html)). Basically, the continuum team made a big bet that a very
efficient JIT that targets only numerical python would be more useful than a
generic JIT that can theoretically handle all of Python. For my use case
(scientific coding) - numba is far superior.

~~~
mangecoeur
PyPy is really impressive, I love the idea of getting all these optimisations
(SIMD, STM) for free... but as you say, numerical work means numpy, scipy,
pandas, which don't work with PyPy. Even if the NumPyPy project was able to
fully match the Numpy api, you still have a lot of large projects like Pandas
that depend on the c api. it would be stupid to copy everything. Perhaps
something can be worked out between pypy, cffi, and cython.

In the short term Numba is much more practical for numerics. In the longer
term Pyston looks promising - it's actually similar to Numba in that it also
uses LLVM, I imagine there could be synergy between the two...

~~~
sitkack
NumPy is a protocol that dictates the layout of multidimensional arrays with
really fast Fortran code that knows that layout. What needs to be copied is
that memory layout protocol so that we get n:m sharing instead of n^2
duplication.

~~~
mangecoeur
My point was that even if you copy the numpy protocol, you still have huge
projects that depend on c-extensions that you wouldn't want to port.

Pandas is one, others include scikit-learn, scikit-image, Astropy,
Bioinformatics libraries, stats libraries, etc... which all have heavy
C/Cython use and depend to varying degrees on the Python C-api. Porting NumPy
barely scratches the surface of scientific python.

~~~
dalke
Biopython runs on pypy. It also runs under Jython. While it has C use, it is
not "heavy" C use.

Going on a tangent, and though I realize it's a lost battle, I wish people
would stop saying that NumPy is the base of scientific programming in Python.
As Biopython shows, it isn't required for at least some of bioinformatics.

My own research[1] deals with chemical graphs, and NumPy/SciPy/etc. are nearly
irrelevant to that research.

[1] For example, given a set of 100 structures, what is the largest
substructure (based on the number of bonds) which is in at least 90 of the
structures?

------
ngoldbaum
All of the operations in the plots are being calculated using tiny arrays.
Even 128 elements is pretty small. Additionally, I'd be interested to know how
they compiled NumPy, and which BLAS implementation they linked against.

~~~
jcranmer
The graphs are a travesty of detail, and I suspect that the more you poke it,
the worse the results get. There's a definite downward trend in speedup as you
increase vector size, and that is completely different from what you would
naively expect (at larger vector sizes, you spend more time in the
embarrassingly-parallelizable kernel)--so this suggests that the baseline
itself is not scalar code but vector code.

So what the graphs end up highlighting is that the CPython (NumPy?) has decent
overhead on tiny calls that don't really matter for gross performance. As you
say, 128 elements is tiny: that's where I'd consider starting a graph, not
ending it. At 4 elements, you're looking at about 40 clock cycles to do a
simple vector-add in the scalar case (I think x86 L1 cache hits are ~3 clock
cycles and there are two load/store units; if these numbers are wrong, my
estimate is off). This also means that there's absolutely no measurement of
the overhead of the JIT autovectorizing operation (!), which is quite
significant because the usual resistance to adding autovectorizers in JITs [1]
is that they're way too slow for the speedup they give.

[1] As far as I'm aware, the only other JIT of note that includes an
autovectorizer is the Hotspot JVM. The approach taken by other people (e.g.,
.NET CLR) is to effectively expose a SIMD primitive in the bytecode and get
compilers to target those via static autovectorization instead of
autovectorizing at runtime.

~~~
lqdc13
The demo should really be 128 thousand elements or 128 million.

128 elements is still useful for when you have an inner loop that has a small
vector operation. Most of the time this is not the case though.

------
amit_m
Can anyone explain the version numbers? PyPy 2.6.1 compared to 15.11, which
will eventually become 4.0.0...?

~~~
sirn
AFAIK, they were considering to change the versioning scheme because the old
scheme is becoming too confusing.

The old scheme (PyPy 2.x) resembles Python 2 versioning too much (PyPy 2.6 vs.
Python 2.7) and may cause confusion that PyPy version correspond to Python
version it implements (while in fact it is not). Say, if the next release is
PyPy 2.7, some may assume it is a PyPy implementation of Python 2.7 (even
though PyPy 2.6 is already Python 2.7). The situation is also confusing with
Python 3 as well (PyPy3 2.6 for Python 3.2.)

I think PyPy's initial plan was to use YY.MM for versioning, so the tentative
version was 15.11 but now looks like they decided to follow the old scheme but
with major version > 3 instead (so the next release is PyPy 4.0.0 and PyPy3
4.0.0).

------
frozenport
Does CPython 3 do any better?

