
How to Make Python Run as Fast as Julia (2015) - aaronchall
https://www.ibm.com/developerworks/community/blogs/jfp/entry/Python_Meets_Julia_Micro_Performance?lang=en
======
cs702
If you really care about extracting every possible ounce of performance out of
Python's scientific stack, which relies extensively on Numpy, the best
practical guide I have found for doing that is "From Python to Numpy," by
Nicolas P. Rougier of Inria: [https://www.labri.fr/perso/nrougier/from-python-
to-numpy/](https://www.labri.fr/perso/nrougier/from-python-to-numpy/)

Here's a typical example of the kinds of optimizations this guide teaches you,
in this case by avoiding the creation of temporary copies of Numpy arrays in
memory:

    
    
      # Create two int arrays, each filled with with one billion 1's.
      X = np.ones(1000000000, dtype=np.int)
      Y = np.ones(1000000000, dtype=np.int)
    
      # Add 2 * Y to X, element by element:
      
      # Slowest
      %time X = X + 2.0 * Y
      100 loops, best of 3: 3.61 ms per loop
    
      # A bit faster
      %time X = X + 2 * Y
      100 loops, best of 3: 3.47 ms per loop
    
      # Much faster
      %time X += 2 * Y
      100 loops, best of 3: 2.79 ms per loop
    
      # Fastest
      %time np.add(X, Y, out=X); np.add(X, Y, out=X)
      100 loops, best of 3: 1.57 ms per loop
    

That's a 2.3x speed improvement (from 3.61 ms to 1.57 ms) on a simple vector
operation (your mileage will vary!).[1] This only scratches the surface. The
guide goes into quite a bit of _explicit_ detail about how Numpy arrays are
constructed and stored in memory and always explains the underlying reasons
why some operations are faster than others. In addition, the guide has a
section titled "Beyond Numpy" that points to even more ways of improving
performance, e.g., by using Cython, Numba, PyCUDA, and a range of other tools.

I highly recommend reading the whole thing!

\--

[1] Example copied from here: [https://www.labri.fr/perso/nrougier/from-
python-to-numpy/#an...](https://www.labri.fr/perso/nrougier/from-python-to-
numpy/#anatomy-of-an-array)

~~~
hzhou321
Not only now you are employing a foreign data type (numpy array) that requires
separate learning independent (potentially confusing) to your existing Python
knowledge, you also need mentally differentiate:

    
    
        X = X + 2.0 * Y
        X = X + 2 * Y
        X += 2 * Y
        X += Y; X += Y
    

, all irrelevant to your semantic logic.

Why not realizing that it is much more straight forward to directly do that in
C? It is only a bit more to type but a much simpler mental load to understand
(and maintain in the long run). If performance is at stake, spell it out in C
(with x86 intrinsics if necessary) and put them into the semantics of your
code.

Concerned with performance (in ms order) in Python is ill.

~~~
cs702
There's no need to mentally differentiate between all those approaches when
writing code with Python's scientific stack.

In practice, code is initially written with relatively little regard for how
it affects the performance of Numpy and other libraries like it, and then,
_afterwards_ , and only _if necessary_ , these techniques are used to optimize
those lines of code that prove critical to performance, typically only a small
fraction of all lines.

However, if you're working on a project for which _every_ line of code is
performance-critical, then I would agree with you, Python would not be a good
choice for that.

------
temporaryred
From looking at most of these examples it's clear that Python will be able to
perform really well with projects like numba, Cython and PyPy in the picture.
My impressions after using Cython are that Cython can for the most part even
outperform Julia. Couple that with the fact that with Julia there will
probably be a few steps before you squeeze the maximum performance possible of
out your algorithm, and it makes Cython a no brainier for a developer that is
already using Python.

That is until you consider that most of these examples do not showcase (what
is in my humble opinion) Julia's true strength - homoiconicity and an opt in
extensible type system that is part of the language.

It's trivial to make a new Class in Python and set up behavior that one would
be interested in. One could easily implement decorators, metaclasses,
descriptors, properties and other dunder methods to customize class behavior
to the heart's desire. But Cython (as far as I understand) does not support
these Python features. If an existing library implements an interface using
these Python features, it's not obvious to me how it would be possible to use
Cython to improve performance. Julia on the other hand has no such
restrictions. I would say it is desired to use macros and metaprogramming in
everyday programming. I've been a Python user for over 6 years now. I've only
looked at Julia for a few months now and I already can see concepts and ideas
in Julia that have no analogue in Python. But most if not all Python
programming features translate to Julia one way or the other.

I love Python, but I can't help but feel that the lack of a opt in type system
and more powerful metaprogramming constructs is hurting it in these
comparisons.

------
kristofferc
Isn't adding things like memoization or writing the code in C (Cython) missing
the whole point of the benchmark, which is to test the overhead of recursion /
function calls in the language itself?

~~~
knlje
Users of programming languages interested in the end result (output of the
program etc) do not care. They want the fastest performing language for the
job. Julia website has been misleading people. Due to those claims, I spent a
week porting some of my simulation code to Julia before I realized that it is
actually slower in (optimized) Julia than in optimized Python.

~~~
kristofferc
For those who only care about the output of fib(20) there are more efficient
methods than any of the Julia or Python implementations posted in the link,
e.g. lookup tables.

The assumption in benchmarks is of course that the results carry over to other
use cases. Here, what is being tested is the overhead in recursion, nothing
else. The fact that it happens to be Fibonacci-numbers that is being computed
is irrelevant.

~~~
coldtea
Why, is anybody restricted to using recursion in their solutions?

~~~
kristofferc
You are not restricted to recursion, loops, functions etc, but these language
constructs might be _useful_ in situations, and therefore the performance
(overhead) of these constructs is interesting.

------
hojijoji
Make "Python" run faster by shifting all the runtime heavy computation to
C-implementations... this article is missing the whole point

~~~
coldtea
Python is a scripting language, so its strong point is being used as a glue,
and half of its standard library is implemented in C anyway.

Plus, the "C-implementations" he mentions are available as readily usable
modules (numpy), or semi-transparent jit/aot compilers only needing a few
annotations (Cython, numba), not actual C you have to write.

Besides, isn't the whole point: finish your project fast with the language you
know using whatever it makes available to easily speed your code up?

As opposed to: "be a purist and not use wrapped libs written in another
language".

Who cares for that? Even if it comes up, it's to avoid the hassle of having to
deal with an additional language, setup etc -- which for numpy, Cython etc is
almost none-existent (as you don't need to actually deal with C).

And of course, despite the purity of Julia's "single language", the hassle of
moving to a totally different language, which few use, is not yet stable in
syntax, compiler etc, and has fewer libs, should also be considered...

~~~
FabHK
The specific purpose of the benchmark, though, is to compare implementations
of the same algorithm natively in the language itself, as explained explicitly
on the Julia website just under the table of benchmark results (see quote
below).

As such, I do think the article misses the point somewhat. Of course, if
there's a numpy function that does what you want, you'd use it in real life.
But what if there isn't? The nice thing about Julia is that the function can
be written in Julia itself, and fast.

> It is important to note that these benchmark implementations are not written
> for absolute maximal performance (the fastest code to compute fib(20) is the
> constant literal 6765). Rather, all of the benchmarks are written to test
> the performance of specific algorithms implemented in each language. In
> particular, all languages use the same algorithm: the Fibonacci benchmarks
> are all recursive while the pi summation benchmarks are all iterative; the
> “algorithm” for random matrix multiplication is to call the most obvious
> built-in/standard random-number and matmul routines (or to directly call
> BLAS if the language does not provide a high-level matmul), except where a
> matmul/BLAS call is not possible (such as in JavaScript). The point of these
> benchmarks is to compare the performance of specific algorithms across
> language implementations, not to compare the fastest means of computing a
> result, which in most high-level languages relies on calling C code.

[https://julialang.org](https://julialang.org)

~~~
ProblemFactory
> Of course, if there's a numpy function that does what you want, you'd use it
> in real life. But what if there isn't?

I have been in this exact situation, a numerical algorithm that was missing
from Numpy but the rest of the project is in Python.

The solution is:

1\. Write a Python function that operates on numpy arrays,

2\. Add a few Cython type declarations to loop variables,

3\. Mark the source file as "compile with Cython at runtime", which seamlessly
turns the Python function into a C library.

The end result was a 1000x speedup compared to pure Python, very close to
numpy built-in functions working on similarly sized arrays. And it needed only
about 5 lines of setup code and type declarations for a few variables - all
the code could still be Python and use all of Python even in the compiled
files.

------
drej
There are a few things I consider when I try to improve performance of my
code.

1) I really like the advice of "make it run, make it right, make it fast" [0].
First I build it, then I write a thorough test harness (or make multiple
versions run at the same time and check their results at runtime), then I
rewrite it. 2) I don't really like optimising to the core, like it's mentioned
in the article, things like Cython and numba often add dependencies that don't
port well and they may reduce legibility and maintainability (that's more
about Cython than numba). What I find most useful, and this should be obvious,
that algorithm/data structure changes often yield the biggest benefits. 3) The
old rule that if you speed up code that only runs 1% of the time, you're not
gaining that much, I try to keep that in mind. (There are exceptions, of
course.) 4) Performance is not just runtime, it starts with coding it up in
the first place. So start to finish, you're not always better off with a fast
language. If it's a one-off, it can be more performant to use a slow language,
if it's a tool that will run untouched, you might want to spend some time
tinkering with it. Etc.

All of this seems obvious in retrospect, but it took me a while to appreciate
these principles.

[0] [https://www.facebook.com/notes/kent-beck/mastering-
programmi...](https://www.facebook.com/notes/kent-beck/mastering-
programming/1184427814923414)

~~~
eesmith
A side historical note which doesn't affect anything;
[http://wiki.c2.com/?MakeItWorkMakeItRightMakeItFast](http://wiki.c2.com/?MakeItWorkMakeItRightMakeItFast)
traces that expression back to Stephen C. Johnson and Brian W. Kernighan's
"The C Language and Models for Systems Programming" in Byte magazine (August
1983) - "the strategy is definitely: first make it work, then make it right,
and, finally, make it fast."

~~~
moomin
The make it work/make it right principle is also referred to as "tracer
bullets" in certain circles: if you get the bare bones full process working
it's much easier to flesh it out than doing it one "fully specced" component
at a time.

------
unfamiliar
I had a lot of experience a few years ago with Julia, having ported a lot of
my numerical code in the language. I ended up severely disappointed with it
for moderately complex numerical computing projects.

The language was really nice; the problem was terrible performance, and I
believe that this was caused by bad design of the memory semantics. Even in
optimised code, there were temporary objects and copies all over the place,
and they were very hard to eliminate without resorting to using global arrays
everywhere.

~~~
vanderZwan
> _Even in optimised code, there were temporary objects and copies all over
> the place, and they were very hard to eliminate without resorting to using
> global arrays everywhere._

Could you elaborate a bit with some examples, because this goes against
everything I understand about the strengths of the language: (without having
used it beyond a few first examples myself)

\- you can define types and operations on them down to the bits themselves,
giving you a _lot_ of control over the memory (and definitely a lot more than
Python)

\- it has a lot of support for in-place mutation out of the box

\- globals are almost always bad news for performance

Sure, adding two arrays with

    
    
        C = A + B
    

.. produces an intermediate array, yes, but if you know A and B are not
reused, wouldn't using:

    
    
        A += B
        C = A
    

.. instead be enough?

~~~
ffriend
Even better would be to write (see dot vectorization [1]):

    
    
        C .= A .+ B
    

Benchmarks for 3 matrices of size 1000x1000:

    
    
        julia> using BenchmarkTools
    
        julia> @benchmark C = A + B
        BenchmarkTools.Trial: 
          memory estimate:  7.63 MiB
          allocs estimate:  2
          --------------
          minimum time:     2.359 ms (0.00% GC)
          median time:      2.713 ms (0.00% GC)
          mean time:        3.794 ms (28.81% GC)
          maximum time:     62.708 ms (95.27% GC)
          --------------
          samples:          1314
          evals/sample:     1
    
        julia> @benchmark C .= A .+ B
        BenchmarkTools.Trial: 
          memory estimate:  128 bytes
          allocs estimate:  4
          --------------
          minimum time:     1.232 ms (0.00% GC)
          median time:      1.320 ms (0.00% GC)
          mean time:        1.356 ms (0.00% GC)
          maximum time:     2.572 ms (0.00% GC)
          --------------
          samples:          3651
          evals/sample:     1
    

Note that memory usage dropped from 7.63MiB to 128 bytes.

[1]:
[https://docs.julialang.org/en/stable/manual/functions/#man-v...](https://docs.julialang.org/en/stable/manual/functions/#man-
vectorized-1)

~~~
vanderZwan
Thanks! Like I said: I never truly dove into it, although I loved reading
about the approach to the type system, and the multiple dispatch.

> _Note that memory usage dropped from 7.63MiB to 128 bytes._

Which is important if you're working with large data-sets. Both for
performance and for being able to run the calculations _at all_.

~~~
stabbles
A little bit late to the party here, but the number of allocations is really
just 0 bytes. It shows 128 bytes because the benchmark is creating new
references to A, B and C. To correct this use either interpolation with $A, $B
and $C or initialize A, B and C in the setup phase:

    
    
        > @benchmark C .= A .+ B setup = (A = rand(1000, 1000); B = rand(1000, 1000); C = rand(1000, 1000))
        BenchmarkTools.Trial:
          memory estimate:  0 bytes
          allocs estimate:  0
          --------------
          minimum time:     2.048 ms (0.00% GC)
    

This is showing 0 bytes indeed.

------
pjmlp
I guess, it will take loosing market share to Julia for Python devs actually
start taking into consideration making PyPy the canonical implementation.

Rewriting libraries in C isn't making Python code run faster.

The latest Julia conf had lots of cool presentations, and even though 1.0
might come only at the end of the year (with luck), the uptake among the
research community is quite good for such a young language.

~~~
coldtea
> _Rewriting libraries in C isn 't making Python code run faster._

No, but nobody cares for the purity of only-Python code. They care for running
their calculation fast, and if that takes C-enabled modules they are totally
fine with it. With things like numpy available, it's not like they have to
write those calculations themselves.

Besides Numpy smokes PyPy for the same kind of calculations (not that one
would care to rewrite the tons of stuff available in there in pure Python).

~~~
pjmlp
If people didn't care, they wouldn't be moving into Julia, Go, Chapel,...

~~~
Majestic121
Do you have any source for this affirmation that people are moving from Python
to Julia/Go/etc ?

Those language are definitely growing, but so is Python and while there is
movement between the languages I am not sure is a massive unilateral migration
as you seem to imply.

~~~
pjmlp
Just empirical information from the amount of talks we get to see at the
respective LanguageConf, and which institutes are doing them.

~~~
pwang
The whole point of programming language conferences is to focus on bleeding
edge "what's new/potentially coming down the pike", instead of "what's being
adopted massively".

Of course the type of people who want to dabble with different languages are
going to move around. Most of the rest of the world wants to learn the one
language that mostly gets them where they need to go, and be done with it.

There is a photo I recently saw of 1200 students starting in Berkeley's
introductory data science class. They're doing Python. Are there even 1200
serious Chapel users in the world?! I mean, I love the ideas in Chapel (and
even the early ZPL), and I drew inspiration from them. But the fate of a
computer language is as much determined by the growth dynamics of its user
community as by its differentiating features.

~~~
benstrumental
Full disclosure, Chapel developer here -

While Chapel's community is small today, it has real users and it's growing.

Chapel is a compelling language that offers a good combination of productivity
and performance.

The Chapel project is more ambitious than Python, Julia, or Go in that it
provides a unified programming model for parallelism and locality that enables
scaling from laptops to clusters and supercomputers. Because we are aiming for
a higher target, it has taken longer to move beyond the prototype stage than
it would for a serial language.

------
jernfrost
This is totally missing the point of the Julia benchmarks. They were meant to
show that you can implement fast low level code in Julia. That is why the
python code also had to be written with loops otherwise it would really just
test the performance of a library and not the language.

The point of Julia is to not have to deal with both C/C++/fortran and python
code when creating a high performance solution. You don't always have a high
performance library to utilize for your particular problem.

~~~
Recurecur
Exactly.

Being able to do fast things using typical procedural idioms, and being able
to implement new, fast routines without dropping to a different, lower-level
language are huge wins.

Python is a nice language syntactically, it's just too bad that fundamental
design decisions made it intrinsically slow.

------
marmaduke
It's easy to optimize small kernels. Less straightforward is optimizing an
entire library in Python, especially if pieces need to be composed. For this,
the most interesting library I've come across is loopy which handles things
like fusing kernels, unrolling loops _before_ code generation.

[https://documen.tician.de/loopy/](https://documen.tician.de/loopy/)

------
fnord123
Python's only slow if you use it wrong:
[http://apenwarr.ca/diary/2011-10-pycodeconf-
apenwarr.pdf](http://apenwarr.ca/diary/2011-10-pycodeconf-apenwarr.pdf)

~~~
jernfrost
If python did not have a performance problem you wouldn't need to implement
any python libraries in C.

The point of having Julia is avoiding the usage of two languages. If you can
have C performance in a script language why bother with python?

The only reason to pick python is momentum, libraries and mindshare.

The language by itself is nothing special. Julia is faster, is a better glue
language, easier to use and more powerful.

~~~
joshuamorton
If Julia didn't have a performance problem, you wouldn't need to call out to a
BLAS.

>is a better glue language

I'm confused, I thought you said you were avoiding the usage of two languages.
If so, what is Julia "gluing"?

~~~
jernfrost
You don't have to as Julia can still get high performance. And BLAS is
actually an example of Julia strenghts. The ability to pick exact code
specializations help effectively pick the most suitable BLAS function. Python
can't pick correct function based on type and number of arguments.

As for gluing. That is often needed not for performance reasons but because
you don't want to reimplement lots of proven code which is well devugged and
tested. One of the reasons to pick BLAS.

I use Julia a lot for shell like scripting. That means running a lot of unix
commands I don't know how was implemented like the ios code signing tools.

Doing this in Julia is far superior to doing it in python. I was actually
shocked how clunky it was in python when I attempted a rewrite.

~~~
joshuamorton
>Python can't pick correct function based on type and number of arguments.

What? Numpy does this.

>Doing this in Julia is far superior to doing it in python.

Then, quite simply, you're doing it wrong in python. Show me an example of
this system code that's nice in julia and clunky in python and I'll show you a
superior implementation in python.

~~~
tavert
> What? Numpy does this.

And Numpy needs an ad-hoc grafted on type system(s) to do it.

> Show me an example of this system code that's nice in julia and clunky in
> python and I'll show you a superior implementation in python.

Silly over-reduction, but show me a nice way of running this in Python in 3
lines:

    
    
        run(`./configure`)
        run(`make`)
        run(`make install`)

~~~
shele
I am curious (coming from Julia), what is the difficulty in Python?

~~~
fnord123
There's no problem but if you want to shell out there are a few ways to do it
to make sure you're in control of the subshell.

At the simplest there's:

    
    
        import os  
        os.system('command')
    

Then you might want to capture stdout and stdin or interact with the subshell
(e.g. hitting 'y'). So you use:

    
    
        import subprocess
        cmd = subprocess.Popen("cat <(head tmp)", shell=True, stdout=subprocess.PIPE)
        stdout, _ = cmd.communicate()
    

And if you want to timeout or parse logs or whatever, you go from there and it
gets more and more complicated. If Julia has solved this then the community
should be advertising this above all the alleged performance stuff.

------
goatlover
You can use Numpy inside Julia with the Pycall Library. Julia can also load R,
C & Fortran code.

------
stevesimmons
Can the HN title add "(2015)"?

