Hacker News new | past | comments | ask | show | jobs | submit login
How to Make Python Run as Fast as Julia (2015) (ibm.com)
203 points by aaronchall on Aug 29, 2017 | hide | past | favorite | 120 comments



If you really care about extracting every possible ounce of performance out of Python's scientific stack, which relies extensively on Numpy, the best practical guide I have found for doing that is "From Python to Numpy," by Nicolas P. Rougier of Inria: https://www.labri.fr/perso/nrougier/from-python-to-numpy/

Here's a typical example of the kinds of optimizations this guide teaches you, in this case by avoiding the creation of temporary copies of Numpy arrays in memory:

  # Create two int arrays, each filled with with one billion 1's.
  X = np.ones(1000000000, dtype=np.int)
  Y = np.ones(1000000000, dtype=np.int)

  # Add 2 * Y to X, element by element:
  
  # Slowest
  %time X = X + 2.0 * Y
  100 loops, best of 3: 3.61 ms per loop

  # A bit faster
  %time X = X + 2 * Y
  100 loops, best of 3: 3.47 ms per loop

  # Much faster
  %time X += 2 * Y
  100 loops, best of 3: 2.79 ms per loop

  # Fastest
  %time np.add(X, Y, out=X); np.add(X, Y, out=X)
  100 loops, best of 3: 1.57 ms per loop
That's a 2.3x speed improvement (from 3.61 ms to 1.57 ms) on a simple vector operation (your mileage will vary!).[1] This only scratches the surface. The guide goes into quite a bit of explicit detail about how Numpy arrays are constructed and stored in memory and always explains the underlying reasons why some operations are faster than others. In addition, the guide has a section titled "Beyond Numpy" that points to even more ways of improving performance, e.g., by using Cython, Numba, PyCUDA, and a range of other tools.

I highly recommend reading the whole thing!

--

[1] Example copied from here: https://www.labri.fr/perso/nrougier/from-python-to-numpy/#an...


Faster than fastest:

  import numba
  @numba.vectorize(nopython=True)
  def add2(x, y):
      return x + 2 * y

  add2(X, Y, out=X)
This is 40% faster on my machine. It only needs to read X and Y once, and write X once. It does take a bit of extra time on the first run, because it uses JIT compilation (LLVM).

Numba lets you implement vectorized operations in terms of elements. In the above, x and y are scalars. You get support for various NumPy "ufunc" features for free, such as the out parameter used here.

Ref: http://numba.pydata.org/numba-doc/dev/user/vectorize.html


In Julia this would read using broadcasting:

    X = ones(Int, 1_000_000_000)
    Y = ones(Int, 1_000_000_000)

    # explicit dot notation
    X .+= 2 .* Y

    # or with a macro
    @. X += 2 * Y
The + and * do not allocate, the operations are fused into element-wise operations so the vectors are traversed only once, and the loop is compiled to vectorized code. No need for Numpy / BLAS.


That's impressive! What does fusion mean? I've heard fusion multiple times and I've the feel that fusion (in haskell at least) is something very advanced that's not really thought. Can you explain?


Two passes with one operation each are fused into one pass that has two operations?

Too bad common languages today (Python and its extensions for example) are a big step back from Matlab when it comes to matrix and vector operations syntax.


Not only now you are employing a foreign data type (numpy array) that requires separate learning independent (potentially confusing) to your existing Python knowledge, you also need mentally differentiate:

    X = X + 2.0 * Y
    X = X + 2 * Y
    X += 2 * Y
    X += Y; X += Y
, all irrelevant to your semantic logic.

Why not realizing that it is much more straight forward to directly do that in C? It is only a bit more to type but a much simpler mental load to understand (and maintain in the long run). If performance is at stake, spell it out in C (with x86 intrinsics if necessary) and put them into the semantics of your code.

Concerned with performance (in ms order) in Python is ill.


There's no need to mentally differentiate between all those approaches when writing code with Python's scientific stack.

In practice, code is initially written with relatively little regard for how it affects the performance of Numpy and other libraries like it, and then, afterwards, and only if necessary, these techniques are used to optimize those lines of code that prove critical to performance, typically only a small fraction of all lines.

However, if you're working on a project for which every line of code is performance-critical, then I would agree with you, Python would not be a good choice for that.


When you use numpy, you're not one semicolon away from a segfault.


How could it possibly take on the order of 1ms to operate on arrays of size 1e9?


numpy calls BLAS libraries, which are heavily optimized and parallelized. Note that it is also very dependent on which particular BLAS library is used, as several are older and not nearly as optimized as newer ones (the difference can be rather dramatic). If you're curious, the best one to use is OpenBLAS.


BLAS really shines when you do matrix multiplication, for element-wise operations the best you can do is to add numbers using SIMD instructions or put the load to GPU, and most numeric libraries already when possible. The benchmark about seems unrealistic, here are results from my newest MaBook Pro:

    In [2]: import numpy as np

    In [3]: X = np.ones(1000000000, dtype=np.int)

    In [4]: Y = np.ones(1000000000, dtype=np.int)

    In [5]: %time X = X + 2.0 * Y
    CPU times: user 10.4 s, sys: 27.1 s, total: 37.5 s
    Wall time: 46 s

    In [6]: %time X = X + 2 * Y
    CPU times: user 8.66 s, sys: 26 s, total: 34.7 s
    Wall time: 42.6 s

    In [7]: %time X += 2 * Y
    CPU times: user 8.58 s, sys: 23.2 s, total: 31.8 s
    Wall time: 37.7 s

    In [8]: %time np.add(X, Y, out=X); np.add(X, Y, out=X)
    CPU times: user 11.3 s, sys: 25.6 s, total: 36.9 s
    Wall time: 42.6 s
No surprise, Julia makes nearly the same result:

    julia> X = ones(Int, 1000000000);
    julia> Y = ones(Int, 1000000000); 

    julia> @btime X .= X .+ 2Y
      34.814 s (6 allocations: 7.45 GiB)

UPD. Just noticed 7.45Gib allocations. We can get rid of it as:

    julia> @btime X .= X .+ 2 .* Y
      20.464 s (4 allocations: 96 bytes
or:

    julia> @btime X .+= 2 .* Y
      20.098 s (4 allocations: 96 bytes)


I could have not noticed use of swap in the previous test, so I repeated it on a Linux box and 1e8 numbers (instead of 1e9). Julia took 100.583ms while Python 207ms (probably due to double reading of the array). So I guess adding 1e9 numbers should take about 1 second on a modern desktop CPU.


I think the benchmark was probably done on a supercomputer. But that's really interesting how well Julia did. I did a basic logistic regression ML implementation in it years ago and I was impressed, but I stopped following its progress. Might have to keep it on my radar!


I still don't see how it's possible, no matter how optimized it is. Assuming 8-byte ints (which is what np.int seems to be on 64-bit) you're looking at reading at least 16GB of data since you're operating on two 8GB arrays and you have to read the data in each one at least once. If you can do that in a millisecond, that's a memory bandwidth of about 16TB/s. I thought modern CPUs had memory bandwidth of tens of GB/s, maybe low hundreds for really high-end stuff, and some brief searching seems to confirm that. What am I missing?

Edit: testing the given code on my 2013 Mac Pro, the fastest one at the end completes in one second or so (just eyeballing it), which makes a lot more sense.


The example the OP gave was from a tutorial/website that is hosted at the Laboratoire Bordelais de Recherche en Informatique. I imagine they probably have some heavy duty machines to crunch numbers on.


Not only that, but going through the array twice is apparently faster than doing it once and multiplying by 2. Is multiplying more expensive than fetching/storing from memory? This is counter-intuitive. I must be missing something.


You're going through the array twice whatever you do (once when multiplying by two then a second time when adding the arrays together), Python/NumPy isn't clever enough to figure out that it can be done in a single loop.


Good point.


Would this be as fast as 'Fastest'?

    %time X += Y; X += Y
It's a bit easier to read, imo.


Looks like it:

  In [1]: %%timeit x = np.ones(100000000); y = np.ones(100000000)
     ...: np.add(x, y, out=x)
     ...: np.add(x, y, out=x)
     ...: 
  1 loops, best of 3: 287 ms per loop

  In [2]: %%timeit x = np.ones(100000000); y = np.ones(100000000)
     ...: x += y
     ...: x += y
     ...: 
  1 loops, best of 3: 287 ms per loop

  In [3]: %%timeit x = np.ones(100000000); y = np.ones(100000000)
     ...: np.add(x, y, out=x)
     ...: np.add(x, y, out=x)
     ...: 
  1 loops, best of 3: 286 ms per loop

  In [4]: %%timeit x = np.ones(100000000); y = np.ones(100000000)
     ...: x += y
     ...: x += y
     ...: 
  1 loops, best of 3: 280 ms per loop


From looking at most of these examples it's clear that Python will be able to perform really well with projects like numba, Cython and PyPy in the picture. My impressions after using Cython are that Cython can for the most part even outperform Julia. Couple that with the fact that with Julia there will probably be a few steps before you squeeze the maximum performance possible of out your algorithm, and it makes Cython a no brainier for a developer that is already using Python.

That is until you consider that most of these examples do not showcase (what is in my humble opinion) Julia's true strength - homoiconicity and an opt in extensible type system that is part of the language.

It's trivial to make a new Class in Python and set up behavior that one would be interested in. One could easily implement decorators, metaclasses, descriptors, properties and other dunder methods to customize class behavior to the heart's desire. But Cython (as far as I understand) does not support these Python features. If an existing library implements an interface using these Python features, it's not obvious to me how it would be possible to use Cython to improve performance. Julia on the other hand has no such restrictions. I would say it is desired to use macros and metaprogramming in everyday programming. I've been a Python user for over 6 years now. I've only looked at Julia for a few months now and I already can see concepts and ideas in Julia that have no analogue in Python. But most if not all Python programming features translate to Julia one way or the other.

I love Python, but I can't help but feel that the lack of a opt in type system and more powerful metaprogramming constructs is hurting it in these comparisons.


Isn't adding things like memoization or writing the code in C (Cython) missing the whole point of the benchmark, which is to test the overhead of recursion / function calls in the language itself?


Users of programming languages interested in the end result (output of the program etc) do not care. They want the fastest performing language for the job. Julia website has been misleading people. Due to those claims, I spent a week porting some of my simulation code to Julia before I realized that it is actually slower in (optimized) Julia than in optimized Python.


For those who only care about the output of fib(20) there are more efficient methods than any of the Julia or Python implementations posted in the link, e.g. lookup tables.

The assumption in benchmarks is of course that the results carry over to other use cases. Here, what is being tested is the overhead in recursion, nothing else. The fact that it happens to be Fibonacci-numbers that is being computed is irrelevant.


Isn't that basically what he does when he adds caching? I know it's not a static lookup table but you could prep it by invoking it with a sufficiently large n.


Why, is anybody restricted to using recursion in their solutions?


You are not restricted to recursion, loops, functions etc, but these language constructs might be useful in situations, and therefore the performance (overhead) of these constructs is interesting.


I highly doubt that was optimized Julia code. Sure you might have used a really fast python library and a very slow Julia library, but there is no way optimized Julia code should be slower than Python. Julia simply offers far more ways to control performance than python.


> actually slower in (optimized) Julia than in optimized Python

That shouldn't be possible. If you have a code example, I (and probably other people as well) will be happy to look at it.


And thus we want benchmarks that measure language performance, not the fastest way to compute Fibonacci numbers. The solution to the latter problem is the same in Python and Julia and consists of calling the assembly function in gmp...

Julia has a lot more potential for optimizations than python, but what python has going for it is the larger ecosystem. So if you want to write a one-off experiment that's similar to stuff that already exists in C bindings to python you should use that. If you plan to write a large application that you still want to optimize for current processors in 10 years then I'm not sure if python is a good choice.


I agree... Judged against the title, the article adds little value: take one micro-benchmark; implement the naive Python algorithm using non-CPython approaches like Cython, Numpy and Numba; and stick on clickbait title that implies a speedup that applies in all cases.

The article would be much better if it ditched the comparison to Julia and instead showcased "Some ways to make Python code faster."


Isn't the benchmark missing the whole point of what people actually want to do, which is to run their calculations fast (without caring whether they are written using the constructs the benchmark tests or not)?


People might want to use recursion. They might want to split out their code into small functions without having to think about the overhead of function calls. They might want to just write a for loop instead of transforming it to vectorized notation.

The benchmark investigated in the link answers the question "if I use recursion, and the function body is small, how much will I be penalized?". Changing the benchmark so that it no longer answers that question makes it pointless.


I'm also wondering why the article is skipping pypy ? It's a python JIT which often be used as a drop in replacement of cpython.


In my experience when working with computationally heavy code (i.e. the kind that Julia targets) Numba tends to give better results than PyPy.


That may very well be, and a good way to be sure would be ... a benchmark in this article.


Yes.


Make "Python" run faster by shifting all the runtime heavy computation to C-implementations... this article is missing the whole point


Python is a scripting language, so its strong point is being used as a glue, and half of its standard library is implemented in C anyway.

Plus, the "C-implementations" he mentions are available as readily usable modules (numpy), or semi-transparent jit/aot compilers only needing a few annotations (Cython, numba), not actual C you have to write.

Besides, isn't the whole point: finish your project fast with the language you know using whatever it makes available to easily speed your code up?

As opposed to: "be a purist and not use wrapped libs written in another language".

Who cares for that? Even if it comes up, it's to avoid the hassle of having to deal with an additional language, setup etc -- which for numpy, Cython etc is almost none-existent (as you don't need to actually deal with C).

And of course, despite the purity of Julia's "single language", the hassle of moving to a totally different language, which few use, is not yet stable in syntax, compiler etc, and has fewer libs, should also be considered...


The specific purpose of the benchmark, though, is to compare implementations of the same algorithm natively in the language itself, as explained explicitly on the Julia website just under the table of benchmark results (see quote below).

As such, I do think the article misses the point somewhat. Of course, if there's a numpy function that does what you want, you'd use it in real life. But what if there isn't? The nice thing about Julia is that the function can be written in Julia itself, and fast.

> It is important to note that these benchmark implementations are not written for absolute maximal performance (the fastest code to compute fib(20) is the constant literal 6765). Rather, all of the benchmarks are written to test the performance of specific algorithms implemented in each language. In particular, all languages use the same algorithm: the Fibonacci benchmarks are all recursive while the pi summation benchmarks are all iterative; the “algorithm” for random matrix multiplication is to call the most obvious built-in/standard random-number and matmul routines (or to directly call BLAS if the language does not provide a high-level matmul), except where a matmul/BLAS call is not possible (such as in JavaScript). The point of these benchmarks is to compare the performance of specific algorithms across language implementations, not to compare the fastest means of computing a result, which in most high-level languages relies on calling C code.

https://julialang.org


> Of course, if there's a numpy function that does what you want, you'd use it in real life. But what if there isn't?

I have been in this exact situation, a numerical algorithm that was missing from Numpy but the rest of the project is in Python.

The solution is:

1. Write a Python function that operates on numpy arrays,

2. Add a few Cython type declarations to loop variables,

3. Mark the source file as "compile with Cython at runtime", which seamlessly turns the Python function into a C library.

The end result was a 1000x speedup compared to pure Python, very close to numpy built-in functions working on similarly sized arrays. And it needed only about 5 lines of setup code and type declarations for a few variables - all the code could still be Python and use all of Python even in the compiled files.


>The specific purpose of the benchmark, though, is to compare implementations of the same algorithm natively in the language itself, as explained explicitly on the Julia website just under the table of benchmark results (see quote below).

But then they go and write their own sort instead of using the language provided ones when offered. All these show is that julia is apparently faster than incredibly unidiomatic python written by someone who clearly doesn't write python. Okay. That's neat.


Numpy is such an essential library for any type of scientific computing in Python that ignoring it would be missing the point, if anything. The library infrastructure is part of the appeal of a programming language and Numpy is the default for anything compute-heavy in Python.


Completely agree.

Python users shouldn't run standard loops in any scientific computing. Only valid examples are looping over a set of http parameters in a web app...

In which case, the computation will definitely not be the bottleneck.


>shifting all the runtime heavy computation to C-implementations... this article is missing the whole point

The author understands your perspective but he's deliberately using a different one. The idea is that a data scientist user would realistically use NumPy/SciPy optimized C libraries instead of writing raw loops in "pure Python" to walk pure Python lists that model matrices. Therefore comparing pure Python code (interpreted by the canonical CPython interpreter) to Julia is the opposite of his goal.

The article's title is: "How To Make Python Run As Fast As Julia"

The author wanted to write about: "How To Make Python _Projects_ Run As Fast As Julia"

But many readers insist that the article should have been: "How To Make Pure Python Code Run As Fast As Julia"

(The 2nd type of article is also interesting, but the author didn't write it and didn't claim to.)

The article's comment permalink doesn't seem to jump to his exact comment so I'll copypaste the text here:

>There is indeed a disagreement about the purposes of the benchmarks. I see at least two purposes at stake here.

>1. A user point of view, which is to see how t best accomplish things in a given language. It is the result of various tradeoffs, including this: balance the time and effort to code something with the efficiency you get. That's the view of most Python users reacting to my post. We don't mind using Python libraries, even if they aren't written in 'pure' Python. Actually, the massive set of existing Python libraries is probably one key reason for its success.

>2. A language implementer point of view, which focuses on how elementary language operations perform. That's the purpose of Julia micro benchmarks I think.

>If people do not agree on the yardstick they use, then the discussion is not going to be fruitful. This disagreement explains most of the comments I saw until now.

>I am using the 'user point of view' in my post


> The author understands your perspective but he's deliberately using a different one.

In this case the way the author shows it isn't the best one: he modifies Python code to be more realistic - that's ok, but doesn't he do the same thing for Julia? Obviously, writing a recursive fibonacci functions isn't the best way to implement it. Obviously, using caching can improve performance. But why not to apply these changes to both implementations?


>he modifies Python code to be more realistic - that's ok, but doesn't he do the same thing for Julia?

Yes, I agree he didn't rewrite the Julia fibonacci examples the same way as Python.

My comment was speaking more to the usage of "optimized C libraries" in his benchmarks as being appropriate for his particular goal. (As response to poster hojijoji's objection to C-implementations.)


And his goal is to compute fib(20) really quickly?


I used "goal" to mean his "overall goal" of showing optimized C libraries instead of pure Python for various scenarios. (My response to poster hojijoji objection to C libraries.) Using C libs is not an invalid benchmark if one understands why the author used them.

Yes, when the author didn't change both Python AND Julia fibonacci examples in exactly the same 1-for-1 manner, it does detract from his overall message because it invites nitpicking. (The nitpicking is reasonable if you're hyperfocused on that fibonacci example.)

Based on your other responses in this thread, you seem to want him to write Python-vs-Julia benchmarks that's suitable for benchmarksgame[1]. You have a valid perspective but that's not the article he claimed to write.

[1] http://benchmarksgame.alioth.debian.org/u64q/python.html


My question is then, why bring up Julia at all? Of course, there will be nitpicking when you put two languages against each other, in a benchmark written for a specific purpose, and then start to modify the implementation for one of the languages. It seems like the goal of the blog post would just as well be achieved by saying "here are some ways of speeding up a function in Python".


>My question is then, why bring up Julia at all?

Because he wasn't writing about Python in a vacuum. In his very first paragraph[1], one can see that the article was a response to Julia's benchmark.

Your question could be reversed for the authors of julialang.org website and they could've restricted themselves to say "here are some ways of writing functions in Julia" -- without bringing up Python at all.

But the Julia folks didn't do that because ... people like to write comparisons to other things!

[1] see 1st paragraph that begins and ends with: "Should we ditch Python and other languages in favor of Julia for technical computing? [...] did the Julia team wrote Python benchmarks the best way for Python?"


That is 100% the point of python.


There are a few things I consider when I try to improve performance of my code.

1) I really like the advice of "make it run, make it right, make it fast" [0]. First I build it, then I write a thorough test harness (or make multiple versions run at the same time and check their results at runtime), then I rewrite it. 2) I don't really like optimising to the core, like it's mentioned in the article, things like Cython and numba often add dependencies that don't port well and they may reduce legibility and maintainability (that's more about Cython than numba). What I find most useful, and this should be obvious, that algorithm/data structure changes often yield the biggest benefits. 3) The old rule that if you speed up code that only runs 1% of the time, you're not gaining that much, I try to keep that in mind. (There are exceptions, of course.) 4) Performance is not just runtime, it starts with coding it up in the first place. So start to finish, you're not always better off with a fast language. If it's a one-off, it can be more performant to use a slow language, if it's a tool that will run untouched, you might want to spend some time tinkering with it. Etc.

All of this seems obvious in retrospect, but it took me a while to appreciate these principles.

[0] https://www.facebook.com/notes/kent-beck/mastering-programmi...


A side historical note which doesn't affect anything; http://wiki.c2.com/?MakeItWorkMakeItRightMakeItFast traces that expression back to Stephen C. Johnson and Brian W. Kernighan's "The C Language and Models for Systems Programming" in Byte magazine (August 1983) - "the strategy is definitely: first make it work, then make it right, and, finally, make it fast."


The make it work/make it right principle is also referred to as "tracer bullets" in certain circles: if you get the bare bones full process working it's much easier to flesh it out than doing it one "fully specced" component at a time.


I had a lot of experience a few years ago with Julia, having ported a lot of my numerical code in the language. I ended up severely disappointed with it for moderately complex numerical computing projects.

The language was really nice; the problem was terrible performance, and I believe that this was caused by bad design of the memory semantics. Even in optimised code, there were temporary objects and copies all over the place, and they were very hard to eliminate without resorting to using global arrays everywhere.


I had exactly the opposite experience. If you were having problems with temporaries, you were probably using Matlab style vectorized operations. Back when you tried Julia, explicit loops could avoid this allocation issue. Now, Julia has syntactic broadcast operators which will fuse loops for you. See https://julialang.org/blog/2017/01/moredots


> Even in optimised code, there were temporary objects and copies all over the place, and they were very hard to eliminate without resorting to using global arrays everywhere.

Could you elaborate a bit with some examples, because this goes against everything I understand about the strengths of the language: (without having used it beyond a few first examples myself)

- you can define types and operations on them down to the bits themselves, giving you a lot of control over the memory (and definitely a lot more than Python)

- it has a lot of support for in-place mutation out of the box

- globals are almost always bad news for performance

Sure, adding two arrays with

    C = A + B
.. produces an intermediate array, yes, but if you know A and B are not reused, wouldn't using:

    A += B
    C = A
.. instead be enough?


Even better would be to write (see dot vectorization [1]):

    C .= A .+ B
Benchmarks for 3 matrices of size 1000x1000:

    julia> using BenchmarkTools

    julia> @benchmark C = A + B
    BenchmarkTools.Trial: 
      memory estimate:  7.63 MiB
      allocs estimate:  2
      --------------
      minimum time:     2.359 ms (0.00% GC)
      median time:      2.713 ms (0.00% GC)
      mean time:        3.794 ms (28.81% GC)
      maximum time:     62.708 ms (95.27% GC)
      --------------
      samples:          1314
      evals/sample:     1

    julia> @benchmark C .= A .+ B
    BenchmarkTools.Trial: 
      memory estimate:  128 bytes
      allocs estimate:  4
      --------------
      minimum time:     1.232 ms (0.00% GC)
      median time:      1.320 ms (0.00% GC)
      mean time:        1.356 ms (0.00% GC)
      maximum time:     2.572 ms (0.00% GC)
      --------------
      samples:          3651
      evals/sample:     1
Note that memory usage dropped from 7.63MiB to 128 bytes.

[1]: https://docs.julialang.org/en/stable/manual/functions/#man-v...


Thanks! Like I said: I never truly dove into it, although I loved reading about the approach to the type system, and the multiple dispatch.

> Note that memory usage dropped from 7.63MiB to 128 bytes.

Which is important if you're working with large data-sets. Both for performance and for being able to run the calculations at all.


A little bit late to the party here, but the number of allocations is really just 0 bytes. It shows 128 bytes because the benchmark is creating new references to A, B and C. To correct this use either interpolation with $A, $B and $C or initialize A, B and C in the setup phase:

    > @benchmark C .= A .+ B setup = (A = rand(1000, 1000); B = rand(1000, 1000); C = rand(1000, 1000))
    BenchmarkTools.Trial:
      memory estimate:  0 bytes
      allocs estimate:  0
      --------------
      minimum time:     2.048 ms (0.00% GC)
This is showing 0 bytes indeed.


I had exactly the same experience. Julia's JIT compiler is (or at least was) really picky to what code it did and didn't like. Two pieces of which look 'basically' the same can have a performance difference of an order of magnitude due to the JIT liking one version and not the other.


How do you know that the code was optimized for Julia? Your poor results suggests you are simply not familiar with how to optimize in Julia. E.g. did you check type stability with type_warn() ?


I guess, it will take loosing market share to Julia for Python devs actually start taking into consideration making PyPy the canonical implementation.

Rewriting libraries in C isn't making Python code run faster.

The latest Julia conf had lots of cool presentations, and even though 1.0 might come only at the end of the year (with luck), the uptake among the research community is quite good for such a young language.


Pypy is not that much faster than Python and is certainly still much slower than using a C library. Indeed, there's a limit to how fast python can be considering how many operations it needs to do for even simple statements (such as summing two integers[0]).

This is the cost of being so extremely dynamic, more so than Javascript and many other scripting languages.

[0] https://jakevdp.github.io/blog/2014/05/09/why-python-is-slow...


Even after reading that post, I doubt Python can be more dynamic than Lisp or Smalltalk, and yet they are behind the genesis of JIT compilers.


I don't know that it "is", but it certainly can be. In Python there's just too many ways to screw with the meaning of "a.b" at runtime; it could be in the class dict, it could be a property, it could be modified via a couple of methods on the class, it could be screwed with a couple of other ways too IIRC. You have to write some sort of code that either proves these things can't happen, which gets really hard as the program scales up, or code that checks for them happening. Other languages that can be "dynamic" but still have only one or two things to be checked can be much more easily JIT'ed.


Smalltalk has a message called becomes, where you literally change the complete representation of an object to something else.

You can change code on the fly on the debugger and just do a redo of the failed call.

When you change a specific class, or its meta-class, the whole system is automatically updated, reflecting the changes.

I can give many more examples.


I believe much similar magic is possible in Python, but you get to implement it on top of the relatively simple interactions between instance dict, class dict, and metaclass. They would not be bullet-proof, and they are not built into the default implementation. But they are very possible.


"Everything is an object" is a higher-level abstraction (that is, maps worse to native operations) than "everything is a function" and "everything is a linked list." Also, I don't know much about Lisp, but I was under the impression that Lisp metaprogramming can be entirely or mostly resolved at compile-time, while the same is not true of Python.

Smalltalk is even slower than Python.


That is funny, the language whose JIT research lead to Hotspot, is slower than Python.

Since I was using Smalltalk for university projects before Java was a thing, I guess you mean Python wrappers written in C.


In theory one could build a computer that runs Smalltalk at a fast speed.


And in reality people have built interpreters that run python at a high speed.


I guess we don't have the same understanding of what high speed actually means.


In the context of the above comment, I'd say that they're about the same. Pypy appears to run between half-as-fast and twice-as-fast as java (hotspot) with a much smaller memory footprint, depending on the benchmark.


There's little reason to move to PyPy when Numba is available, at least for scientific code.

For backend work, I find the bottleneck is never Python, tbh


Either Numba has improved by an order of magnitude since I looked last (congratulations that's fantastic, its a difficult problem to solve), or you haven't met the edge cases yet


>Rewriting libraries in C isn't making Python code run faster.

No, but nobody cares for the purity of only-Python code. They care for running their calculation fast, and if that takes C-enabled modules they are totally fine with it. With things like numpy available, it's not like they have to write those calculations themselves.

Besides Numpy smokes PyPy for the same kind of calculations (not that one would care to rewrite the tons of stuff available in there in pure Python).


>No, but nobody cares for the purity of only-Python code.

Let me call BS on that. I would prefer not to be forced to drop down to C if I could have it. I am quite sure many feel that way.

I would definitely like the ease of calling an existing C or Fortran library that already does what I need. But that I need (almost to a fault) to context switch between languages is certainly not a high point especially when I am prototyping something new and for which there aren't any good libraries yet.


Numpy doesn't force you to drop down to C, though. That it's C under the hood is almost entirely irrelevant when making use of it


Yes it absolutely doesnt. I was talking about things when you have to reach beyond Numpy/Scipy/Pandas/Scikits. For C or Fortran libraries and tools like those that are already there, Python is very pleasant to use.


What if the existing libraries don't solve your problem? How do you make python fast without writing new C code?


Using Cython, numba, caching annotations, better algorithms and so -- all things that the article also covered.


If people didn't care, they wouldn't be moving into Julia, Go, Chapel,...


Do you have any source for this affirmation that people are moving from Python to Julia/Go/etc ?

Those language are definitely growing, but so is Python and while there is movement between the languages I am not sure is a massive unilateral migration as you seem to imply.


Just empirical information from the amount of talks we get to see at the respective LanguageConf, and which institutes are doing them.


The whole point of programming language conferences is to focus on bleeding edge "what's new/potentially coming down the pike", instead of "what's being adopted massively".

Of course the type of people who want to dabble with different languages are going to move around. Most of the rest of the world wants to learn the one language that mostly gets them where they need to go, and be done with it.

There is a photo I recently saw of 1200 students starting in Berkeley's introductory data science class. They're doing Python. Are there even 1200 serious Chapel users in the world?! I mean, I love the ideas in Chapel (and even the early ZPL), and I drew inspiration from them. But the fate of a computer language is as much determined by the growth dynamics of its user community as by its differentiating features.


Full disclosure, Chapel developer here -

While Chapel's community is small today, it has real users and it's growing.

Chapel is a compelling language that offers a good combination of productivity and performance.

The Chapel project is more ambitious than Python, Julia, or Go in that it provides a unified programming model for parallelism and locality that enables scaling from laptops to clusters and supercomputers. Because we are aiming for a higher target, it has taken longer to move beyond the prototype stage than it would for a serial language.


There is no correlation whatsoever with the amount of talks in conferences and what's happening in real life.

If conferences were a metric for anything you'd assume most of coding in the world is happening in Node.JS/React/JS-package-manager-of-the day.

The reality is that the vast, vast majority of devs and code that's actually running the world are doing some combination of Java, C#, PHP, C++, C and COBOL. Those people aren't exactly going to conferences; they have regular old boring 9 to 5 jobs and are plenty busy trying to make shit work or diving in piles of legacy code or installing run-of-the-mill CMSes.


Conversely, clearly people are moving from Julia to python, since there are more and more larger and larger pyCon's each year.


>If people didn't care, they wouldn't be moving into Julia, Go, Chapel,...

Who said they are? I'm sure you can find 100 that did. I doubt think you can find 10,000.

(To Go some. Julia, no. Chapel? Might as well talk about people moving to Modula-3).


The empirical information from the increase in the amount of talks we get to see at the respective LanguageConf, and who is doing them, seems to state otherwise.


You can find 20 people to give a talk for any odd language.

Can you find > 1 language confs for the same language? How does it fare in libs and Stack Overflow and GitHub?


What would it mean to make it the "canonical implementation"?

Is there something preventing you from using PyPy today?


Is there something preventing you from using PyPy today?

A lot of libraries aren't tested with/don't work with pypy.


File issues with them. Use the github editor to add `pypy` to their `gitlab-ci.yml`, submit a merge request, and see if any tests fail.


I mean the version installed by default that anyone can use without requiring root.

I stop using Python long time ago.


What would peoples reaction be when their existing small Python scripts take at least 3x as long to run?


If the community focused on making PyPy the canonical implementation, that would be surely a major effort to improve that scenario.

Lisp, Scheme, Dylan, JavaScript, Julia don't suffer from such huge speed bump, on the contrary.


Neither does C. But that, and all the other languages you listed, are not Python. They have differing implementations, constraints and resources that make the comparison unfair and IMO invalid.


C is not in the same league as the languages I mentioned, in regards to productivity and memory safety, while enjoying quite good performance.

What is unfair about being as dynamic as Python and yet enjoying top quality JIT compilers?

Just the community not caring about it and forcing everyone else to go down into C or somewhere else, it seems.


> Just the community not caring about it and forcing everyone else to go down into C or somewhere else, it seems.

That, or the inherent dynamic nature of Python that makes it ridiculously hard to JIT (especially without breaking compatibility), and that a tiny subset of developers have enough experience (and will) to work on it as there is no Google willing to funnel buckets of money into making a V8 for Python, to the point where PyPy has reached an OK state after 10 years of development and a codebase that dwarfs CPython in size and complexity.

I hate people who deride the community for not caring enough about adding a JIT. Yeah, that sure is the reason. Not mind-boggling technical limitations, nope, just pure laziness.

Why don't you try?

Exactly.


I gave up on Python about version 2.0, and went to languages that care about having a JIT/AOT compiler from the get go, instead of forcing me to write C.


A lot has changed in the 17 years since 2.0 was released. Performance is one of them.

C and Python have sometimes overlapping use-cases, but often not. I'm not sure if your arguments still hold water, especially given projects like Cython and PyPy that give AOT and JIT compilation to the language so you don't have to write C.


This is totally missing the point of the Julia benchmarks. They were meant to show that you can implement fast low level code in Julia. That is why the python code also had to be written with loops otherwise it would really just test the performance of a library and not the language.

The point of Julia is to not have to deal with both C/C++/fortran and python code when creating a high performance solution. You don't always have a high performance library to utilize for your particular problem.


Exactly.

Being able to do fast things using typical procedural idioms, and being able to implement new, fast routines without dropping to a different, lower-level language are huge wins.

Python is a nice language syntactically, it's just too bad that fundamental design decisions made it intrinsically slow.


It's easy to optimize small kernels. Less straightforward is optimizing an entire library in Python, especially if pieces need to be composed. For this, the most interesting library I've come across is loopy which handles things like fusing kernels, unrolling loops _before_ code generation.

https://documen.tician.de/loopy/


Python's only slow if you use it wrong: http://apenwarr.ca/diary/2011-10-pycodeconf-apenwarr.pdf


If python did not have a performance problem you wouldn't need to implement any python libraries in C.

The point of having Julia is avoiding the usage of two languages. If you can have C performance in a script language why bother with python?

The only reason to pick python is momentum, libraries and mindshare.

The language by itself is nothing special. Julia is faster, is a better glue language, easier to use and more powerful.


If Julia didn't have a performance problem, you wouldn't need to call out to a BLAS.

>is a better glue language

I'm confused, I thought you said you were avoiding the usage of two languages. If so, what is Julia "gluing"?


You don't have to as Julia can still get high performance. And BLAS is actually an example of Julia strenghts. The ability to pick exact code specializations help effectively pick the most suitable BLAS function. Python can't pick correct function based on type and number of arguments.

As for gluing. That is often needed not for performance reasons but because you don't want to reimplement lots of proven code which is well devugged and tested. One of the reasons to pick BLAS.

I use Julia a lot for shell like scripting. That means running a lot of unix commands I don't know how was implemented like the ios code signing tools.

Doing this in Julia is far superior to doing it in python. I was actually shocked how clunky it was in python when I attempted a rewrite.


>Python can't pick correct function based on type and number of arguments.

What? Numpy does this.

>Doing this in Julia is far superior to doing it in python.

Then, quite simply, you're doing it wrong in python. Show me an example of this system code that's nice in julia and clunky in python and I'll show you a superior implementation in python.


> What? Numpy does this.

And Numpy needs an ad-hoc grafted on type system(s) to do it.

> Show me an example of this system code that's nice in julia and clunky in python and I'll show you a superior implementation in python.

Silly over-reduction, but show me a nice way of running this in Python in 3 lines:

    run(`./configure`)
    run(`make`)
    run(`make install`)


    os.system('./configure')
    os.system('make')
    os.system('make install')


I am curious (coming from Julia), what is the difficulty in Python?


There's no problem but if you want to shell out there are a few ways to do it to make sure you're in control of the subshell.

At the simplest there's:

    import os  
    os.system('command')
Then you might want to capture stdout and stdin or interact with the subshell (e.g. hitting 'y'). So you use:

    import subprocess
    cmd = subprocess.Popen("cat <(head tmp)", shell=True, stdout=subprocess.PIPE)
    stdout, _ = cmd.communicate()
And if you want to timeout or parse logs or whatever, you go from there and it gets more and more complicated. If Julia has solved this then the community should be advertising this above all the alleged performance stuff.


>I use Julia a lot for shell like scripting.

That seems absurd.

    $ time julia -e 'print(1+1, "\n")'
    2  

    real	0m0.996s
    user	0m0.664s
    sys	        0m0.367s

    $ time python3 -c 'print(1+1)'
    2

    real	0m0.160s
    user	0m0.080s
    sys	        0m0.033s
Almost a second just to print a number in Julia. This makes it unattractive for shell scripting. Startup performance is an open issue in Julia and smart people are working on it (I hope); but it's not there yet.


"The only reason to pick python is momentum, libraries and mindshare."

I'm confused. Are you using this as an argument for Julia?

I don't post "LOL" very often on HN, but congrats... LOL!

Unless you code in a vacuum, solving the most basic of problems that don't require integration with other services, formats, and APIs, and can figure everything out yourself, then momentum, libraries, and mindshare are EXACTLY the things you should be optimizing for.

(FWIW I think Julia is a nice language and has an amazing dev community.)


You can use Numpy inside Julia with the Pycall Library. Julia can also load R, C & Fortran code.


Can the HN title add "(2015)"?




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: